A quantum search decoder for natural language processing

Probabilistic language models, e.g. those based on recurrent neural networks such as long short-term memory models (LSTMs), often face the problem of finding a high probability prediction from a sequence of random variables over a set of tokens. This is commonly addressed using a form of greedy decoding such as beam search, where a limited number of highest-likelihood paths (the beam width) of the decoder are kept, and at the end the maximum-likelihood path is chosen. In this work, we construct a quantum algorithm to find the globally optimal parse (i.e. for infinite beam width) with high constant success probability. When the input to the decoder follows a power law with exponent k > 0, our algorithm has runtime Rnf(R, k), where R is the alphabet size, n the input length; here f < 1/2, and f→0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$f\rightarrow 0$\end{document} exponentially fast with increasing k, hence making our algorithm always more than quadratically faster than its classical counterpart. We further modify our procedure to recover a finite beam width variant, which enables an even stronger empirical speedup while still retaining higher accuracy than possible classically. Finally, we apply this quantum beam search decoder to Mozilla’s implementation of Baidu’s DeepSpeech neural net, which we show to exhibit such a power law word rank frequency.

k > 0, our algorithm yields a runtime R n f (R,k) , where f ≤ 1/2, and f → 0 exponentially quickly for growing k.This implies that our algorithm always yields a super-Grover type speedup, i.e. it is more than quadratically faster than its classical counterpart.The algorithm is based on a recent quantum maximum finding algorithm, which we combine with an advice-based query analysis for quantum search; it is known that the latter cannot be used to speed up an equivalent classical algorithm.The quantum search decoder requires a quantum procedure that can sample from the grammar to be parsed, but in a biased fashion: the weight of each word in the sequence is determined by the sequence of random variables given as input.We explicitly construct such a quantum sampling subroutine for the case where a classical uniform sampler is known (e.g. for regular or context-free languages).
We further modify our procedure to recover a quantum beam search variant, which enables an even stronger empirical speedup, while sacrificing accuracy.Finally, we apply this quantum beam search decoder to Mozilla's implementation of Baidu's DeepSpeech neural net, which we show to exhibit such a power law word rank frequency, underpinning the applicability of our model.

Background and Context
A recurring task in the context of parsing and neural sequence to sequence models-such as machine translation [SMH11; SVL14], natural language processing [Sch14] and generative models [Gra13]-is to find an optimal path of tokens (e.g.words or letters) from a sequential list of probability distributions.Such a distribution can for instance be produced at the output layer of a recurrent neural network, e.g. a long short-term memory (LSTM).The goal is to decode these distributions by scoring all viable output sequences (paths) under some language model, and finding the path with the highest score.
Nowadays, the de-facto standard solution is to use a variant of beam search [STN94; Vij+16; WR16; Kul+18] to traverse the list of all possible output strings.Beam search stores and explores a constant sized list of possible decoded hypotheses at each step, compared to a greedy algorithm that only considers the top element at each step.Beam search thus interpolates between a simple greedy algorithm, and best-first search; but just like greedy search, beam search is not guaranteed to find a global optimum.Furthermore, beam search suffers from sensitivity to the predicted sequence length; improving the algorithm itself [MC18; YHM18], as well as finding new decoding strategies [FLD18;Hol+19], is an ongoing field of research.
A related question is found in transition based parsing of formal languages, such as context-free grammars [HMU01; ZC08; ZN11; ZQH15a; Dye+15].In this model, an input string is processed token by token, and a heuristic prediction (which can be based on various types of classifiers, such as feed forward networks) is made on how to apply a transition at any one point.As in generative models and decoding tasks, heuristic parsing employs beam search, where a constant sized list of possible parse trees is retained in memory at any point in time, and at the end the hypothesis optimising a suitable objective function is chosen.Improvements of beam-search based parsing strategies are an active field of research [BBD16;Boh+16;VG18].
In essence, the problem of decoding a probabilistic sequence with a language modelor probabilistically parsing a formal grammar-becomes one of performing search over an exponentially-growing tree, since at each step the list of possible sequences branches with degree up to the number of predicted words.The goal is to find a path through this search space with the highest overall score.Due to runtime and memory constraints, a tradeoff has to be made which limits any guarantees on the performance of the search strategy.
Quantum computing has shown promise as an emerging technology to efficiently solve some instances of difficult computing tasks in fields ranging from optimisation [GAW19], linear algebra [HHL09;Ber+17], simulation of quantum systems [Llo96], distributional property testing [MW16], and language processing [Wie+19; AGS18], to machine learning [CT17; Jia+18; DLD17; CL18; Bau18; BL18].While quantum computers are not yet robust enough to evaluate any of these applications on sample sizes large enough to claim an empirical advantage, a structured search problem such as language decoding is a prime candidate for a quantum speedup.
While the most naïve search problems can be sped up using Grover's search algorithm (or one of its variants, such as fixed point/oblivious amplitude amplification), finding good applications for quantum algorithms remains challenging, and super-quadratic speedups (such as Shor's for prime factoring [NC10]) are rare.Recently, several exponentially-faster algorithms (such as quantum recommender systems [KP16], or dense low rank linear algebra [WZP18]) have been proven to rely on an unrealistic random access memory model which, if classically available, can yield an exponential speedup without the need for quantum computing [Tan19].
Our quantum search decoder does not rely on an unrealistic memory model.The novel algorithmic contribution is to analyse a very recent quantum maximum finding algorithm [Van+17] and its expected runtime when provided with a biased quantum sampler for a formal grammar that we developed, under the promise that at each step the input tokens are non-uniformly distributed.
For the case of finding the most likely parsed string, the close connection between decoding a probabilistic sequence and sampling from it yields precisely the quadratic speedup expected from applying quantum amplitude amplification to an unstructured search problem.
We obtain a more striking advantage in the case that the input sequence is just serving as advice on where to find the top scoring parse under a secondary metric-i.e.where the element with the highest score is not necessarily the one with the highest probability of occuring when sampled.In that case, our proposal is always more than quadratically faster than its classical counterpart, and the speedup becomes more pronounced the better the advice state.

Main Results
In this paper, we address the question of decoding a probabilistic sequence of words, letters, or generally tokens, obtained e.g. from the final softmax layer of a recurrent neural network, or given as a probabilistic list of heuristic parse transitions.These models are essentially identical from a computational perspective.Hence, we give the following formal setup, and will speak of a decoding task, leaving implicit the two closely-related applications.
Given an alphabet Σ, we expect as input a sequence of random variables X 1 , X 2 , . . ., X n over Σ, distributed as X i ∼ D Σ i .The distributions D Σ i can in principle vary for each i; furthermore, the X i can either be independent, or include correlations.The input model is such that we are given this list of distributions explicitly, e.g. as a table of floating point numbers; for simplicity of notation we will continue to write X i for such a table.The decoding machine M is assumed to ingest the input one symbol at a time, and branch according to some factor R at every step; for simplicity we will assume that R is constant (e.g. an upper bound to the branching ratio at every step).As noted, M can for instance be a parser for a formal grammar (such as an Earley parser [Ear70]) or some other type of language model; it can either accept good input strings, or reject others that cannot be parsed.The set of configurations of M that lead up to an accepted state is denoted by Ω; we assume that everything that is rejected is mapped by the decoder to some type of sink state ω Ω.
We allow M to make use of a heuristic that attempts to guess good candidates for the next decoding step.Furthermore, this heuristic can also depend on the input, i.e. we have a function H : Σ × Ω → Ω.We allow H to itself be an automaton, possibly with a stack, and even a full-fledged Turing machine is a natural extension of this model.Since we are not interested in the complexity of the heuristic itself, we simply distinguish between a stateful and stateless heuristic by regarding them as randomized automata with or without correlations respectively, but otherwise assume they always produces the expected output in unit time.
It is not difficult to see that the randomized input setting is more generic than employing a heuristic at the decoding step.In this light, we will restrict our discussion to a decoder M that processes a token sequence step by step, and such that its state itself now simply becomes a sequence (M i ) i ≤n of random variables.Described as a stochastic process, the M i are random variables over the set Ω of internal configurations after the automaton has ingested X i , given that it has ingested X i−1 , . . ., X 1 prior to that, with a distribution D Ω i .The probability of decoding a specific accepted string x = (x 1 , . . ., x n ) is then given by the product of the conditional probabilities where N = 1/( x ∈Ω Pr(X = x)), and in a slight abuse of notation we write M n = x when we mean M n = y(x), where y(x) is the configuration of the parser M that was provided with some input to produce the parsed string x; similarly we will write x ∈ Ω for an accepted string/decoded path.1 The obvious question is: which final accepted string of the decoder is the most likely?This is captured in the following computational problem.

M L P Input:
Decoder M over alphabet Σ and with set of accepting configurations Ω.Sequence of random variables (X i ) i ≤n over sample space Σ.Question: Classically, it is clear that if we have a procedure that can sample the random variable M n efficiently, then we can find the most likely element with an expected runtime of 1/Pr(M n = σ), as this is the number of samples we are expected to draw to see the element once.While such sampling algorithms might be inefficient to construct in general, we emphasize that the question of drawing samples from strings over a formal language is an active field of research, and algorithms to sample uniformly are available for a large class of grammars: in linear time for regular languages [BG12; ODG13], and context-free grammars/restrictions thereof can be sampled uniformly [McK97; GPS01; HC83; Gor+97; Den96] and also with word bias [RPW13; LP13; DRT00; Pon12].
In theorem 6 and section 3.1, we lift a classical uniform sampler (e.g.given as a bounded-error probabilistic poly-time BPP algorithm with coin flips as a source of randomness) to a biased quantum sampler, which we can use to obtain a quantum advantage when answering M L P .We note that the techniques therein may well be used to obtain a classical Monte Carlo procedure to sample from M n .In what follows, we will therefore assume that obtaining a sample of M n within a sufficiently-small error margin can be done with a (uniform) family of classical poly-time randomised circuits denoted (S n ) n≥1 that is given to us.Yet in order to be precise, we will explicitly keep the sampling runtime separate from the rest of the complexity analysis.
We prove the following result: Theorem 1.For an input sequence of length n of random variables to a parser with a classical sampling runtime T(n), there exists a quantum search algorithm answering M L P with certainty, using π/4 Pr(M n = σ) iterations.In each iteration, it runs a quantum circuit for the sampler in O(T(n) 1.6 ) time.
As explained, this theorem formalises the expected quadratic speedup of the runtime as compared to a classical algorithm based on sampling from M n .This works because we know that the element to search for is the one that also occurs with the highest likelihood within the distribution.Given the input to the parser is power-law distributed (see definition 8), this allows us to formulate the following corollary.

Corollary 2. If the
Yet a priori, it is not clear that the weight of a decoded path (e.g. the product of probabilities of the input tokens) also corresponds to the highest score we wish to assign to such a path.This becomes obvious in the setting of a heuristic applied to a live translation.While at every point in time the heuristic might be able to guess a good forward transition, it might well be that long range correlations strongly affect the likelihood of prior choices.Addressing these long-distance "collocations" is an active field of research [ZQH15a].
To give an exemplary illustration, consider the sentence Who does Bill want to This example contains a so-called clausally unbounded long distance dependency [D ąb08].In the top and bottom branches respectively, the word 'Who' is either the direct object of the verb 'replace', or the direct object of the verb 'want' and hence by implication the subject of the verb 'win'.The parser cannot distinguish between the two cases until it has seen the verb, so both choices of interpretation have to be retained.Several recent studies evaluate how parsers perform in the presence of such dependencies in the input.The findings of [Kha+18] indicate that LSTM models are capable ofusing about 200 tokens of context on average, but that they sharply distinguish nearby context (≈ 50 tokens) from the distant past.Furthermore, such models appear to be very sensitive to word order within the most recent context, but ignore word order in the long-range context (more than 50 tokens away).On the other hand, specialised dependency parsers (such as the MSTParser and MaltParser) which are equipped with simple post-processing to extract unbounded dependencies from the basic dependency tree are able to correctly recall unbounded dependencies only roughly 50% of the time [Niv+10].
To address this setting formally, we assume there is a scoring function F : Ω −→ R, which assigns scores to all possible decoded paths.Without loss of generality, there will be one optimal string which we denote with τ = argmax x ∈Ω F(x).Furthermore, we order all decoded strings Ω in some fashion, and index them with numbers i = 1, . . ., |Ω|.Within this ordering, τ can now be in different places-either because the heuristic guesses differently at each step, or because the input sequence varied a little.We denote the probability that the marked element τ is at position i with p i .In essence, the position where τ is found is now a random variable itself, with probability mass function Pr(finding τ at index i) = p i .
For the decoder probabilities Pr(M n = x) to serve as good advice on where to find the highest-score element under the metric F, we demand that Pr(M n = string with index i) = p i . (2) Loosely speaking, what eq. ( 2) means is that the final distribution over the states of the decoder puts high mass where the highest-scoring element often occurs.
To be precise, we define the following problem.

H S P Input:
Decoder M over alphabet Σ and with state space Ω.Sequence of random variables (X i ) i ≤n over sample space Σ.Scoring function What is the classical baseline for this problem?As mentioned in [Mon11], if p x is the probability that x is the highest-scoring string, then in expectation one has to obtain 1/p x samples to see x at least once.Any procedure based on sampling from the underlying distribution p x thus has expected runtime In a sense this is as bad as possible; the advice gives zero gain over iterating the list item by item and finding the maximum in an unstructured fashion.Yet provided with the same type of advice, a quantum computer can exhibit tremendous gains over unstructured search.
Theorem 3.With the same setup as in theorem 1 but under the promise that the input tokens are iid with X i ∼ Power |Σ | (k) over alphabet Σ (definition 8), that the decoder has a branching ratio R ≤ |Σ|, and that we can uniformly sample from the grammar to be decoded, there exists a quantum algorithm Q S D answering H S P with an expected number of iterations and where H R (k) denotes the R th harmonic number of order k.Each iteration runs a quantum circuit for the sampler in time O(T(n) 1.6 ).
There exists no classical algorithm to solve this problem based on taking stochastic samples from the decoder M that requires less than Ω(R n ) samples.
While the runtime for algorithm 1 used to prove theorem 3 is based on an analytical bound we found that numerically it comes close to the true expected query complexity of the search decoding algorithm.The exponent f (R, k) indicates the speedup over a classical implementation of the decoding algorithm (which would have to search over R n elements).We find that f (R, k) < 1/2 for all R, k > 0, and in fact f (R, k) −→ 0 exponentially quickly with k; we formulate the following corollary.
Corollary 4. For k > 0, Q S D is always faster than plain Grover search (with runtime ∝ R n/2 ); the extent of the speedup depends on the branching ratio R and the power law exponent k, and is plotted in fig. 1.
Finally, in section 5 we modify the full quantum search decoder by only searching over the paths with likelihood above some given threshold (that we allow to depend on n in some fashion), effectively turning the decoder into a type of beam search, but where the pruning only happens at the very end.This means that in contrast to beam search, the top scoring element is found over the globally most likely parsed paths, avoiding the risk early beam pruning brigs.We analyse the runtime of algorithm 2 for various choices of beam width numerically, and analyse its performance on a concrete example-Mozilla's DeepSpeech implementation, a speech-to-text LSTM which we show to follow a power-law token distribution at each output frame.
For DeepSpeech, we empirically find that input sequence lengths of up to 500 tokens can realistically be decoded, with an effective beam width of 10 15 hypotheses-while requiring ≈ 3e6 search iterations (cf.fig.9).
We want to emphasize that the fact the letters a-z follow Zipf's law with respect to their occurence in English sentences (see e.g.[Egg00; Pia14]) plays no role in attaining the speedup.In addition to fig. 7, we verified that when only collecting those output frames of DeepSpeech where, say, "t" is the most likely prediction, the distribution over all letters-sorted by rank, i.e. sorted from most to least likely prediction-is already a power-law.This is a feature of the output of the model, and not necessarily a property of the underlying data the model was trained on.In our context this means that the Softmax output layer of the LSTM has to yield a power-law probability distribution.How frequently a given letter is the most likely prediction-which is itself known to be a power-law, as mentioned-is not important.

Quantum Search Decoding
In this section, we give an explicit algorithm for quantum search decoding.As mentioned before (see section 2), we assume we have access to a classical algorithm that, given a list of transition probabilities determined by the inputs X 1 , . . ., X n , yields a random sample drawn from the distribution-either uniformly, or with weights correspoding to eq. (1).Since either way this sampler is given as a classical probabilistic program, we first need to translate it to a quantum algorithm.We start with the following lemma.
Lemma 5.For a probabilistic classical circuit with runtime T(n) and space requirement S(n) on an input of length n, there exists a quantum algorithm that runs in time O(T(n) log 2 3 ) and requires O(S(n) log T(n)) qubits.
Proof.Follows from [BTV01, Th. 1]: any non-reversible computation requiring time T and space S can be simulated reversibly in time T = 3 k 2 O(T /2 k ) and space S = (1 + O(k))S, for a 0 ≤ k ≤ log 2 T chosen arbitrarily.Choose k = log 2 T, then S = (1 + O(log 2 T))S, and T = O(T log 2 3 ).Now translate this reversible probabilistic classical circuit into a quantum circuit-e.g. using the Solovay-Kitaev theorem [NC10], which incurs an at most logarithmic runtime overhead.

Biased Quantum Sampling from a Regular or Context-Free Grammar
As an immediate consequence, given a classical probabilistic sampling algorithm that can produce strings a 1 a 2 • • • a n of a language such that each string is weighted by the probability of the symbol a i occuring at site i (i.e.eq. ( 1)), we can obtain a quantum circuit that produces a quantum state which is a weighted superposition over all such strings.
In addition to the weighted superposition, however, we would like to have the weight of each state in the superposition spelled out as an explicit number in an extra register (e.g. as a fixed precision floating point number), i.e. in the form where Ω is the set of accepted strings reachable by the decoder in n steps, h q is an ancillary state that depends on q and is contained in the decoder's work space, and if q is a state reached by reading the input sequence a 1j 1 , a 2j 2 , . . ., a n j n .The weights p q = n i=1 p i j i .However, it is not clear that such a weighted sampler (dependent input or not) is available at all.As outlined in the introduction, we know there exist uniform classical probabilistic samplers for large classes of grammars, e.g. for regular languages in linear time (e.g.[ODG13]) and polynomial time for variants of CFGs (e.g.[GPS01]).Again keeping the uniform sampler's runtime separate from the rest of the algorithm, we can raise classical uniform samplers to obtain a biased quantum state preparator for | µ .Theorem 6.Given a classical probabilistic algorithm that, in time T(n), produces uniform samples of length n from a language, and given a list of independent random variables X 1 , . . ., X n with pdfs p i, j for i = 1, . . ., n and j = [Σ], we can construct a quantum circuit U µ that produces a state | µ -close to the one in eq.(4).The algorithm runs in time O(T(n) 1.6 × n 3 κ/ 2 ), where κ is an upper bound on the relative variance of the conditional probability Pr(a|s 1 . . .s i ).
Proof.Using lemma 5, translate the parser-which takes its input step by step-into a sequence of unitaries U = U n • • • U 1 .Considering a single unitary U i at the i th step, it is clear that it can be broken up into a family of unitaries (U a i ) a ∈Σ , such that each U a i is a specialization of U i when given a fixed input symbol a ∈ Σ.We define V a i to perform U a i , and in addition store the input a in some ancillary workspace, e.g.via V a i |φ |ξ = (U a i |φ ) |ξ ⊕ a .Then define the block-diagonal unitary V i := diag(V a i ) a ∈Σ , which acts like a controlled matrix, meaning that if V i acts on some state |ψ = |a |φ , then V i |ψ = |a V a i |φ .Naturally this works in superposition as well, e.g.
We further assume that the V a 0 take as initial state |0 |q 0 .
The final step in augmenting the parser is to extend V i to carry out a controlled multiplication: for a finite set of numbers F ⊂ R (e.g.fixed precision), and We denote this extended unitary for step i with U i .The next ingredient we take is the classical uniform language sampler.Once again using lemma 5, we raise it to a unitary W, which takes as input a prefix s m := a 1 • • • a m of the m previously-seen tokens, and a list of distributions over the future weights W m := (p i, j ) m< j ≤n .These are the distribution of tokens for each of the X j .We then augment W to a circuit W that quantumly performs the following classical calculations, in superposition over its input: 1. Draw S samples uniformly at random from the grammar starting at strings prefixed with s m ; denote this list with B := {b 1 , . . ., b S }.
2. Group the samples B into bins C a of samples with the same first token a ∈ Σ, i.e.C a = {b ∈ B : b = a?? • • •?}, where ?stands for any token in the alphabet Σ.
3. Calculate the total of the probabilities of each bin C a where each element is weighted with respect to the future probabilities given in list W m , which yields a distribution D = (d a ) a ∈Σ .
It is straightforward to write the unitary W that then takes a state |00 ∈ H F ⊗ C Σ -the first register for storing a number in F, and the second for storing a letter-and a list of such weights D to a weighted superposition W (D) |0 = a ∈Σ √ d a |d a |a (where for the sake of simplicity we drop the scratch space register that is certainly required).Furthermore, we need a controlled unitary Q that, given some state |h |a where h = h(a) in some specified fashion-which we can demand the V a i produce-uncomputes a and d a from the second register, i.e.Q |h |d a |a = |h |00 .Together with the sequence of parser unitaries U i , the overall quantum circuit U µ can then be constructed as follows: For a partial string s 1 s 2 • • • s i of length i, we denote the set of all strings in the grammar prefixed with letters of s with A(s 1 . . .s i ).At every step i in the algorithm we sample the expectation value of a future hypothesis continuing with some token a, weighted by their individual likelihood p i j .The sampling procedure then yields an empirical distribution (d a ) a ∈Σ , which we denote with Our goal is to show that the algorithm reproduces the desired weight distribution given in eq. ( 4), i.e.
To estimate the total probability distribution to error in total variation distance, it suffices to approximate each conditional distribution to error /n, and thus we must show how many samples S are required for d a to be a good estimator for Pr(a|s 1 . . .
It is straightforward to calculate that and so E(u si (a))/E(v si ) = Pr(a|s 1 . . .s i ), the value we are trying to estimate.
Therefore it suffices to take enough samples S such that the u si (a) are close to their mean in relative error (and thus v si is also close in relative error, since v si = a u si (a)).
Noting that u si (a) = 1 S S j=1 Y j for i.i.d.random variables Y j , we have that Var(u si (a)) = 1 S Var(Y ).Therefore by Chebyshev's inequality, to get a /n relative error approximation requires the number of samples S to be at least By assumption Var(Y )/E(Y ) 2 ≤ κ, and so the total number of uses of the sampler over all n steps of the algorithm is O(κn 3 / 2 ) as claimed.
We remark that getting a precise handle on κ strongly depends on the grammar to be parsed and the input presented to it; it seems unreasonable to claim any general bounds as it will most likely be of no good use for any specific instance.However, we note that it is conceivable that if the input is long and reasonably independent of the language to be sampled, then κ should be independent of n, and κ ≈ 1/p(r min ), where p(r) is the distribution of the input tokens at any point in time-e.g.p(r) ∝ r −k as in a power law.3 We note that variants of this sampling algorithm are certainly possible: a naïve approach would be to just sample from the product-of-powerlaws distribution and postselect on the resulting strings being in the grammar; the performance of this will then depend on the number of strings in the grammar vs. the number of all possible strings.Another method could be to execute the uniform sampler in superposition, and perform amplitude amplification on the resulting quantum state to reintroduce the power-law bias.The number of amplification rounds will again depend on the distribution of the strings in the grammar.

The Quantum Search Decoder
The quantum algorithm underlying the decoder is based on the standard maximum finding procedure from [DH96; AK99], and its extension in [Van+17] used in the context of SDP solvers.
The procedure takes as input a unitary operator U µ which prepares the advice state, and a scoring function F which scores its elements, and returns as output the element within the advice state that has the maximum score under F. As in section 3.1, we assume that F can be made into a reversible quantum circuit to be used in the comparison operation.We also note that reversible circuits for bit string comparison are readily available [SR07], and can be implemented using quantum adder circuits [Gid18].
Algorithm 1 lists the steps in the decoding procedure.As a subroutine within the search loop, we perform exponential search with oblivious amplitude amplification [Ber+14].
As in the maximum finding algorithm, the expected query count for quantum search decoding is given as follows.
Theorem 7 ([Van+17]).If x is the most likely decoded string, the expected number of iterations

Power Law Decoder Input
In this section we formally prove that if the decoder is fed independent tokens that are distributed like a power law, then the resulting distribution over the parse paths yields a super-Grover Algorithm 1 Algorithm for quantum search decoding.
measure new best score counter ← counter + 1 until counter = m end function speedup-meaning the decoding speed is faster than applying Grover search, which itself is already quadratically faster than a classical search algorithm that traverses all possible paths individually.
A power law distribution is the discrete variant of a Pareto distribution, also known as Zipf's law, which ubiquitously appears in the context of language features [Jäg12; SB16; Egg00; Pia14].This fact has already been exploited by some authors in the context of generative models [GGJ11].
Formally, we define it as follows.We are interested in the Cartesian product of power law random variables, i.e. sequences of random variables of the form (X 1 , . . ., X n ).Assuming the random variables X i ∼ Power R (k) are all independent and of rank r i with pdf q(r i ) = r −k i /H R (k), respectively, it is clear that As in [Mon11], we can upper bound the number of decoder queries in Q S D by calculating the expectation value of the iterations necessary-given by theorem 7-with respect to the position of the top element.
We assume that at every step, when presented with choices from an alphabet Σ, the parsed grammar branches on average R ≤ |Σ| times.Of course, even within a single time frame, the subset of accepted tokens may differ depending on what the previously-accepted tokens are.This means that if the decoder is currently on two paths β 1 (e.g.corresponding to "I want") and β 2 ("I were"), where the next accepted token sets are Σ 1 , Σ 2 ⊂ Σ (each different subsets of possible next letters for the two presented sentences), respectively, then we do not necessarily have that the total probability of choices for the two paths-Pr(Σ 1 ) and Pr(Σ 2 )-are equal.But what does this distribution over all possible paths of the language, weighted by eq.(1), look like?Certainly this will depend on the language and type of input presented.Under a reasonable assumption of independence between input and decoded grammar, this becomes equivalent to answering the following question: let X a product-of-powerlaw distributions with pdf given in eq. ( 6), where every term is a powerlaw over Σ.Let Y be defined as X, but with a random subset of elements deleted; in particular, such that R n elements are left, for some R < |Σ|.Is Y distributed as a product-of-powerlaws as in eq. ( 6), but over R elements at each step?In the case of continuous variables this is a straightforward calculation, and we postpone it to appendix A; numerics suggest it also holds true for the discrete case.
But even if the input that the parser given is independent of the parsed grammar, it is not clear whether the sample distribution over R (i.e.sampling R out of |Σ| power-law distributed elements) follows the same power law as the original one over Σ; this is in fact not the case in general [ZQH15b].However, it is straightforward to numerically estimate the changed power law exponent of a sample distribution given R and |Σ|-and we note that the exponent shrinks only marginally when R < |Σ|.
In this light and to simplify the runtime analysis, we therefore assume the decoder accepts exactly R tokens at all times during the parsing process (like an R-ary tree over hypotheses) with a resulting product-of-powerlaw distribution, and give the runtimes in terms of the branching ratio, and not in terms of the alphabet's size.This indeed yields a fair runtime for comparison with a classical variant, since any classical algorithm will also have the aforementioned advantage (i.e.we assume the size of final elements to search over is R n , which precisely corresponds to the number of paths down the R-ary tree).

M L P : Query Bound
In this case F simply returns p q as the score in eq. ( 4).It thus suffices to calculate the state overlap | x| µ |, under the assumption that x is the highest mass point of the probability density function.By eq. ( 6), we have The claim of corollary 2 follows from these observations.

H S P : Simple Query Bound
We aim to find a top element scored under some function F under the promise that | µ (eq.( 4)) presents good advice on where to find it, in the sense of eq. ( 2).The expected runtimes for various power law falloffs k can be obtained by taking the expectation with respect to p x as in [Mon11].
In order to do so, we need to be able to calculate expecation values of the cartesian product of power law random variables, where we restrict the domain to those elements with probability above some threshold.We start with the following observation.Lemma 9.If Q S D receives as input iid random variables X 1 , . . ., X n , with X i ∼ Power R (k), then the number of queries required to the parser is Proof.The expectation value of 1/ x| µ is straightforward to calculate; writing ì r = (r 1 , . . ., r n ), by eq. ( 6), we have We observe that the runtime in lemma 9 is exponential in n.Nevertheless, as compared to a Grover algorithm-with runtime R n/2 -the base is now dependent on the power law's falloff k.We can compare the runtimes if we rephrase RT We observe that the exponent f (R, k) ∈ (0, 1/2]; its precise dependency on k for a set of alphabet sizes R is plotted in fig. 1.For growing k, f (R, k) falls off exponentially.

M L P : Full Query Bound
A priori, it is unclear how much we lose in lemma 9 by upper-bounding O(min{1/ x| µ , √ n}) by O(1/ x| µ )-so let us be more precise.In order to evaluate the expectation value of the minimum, we will break up the support of the full probability density function p(ì r) into a region where p(ì r) > 1/R n , and its complement.Then, for two constants C 1 and C 2 , we have for the full query complexity In order to calculate sums over sections of the pdf p(ì r), we first move to a truncated Pareto distribution by making the substitution where the discrete probabilities from the power law are approximated with a continuous Pareto distribution.On the x-axis is the length of the input sequence n.
While this does introduce a deviation, its magnitude is minor, as can be verified numerically throughout (see fig. 2, where we plot both RT 1 and the continuous variant RT ).The type of integral we are interested in thus takes the form where k 1 is not necessarily equal to k 2 , and typically c = (R/h R (k 1 )) n/k 1 , which would reduce to the case we are seeking to address in eq. ( 7).Here, χ(•) denotes the characteristic function of a set, i.e. it takes the value 1 where the premise is true, and 0 otherwise.It is possible to integrate eq. ( 8) numerically for small n; however, due to the high dimensionality and the flat tail, convergence suffers drastically already for n > 6.Similarly, evaluating the integral with a computer algebra system takes significant time for larger n and produces ever growing expressions that are hard to handle, as the reader is welcome to verify.To address this problem, we derive the following closed-form expression.
Lemma 10.For k 1, eq. (8) becomes Proof.As a first step, we perform a log substitution z i = log r i , e z i dz i = dr i which yields The characteristic function is now supported on a rescaled unit simplex, and writing z := i z i we can take its Fourier transform

We of course have
In the step marked with * , we applied Fubini's theorem, for which we implicitly assumed a smooth limiting argument for the step function.To evaluate the integral J n , we observe that the denominator has a root of order n at We further expand the Fourier-transformed characteristic function-and again glossing over the details of Fubini's theorem to swap the integration order-to obtain We handle the integrand's three pole cases separately.
k > 1.We have k < 0 and an order n pole at i|k |; the integrand g(t) := e it(x− j a ) /(t + ik ) n is holomorphic in the lower half plane.The exponent of the exponential, x − ja , assumes signs In the latter case, the integral (over t) evaluates to zero.
In the middle case, for t = −is we have exp(i(−i)s(x − ja )) = exp(s(x − ja )) −→ 0 as s −→ ∞; by Jordan's lemma we can thus write where γ 1 (r) contains the real interval [−r, r] and a half circle connecting the end points in the lower half complex plane.
In the first case, for t = is, we have exp(i 2 s(x − ja )) = exp(−s(x − ja )) −→ 0 as s −→ ∞; however now the corresponding upper half plane loop encircles the pole of g(x).We apply the residue theorem for a flipped path γ 2 (r) = −γ 1 (r): For the case x − ja < 0 we are left to perform the outer integration in eq. ( 10).If c ≤ ja we necessarily have x ≤ ja and J n = 0.For the case c > ja we have Where Γ(n, •) is the lower incomplete gamma function.Putting it all together, we get otherwise.
Finally, we insert the last expression back into eq.( 9), and obtain The second term in the sum we can further simplify using the identity Γ(n, x)/Γ(n) = e −x n−1 l=0 x l /l! which holds for integer j, which yields We have k > 0 and the order n pole of eq. ( 10) lies at −i|k |.The integrand g(t) = e it(x− j a ) /(t + ik ) n is holomorphic in the upper half plane; and analogous to before, this time when x − ja > 0, we have In the opposite case we can again apply the residue theorem and obtain where the negative sign in step * stems from the clockwise orientation of the contour γ 2 .The outer integration in eq. ( 10) is now otherwise.
Inserting the expression back into eq.( 9) we obtain To reduce the last sum to the previous expression, we note that , where S (n) m is the Stirling number of the second kind, which denotes the number of ways to partition a set of size m into n non-empty subsets.Since m ≤ l ≤ n − 1, S (n)  m ≡ 0 here, and thus The claim follows.
We leave the k = 1 case as an exercise to the reader.
With lemma 10, we can now evaluate the terms in eq. ( 7) efficiently.The first term is and the second Of interest is whether taking this full expectation value and splitting it to fall back to Grover search whenever the probability dips below 1/R n yields a significant improvement of the runtime bound.We found this to not be the case, as fig. 3 demonstrates; while for smaller n there is a significant improvement, as n grows the ratio rt/RT 1 −→ 1 exponentially fast.

Quantum Beam Search Decoding
The goal of this section is to modify the Q S D decoder such that it behaves more akin to a classical beam search algorithm.More specifically, instead of searching for the top scored element which could sit anywhere within the advice distribution, we make the assumption that wherever the advice probability lies below some threshold p(x) < p 0 -where p 0 can be very small-we discard those hypotheses.This is done by dovetailing a few rounds of amplitude amplification to suppress all beam paths with probability less than p 0 (which we can do, since we Algorithm 2 Algorithm for beam search decoding. measure new best score counter ← counter + 1 until counter = m end function have those probabilities written out as numbers within the advice state | µ in eq. ( 4)); a schematic of the algorithm can be found in algorithm 2.
Of course we only want to do this if the number of amplification rounds, given as the squareroot of the inverse of the leftover probability x:p(x)≥p 0 p(x), is small (i.e.constant, or logarithmic in n).We note that this expression is, as before, well-approximated by M(R, k, k, p 0 , n) given in lemma 10.
In beam search, only the top scoring hypotheses are kept around at any point in time; the difference to our method is of course that we can score the elements after every hypothesis has been built.This is not possible in the classical case, since it would require an exponential amount of memory or postselection.As in section 3, we have the two cases of finding the top scoring path and the most likely parse.Deriving a runtime bound for the most likely parse is straightforward-and does not, in fact, gain anything.This is because when finding the maximum likelihood path τ, one performs amplitude amplification on that element anyhow, and p(τ) > p 0 -so it is within the set of elements with probability kept intact by the post-amplification.4 The only interesting case of amplifying the advice state in Q S D to raise it to a beam search variant is thus for finding the top scoring element under a secondary scoring function, using the decoder's output as advice distribution.The relevant questions to ask here is what choice of p 0 will 1.only require a constant-or logarithmic-number of rounds of amplitude amplification, 2. retain a large number of hyptheses, and 3. improve runtime for the post-amplified Q S D variant.
We address all these questions in the next sections.

Constant Post-Amplification
In light of simplicity, we will take RT 1 as an upper runtime bound to the full expected number of rounds, RT 2 ; as we amplify away all paths with weights below the cutoff we never expect to find an element therein-meaning we can drop the fallback to Grover search in our analysis, and treat the search as if the advice state was purely on those paths with weight ≥ p 0 .We first address the question for which choice of p 0 the cumulative leftover probability M(R, k, k, p 0 , n) can be lower-bounded by a quantity independent of n, which means we have to perform only a constant number of amplitude amplification rounds on the advice state.In order to do so, we solve the implicit inequality As M is monotonically decreasing for a decreasing splitting exponent f split , and since M can be computed in O(n 2 ) many arithmetic operations, we can perform the minimization efficiently.For a choice of C 0 = 1/4 (which implies a single amplitude amplification round) and C 0 = 1/100 (ten rounds of amplification) we plot f split in fig. 4. As can be seen, f split tends towards a limiting value ∈ (0, 1) for n −→ ∞.The next step in our analysis is to take the modified splitting exponent f split and count how many hypotheses N hyp remain to be searched over; this is important because it is not clear a priori how many paths we can still search over, and if that quantity is low-or even tends towards zero-then we retained too few elements.Our hope is of course that in contrast to beam search, where generally the beam's width, i.e. the number of hypotheses retained at any point in time, is capped at some possibly large but constant value, we have a growing number of hypotheses to search over.
In order to count this number of hypotheses given a cutoff probability p 0 , we can evaluate M(R, k, k, p 0 , n) in the limit of the power law exponent k −→ 0, and finally multiply h n R (k 1 ) in eq. ( 8) to make the integral count instead of calculating a cumulative density.We again choose a series of values for R, k and C 0 and plot the results in fig. 5.While the number of leftover hypotheses is indeed reduced drastically as compared to performing a full search over R n elements, it is still growing exponentially with n, which results in a significant number of hypotheses to search over, many more than possible in the classical setting.
As a last step, we want to analyse the modified runtime given the changed probability cutoff, which corresponds to evaluating the integral M(R, k, k/2, p 0 , n) with the p 0 derived from the optimization eq. ( 13).The results are collected in fig.6.As one can verify, the runtime does remain asymptotically exponential in the sequence length n; however the base of the exponential is reduced accordingly.

Non-Constant Post-Amplification
The analysis of section 5.1 can of course be repeated for a non-constant f split ; however, one has to be aware that these extra amplitude amplification rounds factor into the overall runtime.For a retained fraction g(n) of the total probability weight, the optimization thus reads and has runtime bound g(n Instead of listing a series of results for a range of parameters, we provide an explicit example of this analysis with real-world parameters derived from Mozilla's DeepSpeech neural network in the next section.

Analysis of the Output Rank Frequency
To support the applicability of our model, we analysed our hypothesis that the output probabilities of an LSTM used to transcribe voice to letters-which can then be used e.g. in a dialogue system with an underlying parser-is distributed in a power-law like fashion.More specifically, we use DeepSpeech, Mozilla's implementation of Baidu's DeepSpeech speech recognition system [Han+14;Moz19a].The neural network processes mel-frequency cepstral coefficients extracted from a sliding window of 25 miliseconds, with a stride of 20 miliseconds; for each such frame, the LSTM is invoked, and yields a distribution over the letters of the english alphabet "a" to "z", as well as a few special symbols, e.g."silence".For the specific architecture of the LSTM we refer the reader to the original paper [Han+14].Our hypothesis was that these letter probabilities follow a power-law distribution; our data supports this claim, as can be seen in fig. 7.

Runtime Bounds for Quantum Beam Search Decoding
As outlined in section 5.2 we take the power law exponent derived from Mozilla's DeepSpeech neural network, k = 3.03, and derive runtime bounds for decoding its output with a parser under the assumption that, on average, we take R = 3 branches in the parsing tree at every time step.As discussed in section 4, the sampling distribution over three elements only yields a slightly lower exponent of k = 2.91.How does quantum beam search perform in this setting, and how many hypotheses are actually searched over?And what if we fix the beam's width to a constant, and increase the sequence length?We summarise our findings in figs.8 and 9.   16)) and number of hypotheses (eq.( 15)) of quantum beam search decoding the output of Mozilla's DeepSpeech LSTM with a grammar, assuming an average branching ratio of R = 3, a token power law distribution with exponent k = 2.91, and post-amplification of the quantum search decoder with a retained fraction of hypotheses C 0 = C 0 (n) ∈ {n −1/2 , n −2/3 , n −1 , n −3/2 , n −2 , n −3 } as defined in eq. ( 14), which is plotted in rainbow colors from red to blue, top to bottom.The dashed line is the full quantum search runtime and number of hypotheses from eq. (11).
As an example we consider an input sequence of length 500; with the above parameters and a splitting exponential f split = n −1/2 (resp.= n −3 ) we can search over N hyp ≈ 10 60 (resp.≈ 10 18 ) hypotheses, with a runtime ≈ 10 30 (resp.≈ 10 9 ).Similarly, when capping the beam width at N hyp ≤ 10 6 , we asymptotically require ≈ 10 3 iterations of the beam search decoder (which includes the post-amplification rounds); for shorter sequences, a super-Grover speedup as present in full Q S D is achieved.

Summary and Conclusions
In summary, we have presented a quantum algorithm that is modelled on and extends the capabilities of beam search decoding for generative models.Studies of context sensitivity of language models have shown that state-of-the-art LSTM models are able to use about 200 tokens of context on average while working with standard datasets (WikiText2, Penn Treebank), but sharply distinguish nearby context (roughly 50 tokens) from distant history [Kha+18].The performance of an efficient classical beam search decoder using such an LSTM depends heavily on the context sensitivity of the underlying language model.On the other hand, our quantum search decoding method is guaranteed to find-with high constant success probability-the global optimum in expected runtime that is always more than quadratically faster than possible classically (neglecting the sampling cost), and with a Grover exponent that shrinks exponentially quickly as the power law exponent k grows, thus surpassing plain Grover search for any k > 0. We have further shown that neural networks used in the real world-concretely DeepSpeechindeed exhibit a strong power law distribution on their outputs, which in turn supports the premise of our algorithm.
There are many extensions possible for the type of search we have described in this paper, and we hope to make a series of improvements.Foremost, it would be interesting to study how much non-independence of the input random variables affects the runtime-either in the negative, or potentially yielding an even positive effect.Quantifying dependence for input models will likely require analysing a specific problem setup.
A fully error-corrected quantum computer will remain outside of the realm of the possible for the forseeable future; yet we hope that our hitherto theoretical proposal of a quantum search decoder demonstrates that natural language processing is one of the areas where a potential quantum advantage can be obtained.
where the S sampled hypothesis are given in list B = {b 1 , . . ., b S } with individual letters b j = b j,1 • • • b j,n ).As usual, χ[•] denotes the indicator function, and p(b j ) := n k=1 p k,b j k .

Definition 8 .
Let A be a finite set with | A| = R, and k > 1.Then Power R (k) is the power law distribution over R elements: for X ∼ Power R (k) the probability density function satisfies Pr(X = x) = r −k /H R (k) for an element of rank r, where H R (k) denotes the R th harmonic number of order k.

Figure 1 :Figure 2 :
Figure 1: Exponent f (R, k) of expected runtime of Q S D , when fed with a power law input with exponent k, over R alphabet tokens; plotted are individual curves for the values R ∈ {3, 5, 10, 15, 20, 30, 40, 60, 100}, from top to bottom.For all R, f (R, k) drops off exponentially with growing k.

Figure 4 :
Figure 4: Minimized value of the splitting exponent f split as defined in eq.(13).Plotted are the values for R = 6 (left) and R = 24 (right), as well as C 0 = 1/4 (green, upper family of lines) which implies exactly one extra round of amplitude amplification, and C 0 = 1/100 (red, lower family of lines) which implies ten extra rounds of amplification.The power law exponents chosen are k ∈ {1.5, 2.0, 2.5, 3.0} (bottom to top, respectively).

Figure 7 :Figure 8 :
Figure 7: Log plot of the power law distribution of the output probabilities obtained from Mozilla's DeepSpeech voice recognition LSTM on the Mozilla Common Voice verified test datasetfor English[Moz19b], which consists of 3995 audio samples of about ten seconds each of spoken test sentences.The dashed line is a fitted power law ar −b with parameters a = 1.2 ± 0.1 and b = 3.03 ± 0.03.We individually process each audio file, and capture the output after the final Softmax layer (logits:0), but before it is processed further by the greedy connectionist temporal classification (CTC beam search) implemented by DeepSpeech.

Figure 9 :
Figure9: Runtime of quantum beam search decoding the output of Mozilla's DeepSpeech LSTM with a grammar, assuming an average branching ratio of R = 5, a token power law distribution with exponent k = 2.91, and post-amplification of the quantum search decoder with a constant number of retained hypotheses N hyp ∈ {10 1 , . . ., 10 15 }, plotted in rainbow colors from purple to red, bottom to top.As expected, the super-Grover speedup is achieved in the regime where full Q S D happens; once the beam width saturates, the speedup asymptotically approaches a quadratic advantage as compared to classical beam search.