Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants

On-device Virtual Assistants (VAs) powered by Automatic Speech Recognition (ASR) require effective knowledge integration for the challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries using various categories of Language Models (LMs) (N-gram word LMs, sub-word neural LMs). We investigate the combination of on-device and server-side signals, and demonstrate significant WER improvements of 23%-35% on various entity-centric query subpopulations by integrating various server-side LMs compared to performing ASR on-device only. We also perform a comparison between LMs trained on domain data and a GPT-3 variant offered by OpenAI as a baseline. Furthermore, we also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model and integrates knowledge learned from domain-specific data to a VA ASR system.


INTRODUCTION
Virtual Assistants (VAs) are rapidly gaining popularity [1,2] as they assist users with various tasks [3].Voice commands issued by VA users are recognized using Automatic Speech Recognition (ASR), a critical component of any VA system.The VA ASR component takes as input user-spoken audio and generates a ranked list of N transcription hypotheses, hereafter the N -best list.A primary challenge of a VA ASR system is that queries are entity-rich and heavily centered around complex information domains which are usually present in server-side knowledge bases.Here, we refer to any such query that may benefit from a knowledge base as information domain query.For example, consider a query -"play Red Smoke by The Reytons" that instructs the VA to play a song.If the corresponding song (i.e., in the previous example, Red Smoke) is not available within the user's local music library, the VA could execute a search query against an online media catalog with the end goal of streaming the song to the user's device.Hence, integration of domain knowledge becomes crucial to improve recognition accuracy of such spoken queries.However, with on-device ASR, offline knowledge sources are constrained because disk space and compute resources are limited.Therefore, integration of knowledge sources, which are often very large and dynamic in nature [4], can be cumbersome.
An empirical analysis of entity-centric information domain queries from a representative sample of anonymized usage logs of a popular VA shows that 36% of such queries contain a more suitable candidate hypothesis in the N -best list which is not ranked first by the on-device ASR system.One way to overcome the problems associated with recognizing entity-centric queries is to run an on-device domain classifier on top of the on-device ASR result, and, for voice commands classified as information domain queries, perform server-side N -best rescoring using domain-specific Language Models (LMs) for knowledge integration.Since rescoring occurs on server, the domain-specific LMs are not subject to the same constraints as the on-device models, and therefore can be larger.
Previously, various efforts have been made to improve ASR accuracy for entity-rich queries.Huang and Peng [5] conducted an empirical study using Transformer-based LMs to achieve significant Word Error Rate (WER) reductions with second-pass N -best rescoring.Others [6,7,8,9] have shown that using masked LM training objectives, like BERT [10], for N -best rescoring are effective for improving ASR accuracy.Wang et al. [11] demonstrated the effectiveness of ASR Nbest information in entity retrieval.Van Gysel et al. [12] improved WER on entity queries by implementing probabilistic grammars as a complement of N -gram LMs within a Finite State Transducer framework.
While the works mentioned above have made significant progress with respect to the use of LMs for N -best rescoring, they are often limited to considering only a single LM architecture at a time and do not consider different subpopulations of the query distribution (that is, head, torso, and tail; see §3.2).Recently, it was shown that different LM architectures lead to better performance on different subsets of the information domain query distribution [12] -where some architectures work well for head queries, and others perform better on the tail.In addition, lack of analysis of how an LM technique affects different subpopulations of the query distribution may lead to cases where the ASR quality degrades on tail queries while the degradation is concealed by overall recognition enhancements, as most of the improvements come from head queries.
In this paper, we investigate strategies for building and combining multiple LMs for N -best rescoring of entitycentric information domain queries.We combine different rescoring LMs and evaluate the recognition quality on different subpopulations (head, torso, and tail §3.2) of the information domain query distribution.Our focus lies on applying N -best rescoring in an application where the on-device ASR system is resource-constrained, while the server-side N -best re-scorers can leverage additional resources and information and hence enhance ASR accuracy.
To the best of our knowledge, our contribution is the first comparison of established techniques for the specific application, extensive empirical evaluation on domain knowledge suitability, using effective model fusion techniques that combine multiple LM architectures with complementary strengths, and testing on different data splits and subpopulations ( §3.2).
We focus on three categories of LMs and evaluate how signals extracted from each category can contribute to improving ASR accuracy.The categories are (1) NGram: back-off word N -gram LMs [13], (2) NNLM : sub-word Neural Network Language Models (NNLMs) [14] ( §2.2) , and (3) LLM : pretrained Large Language Models (LLMs) such as GPT-3 [15].While NGram and NNLM categories are trained from scratch on domain-specific data, LLM category includes out-of-the-box models accessed as a service.
With our specific server-side rescoring setting, the inclusion and comparison with the LLM category is meant to serve as an "out-of-the-box" baseline that shows the hardness of the problem; i.e., it demonstrates the quality one would be able to achieve by outsourcing the problem to an external LLM service without additional training.Therefore, finetuning an LLM is out of the scope of this work.Instead, we focus on whether we are able to construct and assess the ability of various LMs to help improve recognizing domain specific queries, with relatively more controllable modeling scales than LLM s.
We focus on unidirectional LMs since they are more generally applicable (e.g., streaming applications) compared to bidirectional LMs, and hence, can be repurposed in a variety of settings and applications, as opposed to their bidirectional variants.
We systematically pick representative model architectures within each LM category, and conduct in-depth analysis on an individual category's N -best rescoring results, as well as joint impacts of combining multiple LM categories.In the end, we compare the single and cross-category N -best rescoring performance for the best rescoring modeling strategy to improve Table 1: Example of a problematic N-best ranking for an utterance with reference text "play Dickie Jones movies" with associated on-device signals (lower is predicted as better).Note that although the best prediction by the ASR system is incorrect, the correct prediction is given at rank 3. To this end, our Research Questions (RQs) are: (RQ1) Can a single rescoring LM from each LM category reach substantial accuracy improvements on all subpopulations (head, torso and tail) of entity-rich information domain queries?(RQ2) Is it beneficial to conduct N -best rescoring using domain-specific LMs trained from scratch, compared with directly using an out-of-the-box external LLM service as baseline?(RQ3) Furthermore, do mixture combinations of multiple LMs from various categories outperform the single best rescoring LM?

Rank
We provide empirical advice on training and selecting Nbest rescoring models in a generalizable way.Our findings are: (1) We show that training a single domain-expert N -best rescoring model from NGram category or NNLM category leads to significant WER reductions (WERRs) on entity-rich queries across all subpopulations compared to the baseline ASR system.(2) We find that any of our domain-specific NGram category and NNLM category rescoring model outperforms the out-of-the-box LLM category model baseline significantly, and they are also much smaller in sizes and have fewer numbers of model parameters.(3) We discover that NNLM category is slightly better than NGram category, but building NGram category models is still beneficial.(4) Most importantly, we also further discover that effective multi-category rescoring model fusion of NGram and NNLM categories gain complementary advantages over single rescoring models, and results in boosting additional overall accuracy improvements averaging all subpopulations.

METHODOLOGY
ASR systems generate ranked N -best lists that consist of multiple candidate hypotheses, by exploring a subset of the entire search space, sorted by decoding signals.

N-best rescoring with ASR and server-side LMs
Table 1 shows an example of a problematic N -best list generated by an on-device ASR system, and it suggests room for improvement where the correct hypothesis is included in the top-ranked hypotheses but not ranked at the top.Rather than generating new hypotheses, we focus on seeking opportunities to optimize the rankings of the hypotheses, also known as N -best reranking.Comparing to other techniques, it allows us to integrate large, server-side LMs from various resources such as the Davinci model API as described in §3. 4.
In this paper, we obtain multiple signals from on-device ASR and server-side LMs and use linear interpolation to combine the features for integrating domain knowledge to N -best ranking criteria.The on-device ASR decoding signals consist of acoustic and LM scores.In the case of a traditional hybrid ASR system [16, p. 289], the acoustic and LM signals are provided by the acoustic and language models, respectively.For newer end-to-end systems [17, §16.3], the acoustic signal is provided by the end-to-end model and the LM signal by the external LM of the on-device system.They enable us to evaluate the costs of the decoded sequences stemmed from acoustic and language models of an ASR system.Lower scores indicate a better system.
The additional features, that contribute to the domain knowledge integration ability of the new N -best ranking criteria, are calculated from the server-side LMs.The features stem from the negative log-likelihood assigned by the LM for each token in a sequence.To effectively combine the on-device ASR decoding signals (e.g., the acoustic score assigned by the on-device acoustic model) and the serverside LM features, we use a linear model and find the best set of weights by minimizing the WER on a validation set using Powell's method [18], which is effective for nondifferentiable cost functions.

LM categories under consideration
As mentioned in §1, we consider multiple LM categories.For the first category (NGram), we consider N -gram LM, which is a Markov model where the current token prediction is dependent on a window of history tokens.The conditional probabilities for the tokens stem from the counting in a training corpus.With the development of various smoothing techniques such as Witten-Bell discounting [19], N -gram LMs have become one of the classical LMs for speech recognition tasks.
The FOFE NNLM is a feed-forward model in which variable-length input sequences are encoded by fixed-size vectors, with minimal information loss.Such an architecture has been reported as an accuracy-competitive and performance-efficient language modeling approach [23].
LSTM is a type of recurrent neural network designed to have better gradient flows.These models have previously been competitive with N -gram and Transformer based models when used as a reranking model for ASR systems [24].
Transformer-based architectures achieved state-of-the-art results in many language modeling tasks [10,24,15,25].The Transformer-based NNLMs used in this work are built as follows.Relative positional encoding [26] is added to the input embedding vector.Then, several self-attention encoding blocks are stacked.We use layer normalization, followed by multi-head attention and residual connections.A final linear projection with softmax activation is used to determine the subword unit scores.
For the third category (LLM ), we consider large language models, which are Transformer decoder-only autoregressive LMs, typically with large number of parameters.Recent developments of the GPT-3.5 series [27] have shown that they provide high-quality feedback for multiple tasks [15], thanks to the parameter sizes and vast amount of training data from a large variety of sources.We are interested to use the GPT-3.5 series for N -best rescoring knowledge integration and expect that the domain knowledge is intrinsic in the GPT-3.5 series.

Entity-heavy query data
In this paper, we focus on the recognition of entity-centric media player queries.As mentioned in §1, we operate under the setting where ASR runs on-device using a resourceconstrained model, and the obtained transcription is then classified to belong to the information domain, thus, requiring access to a knowledge base.We use the context-free grammar of media player queries published by Van Gysel et al. [12], and use it to generate media player queries.In addition, we generate speech with a Neural Text-To-Speech (TTS) system [28] on validation and test splits ( §3.2) of the generated queries to measure the quality of our N -best rescoring, which is common practice in many past researches [29,30,31].We generate the synthetic validation and test sets mainly because we are constrained by the fact that most of the existing speech recognition test sets do not have good entity coverage.Some directly accessible usage based VA test sets only show 0.83% entity coverage, which is insufficient.Specifically, for the scope of this research, representative entity-rich VA utterances are necessary for investigating the effectiveness of our approach, while most of the available ASR test sets mainly consist of entity-unrelated or general-purpose samples that make it ambiguous whether our approach can demonstrate sufficient evidence for the research questions.Consequently, the synthesis process becomes necessary for effective evaluation in this research.
The query grammar (see [12] for more information) consist of two components: (1) query templates that contain entity slots and are representative of the VA media player query distribution, each associated with a prior probability, P (template) , and (2) a weighted list of entities that can be inserted into the template, with each entity associated with a Table 2: Statistics of our validation/test set utterances ( §3.2) sliced by their subpopulations (head/torso/tail).We report the mean (µ) and std.dev.(σ) of the N -best list lengths.In addition, we also report the best and worst possible WERs by selecting the best and worst hypothesis for each utterance, resp.prior, P (entity) , that correlates with entity popularity.

Head
To control the complexity of the experiments, we apply cutoffs on the ranked template and entity lists.We keep the top-100 templates according to their prior, and we limit the number of entities to the top-200k.We subsequently sample queries from the joint template/entity probability, P (template) • P (entity) , until each unique query is sampled at least once.

Entity query splits and subpopulations
We randomly split the generated queries into training, validation and test sets, with sampling ratios of 90%, 5% and 5% respectively.The three sets are disjoint, even though they are sampled from the same carrier phrases and entities distributions, and such sampling from the same distribution technique is an approach widely used in the community including by open evaluations held by NIST.For the scope of this work, we do not focus on conducting zero-shot pre-trained LLMs evaluations.Instead, our settings aim to improve accuracy on specific domains that have curated entities from knowledge bases, trending topics, etc and aim for an extensive coverage of them.This is reflected in the fact that we use 200k entities.
The training set is used to train the LM architectures of NGram and NNLM categories ( §2.2) as follows.For the Ngram models, the entire training set is used during estimation.Meanwhile, since the NNLMs can be expensive to estimate, we take a 500M sample of queries (denoted T) as its training data.
The validation and test splits are partitioned based on the frequency a query occurs in the respective split, with the top 10% being "head", 10% to 50% being "torso", and bottom 50% being "tail".We subsequently sample 1k queries from each partition, generate audios using TTS ( §3.1), and use the generated audios to obtain N -best lists using our on-device ASR system ( §3.3).Table 2 presents detailed statistics of all subpopulations in the validation and test sets after ASR decoding.
The validation set serves two purposes: (1) to find the optimal linear interpolation weights of the features ( §2.1), and (2) to select sub-word tokenizer and LM hyper-parameters ( §3.4).
The test sets are only used to report WERs obtained by server-side rescoring.

On-device ASR system
Our on-device ASR system uses a convolutional-based neural network acoustic model similar to [32], a 4-gram word LM in the first pass and a FOFE word NNLM in the second pass.We generate the N -best lists and decoding signals with the ASR decoder for the validation and test sets.

Sub-word NNLMs (NNLM ).
Before training NNLMs, we first conduct SentencePiece (SP) modeling [34] and encode the query texts in T with the SP model, which is expected to be helpful in handling rare words [5] for entity recognition tasks.We determine the SP vocabulary size through a pilot study where we select the optimal size on the validation set ( §2a) by sweeping the vocabulary size over {15k, 36k, 48k} and we select an optimal size of 15k.Various NNLM architectures ( §2.2) are trained on SP-encoded T as per the hyperparameter configurations outlined in Table 3.We select these configurations because we find that bigger NNLMs trained on T with more parameters than the ones shown in Table 3 present undesirable performance on validation sets because of overfitting.
The models of NNLM category are trained with the Adam optimizer [35] on 16 GPUs for 80 epochs, each with 28k training steps and 16 sequences per minibatch.A warmup stage runs to linearly increase the learning rate between 10 −6 and 10 −3 for 1.2k steps, and subsequently the learning rate decreases exponentially with a factor 0.94.Dropout with a fixed rate 0.1 is applied.Internal layers use RELU activations.The FOFE models have a FOFE factor of 0.85 and a FOFE order of 8.
At inference time, we tokenize an N -best candidate using the SP model and subsequently compute the joint loglikelihood using the trained NNLMs for the NNLM feature.Davinci (LLM ).The latest Davinci [36] models build upon InstructGPT [37] and text-davinci-003 is a reinforcement learning with human feedback (RLHF) [38] model of about 175 billion parameters that improves the previous model series.We directly use OpenAI API with Davinci models for the LLM feature by including token log-probabilities in the API return and then calculate the joint log-likelihood of the corresponding N -best candidate.
Model combination.We combine the server-side features from the trained NGram, NNLM and LLM category LMs with the on-device ASR signals by linear interpolation with coefficients estimated by Powell's Method as described in ( §2.1) for N -best rescoring and domain knowledge integration, and then evaluate the validation and test rescoring WERs on different subpopulations (head, torso and tail).

RESULTS AND DISCUSSIONS
Table 4 shows rescoring WERs for the various LM categories where R1 corresponds to the on-device ASR system only, R2 to the addition of the out-of-the-box "Davinci" model (LLM category), and R3 to the usage of N -gram model (NGram category) for rescoring.R4, R5, R6 correspond to the inclusion of LSTM, FOFE and Transformer server-side LMs (NNLM category) respectively.For each LM in R2-R6, we report the number of parameters of the corresponding model after picking the best model hyper-parameters on the validation set ( §3.4).For (RQ1), we use on-device ASR (R1) as the baseline.As shown in R3 to R6, comparing to baseline R1, all average WERs (column Avg.) are significantly better, between 27%-30% relative.N -best rescoring with an NGram category model (R3) demonstrates substantial 28% average WER reduction, but we can achieve more significant improvement when rescoring with NNLM category models, such as Transformer (R6), which leads to accuracy improvement of over 23% on head, 34% on torso, 30% on tail, and 30% on average.Therefore, our answer to (RQ1) is in the affirmative: integrating a single server-side domain-expert LM is indeed effective for optimizing N -best rescoring and improving entity recognition accuracy compared to on-device ASR only.Moreover, for (RQ2), we compare with the out-of-thebox baseline of GPT-3.5 "Davinci" (R2) rescoring results.By comparing R3-R6 to R2, we also observe that the N -gram model (NGram category) and the sub-word server-side LMs (NNLM category) trained from domain-specific data from scratch outperform the out-of-the-box baseline of GPT-3.5 "Davinci" model (LLM category).For example, by comparing R6 with R2, rescoring with Transformer achieves substantial relative accuracy enhancements of 18% on head, 32% on torso, and 24% on tail over the GPT-3.5 "Davinci" rescoring WERs.Therefore, our answer to (RQ2) is also positive.It is beneficial to take advantage of training models from NGram category and NNLM category on domain-specific data with decent numbers of model parameters, rather than outsourcing directly to an out-of-the-box LLM category model for effective N -best rescoring.
In terms of (RQ3), we compare the WERs after rescoring N -best with the interpolation results ( §2.1) of multiple serverside LMs from different categories in Table 4 (R7-R12).Inspired by the results obtained as part of answering (RQ1), we are motivated to investigate whether the combination of multiple LM categories can further improve recognition quality.Furthermore, because NGram category and NNLM category outperform LLM category (R3-R6 over R2) as shown in the part of answering (RQ2), we assess multiple NGram category and NNLM category combinations as shown in ({R7, R9, R11}).For completeness, we also conduct all {NGram, NNLM , LLM } category combinations with results shown in ({R8, R10, R12}).In general, by observing R7 to R12 in Table 4, we draw the following conclusions: (1) Integrating an NGram category model with a sub-word NNLM category model (Transformer) leads to the best average WER improvement (R11) among all of our experiments.
( (3) Introducing additional LLM category models does not bring extra benefit for accuracy enhancement.Therefore, the dominating factors for the significant N -best rescoring improvements are the NGram and NNLM category rescoring LMs.
(4) One trade-off is that the best average WER from R11 does not always guarantee the best WERs on each subpopulation individually among our experiments.However, we consider the trade-off acceptable.
Therefore, multi-model combinations from different model categories for N -best rescoring gain complementary advantages of improving head, torso and tail subpopulations respectively over single rescoring models.By joining the multiple strengths on each head/torso/tail subpopulations from each single rescoring LMs in an effective way by model fusion described in §2.1, we are able to collectively reach optimal WERs for each subpopulation and consequently the best average WERR over all head/torso/tail sets among our experiments.In conclusion, our answer to (RQ3) is also positive: we suggest to combine multiple rescoring LMs from different categories to further improve the N -best rescoring accuracy, since combinations of various LM categories outperform use of individual models only.

CONCLUSIONS
We showed that training domain-specific server-side LMs for N -best rescoring led to significant accuracy improvements on all head, torso and tail subpopulations.We focused on three LM categories and investigated modeling strategies for N -best rescoring.Training sub-word NNLMs (NNLM category) on domain-centric data with a SentencePiece tokenizer was the most effective single rescoring modeling choice.Using a single sub-word NNLM, we improved accuracy by 30% and 34% for difficult tail and torso entities and 23% for head entities.The best single sub-word NNLM trained from scratch also outperformed the out-of-the-box pretrained large LM baseline (LLM category) significantly by 25% averaging all subpopulations.Furthermore, integrating multiple server-side LMs of different categories for rescoring led to additional accuracy improvements over single rescoring LMs and consequently the best WERRs among all of our experiments, thanks to the effective model fusion with interpolation coefficients estimated by Powell's method on the validation sets, which combined the complementary advantages and strengths of multiple single rescoring LMs together for N -best rescoring ( §2.1), and consequently improved the ASR accuracy to a great extent on all entity-heavy head/torso/tail subpopulations in our experiments.By combining sub-word NNLMs (NNLM category) and N -gram model (NGram category), we further reached substantial average WER reduction of 30% relative across all subpopulations.
In conclusion, training multiple N -best rescoring models of various categories on domain-specific data, integrating information domain knowledge based on server-side rescoring LMs with individual strengths, and boosting N -best accuracy with complementary signals by model fusion, led to the effectiveness in enhancing entity speech recognition accuracy and improvements on on-device VA systems.
As for future work, we plan to include more LM categories such as autoencoding LMs, conduct fine-tuning on pretrained models, as well as knowledge distillation with teacherstudent learning.We also plan to expand varieties of domain data.

Table 4 :
Test set WERs for the best single and multiple model combinations are presented, with corresponding relative improvements in the parentheses.Overall best test sets WERs are shown bold and the in-group best WERs are underlined.The best model architectures and combinations are selected based on the best validation sets accuracy.We only specify the best model architecture once per model category (R3-R6), but use the same configuration consistently through our experiments.