1 Introduction

Virtual assistants (VAs) are rapidly gaining popularity (Juniper Research, 2019; Statista, 2022) as they assist users with various tasks (Maarek, 2019). Voice commands issued by VA users are recognized using automatic speech recognition (ASR), a critical component of any VA system. The VA ASR component takes as input user-spoken audio and generates a ranked list of N transcription hypotheses, hereafter the N-best list. A primary challenge of a VA ASR system is that queries are entity-rich and heavily centered around complex information domains which are usually present in server-side knowledge bases. Here, we refer to any such query that may benefit from a knowledge base as information domain query. For example, consider a query - “play Red Smoke by The Reytons" that instructs the VA to play a song. If the corresponding song (i.e., in the previous example, Red Smoke) is not available within the user’s local music library, the VA could execute a search query against an online media catalog with the end goal of streaming the song to the user’s device. Hence, integration of domain knowledge becomes crucial to improve recognition accuracy of such spoken queries. However, with on-device ASR, offline knowledge sources are constrained because disk space and compute resources are limited. Therefore, integration of knowledge sources, which are often very large and dynamic in nature (Van Gysel et al., 2020), can be cumbersome.

An empirical analysis of entity-centric information domain queries from a representative sample of anonymized usage logs of a popular VA shows that 36% of such queries contain a more suitable candidate hypothesis in the N-best list, while the more suitable candidate hypothesis is not ranked first in the list by the on-device ASR system. One way to overcome the problems associated with recognizing entity-centric queries is to run an on-device domain classifier on top of the on-device ASR result, and, for voice commands classified as information domain queries, perform server-side N-best rescoring using domain-specific language models (LMs) for knowledge integration. Since rescoring occurs on server, the domain-specific (LMs) are not subject to the same constraints as the on-device models, and therefore can be larger.

Previously, various efforts have been made to improve ASR accuracy for entity-rich queries. Huang and Peng (2019) conducted an empirical study using Transformer-based (LMs) to achieve significant word error rate (WER) reductions with second-pass N-best rescoring. Others (Shin et al., 2019; Pelloin et al., 2022; Salazar et al., 2020; Xu et al., 2022) have shown that using masked LM training objectives, like BERT (Devlin et al., 2019), for \(N{}\)-best rescoring are effective for improving ASR accuracy. Wang et al. (2021) demonstrated the effectiveness of ASR \(N{}\)-best information in entity retrieval. Van Gysel et al. (2022) improved WER on entity queries by implementing probabilistic grammars as a complement of \(N\)-gram LMs within a finite state transducer framework.

While the works mentioned above have made significant progress with respect to the use of LMs for \(N{}\)-best rescoring, they are often limited to considering only a single LM architecture at a time and do not consider different subpopulations of the query distribution (that is, head, torso, and tail; see §3.2). Recently, it was shown that different LM architectures lead to better performance on different subsets of the information domain query distribution (Van Gysel et al., 2022) – where some architectures work well for head queries, and others perform better on the tail. In addition, lack of analysis of how an LM technique affects different subpopulations of the query distribution may lead to cases where the ASR quality degrades on tail queries while the degradation is concealed by overall recognition enhancements, as most of the improvements come from head queries.

In this paper, we investigate strategies for building and combining multiple LMs for \(N{}\)-best rescoring of entity-centric information domain queries. We combine different rescoring LMs and evaluate the recognition quality on different subpopulations (head, torso, and tail §3.2) of the information domain query distribution. Our focus lies on applying \(N{}\)-best rescoring in an application where the on-device ASR system is resource-constrained, while the server-side \(N{}\)-best re-scorers can leverage additional resources and information and hence enhance ASR accuracy.

To the best of our knowledge, our contribution is the first comparison of established techniques for the specific application, extensive empirical evaluation on domain knowledge suitability, using effective model fusion techniques that combine multiple LM architectures with complementary strengths, and testing on different data splits and subpopulations (§3.2).

We focus on three categories of LMs and evaluate how signals extracted from each category can contribute to improving ASR accuracy. The categories are

  1. (1)

    \({NGram}\): back-off word \(N\)-gram LMs (Katz, 1987),

  2. (2)

    \({NNLM}\): sub-word neural network language models (NNLMs) (Bengio et al., 2000) (§2.2), and

  3. (3)

    \({LLM}\): pretrained large language models (LLMs) such as GPT-3 (Brown et al., 2020).

While \({NGram}\) and \({NNLM}\) categories are trained from scratch on domain-specific data, \({LLM}\) category includes out-of-the-box models accessed as a service.

With our specific server-side rescoring setting, the inclusion and comparison with the \({LLM}\) category is meant to serve as an “out-of-the-box” baseline that shows the hardness of the problem; i.e., it demonstrates the quality one would be able to achieve by outsourcing the problem to an external \({LLM}\) service without additional training. Therefore, fine-tuning an \({LLM}\) is out of the scope of this work. Instead, we focus on whether we are able to construct and assess the ability of various LMs to help improve recognizing domain specific queries, with relatively more controllable modeling scales than \({LLM}\)s.

We focus on unidirectional LMs since they are more generally applicable (e.g., streaming applications) compared to bidirectional LMs, and hence, can be repurposed in a variety of settings and applications, as opposed to their bidirectional variants.

We systematically pick representative model architectures within each LM category, and conduct in-depth analysis on an individual category’s \(N{}\)-best rescoring results, as well as joint impacts of combining multiple LM categories. In the end, we compare the single and cross-category \(N{}\)-best rescoring performance for the best rescoring modeling strategy to improve recognition accuracy.

To this end, our research questions (RQs) are:

  1. (RQ1)

    Can a single rescoring LM from each LM category reach substantial accuracy improvements on all subpopulations (head, torso and tail) of entity-rich information domain queries?

  2. (RQ2)

    Is it beneficial to conduct \(N{}\)-best rescoring using domain-specific LMs trained from scratch, compared with directly using an out-of-the-box external LLM service as baseline?

  3. (RQ3)

    Furthermore, do mixture combinations of multiple LMs from various categories outperform the single best rescoring LM?

We provide empirical advice on training and selecting \(N{}\)-best rescoring models in a generalizable way. Our findings are:

  1. (1)

    We show that training a single domain-expert \(N{}\)-best rescoring model from \({NGram}\) category or \({NNLM}\) category leads to significant WER reductions (WERRs) on entity-rich queries across all subpopulations compared to the baseline ASR system.

  2. (2)

    We find that any of our domain-specific \({NGram}\) category and \({NNLM}\) category rescoring model outperforms the out-of-the-box \({LLM}\) category model baseline significantly, and they are also much smaller in sizes and have fewer numbers of model parameters.

  3. (3)

    We discover that \({NNLM}\) category is slightly better than \({NGram}\) category, but building \({NGram}\) category models is still beneficial.

  4. (4)

    Most importantly, we also further discover that effective multi-category rescoring model fusion of \({NGram}\) and \({NNLM}\) categories gain complementary advantages over single rescoring models, and results in boosting additional overall accuracy improvements averaging all subpopulations.

2 Methodology

ASR systems generate ranked \(N{}\)-best lists that consist of multiple candidate hypotheses, by exploring a subset of the entire search space, sorted by decoding signals.

2.1 N-best rescoring with ASR and server-side LMs

Table 1 Example of a problematic N-best ranking for an utterance with reference text “play Dickie Jones movies” with associated on-device signals (lower is predicted as better)

Table 1 shows an example of a problematic \(N{}\)-best list generated by an on-device ASR system, and it suggests room for improvement where the correct hypothesis is included in the top-ranked hypotheses but not ranked at the top. Rather than generating new hypotheses, we focus on seeking opportunities to optimize the rankings of the hypotheses, also known as \(N{}\)-best reranking. Comparing to other techniques, it allows us to integrate large, server-side LMs from various resources such as the Davinci model API as described in §3.4.

In this paper, we obtain multiple signals from on-device ASR and server-side LMs and use linear interpolation to combine the features for integrating domain knowledge to \(N{}\)-best ranking criteria. The on-device ASR decoding signals consist of acoustic and LM scores. In the case of a traditional hybrid ASR system (Jurafsky & Martin, 2008, p. 289), the acoustic and LM signals are provided by the acoustic and language models, respectively. For newer end-to-end systems (Jurafsky & Martin, 2023, §16.3), the acoustic signal is provided by the end-to-end model and the LM signal by the external LM of the on-device system. They enable us to evaluate the costs of the decoded sequences stemmed from acoustic and language models of an ASR system. Lower scores indicate a better system.

The additional features, that contribute to the domain knowledge integration ability of the new \(N{}\)-best ranking criteria, are calculated from the server-side LMs. The features stem from the negative log-likelihood assigned by the LM for each token in a sequence. To effectively combine the on-device ASR decoding signals (e.g., the acoustic score assigned by the on-device acoustic model) and the server-side LM features, we use a linear model and find the best set of weights by minimizing the WER on a validation set using Powell’s method (Powell, 1964), which is effective for non-differentiable cost functions.

2.2 LM categories under consideration

As mentioned in §1, we consider multiple LM categories. For the first category (\({NGram}\)), we consider \(N\)-gram LM, which is a Markov model where the current token prediction is dependent on a window of history tokens. The conditional probabilities for the tokens stem from the counting in a training corpus. With the development of various smoothing techniques such as Witten-Bell discounting (Witten & Bell, 1991), \(N\)-gram LMs have become one of the classical LMs for speech recognition tasks.

For the second category (\({NNLM}\)), we consider (FOFE) (Zhang et al., 2015), (LSTM) (Hochreiter & Schmidhuber, 1997) and Transformer (Vaswani et al., 2017). We use sub-word level NNLMs of varying sizes shown in Table 3.

The FOFE NNLM is a feed-forward model in which variable-length input sequences are encoded by fixed-size vectors, with minimal information loss. Such an architecture has been reported as an accuracy-competitive and performance-efficient language modeling approach (Watcharawittayakul et al., 2018).

LSTM is a type of recurrent neural network designed to have better gradient flows. These models have previously been competitive with \(N\)-gram and Transformer based models when used as a reranking model for ASR systems (Irie et al., 2019).

Transformer-based architectures achieved state-of-the-art results in many language modeling tasks (Devlin et al., 2019; Irie et al., 2019; Brown et al., 2020; Raffel et al., 2020). The Transformer-based NNLMs used in this work are built as follows. Relative positional encoding (Shaw et al., 2018) is added to the input embedding vector. Then, several self-attention encoding blocks are stacked. We use layer normalization, followed by multi-head attention and residual connections. A final linear projection with softmax activation is used to determine the subword unit scores.

For the third category (\({LLM}\)), we consider large language models, which are Transformer decoder-only autoregressive LMs, typically with large number of parameters. Recent developments of the GPT\(-\)3.5 series (OpenAI, 2023a) have shown that they provide high-quality feedback for multiple tasks (Brown et al., 2020), thanks to the parameter sizes and vast amount of training data from a large variety of sources. We are interested to use the GPT\(-\)3.5 series for \(N{}\)-best rescoring knowledge integration and expect that the domain knowledge is intrinsic in the GPT\(-\)3.5 series.

3 Experiments

3.1 Entity-heavy query data

In this paper, we focus on the recognition of entity-centric media player queries. As mentioned in §1, we operate under the setting where ASR runs on-device using a resource-constrained model, and the obtained transcription is then classified to belong to the information domain, thus, requiring access to a knowledge base. We use the context-free grammar of media player queries published by Van Gysel et al. (2022), and use it to generate media player queries. In addition, we generate speech with a Neural Text-To-Speech (TTS) system (Achanta et al., 2021) on validation and test splits (§3.2) of the generated queries to measure the quality of our \(N{}\)-best rescoring, which is common practice in many past researches (Peyser et al., 2020; Huang et al., 2022; Weiran et al., 2022). We generate the synthetic validation and test sets mainly because we are constrained by the fact that most of the existing speech recognition test sets do not have good entity coverage. Some directly accessible usage based VA test sets only show 0.83% entity coverage, which is insufficient. Specifically, for the scope of this research, representative entity-rich VA utterances are necessary for investigating the effectiveness of our approach, while most of the available ASR test sets mainly consist of entity-unrelated or general-purpose samples that make it ambiguous whether our approach can demonstrate sufficient evidence for the research questions. Consequently, the synthesis process becomes necessary for effective evaluation in this research.

The query grammar (see Van Gysel et al. (2022) for more information) consist of two components:

  1. (1)

    query templates that contain entity slots and are representative of the VA media player query distribution, each associated with a prior probability, \(P\left( \text {template}\right) {}\), and

  2. (2)

    a weighted list of entities that can be inserted into the template, with each entity associated with a prior, \(P\left( \text {entity}\right) {}\), that correlates with entity popularity.

To control the complexity of the experiments, we apply cutoffs on the ranked template and entity lists. We keep the top-100 templates according to their prior, and we limit the number of entities to the top-200k. We subsequently sample queries from the joint template/entity probability, \(P\left( \text {template}\right) {} \cdot P\left( \text {entity}\right) {}\), until each unique query is sampled at least once.

3.2 Entity query splits and subpopulations

Table 2 Statistics of our validation/test set utterances (§3.2) sliced by their subpopulations (head/torso/tail)

We randomly split the generated queries into training, validation and test sets, with sampling ratios of 90%, 5% and 5% respectively, which will be utilized for further data partitioning and sampling according to various needs as described later in this section. The three sets are disjoint, even though they are sampled from the same carrier phrases and entities distributions, and such sampling from the same distribution technique is an approach widely used in the community including by open evaluations held by NIST. For the scope of this work, we do not focus on conducting zero-shot pre-trained LLMs evaluations. Instead, our settings aim to improve accuracy on specific domains that have curated entities from knowledge bases, trending topics, etc and aim for an extensive coverage of them. This is reflected in the fact that we use 200k entities.

The training set is used to train the LM architectures of \({NGram}\) and \({NNLM}\) categories (§2.2) as follows. For the \({NGram}\) category models, the entire training set of about 20 M queries is used during estimation. For the \({NNLM}\) category models, we conduct upsampling based on joint template/entity probability (§3.1) with replacement from the entire training set, and subsequently generate a 500 M sample of queries (denoted \(\textbf{T}\)) as the training data.

The validation and test splits are partitioned based on the frequency a query occurs in the respective split, with the top 10% being “head”, 10% to 50% being “torso”, and bottom 50% being “tail”. We subsequently sample 1k queries from each partition, generate audios using TTS (§3.1), and use the generated audios to obtain \(N{}\)-best lists using our on-device ASR system (§3.3). Table 2 presents detailed statistics of all subpopulations in the validation and test sets after ASR decoding.

The validation set serves two purposes:

  1. (1)

    to find the optimal linear interpolation weights of the features (§2.1), and

  2. (2)

    to select sub-word tokenizer and LM hyper-parameters (§3.4).

The test sets are only used to report WERs obtained by server-side rescoring.

3.3 On-device ASR system

Our on-device ASR system uses a convolutional-based neural network acoustic model similar to Huang et al. (2020), a 4-gram word LM in the first pass and a FOFE word NNLM in the second pass. We generate the \(N{}\)-best lists and decoding signals with the ASR decoder for the validation and test sets.

3.4 Server-side LMs for rescoring features

Word \(N\)-gram LMs (\({NGram}\)). We train back-off \(N\)-gram models with Witten-Bell smoothing on the training set using SRILM (Stolcke, 2002) for the \({NGram}\) feature. We sweep the max \(N\)-gram order over \(\{\text {2}, \text {3}, \text {4}\}\) and the pruning threshold over \(\{4^{-4}, 4^{-5},..., 4^{-19}\} \cup \{\text {0}\}\).

Sub-word NNLMs (\({NNLM}\)). Before training NNLMs, we first conduct SentencePiece (SP) modeling (Kudo & Richardson, 2018) and encode the query texts in \(\textbf{T}\) with the SP model, which is expected to be helpful in handling rare words (Huang & Peng, 2019) for entity recognition tasks. We determine the SP vocabulary size through a pilot study where we select the optimal size on the validation set (§2a) by sweeping the vocabulary size over \(\{\text {15k}, \text {36k}, \text {48k}\}\) and we select an optimal size of 15k. Various NNLM architectures (§2.2) are trained on SP-encoded \(\textbf{T}\) as per the hyper-parameter configurations outlined in Table 3. We select these configurations because we find that bigger NNLMs trained on \(\textbf{T}\) with more parameters than the ones shown in Table 3 present undesirable performance on validation sets because of overfitting.

Table 3 Hyper-parameters of server-side NNLMs

The models of \({NNLM}\) category are trained with the Adam optimizer (Kingma & Ba, 2015) on 16 GPUs for 80 epochs, each with 28k training steps and 16 sequences per minibatch. A warmup stage runs to linearly increase the learning rate between \(10^{-6}\) and \(10^{-3}\) for 1.2k steps, and subsequently the learning rate decreases exponentially with a factor 0.94. Dropout with a fixed rate 0.1 is applied. Internal layers use RELU activations. The FOFE models have a FOFE factor of 0.85 and a FOFE order of 8.

At inference time, we tokenize an \(N{}\)-best candidate using the SP model and subsequently compute the joint log-likelihood using the trained NNLMs for the \({NNLM}\) feature.

Davinci (\({LLM}\)) The latest Davinci (OpenAI, 2023b) models build upon InstructGPT (Ouyang et al., 2022) and text-davinci-003 is a reinforcement learning with human feedback (RLHF) (Christiano et al., 2017) model of about 175 billion parameters that improves the previous model series. We directly use OpenAI API with Davinci models for the \({LLM}\) feature by including token log-probabilities in the API return and then calculate the joint log-likelihood of the corresponding \(N{}\)-best candidate.

Model combination We combine the server-side features from the trained \({NGram}\), \({NNLM}\) and \({LLM}\) category LMs with the on-device ASR signals by linear interpolation with coefficients estimated by Powell’s Method as described in (§2.1) for \(N{}\)-best rescoring and domain knowledge integration, and then evaluate the validation and test rescoring WERs on different subpopulations (head, torso and tail).

4 Results and discussions

Table 4 Test set WERs for the best single and multiple model combinations are presented, with corresponding relative improvements in the parentheses

Table 4 shows rescoring WERs for the various LM categories where R1 corresponds to the on-device ASR system only, R2 to the addition of the out-of-the-box “Davinci" model (\({LLM}\) category), and R3 to the usage of \(N\)-gram model (\({NGram}\) category) for rescoring. R4, R5, R6 correspond to the inclusion of LSTM, FOFE and Transformer server-side LMs (\({NNLM}\) category) respectively. For each LM in R2-R6, we report the number of parameters of the corresponding model after picking the best model hyper-parameters on the validation set (§3.4).

For (RQ1), we use on-device ASR (R1) as the baseline. As shown in R3 to R6, comparing to baseline R1, all average WERs (column \({\textbf {Avg.}}\)) are significantly better, between 27%-30% relative. \(N{}\)-best rescoring with an \({NGram}\) category model (R3) demonstrates substantial 28% average WER reduction, but we can achieve more significant improvement when rescoring with \({NNLM}\) category models, such as Transformer (R6), which leads to accuracy improvement of over 23% on head, 34% on torso, 30% on tail, and 30% on average. Therefore, our answer to (RQ1) is in the affirmative: integrating a single server-side domain-expert LM is indeed effective for optimizing \(N{}\)-best rescoring and improving entity recognition accuracy compared to on-device ASR only.

Moreover, for (RQ2), we compare with the out-of-the-box baseline of GPT\(-\)3.5 “Davinci" (R2) rescoring results. By comparing R3-R6 to R2, we also observe that the \(N\)-gram model (\({NGram}\) category) and the sub-word server-side LMs (\({NNLM}\) category) trained from domain-specific data from scratch outperform the out-of-the-box baseline of GPT\(-\)3.5 “Davinci" model (\({LLM}\) category). For example, by comparing R6 with R2, rescoring with Transformer achieves substantial relative accuracy enhancements of 18% on head, 32% on torso, and 24% on tail over the GPT\(-\)3.5 “Davinci" rescoring WERs. Therefore, our answer to (RQ2) is also positive. It is beneficial to take advantage of training models from \({NGram}\) category and \({NNLM}\) category on domain-specific data with decent numbers of model parameters, rather than outsourcing directly to an out-of-the-box \({LLM}\) category model for effective \(N{}\)-best rescoring.

In terms of (RQ3), we compare the WERs after rescoring \(N{}\)-best with the interpolation results (§2.1) of multiple server-side LMs from different categories in Table 4 (R7-R12). Inspired by the results obtained as part of answering (RQ1), we are motivated to investigate whether the combination of multiple LM categories can further improve recognition quality. Furthermore, because \({NGram}\) category and \({NNLM}\) category outperform \({LLM}\) category (R3-R6 over R2) as shown in the part of answering (RQ2), we assess multiple \({NGram}\) category and \({NNLM}\) category combinations as shown in ({R7, R9, R11}). For completeness, we also conduct all {\({NGram}\), \({NNLM}\), \({LLM}\)} category combinations with results shown in ({R8, R10, R12}). In general, by observing R7 to R12 in Table 4, we draw the following conclusions:

  1. (1)

    Integrating an \({NGram}\) category model with a sub-word \({NNLM}\) category model (Transformer) leads to the best average WER improvement (R11) among all of our experiments.

  2. (2)

    More specifically, combination of \({NGram}\) and \({NNLM}\) category rescoring LMs (R11) shows statistically significant (Student’s t-test p-value \(< 0.03\)) tail-entity WER improvement over a single Transformer rescoring model (R6), best-of-all on torso-entity accuracy among all of our experiments, and head-entity WER improvement over a single \(N\)-gram rescoring model (R3).

  3. (3)

    Introducing additional \({LLM}\) category models does not bring extra benefit for accuracy enhancement. Therefore, the dominating factors for the significant \(N{}\)-best rescoring improvements are the \({NGram}\) and \({NNLM}\) category rescoring LMs.

  4. (4)

    One trade-off is that the best average WER from R11 does not always guarantee the best WERs on each subpopulation individually among our experiments. However, we consider the trade-off acceptable.

Therefore, multi-model combinations from different model categories for \(N{}\)-best rescoring gain complementary advantages of improving head, torso and tail subpopulations respectively over single rescoring models. By joining the multiple strengths on each head/torso/tail subpopulations from each single rescoring LMs in an effective way by model fusion described in §2.1, we are able to collectively reach optimal WERs for each subpopulation and consequently the best average WERR over all head/torso/tail sets among our experiments. In conclusion, our answer to (RQ3) is also positive: we suggest to combine multiple rescoring LMs from different categories to further improve the \(N{}\)-best rescoring accuracy, since combinations of various LM categories outperform use of individual models only.

5 Conclusions

We showed that training domain-specific server-side LMs for \(N{}\)-best rescoring led to significant accuracy improvements on all head, torso and tail subpopulations. We focused on three LM categories and investigated modeling strategies for \(N{}\)-best rescoring. Training sub-word NNLMs (\({NNLM}\) category) on domain-centric data with a SentencePiece tokenizer was the most effective single rescoring modeling choice. Using a single sub-word NNLM, we improved accuracy by 30% and 34% for difficult tail and torso entities and 23% for head entities. The best single sub-word NNLM trained from scratch also outperformed the out-of-the-box pretrained large LM baseline (\({LLM}\) category) significantly by 25% averaging all subpopulations.

Furthermore, integrating multiple server-side LMs of different categories for rescoring led to additional accuracy improvements over single rescoring LMs and consequently the best WERRs among all of our experiments, thanks to the effective model fusion with interpolation coefficients estimated by Powell’s method on the validation sets, which combined the complementary advantages and strengths of multiple single rescoring LMs together for \(N{}\)-best rescoring (§2.1), and consequently improved the ASR accuracy to a great extent on all entity-heavy head/torso/tail subpopulations in our experiments. By combining sub-word NNLMs (\({NNLM}\) category) and \(N\)-gram model (\({NGram}\) category), we further reached substantial average WER reduction of 30% relative across all subpopulations.

In conclusion, training multiple \(N{}\)-best rescoring models of various categories on domain-specific data, integrating information domain knowledge based on server-side rescoring LMs with individual strengths, and boosting \(N{}\)-best accuracy with complementary signals by model fusion, led to the effectiveness in enhancing entity speech recognition accuracy and improvements on on-device VA systems.

As for future work, we plan to include more LM categories such as autoencoding LMs, conduct fine-tuning on pretrained models, as well as knowledge distillation with teacher-student learning. We also plan to expand varieties of domain data.