Skip to main content

Unified Modeling of Multi-Domain Multi-Device ASR Systems

  • 280 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 14102)


Modern Automatic Speech Recognition (ASR) technology is typically fine-tuned for a targeted domain or application to obtain the best recognition results. This requires training and maintaining a dedicated ASR model for each domain, which increases the overall cost. Moreover, fine-tuned model might not be the most optimal way of sharing knowledge across domains. To address this, we propose a novel unified RNN-T based ASR technology that leverages domain embeddings and attention based mixture of experts architecture. Further, the proposed unified neural architecture allows for sharing of data and parameters seamlessly across domains. Our experiments show that the proposed approach outperforms a carefully fine-tuned domain-specific ASR model, yielding up to 10% relative word error rate (WER) improvement and 30% reduction in overall training cost.


S. Mitra and S. N. Ray—Equal Contribution.

This is a preview of subscription content, log in via an institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Biswas, A., Yilmaz, E., de Wet, F., van der Westhuizen, E., Niesler, T.: Semi-supervised acoustic model training for five-lingual code-switched ASR. CoRR arXiv:1906.08647 (2019)

  2. Chen, Z., Jain, M., Wang, Y., Seltzer, M.L., Fuegen, C.: Joint grapheme and phoneme embeddings for contextual end-to-end ASR (2019)

    Google Scholar 

  3. Ganin, Y., Lempitsky, V.S.: Unsupervised domain adaptation by backpropagation. ArXiv: abs/1409.7495 (2015)

  4. Ganin, Y., et al.: Domain-adversarial training of neural networks (2016)

    Google Scholar 

  5. Gaur, N., et al.: Mixture of informed experts for multilingual speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238 (2021).

  6. Graves, A.: Sequence transduction with recurrent neural networks. CoRR arXiv:1211.3711 (2012)

  7. Gururangan, S., Lewis, M., Holtzman, A., Smith, N.A., Zettlemoyer, L.: Demix layers: disentangling domains for modular language modeling. CoRR arXiv:2108.05036 (2021)

  8. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. CoRR arXiv:1902.00751 (2019)

  9. Hu, H., et al.: Redat: accent-invariant representation for end-to-end ASR by domain adversarial training with relabeling. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6408–6412 (2021).

  10. Jain, M., Keren, G., Mahadeokar, J., Zweig, G., Metze, F., Saraf, Y.: Contextual RNN-T for open domain ASR. arXiv preprint arXiv:2006.03411 (2020)

  11. Kim, K., et al.: Attention based on-device streaming speech recognition with large speech corpus. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 956–963. IEEE (2019)

    Google Scholar 

  12. Ray, S.N., Dasgupta, S.S., Talukdar, P.P.: AD3: attentive deep document dater. CoRR arXiv:1902.02161 (2019)

  13. Ray, S.N., Mitra, S., Bilgi, R., Garimella, S.: Improving RNN-T ASR performance with date-time and location awareness. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 394–404. Springer, Cham (2021).

    Chapter  Google Scholar 

  14. Ray, S.N., et al.: Listen with intent: Improving speech recognition with audio-to-intent front-end. arXiv preprint arXiv:2105.07071 (2021)

  15. Shazeer, N., et al.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR arXiv:1701.06538 (2017)

  16. Shibata, Y., et al.: Byte pair encoding: a text compression scheme that accelerates pattern matching (1999)

    Google Scholar 

  17. Singh, V.P., Rath, S.P., Pandey, A.: A mixture of expert based deep neural network for improved ASR. arXiv preprint arXiv:2112.01025 (2021)

  18. Vaswani, A., et al.: Attention is all you need. CoRR arXiv:1706.03762 (2017)

  19. Wu, Z., Li, B., Zhang, Y., Aleksic, P.S., Sainath, T.N.: Multistate encoding with end-to-end speech RNN transducer network. In: ICASSP 2020, pp. 7819–7823 (2020)

    Google Scholar 

  20. Yilmaz, E., Biswas, A., van der Westhuizen, E., de Wet, F., Niesler, T.: Building a unified code-switching ASR system for south African languages. CoRR arXiv:1807.10949 (2018)

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Soumyajit Mitra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mitra, S. et al. (2023). Unified Modeling of Multi-Domain Multi-Device ASR Systems. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics