Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach

Hofmann, Hansjörg; Sakti, Sakriani; Isotani, Ryosuke; Kawai, Hisashi; Nakamura, Satoshi; Minker, Wolfgang

doi:10.1007/978-3-642-16202-2_15

Hansjörg Hofmann^23,24,
Sakriani Sakti²³,
Ryosuke Isotani²³,
Hisashi Kawai²³,
Satoshi Nakamura²³ &
…
Wolfgang Minker²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6392))

Included in the following conference series:

International Workshop on Spoken Dialogue Systems Technology

Abstract

Previous approaches to spontaneous speech recognition address the multiple pronunciation problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence are not considered yet. In this paper we attempt to model the sequence-based pronunciation variation using a noisy-channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this preliminary study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy-channel approach will map from the phoneme to the word level. Our experiments use Switchboard as spontaneous speech corpus. The results show that the proposed method improves the word accuracy consistently over the conventional recognition system. The best system achieves up to 38.9% relative improvement to the baseline speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Onaizan, Y., Papineni, K.: Distortion models for statistical machine translation. In: Proc. ACL/COLING, pp. 529–536 (2006)
Google Scholar
Bates, A., Osterndorf, M., Wright, R.: Symbolic phonetic features for modeling of pronunciation variation. Speech Communication 49, 83–97 (2007)
Article Google Scholar
Brown, P., Pietra, S., Pietra, V.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificial neural networks for English spontaneous speech. In: Proc. ICSLP, pp. 1461–1464 (2004)
Google Scholar
Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang, R., Sumita, E.: The NICT/ATR speech translation system for IWSLT 2007. In: Proc. IWSLT, pp. 103–110 (2007)
Google Scholar
Fosler-Lussier, E.: Contextual word and syllable pronunciation models. In: Proc. IEEE ASRU Workshop (1999)
Google Scholar
Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: Telephone speech corpus for research and development. In: Proc. ICSLP, pp. 24–27 (1996)
Google Scholar
Jitsuhiro, T., Matsui, T., Nakamura, S.: Automatic generation of non-uniform HMM topologies based on the MDL criterion. IEICE Trans. Inf. Syst. E87-D (8) (2004)
Google Scholar
King, S., Bartels, C., Bilmers, J.: Small vocabulary tasks from Switchboard 1. In: Proc. EUROSPEECH, pp. 3385–3388 (2005)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proc. the Human Language Technology Conference, pp. 127–133 (2003)
Google Scholar
Livescu, K., Glass, J.: Feature-based pronunciation modeling for speech recognition. In: Proc. HLT/NAACL (2004)
Google Scholar
Och, F., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proc. ACL, pp. 295–302 (2002)
Google Scholar
Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Pallet, D.: A look at NISTS’s benchmark ASR tests: Past, present, future. In: Proc. ASRU, pp. 483–488 (2003)
Google Scholar
Pallett, S., Fiscus, J., Fisher, M., Garofolo, J., Lund, B., Przybocki, M.: 1993 benchmark tests for the ARPA spoken language program. In: Proc. Spoken Language Technology Workshop (1994)
Google Scholar
Paul, D.B., Baker, J.: The design for the Wall Street journal-based CSR corpus. In: Proc. ICSLP (1992)
Google Scholar
Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., Zavaliagkos, G.: Stochastic pronunciation modelling from handlabelled phonetic corpora. In: Proc. ETRW on Modeling Pronunciation Variation for Automatic Speech Recognition, pp. 109–116 (1998)
Google Scholar
Sakti, S., Markov, S., Nakamura, S.: Probabilistic pronunciation variation model based on Bayesian networks for conversational speech recognition. In: Second International Symposium on Universal Communication (2008)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. ICSLP, pp. 901–904 (2002)
Google Scholar
Lo, W.K., Soong, F.K.: Generalized posterior probability for minimum error verification of recognized sentences. In: Proc. ICASSP, pp. 85–88 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Information and Communications Technology, Japan
Hansjörg Hofmann, Sakriani Sakti, Ryosuke Isotani, Hisashi Kawai & Satoshi Nakamura
University of Ulm, Germany
Hansjörg Hofmann & Wolfgang Minker

Authors

Hansjörg Hofmann
View author publications
You can also search for this author in PubMed Google Scholar
Sakriani Sakti
View author publications
You can also search for this author in PubMed Google Scholar
Ryosuke Isotani
View author publications
You can also search for this author in PubMed Google Scholar
Hisashi Kawai
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Minker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, South Korea
Gary Geunbae Lee
Laboratoire d’Informatique pour la Mécanique et les Sciences de L’ Ingénieur, Centre National de la Recherche Scientifique, B.P. 133 91403, Orsy cedex, France
Joseph Mariani
Institute of Information Technology, University of Ulm, Albert-Einstein-Allee 43, 89081, Ulm, Germany
Wolfgang Minker
national Institute of Information and Communications Technology, 3-5 Hikaridai, Keihanna Science City, Kyoto, Japan
Satoshi Nakamura

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hofmann, H., Sakti, S., Isotani, R., Kawai, H., Nakamura, S., Minker, W. (2010). Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach. In: Lee, G.G., Mariani, J., Minker, W., Nakamura, S. (eds) Spoken Dialogue Systems for Ambient Environments. IWSDS 2010. Lecture Notes in Computer Science(), vol 6392. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16202-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-16202-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16201-5
Online ISBN: 978-3-642-16202-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics