Abstract
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.
Similar content being viewed by others
References
Ackermann U, Angelini B, Brugnara F, Federico M, Giuliani D, Gretter R, Lazzari G, Niemann H 1996 Speedata: Multilingual spoken data-entry. In Proc. Conf. Int. Spoken Lang. Process. (ICSLP), pp. 2211–2214
Andersen O, Dalsgaard P, Barry W 1993 Data-driven identification of poly- and mono-phonemes for four European languages. In Proc. European Conf. Speech Commun. Technol. (EUROSPEECH), pp. 759–762
Berkling K, Barnard E 1995 Theoretical error prediction for a language identification system using optimal phoneme clustering. In Proc. European Conf. Speech Commun. Technol. (EUROSPEECH), pp. 351–354
Black A W, Lenzo K A 2007 Building synthetic voices. http://festvox.org/bsv/
Black A W, Taylor P A 1997 The Festival Speech Synthesis System: System documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Scotland, UK. Available at http://www.cstr.ed.ac.uk/projects/festival.html
Bourlard H, Morgan N 1994 Connectionist s peech recognition – A hybrid a pproach, (Kluwer Academic Publishers, USA)
Bulyko I, Ostendorf M, Siu M, Ng T, Stolcke A, Çetin O 2007 Web resources for language modeling in conversational speech recognition. ACM Trans. Speech Lang. Process. 5(1): 1–25
Burget L, Fapso M, Valiantsina H, Glembek O, Karafiat M, Kockmann M, Matejka P, Schwarz P, Cernoky J 2009 BUT system for NIST 2008 speaker recognition evaluation. In Proc. Interspeech, pp. 2335–2338
Campbell W, Gleason T, Navratil J, Reynolds D, Shen W, Singer E, Torres-Carrasquillo P 2006 Advanced language recognition using cepstra and phonotactics: MITLL system performance on the NIST 2005 language recognition evaluation. In IEEE Odyssey Speaker and Language Recognition Workshop
Cetin Ö, Magimai-Doss M, Livescu K, Kandtor A, King S, Bartels C, Frankel J 2007 Monolingual and crosslingual comparison of tandem features dervied from articulatory and phone MLPs. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 36–41
Chen X-X, Li A-J, Sun G-H, Hua W, Yin Z-G 2000 An application of SAMPA-C for standard Chinese. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 652–655, Beijing, China
Constantinescu A, Chollet G 1997 On cross-language experiments and data driven units for ALSP (automatic language independent speech processing). In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 606–613
Dalsgaard P, Andersen O 1992 Identification of mono- and poly-phonemes using acoustic-phonetic features derived by self-organizing neural network. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 547–550
Dines J, Saheer L, Liang H 2009 Speech recognition with speech synthesis models by marginalising over decision tree leaves. In Proc. Interspeech, pp. 1395–1398, Brighton, UK
Dines J, Yamagishi J, King S 2010 Measuring the gap between HMM-based ASR and TTS. IEEE Special Topics Signal Process 4(6): 1046–1058
Dutoit T, Pagel V, Pierret N, Bataille F, van der Vrecken O 1996 The MBROLA project: Towards a set of high quality speech synthesizers free of use for non-commercial purposes. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 1393–1396, Philadelphia, USA
Fosler-Lussier E, Morris J 2008 CRANDEM Systems: Conditional Random Field Acoustic Models for Hidden Markov Models. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4049–4052
Fousek P, Lamel L, Gauvain J L 2008 Transcribing Broadcast Data Using MLP Features. In Proc. Interspeech, pp. 1433–1436
Fügen C, Stüker S, Soltau H, Metze F, Schultz T 2003 Efficient handling of multilingual language models. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 441–446
Gales M, Young S 2007 The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing 1(3): 195–304
Gibson M 2009 Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models. In Proc. Interspeech, pp. 1791–1794, Brighton, UK
Hermansky H, Fousek P 2005 Multi-resolution RASTA filtering for TANDEM-based ASR. In Proc. Interspeech, pp. 361–364
Hermansky H, Ellis D P W, Sharma S 2000 Tandem connectionist feature extraction for conventional HMM systems. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 1635–1638, Istanbul, Turkey
Hieronymus J L 1993 ASCII phonetic symbols for the world’s languages: Worldbet. Technical Memo. 23, AT&T Bell Laboratories, Murray Hill, NJ 07974, USA
Hillard D, Hwang M, Harper M, Ostendorf M 2008 Parsing-based objective functions for speech recognition in translation applications. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP), pp. 5109–5112
Höge H, Draxler C, van den Heuvel H, Johansen F T, Sanders E, Tropf H S 1999 Speechdat multilingual speech databases for teleservices: across the finish line. In Proc. European Conf. Speech Commun. Technol. (EUROSPEECH), pp. 2699–2702
Ikbal S 2004 Nonlinear feature transformations for noise robust speech recognition. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Ikbal S, Misra H, Sivadas S, Hermansky H, Bourlard H 2004 Entropy based combination of tandem representations for noise robust ASR. In Proc. INTERSPEECH-ICSLP-04, pp. 2553–2556
International Phonetic Association (IPA) 1999 Handbook of the International Phonetic Association. A guide to the use of the International Phonetic Alphabet. Cambridge University Press, The Edinburgh Building, Cambridge CB2 8RU, UK. ISBN-10:0521637511, ISBN-13:978-0521637510
Imseng D, Magimai-Doss M, Bourlard H 2010 Hierarchical multilayer perceptron based language identification. Idiap-RR Idiap-Internal-RR-104-2010, Idiap, May 2010. URL http://www.idiap.ch/~dimseng/Idiap_IIR_104-2010.pdf
Kawahara H, Estill J, Fujimura O 2001 Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In Proc. MAVEBA, Florence, Italy
Khudanpur S P 2006 Multilingual language modeling. In (eds) T Schultz, K Kirchoff, Multilingual speech processing, chapter 6, pp. 169–205. Academic Press, USA
Kim W, Khudanpur S 2002 Using cross-language cues for story-specific language modelling. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 513–516
Kim W, Khudanpur S 2003 Cross-lingual lexical triggers in statistical language modeling. In Proc. Empirical Methods in Natural Language Processing (EMNLP), pp. 17–24
Kim W, Khudanpur S 2004 Cross-lingual latent semantic analysis for language modeling. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–257–I–260
King S, Tokuda K, Zen H, Yamagishi J 2008 Unsupervised adaptation for HMM-based speech synthesis. In Proc. of Interspeech, pp. 1869–1872
Köhler J 1996 Multilingual phoneme recognition exploiting acoustic-phonetic similarities of sounds. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 2195–2198
Köhler J 1998 Language adaptation of multilingual phone models for vocabulary independent speech recognition tasks. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 417–420
Köhler J 1999 Comparing three methods for multilingual phone models for vocabulary independent speech recognition tasks. In Proc. of the ESCA-NATO Tutorial Workshop on Multilingual Interportability in Speech Technology, pp. 79–84
Koishida K, Hirabayashi G, Tokuda K, Kobayashi T 1994 Mel-generalized cepstral analysis — a unified approach to speech spectral estimation. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), vol. 3, pp. 1043–1046, Yokohama, Japan
Kominek J 2009 TTS From zero: Building synthetic voices for new languages. PhD thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Kubala F, Bellegarda J, Cohen J, Pallett D, Paul D, Phillips M, Rajasekaran R, Richardson F, Riley M, Rosenfeld R, Roth B, Weintraub M 1994 The hub and spoke paradigm for CSR evaluation. In Human Language Technology Conference: Proceedings of the Workshop on Human Language Technology, pp. 37–42, Plainsboro, NJ
Lamel L, Gauvain J-L, Adda G 2002 Lightly supervised and unsupervised acoustic model training. Comput. Speech Lang. 16: 115–129
Lamel L F, Gauvain J-L 1993 Cross-lingual experiments with phone recognition. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), vol. 2, pp. 507–510
Latorre J 2006 A study on speaker adaptable speech synthesis. PhD thesis, Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan
Lee L, Rose R 1998 A frequency warping approach to speaker normalisation. IEEE Trans. Speech Audio Process. 6: 49–60
Li X, Bilmes J 2006 Regularized adaptation of discriminative classifiers. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–237–I–240
Liang H, Dines J 2010 An analysis of language mismatch in HMM state mapping-based cross-lingual speaker adaptation. In Proc. Interspeech, pp. 622–625, Makuhari, Japan
Liang H, Dines J, Saheer L 2010 A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4598–4601, Dallas, USA
Lööf J, Gollan C, Ney H 2009 Cross-language bootstrapping for unsupervised acoustic model training: Rapid development of a Polish speech recognition system. In Proc. Interspeech, pp. 88–91, Brighton, UK
Maskey S R, Black A W, Tomokiyo L M 2004 Bootstrapping phonetic lexicons for new languages. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 69–72, Jeju Island, Korea
McDonough J W 2000 Speaker compensation with all-pass transforms. PhD thesis, Johns Hopkins University
Misra H, Bourlard H, Tyagi V 2003 New entropy based combination rules in HMM/ANN multi-stream ASR. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. II–741– II–744
Morgan N, Zhu Q, Stolcke A, Sonmez K, Sivadas S, Shinozaki T, Ostendorf M, Jain P, Hermansky H, Ellis D, Doddington G, Chen B, Cetin O, Bourlard H, Athineos M 2005 Pushing the Envelope–Aside. IEEE Signal Process. Mag. 22(5): 81–88
Motlicek P 2009 Automatic out-of-language detection based on confidence measures derived from LVCSR word and phone lattices. In Proc. Interspeech, pp. 1215–1218, Brighton, UK
Motlicek P, Valente F 2010 Application of out-of-language detection to spoken term detection. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 5098–5101, Dallas, USA
Muthusamy Y K, Cole R A, Oshika B T 1992 The OGI multi-language telephone speech corpus. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 895–898, Banff, Alberta, Canada
Navratil J 2001 Spoken language recognition – a step toward multilinguality in speech processing. IEEE Trans. Audio Speech Lang. Process. 9(6): 678–685
Navratil J 2006 Automatic language identification. In (eds) T Schultz, K Kirchoff, Multilingual speech processing, chapter 8, pp. 233–271. Academic Press, USA
Ostendorf M, Bulyko I 2002 The impact of speech recognition on speech synthesis. In Proc. IEEE Workshop on Speech Synthesis, pp. 99–106, Santa Monica, USA
Osterholtz L, Augustine C, McNair A, Saito I, Sloboda T, Tebelskis J, Waibel A, Woszczyna M 1992 Testing generality in JANUS: A multilingual speech translation system. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 209–212
Paul D B, Baker J 1992 The design for the wall street journal-based CSR corpus. In Human Language Technology Conference: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362, Harriman, NY
Pinto J, Yegnanarayana B, Hermansky H, Magimai-Doss M 2008 Exploiting contextual information for improved phoneme recognition. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4449–4452
Pinto J, Sivaram G S V S, Magimai-Doss M, Hermansky H, Bourlard H 2011 Analysis of MLP based hierarchical phoneme posterior probability estimator. IEEE Trans. Audio Speech Lang. Process. 19(2): 225–241
Pinto J P 2010 Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Rabiner L R 1989 A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2): 257–286
Saheer L, Dines J, Garner P N, Liang H 2010a Implementation of VTLN for statistical speech synthesis. In Proc. 7th Speech Synthesis Workshop, Kyoto, Japan
Saheer L, Garner P N, Dines J 2010b Study of Jacobian normalisation for VTLN. Idiap-RR Idiap-RR-25-2010, Idiap Research Institute, Martigny, Switzerland
Saheer L, Garner P N, Dines J, Liang H 2010c VTLN adaptation for statistical speech synthesis. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4838–4841, Dallas, USA
Schlüter R, Macherey W, M"uller B, Ney H 2001 Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Commun. 34(3): 287–310
Schultz T 2006 Multilingual acoustical modeling. In (eds) T Schultz, K Kirchoff, Multilingual Speech Processing, chapter 4, pp. 71–122. Academic Press, USA
Schultz T, Waibel A 1997a Fast bootstrapping of LVCSR systems with multilingual phoneme sets. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), vol. 1, pp. 371–374, Rhodes, Greece
Schultz T, Waibel A 1997b Fast bootstrapping of LVCSR systems with multilingual phoneme sets. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 371–374
Schultz T, Waibel A 1998 Language independent and language adaptive large vocabulary speech recognition. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 1819–1822
Schultz T, Waibel A 2001 Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1–2): 31–50
Schultz T, Rogina I, Waibel A 1996 LVCSR-based language identification. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), vol. 2, pp. 781–784
Siemund R, Höge H, Kunzmann S, Marasek K 2000 SPEECON – speech data for consumer devices. In Proc. 2nd Int. Conf. Language Resources & Evaluation, pp. 883–886, Athens, Greece
Silén H, Hel E, Nurminen J, Gabbouj M 2009 Parameterization of vocal fry in HMM-based speech synthesis. In Proc. Interspeech, pp. 1775–1778, Brighton, UK
Sivadas S, Hermansky H 2004 On Use of Task Independent Training Data in Tandem Feature Extraction. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–541–I–544
Sproat R (ed) 1997 Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, Norwell, Massachussetts, USA
Stolcke A, Grézl F, Hwang M-Y, Lei X, Morgan N, Vergyri D 2006 Cross-domain and cross-lingual portability of acoustic features estimated by multilayer perceptrons. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–321–I–324
Suendermann D, Hoege H, Bonafonte A, Ney H, Hirschberg J 2006 TC-Star: Cross-language voice conversion revisited. In Proc. TC-Star Workshop, Barcelona, Spain
Sugiyama M 1991 Automatic language recognition using acoustic features. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 813–816
Tokuda K, Masuko T, Miyazaki N, Kobayashi T 2002 Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85-D(3): 455–464
Toth L, Frankel J, Gosztolya G, King S 2008 Cross-lingual portability of MLP-based tandem features - a case study for English and Hungarian. In Proc. Interspeech, pp. 2695–2698
Traber C, Huber K, Nedir K, Pfister B, Keller E, Zellner B 1999 From multilingual to polyglot speech synthesis. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 835–838, Budapest, Hungary
Valente F 2009 A novel criterion for classifiers combination in multistream speech recognition. IEEE Signal Process. Lett. 16(7): 561–564
Valente F, Hermansky H 2008 Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4165–4168
Valente F, Magimai-Doss M, Plahl C, Ravuri S 2009 Hierarchical Modulation spectrum for the GALE project. In Proc. Interspeech, pp. 2963–2966
Valente F, Magimai-Doss M, Plahl C, Ravuri S, Wang W 2010 A comparative large scale study of MLP features for Mandarin ASR. In Proc. Interspeech, pp. 2630–2633
Wan V, Hain T 2006 Strategies for language model data collection. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–1069–I–1072, Toulouse, France
Weng F, Bratt H, Neumeyer L, Stolcke A 1997 A study of multilingual speech recognition. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 359–362
Wester M 2010a Cross-lingual talker discrimination. In Proc. Interspeech, pp. 1253–1256, Makuhari, Japan
Wester M 2010b The EMIME bilingual database. Technical Report EDI-INF-RR-1388, The University of Edinburgh, UK
Wester M et al 2010 Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In Proc. 7th Speech Synthesis Workshop, Kyoto, Japan
Wheatley B, Kondo K, Anderson W, Muthuswamy Y 1994 An evaluation of cross-language adaptation for rapid HMM development in a new language. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I/237–I/240
Wu Y-J, Wang R-H 2006 Minimum generation error training for HMM-based speech synthesis. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–89–I–92, Toulouse, France
Wu Y-J, King S, Tokuda K 2008 Cross-lingual speaker adaptaton for HMM-based speech synthesis. In Proc. Interspeech, pp. 528–531, Brisbane, Australia
Wu Y-J, Nankaku Y, Tokuda K 2009 State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In Proc. Interspeech, pp. 528–531, Brighton, UK
Yamagishi J, Lincoln M, King S, Dines J, Gibson M, Tian J, Guan Y 2009 Analysis of unsupervised and noise-robust speaker-adaptive HMM-based speech synthesis systems toward a unified ASR and TTS framework. In Proc. Blizzard Challenge Workshop, Edinburgh, UK
Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T 2007 A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5): 825–834
Zen H, Tokuda K, Black A W 2009 Statistical parametric speech synthesis. Speech Commun. 51(11): 1039–1064
Zheng J, Cetin O, Hwang M-Y, Lei X, Stolcke A, Morgan N 2007 Combining discriminative feature, transform, and model training for large vocabulary speech recognition. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. IV–633–IV–636
Zhu Q, Chen B, Morgan N, Stolcke A 2004 On using MLP features. In Proc. INTERSPEECH-ICSLP-04, pp. 921–924
Zissman M A 1996 Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Audio Speech Lang. Process. 4(1): 31–42
Zissman M A, Berkling K M 2001 Automatic language identification. Speech Commun. 35: 115–124
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
BOURLARD, H., DINES, J., MAGIMAI-DOSS, M. et al. Current trends in multilingual speech processing. Sadhana 36, 885–915 (2011). https://doi.org/10.1007/s12046-011-0050-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-011-0050-4