Transfer Learning for Tandem ASR Feature Extraction
Purchase on Springer.com
$29.95 / €24.95 / £19.95*
* Final gross prices may vary according to local VAT.
Tandem automatic speech recognition (ASR), in which one or an ensemble of multi-layer perceptrons (MLPs) is used to provide a non-linear transform of the acoustic parameters, has become a standard technique in a number of state-of-the-art systems. In this paper, we examine the question of how to transfer learning from out-of-domain data to new tasks.
Our primary focus is to develop tandem features for recognition of speech from the meetings domain. We show that adapting MLPs originally trained on conversational telephone speech leads to lower word error rates than training MLPs solely on the target data. Multi-task learning, in which a single MLP is trained to perform a secondary task (in this case a speech enhancement mapping from farfield to nearfield signals) is also shown to be advantageous.
We also present recognition experiments on broadcast news data which suggest that structure learned from English speech can be adapted to Mandarin Chinese. The performance of tandem MLPs trained on 440 hours of Mandarin speech with a random initialization was achieved by adapted MLPs using about 97 hours of data in the target language.
- Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature stream extraction for conventional hmm systems. In: Proc ICASSP, Istanbul, Turkey, vol. III, pp. 1635–1638 (2000)
- Trentin, E., Gori, M.: A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1), 91–126 (2001) CrossRef
- Stolcke, A., Grezl, F., Hwang, M.Y., Lei, X., Morgan, N., Vergyri, D.: Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons. In: Proc. ICASSP, Toulouse, France (2006)
- Zheng, J., Çetin, O., Hwang, M.Y., Lei, X., Stolcke, A., Morgan, N.: Combining discriminative feature, transform, and model training for large vocabulary speech recognition. In: Proc. ICASSP, Honolulu (2007)
- Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997) CrossRef
- Hain, T., Burget, L., Dines, J., Garau, G., Karafiat, M., Lincoln, M., Vepa, J., Wan, V.: The AMI meeting transcription system: Progress and performance. In: NIST RT 2006 Workshop (2006)
- Hermansky, H.: TRAP-TANDEM: Data-driven extraction of temporal features from speech. In: IDIAP-RR 50, IDIAP, Martigny, Switzerland (2003)
- Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, J., Hermansky, H., Ellis, D., Doddington, G., Chen, B., Çetin, O., Bourlard, H., Athineos, M.: Pushing the Envelope - Aside. IEEE Signal Processing Magazine 22(5), 81–88 (2005) CrossRef
- Chen, B., Zhu, Q., Morgan, N.: Learning long term temporal feature in LVCSR using neural networks. In: Proc. ICSLP, pp. 612–615 (2004)
- Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Using MLP features in SRI’s conversational speech recognition system. In: Proc. Eurospeech, Portugal (2005)
- Janin, A., Stolcke, A., Anguera, X., Boakye, K., Çetin, O., Frankel, J., Zheng, J.: The ICSI-SRI spring 2006 meeting recognition system. In: Proc. MLMI, Washington DC, USA (2006)
- Hwang, M.Y., Wang, W., Lei, X., Zheng, J., Çetin, O., Peng, G.: Advances in Mandarin broadcast speech recognition. In: Proc.Interspeech, Antwerp (2007)
- Transfer Learning for Tandem ASR Feature Extraction
- Book Title
- Machine Learning for Multimodal Interaction
- Book Subtitle
- 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers
- pp 227-236
- Print ISBN
- Online ISBN
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- Series ISSN
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Industry Sectors
- eBook Packages
To view the rest of this content please follow the download PDF link above.