Advertisement

Speaker-Clustered Acoustic Models Evaluated on GPU for On-line Subtitling of Parliament Meetings

  • Josef V. Psutka
  • Jan Vaněk
  • Josef Psutka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6836)

Abstract

This paper describes the effort with building speaker-clustered acoustic models as a part of the real-time LVCSR system that is used more than one year by the Czech TV for automatic subtitling of parliament meetings broadcasted on the channel ČT24. Speaker-clustered acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model or even gender-dependent models. Frequent changes of speakers and a direct connection of the LVCSR system to the audio channel require an automatic switching/fusion of models as quickly as possible. An important part of the solution is real time likelihood evaluations of all clustered acoustic models, taking advantage of a fast GPU(Graphic Processing Unit). The proposed method achieved a WER reduction to the baseline gender-independent model over 2.34% relatively with more than 2M Gaussian mixtures evaluated in real-time.

Keywords

Graphic Processing Unit Automatic Speech Recognition Acoustic Model Word Error Rate Automatic Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Pražák, A., et al.: Automatic online subtitling of the Czech parliament meetings. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 501–508. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Pražák, A., Müller, L., Psutka, J.V., Psutka, J.: LIVE TV SUBTITLING - Fast 2-pass LVCSR System for Online Subtitling. In: SIGMAP 2007: Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pp. 139–142. INSTICC Press, Lisbon (2007)Google Scholar
  3. 3.
    Vaněk, J., Psutka, J.V.: Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 431–438. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Neto, J., et al.: Broadcast News Subtitling System In Portuguese. In: Proceedings of the ICASSP, Las Vegas, USA (2008)Google Scholar
  5. 5.
    Vaněk, J., Psutka, J.V., Zelinka, J., Pražák, A., Psutka, J.: Training of Speaker-Clustered Acoustic Models for Use in Real-Time Recognizers. In: SIGMAP 2007: Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pp. 131–135. INSTICC Press, Lisbon (2009)Google Scholar
  6. 6.
    Vaněk, J., Psutka, J.V., Zelinka, J., Pražák, A., Psutka, J.: Discriminative training of gender-dependent acoustic models. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 331–338. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Povey, D., Woodland, P.C.: Improved discriminative training techniques for large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing, Salt Lake City, Utah (2001)Google Scholar
  8. 8.
    Vaněk, J., et al.: Acoustic Likelihoods Computation Optimized for NVIDIA and ATI/AMD Graphics Processors. Submited to IEEE Signal Processing Magazine (2011)Google Scholar
  9. 9.
    Radová, V., Psutka, J.: Recording and Annotation of the Czech Speech Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 319–323. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  10. 10.
    Kolář, J., Švec, J.: The Czech Broadcast Conversation Corpus. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 101–108. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: International Conference on Spoken Language Processing (ICSLP 2002), Denver, USA (2002)Google Scholar
  12. 12.
    Wessel, F., et al.: Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 9, 288–298 (2001)CrossRefGoogle Scholar
  13. 13.
    Psutka, J.V., et al.: Searching for a robust MFCC-based parameterization for ASR application. In: SIGMAP 2007: Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pp. 196–199. INSTICC Press, Lisbon (2007)Google Scholar
  14. 14.
    Young, S., et al.: The HTK Book (for HTK Version 3.4), Cambridge (2006)Google Scholar
  15. 15.
    Stolcke, A., et al.: The SRI March 2000 Hub-5 Conversational Speech Transcription System. In: Proc. NIST Speech Transcription Workshop, College Park, MD (May 2000)Google Scholar
  16. 16.
    Olsen, P.A., Dharanipragada, S.: An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models, In: 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003),Geneva, Switzerland (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Josef V. Psutka
    • 1
  • Jan Vaněk
    • 1
  • Josef Psutka
    • 1
  1. 1.Department of CyberneticsWest Bohemia UniversityPilsenCzech Republic

Personalised recommendations