Multimedia Tools and Applications

, Volume 72, Issue 2, pp 1465–1481 | Cite as

Leveraging topical and positional cues for language modeling in speech recognition

Article

Abstract

This paper investigates language modeling with topical and positional information for large vocabulary continuous speech recognition. We first compare among a few topic models both theoretically and empirically, including document topic models and word topic models. On the other hand, since for some spoken documents such as broadcast news stories, the composition and the word usage of documents of the same style are usually similar, the documents hence can be separated into partitions consisting of identical rhetoric or topic styles by the literary structures, like introductory remarks, elucidations of methodology or affairs, conclusions of the articles, references or footnotes of reporters, etc. We hence present two position-dependent language models for speech recognition by integrating word positional information into the exiting n-gram and topic models. The experiments conducted on broadcast news transcription seem to indicate that such position-dependent models obtain comparable results to the existing n-gram and topic models.

Keywords

Speech recognition Language model Positional information Topical information Language model adaptation 

Notes

Acknowledgments

This work was sponsored in part by “Aim for the Top University Plan” of National Taiwan Normal University and Ministry of Education, Taiwan, and the National Science Council, Taiwan, under Grants NSC 101-2221-E-003-024-MY3, NSC 101-2511-S-003-057-MY3, NSC 101-2511-S-003-047-MY3, NSC 99-2221-E-003-017-MY3, and NSC 98-2221-E-003-011-MY3.

References

  1. 1.
    Aubert XL (2002) An overview of decoding techniques for large vocabulary continuous speech recognition. Comput Speech Lang 16:89–114CrossRefGoogle Scholar
  2. 2.
    Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  3. 3.
    Bellegarda JR (1998) A multi-span language modeling framework for large vocabulary speech recognition. IEEE Trans Speech Audio Process 6(5):456–467CrossRefGoogle Scholar
  4. 4.
    Bellegarda JR (2004) Statistical language model adaptation: review and perspectives. Speech Comm 42(11):93–108CrossRefGoogle Scholar
  5. 5.
    Brown PF, deSouza P, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479Google Scholar
  6. 6.
    Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4):283–332CrossRefGoogle Scholar
  7. 7.
    Chen B (2009) Word topic models for spoken document retrieval and transcription. ACM Trans Asian Lang Inf Process 8(1):2:1–2:27Google Scholar
  8. 8.
    Chen B, Kuo JW, Tsai WH (2004) Lightly supervised and data-driven approaches to mandarin broadcast news transcription. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2004), pp 777–780Google Scholar
  9. 9.
    Chen B, Lin SH (2012) A risk-aware modeling framework for speech summarization. IEEE Trans Audio Speech Lang Process 20(1):199–210Google Scholar
  10. 10.
    Chen B, Liu JW (2011) Discriminative language modeling for speech recognition with relevance information. In: Proc. IEEE International Conference on Multimedia & Expo (ICME 2011), pp 1–4Google Scholar
  11. 11.
    Chen B, Liu SH, Chu FH (2009) Training data selection for improving discriminative training of acoustic models. Pattern Recognit Lett 30(13):1228–1235CrossRefGoogle Scholar
  12. 12.
    Chen B, Wang HM, Lee LS (2002) Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Trans Speech Audio Process 10(5):303–314CrossRefGoogle Scholar
  13. 13.
    Chen KY, Chen B (2011) Relevance language modeling for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2011), pp 5568–5571Google Scholar
  14. 14.
    Chen KY, Chiu HS, Chen B (2010) Latent topic modeling of word vicinity information for speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2010), pp 5394–5397Google Scholar
  15. 15.
    Chen YT, Chen B, Wang HM (2009) A probabilistic generative framework for extractive broadcast news speech summarization. IEEE Trans Audio Speech Lang Process 17(1):95–106CrossRefGoogle Scholar
  16. 16.
    Chiu HS, Chen GY, Lee CJ, Chen B (2008) Position information for language modeling in speech recognition, In: Proc. 6th International Symposium on Chinese Spoken Language Processing (ISCSLP 2008), pp 101–104Google Scholar
  17. 17.
    Clarkson PR, Robinson AJ (1997) Language model adaptation using mixtures and an exponentially decaying cache. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1997), pp 799–802Google Scholar
  18. 18.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38MATHMathSciNetGoogle Scholar
  19. 19.
    Gildea D, Hofmann T (1999) Topic-based language models using EM. In: Proc. European Conference on Speech Communication and Technology (Eurospeech 1999), pp 2167–2170Google Scholar
  20. 20.
    Good IJ (1953) The population frequencies of species and estimation of population parameters. Biometrika 40(3–4):237–264CrossRefMATHMathSciNetGoogle Scholar
  21. 21.
    Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196CrossRefMATHGoogle Scholar
  22. 22.
    Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 1995), vol. I, pp 181–184Google Scholar
  23. 23.
    Koshinaka T, Iso K, Okumura A (2005) An HMM-based text segmentation method using variational Bayes approach and its application to LVCSR for broadcast news. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP 2005), pp 485–488Google Scholar
  24. 24.
    Lau R, Rosenfeld R, Roukos S (1993) Trigger-based language models: a maximum entropy approach. Proc IEEE Int Conf Acoust Speech Signal Process 2:45–48CrossRefGoogle Scholar
  25. 25.
    Lee HS, Chen B (2009) Generalized likelihood ratio discriminant analysis. In: Proc. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 2009), pp 158–163Google Scholar
  26. 26.
    Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21:373–392CrossRefGoogle Scholar
  27. 27.
    Rosenfeld R (2000) Two decades of statistical language modeling: where do we go from here. Proc IEEE 88(8):1270–1278CrossRefGoogle Scholar
  28. 28.
    Ortmanns S, Ney H, Aubert X (1997) A word graph algorithm for large vocabulary continuous speech recognition. Comput Speech Lang 11:43–72CrossRefGoogle Scholar
  29. 29.
    Ostendorf M (2008) Speech technology and information access. IEEE Signal Process Mag 25(3):150–152Google Scholar
  30. 30.
    Pallett D, Fisher W, Fiscus J (1990) Tools for the analysis of benchmark speech recognition tests. In: Proc. IEEE International Conference on Acoustics, Speech, Signal Processing, pp 97–100Google Scholar
  31. 31.
    Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proc. the ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), pp 275–281Google Scholar
  32. 32.
    Saul L, Pereira F, (1997) Aggregate and mixed-order Markov models for statistical language processing In: Proc. Empirical Methods on Natural Language Processing (EMNLP 1997), pp 81–89Google Scholar
  33. 33.
    Stolcke A (2000) SRI language modeling toolkit. Version 1.3.3. http://www.speech.sri.com/projects/srilm/
  34. 34.
    Tur G, Mori RD (eds) (2011) Spoken language understanding—systems for extracting semantic information from speech. John Wiley and Sons, New York, NYGoogle Scholar
  35. 35.
    Wang HM, Chen B, Kuo JW, Cheng SS (2005) MATBN: a Mandarin Chinese broadcast news corpus. Int J Comput Linguist Chin Lang Process 10(1):219–235MATHGoogle Scholar
  36. 36.
    Zhai CX (2008) Statistical language models for information retrieval. Morgan & Claypool Publishers, United StatesGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Computer Science & Information EngineeringNational Taiwan Normal UniversityTaipeiTaiwan

Personalised recommendations