Journal of Signal Processing Systems

, Volume 82, Issue 2, pp 197–206 | Cite as

A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

  • I-Fan ChenEmail author
  • Chongjia Ni
  • Boon Pang Lim
  • Nancy F. Chen
  • Chin-Hui Lee


A keyword-sensitive language modeling framework for spoken keyword search (KWS) is proposed to combine the advantages of conventional keyword-filler based and large vocabulary continuous speech recognition (LVCSR) based KWS systems. The proposed framework allows keyword search systems to be flexible on keyword target settings as in the LVCSR-based keyword search. In low-resource scenarios it facilitates KWS with an ability to achieve high keyword detection accuracy as in the keyword-filler based systems and to attain a low false alarm rate inherent in the LVCSR-based systems. The proposed keyword-aware grammar is realized by incorporating keyword information to re-train and modify the language models used in LVCSR-based KWS. Experimental results, on the evalpart1 data of the IARPA Babel OpenKWS13 Vietnamese tasks, indicate that the proposed approach achieves a relative improvement, over the conventional LVCSR-based KWS systems, of the actual term weighted value for about 57 % (from 0.2093 to 0.3287) and 20 % (from 0.4578 to 0.5486) on the limited-language-pack and full-language-pack tasks, respectively.


Keyword spotting Keyword search Filler Spoken term detection Grammar network LVCSR 



This study uses the IARPA Babel Program Vietnamese language collection release babel107b-v0.7 with the LimitedLP and FullLP training sets.


  1. 1.
    Wilpon, J. G., Rabiner, L. R., Lee, C.-H., & Goldman, E. (1990). Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(11), 1870–1878.CrossRefGoogle Scholar
  2. 2.
    Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proceedings of ICASSP, Albuquerque, NM (vol. 1, pp. 129–132): IEEE. doi: 10.1109/ICASSP.1990.115555
  3. 3.
    Vergyri, D., Shafran, I., Stolcke, A., Gadde, R. R., Akbacak, M., Roark, B., et al. (2007). The SRI/OGI 2006 spoken term detection system. In Proceedings of Interspeech, (pp. 2393–2396): ISCA.
  4. 4.
    Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary Independent spoken term detection. In Proceedings of SIGIR (pp. 615–622): ACM. doi: 10.1145/1277741.1277847
  5. 5.
    Miller, D. R., Kleber, M., Kao, C.-l., Kimball, O., Colthurst, T., Lowe, S. A., et al. (2007). Rapid and accurate spoken term detection. In Proceedings of Interspeech: ISCA.
  6. 6.
    Wallace, R., Vogt, R., & Sridharan, S. (2007). A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation. In Proceedings of Interspeech: ISCA.
  7. 7.
    Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., et al. (2000). Speech and langauge technologies for audio indexing and retrieval. Proceedings of the IEEE, 88(8), 1338–1353.CrossRefGoogle Scholar
  8. 8.
    Warren., R. L. (2001). Broadcast speech recognition system for keyword monitoring. U.S. Patent 6332120 B1.
  9. 9.
    Kawahara, T., Lee, C.-H., & Juang, B.-H. (1998). Key-phrase detection and verification for flexible speech understanding. IEEE Transactions on Speech and Audio Processing, 6(6), 558–568.CrossRefGoogle Scholar
  10. 10.
    Juang, B.-H., & Furui, S. (2000). Automatic recognition and understanding of spoken language – a first step toward natural human-machine communication. Proceedings of the IEEE, 88(8), 1142–1165.CrossRefGoogle Scholar
  11. 11.
    Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278.CrossRefGoogle Scholar
  12. 12.
    Pallett, D. S. (2003). A look at NIST’s benchmark ASR tests: past, present, and future. In Proceedings of ASRU, (pp. 483–488): IEEE. doi: 10.1109/ASRU.2003.1318488
  13. 13.
    Fiscus, J. G., Ajot, J., Garofolo, J. S., & Doddintion, G. (2007). Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR: ACM.
  14. 14.
    Szoeke, I., Fapso, M., & Burget, L. (2008). Hybrid word-subword decoding for spken term detection. In Proceedings of SIGIR, Singapore (pp. 42-48): ACM.Google Scholar
  15. 15.
    Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393.CrossRefGoogle Scholar
  16. 16.
    Riedhammer, K., Do, V. H., & Hieronymus, J. (2013). A study on LVCSR and keyword search for tagalog. In Proceedings of Interspeech, (pp. 2529–2533).
  17. 17.
    Jeanrenaud, P., Eide, E., Chaudhari, U., McDonough, J., Ng, K., Siu, M., et al. (1995). Reducing word error rate on conversational speech from the Switchboard corpus. In Proceedings of ICASSP, (vol. 1, pp. 53–56): IEEE. doi: 10.1109/ICASSP.1995.479271
  18. 18.
  19. 19.
    Cui, J., Cui, X., Ramabhadran, B., Kim, J., Kingsbury, B., Mamou, J., et al. (2013). Developing speech recognition systems for corpus indexing under the IARPA babel program. In Proceedings of ICASSP, (pp. 6753–6757): IEEE. doi: 10.1109/ICASSP.2013.6638969
  20. 20.
    Chen, N. F., Sivadas, S., Lim, B. P., Ngo, H. G., Xu, H., Pham, V. T., et al. (2014). Strategies for vietnamese keyword search. In Proceedings of ICASSP, (pp. 4149–4153). Florence: IEEE. doi: 10.1109/ICASSP.2014.6854377
  21. 21.
    Metze, F., Rajput, N., Anguera, X., Davel, M., Gravier, G., Heerden, C. v., et al. (2012). The spoken web search task at mediaeval 2011. In Proceedings of ICASSP, (pp. 5165–5168). Kyoto: IEEE. doi: 10.1109/ICASSP.2012.6289083
  22. 22.
    Metze, F., Anguera, X., Barnard, E., Davel, M., & Gravier, G. (2013). The spoken web search task at mediaeval 2012. In Proceedings of ICASSP, (pp. 8121–8125). Vancouver, BC: IEEE. doi: 10.1109/ICASSP.2013.6639247
  23. 23.
    MediaEval Benchmarking Initiative for Multimedia Evaluation.
  24. 24.
    Moseley, C. (Ed.). (2010). Atlas of the world’s languages in danger (3rd ed.). Paris: UNESCO.Google Scholar
  25. 25.
    Tueske, Z., Nolden, D., Schlueter, R., & Ney, H. (2014). Multilingual MRASTA features for low-resource keyword search and speech recognition sysTEMS. In Proceedings of ICASSP, (pp. 7854–7858). Florence: IEEE. doi: 10.1109/ICASSP.2014.6855129
  26. 26.
    Ghahremani, P., Babaali, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automaticspeech recognition. In Proceedings of ICASSP, (pp. 2494–2498). Florence: IEEE. doi: 10.1109/ICASSP.2014.6854049
  27. 27.
    Lee, H.-Y., Zhang, Y., Chuangsuwanich, E., & Glass, J. (2014). Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages. In Proceedings of Interspeech, Singapore (pp. 2479–2483): ISCA.
  28. 28.
    Soto, V., Mangu, L., Rosenberg, A., & Hirschberg, J. (2014). A Comparison of multiple methods for rescoring keyword search lists for low resource languages. In Proceedings of Interspeech, Singapore (pp. 2464-2468). Singapore: ISCA.
  29. 29.
    Hartmann, W., Le, V.-B., Messaoudi, A., Lamel, L., & Gauvain, J.-L. (2014). Comparing decoding strategies for subword-based keyword spotting in low-resourced languages. In Proceedings of Interspeech, Singapore (pp. 2764–2768).
  30. 30.
    Hsiao, R., Ng, T., Zhang, L., Ranjan, S., Tsakalidis, S., Nguyen, L., et al. (2014). Improving semi-supervised deep neural network for keyword search in low resource languages. In Proceedings of Interspeech, Singapore (pp. 1088–1091): ISCA.
  31. 31.
    Cui, X., Kingsbury, B., Cui, J., Ramabhadran, B., Rosenberg, A., Rasooli, M. S., et al. (2014). Improving deep neural network acoustic modeling for audio corpus indexing under the IARPA babel program. In Proceedings of Interspeech, Singapore (pp. 2103–2107): ISCA.
  32. 32.
    Huang, J.-T., Li, J., Yu, D., Deng, L., & Gong, Y. (2013). Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of ICASSP, Vancouver, BC (pp. 7304-7308): IEEE. doi: 10.1109/ICASSP.2013.6639081
  33. 33.
    Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2014). A novel keyword + LVCSR-filler based grammar network representation for spoken keyword search. In Proceedings of ISCSLP, Singapore (pp. 192-196): IEEE. doi: 10.1109/ISCSLP.2014.6936713
  34. 34.
    Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2015). A keyword-aware grammar framework for lvcsr-based spoken keyrowd search. In Proceedings of ICASSP, Brisbane: IEEE.Google Scholar
  35. 35.
    Sukkar, R. A., & Lee, C.-H. (1996). Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing, 4(6), 420–429.CrossRefGoogle Scholar
  36. 36.
    Ou, J., Chen, K., Want, X., & Lee, Z. (2001). Utterance verification of short keywords using hybrid neural-network/HMM approach. In Proceedings of ICII, Beijing (vol. 2, pp. 671-676): IEEE. doi: 10.1109/ICII.2001.983657.
  37. 37.
    Chen, I.-F., & Lee, C.-H. (2013). A hybrid HMM/DNN Approach to keyword spotting of short words. In Proceedings of Interspeech, Lyon (pp. 1574-1578): ISCA.
  38. 38.
    Chen, I.-F., & Lee, C.-H. (2013). A Resource-dependent approach to word modeling for keyword spotting. In Proceedings of Interspeech, Lyon (pp. 2544–2548): ISCA.
  39. 39.
    Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiat, M., Fapso, M., et al. (2005). Comparison of Keyword spotting approaches for informal continuous speech. In Proceedings of EuroSpeech.Google Scholar
  40. 40.
    Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559–584): Springer Berlin Heidelberg. doi: 10.1007/978-3-540-49127-9_28
  41. 41.
    Allauzen, C., Mohri, M., & Roark, B. (2003). Generalized algorithms for constructing language models. In Proceedings of ACL, Stroudsburg, PA, USA (vol. 1, pp. 40–47): ACL. doi: 10.3115/1075096.1075102
  42. 42.
    NIST Open Keyword Search 2013 Evaluation (OpenKWS13).
  43. 43.
    Povey, D., Ghoshal, A., Boulianne, G., L. S. B., Glembek, O. R., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society.Google Scholar
  44. 44.
    Vesely, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative traning of deep neural networks. In Proceedings of Interspeech, Lyon, France (pp. 2345-2349): ISCA.
  45. 45.
    Novak, J. R., Minematsu, N., & Hirose, K. (2012). WFST-based Grapheme-to-Phoneme conversion: open source tools for alignment, model-building and decoding. In Proceedings of International Workshop on Finite State Methods and Natural Language Processing, Donostia-San Sebastian (pp. 45–49).
  46. 46.
    Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • I-Fan Chen
    • 1
    Email author
  • Chongjia Ni
    • 2
  • Boon Pang Lim
    • 2
  • Nancy F. Chen
    • 2
  • Chin-Hui Lee
    • 1
  1. 1.School of Electrical and Computer EngineeringGeorgia Institute of TechnologyAtlantaUSA
  2. 2.Institute for Infocomm ResearchSingaporeSingapore

Personalised recommendations