Skip to main content
Log in

A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

A keyword-sensitive language modeling framework for spoken keyword search (KWS) is proposed to combine the advantages of conventional keyword-filler based and large vocabulary continuous speech recognition (LVCSR) based KWS systems. The proposed framework allows keyword search systems to be flexible on keyword target settings as in the LVCSR-based keyword search. In low-resource scenarios it facilitates KWS with an ability to achieve high keyword detection accuracy as in the keyword-filler based systems and to attain a low false alarm rate inherent in the LVCSR-based systems. The proposed keyword-aware grammar is realized by incorporating keyword information to re-train and modify the language models used in LVCSR-based KWS. Experimental results, on the evalpart1 data of the IARPA Babel OpenKWS13 Vietnamese tasks, indicate that the proposed approach achieves a relative improvement, over the conventional LVCSR-based KWS systems, of the actual term weighted value for about 57 % (from 0.2093 to 0.3287) and 20 % (from 0.4578 to 0.5486) on the limited-language-pack and full-language-pack tasks, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

Notes

  1. In this study, a grammar is defined as a search graph or network whose paths from the initial to final nodes represent valid word sequences in a system with corresponding scores, and the graph/network is easily realized by weighted finite-state automata (WFSA) [40, 41].

  2. For example, Fig. 4 shows the averaged keyword prior probabilities in the IARPA Babel Vietnamese data [32] are in the range of 5 × 10−5 to 5 × 10−6.

  3. In this research, we used all the bigrams in the original Kneser-Ney smoothed LM as context terms.

  4. We have obtained very poor performances (negative ATWVs) for keyword-filler based KWS systems due to an extremely large amount of false alarms caused by the noises in the test data. Therefore keyword-filler based KWS systems were not considered here.

  5. We observed that by adjusting k = 5 the performance is significantly better than setting k = 1 as in [19]. However, the differences became trivial when k is larger than 5.

References

  1. Wilpon, J. G., Rabiner, L. R., Lee, C.-H., & Goldman, E. (1990). Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(11), 1870–1878.

    Article  Google Scholar 

  2. Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proceedings of ICASSP, Albuquerque, NM (vol. 1, pp. 129–132): IEEE. doi:10.1109/ICASSP.1990.115555

  3. Vergyri, D., Shafran, I., Stolcke, A., Gadde, R. R., Akbacak, M., Roark, B., et al. (2007). The SRI/OGI 2006 spoken term detection system. In Proceedings of Interspeech, (pp. 2393–2396): ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_2393.html

  4. Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary Independent spoken term detection. In Proceedings of SIGIR (pp. 615–622): ACM. doi:10.1145/1277741.1277847

  5. Miller, D. R., Kleber, M., Kao, C.-l., Kimball, O., Colthurst, T., Lowe, S. A., et al. (2007). Rapid and accurate spoken term detection. In Proceedings of Interspeech: ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_0314.html

  6. Wallace, R., Vogt, R., & Sridharan, S. (2007). A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation. In Proceedings of Interspeech: ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_2385.html

  7. Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., et al. (2000). Speech and langauge technologies for audio indexing and retrieval. Proceedings of the IEEE, 88(8), 1338–1353.

    Article  Google Scholar 

  8. Warren., R. L. (2001). Broadcast speech recognition system for keyword monitoring. U.S. Patent 6332120 B1. http://www.google.tl/patents/US6332120

  9. Kawahara, T., Lee, C.-H., & Juang, B.-H. (1998). Key-phrase detection and verification for flexible speech understanding. IEEE Transactions on Speech and Audio Processing, 6(6), 558–568.

    Article  Google Scholar 

  10. Juang, B.-H., & Furui, S. (2000). Automatic recognition and understanding of spoken language – a first step toward natural human-machine communication. Proceedings of the IEEE, 88(8), 1142–1165.

    Article  Google Scholar 

  11. Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278.

    Article  Google Scholar 

  12. Pallett, D. S. (2003). A look at NIST’s benchmark ASR tests: past, present, and future. In Proceedings of ASRU, (pp. 483–488): IEEE. doi:10.1109/ASRU.2003.1318488

  13. Fiscus, J. G., Ajot, J., Garofolo, J. S., & Doddintion, G. (2007). Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR: ACM. http://www.itl.nist.gov/iad/mig/publications/storage_paper/Interspeech07-STD06-v13.pdf

  14. Szoeke, I., Fapso, M., & Burget, L. (2008). Hybrid word-subword decoding for spken term detection. In Proceedings of SIGIR, Singapore (pp. 42-48): ACM.

  15. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393.

    Article  Google Scholar 

  16. Riedhammer, K., Do, V. H., & Hieronymus, J. (2013). A study on LVCSR and keyword search for tagalog. In Proceedings of Interspeech, (pp. 2529–2533). http://www.isca-speech.org/archive/interspeech_2013/i13_2529.html

  17. Jeanrenaud, P., Eide, E., Chaudhari, U., McDonough, J., Ng, K., Siu, M., et al. (1995). Reducing word error rate on conversational speech from the Switchboard corpus. In Proceedings of ICASSP, (vol. 1, pp. 53–56): IEEE. doi:10.1109/ICASSP.1995.479271

  18. BABEL Program. http://www.iarpa.gov/Programs/ia/Babel/babel.html.

  19. Cui, J., Cui, X., Ramabhadran, B., Kim, J., Kingsbury, B., Mamou, J., et al. (2013). Developing speech recognition systems for corpus indexing under the IARPA babel program. In Proceedings of ICASSP, (pp. 6753–6757): IEEE. doi:10.1109/ICASSP.2013.6638969

  20. Chen, N. F., Sivadas, S., Lim, B. P., Ngo, H. G., Xu, H., Pham, V. T., et al. (2014). Strategies for vietnamese keyword search. In Proceedings of ICASSP, (pp. 4149–4153). Florence: IEEE. doi:10.1109/ICASSP.2014.6854377

  21. Metze, F., Rajput, N., Anguera, X., Davel, M., Gravier, G., Heerden, C. v., et al. (2012). The spoken web search task at mediaeval 2011. In Proceedings of ICASSP, (pp. 5165–5168). Kyoto: IEEE. doi:10.1109/ICASSP.2012.6289083

  22. Metze, F., Anguera, X., Barnard, E., Davel, M., & Gravier, G. (2013). The spoken web search task at mediaeval 2012. In Proceedings of ICASSP, (pp. 8121–8125). Vancouver, BC: IEEE. doi:10.1109/ICASSP.2013.6639247

  23. MediaEval Benchmarking Initiative for Multimedia Evaluation. http://www.multimediaeval.org/

  24. Moseley, C. (Ed.). (2010). Atlas of the world’s languages in danger (3rd ed.). Paris: UNESCO.

    Google Scholar 

  25. Tueske, Z., Nolden, D., Schlueter, R., & Ney, H. (2014). Multilingual MRASTA features for low-resource keyword search and speech recognition sysTEMS. In Proceedings of ICASSP, (pp. 7854–7858). Florence: IEEE. doi:10.1109/ICASSP.2014.6855129

  26. Ghahremani, P., Babaali, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automaticspeech recognition. In Proceedings of ICASSP, (pp. 2494–2498). Florence: IEEE. doi:10.1109/ICASSP.2014.6854049

  27. Lee, H.-Y., Zhang, Y., Chuangsuwanich, E., & Glass, J. (2014). Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages. In Proceedings of Interspeech, Singapore (pp. 2479–2483): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2479.html

  28. Soto, V., Mangu, L., Rosenberg, A., & Hirschberg, J. (2014). A Comparison of multiple methods for rescoring keyword search lists for low resource languages. In Proceedings of Interspeech, Singapore (pp. 2464-2468). Singapore: ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2464.html

  29. Hartmann, W., Le, V.-B., Messaoudi, A., Lamel, L., & Gauvain, J.-L. (2014). Comparing decoding strategies for subword-based keyword spotting in low-resourced languages. In Proceedings of Interspeech, Singapore (pp. 2764–2768). http://www.isca-speech.org/archive/interspeech_2014/i14_2764.html

  30. Hsiao, R., Ng, T., Zhang, L., Ranjan, S., Tsakalidis, S., Nguyen, L., et al. (2014). Improving semi-supervised deep neural network for keyword search in low resource languages. In Proceedings of Interspeech, Singapore (pp. 1088–1091): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_1088.html

  31. Cui, X., Kingsbury, B., Cui, J., Ramabhadran, B., Rosenberg, A., Rasooli, M. S., et al. (2014). Improving deep neural network acoustic modeling for audio corpus indexing under the IARPA babel program. In Proceedings of Interspeech, Singapore (pp. 2103–2107): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2103.html

  32. Huang, J.-T., Li, J., Yu, D., Deng, L., & Gong, Y. (2013). Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of ICASSP, Vancouver, BC (pp. 7304-7308): IEEE. doi:10.1109/ICASSP.2013.6639081

  33. Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2014). A novel keyword + LVCSR-filler based grammar network representation for spoken keyword search. In Proceedings of ISCSLP, Singapore (pp. 192-196): IEEE. doi:10.1109/ISCSLP.2014.6936713

  34. Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2015). A keyword-aware grammar framework for lvcsr-based spoken keyrowd search. In Proceedings of ICASSP, Brisbane: IEEE.

  35. Sukkar, R. A., & Lee, C.-H. (1996). Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing, 4(6), 420–429.

    Article  Google Scholar 

  36. Ou, J., Chen, K., Want, X., & Lee, Z. (2001). Utterance verification of short keywords using hybrid neural-network/HMM approach. In Proceedings of ICII, Beijing (vol. 2, pp. 671-676): IEEE. doi:10.1109/ICII.2001.983657.

  37. Chen, I.-F., & Lee, C.-H. (2013). A hybrid HMM/DNN Approach to keyword spotting of short words. In Proceedings of Interspeech, Lyon (pp. 1574-1578): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_1574.html

  38. Chen, I.-F., & Lee, C.-H. (2013). A Resource-dependent approach to word modeling for keyword spotting. In Proceedings of Interspeech, Lyon (pp. 2544–2548): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_2544.html

  39. Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiat, M., Fapso, M., et al. (2005). Comparison of Keyword spotting approaches for informal continuous speech. In Proceedings of EuroSpeech.

  40. Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559–584): Springer Berlin Heidelberg. doi:10.1007/978-3-540-49127-9_28

  41. Allauzen, C., Mohri, M., & Roark, B. (2003). Generalized algorithms for constructing language models. In Proceedings of ACL, Stroudsburg, PA, USA (vol. 1, pp. 40–47): ACL. doi:10.3115/1075096.1075102

  42. NIST Open Keyword Search 2013 Evaluation (OpenKWS13). http://www.nist.gov/itl/iad/mig/openkws13.cfm.

  43. Povey, D., Ghoshal, A., Boulianne, G., L. S. B., Glembek, O. R., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society.

  44. Vesely, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative traning of deep neural networks. In Proceedings of Interspeech, Lyon, France (pp. 2345-2349): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_2345.html

  45. Novak, J. R., Minematsu, N., & Hirose, K. (2012). WFST-based Grapheme-to-Phoneme conversion: open source tools for alignment, model-building and decoding. In Proceedings of International Workshop on Finite State Methods and Natural Language Processing, Donostia-San Sebastian (pp. 45–49). https://code.google.com/p/phonetisaurus

  46. Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.

    Article  Google Scholar 

Download references

Acknowledgments

This study uses the IARPA Babel Program Vietnamese language collection release babel107b-v0.7 with the LimitedLP and FullLP training sets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I-Fan Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, IF., Ni, C., Lim, B.P. et al. A Keyword-Aware Language Modeling Approach to Spoken Keyword Search. J Sign Process Syst 82, 197–206 (2016). https://doi.org/10.1007/s11265-015-0998-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-0998-0

Keywords

Navigation