A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

Chen, I-Fan; Ni, Chongjia; Lim, Boon Pang; Chen, Nancy F.; Lee, Chin-Hui

doi:10.1007/s11265-015-0998-0

A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

Published: 21 April 2015

Volume 82, pages 197–206, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

I-Fan Chen¹,
Chongjia Ni²,
Boon Pang Lim²,
Nancy F. Chen² &
…
Chin-Hui Lee¹

326 Accesses
1 Citation
Explore all metrics

Abstract

A keyword-sensitive language modeling framework for spoken keyword search (KWS) is proposed to combine the advantages of conventional keyword-filler based and large vocabulary continuous speech recognition (LVCSR) based KWS systems. The proposed framework allows keyword search systems to be flexible on keyword target settings as in the LVCSR-based keyword search. In low-resource scenarios it facilitates KWS with an ability to achieve high keyword detection accuracy as in the keyword-filler based systems and to attain a low false alarm rate inherent in the LVCSR-based systems. The proposed keyword-aware grammar is realized by incorporating keyword information to re-train and modify the language models used in LVCSR-based KWS. Experimental results, on the evalpart1 data of the IARPA Babel OpenKWS13 Vietnamese tasks, indicate that the proposed approach achieves a relative improvement, over the conventional LVCSR-based KWS systems, of the actual term weighted value for about 57 % (from 0.2093 to 0.3287) and 20 % (from 0.4578 to 0.5486) on the limited-language-pack and full-language-pack tasks, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Article 25 June 2018

Dynamic out-of-vocabulary word registration to language model for speech recognition

Article Open access 25 January 2021

Phonetic Spoken Term Detection in Large Audio Archive Using the WFST Framework

Notes

In this study, a grammar is defined as a search graph or network whose paths from the initial to final nodes represent valid word sequences in a system with corresponding scores, and the graph/network is easily realized by weighted finite-state automata (WFSA) [40, 41].
For example, Fig. 4 shows the averaged keyword prior probabilities in the IARPA Babel Vietnamese data [32] are in the range of 5 × 10⁻⁵ to 5 × 10⁻⁶.
In this research, we used all the bigrams in the original Kneser-Ney smoothed LM as context terms.
We have obtained very poor performances (negative ATWVs) for keyword-filler based KWS systems due to an extremely large amount of false alarms caused by the noises in the test data. Therefore keyword-filler based KWS systems were not considered here.
We observed that by adjusting k = 5 the performance is significantly better than setting k = 1 as in [19]. However, the differences became trivial when k is larger than 5.

References

Wilpon, J. G., Rabiner, L. R., Lee, C.-H., & Goldman, E. (1990). Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(11), 1870–1878.
Article Google Scholar
Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proceedings of ICASSP, Albuquerque, NM (vol. 1, pp. 129–132): IEEE. doi:10.1109/ICASSP.1990.115555
Vergyri, D., Shafran, I., Stolcke, A., Gadde, R. R., Akbacak, M., Roark, B., et al. (2007). The SRI/OGI 2006 spoken term detection system. In Proceedings of Interspeech, (pp. 2393–2396): ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_2393.html
Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary Independent spoken term detection. In Proceedings of SIGIR (pp. 615–622): ACM. doi:10.1145/1277741.1277847
Miller, D. R., Kleber, M., Kao, C.-l., Kimball, O., Colthurst, T., Lowe, S. A., et al. (2007). Rapid and accurate spoken term detection. In Proceedings of Interspeech: ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_0314.html
Wallace, R., Vogt, R., & Sridharan, S. (2007). A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation. In Proceedings of Interspeech: ISCA. http://www.isca-speech.org/archive/interspeech_2007/i07_2385.html
Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., et al. (2000). Speech and langauge technologies for audio indexing and retrieval. Proceedings of the IEEE, 88(8), 1338–1353.
Article Google Scholar
Warren., R. L. (2001). Broadcast speech recognition system for keyword monitoring. U.S. Patent 6332120 B1. http://www.google.tl/patents/US6332120
Kawahara, T., Lee, C.-H., & Juang, B.-H. (1998). Key-phrase detection and verification for flexible speech understanding. IEEE Transactions on Speech and Audio Processing, 6(6), 558–568.
Article Google Scholar
Juang, B.-H., & Furui, S. (2000). Automatic recognition and understanding of spoken language – a first step toward natural human-machine communication. Proceedings of the IEEE, 88(8), 1142–1165.
Article Google Scholar
Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278.
Article Google Scholar
Pallett, D. S. (2003). A look at NIST’s benchmark ASR tests: past, present, and future. In Proceedings of ASRU, (pp. 483–488): IEEE. doi:10.1109/ASRU.2003.1318488
Fiscus, J. G., Ajot, J., Garofolo, J. S., & Doddintion, G. (2007). Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR: ACM. http://www.itl.nist.gov/iad/mig/publications/storage_paper/Interspeech07-STD06-v13.pdf
Szoeke, I., Fapso, M., & Burget, L. (2008). Hybrid word-subword decoding for spken term detection. In Proceedings of SIGIR, Singapore (pp. 42-48): ACM.
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393.
Article Google Scholar
Riedhammer, K., Do, V. H., & Hieronymus, J. (2013). A study on LVCSR and keyword search for tagalog. In Proceedings of Interspeech, (pp. 2529–2533). http://www.isca-speech.org/archive/interspeech_2013/i13_2529.html
Jeanrenaud, P., Eide, E., Chaudhari, U., McDonough, J., Ng, K., Siu, M., et al. (1995). Reducing word error rate on conversational speech from the Switchboard corpus. In Proceedings of ICASSP, (vol. 1, pp. 53–56): IEEE. doi:10.1109/ICASSP.1995.479271
BABEL Program. http://www.iarpa.gov/Programs/ia/Babel/babel.html.
Cui, J., Cui, X., Ramabhadran, B., Kim, J., Kingsbury, B., Mamou, J., et al. (2013). Developing speech recognition systems for corpus indexing under the IARPA babel program. In Proceedings of ICASSP, (pp. 6753–6757): IEEE. doi:10.1109/ICASSP.2013.6638969
Chen, N. F., Sivadas, S., Lim, B. P., Ngo, H. G., Xu, H., Pham, V. T., et al. (2014). Strategies for vietnamese keyword search. In Proceedings of ICASSP, (pp. 4149–4153). Florence: IEEE. doi:10.1109/ICASSP.2014.6854377
Metze, F., Rajput, N., Anguera, X., Davel, M., Gravier, G., Heerden, C. v., et al. (2012). The spoken web search task at mediaeval 2011. In Proceedings of ICASSP, (pp. 5165–5168). Kyoto: IEEE. doi:10.1109/ICASSP.2012.6289083
Metze, F., Anguera, X., Barnard, E., Davel, M., & Gravier, G. (2013). The spoken web search task at mediaeval 2012. In Proceedings of ICASSP, (pp. 8121–8125). Vancouver, BC: IEEE. doi:10.1109/ICASSP.2013.6639247
MediaEval Benchmarking Initiative for Multimedia Evaluation. http://www.multimediaeval.org/
Moseley, C. (Ed.). (2010). Atlas of the world’s languages in danger (3rd ed.). Paris: UNESCO.
Google Scholar
Tueske, Z., Nolden, D., Schlueter, R., & Ney, H. (2014). Multilingual MRASTA features for low-resource keyword search and speech recognition sysTEMS. In Proceedings of ICASSP, (pp. 7854–7858). Florence: IEEE. doi:10.1109/ICASSP.2014.6855129
Ghahremani, P., Babaali, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automaticspeech recognition. In Proceedings of ICASSP, (pp. 2494–2498). Florence: IEEE. doi:10.1109/ICASSP.2014.6854049
Lee, H.-Y., Zhang, Y., Chuangsuwanich, E., & Glass, J. (2014). Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages. In Proceedings of Interspeech, Singapore (pp. 2479–2483): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2479.html
Soto, V., Mangu, L., Rosenberg, A., & Hirschberg, J. (2014). A Comparison of multiple methods for rescoring keyword search lists for low resource languages. In Proceedings of Interspeech, Singapore (pp. 2464-2468). Singapore: ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2464.html
Hartmann, W., Le, V.-B., Messaoudi, A., Lamel, L., & Gauvain, J.-L. (2014). Comparing decoding strategies for subword-based keyword spotting in low-resourced languages. In Proceedings of Interspeech, Singapore (pp. 2764–2768). http://www.isca-speech.org/archive/interspeech_2014/i14_2764.html
Hsiao, R., Ng, T., Zhang, L., Ranjan, S., Tsakalidis, S., Nguyen, L., et al. (2014). Improving semi-supervised deep neural network for keyword search in low resource languages. In Proceedings of Interspeech, Singapore (pp. 1088–1091): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_1088.html
Cui, X., Kingsbury, B., Cui, J., Ramabhadran, B., Rosenberg, A., Rasooli, M. S., et al. (2014). Improving deep neural network acoustic modeling for audio corpus indexing under the IARPA babel program. In Proceedings of Interspeech, Singapore (pp. 2103–2107): ISCA. http://www.isca-speech.org/archive/interspeech_2014/i14_2103.html
Huang, J.-T., Li, J., Yu, D., Deng, L., & Gong, Y. (2013). Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of ICASSP, Vancouver, BC (pp. 7304-7308): IEEE. doi:10.1109/ICASSP.2013.6639081
Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2014). A novel keyword + LVCSR-filler based grammar network representation for spoken keyword search. In Proceedings of ISCSLP, Singapore (pp. 192-196): IEEE. doi:10.1109/ISCSLP.2014.6936713
Chen, I.-F., Ni, C., Lim, B. P., Chen, N. F., & Lee, C.-H. (2015). A keyword-aware grammar framework for lvcsr-based spoken keyrowd search. In Proceedings of ICASSP, Brisbane: IEEE.
Sukkar, R. A., & Lee, C.-H. (1996). Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing, 4(6), 420–429.
Article Google Scholar
Ou, J., Chen, K., Want, X., & Lee, Z. (2001). Utterance verification of short keywords using hybrid neural-network/HMM approach. In Proceedings of ICII, Beijing (vol. 2, pp. 671-676): IEEE. doi:10.1109/ICII.2001.983657.
Chen, I.-F., & Lee, C.-H. (2013). A hybrid HMM/DNN Approach to keyword spotting of short words. In Proceedings of Interspeech, Lyon (pp. 1574-1578): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_1574.html
Chen, I.-F., & Lee, C.-H. (2013). A Resource-dependent approach to word modeling for keyword spotting. In Proceedings of Interspeech, Lyon (pp. 2544–2548): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_2544.html
Szoke, I., Schwarz, P., Matejka, P., Burget, L., Karafiat, M., Fapso, M., et al. (2005). Comparison of Keyword spotting approaches for informal continuous speech. In Proceedings of EuroSpeech.
Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559–584): Springer Berlin Heidelberg. doi:10.1007/978-3-540-49127-9_28
Allauzen, C., Mohri, M., & Roark, B. (2003). Generalized algorithms for constructing language models. In Proceedings of ACL, Stroudsburg, PA, USA (vol. 1, pp. 40–47): ACL. doi:10.3115/1075096.1075102
NIST Open Keyword Search 2013 Evaluation (OpenKWS13). http://www.nist.gov/itl/iad/mig/openkws13.cfm.
Povey, D., Ghoshal, A., Boulianne, G., L. S. B., Glembek, O. R., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society.
Vesely, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative traning of deep neural networks. In Proceedings of Interspeech, Lyon, France (pp. 2345-2349): ISCA. http://www.isca-speech.org/archive/interspeech_2013/i13_2345.html
Novak, J. R., Minematsu, N., & Hirose, K. (2012). WFST-based Grapheme-to-Phoneme conversion: open source tools for alignment, model-building and decoding. In Proceedings of International Workshop on Finite State Methods and Natural Language Processing, Donostia-San Sebastian (pp. 45–49). https://code.google.com/p/phonetisaurus
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.
Article Google Scholar

Download references

Acknowledgments

This study uses the IARPA Babel Program Vietnamese language collection release babel107b-v0.7 with the LimitedLP and FullLP training sets.

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
I-Fan Chen & Chin-Hui Lee
Institute for Infocomm Research, Singapore, Singapore
Chongjia Ni, Boon Pang Lim & Nancy F. Chen

Authors

I-Fan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chongjia Ni
View author publications
You can also search for this author in PubMed Google Scholar
Boon Pang Lim
View author publications
You can also search for this author in PubMed Google Scholar
Nancy F. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Hui Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I-Fan Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, IF., Ni, C., Lim, B.P. et al. A Keyword-Aware Language Modeling Approach to Spoken Keyword Search. J Sign Process Syst 82, 197–206 (2016). https://doi.org/10.1007/s11265-015-0998-0

Download citation

Received: 13 November 2014
Revised: 16 February 2015
Accepted: 23 March 2015
Published: 21 April 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-0998-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

Abstract

Access this article

Similar content being viewed by others

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Dynamic out-of-vocabulary word registration to language model for speech recognition

Phonetic Spoken Term Detection in Large Audio Archive Using the WFST Framework

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Keyword-Aware Language Modeling Approach to Spoken Keyword Search

Abstract

Access this article

Similar content being viewed by others

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Dynamic out-of-vocabulary word registration to language model for speech recognition

Phonetic Spoken Term Detection in Large Audio Archive Using the WFST Framework

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation