Skip to main content
Log in

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Keyword spotting in a continuous speech is a challenging problem and has relevance in applications like audio indexing and music retrieval. In this work, the problem of keyword spotting is addressed by utilizing the complementary information present in spectral and prosodic features of the speech signal. A thorough analysis of the complementary information is performed on a large Hindi language database developed for this purpose. Phonetic and prosodic distribution analysis is performed toward this end, using canonical correlation and Student T-distance function. Motivated by these analyses, novel methods for spectral and prosodic information fusion that optimize a combined error function is proposed. The fusion methods are developed both at the feature and the model level. Improved syllable sequence prediction and keyword spotting performance are obtained using these methods when compared to conventional methods of keyword spotting. Additionally, in order to enable comparison with the state-of-the-art deep learning-based methods, a novel method for improved syllable sequence prediction using deep denoising autoencoders is proposed. The performance of the methods proposed in this work is evaluated for keyword spotting using a syllable sliding protocol over a large Hindi database. Reasonable performance improvements are noted from the experimental results on syllable sequence prediction, keyword spotting, and audio retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Encodes categorical data using a one-of-K scheme.

  2. newsonair.nic.in.

  3. www.youtube.com.

References

  1. Y. Benayed, D. Fohr, J.P. Haton, G. Chollet. Confidence measures for keyword spotting using support vector machines, in Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1, pp. I-588–I-591 (April 2003)

  2. L. Bottou, Large-scale machine learning with stochastic gradient descent, in ProceedingsCOMPSTAT’2010. Springer, pp. 177–186 (2010)

  3. G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) May 2014, pp. 4087–4091 (2014)

  4. A. Cutler, D. Dahan, W. Van Donselaar, Prosody in the comprehension of spoken language: a literature review. Lang. Speech 40(2), 141–201 (1997)

    Article  Google Scholar 

  5. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  6. R. Darg̀is, A. Znotiņš, Baseline for keyword spotting in latvian broadcast speech, in Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT 2014, vol. 268. IOS Press, pp. 75–82 (2014)

  7. S.R. Eddy, Hidden Markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)

    Article  MathSciNet  Google Scholar 

  8. X. Feng, Y. Zhang, J. Glass, Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1759–1763 (2014)

  9. M.J.F. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  10. A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649 (2013)

  11. W. Han, C.-F. Chan, C.-S. Choy, K.-P. Pun, An efficient mfcc extraction method in speech recognition, in 2006 IEEE International Symposium on Circuits and Systems

  12. Y. Hifny, S. Renals, Speech recognition using augmented conditional random fields. IEEE Trans. Audio Speech Lang. Process. 17(2), 354–365 (2009)

    Article  Google Scholar 

  13. G. Hinton, L. Deng, Y. Dong, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  14. J. Junkawitsch, L. Neubauer, H. Hoge, G. Ruske, A new keyword spotting algorithm with pre-calculated optimal thresholds, in Proceedings of the Fourth International Conference on Spoken Language, 1996. ICSLP 96, vol. 4. IEEE, pp. 2067–2070 (1996)

  15. J. Junkawitsch, G. Ruske, H. Höge, Efficient methods for detecting keywords in continuous speech. EUROSPEECH 97, 259–262 (1997)

    Google Scholar 

  16. D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  18. X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deep denoising autoencoder, in Interspeech, pp. 436–440 (2013)

  19. J. Ming, F.J. Smith, Improved phone recognition using Bayesian triphone models, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, vol. 1, pp. 409–412 (1998)

  20. L. Rabiner, B. Juang, An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)

    Article  Google Scholar 

  21. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  22. V. Rangarajan, B. Srinivas, S. Narayanan, Exploiting prosodic features for dialog act tagging in a discriminative modeling framework, in Proceedings of Interspeech. Antwerp, Belgium (2007)

  23. S. Rangarajan, V. Kumar, B. Srinivas, S. Narayanan, Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Comput. Speech Lang. 23(4), 407–422 (2009)

    Article  Google Scholar 

  24. V. Tyagi, Hybrid context dependent cd-dnn-hmm keyword spotting (kws) in speech conversations, in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, (Sept 2016)

  25. K. Veselỳ, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks, in Interspeech, pp. 2345–2349 (2013)

  26. H. Zhang, J. Weng, Measuring Multi-modality Similarities via Subspace Learning for Cross-Media Retrieval (Springer Berlin Heidelberg, Berlin, Heidelberg, 2006), pp. 979–988

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laxmi Pandey.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, L., Hegde, R.M. Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion. Circuits Syst Signal Process 38, 2767–2791 (2019). https://doi.org/10.1007/s00034-018-0990-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-018-0990-6

Keywords

Navigation