Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Pandey, Laxmi; Hegde, Rajesh M.

doi:10.1007/s00034-018-0990-6

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Published: 17 November 2018

Volume 38, pages 2767–2791, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

329 Accesses
5 Citations
Explore all metrics

Abstract

Keyword spotting in a continuous speech is a challenging problem and has relevance in applications like audio indexing and music retrieval. In this work, the problem of keyword spotting is addressed by utilizing the complementary information present in spectral and prosodic features of the speech signal. A thorough analysis of the complementary information is performed on a large Hindi language database developed for this purpose. Phonetic and prosodic distribution analysis is performed toward this end, using canonical correlation and Student T-distance function. Motivated by these analyses, novel methods for spectral and prosodic information fusion that optimize a combined error function is proposed. The fusion methods are developed both at the feature and the model level. Improved syllable sequence prediction and keyword spotting performance are obtained using these methods when compared to conventional methods of keyword spotting. Additionally, in order to enable comparison with the state-of-the-art deep learning-based methods, a novel method for improved syllable sequence prediction using deep denoising autoencoders is proposed. The performance of the methods proposed in this work is evaluated for keyword spotting using a syllable sliding protocol over a large Hindi database. Reasonable performance improvements are noted from the experimental results on syllable sequence prediction, keyword spotting, and audio retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Syllable Structure Approach to Spoken Language Recognition

An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition

Designing Syllable Models for an HMM Based Speech Recognition System

Notes

Encodes categorical data using a one-of-K scheme.
newsonair.nic.in.
www.youtube.com.

References

Y. Benayed, D. Fohr, J.P. Haton, G. Chollet. Confidence measures for keyword spotting using support vector machines, in Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1, pp. I-588–I-591 (April 2003)
L. Bottou, Large-scale machine learning with stochastic gradient descent, in ProceedingsCOMPSTAT’2010. Springer, pp. 177–186 (2010)
G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) May 2014, pp. 4087–4091 (2014)
A. Cutler, D. Dahan, W. Van Donselaar, Prosody in the comprehension of spoken language: a literature review. Lang. Speech 40(2), 141–201 (1997)
Article Google Scholar
G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)
Article Google Scholar
R. Darg̀is, A. Znotiņš, Baseline for keyword spotting in latvian broadcast speech, in Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT 2014, vol. 268. IOS Press, pp. 75–82 (2014)
S.R. Eddy, Hidden Markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)
Article MathSciNet Google Scholar
X. Feng, Y. Zhang, J. Glass, Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1759–1763 (2014)
M.J.F. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649 (2013)
W. Han, C.-F. Chan, C.-S. Choy, K.-P. Pun, An efficient mfcc extraction method in speech recognition, in 2006 IEEE International Symposium on Circuits and Systems
Y. Hifny, S. Renals, Speech recognition using augmented conditional random fields. IEEE Trans. Audio Speech Lang. Process. 17(2), 354–365 (2009)
Article Google Scholar
G. Hinton, L. Deng, Y. Dong, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
J. Junkawitsch, L. Neubauer, H. Hoge, G. Ruske, A new keyword spotting algorithm with pre-calculated optimal thresholds, in Proceedings of the Fourth International Conference on Spoken Language, 1996. ICSLP 96, vol. 4. IEEE, pp. 2067–2070 (1996)
J. Junkawitsch, G. Ruske, H. Höge, Efficient methods for detecting keywords in continuous speech. EUROSPEECH 97, 259–262 (1997)
Google Scholar
D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deep denoising autoencoder, in Interspeech, pp. 436–440 (2013)
J. Ming, F.J. Smith, Improved phone recognition using Bayesian triphone models, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, vol. 1, pp. 409–412 (1998)
L. Rabiner, B. Juang, An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)
Article Google Scholar
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
V. Rangarajan, B. Srinivas, S. Narayanan, Exploiting prosodic features for dialog act tagging in a discriminative modeling framework, in Proceedings of Interspeech. Antwerp, Belgium (2007)
S. Rangarajan, V. Kumar, B. Srinivas, S. Narayanan, Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Comput. Speech Lang. 23(4), 407–422 (2009)
Article Google Scholar
V. Tyagi, Hybrid context dependent cd-dnn-hmm keyword spotting (kws) in speech conversations, in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, (Sept 2016)
K. Veselỳ, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks, in Interspeech, pp. 2345–2349 (2013)
H. Zhang, J. Weng, Measuring Multi-modality Similarities via Subspace Learning for Cross-Media Retrieval (Springer Berlin Heidelberg, Berlin, Heidelberg, 2006), pp. 979–988
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Indian Institute of Technology Kanpur, Kanpur, India
Laxmi Pandey & Rajesh M. Hegde

Authors

Laxmi Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh M. Hegde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laxmi Pandey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pandey, L., Hegde, R.M. Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion. Circuits Syst Signal Process 38, 2767–2791 (2019). https://doi.org/10.1007/s00034-018-0990-6

Download citation

Received: 05 July 2017
Revised: 09 November 2018
Accepted: 10 November 2018
Published: 17 November 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s00034-018-0990-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Abstract

Access this article

Similar content being viewed by others

A Syllable Structure Approach to Spoken Language Recognition

An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition

Designing Syllable Models for an HMM Based Speech Recognition System

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Abstract

Access this article

Similar content being viewed by others

A Syllable Structure Approach to Spoken Language Recognition

An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition

Designing Syllable Models for an HMM Based Speech Recognition System

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation