Time warped continuous speech signal matching using Kalman filter

Khan, Wasiq; Holton, Rob

doi:10.1007/s10772-015-9277-5

Time warped continuous speech signal matching using Kalman filter

Published: 13 May 2015

Volume 18, pages 419–431, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Wasiq Khan^1,2 &
Rob Holton²

308 Accesses
1 Citation
Explore all metrics

Abstract

Dynamic speech properties, such as time warping, silence removal and background noise reduction are the most challenging issues in continuous speech signal matching. Among all of them, the time warped speech signal matching is of great interest and has been a tough challenge for the researchers. The literature contains a variety of techniques to measure the similarity between speech utterances, however there are some limitations associated with these techniques. This paper introduces an adaptive framing based continuous speech tracking and similarity measurement approach that uses a Kalman filter (KF) as a robust tracker. The use of KF is novel for time warped speech signal matching and dynamic time warping. A dynamic state model is presented based on equations of linear motion. In this model, fixed length frame of input (test) speech signal is considered as a unidirectional moving object by sliding it along the template speech signal. The best matched position estimate in template speech (sample number) for corresponding test frame at current time is calculated. Simultaneously, another position observation is produced by a feature based distance metric. The position estimated by the state model is fused with the observation using KF along with the noise variances. The best estimated frame position in the template speech for the current state is calculated. Finally, forecasting of the noise variances and template frame size for next state are made according to the KF output. The experimental results demonstrate the robustness of the proposed technique in terms of time warped speech signal matching as well as in computation cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Amin, S., Gholam, A., & Amir, S. (2009). Speech recognition from PSD using neural network. International MultiConference of Engineers and Computer Scientists, I, 174.
Google Scholar
Antanas, L. (2010). Optimization of formant feature based speech recognition. Institute of Mathematics and Informatics, Informatica, 21(3), 361–374.
Google Scholar
Arthur, G. (1974). Applied optimal estimation (pp. 72–79). Cambridge: MIT Press.
Google Scholar
Bob, L. (2009). Fundamentals of real time spectrum analysis. USA: Tektronix Primer.
Bolls, S. F. (2003). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120.
Article Google Scholar
Chotirat, R., & Eamonn, K. (2005). Three Myths about dynamic time warping data mining. In the Proceedings of SIAM International Conference on Data Mining (pp. 506–510).
Dave, N. (2013). Feature extraction methods LPC, PLP and MFCC. Speech Recognition. International Journal for Advance Research in Engineering and Technology, 1(6), 1–4.
Google Scholar
Dhingra, S., Nijhawan, G., & Pandit, P. (2013). Isolated speech recognition using MFCC and DTW. International Journal Advanced Research in Electrical Electronics and Instrumentation Engineering, 2(8), 1–8.
Google Scholar
Festvox. CMU Arctic Databases. Accessed January 13, 2013, from http://www.festvox.org/cmu_arctic/index.html.
Gannot, S., Burshtien, D., & Weinstein, E. (2002). Iterative and sequential Kalman filter-based speech enhancement algorithms. IEEE Transactions, Speech and Audio Processing, 6(4), 373–385.
Article Google Scholar
Giannakopoulos, T. (2014). A method for silence removal and segmentation of speech signals, implemented in Matlab. Accessed May, 13, 2014 from http://cgi.di.uoa.gr/~tyiannak/Software.html.
Gibson, D., Boneung, K., & Gray, S. D. (1989). Filtering of coloured noise for speech enhancement and coding. In International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 349–352).
Greg, W., & Gary, B. (2006). An introduction to Kalman filter. TR 95- 041, Course 8, Chapel Hill, NC: University of North Carolina.
Hung, H., & Chittaranjan, G. (2010). The Idiap Wolf Corpus: Exploring group behaviour in a competitive role-playing game. Florence, Italy: ACM Multimedia.
Book MATH Google Scholar
Lalkhen, A. G., & McCluskey, A. (2008). Clinical tests: Sensitivity and specificity. Continuing Education in Anaesthesia, Critical Care & Pain, 8(6), 221–223.
Article Google Scholar
Lawrence, R. R., Jay, G. W., & Frank, K. S. (1989). High performance connected digit recognition using Hidden Markov Models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(8), 1214–1225.
Article Google Scholar
Lipman, R., Martin, & E., Paul. B. (1987). Multi-style training for robust isolated-word speech recognition. In IEEE International Conference of Acoustic, Speech, Signal Processing (pp. 705–708).
McCool, C., Marcel, S., Hadid, A., Pietikäinen, M., Matějka, P., Černocký, J., Poh, N., Kittler, J., Larcher, A., Lévy, C., Matrouf, D., Bonastre, J., Tresadern, P., & Cootes, T. (2012). Bi-Modal person recognition on a mobile phone: Using mobile phone data. In IEEE ICME Workshop on Hot Topics in Mobile Multimedia.
Melvyn, H., & Claude, L. (1988). Speaker dependent and independent speech recognition experiments with an auditory model. In International Conference on Acoustic Speech and Signal Processing (Vol. 1, pp. 215–218).
Michael, E. (2001). ‘Top 100 speeches’, American Rhetoric. Accessed December 12 2013 from http://www.americanrhetoric.com/top100speechesall.html
Mohinder, S. G., & Angus, A. (2001). Kalman filtering theory and practice using MATLAB (2nd ed., pp. 15–17). Canada: Wiley.
Google Scholar
Nortel, N. (2002). Voice fundamentals: From analogue to ATM (pp. 1–4). Mississauga, ON: Nortel Networks Corporation.
Olivier, S. (1995). On the robustness of linear discriminant analysis as a pre-processing step for noisy speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 125–128).
Online Audio Stories, ‘short stories and audio books online for kids’. Accessed February, 01, 2014 from http://www.onlineaudiostories.com/category/all_stories/audio_stories/.
Pour, M. M., & Farokhi, F. (2009). An advanced method for speech recognition. International Scholarly and Scientific Research & Innovation, 3(1), 840–845.
Google Scholar
Rabiner, L. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Ravindran, G., Shenbagadevi, S., & Salai, S. V. (2010). Cepstral and linear prediction techniques for improving intelligibility and audibility of impaired speech. Journal of Biomedical Science and Engineering, 3(1), 85–94.
Article Google Scholar
Schmid, P., & Barnard, E. (1995). Robust, N-Best formant tracking. In Proceedings of EUROSPEECH’95 (pp. 737–740).
Sharma, P., & Rajpoot, A. K. (2013). Automatic identification of silence, unvoiced and voiced chunks in speech. Journal of Computer Science & Information Technology (CS & IT), 3(5), 87–96.
Article Google Scholar
Xuedong, H., & Kai, L. (1993). On speaker-independent, speaker-dependent and speaker-adaptive speech recognition. IEEE Transactions on Speech and Audio Processing, 1(2), 150–157.
Article Google Scholar
Yegnanarayana, B., & Sreekumar, T. (1984). Signal dependent matching for isolated word speech recognition system. Signal Processing, 7(2), 161–173.
Article Google Scholar
Zahorian, S. A., & Hu, H. (2008). A spectral/temporal method for robust fundamental frequency tracking. Journal of Acoustic Society of America, 123(6), 4559–4571.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Systems (Aerospace Technology), University of Central Lancashire, Preston, PR1 7XP, UK
Wasiq Khan
Department of Computing, University of Bradford, Bradford, BD7 1DP, UK
Wasiq Khan & Rob Holton

Authors

Wasiq Khan
View author publications
You can also search for this author in PubMed Google Scholar
Rob Holton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wasiq Khan.

Additional information

Rob Holton is a Head of Computer Science in University of Bradford

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, W., Holton, R. Time warped continuous speech signal matching using Kalman filter. Int J Speech Technol 18, 419–431 (2015). https://doi.org/10.1007/s10772-015-9277-5

Download citation

Received: 12 October 2014
Accepted: 17 March 2015
Published: 13 May 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10772-015-9277-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Time warped continuous speech signal matching using Kalman filter

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Introduction to Acoustic Terminology and Signal Processing

The cocktail-party problem revisited: early processing and selection of multi-talker speech

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time warped continuous speech signal matching using Kalman filter

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Introduction to Acoustic Terminology and Signal Processing

The cocktail-party problem revisited: early processing and selection of multi-talker speech

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation