Advertisement

Leveraging a Small Corpus by Different Frame Shifts for Training of a Speech Recognizer

  • Akinori Ito
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 110)

Abstract

During the feature extraction process for speech recognition, a window function is first applied to the input waveform to extract temporally-limited spectrum. By shifting the window function with a short time period, we can analyze the temporal change of speech spectrum. This time period is called “the frame shift,” which is usually 5 to 10 ms. In this paper, frame shift is re-considered from two aspects. The first one is the appropriateness of 10 ms as the frame shift. The frame-based process is based on the assumption that temporal change of speech spectrum is slow enough compared with the frame shift, which does not hold for kinds of consonants such as plosives. Thus, this paper experimentally shows that feature value fluctuates much according to the first position of the frame. Then a training method is proposed that uses temporally shifted samples as independent samples to compensate for the fluctuation of feature caused by the difference of the beginning position of a frame. The second aspect is that the frame shift could be longer if the fluctuation can be compensated. To prove this, an experiment was conducted to change frame shift from 10 to 60 ms, and it was found that the result of 40 ms frame shift outperformed the result of 10 ms frame shift, and comparable recognition performance with 10 ms frame shift result was obtained with 50 ms frame shift.

Keywords

Speech recognition Windowing Frame shift 

Notes

Acknowledgment

Part of this work was supported by JSPS Kakenhi JP17H00823.

References

  1. 1.
    Furui, S.: Cepstral analysis technique for automatic speech verification. IEEE Trans. Acoust. Speech Signal Process. 29(2), 254–272 (1981)CrossRefGoogle Scholar
  2. 2.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  3. 3.
    Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975)CrossRefGoogle Scholar
  4. 4.
    Jelinek, F.: Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–556 (1976)CrossRefGoogle Scholar
  5. 5.
    Kinnunen, T., Karpov, E., Franti, P.: Real-time speaker identification and verification. IEEE Trans. Audio Speech Lang. Process. 14(1), 277–288 (2006)CrossRefGoogle Scholar
  6. 6.
    Makino, S., Kawabata, T., Kido, K.: Recognition of consonant based on the perceptron model. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 738–741 (1983)Google Scholar
  7. 7.
    McLaulin, J., Reynolds, D.A., Gleason, T.: A study of computation speed-ups of the GMM-UBM speaker recognition system. In: Proceedings of Eurospeech, pp. 1215–1218 (1999)Google Scholar
  8. 8.
    O’Shaunessy, D.: Acoustic analysis of automatic speech recognition. Proc. IEEE 101(5), 1038–1053 (2013)CrossRefGoogle Scholar
  9. 9.
    Reddy, D.R.: Phoneme grouping in speech recognition. J. Acoust. Soc. Am. 41(5), 1295–1300 (1967)CrossRefGoogle Scholar
  10. 10.
    Reddy, D.R., Erman, L.D., Neely, R.B.: A model and a system for machine recognition of speech. IEEE Trans. Audio Electroacoust. 21(3), 229–231 (1973)CrossRefGoogle Scholar
  11. 11.
    Ruiz, E.V., Nolla, F.C., Segovia, H.R.: Is the DTW “Distance” really a metric? An algorithm reducing the number of DTW comparisons in isolated word recognition. Speech Commun. 4(4), 333–344 (1985).  https://doi.org/10.1016/0167-6393(85)90058-5CrossRefGoogle Scholar
  12. 12.
    Sakai, T., Doshita, S.: The automatic speech recognition system for conversational sound. IEEE Trans. Electron. Comput. EC–12(6), 835–846 (1963)CrossRefGoogle Scholar
  13. 13.
    Umeda, N.: Consonant duration in American English. J. Acoust. Soc. Am. 61(3), 846–858 (1977)CrossRefGoogle Scholar
  14. 14.
    Weibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Tohoku UniversitySendaiJapan

Personalised recommendations