Study of articulators’ contribution and compensation during speech by articulatory speech recognition

  • Jianguo Wei
  • Yan Ji
  • Jingshu Zhang
  • Qiang Fang
  • Wenhuan Lu
  • Kiyoshi Honda
  • Xugang Lu
Article
  • 13 Downloads

Abstract

In this paper, the contributions of dynamic articulatory information were evaluated by using an articulatory speech recognition system. The Electromagnetic Articulographic dataset is relatively small and hard to be recorded compared with popular speech corpora used for modern speech study. We used articulatory data to study the contribution of each observation channel of vocal tracts in speech recognition by DNN framework. We also analyzed the recognition results of each phoneme according to speech production rules. The contribution rate of each articulator can be considered as the crucial level of each phoneme in speech production. Furthermore, the results indicate that the contribution of each observation point is not relevant to a specific method. The tendency of a contribution of each sensor is identical to the rules of Japanese phonology. In this work, we also evaluated the compensation effect between different channels. We discovered that crucial points are hard to be compensated for compared with non-crucial points. The proposed method can help us identify the crucial points of each phoneme during speech. The results of this paper can contribute to the study of speech production and articulatory-based speech recognition.

Keywords

DNN Articulatory recognition Articulators’ contribution Crucial level Compensation 

Notes

Acknowledgements

This work was supported in part by grants from the National Natural Science Foundation of China (General Program No. 61471259, and Key Program No. 61233009) and in part by NSFC of Tianjin (No. 16JCZDJC35400).

References

  1. 1.
    Akamatsu T (1997) Japanese phonetics: theory and practice. Lincom Europa, München ISBN 3-89586-095-6Google Scholar
  2. 2.
    Chen Q, Zhang WL, Tong N, Li B-C (2013) RBM-based phoneme recognition by deep neural network based on RBM. Journal of Information Engineering University 14(5):569–574Google Scholar
  3. 3.
    Dahl GE, Yu D, Deng L et al (2015) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42CrossRefGoogle Scholar
  4. 4.
    Dang J, Honda K (2001) A physical articulatory model for simulating speech production process. Acoust Sci Technol 22:6CrossRefGoogle Scholar
  5. 5.
    Dang J, Lizuka Y, Markov K, Nakamura S (2003) Improvement of speech recognition method using speech production mechanism. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp 731–734Google Scholar
  6. 6.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29(6):82–97CrossRefGoogle Scholar
  8. 8.
    Honda K (2008) Physiological processes of speech production. Springer, Berlin, HeidelbergCrossRefGoogle Scholar
  9. 9.
    Honda K, Bao H, Lu W (2016) Articulatory idiosyncrasy inferred from relative size and mobility of the tongue. J Acoust Soc Am 139(4):2192–2192CrossRefGoogle Scholar
  10. 10.
    Itō J, Armin MR (1995) Japanese phonology. In: Goldsmith J (ed) The handbook of phonological theory. Blackwell, Oxford, pp 817–838Google Scholar
  11. 11.
    Lu X, Dang J (2004) Speech recognition based on a combination of traditional speech features with articulatory information. In: The 18th International Congress on Acoustics (ICA2004), Kyoto, Japan, 4–9 April, pp 3499–3502Google Scholar
  12. 12.
    Lu X, Dang J (2005) Speech recognition based on a combination of acoustic features with articulatory information. Chin J Acoust 3:271–279Google Scholar
  13. 13.
    Okada H (1991) Japanese. J Int Phon Assoc 21(2):94–96.  https://doi.org/10.1017/S002510030000445X CrossRefGoogle Scholar
  14. 14.
    Povey D, Ghoshal A, Boulianne G et al (2011) The Kaldi speech recognition toolkit. Idiap, MartignyGoogle Scholar
  15. 15.
    Riney TJ, Takagi N, Ota K, Uchida Y (2007) The intermediate degree of VOT in Japanese initial voiceless stops. J Phon 35(3):439–443.  https://doi.org/10.1016/j.wocn.2006.01.002 CrossRefGoogle Scholar
  16. 16.
    Tsuchida A (2001) Japanese vowel devoicing. J East Asian Linguis 10(3):225–245.  https://doi.org/10.1023/A:1011221225072. CrossRefGoogle Scholar
  17. 17.
    Zhang J, Wei J (2015) Vowel normalization by articulatory information. In: Signal and information processing association summit and conference asia-pacific signal and information processing association pp 217–221Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Jianguo Wei
    • 1
  • Yan Ji
    • 2
  • Jingshu Zhang
    • 2
  • Qiang Fang
    • 3
  • Wenhuan Lu
    • 1
  • Kiyoshi Honda
    • 2
  • Xugang Lu
    • 4
  1. 1.School of Computer SoftwareTianjin UniversityTianjinChina
  2. 2.School of Computer Science and TechnologyTianjin UniversityTianjinChina
  3. 3.Chinese Academy of Social SciencesBeijingChina
  4. 4.NICTTokyoJapan

Personalised recommendations