Journal of Signal Processing Systems

, Volume 90, Issue 7, pp 975–983 | Cite as

Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

  • Lei Sun
  • Jun Du
  • Zhipeng Xie
  • Yong Xu


Recently, the signals captured from a laser Doppler vibrometer (LDV) sensor have shown the noise robustness to automatic speech recognition (ASR) systems by enhancing the acoustic signal prior to feature extraction. In this study, an alternative approach, namely concatenating the auxiliary features extracted from the LDV signal with the conventional acoustic features, is proposed to further improve ASR performance based on the deep neural network (DNN) for acoustic modeling. The preliminary experiments on a small set of stereo-data including both LDV and acoustic signals demonstrate its effectiveness. Thus, to leverage more existing large-scale speech databases, a regression DNN is designed to map acoustic features to LDV features, which is well trained from a stereo-data set with a limited size and then used to generate pseudo-LDV features from a massive speech data set for parallel training of an ASR system. Our experiments verify that both the features from the limited scale LDV data set as well as the massive scale pseudo-LDV features can yield significant improvements of recognition performance over the system using purely acoustic features, in both quiet and noisy environments.


Laser Doppler vibrometer Auxiliary features Deep neural network Regression model Speech recognition 



This work was supported in part by the National Natural Science Foundation of China under Grants 61671422 and U1613211, in part by the National Key Research and Development Program of China under Grant 2017YFB1002200, in part by the MOE-Microsoft Key Laboratory of USTC.


  1. 1.
    Baker, J.M., Deng, L., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., & O’Shaughnessy, D. (2009). IEEE Signal Processing Magazine, 26(3).Google Scholar
  2. 2.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al. (2012). IEEE Signal Processing Magazine, 29(6), 82.CrossRefGoogle Scholar
  3. 3.
    Hinton, G.E. (2002). Neural Computation, 14(8), 1771.CrossRefGoogle Scholar
  4. 4.
    Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). Neural Computation, 18(7), 1527.MathSciNetCrossRefGoogle Scholar
  5. 5.
    Hinton, G. (2010). Momentum, 9(1), 926.Google Scholar
  6. 6.
    Han, K., He, Y., Bagchi, D., Fosler-Lussier, E., & Wang, D. (2015). In Proceedings of Interspeech (pp. 2484–2488).Google Scholar
  7. 7.
    Acero, A. (1993). Acoustical and environmental robustness in automatic speech recognition, Vol. 201. Berlin: Springer.CrossRefGoogle Scholar
  8. 8.
    Gong, Y. (1995). Speech Communication, 16(3), 261.CrossRefGoogle Scholar
  9. 9.
    Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). IEEE/ACM Transactions on Audio Speech, and Language Processing, 22(4), 745.CrossRefGoogle Scholar
  10. 10.
    Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., & Lee, C.H. (2014). In Proceedings of interspeech (pp. 616–620).Google Scholar
  11. 11.
    Bagchi, D., Mandel, M.I., Wang, Z., He, Y., Plummer, A., & Fosler-Lussier, E. (2015). . In Proceedings IEEE ASRU.Google Scholar
  12. 12.
    Kingsbury, B.E., Morgan, N., & Greenberg, S. (1998). Speech Communication, 25(1), 117.CrossRefGoogle Scholar
  13. 13.
    Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., & Maas, R. (2013). In 2013 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 1–4).Google Scholar
  14. 14.
    Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., & Matassoni, M. (2013). In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 126– 130).Google Scholar
  15. 15.
    Harper, M. (2015). In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 547–554).Google Scholar
  16. 16.
    Dekens, T., Verhelst, W., Capman, F., & Beaugendre, F. (2010). In 2010 18th European signal processing conference (pp. 1978–1982): IEEE.Google Scholar
  17. 17.
    Liu, Z., Zhang, Z., Acero, A., Droppo, J., & Huang, X. (2004). In 2004 IEEE 6th workshop on multimedia signal processing (pp. 363–366). IEEE.Google Scholar
  18. 18.
    Radha, N., Shahina, A., Vinoth, G., & Khan, A.N. (2014). In 2014 international conference on control instrumentation, communication and computational technologies (ICCICCT) (pp. 1343–1348). IEEE.Google Scholar
  19. 19.
    Breguet, J., Pellaux, J.P., & Gisin, N. (1994). In 10th optical fibre sensors conference (pp. 457–460). International Society for Optics and Photonics.Google Scholar
  20. 20.
    De Paula, M., De Carvalho, A., Vinha, C., Cella, N., & Vargas, H. (1988). Journal of Applied Physics, 64(7), 3722.CrossRefGoogle Scholar
  21. 21.
    Graciarena, M., Franco, H., Sonmez, K., & Bratt, H. (2003). IEEE Signal Processing Letters, 10(3), 72.CrossRefGoogle Scholar
  22. 22.
    Goode, R.L., Ball, G., Nishihara, S., & Nakamura, K. (1996). Otology & Neurotology, 17(6), 813.Google Scholar
  23. 23.
    Avargel, Y., & Cohen, I. (2011). In 2011 joint workshop on hands-free speech communication and microphone arrays (HSCMA) (pp. 109–114). IEEE.Google Scholar
  24. 24.
    Avargel, Y., Bakish, T., Dekel, A., Horovitz, G., Kurtz, Y., & Moyal, A. (2011). In Proceedings speech process, conference. Israel: Tel-Aviv.Google Scholar
  25. 25.
    Xie, Z., Du, J., McLoughlin, I., Xu, Y., Ma, F., & Wang, H. (2016). In Proceedings of ISCSLP.Google Scholar
  26. 26.
    Vass, J., Ṡmíd, R., Randall, R., Sovka, P., Cristalli, C., & Torcianti, B. (2008). Mechanical Systems and Signal Processing, 22(3), 647.CrossRefGoogle Scholar
  27. 27.
    Seltzer, M.L., Yu, D., & Wang, Y. (2013). In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7398–7402). IEEE.Google Scholar
  28. 28.
    [online]. iflytek.
  29. 29.
    Gao, T., Du, J., Dai, L.R., & Lee, C.H. (2015). In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE.Google Scholar
  30. 30.
    [online]. vocalzoom.

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiPeople’s Republic of China
  2. 2.iFlytek ResearchHefeiChina
  3. 3.University of SurreyGuildford GU2UK

Personalised recommendations