Efficient Audio-Visual Speaker Recognition via Deep Heterogeneous Feature Fusion

  • Yu-Hang Liu
  • Xin LiuEmail author
  • Wentao Fan
  • Bineng Zhong
  • Ji-Xiang Du
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10568)


Audio-visual speaker recognition (AVSR) has long been an active research area primarily due to its complementary information for reliable access control in biometric system, and it is a challenging problem mainly attributes to its multimodal nature. In this paper, we present an efficient audio-visual speaker recognition approach via deep heterogeneous feature fusion. First, we exploit a dual-branch deep convolutional neural networks (CNN) learning framework to extract and fuse the high-level semantic features of face and audio data. Further, by considering the temporal dependency of audio-visual data, we embed the fused features into a bidirectional Long Short-Term Memory (LSTM) networks to produce the recognition result, though which the speakers acquired under different challenging conditions can be well identified. The experimental results have demonstrated the efficiency of our proposed approach in both audio-visual feature fusion and speaker recognition.


Audio-visual speaker recognition Deep heterogeneous feature fusion Dual-branch deep CNN Bidirectional LSTM 



The work described in this paper was supported by the National Science Foundation of China (No. 61673185, 61502183, 61572205, 61673186), National Science Foundation of Fujian Province (2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research (No. ZQN-PY309), the Promotion Program for graduate student in Scientific research and innovation ability of Huaqiao University (No. 1611314014).


  1. 1.
    Bredin, H., Chollet, G.: Audio-visual speech synchrony measure for talking-face identity verification. In: Processing of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 233–236 (2007)Google Scholar
  2. 2.
    Cheng, H.T., Chao, Y.H., Yeh, S.L., Chen, C.S.: An efficient approach to multimodal person identity verification by fusing face and voice information. In: Processing of IEEE International Conference on Multimedia and Expo, pp. 542–545, 2005Google Scholar
  3. 3.
    Feng, W., Xie, L., Zeng, J., Liu, Z.Q.: Audio-visual human recognition using semi-supervised spectral learning and hidden markov models. J. Vis. Lang. Comput. 20(3), 188–195 (2009)CrossRefGoogle Scholar
  4. 4.
    Geng, J., Liu, X., Cheung, Y.: Audio-visual speaker recognition via multi-modal correlated neural networks. In: IEEE/wic/acm International Conference on Web Intelligence Workshops, pp. 123–128 (2016)Google Scholar
  5. 5.
    Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). doi: 10.1007/11550907_126 Google Scholar
  6. 6.
    Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11(9), 1984–1996 (2016)CrossRefGoogle Scholar
  7. 7.
    Hu, Y., Ren, J.S.J., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of Annual ACM International Conference on Multimedia, pp. 1107–1110 (2015)Google Scholar
  8. 8.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceeding of IEEE International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  9. 9.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Processing of IEEE International Conference on Machine Learning Workshop, pp. 1–6 (2013)Google Scholar
  10. 10.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of IEEE International Conference on Machine Learning, pp. 689–696 (2011)Google Scholar
  11. 11.
    Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012)CrossRefGoogle Scholar
  12. 12.
    Soltane, M., Doghmane, N., Guersi, N.: Face and speech based multi-modal biometric authentication. Process. IEEE Int. J. Adv. Sci. Technol. 21(6), 41–56 (2010)Google Scholar
  13. 13.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)zbMATHMathSciNetGoogle Scholar
  14. 14.
    David Sánchez, A.V.: Advanced support vector machines and kernel methods. Neurocomputing 55(1C2), 5–20 (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Yu-Hang Liu
    • 1
    • 2
  • Xin Liu
    • 1
    • 2
    Email author
  • Wentao Fan
    • 1
    • 2
  • Bineng Zhong
    • 1
    • 2
  • Ji-Xiang Du
    • 1
    • 2
  1. 1.Department of Computer ScienceHuaqiao UniversityXiamenChina
  2. 2.Xiamen Key Laboratory of Computer Vision and Pattern RecognitionHuaqiao UniversityXiamenChina

Personalised recommendations