Efficient Audio-Visual Speaker Recognition via Deep Heterogeneous Feature Fusion
Audio-visual speaker recognition (AVSR) has long been an active research area primarily due to its complementary information for reliable access control in biometric system, and it is a challenging problem mainly attributes to its multimodal nature. In this paper, we present an efficient audio-visual speaker recognition approach via deep heterogeneous feature fusion. First, we exploit a dual-branch deep convolutional neural networks (CNN) learning framework to extract and fuse the high-level semantic features of face and audio data. Further, by considering the temporal dependency of audio-visual data, we embed the fused features into a bidirectional Long Short-Term Memory (LSTM) networks to produce the recognition result, though which the speakers acquired under different challenging conditions can be well identified. The experimental results have demonstrated the efficiency of our proposed approach in both audio-visual feature fusion and speaker recognition.
KeywordsAudio-visual speaker recognition Deep heterogeneous feature fusion Dual-branch deep CNN Bidirectional LSTM
The work described in this paper was supported by the National Science Foundation of China (No. 61673185, 61502183, 61572205, 61673186), National Science Foundation of Fujian Province (2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research (No. ZQN-PY309), the Promotion Program for graduate student in Scientific research and innovation ability of Huaqiao University (No. 1611314014).
- 1.Bredin, H., Chollet, G.: Audio-visual speech synchrony measure for talking-face identity verification. In: Processing of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 233–236 (2007)Google Scholar
- 2.Cheng, H.T., Chao, Y.H., Yeh, S.L., Chen, C.S.: An efficient approach to multimodal person identity verification by fusing face and voice information. In: Processing of IEEE International Conference on Multimedia and Expo, pp. 542–545, 2005Google Scholar
- 4.Geng, J., Liu, X., Cheung, Y.: Audio-visual speaker recognition via multi-modal correlated neural networks. In: IEEE/wic/acm International Conference on Web Intelligence Workshops, pp. 123–128 (2016)Google Scholar
- 7.Hu, Y., Ren, J.S.J., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of Annual ACM International Conference on Multimedia, pp. 1107–1110 (2015)Google Scholar
- 8.Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceeding of IEEE International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
- 9.Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Processing of IEEE International Conference on Machine Learning Workshop, pp. 1–6 (2013)Google Scholar
- 10.Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of IEEE International Conference on Machine Learning, pp. 689–696 (2011)Google Scholar
- 12.Soltane, M., Doghmane, N., Guersi, N.: Face and speech based multi-modal biometric authentication. Process. IEEE Int. J. Adv. Sci. Technol. 21(6), 41–56 (2010)Google Scholar