Chinese Dialects Identification Using Attention-Based Deep Neural Networks
This paper presents a novel Chinese dialects identification system. We use attention-based deep neural networks (AB-DNN) to obtain the Chinese dialects model as back-end. The front-end fuses identity vector (i-vector) with the global prosodic information as input used to describe the dialectal category information accurately. In the task, five kinds of Chinese dialects including Min, Yue, Wu, Jianghuai, Zhongyuan and standard Mandarin are selected as the identification objects. Experimental results show that 21.1% relative equal error rate (EER) reduction is obtained compared with regular deep neural networks (DNN) and further 14.5% reduction when apply global fusion features. The method based on AB-DNN combined with global fusion features observes 29.2% performance improvement compared to traditional DNN with MFCC.
KeywordsChinese dialects identification Attention mechanism Global fusion features Attention-based deep neural networks
This work is supported by National Natural Science Foundation of China under grants No. 61040053, No. 61673196 and The Key Project of Philosophy and Social Science Research in Colleges and Universities in Jiangsu Province under grant No. 2012JDXM016.
- 1.Resnick, M.C.: Dialect zones and automatic dialect identification in Latin American Spanish. Hispania 52, 553–568 (1969)Google Scholar
- 2.Mingliang, G.U., Zhaoyong, S.H.E.N.: Phonotatics based Chinese dialects identification. J. Chin. Inf. Proc. 20(5), 77–82 (2006)Google Scholar
- 3.Etman, A., Louis, A.A.: American dialect identification using phonotactic and prosodic features. In: SAI Intelligent Systems Conference, pp. 963–970. IEEE (2015)Google Scholar
- 4.Zhang, Q., Bořil, H., Hansen, J.H.L.: Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification. In: ICASSP, pp. 7363–7367. IEEE (2013)Google Scholar
- 5.Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv:1508.04025 (2015)
- 6.Chorowski, J.K., Bahdanau, D., Serdyuk, D., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)Google Scholar
- 7.Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. ICML 14, 77–81 (2015)Google Scholar
- 8.Raffel, C., Ellis, D.P.W.: Feed-forward networks with attention can solve some long-term memory problems. arXiv:1512.08756 (2015)
- 9.Kenny, P., Boulianne, G., Ouellet, P., et al.: Joint factor analysis versus Eigenchannels in speaker recognition. Trans. Audio Speech Lang. Process. 15(4), 1435–1447 (2007)Google Scholar
- 10.Dehak, N., Kenny, P., Dehak, R., et al.: Front-end factor analysis for speaker verification. Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)Google Scholar