Due to the existence of false positive rate of the traditional depression diagnosis method, this paper proposes a multi-modal fusion algorithm based on speech signal and facial image sequence for depression diagnosis. Introduced spectrum subtraction to enhance depressed speech signal, and use cepstrum method to extract pitch frequency features with large variation rate and formant features with significant difference, the short time energy and Mel-frequency cepstral coefficients characteristic parameters for different emotion speeches are analyzed in both time domain and frequency domain, and establish a model for training and identification. Meanwhile, this paper implements the orthogonal match pursuit algorithm to obtain a sparse linear combination of face test samples, and cascade with voice and facial emotions based proportion. The experimental results show that the recognition rate based on the depression detection algorithm of fusion speech and facial emotions has reached 81.14%. Compared to the existing doctor’s accuracy rate of 47.3%, the accuracy can bring extra 71.54% improvement by combining with the proposed method of this paper. Additionally, it can easily apply to the hardware and software on the existing hospital instruments with low cost. Therefore, it is an accurate and effective method for diagnosing depression.
Depression diagnosis Pitch frequency Sparse representation Speech signal Facial emotions Multimodal fusion
This is a preview of subscription content, log in to check access.
This work was supported by National Natural Science Foundation of China (No. 61379010).
Ionescu, D.F., et al.: Defining anxious depression: a review of the literature. CNS Spectr. 18(5), 252–260 (2013)CrossRefGoogle Scholar
Erschens, R., et al.: Methodological aspects of international research on the burden of anxiety and depression in medical students. Ment. Health Prev. 4(1), 31–35 (2016)CrossRefGoogle Scholar
Melton, T.H., et al.: Comorbid anxiety and depressive symptoms in children and adolescents: a systematic review and analysis. J. Psychiatr. Pract. 22(2), 84 (2016)CrossRefGoogle Scholar
Potapova, R., Grigorieva, M.: Crosslinguistic intelligibility of Russian and German speech in noisy environment. J. Electr. Comput. Eng. 2017, 1–9 (2017)CrossRefGoogle Scholar
Vrbova, K., et al.: Quality of life, self-stigma, and hope in schizophrenia spectrum disorders: a cross-sectional study. Neuropsychiatr. Dis. Treat. 13, 567 (2017)CrossRefGoogle Scholar
Hernández-Mena, C.D., Meza-Ruiz, I.V., Herrera-Camacho, J.A.: Automatic speech recognizers for Mexican Spanish and its open resources. J. Appl. Res. Technol. 15(3) (2017)Google Scholar
Huang, Y.B., et al.: Hash authentication algorithm of compressed domain speech perception based on MFCC and NMF. Appl. Mech. Mater. 719–720, 1166–1170 (2015)CrossRefGoogle Scholar
Yang, A.Y., et al.: Distributed sensor perception via sparse representation. Proc. IEEE 98(6), 1077–1088 (2010)CrossRefGoogle Scholar
Maas, A.L., et al.: Building DNN acoustic models for large vocabulary speech recognition. Comput. Speech Lang. 41(C), 195–213 (2017)Google Scholar
Ozdas, A., et al.: Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk. IEEE Trans. Biomed. Eng. 51(9), 1530–1540 (2004)CrossRefGoogle Scholar
Vihari, S., et al.: Comparison of speech enhancement algorithms. Procedia Comput. Sci. 89, 666–676 (2016)CrossRefGoogle Scholar
Sahu, S., Espywilson, C.: Effects of depression on speech. J. Acoust. Soc. Am. 136(4), 2312–2312 (2014)CrossRefGoogle Scholar
ChinnaRao, M., Murthy, A.V.S.N., Satyanarayana, Ch.: Emotion recognition system based on skew gaussian mixture model and MFCC coefficients. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 4, 51–57 (2015)Google Scholar