A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Zhang, Wei; Mao, Kaining; Chen, Jie

doi:10.1007/s43657-023-00152-8

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Article
Published: 03 May 2024

(2024)
Cite this article

Phenomics Aims and scope Submit manuscript

139 Accesses
5 Altmetric
Explore all metrics

Abstract

Depression is one of the most common mental disorders, and rates of depression in individuals increase each year. Traditional diagnostic methods are primarily based on professional judgment, which is prone to individual bias. Therefore, it is crucial to design an effective and robust diagnostic method for automated depression detection. Current artificial intelligence approaches are limited in their abilities to extract features from long sentences. In addition, current models are not as robust with large input dimensions. To solve these concerns, a multimodal fusion model comprised of text, audio, and video for both depression detection and assessment tasks was developed. In the text modality, pre-trained sentence embedding was utilized to extract semantic representation along with Bidirectional long short-term memory (BiLSTM) to predict depression. This study also used Principal component analysis (PCA) to reduce the dimensionality of the input feature space and Support vector machine (SVM) to predict depression based on audio modality. In the video modality, Extreme gradient boosting (XGBoost) was employed to conduct both feature selection and depression detection. The final predictions were given by outputs of the different modalities with an ensemble voting algorithm. Experiments on the Distress analysis interview corpus wizard-of-Oz (DAIC-WOZ) dataset showed a great improvement of performance, with a weighted F1 score of 0.85, a Root mean square error (RMSE) of 5.57, and a Mean absolute error (MAE) of 4.48. Our proposed model outperforms the baseline in both depression detection and assessment tasks, and was shown to perform better than other existing state-of-the-art depression detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Transformer models for text-based emotion detection: a review of BERT-based approaches

Article 08 February 2021

EEG-based emotion recognition using 4D convolutional recurrent neural network

Article 14 September 2020

Data Availability

The data sets used in the current study are available from https://dcapswoz.ict.usc.edu/.

Abbreviations

BiLSTM:: Bidirectional long short-term memory
PCA:: Principal component analysis
SVM:: Support vector machine
XGBoost:: Extreme gradient boosting
DAIC-WOZ:: Distress analysis interview corpus wizard-of-oz
RMSE:: Root mean square error
MAE:: Mean absolute error
PHQ-8:: Patient health questionnaire-8
BDI:: Beck's depression inventory
AI:: Artificial intelligence
ML:: Machine learning
GloVe:: Global vectors
CNN:: Convolutional neural network
MFCC:: Mel-frequency cepstral coefficient
COVAREP:: Cooperative voice analysis repository
MHI:: Motion history image
AVEC:: Audio/visual emotion challenge
LSTM:: Long short-term memory
COVID-19:: Coronavirus disease 2019
BERT:: Bidirectional encoder representations from transformers
USE:: Universal sentence encoder
MSE:: Mean squared error
BCE:: Binary cross entropy
VUV:: Voiced/unvoiced
F0:: Fundamental frequency
NAQ:: Normalized amplitude quotient
QOQ:: Quasi-open quotient
H1H2:: First two harmonics of the differentiated glottal source spectrum
PSP:: Parabolic spectral parameter
MDQ:: Maxima dispersion quotient
MCEP:: Mel cepstral coefficient
HMPDM:: Harmonic model and phase distortion mean
HMPDD:: Harmonic model and phase distortion deviation
FAU:: Facial action unit
KNN:: K-nearest neighbors

References

Albadr MAA, Tiun S (2020) Spoken language identification based on particle swarm optimisation–extreme learning machine approach. Circuits Syst Signal Process 39:4596–4622. https://doi.org/10.1007/s00034-020-01388-9
Article Google Scholar
Albadr MA, Tiun S, Ayob M, Al-Dhief F (2020) Genetic algorithm based on natural selection theory for optimization problems. Symmetry (basel) 12:1758. https://doi.org/10.3389/fonc.2023.1150840
Article Google Scholar
Albadr MAA, Ayob M, Tiun S, Al-Dhief FT, Hasan MK (2022a) Gray wolf optimization-extreme learning machine approach for diabetic retinopathy detection. Front Public Health 10:925901. https://doi.org/10.3389/fpubh.2022.925901
Article PubMed PubMed Central Google Scholar
Albadr MAA, Tiun S, Ayob M, Al-Dhief FT (2022b) Particle swarm optimization-based extreme learning machine for covid-19 detection. Cognit Comput. https://doi.org/10.1007/s12559-022-10063-x
Article PubMed PubMed Central Google Scholar
Albadr MAA, Ayob M, Tiun S, Al-Dhief FT, Arram A, Khalaf S (2023a) Breast cancer diagnosis using the fast learning network algorithm. Front Oncol 13:1150840. https://doi.org/10.3389/fonc.2023.1150840
Article PubMed PubMed Central Google Scholar
Albadr MAA, Tiun S, Ayob M, Nazri MZA, Al-Dhief FT (2023b) Grey wolf optimization-extreme learning machine for automatic spoken language identification. Multimed Tools Appl 82:27165–27191. https://doi.org/10.1007/s11042-023-14473-3
Article Google Scholar
Albadr MAA, Tiun S, Ayob M, Al-Dhief FT, Abdali T-AN, Abbas AF (2021) Extreme learning machine for automatic language identification utilizing emotion speech data. In: 2021 international conference on electrical, communication, and computer engineering (ICECCE). IEEE, pp 1–6. https://doi.org/10.1109/icecce52 056.2021.9514107
Alhanai T, Ghassemi MM, Glass JR (2018) Detecting Depression with Audio/Text Sequence Modeling of Interviews. In: Interspeech. pp 1716–1720. https://doi.org/10.21437/Interspeech.2018-2522
Amiriparian S, Gerczuk M, Ottl S, Cummins N, Freitag M, Pugachevskiy S, Baird A, Schuller B (2017) Snore sound classification using image-based deep spectrum features. Interspeech Proc. https://doi.org/10.21437/interspeech.2017-434
Article Google Scholar
Ansari H, Vijayvergia A, Kumar K (2018) DCR-HMM: Depression detection based on Content Rating using Hidden Markov Model. In: 2018 Conference on Information and Communication Technology (CICT). IEEE, pp 1–6. https://doi.org/10.1109/infocomtech.2018.8722410
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. Adv Neural Inf Process Syst 29:892–900. https://doi.org/10.48550/arXiv.1610.09001
Article Google Scholar
Bailey A, Plumbley MD (2021) Gender bias in depression detection using audio features. In: 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, pp 596–600. https://doi.org/10.23919/eusipco54536.2021.9615933
Baltrušaitis T, Robinson P, Morency L-P (2016) Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1–10. https://doi.org/10.1109/WACV.2016.7477553
Bobick A, Davis J (1996) Real-time recognition of activity using temporal templates. In: Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV’96. IEEE, pp 39–42. https://doi.org/10.1109/acv.1996.571995
Burne L, Sitaula C, Priyadarshi A, Tracy M, Kavehei O, Hinder M, Withana A, McEwan A, Marzbanrad F (2022) Ensemble approach on deep and handcrafted features for neonatal bowel sound detection. IEEE J Biomed Health Inform. https://doi.org/10.1109/jbhi.2022.3217559
Article Google Scholar
Carey M, Jones K, Meadows G, Sanson-Fisher R, D’Este C, Inder K, Yoong SL, Russell G (2014) Accuracy of general practitioner unassisted detection of depression. Aust N Z J Psychiatry 48:571–578. https://doi.org/10.1177/0004867413520047
Article PubMed PubMed Central Google Scholar
Cer D, Yang Y, Kong S, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Céspedes M, Yuan S, Tar C (2018) Universal sentence encoder. arXiv Preprint. https://doi.org/10.48550/arXiv.1803.11175
Article Google Scholar
Chao L, Tao J, Yang M, Li Y (2015) Multi task sequence learning for depression scale prediction from video. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp 526–531
Chiu CY, Lane HY, Koh JL, Chen ALP (2021) Multimodal depression detection on instagram considering time interval of posts. J Intell Inf Syst 56:25–47. https://doi.org/10.1007/s10844-020-00599-5
Article Google Scholar
Cohn JF, Kruez TS, Matthews I, Yang Y, Nguyen MH, Padilla MT, Zhou F, de la Torre F (2009) Detecting depression from facial actions and vocal prosody. In: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, pp 1–7. https://doi.org/10.1109/acii.2009.5349358
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv Preprint. https://doi.org/10.18653/v1/d17-1070
Article Google Scholar
Cook IA, Hunter AM, Caudill MM, Abrams MJ, Leuchter AF (2020) Prospective testing of a neurophysiologic biomarker for treatment decisions in major depressive disorder: the PRISE-MD trial. J Psychiatr Res 124:159–165. https://doi.org/10.1016/j.jpsychires.2020.02.028
Article PubMed PubMed Central Google Scholar
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80. https://doi.org/10.1109/79.911197
Article Google Scholar
Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, Quatieri TF (2015) A review of depression and suicide risk assessment using speech analysis. Speech Commun 71:10–49. https://doi.org/10.1016/j.specom.2015.03.004
Article Google Scholar
Cummins N, Vlasenko B, Sagha H, Schuller B (2017) Enhancing speech-based depression detection through gender dependent vowel-level formant features. In: ten Teije A, Popow C, Holmes JH, Sacchi L (eds) Conference on artificial intelligence in medicine in Europe. Springer, Cham, pp 209–214. https://doi.org/10.1007/978-3-319-59758-4_23
Chapter Google Scholar
Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) COVAREP—A collaborative voice analysis repository for speech technologies. In: 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 960–964. https://doi.org/10.1109/icassp.2014.6853739
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: PRE-training of deep bidirectional transformers for language understanding. arXiv Preprint. https://doi.org/10.48550/arXiv.1810.04805
Article Google Scholar
Doulamis N (2006) An adaptable emotionally rich pervasive computing system. In: 2006 14th European Signal Processing Conference. IEEE, pp 1–5. https://zenodo.org/records/52799
Dutta P, Saha S (2020) Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp 6396–6407. https://doi.org/10.18653/v1/2020.acl-main.570
Fan W, He Z, Xing X, Cai B, Lu W (2019) Multi-modality depression detection via multi-scale temporal dilated cnns. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. pp 73–80. https://doi.org/10.1145/3347320.3357695
Fava M, Kendler KS (2000) Major depressive disorder. Neuron 28:335–341. https://doi.org/10.1016/s0896-6273(00)00112-4
Article CAS PubMed Google Scholar
Gallos P, Menychtas A, Panagopoulos C, Kaselimi M, Temenos A, Rallis I, Maglogiannis I (2022) Using mHealth technologies to promote public health and well-being in urban areas with blue-green solutions. Stud Health Technol Inform 295:566–569. https://doi.org/10.3233/shti220791
Article PubMed Google Scholar
Gong Y, Poellabauer C (2017) Topic modeling based multi-modal depression detection. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. pp 69–76. https://doi.org/10.1145/3133944.3133945
Gratch J, Artstein R, Lucas GM, Stratou G, Scherer S, Nazarian A, Wood R, Boberg J, DeVault D, Marsella S (2014) The distress analysis interview corpus of human and computer interviews. In: LREC. pp 3123–3128
Guntuku SC, Yaden DB, Kern ML, Ungar LH, Eichstaedt JC (2017) Detecting depression and mental illness on social media: an integrative review. Curr Opin Behav Sci 18:43–49. https://doi.org/10.1016/j.cobeha.2017.07.005
Article Google Scholar
Haque A, Guo M, Miner AS, Fei-Fei L (2018) Measuring depression symptom severity from spoken language and 3D facial expressions. arXiv Preprint. https://doi.org/10.48550/arXiv.1811.08592
Article Google Scholar
Hawton K, Comabella CC, Haw C, Saunders K (2013) Risk factors for suicide in individuals with depression: a systematic review. J Affect Disord 147:17–28. https://doi.org/10.1016/j.jad.2013.01.004
Article PubMed Google Scholar
He L, Cao C (2018) Automated depression analysis using convolutional neural networks from speech. J Biomed Inform 83:103–111. https://doi.org/10.1016/j.jbi.2018.05.007
Article PubMed Google Scholar
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. pp 801–804. https://doi.org/10.1109/ISCSLP.2018.8706610
Islam MR, Kabir MA, Ahmed A, Kamal ARM, Wang H, Ulhaq A (2018) Depression detection from social network data using machine learning techniques. Health Inf Sci Syst 6:1–12. https://doi.org/10.1007/s13755-018-0046-0
Article Google Scholar
Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv Preprint. https://doi.org/10.18653/v1/e17-2068
Article Google Scholar
Kächele M, Glodek M, Zharkov D, Meudt S, Schwenker F (2014) Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. Depression 1:671–678. https://doi.org/10.5220/0004828606710678
Article Google Scholar
Knyazev GG, Savostyanov AN, Bocharov AV, Aftanas LI (2019) EEG cross-frequency correlations as a marker of predisposition to affective disorders. Heliyon 5:e02942. https://doi.org/10.1016/j.heliyon.2019.e02942
Article PubMed PubMed Central Google Scholar
Kroenke K, Strine TW, Spitzer RL, Williams JBW, Berry JT, Mokdad AH (2009) The PHQ-8 as a measure of current depression in the general population. J Affect Disord 114:163–173. https://doi.org/10.1016/j.jad.2008.06.026
Article PubMed Google Scholar
Lam G, Dongyan H, Lin W (2019) Context-aware deep learning for multi-modal depression detection. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 3946–3950. https://doi.org/10.1109/icassp.2019.8683027
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, pp 1188–1196. https://doi.org/10.48550/arXiv.1405.4053
Lépine J-P, Briley M (2011) The increasing burden of depression. Neuropsychiatr Dis Treat 7:3. https://doi.org/10.2147/ndt.s19617
Article PubMed PubMed Central Google Scholar
Li X, Tan W, Liu P, Zhou Q, Yang J (2021) Classification of COVID-19 chest CT images based on ensemble deep learning. J Healthc Eng 2021:1–7. https://doi.org/10.1155/2021/5528441
Article Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv Preprint. https://doi.org/10.48550/arXiv.1907.11692
Article Google Scholar
Ma X, Yang H, Chen Q, Huang D, Wang Y (2016) Depaudionet: An efficient deep model for audio based depression classification. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. pp 35–42. https://doi.org/10.1145/2988257.2988267
Mallol-Ragolta A, Zhao Z, Stappen L, Cummins N, Schuller B (2019) A hierarchical attention network-based approach for depression detection from transcribed clinical interviews. https://doi.org/10.21437/interspeech.2019-2036
Meng H, Huang D, Wang H, Yang H, Ai-Shuraifi M, Wang Y (2013) Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. pp 21–30. https://doi.org/10.1145/2512530.2512532
Muzammel M, Salam H, Hoffmann Y, Chetouani M, Othmani A (2020) AudVowelConsNet: a phoneme-level based deep CNN architecture for clinical depression diagnosis. Mach Learn Appl 2:100005. https://doi.org/10.1016/j.mlwa.2020.100005
Article Google Scholar
Nasir M, Jati A, Shivakumar PG, Nallan Chakravarthula S, Georgiou P (2016) Multimodal and multiresolution depression detection from speech and facial landmark features. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. pp 43–50. https://doi.org/10.1145/2988257.2988261
Niu M, Tao J, Liu B, Huang J, Lian Z (2020) Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans Affect Comput. https://doi.org/10.1109/taffc.2020.3031345
Article Google Scholar
Niu M, Chen K, Chen Q, Yang L (2021) HCAG: A Hierarchical Context-Aware Graph Attention Model for Depression Detection. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 4235–4239. https://doi.org/10.1109/icassp39728.2021.9413486
Orabi AH, Buddhitha P, Orabi MH, Inkpen D (2018) Deep learning for depression detection of twitter users. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. pp 88–97. https://doi.org/10.18653/v1/w18-0609
Pampouchidou A, Simantiraki O, Fazlollahi A, Pediaditis M, Manousos D, Roniotis A, Giannakakis G, Meriaudeau F, Simos P, Marias K (2016) Depression assessment by fusing high and low level features from audio, video, and text. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. pp 27–34. https://doi.org/10.1145/2988257.2988266
Pampouchidou A, Simantiraki O, Vazakopoulou C-M, Chatzaki C, Pediaditis M, Maridaki A, Marias K, Simos P, Yang F, Meriaudeau F (2017) Facial geometry and speech analysis for depression detection. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, pp 1433–1436. https://doi.org/10.1109/embc.2017.8037103
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Rafiei A, Zahedifar R, Sitaula C, Marzbanrad F (2022) Automated detection of major depressive disorder with EEG signals: a time series classification using deep learning. IEEE Access 10:73804–73817. https://doi.org/10.1109/access.2022.3190502
Article Google Scholar
Ray A, Kumar S, Reddy R, Mukherjee P, Garg R (2019) Multi-level attention network using text, audio and video for depression prediction. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. pp 81–88. https://doi.org/10.1145/3347320.3357697
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv Preprint. https://doi.org/10.48550/arXiv.1908.10084
Article Google Scholar
Resnik P, Garron A, Resnik R (2013) Using topic modeling to improve prediction of neuroticism and depression in college students. In: Proceedings of the 2013 conference on empirical methods in natural language processing. pp 1348–1353. https://doi.org/10.3389/fpubh.2022.1003553
Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M, Pantic M (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. pp 3–9. https://doi.org/10.1145/3133944.3133953
Ringeval F, Schuller B, Valstar M, Cummins N, Cowie R, Tavabi L, Schmitt M, Alisamir S, Amiriparian S, Messner E-M (2019) AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. pp 3–12. https://doi.org/10.1145/3347320.3357688
Rodrigues Makiuchi M, Warnita T, Uto K, Shinoda K (2019) Multimodal fusion of bert-cnn and gated cnn representations for depression detection. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. pp 55–63. https://doi.org/10.1145/3347320.3357694
Rohanian M, Hough J, Purver M (2019) Detecting Depression with Word-Level Multimodal Fusion. In: INTERSPEECH. pp 1443–1447. https://doi.org/10.1016/j.cmpb.2023.107702
Rude S, Gortner E-M, Pennebaker J (2004) Language use of depressed and depression-vulnerable college students. Cogn Emot 18:1121–1133. https://doi.org/10.1080/02699930441000030
Article Google Scholar
Saidi A, Othman S ben, Saoud S ben (2020) Hybrid CNN-SVM classifier for efficient depression detection system. In: 2020 4th International Conference on Advanced Systems and Emergent Technologies (IC_ASET). IEEE, pp 229–234. https://doi.org/10.1109/ic_aset49463.2020.9318302
Sakhovskiy A, Tutubalina E (2022) Multimodal model with text and drug embeddings for adverse drug reaction classification. J Biomed Inform 135:104182. https://doi.org/10.1016/j.jbi.2022.104182
Article PubMed Google Scholar
Scherer S, Stratou G, Lucas G, Mahmoud M, Boberg J, Gratch J, Morency L-P (2014) Automatic audiovisual behavior descriptors for psychological disorder analysis. Image vis Comput 32:648–658. https://doi.org/10.1016/j.imavis.2014.06.001
Article Google Scholar
Shen G, Jia J, Nie L, Feng F, Zhang C, Hu T, Chua T-S, Zhu W (2017) Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution. In: IJCAI. pp 3838–3844. https://doi.org/10.24963/ijcai.2017/536
Shickel B, Loftus TJ, Ozrazgat-Baslanti T, Ebadi A, Bihorac A, Rashidi P (2018) DeepSOFA: a real-time continuous acuity score framework using deep learning. ArXiv e-Prints. https://doi.org/10.1038/s41598-019-38491-0
Article Google Scholar
Sitaula C, Basnet A, Aryal S (2021a) Vector representation based on a supervised codebook for Nepali documents classification. PeerJ Comput Sci 7:e412. https://doi.org/10.7717/peerj-cs.412
Article PubMed PubMed Central Google Scholar
Sitaula C, Basnet A, Mainali A, Shahi TB (2021b) Deep learning-based methods for sentiment analysis on Nepali COVID-19-related tweets. Comput Intell Neurosci. https://doi.org/10.1155/2021/2158184
Article PubMed PubMed Central Google Scholar
Sitaula C, He J, Priyadarshi A, Tracy M, Kavehei O, Hinder M, Withana A, McEwan A, Marzbanrad F (2022) Neonatal bowel sound detection using convolutional neural network and Laplace hidden semi-Markov model. IEEE/ACM Trans Audio Speech Lang Process 30:1853–1864. https://doi.org/10.1109/taslp.2022.3178225
Article Google Scholar
Smys S, Raj JS (2021) Analysis of deep learning techniques for early detection of depression on social media network-a comparative study. J Trends Comput Sci Smart Technol (TCSST) 3:24–39. https://doi.org/10.36548/jtcsst.2021.1.003
Article Google Scholar
Tong L, Zhang Q, Sadka A, Li L, Zhou H (2019) Inverse boosting pruning trees for depression detection on Twitter. ArXiv Preprint arXiv. https://doi.org/10.1109/taffc.2022.3145634
Article Google Scholar
Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. pp 3–10. https://doi.org/10.1145/25125 30.2512533
Valstar M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R, Pantic M (2016) Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. pp 3–10. 10.1145/ 2988257.2988258
Villatoro-Tello E, Dubagunta SP, Fritsch J, Ramírez-de-la-Rosa G, Motlicek P, Magimai-Doss M (2021) Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition. In: Interspeech. pp 1927–1931. https://doi.org/10.21437/interspeech.2021-1288
Vonikakis V, Yazici Y, Nguyen VD, Winkler S (2016) Group happiness assessment using geometric features and dataset balancing. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. pp 479–486. https://doi.org/10.1145/2993148.2997633
Wang Y, Ma J, Hao B, Hu P, Wang X, Mei J, Li S (2020) Automatic Depression Detection via Facial Expressions Using Multiple Instance Learning. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, pp 1933–1936. https://doi.org/10.1109/isbi45749.2020.9098396
World Health Organization (2017) Depression and other common mental disorders: global health estimates. World Health Organization
World Health Organization (2019) Depression. World Health Organization
World Health Organization (2021) World health statistics 2021. World Health Organization
Yin S, Liang C, Ding H, Wang S (2019) A multi-modal hierarchical recurrent neural network for depression detection. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. pp 65–71. 10. 1145/3347320.3357696

Download references

Acknowledgements

We would like to acknowledge the funding support from MITACS, Canada. The authors would also like to thank Ryan Corpuz for proofreading the manuscript.

Funding

China Scholarship Council, 201606280044, Wei Zhang.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, T6G 2R3, Canada
Wei Zhang, Kaining Mao & Jie Chen
Academy of Engineering and Technology, Fudan University, Shanghai, 200433, China
Jie Chen

Authors

Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kaining Mao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

WZ: Conceptual and experimental design, data analysis, manuscript preparation. KM: Data analysis. JC: Conceptual design, project supervision, obtaining funding, manuscript preparation.

Corresponding author

Correspondence to Jie Chen.

Ethics declarations

Conflict of Interest

The authors declare that there is no conflict of interest. Jie Chen is the Editorial Board member of Phenomics, and he was not involved in reviewing this paper.

Ethical Approval

All the methods were performed in accordance with the relevant guidelines and regulations.

Consent to Participate

All volunteers provided written informed consent.

Consent for Publication

Not applicable.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 17 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, W., Mao, K. & Chen, J. A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video. Phenomics (2024). https://doi.org/10.1007/s43657-023-00152-8

Download citation

Received: 23 December 2022
Revised: 02 December 2023
Accepted: 12 December 2023
Published: 03 May 2024
DOI: https://doi.org/10.1007/s43657-023-00152-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Abstract

Access this article

Similar content being viewed by others

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Transformer models for text-based emotion detection: a review of BERT-based approaches

EEG-based emotion recognition using 4D convolutional recurrent neural network

Data Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Consent to Participate

Consent for Publication

Supplementary Information

Supplementary file1 (DOCX 17 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Abstract

Access this article

Similar content being viewed by others

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Transformer models for text-based emotion detection: a review of BERT-based approaches

EEG-based emotion recognition using 4D convolutional recurrent neural network

Data Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Consent to Participate

Consent for Publication

Supplementary Information

Supplementary file1 (DOCX 17 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation