Skip to main content
Log in

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

  • Article
  • Published:
Phenomics Aims and scope Submit manuscript

Abstract

Depression is one of the most common mental disorders, and rates of depression in individuals increase each year. Traditional diagnostic methods are primarily based on professional judgment, which is prone to individual bias. Therefore, it is crucial to design an effective and robust diagnostic method for automated depression detection. Current artificial intelligence approaches are limited in their abilities to extract features from long sentences. In addition, current models are not as robust with large input dimensions. To solve these concerns, a multimodal fusion model comprised of text, audio, and video for both depression detection and assessment tasks was developed. In the text modality, pre-trained sentence embedding was utilized to extract semantic representation along with Bidirectional long short-term memory (BiLSTM) to predict depression. This study also used Principal component analysis (PCA) to reduce the dimensionality of the input feature space and Support vector machine (SVM) to predict depression based on audio modality. In the video modality, Extreme gradient boosting (XGBoost) was employed to conduct both feature selection and depression detection. The final predictions were given by outputs of the different modalities with an ensemble voting algorithm. Experiments on the Distress analysis interview corpus wizard-of-Oz (DAIC-WOZ) dataset showed a great improvement of performance, with a weighted F1 score of 0.85, a Root mean square error (RMSE) of 5.57, and a Mean absolute error (MAE) of 4.48. Our proposed model outperforms the baseline in both depression detection and assessment tasks, and was shown to perform better than other existing state-of-the-art depression detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

The data sets used in the current study are available from https://dcapswoz.ict.usc.edu/.

Abbreviations

BiLSTM:

Bidirectional long short-term memory

PCA:

Principal component analysis

SVM:

Support vector machine

XGBoost:

Extreme gradient boosting

DAIC-WOZ:

Distress analysis interview corpus wizard-of-oz

RMSE:

Root mean square error

MAE:

Mean absolute error

PHQ-8:

Patient health questionnaire-8

BDI:

Beck's depression inventory

AI:

Artificial intelligence

ML:

Machine learning

GloVe:

Global vectors

CNN:

Convolutional neural network

MFCC:

Mel-frequency cepstral coefficient

COVAREP:

Cooperative voice analysis repository

MHI:

Motion history image

AVEC:

Audio/visual emotion challenge

LSTM:

Long short-term memory

COVID-19:

Coronavirus disease 2019

BERT:

Bidirectional encoder representations from transformers

USE:

Universal sentence encoder

MSE:

Mean squared error

BCE:

Binary cross entropy

VUV:

Voiced/unvoiced

F0:

Fundamental frequency

NAQ:

Normalized amplitude quotient

QOQ:

Quasi-open quotient

H1H2:

First two harmonics of the differentiated glottal source spectrum

PSP:

Parabolic spectral parameter

MDQ:

Maxima dispersion quotient

MCEP:

Mel cepstral coefficient

HMPDM:

Harmonic model and phase distortion mean

HMPDD:

Harmonic model and phase distortion deviation

FAU:

Facial action unit

KNN:

K-nearest neighbors

References

Download references

Acknowledgements

We would like to acknowledge the funding support from MITACS, Canada. The authors would also like to thank Ryan Corpuz for proofreading the manuscript.

Funding

China Scholarship Council, 201606280044, Wei Zhang.

Author information

Authors and Affiliations

Authors

Contributions

WZ: Conceptual and experimental design, data analysis, manuscript preparation. KM: Data analysis. JC: Conceptual design, project supervision, obtaining funding, manuscript preparation.

Corresponding author

Correspondence to Jie Chen.

Ethics declarations

Conflict of Interest

The authors declare that there is no conflict of interest. Jie Chen is the Editorial Board member of Phenomics, and he was not involved in reviewing this paper.

Ethical Approval

All the methods were performed in accordance with the relevant guidelines and regulations.

Consent to Participate

All volunteers provided written informed consent.

Consent for Publication

Not applicable.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 17 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Mao, K. & Chen, J. A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video. Phenomics (2024). https://doi.org/10.1007/s43657-023-00152-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43657-023-00152-8

Keywords

Navigation