Keywords

1 Introduction

According to the World Health Organization (WHO), 800 000 people die every year due to suicideFootnote 1, which is the second leading cause of death in people aged between 15 and 29 years. However, many suicides can be prevented. Indeed, mental disorders, such as depression, contribute to many of them around the world. Hence, early detection and appropriate management are key to ensuring that people receive the care they needFootnote 2. Therefore, patients suffering from Major Depressive Disorder (MDD) and Bipolar Disorders (BD) should be continuously monitored, since they might experience frequent depressive episodes.

Several mental health monitoring approaches using mobile devices have been proposed. Most of them [1,2,3, 5,6,7,8,9,10,11, 13, 14] are based on (1) collecting and analyzing smartphone features such as activity, localization, and phone calls, and (2) launching interactive questionnaires such as PHQ-9Footnote 3 and BDIFootnote 4. “Active” monitoring approaches (i.e., requiring a patient’s intervention) are less used and less effective than “passive” ones (i.e., not requiring a patient’s intervention) in practice. Actually, patients do not like to answer questionnaires every day, and even if they answer them, we cannot be sure about the honesty of their answers.

Passive monitoring approaches [3, 5,6,7,8, 10, 11, 13, 14] can be divided into two categories. The first category tries to use most of the smartphone’s sensors, such as accelerometer and camera, to collect related features [5,6,7, 11, 14]. These approaches require special environmental conditions to ensure the precision of collected data. For instance, Maxhuni et al. [11] require the smartphone to be held in a special position to accurately capture the patient’s activity. This makes these approaches difficult to use in practice. The second category of approaches [3, 8, 10, 13] focuses on analyzing “only” voice sounds, recorded from phone calls. Indeed, acoustic changes in speech allow us to detect depressive episodes [8]. Most of these approaches are based on the use of complex algorithms for speech analysis, such as deep learning approaches. Typically, these approaches require significant computational resources. Therefore, this category of approaches sends recorded phone calls to an external server for analysis, since running complex algorithms on smartphones is still a challenging task [12]. This reduces the chances of using these mobile applications, because they do not protect data privacy. Actually, many patients do not accept that external servers process their recorded phone calls.

In this paper, we present a novel deep learning approach that can be loaded and executed on a patient’s smartphone to allow real-time detection of depressive episodes. Our proposal, called DL4DED (Deep Learning for Depressive Episode Detection), is based on optimizing and compressing a deep learning model to be integrated in our mobile application. DL4DED (1) records phone calls, (2) executes our deep learning model, and (3) triggers alerts if a depressive episode is detected. DL4DED does not send recorded phone calls to an external server and consequently preserves data privacy. To evaluate our approach, two groups of experiments have been conducted. The first one illustrates the efficiency of DL4DED in terms of accuracy. The second one demonstrates that the power consumption of DL4DED is reasonable when compared to baseline approaches.

The paper is organized as follows. Section 2 discusses related work. DL4DED and its implementation are detailed in Sects. 3 and 4, respectively. Section 5 presents experiments and results. Section 6 concludes the paper and outlines areas for future research.

2 Related Work

A comprehensive survey reviewing existing mobile applications for mental health diseases (e.g., BD, MDD, schizophrenia) has been published by Cornet et al. [1]. This survey explains different parameters that have been used throughout the years to detect mental health diseases. These parameters are usually extracted from smartphone sensors, such as microphone (i.e., phone calls), accelerometer (i.e., movement data), and GPS (i.e., localization). The authors also describe the used analysis algorithms, where the most advanced ones are based on deep learning networks. Moreover, the authors discuss data privacy issues related to all reviewed approaches, since all extracted smartphone features are sent to an external server for analysis.

Haque et al. [6] present a deep learning model to measure the severity of depression symptoms. The proposed approach is based on analyzing 3D facial expression and spoken language. It has been evaluated on the DAIC-WOZ dataset [4]. The proposed approach requires an active intervention from the patient, which makes it unusable in some cases. Actually, it is hard to ask a patient to daily record videos, due to privacy issues. In contrast, our approach is passive, analyzes spontaneous phone calls, and preserves data privacy.

Su et al. [14] present a diagnosis assistance system that is based on deep learning and fusion techniques. The proposed system makes use of two deep learning models. The first one is used for speech classification, while the second one processes facial expression. This work is aimed at assisting doctors to avoid misdiagnosis in the case of BD. Actually, BD is usually confused with MDD. The results are quite promising, but the system architecture cannot not be applied in a spontaneous manner. Indeed, the system requires special environmental conditions to be applied, such as a camera.

Huang et al. [7] propose an attention-based convolutional neural network (CNN) and long short-term memory (LSTM) approach for distinguishing between MDD and BD. The proposed approach identifies mood disorders on the basis of responses given to 6 video sequences. Consequently, it can be applied to depressive episode detection. The analysis includes speech processing and facial recognition to extract emotional state. In contrast to our approach, this work cannot be applied in a spontaneous manner and does not protect data privacy. Actually, recording responses to 6 particular videos cannot be expected on a daily basis and without the intervention of a psychiatrist.

Grunerbl et al. [5] present a smartphone-based approach for mood status identification in BD. The proposed approach makes use of smartphone features (e.g., location, phone call sounds, phone light) to detect depressive and maniac episodes. Features are extracted manually, and data privacy is not protected.

Maxhuni et al. [11] define an analysis approach that identifies manic and depressive episodes on the basis of activity data and phone calls. To monitor activities, the authors mention that the phone has to be held in a special position. Moreover, phone calls are recorded, stored in the phone’s memory card and sent to another server for analysis. This means that this approach does not protect data privacy and requires special environmental conditions to generate accurate results.

Khorram et al. [10] propose a machine learning (ML) approach for depression detection in BD. The proposed approach is based on the analysis of acoustic features of the patient’s voice. The paper shows the effectiveness of speech features in detecting depression. However, features are extracted manually due to the use of ML techniques. Moreover, a patient’s voice is sent to an external server for analysis, which does not protect data privacy. Therefore, we propose a deep learning model that allows us to automatically extract features and preserve data privacy.

Gideon et al. [3] discuss the impact of a recorded voice on the identification of depressive and manic episodes in BD. Only phone calls recorded during clinical trial are considered. Other phone calls are removed to protect data privacy. This means that this work does not analyze naturalistic and spontaneous voice, which might alter the detection results. However, our work is based on the analysis of spontaneous phone calls (i.e., recorded on a daily basis), thanks to our deep learning model that is locally running on a patient’s smartphone.

Huang et al. [8] define two speech features based on speech landmark bigrams (i.e., bigram count and LDA bigram) for depression detection. Both features could be extracted from naturalistic phone calls including 6 elicitation tasks such as measuring the diadochokinetic rate. Landmarks are extracted using the SpeechMark® toolboxFootnote 5, which is a Matlab tool that runs under Windows 7, Windows XP, Apple OSX (Lion or Mountain Lion), as mentioned on their web page. This means that the proposed work needs to send recorded phone calls to an external server for feature extraction. However, our work allows us to locally extract features on the smartphone.

Pan et al. [13] propose analysis approaches for detecting manic episodes in BD. The proposed approaches are based on the use of Support Vector Machines (SVM) and Generalized Markov Models (GMM). They record spontaneous phone calls and send them to an external server for analysis. This approach does not protect data privacy.

3 DL4DED

This section presents our novel approach, called DL4DED. It is a mobile voice analysis approach that:

  1. 1.

    monitors patients suffering from BD and MDD;

  2. 2.

    detects depressive episodes by locally analyzing their phone calls without storing them;

  3. 3.

    alerts patients, their family members, and their psychiatrists if a depressive episode is detected.

DL4DED is based on the use of deep learning methods applied to spontaneous speech, recorded from phone calls, to identify depressed voice. The proposed deep learning model is locally running on a patient’s smartphone to preserve data privacy. Indeed, recorded phone calls are “temporarily” stored on the smartphone until the analysis is accomplished. Once the decision is sent to our external server, the recorded phone call is discarded. The architecture of DL4DED, the proposed deep learning model, and our optimizations are described below.

3.1 Architecture

Figure 1 shows the architecture of DL4DED. Our approach (1) records spontaneous phone calls, (2) stores them temporarily in a buffer belonging to the local memory of a patient’s phone, (3) processes them via our “mobile” deep learning model, (4) identifies the state of the recorded voice (i.e., depressed or not depressed), (5) discards the recorded voice (i.e., phone call), and finally (6) sends and stores the obtained decision to a database that is installed on a cloud server. The communication links between the smartphone and the cloud server are enabled via the HTTP protocol. A web dashboard is available for both patients and psychiatrists. It is used to display the patient’s mood on the basis of the voice status.

To identify the status of the voice, we use a novel deep learning model that is described in Sect. 3.2.

Fig. 1.
figure 1

Architecture of DL4DED

Fig. 2.
figure 2

Deep learning model for voice status identification

Fig. 3.
figure 3

Energy bar: an audio segment

3.2 Deep Learning for Voice Status Identification

Figure 2 shows our deep learning model. It processes recorded phone calls and detects depressed voice. The proposed model takes as input a spectrogram. To obtain a spectrogram, a Short-Time Fourier Transform (STFT) is applied to the recorded phone calls. A spectrogram, also called voiceprint, represents the spectrum of frequencies of the recorded phone call, as a function of time. It is composed of a set of frequency bars. Each bar corresponds to a time-stamp t. It is a vector, having the dimension 513 corresponding to the quantities of energy, expressed in decibels (see Fig. 3).

In our case, the spectrogram is a matrix having the dimensions 513 (frequencies) and 120 (times). The temporal dimension has been experimentally chosen. However, the frequency dimension has been obtained by applying STFT on audio segments, recorded with a frequency of 16 kHz (standard frequency for human voice recording). The intensity of colors represents the intensity of the energy of the recorded voice at the instant t, which might be useful for the identification of the patient’s mood. Once the spectrogram is built, it is processed by a CNN network, having one convolution layer and one max-pooling layer. The convolution layer applies a \((1*3)\) filter/(1, 2), allowing us to keep all frequencies (i.e., all energy quantities describing the voice) and preserve the temporal continuity. The max-pooling layer applies a \((1*5)\) filter /(1, 5), to extract medium-term features, while keeping all frequencies. The output of the max-pooling layer is processed by an LSTM that includes memory cells to save long-term information. A fully connected layer is applied afterwards, to transform the obtained matrix into a 128-dimensional vector. The latter is processed by a softmax classifier allowing a binary classification (i.e., depressed voice or not depressed voice). The parameters of our CNN model have been identified experimentally.

To run the proposed deep learning model on mobile devices, we considered two compression methods: quantization and pruning. These methods allow us to reduce the size of our deep learning model by removing weights or operations that are least useful for prediction.

4 Implementation Issues

Figure 4 shows the technical architecture of DL4DED. The smartphone is in charge of recording and analyzing phone calls, while running our deep learning model that has been implemented using the KerasFootnote 6 library. Generated decisions are then sent to a Flask RESTful API 1.0.2, a web service platform that stores received decisions in a RethinkDB 2.3.6 0xenial (GCC 5.3.1) database. A nodeJS v4.2.6 server is installed to build a real-time web application. It is composed of a set of dashboards, displaying analysis results and an estimation of the patient’s mood/status.

Fig. 4.
figure 4

DL4DED implementation

To predict depressed voice, recorded phone calls are pre-processed first. Afterwards, the deep learning model is loaded into a mobile application to trigger predictions. Implementation details are presented below.

4.1 Data Processing

To process a phone call by a CNN model, a spectrogram is built. For this purpose, the recorded voice (i.e., voice signal) is processed by a pre-emphasis filter. A pre-emphasis filter allows us to (1) improve the Signal-to-Noise Ratio (SNR); (2) avoid numerical problems that might appear during the Fourier Transform operation and (3) balance the frequency spectrum. Actually, high frequencies usually have smaller magnitudes compared to lower frequencies.

After applying the pre-emphasis filter, the signal is decomposed into short-time and overlapping frames. This step allows us to avoid applying the STFT across the entire signal and consequently losing the frequency contours of the signal over time. In this case, STFT will be applied on short-term frames allowing us to obtain a good approximation of the frequency contours of the signal by concatenating adjacent frames.

A window function (e.g., Hamming window) and a STFT are applied to each frame. This allows us to compute the power spectrum that is used to extract the frequency bands by applying triangular filters on a mel scale. The mel scale aims to mimic the non-linear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies.

After applying the filter bank to the power spectrum (i.e., periodogram) of the signal, we obtain the spectrogram that is processed by our CNN.

All steps described above were implemented in Android studio to create our mobile application (see first step of Fig. 5).

4.2 Implementation on Mobile Devices

To implement our deep learning model on mobile devices, we followed three steps (see Fig. 5). First, our Keras-based deep learning model is converted to Tensorflow. Actually, Tensorflow allows an easier integration of deep learning models on mobile devices (i.e., Android or IOS). For this purpose, we use a method called “Keras_to_Tensorflow” provided by the Tensorflow library. This method converts a Keras model file into a Tensorflow file which contains both the network architecture and its associated weights. Second, we optimize the generated model by removing weights and operations that are least useful for predictions by applying pruning and quantization methods. Third, we load our model (i.e., *.pb file including weights and architecture) into our mobile application to allow real-time detection of depressed voice.

Fig. 5.
figure 5

Implementation of our deep learning model on mobile devices

5 Experimental Results

Our experiments were conducted on (1) an Ubuntu Server 16.04.5 LTS, with 2 GPUs Nvidia GeForce GTX1080 Ti Turbo 11 GB GDDR5X-RAM PCIe x16 HDMI and (2) a OnePlus A6003 Smartphone, running under an Android OS 8.1.0, with 8 GB of RAM and 128 GB of storage. We use the DAIC-WOZ data setFootnote 7 to evaluate DL4DED. The DAIC-WOZ dataset was compiled by the USC Institute of Creative Technologies and published as part of the Audiovisual Emotional Challenge 2016 (AVEC 2016). The DAIC WOZ data set includes 189 sessions, with an average duration of 16 min, between a participant and a virtual interviewer, controlled by a human interviewer in another room via a “Wizard of Oz” approach. Prior to the interview, each participant completed a psychiatric questionnaire (PHQ-8), from which a binary classification (depressed, non-depressed) was derived [4]. To evaluate DL4DED, we conducted two groups of experiments. The first group is used to evaluate the performance of our approach in terms of accuracy. The second group aims to assess the power consumption of DL4DED on mobile devices.

5.1 Performance

The main objective of this group of experiments is to compare our “optimized deep learning model” (i.e., with model compression, running on a smartphone) and the “original deep learning model” (i.e., the originally created deep learning model without compression and optimization), in terms of accuracy that has been calculated using Tensorflow libraries. This experiment does not depend on the used data since it assesses the accuracy loss related to the use of DL4DED. Therefore, we evaluate both models on the same database (i.e., the DAIC-WOZ dataset). The obtained results demonstrate that the accuracy of DL4DED (0.5) is slightly lower than the accuracy of the original deep learning model (0.52). This means that the applied compression techniques do not significantly alter the analysis results.

5.2 Power Consumption

To assess the power consumption of DL4DED, we used the battery monitoring functionality provided by the Android OS that allows us to measure the power usage of each application in mAh (see Fig. 6).

Fig. 6.
figure 6

Power monitoring functionality of Android OS

mAh stands for milli-Ampere-hours and expresses the number of milliampere (i.e., electric charge quantity), the mobile application has used per hour. Actually, the battery capacity is usually expressed in mAh.

We measured the power consumption of (1) DL4DED and (2) a baseline mobile application, while varying the duration of phone calls (from 1 min to 16 min). The baseline mobile application only records phone calls and sends them to an external server that loads and runs our “original deep learning model” (i.e., without compression and optimization) to detect depressed voice. Due to the throughput of our communication links, the baseline approach does not send the whole duration of the phone call. However, DL4DED processes the whole duration of the recorded phone call to allow better classification. The obtained results show that the average power consumption of DL4DED (5 mAh) is higher than the power consumption of the baseline mobile application (1 mAh), for a phone call duration of 4 min. This is quite plausible. Actually, DL4DED records phone calls, saves it to a temporary buffer and loads a deep learning model on the smartphone, to allow real-time prediction of the depressed voice. Loading an optimized deep learning model on a smartphone should logically increase power consumption. Our experiments show that the average difference of power usage (4 mAh) for a phone call of 4 min is reasonably low and therefore acceptable. As shown in Fig. 7, it is clear that the power consumption of DL4DED increases when the duration of the phone call increases. This is related to the fact that we process the whole duration of the phone call in contrast to the baseline approach. This issue will be solved in future work by processing only pertinent parts of the recorded phone call.

Fig. 7.
figure 7

Power consumption: DL4DED vs. baseline approach

6 Conclusion

We presented a novel mobile deep learning approach for depressive episode detection, called DL4DED. It is a combination of CNN and LSTM networks. The proposed deep learning model was optimized using compression techniques to be loaded onto smartphones. DL4DED records spontaneous phone calls on a daily basis, stores them temporarily locally on a smartphone, and loads our optimized deep learning model to allow real-time detection of depressed voice. DL4DED preserves data privacy since recorded phone calls are not sent to external servers. DL4DED was evaluated on the DAIC-WOZ database. The results demonstrated the efficiency of DL4DED in terms of accuracy and power consumption.

There are several directions for future research. First, we aim to consider the number of recorded phone calls in our study to improve analysis results. Actually, the absence of phone calls could be seen as a severe sign of depression. Second, we plan to extend our deep learning approach to detect manic episodes in BD. Third, we aim to build a realistic and balanced database to improve the performance of DL4DED. Finally, further optimization methods should be investigated to improve our implementation on mobile devices.