For experiments, two text datasets were obtained from the queries and answers from ung.no website. The dataset comprises of several categories including depression texts. The annotations of the messages were done with the help of professionals such as medical doctors and psychologists. All the experiments are done on a computer that has Intel(R) Core(TM) i7-7700HQ CPU with the speed of 2.80 GHz and 2.81 GHz, memory of 32 GB, Windows® 10 operating system, and TensorFlow deep learning tool version 2.4.1.
First dataset and experiments
From the whole collection of texts of different categories, 11,807 of them were extracted for the first dataset and experiments that consisted of 1820 texts categorized as depression texts (describing symptoms of depression) and the other 9987 as non-depression texts (not describing symptoms of depression). Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 represent the classification reports of tenfold used in the experiments where each fold consist of 90% data as training and rest as testing. Figures 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 show the confusion matrices of each fold. Figure 23 depicts the accuracy and loss for 100 epochs during the training of the ten different folds. The overall training of the folds looks good except a little negligible fluctuation. Figure 24 shows the attention-based LSTM model used in this work where there are 53,358 parameters represented by an LSTM layer with 50 memory units, an attention layer, and a dense layer for 2 different emotional states (i.e., depression and non-depression).
Comparison with traditional approaches
We compared the proposed approach with traditional approaches where the proposed one showed superior results than others. We first applied traditional machine learning approaches using different features (i.e., typical one-hot, TF-IDF, and proposed features) with other conventional machine learning algorithms (i.e., logistic regression, decision trees, support vector machines (SVM), typical large artificial neural network (ANN), DBN, and CNN) but could not achieve more than 91% of mean accuracy as shown in Table 12. Furthermore, we tried LSTM with the traditional as well as proposed features to decode and model the time-sequential information to determine the emotional states. Table 13 and chart in Fig. 25 show the performance of three different approaches to the first dataset where the proposed approach shows the superiority by achieving 98% of mean accuracy over two other approaches.
Besides, another straight-forward approach was applied where the direct presence of the symptoms from “Appendix” was checked to take the binary decision of depression or non-depression. This approach was applied on the whole dataset rather than splitting into training and testing since it was a simple rule-based classification. The direct presence of one or more symptoms-based approach achieved the accuracy of 84.20% where 1684 depression texts were correctly classified among a total of 1807 depression texts and 1730 non-depression texts correctly classified among 10,000 non-depression text. Since there are different ways to express self-depression in texts of different length, it is hard to apply just a binary rule to determine the depression in the text. Hence, it is better to combine the base words from all the symptoms to define collection of features for depression to apply some complicated algorithms such as sequence-based machine learning algorithm using LSTM-based RNN that has been applied in this work.
Second dataset and experiments
For the second dataset, a total of 21,470 text samples were obtained consisting of depression—and non-depressions texts. From which, 1470 were depression texts and rest of the 20,000 were non-depression texts. We applied fivefold cross validation for the second phase experiments with the proposed approach, i.e. using RNN on the robust features. Only the results using the proposed approach is reported here since it showed the best results than the other approaches as shown in the experiments of the first phase, i.e. first dataset. Figures 26, 27, 28, 29, 30 represent the confusion matrices of fivefold used in the second experiments where each fold consist of 80% data as training and rest 20% as testing. The experimental results show a remarkable performance of the proposed features followed by one-hot and LSTM where the mean recall rate of depression and non-depression is 0.98 and 0.99, respectively. The mean accuracy is 0.99 that shows the robustness of the proposed approach.
In summary, the above experimental results show the overall efficiency of the proposed depression prediction system using depression symptom-based features and time-sequential LSTM-based machine learning model. The proposed system shows better results than existing latest approaches for depression prediction. For instance, in , the work is basically based on a measuring scale considering depression, anxiety and stress, which is a point-based measuring scale obtained by writing four different kind of letters by the candidates. The candidates collected by formal advertisements were asked to write these letters whereas in our database, the participants wrote the text spontaneously expressing their necessity to seek assistance over a national portal. The model used  is logistic regression, a simple and basic machine learning model which is usually simple linear model and hence, should not generally fit well where the sample data is distributed non-linearly. On the contrary, our work adopted time-sequential LSTM-based machine learning model that can separate both linearly and nonlinearly distributed samples from different classes. The proposed approach also overpowers other popular deep learning models such as DBN and CNN which are usually used for non-sequential event modelling.
XAI to explain the ML decisions
Humans are basically restrained to accept approaches that are not interpretable or trustworthy, pushes the demand for transparent AI to increase. Hence, focusing only on performance of the AI models, gradually makes the systems towards unacceptance. Though there is a trade-off between the performance and transparency of machine learning models, improvements in the understanding of the models via explainability can however lead to the correction of the model's deficiencies as well. Therefore, with the target of overcoming the limitations of accepting the current generation AI models, XAI should focus on machine learning techniques to produce more and more explainable models while upholding a high level of accuracy. Besides, they can also make it happen for humans to appropriately understand, trust, and manage the emerging AI phenomena as much as possible. Explainability is a main factor to gain confidence of whether a model would act as intended for a given problem. Most certainly, it is a property of any explainable model. Local explanations in AI models handle explainability by dividing the model's complex solutions space into several less complex solution subspaces which are relevant for the whole model. These explanations can utilize some approaches with the differentiating property to explain the model to some basic extent.
Most of the techniques of model simplification are based on rule extraction techniques. The most popular contributions for local post-hoc explanation is based on the approach called Local Interpretable Model-Agnostic Explanations (LIME) . LIME basically generates locally linear models for the predictions of a machine learning model to explain it. It falls under category of the rule-based local explanations by simplification. Explanations by simplification builds a whole new system based on the trained model to be explained. Then, the new simplified model usually tries to optimize its resemblance to its predecessor model functions while reducing the complexity and at the same time, keeping a similar performance. Therefore, once the machine learning decision is obtained, XAI algorithm LIME is applied to see the importance of the features and probabilities towards the decision. Hence, we can understand the presence of the feature importance in the input for the decision, that helps understanding the outcomes of the system. Figure 31 shows the total class probabilities, top 10 features, their probabilities, and automatically highlighted features in a sample input text using LIME. As can be seen in right side of the figure, features towards depression get higher weights altogether than non-depression class, indicates the person to be in depression mode. The input text, features, and highlights were originally in Norwegian language since the database is from a Norwegian national portal to interact with youth, but the figure shows the corresponding translated text in English for better readability and understanding of the approach. According to the decision from machine learning model and explanations from LIME, the sample text consists of depression. To be noted, the ground truth for the sample text in the figure was the same as the model's prediction (i.e., depression), indicating the robustness of the model's decision and explanation.
Furthermore, Fig. 32 shows summarized probabilities of top 10 features for a paraphrased non-depression example text using LIME. In the figure, left side represents the original part after applying the algorithm and right side the corresponding representation in English for better readability as well as understandability. The overall probabilities of the non-depression text from the machine learning model for depression and non-depression classes were 0.001 and 0.999, respectively.