1 Introduction

Depression is a serious psychiatric disorder in communities across the world and fares a proper share of the global disease count. There are above 350 million people who suffer from depression, which corresponds to more than 4.4% of the global population [18]. Additionally, two-third of patients do not seek out help. The major problem is that depression unknowingly affects the personal and social life of a person. At its extreme, it can lead to other factors like suicide, psychiatric disorders, etc. Approximately, 1 person dies every 40 s, which translates to 8,00,000 suicide deaths every year across the globe [21]. Suicides are one of the primary reasons for adolescent deaths which imply that adolescents are at great risk of depression. The research of depression and identifying it in a person is an essential task globally, as well as in the context of India, where suicide rates are alarming [69]. According to Lancet’s report published in 2012, in India every hour a student commits suicide due to depression. According to the World Health Organization (W.H.O.) report, approximately 8934 students committed suicide in 2015. In the previous 5 years, approximately 39,775 students killed themselves and most of the suicides are not even reported [16]. This figure is alarming as well as calls for the need to take critical actions to address the problem and take necessary steps.

Mental health and issues related to it are very important to address at every stage of life, be it childhood, adolescence, or adulthood. The person suffering from depression usually suffers from a short-term or long-term low mood state that kills creativity or enthusiasm in day-to-day activities of life [44]. Prolonged low mood state and routine tensions can become chronic or recurrent which can lead to a serious health problem [75]. Victims of depression usually suffer from symptoms like insomnia, loneliness, loss of appetite and sleep, lack of concentration at work and personal life, and sometimes there are high chances for suicide [31, 69]. Some of the common symptoms are shown in Table 1 [13]. There are various reasons for having long-term depression in a person such as rough childhood, sexual abuse, addiction to alcohol, medical treatments, work pressure, and historical legacy of racism, colonialism, and caste [42]. Untreated depression and anxiety get worse with time and result in sleep deprivation, memory issues, heart problems, etc. Various incentives and programmes have been started for curing depression under the guidance of various countries and well-known organizations like W.H.O. Victims suffering from depression are unable to utilize these treatments because most of them are from lower and middle-class families [8, 9]. Also, developing countries do not have an effective scheme for treating depression due to a lack of funds and resources [16].

Table 1 Symptoms of depression

Differentiating depressed individuals from non-depressed ones is a quite challenging task as there is no practical approach available for doing so. Moreover, there are not enough resources and trained health professionals for treating depression. Most of the existed prediction techniques are not accurate due to lack of effective methods in the diagnosis of depression. Social media giants such as Twitter, Facebook, Snapchat, and Instagram, can help us for predicting depression because of the large pool of users and their activities in their respective platforms. Platforms like Twitter, generate social media data on an average of 6000 tweets per second i.e., around 200 billion tweets per year and they are providing their data for open source [7]. Figure 1 clearly shows that anxiety and depression are top-listed and major problems for young people. The young population usually spends a lot of time on social sites, and it allows researchers to collect and analyze content shared on social platforms [4, 9]. After talking to health professionals, we become familiar with the fact that there are so many unsolved problems regarding mental health and most people consider it taboo to discuss it with family or friends. However, on social media, people are willing to share their sentiments and problems, and they expect solutions to their problems because of the vast network of social media.

Fig. 1
figure 1

Survey of the U.S. showing depression as a major problem in the young population in 2018 [36]

Natural Language Processing (NLP) is a technique to study speech and text, and it evolved from linguistic approaches to computer algorithms with the development of computers. Initially, NLP was used to study classical tasks to well-structured grammar like language in books, but it has evolved to understand human perspective in the form of mail, web content, reviews, comments, news, social media posts, and media articles, etc. and these are more challenging to process [29, 32]. Sentiment Analysis (SA) is a technique for analyzing positive and negative characteristics of text or speech by studying sentiments. People share their thoughts, ideas, opinions, and life events by posting various textual posts and comments on their topic of interest [51]. This approach can be extended to study depression levels of a person from his/her social media content and by observing the negative sentiment scores. Social media content has progressively evolved from text to videos. However, there are various complexities in “text-related content”, such as the usage of emojis, hashtags, and other languages in addition to English, which are highly valuable in understanding the sentiments of a person [20, 61]. Lots of research has proved that if the content generated by users in social networking sites is used in the correct way, it can be used to detect a person’s mental state at an early stage [75]. There are many predictive algorithms and optimization techniques like Deep Learning (DL) and Machine Learning (ML) to observe patterns in the data and based on those observations to generate insights. Thus, researchers prefer to use various computational techniques to examine individuals with mental health problems, thereby predicting depression among social media users. NLP is capable of processing a number of languages in many aspects [27]. Researchers generally use classical ML algorithms such as Random Forest (RF), Support Vector Machines (SVMs), Decision Trees (DT), etc. for text analysis using binary classification methods. The efficacy of the conventional ML technique was limited as the number of correlations increased significantly due to growth in the volume of data [48]. In this research, we investigated various DL techniques and also provided a comparative report of our findings with those of conventional approaches.

Textual data classification has seen a significant boost in accuracy by deploying DL models like RNNs and CNNs in parallel with neural word embeddings [58]. CNNs were originally designed to duplicate the visual sense of human beings and animals to recognize objects and detect them in images [11] and videos, whereas RNNs are designed for sequence reading processes such as parts of speech, tagging, and language translations, etc. [68]. The uniqueness of CNNs is translational invariance i.e., extraction of unique features irrespective of angles and intensity of images in vision tasks. The addition of pooling procedures in CNN enabled it to operate on textual data as well. CNNs along with word embeddings capture syntactic and semantic information from text data. The advantage of LSTM is that it is effective at retaining vital information and is meant to avoid “vanishing gradient” problem [33]. Due to their capacity to handle arbitrary-length sequential input, it also performs well for sequence labeling tasks [60]. In this study, we have integrated DL approaches to propose our hybrid model.

In our work, we used Twitter dataset for classifying tweets, labeled as positive for depressed users and negative for non-depressed users. The remaining part of the paper is structured as follows. Section 2 discusses the related work for the analysis of depression using various techniques. Section 3 highlights the main contributions of our work. Section 4 focuses on the proposed methodology. Section 5, describes the dataset used in our work and pre-processing techniques. In Section 6, our hybrid DL based method CNN-biLSTM is explained in detail. The proposed approach is based on concept of Deep Neural Networks (DNNs) and Word Embeddings, which are still the state-of-the-art methodology to perform any learning task in NLP. Section 7 compares our proposed model with CNN and RNN based models for text classification. DL based binary classification algorithms have been adapted for the suitability of our current use case. Although, CNNs always outperform most of the tasks but are subjective in model explainability. Therefore, to add more explainability to the models we have employed an RNN based approach. Section 8 illustrates the comparative analysis, visualizations, statistical and performance analysis of the proposed work. Lastly, the conclusion of our study is given in Section 9.

2 Related work

2.1 Traditional techniques

Various ML algorithms and statistical techniques have been used for classifying text data using social media as a platform. Traditional studies show a correlation between the raised depression and social media website usage. Costello et al. [15] explained the mapping of psychological characteristics with digital records of online behavior on social platforms like Facebook, Instagram, Twitter, etc. It was hypothesized that psychological traits can be predicted based on the language used on internet platforms and the pages liked by people. Eichstaedt et al. [19] collected the Facebook status history from 683 patients to predict their depression. ML algorithms were trained using at least 200 topics along with cross-validation techniques to avoid overfitting. Priya et al. [47] proposed a five-level prediction of depression, stress, and anxiety using five ML algorithms. Mori et al. [38] used ML algorithms on four types of Twitter information such as network, word of statistics, time, and a bag of words. The study considered 24 types of personality traits and 239 features. Tao et al. [62] predicted depressive content in social media using depression context. Data gathered from social networks was given as text to the knowledge sentiment block. The depressive contents identified from SA were used to warn and aware the family members, and social activists. Guntuku et al. [22] collected Twitter data which is around 400 million tweets in Pennsylvania, USA, and acknowledged the users whose tweets consisted of words like alone or lonely. User tweets were analyzed concerning age, time of post, daily activities of users, and gender of the user. Patterns in tweets were analyzed using ML classifiers and NLP techniques, thereby predicting the loneliness of a user.

Various popular ML techniques have attracted many researchers in the last few years. Islam et al. [27] proposed depression classification on Facebook data using ML techniques. For effective analysis, traditional ML approaches have been utilized considering different psycholinguistic features. It has been found that DT produced better results and also the classification error rate decreased and accuracy significantly enhanced when compared with the other ML techniques. Hiraga et al. [24] utilized various conventional ML algorithms like multinomial Logistic Regression, NB, and Linear SVMs to classify mental disorders from the blogs written in Japanese. Wu et al. [73] proposed various hypothesis and their correlation based on language, time, and interaction to predict job burnout using ML algorithms like DT, Logistic Regression (LR), SVM, XGBoost, and RF on 1532 Weibo burnout users, to replace previous statistical methods based on surveys. Fatima et al. [20] used ML techniques such as SVMs, Multilayer perceptron neural network, and LR to predict postpartum depression from social media text. Features were extracted from social media platforms based on linguistics and classified as general, depression, and postpartum depression.

2.2 Deep learning techniques

Conventional ML algorithms have performed well in predicting depression from social media-based textual content. Researchers have employed DL-based solutions to gain more insight from photos, videos, unstructured text, and emojis. Orabi et al. [43] utilized CNN and RNN to detect depression in Twitter data using root Adaptive Moment Estimation (Adam) as an optimizer, and word embedding training was done using CBOW, Skip-gram, and Random word embedding having a uniform distribution range from −0.5 to +0.5. Shrestha et al. [56] proposed an unsupervised method utilizing RNNs and anomaly detection to analyze behaviors of users on ReachOut.com (online forum). Two streamed approaches were used, one for linguistics and the other for network connection to detect depression in users. Eatedal et al. [2] utilized an RNN technique to predict depression among women in the Arab using 10,000 tweets generated by 200 users. Zogan et al. [79] proposed a new approach for identifying depressed users based on the user’s online timeline tweets and user behaviors. The hybrid model comprising of CNN and Bi-GRU approaches was tested on a benchmark dataset. The semantic features were extracted which represented user behaviors. Hybrid CNN and Bi-GRU were compared with the state-of-art techniques and found that the classification performance was improved to a greater extent. Chiu et al. [14] and Huang et al. [26] proposed a multi-model framework on the DL technique to predict depression from Instagram posts that use pictures, text, and behavior features. Tommasel et al. [64] proposed a DL technique to capture social media expression in Argentina. Time-series data generated (using markers) was fed to a neural network to forecast mental health and emotions during COVID-19. Wang et al. [70] built a dataset on Sina Weibo named Weibo User Depression Detection Data Set (WU3D) containing 10,000 depressed users and 20,000 normal users. Ten statistical features were proposed based on the social behavior, user’s text, and posted pictures. Fusion F-net was proposed to train on these 10 features to detect depression. Suman [61] utilized DL models with a cloud-based smartphone application on tweets to detect depression. The sentence classifier used in this study is the RoBERTa associated with the provided tweet or Query with standard corpus tweets. The model’s reliability was enhanced by the standard corpus. The patient’s status of depression has been estimated and the mental health is predicted. The authors also used random noise factors and the larger set of tweet samples from which depression has been predicted.

Furthermore, Rao [48] showed that the critical sentiment information cannot be correctly captured by the traditional depression analysis models. This issue was handled by the proposed multi-gated LeakyReLU CNN model which in turn also identified the depressed characters in social media. Every user post was initially recognized and further emotional status and overall representation have been identified. The developed multi-gated has been modified into a single LeakyReLU CNN. Content posted by online users was considered in the form of the Reddit self-reported dataset. Depressed persons have been identified by this proposed model and performance results were analyzed. Sood et al. [58] used RStudio to retrieve tweets and further the sentiments were evaluated. To analyze the sentiments of the general population from Twitter, every sentiment was provided with a score. The scored tweet was based on the sentiments and for that, an innovative algorithm was developed in this study. Uddin et al. [68] researched and focused on online data in the Bangla language and analyzed depression by utilizing Long Short-Term Memory (LSTM) and deep recurrent network. Hyper-parameter tuning effects have been demonstrated. It was depicted that for a stratified dataset, the accuracy in depression detection is higher with repeated sampling. Also, an individual’s depression is detected in this study with the help of proposed models and thus undesirable doings have been avoided. Pranav [46] researched and found that victims of depression used abnormal language while speaking. Processing of Twitter data for depression prognosis was done by implementing neural networks. This study stated that vagueness raised in SA can be terminated by propounding CNNs. Identification of user’s depression status can be resolved by the proposed approach. This is a very fruitful prognostic tool in observing user’s depression on social platforms. Rosa et al. [50] emphases on analyzing the emotional sentiments and extracting deep semantic analysis from textual data. Also, descriptive useful information of the content in natural language is extracted and a combined model training is performed using semi-supervised learning. Hybrid model DHMR was utilized to get better results. Shetty et al. [55] performed SA for posts on Twitter. KAGGLE dataset was taken as input for DL and LSTM was propounded in deep networks. Later for enhanced performance CNNs were propounded in the classification part.

2.3 Other techniques

For the earlier mental illnesses and depression identification, Alsagri et al. [5] proposed a method considered as a data-driven approach. It was concluded that depression goes to a peak based on tweets gathered from social media. The transformer-based approach has been formed based on the depression dataset. The comparative analysis of the proposed approach with the existing studies was analyzed and showed that the overall model performance got highly improved. Stephen and Prabhu [60] detected the level of depression among Twitter users using depression scores measurement through various emotions combined with sentiment scores. Different depression aspects have been underscored. The estimated scores have been correlated to major information concerning various user’s depression levels. Levia et al. [33] analyzed social media users who used online platforms to reveal their mental health states. Based on the time, online messages have been analyzed in iterative order and detected the user’s depression risk state earlier. The comprehensive SA combined with the ML approaches showed effective results for detecting depression’s early symptoms. Zucco et al. [80] analyzed the opinion of users for performing SA. Text extraction and NLP were used to detect the opinion of users. Birjali et al. [11] analyzed user’s activities from social media platforms and predicted their depression emotions. Weka tool was utilized for ML techniques for Twitter data classification. For the semantic similarities, WordNet external semantic source among the evaluated participants has been utilized. From social networks, Twitter sentiments have been extracted. Zhang [76] analyzed that depression trends and posts related to stress were highly monitored and various geographical entities have been focused by the online users. Also, this study identified that if people talk more about covid-19, depression signals significantly get increased.

2.4 Analysis table

In the Analysis table, a few studies that used ML and DL models to perform depression prediction using text data have been analyzed and discussed. Table 2 summarises the most related studies recent studies (2017–2021).

Table 2 Analysis of recent studies on text data using deep learning techniques and its comparison with proposed work

2.5 Problem formulation

The growing online social media platforms have developed a way for communication in day-to-day life. Classifying depressed individuals from non-depressed ones using lingual dialects is a challenging task. Previous works as discussed in Section 2 have talked about the problem of (1) Distinguishing depressed individuals from non-depressed individuals using social media platforms, and (2) Classifying posts that are posted by depressed people.

Furthermore, researchers have focused on unsupervised techniques, whereas in this study, a supervised technique is put forward to classify depressed users using psycho-linguistic features. Experiments and outcomes stated in previous works illustrate that both text and user detection are challenging issues. Despite the fact that Twitter provides a huge amount of data, it is a challenging task to handle this data. Some of the most common issues encountered while working with Twitter data are:

  1. 1.

    A huge number of images and video transactions were done parallelly with text.

  2. 2.

    Unstructured data with significant usage of emojis and GIFs.

  3. 3.

    Usage of foreign languages etc.

  4. 4.

    Labeling tweets that require professionals to label data and are very time-consuming.

For simplicity, we are limiting our problem statement to “text” and “English language”. Moreover, emojis, videos, and foreign language alphabets are dropped. The unstructured noisy data is cleaned using various pre-processing techniques. The problem of labeling data is solved by using an authenticated Twitter dataset that is released for the use of psychology and computer science researchers.

The main objective of our study is to develop a hybrid DL model for depression prediction, and compare its performance to the related DL models namely, CNN and RNN. Our hypothesis is that the proposed DL model should outperform DL-based CNN and RNN models, state-of-the-art studies, and baseline models in both accuracy and robustness.

3 Core contribution of the article

DL algorithms like CNN and LSTM thrive on data, and with so many open datasets available, they give better results for classifying depressed and non-depressed tweets. RNNs are preferred as they can be fed with external pre-trained embeddings. CNN’s can be utilized for text processing as they can be trained very fast. Moreover, their capability to extract local features from text is mainly prominent for NLP tasks. CNNs and RNNs can be combined together to take benefit of both architectures. We have proposed the hybrid architecture which provides the advantages of both CNNs and LSTMs. The major contribution of this study involves the following:

  1. 1.

    A depression detection framework has been proposed using DL utilizing textual content from social media.

  2. 2.

    A feature-rich CNN network is built to classify user tweets concatenated with the biLSTM network to classify social media users suffering from depression i.e., CNN-biLSTM. To the best of our knowledge, first time the work of using semantic features, statistical analysis, and DL techniques jointly with word embeddings have been utilized for depression detection.

  3. 3.

    The research outcomes achieved on a benchmark Twitter dataset have shown the superiority of our proposed method when compared to state-of-the-art studies.

4 Proposed methodology

The proposed architecture comprises of six modules, as shown in Fig. 2: (1) Extraction of data generated by the online users. (2) Analyzing raw data. (3) Pre-processing of raw data into clean data. (4) Feature extraction to generate machine-compatible data. (5) Depression classification differentiating depressive tweets from non-depressive tweets. (6) Parameter Evaluation to evaluate and compare the hybrid CNN-biLSTM, RNN, and CNN classification approach. In this study, depression is predicted via a hybrid CNN-biLSTM approach, using Twitter-based depression datasets. The classification error is minimized and the prediction of depression becomes more precise and accurate.

Fig. 2
figure 2

The proposed system architecture

The steps given below illustrate the proposed methodology and its related flowchart is shown in Fig. 3.

  1. Step-1:

    Design a framework for SA and upload a Twitter database to predict depression. Tweets are labeled as depressed or not depressed by experts, Shen et al., [54] for mental health text analysis, and to identify sentiments of depressed or not depressed users.

Fig. 3
figure 3

Proposed flowchart

  1. Step-2:

    Apply pre-processing steps to the uploaded data for noise elimination. Data is processed as per the requirements resulting in considerable positive effects on the quality of feature extraction. Pre-processing techniques like data normalization, tokenization, punctuation removal, stop words removal, etc. are applied to the uploaded text data. This step provides clean and noise-free data that is further used for feature extraction.

  2. Step-3:

    In this step, the feature extraction procedures are applied to the pre-processed data for extracting important and relevant features. Extracted features ascertain the relevant data dimensions to assist classification algorithms for better performance.

  3. Step-4:

    To obtain better accuracy, CNN model, RNN model, and proposed hybrid classification algorithm i.e., CNN-biLSTM is used. The proposed model is evaluated against RNN, CNN, and traditional reference models on the same Twitter dataset to validate its performance. The optimized features obtained in the third step are forwarded as input to the classifiers for training and testing purposes.

  4. Step-5:

    In the last step, performance parameters of the proposed depression analysis framework, such as precision, recall, F1-score, specificity, accuracy, and AUC, are calculated to validate the system.

5 Dataset description and pre-processing

5.1 Dataset description

Twitter is a popular online social media platform that provides open and easy access to data. The development and validation of terms used as a vocabulary for browsing data for users with mental illness take a significant amount of time. In the past, researchers have typically followed two methods of using Twitter data:

  1. 1.

    Using existing datasets that are freely and publicly shared by others [34].

  2. 2.

    Crawling social media vocabulary, though slow, helps to get reliable data.

  3. 3.

    Data from resources such as Twitter impose tweet download restrictions per user per day because of the fair use policy applied to all users.

The benchmark dataset that has been considered in this study for depression prediction is taken from [54]. The information of the dataset is given in Table 3. The sample dataset is depicted in Fig. 4 below.

Table 3 Statistics of the dataset used [54]
Fig. 4
figure 4

Sample of depressed and non-depressed tweets from a Twitter dataset

This dataset has been subdivided further into three complementary datasets namely, D1, D2, and D3, as discussed below:

  1. 1.

    Depression Dataset (D1): D1 consists of 2,92,564 tweets and 1402 depressed users. Every user has been labeled as depressed. It comprised of tweets obtained between the period 2009–2016. The users were labeled as “users of depression” if the anchor tweet gratified the pattern as “I am / I was / I have been identified as depressed”.

  2. 2.

    Non-depression Dataset (D2): D2 consists of more than 300 million active users and 10 billion tweets. Each user is labeled as non-depressed and the tweets were gathered in December 2016.

  3. 3.

    Depression-candidate Dataset (D3): Tweets were gathered if the word “depress” was loosely used. This dataset contains 36,993 depressive candidate users and more than 35 million tweets.

For D1, D2, and D3, there are 2558, 5304 and 58,810 collected samples. Prior to anchor tweets being detected, the one-month Twitter post information of the twitter user is presented in each dataset. In the present study, D1 and D2 datasets are used and analyzed using the DL-based classification algorithms.

5.2 Data pre-processing

Data pre-processing [41] is an integral part of the data mining process. Real-world data is collected using different methods and is not specific to a particular domain, resulting in incomplete, unstructured, and unreliable data containing errors. Such data leads to irrelevant and erroneous predictions if analyzed directly. In our framework, various methods are used during the pre-processing phase. The first method eliminates the text patterns specified by the user. The goal of this method is to remove patterns, e.g., “user handles (@username)”; “hashtags (#hashtag)”; “URLs”; “characters, symbols, and numbers other than alphabets”; “empty strings”; “drop rows with NaN in the column”; “duplicate rows” etc. This method cleans up each tweet in the dataset and deletes all URLs in the tweet. URLs are not taken into account because they are not useful for prediction purposes and eliminating them will reduce computing complexity. The next step is to delete the date, time, digits, and hashtags. Date and time are useless for the prediction of depression, so this information is removed from the tweets. Likewise, digits are not an appropriate aspect for prediction purposes, and hashtags although may be used for prediction. It has been observed that accuracy is very low when prediction is based on hashtags. As we do not want to deviate from the trend, hashtags have also been removed. The next step is to delete emojis and remove whitespace and extra spaces in the sentence.

After this, stop words are eliminated and stemming is performed. Stopwords like are, was, at, if, etc. do not contribute to the meaning of the sentence. The NLTK [10] package consisting of a set of stopwords is used to remove stopwords from our text. Stemming [6] is a method of changing a word to its root form. Porter Stemmer is used for creating the root of a word by removing prefix or suffix (−ize, −ed, −s, −de, etc.) from the word. After cleaning up all the tweets, cleaned tweets are returned and given as input to the tokenizer, which is the next step. Tokenizing [54] raw text data is a main pre-processing step for NLP methods. Tokenizers are tools that use regular expressions to divide a given string into tokens by breaking a larger body of text into smaller lines or words. Figure 5 shows the clean depression tweets after pre-processing.

Fig. 5
figure 5

Sample of depressed and non-depressed tweets after pre-processing

The different tokenization functions are used by importing the NLTK package. The first stage of tokenizing is to provide the datasets of cleaned positive and negative tweets as input for Tokenizer.fit_on_texts () function. It updates the internal vocabulary from a list of texts and creates the vocabulary index based on the frequency of words. As a result, the word with the highest frequency has the lowest index value. Hence, this function returns the maximum number of words that is 10,275 in our framework, with an index for each word. The next stage of tokenization is the texts_to_sequences() method. It receives data of the preceding method, consisting of maximum words with an index. Its objective is to turn every word in a tweet into a sequence of integers and replace it with the corresponding integer value of the word_index dictionary. Now, the tweets get converted into integer sequences of varying lengths. After this, tweets get padded with zeroes having a length smaller than the Max_Tweet_Length which is 25 in our frame.

5.3 Word embeddings

In NLP, embedding can facilitate ML applications to work with large data where a word available in highly sparse vectors can be projected down into a low dimensional embedding vector. Embeddings are dense or low dimension demonstrations of high-dimensional input vectors. Some of the recent techniques using word vectors [20, 68] learn from the given text corpus and these word embedding techniques lead to high dimensionality within the solution, typically the size of the whole corpus. Word embeddings [37] are trained in such a way that words with semantically same meanings are positioned close to each other and the vectors are created with approximately identical representations, e.g., the terms “joyful” and “miserable” have very different semantics. So, these will be represented far apart in the geometric space, thereby building more separable features out of the tokenized numerical vectors. These vectors are transformed with the help of embedded layer so that the semantic relationship between the associated word vectors is captured. The original tokenized vector does not have a relationship between different words, whereas the embedding vectors in the embedding space learn the relationship using the distance between the two vectors. Each time the training is repeated, more separable features are extracted providing more predictive power to the CNN or RNN networks. They are one of the best approaches till date to encode a sentence, paragraph, or document and can be seen as one of the breakthroughs in DL capable of solving challenging NLP issues.

We used the “Word embeddings” technique to calculate numerical vectors for every pre-processed data point. First, we converted all the sample text words into sequences for generating word indexes. These indices are retrieved using the Keras text tokenizer [63]. We have ensured that the tokenizer does not assign a zero index to any word and vocabulary length is also adapted accordingly. Next, all separate words in the dataset are assigned a unique index that is used to form numeric vectors of all text samples. Initially, the length of all tweets is obtained for text sequence generation. Figure 6, illustrates a histogram of the number of tweets increasing in word length, and it is evident that the majority of tweets in the training set are fewer than 25 words long. As a result, text sequences are converted into integer sequences and zero padding is implemented. Maximum sequential length (number of words) is set to 25 because a majority of tweets in the dataset are of that length. Furthermore, 5 words are discarded as such tweets are very less in number and result in the addition of zeros to the sequence of vectors, thereby slowing model training and affecting the overall performance. The process of generating word embeddings for a tweet is illustrated in Fig. 7.

Fig. 6
figure 6

Text sequence generation

Fig. 7
figure 7

Word embeddings for a single tweet

In this study, an embedding matrix with the Max_unique_words * Embedding_dim dimension is created. Embedding_dim is the length of the vector i.e., 300 in our case. This matrix is initially populated by zeroes. Every single word in the top 10,275 unique words is converted into a vector by searching inside the vector space. After this, each vector is populated as a row in the embedding matrix, and the defined vector contains 300-dimensional columns of features with a vocabulary of 10,275. We used an embedding layer of length 50 on our DNNs to produce a 25 × 50 output embedding vector for each tokenized vector. This layered neural network in our framework is trained to rebuild the linguistic context of words by taking a large body of text as input i.e., EMBEDDING_FILE. This generates a vector space, usually of several hundred dimensions, with each unique word in the corpus having a unique vector.

At last, we divided the positive and negative data sets for testing and training. 30% of the data is for training and remaining 70% of data is for testing. The CNN, RNN, and the proposed hybrid CNN-biLSTM models are discussed below in further sections.

6 Proposed hybrid CNN-biLSTM model

We proposed an approach by combining CNN and bidirectional-LSTM (a type of RNN), for higher classification performance to predict depressed users on Twitter. On performing multiple experiments, we found that CNN is good at extracting spatial features and performed well when contextual information with the prior sequence is not required. Whereas, RNNs are effective in extracting information when the context of adjacent elements is important to classification. In the beginning, multidimensional data is used directly as a low-level input to CNN. In the pooling and convolution operation, significant features are mined by every layer. Relative to traditional CNN, the output layer is fully connected to the hidden layer. Depression tweets are extracted in-depth and accuracy is improved by using a number of convolutional layers, pooling layers, and convolutional kernel enhancements. With over-fitting risk exposure, a complex network may occur.

As a result, the time-dependent network of recurrent neurons i.e., LSTM is incorporated with the CNN model to address the sequence problem. Similar and useless information is extracted by using the convolution kernels and the important extracted information is stored for a longer time in the state cell. Predominant results are achieved through a combination of CNN and biLSTM. The biLSTM architecture is shown in Fig. 8. Consequently, CNN-biLSTM is used where the convolutional layer facilitates the extraction of low dimensional semantic features from textual data and minimizes the number of dimensions. Moreover, biLSTM processes the text as a sequence of input. In this study, several 1-dimensional convolutional kernels are used together to achieve better performance on the input vectors. Table 4 describes the control parameters used.

Fig. 8
figure 8

Architecture of biLSTM

Table 4 Control parameters used in equations

Sequential input data is characterized as the average of embedding vectors of individual words as shown in Eq. (1). Unigram, Bigram, and Trigram features are extracted using the different sizes of convolutional kernels by applying them to X1 : T with the use of 1D CNN. The features generated during a convolution process, in the tth convolution, where a window of d words stretch from t : t + d is taken as an input. The convolution process generates features for that window as follows in Eq. (2),

$$ {X}_1:T=\left[{x}_1,{x}_2,{x}_3,{x}_4,\cdots {x}_T\right] $$
(1)
$$ {h}_d,t=\mathit{\tan}\ h\left({W}_d{x}_{t\kern0.5em :t+d-1}+{b}_d\ \right) $$
(2)

Where, xt : t + d − 1 is the embedding vector of the individual words in the context window, Wd represents the parameters with the learnable weight’s matrix, and bd is the bias. Also, as different regions of the text are convolved with every filter, the generated feature map of the filter having a convolution of size d is given in Eq. (3),

$$ {h}_d=\left[{h}_{d1},{h}_{d2},{h}_{d3},{h}_{d4},\cdots {x}_{T-d+1}\right] $$
(3)

Using different convolutional kernels with varied widths expands the scope of CNN to find the latent correlation between several adjacent words. The most important characteristic of using a convolution filter for feature extraction from textual data is to minimize the number of trainable parameters during the feature learning process. This is achieved using a max-pooling layer following the convolutional layers [5]. The process starts with input being processed through various convolutional channels and every channel has its unique set of values. During max-pooling process, the largest value from each convolutional layer is selected and pooled to create a set of new features. Within each convolution kernel, max pooling is applied to the feature maps with convolutional size d to obtain Eq. (4). The final features of each window are extracted by concatenation of pd for every filter size d = 1, 2, 3 and extracted the unigram, bigram, and trigram hidden features as shown in Eq. (5),

$$ {p}_d= Maxt\ \left({h}_{d1},{h}_{d2},{h}_{d3},{h}_{d4},\cdots {x}_{T-d+1}\right) $$
(4)
$$ {h}_d=\left[{p}_1,{p}_2,{p}_3\right] $$
(5)

The most significant advantage of using a CNN-based feature extraction technique over the conventional LSTM is that the overall number of features are considerably reduced. These features are further used by a depression prediction model following the feature extraction process. The architecture of LSTM makes it overcome the “vanishing gradient” problem with sequential data using gate structures like input, output, and forget gates along with the cell states. These cell states work as an overall long-term memory for the LSTM unit and additive connections between the states. At a given time t for an LSTM cell, provided the input xt and the intermediate state, ht its output state is calculated as follows in eq. (6, 7, 8, 9, 10 and 11).

$$ {f}_t=\sigma \left({W}_f\ {x}_t+{U}_f\ {h}_{t-1}+{b}_f\ \right) $$
(6)
$$ {i}_t=\sigma \Big({W}_i\ {x}_t+{U}_i\ {h}_{t-1}+{b}_i $$
(7)
$$ {o}_t=\sigma \left({W}_o\ {x}_t+{U}_o\ {h}_{t-1}+{b}_o\right) $$
(8)
$$ {g}_t=\mathit{\tanh}\left({W}_g\ {x}_t+{U}_g\ {h}_{t-1}+{b}_g\right) $$
(9)
$$ {c}_t={f}_t\ o\ {c}_{t-1}+{i}_t\kern0.5em o\ {g}_t $$
(10)
$$ {h}_t={o}_t\ o\ \mathit{\tanh}\left({c}_t\right) $$
(11)

Where the learnable parameters are represented by W, U, and b. σ is a sigmoid function and o is the convolution operator. The LSTMs gate is represented by ft, it, and ot for forget, input, and output gates respectively. The memory state or the cell state is shown with ct. The cell state is the only reason LSTMs are skilled to capture all the long-term dependencies in the data provided in the input sequence and can be applied to data with longer sequences. As shown in Figs. 9 and 10, the CNN part of the network comprises of three convolution layers with a variable number of filters. The first two layers have 128 filters having a kernel size of 3 × 3 with sigmoid and Rectified linear activation unit (Relu) as activation functions. The third convolution layer has 64 filters with a kernel size of 3 × 3 and a sigmoid activation function.

Fig. 9
figure 9

Architecture of CNN-biLSTM for the depression prediction

Fig. 10
figure 10

Layers and parameters used in CNN-biLSTM neural network

This is trailed by a Max-pool layer with a 4 × 4 kernel size. Finally, we used a biLSTM with slightly different hidden computations. Since computations are carried out in both forward and backward directions, this helps us in dealing with the drawback of RNN i.e., only information from previous computations is used as the next step. We used a “dropout layer” with a “keep probability” of 0.1 to prevent overfitting on the training data. We used Relu activation for the output layer and the model is trained on “binary_crossentropy” loss and “root mean squared propagation (RMSprop)” optimizer. Table 5 shows parameters used in our model and pseudocode of our proposed work is illustrated in Algorithm 1.

Table 5 Parameters used in our hybrid model
figure a

7 Comparative methods

7.1 CNN model

CNN [3] has revolutionized research and innovation using ML. This can be seen through the applications of DL techniques and their performance on the ImageNet [17] dataset for object detection problems in Computer Vision. Not only Computer Vision, but research has been successful in the text analysis domain with the use of CNNs. CNNs are excellent feature extractors, extracting domain-specific characteristics during each epoch, e.g., retrieving high-level characteristics, such as edges, and low-level characteristics such as objects in images. Using appropriate filters, convolution layers, dimensionality reduction operation, and pooling operations like Max-pooling and Average-pooling, time dependencies and specific features can also be captured. The role of a convolution layer is to tune the input features by reducing the spatial size of the convolution matrix into a form that is easier to process with no loss of critical classification information. The matrix involved in the realization of convolution information is referred to as Kernel or Filter. The kernel moves with a specific stride and each time a matrix multiplication operation is performed between the kernel and a portion of the input vector that superimposes itself on the kernel. This not only reduces the computing power needed to process the data but is also useful in extracting key features which are position invariant and contribute to effective model training.

In this study, convolutional layers, pooling layers, drop-out layers, and dense layers have been used. The neural network mostly used convolutional layers for training. The architecture of CNN to classify the text is shown below in Fig. 11.

Fig. 11
figure 11

Layers and parameters used in CNN neural network

The network parameters are taken up by the fully connected layers. The input features are extracted by the convolution layer and the resulting convolution matrix is obtained from pooling. The problem of over-fitting is resolved by the dropout layer and, as a result, during training, co-adaptation is prevented by random dropping units. We have presented the application of DNN with CNN + Maxpool to improve classification performance compared to RNN. For the convolution layer, we used a 3 × 3 kernel size with a stride of 1 tailed by a Max-pooling layer with a 2 × 2 kernel size. Finally, the convoluted features are fed into a dense layer. Since the network parameters are close to ~1 M, we used a “dropout” layer with a “keep probability” of 0.2 to avoid overfitting the training data. Sigmoid activation has been used for the output layer and the model is trained on the “BinaryCrossentropy” loss function and “RMSprop” optimizer. A l long sentence as the d X l matrix is called the S sentence matrix, and by using linear filters, CNN carried out the convolution on the given input. The W weight matrix is indicated by a filter of size r and length d. W represents parameters as d X r. S ∈ dXlis the input matrix, feature map vector with a convolutional operator O = [O0, O1, …. ., Os − h] ∈ s − r + 1 with filter W applied repeatedly to sub-matrices S as shown in eq. (12),

$$ {O}_i=W.{S}_{i:i+h-1} $$
(12)

where, i = 0, 1, 2…s − r, (.) = dot product operation and Si : j = S’s sub matrix from i to j rows. Each feature map O was served to the pooling layer for generating possible features. The highly significant feature v, captured by the max-pooling layer selects the highest value of feature map as shown in eq. (13),

$$ v=\underset{0\le i\le s-r}{\max }{O}_i $$
(13)

7.2 Recurrent neural networks – RNN model

RNN and LSTM have been highly popular in the area of NLP and speech recognition. Applications such as language modeling, sentiment classifying, contextual modeling, named entity recognition, and neural machine translation at the character level had produced excellent experimental results with RNN-based sequential modeling [23, 59]. In 2016, the Neural Machine Translation System [72] of Google utilized a deep LSTM network for multi-lingual translation. Intuitively, LSTM has been found useful for learning the context among adjacent words improving the classification performance where class ambiguity exists. RNN is a group of neural networks that support sequential data modeling. Such networks are derived from the feed-forward networks which exhibit the same behavior in relation to the function of the human brain.

RNNs have a memory that captures sequential information using the unrolled units that have hidden states and, thus, weights and biases are shared over time as shown in Fig. 12. For every element in a sequence, the output of RNN utilizes the previous calculations; therefore, RNNs are defined as recurring. Also, the future calculations are based on this captured information. As a result, RNNs are successful in NLP issues such as automatic translations where all inputs and outputs are highly dependent on each other. RNNs use information in the long sequences, but practically, they limit themselves to limited steps and capture short-term dependencies. Generally, in RNN, the initial layer is the encoder, which converts text into a sequence of token indices. Post encoder layer, data is routed through the embedded layer, and sequences of word indices are converted into a sequence of trainable vectors. Words of the same significance show same training vectors. Through iteration of elements, RNN processes the input sequence of “one step-of-time” to the input of another “step-of-time”. Final processing is done using the dense layer after RNN has converted the single vector sequence.

Fig. 12
figure 12

Layers and parameters used in RNN neural network

In our study, we used a network of SimpleRNN [49] + Dense layers with “Relu” [1] activation to solve the problem of depression prediction classification on the Twitter dataset. SimpleRNN is a fully connected RNN in which the output of the prior time step is delivered to the next step. We used a “dropout” with a “keep probability” of 0.4 and a “tanh” activation function for the output layer. The model is trained on “binary_crossentropy” loss and “Adam” [30] optimizer with a learning rate of 0.001. RNN requires 3-dimensional data as input provided by the embedding layer. The design of our model is illustrated in Fig. 12. RNNs have proven to be very promising in improving baseline performance, although they do have certain limitations. As the calculation of RNNs is slow, they are not useful in accessing the information over the long term. Furthermore, RNNs cannot account for future inputs for the present state. As a result, other DL approaches were explored to address the issue. The other types of RNNs are LSTMs [25], biLSTMs [12] and are the most frequently used RNN type. They just have a different way of calculating the hidden state but these are far better at catching the long-term dependencies that RNNs cannot.

8 Results and discussion

This section shows experimental setups and evaluation measures considered while conducting experiments. Moreover, it discusses results achieved by conducting experiments for the proposed CNN-biLSTM approach. It also includes a performance comparison between the proposed method and other state-of-the-art studies. Moreover, this section presents statistical analysis and visualizations for lingual dialects used by depressed and non-depressed users.

8.1 Experimental setup and evaluation metrics

All DL techniques have been implemented using Anaconda Navigator in Python 3.7 and Keras (an open-source library based on TensorFlow). For calculating performance on the implemented algorithms, we used information retrieval metrics extracted from the confusion matrix of the classifier [27, 79]. A confusion matrix is a way to summarize a classification model consisting of multiple sub-metrics. The sub-metrics derived from confusion metrics include precision, recall, accuracy, F1-score, sensitivity, and AUC curves. It analyzes the model performance for each class in a classification problem. Sometimes confusion metrics are referred to as error matrices because of the tabulated representation showing correct and incorrect predictions. The description of metrics is given in Tables 6 and 7.

Table 6 Terms used to compute confusion matrix
Table 7 Description of evaluation metrics

The performance of classification model is evaluated by plotting the distribution of prediction ratios for various classes on test data for known actual values, in the form of a confusion matrix. The confusion matrix for RNN, CNN, and CNN-biLSTM are shown in Fig. 13a–c. CNN-biLSTM obtained a significantly larger number of true positive and true negative values, demonstrating it as an efficient classifier for our dataset.

Fig. 13
figure 13

Confusion matrix for a RNN b CNN and c CNN-biLSTM

8.2 Comparative analysis

The proposed hybrid CNN-biLSTM model is compared with the CNN and RNN models to evaluate depression prediction using accuracy, precision, recall, F1-score, and specificity in relevance to the test set of the used dataset. Table 8 depicts that CNN-biLSTM shows higher accuracy, precision, F1-score, and specificity i.e., 94.28, 96.99, 94.78, and 96.35 respectively as compared to RNN and CNN models. The proposed model aims to enhance the precision score while maintaining a relatively stable recall value, thereby, ensuring no incorrect depression predictions are made. The performance metrics obtained in Table 8 using different models is graphically shown in Fig. 14.

Table 8 Experimental results
Fig. 14
figure 14

Comparison of RNN, CNN and CNN-biLSTM

From Fig. 14 it is evident that RNN model has the lowest prediction performance compared to CNN and CNN-biLSTM. It is because of the inability of RNN to deal with “vanishing gradient” and “exploding gradient” problems. Although LSTM can tackle both problems, but it can only learn information before the current word but not after it. The semantics of a word in a sentence is related not only to previous historical information but also to subsequent information that comes after it. Instead of using LSTM, this study employs biLSTM incorporating bidirectional information as well as overcoming the problem of “vanishing gradient” and “exploding gradient”. The best results are obtained when CNN’s deep feature extraction capabilities are combined with the biLSTM model.

To analyze the model performance more clearly, a comparison of model graphs per epoch for accuracy, loss, and AUC is shown in Fig. 15. Accuracy in the case of training and testing data for most models follows a general Hilton curve and stabilizes around 0.90, as shown in the graphs. However, the loss function, particularly for test data, indicates an unstable noise output for both RNN and CNN. Moreover, in the case of proposed hybrid CNN-biLSTM model, the propagated noise is reduced. In the case of RNN and CNN, the overall AUC score is around 0.94. The AUC value of the CNN-biLSTM is slightly higher, i.e., 0.95432, indicating improved performance. As a result of the proposed hybrid CNN-biLSTM model, the accuracy of the depression analysis prediction improves while the model loss decreases.

Fig. 15
figure 15

Comparison of a Accuracy graph for RNN, b Loss graph for RNN, c AUC graph for RNN, d Accuracy graph for CNN, e Loss graph for CNN, f AUC graph for CNN, g Accuracy graph for CNN-biLSTM, h Loss graph for CNN-biLSTM, and i AUC graph for CNN-biLSTM

Table 9 compares F1-scores and accuracies of state-of-the-art studies with the proposed hybrid CNN-biLSTM model. In comparison to other existing studies for depression prediction based on Twitter data, it is notable that our proposed hybrid model increases not only accuracy but also the overall F1-score. Figure 16 compares the accuracies of various algorithms (as implemented in Ref. 65 on the same Twitter dataset) to our proposed model. The proposed hybrid CNN-biLSTM model outperforms traditional approaches with an accuracy of 94.28. This implies that the hybrid deep learning models could be investigated in the future for depression analytics.

Table 9 Classification comparison with state-of-the-art studies using F1-scores and accuracy measures
Fig. 16
figure 16

Comparison of accuracy of CNN-biLSTM model with state-of-art models [65] on same dataset

It is concluded that CNN effectively extracts local features from different locations in a sentence but does not capture the contextual features of a word token. Convolution, pooling, and fully connected layers allow CNN to adapt and learn important features using backpropagation algorithm. The convolution operation is used for weight sharing across neighborhood positions, allowing kernels to extract local information within a given space. Moreover, CNN learns relevant feature patterns using a pooling operation. The Max-pooling layer extracts important information from input feature maps and outputs the most significant value in each map while discarding others, thereby shrinking the number of input features. In comparison to LSTM, RNN has the shortcoming of being unable to handle the “vanishing gradient” and “exploding gradient” problems, as well as extracting specific context information from a long sentence. On the other hand, LSTM efficiently tackles the “vanishing gradient” and “exploding gradient” problems along with contextual feature extraction. However, the problem with LSTM is that it is unidirectional, indicating that it does not consider the effect of next word in a sentence on the current context. Furthermore, the bidirectional nature of biLSTM model concentrates on the important contextual features of a sentence, and the embedding layer extracts not only word-level but also character-level embedding vectors.

Thus, the proposed CNN-biLSTM model efficiently addresses the shortcoming of CNN and RNN by extracting both local features and contextual information from the features obtained from convolutional layer. This validates our hypothesis that integrating CNN and biLSTM improves localized feature extraction while also leveraging biLSTM’s multi-directional enhanced RNN functionalities.

8.3 Statistical analysis

In this study, a t-test is applied to determine the significant difference among two groups i.e., depressed and non-depressed tweets. t-test [45] is a parametric test that determines whether two sets are different from one another or not. The aim of the test is to determine whether there is a noteworthy difference among the average length of string for both depression and non-depression tweets. The t-test statistics is given by Eq. (14),

$$ t=\frac{\overline{x_1}-\overline{x_2}}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} $$
(14)

where, t represents t-value to be calculated, \( \overline{x_1} \) and \( \overline{x_2} \) represents the means of two groups (depressed and non-depressed) whose sample distributions are compared, \( \overline{x_1}-\overline{x_2} \) represents the difference in sample means, s1and s2 represents standard error of the two distributions, and n1 and n2 represent a number of observations in each group respectively. For a t-test, the degree of freedom (df) is the least of the two (n1–1, n2 – 1). In order to perform statistical analysis, the average length is calculated for both depressed and non-depressed tweets. The length of tweets is calculated and shown in Fig. 17. Density distribution plots are plotted for the length of the string as shown in Fig. 18a, b. After this, the mean of two distributions has been calculated i.e., μ1= 9.17 for depressed strings and μ2 = 6.63 for non-depressed strings. The null hypothesis assumes that the mean of two population sample distributions is equal (H0: μ1 = μ2) and to test the alternate hypothesis against the null hypothesis we use a t-test. The alternate hypothesis emphasizes the significant difference between the means of both distributions.

Fig. 17
figure 17

Tweet length for non-depressed users

Fig. 18
figure 18

Distribution plot for a depressed users and b non-depressed users

P value is a critical value that is an important threshold and acts as a test statistic for the test results. The p value is a level that permits us to conclude when to discard the null hypothesis i.e., H0 : μ1 = μ2 or to inaugurate that the two groups are dissimilar. The p value obtained from t-test = 0.000. However, we compare p value with the critical value (α). In our study, the critical value is chosen as α = 0.02.

  1. 1.

    If p value > α (Critical value): t-test fails to reject the null hypothesis of the test and establishes that the distributions have the same mean.

  2. 2.

    If p value < α (Critical value): t-test rejects the null hypothesis of the test and establishes that the means of the sample distributions are different.

In this study, we performed the two-tailed test. The calculated value for t-test statistics is 34.749. The tabulated value for t-test statistics at α = 0.02 and degree of freedom = 1 is 31.821. The t-test statistics is shown in Table 10. Inference from the t-test is that the null hypothesis is rejected and the distribution for the tweet length for both depressed ones and non-depressed ones are different at 0.02 level of significance.

Table 10 Result of the t-test

Figure 19a, shows the word cloud [53] for depressed users depicting that the words depress and depressed, are frequently announced by depressed users on a social media platform. This demonstrates how frequently depressed people express and post their sentiments on social media platforms. Besides that, other words related to mental health such as anxiety, mental disorder, bipolar, and PTSD have seemed to appear in more than 20% of depressed users. Figure 19b shows the word cloud for non-depressed users depicting positive and life-enriching attitudes, as well as the sensitization of self-love, as shown by terms such as love and good.

Fig. 19
figure 19

Word cloud of a depressed users and b non-depressed users

The word frequency plot for the first 20 words is shown in Fig. 20a, b representing words like severe, depressed, help, anxiety, bipolar, etc., are often used by depressed users. The top words used by the non-depressed users, on the other hand, are love, like, know, think, etc. Moreover, we can conclude from plot analysis that these individuals are more likely to have a depressive self-analysis, which they actively reported on various social media platforms.

Fig. 20
figure 20

Words frequency distribution plots of a depressed users and b non-depressed users

9 Conclusion

Depression is one of the most common mental disorders permeating worldwide. It is important to educate ourselves about depression on an individual, communal, and global scale. Addressing the issue and helping individuals suffering from depression should be given utmost priority. For classification-based problems, NB, DT, and RF are generally used in the text-based SA. In this study, we presented an innovative methodology to predict depression using Twitter raw dataset comprising of three subtypes (D1, D2, and D3). A real-world dataset has been used for categorizing non-depressed and depressed users in the proposed model. The proposed hybrid model i.e., CNN-biLSTM is characterized by introducing interplay between CNN and biLSTM network. Our proposed model uses biLSTM that can process longer text sequences and tackle the “vanishing gradient” and “exploding gradient” problems, unlike RNN. Moreover, our approach extract features using convolution layers and enhanced recurrent network architectures. On comparing CNN-biLSTM model with “state-of-the-art” studies, it is evaluated that the former model shows better performance in terms of various evaluation metrics. It is concluded through experimental studies that the CNN-biLSTM model is the one that achieved the best accuracy of 94.28%, precision of 96.99%, F1-score of 94.78%, specificity of 96.35%, and AUC score of 95.43%.

This work has a lot of potential to be studied further in the future; for instance, we can increase the model’s accuracy by exploring different combinations of neural network layers and activation functions. The pre-trained language techniques such as Deep contextualized word representations (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) can be used in the future, and train them on a large corpus of depression-related tweets. It can be challenging to use such pre-trained language models due to the restriction imposed on sequence length of a sentence. Nevertheless, studying these models on this task helps to unearth their pros and cons. Our future work aims to detect other mental illnesses in conjunction with depression to capture complex mental issues prevading into an individual’s life. Apart from Twitter raw data, various other ML methods can be evaluated on different other social media networks.