1 Background

According to World Health Organization, more than 0.8 million people die out of suicide every year. According to the recent report of Centers for Disease Control and Prevention (CDC) WISQARS in 2019 Leading Causes of Death Reports, suicide is the tenth leading cause of death in United States. According to the official data of USA, in every 11.1 min one person commits suicide.Footnote 1 According to the latest available data, the statistics of Canada estimates 4157 suicides in 2017, making it the ninth leading cause of death. The clinical psychologists and academic researchers come across increasing number of mental health problems and its exposure to the social media platforms during COVID-19 pandemic lockdown. The pandemic has long-term impacts on the mental health and wellness of masses due to economic insecurity and isolation. The suicide cases have adverse physical, economical, and emotional impact on social well-being. Early suicide risk prediction may control the suicide rate by reporting the need of necessary steps to take preventive measures.

As per reports released in August 2021,Footnote 21.6 million people in England were on waiting lists for mental health care. As per estimation, 8 million people could not get specialist help as they were not considered sick enough to qualify. This situation underscores the need for automation of mental health detection from social media data where people express themselves and their thoughts, beliefs/emotions with ease. These writings contain heterogeneous, unstructured and ill-formed data which is human-readable but difficult to interpret automatically by a system. Recent studies on predicting suicidal tendency on social media data by using machine learning, ML [1,2,3,4,5] models are more successful as compared to the medical records [6] and paved the way to explore deep learning, DL [7,8,9,10,11,12] and computational intelligence techniques [13] for quantifying suicidal tendency. We acknowledge that we limit the scope of our study to stress, clinical depression, and suicide risk.

1.1 Motivation

The labour-intensive engineering with traditional clinical psychology is a theoretical approach to identify signs of suicidal tendencies. This subjective approach follows the time consuming face-to-face interaction. 80% of people who are at risk are not comfortable in disclosing the level of stress and anxiety that they may have [14]. Further, increase in the levels of stress and anxiety may align thoughts of a person to suicidal tendencies. Progressive studies on suicide prevention [15] has enriched the research community with dataset, resources and provides motivation for new-frontiers.

In the past, we closely observe cross-sectional studies for identifying mental disorder in a given self-reported text using AI models for classification and categorization. We witness progressive studies on finding mental disorder levels from longitudinal data which provides useful insights. Based on these interesting investigations, research community may report the available and required resources in near future for medical assistance to people at risk. These developments reduces the dependency of in-person sessions with therapist/clinical psychologist and thus, cost of identifying people at risk. As evident from recent deployments of suicide risk detection model by Facebook [16], we may identifying potential users at risk and offer them help in near future.

As evident from studies in the past, social media platforms has strong association with feelings expressed by users [17,18,19]. About 8 out of 10 people tend to disclose their suicidal tendencies on social media [20]. Mental health prediction from social media [21] facilitates suicidal risk assessments [22] and early detection of suicidal tendencies by using emotion spectrum from social media user’s historical timeline [7] due to the presence of Papageno effect [23]. Such path-breaking developments intensifies faith in developing learning-based mechanisms to capture mental health levels using language.

1.2 Mental Healthcare: A Taxonomy

After comprehensive investigation in NLP-centered problems and social computing of mental health, we introduce a unique taxonomy for mental healthcare as shown in Fig. 1. We further examine mental healthcare as an interdisciplinary domain of computational linguistics and human–computer interaction to automate the predictions.

We discover different aspects of mental health domain and observe both independent and integrated studies for each aspect. In this section, we describe different aspects of mental healthcare. Recent developments with ECG signals, Electronic Health Record (EHR), demographic information and other medical reports exemplifies the available data and resources for neuroscience-based studies known as biomedical domain. The social aspect of mental health studies is closely associated with the human-behaviour within the society. The psychological aspect is inclined towards theorizing the thoughts on mental health. The ethical aspect is concerned with the security of the data which mean to what extent and in what manner can it be used [24]. The prevention and control measures for mental health issues are examine independently or in association with any of the corresponding aspects.

Fig. 1
figure 1

Taxonomy on mental healthcare

Conventionally, identification of people at risk is the carried out on digital data and the traditional offline interactions. The use of traditional method is decreasing because of social stigma and unavailability of clinical psychologists. Digital mental health comprises of blogs and diaries of a user, information filled in private questionnaires or Google forms, self-reported mental illness (voluntarily), and online social media data. We choose to explore the online social media language resources which contains heterogeneous type of information such as linguistics, user-metadata, social metadata, and multimedia data. The scope of this manuscript is to deal with text in social media platforms (Twitter, Reddit, Sina Weibo) and it occasionally contains images for stress, depression and suicide risk on social media. There are studies over multimodal (images, audio and visual) social media platforms (Instagram, YoutubeFootnote 3) in the past which is beyond the scope of this manuscript due to different nature and semantics of available resources.

Social NLP research community investigate six other social mental health problems in social media data which may/may be directly associated with the suicidal tendencies. Moreover, among nine mental health problems, stress, clinical depression and suicidal risk detection are the most widely studied areas on social media [18]. The success of existing AI models have given new research direction to investigate this problem and motivate academic researchers to find its practical application in industry.

1.3 Corpus Overview

We perform in-depth analysis for 92 research articles which are further classified as 9 articles for stress; 32 articles for depression; 37 articles for suicide risk; 14 articles for two or more mental disorders. The year-wise distribution of publications is shown in Fig. 2 which and top 3 venues are CLPsych, ACL, and AAAI as observed from Fig. 3. We advocate that the research articles on stress and suicide risk detection are fewer than the article on identifying clinical depression.

Fig. 2
figure 2

Year-wise distribution of number of publications on mental disorders

Fig. 3
figure 3

Year-wise distribution of publication venues

The area of interest by research community has evolved from social venues [25,26,27,28], to Human–Computer Interaction venues [29, 30], and Computational Linguistic domain of computer science [31, 32]. Existing studies have addressed the concerns on dataset and its ethical constraints [17, 33,34,35]; multi-modal feature extraction [36,37,38,39]; classification techniques [1,2,3,4, 9, 39,40,41], graph learning approach [2]; use of the Internet of Medical Things for real-time applications [34]; noisy label problem in dataset annotations [42]; and improvement over the attention mechanisms [9, 11, 43, 44].

1.4 Scope of the Study

The research domain of Mental Illness Detection and Analysis on Social media (MIDAS) has evolved for less than a decade [45]. In the past, the honest disclosure of public opinion about privacy concerns demands the need of explainable and responsible AI models [46]. An in-depth study about dataset and its ethical issues were explored in a systematic review for statistical analysis of mental health dataset [17]. A critical review of 75 research articles on the mental health issues from 2013 to 2018 study the design and research methods [18]. A short survey address the concerns of association between social media data and mental health prediction [47]. Recent advancements yield comprehensive study of features and online behaviour patterns for mental health prediction with DL mechanisms [48].

We focus on direct contributions in the field of suicide risk detection by identifying the extent of suicidal tendencies which shows new research direction to build real-time NLP-centered applications to handle the problem of mental disorders. Our major contributions are:

  • Classification of heterogeneous social media features.

  • State-of-the-art AI models for stress, depression and suicide risk detection and analysis.

  • Available tools, resources, and dataset in this research domain.

  • Highlight the open challenges and new frontiers.

We further structure this work in different sections. Section 2 presents the classification of different features of social media data for suicide risk detection. We elaborate embedding and feature enhancement in this domain. Section 3 give summary of automated learning based techniques for quantifying mental health. We further compose a list of available dataset and other tools/resources. Section 4 highlights the open challenges and new frontiers. Finally, Sect. 5 concludes the manuscript.

2 Features from Social Media Data

With this background, data curation becomes the most challenging task as it contains unstructured/semi-structured, user-generated and ill-formed nature. Recent advances in the development of classifiers [49] enrich natural language understanding to infer mental states. In the past, an exclusive study for feature extraction have made headway towards finding neuropsychiatric disorders from self-reported text [13, 50].

The social media platforms are usually characterized by one-way connections (Twitter, Reddit, Instagram) and two-way connections (Facebook). The most widely used social media platforms are Twitter and Reddit followed by Instagram and Facebook. We observe multimodal models on social media data but we limit our studies to natural language processing and social features only.

When the information is limited, it start fabricating patterns among them. These patterns aid in feature extraction or transformation for both cross-sectional and longitudinal study. Recent works for feature extraction have addressed the concerns to explore dominant features [13, 50, 51].

Fig. 4
figure 4

Architecture of feature harvesting from social media data for classification algorithms

The architecture of cross-sectional study to infer mental state from social media data is given in Fig. 4. Learning-based models are build on features extracted from data such as the handcrafted features, the statistical information, and automated features to name a few. We categories and discuss four classes of features, namely, user-profile features, linguistic features, social features, and multimedia features as given in Fig. 5.

Fig. 5
figure 5

Classification of social media features for quantifying suicidal tendencies

The Social NLP research community exploit social media data for two modalities: text and images. We extract the textual features using either a conventional approach or via automation. The conventional approach contains surface-level linguistic features and semantic level aspects and is referred as handcrafted features. The automatic features incorporate vector representation for end-to-end pre-trained models.

2.1 Handling Ambiguity of Features

Although there is no ideal classification of features, we classify them into four different categories with few exceptions belonging to multiple categories. We resolve the perplexities with following guidelines:

  • The metadata of posts yields information about both user metadata: data about the users’ profile and is thus, kept under user profile features; and post metadata: data about the post and is categorized under Social features.

  • The ruminative response style is expression of repetitive thoughts and behavior [52]. People with depression tend to express their feelings or negative experiences repeatedly by repeating the sentences in their posts. Though the ruminative response style is the part of both user behaviour and linguistic styles, it is more closely associated with user-profile features and thus, studied under User Profile Feature.

  • An interesting study introduce bBridge [53], a big data based feature extraction approach from social media data which contains both user-profile features and social networking features.

  • The community specific information of the user comprises of the information about followers, and favourites. We associated these features with the user’s social networking and thus, are discussed in Social Features.

2.1.1 User Profile Features

Past studies reveals the proportional impact of employment on psychiatric behaviour of a person by analyzing their college degree/type of job [54]. People sharing similar demographic, linguistic and cultural traits as those of depressed users are more at-risk than others [55]. In this context, we further classify the user-profile features in Table 1.

Demographic The users’ metadata contains information about the age, gender, occupation, race, ethnicity [55]. These characteristics of people disclose their alignment towards psychiatric disorders such as mental disorders fore in old aged people more than younger ones. Social well-being of males decline more than females [56].

Spatio-temporal [7]: models user’s emotional spectrum by tracking their historical timeline on social media platform. In their study, the patterns of irregularities among posting behaviour incorporates the time-varying component and use time-aware LSTM cell to capture patterns [57]. A shared task in eRISK workshop at CLEF forum introduce a longitudinal dataset which encourages more research contributions for early risk detection in social media [58]. Similarly, location of social media post have strong associations with economical indexes like Ease of doing businessFootnote 4 and World Happiness ReportFootnote 5 with mental health status of residents. In future, both temporal and location component may simulate significant information for mental health analysis.

Behavioural features Social media users are more likely to be expressed late night than during day time [39]. Behavioural patterns such as insomnia index, sleep cycle [45] and ruminative response style [52] affects the user’s state of mind. People with depression tend to express their feelings or negative experiences repeatedly. In this context, [59] consider ruminative response style using text encoding mechanism resulting into significance of mental health analysis.

Table 1 User profile feature extraction for mental health state

2.1.2 Linguistic Features

To study linguistic features [54], recapitulate the importance of words that users pick to express their feelings in their personal writings. People with depression exhibit differences with respect to linguistic styles such as the distribution of nouns, verbs and adverbs and the unconscious conceptualization of complex sentences [64]. The exclusive studies on linguistic features reveals the increased use of first person language, the current scenario and anger based terms for person’s state of mind [65]. We further classify linguistic features in Table 2.

Emotional Features Infusing implicit and explicit emotions while encoding text is trending in current scenario. We emphasise and recommend the use of sentiments and emotions from active vocabulary of a user. The research community witnesses many emotion based pre-trained models as word embedding. Such models set strong foundation for building contextual transformer-based models [66]. We come across different pre-trained models such as EmoBERT [67], DistillBERT for emotions,Footnote 6 MentalBERT [68], and other Contextual BERT-based models [69].

Semantic Features The topic modelling methods such LDA, [28, 70] is used for clustering the posts related to similar topics. The depressed and non-depressed users discuss different topics which may help to determine potential depressed users [71]. Another interesting study aims to understand the Twitter users’ discourse and psychological reactions to COVID-19 pandemic time period using topic modeling [72].

Statistical Features We categorize the statistical features into lexical, dictionary-based, and syntactic. The lexical features use tokenized form of text to calculate statistical measures such as TFIDF, n-grams, morphology and alike features. Dictionary features are use existing dictionaries such as LIWC,Footnote 7 Suicide dictionaryFootnote 8 and ANEWFootnote 9 for assigning values. We use syntactical features are used to check the context of a token with respect to its neighbourhood, for instance, Part-Of-Speech tagging. The domain specific features are the lexicon of mental health specific words derived from Wikipedia, domain specific dictionaries, and depression symptoms such as Diagnostic and Statistical Manual of Mental Disorders (DSM-IV).Footnote 10

Domain Specific With evolving era of ‘Emotional Intelligence’, we observe a clear description on emotion models in clinical psychology and psychiatric theories for affective computing [73]. Valence refers to the pleasant–unpleasant quality of a stimulus and ranges from negative to positive, whereas arousal refers to the intensity of a stimulus and ranges from dull to arousing. The past studies with MHA incorporate the Valence arousal dominance (VAD) Emotion model [36, 39, 43, 74] and Plutchik model [7, 75]. Plutchik’s theory of emotion and emotional consequences for cognition, personality, and psychotherapy is derived from an evolutionary perspective [75].

Table 2 Linguistic feature extraction for mental health state

2.1.3 Social Features

Depressed people who are conscious about their social circle on social media platforms and have limited number of friends [91]. The depressed tweet gains more attention from friends and so, important features are Retweets, comments, and favourites [76]. We further classify social features into social metadata and social networking as shown in Table 3.

Social Metadata Social information about post of a user consists of the length of a post, number of hashtags in a post, number of URLs used in a post and other minute details which is termed as the metadata.

Social Network We observe patterns in interaction and relationships among users [92]. These networking features are gaining importance due to non-Euclidean space representation of the problem. Applying hyperbolic geometry on non-Euclidean representation has given new research direction in the field of mental health analysis [66, 93].

Table 3 Social feature extraction for mental health state

2.1.4 Multimedia Features

The increase in use of images for feature extraction or transformation either consider display picture in Twitter (also referred as Avtars in Reddit) or images posted by user. The colour combinations, colour ratio, brightness, saturation, and convolution are few interesting features for mining social media images as shown in Table 4.

Table 4 Multimedia feature extraction for mental health state

2.2 Feature Vector Representation

The feature vectorization is the process of representing input data in the form of a vector. We further classify feature vector representation into text feature vectorization and image feature vectorization as shown in Fig. 6. The text feature vectorization comprised of feature extraction and feature embedding. We enlist the past studies along with classified insights for feature vector representation in Table 5.

Textual Feature Extraction The traditional methods of converting text in vectors (TFEx) is performed with conventional approach of TFIDF vectorizer, Count vectorizer, and Hashing vectorizer [42]. For dimensionality reduction, the selective features are processed further by using PCA, NMF and other filter based linear feature selection algorithms. In the past, authors use one-hot encoding to encode a set of Tweets [82]. The uni-modal dictionaries evolves from text and image data separately which are further useful for joint sparse representation [39]. These traditional feature extraction techniques are convenient for converting the social media data into vector representation for classification models.

Feature Embedding With advancements in the word to vector conversion using neural network approach, the word2vec [98], the GloVe [10, 11], and the Fasttext are encode the text. To handle the longer text like phrase, sentence or paragraph, the researchers use BERT [99], Sentence-BERT [100], and Google Universal Sentence Encoder (GUSE) [101] for feature vector representation [42].The use of embedding over dense layers, BERT, GUSE, and GRU [11, 38] for sequence to sequence learning has given significant contributions in attention based mechanism to enhance the importance of feature across representation.

An image represents many characteristics of the psychological thoughts and health. The permutations and combinations of different image features extraction determines the mental health. The research community follows end-to-end feature transformation technique by using a 16-layer pre-trained VGGNet to use image as features [38, 97, 102].

Dimensionality Reduction One of the most promising step of social media data mining is dimensionality reduction. The dimensions of text representation in conventional feature extraction techniques are reduced by linear and non-linear methods such as Principal Component Analysis (PCA), Deep Neural Autoencoders (DNAE) [103], and Uniform Manifold Approximation and Projection (UMAP) [104] for dimensionality reduction in MIDAS. The Post Feature Transformation (PFT) approach is recommended for transformer-based end-to-end data conversion into feature vectors. We witness existing works with attention mechanism such as Hierarchical Attention Mechanism (HAM) [32, 105] to give importance to important posts for identifying suicidal tendencies [11, 43]. The multi-attributed feature extraction is given as 3-level framework using three-level features extraction which consists of low level feature (linguistic features), middle level features (visual features) and high-level features (social features) to give as an input to the Deep Sparse Neural Network (DSNN) [76]. They argue the unavailability of all three types of features in data.

2.3 Summary of Feature Extraction and Transformation

The existing potential studies define and explore new features for mental health detection from social media data as shown in Table 6. Most of the recent approaches use embedding techniques and work on post-feature transformation to hypothesise better feature representation. Moreover, all existing studies are using the textual information of post and other features optionally.

Table 5 Feature vector representation for social mental status detection
Table 6 Feature extraction and transformation for mental health detection
Fig. 6
figure 6

Feature vector representation for mental health analysis in social media posts

3 Classification

The classification problem of identifying suicidal tendency on social media use many shallow learning and DL algorithms. One of the most challenging module is to handle the unstructured and semi-structured data from social media data, filling missing values and jointly represent the multi-modal information. Although, data resource for this task is freely available in public domain, most of the dataset are not available due to sensitivity of the data.

3.1 Available Dataset

In the past, the research community witness the use of widely available datasets such as CLPsych shared task [25], Reddit Self-reported Depression Diagnosis [115], and Language of Mental Health [64], early risk prediction on the Internet (eRISK) from CLEF Forum [116]. As discussed earlier, only a few dataset are available in public domain, many of them are either reproducible or available on request. Every year we come across more than 12 dataset for predicting mental health on social media data. Limited availability of these dataset lead us to enlist either the most popular and reproducible dataset, or the dataset which are available by request or via signed agreement. A list of reproducible dataset are enumerated in Table 7. In this section, we further discuss details of each dataset.

Table 7 Results obtained for Social Media Health Detection
  1. (1)

    CLPsych 2015 Shared task dataset: The CLPsych datasetFootnote 11 contains three modules which are available via signed agreement, namely, DepressionvControl (DvC), PTSDvControl (PvC), and DepressionvPTSD (DvP). To use this dataset, the academic researchers must sign a confidentiality agreement to ensure the privacy of the data.

  2. (2)

    Multimodal Dictionary Learning (MDDL): MDDLFootnote 12 is a depression detection dataset which comprises of three modules D1, D2, and D3. The Depression Dataset D1 is constructed using tweets from 2009 and 2016 where users were labeled as depressed if their anchor tweets satisfied the strict pattern “(I’m/I was/I am/I’ve been) diagnosed depression”. The Non-Depression Dataset D2 is constructed in December 2016, where users were labeled as non-depressed if they had never posted any tweet containing the character string “depress”. Although D1 and D2 are well-labeled, the depressed users on D1 are too few, thus, a larger unlabelled Depression-candidate Dataset D3 is constructed for depression behaviors discovery which contains much more noise.

  3. (3)

    Reddit Self-reported Depression Diagnosis (RSDD): The RSDD datasetFootnote 13 contains the Reddit posts of approximately 9000 users who have claimed to have been diagnosed with depression (“diagnosed users”) and approximately 107,000 matched control users. The introduction to Reddit dataset [115] has given a significant contribution which was used by many existing studies.

  4. (4)

    Self-Reported Mental Health Diagnoses (SMHD) dataset: The SMHD dataset,Footnote 14 just like RSDD dataset, can be obtained via signed agreement as per the privacy policy of data. The dataset consists of Reddit posts of the users diagnosed with one or several of nine mental health conditions (“diagnosed users”), and matched control users. This dataset is also used by few studies in literature and is related to multiple mental health conditions instead of just the depression dataset.

  5. (5)

    eRISK: The eRISK datasetFootnote 15 is available online for experiments and analysis to meet the targets of a shared task since few years. The dataset for early risk detection by CLEF Lab is given to solve the problems of detecting depression, anorexia and self-harm since few years.

  6. (6)

    Pirina: A new dataset is proposed [120], named as Pirina to refer it in this study and is available onlineFootnote 16 for research purposes. A filtered data is extracted from Reddit social media platform for depression detection task. Although, this dataset is not actively maintained, it can be extracted and can be used for pilot study.

  7. (7)

    Ji: A new Reddit dataset of 5326 suicidal posts out of 20,000 posts were extracted and 594 Suicidal Tweets out of 10,000 Tweets were extracted for experiments and evaluation of the proposed classification approach for suicidal risk detection. This dataset is referred as Ji datasetFootnote 17 in this study which is available on-request.

  8. (8)

    Sina Weibo: Another dataset which is proposed for public domain and remains un-named is given the name of the social media platform, Sina Weibo,Footnote 18 to refer it for this study. The dataset with 3652 users having suicidal tendency and 3677 users not having suicidal risk is extracted from Sina Weibo, a Chinese social media platform.

  9. (9)

    Dreaddit: Dreaddit,Footnote 19 a new text corpus of lengthy multi-domain social media data for the identification of stress. This dataset consists of 190K posts from five different categories of Reddit communities; the authors additionally label 3.5K total segments taken from 3K posts using Amazon Mechanical Turk. The lexical features which used in this dataset are Dictionary of Affect in Language [131], LIWC features [132] and patterns sentiment library [133]; syntactic features like unigrams and bigrams, the Flesch-Kincaid Grade level and the automated reliability index; social media features like timestamp, upvote ratio, karma (upvote–downvote) and the total number of comments.

  10. (10)

    Suicide Risk Assessment using Reddit (SRAR): The SRAR datasetFootnote 20 is available in public domain. The dataset is composed of 500 Redditors (anonymized), their posts and domain expert annotated labels. The SRAR is used along with different lexicons which are built from the knowledge base associated with mental health like SNOMED-CT, ICD-10, UMLS, and Clinical Trials. This dataset is recently used [123] and the research community is looking forward to use this in near future to enhance the proposed techniques.

  11. (11)

    Aladaug: This dataset is built by Aladaug [124] during his study on suicidal tendency identification from the posts over social media data. Since, there is no name given to this dataset, this dataset is named as Aladaug to refer it in this study. Among 10,785 posts, 785 were manually labelled for this study. This dataset is available on request from authors.

  12. (12)

    The University of Maryland Reddit Suicidality Dataset (UMD-RD): The UMD-Reddit DatasetFootnote 21 contains one sub-directory with data pertaining to 11,129 users who posted on SuicideWatch, and another for 11,129 users who did not. For each user there is full longitudinal data from the 2015 Full Reddit Submission Corpus. The UMD-Reddit dataset have been used by academic researchers actively since 2019 as it is available via signed agreement.

  13. (13)

    GoEmotion: The GoEmotion datasetFootnote 22 contains 58K carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral. It also contains a filtered version based on reter-agreement, which contains a train/test/validation split. This dataset is proposed [33] in 2020 for emotion detection and is used to validate the scalability of the proposed models for stress detection.

  14. (14)

    SDCNL dataset: The SDCNLFootnote 23 dataset was collected using Reddit API and scraped from two subreddits, r/SuicideWatch and r/Depression which contains 1895 total posts. Two fields were utilized from the scraped data: the original text of the post as our inputs, and the subreddit it belongs to as labels. Posts from r/SuicideWatch are labeled as suicidal, and posts from r/Depression are labeled as depressed.

  15. (15)

    CAMS: CAMS stand for Causal Analysis for Mental illness in Social media posts. The introduction of CAMS datasetFootnote 24 enables academic researchers to perform causal inference, causal explanation extraction and causal categorization. The dataset contains 5051 samples and categorize each sample into one of the five different causal categories, namely, bias/abuse, jobs and carers, medication, relationships, and alienation. This dataset is publicly available [126].

  16. (16)

    RHMD: The RHMD stands for a Real-world Dataset for Health Mention classification on Reddit data.Footnote 25 The health mention is defined as a problem to find symptoms and understand its semantics. These semantics specifies the contextual perspective in which a given symptom is used in texts [128]. Every sample of this dataset categorizes a given post in five categories health mention, non-health mention, hyperbolic mention, figurative mention, and uninformative.

  17. (17)

    Kayalvizhi: A unique datasetFootnote 26 that not only detects depression from social media but also analyzes the level of depression. Initially 20,088 instances of postings data were annotated, out of which 16,613 instances were found to be mutually annotated instances by the two judges, and thus they were considered as instances of data set with their corresponding labels [129].

3.2 The Historical Evolution of Classification Models

In this section, we discuss the evolution of methods developed for mental health analysis in the past. The Social NLP researchers at Microsoft, one of the leading IT based solution organization, disclose the significance with role of social media in identifying mental health problems. After comprehensive study of 92 research articles on three mental health problems of stress, depression and suicide risk; the evolution of historical timeline is represented in Fig. 7. Furthermore, the architecture of path-breaking models for mental health analysis is shown in Fig. 8.

Fig. 7
figure 7

The timeline of evolving important events for quantification of suicidal tendency on social media

Fig. 8
figure 8

Some existing models for quantifying the suicidal tendency on social media

Past studies since 2013 set preamble to investigate the significance of users’ social media data for predicting depression [45] and suicidal tendencies [60]. With introduction to word-embedding and vector-space representation [98], encouraging studies over developing deep neural network classifies for psychological perspective has gained much attention from academic researchers [76, 95]. After linguistic features, Ref. [27] introduce unique features, namely, user-profile features resulting into improved performance for classifying posts. We witness exponential growth in this domain after release of initial datasets as it resolve the problem with limited availability of sensitive dataset of mental health in social media posts. CLPsych shared task data paves a way for new studies and development of new datasets for future use [47].

In 2017, we observe extended studies on different social media platforms such as Facebook [16], Sina Weibo (a Chinese online social platform), and Instagram [94, 134]. The use of social media and social network features for stress detection has enriched this domain with learning-based mechanisms [77]. Simultaneously, the dual-attention mechanism for multimodal approaches reveals the need of explainability and reliability of models [135].

In 2018, more studies revolve around the dimensionality reduction or optimizing the feature vector for ML and DL models, respectively [82]. The studies for depression detection started with the use of different social network features [45], evolved with interactions over social media [77] and cascading social networks [61] to extract reliable features, followed by ontology and knowledge graphs [2].

The observations about users’ dynamic historical timeline on Twitter include improvements with interpretive Multi-Modal Depression Detection with Hierarchical Attention Network (MDHAN) [43]. The MDHAN framework is designed with multi-model features and two attention mechanisms are applied at tweet-level and at word-level, respectively. Ref. [38] introduce COMMA, a depression detection mechanism, to use encoded text/ visual data and their selection using GRU to apply averaged embedding on classifier. We further investigate a set of recent contextualized models such as multimodal feature extraction techniques for multiple social networking learning (MSNL) [136], Wasserstein dictionary learning (WDL) [137], and multimodal depressive dictionary learning (MDL) [39] methods. The authors in Dual-ContextBERT model [36] use multi-level analysis by removing a limitation of single-level analysis. It is the best performing model at CLPsych 2019 which feeds BERT encoded posts to attention-based RNN layer.

The performance evaluation for responsible and explainable models is carried out for mental health prediction using Ablation study [7, 83]. After extensive literature over ML and DL algorithms, academic researchers found interesting improvements with hybrid studies [36]. Recent investigation with Graph Neural Network results into improved early risk detection [2]. Furthermore, computation intelligence techniques for feature optimization has resolved the problem of noisy data [13].

During next transition phase, we observe research advancements with historical aspect of the users’ timeline for identifying different phase of mental health [7], and hybrid extractive and abstractive summarization strategy as DepressionNet [41].Footnote 27 DepressionNet is a novel approach which summarizes user posts before encoding it via embedding. They apply BiGRU model and concatenate results with encoded current post. The multitask models encode data using pre-trained models and GoEmotions dataset [83].Footnote 28 Most of the datasets collected from Reddit are labelled using sub-reddits. However, Ref. [42] suggests the problem of noisy labels and address it by introducing a new dataset on depression versus suicide. In extension to this, a data augmentation approach resolve the problem of limited data availability for mental health analysis [138].

Past studies incorporate the multi-modal feature extraction for building contextual transformer based models to resolve the problem of depression detection such Co-attention [139], Dual attention [135], and Modality attention [140]. The novel contributions for suicidal tendency predictions are comprised of implicit emotion-based features [7] and explicit commonsense knowledge [141].

Table 8 Linguistic feature extraction for mental health status

People often express their feeling in native language and thus, a potential new research frontier is to build explainable language-independent models for low-resourced languages. A comprehensive study for Chinese data reveals interesting insights with semantics in language [145]. Linguistic features analysis shows significant increase over due to frequency of terms related to affect, positive emotion, anger, cognition (including the subcategory of insight), and conjunctions. A recent work with code-mixing is carried out over English and Hindi language, which shall help in implementation across multiple platforms and help in putting a stop to the ever-increasing depression rates in a methodical and automated manner [146]. We keep this as an open-research direction to examine mental health for low-resourced languages.

We further observe an improved efficiency of early risk detection with the help of bidirectional transformer based models and ordinal classification [114, 147]. Recent advances on early depression detection using attention mechanisms over transformer-based model results into explainable AI in this domain [85, 147,148,149]. More work with graph convolution encoders [86] and hyperbolic space embedding has enriched this domain with new insights on recognizing patterns in graph and visualizing the problem in non-Euclidean distance, respectively. Other than improvising cross-sectional and longitudinal studies with additional attention mechanisms and semantic enhancements, we came across next level study on finding indicators to state reason behind mental disorders in self-reported texts [126]. Such studies show new research direction towards discourses and pragmatics.

To summarize the extensive study of classification models for identifying suicidal tendency, we reveal information about recent developments in Table 8 where we mention dataset, baselines, results and code availability for each study. We acknowledge that existing studies are not directly comparable. Also, before we discuss new frontiers, we enlist useful tools and resources for future research.

3.3 Tools and Resources

As discussed earlier, the social media data is firsthand user-generated information which is informal in nature. Thus, identifying named entities and semantics in social media posts is still a challenging task. In this section, we enlist different tools/ libraries as potential sources.

  • Python Reddit API The Reddit social media platform can be scrapped through Python Reddit API Wrapper (PRAW)Footnote 29 and follows Reddit API rulesFootnote 30 for scrapping data.

  • PyPlutchik: An embedding to employ emotion models as pre-built tools in Python environment [150] and trained on Plutchik model of emotions [75].

  • DLATK Python Package: DLATK stands for Differential Language Analysis Toolkit which is an end-to-end human text analysis package which is specifically suited for social media and social scientific applications. The non-neural models may be implemented via the DLATK Python package [90].

  • Optimization: Adaptive experimentation is the ML guided process of iteratively exploring a (possibly infinite) parameter space in order to identify optimal configurations in a resource-efficient manner. AxFootnote 31 currently supports Bayesian optimization and bandit optimization as exploration strategies and is used for social mental health detection [83].

4 New Frontiers

After extensive study of 92 research articles related to stress, depression and suicidal tendency, we make inferences to define new research directions and future scope as shown in Tables 9 and 10. Finally, we give new frontiers in Fig. 9.

Table 9 Inferences of evolving suicidal tendency detection on social media
Table 10 Inferences of recent suicidal tendency detection on social media
Fig. 9
figure 9

Open challenges and new research directions in identifying suicidal tendency on social media

  1. 1.

    Noisy Labels: We found that the potential of some labels of data is found to be corrupted in the past which are mentioned as the noisy labels. To solve this problem, SDCNL model introduced a unique feature of label correction methodology to classifying posts as suicide versus depression [42].

  2. 2.

    New Features: The other factors which can be potential features are the happiness index of the country of a user; the ease of living index of the country of the user; the variation in geographical locations and multi-source distributed crawling; detection of multi sources communities by using spectral clustering over multi-level graphs [53]. Although there are studies on finding correlations among different features and map different variables for mental illness in China [152], there is the need to study this for different countries and at global-level due to much of socio-political differences in each country.

  3. 3.

    Embedding for Multi-task problem: We observe solution of multi-task mental health analysis through systematic word embedding optimizer [82]. However, there is no explainability or mathematical validation for why the results are better.

  4. 4.

    Time Complexity: Although, it is observed that the recent approaches for stress detection shows the significant improvement with F1-score for FGM approach [95] but it is computationally expensive and takes almost more than the double time as compared to the second best approach. There is need to give equal importance to the complexity in recent advancements.

  5. 5.

    Behavioral Analysis: The mental health detection is the part of integrated study of computational linguistics, human–computer interactions and clinical psychology. Few studies have observed the latent patterns among social media users which express their common but sensitive thoughts. Depressed tweets are more likely to be expressed late night than during day time [39]. This analytical part of human behavior is rarely explored in the existing literature as observed from Tables 9 and 10.

  6. 6.

    Interpretability and Explainability: There are detailed and theoretical explanations of the proposed approach to test its interpretability [11] or explainability [2, 7, 43, 83] via ablation studies. A complete section of ethical validation must be explored further to enhance the applicability of the new methods in real-time applications.

  7. 7.

    Social Networks and Graph Neural Networks: The trend of making use of text, visual, and multimedia information has given several new research directions in this domain. In the past, network features for Twitter data shows promising results [153], still there is a big room to study multi-level networks and heterogeneous information networks for multi-modal information in social media for better and integrated representation. Few studies on knowledge graph [2], ontology [2] and graph neural networks [92] validate it as a progressive domain.

  8. 8.

    Multi-lingual, cross-lingual and language-independent approach: We find limited studies with low-resourced language in this domain. There is no work found in the multi-lingual approach as observed for offensive language [154]. Few studies have made progress towards language independent approach [2, 10, 77], however, the existing techniques are not directly or indirectly not compared for language-independent or multi-lingual approach.

  9. 9.

    Incremental Learning from Streaming Data: There are some studies on Topic extraction on social media content for early depression detection on retrospective data [79] and phase change of the user [7, 92]. The existing studies have rarely use the online streaming data [37] and there is no such study which shows the concept drift [151] in streaming data. A concept drift identifies the level of changing risk in suicidal tendency.

  10. 10.

    Real-time Applications: A real-time mental health prediction is yet to be explored because to the best of our knowledge, there is only one study on integration of Internet of Medical Things (IoMT) and Social Media dataset by academic researchers [34].

5 Conclusion

This manuscript is an extensive literature survey on predicting suicidal tendency from social media data. The exponential progress in the field of data science for mental health prediction has shown its significance in recent years. The corpus of 92 research articles contains studies over stress, depression and suicide risk detection on social media. However, there is no substantial work on quantifying the suicide risk from the longitudinal data of the user. To handle this and to integrate the existing studies on multiple tasks, an extensive survey is given along with the open challenges and possible research directions. The major contributions of this manuscript are enlisting the available dataset (publicly, on-request and via signed agreement); introduction to the taxonomy of the mental healthcare; classification of feature extraction and transformation techniques for vector representation; the historical evolution of suicidal tendency detection with timeline; new research directions and open challenges. This manuscript further highlights the important contributions which can be used as benchmark studies in this domain.