1 Introduction

As the economy and society have developed rapidly over the past few years, mental health disorders have become one of the most critical and prevalent global human health problems. Currently, nearly one billion patients live with mental health disorders worldwide, about one eighth of the total population, in accordance with the latest figures from the World Health Organization. The COVID-19 pandemic has even increased the likelihood of people experiencing negative emotions such as depression and even developing mental illness, with a 28% increase in cases of major depression worldwide in 2020. In countries more affected by COVID-19, the incidence of the psycological disorders has increased sharply (Organization 2022).

Traditional psychological research on the assessment of mental health status mainly relies on quantitative scales, questionnaires, behavioral observations, clinical interviews, and so on, which have better performance in terms of credibility, but most of them have the limitations of being intrusive, poorly timed, difficult to track in the long term, and requiring the sufferers to initiate a request for help (Saxena et al. 2007). Especially in the early stages of the problem, people may not be aware of their psychological changes or feel ashamed of their symptoms and are less likely to consult a professional.

With the development of computing technology and social media, social media platforms such as Twitter, Reddit, and Weibo have become an integral part of people’s lives. Many researchers have found that social media data contains a large number of language features and behavioral features that can reflect individual emotions and mental states. Individuals with mental disorders will show changes in language and behavior when using these platforms. It has been a critical research topic in the field of affective computing to acquire and analyze social media data for effective and efficient mental state assessment (Cavazos-Rehg et al. 2016; Guntuku et al. 2017). Using computing technology to analyze subjects’ social media data for psychological assessments enables healthcare workers to identify and initiate preventive efforts at an early stage.

At present, this task is mostly performed based on supervised learning based classification models. Apart from the rapid growth of computational resources, the superior performance of machine learning (ML) and deep learning (DL) techniques for this task depends on large amounts of high-quality labeled data with clear and accurate labeling.

Currently, there are two main ways to obtain and label social media data sets for psychological assessment: one is to recruit volunteers to participate in questionnaires such as Self-Rating Depression Scaled (SDS), and professional psychological staff mark their mental states and obtain corresponding social media data. This method can obtain more accurate labeling results, but it is difficult to carry out in large-scale data collection due to the limited number of volunteers and the cost of manpower and time. The second one is to crawl texts from social media and label them according to users’ self-report or keyword search rules, for example, finding users with expressions like “I was just diagnosed with depression” at a certain time and labeling them as positive for mental problems and finding users who have never posted texts in the corresponding format as negative for mental problems. This annotation method collects data faster and can get a larger amount of data. But due to the implicit characteristics of psychological issues, users may misperceive and report their psychological states; and the format of online expressions is very different from the prescribed format, which cannot meet the ideal basic factual standards. As a result, there is a high volume of noisy labels, which can cause large errors when used by current psychological assessment methods (Wongkoblap et al. 2017). Although the powerful characterization capabilities of DL models can fit any data in the training set, the included noise may cause the network to fit the wrong patterns, resulting in a degradation of the network’s performance (Algan and Ulusoy 2021). Therefore, learning with noisy labels has arisen as an important study issue for deep learning tasks.

Most current research on DL models with noisy labels focuses on visual image classification, where noisy labels on visual data are usually contrary to objective facts and easy to distinguish. However, unlike image data, text data contains more subjective emotional tendencies and is highly influenced by individual differences of users, and lacking clear decision boundaries, which is more likely to bring noises due to labeling errors. Especially for psychological assessment tasks, it is difficult (1) for individuals to make accurate assessments of their psychological states, and (2) for different psychologists to generalize a uniform or correct label through text. Now, an in-depth study on how to adequately extract features from noise-labeled text data sets and classify them, while improving the robustness of the model, for the mental state assessment task is still lacking.

On the other hand, some recent advancement in DL and noise handling shows great potentials to address the problem. In particular, recent studies reveale that applying auxiliary noise models on the classifier to capture the noises can better solve the problem of classification with label noises (Cheng et al. 2021). Furthermore, the training and classification models are also evolving rapidly. Beyond LSTM (Hochreiter and Schmidhuber 1997) and GRU (Cho et al. 2014), the most popular models that are trained from scratch, the pre-trained model is attracting increasing attention because the training process of such models can be divided into pre-training and fine-tuning phases, while mislabeling does not destroy the training process of the former. However, the single use of sequential or pre-trained models may over-focus on local semantic information and ignore global structural information. Graph neural networks(GNNs) have been recognized as an effective solution to remedy this problem and fully extract text features. In addition, previous studies on learning from the noisy labels have shown that the correctly and incorrectly labeled samples perform different functions in the process of training, and differ in the value of loss and the speed of convergence (Arpit et al. 2017), which shed lights on the selection of noisy samples and training of noisy models.

Therefore, in this paper, a new mental state classification method is proposed, which incorporates a noise-label correction mechanism based on text data on social media for improving the effectiveness of mental health status classification while enhancing the model’s noise immunity. The main contributions are summarized as follows:

  1. 1.

    We propose a mental state classification framework incorporating a noisy label correction mechanism, which includes a primary classification model, a noisy sample selection and pseudo-label generation strategy. Then, the model is iteratively trained using a new data set consisting of samples with clean labels and samples with pseudo-labels. Specifically, after the preliminary classification, a Gaussian mixture model is used to identify mislabeled samples in social media data and then pseudo-labels are generated for the noisy samples. This approach can effectively increase the classification accuracy and robustness of mental state classification model without knowing the noise distribution in advance.

  2. 2.

    Furthermore, the primary mental state classification model incorporates a pre-trained model and a graph neural network, using the former to extract semantic features of text and the latter to emphasize structural information. Different from conventional single-use sequence models that pay excessive attention to only local and sequential information, the proposed model can obtain a more effective representation of text by adopting global and structural features.

The rest parts of this paper are organized as follows. Section 2 introduces related work. Section 3 discusses in detail the proposed framework and algorithm designed for mental state evaluation incorporating the noisy label correction mechanism. And Sect. 4 discusses the experiments conducted to prove the effectiveness of the proposed method. The conclusions and outlook of this research are pointed out in Sect. 5.

2 Related work

The key point of the research is to fully extract features of noise-labeled social media text data sets using DL approaches, incorporating noisy label correction mechanisms to improve model robustness while achieving accurate user mental state assessment. Therefore, related work is presented in terms of psychological evaluation through social media using ML and DL methods, learning from noisy labels with deep learning networks (DNNs), and text feature extraction based on DL.

2.1 Psychological evaluation through social media using ML &DL

Different from physical illnesses, psychological issues are often overlooked. Lack of awareness and acceptance by the population, the associated social stigma, the high cost and time-consuming of traditional questionnaire testing methods, and the need for multiple interviews are all key factors in the neglection of mental problems. Contemporary innovative non-invasive techniques such as ML and DL offer new ways and methods to accomplish psychological assessment tasks. The Internet and social media have been used as the first choice for timely communication, and some research scholars have found that people with mental health disorders prefer to express their feelings in online communities (Ghosh et al. 2022).The analysis of behaviors and texts on social media can calculate a person’s personality and has far-reaching implications for applications in psychological assessment such as depression detection and suicidal ideation detection (Yang et al. 2022). Manual analytics is difficult to apply in the huge field of data analysis. ML and DL methods can extract hidden patterns from huge social media data, and are therefore emerging as new approachs to assist in clinical decision making. Since 2006, ML as well as DL began to be used in the text classification tasks related to psychological health (Diederich et al. 2007). Traditional ML algorithms require manual extraction of features. For example, Reece et al. (2017) collected information from participants’ twitter accounts, extracted features from their contents, and used these features with supervised learning algorithms to construct models to predict the presence of users with depression and Post-Traumatic Stress Disorder (PTSD). Peng et al. (2019) extracted features from users’ social media data, constructed an emotion dictionary and constructed depressed people screening models using Multicore Support Vector machines (SVM). Aguilera et al. (2021) used ML methods such as SVM, k-Nearest Neighbor (kNN), and k-Strongest Strengths(kSS) to detect depressed and anorexic patients using user-generated content, and proposed a one-class classification method. Although traditional ML methods are superior in terms of efficiency and performance, they are unable to deeply learn the potential features in the data, and only extract the relationships between the dimensions in the shallow features, which are suitable for the cases where there are not many feature vectors and less data. In addition, feature engineering requires extensive domain knowledge, and it is challenging to identify all factors that could cause mental health disorders, which varies for different users and different levels of disorders. As a result, it is hard to hand-select the features needed to identify the behavioral patterns of users with mental health disorders.

In recent years, DL algorithms have shown good results in the area of natural language processing(NLP), especially in the subjects such as psychometric classification and sentiment analysis (Chen et al. 2020). In current research using social media data for psychological evaluation, sequence models such as LSTM are mostly used to prioritize the local and sequential nature of the text, and this type of model can well capture the semantic and syntactic information in local sequences of consecutive words. For example, Gupta et al. (2022) extracted depression-related behaviors and activities from social media messages and detected depression in tweets, and experimentally demonstrated that LSTM is more suitable for processing large-scale data and obtained better detection results than traditional machine learning methods. Almars (2022) combined attention mechanisms with a Bi-LSTM network to learn hidden features of depression for depression detection from Arabic social media. Based on these works, researchers continue to explore the application of deep learning models in psychological assessment, for example, Uban et al. (2021) attempted to explore more complex deep architectures, including RNNs, CNNs, hierarchical attention networks, and Transformers, to find individuals who might be diagnosed with psychological illness. However, the single approach of using sequence models to process data cannot consider text structure information and global word co-occurrence and has some limitations in text feature extraction. Moreover, most of these methods rely on supervised learning approaches for model training, and most of the currently used data sets are collected from social media platforms and labeled by self-reporting or keyword retrieval, where the implicit characteristics of psychological problems and the subjectivity of judgments can lead to incorrect labels being introduced, resulting in labeling noise, leading to some limitations of these models in this task.

2.2 Learning from noisy labels with DNNs

Label noise can affect the performance of neural networks by a variety of factors. On the one hand, training DNNs with noisy labels is vulnerable because plenty of model parameters makes the DNNs overfit the mistaken labels, and the test accuracy is very different between models trained using clean and noisy data. On the other hand, due to the presence of noisy labels, the DNNs remember all the noisy samples, resulting in poor accuracy and low robustness of model, and the negative impact will be more prominent on supervised models with deeper label dependence as opposed to semi-supervised and unsupervised classification models. The most direct and simple way to deal with noise-labeled data at present is to find samples with noisy labels. The method of improving learning performance by directly removing the noisy data from the data sets or reducing their weights is called sample re-weighting, which transforms noisy deep learning into a semi-supervised learning problem. For example, Northcutt et al. (2021) constructed a confidence learning framework to train the model using the noisy data sets, estimating the joint distribution of noisy and true labels, and then found and filtered the noisy samples, and retrained the model by re-adjusting the sample category weights. Deep learning models are characterized by preferential learning of simple samples, many truely labeled samples tend to show a smaller loss than the mislabeled samples, called the small-loss trick. Based on this concept, Li et al. (2020) proposed the DivideMix algorithm, which clustered the loss values of the samples to filter the suspected error samples. Another approach is to screen out suspected mislabeled samples and then correct the mislabeling using the noise transfer matrix or other methods. For example, Pham et al. (2021) proposed a meta-pseudo-labeling algorithm that included a teacher network and a student network that rejected suspected mislabeled sample labels, used the teacher network to generate pseudo-labels for them, and used the pseudo-labels to guide parameter updating for the student network, and the student network’s results on the correct labels and pseudo labels continued to guide parameter tuning for the teacher network, with the two networks optimizing alternately. These methods start from the data to find and correct mislabeling as much as possible to radically improve the quality of the data set. However, using a single method of processing the data set may cause errors to accumulate due to sample-selection bias. Therefore, some of the more advanced methods will improve on the training approach based on quality optimization of the data itself, such as using multiple DNNs co-operating with each other or running several training cycles (Song et al. 2022). Co-teaching (Han et al. 2018) and Co-teaching+ (Yu et al. 2019) used some amount of small-loss samples per DNN based on small-loss trick and fed to another DNN for further training. The reason for this approach being more robust can be attributed to the different learning capabilities of the two networks. The two networks can filter errors introduced by noisy labels, reducing the error flow during the exchange process, and when errors from noisy data flow into the other network, the errors will be attenuated due to their own robustness. This method is suitable for dealing with more noisy data as it solves the cumulative error problem. However, the problem with this method is that it has a large time overhead on large scale data sets. In recent years, scholars have found that combining sample selection strategies with loss correction or semi-supervised learning can effectively solve this problem. Without adding additional DNNs, multiround learning selects clean samples by repeating training iterations. For example, Qiao et al. (2022) proposed SelfMix, which used an exit mechanism on a single model for alleviating confirmation bias in self-training as well as achieving robust learning against text label noise. Liu et al. (2021) proposed a noise-resistant medical image classification framework that brings advances in DL for handling label noise in the medical field. As research at the intersection of information science and psychology has become progressively more profound, researchers have noticed that mislabeled data in data sets limits the continued development of the technology, and have begun to delve deeper in this area. A two-stage pseudo-labeling-based classification framework, called DSPL (Hu et al. 2022), was proposed for diagnosing mental disorders based on MRI data. Haque et al. (2021) proposed an unsupervised learning methods into a graph convolutional network(GCN) model to achieve node classification in graph data with labeled noise and applied it to a population graph for the detection of patients with autism spectrum disorders.

2.3 Text feature extraction based on DL

Most of the methods mentioned in the previous section to combat noisy labels are based on adding a noisy sample screening module to a primary classification model and choosing a more robust training approach. To evaluate users’ psychological state from social media text data, it is necessary to fully extract effective information from text information first. However, the text on social media presents the characteristics of different information content quality, unequal value density, and large differences in text length. How to quickly obtain rich and effective information from huge text data and perform sentiment recognition is one of the unavoidable problems in the current task. Traditional text extraction strategies based on statistical approaches like the Term Frequency Inverse Document Frequency (TF-IDF) algorithm (Zhang et al. 2011), would ignore some extra information, such as contextual or semantic information, when facing unstructured text data.

Introducing neural networks and corpus pre-trained models into feature extraction can reduce the manual workload and obtain better analysis results. Currently, sequential models such as LSTM and GRU are extensively applied in the NLP, where a sentence or a text is used as a network input, and the features are extracted by sequential models and classified to assist the diagnosis of mental disorders (Ji et al. 2020). Using pre-trained language models is a new trend in NLP and has received substantial focus in various text-processing tasks. Devlin et al. (2019) proposed a pre-trained language model called BERT, pioneering the use of a bidirectional transformer text encoder and training using a large-scale corpus. Since then, many field-specific pre-trained models have been developed. A case in point is that Ji et al. (2022b) trained two pre-trained models called MentalBERT and MentalRoBERTa, for the mental health domain and conducted psychological evaluation based on them. The sequential model can effectively capture the sequential information between words, contextual information, and semantic information of adjacent words in the document, but ignores the extraction of global structural features.

Compared with sequential learning models, GNNs can directly process structured data and have a stronger ability to capture global information, which can obtain more excellent performance and are widely used in sentiment analysis, text classification, and other fields.To address the text classification task, Yao et al. (2019) transformed the corpus into a heterogeneous graph and uses GNNs approach to learn word and document embeddings. Yang et al. (2021) proposed a heterogeneous GNNs model integrating text, topic and entity information for short text classification using a semi-supervised approach. Hong et al. (2022) captured specific embeddings of each word node based on graph neural networks to provide a global representation for each node, obtaining better performance than other graph network models on the depression symptom severity prediction task. Lin et al. (2021) achieved excellent performance on text classification tasks by jointly training the BERT and GCN modules, fusing the pre-trained models to make extensive use of the raw data and the advantages of GCNs to jointly learn trained and untrained data by propagating labeling effects. Fusion using pre-trained models and GNNs can adapt to texts of different lengths and complexity, capturing rich semantic information along with text structure information to enhance text representation.

Although the combination of sequential models and GNNs for feature extraction also has more applications and contributions in text classification and image classification, a large proportion of the available work assumes that true data labels are always offered. In addition, learning with mislabeled data has been studied majorly under the goal of image classification. For example, NGC (Wu et al. 2021) proposed a graph-based anti-noise framework, which was based on a graph structure filtered by confidence information, selected subgraphs using the geometric structural relationships between the data, considered the samples in the subgraphs as clean samples, and refined the feature representations by using comparative learning. These studies showed the potential of GNNs in the field of deep learning with noise in experiments on image processing. But there is little progress on how to enhance the robustness of GNNs, and applied to the field of text classification and to more fine-grained mental assessment tasks are even less common.

Fig. 1
figure 1

Structure of the proposed PGNLC

3 Proposed method

As mentioned earlier, the key issue we are focusing on in this paper is to fully extract social media text features with noisy labels and incorporate noisy label correction mechanisms to obtain more accurate results of user psychological assessment. Therefore, to handle the specific characteristics of this task, a classification framework for the psychological assessment of subjects using their social media data is proposed, which constructs feature extraction and classification models by combining pre-trained models with graph neural networks and incorporates noise label correction mechanisms (PGNLC). In particular, a primary classification model is firstly designed to fully extract features from social media texts for classification of users’ mental states in the presence of noisy labels. In addition, a noisy sample selection and pseudo-label generation mechanism is introduced to screen samples that are suspected to be mislabeled and generate soft labels for them, which are reused to train the model of classification together with clean samples to improve the model accuracy and robustness for psychological assessment.

There are two parts of the framework’s structure: (a) a feature extraction and classification module and (b) a noisy sample selection and pseudo label generation module. The first module combines a pre-trained model that is relatively less disturbed by noise, as well as a graph neural network, so as to fully extract local semantic information and global structural information of the text. As a result, a text representation will be generated and passed through the classifier to achieve the preliminary mental assessment function. After that, the sample loss of the classification model evaluated by the original data set label is input into the second module, where the noise sample selection is performed by Gaussian mixture model (GMM). Consequently, the labels of the suspected noise samples are erased and replaced by soft pseudo-labels. The corrected data set together with the clean samples are input into the classification model again for another psychological evaluation, so that the influence of the noisy label is reduced through multiple iterations to improve the performance of user mental state classification. Without prior knowledge of the noise distribution, the method performs label correction by generating predicted soft pseudo-labels using iterative GMM. The framework structure is shown in Fig. 1. Following that, we will explain the model and algorithm that we propose specifically.

3.1 Construction of heterogeneous text graph

To fully extract the semantic information and global structural information, we combines a large-scale pre-trainied model with GNNs, convert the social media text data that contains the user’s mental state into a graph representation and constructs a text heterogeneous graph containing word nodes and document nodes, denoted as \(HG=\{U,V,E,F\}\). There are two types of nodes in the text graph: word nodes (U) and document nodes (V), and two kinds of edges: edges between word and document nodes (E) and edges between words (F). An example of a text heterogeneous graph is shown in Fig. 2.

Fig. 2
figure 2

Example of a text heterogeneous graph

In the text graph, a unit matrix \(X={{I}_{{{N}_{U}}+{{N}_{V}}}}\) is used to represent the node initialization, where \({{N}_{U}}\) and \({{N}_{V}}\) denote the number of word nodes and document nodes in the corpus, respectively. We use a pre-trained model as a sentence encoder to obtain the input representation of the document nodes, where d denotes the embedding dimension. Therefore, the initialized node feature matrix is represented as \(X={{\left( \begin{matrix} {{X}_{doc}} \\ 0 \\ \end{matrix} \right) }_{({{N}_{U}}+{{N}_{V}})\times d}}\). For the weights of edges in the text graph, the weights of E are calculated using the TF-IDF, based on the local word co-occurrence within the sliding window, and positive point-wise mutual information(PPMI) is calculated to represent the edge weights of F, and the weights of edges between nodes i, j are calculated as shown in Eq. (1).

$$\begin{aligned} {{A}_{i,j}}=\left\{ \begin{matrix} PPMI(i,j) &{} i,j\in Uandi\ne j \\ TF-IDF(i,j) &{} i\in V,j\in U \\ 1 &{} i=j \\ 0 &{} otherwise \\ \end{matrix} \right. \end{aligned}$$
(1)

3.2 Noisy sample selection and pseudo label generation mechanism

We first need to design a reliable criterion to filter out noisy samples that mislabel mental states, using \(D=\{{{x}_{i}},{{y}_{i}}\}_{i=1}^{{{N}_{V}}}\) to denote the original data set, \({{x}_{i}}\) to denote the text of the ith document, and \({{y}_{i}}\) to be the one-hot representation of the sample label.

For noisy data, DNNs will prioritize learning straightforward and logical samples, reducing their loss. Arazo et al. (2019) found that learning random labels needs to consume longer time than learning clean labels, implying that noisy samples have higher losses in the early part of the training process, and the loss distributions of clean and noisy samples converge to two Gaussian distributions. Based on this finding, we transform the problem of identifying and filtering noisy data into a problem of clustering the loss values of sample individuals. Thus, this paper proposes a sample selection strategy based on GMM. The model loss function applies a cross-entropy loss function as shown in Eq. (2).

$$\begin{aligned} L=\{{{l}_{i}}\}_{i=1}^{{{N}_{V}}}=\{-y_{i}^{T}\log (p({{x}_{i}};\theta ,\phi ))\}_{i=1}^{{{N}_{V}}} \end{aligned}$$
(2)

where \(p(x;\theta )\) is the probability of the model output. \({{l}_{i}}\) denotes the loss of the ith sample, \(\theta \) is the parameters of the pre-trained encoder and \(\phi \) is the parameters of Multilayer Perceptron (MLP) classifier with two fully connected layers.

Using the primary classification model to initially complete the psychological assessment task without overfitting the noisy labels, the loss is fed into the GMM and the Expectation-Maximum(EM) algorithm is used to match the GMM with the observations. Specifically, \({{\omega }_{i}}=p(g|{{l}_{i}})\) denotes the probability that the ith sample belongs to the Gaussian component with a smaller mean g. With a preset threshold \(\tau \), if \({{\omega }_{i}}\) is greater than \(\tau \), the sample is identified as a clean sample. Otherwise, the sample is identified as a noisy sample, and itslabel is erased, as shown in Eqs. (3) and (4):

$$\begin{aligned}{} & {} {{D}_{C}}=\{({{x}_{i}},{{y}_{i}})|{{x}_{i}}\in D,{{\omega }_{i}}>\tau \} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} {{D}_{N}}=\{({{x}_{i}})|{{x}_{i}}\in D,{{\omega }_{i}}<\tau \} \end{aligned}$$
(4)

where \({{D}_{C}}\) denotes the clean data set, including the clean sample text and the corresponding labels, and \({{D}_{N}}\) denotes the noisy data set, including the noisy sample text.

After label noise sample selection, the noisy mental state labels need to be updated and corrected to reduce the effect of noise. For better semi-supervised learning, pseudo-labels need to be generated for the noisy sample set after the label removal. In this paper, we generate soft labels by sharpening the prediction distribution of the model. Utilizing soft pseudo labels rather than hard pseudo labels leads to a more robust model. The formulas are shown in Eqs. (5)–(8):

$$\begin{aligned}{} & {} Sharpen{{(p,T)}_{i}}=\frac{p_{i}^{\frac{1}{T}}}{\sum \nolimits _{j=1}^{C}{p_{j}^{\frac{1}{T}}}} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} {{{\tilde{y}}}_{i}}=Sharpen(p({{x}_{i}};\theta ,\phi ),T) \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {{{\hat{D}}}_{N}}=\{({{x}_{i}},{{{\tilde{y}}}_{i}})|{{x}_{i}}\in {{D}_{N}}\} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\hat{D}}={{D}_{C}}\cup {{{\hat{D}}}_{N}} \end{aligned}$$
(8)

The temperature sharpening module, which is often applied in self-training models, is represented using Sharpen. \({{p}_{i}}\) is the probability of each class of the model prediction output; T as a hyperparameter, represents the sharpening model temperature; and C denotes the number of categories. Using the Sharpen module, the model predicts soft labels as the one yielding a lower “entropy”. By combining \({{{\hat{D}}}_{N}}\) and \({{D}_{C}}\), we obtain the integrated data set \({\hat{D}}\).

3.3 Model training and mental state clasification

The initialization matrix X obtained after encoding via a large-scale pre-trained model serves as the input to the GNNs. The GCN is used to iterate and propagate features to generate each node embedding; and the output is the final embedding representation of the document nodes. For this article, BERT is used as the pre-trained model; we then feed the output of BERT and GCN into the softmax layer for classification, and linearly interpolate the predictions of both.

The feature matrix of the GCN layer i output is calculated in Eq. (9):

$$\begin{aligned} {{L}^{(i)}}=\rho ({\tilde{A}}{{L}^{(i-1)}}{{W}^{(i)}}) \end{aligned}$$
(9)

where \(\rho \) denotes a nonlinear activation function. \({\tilde{A}}\) is the normalized adjacency matrix and \({{W}^{(i)}}\) is the weight parameter matrix of the ith layer.

To optimize both the pre-trained model and the GCN, the predictions of them are interpolated. The two module predictions and the predicted linear interpolation formulas are shown in (10)–(12):

$$\begin{aligned}{} & {} {{Z}_{GCN}}=soft\max (G(X,A)) \end{aligned}$$
(10)
$$\begin{aligned}{} & {} {{Z}_{BERT}}=soft\max (WX) \end{aligned}$$
(11)
$$\begin{aligned}{} & {} Z=\lambda {{Z}_{GCN}}+(1-\lambda ){{Z}_{BERT}} \end{aligned}$$
(12)

where G denotes the GCN model, and the output of the GCN is input into the softmax layer to gain prediction results, denoted as \({{Z}_{GCN}}\). The document embedding from the BERT output is fed into the softmax layer to obtain the pre-trained model prediction results, denoted as \({{Z}_{BERT}}\). \(\lambda \) is the interpolation parameter of the two prediction parts,where \(\lambda \in (0,1)\) is adjusted to better balance the two parts of the prediction and better optimize the embedding model.

3.4 Loss function

Classification loss \({{L}_{o}}\) Classification is performed on the integrated data set \({\hat{D}}\), using the standard cross-entropy loss as \({{L}_{o}}\), as shown in Eq. (13):

$$\begin{aligned} {{L}_{o}}=-\frac{1}{|{\hat{D}}|}\sum \limits _{(x,y)\in {\hat{D}}}{{{y}^{T}}\log ({{p}_{o}}(x;\phi ))} \end{aligned}$$
(13)

where \({{p}_{o}}(x;\phi )\) represents the predicted probability using the corrected data set.

Pseudo loss \({{L}_{p}}\) Following the low-density separation assumption theory, the decision boundary of the classifier is preferred to be through the low-density region of the input space. Therefore, a regularization is added to the unlabeled data set after label erasure to penalize those samples with small output probability values for the prediction class, as shown in Eqs. (14) and (15):

$$\begin{aligned}{} & {} {{{\hat{y}}}_{i}}=\arg \max (p({{x}_{i}};\theta ,\phi )) \end{aligned}$$
(14)
$$\begin{aligned}{} & {} {{L}_{p}}=-\frac{1}{|{{D}_{N}}|}\sum \limits _{{{x}_{i}}\in {{D}_{N}}}{{{{{\hat{y}}}}_{i}}\log (p({{x}_{i}};\phi ))} \end{aligned}$$
(15)

where \(p({{x}_{i}};\theta ,\phi )\) denotes the sample \({{x}_{i}}\) prediction of the model, and the one-hot representation of the pseudo label obtained from the model prediction is noted as \({{{\hat{y}}}_{i}}\).

In the label correction and network parameter update phase, the overall network loss function is shown in Eq. (16):

$$\begin{aligned} {{L}_\textrm{total}}={{L}_{o}}+\alpha {{L}_{p}} \end{aligned}$$
(16)

where \(\alpha \) is the weight parameter for the additional loss.

The training of PGNLC is given in Algorithm 1.

Algorithm 1
figure a

Training with PGNLC against Label Noise

4 Experiments

4.1 Data sets

Aiming to demonstrate the efficiency of the method proposed in this paper, experiments are performed on three data sets collected from real social media platforms including Twitter and Reddit to evaluate and compare the performance of psychological assessment on different data sets. Table 1 overviews the three data sets used in terms of the name of the data set, data source platform, mental disorder detection category, number of classification categories, number of training sets, validation sets, and test sets.

Table 1 A summary of data sets

Depression_Reddit We use the depression data set collected from Reddit platform by Pirina and Çöltekin (2018). They found authors with the description “I was just diagnosed with depression” in the Reddit depression board, and searched for posts made by them in the month after posting excluded potential content like “anxiety”, “depression help”, etc., and considered the remaining posts as likely to be written by authors with depression. The remaining posts were labeled “depression”. Posts with the same number of texts in the same sub-section were randomly selected, and the authors of these posts didn’t post in the depression sub-section during the same period of time, so they were regarded as non-depression texts and labeled as “non-depression”.

Suicide_TwitterFootnote 1 Many users will express suicidal ideation on social media when it occurs or before committing suicidal acts. A keyword filtering method was used to collect a data set of suicidal ideation from Twitter, collecting posts expressing suicidal thoughts or declaring imminent suicidal intent and labeling them as “suicide”. Collecting posts that talk formally and objectively about suicide, mention the suicide of others, or never mention suicide-related content as non-suicidal text, marked as “non-suicide”.

SWMH Due to the fact that the serious mental disorders may cause the patient to self-harm or even commit suicide, Ji et al. (2022a) collected the data set named SWMH from a number of mental health-related subsections on the Reddit platform, which included suicide-related intentions and psychogenic disorders such as depression, anxiety, and bipolar disorder. We conduct experiments on this data set to validate the performance of the proposed method on large-scale, multi-category categorized data sets.

4.2 Experimental setting

Comparison experiment Experiments were conducted on the above three data sets using four popular classification models in the area of social media sentiment analysis or mental health evaluation and compared to the method proposed in this paper to prove the effectiveness of the method proposed. These baseline models are shown as follow:

LSTM (Gupta et al. 2022): LSTM is widely used to solve the long term dependency problem prevalent in general recurrent neural networks. Using LSTM can effectively convey and express information in long time sequences without causing useful information from long time ago to be forgotten, and has been widely cited in the field of natural language processing since its proposal. Gupta et al. used this method for depression detection on Twitter data, achieving better performance and demonstrating the validity of the method in the task of mental health assessment.

BERT (Devlin et al. 2019): The large-scale pre-trained model proposed in the literature has excelled on many classical NLP tasks over the past few years.

TextGCN (Yao et al. 2019): This literature proposed the use of GCN for text classification, constructing text graphs, initializing node embeddings using one-hot representations of words and documents, and learning and predicting through graph convolutional networks.

BERTGCN (Lin et al. 2021): The literature constructed heterogeneous graphs on the data set, initialized the document nodes using a pre-trained model, and learned through a graph convolutional neural network to achieve text classification.

Co-teaching (Han et al. 2018): This approach trained two models simultaneously and allowed each model to sample small loss instances to teach the other model for further training. We use this strategy for processing noisy labels based on the BERT models.

Selfmix (Qiao et al. 2022): The method utilized the dropout mechanism on a single model to reduce the confirmation bias in self-training and introduced a textual level mixup training strategy to achieve superior performance on text classification tasks with noisy labels.

Ablation experiment To verify the effectiveness of each part of the PGNLC, each part is removed separately for experiments, and the additional added loss function \({{L}_{p}}\) is removed separately to verify that adding this part of the loss function can improve the model performance, denoted by PGNLC w/o Lp. In addition, the GCN model is removed so that only the pre-trained model with classifier is directly used as the classification model, and the encoded results are used as the input for noisy label selection and correction to verify whether the fused GCN could effectively improve the psychological assessment performance, denoted as PGNLC w/o GCN.

Comparison experiments at different noise rates To further validate the noise-resistance performance of the PGNLC, noise is manually injected into the training and validation sets and evaluated using the original test set. The labels in the training sets are randomly changed to different categories to generate label noise, and label noise with noise rates of 20% and 40% are added to the three data sets to compare with other advanced models to verify the evaluation performance under uniform random label noise.

Table 2 Results (%) for comparison experiment on three data sets

4.3 Evaluation metrics and parameter settings

For the mental state classification assessment problem to which this experiment belongs, Accuracy (Accuracy) and Macro-F1 value (F1) are set as two evaluation metrics, and the formulae are calculated as shown in (17)–(20).

$$\begin{aligned}{} & {} {Accuracy=\frac{\sum \nolimits _{i=1}^{M}{T{{P}_{i}}}}{\sum \nolimits _{i=1}^{M}{T{{P}_{i}}}+F{{P}_{i}}}} \end{aligned}$$
(17)
$$\begin{aligned}{} & {} {Precision=\frac{1}{M}*\sum \limits _{i=1}^{M}{\frac{T{{P}_{i}}}{T{{P}_{i}}+F{{P}_{i}}}}} \end{aligned}$$
(18)
$$\begin{aligned}{} & {} {{Re}call=\frac{1}{M}*\sum \limits _{i=1}^{M}{\frac{T{{P}_{i}}}{T{{P}_{i}}+F{{N}_{i}}}} } \end{aligned}$$
(19)
$$\begin{aligned}{} & {} {F1=\frac{1}{M}*\sum \limits _{i=1}^{M}{\frac{2*{{R}_{i}}*{{P}_{i}}}{{{R}_{i}}+{{P}_{i}}}} } \end{aligned}$$
(20)

where i is the mental status category and M is the number of mental status categories. \(T{{P}_{i}}\) is the number of categories predicted as categories i, \(F{{P}_{i}}\) is the number of other categories incorrectly predicted as categories i and \(F{{N}_{i}}\) is the number of true categories i incorrectly predicted as other categories. The \({{R}_{i}}\) and \({{P}_{i}}\) are the recall and precision of each category i, respectively.

The data collected from social media platforms may contain some redundant or irregular expressions, which makes semantic analysis difficulty. Therefore, to ensure the efficiency and completeness of the task, we refer to the method of Yao et al. (2019) for data preprocessing. Firstly, the data is cleaned to remove special characters such as web links, emoticons and face characters, and non-English characters. After that, NLTKFootnote 2 is used in Python to remove the stop words and the corresponding prefixes and suffixes, as well as low-frequency words with less than five occurrences in the data set. We use Bert-base as the pre-training module and use two-layer GCN as the graph neural network module in the primary classification model. Since the pre-training model is used, the learning rate needs to be set on the small side for fine-tuning. Hence, we set the learning rate of the pre-training model module to be 1e−5, the learning rate of the GCN module to be 1e−3, and set batch_size to be 32. To avoid the overfitting problem, we set dropout = 0.4, train 2 epochs in the warmup phase. There are three adjustable hyperparameters in the PGNLC: the noisy sample selection threshold of the GMM model \(\tau \), the weight parameter \(\alpha \) of the \({{L}_{p}}\), and the linear interpolation factor \(\lambda \) connecting the BERT and the GCN in the classification model. For different tasks, the optimal values of the three parameters may be different. To make the model optimal, we experimented to select the best pair and set \(\tau =0.5\), \(\alpha =0.2\), \(\lambda =0.5\). All the results are the average of five runs.

4.4 Results and discussion

Comparison experiment The baseline models can be categorized into three groups: (1) single-use sequence models (LSTM, BERT), (2) methods that combine graph neural networks but do not use the denoising strategy (TextGCN, BERTGCN), (3) methods that use current state-of-the-art noisy labeling processing strategies (Co-teaching, Selfmix). The results of the comparison experiments are shown in Table 2. The results show that the proposed PGNLC achieves optimal performance in three data sets, proving that PGNLC can effectively perform psychological assessment and obtain superior performance. Moreover, BERT outperforms LSTM significantly on all three data sets. The BERTGCN model with sentence embedding initialized GNN obtained by using BERT for encoding performs significantly better than the TextGCN without the BERT pre-trained model, reflecting the effectiveness of fine-tuning the large-scale pre-trained model for this task. On the two data sets besides SWMH, the BERTGCN model using GNN outperforms the single use of the pre-trained model, proving that the combination of GNN plays a certain advantage on this task. But on the SWMH data set, the performance of BERTGCN is slightly worse than that of the model without GNNs. Based on our analysis, the reasons are multifold: (1) this data set is larger in size, with more classification categories; (2) there is a certain similarity between different psychological disorders; and (3) the data contains more noises. Therefore, the PGNLC carries out on this data set can obtain a significant performance improvement. In addition, the proposed method, Co-teaching, and Selfmix that use a noise processing strategy generally outperform the methods without noise processing on the Depression_Reddit and SWMH data sets, demonstrating the effectiveness of processing noisy labels for this task. The exception is that on the Suicide_Twitter data set, the performances of the two processing methods in the baselines are not good enough. It is speculated that this may be due to the fact that the categorical boundaries of this data set are clearer, and there are fewer noisy labels in the original data set. In addition, the two methods may have filtered out some useful samples in the sample selection, whereas the method that we propose compensates for this problem due to the use of a graph neural network, which is more completed for the extraction of the semantic information. In summary, the proposed method can perform the psychological assessment task effectively under the existence of noisy labels, and the accuracy of the assessment can be improved due to the introduction of the noise label correction mechanism.

Ablation experiment PGNLC contains several components. To demonstrate that each component makes a contribution to the final improvement in performance, additional added loss function \({{L}_{p}}\) as well as the graph convolution model are removed separately, and the ablation experiments are performed on three data sets, with the results shown in Fig. 3. The results indicate that PGNLC is significantly better than PGNLC w/o GCN, both in terms of accuracy and F1 values, demonstrating that feature extraction using only the BERT model without the graph neural network leads to decreases in accuracy. It proves the contribution of GNN in extracting features and fine-tuning BERT to achieve a better representation of text. Moreover, the results of PGNLC w/o GCN on all three data sets are slightly better than the experimental results of using the BERT model alone, and it shows that the label correction strategy can improve the accuracy of the evaluation. Furthermore, removing the additional added loss function \({{L}_{p}}\) has a significant performance degradation, and it is observed during the experiment that the network may have overfitting problems after removing \({{L}_{p}}\). Therefore, we draw the conclusion that the pseudo loss contributes to the improvement of results and avoids over-fitting the model to noisy data.

Fig. 3
figure 3

Ablation experiment results (%) on three data sets

Fig. 4
figure 4

Comparison experiment of artificially adding label noise with different noise ratios. The bar chart indicates the classification accuracy and the line chart indicates F1 scores

Comparison experiments at different noise rates

Random noisy labels are added manually to explore the model’s performance at various noise levels and the results are given in Fig. 4. The classification results represent no artificially added noisy labels with 20% and 40%, respectively. It is demonstrated that PGNLC achieves better performance on two metrics for three data sets under different ratio noise conditions. Taking the Depression_Reddit and Suicide_Twitter data sets as examples, the accuracy metric is improved by 2.71% and 0.5% when adding 20% noise, and by 1.97% and 0.61% when adding 40% noise, when compared to the optimal model. In addition, results of experiments on the Suicide_Twitter data set show that as the proportion of noise in the data set increases, the performance of the Co-teaching and Selfmix methods gradually approaches, and even exceed, that of the BERT method which does not deal with noise. This result once again proves the effectiveness of the processing noisy labels. The PGNLC is demonstrated to be more robust and can efficiently attenuate the influence of noisy labels on the classification performance. In addition, on the SWMH data set, the improvement of classification metrics on PGNLC is more obvious than the other two data sets, and the increase of the ratio of noise labels brings a weaker effect on the model. We infer that this is because of the larger size of this data set, the higher granularity of the labels, and the greater number of classification categories. The correction of noisy labels can be more effective, suggesting that the proposed method has greater potential for handling large-scale data and more fine-grained psychological assessment tasks.

5 Conclusions

In this article, PGNLC are presented, which is a psychological evaluation approach for social media users incorporating a noisy label correction mechanism. Firstly, based on the development of DL models in the field of NLP, the proposed scheme fuses pre-trained models and graph convolution models as the classification models, focuses on local semantic information and global structural information of the text, fully extracts text features, obtain embedding representations of nodes in the text graph, and completes preliminary mental assessment tasks. Secondly, we enhance the robustness of the model by filtering label noise samples and generating pseudo-labels through GMM to achieve the correction of noisy labels. Experiments are carried out on three real data sets from social media platforms, and the results show that the proposed approach can attain superior accuracy on mental state classification tasks. The validity of the components of the model is demonstrated by ablation experiments. Experiments are conducted by artificially adding different ratios of random noises. With noise ratios of 20% and 40%, the proposed method obtains superior classification performance over the comparison models, and the performance improvement is more prominent on larger data sets, demonstrating that the proposed scheme has higher classification performance and robustness, and better potential for tasks with large data size and high classification granularity. Meanwhile, experiments on the current state-of-the-art label noise processing strategies demonstrate that the performance difference among different noise processing methods is quite large on different data sets, proving that not all methods are applicable to all data sets. Therefore, in the future work, we will continue to explore the characteristics of text noises reflecting mental states in social media data, try to introduce domain knowledge to guide the selection and correction of label noise, and to improve the generalization ability of the proposed method.