1. INTRODUCTION

Analysis of negative sentiments based on the texts of comments in social media associated with the manifestation of fear, anxiety, boredom, sadness, etc. is a promising direction for assessing the state of mental health in general and identifying various affective states in particular. Some researchers note that in social networks, people describe the problems, symptoms, and manifestations of their mental illness more freely than at a doctor’s appointment [1, 2]. For this reason, there has been a growing interest on the part of researchers in the use of natural language processing methods to identify patterns in commentary texts characteristic of various types of disorders and their diagnosis. Coppersmith et al. [3] revealed the predominance of first person personal pronouns in depressive comments based on the analysis of parts of speech; Sarsam et al. [4] note the predominance of emotional states associated with the expression of sadness in suicidal messages. Some researchers [5, 6] note that an altered emotional state and the desire to deliberately distort the meaning affect the linguistic indicators of the text, which can be used at the stage of vectorization of commentary texts to improve the quality of classification.

The COVID-19 pandemic and, in particular, self-isolation, mask regimen, and vaccination have led to an increase in the manifestation of affective states in social media comments. For example, Zhang et al. [7] investigated the impact of the COVID-19 pandemic on the expression of depressive emotions in tweets; Saifullah et al. [8] demonstrated the efficiency of using a random forest in conjunction with a TF–IDF vectorization approach to classify COVID-19–related disturbing comments on YouTube.

The development of new methods for analyzing the texts of comments in the field of mental health research is aimed not only at identifying comments related to various types of affective disorders but also at creating decision support systems to provide personalized assistance to people suffering from such disorders. Of interest is also the problem of determining the point of no return in messages, when a negative emotional state and a negative attitude towards all aspects of life lead to suicidal ideation [4].

The aim of the paper is to study the efficiency of applying various approaches to the vectorization of comment texts and, in particular, based on the analysis of bigrams, to solve the problems of classifying and clustering comments describing various affective disorders as well as to identify patterns that contribute to the understanding of psychosocial stressors associated with affective disorders.

2. SURVEY OF LITERATURE ON RESEARCH TOPICS

Twitter is the most researched social media platform in terms of identifying mood disorders based on the analysis of comment texts. Table 1 lists some of the studies aimed at extracting patterns from tweets that improve the quality of classification of affective comments.

Wolk et al. [5] have demonstrated the efficiency of sentiment-based comment text classification and depression detection based on call-gram analysis and the deep model BERT of the language representation, while Moyeen et al. [11] note that bigram-based vectorization considerably increases the quality of classification, in contrast to the use of trigrams, and so the study of the authors was aimed at identifying bigrams and trigrams that describe the subject area and considering approaches to the vectorization of comment texts based on the analysis of bigrams and their characteristics to solve the problems of classifying and clustering comments containing a description of affective disorders.

Table 1. Analysis of tweets to identify some types of affective disorders

Despite the development of efficient approaches to preprocessing, vectorization, and classification of tweets and comment texts in social networks into classes corresponding to various affective states, research aimed at extracting patterns that describe the causes of such states remains relevant. It should be noted that comments on social networks are significantly different from tweets, as they allow one to describe thoughts and feelings in more detail. In particular, people suffering from depression tend to change their mood when writing comments on social networks [12]. In addition, the comments of people suffering from depression describe a positive past, followed by a description of a negative present; therefore, it is necessary to develop special methods from the stage of preprocessing the texts of comments to the stage of classification, clustering, and extracting patterns from them taking into account the characteristics of affective disorders.

3. DATASET AND RESEARCH METHODOLOGY

3.1. Classification of Text Messages in Two Classes: A Class with a Description of Affective States and a Class of Ordinary Comments

The study used a balanced dataset of 3553 comments: 1857 comments describing anxiety and 1696 regular comments from the social network Reddit. The tagging of comments into two classes was done manually with the help of two practicing experts who provide care to people with various types of affective disorders. The considered dataset is part of the dataset described in [13] and presented on the Kaggle platform.

The random forest algorithm was considered as the basic classification algorithm. To improve the quality of the classification algorithm, various approaches to text vectorization were studied such as

  1. 1.

    Applying Bag of Words (BoW).

  2. 2.

    Using the TF–IDF measure.

  3. 3.

    Applying the deep model BERT of the language representation.

  4. 4.

    Using bigrams analysis based on pointwise mutual information as well as numerical estimates of sentiments obtained using the VADER method implemented in the vaderSentiment Python library.

Let us briefly describe the methods listed above. The “bag of words” (BoW) model is based on extracting all words from comment texts and comparing them with the frequency of their occurrence in comments. The TF–IDF measure (TF stands for Term Frequency, IDF is Inverse Document Frequency) is calculated as the product of the ratio of the number of occurrences of the selected word to the total number of words in the comment and the inverse of the frequency with which a certain word occurs in the comments of the corpus [14]. The deep model BERT of the language representation (Bidirectional Encoder Representations from Transformers) implements the transformer architecture and allows taking into account the context and representation of the token as well as its position within the sentence and the sentence number in the corpus [15].

To assess the quality of the classification of the studied approaches on a balanced dataset, we used the indicators (1)–(3) given below and 5-fold cross-validation,

$$ \textit {Accuracy} = \frac {TP+TN}{TP+TN+FP+FN}, $$
(1)
$$ \textit {Precision} = \frac {TP}{TP+FP}, $$
(2)
$$ \textit {Recall} = \frac {TP}{TP+FN}, $$
(3)
$$ F1-\textit {score} = 2 \cdot \frac {\textit {Precision} \cdot \textit {Recall}}{\textit {Precision} + \textit {Recall}}\thinspace , $$
(4)

where \(TP\), \(TN \), \(FP\), and \(FN \) are true-positive, true-negative, false-positive, and false-negative values, respectively.

3.2. Clustering Text Messages Containing Descriptions of Affective States

At the clustering stage, we used a dataset of 1857 messages describing various anxieties. Here is an example of a random comment with the author’s spelling from the dataset under study: “The attack lasted several hours. It looked like circulatory problems and I panicked and of course ended up in the emergency room again. This time the doctor came to me immediately. He wanted to talk about my anxiety. He said he could run some more tests, but he didn’t think it would help.”

Following the methodology for detecting mathematical anxiety based on the analysis of MOOC comments outlined in [16], the vectorization of comment texts based on the deep model BERT of the language representation and the \(k \)-means clustering algorithm was used to identify clusters.

Consider a hybrid approach based on the use of the LDA thematic modeling method, the VADER sentiment analysis method, pointwise mutual information, and parts of speech analysis and allowing one to select bigrams and trigrams to describe comment clusters.

The algorithm for analyzing bigrams and constructing trigrams based thereon is relying on the following main steps:

  1. 1.

    Extraction of \(M\) keywords with the highest frequency from the topics identified based on Latent Dirichlet Allocation (LDA) and extraction of nouns or verbs based on the analysis of parts of speech. Latent Dirichlet allocation is aimed at extracting hidden (latent) topics from documents, with the coherence index taken into account to ensure the similarity of terms within the same topic when constructing a topic model and determining the number of topics.

  2. 2.

    Extraction of key bigrams in the cluster whose left and/or right token is one of the \(M \) keywords of the topic.

  3. 3.

    In the set of all bigrams for the key bigram, the left and right neighboring bigrams are extracted and the gluing is carried out by common words to obtain a trigram.

  4. 4.

    Trigrams containing MDs (modal verbs), more than two adverbs or adjectives (RB, JJ), etc. are removed based on the analysis of parts of speech.

  5. 5.

    A set of target trigrams is formed on the basis of rare trigrams that have a negative tone. The rare trigrams are extracted based on pPMI values according to (5), and negative sentiment is determined using the VADER (Valence Aware Dictionary and Sentiment Reasoner) sentiment analysis method. The VADER method is based on rules and dictionaries in which words from the dictionary are juxtaposed with polarity assessments by experts [17].

The pointwise mutual information (PMI) is calculated using the formula

$$ PMI(w_1,w_2, w_3) = \log _2\left (\frac {P(w_1, w_2, w_3)}{P(w_1) P(w_2) P(w_3)}\right ), $$

where \(P(w_1)\), \(P(w_2) \), and \(P(w_3) \) are the probabilities of occurrence of tokens (words) \(w_1 \), \(w_2\), and \(w_3 \), respectively, in the comment text and \(P(w_1, w_2, w_3)\) is the probability of occurrence of the triple of words \((w_1, w_2, w_3)\), a trigram, in the comment text.

To identify rare trigrams, we used the pPMI modification calculated by the formula

$$ pPMI(w_1,w_2, w_3) = \max \big (0, PMI(w_1,w_2, w_3)\big ).$$
(5)

The trigram analysis algorithm is based on constructing all trigrams for each cluster and identifying rare trigrams with negative sentiment, followed by analysis of parts of speech based on patterns: (JJ, VB, NN), (NN(P), VBD, NN(S)), (NN, VBN, NN), (NNP, VBG, NNP), (JJ, VBG, NN), (JJ, VB+VB, NN), (NN, JJ, NN), etc. or the central word of the trigram has the ROOT dependency tag, where JJ is an adjective, NN(NNP, NNS) is a plural or singular noun, VBG is a gerund or simple participle, VB is a verb, and VBN is a past participle. For example, the following trigrams were extracted based on the pattern (NN|JJ, VB, NN(S)|JJ)—(noun | adjective, verb, noun (plural) | adjective): (panic, occur, attack), (catatonia, detect, symptoms), (attack, affect, traumatic), etc.

To compare the keywords of the trigrams in each cluster with the types of psychosocial stressors, we used the LIWC psycholinguistic dictionary [19] and a custom dictionary based thereon and containing various synonyms for the words “anxiety,” “fear,” and “loneliness” as well as words describing various family relationships and social ties (for comparison with a sociological stressor associated with building relationships), words describing various types of pain, parts of the body, health care facilities (for comparison with a sociological stressor related to health and health care), etc.

4. RESEARCH RESULTS

4.1. Classification of Text Messages in Two Classes: A Class with a Description of Affective States and a Class of Ordinary Comments

The results of a 5-fold cross-validation to assess the quality of the classification algorithm (random forest (RF)) depending on various approaches to the vectorization of comment texts are presented in Table 2.

It can be seen from Table 2 that the best classification accuracy was achieved by expanding the vector space based on bigram analysis and amounted to 91.1%.

Table 2. Evaluation the efficiency of various approaches to the vectorization of comment texts

4.2. Clustering Text Messages Containing Descriptions of Affective States

The preliminary stage of processing comments included the removal of punctuation and stop words, tokenization, and normalization. The optimal number of clusters for solving the problem of cluster analysis of 1857 messages containing descriptions of affective states was estimated based on the voting of various methods for determining the number of clusters (silhouette method, elbow method, etc.) implemented in the NbClust R-package [18] and was equal to 7.

Table 3 shows a fragment of the results of cluster analysis based on \(\text {BERT}+{}\) \(k \)-means with the selection of bigrams and trigrams with negative sentiment containing keywords determined on the basis of topic modeling using LDA and ranking using pointwise mutual information.

Table 3. An example of three selected clusters and the construction of their descriptions based on the analysis of bigrams and trigrams

For example, from the trigrams in cluster 1, patterns can be identified that describe panic attacks, catatonia, insomnia, and anxiety associated with school, drugs, living conditions, and job search.

To implement the bigram analysis algorithm, the following Python data analysis libraries were used: nltk, gensim, spacy, and sentence-transformers.

A frequency analysis of trigram words was performed separately for each cluster based on a custom dictionary describing psychosocial stressors. For example, the frequency analysis of cluster-2 trigrams revealed that most of the comments in the considered cluster (48% of all cluster trigrams) describe health-related problems, for example, (heart, tracking, frequency), (stomach, physically, sick), (thought, refer, hypochondria), etc. A considerable part of the comments of the cluster under consideration (21% of all trigrams in the cluster) describe problems associated with building social relationships, for example, (wedding, chronic, anxiety), (angry, say, jealous), (boyfriend, feel, deceived), (incest , start, abuse), etc.

4.3. Constructing a Knowledge Graph Based on the Analysis of Bigrams and Trigrams

Knowledge graphs have proven themselves in the field of visualizing the extracted patterns, highlighting the main characteristics, and demonstrating the relationships between them. Knowledge graphs are widely used in the field of medicine, for example, to represent medical knowledge about strokes [20] or to show personalized dietary suggestions for people with diabetes [21]; however, studies related to knowledge graph representation patterns describing mental health problems are unknown to the present author. Further, it is proposed to construct a knowledge graph to describe affective states based on the obtained trigrams.

For each trigram \((w_{i-1}, w_{i}, w_{i+1}) \) we construct graph vertices with labels \(w_{i-1} \) and \(w_{i+1} \) and also an edge labeled \(w_{i} \). First, a target word is selected from the dictionary of affective disorders, for example, “anxiety”; all trigrams constructed on the basis of the patterns proposed above with the word “anxiety” are extracted, for example, (wedding, chronic, anxiety), (anxiety, affect, fear), (rushing, constant, anxiety), (anxiety, numbers, drama), (anxiety, diagnose, depression), etc. The vertices of the graph are the target word, for example, “anxiety”, as well as the words of the following parts of speech included in the trigram: a noun or any word with the tag ROOT (root word). The words of trigrams associated with the description of the affective state and related to the adjectives, adverbs, and participles are assigned to the oriented edges of the graph; if words that are these parts of speech do not occur in the trigram, then the remaining word in the trigram is assigned to the edge. The knowledge graph was constructed using the spacy part-of-speech library and the PyViz graph construction library. A fragment of the knowledge graph demonstrating the patterns constructed for the words “anxiety” and “fear” related to affective states is shown in Fig. 1.

Fig. 1.
figure 1

A fragment of the knowledge graph for the words “anxiety” and “fear,” related to the manifestation of affective states, compiled by the present author.

From the knowledge graph presented in the figure, one can see anxiety caused by depression, wedding, and drama as well as a description of anxiety as chronic, associated with numbers, or arising under the influence of fear.

5. CONCLUSIONS

The active use of social networks has led to the accumulation of a huge number of comments left by users. Natural language processing methods, together with machine learning algorithms, have made it possible to obtain interesting results in the field of assessing the emotional state of both individual groups of social network users and society as a whole. Recently, such a direction in cyberpsychology as the assessment of the mental state based on the analysis of comments in social networks and the impact of various content on the physical and mental health of a person has been actively developing.

The present paper demonstrates the efficiency of using bigrams to improve the quality of the classification of comments containing descriptions of affective disorders and the possibility of extracting bigrams and trigrams to describe the subject area. Further research by the present author will be aimed at improving the quality of the extracted patterns to identify the causes of various types of psychosocial stressors that lead to the manifestation of anxiety disorders in the texts of social media comments.