4.1 The Application of Topic Model
As described in Data and Methods, altogether, 14 hashtags were used to collect the posts for our corpus, which contained more than four million posts. We applied the unsupervised topic model Latent Dirichlet Allocation, to identify the explicitly suicide-related posts, and also the main discourses of this topic. Using LDA presents two main challenges, which have to be addressed in order to fit the model.
One of the challenges of LDA is the selection of the number of topics. The goal is to find the right number of topics and to make sure that the chosen solution is stable. There are several ways to specify the number of topics. We chose the ldatuning package of R, which calculates different metrics to estimate the most preferable number of topics for an LDA model. We selected 5 percent random samples (200,000 posts per sample) from the original corpus and repeated this sampling process 5 times to check the stability of the structure. The 5 percent sample approach was chosen to minimalize the run time of the models. On each sample, we run the ldatuning package with parameters from 2 to 15 topics. We did not want to go above 15 topics, because we thought we would lose the interpretability of results. Based on the results of the different models, we decided to use the Gibbs sampling method for the estimation of LDA, as the alternative VEM algorithm provided fast, but more unstable solutions. Figure 1 presents one of the results of the five samples. The results of the five samples did not show high deviations from each other.
We can interpret the different metrics plotted in Fig. 1 as the following: in the case of the metrics of Cao-Juan2009 (Cao et al. 2009) and Arun2010 (Arun et al. 2010), low values indicate good solution; in the case of Griffiths2004 (Griffiths and Steyvers 2004) and Deveaud2014 (Deveaud et al. 2014), high value indicates the good choice. Based on this figure, there is no ultimate solution, but extracting at least six topics is a reasonable choice.
In order to decide on the final number of topics, we ran several LDA models on the 5 percent samples, with 6, 8, and 10 topics. We wanted to test the stability of the results with different topic numbers. Just as in the traditional method of K-means clustering, LDA only finds local optimums; thus different runs give different results. Accordingly, to find the most stable solution for the number of topics, we ran two models with the same number of topics at one time and repeated this process several times. Then, we calculated the ratio of similarly classified topics for each pair of runs. The result of this calculation was around the average of 65 percent for the 6 topics approach, 60 percent for the 8 topics approach, and the ratio was much lower for higher topic numbers. Thus, regarding stability, lower topic numbers provided a better fit.
However, overall stability was far from perfect (65 percent), even in the case of the six topic approach. Therefore, we ran the LDA 5 times, but now on the full corpus, with the six topics approach. We found that less than 50 percent of the posts were assigned to the same six topics in all of the five runs. Beyond the six base topics, there were some quite big topics, which also seemed coherent and interpretable. In the end, we decided to keep all the extracted topics, which contained at least two percent of the posts. Below 2 percent, only really small topics remained. To choose the approach of 13 topics instead of the originally selected 6 topics may seem odd; nevertheless, it was the direct result of the LDA algorithm. Like the K-means cluster, LDA also tries to avoid finding small topics. For example, if we change the number of topics from 6 to 10, LDA will not split the original six topics to smaller ones but re-run the whole analysis and tries to find 10 topics with approximately similar sizes. This is why higher topic numbers are more unstable.
To evaluate our approach, we calculated a topic coherence statistic, the UMass coherence measure (Mimno et al. 2011). The values of UMass are always negative, and the closer the value is to zero, the more coherent the topic is. The average topic coherence was −1.69, but there was a high variance between the topics: from −0.81 to −3.61. The average topic coherence improved compared to the original 6 topics models (where the value of UMass was around −2) and was similar to the simple 13 topics solution (where it was −1.7). Therefore, the 13 topics solution appeared to be an acceptable choice.
Though the number of topics was chosen, around 30 percent of the posts were not stably classified at this stage. In order to classify these texts, we applied the word embedding-based supervised classifier of fastText (Joulin et al. 2016).Footnote 4 We used all stably classified posts, as a training dataset and predicted the topics of the not stably classified posts. The average value of topic coherence decreased to −1.84, mostly because of one quite incoherent topic with a value of −3.45.
Taking a closer look at this topic, it did not contain any relevant depression- and suicide-related content, only posts about kitchen and grooming. Although #cutting is widely used in depression- and suicide-related posts, it is used in other domains as well, for example when people post about cutting their hair. As this topic contained no relevant information from the perspective of our research question, we decided to omit it from further analysis.Footnote 5 After the omission of this topic, the coherence value increased to −1.7.
The preliminary analysis also showed that two topics had very similar content about mental illnesses. The list of the most important keywords is also overlapping (like #mentalhealthmatters, #recovery, #mentalhealthawareness). Based on the values of topic coherence, it seemed reasonable to merge these topics: coherence values of the original topics were −1.74 and −1.45 and −1.67 for the merged one. Based on the coherence statistics and the similar keywords, we decided to merge the two topics.
The appearance of fairly similar topics also highlighted that it would not be worth creating more topics. Ultimately, we chose an 11 topics solution. Table 1 presents the most frequent 15 words of each topic in descending order.
Table 1 Most frequent 15 words of topics of the LDA analysis
Topic 1 is the darkest one. It shows high probability for all the hashtags we used for the selection of the posts and contains additional disturbing words like helpme and selfhate. This is the largest topic, and it contains around 16 percent of the posts.
Topic 2 is also large, but the posts, which are assigned to it, have completely different content. This topic covers the posts about memes and meme sites like filthyfrank. It also contains those posts, which use a wide range of hashtags to reach a broader audience. Topic 9 is similar to this one as it also contains “clickbait” posts, with hashtags, such as likeforlike and followforfollow. Nevertheless, topic 9 is more about heartbreak and loneliness.
Topic 3 is about photos and depression related art. This is the second largest topic, with more than 14 percent of the posts. Topic 10 (which is much smaller, only 3 percent of the posts) is similar to topic 3, as it is also connected to art, but while topic 3 is more about the visual representation, topic 10 is more about literary content. We can also include topic 8 in the art-related topics, as its posts are about specific dark music, such as grunge or goth, but hashtags like #emo or #alternative also appear here.
Topic 4 contains posts about fitness, gym, and diet associating mental health with exercising and doing sports, and their positive effect on mental health and depression. Topic 5 is also about positive messages for dealing with mental health problems, but with a focus on spirituality and positive emotions. Topic 11 covers similar perspectives, but from the point of religiosity and faith, emphasizing God and Jesus.
Topic 7 covers posts with more medical expression about mental illness than other topics. However, this topic is not exclusively negative, as it contains expressions like recovery or support as well. Medical expressions also appear in topic 6; however, it focuses on mental health issues as the consequence of other types of diseases.
To confirm our statement, that topic 1 is the darkest of all; we calculated the sentiment value of each topic (see Table 2 for the results). For this analysis, we used a sentiment dictionary called NRC Word Emotion Association Lexicon (Mohammad and Turney 2013). As we expected, the self-harm topic contained the most negative words: 30 percent of the words assigned to this topic were negative, and only 3.5 percent were positive. Also, this topic contained the most words of any kind of emotion (either positive or negative). Some topics, such as fitness, mental health awareness, or religion, contained more positive words than negative ones, which is also a clear sign that significantly different drivers can be identified behind the posts related to mental health or depression.
Table 2 The ratio of positive and negative words in the different topics
We also examined the average number of likes and comments of each topic (see appendix Table 3). It is not surprising that topic 9 – likeforlike – attracts the most likes and comments. However, the median value of likes is only 7, which is much lower compared to topic 4, where the median value is 12. Thus, we can conclude that a small number of posts with a large number of likes and comments in topic 9 cause this difference. Overall, topic 1 (self-harm) has the smallest number of likes and comments (and it is true even if we use the median instead of the average). It may indicate that people respond less actively to these types of contents. However, we cannot investigate this presumption, as we do not have information about the number of followers of users who posted about raw selfharm and selfhate.
The temporal dynamics of posts also presents interesting results. As we do not know the number of all Instagram posts created in the different periods of the data collection, we could only calculate the relative frequency of the different topics compared to all collected posts. Thus, the number of posts in a topic was compared to all downloaded mental health-related posts, which definitely makes our analysis limited. In this section, we primarily focus on topic 1, which covers self-harm posts because we have long-time temporal trends about the number of committed suicides, which seems independent from geographical factors. Thus, if we select the topic, which is the closest to the sociological and theoretical definition of our object of interest, we can compare these trends with the ones we found on our new type of data. Figure 2 presents the yearly, monthly, and days of week trends of suicide- and depression-related posts.
A general increase can be observed in the ratio of this topic during the examined period. In 2016, 12 percent of all the collected posts belonged to this topic, which increased to 15 percent in 2017 and above 17 percent in 2018. Thus, within this wide, mental health-related space, posts about suicide, and self-harm ideation increased significantly in the examined 3 years.
As a monthly trend, we observed that in the first 3 months of the year, the proportion of topic 1 is around 12–13 percent, then it increases to 15 percent and remains at that level.
There is another long-standing and spatially independent trend about the days of the week; thus, it is worth examining the weekly dynamic of the ratio of topic 1 among all the collected posts. The proportion of topic 1 is the highest on the weekends, and the lowest in the middle of the week (Wednesday).
Finally, we examined the ratio of topic 1 on some special days, like Christmas, New Year’s Eve, and Valentine’s day, just as (Beauchamp et al. 2014; Jessen et al. 1999; Zonda et al. 2009) examined these special days. However, we did not find any remarkable pattern in our data.
4.2 The Application of Word-Embedding
Our second approach for discovering the different discourses behind mental health-related Instagram posts is based on an artificial neural network word embedding model. We used the Glove algorithm (Pennington et al. 2014), with a 30-length window size, which means that the environment of 30 number of words was taken into account when pro-occurrences of words were calculated. This window size is larger than usual, but we wanted to make sure that all the hashtags that appeared in a post are assigned to the same window. The reason for this is that we hypothesized that the distance of two hashtags within the same post does not matter, only that they appear together in the same post. With this size of a window, each post served as one window. For the training of the vector space, we only used those words, which occurred at least 30 times in the corpus. Words with low occurrence could increase the instability of the word embedding models. Overall, more than 150,000 unique words were used in the training of the model. The trained vector space had 300 dimensions.
After we trained the model, we selected the 14 hashtags we applied in the data collection stage of the project and detected the closest words in the vector space for each of them. We used the threshold of 0.3 cosine similarity in the selection of the closest words; thus there were some initial hashtags, where less than 50 words were detected. The 0.3 value was based on our previous experiences on these word embedding models. Finally, we had 275 words in the close environment of the initial hashtags, from which we created a similarity matrix based on cosine similarities. To identify latent topics behind the corpus, we built a hierarchal cluster based on this similarity matrix.Footnote 6 As we wanted to group the words, and we could calculate the distance between the words, clustering seemed to be a reasonable solution for this task. Based on the interpretation of the dendrogram of the hierarchical cluster analysis, we decided to keep the solution of 13 word clusters. Here, the selection of the number of clusters was also not straightforward, but we could rely on the figure of the dendrogram. Figure 3 presents the simplified version of this dendrogram and thus show the distances of clusters with some interpretative labels of those. The list of all words assigned to each cluster is available in Table 4 in the appendix.
As expected, the word cluster of chronic pain and stress/insomnia are close to each other, as is mental illness and mental illness awareness. These four clusters merge before ptsd recovery joins them (right side of the figure). Words associated with recovery and healing create their own cluster, but on a higher level it merges to the first five. In the other part of the dendrogram, suicide and self-harm-related hashtags are close to eating disorder, and these two clusters are close to the cluster labeled as broken heart. Additionally, the cluster of memes and music (grunge) is close to the likeforlike and followforfollow clusters, which are also not that far from the art and photography-related cluster (left side of the figure). Last, but not least, we can find a distinct cluster about fitness and diet. This cluster is close to the recovery and mental illness-related clusters.
The main question is how similar are the results of word embedding clustering to the topics we got from the topic modeling approach. Most of the topics of the topic model can be identified in the results of the word embedding clustering, but there are some slight differences. The two main differences are the following. In the topic model, religion-related words shape a distinct topic; however, in the word embedding cluster, this topic does not appear separately. At the same time, eating disorder forms a separate cluster in the word embedding analysis, which does not appear in the topic model. Nevertheless, these are rather small differences; on the whole, the results of the two methodological approaches are exceedingly similar.