Predicting future mental illness from social media: A big-data approach

Abstract

In the present research, we investigated whether people’s everyday language contains sufficient signal to predict the future occurrence of mental illness. Language samples were collected from the social media website Reddit, drawing on posts to discussion groups focusing on different kinds of mental illness (clinical subreddits), as well as on posts to discussion groups focusing on nonmental health topics (nonclinical subreddits). As expected, words drawn from the clinical subreddits could be used to distinguish several kinds of mental illness (ADHD, anxiety, bipolar disorder, and depression). Interestingly, words drawn from the nonclinical subreddits (e.g., travel, cooking, cars) could also be used to distinguish different categories of mental illness, implying that the impact of mental illness spills over into topics unrelated to mental illness. Most importantly, words derived from the nonclinical subreddits predicted future postings to clinical subreddits, implying that everyday language contains signal about the likelihood of future mental illness, possibly before people are aware of their mental health condition. Finally, whereas models trained on clinical subreddits learned to focus on words indicating disorder-specific symptoms, models trained to predict future mental illness learned to focus on words indicating life stress, suggesting that kinds of features that are predictive of mental illness may change over time. Implications for the underlying causes of mental illness are discussed.

Millions of Americans are affected by mental illness, with profound consequences for their well-being and for the economy. Early detection and intervention offer the best hope for treatment, but accurate and noninvasive early detection of mental illness based on biological markers has proven elusive (Kapur, Phillips, & Insel, 2012). Recent advances in big-data analytics, machine learning, and natural language processing may offer a solution to this problem. Such tools may make possible the extraction of indicators of mental illness from subtle aspects of people’s natural speech, as reflected in their particular word choices, syntactic constructions, and turns of phrase. These linguistic markers may allow for the construction of a digital phenotype, a computationally derived characterization of an individual that can be probed for signs of mental illness (Elvevag et al., 2016; Insel, 2017; Jain, Powers, Hawkins, & Brownstein, 2015).

Several studies have demonstrated that psychological characteristics of an individual can be recovered from analyses of people’s everyday language. Youyou, Kosinski, and Stillwell (2015) showed that analyses of people’s Facebook likes can result in personality judgments that are more accurate than those based on the judgments of friends and family members. Thorstad and Wolff (2018) demonstrated that the content of people’s tweets can be used to predict their future-sightedness, and consequently decisions about investment and risk taking. With respect to diagnosing mental illness, a number of studies have shown that people’s posts on social media sites such as Twitter, Facebook, and Reddit can be used to identify a person’s mental illness (for a review, see Guntuku, Yaden, Kern, Ungar, & Eichstaedt, 2017). Still to be determined, however, is whether mental illness can be detected before a person knows they are mentally ill. Here we use analyses of the social media platform Reddit to show that the signs needed to make these predictions are in fact present in people’s everyday language.

Background

Most studies using natural language to predict mental illness have focused on language samples that overlap in time with the mental illness. For example, Schwartz et al. (2014) used the text of people’s Facebook posts to predict depression scores obtained from those who were currently depressed, as reflected in their neuroticism scores obtained in a Big 5 questionnaire. Resnik et al. (2015) used supervised latent Dirichlet allocation to predict depression on the basis of training using self-identifying statements in Twitter posts such as “I was diagnosed with depression.” Bagroy, Kumaraguru, and De Choudhury (2017) used logistic regression to predict mental health and well-being on the basis of training in which well-being was estimated from membership on mental health subreddits on Reddit. Coppersmith, Dredze, Harman, Hollingshead, and Mitchell (2015) used Twitter posts to predict those with depression and PTSD, in which membership in a particular mental health condition was based on Twitter statements indicating that the user had a particular mental illness (see also Preotiuc-Pietro et al., 2015). The results from these studies suggest that everyday language contains implicit information about mental illness.

However, if language is to be used to predict future mental illness, it needs to be determined whether signs of mental illness are present in language before the person is sick. Conceptually, such an analysis should be possible, since people’s language use tends to be at least moderately stable over time, as revealed by analyses of both written and spoken language (Mehl, Pennebaker, Crow, Dabbs, & Price, 2001; Pennebaker & King, 1999). Indeed, several projects suggest that language contains indicators of future mental illness. De Choudhury, Gamon, Counts, and Horvitz (2013) used the content of people’s tweets to predict future depression on the basis of models in which people indicated their level of depression on the Center for Epidemiologic Studies Depression Scale. De Choudhury, Counts, Horvitz, and Hoff (2014) found that several predictor variables including user demographics, social connectedness, and linguistic features drawn from Facebook posts could explain as much as 35% of the variance in whether new mothers would eventually experience postpartum depression. De Choudhury, Kiciman, Dredze, Coppersmith, and Kumar (2016) found that users’ linguistic structures, interpersonal awareness and social interaction predicted whether they would eventually subscribe to a mental health subreddit. Although these studies have demonstrated how language might be used to predict mental illness prospectively, the broader viability of this methodology is still unclear. Ideally, such information would be used to predict the future onset of any number of mental illnesses, whereas the research to date has only shown prospective classification of a single mental illness at a time.

Multiclassification of mental illness based on language has, in fact, been demonstrated. Gkotsis et al. (2017) used a deep-learning convolutional neural network to determine which of several subreddits (Borderline personality disorder, Bipolar, Schizophrenia, Anxiety, Depression, Selfharm, SuicideWatch) a particular Reddit post came from, with 71.37% accuracy. Critically, however, the multiclassification was not conducted prospectively.

The primary goal of the present research was to determine whether language could be used to predict the future occurrence of several different kinds of mental illness. In three studies, the research addressed four main questions. First, is multiclassification possible on the basis of language focusing on mental illness? At the very least, we were interested in whether it was possible to qualitatively replicate the results from Gkotsis et al. (2017). Second, is multiclassification of mental illness possible on the basis of language that is not explicitly about mental illness? Given the broad effects that mental illness can have on cognition, it was expected that signs of mental illness might be present in what people say about topics unrelated to mental illness, as when they talk about cars, sports, and restaurants. Third, can everyday language be used to predict the future occurrence of different kinds of mental illness? To the extent that prospective classification of multiple categories of mental illness can be achieved, then it may be possible to use people’s everyday language for early detection of mental illness. Lastly, assuming these different kinds of classifications are possible, what are the features that predict mental illness? Predictive models of mental illness can be used to learn about the nature of a mental illnesses through an analysis of the features discovered during training.

The data for this research were drawn from the social media platform Reddit. Members on Reddit (N = 234 million) self-organize into user-created discussion groups called “subreddits.” These subreddits sometimes reflect a general perspective (such as r/Politics and r/Philosophy), but more often reflect specific interests (such as r/ModelTrains, r/Badminton, r/MachineLearning), experiences (r/TalesFromTheCustomer, r/KitchenConfidential, r/IDontWorkHereLady), and question-asking forums (r/AskWomen, r/AskHistorians, r/AskScienceFiction). Of particular interest to the present research, some of these subreddits are about clinical psychological disorders, such as r/Anxiety and r/Depression. Those who subscribe to a particular mental health subreddit do not necessarily have the associated mental illness, but it seems likely that a relatively high proportion of the subscribers to the mental health subreddits would have a personal connection to the associated mental illness (see Gkotsis et al., 2017). A key advantage of Reddit, then, is that subscription to a clinical subreddit can be used as a (noisy) indicator of having a particular mental illness. Thus, given a large number of language samples from one of these subreddits, it might be possible to gain some insight into the nature of the features associated with various mental illnesses.

Study 1: Clinical subreddits

The main goal in Study 1 was to determine whether a model could be trained to identify the mental illness subreddit from which a post was drawn. To the extent that a model can learn this classification, it would suggest people’s language when experiencing a mental illness can be used to determine that person’s mental illness. Posts from four common clinical psychological subreddits were downloaded: r/ADHD, r/Anxiety, r/Bipolar, and r/Depression. Training involved a one-versus-all classification strategy in which binary logistic regression was used to search for features that could be used to identify one of the mental illness subreddits at a time, with the process repeated for each mental illness category. As in Gkotsis et al. (2017), it was expected that words drawn from the different mental illness subreddits would allow for accurate classification. In this and the remaining studies, the posts were randomly divided into a training set (80%) and testing set (20%). Of primary interest was how a model fitted to the training data would generalize to never before seen data in the testing set.

Method

All procedures were approved under exempt review by the Emory University institutional review board.

Data acquisition

Data acquisition involved downloading posts from different clinical subreddits, undersampling these posts to create a balanced dataset, and dividing the data into separate training and testing datasets. First, we used the Reddit Application Programming Interface (API: reddit.com/dev/api) to download 5 years of submissions (2012–2017) to four clinical subredddits: r/ADHD, r/Anxiety, r/Bipolar, and r/Depression. In all, 515,374 posts were obtained (range 56,009–300,141 posts per subreddit). Second, we randomly undersampled these posts to create a balanced dataset of 224,036 posts, with 56,009 posts per subreddit. Third, we randomly divided this balanced dataset into a training set (80%) of 179,228 posts for model training and a testing set (20%) of 44,808 posts for model evaluation.Footnote 1

Data preprocessing

Data preprocessing involved two steps: removing explicit mentions of the names of clinical disorders and other unusual characters, and transforming the text of the posts into machine-readable vectors. First, we removed the words anxiety, anxious, depressed, bipolar, and adhd from the posts, as well as words beginning with anx, depr, bipol, and add. An additional part of this preprocessing step was to remove highly unusual characters, such as foreign alphabets, that the algorithm would have difficulty understanding. A standard, restricted character set, based mostly on letters and numbers and a few special characters, is the ASCII character set. We thus removed all unusual characters by removing non-ASCII characters. Second, we transformed each post into a vector of numbers, in which each column in the vector represented the frequency of using a given word. Such a vector representation is thought to provide a semantic characterization of the post, since posts that use similar words should have similar meanings. Frequency information offers a direct and relatively transparent characterization of the information contained in a text document. However, frequencies are vulnerable to base rate effects, and as a consequence, it is traditional to analyze these representations with an additional analysis. If, for example, a word has a high frequency, this could mean that it is an important word for that document. However, if the word also has a high frequency in other documents, then the word is not informative about that document, and its frequency is traditionally down-weighted. Tf-idf is a weighting methodology that adjusts frequencies so that high frequencies remain high, so long as they are unique to a particular group of posts, and conversely low frequencies may stay low in scenarios in which they do not offer information that is diagnostic of the category. Formally, where the term frequency (tf) is the frequency of a term t in a document d, and the document frequency (df) is the frequency of t in all documents in the corpus D, the tf-idf score is given by tfidf(t, d, D) = tf(t, d) × df(t, D)−1. We applied this scaling using the Python library scikit-learn (Pedregosa et al., 2011), with an ngram range of (1, 1), and a minimum document frequency of 1.Footnote 2 For the same reason, we also removed some especially frequent English words using a list of “stopwords” provided by the scikit-learn library. The total vocabulary size in Study 1 was 168,822 words.Footnote 3

Machine-learning model

We trained a machine-learning model to use the words in posts as input, and to output the subreddit that the post was submitted to. We trained an L2-penalized logistic regression model in the Python library scikit-learn, using the default regularization strength parameter, C = 1. The key idea of this model is a prior belief that a good language model should use a relatively large number of words to make predictions. A good way to accomplish this objective is to penalize models by the extent to which a model only uses a few words to make predictions. This is accomplished by adding an L2 penalty to the model’s training objective, which penalizes the squared magnitude of the regression weights. Such a penalty incentivizes the model to learn a relatively large number of moderately sized regression weights, but few large regression weights, effectively incentivizing the model to use many words to make predictions.

The inputs to the model were the unigram (word) vectors created after data preprocessing; thus, the inputs were a matrix with 179,228 rows and 168,822 columns. For multiclass classification, scikit-learn trains four separate logistic regression models to perform a binary classification—for example, depression versus all other subreddits, anxiety versus all other subreddits, and so forth. The output of the model is the label assigned with the highest probability by any of the four binary classifiers.Footnote 4 Note that chance performance is 25%, because while any individual model (e.g., anxiety vs. all others) has a 50% chance of being correct, the aggregated model must guess one of four categories, thus having a 25% chance of being correct. In this and all future analyses, the model was trained to categorize the posts in the training set and evaluated with no further training based on its performance on the held-out testing set.

Model accuracy was calculated using the F1 score, a standard statistic for machine-learning classifiers that balances two features. First, a good classifier should find most examples of a class, which is measured with the recall statistic, recall = true positives/(true positives + false negatives). For example, a good depression classifier should identify most posts from the Depression subreddit. Second, a good classifier should usually be correct when it labels a data point as part of a class, which is measured with the precision statistic, precision = true positives/(true positives + false positives). For example, the classifier should usually be correct when it labels a post as from the Depression subreddit. Note that the F1 score is usually preferred to accuracy because a classifier can be accurate while failing one of these objectives—for example, by learning to guess the class with the highest base rate. More formally, the F1 score is the harmonic mean of precision (PR) and recall (R), F = 2 × (PR × R)/(PR + R). We compute the micro-averaged F1 score by first computing the F1 score for each class (e.g., depression, anxiety, etc.) and then averaging these scores with equal weights. We also report percent accuracy for the model as a whole, calculated using the formula accuracy = (true positives + true negatives)/(true positives + false positives + false negatives + false negatives).

Machine-learning models are typically evaluated using descriptive statistics (e.g., the magnitude of F1), as opposed to inferential statistics such as p values, in part due to the large sample sizes and to use of held-out testing data. Although to our knowledge there is no standard method to compute p values for a machine-learning model, in this and future studies we calculate p values by creating a null distribution based on 10,000 random draws, given chance accuracy and the observed sample size.

Clustering analysis

We conducted a clustering analysis to understand the features the model used to make predictions, as well as whether these features fell into coherent semantic groups. It is possible that the model learned a large number of different features, but also possible that these features fall into a relatively small number of semantic groups, providing more insight into the representations learned by the model. The first step of this analysis was to select the most predictive features from the model. For each disorder, this was done by selecting the 100 words associated with the most positive regression weights. The next step was to represent each word in terms of its semantics, in order to understand whether coherent semantic groups of words were used to make predictions. To do this, we represented each word using the document vector generated during data preprocessing, which, as we previously discussed, is thought to be a semantic representation of the word (e.g., documents that contain the word terrified may be more likely to contain the word afraid, suggesting that these words have similar meanings). Third, we reduced the dimensionality of this semantic space to two dimensions to facilitate the discovery of clusters and to allow for their visualization. We reduced the dimensions using a procedure known as t-distributed stochastic neighbor embedding (t-SNE; Maaten & Hinton, 2008), using the parameters perplexity = 5 and number of principal component dimensions = 5. The t-SNE algorithm emphasizes maintaining local distances from the high dimensional space, making it a popular choice for clustering by increasing the likelihood that clustered points in the high dimensional space will remain in the same clusters after dimensionality reduction, even if these points change their global configuration in the space relative to other clusters. Because t-SNE is a stochastic algorithm, we ran t-SNE 20 times for each solution, selecting the best solution as described below.

In the final step, we asked whether there were coherent semantic groupings to the most predictive words. To do this, we applied a clustering technique known as DBSCAN (Ester, Kriegel, Sander, & Xu, 1996), using the parameters epsilon = 10 and minimum samples = 3. DBSCAN looks for clusters by identifying dense groups of points in the data, making few assumptions about the distribution of these groups and no assumptions about the number of clusters. We observed in pilot testing that the most semantically meaningful clustering solutions were those for which t-SNE and DBSCAN generated a larger number of clusters. Thus, for each disorder separately, we retained the clustering solution (out of the 20 t-SNE iterations) with the maximum number of clusters, breaking ties randomly. The resulting clusters indicate the main semantic groups of words used by the model to make predictions.

Results

Classification performance

People’s language on clinical subreddits was highly predictive of which clinical subreddit the post was written on. As would be expected for a model with this number of parameters, the model learned to predict clinical disorders in the training data with high accuracy (81%, where chance = 25%), also reflected in a high F1 score, .81. A more robust test is whether the model generalizes to previously unseen test data without further training, which would suggest that the model learned to fit signal rather than noise in the training data. As is shown in Table 1, the model had strong performance on the held-out test data, F1 = .77, accuracy = 77%, PR = .77, and R = .77, where chance = .25. The classifier performed best for ADHD (F1 = .83, PR = .82, R = .85), anxiety (F1 = .75, PR = .75, R = .74), and bipolar disorder (F1 = .75, PR = .78, R = .71), and worst for depression (F1 = .74, PR = .71, R = .77). Performance was above chance for all disorders, as revealed by permutation testing (p < .05 for all disorders). In future comparisons, we will only report this stricter test of performance based on the held-out test data (not performance on the training data).

Table 1 Performance of the clinical disorder classifier in Studies 1–3

Clustering analysis

As a consequence of training a predictive model, the model learns that some words are especially prominent predictors of a given mental illness. We selected the 100 most prominent of these words for each disorder. It is possible that these 100 words might each refer to a separate and independent feature of the illness. However, it is also possible that these words might refer to a small number of dimensions. To address this question, we conducted a clustering analysis of the 100 most predictive words for each disorder. To do this, we represented each word as a vector reflecting its meaning, reduced its dimensionality, and calculated whether these re-representations fell into clusters.

As expected, the top 100 words formed coherent semantic clusters. The clusters for depression are shown in Fig. 1, and the clusters for other disorders are shown in Supplemental Figs. 13. The dimensionality reduction technique used to project words to two dimensions, t-SNE, prioritizes local over nonlocal similarity patterns. This prioritization has implications for interpreting solutions like the one in Fig. 1: Greater certainty should be given to the words in a cluster than to the spatial relations between those clusters. Nevertheless, larger patterns in the arrangement of the clusters sometimes emerged. For example, in Fig. 1, the lower right hand corner of the space (the green and orange clusters) includes groups of words focusing on negative emotions such as sadness, loneliness, and pain. The left side of the space (the blue and yellow clusters) tends to concern words from health care settings, such as drug names (effexor, prozac) and clinical terms (chronic, severely). Finally, whereas the clusters in the bottom half of the figure (green, orange, blue) focus on internal states, those on the top of the figure tend to focus on actions (the pink and light blue clusters), including cutting, escape, suicide watch, and fantasize.

Fig. 1
figure1

The 100 most predictive words for depression learned in Study 1. Words are projected in a 2-D space based on their document vectors, after dimensionality reduction with t-SNE. Colors indicate clusters assigned by DBSCAN. Gray Y-shaped markers indicate “noise” points not assigned to any cluster by DBSCAN. In each cluster, the top three most predictive words are labeled. The marker size is also scaled linearly with predictive rank, where larger markers indicate more predictive words

Table 2 lists the words and clusters depicted in Fig. 1. The clusters are ordered by their average predictiveness of depression, from most to least predictive. In addition, within each cluster the words are ordered by predictiveness. As is reflected in Table 2, the most predictive cluster contained words referring to Sadness (e.g., sadness, loneliness, suffering). The second most predictive cluster included words related to Life Problems (e.g., life, friends, talk, girl). Following the Sadness clusters, there are clusters focusing on clinical settings, such as Drugs, Health, Time, Emptiness, and Intensity. Many of the words in these clusters focus on medications (e.g., antidepressants, effexor).Footnote 5 However, other clinical terms emerged, such as temporal variation in depression intensity (e.g., seasonal, relapsing), and feelings of worthlessness and despair (e.g., meaningless, darkness, emptiness, despair). The last two clusters focused on Ugliness and Harm, including references to self-harm (e.g., ugly, cutting, harm) and Fantasization (e.g., anime, facade, fantasize).

Table 2 Listing of the most predictive words for depression drawn from the r/Depression subreddit

As is covered in the supplemental materials, the 100 most predictive words for ADHD, anxiety, and bipolar disorder clustered into approximately the same number of clusters as depression. The clusters for ADHD (eight clusters) included Tests (inattentive, tested, testing), Stimulants (concerta, vyvanse, stimulants), Health (hyperfocus, dopamine, dexamphetamine), Tasks (distracted, focused, forget, task), Impulsivity (impulsivity, executive, forgetfulness), and Drugs (methylphenidate, dexedrine, focalin). The clusters for anxiety (ten clusters) included Panic (attacks, panic), Fear (fear, scared, afraid), Worry (worrying, worry, nervous), Drugs (buspirone, cipralex, lorazepam), Uncomfortableness (crippling, stressful, uncomfortable), Present focus (panicking, panicked, shaky), Obsessive thoughts (worries, obsessive, intrusive), and Suffering (separation, sufferers, paralyzed). Finally, the clusters for bipolar disorder (eight clusters) included Mania (manic, mania, mood), Doctors (pdoc, med, psych), Cycling (cycling, phases, swings), Support (hospital, state), Hallucination (grandiose, hallucination, delusion), and Violence (unstable, paranoia, rage).

A final analysis concerned whether the posts analyzed by our model primarily reflect people’s own experience with mental health, or instead reflect people seeking advice about other people’s mental health experiences. To address this question, the authors both coded a random sample of 400 posts, 100 from each clinical subreddit, with disagreements (7.3%) resolved by discussion between the authors. The main finding was that the vast majority of posts (371/400, 92.8%) concerned an individual’s own experience, suggesting that the posts largely reflect people’s own experiences with mental health. We did find that some posts (14/400, 3.5%) reflected other people’s experiences, primarily seeking advice for another person’s mental health problem. An additional set of posts did not clearly fall into either category—for example, posts consisting of links to resources about mental health (15/400, 3.8%).

Discussion

There were two main results of Study 1. First, as expected, it was possible to use people’s language on mental health subreddits to predict which mental health subreddit the post was written on. The accuracy of the model was relatively high (F1 = 0.77), suggesting that people’s posts on mental health subreddits contain sufficient information to distinguish between several types of mental illness. Second, a clustering analysis revealed clear themes in the talk of those contributing to mental illness subreddits. For depression, the cluster analysis indicated that the people subscribing to the r/Depression subreddit tend to talk about sadness, life problems, drugs used to treat depression, ugliness and harm and fantasization, among other topics. The clusters revealed in this analysis overlap with several of the major symptoms listed by the DSM-5 for major depressive disorder (American Psychiatric Association, 2013). In particular, the clusters reflected (1) depressed mood, as indicated in the clusters by feelings of sadness, emptiness and hopelessness; (2) feelings of worthlessness, as indicated in the clusters by expressions of emptiness, meaningless, and despair; and (3) recurrent thoughts of death, as indicated in the clusters by references to death and suicide. The overlap in the DSM-5 criteria and the clusters in Fig. 1 suggests that the feature discovery procedures used in our analyses were uncovering features of major depressive disorder with significance in real-world clinical settings. Although there is overlap between the DSM-5 criteria and the clusters observed in Fig. 1, several DSM-5 criteria also were not well represented in the clusters generated in our analyses. In particular, we did not see clusters focusing on (1) diminished interest or pleasure in all activities, (2) significant weight loss, (3) insomnia or hypersomnia, (4) psychomotor agitation or retardation, (5) fatigue or loss of energy, and (6) diminished ability to think or concentrate, or indecisiveness. The lack of clusters focusing on these areas could represent a limitation of the feature discovery approach used in our analyses. However, it is also possible that points of overlap might be indicative of the DSM-5 criteria that are most salient in the prediction of depression, and hence might go beyond the criteria offered in the DSM-5 in providing a rough ranking of the criteria most central to a mental illness such as depression.

These results are encouraging for the ability to automatically identify individuals who likely already have a clinical disorder. However, as was concluded in a recent literature review, an unfulfilled promise in clinical science is the development of classifiers that could be used to detect otherwise undiagnosed cases, presumably including cases in which individuals are unaware that they have a disorder (Guntuku et al., 2017). The aim of Study 2 was to investigate whether the mental illness associated with a particular individual, as indicated by membership in a particular clinical subreddit, could be identified using words drawn from nonclinical subreddit posts. In effect, Study 2 addressed the question of whether mental illness can be inferred from people’s everyday language. It was expected that in nonclinical language contexts, there would be much less talk about medications and specific symptoms, but that the signs of mental illness might nevertheless still be present.

Study 2: Nonclinical subreddits

The main goal of Study 2 was to determine whether people’s language in nonclinical contexts still reveals information about their mental health. To do this, we capitalized on the fact that most individuals on Reddit post to many different subreddits, only a small subset of which are dedicated to clinical disorders. For each individual who posted to a clinical subreddit in Study 1, we downloaded all of the individual’s posts to other subreddits, excluding posts on the four clinical subreddits from Study 1. We then trained the same logistic regression model from Study 1 to use these everyday posts to predict which mental health subreddit the individual had also posted on. Such a classifier should not be expected to perform as well as one based on people’s talk in mental health contexts, because people are much less likely to refer explicitly to highly diagnostic features such as symptoms and medications in this nonclinical context. Nevertheless, we expected that people’s nonclinical language might be revealing of their mental health, and hence that a classifier should be able to classify an individual’s mental health above chance accuracy. As in Study 1, we also performed a content analysis of the most predictive language in nonclinical contexts. We expected that this language might be reflective of only a subset of the clusters observed in Study 1.

Method

Data acquisition

Data acquisition involved identifying users who had posted to the clinical subreddits in Study 1 and acquiring these users’ posts to all other subreddits, excluding those clinical subreddits. First, for each user in Study 1, we used the Reddit application program interface to download all of the user’s posts to Reddit, excluding the subreddits r/ADHD, r/Anxiety, r/Bipolar, and r/Depression. For each user, we concatenated all of the user’s posts into a single data point to avoid the same user appearing in the training and testing dataset (which could artificially inflate accuracy rates based the same user having a consistent but idiosyncratic linguistic style). Posts were acquired for 121,722 users, randomly undersampled to create a balanced dataset of 24,436 users (6,109 users per disorder), and randomly divided into a training set (80%) of 19,548 users and a testing set (20%) of 4,888 users.

Data preprocessing and machine-learning model

The data preprocessing and the machine-learning model architecture and evaluation were identical to those components of Study 1. The total vocabulary size for Study 2 was 973,962 words. Class labels were assigned on the basis of the mental health subreddit that the individual had posted on in Study 1.

Clustering analysis

We conducted a clustering analysis in which we selected the most predictive words for each disorder and assigned them to clusters based on Study 1. First, for each disorder we selected the words associated with the 100 most positive regression weights. This indicates the 100 most predictive words for each mental illness. Next, to facilitate comparisons across the different kinds of subreddits, the words were analyzed with respect to the clusters derived from the clinical subreddits in Study 1. The following steps were conducted for each disorder separately. First, we represented each of the 100 most predictive words in a common space as the clusters in Study 1. To do this, we used the word’s document vector based on Study 1 (i.e., a vector of length 179,228). In the rare cases in which this document vector was not available for Study 1, due to the word never occurring in a clinical subreddit, the word was eliminated from the analysis (13/400 words eliminated across all disorders, 3.25%). Second, we calculated the mean vector for each cluster in Study 1 (e.g., the center of the cluster). To do this, for each cluster, we selected the document vectors for all words in the cluster (before dimensionality reduction) and calculated the mean vector (of length 179,228). Finally, we assigned each of the top 100 words to the nearest cluster in Study 1. To do this, we calculated the cosine similarity between a word and each of the cluster mean vectors. We assigned each word to the nearest cluster. To prevent spurious classifications, cluster assignments were restricted to cosine similarities above or equal to .05. Note that in general, the cosine similarities were low because the vectors were sparse (i.e., they had many elements, most of which were 0).

Results

Classification performance

People’s language on nonclinical subreddits was moderately predictive of which clinical subreddit the user had also submitted to. As is shown in Table 1, the model had moderate performance overall on the held-out test data, F1 = .38, accuracy = 39%, PR = .40, R = .39, where chance = .25. The classifier performed best for depression (F1 = .44, PR = .34, R = .62) and ADHD (F1 = .42, PR = .44, R = .44) and worst for bipolar disorder (F1 = .34, PR = .46, R = .27) and anxiety (F1 = .30, PR = .37, R = .26). Performance was above chance for all disorders as revealed by permutation testing (p < .05 for all disorders).

Clustering analysis

As expected, the most predictive words from the nonclinical subreddits were assigned to only a subset of the clusters discovered in Study 1. The results for depression are shown in Table 3, and results for other disorders are shown in Supplemental Tables 13. For depression, 73 of the most predictive words were similar enough to one of the clusters drawn from the clinical subreddits to be assigned to a cluster. These 73 words fell into three clusters only: Sadness, Life Problems, and Ugliness & Harm.

Table 3 Listing of the most predictive words for depression drawn from nonclinical subreddits with respect to the semantic clusters derived from the clinical subreddits in Study 1

As with depression, and as is shown in Supplemental Tables 46, the most predictive words from the nonclinical subreddits for ADHD, anxiety, and bipolar disorder were assigned to only a subset of the clusters found in Study 1. In effect, this subset of clusters may highlight the most central features of these disorders. For ADHD, the main clusters present in nonclinical subreddits were Stimulants (vyvanse, ritalin, stimulant) and Tasks (study, exams, business). For anxiety, the main clusters were Panic (panic, attack, intense), Fear (social, feel), and Uncomfortableness (uncomfortable, physically). Finally, for bipolar depression, the most prominent clusters were Mania (mania, episodes, mood), Doctors (pdoc, psych, doc), and Cycling (cycling, psychosis, swings).

Discussion

The results of Study 2 showed that people’s language in nonclinical contexts was still moderately predictive of information about their mental health, although much less so than their language in clinical contexts. A classifier trained on people’s talk in nonclinical contexts performed markedly worse than one trained on people’s talk in clinical contexts (F1 = .38 vs. .77), although the performance was still better than chance (F1 = .25).

A content analysis of the depression results indicated that the same two clusters were most predictive of depression when the words were drawn from clinical subreddits as when the words were drawn from nonclinical subreddits. These clusters referred to Sadness and Life Problems. As expected, only a subset of the clusters based on the clinical subreddits emerged in language based on nonclinical subreddits. In a nonclinical context, people generally did not talk about Drugs for treating depression, Health-specific terms, Intensity of symptoms, Fantasization, or feelings of Emptiness. There was, however, some mention of Ugliness & Harm. Thus, when combined with the results from Study 1, the present results suggest that words concerning Sadness and Life Problems might be especially central to depression.

Study 3: Future prediction

The results from Study 2 indicated that signs of mental illness spill over into people’s everyday language. Since everyday language contains indicators of mental illness, it might be possible to use this language to predict the future occurrence of mental illness before a person has enough awareness to join a discussion group focusing on this condition. The aim of Study 3 was to test this possibility. To do this, for each individual who posted on a mental health subreddit in Study 1, we downloaded all of their posts to nonclinical subreddits (excluding r/ADHD, r/Anxiety, r/Bipolar, r/Depression). We then identified, for each individual, the first date when the individual posted to any of the clinical subreddits identified in Study 1. We eliminated any posts written on or after this date, leaving a dataset of posts in nonclinical contexts from before the individual ever posted to a clinical subreddit. We trained the same logistic regression model as in Studies 1–2, to investigate whether people’s past everyday language in nonclinical contexts could predict which clinical subreddit the individual would post to in the future. As in Studies 1–2, we performed a content analysis of the most predictive language. We expected that the most predictive words would concern the same subset of clusters observed in Study 2.

Method

Data acquisition

Data acquisition involved selecting the subset of people’s nonclinical subreddit posts in Study 2 that were posted before the user had ever posted to a clinical subreddit. First, for each user in Study 1, we calculated the earliest date that the user had posted to one of the subreddits r/ADHD, r/Anxiety, r/Bipolar, and r/Depression. Next, we used the same methods as in Study 2 to download each of these users’ posts to subreddits other than r/ADHD, r/Anxiety, r/Bipolar, and r/Depression, retaining only those posts submitted before the user’s earliest post to a clinical subreddit. If the user had never posted to another subreddit before the first post on a clinical subreddit, the user was eliminated from the analysis. Posts were acquired for 66,605 users, randomly undersampled to create a balanced dataset of 18,052 users (4,513 users per disorder), and randomly divided into a training set (80%) of 14,441 users and a testing set (20%) of 3,611 users. On average, the posts in Study 3 were submitted 182 days before the user posted on a clinical subreddit (max = 2,991 days, SD = 247 days).

Data preprocessing and machine-learning model

The data preprocessing and the machine-learning model architecture and evaluation were identical to those aspects of Study 1. The total vocabulary size for Study 3 was 491,144 words. Class labels were assigned on the basis of the mental health subreddit that the individual had posted to in Study 1.

Clustering analysis

To enable comparisons across the different subreddits, we represented the 100 most predictive words for each disorder in the similarity space derived in Study 1. This analysis used the same methods as in Study 2. As in Study 2, document vectors were not available for a small proportion of words (7/400 words across all disorders, 1.75%), and these words were excluded from the clustering analysis.

Split-half analysis

We also performed a split-half analysis. In this analysis, we split users’ posts into two datasets. The recent-past dataset consisted of the half of a user’s past posts that were submitted more recently in time, relative to the date the user had first posted on a clinical subreddit (M = 123 days in the past, SD = 195 days). The distant-past dataset consisted of the half of a user’s past posts that were submitted more distantly in time from the date the user had first posted on a clinical subreddit (M = 239 days in the past, SD = 313 days). Note that in order to ensure that each user had the same number of posts in both datasets, the exact distance in time dividing the recent- and distant-past datasets varied for each user. We trained the same logistic regression model as in Studies 1–2 separately on each dataset, with one change in the model evaluation procedure: Because we anticipated a smaller difference in accuracy between the models, we evaluated the models using tenfold cross-validation rather than a single train/test split, to provide a more stable estimate of held-out prediction accuracy.

Results

Classification performance

People’s past language on the nonclinical subreddits was predictive of which clinical subreddit the user would submit to in the future. As is shown in Table 1, the classifier had moderate performance overall on the held-out test data, F1 = .36, accuracy = 36%, PR = .36, R = .36, where chance = .25. Overall, the performance of the model was almost as high as when the classifier had been based on people’s past and present everyday posts in Study 2 (F1 = .38). The classifier performed best for ADHD (F = .39, PR = .37, R = .42) and bipolar disorder (F1 = .37, PR = .39, R = .35), and worst for depression (F1 = .36, PR = .35, R = .38) and anxiety (F1 = .32, PR = .35, R = .30). Performance was above chance for all disorders, as revealed by permutation testing (p < .05 for all disorders).

We also performed a split-half analysis to evaluate whether the model was indeed performing future prediction. Although all of a user’s past posts might provide information about their mental health, posts from the more recent past should provide the most helpful information, because an individual would be closer in time to joining a mental health subreddit. To test this prediction, we divided users’ posts into two datasets: a dataset based on the more recent past posts, and a dataset based on the more distant past posts. We then separately trained the classifier on the basis of users’ recent-past posts or on the basis of users’ distant-past posts. As expected, both models performed slightly worse than the model based on all of the users’ past posts, due to the smaller amount of available training data. Critically, the model based on users’ recent-past posts performed slightly better (F1 = .344) than the model based on users’ distant-past posts (F1 = .338).

Clustering analysis

The most predictive words based on future prediction were again assigned only to a subset of the clusters discovered in Study 1. The results for depression are shown in Table 4, and the results for other disorders are shown in Supplemental Tables 79. For depression, 68 of the most predictive words were similar enough to one of the clusters drawn from the clinical subreddits to be assigned to a cluster. These words fell into three clusters only: Sadness, Life Problems, and Ugliness & Harm. The large majority of words (65/68, 96%) fell into a single cluster: Life Problems.

Table 4 Listing of the most predictive words for depression drawn from past posts to nonclinical subreddits, with respect to the semantic clusters derived from the clinical subreddits in Study 1

As with depression, the most predictive words for the other disorders fell into a much smaller set of clusters than was discovered in Study 1, possibly indicating the most salient features of each disorder. For ADHD, the main clusters were Stimulants (vyvanse, ritalin) and Tasks (tasks, finish, routine). For anxiety, the primary clusters were Panic (panic, attacks, night), Fear (fear, situations, sounds), Body (breathing, stomach, blood), and Worry (worry, terrified, excited). Finally, for bipolar disorder, the main clusters were Mania (mania, mood, experiences) and Support (hospital, support, father, god).

Discussion

The main result of Study 3 was that people’s past language in nonclinical contexts was still moderately predictive of which clinical subreddit the individual would later post to. In fact, the performance of the classifier based only on past posts was almost as strong as the performance of the classifier based on both past and present posts together (F1 = .36 vs. .38), although both classifiers performed much worse than a model based on people’s language in clinical contexts (F1 = .77). Interestingly, the words most predictive of future depression based on everyday language fell into the same three clusters as those based on everyday language from individuals who were currently experiencing depression. However, in the case of words predicting future depression, nearly all of the words fell best into the Life Problems cluster. The results suggest that the most predictive indicators of future depression may be talk implying life problems. A content analysis of the words predictive of future ADHD, anxiety, and bipolar disorder led to patterns of results highly similar to that of depression.

General discussion

In three studies, we found that people’s language on the social media website Reddit is predictive of information about mental illness. Study 1 showed that people’s language in clinical contexts is highly predictive of the particular clinical disorder they are writing about. A classifier trained on posts from the r/Anxiety, r/ADHD, r/Bipolar, and r/Depression subreddits could identify which of these subreddits a new post was submitted to with high accuracy. Study 2 extended these results beyond explicit clinical contexts. A classifier based on people’s posts on non-clinical subreddits was able to identify which clinical subreddit the individual had also posted to with moderate accuracy. Finally, Study 3 revealed that people’s everyday language can be used prospectively to predict with moderate accuracy which clinical subreddit an individual will post to in the future. A split-half analysis suggested that this model was indeed performing future prediction, given that classification was more accurate when based on posts from the recent rather than the distant past.

Content analysis yielded several findings. First, Study 1 showed that the most predictive words of each mental illness fell into relatively coherent clusters. The clusters overlapped with some, but not all, of the DSM-5 criteria for major depressive disorder. Second, the clusters in Study 1 varied in their average predictiveness of depression. This difference was diagnostic of the results of Studies 2 and 3. The most predictive words from Studies 2 and 3 fell into the most predictive clusters from Study 1. In sum, language in clinical contexts is broader and includes clusters of content not generally found in everyday language. However, the most predictive of these clusters are also found in everyday language. A highly similar pattern of results was observed in the case of the other mental illness examined in this study, ADHD, Anxiety, and Bipolar disorder.

The present study adds to the growing list of studies showing that people’s online language gives away information about their mental health (for a review, see Guntuku et al., 2017), personality (Youyou et al., 2015) and decision-making (Thorstad & Wolff, 2018). The present study goes beyond prior studies in showing that everyday language can be used to predict the future occurrence of multiple kinds of mental illness, and can do so many months in advance. It also goes beyond prior studies in providing a content analysis of the language features that predict mental illness. This content analysis shows that automatically discovered linguistic predictors of mental illness overlap partly, but not perfectly, with clinical observations.

Inferring causation from analyses of content

The results from Studies 1–3 suggest how content analyses might be used to generate hypotheses about causation. In Study 1, the most predictive words, which were drawn from clinically oriented discussion groups, fell into nine clusters. In Study 2, the most predictive words, which were drawn from non-clinical discussion groups, fell into only three of those clusters: Sadness, Life Problems, and Ugliness & Harm. In Study 3, the most predictive words, which were selected from non-clinical posts made before an individual joined a clinical discussion group, fell almost entirely into one cluster: Life Problems. The Life Problems cluster included words implying stressful events (e.g., life, try, end, sorry, wrong, girlfriend, problems, high school, debt). As is specified in diathesis–stress models of psychopathology, conversion from health to illness may sometimes be precipitated by life stresses (Monroe & Simons, 1991). The results from Study 3 are consistent with such models in showing that the first signs of depression are associated with Life Problems. The results from Study 2 might reflect the consequences of these Life Problems, specifically feelings of sadness, as reflected in words such as sadness, worthless, loneliness, miserable, and suffering. The content words in Study 1 might reflect the end result of the progression, including talk about drugs (Effexor, Prozac), illness intensity (chronic, severely), and specific clinical features (dysthymia, resistant). Thus, by examining the most predictive features at different distances in time and context, it might be possible to isolate the causal sequence of events leading to a particular mental illness.

Empirically discovered versus hand-designed features

The feature discovery approach used in this research contrasts with the approaches of a number of recent studies that have sought to identify mental illness using hand-designed features. For example, Mota et al. (2012) found that graph analysis of people’s descriptions of dreams distinguished mania from schizophrenia with 93.7% accuracy. Elvevag, Foltz, Weinberger, and Goldberg (2007) found that semantic coherence, indicated by cosine similarity between sentences represented semantically using latent semantic analysis (LSA), distinguished individuals with schizophrenia from normal controls. Bedi et al. (2015) found that semantic coherence measured during free speech predicted conversion to psychosis with 100% accuracy (see also Corcoran et al., 2018; Mota, Copelli, & Ribeiro, 2017). Classification based on these hand-designed features has been impressively accurate, possibly because these studies only involved distinguishing one mental illness from another, or mental illness from health, rather than multiple kinds of mental illness. Indeed, Gkotsis et al. (2017) found that a convolutional neural network (CNN) distinguished clinical and nonclinical posts with 91.08% accuracy, but the same network only distinguished different clinical conditions (N = 11) with 71.37% accuracy. Distinguishing different kinds of mental illness appears to be a harder classification problem than distinguishing mental illness from mental health. Thus, despite the high levels of performance in studies involving hand-designed features, it is not necessarily clear that hand-designed features are superior to empirically discovered features. Moreover, whereas hand-designed features must be tailored to each category and are often difficult to build, empirically discovered features fall out easily and naturally from the machine learning process in which they are used.

Despite these advantages, our ability to classify posts prospectively using empirically derived features was modest, with roughly 36% average accuracy. This performance level raises the question of how classification accuracy might be improved. One possible improvement would be classifiers that combine empirically discovered features with hand-designed features. These two kinds of features tend to detect different kinds of information. Empirically discovered features focus on semantic information from language. By contrast, hand-designed features focus on the by-products of mental and social processes. For example, Mota et al. (2012; see also Mota et al., 2017) designed programs to measure links between ideas. Bedi et al. (2015) designed programs to measure semantic relations between sentences (see also Corcoran et al., 2018; Elvevag et al., 2007). De Choudhury et al. (2014) used features that measured people’s degree of social interaction. Because empirically discovered and hand-designed features capture different kinds of information, performance may be significantly improved by building classifiers that combine both kinds of features. Interestingly, whereas empirically discovered features require big data studies, hand-designed features have almost entirely come from small-data studies in which feature creation is informed by prior theory. To the extent that classification can be improved by combining these two kinds of features, it would demonstrate the benefit of combining insights from big and small data studies.

Using big data for insight

One challenge for big-data studies in psychology is not just to provide accurate predictions, but also to provide insight into the underlying psychological constructs being studied (Kern et al., 2016). Indeed, as machine-learning models grow more sophisticated, there is sometimes a tension between creating a strong predictive model and a model that can be easily queried to gain insight into the underlying psychological construct being studied. In these studies, we focused on a simple model class (logistic regression) that allowed querying the underlying predictive features learned by the model. In a series of cluster and content analyses, we found that people’s everyday language was revealing of mental health information, but in a much more implicit way than explicit talk about symptoms and disorders. The most predictive words in the clinical subreddits tended to be those with strong associations to health. This pattern was observed in language drawn from subreddits related to depression (antidepressants, prozac, effexor, major), ADHD (ritalin, concerta, ritalin, stimulants), anxiety (zoloft, lexapro, citalopram, benzos), and bipolar disorder (hallucinations, hospitalization, mania, hypomanic). Many of the words used in the clinical subreddits were indicative of people who were fully aware of their condition. Of potentially greater interest were the words used in nonclinical subreddits. Subscribers to the Depression subreddit tended to use words emphasizing negative emotion (e.g., sadness, loneliness, worthlessness). In the case of ADHD, prediction was heavily influenced by talk about stimulants (ritalin, caffeine) and performance on tasks (tasks, finish, exams, productive, lazy). Language from discussion groups associated with anxiety tended included expressions of uneasiness and apprehension (worry, nervous, uncomfortable, freaking), as well as frequent comments about bodily functions (shaking, breathing, stomach, chest). Finally, posts about bipolar disorder made frequent mention to cycles in time (e.g., times, episodes, swing, cycles) and supportive help (hospital, father, god). These content analyses help clarify how mental illness may impact people’s experience before the full emergence of a disease. In depression there are strong changes in emotion; in ADHD, strong effects on performance; in anxiety, an impact on mood, and in bipolar disorder, changes in emotion over time.

The role of function words in predicting mental illness

Our content analysis contrasts with some previous approaches in psychology that have been based on counting words in predefined categories (for reviews, see Ireland & Mehl, 2014; Pennebaker & Graybeal, 2001; Pennebaker, Mehl, & Niederhoffer, 2003). In these studies, words have been divided into categories concerning function words (such as I, my) and categories concerning content words (such as talk about cognitive processing, emotion, and leisure). Many of these studies have shown that the most predictive words of clinical disorders are function words rather than content words. For example, many studies have found an elevated use of I-words in depression (Rude, Gortner, & Pennebaker, 2004). These same studies have found that content words such as emotion words are at best inconsistent predictors of depression.

In contrast to these studies, most of the features learned by our model are content words, such as talk about symptoms, emotion, and feelings of worthlessness. In fact, some of the methods we used, namely tf-idf weighting and stop-word removal, remove many function words from the model. What, then, explains the fact that our model largely learns to rely on content words? One possible difference is that the model was allowed to learn the relevant categories of content words from the data, rather than specify these categories in advance. In doing so, the model may learn more appropriate groupings of content words. The model learned several categories that do not exist in current dictionary-based approaches. For example, the model learned several separate subcategories of negative emotion words, namely talk about life problems and feelings of worthlessness, and the model learned to use these different subcategories in different ways. Of course, another possibility is that our model would have learned to use function words more often if procedures such as tf-idf scaling and stop-word removal were not performed. Certainly, further work should explore the differences between more dictionary-driven approaches and insight-driven modeling approaches to predicting psychological disorders (for an example of such comparison in personality prediction, see Schwartz et al., 2014).

Implications for clinical practice

The results have several potential implications for clinical psychological practice. First, as described above, the analyses suggest that big data can be used to provide insight into clinical psychological disorders. Instead of beginning with an expectation of the kinds of language that are predictive of various disorders, it is possible to begin with a large set of language and then learn the most predictive words for various disorders. As we described above, the features learned overlapped somewhat but not perfectly with the features described in the DSM. Thus, we expect that this type of insight-driven modeling can be used to derive insights relevant to clinical practice. Second, the results suggest a practical application. Classifiers can potentially be built to use people’s natural language to identify those at risk for developing certain mental illnesses. Indeed, one could imagine a future clinical tool that combines several sources of people’s natural language, based on social media or other sources, to identify those who may need to seek additional screening for mental illness. Admittedly, our predictive accuracy was quite modest, suggesting that such prediction may not yet be possible with high accuracy. However, our results provide a proof of concept that such future mental health prediction is possible on the basis of people’s everyday language.

Dual-use research in the age of big data

Big data methods are increasingly enabling inferences about private characteristics of people from publicly available data. Prior work has shown that people’s sexuality, politics, and smoking habits can be predicted from their activities on Facebook (Kosinski, Stillwell, & Graepel, 2013). Prior work has also shown that people can be deanonymized on the basis of features of their social network (Narayanan & Shmatikov, 2008). The present work adds to this prediction by showing that mental health can be predicted from people’s publicly accessible activity on social media, even seemingly outside of mental health contexts.

Although these methods are intended to contribute to basic science research, it is conceivable that big-data research can begin to provide a blueprint that could be misused. It could be possible, in the future, to build a system to screen potential employees for their sexuality or politics based on their statements on social media. It could even be possible to build such a system to screen potential political candidates. This type of dilemma is typically discussed in biological sciences, under the name dual-use research (e.g., Frankel, 2012; Wolinetz, 2012). Dual-use research concerns situations in which research can be directly misused to harm the public, such as research that could suggest how to build a better biological weapon. Although typically confined to the life sciences, the growth of big data may see an increase in dual-use dilemmas in the social and computing sciences, including in psychology. The dilemma is that people may reveal more about themselves than they realize, and this information may be used against them. One strategy to address this problem could be to require that people be notified of the ways their casual comments can be mined. By allowing people to see what information can be mined about them, people would be better able to evaluate the costs and benefits of participation on social media. In certain circumstances, it may also be necessary to place restrictions on certain kinds of mind mining activities people are allowed to pursue. We believe the community may benefit by beginning a discussion about the contours of dual-use dilemmas in psychological research, especially concerning the growth of big data.

Limitations

The present study does have some limitations. Chief among them is that posting on a clinical subreddit is not a gold-standard diagnosis. Thus, it is not possible to be certain that individuals on a clinical subreddit have been diagnosed with that disorder. However, the frequent references to disorder-specific medications and symptoms in the clinical subreddits, as well as references to therapy and doctors, suggest that many individuals on mental health subreddits have indeed been diagnosed with a clinical disorder. Many of the predictive words learned by the model refer to specific medications (such as Effexor) that the general nonclinical population would not be expected to know. Moreover, the fact that the vast majority of posts that were hand-coded were found to refer to an individual’s own experience of the disorder provides additional evidence that posting on a subreddit may be a proxy for having a particular disorder. An additional limitation concerns the inference that our model can predict future mental disorders. It is possible that some individuals in our dataset may have been diagnosed with a mental disorder before they decided to post to a disorder-specific subreddit. If such prior diagnoses are frequent in our dataset, it would undermine the claim that the model performs future prediction. However, the split-half analysis in Study 3 provides some evidence that the model predicts the future by showing that more recent past posts are more predictive. This pattern would not be expected if individuals had already been diagnosed with a disorder. Additionally, few of the top predictors learned by the future-prediction model were explicit references to medications and therapy (as compared to many such references in Study 1). There would likely have been more of these explicit medication and therapy references if many individuals had already been diagnosed with a disorder. However, we cannot rule out the possibility that some individuals had been diagnosed with a mental health disorder before they ever posted to a clinical subreddit.

Conclusions

There has been a recent explosion of “big-data” research in several areas of science (Bond et al., 2012; Krizhevsky, Sutskever, & Hinton, 2012; Silver et al., 2016). Such research has led to breakthroughs in physics (dark matter), biology (genomics), computer science (vision), and neuroscience (MVPA). There is every reason to believe that such analyses will have a similar impact on the study of the mind and social behavior. Through the collection of huge datasets made possible through the repurposing of naturally occurring datasets (Goldstone & Lupyan, 2016), it should be possible to detect the subtle signals associated with mental representations and mental processes, and consequently to use these signals to identify ways that people assign meaning, make decisions, and experience emotions in sickness as well as in health.

Notes

  1. 1.

    It is possible that some users could post to more than one mental health subreddit. Such duplicate posting could either indicate noisy training labels, or alternatively could suggest that the individual really has more than one mental illness. We found that 5.7% of the users in Study 1 posted to more than one of the mental health subreddits. To evaluate the effect of these users, we excluded duplicate users from the data in Study 1 and retrained the model. We found little effect on model performance, F1 = .76 (without duplicate users) versus F1 = .77 (with duplicate users). We therefore retained these duplicate users, since such users appear to demonstrate real comorbidity between multiple disorders.

  2. 2.

    A common alternative to unigrams is pretrained word vectors, such as word2vec (Mikolov, Chen, Corrado, & Dean, 2013) or Glove (Pennington, Socher, & Manning, 2014). Here we used unigram vectors for two reasons. First, pilot experiments revealed that the pretrained embeddings did not contain vectors for many highly predictive words, such as medication names (vyvanse, dexedrine, focalin) and important clinical abbreviations (cbt, ssri). Second, unigram vectors increase the interpretability of the model, because it is straightforward to ask which words the model learns as being most predictive. It is possible, however, that the performance of our model could be improved by training custom word2vec embeddings on our corpus.

  3. 3.

    Because the dimensionality of these features was quite high, we experimented with two dimensionality reduction techniques, but we observed no effect of either technique. First, we stemmed the words in the Reddit posts (e.g., the stem of “running” and “runs” is “run”) and found reduced dimensionality but identical model performance. Second, we reduced the dimensionality of the feature matrix to 5,000 dimensions by using truncated singular-value decomposition, and again found identical model performance.

  4. 4.

    We also experimented with training the model with a multiclass objective, without the one-versus-rest classifiers. We observed identical results, F1 = .77 with one-versus-rest training versus F1 = .77 with multiclass training.

  5. 5.

    To verify that the model was not only learning to identify medication names, we created a list of common mental health medications and retrained the model. The main result was that the model performance was quite similar, although slightly worse, F1 = .75 (without medication names) versus F1 = .77 (with medication names).

References

  1. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Publishing.

    Google Scholar 

  2. Bagroy, S., Kumaraguru, P., & De Choudhury, M. (2017). A social media based index of mental well-being in college campuses. In Proceedings of the 2017 CHI Conference on Human factors in Computing Systems (pp. 1634–1646). New York, NY: ACM Press.

    Google Scholar 

  3. Bedi, G., Carrillo, F., Cecchi, G., Slezak, D., Sigman, M., Mota, N., . . . Corcoran, C. M. (2015). Automated analysis of free speech predicts psychosis onset in high-risk youths. NPJ Schizophrenia, 1, 15030.

  4. Bond, R., Fariss, C., Jones, J., Kramer, A., Marlow, C., Settle, J., & Fowler, J. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489, 295–298.

    Article  PubMed  Google Scholar 

  5. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., & Mitchell, M. (2015). CLPsych 2015 shared task: Depression and PTSD on Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology (pp. 31–39). Red Hook, NY: Association for Computational Linguistics.

    Google Scholar 

  6. Corcoran, C., Carrillo, F., Slezak, D., Klim, C., Bedi, G., Javitt, D., . . . Cecchi, G. (2018). Language disturbance as a predictor of psychosis onset in youth at enhanced clinical risk. Schizophrenia Bulletin, 44, S43–S44.

  7. De Choudhury, M., Counts, S., Horvitz, E., & Hoff, A. (2014). Characterizing and predicting postpartum depression from shared Facebook data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing (pp. 628–638). New York, NY: ACM Press.

    Google Scholar 

  8. De Choudhury, M., Gamon, M., Counts, S., & Horvitz, E. (2013). Predicting depression via social media. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (pp. 128–137). Menlo Park, CA: AAAI Press.

    Google Scholar 

  9. De Choudhury, M., Kiciman, E., Dredze, M., Coppersmith, G., & Kumar, M. (2016). Discovering shifts to suicidal ideation from mental health content in social media. In Proceedings of the 2016 CHI conference on human factors in computing systems (pp. 2098–2110). New York, NY: ACM Press.

    Google Scholar 

  10. Elvevag, B., Cohen, A., Wolters, M. , Whalley, H., Gountouna, V, Kuznetsova, K., . . . Nicodemus, K (2016). An examination of the language construct in NIMH’s research domain criteria: Time for reconceptualization! American Journal of Medical Genetics Part B, 171, 904–919.

  11. Elvevag, B., Foltz, P., Weinberger, D., & Goldberg, T. (2007). Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia. Schizophrenia Research, 93, 304–316.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, & U. Fayyad (Eds.), Proceedings of Second International Conference on Knowledge Discovery and Data Mining (pp. 226–231). Menlo Park, CA: AAAI Press.

    Google Scholar 

  13. Frankel, M. (2012). Regulating the boundaries of dual-use research. Science, 336(6088), 1523–1525.

  14. Gkotsis, G., Oellrich, A., Velupillai, S., Liakata, M., Hubbard, T., Dobson, R., & Dutta, R. (2017). Characterisation of mental health conditions in social media using Informed Deep Learning. Nature Scientific Reports, 7, 45141.

    Article  Google Scholar 

  15. Goldstone, R., & Lupyan, G. (2016). Discovering psychological principles by mining naturally occurring datasets. Topics in Cognitive Science, 8, 548–568.

    Article  PubMed  Google Scholar 

  16. Guntuku, S. C., Yaden, D. B., Kern, M. L., Ungar, L. H., & Eichstaedt, J. C. (2017). Detecting depression and mental illness on social media: An integrative review. Current Opinion in Behavioral Sciences, 18, 43–49. https://doi.org/10.1016/j.cobeha.2017.07.005

    Article  Google Scholar 

  17. Insel, T. (2017). Digital phenotyping: Technology for a new science of behavior. Journal of the American Medical Association, 318, 1215–1216.

    Article  PubMed  Google Scholar 

  18. Ireland, M. E., & Mehl, M. R. (2014). Natural language use as a marker of personality. In T. M. Holtgraves (Ed.), Oxford handbook of language and social psychology (pp. 201–218). New York, NY: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199838639.013.034

    Google Scholar 

  19. Jain, S., Powers, B., Hawkins, J., & Brownstein, J. (2015). The digital phenotype. Nature Biotechnology, 33, 462–463.

    Article  PubMed  Google Scholar 

  20. Kapur, S., Phillips, A. G., & Insel, T. R. (2012). Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Molecular Psychiatry, 17, 1174–1179. https://doi.org/10.1038/mp.2012.105

    Article  PubMed  Google Scholar 

  21. Kern, M. L., Park, G., Eichstaedt, J., Schwartz, H., Sap, M., Smith, L, & Ungar, L. (2016). Gaining insights from social media language: Methodologies and challenges. Psychological Methods, 21, 507–525. https://doi.org/10.1037/met0000091

    Article  PubMed  Google Scholar 

  22. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems (pp. 1097–1105). Red Hook, NY: Curran Associates.

    Google Scholar 

  23. Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802–5805.

  24. Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

    Google Scholar 

  25. Mehl, M., Pennebaker, J, Crow, D., Dabbs, J., & Price, J. (2001). The electronically activated recorder (EAR): A device for sampling naturalistic daily activities and conversations. Behavior Research Methods, Instruments, & Computers, 33, 517–523.

    Article  Google Scholar 

  26. Mikolov, T., Chen, K., Corrado, D., & Dean, J. (2013). Efficient estimation of word representations in vector space. In International Conference on Learning Representations (ICLR) 2013. Retrieved from https://sites.google.com/site/representationlearning2013/workshop-proceedings

  27. Monroe, S. M., & Simons, A. D. (1991). Diathesis—Stress theories in the context of life stress research: Implications for the depressive disorders. Psychological Bulletin, 110, 406–425.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Mota, N., Copelli, M., & Ribeiro, S. (2017). Thought disorder measured as random speech structure classifies negative symptoms and schizophrenia diagnosis 6 months in advance. NPJ Schizophrenia, 3, 18. https://doi.org/10.1038/s41537-017-0019-3

    Article  PubMed  PubMed Central  Google Scholar 

  29. Mota, N., Vasconcelos, N., Lemos, N., Pieretti, A., Kinouchi, O., Cecchi, G., . . . Ribeiro, S. (2012). Speech graphs provide a quantitative measure of thought disorder in psychosis. PLoS ONE, 7, e34928. https://doi.org/10.1371/journal.pone.0034928

  30. Narayanan, A., & Shamitkov, V. (2008). Robust de-anonymizatoin of large sparse datasets. In Proceedings of IEEE 2008.

  31. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

  32. Pennebaker, J., Boyd, R., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. Retrieved from https://repositories.lib.utexas.edu/.

  33. Pennebaker, J., & King, L. (1999). Linguistic style: Language use as an individual difference. Journal of Personality and Social Psychology, 77, 1296–1312.

    Article  PubMed  Google Scholar 

  34. Pennebaker, J. W., & Graybeal, A. (2001). Patterns of natural language use: Disclosure, personality, and social integration. Current Directions in Psychological Science, 10, 90–93. https://doi.org/10.1111/1467-8721.00123

    Article  Google Scholar 

  35. Pennebaker, J. W., Mehl, M. R., & Niederhoffer, K. G. (2003). Psychological aspects of natural language. use: Our words, our selves. Annual Review of Psychology, 54, 547–577. https://doi.org/10.1146/annurev.psych.54.101601.145041

    Article  PubMed  Google Scholar 

  36. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on EMNLP (pp. 1532–1543). New York, NY: Association for Computational Linguistics.

    Google Scholar 

  37. Preotiuc-Pietro, D., Eichstaedt, J., Park, G., Sap, M., Smith, L., Tobolsky, V., . . . Ungar, L. (2015). The role of personality, age and gender in tweeting about mental illness. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology (pp. 21–30). New York, NY: Association for Computational Linguistics.

  38. Resnik, P., Armstrong, W., Claudino, L., Nguyne, T., Nguyen, V., & Boyd-Graber, J. (2015). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology (pp. 99–107). New York, NY: Association for Computational Linguistics.

    Google Scholar 

  39. Rude, S., Gortner, E., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition & Emotion, 18(8), 1121–113.

  40. Schwartz, H. A., Eichstaedt, J., Kern, M. L., Park, G., Sap, M., Stillwell, D., . . . Ungar, L. (2014). Toward assessing changes in degree of depression through Facebook. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology (pp. 118–125). New York, NY: Association for Computational Linguistics.

  41. Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Van Den Driessche, G., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489.

    Article  PubMed  Google Scholar 

  42. Thorstad, R., & Wolff, P. (2018). A big data analysis of the relationship between future thinking and decision-making. Proceedings of the National Academy of Sciences, 115, 1740–1748.

    Article  Google Scholar 

  43. Wolinetz, C. (2012). Implementing the new US dual-use policy. Science, 336(6088), 1525–1527.

  44. Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112, 1036–1040.

    Article  Google Scholar 

Download references

Author note

The data are available upon request, and are also publicly accessible via the website Reddit. The experiments were not preregistered.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Robert Thorstad.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 219 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Thorstad, R., Wolff, P. Predicting future mental illness from social media: A big-data approach. Behav Res 51, 1586–1600 (2019). https://doi.org/10.3758/s13428-019-01235-z

Download citation

Keywords

  • Mental health
  • Machine learning
  • Big data
  • ADHD
  • Anxiety
  • Bipolar
  • Depression