1 Introduction

In the past decade, digital trace data has become an integral part of social science research [1, 2]. One of the advantages of such data is that it can be used to predict otherwise unknown characteristics of people, allowing researchers to conduct studies on a large scale. For example, on an aggregate level, mobile phone metadata were used to predict wealth [3], images of street scenes were used to predict voting preferences [4], and texts of books were used to predict subjective well-being [5]. On an individual level, socio-demographic characteristics such as gender, ethnicity, age, and income were predicted from profile images [6], tweets [7, 8], and Facebook posts [9].

The work in this domain is typically focused on basic, often categorical, demographic variables. In such cases, the ground truth data required for training and validating the predictive model could be relatively easily obtained. In contrast, studying more complex human characteristics would require linking extensive survey data with digital traces on an individual level—a task that is known to be rather challenging [10]. Examples of such work include predicting personality (see [11] for review) and mental health status from social media activity (see [12] for review). These complex individual-level characteristics were predicted from Facebook likes [13], posts on Facebook [14] or Twitter [15], and Instagram images [16].

A common limitation of this line of work is a reliance on voluntary response samples: participants are typically recruited via purpose-built applications or crowdsourcing platforms such as Amazon Mechanical Turk. This approach might not be problematic if the model could be externally validated. For basic demographic variables, the external validation is typically achieved by comparing model prediction with existing data on an aggregated level (e.g., census data [4], governmental data [3], Eurobarometer survey [5]). However, for more complex characteristics, such data are rarely available, and in most cases, the validation is limited to cross-validation on the existing dataset. As a result, the predictive power of the models on out-of-sample cases is not well known.

In our study, we built a model to predict the educational outcomes of students from their posts on social media. Educational outcomes are of particular interest because they are complex, continuous characteristics that—in contrast to basic demographic variables—can be reliably measured only by extensive standardized tests. Because it is also closely related to many important life outcomes [1721], the ability to predict it on a large scale might be particularly valuable both for researchers and policy-makers.

Despite there being a large body of research related to the prediction of academic performance (see [22, 23] for review), such approaches typically (though not always [24]) rely on internal data from the educational organization (i.e., information about library logins [25], class attendance [26, 27], or data from learning management systems [28]); thus, they are limited to one educational institution. As a result, the extent to which it is possible to predict educational outcomes on a population level is not well understood.

In this study, we used data from a nationally representative panel study entitled “Trajectories in Education and Careers” (TrEC) [29] that tracks 4400 students who participated in the Programme for International Student Assessment (PISA) [30]. In addition to survey data, this dataset contains information about public posts on a popular Russian social networking site—VK for those participants who agreed to share their VK data (\(N = 3483\)). We combined unsupervised learning of word embeddings on a large corpus of VK posts (1.9B tokens) with a simple supervised model trained on individual posts to predict PISA scores from texts. We then tested the predictive performance of the model in different contexts. In particular, we used the model to predict the rankings of schools and universities based on the public posts of their students on VK. We further tested the generalizability of the model by applying it to users’ tweets rather than VK posts. Finally, we used the model to explore the differential language use of participants.

Because we make use of data from a nationally representative sample, our results could be generalized on a population level, albeit for one age cohort. Our study also benefits from the fact that academic performance measured by a standardized test is publicly available for both high school and university levels in Russia, allowing for external validation of the predictive model.

The main aim of this study was to understand to what extent educational outcomes can be predicted from posts on social media. Note that the prediction here is understood as identification of patterns in data (i.e., correlations between academic performance and text of posts, rather than future forecasting). Although the estimates provided by such models cannot be used to predict future outcomes, they are nevertheless valuable. For instance, Russian students take standardized tests only once per cohort, making it extremely difficult to estimate any value that is added to students’ learning by an educational organization [31]. A comparison of predicted scores for the same users at different points in time might help to overcome this issue and shed light on the factors contributing to students’ progress. The ability to estimate educational outcomes on a large scale could also help to uncover previously unknown factors that are associated with low or high academic performance.

2 Data & methods

2.1 TrEC data

We used data from the Russian Longitudinal Panel Study of Educational and Occupational Trajectories (TrEC) [29]. The study tracks 4400 students from 42 Russian regions who took the PISA test in 2012 [30]. We used PISA reading scores as a measure of students’ academic performance. PISA defines reading literacy as “understanding, using, reflecting on and engaging with written texts in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society” and considers it a foundation for achievement in other subject areas within the educational system and also a prerequisite for successful participation in most areas of adult life [32]. PISA scores are scaled so that the OECD average is 500 and the standard deviation is 100, while every 40 score points roughly correspond to the equivalent of one year of formal schooling [30].

In 2018, publicly available information from the social networking site VK was collected for 3483 TrEC participants who provided informed consent for the use of this data for research purposes. Note that while the initial sample was representative of the 9th-grade high school Russian students in 2012, the social network data is not necessarily representative. There were no publicly available posts for 498 users. The median number of public posts for remaining users was 35. We removed posts that contain URLs from our data set to account for potentially automated postings, and we also excluded re-posts and posts with no text. This resulted in the final data set of 130,575 posts from 2468 users.

2.2 Continuous-vocabulary approach

There are two general strategies for identifying correlations between user characteristics and their behavior as manifest through language: a closed vocabulary analysis that relies on a priori word category human judgments and an open vocabulary analysis that extracts patterns from data and that does not rely on any a priori word categories [9]. We adopted the latter approach; however, in contrast to previous studies [9, 33, 34], we used continuous rather than discrete word representations. For that purpose, we trained the fastText model [35] on the VK corpus (1.9B tokens, vocabulary size is 2.5M) to obtain vector representations of Russian words (the model is available at [36]). We used a simple tokenizer function that defines a word (token) as any uninterrupted sequence of symbols from Cyrillic or Latin alphabet, or a hyphen. We also substitute all numbers by a single 〈num〉 token and all URLs by a single 〈url〉 token.

We represented each post as a 300-dimensional vector by averaging over vector representations of all its constituent words. These post representations were used to train a linear regression model to predict the PISA scores of the posts’ authors (the model is available at [36]). By construction, the predicted text score is equal to the average predicted score of its constituent words.

The advantage of this continuous-vocabulary approach approach in comparison to discrete methods is that it allows the incorporation of rich knowledge about the language structure learned from the training of unsupervised word embeddings. It enables, for instance, the computation of meaningful scores even for words that are not present in the training dataset. This property is particularly valuable for small datasets that are typical for studies that combine survey data with digital traces. As we demonstrate, this approach outperforms common alternatives such as TF-IDF models (see Results).

At the same time, the continuous-vocabulary strategy is simpler than state-of-the-art ANN methods [37] and, as a result, allows straightforward interpretation of the predictions and exploration of the differential language use by users, as we demonstrate in the Results section.

2.3 External validation: high schools and universities data

VK provides an application programming interface (API) that enables the downloading of information systematically from the site. In particular, downloading user profiles from particular educational institutions and within selected age ranges is possible. For each user, obtaining a list of their public posts is also possible. According to the VK Terms of Service: “Publishing any content on his/her own personal page, including personal information, the User understands and accepts that this information may be available to other Internet users, taking into account the architecture and functionality of the Site.” The VK team confirmed that their public API could be used for research purposes.

We created a list of high schools in Saint Petersburg (\(N = 601\)), Samara (\(N = 214\)), and Tomsk (\(N = 99\)) and then accessed the IDs of users who indicated on VK that they graduated from one of these schools. We removed profiles with no friends from the same school, profiles that already belong to the TrEC data set, and users who indicated several schools in their profiles. The public posts of the remaining users were downloaded and our model was applied to them to get a prediction of the users’ academic performance. We then estimated the educational outcomes of a school by averaging the predictions of its students’ performance. Overall, 1,064,371 posts from 38,833 users were used at this stage of analysis. The same procedure was performed to obtain a prediction of academic performance for students from the 100 largest universities in Russia (\(N_{\mathrm{users}} = 115\text{,}804\); \(N_{\mathrm{posts}} = 6\text{,}508\text{,}770\)). Moscow State University was excluded from the analysis as it is known to be a default choice for bots and fake profiles. Even the application of the aforementioned filtering method does not allow reliable data to be obtained for this university, i.e. there is still an order of magnitude more user profiles than the real number of students for a given cohort.

We used data from the web portal “Schools of Saint Petersburg” [38] to obtain the average performance of schools’ graduates in the Unified State Examination (USE). This is a mandatory state examination that all school graduates should pass in Russia. The USE scores for Samara were provided by the web portal “Zeus” [39]. The USE scores of the Tomsk schools along with the data on university enrollees [40] were collected by the Higher School of Economics.

This information was used to check if the scores predicted from social media data correspond to the ranking of schools and universities based on their USE results. Note that individual USE and PISA scores are not perfectly correlated [41], however it might be assumed that both test measure the same underlying academic ability. It means that our model was tested not only on a different set of users but that the measure of academic performance was also different from the training settings.

2.4 Twitter data

VK allows users to indicate links to other social media accounts, including Twitter, in their profiles. Only a small proportion of users provide links to their Twitter accounts. In our sample information, about 665 Twitter accounts were available for the Saint-Petersburg data set (less for other cities) and 2836 Twitter accounts were available for the university data set. This allowed the analysis to be performed only for the university data set. The latest tweets of these 2836 users were downloaded via Twitter’s API. Note that unlike tweets, VK posts are not limited to 140 or 280 characters. However, most VK posts are short texts (85.2% of the posts are less than 140 characters and 92.6% of the posts are less than 280 characters in our sample).

3 Results

3.1 Prediction

We first explored the predictive power of common text features with respect to academic performance. We found a small negative effect for the use of capitalized words (\(P = 2 \times 10^{-3}\)), emojis (\(P = 7 \times 10^{-3}\)), and exclamations (\(P = 0.05\)), as seen in Fig. 1. The use of Latin characters (\(P = 5 \times 10^{-3}\)), average post length (\(P = 2 \times 10^{-4}\)), word length (\(P = 4 \times 10^{-4}\)), and vocabulary size (\(P < 10^{-10}\)) are positively correlated with academic performance. The strongest correlation was found for the information entropy of users’ texts (Pearson’s \(r = 0.20\), \(P < 10^{-15}\)).

Figure 1
figure 1

Pearson correlation between common text features and academic performance. The use of capitalized words, emojis and exclamations (average number per post normalized by the post length in tokens) is negatively correlated with performance. The use of Latin characters, average post and word length, vocabulary size and entropy of users’ texts are positively correlated with academic performance

We used a TF-IDF model to obtain a base-line prediction of academic performance from the users’ posts. We selected the 1000 most common unigrams and bigrams from our corpus, excluding stop words, for the Russian language. We then applied a TF-IDF transformation to represent posts as 1000-dimensional vectors and then trained a linear regression model on individual posts to predict the academic performance of their authors. The correlation between predicted and real scores is \(r = 0.285\). Here, and for the following models, we report results on the user level obtained using leave-one-out cross-validation, i.e. scores for posts of a certain user were obtained from the model trained on posts of all other users. We obtained significantly better results with a model that used word-embeddings (see Methods). We also find that embeddings trained on the VK corpus outperform models trained on the Wikipedia and Common Crawl corpora (Table 1).

Table 1 Predictive power of the models measured as Pearson correlation between real and predicted outcomes. The results were computed using leave-one-out cross-validation

The predictive power of a model depends on the number of posts available for each user (see Fig. 2). If only one post is available per user, the predictive power is rather low (\(r = 0.237\)). However, it increases with the number of posts available, reaching \(r = 0.541\) for 20 posts per user.

Figure 2
figure 2

The predictive power of the model depending on the number of posts used per user. First, users with at least 20 posts were selected. Then, for each user, N of their posts were selected to predict their academic performance (\(N = 1,\ldots, 20\)). The shaded region corresponds to the bootstrapped 90% confidence interval

In addition to predicting raw scores, we analyze how well the model could distinguish between low- and high-performing users. To help interpret what student scores mean in substantive terms, the PISA scale is divided into six proficiency levels [30]. In Table 2, we report the ability of our model to distinguish between students of different proficiency levels. The performance of the model is measured as the area under the ROC curve (AUC). According to the OECD, Level 2 is a baseline proficiency that is required to participate fully in modern society [30]. Students who do not meet this baseline are considered as low-performing. High-performing students are those who achieve proficiency Level 5 or higher. The ability of our model to distinguish between low- and high-performing students, i.e. between Levels 0 & 1 and Levels 5 & 6 is 93.7%.

Table 2 Model performance in discrimination between different proficiency levels measured as the area under the ROC curve

3.2 Transfer

Figure 3 shows the correlation between the predicted performance of schools (a)–(c) and universities (d) and the USE scores of their graduates or enrollees. In all four cases, we find a relatively strong signal despite the fact that the VK sample might not be representative and that academic performance was measured differently than in training settings, being available only in aggregated form and from secondary sources.

Figure 3
figure 3

Correlation between the predicted and real performance of schools and universities. Pearson’s correlation coefficients between predicted school scores and the USE scores of their graduates were computed for Saint-Petersburg (a), Samara (b) and Tomsk (c). The correlation between predicted university scores and the USE scores of their enrollees was also computed for the 100 largest Russian universities (d)

Intriguingly, we find that the substitution of VK posts by tweets doesn’t substantially alter the resulting performance (see Fig. 4)). For fair comparison, we use VK data only for those users for whom Twitter data was also available. This is why the performance of the model is substantially lower than in Fig. 3(d), where all available VK data was used.

Figure 4
figure 4

Comparison of predictions based on VK and Twitter data. While estimates from Twitter and VK vary for individual universities the overall performance of the model is similar for both cases. Note that the performance of the model is rather low due to the limited number of users per university for whom both VK and Twitter data is available

3.3 Differential language use

We explored the resulting model by selecting 400 words with the highest and lowest scores that appear at least 5 times in the training corpus. A t-SNE representation of the embeddings produced by our model helped [42] to identify several thematic clusters (Fig. 5). High performing clusters include English words (above, saying, yours, must), words related to literature (Dandelion, Bradbury, Fahrenheit, Orwell, Huxley, Faulkner, Nabokov, Brodsky, Camus, Mann, Shelley, Shakespeare), words related to reading (read, reread, published, book, volume), words related to physics (universe, hole, string, theory, quantum, Einstein, Newton, Hawking), and words related to thinking processes including various synonyms of “thinking” and “remembering.”

Figure 5
figure 5

t-SNE representation of the words with the highest and lowest scores from the training data set. High performing clusters (orange) include English words and words related to literature, physics, or thinking processes. Low performing clusters (green) include spelling errors and words related to horoscopes, military service, or cars and road accidents

Low performing clusters include common spelling errors and typos, names of popular computer games, words related to military service (army, to serve, military oath), horoscopes (Aries, Sagittarius), and cars and road accidents (traffic collision, General Administration for Traffic Safety, wheels, tuning).

The use of continuous word representations allows one to compute scores even for words that were not present in the training data set. We computed the scores for all 2.5M words in our vector model and made it available for further exploration [36]. This could be used for exploratory analysis to get insights into differential use of language with respect to academic performance and could be applied to various domains, from literature to politics or food (Fig. 6).

Figure 6
figure 6

Ranking of selected words by their predicted score (translated from Russian). The application of the model to words from different domains confirms its face validity

4 Discussion

In traditional cross-validation settings, one data set is randomly split into two parts: the model is trained on one part, and then its performance is assessed using the other part. In our case, we use two different data sources (VK and Twitter) and different measures of academic performance (PISA scores and USE scores). Some additional limitations were that the PISA scores were collected in 2012, while most of the posts were written much later, and that the USE scores were available only at the aggregated level. Despite these limitations, we found a relatively strong signal in the data. For instance, we were able to explain 69% of the variation in universities’ scores using information about VK posts of users from these universities. While the result for tweets was significantly lower (\(R^{2} = 0.26\)), this was probably at least partly due to the smaller sample size. Also, note that the prediction of individual scores substantially depends on the number of available posts. If only one post per user is used for prediction, then 6% of the variation in their academic performance could be explained by our model. This number rises to 29% if 20 posts are used.

The ability to predict the ranking of educational organizations might seem trivial given that direct ranking information is readily available. However, USE scores are measured only once per cohort, making it extremely hard to estimate any added value provided by an educational organization. A comparison of predicted scores for the same users at different points in time might shed light on factors contributing to the students’ progress.

We demonstrate how domain-specific unsupervised learning of word embeddings allows predictive models to be trained using relatively small labeled data sets. One reason is that even words that are rare in the training or testing data sets could be valuable for prediction. For instance, even if the word “Newt” never occurs in the training data set, the model could assign a higher score to posts containing it. This would happen if the model learns from the training data set that words from the Harry Potter universe are indicative of high performing students and learns from the unsupervised training that “Newt” belongs to this category, i.e. this word is close to other Harry Potter related words in the vector space. This might make the use of continuous word representation preferable to common approaches relying on counting word frequencies. As our approach does not depend on a particular language, source of texts, or target variable (i.e. academic performance could be substituted by income or depression), it could be applied to a wide variety of settings.

Our results also suggest that models trained on text data could be successfully transferred from one data source to another. While this certainly might be useful in some applications, it also means there is a greater risk to users’ privacy. If users of platform A do not disclose an attribute X on it, then there is no data to train a model to predict X from digital traces on platform A. However, if X is disclosed on platform B, and both platforms collect short texts from users, then it becomes possible to predict X from digital traces on A given access to data from B. In recent years, face recognition technology has raised particular privacy concerns because of its potential omnipresence and the inability of people to hide from it. In a similar way, digital traces in the form of short texts are ubiquitous, and our results suggest that they allow, if not to identify a person, then at least to predict potentially sensitive private attributes.