1 Introduction

The rise of social media has led to an unprecedented increase in the supply of publicly available unstructured text data. Researchers often wish to examine relationships between observable metadata (e.g., characteristics of a document’s author) and in-text patterns (Farrell 2016; Kim 2017). Probabilistic topic models identify such in-text patterns by producing a posterior distribution over different topics. Yet estimating relationships with observed metadata is not trivial as the target variable is latent and itself being estimated from the text data. In this work we focus on exploring and estimating relationships between metadata and topics learned by the structural topic model (STM; Roberts et al. 2016). We selected this model due to its high relevance in the social sciences—see “Appendix A”.Footnote 1 The R package stm (Roberts et al. 2019) implements the STM itself and additionally provides a framework for estimating topic-metadata relationships via the method of composition, a combination of Monte Carlo sampling and frequentist linear regression. Even though this estimation technique is prone to producing predictions incompatible with standard definitions of probability, it is frequently applied in the literature (cf. “Appendix A”). This leads to implausibilities of two different forms: authors sometimes report negative expected topic proportions (e.g., Farrell 2016; Moschella and Pinto 2019, see also our Fig. 1); whereas in other cases "only" the confidence bands partly include negative values (e.g., Cho et al. 2017; Chandelier et al. 2018; Bohr and Dunlap 2018; Heberling et al. 2019). In both cases, it is ignored that sampled topic proportions are confined to (0, 1) by definition, which severely harms the interpretability of results.

In this paper, we suggest two key modifications to the stm implementation in R (Roberts et al. 2019): First, our proposed Beta regression approach is a natural correction of the linear regression approach, accounting for topic proportions being restricted to the interval (0, 1). Second, we develop a Bayesian design within the method of composition to allow for a more coherent estimation and interpretation of topic-metadata relationships; in particular, we obtain a posterior predictive distribution of topic proportions at different values of metadata covariates.

We demonstrate the added value of our corrections by analyzing Twitter posts of German politicians, gathered from September 2017 through April 2020. Politics has been particularly impacted by the increasing usage of social media as evidenced by the Brexit vote and US presidential elections, with Twitter being extensively used for direct communication by politicians. We investigate relationships between latent topics in the tweets of German members of parliament (MPs) and corresponding metadata, such as tweet date or unemployment rate in the respective MP’s electoral district. In doing so, we attempt to link the topics discussed to specific events as well as to socioeconomic characteristics of the MP’s electoral districts.

2 Background

Topic models seek to discover latent thematic clusters, called topics, within a collection of discrete data, usually text documents. In addition to identifying such clusters, topic models estimate the proportions of the discovered topics within each document. Many topic models build upon the well-known LDA, which is a generative probabilistic three-level hierarchical Bayesian mixture model that assumes a Dirichlet distribution for topic proportions. The Correlated Topic Model (CTM; Blei and Lafferty 2007), for instance, builds on the LDA but replaces the Dirichlet distribution with a logistic normal distribution in order to capture inter-topic correlations. The STM adopts this approach, but additionally incorporates document-level metadata into the estimation of topicsFootnote 2:

  • For each document, indexed by \(d \in \{1,\dots ,D\}\), and each topic, indexed by \(k \in \{1,\dots ,K\}\), a topic proportion \(\theta _{d,k}\) is drawn from a logistic normal distribution.Footnote 3

  • The parameters of the logistic normal distribution depend on document-level metadata covariates \({\textbf{x}}_d\).

For parameter estimation, the STM employs a variational expectation maximization (EM) algorithm, where in the E-step the variational posteriors are updated using a Laplace approximation (Wang and Blei 2013; Roberts et al. 2016). In the M-step, the approximated Kullback–Leibler (KL) divergence is minimized with respect to the model parameters.

3 Modeling topic-metadata relationships in the STM

The STM produces an approximate posterior distribution of topic proportions. A point estimate can be obtained for example as the mode of this distribution. Topic proportions are often used in subsequent analysis, e.g., for determining their relationship with metadata. We argue that the usual practice of simply regressing point estimates of topic proportions on document-level covariates is not adequate for estimating topic-metadata relationships. This approach ignores that topic proportions are themselves estimates, neglecting much of the information contained in their posterior distribution. In this section, we propose a method to adequately explore the relationship between topic proportions and metadata covariates.

One way to account for the uncertainty in topic proportions is the "method of composition" (p. 52; Tanner 2012), which is a simple Monte Carlo sampling technique. Let y be a random variable with unknown distribution p(y) from which we would like to sample and let z be another random variable with known distribution p(z). If \(p(y \vert z)\) is known, we can sample from

$$\begin{aligned} p(y) = \int p(y \vert z) p(z) dz, \end{aligned}$$
(1)

using the following procedure:

  1. 1.

    Draw \(z^* \sim p(z)\).

  2. 2.

    Draw \(y^* \sim p(y \vert z^*)\).

Discarding \(z^*\), the resulting \(y^*\) are samples from p(y).Footnote 4

In Roberts et al. (2016), the authors employ a variant of the method of composition established by Treier and Jackman (2008), which uses linear regression to obtain the conditional distribution \(p(y \vert z)\). To demonstrate this variant, let \(\varvec{\theta }_{{\varvec{\cdot }}k}=(\theta _{1,k}, \dots , \theta _{D,k})^T \in (0,1)^{D}\) denote the proportions of topic k and let \({\textbf{X}}:=[{\textbf{x}}_1 \vert \dots \vert {\textbf{x}}_D]^T\) be the covariates for all D documents. Let further \(q(\varvec{\theta }_{{\varvec{\cdot }}k})\) be the approximate posterior distribution of topic proportions given observed documents and metadata, as produced by the STM. The idea now is to repeatedly draw samples \(\varvec{\theta }_{{\varvec{\cdot }}k}^*\) from \(q(\varvec{\theta }_{{\varvec{\cdot }}k})\) and subsequently perform a regression of each sample \(\varvec{\theta }_{{\varvec{\cdot }}k}^*\) on covariates \({\textbf{X}}\) to obtain coefficient estimates \(\hat{\varvec{\xi }}\). Treier and Jackman (2008) consider the asymptotic distribution of \(\hat{\varvec{\xi }}\) as posterior density for \(\varvec{\xi }\), i.e., as \(p(\varvec{\xi } \vert \varvec{\theta }_{{\varvec{\cdot }}k}^*, {\textbf{X}})\).

That is, the method of composition draws samples from the asymptotic distribution of the maximum likelihood estimate (MLE) for the regression parameters. This use of the asymptotic distribution of the MLE can be motivated by the idea that the prior distribution is dominated by the likelihood for larger samples. Therefore, the posterior can be shown to be approximately normal with mean vector equal to the MLE and variance equal to the inverse observed information matrix (see, e.g., Walker 1969).

Using samples \(\varvec{\xi }^*\) from this distribution \(p(\varvec{\xi } \vert \varvec{\theta }_{{\varvec{\cdot }}k}^*, {\textbf{X}})\), we can “predict” topic proportions \(\theta _{pred, k}^{*} = g({\textbf{x}}_{pred}^T \varvec{\xi }^*)\) at new covariate values \({\textbf{x}}_{pred}\) (g is the regression response function, e.g., identity function for linear regression). "Algorithm 1" summarizes the method. Note that sampling from the posterior of topic proportions in the first step of Algorithm 1 accounts for the uncertainty in \(\varvec{\theta }_{{\varvec{\cdot }}k}\), while the uncertainty of the regression estimation itself is addressed by sampling from the (asymptotic) distribution of the regression coefficient estimator.

figure a

Algorithm 1: Method of composition with frequentist regression

To visualize topic-metadata relationships, Roberts et al. (2016) generate multiple “predictions” \(\theta _{pred, k}^{*}\) and calculate empirical quantities such as the mean and quantiles. Calculating mean and credible intervals in such a Bayesian fashion implicitly assumes a (posterior predictive) distribution for \(\theta _{pred, k}^{*}\). This distribution, however, directly depends on the regression - which is frequentist as implemented in the stm package. We address this point in detail in Sect. 4.2.

4 Methodological Improvements

While we agree with performing Monte Carlo sampling of topic proportions in order to integrate over latent variables, we aim to address two inconsistencies:

  1. 1.

    Inadequate modeling of proportions: The method of composition is implemented in the R package stm via the estimateEffect function, which employs a linear regression in the second step of Algorithm 1 (implying \(g = id\) in the last step). This implementation ignores that topic proportions are naturally restricted to the interval (0, 1). As a consequence, when using the estimateEffect function, we frequently observed predicted topic proportions outside of (0, 1), as is exemplarily shown for one specific topic-covariate combination in Fig. 1.

  2. 2.

    Mixing Bayesian and frequentist methods: The method of composition used by Treier and Jackman (2008) and Roberts et al. (2016) mixes Bayesian and frequentist methods. As described in Sect. 3, a frequentist regression is used inside the method of composition, yet estimates are obtained in a Bayesian manner via calculation of empirical mean and quantiles. Recall that according to Treier and Jackman (2008), \(\varvec{\xi }^*\) can be considered a sample from the posterior of regression coefficients. However, the coefficients resulting from a frequentist regression do not have any distribution because the frequentist framework assumes them to be fixed parameters. As a consequence, one cannot sample from the distribution of regression coefficients, which is why Treier and Jackman (2008) sample \(\varvec{\xi }^*\) from the distribution of coefficient estimators. This distribution, however, only exists by making frequentist assumptions.

In Sects. 4.1 and 4.2 below we further discuss these problems and present corrections and alternatives, all of which are implemented in the R package stmprevalence.Footnote 5

Fig. 1
figure 1

Mean prediction and 95% confidence intervals for the topic proportion of topic “Climate Protection” over time, generated using estimateEffect from the R package stm

4.1 Frequentist beta regression

As noted above, the linear regression approach is often used carelessly in the literature, neglecting that topic proportions are non-negative by definition. Farrell (2016) and Moschella and Pinto (2019), for instance, produce figures containing negative expected topic proportions, while Cho et al. (2017); Chandelier et al. (2018); Bohr and Dunlap (2018), and Heberling et al. (2019) display confidence bands partly covering negative values.

Therefore, we correct the approach employed within the stm package by replacing the linear regression with a regression model that assumes a dependent variable in the interval (0, 1). As shown by Atchison and Shen (1980), the Dirichlet distribution is well suited to approximate a logistic normal distribution, though inducing less interdependence among the different topics. When employing a Dirichlet distribution, the univariate marginal distributions are Beta distributions. We thus perform a separate Beta regression for each topic proportion on \({\textbf{X}}\), using a logit-link.Footnote 6 This approach now again corresponds to Algorithm 1, but with g being the logistic sigmoid function in this case.Footnote 7

4.2 Bayesian beta regression

Treier and Jackman (2008) and the authors of the STM consider \(\varvec{\xi }^*\) to be samples from the posterior of regression coefficients. While it is possible to view frequentist regression from a Bayesian perspective, it implies assuming a uniform prior distribution for regression coefficients \(\varvec{\xi }\) - which is rather implausible. More generally, the mixing of Bayesian and frequentist frameworks within the method of composition lacks a theoretical foundation, especially when employing an asymptotic distribution of regression coefficient estimators. This applies to the model of Treier and Jackman (2008) as well as to the Beta regression presented in Sect. 4.1. Furthermore, note that when using a frequentist regression, the estimated uncertainty is with respect to the prediction of the mean of topic proportions. However, when exploring topic-metadata relationships it might be preferable to examine the variation of individual topic proportions among documents at different values of metadata covariates.

figure b

Algorithm 2: Method of composition with Bayesian Beta regressi

Therefore, we propose to replace the frequentist regression in "Algorithm 1" by a Bayesian Beta regression with normal priors centered around zero. This enables modeling topic-metadata relationships in a fully Bayesian manner while preserving the methodological improvements from Sect. 4.1. Algorithm 2 summarizes this approach. By drawing \(\theta _{pred, k}^{*}\) at covariate values \({\textbf{x}}_{pred}\), we obtain samples from the posterior predictive distribution

$$\begin{aligned}&p(\varvec{\theta }_{pred, k} \vert \varvec{\theta }^*_{{\varvec{\cdot }}k}, {\textbf{X}}, {\textbf{x}}_{pred}) = \end{aligned}$$
(2)
$$\begin{aligned}&\int p(\varvec{\theta }_{pred, k} \vert {\textbf{x}}_{pred}, \varvec{\xi }) p(\varvec{\xi } \vert \varvec{\theta }_{{\varvec{\cdot }}k}^*, {\textbf{X}}) d\varvec{\xi }, \end{aligned}$$
(3)

where \(p(\varvec{\xi } \vert \varvec{\theta }_{{\varvec{\cdot }}k}^*, {\textbf{X}})\) denotes the posterior distribution of regression coefficients. This allows displaying the (predicted) variation of topic proportions at different covariate levels. As before, quantities of interest, such as the mean and quantiles, are obtained by averaging across samples; now, however, these samples are generated within a fully Bayesian framework.

5 Application

Source code available at https://github.com/PMSchulze/topic-metadata-stm.

In this section, we first apply the STM to German parliamentarians’ Twitter data and subsequently demonstrate both the original (stm) and our new method (stmprevalence) to explore topic-metadata relationships. Here, we chose to apply the STM in particular for illustrative purposes, because of its flexibility and its relevance in the social sciences. We would like to emphasize again, however, that our methods work with any other topic model, such as LDA or CTM, as long as it produces an (approximate) posterior distribution of topic proportions. This is because our methods focus on the step subsequent to the estimation of a topic model, i.e., on the exploration of relationships between previously estimated topic proportions and metadata covariates.

5.1 DataFootnote 9

For all German MPs during the 19th election period (starting on September 24, 2017), we gathered personal information such as name, party affiliation, and electoral district from the official parliament website as well as Twitter profiles from the official party websites, using BeautifulSoup (Richardson 2007). Next, after excluding MPs without a public Twitter profile, we used tweepy (Roesslein 2020) to scrape all tweets by German MPs from September 24, 2017 through April 24, 2020. We also gathered socioeconomic data, such as GDP per capita and unemployment rate, as well as 2017 election results on an electoral-district level. Text preprocessing, such as transcription of German umlauts, removal of stopwords, and word-stemming, was performed with quanteda (Benoit et al. 2018).Footnote 10

We define a document as the concatenation of an individual MP’s tweets during a single calendar month in order to achieve sufficient document length. Our final data set includes 10,998 monthly MP-level documents, each one associated with 90 covariates.

5.2 Model fitting and global-level analysis

Before fitting the STM, we need to decide on the number of topics, K. To do so, we use the following four model evaluation metrics: held-out likelihood, semantic coherence, exclusivity, and residuals. The held-out likelihood approach is based on document completion. The higher the held-out likelihood, the more predictive power the model has on average (Wallach et al. 2009). Semantic coherence means that words characterizing a specific topic also appear together in the same documents (Mimno et al. 2011). Exclusivity, on the other hand, indicates to which degree words characterizing a given topic only occur in that topic. Finally, the residuals metric, which is based on residual dispersion, indicates a (potentially) insufficiently small value of K whenever the residual dispersion is larger than one (Taddy 2012).

Fig. 2
figure 2

Left: Model evaluation metrics for hyperparameter K (number of topics). Right: Word cloud for the topic labeled as “Climate Protection”

The left part of Fig. 2 shows these four metrics for a grid of K between five and 40 with step size five. Both \(K=15\) and \(K=20\) seem to be good choices. Given the better interpretability for models with fewer topics, we choose \(K = 15\).

After fitting the model, we label all topics manually with human interpretable labels; to do so, we use word clouds and top words (see Fig. 2 (right panel) and “Appendix B”). Throughout this work, we consider the topics “Climate Protection” “Right/Nationalist,” “Social/Housing,” and “Europe” for illustration, in particular the first one. To obtain an overview of the model output, different global-level analyses are conducted, such as inspecting global topic proportions \(\bar{{\theta }}_k = \frac{1}{D}\sum _{d=1}^{D}\theta _{d,k}\) or creating a network graph.

5.3 Topic-metadata relationships

Moving from global- to document-level, we now visualize relationships between document-level topic proportions \(\theta _{d,k}\) and covariates \({\textbf{x}}_d\). In particular, we examine the extent to which German MPs discussed the abovementioned topics over time and in relation to several socioeconomic variables regarding their respective electoral districts. These relationships were estimated by regressing the previously estimated topic proportions on metadata covariates, using either the linear regression-based method of composition (see Fig. 1) or our Beta regression-based methods (see Fig. 3 and 4).Footnote 11

For all regressions, we choose the same linear predictor, containing the date of the Twitter posts, the MP-level categorical covariates political party affiliation and federal state, as well as the electoral district-level continuous socioeconomic covariates immigration share, GDP per capita, and unemployment rate; the effects of the latter three, due to being continuous, are estimated as smooth functions using B-splines.

To demonstrate the shortcomings of the approach implemented in the stm package, we first apply the estimateEffect function to produce “naïve” estimates for the relationship between estimated topic proportions and document-level covariates. Figure 1 shows the estimated proportion of the topic “Climate Protection” over time, peaking during the UN Climate Action Summit 2019 held in September 2019. Importantly, notice that estimateEffect produces predicted topic proportions outside of (0, 1). This is due to using a linear regression, which places no restrictions on the range of the dependent variable.

Fig. 3
figure 3

Mean prediction and 95% confidence intervals for the topic proportion of topics “Climate Protection,” "Right/Nationalist," "Social/Housing," and "Europe" for different document-level covariates, obtained using a frequentist Beta regression from the R package stmprevalence

Fig. 4
figure 4

Left: Mean prediction for the topic proportion of topic “Climate Protection” for different document-level covariates, obtained using a Bayesian Beta regression from the R package stmprevalence. Right: 95% (light gray), 90% (gray), and 85% (dark gray) quantiles of the posterior predictive distribution for the topic proportion of topic “Climate Protection”

Next, we evaluate the results when replacing the linear regression by a Beta regression, which restricts the dependent variable to the (0, 1)-interval.

Figure 3 consists of four panels, one for each topic, each panel being made up of four (sub)plots. The top left plot in the top left panel corresponds to the time trend of the climate protection topic. It shows that the overall trend over time is similar to the one in Fig. 1, yet the range is shifted upwards and no negative values are estimated. The three remaining plots of the top left panel depict the relationship of the climate protection topic with the socioeconomic covariates immigration, GDP per capita, and unemployment as measured at the electoral district-level. First, note that only non-negative values are obtained—as desired. Regarding GDP per capita, we notice an increase in the relevance of the climate protection topic until around EUR 70k, yet for very high income electoral districts this trend is reversed. The unemployment rate shows an ambiguous relationship, with rather large fluctuations. Finally, the higher the share of immigrants in an electoral district, the less frequently the district’s MPs tend to discuss climate-related subjects on average.

However, one might suspect that this negative relationship between climate protection relevance and immigration is the consequence of spurious correlation: one immigration-related topic might simply be suppressing all other topics.Footnote 12 To investigate this, and also in order to evaluate our approach more broadly, we consider three further topics, “Right/Nationalist,” “Social/Housing,” and “Europe”. Actually, the frequency of the “Right/Nationalist” topic increases as electoral district-level immigrant share increases, yet a similar association can also be found for the Europe-related topic; for the topic regarding social issues and housing, no clear trend is recognizable. This leads us to conclude that the negative association between the relevance of the climate protection topic and the immigration share is not only an effect of the mechanics of compositional data such as topic proportions.

Regarding time, the social and European topics do not show any temporal trend, whereas the nationalist topic clearly peaks around September 2018. As for GDP per capita and unemployment rate, only few more or less clear trends can be recognized, such as the decrease in the relevance of the European as well as the social topic with increasing unemployment rate. However, while some interesting and reasonable patterns emerge, we do caution against (quantitative) over-interpretation of the observed patterns.

Finally, we display the results from the fully Bayesian approach discussed in Sect. 4.2, though here we only focus on the climate protection topic for the sake of brevity. As can be seen in the left plot of Fig. 4, the predicted progressions of mean topic proportions at different covariate values are mostly similar to those obtained with the frequentist Beta regression, yet the range is compressed and shifted downwards. In addition to the empirical mean, the right plot of Fig. 4 depicts different empirical quantiles of the posterior predictive distribution of topic proportions. Here we can see that topic proportions at different covariate values vary starkly for different MPs. More generally, we find that a fully Bayesian approach enables a much more comprehensive analysis of topic-metadata relationships because it allows for displaying the variation of individual topic proportions observed in the data.

6 Conclusion

Nowadays, large-scale unstructured text from a wide variety of fields is publicly available on social media and various other forms of online appearances. Topic modeling plays an important role in the extraction of specific information from such data. At the same time, researchers—in particular from the social sciences—increasingly move beyond purely exploratory topic analyses, wishing to associate identified topics with metadata. In order to investigate topic-metadata relationships while accounting for the probabilistic nature of topic proportions, the R package stm implements repeated linear regressions of sampled topic proportions on metadata covariates using the method of composition.

In this paper, we identify two main inconsistencies of this original implementation: the inadequate modeling of proportions via linear regression, allowing topic proportions to take on values outside of (0, 1); and the mixing of frequentist regression with Bayesian computations of empirical quantities. We propose improvements to both shortcomings: the more appropriate Beta regression to account for the distributional nature of topic proportions; and a fully Bayesian approach to replace the current mixture of frequentist and Bayesian methods within the method of composition.

We illustrate our proposed improvements by first applying the STM to a data set containing Twitter posts by German MPs and subsequently employing our methods to estimate relationships between estimated topic proportions and MP-level metadata covariates. It is important to note that our methods merely concern the second-step estimation of topic-metadata relationships and are thus equally applicable to other topic models and beyond.

7 Limitations and Outlook

There are some limitations to our approach, which in turn give rise to future research. Regarding the application case presented in this paper, the relationship with Twitter-related metadata such as retweets or likes would be interesting—especially because such metadata would be actively influenced by the topics of the tweets, whereas the socioeconomic covariates used here are of a more explanatory nature. Unfortunately, Twitter-related metadata are not contained in the data set. Another use case-related aspect is the document length. Longer documents are beneficial for topic models such as the STM in general, yet in our specific case hamper the content-related interpretability of the resulting “tweet documents.” We experimented extensively with different document lengths, including days and weeks, but finally came to the conclusion that aggregating tweets at a monthly interval constitutes the best compromise between content-related interpretability and sufficient text length.

Both frequentist and Bayesian Beta regression are well established approaches in the statistical literature, necessarily implying a lower degree of methodological novelty of our approach. However, the correct modeling and illustration of topic-metadata relationships and the corresponding uncertainty is of paramount importance: because of the enormous popularity of topic models such as the STM and the fact that conclusions drawn from a mis-specified model can be (substantially) misleading (cf. "Appendix A").

Several possibilities exist to build upon our exploratory methods. For instance, our approach could be used in combination with MCMC-based methods in order to make inference in a Bayesian setting. If the goal is to make causal inference beyond exploratory purposes, one must take into account that the estimation of topic proportions induces additional dependence across documents. Developing methods to identify underlying causal mechanisms is the subject of current research (e.g., Egami et al. 2018).