FormalPara Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the motivation for MRP and the circumstances under which it is appropriate to implement.

  • Describe the two steps in producing MRP estimates: model fitting and postsratification.

  • Generate MRP estimates by adapting the provided sample code.

  • Implement more sophisticated variants of MRP, including stacked regression and postratification (SRP) or multilevel regression and synthetic poststratification (MrsP) where appropriate.

5.1 Introduction

The book you are reading is a testament to the “credibility revolution” in the social sciences (Angrist & Pischke, 2010), a wide-ranging effort spanning multiple disciplines to develop credible, design-based approaches to causal inference. It is difficult to overstate the influence this revolution has had on empirical social science, and the increasing emphasis that policymakers place on informing policy with good research design is a welcome trend.

But as the ongoing replication crisis in experimental psychology (Button et al., 2013) has made clear, good research design alone is insufficient to yield good science. After all, double-blind randomized control trials are the “gold standard” of credible causal inference, but small sample sizes and noisy measurement have created a situation where many published effect estimates fail to replicate upon further scrutiny (Loken & Gelman, 2017). To confidently detect causation, one needs both good research design and good measurement.

Often policy researchers are interested in public opinion on some issue, either as an independent or dependent variable. But the surveys we use to measure public opinion are frequently unrepresentative in some important way. Perhaps their respondents come from a convenience sample (Wang et al., 2015), or non-response bias skews an otherwise random sample. Or perhaps the data is representative of some larger population (i.e., a country-level random sample) but contains too few observations to make inferences about a subgroup of interest. Even the largest US public opinion surveys do not have enough respondents to make reliable inferences about lower-level political entities like states or municipalities. Conclusions drawn from low frequency observations – even in a large sample survey – can be wildly misleading (Ansolabehere et al., 2015).

This presents a challenge for researchers: how to take unrepresentative survey data and adjust it so that it is useful for our particular research question. In this chapter, I will demonstrate a method called Multilevel Regression and Poststratification (MRP). Using this approach, the researcher first constructs a model of public opinion (multilevel regression) and then reweights the model’s predictions based on the observed characteristics of the population of interest (poststratification). In the sections that follow, I will describe this approach in detail, accompanied by replication code in the R statistical language.

As we will see, the accuracy of our MRP estimates depends critically on whether the first-stage model makes good out-of-sample predictions. The best first-stage models are regularized (Gelman, 2018) to avoid both over- and underfitting to the survey data. Regularized ensemble models (Ornstein, 2020) with group-level predictors tend to produce the best estimates, especially when trained on large survey datasets.

5.2 How It Works

MRP was first introduced by Gelman and Little (1997), and in the subsequent decades, it has helped address a diverse set of research questions in political science. These range from generating election forecasts using unrepresentative survey data (Wang et al., 2015) to assessing the responsiveness of state (Lax & Phillips, 2012) and local policymakers (Tausanovitch & Warshaw, 2014) to their constituents’ policy preferences.

To demonstrate how the method works, the next section will introduce a running example drawn from the Cooperative Election Study (Schaffner et al., 2021), a 50,000+ respondent study of voters in the United States. The 2020 wave of the study includes a question asking respondents whether they support a policy that would “decrease the number of police on the street by 10 percent, and increase funding for other public services.” Since police reform is a policy issue on which US local governments have a significant amount of autonomy, it would be useful to know how opinions on this issue vary from place to place without having to conduct separate, costly surveys in each area.

The problem is that even a survey as large as CES has relatively few respondents in some small areas of interest. If we wanted to know, for example, what voters in Detroit thought about police reform, a survey of 50,000 people randomly sampled from across the United States will have, on average, only 100 people from Detroit. Estimates from such a small sample will not be very precise. And more importantly, those 100 people are unlikely to be representative of the population of Detroit, since the survey was designed to be representative of the country at large.

The core insight of the MRP approach is that we can use similar respondents from similar areas – e.g., Cleveland or Chicago or Pittsburgh – to improve our inferences about public opinion in Detroit. The way we do so is to first fit a statistical model of public opinion, using both individual-level predictors (e.g., race, age, gender, education) and group-level predictors (e.g., median income, population density) from our survey dataset. Then, we reweight the predictions of the model to match the observed demographics and characteristics of Detroit. In this way, we get the most out of the information contained in our survey and produce a better estimate of what Detroit residents think than our small sample from Detroit alone could produce.

5.3 Running Example

To help demonstrate this process, we will draw a small random sample from the CES survey, and, using that sample alone, attempt to estimate state-level public opinion on police reform in each US state. In this way, we can evaluate the accuracy of our MRP estimates and explore how various refinements to the method improve predictive accuracy. This approach mirrors Buttice and Highton (2013), who use disaggregated responses from large-scale US survey of voters as their target estimand to evaluate MRP’s performance. The Cooperative Election Study data is available here, and we’ll be using a tidied version of the dataset created by the R/cleanup-ces-2020.R script.Footnote 1

A text represents the code with two libraries called tidyverse and ggrepel, as well as the data loader function, which loads the data.

This tidied version of the data only includes the 33 states with at least 500 respondents. First, let’s plot the percent of CES respondents who supported “defunding” the policeFootnote 2 by state.

A text document represents the code that is used to calculate the percentage of who support the police reform policy and state.

Oregon is the only state where a majority of respondents supported this policy proposal. And note that Fig. 5.1 likely overstates the percent of the total population that support such a policy, since self-identified Democrats are overrepresented in the CES sample. But nevertheless, these population-level parameters will be a useful target to evaluate the performance of our MRP estimates.

Fig. 5.1
The scatter plot depicts the state versus the percent who support the police reform policy, where the O R state has the highest support at 0.54 percent. Values are approximate.

The percent of CES respondents in each state who support reducing police budgets. These are our target estimands

5.3.1 Draw a Sample

Suppose that we did not have access to the entire CES dataset, but only to a random sample of 1,000 respondents. How good of a job can we do at estimating those state-level means?

The text document represents the sample code and sample summary for determining a b b, estimate, and num values from sample data called c e s.

In a sample with only 1,000 respondents, there are several states with very few (or no) respondents. Notice, for example, that this sample includes only four respondents from Arkansas, of whom zero support reducing police budgets. Simply disaggregating and taking sample means is unlikely to yield good estimates, as you can see by comparing those sample means against the truth (Fig. 5.2).

Fig. 5.2
The scatter plot of disaggregated sample data represents truth versus an estimate to plot W A, C A, V A, N C, G A, S C, K Y, etcetera, data.

Estimates from disaggregated sample data

The text document is a code for a function that plots state level estimates against the truth.

These are clearly poor estimates of state-level public opinion. The four respondents from Arksansas simply do not give us enough information to adequately measure public opinion in that state. But one of the key insights behind MRP is that the respondents from Arkansas are not the only respondents who can give us information about Arkansas! There are other respondents in, for example, Missouri, that are similar to Arkansas residents on their observed characteristics. If we can determine the characteristics that predict support for police reform using the entire survey sample, then we can use those predictions – combined with demographic information about Arkansans – to generate better estimates. The trick, in essence, is that our estimate for Arkansas will be borrowing information from similar respondents in other states.

The method proceeds in three steps.

5.3.1.1 Step 1: Fit a Model

First, we fit a model of our outcome, using observed characteristics of the survey respondents as predictors. To demonstrate, let’s fit a simple logistic regression model including only four demographic predictors: gender, education, race, and age.

The text document represents a code for model creation with sample data that includes variables such as gender, Educ, race, and age.

5.3.1.2 Step 2: Construct the Poststratification Frame

The poststratification stage requires the researcher to know (or estimate) the joint frequency distribution of predictor variables in each state. This information is stored in a “poststratification frame,” a matrix where each row is a unique combination of characteristics, along with the observed frequency of that combination. Often, one constructs this frequency distribution from Census micro-data (Lax & Phillips, 2009). For our demonstration, I will compute it directly from the CES.

The text document represents a code in which the values a b b, gender, Educ, race, age, and n values are determined with the help of the c e s dataset.

5.3.1.3 Step 3: Predict and Poststratify

With the model and poststratification frame in hand, the final step is to generate frequency-weighted predictions of public opinion. For each cell in the poststratification frame, append the model’s predicted probability of supporting police defunding.

The text document represents the code to predict the probability of the model.

Then, the poststratified estimates are the frequency-weighted means of those predictions.

The text document represents the code to estimate the mean of the predicted probabilities.

Let’s see how these estimates compare with the known values (Fig. 5.3).

Fig. 5.3
A scatter plot represents an estimate versus truth for the pooling model for different estimated M R Ps, with the correlation equal to 0.31, and the mean absolute error equal to 0.028.

Underfit MRP estimates from complete pooling model

A text document represents the code for post stratified estimation and truth comparison.

These estimates, though still imperfectly correlated with the truth, are much better than the previous estimates from disaggregation. Notice, in particular, that the estimate for Arkansas went from 0% to roughly 39%, reflecting the significant improvement that comes from using more information than the four Arkansans in our sample can provide.

But we can still do better. In the following sections, I will show how successive improvements to the first-stage model can yield more reliable poststratified estimates.

5.3.2 Beware Overfitting

A common instinct among social scientists building models is to take a “kitchen sink” approach, including as many explanatory variables as possible (Achen, 2005). This is counterproductive when the objective is out-of-sample predictive accuracy. To illustrate, let’s estimate a model with a separate intercept term for each state – a “fixed effects” model. Because our sample contains several states with very few observations, these state-specific intercepts will be overfit to sampling variability (Fig. 5.4).

Fig. 5.4
A scatter plot represents an estimate versus truth for the fixed model for different estimated M R Ps, with the correlation equal to 0.13 and the mean absolute equal to 0.093.

Overfit MRP estimates from fixed effects model

The text document represents a code for model fitting, poststratification construction, prediction, and poststratification.

These poststratified estimates perform about as well as the disaggregated estimates from Fig. 5.2. Because each state’s intercept is estimated separately, the overfit model foregoes the advantages of “partial pooling” (Park et al., 2004), borrowing information from respondents in other states. Note that the estimate for Arkansas is once again 0%.

5.3.3 Partial Pooling

A better approach is to estimate a multilevel model (alternatively known as “varying intercepts” or “random effects” model), including group-level covariates. In the model below, I estimate varying intercepts by US Census division, including the state’s 2020 Democratic vote share as a covariate. The result is a marked improvement over Fig. 5.3 (particularly for West Coast states like Oregon, Washington, and California) (Fig. 5.5).

Fig. 5.5
A scatter plot represents an estimate versus truth for the partial pooling model for different estimated M R Ps, with the correlation equals to 0.39 and the mean absolute equal to 0.033.

MRP estimates from model with partial pooling

The text document represents a code with a library named l m e 4 for model fitting, poststratification construction, prediction, and poststratification.

5.3.4 Sample Size Is Critical

MRP’s performance depends heavily on the quality and size of the researcher’s survey sample. Up to now, we’ve been working with a random sample of 1,000 respondents, and though the resulting estimates are better than the raw sample means, their performance has been somewhat underwhelming. Suppose instead we had a sample of 5,000 respondents (Fig. 5.6).

Fig. 5.6
A scatter plot represents an estimate versus truth for the poststratified estimation with a survey sample of 5000, with the correlation equals to 0.68 and the mean absolute equal to 0.021.

Poststratified estimates with a survey sample of 5,000

The text document represents a code for model fitting, poststratification construction, prediction, and poststratification with sample data called c e s.

Now MRP really shines. With more observations, the first-stage model can better predict opinions of out-of-sample respondents, which dramatically improves the poststratified estimates.

5.3.5 Stacked Regression and Poststratification (SRP)

Ultimately, the accuracy of one’s poststratified estimates depends on the out-of-sample predictive performance of the first-stage model. As we’ve seen above, the challenge is to thread the needle between overfitting and underfitting. Several recent papers (Bisbee, 2019; Broniecki et al., 2022; Ornstein, 2020) have shown that approaches from machine learning can help to automate this process, particularly with large survey samples.

In the code below, I’ll demonstrate how an ensemble of models – using the same set of predictors but different methods for combining them into predictions – can yield superior performance to a single multilevel regression model. In particular, I will fit a “stacked regression” (Breiman, 1996), which makes predictions based on a weighted average of multiple models, where the weights are assigned by cross-validated prediction performance (van der Laan et al., 2007). The literature on ensemble models is extensive, but for good entry points, I recommend Breiman (1996), Breiman (2001), and Montgomery et al. (2012) (Fig. 5.7).

Fig. 5.7
A scatter plot represents an estimate versus truth for the first stage model, with the correlation equals to 0.83 and the mean absolute equal to 0.019.

Estimates from an ensemble first-stage model

The text document represents a code for poststratification frame construction, model fitting, prediction, and poststratification.

The performance gains in Fig. 5.7 reflect the improvement that comes from modeling “deep interactions” in the predictors of public opinion (Ghitza & Gelman, 2013). If, for example, income better predicts partisanship in some states but not in others (Gelman et al., 2007), then a model that captures that moderating effect will produce better poststratified estimates than one that does not. Machine learning techniques like random forest (Breiman, 2001) are especially useful for automatically detecting and representing such deep interactions, and stacked regression and poststratification (SRP) tends to outperform MRP in simulations, particularly for training data with large sample size (Ornstein, 2020).

5.3.6 Synthetic Poststratification

Researchers rarely have access to the entire joint distribution of individual-level covariates. This can be limiting, since there may be a variable that one would like to include in the first-stage model but cannot because it is not in the poststratification frame. Leemann and Wasserfallen (2017) suggest an extension of MRP, which they (delightfully) dub Multilevel regression and synthetic Poststratification’ (MrsP). Lacking the full joint distribution of covariates for poststratification, one can instead create a synthetic poststratification frame by assuming that additional covariates are statistically independent of one another. So long as the first-stage model is linear additive, this approach yields the same predictions as if you knew the true joint distribution!Footnote 3 And even if the first-stage model is not linear additive, simulations suggest that the improved performance from additional predictors tends to overcome the error introduced in the poststratification stage.

Here are some CES covariates that we might want to include in our model of police reform:

  • How important is religion to the respondent?

  • Whether the respondent lives in an urban, rural, or suburban area.

  • Whether the respondent or a member of the respondent’s family is a military veteran.

  • Whether the respondent owns or rents their home.

  • Is the respondent the parent or guardian of a child under the age of 18?

These variables are likely to be useful predictors of opinion about police reform, and the first-stage model could be improved by including them. But there is no dataset (that I know of) that would allow us to compute a state-level joint probability distribution over every one of them. Instead, we would typically only know the marginal distributions of each covariate (e.g., the percent of a state’s residents that are military households or the percent that live in urban areas). So a synthetic poststratification approach may prove helpful.

To create a synthetic poststratification frame, we create a set of marginal probability distributions and multiply them together.Footnote 4

The text document represents a code for model fitting, poststratification frame construction, frequency to probability conversion, and marginal distribution discovery for each new variable.
The text document represents a code to combine the marginal distributions, and a code to multiply.

Then, poststratify as normal using the synthetic poststratification frame (Fig. 5.8).

Fig. 5.8
A scatter plot represents an estimate versus truth for the estimated synthetic poststratification, with the correlation equal to 0.81 and the mean absolute error equal to 0.019.

Estimates from synthetic poststratification, including additional covariates

The text document represents a code for poststratification and prediction.

5.3.7 Best Performing

As a final demonstration, suppose we had access to the entire joint distribution over those covariates, and our first-stage model was a Super Learner ensemble. This combination yields the best-performing estimates yet (Fig. 5.9).

Fig. 5.9
A scatter plot represents an estimate versus truth for the first stage model which uses a survey samples and predictor, with the correlation equal to 0.83 and the mean absolute equal to 0.019.

The best performing estimates, using a large survey sample, ensemble first-stage model, and full set of predictors

The text document represents a code for poststratification, super learner fit, prediction, and poststratification.

The results shown in Fig. 5.9 reflect all the gains from a larger sample size, ensemble modeling, and a full set of individual-level and group-level predictors.

5.4 Conclusion

For policy researchers interested in public opinion, MRP and its various refinements offer a useful approach to get the most out of survey data. The results I’ve presented in this chapter suggest a few lessons to keep in mind when applying MRP to one’s own research.

First, be wary of first-stage models that are underfit or overfit to the survey data. As we saw in Fig. 5.3, MRP estimates with too few predictors tend to over-shrink toward the grand mean.Footnote 5 Using such estimates to inform subsequent causal inference would understate the differences between regions. Conversely, models that are overfit to survey data (e.g., Fig. 5.4) will tend to exaggerate regional differences.

Second, new techniques like synthetic poststratification and stacked regression can help researchers manage the trade-off between underfitting and overfitting. Synthetic poststratification allows for the inclusion of more relevant predictors, and regularized ensemble models help ensure that the predictions are not overfit to noisy survey samples. The best estimates often come from combining these two approaches.

Finally, recall that the most significant performance gains in our demonstration came not from more sophisticated modeling techniques, but from more data. As we saw in Fig. 5.6, working with a larger survey yielded greater improvements than any tinkering around with the first-stage modeling choices. MRP is not a panacea, and one should be skeptical of estimates produced from small-sample surveys, regardless of how they are operationalized.

In the code above, I emphasize “do-it-yourself” approaches to MRP – fitting a model, building a poststratification frame, and producing estimates separately. But there are a now number of R packages available with useful functions to help ease the process. In particular, I would encourage curious readers to explore the autoMrP package (Broniecki et al., 2022), which implements the ensemble modeling approach described above and performs quite well in simulations when compared to existing packages.

Further Suggested Readings

  • McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Boca Raton: Taylor and Francis, CRC Press. (particularly chapter 13).

  • Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge, United Kingdom: Cambridge University Press. (particularly chapter 17).

Review Questions

  1. 1.

    What other individual-level or group-level variables might be useful to include in the first-stage model of opinion on police reform, if they were available?

  2. 2.

    Why is regularization crucial for constructing good first-stage MRP models?

  3. 3.

    What are the benefits and potential downsides of using a synthetic poststratification frame?