Estimates of subnational public opinion are necessary to study many important questions in political science. Despite their usefulness, accurate estimates can be difficult to obtain due to cost of obtaining sufficiently many samples in all subnational units of interests (e.g., states or regions). In response to this lack of data, scholars have for decades developed alternative approaches to estimate subnational opinion from national polls (e.g., Erikson, Wright, and McIver Reference Erikson, Wright and McIver1993).
In recent years, subnational opinion estimation has been substantially aided by the development of multilevel regression and poststratification (MRP), which allows for more accurate estimates than previous methods (Park, Gelman, and Bafumi Reference Park, Gelman and Bafumi2004). MRP involves fitting a predictive model of individual opinion from survey data, predicting opinion for different demographic and geographic subgroups in the public, and then taking a weighted average using the known distribution of these subgroups within subnational geographies. It has become the gold standard for estimating opinion, primarily based on studies of state and legislative district opinion in the US (Lax and Phillips Reference Lax and Phillips2009b; Warshaw and Rodden Reference Warshaw and Rodden2012) and in Europe (Leemann and Wasserfallen Reference Leemann and Wasserfallen2016; Lipps and Schraff Reference Lipps and Schraff2021; Toshkov Reference Toshkov2015).
The development of MRP has made it considerably more straightforward for scholars to estimate opinion on individual issues within subnational geographies. This has proven particularly useful for studies of elite responsiveness (Lax and Phillips Reference Lax and Phillips2009a; Reference Lax and Phillips2012; Tausanovitch and Warshaw Reference Tausanovitch and Warshaw2014), electoral behavior (Ghitza and Gelman Reference Ghitza and Gelman2013; Gelman et al. Reference Gelman, Goel, Rivers and Rothschild2016), public opinion (Shirley and Gelman Reference Shirley and Gelman2015), and among others. By estimating opinion on individual issues, scholars can examine more specific determinants of policy and opinion than latent opinion measures allow.
While MRP has provided considerable advancements, it has primarily been developed with polls whose respondents represent simple random samples (or close approximations of simple random samples) of the target population in mind. Yet, there are many settings in which such samples cannot be produced. In such cases, scholars have instead turned to alternative sampling designs. One popular alternative is cluster sampling – also called area-probability sampling – which produces a sample by taking multiple respondents from a small number of randomly drawn geographic areas.
Cluster sampling was nearly ubiquitous among US polling firms into the 1980s, as it allowed for high-quality national samples to be obtained for face-to-face interviews at a comparatively low cost (Warshaw Reference Warshaw and Alvarez2016). It is also a common sampling method for large, multinational and national academic surveys (e.g., AmericasBarometer, Asian Barometer, the American National Election Study [ANES], and General Social Survey [GSS]). As a result, public opinion data in many contexts has been predominantly generated through the use of cluster-sampling methods. These procedures commonly draw clusters nationally or within large regions, conditional on some geographic characteristics (e.g., the “urbanness” of communities). This allows them to produce nationally representative samples in a single poll without necessarily generating representative samples at subnational levels of interest, such as states. This makes these polls particularly susceptible to inaccurate opinion estimation at smaller geographies using conventional MRP approaches (Stollwerk Reference Stollwerk2017).
As a result, scholarship is limited in any domain which relies on subnational estimates of opinion. Among these are descriptive studies of opinion across space and time, as well as research on policy responsiveness, which uses opinion as an independent variable to predict the positions taken by legislators, governments, or political parties. Measurement error in the opinion variable may introduce a number of challenges, including the attenuation of regression coefficients. And common workarounds, such as pooling and disaggregating surveys over many years (e.g., a decade or more) do not allow for the estimation of effects over time.
In this paper, I introduce and test two approaches that scholars may employ to improve opinion estimation with MRP from cluster-sampled polls. I begin by providing background information about MRP and the cluster sampling methods that underpin the following sections. In Section 2, I illustrate problems that may arise from using traditional MRP in the case of cluster-sampled polls. I first use an empirical example of abortion opinion in 1980, in which MRP returns estimates that lack face validity. I show that the uneven distribution of clusters across states is likely the source of the problem. I clarify this intuition using simulations that demonstrate the limitations of traditional MRP under this sampling design.
Sections 3 and 4 propose two possibly complementary solutions to the problem. In each section, I conduct a simulation analysis and validate the methods against the case of presidential polls. My first solution, outlined in Section 3, pools responses from multiple surveys. This can address the problem by increasing the number of distinct clusters and thus hopefully mitigating problems inherent to a single survey. However, there are significant data limitations that may make this approach impossible in many circumstances. First, few issues are repeatedly surveyed using the same question wording over narrow windows of time. Second, I find that pooling only produces significant improvements in estimation when the number of clusters (and not merely sample size within clusters) increases. As a result, it would generally be necessary to find the same question asked in polls fielded by different firms.
Section 4 presents an alternative, model-based strategy – Clustered MRP (CMRP) – that incorporates features of sampling design into the model. By explicitly including the geographic levels used in a pollster’s sampling procedure in MRP’s predictive model, CMRP properly accounts for polling firms’ sampling protocols and allows information from similar geographic areas to be pooled across state and regional lines. Even using a single poll, I find that CMRP can reduce mean absolute error (MAE) in state-level opinion estimates by between 2.1% and 3.3% compared to standard MRP approaches. These accuracy gains are similar in magnitude to those associated with other improvements to MRP, such as using machine learning or models with deeper interactions for opinion estimation with modern polls (Ornstein Reference Ornstein2020; Goplerud Reference Goplerud2024).
Finally, I also discuss the concerns of particular cluster-sampling procedures and steps needed to produce poststratification data from the Census for CMRP in Section 5, as well as in the Supplementary Appendix. While this paper focuses on historical polls in the US, my approach may be applicable in other contexts in which cluster-sampling is common, including comparative multinational surveys.
MRP and cluster sampling
In recent decades, scholars have turned to MRP to estimate subnational public opinion from the individual responses in national polls. Developed by Gelman and Little (Reference Gelman and Little1997) and Park, Gelman, and Bafumi (Reference Park, Gelman and Bafumi2004), MRP has been shown to outperform alternative methods, such as disaggregation, at the state (Lax and Phillips Reference Lax and Phillips2009b), congressional and state legislative district (Warshaw and Rodden Reference Warshaw and Rodden2012), and municipal levels (Tausanovitch and Warshaw Reference Tausanovitch and Warshaw2013). This is because MRP uses both demographic and geographic characteristics of respondents and partially pools information across geographies to better model the drivers of individual opinion.
Particularly compelling for applied researchers, MRP has the potential to produce reliable estimates of state opinion from a single, conventional national survey (Lax and Phillips Reference Lax and Phillips2009b), although it may produce less accurate estimates from such polls when geographic variables are poor predictors of individual-level opinion (Buttice and Highton Reference Buttice and Highton2013).
MRP with polls using simple random samples
There are two steps to estimating opinion with MRP: First, a multilevel model is fit to predict individual response to a binary question using individual-level data from a survey. Then, using the model and the joint distribution of demographic characteristics in the population from the Census, the scholar poststratifies to estimate the average response to the question of interest at the state (or other subnational) level.
The typical multilevel logistic model for MRP contains as predictors detailed demographic information about respondents, state and region indicators, and one or two state-level variables (e.g., presidential vote and religiosity). Below, I formalize a standard MRP model that could be fit from data available in a standard Gallup Poll fielded in the US during the 1980s, using the notation from Gelman and Hill (Reference Gelman and Hill2007). The outcome of interest, $ {y}_i $ , indicates an individual respondent $ i $ ’s response to a survey question. The $ \alpha $ terms are random effects corresponding to demographic or geographic groups; so $ {\alpha}_{r\left[i\right]}^{\mathrm{race}} $ indicates the random effect for the racial group $ r $ to which respondent $ i $ belongs. The state random effect $ {\alpha}_{s\left[i\right]}^{\mathrm{state}} $ is modeled as a function of the region and contextual variables included (here I use $ RepVote $ , Republican vote share in the last presidential election, and $ Relig $ , the proportion of the state that identifies as evangelical Christian or Mormon).
This model allows us to predict the expected level of support for the policy $ y $ among each “type” of person in the population – that is, each of the 4,800 possible combinations of $ race\times sex\times age\times educ\times state $ . These predictions are then used to poststratify and aggregate to the state level by taking a weighted average where the weights are the share of each combination of demographic variables in the state’s population.
Estimates may be further improved by fitting a “deep” model with interactions among demographic and geographic variables in ways not captured by a simple model without interactions (Ghitza and Gelman Reference Ghitza and Gelman2013; Goplerud Reference Goplerud2024).
Cluster-sampled surveys
The above model has been developed assuming survey respondents are independently drawn from the population, as in a simple random sample. However, this assumption may not hold in many circumstances (Berinsky Reference Berinsky2017). Instead, when such sampling methods are impossible or impractical, pollsters often rely on alternative methods to produce samples that are representative of target populations. For many surveys, this target is the population of an entire country, and not any one subnational unit.
One of the most common procedures used for survey sampling is cluster sampling (also called area-probability sampling). Variations of this approach (and especially multi-stage sampling methods) were almost universally used by US polling firms from the 1950s and 1980s (Warshaw Reference Warshaw and Alvarez2016).Footnote 1 Two major academic studies, the GSS and ANES, still use cluster-sampling to produce all or part of their samples today, as do many large multinational surveys (see, e.g., Latin American Pulic Opinion Project 2019; Asian Barometer Survey 2003). As a result, the only quality polls available to scholars in many contexts are likely to be cluster-sampled.
The aim of cluster sampling is straightforward: researchers randomly select a set of “clusters,” such as cities or neighborhoods, from which they randomly draw people to interview. This approach is appealing to survey researchers because it reduces the costs of producing a nationally representative sample for surveys that rely on in-person interviews.
As an example, consider the Gallup Poll during the late 1970s and early 1980s, on which I based the sampling algorithm in the simulation studies discussed below (Gallup Organization 1980b). First, Gallup assigned each state to a region. Within each region, geographic areas were assigned to a “size-of-community stratum” based on urban/rural status and population. These region-stratum combinations form the basis of primary sampling units (PSUs). Then, Gallup randomly selected two localities from each PSU, weighting by population, and repeated the process using progressively smaller geographies to identify a block or cluster of blocks. The resulting sample is expected to produce reasonable estimates of national opinion. A more detailed description of this procedure can be found in Supplementary Appendix A.
Challenges of MRP with cluster-sampled polls
While some scholars have used MRP with cluster-sampled polls (e.g., Shirley and Gelman Reference Shirley and Gelman2015), its performance has been primarily validated on polls that use simple random samples (or close approximations). As a result, it is less clear whether MRP should perform well under more complex sampling procedures – such as cluster sampling – in which respondents are not drawn independently from the public (Stollwerk Reference Stollwerk2017). In this section, I provide a simple empirical example that illustrates problems that may arise.
Case study: abortion opinion
To illustrate the problems that may arise from using MRP with cluster-sampled polls, I estimated opinion on abortion from a survey fielded by Gallup in September 1980 and downloaded from the Roper Center for Public Opinion Research (Gallup Organization 1980a). The survey is typical of the period and used cluster sampling to produce a set of respondents ( $ N=\mathrm{1,602} $ ) who were interviewed in person at their homes. I used MRP to estimate state-level opinion using responses to a question asking whether respondents “generally favor” or “generally oppose” an ban on abortion.Footnote 2
I modeled individual opinion in Stan (Carpenter et al. Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017) with random effects for race, female, the race × female interaction, age group, education, state, and region. I also included state Republican vote share and the share of the population that is evangelical Christian or Mormon as linear predictors. This specification is typical for MRP for a question about social issues like abortion. All variables in the model were used in poststratification, using weights from joint distributions of the population downloaded from IPUMS-NHGIS (Manson et al. Reference Manson, Schroeder, Van Riper, Kugler and Ruggles2021). Presidential vote data are from Leip (Reference Leip2021), and religion data are from the Churches and Church Membership in the United States study (Grammich et al. Reference Grammich, Hadaway, Houseal, Jones, Krindatch, Stanley and Taylor2019).
Figure 1 reports the results from the MRP model. The lefthand panel maps the share of each state that opposed a ban on abortion, as estimated with MRP. The righthand panel compares modeled opposition to an abortion ban with the measure of state-level abortion liberalism produced by Brace et al. (Reference Brace, Sims-Butler, Arceneaux and Johnson2002) using pooled surveys from 1974 to 1998.Footnote 3 A higher score on their scale corresponds to more liberal opinion on abortion. The MRP results are clearly only minimally correlated with the baseline estimates of abortion opinion. Similar analyses using latent policy liberalism estimates provide similar results (these can be found in Supplementary Appendix F). Many state estimates also clearly lack face validity. For example, it is unlikely that Utah would have the highest opposition to a ban on abortions in the country. Conversely, more liberal states in the Northeast and upper Midwest show surprisingly low levels of opposition to banning abortions.
One reason for error is that state estimates may depend only on a small number of clusters from particular (non-representative) parts of the state. For example, Table 1 shows the five Utah respondents in the poll. Of the five, which all appear to come from one cluster in a city of 50,000–99,999 people, only two answered the question about an abortion ban, and both were opposed.Footnote 4 Respondents in the Utah cluster are more likely to be Democrats or Independents than Utahans of the era as a whole (Erikson, Wright, and McIver Reference Erikson, Wright and McIver1993). In principle, the limited nature of this Utah subsample is the type of problem MRP handles well. However, in the case of cluster sampling, the predictive model attributes the average opinion in this cluster to Utah as a whole, rather than the stratum $ \times $ region combination that it was drawn into the sample to represent.
Although subgroup means converge on true population means when a large number of clusters are included in a sample (Kish and Frankel Reference Kish and Frankel1974), scholars often have access to a limited number of surveys (indeed, there may be only one survey asking a particular question in the time period of interest). As a result, the relevant question for using MRP with cluster-sampled polls not whether state-level subsamples are representative of the population in expectation over repeated surveys, but rather whether the respondents from a given state in a single poll are likely to be representative. In Supplementary Appendix B, I show that state subsamples in individual polls are often not representative of their states. This problem is especially severe in states with lower populations, which are much more likely to have no respondents included in a poll, and whose subsamples are less representative of the state population (because they include fewer clusters).Footnote 5 It is also intuitively likely to be the case in states with more diverse and segregated populations, where any two clusters may be very different from one another. Erikson, Wright, and McIver (Reference Erikson, Wright and McIver1993) note these potential problems in discussing their decision to use disaggregated CBS/New York Times polls, rather than a cluster-sampled survey.
Simulation study 1: MRP with cluster-sampled polls
To better understand and illustrate the problems that may arise when using MRP with cluster-sampled polls, I conducted a series of simulations.
In each simulation, I generated one million “voters” distributed across 50 states (according to their actual share of the population) and seven size-of-community strata (according to a random distribution that I hold constant across simulations). Next, I randomly assigned each voter a binary demographic predictor (which I call race) distributed according to each state’s real-world white and non-white populations. I then drew a survey response for each voter such that:
I drew true effect sizes for the demographic and geographic variables from a normal distribution, varying the standard deviation $ \sigma $ for one variable at a time. For each variable, I ran the simulation with $ \sigma \in \left\{\mathrm{0.1,0.75,2.0}\right\} $ , holding all other $ \sigma $ values constant at 0.1. As $ \sigma $ increases for each effect, so does the extent to which that variable independently impacts individual opinion. Finally, I use these effects to produce a probability that individual $ i $ supports a survey question and draw response $ {y}_i $ from a Bernoulli distribution.
The true data generating process for individual opinion in the simulation is based on race, state, region, and stratum, as well as the interactions between stratum $ \times $ region and stratum×state, to reflect that that rural and urban places may vary systematically in different parts of the country. $ RepVote $ is normally distributed and constrained to be modestly correlated with $ {\alpha}_s^{\mathrm{state}} $ .Footnote 6
With a population of one million voters in hand, I then produce two samples, each meant to mimic a standard survey of approximately 1,500 respondents.Footnote 7 The first is a simple random sample in which every voter has an equal probability of being selected. The second is based on the Gallup Poll cluster sampling procedure. For each stratum × region pair, I randomly select two states, weighting by their populations. From each stratum × region × state combination selected, I then sampled 14 respondents from the pool of voters. Finally, I fit both traditional and deep MRP models on both sets of polls to predict opinion using the lme4 package in R (Bates et al. Reference Bates, Mächler, Bolker and Walker2015). Deep MRP models add race $ \times $ state and race $ \times $ region random effects. I poststratified using all variables and interactions included in the model. I repeated the simulation 100 times for each combination of parameters, allowing the specific effects drawn, the voters, and the samples to vary each time.
Figure 2 reports the results of these simulations. For each variable $ j $ , I report error from the MRP model’s opinion estimates when the corresponding $ {\sigma}_j $ is set at 0.1, 0.75, or 2.0. I hold $ {\sigma}_{\neg j} $ for all other variables constant at 0.1. The lefthand column of Figure 2 reports the mean absolute error (MAE) across 100 simulations, using the observed “true” opinion value from the full pool of simulated voters. Filled circles and squares report results from using MRP with the cluster-sampled survey, while the hollow points are for the simple random sample. The righthand column reports the average correlation between between true and modeled opinion.
As I increase the magnitude of the independent effect of stratum on opinion, as well as the the interactions between stratum × state and stratum × region, the amount of error from MRP increases in the clustered random sample. While the MAE does increase slightly for the poll with a simple random sample, the increase in error is much more dramatic when cluster-sampling is used. Notably, MRP does not perform much worse under cluster-sampling when the effects for state, region, or race are large. My results also suggest that although deep models have been found to improve estimation in MRP generally, they do not seem to offer dramatic improvements when polls are cluster-sampled.
We only see divergence between cluster sampling and simple random sampling when the effect of a stratum variable (i.e., heterogeneity inside of subnational units and across the dimensions included in the sampling frame) increases. This suggests that when traditional MRP is conducted with cluster-sampled polls, to the extent that there is heterogeneity inside a state based on stratum, the model performs worse.
Approaches to MRP with cluster-sampled polls
The abortion case study and simulations above highlight two problems that may produce increased error when estimating state opinion with MRP on cluster-sampled polls.
First, clusters may not be representative of the overall population in a state. At one extreme, some states will have zero respondents despite the fact that a simple random sample might include two or three individuals in expectation. More commonly, they may have a single cluster drawn from just one (unrepresentative) community, as in the Utah case described above. In principle, this is the kind of problem MRP is designed to address (indeed, even simple random samples will produce states with very few respondents); MRP borrows strength across states and assumes that even if a state has few respondents, a reasonable prediction can be derived from the behavior of similar individuals in other states. However, if clusters are not representative of their states as a whole, this can contaminate the estimate of the state effect and thus lead to inappropriate predictions for the state as a whole.
Second, the hierarchical model typically used with traditional MRP may account for the wrong geographic variation in opinion. The traditional MRP model is inconsistent with the known data generating process for the opinion survey. Because pollsters produced the sample using important information not accounted for in the model, results may have increased error. The traditional MRP model also ignores important information that pollsters use to produce their samples. If a nationally representative survey can be produced by sampling based on stratum, then these same characteristics may be useful in accurately predicting opinion.
Because of the role that geography plays in a cluster sample, these problems make MRP particularly sensitive to geographic variation in public opinion. Stollwerk (Reference Stollwerk2017) conjectured that MRP estimates from cluster-sampled polls will be incorrect when opinion varies within states in ways that are not accounted for by demographic variables. Pollsters produce samples by randomizing not at the state level but at the region $ \times $ stratum level. As a result, the kinds of people missing from the poll may not always be well represented by those in the dataset. For example, consider the (un)representativeness of an urban neighborhood in Milwaukee being the only cluster sampled in the state of Wisconsin.
In the following sections, I propose and test two approaches to improve estimation of opinion from cluster-sampled surveys. First, I pool responses from multiple surveys. By adding new clusters, the poll in this case begins to approach a simple random sample. (Simple random samples can be thought of as clustered samples with $ N $ clusters of 1 respondent each.)
Second, I propose respecifying the predictive model in the first stage of MRP to include the geographic information that pollsters use to produce clustered random samples. This data can then be used in the poststratification step of MRP. Underpinning this approach – which I call CMRP – is the idea that scholars can improve estimates from MRP by fitting a model that accounts for the cluster-sampling procedure itself.
Pooling cluster-sampled polls
One solution to the problem of clustered random samples being unrepresentative of their states as a whole (without conditioning on stratum or cluster-level information) is to pool multiple surveys. MRP performs better with cluster-sampled polls that have a larger number of clusters, which pooling is analogous to, assuming the polls are conducted using different clusters (Stollwerk Reference Stollwerk2017).Footnote 8
However, pooling surveys improves opinion estimates only in cases where two conditions are met. First, multiple surveys must ask identical (or at least very similar) questions. For many substantive applications (e.g., studying policy responsiveness or constructing time-series of opinion) it may also be necessary for the polls to be fielded around the same time. Second, pooling surveys only produces large reductions in error from MRP when doing so increases the number of clusters, and not simply the sample size within each cluster. Because survey firms may not change clusters from one poll to the next – and this cannot usually be observed – it is therefore usually necessary to find polls from different firms asking the same questions.
Simulation study: pooling
As an initial test of whether pooling can produce better estimates than using a single cluster-sampled poll, I incorporated pooling into the simulation setup described above. I found that by doubling the number of clusters in each stratum × region combination, estimation can be improved, especially when opinion varies across states or by stratum within states. This is consistent with the problems associated with unrepresentative state-level subsamples from a single poll. The simulation results are presented and discussed in Supplementary Appendix D.
However, I also find that pooling does very little to improve opinion estimates if the surveys do not increase the number of clusters. In Supplementary Appendix D, I show that doubling the sample size within clusters (analogous to pooling two surveys sampling from the same clusters) does not meaningfully improve estimates versus a traditional MRP model.
Validation: pooled surveys for presidential opinion
I confirmed the results of the simulation study using presidential election polls from 1980. Presidential elections are a useful testing ground for MRP because they offer a ground truth against which polls can be compared—the election results themselves.
Here, I use MRP to predict support for the Democratic presidential candidate in the 1968–1984 presidential elections. I use two samples: a baseline that comes from the final Gallup poll conducted in the election season, as well as a pooled sample that includes that same Gallup poll and the ANES. For each sample, I fit two models in Stan: traditional MRP, which included variables for race, sex, the race × sex interaction, age group, education, and percent evangelical or Mormon;Footnote 9 and a Deep MRP model adding interactions among demographic predictors and between demographic and geographic variables. I then poststratified on all included predictors using a poststratification matrix built from joint distributions of the population in the 1980 Census, obtained from IPUMS-NHGIS. The pooled models included a random effect for the survey firm, but I did not poststratify on this variable.Footnote 10 The full model specifications and details of the surveys used are in Supplementary Appendix E.
To account for variability of polls and unobserved changes in the national environment in the final weeks of the campaign (Gelman and King Reference Gelman and King1993), I report results relative to national support. I adjust both estimates and actual election results by taking the difference between state-level support and national support for the Democratic candidate.
Figure 3 reports results. The leftmost column shows the change in MAE that comes from using a pooled sample, rather than the single sample. Negative numbers reflect a reduction in MAE (i.e., more accurate estimation). The second column reports the MAE improvement as a share of the error in the traditional MRP model. The third column shows the share of states whose estimates improved when the pooled sample was used. A pooled sample reduced overall error in four of the five election years – and in the case of 1972 by more than 20%. Finally, the rightmost panel shows the average increase in the variance of state-level from using a pooled sample (versus standard MRP).Footnote 11 The variance of the pooled sample is slightly larger, though not dramatically so, suggesting slightly higher uncertainty from pooling, though the point estimates themselves are more accurate on average.
Limitations of pooling
While pooling surveys represents a promising improvement on traditional MRP in the case of cluster-sampled polls, there are two challenges that make it impractical in some situations.
First, pooling requires scholars to find multiple polls that asked the same question around the same time. This is often not possible, as many topics appear infrequently in surveys, and the exact question wording can vary widely between polling firms and even from one survey to the next. The need to collect multiple polls can also make scholars’ attempts to construct time series of opinion impossible.
Second, simulations indicate that MRP with pooled surveys works well when the number of clusters increases and not necessarily when the size of each cluster does. This makes it especially problematic that polling firms rarely change the communities from which they select respondents, particularly for face-to-face interviews. A review of memos by Gallup statisticians during the 1980s indicated that sampled areas were frequently re-used by pollsters until they had been “exhausted.”Footnote 12 As a result, the same communities can appear repeatedly in surveys from some pollsters, and because detailed information about the exact location of respondents is generally not available, the extent of this re-use can be difficult to definitively determine. In Supplementary Appendix D, I show that increasing sample size by doubling the number respondents from each cluster does not offer the same improvements as increasing the number of clusters.
Estimating opinion with CMRP
In this section, I propose CMRP, which offers an alternative approach to improving opinion estimation from cluster-sampled polls. CMRP adds geographic data used in the pollster’s sampling procedure to both the multilevel model and poststratification stages of MRP. Because clustered random samples are representative of the overall population conditional on the sampling procedure used, we should be able to improve state opinion estimates by accounting for pollsters’ procedures in the model. CMRP also pools respondents more intelligently within regions by allowing missing “types” of people to be represented by more similar groups elsewhere in the sample, rather than dissimilar people in their state. That is, rather than using urban Milwaukee residents to model rural Wisconsinites’ opinion, CMRP takes more similar groups (e.g., rural Minnesotans) into account when predicting opinion.
How to fit CMRP
CMRP follows a similar procedure to MRP. First, the researcher fits a multilevel model of individual opinion, incorporating the geographic units employed by the pollster to produce the sample. In the Gallup case, this would be region and size-of-community stratum. Specifically, this mirrors the standard MRP approach in eq:basicmrp above, but adds the below random effects:
To improve estimates, we might also include interactions among demographic predictors and between geographic and demographic variables to improve estimation, which I refer to here as Deep CMRP.
Next, the researcher poststratifies to the level of the sampling unit (i.e., stratum) within each state, using joint distributions from the Census as weights. Finally, the estimates are aggregated up to the state level, again using Census data to weight. The exact steps that need to be taken to produce poststratification information from the Census vary depending on the cluster-sampling procedure used by a polling firm. In general, joint distributions of demographic variables in the population at small geographic levels (e.g., metropolitan areas, counties, cities and towns) can be downloaded from IPUMS-NHGIS. In some census years, one or two variables may not be included in joint Census tables; however, race, gender, and often age are routinely readily available. The steps I took to produce poststratification matrices used in this paper can be found in Supplementary Appendix C.
Simulation study: testing CMRP
To test CMRP, I again return to simulations. I follow the same procedure as before but do not generate pooled samples. For each sample, I fit CMRP and Deep CMRP, which adds interactions between race and all geographic variables. I also fit traditional and deep MRP models that do not adjust for clustering geography to serve as comparisons. Figure 4 reports the results of these simulations. Here, all results come from polls with clustered random samples. I report results from various CMRP methods (filled circles and squares) and the corresponding traditional MRP methods (hollow circles and squares).
The leftmost column of Figure 4 reports the MAE across simulations. As the effect sizes increase for stratum × state, stratum × region, and the independent stratum effect, traditional methods perform worse, while CMRP reduces the error across all specifications. The middle column reports the difference in MAE between corresponding traditional MRP and CMRP methods as a percentage of the error in traditional MRP. A negative result means that the MAE decreases (improves) when CMRP is used. Depending on the conditions shaping opinion, using CMRP might reduce error by as much as 50% when stratum, stratum $ \times $ state, and stratum $ \times $ region effect sizes are large. The third column reports the average share of states in each simulation whose modeled estimates of opinion get closer to true opinion. Using CMRP improves the estimates of more than half of states, particularly as the effect sizes for stratum $ \times $ state, stratum × region, and stratum get larger.
CMRP with historical polls
I now turn to validating CMRP using historical cluster-sampled polls. Using CMRP and traditional MRP, I estimated state-level support for Democratic candidates in the five presidential elections from 1968 to 1984. In each case, I estimate state-level support for the Democratic candidate using the last available Gallup poll before Election Day. I fit two models: CMRP and Deep CMRP, which adds interactions among demographic predictors (e.g., $ \mathrm{race}\times \mathrm{sex}\times \mathrm{educ} $ ) and between demographic and geographic variables (e.g., $ \mathrm{race}\times \mathrm{state} $ and $ \mathrm{race}\times \mathrm{stratum} $ ), which I then compared to similar traditional MRP or deep MRP models.Footnote 13 As in the pooling example above, all models included variables for race, sex, the race $ \times $ sex interaction, age group, and percent evangelical or Mormon. Models for 1976–1984 included education (the requisite variables for poststratification were not available in the 1970 Census). I then poststratified on all included predictors using a matrix built from joint distributions of the population in the 1970 and 1980 Censuses, which I obtained from IPUMS-NHGIS. The full model specifications and details of the surveys used are in Supplementary Appendix E. As before, I report results relative to national support.
Figure 5 reports results. The results indicate that CMRP, on average, reduces the overall error in opinion estimates by 2.1% for the simple CMRP model and 3.3% for the deep model. These improvements are similar in magnitude to the gains from using machine learning methods in MRP, which have been tested on modern polls that use simple random samples. For example, Ornstein (Reference Ornstein2020) shows that ensembles improve MRP estimates from standard polls by approximately 2%–3%. Likewise, Goplerud (Reference Goplerud2024) finds that Bayesian additive regression trees (BART) outperform standard MRP by 4.5% and deeply specified MRP models by 2.6% in datasets with sample sizes similar to those available from Gallup. In some cases, CMRP can produce much larger improvements; using polls from the 1968 election, CMRP performs nearly 10% better than traditional MRP. As predicted in the simulation studies, CMRP and Deep CMRP appear to safeguard against the worst increases in error in the 1972 and 1976 elections. While CMRP outperforms traditional MRP on average, individual state estimates can perform worse in some rare cases, as shown in the third column.
The rightmost column presents differences in variance between estimates produced via CMRP and Traditional MRP. As in the case of pooling, there is slightly more uncertainty associated with the CMRP estimates. However, these differences are much smaller than in the case of pooling – in fact, nearly zero – suggesting that CMRP can improve the accuracy of estimates with limited reduction in precision.
I further tested CMRP for a series of specific issue questions, which are the most common context in which MRP is used but also difficult to validate as high-quality measures of “ground truth” opinion generally do not exist. In Supplementary Appendix F, I show that CMRP on average produces slight increases in the correlation between opinion and state liberalism scores by Enns and Koch (Reference Enns and Koch2013), though the improvement can be more dramatic on some issues. I also return to the abortion question above and compare it against an abortion liberalism scale, as well as a limited number of state-level public opinion polls in 10 states. I find improvements from using CMRP versus traditional MRP approaches consistent with those reported for estimates of presidential vote choice (around 4%–8% reductions in MAE, depending on the model), though I note that the underlying state polls introduce considerable noise of their own into the comparison. Finally, in Supplementary Appendix G, I tested CMRP on six issues from the 2000 ANES and compared them with the 2000 National Annenberg Election Survey (NAES). Here, I found that, on average, CMRP methods decrease MAE by 2%.
Practical considerations for CMRP
To use CMRP, researchers may need to take additional steps beyond those required for MRP with modern surveys and simple random samples (or similar).
First, in order to model public opinion as a function of the cluster-sampling procedure, it is necessary to obtain more granular individual-level geographic data. Ideally, researchers would fit the model using the exact sampling strata or categories of PSUs from the sampling frame. (Sampling frames and clustering procedures are usually described in the documentation for surveys.) In the case of the Gallup polls that form the core of my validation, the precise stratum designations were not available in the survey data, but a “city size” variable was, which allowed me to match up to Gallup’s strata. For some surveys (e.g., the GSS and most years of the ANES), this data is not publicly available and must be requested; in some cases it may not exist at all.
In cases where sufficiently granular data are not available, reasonable proxies may work. In Supplementary Appendix E, I show that replacing stratum in the Gallup models with a two-level urban/rural variable produces similar results. I produced this variable using the city size data, as Gallup does not always publish a coarsened urban/rural variable. However, a similar approach may be reasonable for other situations in which samples are produced from clusters based on their urbanness but granular data are not available.
The second major practical consideration is collecting poststratification data necessary for CMRP. Unfortunately, modeling opinion at sub-state levels presents new difficulties not always present in the standard MRP case. While most guides for MRP suggest computing the joint distribution of the population across several demographic variables using Census microdata from IPUMS, the data are too sparse in many smaller geographies to do so.
Instead, joint distributions from tables published by the Census can be used. In this paper, I created poststratification matrices using Census data at the state and place (city or town) level, which I downloaded from IPUMS-NHGIS. For each place, the total population can be used to match to Gallup strata. These can then be combined and aggregated within states to produce the join distribution by race, sex, age, education, and stratum.Footnote 14 I discuss this procedure in greater detail in Supplementary Appendix C. I also have made poststratification data for 1970 and 1980 using the Gallup strata and a simpler urban versus rural setup in the replication data for this paper.Footnote 15
Conclusion
This paper seeks to address the challenges of estimating subnational public opinion using MRP on polls produced from cluster sampling. Simulations suggest that MRP may produce estimates with higher error when clusters are not representative of the overall population of the state and because the multilevel model commonly used in MRP may not correctly account for geographic variation in opinion.
To mitigate these potential sources of error, I propose two solutions: pooling samples and CMRP. By pooling multiple cluster-sampled polls that ask the same question around the same time, scholars in effect increase the number of clusters included in the sample frame. By doing so, the MRP model can better account for geographic variation in opinion within and across states. However, this may not be feasible in many contexts.
A second approach – CMRP – improves estimation, on average, without requiring multiple polls. CMRP integrates the sampling procedure used by polling firm into the estimation process by including relevant geographic variables in the predictive model fit in the first step of MRP. Specifically, CMRP fits a multilevel model using demographics and the geographic variables used for clustering (which I call strata, following the Gallup Poll’s nomenclature). I also introduced Deep CMRP which adds interactions.
In principle, these two approaches could be combined. That is, one could pool multiple surveys from different firms that use similar sampling procedures, and then produce estimates using CMRP. This poses additional challenges in producing joint distributions of the population for poststratification. However, even in the case where only one method is feasible, the methods in this paper improve estimation of subnational opinion using cluster-sampled polls.
Higher-quality opinion estimates can improve research in a number of domains. First, and most obviously, descriptive studies of public opinion will be aided by more accurate estimates at subnational levels. But public opinion is also useful as an input to understand other political processes and dynamics. Studies of responsiveness depend on estimates of constituent opinion on issues. Likewise, our understanding of party position-taking is often limited by the lack of availability of public support for issues at subnational levels. Reducing measurement error in opinion data may allow for more greater precision in scholarship in in these domains.
In addition to improving estimation from cluster-sampled polls in the US and in comparative contexts, the idea underpinning CMRP may be useful in analyzing opinion data from other sources with more complex samples. In particular, when selection into a survey varies across some observable other variable, modeling opinion at the level of this variation and aggregating up can reduce measurement error. One concern with online surveys, in particular, has been the unrepresentativeness of samples (Berinsky Reference Berinsky2017). MRP has been used to correct unrepresentative online samples in some cases (e.g., Gelman et al. Reference Gelman, Goel, Rivers and Rothschild2016); future research in this vein may be augmented by considering more granular levels at which opinion can be estimated, particularly in cases where observable variables are used deterministically to produce samples.
More generally, the takeaway for scholars is that careful consideration of not only the policy domain at hand but also the procedures used to produce a poll can improve public opinion estimation.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/spq.2024.16.
Data availability statement
Replication materials are available on SPPQ Dataverse at https://doi.org/10.15139/S3/VD6BGL (Auslen Reference Auslen2024).
Acknowledgements
I thank Naoki Egami, Andrew Gelman, Max Goplerud, Shigeo Hirano, Eunji Kim, Patricia Kirkland, Justin Phillips, Robert Shapiro, Elizabeth Zechmeister, and participants at APSA 2021, the 2022 State Politics and Policy Conference, and the Columbia American Politics Graduate Student Workshop for helpful feedback. I also thank Kathleen Weldon from the Roper Center for Public Opinion Research for sharing internal memos from the Gallup Poll. I acknowledge computing resources from Columbia University’s Shared Research Computing Facility project (supported by NIH Research Facility Improvement Grant 1G20RR030893-01 and New York State Empire State Development, Division of Science Technology and Innovation Contract C090171).
Funding statement
The author received no financial support for the research, authorship, and/or publication of this article.
Competiting interest
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author biography
Michael Auslen is a Postdoctoral Fellow in the Department of Government at the University of Texas at Austin. His research focuses on the roles that public opinion and the news media play in political representation, especially in subnational politics.