INTRODUCTION
Regression models are ubiquitous in political science, but standard estimators can produce extreme and unstable inferences, particularly with highly correlated predictors (Gelman et al. Reference Gelman, Jakulin, Pittau and Su2008). These extreme results overinflate effect size estimates, especially when filtered on statistical significance. One solution is to apply regularization techniques, either through Bayesian priors or frequentist methods such as ridge regression or least absolute shrinkage and selection operator (LASSO). Regularization techniques have become popular in the last decade, with at least 102 articles in the top three political science journals (see Appendix R5 for details).
Regularization reduces the variance of estimates by deliberately biasing them toward zero. In repeated samples, unregularized estimators converge on the population value of a parameter, but individual estimates may be far from this value. By contrast, regularized estimators are biased in expectation (i.e., closer to zero than the population value), but there is less variation in estimates between different samples. The degree of bias is determined by making assumptions about the likely distribution of effects. Different methods make different assumptions. Faced with a set of correlated predictors, a researcher who thought there was likely to be a large number of small effects would choose ridge regression, but one who thought there was likely to be a small number of large effects—while other variables had no effect at all—would choose LASSO. Bayesian methods are more flexible, allowing researchers to specify ridge, LASSO, and elastic net equivalent priors, but also to use a range of other distributions (Carvalho, Polson, and Scott Reference Carvalho, Polson and Scott2009). Used judiciously, regularization stabilizes estimates and improves out-of-sample predictions.
However, regularization has underappreciated consequences for making causal inferences: with strong regularization, as some coefficients are shrunk, the coefficients of correlated variables can be inflated. This happens because regularized coefficients do not account for all the variance associated with a variable, leaving residual variance open to be modeled by other variables. In other words, regularization partially reopens backdoor causal paths that were previously blocked by covariate adjustment.
In this letter, we first demonstrate how regularization can reopen backdoor casual paths. We then discuss this problem with respect to an American Political Science Review letter by Cavari and Freedman (Reference Cavari and Freedman2023, hereafter CF2), which employs regularized regression to argue that falling survey cooperation rates have exaggerated mass polarization in the United States. CF2 include a linear time control in their model to guard against spurious correlations due to trending variables. However, their use of regularization entails the assumption that the effect of time is small. We argue that a priori we should have strong expectations that any two (causally unrelated) variables measured over time are likely to be correlated with one another. Shrinking the time effect reopens the backdoor path between response rates and polarization via time. We reanalyze CF2’s data in light of these problems and show it does not support the claim that declining response rates have inflated estimates of mass polarization.
REGULARIZED REGRESSION
Ordinary Least Squares (OLS) and regularized regression methods are both approaches for estimating the model y = Xβ + ϵ. Frequentist regularized methods shrink coefficient estimates by adding a penalty term to the OLS cost function, but differ in how they do so. We examine three common frequentist regularization methods: ridge regression, LASSO, and elastic net. We also contextualize these methods against the Bayesian approach of specifying explicit priors.
Ridge regression minimizes the cost function:
The ridge cost function is the OLS cost function, which minimizes the sum of squared residuals, plus an additional penalty term $ \lambda \sum_{j=1}^p{\beta}_j^2. $ Regularization penalties shrink coefficients because any reduction in the sum of squared residuals produced by larger coefficients is partially offset by an increase in the size of the penalty term. For example, while a model with two coefficients 2 and 1 might result in a lower sum of squares compared to a model with coefficients 1.9 and 0.9, the latter has a smaller penalty term, and with a sufficiently large λ will have a smaller overall sum. Importantly, the size of the ridge penalty is proportional to the square of the coefficients, so if coefficient β 1 is twice the magnitude of coefficient β 2, β 1 contributes four times as large a penalty as β 2. Consequently, for a marginal increase in the sum of squared residuals, ridge regression prefers two coefficients of similar sizes to one large and one small coefficient.
LASSO minimizes the cost function:
The difference between Equation 2 and Equation 1 being that the LASSO loss function regularizes coefficients proportional to their absolute size rather than their square, and so does not produce the effect of ridge regression of preferring many small coefficients to a smaller set of larger coefficients.
The elastic net incorporates both of these penalty terms, adding an additional term α, which controls the relative weight between the two. When α = 1, the elastic net is equivalent to LASSO, and when α = 0, it is equivalent to ridge regression:
These penalty terms encode assumptions about the likely size of the β j effects, which can be interpreted as Bayesian priors (Hastie, Tibshirani, and Friedman Reference Hastie, Tibshirani and Friedman2009). In Bayesian terms, the ridge regression penalty expects that β j are normally distributed around zero, which implies that coefficients are increasingly unlikely the further away from zero they are. As a Bayesian prior the LASSO penalty follows a Laplace distribution, which is sharply peaked at zero, entailing the expectation that large values of β j are unlikely, but furthermore that many effects are exactly zero. The elastic net prior is a mixture of these two distributions. Larger λs imply stronger versions of these assumptions, or in Bayesian terms, prior distributions with smaller standard deviations and/or scale parameters.
Choosing λ
Given the role of the penalty term in regularized estimators, λ must be chosen with care. The most common approach is to choose λ based on out-of-sample predictive performance, typically using cross-validation (CV). For ridge regression, an alternative approach is based on a transformation of sample variance and covariance. CF2 take this latter approach, using the “KM4” method (Muniz and Kibria Reference Muniz and M. Golam Kibria2009). Different methods for choosing λ can result in radically different values. In our simulations below, KM4 tends to choose considerably higher λs (mean = 4.1) compared to CV (mean = 0.9), and so a KM4 approach to choosing λ would result in much stronger regularization.
Thinking about penalty terms as being equivalent to Bayesian priors suggests an alternative approach. In Bayesian statistics, a common choice is to use weakly informative priors, such as N(0, 1) when all variables are standardized to have mean 0 and SD 1 (Betancourt Reference Betancourt2017). For a regression where all variables are unit-scaled, a ridge regression with penalty λ is equivalent to a Bayesian estimate with a normally distributed prior with mean 0 and standard deviation $ \sqrt{\frac{Var(\widehat{\epsilon})}{\lambda }} $ (Hastie, Tibshirani, and Friedman Reference Hastie, Tibshirani and Friedman2009, 64). In practice, the conversion between λ and priors can be further complicated by rescaling procedures applied by particular software, which we discuss in Section S5 of the Supplementary Material.
Regularized Regressions Can Reopen Backdoor Causal Paths
In order to demonstrate the counter-intuitive effects that regularization can have, we run simulations in which X causes Y, and X and Z are correlated, but Z has no causal effect on Y, illustrated in the directed acyclic graph (DAG) (4) below.
Where X and Z are multivariate standard normal variables with a correlation ρ, and Y is a linear function of β X X and a standard normal error term (in CF2’s data, if the response rate effect on polarization was actually zero, then X would be equivalent to time, Z to response rate, and Y to polarization):
From this DGP, we simulate ten thousand datasets with β X ∼ U(0, 1) and ρ = −0.9 (approximately the correlation between time and response rate in CF2’s data).
We then fit the following model to the simulated data:
using four estimators: OLS, ridge regression, LASSO, and elastic net (with α = 0.5, an equal mix of the two penalties). For each regularized estimator, we set four λ penalties: 0.00615, 0.154, 0.615, and 2.46, which correspond approximately to a weakly informative prior for the average simulated case, and priors that are 20%, 10%, and 5% the width of the standard deviation of that weakly informative prior for ridge regression.
Figure 1 shows the average coefficient estimates for β X and β Z for each estimator, which illustrates the interaction between effect size and λ. We first consider bias (the average signed difference between the estimate and simulated coefficient). The OLS estimate is unbiased for both β X and β Z across the range of the simulated β X terms. With a penalty equivalent to a weakly informative prior (λ = 0.00615, row A), the regularized estimates are only marginally biased. With a larger penalty (λ = 0.154, row B), the bias in the estimated β x is substantially larger. For ridge regression, this bias increases linearly as the simulated β X increases. For LASSO (and to a lesser extent elastic net), the level of bias flattens off with sufficiently large β X values. As λ gets larger, the bias in the β X estimates also increases (rows C and D). These biases are strongest for LASSO—which with the highest penalty almost always shrinks the estimates to zero—and smallest for ridge regression, with elastic net in between.
That regularized regression induces bias and shrinks β X toward zero is well understood—this is the price these estimators pay for reducing variance. What is less appreciated is that regularized estimators also bias β Z away from zero. This occurs because some of the X → Y effect is left unaccounted for by the biased β X estimate, and so is transferred to Z via the Z ↔ X → Y backdoor path (for a demonstration, see Section S3 of the Supplementary Material). The β Z bias is most pronounced for ridge regression: as the λ penalty increases, the bias in β Z initially increases (compare the ridge regression lines in rows B/C). As λ increases further, the ridge β Z estimates are themselves regularized, and move back toward zero. Although this means the absolute bias in β Z is reduced with the largest λ terms, the effect of β Z relative to β X continues to increase, with estimators with large penalties assigning almost equal values to β X and β Z . For LASSO, the bias in β Z is only noticeable at the smaller levels of λ (i.e., row B); with higher penalties β Z is almost always pulled to zero. Elastic net is again between the ridge and LASSO estimates.
Making Statistical Inferences with Regularized Regressions
To make inferences about underlying population parameters (e.g., the true effect of response rate on estimates of mass polarization), we typically use estimates of sampling variance. With unbiased estimators this is straightforward; in repeated samples, we would expect 95% of our confidence intervals (CIs) to overlap the population value. With regularized estimators, this is more complicated because (1) the sampling variance of some regularized estimators is poorly defined, and (2) even where it is well defined (such as for ridge regression) the resulting CIs may have no overlap with the underlying population parameter.
A particular challenge for the LASSO and elastic net estimators is that “we still do not have a general, statistically valid method of obtaining standard errors of LASSO estimates” (Kyung et al. Reference Kyung, Gill, Ghosh and Casella2010, 377) even using bootstrap methods. LASSO and elastic net are therefore poorly suited to inference.
This problem does not apply to ridge regression, where the variance of ridge estimates is
We must be clear what this variance captures: how much a ridge estimate $ {\widehat{\beta}}_{ridge} $ is expected to differ across repeated samples (when λ = 0, this is the same as OLS). By design, the λ penalty decreases variance. However, λ also increases bias (indeed, it reduces variance by increasing bias). This has the pernicious effect of shrinking the ridge CIs at the same time as increasing the distance between the ridge estimates and the underlying parameter of interest. As bias grows, CI coverage falls because more of the discrepancy between $ {\widehat{\beta}}_{ridge} $ and β is due to bias (which the CIs do not account for) rather than sampling variation (which the CIs do account for). If the bias is large enough, CIs routinely exclude the true value of β. Narrow ridge regression CIs do not necessarily indicate a precise estimate of the parameter of interest, but simply that ridge regression would produce similar estimates with new samples. The resulting CIs may lead to misleading inferences about population parameters.
We can again demonstrate this problem using simulations. Returning to the simulated data we used in Figure 1, we calculate CIs for OLS and ridge regression. For each simulation, we record whether the CI overlaps the simulated parameters for β X and β Z , illustrated in Figure 2. The OLS CIs have the appropriate coverage (95%) across all simulated values of β X . For the ridge CIs, however, we again see an interaction between λ and the size of β X . When λ is small and β X is small, the CIs still have coverage close to the 95% benchmark. As β X gets larger the coverage rate begins to fall for both β X and β Z . This problem is exacerbated by larger values of λ—the coverage rate falls precipitously and approaches zero. At the highest levels of λ and β X , this means that the CIs for β X and β Z never overlap the population values for these parameters. In all but the most favorable of circumstances, ridge regression is simply not an appropriate tool for statistical inference.
AN EXAMPLE OF RIDGE REGRESSION ESTIMATES IMPLYING UNREALISTIC PRIORS ABOUT THE EFFECT OF TIME
In light of the issues we discuss above, we turn to CF2’s analysis. CF2 analyze the relationship between survey response/cooperation rate and partisan polarization for survey questions asked by Pew between 2004 and 2018. They operationalize polarization as the absolute Cohen’s D of mean differences between Republican and Democrat identifiers’ responses. They estimate models for six issue areas: economy, energy, immigration, civil rights, welfare, and foreign policy, specified as
where rr it is survey i’s response rate at time t, congress t is congressional polarization at time t, and year t is the year in which the survey was conducted. For the contact and cooperation rate versions of these models, the β 1rr it term is replaced with β 4contact it and β 5cooperation it . Equation 8, which adds a year term to an earlier model estimated by Cavari and Freedman (Reference Cavari and Freedman2018, henceforth CF1) was suggested by Mellon and Prosser (Reference Mellon and Prosser2021, hereafter MP) in a critique of CF1 as one of several ways they could deal with the problem of time trends.Footnote 1 CF2 estimate their models using ridge regression (unlike MP, who use OLS) with λ chosen by the KM4 method. CF2 reports statistically significant negative effects of response and cooperation rates (i.e., lower response rates increase the estimated level of polarization) for three areas: economy, energy, and immigration.
CF2’s KM4 ridge regression estimates imply strong priors that are between 1.5% and 5.9% the width of a weakly informative prior. Are these priors plausible when considering time trends? A priori, there are good reasons to think not. Many social processes can be modeled as random walks or similar processes. Independent random walks will tend to correlate with time (Granger and Newbold Reference Granger and Newbold1974) because their variance increases as a function of time (see Section S2 of the Supplementary Material).
Moreover, we can show that many social science variables are correlated with time in practice. We calculate the annual mean of binary, ordinal and continuous variables asked in 15 or more years of the GSS (Smith et al. Reference Smith, Davern, Freese and Morgan2019) and estimate a linear model of each variable on time using OLS. Figure 3 shows a histogram of the standardized coefficients for the effect of time on each variable. The distribution has a far higher dispersion than CF2’s priors imply, with a standard deviation of 0.657 and heavy tails close to 1 and −1.
To show how different these priors are from CF2’s, we can consider how likely we are to observe a 0.5 or larger magnitude unit-scale coefficient. The weakest of CF2’s priors implies the probability of observing a 0.5 or larger unit-scale coefficient (considerably smaller than the mean OLS-year estimate on their data) is 2.8 × 10−17. By contrast, a N(0, 0.657) prior gives a probability of 0.447 of observing an effect size magnitude of 0.5 or larger.
CF2’s narrow priors mean they strongly shrink the year coefficient in their models, which risks reopening the backdoor causal paths that the inclusion of a time control is designed to block.
AN EXAMPLE OF HOW A LARGE SHRINKAGE PARAMETER AFFECTS EMPIRICAL RESULTS
To demonstrate the impact of CF2’s regularization approach, we reestimate their models using a variety of approaches: their original KM4 ridge regressions, OLS, ridge regression with λ chosen by CV, a weakly informative N(0, 1) unit-scaled prior, a prior informed by the GSS distribution of time effect sizes N(0, 0.657), LASSO with λ chosen by CV, and elastic net (α = 0.5) with λ again chosen by CV. These estimators show the impact of different regularization approaches, but they all share a common approach to controlling for time (a linear trend), which may not fully account for time-series problems. Given this, we benchmark them against a more sophisticated repeated cross-section model suggested by Lebo and Weber (Reference Lebo and Weber2015) (ARIMA-MLM—we elaborate on this approach in Section S1 of the Supplementary Material). These estimates for the key response and cooperation rate variables are shown in Figure 4.Footnote 2
The results in Figure 4 present a striking pattern—six of the seven estimators of the linear-time-trend model produce similar results, and one does not. With the exception of the economy issue area, CF2’s KM4 results are considerably closer to zero than either the OLS estimate or the other regularized estimates. All of the unregularized estimates, and nearly all of the other regularized estimates apart from KM4, have estimates in the opposite direction to those reported by CF2. Those estimates that CF2 report as negative have swapped directions compared to their unregularized equivalents—the same effect we demonstrated earlier in our simulations—and are driven not by the data, but by high λ penalties.Footnote 3 Only in one other case—ridge regression with λ chosen by CV for the economy—are the estimates of response and cooperation rate effects negative. Even these estimates do not support an effect of response rates on mass polarization: the CIs for both overlap zero.
Appendix R3 presents further simulations that examine the likelihood of false positives using CF2’s approach if the true effect of response/cooperation rates was zero. The results are stark. We observe false positives more than 98% of the time for the effect of response/cooperation rates for energy, immigration, and foreign policy and more than 83% for the economy. In other words, CF2’s results are what we would expect to see if there were no true effect of response/cooperation rate on mass polarization.
Should we conclude that the effects of response/cooperation rates on mass polarization are null or even positive (i.e., the opposite of CF2’s claims)? While there are plausible theoretical models that lower response rates could deflate measures of mass polarization (see Appendix R4), the evidence is insufficient to make that claim. While some OLS estimates of the linear-time-trend model suggest positive effects, the ARIMA-MLM approach (and several other alternate model specifications) never find significant effects in either direction, suggesting that the linear-time-trend specification does not fully account for structural properties such as auto-correlation. We believe that ARIMA-MLM is a better model specification for removing spurious time effects, so prioritize its result over the linear specification used by CF2.
We also refrain from claiming a precise null effect. As Section S4 of the Supplementary Material shows, the data are underpowered even if the linear time trend specification is correct. We suspect that matched individual-level and administrative data (e.g., Clinton, Lapinski, and Trussler Reference Clinton, Lapinski and Trussler2022) may be a more fruitful avenue for exploring this question. Our claim is merely that the apparent evidence of an effect of response rates on mass polarization is driven by an inappropriate estimator and that the data are insufficient to test for effects of the claimed size.
CONCLUSIONS
The problem of correlated predictors is a real concern for scholars. Regularization through weakly informative priors or equivalent frequentist regularized estimators may be helpful for reducing variance in these cases with only a limited increase in bias. However, we urge scholars to avoid using data-driven regularization procedures without assessing their substantive implications. With overly strong penalty terms, regularization can reintroduce confounding through backdoor causal paths that are blocked by covariate adjustment with an unregularized estimator.Footnote 4 We recommend either directly using Bayesian priors, or translating the implicit assumptions of frequentist regularized regressions into priors so that their plausibility and impact can be assessed.
There are also broader lessons that can be taken from our analysis. First, correlation with time is a serious threat to valid statistical inference. Many—perhaps most—social science variables are correlated with time. Any statistical model that uses longitudinal data needs to account for time. Second, social scientists frequently use complex methods to try to extract the maximum available information from limited data, which often entail opaque assumptions about the world. These assumptions should be made explicit and justified. Third, simulation should be part of any social scientist’s toolkit. Simulations make clear the impact that different modeling choice can have on our estimates and require little expertise beyond that required to estimate the same statistical models on observed data.
SUPPLEMENTARY MATERIAL
To view supplementary material for this article, please visit https://doi.org/10.1017/S0003055424000935. Further supplementary material (indicated with by appendices labeled with R) can be found with the replication material.
DATA AVAILABILITY STATEMENT
Research documentation and data that support the findings of this study are openly available at the American Political Science Review Dataverse: https://doi.org/10.7910/DVN/VEFZXI.
ACKNOWLEDGMENTS
The views expressed herein are those of the authors and do not reflect the position of the United States Military Academy, the Department of the Army, or the Department of Defense. Thank you to Jack Bailey and Rex Douglas for their insightful comments on earlier drafts of this paper.
CONFLICT OF INTEREST
The authors declare no ethical issues or conflicts of interest in this research.
ETHICAL STANDARDS
The authors affirm that this research did not involve human participants.
Comments
No Comments have been published for this article.