1 Introduction
Entropy balancing (ebal) is a popular reweighting method that aims at estimating the average treatment on the treated (ATT) using nonexperimental data with binary treatments (Hainmueller Reference Hainmueller2012). It adjusts the weights for the control units to achieve exact covariate balance by solving the following constrained maximization problem:
in which $w = \{w_{i}\}_{i\in \mathcal {C}}$ is set of solution weights for units in the control group (denoted as $\mathcal {C}$ ); $q_{i}>0$ is the base weight for unit i (and $\sum _{i\in \mathcal {C}}q_{i}=1$ ); $H(\cdot )$ is the Kullback–Leibler divergence between the distributions of the solution weights and base weights; and $\sum _{i\in \mathcal {C}}w_{i}G_{ij}=m_{j}$ specifies a set of J balance constraints, where $G\in {\mathbb {R}}^{J}$ includes J pretreatment covariates and $m_{j}$ is the mean of the jth covariate of the treatment group.
Despite its appealing properties, such as exact balance, computational efficiency, and double robustness, ebal has two main drawbacks. First, it requires researchers to specify the moments of the covariates to be balanced on, which leaves room for specification searching and selective reporting. Second, when the number of control units is small relative to the number of available covariates, the algorithm either does not converge or generates highly concentrated weights. As a result, researchers face a dilemma that balancing on too few terms leads to biases while balancing on too many terms may be infeasible or induce high variance due to extreme weights. To address these problems, we propose hierarchically regularized entropy balancing (hbal) as an extension to ebal. hbal achieves approximate balance on reasonably flexible functions of the covariates through a ridge-regularized entropy balancing framework.
2 Approximate Balancing with Hierarchical Regularization
Hainmueller (Reference Hainmueller2012) proves that the global solution for each unit i’s weight exists and is given by $w_{i}^{ebal}=\frac {q_{i}\exp {(-G_{i}^{\prime }Z)}}{\sum _{D_{i} = 0}q_{i}\exp {(-G_{i}^{\prime }Z)}}$ , where $Z = \{\lambda _{1}, \lambda _{2} \dots , \lambda _{J}\}^{\prime }$ is a set of Lagrange multipliers for the balance and normalizing constraints and $D_{i} = \{0, 1\}$ is the binary treatment indicator. Using the Lagrangian multipliers and the solution weights $w_{i}^{ebal}$ , the constrained optimization problem can be rewritten as the following dual problem:
After obtaining $w_{i}^{ebal}$ , one can use a difference in means (DIM) approach to estimate the ATT:
in which $n_{1}$ is the number of treated units. Zhao and Percival (Reference Zhao and Percival2017) show that Problem (1) is an M-estimator for the propensity score with a logistic link using G as predictors, and the solution weights $w_{i}^{ebal}$ belong to a class of inverse probability weights. They also show that under strong ignorability and positivity, $\hat \tau _{ebal}$ is a doubly robust estimator for the ATT: when either the untreated potential outcome $Y_{i}(0)$ or treatment assignment is linear in G, $\hat \tau _{ebal}$ is consistent (see the Supplementary Material (SM) for details).
In practice, the linearity assumption may be unrealistic. To make this assumption more plausible, we can conduct a series expansion of G, for example, by including higher-order terms and various kinds of interactions, obtaining $X\in \mathbb {R}^{T}$ , $(T \gg J)$ . However, exact balancing on high-dimensional X is often infeasible; even when it is, the large number of balancing constraints may cause the solution weights to be heavily concentrated on a few control units, resulting in high variance of the ATT estimates. hbal addresses this problem by modifying the objective function in (1), that is, adding an $\ell ^{2}$ penalty with a hierarchical structure to the Lagrangian multipliers Z:
where $Z^{+} = \{\lambda _{1} \dots \lambda _{T}\}^{\prime }$ is a vector of Lagrangian multipliers corresponding to T moment conditions. $\sum _{k=1}^{K} \alpha _{k}r_{k}$ is the newly added penalty term, in which $\alpha _{k}$ is a scalar tuning parameter that adjusts the strength of penalty for the kth group, for $k = 1, 2, \dots , K$ ; $r_{k} = \sum _{t \in P_{k}} \lambda _{t}^{2}$ is the squared $\ell ^{2}$ norm of the Lagrangian multipliers ( $\lambda _{t}$ ) for moment conditions in the kth group, in which $P_{k}$ is the set of their indices. We choose $\ell ^{2}$ penalty over $\ell ^{1}$ penalty mainly because the former is twice differentiable, making computation much more efficient. This grouped structure allows differential strengths of regularization to be applied to different groups of balance constraints and prioritizes feature groups that have heavy influence on the overall covariate balance between the treatment and control groups. For example, it is possible that two-way interactions are more important to the overall balance in an application than the squared terms of the covariates (see the SM for a performance comparison of hierarchical and nonhierarchical regularization). This optimization problem gives $w_{i}^{hbal}=\frac {q_{i}\exp {(-X^{\prime }Z^{+})}}{\sum _{D_{i} = 0}q_{i}\exp {(-X^{\prime }Z^{+})}}$ ( $i\in \mathcal {C}$ ) and
2.1 Implementation Details
Implementing hbal involves several technical details, such as grouping moment conditions, selecting tuning parameters, and prescreening covariates. Due to space limitations, we only provide a sketch here and offer more details in the SM.
In terms of grouping, we put all the level terms of G in the first group ( $k=1$ ); two-way interactions ( $k=2$ ), squared terms ( $k=3$ ), three-way interactions ( $k=4$ ), interactions between square and level terms ( $k=5$ ), and cubic terms ( $k=6$ ) each represent a separate group. Because the Lagrangian multipliers can be interpreted as covariate coefficients in a logistic regression for propensity scores, shrinking the Lagrangian multipliers differentially enables us to prioritize groups of features in the expanded covariate space that are the more predictive of propensity scores. By default, we impose a hierarchical structure by setting $\alpha _{1} = 0$ , that is, hbal seeks exact balance on the level terms just like ebal, and only regularizing higher moment constraints. When $\alpha _{2} = \alpha _{3} = \cdots = \alpha _{K} = 0$ , hbal is reduced to ebal applied to the full series expansion of the covariates. To select the tuning parameters, we combine a trust-region optimization method (Powell Reference Powell1994) with a V-fold cross-validation procedure that minimizes mean absolute error of expanded covariate balance between the held-out subsample of control units and the treated units. This procedure encapsulates the intuition that the optimal Lagrangian multipliers, based on which the solution weights are constructed, should generalize to randomly selected held-out data and result in approximate covariate balance. While the proposed approach is data-driven, we also allow researcher to incorporate their prior knowledge when applying hbal, for example, by imposing custom covariate groupings and by specifying the parameter space of the tuning parameters.
2.2 Combining hbal with an Outcome Model
When the number of control units is small relative to the number of moment conditions, hbal only achieves approximate balance. To remove bias caused by the remaining imbalance, we suggest combining hbal with an outcome model that includes the same set of covariates in the moment conditions, which we label as hbal+. Because hbal gives higher weights to units that have similar propensity scores to those of the treated units, this strategy leads to efficiency gain for a regression-based double selection approach. When combined with an outcome model, an ATT estimator is given by
where $\hat {g}_{0}(G_{i}) = X_{i}^{\prime }\hat \beta $ is based on a linear regression on the expanded features. Zhao and Percival (Reference Zhao and Percival2017) show that $\hat {\tau }_{hbal+}$ is consistent for the ATT when either $g_{0}$ is linear in $X_{i}$ or $w^{hbal}$ converges to the logit of the true propensity scores. With nonzero tuning parameters, hbal achieves exact balancing on G and only approximate balance on $X\backslash G$ . Hence, if the true propensity scores depend on $X\backslash G$ , $w^{hbal}$ does not converge to the logit of the true propensity scores, in which case a correctly specified outcome model $g_{0}$ ensures the consistency of $\hat {\tau }_{hbal+}$ .
2.3 Related Work
hbal builds on a class of preprocessing methods that explicitly seek to achieve approximate covariate balance for causal inference purposes (e.g., Athey, Imbens, and Wager Reference Athey, Imbens and Wager2018; Hazlett Reference Hazlett2020; Imai and Ratkovic Reference Imai and Ratkovic2014; Ning, Sida, and Imai Reference Ning, Sida and Imai2020; Zubizarreta Reference Zubizarreta2015). These methods are shown to estimate propensity scores with loss functions targeting good covariate balance (Ben-Michael, Feller, and Rothstein Reference Ben-Michael, Feller and Rothstein2021; Wang and Zubizarreta Reference Wang and Zubizarreta2019; Zhao and Percival Reference Zhao and Percival2017). hbal extends this line of research in that it aims at achieving approximate balance in a large covariate space. Hence, hbal’s solution weights can be interpreted as penalized propensity scores with a special loss function. Moreover, the balancing approach is closely connected to the survey literature on calibrated weighting, or raking (e.g., Deming and Stephan Reference Deming and Stephan1940). The key component of hbal, hierarchical ridge regularization, shares similarity with recent research in survey methods that deal with high dimensionality of crosstabs of respondent characteristics (Ben-Michael et al. Reference Ben-Michael, Feller and Hartman2021; Caughey and Hartman Reference Caughey and Hartman2017; Tan Reference Tan2020).
3 Monte Carlo Evidence
To evaluate the performance of hbal, we conduct Monte Carlo simulations to compare hbal and hbal+ with five commonly used matching and weighting methods, including inverse propensity score weighting (PSW; e.g., Hirano, Imbens, and Ridder Reference Hirano, Imbens and Ridder2003), covariate balancing propensity score (CBPS; Imai and Ratkovic Reference Imai and Ratkovic2014), coarsened exact matching (CEM; Iacus, King, and Porro Reference Iacus, King and Porro2012), ebal (Hainmueller Reference Hainmueller2012), and kernel balancing (Hazlett Reference Hazlett2020). To illustrate the advantage of hierarchical regularization, we also report the results from using ebal to balance on the serially expanded covariate set (ebal*). The naive DIM (Raw) estimator is also included as a benchmark.
3.1 Design
We use six covariates $G = \{G_{1}, G_{2}, G_{3}, G_{4}, G_{5}, G_{6}\}$ , in which $G_{1}, \dots , G_{4}$ are drawn independently from a multivariate normal distribution with mean $0$ , variance $1$ , and covariances of $0.2$ ; $G_{5}$ and $G_{6}$ are independently drawn from Bernoulli distributions with probability of success $0.3$ and $0.2$ , respectively. To simulate a complex treatment assignment process, we use all level terms and a random subset of higher-order terms to be relevant for treatment assignment and generate their individual effect from a normal distribution. Specifically, the treatment assignment indicator is given by $D=\mathbf {1}\{f(G)-2 +\epsilon> 0\}$ , where $f(G)$ is a linear combination of the subset of the serial expansion of G and $\epsilon \stackrel {i.i.d.}{\sim } N(0, \sqrt {8})$ .Footnote 1 To capture the empirical variability in the outcome-generating process, we consider three outcome designs with increasing degree of nonlinearity: (1) linear: $Y = G_{1}+G_{2}+ G_{3}-G_{4} +G_{5} +G_{6} + u$ ; (2) nonlinear $Y=G_{1} + G_{1}^{2} - G_{4}G_{6} +u$ ; and (3) trigonometric: $Y=2\times \cos (G_{1}) - \sin (\pi \times G_{2})+u$ , with $u \stackrel {i.i.d.}{\sim } N(0, 1)$ and the true treatment effect fixed at $0$ for all units. For PSW, CBPS, CEM, ebal, and kbal, we include only the original variables G. Note that kbal expands the covariate space through Gaussian kernels; for ebal*, hbal, and hbal+, we include all 69 covariates in the third-degree series expansion of G.
3.2 Results
Figure 1 presents the simulation result with sample size $N=900$ and control to treatment ratio 5:1. We report additional results with varying outcome designs, sample sizes, and control to treatment ratios in the SM. The comparative performance of hbal (or hbal+) is similar across different simulation setups.
For the linear outcome design 1, most methods substantially reduce bias as compared to the naive DIM estimator. One exception is ebal*, whose poor performance is caused by nonconvergence of the ebal algorithm with many moment constraints. In comparison, hbal’s ability to discriminate balance among covariate moments leads to superior performance in both bias and variance reduction. As the outcome-generating process becomes more complex, methods that rely only on G to estimate propensity scores or weights perform poorly. Compared to ebal, ebal*, and kbal, hbal and hbal+ yield estimates with substantially less bias and smaller variance in designs 2 and 3. These results demonstrate that, when the data-generating process is not too far off from a polynomial expansion of the covariates, hbal’s hierarchical structure is able to tailor the strengths of regularization to the importance of balance constraints, thus reducing bias while maintaining a relatively small variance of the estimates. In the SM, we provide additional evidence that hierarchical regularization leads to higher correlation between the solution weights and the true propensity scores.
Moreover, hbal is also much more computationally efficient and scalable than kbal. The bottom-right panel of Figure 1 shows the average running time across $500$ samples for varying sample sizes. Across sample sizes, hbal finds the solution weights using a fraction of kbal’s time and hbal’s advantage of scalability becomes more evident as the sample size increases.
4 Promotion Prospect and Circuit Court Judge Behavior
To illustrate how hbal works in empirical settings, we replicate a recent article by Black and Owens (Reference Black and Owens2016), who study the effect of promotion prospect to the Supreme Court on the behavior of circuit court judges. Using CEM, the authors show that judges who are on the president’s shortlist to fill Supreme Court vacancies are more likely to vote in line with the president’s position during the vacancy period as compared to the nonvacancy period; they find no such effect among noncontending judges who stand little chance to be nominated to the Supreme Court. We focus on whether circuit court judges ruled in line with the president as the outcome of interest. The binary treatment variable is vacancy period (vs. nonvacancy period). To address potential confounding, the authors use CEM to match cases on seven covariates that might influence a judge’s behavior and the treatment, such as the judge’s Judicial Common Space score, the judge’s ideological alignment with the president, and whether the case decision was reversed by the circuit court.
In Figure 2, we compare the results from mean balancing on the level terms of the covariates using ebal+ (shown in solid circle) and from balancing on a set of serially expanded covariate using hbal+ (shown in solid triangle). To assess whether the ebal+ estimate is sensitive to different model specifications, we also include an additional 500 models in which random higher moments of the covariates are included (shown in gray). For both methods, we use the solution weights to estimate a weighted linear regression and report 95% confidence intervals based on standard errors clustered at the individual judge level. For contending judges, estimates from both methods indicate judges are more likely to rule in line with the president during vacancy periods than nonvacancy periods, although the estimate from ebal+ using the level terms only is not statistically significant at the $5\%$ level. Because hbal+’s specification includes higher-order terms that can explain additional variation in the outcome and treatment assignment, we obtain a more precise and likely more reliable estimate than ebal+’s. For noncontending judges, ebal+’s estimate suggests that noncontending judges tend to be more likely to rule in line with the president during a vacancy period, while hbal+’s estimate shows no significant difference between the vacancy and nonvacancy periods. In short, hbal+’s results are broadly in line with Black and Owens’ (Reference Black and Owens2016) original findings, whereas the estimates from ebal+ for both contending and non-contending judges vary widely depending on specifications, ranging from negative to positive effects.
5 Conclusion
In this letter, we extend ebal to hbal by introducing hierarchical regularization on the Lagrangian multiplier in the transformed objective function. It achieves approximate balance on a potentially large covariate space. Through simulations and an empirical study, we demonstrate hbal’s desirable properties in comparison to ebal and other commonly used preprocessing methods. We also show that ebal is computationally more efficient than kbal, another popular covariate balancing method. hbal thus can serve as a building block for methods that seek approximate covariate balance. To facilitate implementation, we develop an open source routine, hbal, in R.
In the SM, we provide more details about the identifying assumptions and theoretical guarantee of hbal, its algorithm, implementation procedure, and inferential method, as well as additional information on the simulation results and the empirical example. We also apply hbal to the famous Lalonde data and find reassuring results, which are provided in the SM.
Acknowledgements
We thank Ryan Black, Avi Feller, Justin Grimmer, Jens Hainmueller, Chad Hazlett, Molly Offer-Westort, Xiang Zhou, seminar participants at Stanford and UCSD, the Editor of this article Jeff Gill, as well as four anonymous reviewers for extremely helpful comments.
Data Availability Agreement
Replication data and code for this article have been published at Harvard Dataverse at https://doi.org/10.7910/DVN/QI2WP9 (Xu and Yang Reference Xu and Yang2022).
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2022.12.