1. Introduction
Causal inference in social sciences stands out as an increasingly attractive field. Currently, it is acknowledged as a missing data problem, where one of the potential outcomes is missing for each unit (Little and Rubin, Reference Little and Rubin2019). This problem is widely addressed by inverse probability weighting (IPW) based on the propensity score estimation under the conditional ignorability assumption. In a review of papers in the American Journal of Political Science, the American Political Science Review, and the Journal of Politics from 2000 to 2022,Footnote 1 I found that 35 articles employ the IPW. Notably, 30 of them appear in the past five years, making it the most widely used method among the popular weighting or matching methods, including the entropy balancing (Hainmueller, Reference Hainmueller2012), propensity score matching, and genetic matching (Diamond and Sekhon, Reference Diamond and Sekhon2013), in each of the five years. Despite its increasing popularity, however, it is vulnerable to propensity score model misspecification (Kang and Schafer, Reference Kang and Schafer2007; Imai and Ratkovic, Reference Imai and Ratkovic2014), which is rather common in practice.
To mitigate the bias due to model misspecification, existing studies have proposed various methods balancing some moments (or kernels) of observed covariates between the target and weighted groups (Hainmueller, Reference Hainmueller2012; Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Chan et al., Reference Chan, Yam and Zhang2016; Wong and Chan, Reference Wong and Chan2017; Zhao, Reference Zhao2019; Hazlett, Reference Hazlett2020; Kallus, Reference Kallus2020; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020). However, the efficacy of these methods is contingent upon stringent assumptions. Specifically, removing bias by balancing specific moments requires prior knowledge of the true outcome model (Zubizarreta, Reference Zubizarreta2015; Zhao and Percival, Reference Zhao and Percival2017; Zhao, Reference Zhao2019; Fan et al., Reference Fan, Imai, Lee, Liu, Ning and Yang2021), which is an impractical expectation in the face of real-world scenarios characterized by nonlinearity, interactions, and modifications. Instead, I propose to balance the multivariate confounder distribution. This novel perspective, while acknowledged in previous works (Hainmueller, Reference Hainmueller2012; Zubizarreta, Reference Zubizarreta2015; Li et al., Reference Li, Morgan and Zaslavsky2018), remains largely unexplored.
In this paper, I propose an estimation method for the IPW with a misspecified propensity score model that mitigates bias and controls the mean squared error (MSE) by minimizing the imbalance in the multivariate covariate distribution rather than only specific moments. The key idea is to use the estimating equations optimized for estimating the (inverse probability) weights rather than the treatment assignment or propensity scores. Specifically, it estimates propensity scores by minimizing the Kullback–Leibler (KL) divergence between the true and estimated weights, which directly quantifies the quality of the IPW estimation. This idea builds upon the quasi maximum likelihood estimation (MLE) with misspecified models, which minimizes the KL divergence between the true and estimated distribution (Akaike, Reference Akaike1973; White, Reference White1982). I demonstrate that the KL divergence between the weights can be calculated up to a constant without requiring knowledge of the true weights or propensity scores and that the proposed method mitigates bias and controls the MSE of parameters of interest by minimizing their upper bounds (Section 3.3 and 3.5).
Figure 1 provides an overview of the proposed method, summarizing its two main goals of the proposed method and two associated intermediate goals for each along with the propositions for the theoretical results. The first main goal is (a) mitigating bias of the parameter of interest. The associated intermediate goals are (a-1) minimizing imbalance in multivariate covariate distribution between the target group and the weighted group and (a-2) minimizing an upper bound of bias. The second main goal is (b) controlling the MSE of the parameter of interest. The associated intermediate goals are (b-1) minimizing an upper bound of the relative error of (inverse probability) weights, and (b-2) minimizing an upper bound of the MSE. Section 2 presents a numerical example to illustrate that the proposed method achieves these goals better than the MLE.
The proposed method has several additional attractive characteristics. First, the proposed method can incorporate the outcome regression model to substitute for the mean balancing condition (Section 4.3). This allows it to remove bias under the same conditions as the recently proposed covariate balancing methods such as the covariate balancing propensity scores (CBPS) (Imai and Ratkovic, Reference Imai and Ratkovic2014), bias-reduced doubly robust estimator (Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015), calibrated weighting (Tan, Reference Tan2020), and entropy balancing (Hainmueller, Reference Hainmueller2012). Second, like the CBPS, it models propensity scores explicitly and estimates them consistently when the model is correctly specified. While weight calibration methods such as the entropy balancing have an implicit propensity score model (Hainmueller, Reference Hainmueller2012; Zhao and Percival, Reference Zhao and Percival2017; Wang and Zubizarreta, Reference Wang and Zubizarreta2020), it is easier for empirical researchers to make a propensity score model by considering the treatment assignment mechanism with their domain knowledge than to incorporate it implicitly through moment conditions. Third, as a direct consequence of these two characteristics, it can become a doubly robust (DR) estimator (Section 4.3) (Seaman and Vansteelandt, Reference Seaman and Vansteelandt2018). It is consistent either if the propensity score model or the outcome model is correct, and it attains the semiparametric efficiency bound if both the models are correct (Robins et al., Reference Robins, Rotnitzky and Zhao1994).
Despite the theoretical attractiveness, the proposed method has some difficulties in implementation. First, it is not sample-bounded because the estimated weights may not sum to one; second, it is likely to overfit extreme data points; and third, it necessitates the non-convex optimization. I address the first problem by imposing the normalization constraint and deriving the unconstrained optimization problem by the Lagrangian multiplier method (Section 4.1). I address the second problem by utilizing the L2 regularization (Section 4.1). To address the third problem, I develop a majorization-minimization algorithm that iteratively optimizes two convex decompositions of the original function in Section 4.2 (Wu and Lange, Reference Wu and Lange2010).
The proposed method is motivated by an observational study of political devolution and resistance to foreign rule in France during World War II (Ferwerda and Miller, Reference Ferwerda and Miller2014). In Section 6, I apply the proposed method to this study and discover the treatment effect heterogeneity, suggesting that spill-over effects contaminate the treatment effect. I also apply the proposed method to another empirical study in Supplementary Materials H. Section 5 demonstrates the finite-sample performance of the proposed method through simulation studies.
The proposed methodology is implemented via an open-source R package dbw, which is available at https://github.com/hirotokatsumata/dbw and will be soon at CRAN.
2. A numerical example
This section presents a numerical example for illustrating that the proposed method achieves goals (a) and (b), through goals (a-1), and (b-1) better than the MLE, where goals (a-2) and (b-2) are direct consequences of (a-1) and (b-1) as later shown in propositions 2 and 4. It also provides intuition on how the proposed method achieves these goals by showing that it is optimized for estimating inverse probability weights rather than propensity scores.
For simplicity, this example examines a situation where there is only one continuous covariate X i following the standard normal distribution as well as a binary response variable R i, indicating that the outcome Y i is observed when R i = 1 but not observed when R i = 0, for observation i = 1, 2, …, 1000. The propensity score model is misspecified in the sense that it only includes a linear term for X i but the true model also includes its squared term $X_i^2$. Specifically, the misspecified model is $\Pr ( R_i = 1) = \text {expit}( \alpha + \beta X_i)$ whereas the true model is $\Pr ( R_i = 1) = \text {expit}( -1 + X + 0.5 X^2)$. The inverse probability weights are estimated as $\hat {w}_i = 1 / \widehat {\Pr }( R_i = 1)$. To consider the possible worst-case scenario for each estimator, I do not specify the data-generating process for the outcome Y i.
a-1: Distributional imbalance in the covariate
First, I compare the performance of the proposed method with the MLE in terms of distributional imbalance in the covariate X. For this purpose, a commonly used statistic is the Kolmogorov–Smirnov (KS) statistic. The KS statistics are 0.112 for the proposed method and 0.193 for the MLE, indicating that the proposed method has a smaller covariate distributional imbalance than the MLE. To compare the distribution visually, I present a figure showing the cumulative distribution functions of X in Supplementary Material B.
b-1: Relative error of the weights
Next, I compare the proposed method with the MLE in terms of the relative error of the weights. As detailed in Section 3.5, this quantity is defined as ${\mathbb E}[ ( \hat {w}_i / w_i - 1) ^2]$, where w i and $\hat {w}_i$ denote true and estimated weights. The relative error of the weights is 0.38 for the proposed method and 4.14 for the MLE, indicating that the relative error of the proposed method is smaller than one-tenth of the MLE.
2.1 Intuition: targeting weights rather than propensity scores
To gain intuition, the left panel of Figure 2 presents the true (the x-axis) and estimated weights (the y-axis) for each observation. The proposed method (triangles) estimates inverse probability weights better than the MLE (circles). Most of the triangles are within or near the shaded narrow corridor around the diagonal line, which indicates that the difference between the true and estimated weights is less than two, whereas many circles lie far from the narrow corridor.
The right panel shows the true (the x-axis) and estimated propensity scores (the y-axis) for each observation. Overall, the MLE (circles) estimates propensity scores better than the proposed method (triangles) but the proposed method restricts the difference more effectively than the MLE when the difference in the resulting inverse probability weights is large shown as the shaded area. This explains why the proposed method approximates the weights better than the MLE.
These results provide intuition when the difference between the two estimators is large: It is when some units have large estimated weights, which is likely to happen when there are unmodeled nonlinear effects of covariates on treatment assignment, there are unmodeled interaction effects of covariates on treatment assignment, or the numbers of the treated and controlled units are unbalanced. In addition, the proposed estimator is especially better than the MLE when the outcomes of these units are extreme (relative to the true average outcome), but not so much when the opposite is true. The difference in the estimated weights between the two estimators is small when no units have large estimated weights.
The root mean squared errors (RMSEs) of the estimated weights are 1.18 for the proposed method and 3.10 for the MLE, and RMSEs of the estimated propensity scores are 0.198 for the proposed method and 0.131 for the MLE, indicating that the proposed method estimates the weights better than the MLE whereas the MLE estimates the propensity scores better than the proposed method. To estimate the quantity of interest through estimating inverse probability weights, the proposed method is better suited than the MLE.
3. Proposed methodology
3.1 Setup
Suppose that there is a random sample of n units (i = 1, 2, …, n) from a population. Each unit consists of a triplet (Y i, R i, X i), where Y i is an outcome variable, R i is a binary outcome response indicator variable that takes one when Y i is observed and zero otherwise, and X i is a vector of observed covariates whose support is denoted by ${\cal X}$. The propensity score is defined as the conditional response probability given observed covariates and denoted as $\Pr ( R_i = 1 \mid X_i) = \pi ( X_i)$. We impose the overlap assumption that the propensity score is bounded away from 0 and 1: 0 < π(x) < 1 for any ${\bf x} \in {\cal X}$. We also assume that the outcome is missing at random, i,e., the outcome variable does not account for the response conditional on the observed covariates: $Y_i \; \mathop {\perp \!\!\!\!\!\!\perp }\; R_i \mid X_i$ (Little and Rubin, Reference Little and Rubin2019). These assumptions correspond to the conditional ignorability assumption in causal inference.
For simplicity, the parameter of interest in this study is the average outcome $\mu = {\mathbb E}[ Y_i]$ for the target group of R i = 1 and R i = 0, and the proposed method is easily extended to other estimands such as the average treatment effect in causal inference (Kang and Schafer, Reference Kang and Schafer2007; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Wong and Chan, Reference Wong and Chan2017; Seaman and Vansteelandt, Reference Seaman and Vansteelandt2018; Zhao, Reference Zhao2019; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020).
3.2 Problems in the inverse probability weighting with the maximum likelihood estimation
The average outcome can be estimated unbiasedly and consistently by the IPW with true propensity scores: $\sum _{i = 1}^n R_i w_i Y_i = \sum _{i = 1}^n R_i Y_i / \pi ( X_i)$, where w i = 1/π(X i) is the true inverse probability weight for unit i. In observational studies, however, neither the true propensity scores nor the true inverse probability weights are known. In practice, researchers use a parametric propensity score model π(X i, β), where the most common choice is the logistic regression model $\pi ( X_i,\; \, \beta ) = \exp ( X_i^{\sf T} \beta ) / ( 1 + \exp ( X_i^{\sf T} \beta ) )$. Typically, the nuisance parameters β are estimated using the MLE to obtain the estimated propensity scores for each unit $\hat {\pi }_{\rm MLE}( X_i) = \pi ( X_i,\; \, \hat {\beta }_{\rm MLE})$. The resulting IPW estimator with the estimated propensity scores is consistent if the propensity score model is correct.
However, if the propensity score model is misspecified, the MLE estimator for β and the resulting IPW estimator is not consistent. Generally, the limiting values of $\hat {\beta }_{\rm MLE}$ are the solution to the following problem (Akaike, Reference Akaike1973; Tan, Reference Tan2020; White, Reference White1982):
This implies that the MLE estimates $\hat {\beta }_{\rm MLE}$ to minimizes the (generalized) KL divergence between the true responses (R and 1 − R) and their estimates ($\pi ( {\bf X},\; \, \hat {\beta }_{\rm MLE})$ and $1 - \pi ( {\bf X},\; \, \hat {\beta }_{\rm MLE})$) for the target group of R i = 1 and R i = 0, where the (generalized) KL divergence is the most common measure of the difference between two points P and Q, which is defined as ${\rm KL}( P,\; \, Q) = \sum _{i = 1}^{n} \{ P_i \log ( P_i / Q_i) - P_i + Q_i\}$. The problem is that this is not the same as the KL divergence between the true weights and estimated inverse probability weights although the latter determines the performance of the IPW estimator.
3.3 Distribution balancing weighting
This study proposes the distribution balancing weighting (DBW), which estimates such propensity scores $\pi ( X_i,\; \, \hat {\beta }_{\rm DBW})$ that minimize the KL divergence between the true and the estimated inverse probability weights. The idea of minimizing the KL divergence with possibly misspecified models originates from the quasi MLE, which minimizes the KL divergence between the true and estimated distribution when the true distribution is unknown (Akaike, Reference Akaike1973; White, Reference White1982). Since the propensity scores and their inverse probability weights have a non-linear relationship, minimizing the KL divergence between the true and estimated response assignment like the MLE does not minimize that for the inverse probability weights in general. The DBW applies the idea to the KL divergence between the true and estimated weights and obtains the following KL divergence:
The DBW estimates coefficients as the minimizers of the following loss function, which is the sample version of the KL divergence above up to a constant:
As stated formally in proposition 5, the DBW estimates propensity scores and resulting inverse probability weights consistently, like the MLE, when the propensity score model is correctly specified.
3.4 Minimizing an upper bound of bias
The main advantage of the DBW is that it also minimizes the imbalance in the multivariate covariate distribution in the target and weighted groups as stated in proposition 1 below, whose importance has been acknowledged in the existing studies (Hainmueller, Reference Hainmueller2012; Zubizarreta, Reference Zubizarreta2015; Li et al., Reference Li, Morgan and Zaslavsky2018; Zhao, Reference Zhao2019). While the propensity scores estimated by the MLE minimize the distributional imbalance in the limit when the propensity score model is correctly specified, it does not when the model is misspecified. The DBW achieves this even when the model is misspecified. The following proposition summarizes this result.
Proposition 1 (Multivariate distribution balancing property): When the propensity scores and resulting inverse probability weights are estimated using (4), the weights minimize the following KL divergence for the multivariate covariate distribution between target and weighted groups in the limit even when the propensity score model is misspecified.
where $\beta _{\rm DBW}^\ast$ is the probability limit of $\hat {\beta }_{\rm DBW}$ and f(x) and f 1(x) is the multivariate covariate distribution for the target group (${\cal S} = \{ i \mid R_i \in \{ 0,\; \, 1 \} \}$) and the response group (${\cal S}_{1} = \{ i \mid R_i = 1 \}$), respectively.
Proof. For any β, the following equations hold:
By applying the continuous mapping theorem, one can obtain
□
This property is advantageous because it implies that bias is bounded above as stated in the following proposition.
Proposition 2 (Minimizing a sharp upper bound of bias): Let the absolute value of bias be denoted as B and $C = \max ( \vert {\mathbb E}[ Y_i - \mu \mid R_i = 1,\; \, {\bf x}] \vert )$. A function of imbalance in the multivariate covariate distribution quantified by the KL divergence sharply bounds bias from above as follows:
where the last inequality follows from Pinsker's inequality, and the equalities hold when the propensity score model is correct. □
3.5 Minimizing an upper bound of the mean squared error
This section demonstrates that the proposed method sharply bounds an upper bound of the MSE by showing that the proposed method minimizes a sharp upper bound of the relative error of weights (proposition 3) and the relative error of weights bounds the MSE of the IPW estimator (proposition 4).
First, the KL divergence between the weights, which the proposed method minimizes as a loss function, bounds the relative error of the weights $\xi ( \hat {\beta }) = {\mathbb E} [ ( \pi ( X_i) / \pi ( X_i,\; \, \hat {\beta }) - 1 ) ^2 ]$ from above as the following proposition states:
Proposition 3 (Minimizing a sharp upper bound of the relative error of weights): When the propensity scores and resulting inverse probability weights are estimated using (4), a sharp upper bound of the relative error of weights is minimized in the limit even when the propensity score model is misspecified. If $\pi ( X_i,\; \, \hat {\beta }) \geq a \pi ( X_i)$ almost surely for some constant a ∈ (0, 1], the bound is as follows:
where the equalities hold when $\pi ( X_i,\; \, \hat {\beta }) = a \pi ( X_i)$, for all i. Proof is available in Supplementary Material C.1.
Note that the MLE does not minimize this upper bound as its coefficients converge to different limits than the DBW unless the propensity score model is correctly specified. Note also that this bound is tighter than Tan (Reference Tan2020)'s bound (Supplementary Material C.1), where the leading term is 5/(3a) for a case where $\pi ( X_i,\; \, \hat {\beta }) \geq a \pi ( X_i)$ almost surely for some constant a ∈ (0, 1/2] instead of a ∈ (0, 1] in proposition 3. Tan (Reference Tan2020) neither provide an intuitive interpretation of this relative error nor investigate the estimator minimizing it as the loss function.
Second, the MSE of the IPW estimator is bounded above by the relative error of weights as the following proposition states, which is also shown in Proposition 2 of Tan (2020) (a typo in the original proposition corrected):
Proposition 4 (Minimizing a sharp upper bound of the mean squared error): The MSE of the IPW estimator is sharply bounded above by the relative error of weights as:
where ${\mathbb E}[ Y^2 \mid X] \leq c$ and π(X i) ≥ ρ almost surely for some constants c > 0 and ρ ∈ (0, 1). The equality holds when Y i = μ and $\pi ( X_i) = \pi ( X_i,\; \, \hat {\beta }) = 0.5$ for all i. Proof is available in Supplementary Material C.2.
Propositions 3 and 4 together imply that the proposed method minimizes the upper bound of the MSE even when the propensity score model is misspecified.
3.6 Comparison with the existing methodologies
This section briefly discusses related literature and compares the proposed methods with existing ones. A more detailed discussion including a comparison with the non-parametric methods is provided in Supplementary Material F.
Table 1 summarizes the loss function and three important properties of proposed and three existing methods, the standard IPW, calibrated weighting (Tan, Reference Tan2020), and entropy balancing (Hainmueller, Reference Hainmueller2012), for the average outcome estimation, where each of the DBW, calibrated weighting, and entropy balancing has two of those three properties. First, the distribution balancing property minimizes the imbalance in the multivariate covariate distribution between the target and weighted groups (Section 3.3) and an upper bound of the mean squared error (MSE) of the parameter of interest (Section 3.5). The DBW and entropy balancing have this property because they use the same loss function (4) with different link functions, where the DBW uses the logistic function and the entropy balancing uses the exponential function (Wang and Zubizarreta, Reference Wang and Zubizarreta2020, See also, Supplementary Material D and F). Second, the calibrated weighting and entropy balancing have the mean balancing property, which is increasingly employed in recently proposed methods (Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Zhao, Reference Zhao2019; Wang and Zubizarreta, Reference Wang and Zubizarreta2020). This property implies that the estimators are sample-bounded (Robins et al., Reference Robins, Sued, Lei-Gomez and Rotnitzky2007; Wang and Zubizarreta, Reference Wang and Zubizarreta2020). Third, the estimated propensity score is bounded between 0 and 1 except for the entropy balancing, whose implicit propensity scores can take larger values than one (See, Wang and Zubizarreta, Reference Wang and Zubizarreta2020). This unboundedness implies that the implicit propensity score model in the entropy balancing is always misspecified, and it cannot be a doubly robust estimator for the average outcome estimation (Zhao, Reference Zhao2019), which contrasts with the double robustness property for the estimation of the average treatment effect for the treated (Zhao and Percival, Reference Zhao and Percival2017).
Notes: The standard IPW is the IPW with the MLE.
The comparison shows that the drawback of the DBW is the lack of mean balancing and sample-boundedness properties. However, the DBW can incorporate or substitute for these properties. First, it can substitute the mean balancing with outcome models as explained in Section 4.3, which removes bias under the same assumptions as the mean balancing methods such as the calibrated weighting and entropy balancing. I also demonstrate that incorporating the linear outcome model is equivalent to modify the DBW weights to have the mean balancing property. Second, it can incorporate the normalization constraint to gain the sample-boundedness as explained in Section 4.1.
4. Improvement, estimation, and inference
4.1 Normalization, regularization, and estimation
Like the IPW with the MLE, a drawback of the DBW is that estimated weights may not sum to one. To become normalized and thus sample-bounded, the normalized DBW (nDBW) estimator incorporates the normalization restriction as follows:
This nDBW estimator minimizes the distribution imbalance and an upper bound of the MSE under the normalization constraint. When the propensity score model is correctly specified, the limiting values of the DBW and nDBW estimators are the same because the (unnormalized) DBW estimator asymptotically satisfies the normalization constraint.
Using the Lagrangian function, the solution to this constrained optimization problem is obtained as the solution to the following unconstrained optimization problem:
To estimate coefficients for the nDBW in (19), we can use the M-estimation under the regularity conditions that the solution to (19) is unique and an element of the interior of a compact set. The second condition is satisfied when the objective function in (19) is bounded below, which is similar to the nonseparation condition for the MLE (Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Tan, Reference Tan2020). To satisfy this condition, we can utilize the L2 regularization:
where λ is a hyper-parameter controlling for the regularization, β −1 is β without the intercept, and || ⋅ || denotes the Euclidean norm.
As the properties of the M-estimator, the following results regarding the consistency and asymptotic normality of the nDBW estimator are obtained (Stefanski and Boos, Reference Stefanski and Boos2002).
Proposition 5 (Large sample properties): Under the conditions that the solution to (21) is unique and that (21) is bounded below, $\hat {\beta }_{\rm nDBW}$ and resulting inverse probability weights $1 / \pi ( X_i,\; \, \hat {\beta }_{\rm nDBW})$ converge in probability to their limiting values $\hat {\beta }_{\rm nDBW}^\ast$ and $1 / \pi ( X_i,\; \, \hat {\beta }_{\rm nDBW}^\ast )$, and since (21) is a smooth function, $\hat {\beta }_{\rm nDBW}$ is asymptotically normally distributed with a certain variance Σ as follows:
where β* is the true values of the coefficients, which is the solution for (3), g′nDBW (β) = ∂g nDBW(β)/∂β and $g''_{\rm nDBW}( \beta ) = \partial g'_{\rm nDBW}( \beta ) / \partial \beta ^{\sf T}$ are the first and second derivatives of g nDBW(β) with respect to β. When the propensity score model is correctly specified, $\hat {\beta }_{\rm nDBW}$ and $1 / \pi ( X_i,\; \, \hat {\beta }_{\rm nDBW})$ are consistently estimated.
To demonstrate the benefits of normalization and regularization, I conduct a simulation in Section 2 for the non-normalized DBW and the nDBW with regularization. First, the RMSEs of the estimated weights are 1.23 for the non-normalized DBW against 1.18 for the nDBW, indicating that the normalization improves the weight estimation.
Second, Figure 3 presents the performance of the proposed estimators with different levels of regularization. The optimal value of the regularization hyper-parameter is 0.22 as depicted by the vertical dotted line. Beyond this value, the RMSE of the weights increases as the regularization becomes stronger, approaching the RMSE of the uniform weights shown in the horizontal dotted line. Although tuning hyper-parameters is still a difficult and unsolved open problem (Zubizarreta, Reference Zubizarreta2015; Wong and Chan, Reference Wong and Chan2017; Zhao, Reference Zhao2019), I recommend using regularization mainly to avoid the separation problem by choosing the smallest one that addresses the separation problem as I do in the simulation studies in Section 5. Another way is choosing the one that minimizes the upper bound of bias estimated by the kernel balance method in Hazlett (Reference Hazlett2020), which I adopt in the empirical application in Section 6.
4.2 Algorithm
In this section, I explain the algorithm for estimating the DBW without the normalization and regularization for simplicity, and the algorithm for the nDBW with regularization is explained in Supplementary Material E. Since the loss function in (4) is non-convex with respect to β, the convex optimization algorithm such as the Newton's method is not applicable. For this non-convex optimization, I develop a majorization-minimization algorithm that iteratively optimizes two convex decompositions of the original function (Wu and Lange, Reference Wu and Lange2010). It exploits the characteristics that the minimization problem of the non-convex loss function of (4) is equivalent to the minimization problem of the difference of the two convex functions as follows:
This equivalence implies that the loss function for the DBW (25) is equivalent to the difference of the convex loss functions for the calibrated weighting (26) and the MLE (27). Note that the DBW does not minimize the absolute difference of the two functions but it minimizes the sum of the loss functions of the calibrated weighting and the negative of the loss function of the MLE. Thus, we should not interpret the DBW as the combination of the calibrated weighting and the MLE but rather the calibrated weighting should be regarded as the combination of the DBW and the MLE (Tan, Reference Tan2020). Since the DBW is better at controlling the mean squared error than the MLE as shown in Section 3.5, the DBW is also better than the calibrated weighting in this regard.
The difference of the convex functions algorithm for the DBW repeats the following two steps until convergence. First, as the majorization step, it constructs a surrogate function u(β, β t) for iteration t by the Taylor expansion of the second component of (25) as follows:
where β t is the initial values or values obtained in the previous iteration and $g'_2( \beta ) = {\partial g_2( \beta ) \over \partial \beta }$. Next, as the minimization step, it estimates β t+1 such that minimizes (28) with respect to β while keeping β t fixed. This minimization is easily conducted, e.g., via the Newton's method, because the objective surrogate function is convex. This algorithm is proved to decrease the loss monotonically in every steps and converge to a stationary point (Wu and Lange, Reference Wu and Lange2010).
4.3 A substitute for the mean balancing condition: doubly robust distribution balancing weighting estimator
Another drawback of the DBW is that it lacks the mean balancing property, which is increasingly employed by recently proposed methods as a safeguard against model misspecification (Hainmueller, Reference Hainmueller2012; Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Zhao, Reference Zhao2019; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020). This removes bias when the true outcome model is a linear regression model with the balanced covariates (Zubizarreta, Reference Zubizarreta2015; Zhao and Percival, Reference Zhao and Percival2017; Zhao, Reference Zhao2019; Fan et al., Reference Fan, Imai, Lee, Liu, Ning and Yang2021).
To substitute the mean balancing property, the (n)DBW can incorporate the outcome regression model. Specifically, this study considers the augmented IPW where the weights are estimated by the (n)DBW instead of the MLE. The augmented IPW combines the outcome model and propensity score model in the following manner (Robins et al., Reference Robins, Rotnitzky and Zhao1994; Seaman and Vansteelandt, Reference Seaman and Vansteelandt2018):
where $\pi ( X_i,\; \, \hat {\beta })$ is the estimated propensity scores and $\hat {m}( X_i)$ is the estimated conditional expectation function $m( X_i) = {\mathbb E}[ Y_i \mid X_i]$. The (n)DBW DR estimator uses the WLS to estimate m(X i) because it has better performance than the OLS (Kang and Schafer, Reference Kang and Schafer2007; Imai and Ratkovic, Reference Imai and Ratkovic2014). Under the same condition for the mean balancing methods, the (n)DBW estimator eliminates bias by using a linear regression outcome model. It can also utilize more flexible methods such as machine learning techniques for estimating m(X i). This estimator is doubly robust, i.e., consistent when either the propensity score or the outcome model is correctly specified (Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Seaman and Vansteelandt, Reference Seaman and Vansteelandt2018). Moreover, the (n)DBW DR estimator achieves semiparametric efficiency bound if both of the models are correctly specified (Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Seaman and Vansteelandt, Reference Seaman and Vansteelandt2018).
Recently, Chattopadhyay and Zubizarreta (Reference Chattopadhyay and Zubizarreta2023) demonstrates that the augmented IPW is expressed as the (non-augmented) IPW with modified weights with the covariate balancing property. Chattopadhyay and Zubizarreta (Reference Chattopadhyay and Zubizarreta2023) also proves that the modified weights minimize the Chi-square-type distance from the (inverse probability) weights used in the augmented IPW with the WLS. This implies that the (n)DBW DR estimator can be expressed as the (non-augmented) IPW estimator with weights $\hat {w}_{\rm nDBWDR}$ satisfying the following conditions:
which demonstrates that the (n)DBW DR estimator equips the covariate balancing property with the approximate multivariate covariate distribution balancing.
Since the covariate balancing methods satisfy ${1\over n} \sum _{i = 1}^n ( R_i - \pi ( X_i,\; \, \hat {\beta }) ) \, X_i = 0$ by construction, their DR estimators with the linear outcome model of $\hat {m}( X_i)$ cannot improve their non-DR estimators because the augmented term, the second term in (29), is always 0. I will examine the finite-sample performance of the nDBW DR estimators through simulation studies in Section 5.
5. Simulation studies
This section examines the finite-sample performance of the proposed methods through simulation studies. The standard simulation setting is proposed by Kang and Schafer (Reference Kang and Schafer2007), which showed that even the DR estimators may suffer from large bias and variance when both the outcome and propensity score models are misspecified. To examine the performance of newly proposed methods, many studies have utilized the same data-generating process (Robins et al., Reference Robins, Sued, Lei-Gomez and Rotnitzky2007; Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Zhao and Percival, Reference Zhao and Percival2017; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020).
Following these studies, I use the following data-generating process. There are n units and each unit i has four covariates X i = (X i1, X i2, X i3, X i4), each of which is independently and identically distributed according to the standard normal distribution. Each unit also has an outcome Y i and a response indicator variable R i ∈ {0, 1}, where outcome is observed only for the response group R i = 1. The response indicator variable is assigned according to the Bernoulli distribution R i = Bernoulli (π(X i)), where π(X i) is a true propensity score. Following Robins et al. (Reference Robins, Sued, Lei-Gomez and Rotnitzky2007); Vermeulen and Vansteelandt (Reference Vermeulen and Vansteelandt2015), this study considers two types of the true propensity score model, where the signs of the coefficients are reversed each other: the type A model π(X i) = expit( − X i1 + 0.5 X i2 − 0.25 X i3 − 0.1 X i4) and type B model π(X i) = expit(X i1 − 0.5 X i2 + 0.25 X i3 + 0.1 X i4). The target parameter is $\mu = {\mathbb E}[ Y_i]$, where the outcome variable Y i is independently and identically distributed according to the standard normal distribution $Y_i \sim {\cal N}( m( X_i) ,\; \, 1)$. Following the extension of the standard simulation study by Tan (Reference Tan2020), this study considers six versions of the true outcome model m(X i): (Linear 1) m(X i) = 210 + 27.4 X i1 + 13.7 X i2 + 13.7 X i3 + 13.7 X i4 and μ = 210; (Linear 2) m(X i) = 210 + 13.7 X i1 + 27.4 X i2 + 27.4 X i3 + 27.4 X i4 and μ = 210; (Quadratic 1) $m( X_i) = 210 + 27.4 \sum _{j = 1}^4 \{ \max ( X_{ij},\; \, 0) \} ^2$ and μ = 264.8; (Quadratic 2) $m( X_i) = 210 + 27.4 \sum _{j = 1}^4 \{ \max ( -X_{ij},\; \, 0) \} ^2$ and μ = 264.8; (Exponential 1) $m( X_i) = 210 + 27.4 \sum _{j = 1}^4 \exp ( X_{ij})$ and μ ≈ 390.7; (Exponential 2) $m( X_i) = 210 + 27.4 \sum _{j = 1}^4 \exp ( -X_{ij})$ and μ ≈ 390.7.
To represent misspecified working models, non-linear transforms of the covariates $Z_i = ( Z_{i1},\; \, Z_{i2},\; \, Z_{i3},\; \, Z_{i4}) = ( \exp ( X_{i1} / 2) ,\; \, X_{i2} / ( 1 + \exp ( X_{i1}) ) + 10,\; \, ( X_{i1} X_{i3} / 25 + 0.6) ^3,\; \, ( X_{i1} + X_{i4} + 20) ^2) $ are used in some models (Robins et al., Reference Robins, Sued, Lei-Gomez and Rotnitzky2007; Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Zhao and Percival, Reference Zhao and Percival2017; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020). Specifically, this study examines the following two scenarios for each simulation setting. In the first scenario, the working propensity score model is correctly specified and uses the logistic regression with X i as linear predictors whereas in the second scenario it is misspecified and uses the logistic regression with Z i as linear predictors. Since the focus is on the relative performance of the different propensity score estimation methods, the working outcome model in both the scenarios is misspecified with the linear regression with Z i as linear predictors.
The propensity scores and/or (inverse probability) weights are estimated by the nDBW and four existing methods (MLE, CBPS, calibrated weighting, and entropy balancing). The unweighted (difference-in-means) and the true propensity score estimators are also used for comparison. For each of these methods, the simple IPW type estimator $\hat {\mu }^{\rm IPW} = \sum _{i = 1}^n {R_i Y_i} / {( n \hat {\pi }( X_i) ) }$ and the augmented IPW type estimator (29) are utilized for the target parameter, where the OLS is used to estimate the outcome regression model. I conduct 2000 Monte Carlo simulations with two different sample sizes (n = 200 and 1000) and calculate the bias and root-mean-squared error (RMSE) for each estimator in each of the 48 scenarios (two versions of the true propensity score model, six versions of the true outcome model, two versions of the working propensity score model, and two versions of the sample size). Although the adaptive optimization would improve the estimation, the regularization parameters for the nDBW are simply fixed at λ = 0.03 for n = 200 and λ = 0.007 for n = 1000, which are minimal values to obtain the convergence in all the Monte Carlo draws. Note that the CBPS DR estimator and the bias-reduced doubly robust estimator (Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015) produce the same estimate when the linear outcome regression model uses the same covariates as the propensity score model, i.e., in the misspecified propensity score model cases in this simulation setting.
The results for the augmented IPW type estimators are shown in Table 2–4 and the full results are shown in Table I.1–I.6 in Supplementary Materials I. Note that the unweighted estimator is an IPW type estimator with uniform weights, and the imputation estimator is an augmented IPW type of that.
Notes: This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Linear 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000). For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table.
Notes: This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Quadratic 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000). For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table.
Notes: This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Exponential 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000). For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table.
When the propensity score model is correctly specified, the proposed nDBW DR estimator has at least comparably small bias and RMSE as the calibrated weighting DR, and it has the smallest RMSE in all the cases with a moderate sample size (n = 1000) and in most of the cases with a small sample size n = 200. Unlike the other estimators, the entropy balancing estimator is not consistent even in the correct propensity score model scenarios except for the linear outcome model cases because it is not doubly robust.
When the propensity score model is misspecified, all the estimators except for the true propensity score DR estimator are no longer consistent, and the MLE DR estimator suffers from large bias and variance when the true propensity score model is type A. Even in these difficult scenarios, the nDBW DR estimator, as well as the calibrated weighting DR estimator, have small biases and RMSEs. To visually compare the relative performance of the estimators with the misspecified propensity score model and moderate sample size (n = 1000), Figure 4 shows the absolute bias and RMSE of nDBW DR estimator in the x-axis against those of the calibrated weighting DR (circles), CBPS DR estimator (triangles), and entropy balancing DR (squares) estimators in the y-axis. The symbols above (below) the diagonal line indicate that the nDBW DR estimator performs better (worse) than other estimators. The nDBW estimator outperforms the calibrated weighting DR and CBPS DR estimators both in terms of bias and RMSE in most of the cases. The performance of the nDBW DR estimator is also preferable to the entropy balancing DR estimator, although the latter performs better than the former in some scenarios. Compared with the entropy balancing DR estimator, the nDBW DR estimator has smaller RMSEs by more than 30% in one-third of the cases but larger RMSEs by more than 30% in none of the cases.
Finally, the full results are shown in Table I.1–I.6 in Supplementary Materials I, which enable to compare the IPW estimators and their DR variants. In most of the scenarios, the nDBW estimator is improved by incorporating outcome models. In contrast, as explained in Section 4.3, the calibrated weighting and entropy balancing estimators have the same biases and RMSEs as their DR variants in the misspecified propensity score model scenarios, where the linear outcome model uses the same covariates as the propensity score model.
6. Empirical analysis
Can foreign occupiers reduce resistance by devolving political authority to native elites? This question is central in the literature on foreign occupation but also challenging because randomized experiments are almost impossible, and strategic determination of political devolution causes an endogeneity problem in observational studies.
To address this issue, Ferwerda and Miller (Reference Ferwerda and Miller2014) utilizes a natural experimental setting in France during World War II, where municipalities were plausibly viewed as randomly assigned into German or Vichy-governed zones in the neighborhood of the demarcation line. After Germany defeated France in June 1940, an armistice provided for the division of France into an occupied zone in northern France and the western coast ruled by Germany directly and an unoccupied zone in the south ruled by the Vichy government. In November 1942, in response to the landing of the Allies in North Africa, German forces invaded and occupied the unoccupied southern zone. However, the formal administrative division remained, which kept the extent of political devolution between the two zones different. Ferwerda and Miller (Reference Ferwerda and Miller2014) analyzes this period (November 1942 to September 1944), when German military forces were present in both the zones and most of the resistance events occurred (mainly after 1943).
Using the regression discontinuity design, Ferwerda and Miller (Reference Ferwerda and Miller2014) finds that political devolution decreases sabotage events in the close neighborhood of the demarcation line, which was drawn arbitrarily at the local level, cutting across the preexisting administrative borders and geographic features such as mountain ranges and rivers. There is, however, a concern about the spill-over effects: Resistance fighters operating from the Vichy zone, not residents in the German-governed zone, may increase sabotage events in the German zone at the demarcation line. To address this concern, it is necessary to compare the municipalities at some distance from the line like the robustness checks in their article. A problem that arises here is that the demarcation line is drawn at random only locally, which leaves some imbalance in the confounders between municipalities with high (the Vichy zone) and low (the German zone) levels of political devolution distant from the demarcation line.
To address this issue, I employ the nDBW DR estimator, and compare the results with the MLE DR, calibrated weighting DR, CBPS DR, entropy balancing DR, and the unweighted difference-in-means estimators. Following Ferwerda and Miller (Reference Ferwerda and Miller2014), I construct a propensity score model with the following variables: the distance from the demarcation line, local state capacity (Population from the 1936 census, the distance to the nearest train station, an index variable indicating whether a municipality possessed a telephone bureau, a telegraph station, or a post office in 1939), forest cover and urbanization (the number of actively farmed hectares), the ruggedness of terrain (the mean and standard deviation of the elevation), and the political ideology (leftist and rightist vote shares in the 1936 election).
With these methods, I compare the occurrence of sabotage events, which counts all attacks against infrastructure (largely railroad and communications), between the treatment (German-zone) and control (Vichy-zone) municipalities within some pre-specified bandwidths of the distance from the demarcation line (Ferwerda and Miller, Reference Ferwerda and Miller2014). To examine whether the effects are observed only locally near the demarcation line or not, I use rolling windows of 15 km, where municipalities within each of 0–15, 2.5–17.5, …,20–35 km from the demarcation line are compared. I denote them by the upper limit: e.g., the results for the municipalities that are 20 km distant from the demarcation line indicate results for those within 5–20 km distant from the line.
6.1 Bias due to imbalance in multivariate covariate distribution
While bias induced by multivariate distributional imbalance is difficult to evaluate, an upper bound of bias can be estimated by the kernel technique (Hazlett, Reference Hazlett2020). I use this upper bound to compare the performance of various methods.
Figure 5 presents the upper bounds of bias due to multivariate distributional imbalance for the proposed (shown in red) and other estimators. Those for the calibrated weighting for the distance larger than 30 km are missing because the weights cannot be estimated due to the convergence issue. As the distance from the demarcation line increases, bias due to multivariate distributional imbalance becomes severe, but the proposed estimator mitigates the bias more effectively than the others.
6.2 Results
The results are presented in Figure 6, and the details are shown in Supplementary Material G. The left panel presents point estimates of the ATE for the proposed (shown in red) and other estimators, which diverge when the distance exceeds 25 km. Estimates by the other estimators than the proposed and entropy balancing estimators show unnatural U-shaped effects fluctuating sharply.Footnote 2 These results are consistent with the upper bound of the bias shown in Figure 5.
Why do these estimators provide such different estimates? The key is multivariate distributional imbalance. To examine the distribution balance obtained by these estimators, I conduct the following analysis. First, to represent a possibly complex function of covariates, I prepared the constitutive terms, their squared terms, and all the first-order interactions. Then, to select relevant terms influencing potential outcomes, I estimated outcome models using LASSO and selected terms with non-zero coefficients. This procedure selected sixteen terms out of 54 terms as relevant, of which four were squared terms and nine were interactions, implying that the outcome model consists of a complex function of covariates, highlighting the importance of the distribution balance. Finally, I calculated the absolute mean difference for these selected terms.
Figure 7 presents the results for the treated municipalities with a 20–35 km distance from the demarcation line as an illustration. The proposed estimator (circles) attains better balance than the CBPS estimator (triangles) in all but two terms while avoiding large imbalances unlike the entropy balancing estimator (squares). These results confirm the superiority of the proposed method in the distribution balance.
The right panel of Figure 6 presents the ATE estimates estimated by the nDBW DR estimator and their jack-knife standard errors, which demonstrate that the treatment effects decrease steeply toward zero as the distance from the demarcation line increases. The decreasing effect indicates that political devolution reduces resistance activities only near the demarcation line, which raises concern about the spill-over effects that the resistance operating from the Vichy zone increases sabotage events in the German zone. This result suggests the possibility of the contamination of the treatment effects by spill-over effects.
7. Conclusions
The IPW estimators are widely utilized to address missing data problems including causal inference. However, their practical application is susceptible to bias due to propensity score model misspecification. In response, existing studies proposed various methods balancing some moments (or kernels) of observed covariates (Hainmueller, Reference Hainmueller2012; Imai and Ratkovic, Reference Imai and Ratkovic2014; Vermeulen and Vansteelandt, Reference Vermeulen and Vansteelandt2015; Zubizarreta, Reference Zubizarreta2015; Chan et al., Reference Chan, Yam and Zhang2016; Wong and Chan, Reference Wong and Chan2017; Zhao, Reference Zhao2019; Tan, Reference Tan2020; Wang and Zubizarreta, Reference Wang and Zubizarreta2020), but specifying the moment conditions remains a formidable task because it requires knowledge of the true outcome model (Zubizarreta, Reference Zubizarreta2015; Zhao and Percival, Reference Zhao and Percival2017; Zhao, Reference Zhao2019; Fan et al., Reference Fan, Imai, Lee, Liu, Ning and Yang2021).
This study proposes the distribution balancing weighting, which mitigates bias and controls the MSE by minimizing their upper bounds. These goals are achieved by balancing covariate distribution through minimizing the KL divergence between the true and estimated weights, inspired by the quasi MLE with misspecified models (Akaike, Reference Akaike1973; White, Reference White1982). The proposed method has several attractive properties as demonstrated in this study.
Finally, I show some future directions for the improvement and the application of the proposed method. First, the key idea of the distribution balancing weighting is to minimize the statistics measuring the discrepancy between the true and estimated weights. This idea may also apply to recently proposed machine learning techniques for propensity score estimation, such as the decision tree, random forest, and generalized random forest. Second, the proposed method may be useful in other missing data problems such as mediation analysis, causal inference with time-series cross-section data, and continuous treatment cases. I am currently exploring these potential directions.
Acknowledgements
I thank Soichiro Yamauchi, the editorial team at PSRM, and the anonymous reviewers for their helpful comments. This work was supported by JSPS KAKENHI, Grant-in-Aid for JSPS Fellows (Grant Number 17J03508) and Grant-in-Aid for Early-Career Scientists (Grant Number 21K13225). Competing interests: The author declares none.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2024.23.
To obtain replication material for this article, https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UAJME4&version=DRAFT&faces-redirect=true