Why Propensity Scores Should Not Be Used for Matching

Gary King; Richard Nielsen

doi:10.1017/pan.2019.11

Why Propensity Scores Should Not Be Used for Matching

Published online by Cambridge University Press: 07 May 2019

Gary King

and

Richard Nielsen

Show author details

Gary King*: Affiliation:
Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. Email: king@harvard.edu, URL: http://GaryKing.org
Richard Nielsen: Affiliation:
Department of Political Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. Email: rnielsen@mit.edu, URL: http://www.mit.edu/∼rnielsen
*: *Email: king@harvard.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We show that propensity score matching (PSM), an enormously popular method of preprocessing data for causal inference, often accomplishes the opposite of its intended goal—thus increasing imbalance, inefficiency, model dependence, and bias. The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which, we show, increases imbalance even relative to the original data. Although these results suggest researchers replace PSM with one of the other available matching methods, propensity scores have other productive uses.

Keywords

matching propensity score matching coarsened exact matching Mahalanobis distance matching model dependence

Type: Articles
Information: Political Analysis , Volume 27 , Issue 4 , October 2019 , pp. 435 - 454

DOI: https://doi.org/10.1017/pan.2019.11 [Opens in a new window]
Copyright: Copyright © The Author(s) 2019. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Authors’ note: The current version of this paper, along with a Supplementary Appendix, can be found at j.mp/PScore. We thank Alberto Abadie, Alan Dafoe, Justin Grimmer, Jens Hainmueller, Chad Hazlett, Seth Hill, Stefano Iacus, Kosuke Imai, Simon Jackman, John Londregan, Adam Meirowitz, Giuseppe Porro, Molly Roberts, Jamie Robins, Bradley Spahn, Brandon Stewart, Liz Stuart, Chris Winship, and Yiqing Xu for helpful suggestions, and Connor Jerzak, Chris Lucas, Jason Sclar for superb research assistance. We also appreciate the insights from our collaborators on a previous related project, Carter Coberley, James E. Pope, and Aaron Wells. All data necessary to replicate the results in this article are available at Nielsen and King (2019).

Contributing Editor: Jeff Gill

References

Abadie, A., and Imbens, G. W.. 2006. “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74(1):235–267.Google Scholar

Athey, S., and Imbens, G. W.. 2015. “A Measure of Robustness to Misspecification.” American Economic Review Papers and Proceedings 105(5):476–480.Google Scholar

Austin, P. C. 2008. “A Critical Appraisal of Propensity-Score Matching in the Medical Literature Between 1996 and 2003.” Journal of the American Statistical Association 72:2037–2049.Google Scholar

Austin, P. C. 2009. “Some Methods of Propensity-Score Matching had Superior Performance to Others: Results of an Empirical Investigation and Monte Carlo Simulations.” Biometrical Journal 51(1):171–184.Google Scholar

Banaji, M. R., and Greenwald, A. G.. 2016. Blindspot: Hidden Biases of Good People . New York: Bantam.Google Scholar

Bansal, P. P., and Ardell, A. J.. 1972. “Average Nearest-Neighbor Distances Between Uniformly Distributed Finite Particles.” Metallography 5(2):97–111.Google Scholar

Barnow, B. S., Cain, G. G., and Goldberger, A. S.. 1980. “Issues in the Analysis of Selectivity Bias.” In Evaluation Studies, vol. 5 , edited by Stromsdorfer, E. and Farkas, G.. San Francisco: Sage.Google Scholar

Box, G. E. P., Hunter, W. G., and Hunter, J. S.. 1978. Statistics for Experimenters . New York: Wiley-Interscience.Google Scholar

Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., and Sturmer, T.. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163:1149–1156.Google Scholar

Caliendo, M., and Kopeinig, S.. 2008. “Some Practical Guidance for the Implementation of Propensity Score Matching.” Journal of Economic Surveys 22(1):31–72.Google Scholar

Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O.. 2009. “Dealing with Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96(1):187.Google Scholar

D’Augustino, R. B. 1998. “Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Control Group.” Statistics in Medicine 17:2265–2281.Google Scholar

Dehejia, R. 2004. “Estimating Causal Effects in Nonexpermental Studies.” In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives , edited by Gelman, A. and Meng, X.-L.. New York: Wiley.Google Scholar

Diamond, A., and Sekhon, J. S.. 2012. “Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.” Review of Economics and Statistics 95(3):932–945.Google Scholar

Drake, C. 1993. “Effects of Misspecification of the Propensity Score on Estimators of Treatment Effects.” Biometrics 49:1231–1236.Google Scholar

Efron, B. 2014. “Estimation and Accuracy After Model Selection.” Journal of the American Statistical Association 109(507):991–1007.Google Scholar

Finkel, S. E., Horowitz, J., and Rojo-Mendoza, R. T.. 2012. “Civic Education and Democratic Backsliding in the Wake of Kenya’s Post-2007 Election Violence.” Journal of Politics 74(01):52–65.Google Scholar

Glazerman, S., Levy, D. M., and Myers, D.. 2003. “Nonexperimental Versus Experimental Estimates of Earnings Impacts.” The Annals of the American Academy of Political and Social Science 589:63–93.Google Scholar

Greevy, R., Lu, B., Silver, J. H., and Rosenbaum, P. R.. 2004. “Optimal Multivariate Matching Before Randomization.” Biostatistics 5(2):263–275.Google Scholar

Gu, X. S., and Rosenbaum, P. R.. 1993. “Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms.” Journal of Computational and Graphical Statistics 2:405–420.Google Scholar

Heckman, J., Ichimura, H., and Todd, P.. 1998. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program.” Review of Economic Studies 65:261–294.Google Scholar

Hill, J. 2008. “Discussion of Research Using Propensity-Score Matching: Comments on “A Critical Appraisal of Propensity-Score Matching in the Medical Literature Between 1996 and 2003” by Peter Austin, Statistics in Medicine.” Statistics in Medicine 27(12):2055–2061.Google Scholar

Ho, D. E., Imai, K., King, G., and Stuart, E. A.. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15:199–236. URL: j.mp/matchP.Google Scholar

Holland, P. W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81:945–960.Google Scholar

Iacus, S. M., King, G., and Porro, G.. 2011. “Multivariate Matching Methods that are Monotonic Imbalance Bounding.” Journal of the American Statistical Association 106:345–361. URL: j.mp/matchMIB.Google Scholar

Imai, K., King, G., and Nall, C.. 2009. “The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation.” Statistical Science 24(1):29–53. URL: j.mp/essrole.Google Scholar

Imai, K., King, G., and Stuart, E. A.. 2008. “Misunderstandings Among Experimentalists and Observationalists about Causal Inference.” Journal of the Royal Statistical Society, Series A 171(2):481–502. URL: j.mp/misunEO.Google Scholar

Imai, K., and Ratkovic, M.. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1):243–263.Google Scholar

Imbens, G. W. 2004. “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.” Review of Economics and Statistics 86(1):4–29.Google Scholar

Imbens, G. W., and Rubin, D. B.. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences An Introduction . New York: Cambridge University Press.Google Scholar

Ioannidis, J. P. A. 2005. “Why Most Published Research Findings are False.” PLoS Medicine 2(8):e124.Google Scholar

Kahneman, D. 2011. Thinking, Fast and Slow . London: Macmillan.Google Scholar

Kallus, N. 2018. “Optimal A Priori Balance in The Design of Controlled Experiments.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1):85–112.Google Scholar

Kang, J. D. Y., and Schafer, J. L.. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22(4):523–539.Google Scholar

King, G., and Zeng, L.. 2006. “The Dangers of Extreme Counterfactuals.” Political Analysis 14(2):131–159. URL: j.mp/dangerEC.Google Scholar

King, G., and Zeng, L.. 2007. “When Can History Be Our Guide? The Pitfalls of Counterfactual Inference.” International Studies Quarterly , 183–210. URL: j.mp/pitfallsH.Google Scholar

Lechner, M. 2001. “Identification and Estimation of Causal Effects of Multiple Treatments under the Conditional Independence Assumption.” In Econometric Evaluation of Labour Market Policies , edited by Lechner, M. and Pfeiffer, F., 43–58. Heidelberg: Physica.Google Scholar

Lunceford, J. K., and Davidian, M.. 2004. “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study.” Statistics in Medicine 23(19):2937–2960.Google Scholar

Mahoney, M. J. 1977. “Publication Prejudices: An Experimental Study of Confirmatory Bias in the Peer Review System.” Cognitive Therapy and Research 1(2):161–175.Google Scholar

Mielke, P., and Berry, K.. 2007. Permutation Methods: A Distance Function Approach . New York: Springer.Google Scholar

Morgan, S. L., and Winship, C.. 2014. Counterfactuals and Causal Inference: Methods and Principles for Social Research , 2nd edn. Cambridge: Cambridge University Press.Google Scholar

Nielsen, R., Findley, M., Davis, Z., Candland, T., and Nielson, D.. 2011. “Foreign Aid Shocks as a Cause of Violent Armed Conflict.” American Journal of Political Science 55(2):219–232.Google Scholar

Nielsen, R., and King, G.. 2019. “Replication Data for: Why Propensity Scores Should Not Be Used for Matching.” https://doi.org/10.7910/DVN/A9LZNV, Harvard Dataverse, V1.Google Scholar

Pearl, J.2009. “Myth, Confusion, and Science in Causal Analysis.” Unpublished paper, http://web.cs.ucla.edu/∼kaoru/r348.pdf.Google Scholar

Pearl, J. 2009. “The Foundations of Causal Inference.” Sociological Methodology 40(1):75–149.Google Scholar

Peikes, D. N., Moreno, L., and Orzol, S. M.. 2008. “Propensity Score Matching.” The American Statistician 62(3):222–231.Google Scholar

Pimentel, S. D., Page, L. C., Lenard, M., and Keele, L.. 2018. “Optimal Multilevel Matching Using Network Flows: An Application to a Summer Reading Intervention.” The Annals of Applied Statistics 12(3):1479–1505.Google Scholar

Robins, J. M., Hernan, M. A., and Brumback, B.. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11(5):550–560.Google Scholar

Robins, J. M., and Morgenstern, H.. 1987. “The Foundations of Confounding in Epidemiology.” Computers & Mathematics with Applications 14(9):869–916.Google Scholar

Rosenbaum, P. R., Ross, R., and Silber, J.. 2007. “Minimum Distance Matched Sampling With Fine Balance in an Observational Study of Treatment for Ovarian Cancer.” Journal of the American Statistical Association 102(477):75–83.Google Scholar

Rosenbaum, P. R., and Rubin, D. B.. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55.Google Scholar

Rosenbaum, P. R., and Rubin, D. B.. 1984. “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score.” Journal of the American Statistical Association 79:515–524.Google Scholar

Rosenbaum, P. R., and Rubin, D. B.. 1985a. “Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score.” The American Statistician 39:33–38.Google Scholar

Rosenbaum, P. R., and Rubin, D. B.. 1985b. “The Bias Due to Incomplete Matching.” Biometrics 41(1):103–116.Google Scholar

Rubin, D. B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 6:688–701.Google Scholar

Rubin, D. B. 1976. “Inference and Missing Data.” Biometrika 63:581–592.Google Scholar

Rubin, D. B. 1980. “Comments on “Randomization Analysis of Experimental Data: The Fisher Randomization Test”, by D. Basu.” Journal of the American Statistical Association 75:591–593.Google Scholar

Rubin, D. B. 2008a. “Comment: The Design and Analysis of Gold Standard Randomized Experiments.” Journal of the American Statistical Association 103(484):1350–1353.Google Scholar

Rubin, D. B. 2008b. “For Objective Causal Inference, Design Trumps Analysis.” Annals of Applied Statistics 2(3):808–840.Google Scholar

Rubin, D. B. 2009. “Should Observational Studies be Designed to Allow Lack of Balance in Covariate Distributions Across Treatment Groups? Statistics in Medicine 28:1415–1424.Google Scholar

Rubin, D. B. 2010. “On the Limitations of Comparative Effectiveness Research.” Statistics in Medicine 29(19):1991–1995.Google Scholar

Rubin, D. B., and Stuart, E. A.. 2006. “Affinely Invariant Matching Methods with Discriminant Mixtures of Proportional Ellipsoidally Symmetric Distributions.” Annals of Statistics 34(4):1814–1826.Google Scholar

Rubin, D. B., and Thomas, N.. 2000. “Combining Propensity Score Matching with Additional Adjustments for Prognostic Covariates.” Journal of the American Statistical Association 95:573–585.Google Scholar

Simmons, J. P., Nelson, L. D., and Simonsohn, U.. 2011. “False-Positive Psychology Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22(11):1359–1366.Google Scholar

Smith, J. A., and Todd, P. E.. 2005a. “Does Matching Overcome LaLonde’s Critique of Nonexperimental Estimators? Journal of Econometrics 125(1–2):305–353.Google Scholar

Smith, J., and Todd, P.. 2005b. “Rejoinder.” Journal of Econometrics 125:365–375.Google Scholar

Stuart, E. A. 2010. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science 25(1):1–21.Google Scholar

Stuart, E. A., and Rubin, D. B.. 2007. “Best Practices in Quasi-Experimental Designs: Matching Methods for Causal Inference.” In Best Practices in Quantitative Methods , edited by Osborne, J., 155–176. New York: Sage.Google Scholar

Stuart, E. A., and Rubin, D. B.. 2008. “Matching with Multiple Control Groups with Adjustment for Group Differences.” Journal of Educational and Behavioral Statistics 33(3):279–306.Google Scholar

Tetlock, P. E. 2005. Expert Political Judgment: How Good Is It? How Can We Know? Princeton: Princeton University Press.Google Scholar

VanderWeele, T. J., and Hernan, M. A.. 2012. “Causal Inference Under Multiple Versions of Treatment.” Journal of Causal Inference 1:1–20.Google Scholar

VanderWeele, T. J., and Shpitser, I.. 2011. “A New Criterion for Confounder Selection.” Biometrics 67(4):1406–1413.Google Scholar

Vansteelandt, S., and Daniel, R.. 2014. “On Regression Adjustment for the Propensity Score.” Statistics in Medicine 33(23):4053–4072.Google Scholar

Wilson, T. D., and Brekke, N.. 1994. “Mental Contamination and Mental Correction: Unwanted Influences on Judgments and Evaluations.” Psychological Bulletin 116(1):117.Google Scholar

Zhao, Z. 2008. “Sensitivity of Propensity Score Methods to the Specifications.” Economic Letters 98(3):309–319.Google Scholar

Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. et al. . 2014. “Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-Profit and Not-For-Profit High Schools in Chile.” The Annals of Applied Statistics 8(1):204–231.Google Scholar

King and Nielsen supplementary material

File 477 KB

Article contents

Why Propensity Scores Should Not Be Used for Matching

Abstract

Keywords

Access options

Footnotes

References

King and Nielsen supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests