Published online by Cambridge University Press: 19 July 2012
Missing values are a frequent problem in empirical political science research. Surprisingly, the match between the measurement of the missing values and the correcting algorithms applied is seldom studied. While multiple imputation is a vast improvement over the deletion of cases with missing values, it is often unsuitable for imputing highly non-granular discrete data. We develop a simple technique for imputing missing values in such situations, which is a variant of hot deck imputation, drawing from the conditional distribution of the variable with missing values to preserve the discrete measure of the variable. This method is tested against existing techniques using Monte Carlo analysis and then applied to real data on democratization and modernization theory. Software for our imputation technique is provided in a free, easy-to-use package for the R statistical environment.
Department of Political Science, University of North Carolina; and Department of Political Science, Washington University (email: jgill@wustl.edu), respectively. The authors wish to thank Micah Altman, James Fowler, Katie Gan, Adam Glynn, Justin Grimmer, Dominik Hangartner, Michael Kellerman, Gary King, Ryan Moore and Randolph Siverson for valuable comments. Replication data is available at http://www.unc.edu/~skylerc/.
1 The term ‘missing data’ can mean either missing values (e.g. item non-response in a survey) or missing observations such as refusal to take an entire survey. Throughout this work, we use the term exclusively to mean the first case.
2 Taagepera, Rein and Shugart, Matthew Soberg, Seats and Votes: The Effects and Determinants of Electoral Systems (New Haven, Conn.: Yale University Press, 1989)Google Scholar
3 Peter Mair and Ingrid van Biezen, ‘Party Membership in Twenty European Democracies, 1980–2000’, Party Politics, 7 (2001), 5–21CrossRefGoogle Scholar
4 Palmer, Harvey D. and Whitten, Guy D., ‘The Electoral Impact of Unexpected Inflation and Economic Growth’, British Journal of Political Science, 29 (1999), 623–639CrossRefGoogle Scholar
5 Reiter, Dan, ‘Does Peace Nurture Democracy?’ Journal of Politics, 63 (2001), 935–948CrossRefGoogle Scholar
6 Tsiatis, Anastasios A., Semiparametric Theory and Missing Data (New York: Springer, 2010)Google Scholar
Enders, Craig K., Applied Missing Data Analysis (New York: The Guilford Press, 2010)Google Scholar
Tan, Ming T.Tian, Guo-Liang and Ng, Kai Wang, Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation (New York: Chapman & Hall/CRC, 2009)CrossRefGoogle Scholar
Molenberghs, Geert and Kenward, Michael G., Missing Data in Clinical Studies (New York: Wiley, 2007)CrossRefGoogle Scholar
McKnight, Patrick E., McKnight, Katherine M.Sidani, Souraya and Figueredo, Aurelio Jose, Missing Data: A Gentle Approach (New York: The Guilford Press, 2007)Google Scholar
7 Rees, Phil H. and Duke-Williams, Oliver, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’, International Journal of Population Geography, 3 (1997), 323–3683.0.CO;2-Z>CrossRefGoogle ScholarPubMed
8 Rees and Duke-Williams, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’.
9 Roderick J. A. Little and Donald B. Rubin, Statistical Analysis with Missing Data, 2nd edn (New York: Wiley, 2002), p. 42Google Scholar
10 Allison, Paul D., Missing Data (Thousand Oaks, Calif.: Sage, 2001)Google Scholar
Little, Roderick J. A., ‘Regression with Missing X's: A Review’, Journal of the American Statistical Association, 87 (1992), 1227–1237Google Scholar
Little, Roderick J. A., ‘Approximately Calibrated Small Sample Inference about Means from Bivariate Normal Data with Missing Values’, Computational Statistics & Data Analysis, 7 (1988), 161–178CrossRefGoogle Scholar
Rubin, Donald B., ‘Inference and Missing Data (with Discussion)’, Biometrika, 63 (1976), 581–592CrossRefGoogle Scholar
King, Gary, Honaker, JamesJoseph, Anne and Scheve, Kenneth, ‘Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation’, American Political Science Review, 95 (2001), 49–69CrossRefGoogle Scholar
11 Honaker, James and King, Gary, ‘What to Do about Missing Values in Time-Series Cross-Section Data’, American Journal of Political Science, 54 (2010), 561–581CrossRefGoogle Scholar
12 Rubin, ‘Inference and Missing Data’; King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Little and Rubin, Statistical Analysis with Missing Data.
13 Little and Rubin, Statistical Analysis with Missing Data, p. 12.
14 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’.
15 Gelman, Andrew and Hill, Jennifer, Data Analysis Using Regression and Multilevel/Hierarchical Models (New York: Cambridge University Press, 2007)Google Scholar
16 Bailar, John C. III and Bailar, Barbara A., ‘Comparison of the Biases of the “Hot Deck” Imputation Procedure with an “Equal Weights” Imputation Procedure’, Symposium on Incomplete Data: Panel on Incomplete Data of the Committee on National Statistics, National Research Council, 1997), 422–47Google Scholar
Cox, Brenda. G., ‘The Weighted Sequential Hot Deck Imputation Procedure’, Proceedings of the Section on Survey Research Methods, American Statistical Association (1980), 721–6Google Scholar
Rockwell, Richard C., ‘An Investigation of Imputation and Differential Quality of Data in the 1970 Census’, Journal of the American Statistical Association, 70 (1975), 39–42CrossRefGoogle Scholar
17 Rubin, Donald B., Multiple Imputation for Nonresponse in Surveys (New York: Wiley, 2004)Google Scholar
18 Rubin, ‘Inference and Missing Data’.
19 Rubin, Donald B., ‘Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys’, Journal of the American Statistical Association, 72 (1977), 538–543CrossRefGoogle Scholar
Rubin, Donald B., ‘Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse’, Proceedings of the Survey Research Methods Section of the American Statistical Association (1978), 20–34Google Scholar
Rubin, Donald B. and Schenker, Nathaniel, ‘Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse’, Journal of the American Statistical Association, 81 (1986), 366–374CrossRefGoogle Scholar
Rubin, Donald B., ‘Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations’, Journal of Business and Economic Statistics, 4 (1986), 87–94Google Scholar
Rubin, Donald B.Schafer, J. L. and Schenker, Nathaniel, ‘Imputation Strategies for Missing Values in Post-Enumeration Surveys’, Survey Methodology, 14 (1988), 209–221Google Scholar
Rubin, Donald B., ‘Multiple Imputation after 18+ Years’, Journal of the American Statistical Association, 91 (1996), 473–489CrossRefGoogle Scholar
20 The combined $$\[-->$<>{{\bar{\theta }}_{{\bi M}}} <$> <!--\]$$ is in fact an average, but the treatment of the variability of this estimate is slightly more complicated than an average since it needs to account for within imputation variation and between imputation variation. The subject of multiple estimate combination will be discussed in some detail below. See Little and Rubin, Statistical Analysis with Missing Data, for a more detailed treatment.
21 Kim, Jae Kwang, ‘Finite Sample Properties of Multiple Imputation Estimators’, Annals of Statistics, 32 (2004), 766–783CrossRefGoogle Scholar
Kim, Jae Kwang and Fuller, Wayne, ‘Fractional Hot Deck Imputation’, Biometrika, 91 (2004), 559–578CrossRefGoogle Scholar
Fuller, Wayne and Kim, Jae Kwang, ‘Hot Deck Imputation for the Response Model’, Statistics Canada, 31 (2005), 139–149Google Scholar
22 Schafer, Joseph L., Analysis of Incomplete Multivariate Data (New York: Chapman & Hall/CRC, 1997)CrossRefGoogle Scholar
23 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.
24 The articles describing the Amelia procedure have received over 330 ISI citations as of this writing.
25 Reilly, Marie, ‘Data Analysis Using Hot Deck Multiple Imputation’, The Statistician, 42 (1993), 307–313CrossRefGoogle Scholar
26 Kalton, Graham and Kish, Leslie, ‘Some Efficient Random Imputation Methods’, Communications in Statistics – Theory and Methods, 13 (1984), 1919–1939CrossRefGoogle Scholar
Fay, Robert E., ‘Alternative Paradigms for the Analysis of Imputed Survey Data’, Journal of the American Statistical Association, 91 (1996), 490–498CrossRefGoogle Scholar
27 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.
28 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.
29 For linguistic parsimony, we generally use the term ‘respondent’ below, but these methods are immediately applicable to datasets where the rows reflect any other type of observation.
30 Gower, J. C., ‘A General Coefficient of Similarity and Some of its Properties’, Biometrics, 27 (1971), 857–871CrossRefGoogle Scholar
31 Rosenbaum, Paul R. and Rubin, Donald B., ‘The Central Role of the Propensity Score in Observational Studies for Causal Effects’, Biometrika, 70 (1983), 41–55CrossRefGoogle Scholar
32 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’; Kim and Fuller, ‘Fractional Hot Deck Imputation’; Fuller and Kim, ‘Hot Deck Imputation for the Response Model’.
33 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’.
34 Little and Rubin, Statistical Analysis with Missing Data; Rubin, ‘Multiple Imputations in Sample Surveys’; Rubin, Multiple Imputation for Nonresponse in Surveys; Rubin, ‘Multiple Imputation after 18+ Years’.
35 Little and Rubin, Statistical Analysis with Missing Data.
36 Our software formats its output so that the output can be used seamlessly with the R package Zelig; Koske Imai, Gary King and Olivia Lau, ‘Zelig: Everyone's Statistical Software’, Comprehensive R Archive Network (2006). This has the advantage of allowing the user to run, in a single line of code, a great variety of models on the multiple imputed datasets and have the combination handled automatically.
37 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.
38 Stef van Buuren, Jaap P. L. Brand, C. G. M. Groothuis-Oudshoorn and Donald B. Rubin, ‘Fully Conditional Specification in Multivariate Imputation’, Journal of Statistical Computation and Simulation, 76 (2006), 1049–1064CrossRefGoogle Scholar
Stef van Buuren, ‘Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification’, Statistical Methods in Medical Research, 16 (2007), 219–242CrossRefGoogle Scholar
39 Dempster, A. P.Laird, N. M. and Rubin, D. B., ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, Series B, 39 (1977), 493–510Google Scholar
40 We also ran experiments where the missing values were MCAR, but, as we would expect theoretically, no method was biased under those conditions.
41 Lipset, Seymour M., ‘Some Social Requisites of Democracy: Economic Development and Political Legitimacy’, American Political Science Review, 53 (1959), 69–105CrossRefGoogle Scholar
42 Cutright, Phillips, ‘National Political Development: Its Measurement and Social Correlates’, in Nelson W. Polsby, Robert A. Dentler and Paul A. Smith, eds, Politics and Social Life: An Introduction to Political Behavior (Boston, Mass.: Houghton Mifflin, 1963), 569–581Google Scholar
Deutsch, Karl W., ‘Social Mobilization and Political Development’, American Political Science Review, 55 (1961), 493–510CrossRefGoogle Scholar
Dahl, Robert A., Polyarchy: Participation and Opposition (New Haven, Conn.: Yale University Press, 1971)Google Scholar
Burkhart, Ross E. and Lewis-Beck, Michael S., ‘The Economic Development Thesis’, American Political Science Review, 88 (1994), 903–910CrossRefGoogle Scholar
Londregan, John B. and Poole, Keith T., ‘Does High Income Promote Democracy?’ World Politics, 49 (1996) 1–30Google Scholar
43 Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRefGoogle Scholar
Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRefGoogle Scholar
Przeworski, Adam and Limongi, Fernando, ‘Political Regimes and Economic Growth’, Journal of Economic Perspectives, 7 (1993), 51–69CrossRefGoogle Scholar
Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, ‘What Makes Democracies Endure?’ Journal of Democracy, 7 (1996), 39–55Google Scholar
Przeworski, Adam and Limongi, Fernando, ‘Modernization: Theories and Facts’, World Politics, 49 (1997), 155–183CrossRefGoogle Scholar
Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, Democracy and Development: Political Institutions and Well-Being in the World, 1950–1990 (New York: Cambridge University Press, 2000)CrossRefGoogle Scholar
44 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
45 Boix, Carles, Democracy and Redistribution (New York: Cambridge University Press, 2002)Google Scholar
Boix, Carles and Stokes, Susan, ‘Endogenous Democratization’, World Politics, 55 (2003), 517–549CrossRefGoogle Scholar
Epstein, David L., Bates, Robert, Goldstone, JackKristensen, Ida and O'Halloran, Sharyn, ‘Democratic Transitions’, American Journal of Political Science, 50 (2006), 551–569CrossRefGoogle Scholar
46 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
47 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
48 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
49 The true results are true to the extent that they are the results actually obtained by analysing the complete data. They are not true in the more traditional sense of being the true population parameters an empirical analysis attempts to estimate.
50 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
51 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.
52 Imai, King and Lau, ‘Zelig’.