The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Ranjit Lall; Thomas Robinson

doi:10.1017/pan.2020.49

The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Published online by Cambridge University Press: 26 February 2021

Ranjit Lall

and

Thomas Robinson

Show author details

Ranjit Lall*: Affiliation:
Department of International Relations, London School of Economics and Political Science, London, UK. Email: r.lall@lse.ac.uk
Thomas Robinson: Affiliation:
School of Government and International Affairs, Durham University, Durham, UK. Email: thomas.robinson@durham.ac.uk
*: Corresponding author Ranjit Lall

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Keywords

missing data multiple imputation imputation methods machine learning

Type: Article
Information: Political Analysis , Volume 30 , Issue 2 , April 2022 , pp. 179 - 196

DOI: https://doi.org/10.1017/pan.2020.49 [Opens in a new window]
Copyright: © The Author(s) 2021. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Beaulieu-Jones, B. K., and Greene, C.. 2016. “Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification.” Journal of Biomedical Informatics 64(2):168–178.CrossRef Google Scholar PubMed

Cranmer, S. J., and Gill, J.. 2013. “We Have to be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data.” British Journal of Political Science 43(2):425–449.CrossRef Google Scholar

Duan, Y., Lv, Y., Kang, W., and Zhao, Y.. 2014. “A Deep Learning Based Approach for Traffic Data Imputation.” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 912–917. New York: IEEE.CrossRef Google Scholar

Gal, Y., and Ghahramani, Z.. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, 1050–1059. New York: ACM.Google Scholar

Gondara, L., and Wang, K.. 2018. “Mida: Multiple Imputation Using Denoising Autoencoders.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining, 260–272. Cham: Springer.CrossRef Google Scholar

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. Neural Networks 2:1–18.Google Scholar

Honaker, J., and King, G.. 2010. “What to Do About Missing Values in Time-Series Cross-Section Data.” American Journal of Political Science 54(2):561–581.CrossRef Google Scholar

Honaker, J., King, G., and Blackwell, M.. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45(7):1–47.CrossRef Google Scholar

King, G., Honaker, J., Joseph, A., and Scheve, K.. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95(1):49–69.CrossRef Google Scholar

Kropko, J., Goodrich, B., Gelman, A., and Hill, J.. 2014. “Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches.” Political Analysis 22(4):497–519.CrossRef Google Scholar

Lall, R. 2016. “How Multiple Imputation Makes a Difference.” Political Analysis 24(4):414–433.CrossRef Google Scholar

Lall, R., and Robinson, T.. 2020. “Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning.“ https://doi.org/10.7910/DVN/UPL4TT, Harvard Dataverse, V1, UNF:6:nx0l6jH3yhFhdUA34V9V/g== [fileUNF].Google Scholar

Little, R. J., and Rubin, D.. 1987. Statistical Analysis with Missing Data. New York: Wiley.Google Scholar

Novo, A. A. 2015. Norm. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar

Ramseyer, J. M., and Rasmussen, E. B.. 2016. “Voter Ideology: Regression Measurement of Position on the Left-Right Spectrum.” Working Paper.Google Scholar

Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.CrossRef Google Scholar

Schafer, J. L., and Olsen, M. K.. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33(4):545–571.CrossRef Google Scholar PubMed

Su, Y.-S., Gelman, A., Hill, J., and Yajima, M.. 2011. “Multiple Imputation With Diagnostics (mi) in r: Opening Windows into the Black Box.” Journal of Statistical Software 45(2):1–31.CrossRef Google Scholar

van Buuren, S. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Taylor and Francis.CrossRef Google Scholar

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International Conference on Machine Learning, 1096–1103. New York: ACM.CrossRef Google Scholar

Lall and Robinson Dataset

Dataset

https://doi.org/10.7910/DVN/UPL4TT

Link

Lall and Robinson supplementary material

PDF 396.9 KB

Article contents

The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Abstract

Keywords

Access options

Footnotes

References

Lall and Robinson Dataset

Lall and Robinson supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests