Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset

Georgina Evans; Gary King

doi:10.1017/pan.2022.1

Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset

Published online by Cambridge University Press: 20 April 2022

Georgina Evans

and

Gary King

Show author details

Georgina Evans: Affiliation:
Department of Government, Harvard University, Cambridge, MA 02138, USA. E-mail: GeorginaEvans@g.harvard.edu, URL: https://Georgina-Evans.com
Gary King*: Affiliation:
Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA. E-mail: King@Harvard.edu, URL: https://GaryKing.org
*: Corresponding author Gary King

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We offer methods to analyze the “differentially private” Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias—including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of nonconfidential data but with appropriately larger standard errors.

Keywords

privacy measurement error linear regression descriptive statistics

Type: Article
Information: Political Analysis , Volume 31 , Issue 1 , January 2023 , pp. 1 - 21

DOI: https://doi.org/10.1017/pan.2022.1 [Opens in a new window]
Copyright: © The Author(s) 2022. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Aktay, A. et al. 2020. “Google COVID-19 Community Mobility Reports: Anonymization Process Description (Version 1.0).” arXiv:2004.04145.Google Scholar

Barrientos, A. F., Reiter, J., Ashwin, M., and Chen, Y.. 2019. “Differentially Private Significance Tests for Regression Coefficients.” Journal of Computational and Graphical Statistics 28 (2):1–24.CrossRef Google Scholar

Blackwell, M., Honaker, J., and King, G.. 2017. “A Unified Approach to Measurement Error and Missing Data: Overview.” Sociological Methods and Research 46 (3): 303–341.CrossRef Google Scholar

Blair, G., Imai, K., and Zhou, Y.-Y.. 2015. “Design and Analysis of the Randomized Response Technique.” Journal of the American Statistical Association 110 (511): 1304–1319.CrossRef Google Scholar

Bun, M. and Steinke, T.. 2016. “Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds.” In Theory of Cryptography Conference, 635–658. Berlin: Springer.CrossRef Google Scholar

Buonaccorsi, J. P. 2010. Measurement Error: Models, Methods, and Applications. Boca Raton, FL: CRC Press.CrossRef Google Scholar

Diaz-Frances, E., and Rubio, F. J.. 2013. “On the Existence of a Normal Approximation to the Distribution of the Ratio of Two Independent Normal Random Variables.” Statistical Papers 54 (2): 309–323.CrossRef Google Scholar

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A.. 2015. “The Reusable Holdout: Preserving Validity in Adaptive Data Analysis.” Science 349 (6248): 636–638.CrossRef Google Scholar PubMed

Dwork, C., McSherry, F., Nissim, K., and Smith, A.. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis.” In Theory of Cryptography Conference, 265–284. Berlin: Springer.CrossRef Google Scholar

Dwork, C., and Roth, A.. 2014. “The Algorithmic Foundations of Differential Privacy.” Founda-tions and Trends in Theoretical Computer Science 9 (3–4): 211–407.CrossRef Google Scholar

Evans, G., and King, G.. 2021a. “Replication Data for: Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset.” https://doi.org/10.7910/DVN/UDFZJD, Harvard Dataverse, V1, UNF:6:qVAL2iA9dusDRaLhZ1X4xg== [fileUNF].CrossRef Google Scholar

Evans, G., and King, G.. 2021b. “Statistically Valid Inferences from Differentially Private Data Releases, II: Extensions to Nonlinear Transformations.” Working Paper. GaryKing.org/dpd2.Google Scholar

Evans, G., King, G., Schwenzfeier, M., and Thakurta, A.. 2020. “Statistically Valid Inferences from Privacy Protected Data.” GaryKing.org/dp.Google Scholar

Evans, G., King, G., Smith, A. D., and Thankurta, A.. 2022. “Differentially Private Survey Research.” American Journal of Political Science, to appear. Preprint available at garyking.org/DPsurvey.Google Scholar

Fan, J. 1991. “On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems.” The Annals of Statistics 19 (3):1257–1272.CrossRef Google Scholar

Gaboardi, M., Lim, H.-W., Rogers, R. M., and Vadhan, S. P.. 2016. “Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing.” In ICML’16 Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. JMLR.Google Scholar

Garfinkel, S. L, Abowd, J. M., and Powazek, S.. 2018. “Issues Encountered Deploying Differential Privacy.” In Proceedings of the 2018 Workshop on Privacy in the Electronic Society, 133–137. New York: ACM.Google Scholar

Glynn, A. N. 2013. “What Can We Learn with Statistical Truth Serum? Design and Analysis of the List Experiment.” Public Opinion Quarterly 77 (S1): 159–172.CrossRef Google Scholar

Goldberger, A. 1991. A Course in Econometrics. Cambridge, MA: Harvard University Press.Google Scholar

Gong, R. 2019. “Exact Inference with Approximate Computation for Differentially Private Data via Perturbations.” arXiv:1909.12237.Google Scholar

Hersh, E. D., and Nall, C.. 2016. “The Primacy of Race in the Geography of Income-Based Voting: New Evidence from Public Voting Records.” American Journal of Political Science 60 (2): 289–303.CrossRef Google Scholar

Jayaraman, B., and Evans, D.. 2019. “Evaluating Differentially Private Machine Learning in Practice.” In 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association.Google Scholar

Karwa, V., and Vadhan, S.. 2017. “Finite Sample Differentially Private Confidence Intervals.” arXiv:1711.03908.Google Scholar

King, G. 1989. “Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator.” American Journal of Political Science 33 (3): 762–784.CrossRef Google Scholar

King, G., and Persily, N.. 2020. “A New Model for Industry–Academic Partnerships.” PS: Political Science & Politics 53 (4): 703–709.Google Scholar

King, G., and Signorino, C. S.. 1996. “The Generalization in the Generalized Event Count Model.” Political Analysis 6: 225–252.CrossRef Google Scholar

King, G., Tomz, M., and Wittenberg, J.. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341–355.CrossRef Google Scholar

Messing, S., DeGregorio, C., Hillenbrand, B., King, G., Mahanti, S., Muk-erjee, Z., Nayak, C., Persily, N., State, B., and Wilkins, A.. 2020. “Facebook Privacy-Protected Full URLs Data Set.” https://doi.org/10.7910/DVN/TDOAPG, Harvard Dataverse, V8.CrossRef Google Scholar

Mnatsakanov, R. M. 2008. “Hausdorff Moment Problem: Reconstruction of Distributions.” Statistics & Probability Letters 78 (12): 1612–1618.CrossRef Google Scholar

Oberski, D. L., and Kreuter, F. 2020. “Differential Privacy and Social Science: An Urgent Puzzle.” Harvard Data Science Review 2 (1).CrossRef Google Scholar

Papoulis, A. 1984. Random Variables, and Stochastic Processes. New York: McGraw-Hill.Google Scholar

Sheffet, O. 2019. “Differentially Private Ordinary Least Squares.” Journal of Privacy and Confidentiality 9 (1): 1–43.CrossRef Google Scholar

Smith, A. 2011. “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, 813–822. New York: ACM.CrossRef Google Scholar

Stefanski, L. A. 2000. “Measurement Error Models.” Journal of the American Statistical Association 95 (452): 1353–1358.CrossRef Google Scholar

Štulajter, F. 1978. “Nonlinear Estimators of Polynomials in Mean Values of a Gaussian Stochastic Process.” Kybernetika 14 (3): 206–220.Google Scholar

Sweeney, L. 1997. “Weaving Technology and Policy Together to Maintain Confidentiality.” The Journal of Law, Medicine & Ethics 25 (2–3): 98–110.CrossRef Google Scholar PubMed

Thomas, L., Stefanski, L., and Davidian, M.. 2011. “A Moment-Adjusted Imputation Method for Measurement Error Models.” Biometrics 67 (4): 1461–1470.CrossRef Google Scholar PubMed

Vadhan, S. 2017. “The Complexity of Differential Privacy.” In Tutorials on the Foundations of Cryptography, 347–450. Berlin: Springer.CrossRef Google Scholar

Wang, Y., Kifer, D., and Lee, J.. 2018. “Differentially Private Confidence Intervals for Empirical Risk Minimization.” arXiv:1804.03794.CrossRef Google Scholar

Wang, Y., Lee, J., and Kifer, D.. 2015. “Differentially Private Hypothesis Testing, Revisited.” arXiv:1511.03376.Google Scholar

Warren, R. D., White, J. K., and Fuller, W. A.. 1974. “An Errors-in-Variables Analysis of Managerial Role Performance.” Journal of the American Statistical Association 69 (348): 886–893.CrossRef Google Scholar

Williams, O., and McSherry, F. 2010. “Probabilistic Inference and Differential Privacy.” Advances in Neural Information Processing Systems 23:2451–2459.Google Scholar

Article contents

Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset

Abstract

Keywords

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests