Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-01-14T04:31:04.636Z Has data issue: false hasContentIssue false

An Improved Method of Automated Nonparametric Content Analysis for Social Science

Published online by Cambridge University Press:  07 January 2022

Connor T. Jerzak
Affiliation:
Ph.D. Candidate and Carl J. Friedrich Fellow, Department of Government, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. E-mail: cjerzak@g.harvard.edu, URL: https://ConnorJerzak.com
Gary King*
Affiliation:
Albert J. Weatherhead III University Professor, Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. URL: https://GaryKing.org
Anton Strezhnev
Affiliation:
Assistant Professor, University of Chicago, Department of Political Science, 5828 S. University Avenue, Chicago, IL 60637, USA. E-mail: astrezhnev@uchicago.edu, URL: https://antonstrezhnev.com
*
Corresponding author Gary King

Abstract

Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category—using either parametric “classify-and-count” methods or “direct” nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model-dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 datasets. We also offer easy-to-use software that implements all ideas discussed herein.

Type
Article
Copyright
© The Author(s) 2022. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Brunzell, H., and Eriksson, J.. 2000. “Feature Reduction for Classification of Multidimensional Data.” Pattern Recognition 33(10):17411748. https://doi.org/10.1016/S0031-3203(99)00142-9, bit.ly/2ihoYdl.CrossRefGoogle Scholar
Buck, A. A., and Gart, J. J.. 1966. “Comparison of a Screening Test and a Reference Test in Epidemiologic Studies. I. Indices of Agreements and Their Relation to Prevalence.” American Journal of Epidemiology 83(3):586592.CrossRefGoogle Scholar
Ceron, A., Curini, L., and Iacus, S. M.. 2016. “iSA: A Fast, Scalable and Accurate Algorithm for Sentiment Analysis of Social Media Content.” Information Sciences 367:105124.CrossRefGoogle Scholar
Denny, M. J., and Spirling, A.. 2018. “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It.” Political Analysis 26(2):168189.CrossRefGoogle Scholar
Esuli, A., and Sebastiani, F.. 2015. “Optimizing Text Quantifiers for Multivariate Loss Functions.” ACM Transactions on Knowledge Discovery from Data 9(4):27.CrossRefGoogle Scholar
Firat, A. 2016. “Unified Framework for Quantification”. Preprint, arXiv:1606.00868.Google Scholar
Forman, G. 2007. “Quantifying Counts, Costs, and Trends Accurately via Machine Learning.” Technical report, HP Laboratories, Palo Alto. bit.ly/Forman07 Google Scholar
Gama, J., et al. 2014. “A Survey on Concept Drift Adaptation.” ACM Computing Surveys 46(4):44.CrossRefGoogle Scholar
Hand, D. J. 2006. “Classifier Technology and the Illusion of Progress.” Statistical Science 21(1):114.Google Scholar
Ho, D. E., Imai, K., King, G., and Stuart, E. A.. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15:199236. j.mp/matchP.CrossRefGoogle Scholar
Hoadley, B. 2001. “[Statistical Modeling: The Two Cultures]: Comment.” Statistical Science 16(3):220224.Google Scholar
Hopkins, D., and King, G.. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54(1):229247. j.mp/jNFDgI.CrossRefGoogle Scholar
Hopkins, D., King, G., Knowles, M., and Melendez, S.. 2013. “Readme: Software for Automated Content Analysis.” Versions 2007–2013. GaryKing.org/readme Google Scholar
Iacus, S. M., King, G., and Porro, G.. 2012. “Causal Inference without Balance Checking: Coarsened Exact Matching.” Political Analysis 20(1):124. j.mp/woCheck.CrossRefGoogle Scholar
James, G., Witten, D., Hastie, T., and Tibshirani, R.. 2013. An Introduction to Statistical Learning, Vol. 112. New York: Springer.CrossRefGoogle Scholar
Jerzak, C., King, G., and Strezhnev, A.. 2021. “Replication Data for: An Improved Method of Automated Nonparametric Content Analysis for Social Science.” https://doi.org/10.7910/DVN/AVNZR6, Harvard Dataverse, V1.CrossRefGoogle Scholar
Kar, P., et al. 2016. “Online Optimization Methods for the Quantification Problem”. Preprint, arXiv:1605.04135.Google Scholar
King, E., Gebbie, M., and Melosh, N. A.. 2019. “Impact of Rigidity on Molecular Self-Assembly.” Langmuir: The ACS Journal of Fundamental Interface Science 35(48):1606216069.CrossRefGoogle ScholarPubMed
King, G., Hopkins, D., and Lu, Y.. 2012. “System for Estimating a Distribution of Message Content Categories in Source Data.” U.S. Patent 8,180,717. j.mp/VApatent Google Scholar
King, G., and Lu, Y.. 2008. “Verbal Autopsy Methods with Multiple Causes of Death.” Statistical Science 23(1):7891. j.mp/2AuA8aN.CrossRefGoogle Scholar
King, G., Lu, Y., and Shibuya, K.. 2010. “Designing Verbal Autopsy Studies.” Population Health Metrics 8(19). https://doi.org/10.1186/1478-7954-8-19, j.mp/DAutopsy.CrossRefGoogle ScholarPubMed
King, G., and Nielsen, R. A.. 2017. “Why Propensity Scores Should Not Be Used for Matching.” Working Paper. http://j.mp/PSMnot Google Scholar
King, G., Pan, J., and Roberts, M. E.. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107:118. j.mp/LdVXqN.CrossRefGoogle Scholar
Kingma, D. P., and Ba, J.. 2017. “Adam: A Method for Stochastic Optimization.” Preprint, arXiv:1412.6980.Google Scholar
Levy, O., Goldberg, Y., and Dagan, I.. 2015. “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” Transactions of the Association for Computational Linguistics 3:211225.CrossRefGoogle Scholar
Levy, P. S., and Kass, E. H.. 1970. “A Three Population Model for Sequential Screening for Bacteriuria.” American Journal of Epidemiology 91:148154.CrossRefGoogle ScholarPubMed
Milli, L., et al. 2013. “Quantification Trees.” In 2013 IEEE 13th International Conference on Data Mining, 528536. New York: IEEE Press.CrossRefGoogle Scholar
Pereyra, G., et al. 2017. “Regularizing Neural Networks by Penalizing Confident Output Distributions.” Preprint, arXiv:1701.06548.Google Scholar
Socher, R., et al. 2013. “Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 16311642.Google Scholar
Srivastava, N., et al. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15(1):19291958.Google Scholar
Tasche, D. 2016. “Does Quantification without Adjustments Work?” Preprint, arXiv:1602.08780.Google Scholar
Templeton, A., and Kalita, J.. 2018. “Exploring Sentence Vector Spaces through Automatic Summarization.” In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 5560. New York: IEEE.CrossRefGoogle Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A., and Bottou, L.. 2010. “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.” Journal of Machine Learning Research 11(Dec):33713408. bit.ly/2gPcedw.Google Scholar
Supplementary material: Link

Jerzak et al. Dataset

Link
Supplementary material: PDF

Jerzak et al. supplementary material

Jerzak et al. supplementary material

Download Jerzak et al. supplementary material(PDF)
PDF 487.9 KB