We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter introduces the state space model and shows how this can be adapted to represent a wide variety of models of use in economics and finance. We define the Kalman filter and show how it can be implemented in leading examples.
Inverse probability weighting is a common remedy for missing data issues, notably in causal inference. Despite its prevalence, practical application is prone to bias from propensity score model misspecification. Recently proposed methods try to rectify this by balancing some moments of covariates between the target and weighted groups. Yet, bias persists without knowledge of the true outcome model. Drawing inspiration from the quasi maximum likelihood estimation with misspecified statistical models, I propose an estimation method minimizing a distance between true and estimated weights with possibly misspecified models. This novel approach mitigates bias and controls mean squared error by minimizing their upper bounds. As an empirical application, it gives new insights into the study of foreign occupation and insurgency in France.
Making best use of collected weather observations is simplified where thought is given to record management and storage: gathering meteorological records is usually a means to an end, rather than an end in itself. The more effectively records are stored, the quicker and easier it becomes to analyse and use them productively – a statement which applies equally to both professional and amateur observers. This chapter provides tried and tested suggestions for collecting, storing and archiving data from both manual observations and automatic weather stations (AWSs).
We outline key conceptual issues and strategies in social network data collection, focusing on the differences between realist and nominalist approaches. Given that most networks are incomplete in some way, we discuss ways to anticipate and assess problems with missing data.
All statistical models have assumptions, and violation of these assumptions can affect the reliability of any conclusions we draw. Before we fit any statistical model, we need to explore the data to be sure we fit a valid model. Are relationships assumed to be a straight line really linear? Does the response variable follow the assumed distribution? Are variances consistent? We outline several graphical techniques for exploring data and introduce the analysis of model residuals as a powerful tool. If assumptions are violated, we consider two solutions, transforming variables to satisfy assumptions and using models that assume different distributions more consistent with the raw data and residuals. The exploratory stage can be extensive, but it is essential. At this pre-analysis stage, we also consider what to do about missing observations.
Making repeated observations through time adds complications, but it’s a common way to deal with limited research resources and reduce the use of experimental animals. A consequence of this design is that observations fall into clusters, often corresponding to individual organisms or “subjects.” We need to incorporate these relationships into statistical models and consider the additional complication where observations closer together in time may be more similar than those further apart. These designs were traditionally analyzed with repeated measures ANOVA, fitted by OLS. We illustrate this traditional approach but recommend the alternative linear mixed models approach. Mixed models offer better ways to deal with correlations within the data by specifying the clusters as random effects and modeling the correlations explicitly. When the repeated measures form a sequence (e.g. time), mixed models also offer a way to deal with occasional missing observations without omitting the whole subject from the model.
This accessible and practical textbook gives students the perfect guide to the use of regression models in testing and evaluating hypotheses dealing with social relationships. A range of statistical methods suited to a wide variety of dependent variables is explained, which will allow students to read, understand, and interpret complex statistical analyses of social data. Each chapter contains example applications using relevant statistical methods in both Stata and R, giving students direct experience of applying their knowledge. A full suite of online resources - including statistical command files, datasets and results files, homework assignments, class discussion topics, PowerPoint slides, and exam questions - supports the student to work independently with the data, and the instructor to deliver the most effective possible course. This is the ideal textbook for advanced undergraduate and beginning graduate students taking courses in applied social statistics.
This chapter reviews several methods for addressing address the statistical problem of missing data. We first explain how missing data can affect different components of the study design and the statistical analyses in such a way that the validity of the findings may become questionable. We next describe several methods to address the missing data problem and show why some may be problematic. We explain why multiple imputation (MI) and maximum likelihood (ML) are the preferred methods for addressing missing data issues. We then present an example using Stata, focusing on one of the preferred methods, multiple imputation. Lastly, within the context of an analysis of adolescent pregnancy, we use several methods to handle missing data and show how the analysis results may differ depending on which missing data method is used.
Streamflow predictions are vital for detecting flood and drought events. Such predictions are even more critical to Sub-Saharan African regions that are vulnerable to the increasing frequency and intensity of such events. These regions are sparsely gaged, with few available gaging stations that are often plagued with missing data due to various causes, such as harsh environmental conditions and constrained operational resources. This work presents a novel workflow for predicting streamflow in the presence of missing gage observations. We leverage bias correction of the Group on Earth Observations Global Water and Sustainability Initiative ECMWF streamflow service (GESS) forecasts for missing data imputation and predict future streamflow using the state-of-the-art temporal fusion transformers (TFTs) at 10 river gaging stations in the Benin Republic. We show by simulating missingness in a testing period that GESS forecasts have a significant bias that results in poor imputation performance over the 10 Beninese stations. Our findings suggest that overall bias correction by Elastic Net and Gaussian Process regression achieves superior performance relative to traditional imputation by established methods. We also show that the TFT yields high predictive skill and further provides explanations for predictions through the weights of its attention mechanism. The findings of this work provide a basis for integrating Global streamflow prediction model data and the state-of-the-art machine learning models into operational early-warning decision-making systems in resource-constrained countries vulnerable to drought and flooding due to extreme weather events.
I-O psychologists often face the need to reduce the length of a data collection effort due to logistical constraints or data quality concerns. Standard practice in the field has been either to drop some measures from the planned data collection or to use short forms of instruments rather than full measures. Dropping measures is unappealing given the loss of potential information, and short forms often do not exist and have to be developed, which can be a time-consuming and expensive process. We advocate for an alternative approach to reduce the length of a survey or a test, namely to implement a planned missingness (PM) design in which each participant completes a random subset of items. We begin with a short introduction of PM designs, then summarize recent empirical findings that directly compare PM and short form approaches and suggest that they perform equivalently across a large number of conditions. We surveyed a sample of researchers and practitioners to investigate why PM has not been commonly used in I-O work and found that the underusage stems primarily from a lack of knowledge and understanding. Therefore, we provide a simple walkthrough of the implementation of PM designs and analysis of data with PM, as well as point to various resources and statistical software that are equipped for its use. Last, we prescribe a set of four conditions that would characterize a good opportunity to implement a PM design.
Issues of quantile regression, simulation methods, multi-level panel data, errors of measurement, distributed lag models when T is short, rotating or randomly missing data, repeated cross-sectional data, and discretizing unobserved heterogeneity are discussed.
We consider the properties of listwise deletion when both n and the number of variables grow large. We show that when (i) all data have some idiosyncratic missingness and (ii) the number of variables grows superlogarithmically in n, then, for large n, listwise deletion will drop all rows with probability 1. Using two canonical datasets from the study of comparative politics and international relations, we provide numerical illustration that these problems may emerge in real-world settings. These results suggest that, in practice, using listwise deletion may mean using few of the variables available to the researcher.
Imputing missing values is an important preprocessing step in data analysis, but the literature offers little guidance on how to choose between imputation models. This letter suggests adopting the imputation model that generates a density of imputed values most similar to those of the observed values for an incomplete variable after balancing all other covariates. We recommend stable balancing weights as a practical approach to balance covariates whose distribution is expected to differ if the values are not missing completely at random. After balancing, discrepancy statistics can be used to compare the density of imputed and observed values. We illustrate the application of the suggested approach using simulated and real-world survey data from the American National Election Study, comparing popular imputation approaches including random forests, hot-deck, predictive mean matching, and multivariate normal imputation. An R package implementing the suggested approach accompanies this letter.
The use of a Kaplan–Meier (K–M) survival time approach is generally considered appropriate to report antimalarial efficacy trials. However, when a treatment arm has 100% efficacy, confidence intervals may not be computed. Furthermore, methods that use probability rules to handle missing data for instance by multiple imputation, encounter perfect prediction problem when a treatment arm has full efficacy, in which case all imputed values are either treatment success or all imputed values are failures. The use of a survival K–M method addresses this imputation problem in estimating the efficacy estimates also referred to as cure rates. We discuss the statistical challenges and propose a potential way forward.
The proposed approach includes the use of K–M estimates as the main measure of efficacy. Confidence intervals could be computed using the binomial exact method. p-Values for comparison of difference in efficacy between treatments can be estimated using Fisher’s exact test. We emphasize that when efficacy rates are not 100% in both groups, the K–M approach remains the main strategy of analysis considering its statistical robustness in handling missing data and confidence intervals can be computed under such scenarios.
The large amount of synchrophasor data obtained by Phasor Measurement Units (PMUs) provides dynamic visibility of power systems. As the data is being collected from geographically distant locations facilitated by computer networks, the data quality can be compromised by data losses, bad data, and cybernetic attacks. Data privacy is also an increasing concern. This chapter, describes a common framework of methods for data recovery, error correction, detection and correction of cybernetic attacks, and data privacy enhancement by exploiting the intrinsic low-dimensional structures in the high-dimensional spatial-temporal blocks of PMU data. The developed data-driven approaches are computationally efficient with provable analytical guarantees. For instance, the data recovery method can recover the ground-truth data even if simultaneous and consecutive data losses and errors happen across all PMU channels for some time. This approach can identify PMU channels that are under false data injection attacks by locating abnormal dynamics in the data. Random noise and quantization can be applied to the measurements before transmission to compress the data and enhance data privacy.
Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Measurement errors are omnipresent in network data. Most studies observe an erroneous network instead of the desired error-free network. It is well known that such errors can have a severe impact on network metrics, especially on centrality measures: a central node in the observed network might be less central in the underlying, error-free network. The robustness is a common concept to measure these effects. Studies have shown that the robustness primarily depends on the centrality measure, the type of error (e.g., missing edges or missing nodes), and the network topology (e.g., tree-like, core-periphery). Previous findings regarding the influence of network size on the robustness are, however, inconclusive. We present empirical evidence and analytical arguments indicating that there exist arbitrary large robust and non-robust networks and that the average degree is well suited to explain the robustness. We demonstrate that networks with a higher average degree are often more robust. For the degree centrality and Erdős–Rényi (ER) graphs, we present explicit formulas for the computation of the robustness, mainly based on the joint distribution of node degrees and degree changes which allow us to analyze the robustness for ER graphs with a constant average degree or increasing average degree.
Missing data are inevitable in medical research and appropriate handling of missing data is critical for statistical estimation and making inferences. Imputation is often employed in order to maximize the amount of data available for statistical analysis and is preferred over the typically biased output of complete case analysis. This article examines several types of regression imputation of missing covariates in the prediction of time-to-event outcomes subject to right censoring.
Methods:
We evaluated the performance of five regression methods in the imputation of missing covariates for the proportional hazards model via summary statistics, including proportional bias and proportional mean squared error. The primary objective was to determine which among the parametric generalized linear models (GLMs) and least absolute shrinkage and selection operator (LASSO), and nonparametric multivariate adaptive regression splines (MARS), support vector machine (SVM), and random forest (RF), provides the “best” imputation model for baseline missing covariates in predicting a survival outcome.
Results:
LASSO on an average observed the smallest bias, mean square error, mean square prediction error, and median absolute deviation (MAD) of the final analysis model’s parameters among all five methods considered. SVM performed the second best while GLM and MARS exhibited the lowest relative performances.
Conclusion:
LASSO and SVM outperform GLM, MARS, and RF in the context of regression imputation for prediction of a time-to-event outcome.