We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A method for robust canonical discriminant analysis via two robust objective loss functions is discussed. These functions are useful to reduce the influence of outliers in the data. Majorization is used at several stages of the minimization procedure to obtain a monotonically convergent algorithm. An advantage of the proposed method is that it allows for optimal scaling of the variables. In a simulation study it is shown that under the presence of outliers the robust functions outperform the ordinary least squares function, both when the underlying structure is linear in the variables as when it is nonlinear. Furthermore, the method is illustrated with empirical data.
This paper presents an approach for determining unidimensional scale estimates that are relatively insensitive to limited inconsistencies in paired comparisons data. The solution procedure, shown to be a minimum-cost network-flow problem, is presented in conjunction with a sensitivity diagnostic that assesses the influence of a single pairwise comparison on traditional Thurstone (ordinary least squares) scale estimates. When the diagnostic indicates some source of distortion in the data, the network technique appears to be more successful than Thurstone scaling in preserving the interval scale properties of the estimates.
A method for multidimensional scaling that is highly resistant to the effects of outliers is described. To illustrate the efficacy of the procedure, some Monte Carlo simulation results are presented. The method is shown to perform well when outliers are present, even in relatively large numbers, and also to perform comparably to other approaches when no outliers are present.
Factor analysis is regularly used for analyzing survey data. Missing data, data with outliers and consequently nonnormal data are very common for data obtained through questionnaires. Based on covariance matrix estimates for such nonstandard samples, a unified approach for factor analysis is developed. By generalizing the approach of maximum likelihood under constraints, statistical properties of the estimates for factor loadings and error variances are obtained. A rescaled Bartlett-corrected statistic is proposed for evaluating the number of factors. Equivariance and invariance of parameter estimates and their standard errors for canonical, varimax, and normalized varimax rotations are discussed. Numerical results illustrate the sensitivity of classical methods and advantages of the proposed procedures.
We discuss measuring and detecting influential observations and outliers in the context of exponential family random graph (ERG) models for social networks. We focus on the level of the nodes of the network and consider those nodes whose removal would result in changes to the model as extreme or “central” with respect to the structural features that “matter”. We construe removal in terms of two case-deletion strategies: the tie-variables of an actor are assumed to be unobserved, or the node is removed resulting in the induced subgraph. We define the difference in inferred model resulting from case deletion from the perspective of information theory and difference in estimates, in both the natural and mean-value parameterisation, representing varying degrees of approximation. We arrive at several measures of influence and propose the use of two that do not require refitting of the model and lend themselves to routine application in the ERGM fitting procedure. MCMC p values are obtained for testing how extreme each node is with respect to the network structure. The influence measures are applied to two well-known data sets to illustrate the information they provide. From a network perspective, the proposed statistics offer an indication of which actors are most distinctive in the network structure, in terms of not abiding by the structural norms present across other actors.
By means of more than a dozen user friendly packages, structural equation models (SEMs) are widely used in behavioral, education, social, and psychological research. As the underlying theory and methods in these packages are vulnerable to outliers and distributions with longer-than-normal tails, a fundamental problem in the field is the development of robust methods to reduce the influence of outliers and the distributional deviation in the analysis. In this paper we develop a maximum likelihood (ML) approach that is robust to outliers and symmetrically heavy-tailed distributions for analyzing nonlinear SEMs with ignorable missing data. The analytic strategy is to incorporate a general class of distributions into the latent variables and the error measurements in the measurement and structural equations. A Monte Carlo EM (MCEM) algorithm is constructed to obtain the ML estimates, and a path sampling procedure is implemented to compute the observed-data log-likelihood and then the Bayesian information criterion for model comparison. The proposed methodologies are illustrated with simulation studies and an example.
Stochastic mortality models are important for a variety of actuarial tasks, from best-estimate forecasting to assessment of risk capital requirements. However, the mortality shock associated with the Covid-19 pandemic of 2020 distorts forecasts by (i) biasing parameter estimates, (ii) biasing starting points, and (iii) inflating variance. Stochastic mortality models therefore require outlier-robust methods for forecasting. Objective methods are required, as outliers are not always obvious on visual inspection. In this paper we look at the robustification of three broad classes of forecast: univariate time indices (such as in the Lee-Carter and APC models); multivariate time indices (such as in the Cairns-Blake-Dowd and newer Tang-Li-Tickle model families); and penalty projections (such as with the 2D P-spline model). In each case we identify outliers using quantitative methods, then co-estimate outlier effects along with other parameters. Doing so removes the bias and distortion to the forecast caused by a mortality shock, while providing a robust starting point for projections. Illustrations are given for various models in common use.
Chapter 3 examines measures of central tendency and their correspondence to normality and skewness. The three measures of central tendency presented include the mode, median, and mean. The mean is typically thought of as the average. The mode is the score occurring most frequently in a distribution of scores; the median is the central score, or the point which divides a distribution into two equal parts. The median is a robust statistic. The level of measurement assumption is crucial in selecting the best measure of central tendency for specific analyses.
Central tendency describes the typical value of a variable.Measures of central tendency by level of measurement are covered including the mean, median, and mode.Appropriate use of each measure by level of measurement is the central theme of the chapter.The chapter shows how to find these measures of central tendency by hand and in the R Commander with detailed instructions and steps.Skewed distributions and outliers of data are also covered, as is the relationship between the mean and median in these cases.
Multiple linear regression generalizes straight line regression to allow multiple explanatory (or predictor) variables, in this chapter under the normal errors assumption. The focus may be on accurate prediction. Or it may, alternatively or additionally, be on the regression coefficients themselves. Simplistic interpretations of coefficients can be grossly misleading. Later chapters elaborate on the ideas and methods developed in this chapter, applying them in new contexts. The attaching of causal interpretations to model coefficients must be justified both by reference to subject area knowledge and by careful checks to ensure that they are not artefacts of the correlation structure. There is attention to regression diagnostics, to assessment, and comparison of models. Variable selection strategies can readily over-fit. Hence the importance of training/test approaches and cross-validation. The potential is demonstrated for errors in x to seriously bias regression coefficients. Strong multicollinearity leads to large variance inflation factors.
There are many applications of the low-rank signal-plus-noise model 𝒀 = 𝑿 + 𝒁 where 𝑿 is a low-rank matrix and 𝒁 is noise, such as denoising and dimensionality reduction. We are interested in the properties of the latent matrix 𝑿, such as its singular value decomposition (SVD), but all we are given is the noisy matrix 𝒀. It is important to understand how the SVD components of 𝒀 relate to those of 𝑿 in the presence of a random noise matrix 𝒁. The field of random matrix theory (RMT) provides insights into those relationships, and this chapter summarizes some key results from RMT that help explain how the noise in 𝒁 perturbs the SVD components, by analyzing limits as matrix dimensions increase. The perturbations considered include roundoff error, additive Gaussian noise, outliers, and missing data. This is the only chapter that requires familiarity with the distributions of continuous random variables, and it provides many pointers to the literature on this modern topic, along with several demos that illustrate remarkable agreement between the asymptotic predictions and the empirical performance even for modest matrix sizes.
Medearis and his two cofounders of Silicon Valley Bank wished to tackle the antiquated banking practices that led to a massive reduction in the number of banks, the disappearance of community banks, and the mergers of Big Banks. Bank regulations and culture prevent banks from embracing tech startups and entrepreneurs as lending clients. The SVB founders knew about Bank of America’s abandonment of its early tech lending, missed opportunities, and bank failures to capture tech startups and entrepreneurs. The old, conservative banking environment during the early days of the tech sector presented the founders with an opportunity.
A core contention woven into the fabric of Sun Tzu’s thinking is that all situations faced by a strategic actor, even those that appear on their face to be losing ones, hold seeds of opportunity that, if grasped correctly, can be parlayed into strategic advantage.1 An illustrative statement starts off Passage #5.1 below.
This chapter discusses Feature Engineering techniques that look holistically at the feature set, therefore replacing or enhancing the features based on their relation to the whole set of instances and features. Techniques such as normalization, scaling, dealing with outliers and generating descriptive features are covered. Scaling and normalization are the most common, it involves finding the maximum and minimum and changing the values to ensure they will lie in a given interval (e.g., [0, 1] or [−1, 1]). Discretization and binning involve, for example, analyzing a feature that is an integer (any number from -1 trillion to +1 trillion) and realize that it only takes the values 0, 1 and 10 so it can be simplified into a symbolic feature with three values (value0, value1 and value10). Descriptive features is the gathering of information that talks about the shape of the data, the discussion centres around using tables of counts (histograms) and general descriptive features such as maximum, minimum and averages. Outlier detection and treatment refers to looking at the feature values across many instances and realizing some values might present themselves very far from the rest.
Graphs are a powerful and concise way to communicate information. Representing data from an experiment in the form of an x-y graph allows relationships to be examined, scatter in data to be assessed and allows for the rapid identification of special or unusual features. A well laid out graph containing all the components discussed in this chapter can act as a 'one stop' summary of a whole experiment. Someone studying an account of an experiment will often examine the graph(s) included in the account first to gain an overall picture of the outcome of an experiment. The importance of graphs, therefore, cannot be overstated as they so often play a central role in the communication of the key findings of an experiment. This chapter contains many examples of graphs and includes exercises and end of chapter problems which reinforce the graph-plotting principles.
A high-resolution 14C chronology for the Teopancazco archaeological site in the Teotihuacan urban center of Mesoamerica was generated by Bayesian analysis of 33 radiocarbon dates and detailed archaeological information related to occupation stratigraphy, pottery and archaeomagnetic dates. The calibrated intervals obtained using the Bayesian model are up to ca. 70% shorter than those obtained with individual calibrations. For some samples, this is a consequence of plateaus in the part of the calibration curve covered by the sample dates (2500 to 1450 14C yr BP). Effects of outliers are explored by comparing the results from a Bayesian model that incorporates radiocarbon data for two outlier samples with the same model excluding them. The effect of outliers was more significant than expected. Inclusion of radiocarbon dates from two altered contexts, 500 14C yr earlier than those for the first occupational phase, results in ages calculated by the model earlier than the archaeological records. The Bayesian chronology excluding these outliers separates the first two Teopancazco occupational phases and suggests that ending of the Xolalpan phase was around cal AD 550, 100 yr earlier than previously estimated and in accordance with previously reported archaeomagnetic dates from lime plasters for the same site.
State politics researchers commonly employ ordinary least squares (OLS) regression or one of its variants to test linear hypotheses. However, OLS is easily influenced by outliers and thus can produce misleading results when the error term distribution has heavy tails. Here we demonstrate that median regression (MR), an alternative to OLS that conditions the median of the dependent variable (rather than the mean) on the independent variables, can be a solution to this problem. Then we propose and validate a hypothesis test that applied researchers can use to select between OLS and MR in a given sample of data. Finally, we present two examples from state politics research in which (1) the test selects MR over OLS and (2) differences in results between the two methods could lead to different substantive inferences. We conclude that MR and the test we propose can improve linear models in state politics research.
Structural instability in economic time series is widely reported in the literature. It is most prevalent in such series as price indices and inflation related data. Many methods have been developed for analysing and modelling structural changes in a univariate time series model. However, most of them assume that the data are generated by one fixed type (linear or non-linear) of the time series processes. This paper proposes a strategy for modelling different segments of an economic time series by different linear or non-linear models. A graphical procedure is suggested for detecting the model change points. The proposed procedure is illustrated by modelling annual United Kingdom price inflation series over the period 1265 to 2005. Stochastic modelling of inflation rates is an important topic to actuaries for dealing with long-term index linked insurance business. The proposed method suggests dividing the U.K. inflation series into four segments for modelling. Inflation projections based on the latest segment of the data are obtained through simulations. To get a better understanding of the impact of structural changes on inflation projections we also perform a forecasting study.
With more satellite systems becoming available there is currently a need for Receiver Autonomous Integrity Monitoring (RAIM) to exclude multiple outliers. While the single outlier test can be applied iteratively, in the field of statistics robust methods are preferred when multiple outliers exist. This study compares the outlier test and numerous robust methods with simulated GPS measurements to identify which methods have the greatest ability to correctly exclude outliers. It was found that no method could correctly exclude outliers 100% of the time. However, for a single outlier the outlier test achieved the highest rates of correct exclusion followed by the MM-estimator and the L1-norm. As the number of outliers increased MM-estimators and the L1-norm obtained the highest rates of normal exclusion, which were up to ten percent higher than the outlier test.