We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Mediation analysis practices in social and personality psychology would benefit from the integration of practices from statistical mediation analysis, which is currently commonly implemented in social and personality psychology, and causal mediation analysis, which is not frequently used in psychology. In this chapter, I briefly describe each method on its own, then provide recommendations for how to integrate practices from each method to simultaneously evaluate statistical inference and causal inference as part of a single analysis. At the end of the chapter, I describe additional areas of recent development in mediation analysis that that social and personality psychologists should also consider adopting I order to improve the quality of inference in their mediation analysis: latent variables and longitudinal models. Ultimately, this chapter is meant to be a kind introduction to causal inference in the context of mediation with very practical recommendations for how one can implement these practices in one’s own research.
This chapter is devoted to extensive instruction regarding bivariate regression, also known as ordinary least squares regression (OLS).Students are presented with a scatterplot of data with a best-fitting line drawn through it.They are instructed on how to calculate the equation of this line (least squares line) by hand and with the R Commander.Interpretation of the statistical output of the y-intercept, beta coefficient, and R-squared value are discussed.Statistical significance of the beta coefficient and its implications for the relationship between an independent and dependent variable are described.Finally, the use of the regression equation for prediction is illustrated.
Many applications require solving a system of linear equations 𝑨𝒙 = 𝒚 for 𝒙 given 𝑨 and 𝒚. In practice, often there is no exact solution for 𝒙, so one seeks an approximate solution. This chapter focuses on least-squares formulations of this type of problem. It briefly reviews the 𝑨𝒙 = 𝒚 case and then motivates the more general 𝑨𝒙 ≈ 𝒚 cases. It then focuses on the over-determined case where 𝑨 is tall, emphasizing the insights offered by the SVD of 𝑨. It introduces the pseudoinverse, which is especially important for the under-determined case where 𝑨 is wide. It describes alternative approaches for the under-determined case such as Tikhonov regularization. It introduces frames, a generalization of unitary matrices. It uses the SVD analysis of this chapter to describe projection onto a subspace, completing the subspace-based classification ideas introduced in the previous chapter, and also introduces a least-squares approach to binary classifier design. It introduces recursive least-squares methods that are important for streaming data.
In this chapter we cover clustering and regression, looking at two traditional machine learning methods: k-means and linear regression. We briefly discuss how to implement these methods in a non-distributed manner first, to then carefully analyze the bottlenecks of these methods when manipulating big data. This enables us to design global-based solutions based on the DataFrame API of Spark. The key focus is on the principles for designing solutions effectively. Nevertheless, some of the challenges in this chapter are to investigate tools from Spark to speed up the processing even further. k-means is an example of an iterative algorithm, and how to exploit caching in Spark, and we analyze its implementation with both RDD and DataFrame APIs. For linear regression, we first implement the closed form, which involves numerous matrix multiplications and outer products, to simplify the processing in big data. Then, we look at gradient descent. These examples give us the opportunity to expand on the principles of designing a global solution, and also allow us to show how knowing the underlying platform, Spark in this case, well is essential to really maximize the performance.
In this chapter, we introduce some of the more popular ML algorithms. Our objective is to provide the basic concepts and main ideas, how to utilize these algorithms using Matlab, and offer some examples. In particular, we discuss essential concepts in feature engineering and how to apply them in Matlab. Support vector machines (SVM), K-nearest neighbor (KNN), linear regression, Naïve Bayes algorithm, and decision trees are introduced and the fundamental underlying mathematics is explained while using Matlab’s corresponding Apps to implement each of these algorithms. A special section on reinforcement learning is included, detailing the key concepts and basic mechanism of this third ML category. In particular, we showcase how to implement reinforcement learning in Matlab as well as make use of some of the Python libraries available online and show how to use reinforcement learning for controller design.
Regression models with log-transformed dependent variables are widely used by social scientists to investigate nonlinear relationships between variables. Unfortunately, this transformation complicates the substantive interpretation of estimation results and often leads to incomplete and sometimes even misleading interpretations. We focus on one valuable but underused method, the presentation of quantities of interest such as expected values or first differences on the original scale of the dependent variable. The procedure to derive these quantities differs in seemingly minor but critical aspects from the well-known procedure based on standard linear models. To improve empirical practice, we explain the underlying problem and develop guidelines that help researchers to derive meaningful interpretations from regression results of models with log-transformed dependent variables.
Simple linear regression is extended to multiple linear regression (for multiple predictor variables) and to multivariate linear regression for (multiple response variables). Regression with circular data and/or categorical data is covered. How to select predictors and how to avoid overfitting with techniques such as ridge regression and lasso are followed by quantile regression. The assumption of Gaussian noise or residual is removed in generalized least squares, with applications to optimal fingerprinting in climate change.
When analyzing data, researchers are often less interested in the parameters of statistical models than in functions of these parameters such as predicted values. Here we show that Bayesian simulation with Markov-Chain Monte Carlo tools makes it easy to compute these quantities of interest with their uncertainty. We illustrate how to produce customary and relatively new quantities of interest such as variable importance ranking, posterior predictive data, difficult marginal effects, and model comparison statistics to allow researchers to report more informative results.
We offer methods to analyze the “differentially private” Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias—including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of nonconfidential data but with appropriately larger standard errors.
Analysis of various data sets can be accomplished using techniques based on least-squares methods.For example, linear regression of data determines the best-fit line to the data via a least-squares approach.The same is true for polynomial and regression methods using other basis functions.Curve fitting is used to determine the best-fit line or curve to a particular set of data, while interpolation is used to determine a curve that passes through all of the data points.Polynomial and spline interpolation are discussed.State estimation is covered using techniques based on least-squares methods.
Many researchers use an ordinal scale to quantitatively measure and analyze concepts. Theoretically valid empirical estimates are robust in sign to any monotonic increasing transformation of the ordinal scale. This presents challenges for the point-identification of important parameters of interest. I develop a partial identification method for testing the robustness of empirical estimates to a range of plausible monotonic increasing transformations of the ordinal scale. This method allows for the calculation of plausible bounds around effect estimates. I illustrate this method by revisiting analysis by Nunn and Wantchekon (2011, American Economic Review, 101, 3221–3252) on the slave trade and trust in sub-Saharan Africa. Supplemental illustrations examine results from (i) Aghion et al. (2016, American Economic Review, 106, 3869–3897) on creative destruction and subjective well-being and (ii) Bond and Lang (2013, The Review of Economics and Statistics, 95, 1468–1479) on the fragility of the black–white test score gap.
Research on gender differences in language use previously focused mainly on affluent, especially Western societies. The present chapter extends this research to acrolectal Indian English, a postcolonial variety of English, investigating how the use of intensifiers (e.g. very, really) is affected not only by the speakers’ gender, but also their age, the gender of the other speakers in the conversation and the formality of the context. Results show some parallels with Western varieties of English, in particular a tendency for women to use more intensifiers than men in informal contexts. However, Indian women modify their usage of intensifiers with respect to the formality of the context more than British women and men, while Indian men do so less than British women and men. In mixed-sex conversations, Indian women also converge with Indian men in their intensifier usage, while neither British women nor men do so. The more flexible use of intensifiers by Indian women may be a response to societal expectations regarding their linguistic behaviour, in order to avoid censure by society. British women likewise continue to be affected by such constraints, but much less so, while the linguistic behaviour of Indian and British men is subject to less criticism.
This chapter is not about one particular method (or a family of methods). Instead, it provides a set of tools useful for better pattern recognition, especially for real-world applications. They include the definition of distance metrics, vector norms, a brief introduction to the idea of distance metric learning, and power mean kernels (which is a family of useful metrics). We also establish by examples that proper normalizations of our data are essential, and introduce a few data normalization and transformation methods.
The focus of this study is to monitor the effect of lockdown on the various air pollutants due to the coronavirus disease (COVID-19) pandemic and identify the ones that affect COVID-19 fatalities so that measures to control the pollution could be enforced.
Methods:
Various machine learning techniques: Decision Trees, Linear Regression, and Random Forest have been applied to correlate air pollutants and COVID-19 fatalities in Delhi. Furthermore, a comparison between the concentration of various air pollutants and the air quality index during the lockdown period and last two years, 2018 and 2019, has been presented.
Results:
From the experimental work, it has been observed that the pollutants ozone and toluene have increased during the lockdown period. It has also been deduced that the pollutants that may impact the mortalities due to COVID-19 are ozone, NH3, NO2, and PM10.
Conclusions:
The novel coronavirus has led to environmental restoration due to lockdown. However, there is a need to impose measures to control ozone pollution, as there has been a significant increase in its concentration and it also impacts the COVID-19 mortality rate.
There is evidence indicating that using the current UK energy feeding system to ration the present sheep flocks may underestimate their nutrient requirements. The objective of the present study was to address this issue by developing updated maintenance energy requirements for the current sheep flocks and evaluating if these requirements were influenced by a range of dietary and animal factors. Data (n = 131) used were collated from five experiments with sheep (5 to 18 months old and 29.0 to 69.8 kg BW) undertaken at the Agri-Food and Biosciences Institute of the UK from 2013 to 2017. The trials were designed to evaluate the effects of dietary type, genotype, physiological stage and sex on nutrient utilization and energetic efficiencies. Energy intake and output data were measured in individual calorimeter chambers. Energy balance (Eg) was calculated as the difference between gross energy intake and a sum of fecal energy, urine energy, methane energy and heat production. Data were analysed using the restricted maximum likelihood analysis to develop the linear relationship between Eg or heat production and metabolizable energy (ME) intake, with the effects of a range of dietary and animal factors removed. The net energy (NEm) and ME (MEm) requirements for maintenance derived from the linear relationship between Eg and ME intake were 0.358 and 0.486 MJ/kg BW0.75, respectively, which are 40% to 53% higher than those recommended in energy feeding systems currently used to ration sheep in the USA and the UK. Further analysis of the current dataset revealed that concentrate supplement, sire type or physiological stage had no significant effect on the derived NEm values. However, female lambs had a significantly higher NEm (0.352 v. 0.306 or 0.288 MJ/kg BW0.75) or MEm (0.507 v. 0.441 or 0.415 MJ/kg BW0.75) than those for male or castrated lambs. The present results indicate that using present energy feeding systems in the UK developed over 40 years ago to ration the current sheep flocks could underestimate maintenance energy requirements. There is an urgent need to update these systems to reflect the higher metabolic rates of the current sheep flocks.
Ecological inference (EI) is the process of learning about individual behavior from aggregate data. We relax assumptions by allowing for “linear contextual effects,” which previous works have regarded as plausible but avoided due to nonidentification, a problem we sidestep by deriving bounds instead of point estimates. In this way, we offer a conceptual framework to improve on the Duncan–Davis bound, derived more than 65 years ago. To study the effectiveness of our approach, we collect and analyze 8,430 $2\times 2$ EI datasets with known ground truth from several sources—thus bringing considerably more data to bear on the problem than the existing dozen or so datasets available in the literature for evaluating EI estimators. For the 88% of real data sets in our collection that fit a proposed rule, our approach reduces the width of the Duncan–Davis bound, on average, by about 44%, while still capturing the true district-level parameter about 99% of the time. The remaining 12% revert to the Duncan–Davis bound.
Tognini-Bonelli (2001) made the following distinction between corpus-based and corpus-driven studies. While corpus-based studies start with pre-existing theories which are tested using corpus data, in corpus driven studies the hypothesis is derived by examination of the corpus evidence. This chapter will give an overview of the two different families of statistical tests which are suited for these two approaches. For corpus-based approaches, we use more traditional statistics, such as the t-test, or ANOVA which return a value called a p-value to tell us to what extent we should accept or reject the initial hypothesis. Multi-level modelling (also known as mixed modelling) is a new technique which shows considerable promise for corpus-based studies, and will also be described here to analyse the ENNTT subset of Europarl corpus. Multi-level modelling is useful for the examination of hierarchically structured or “nested” data, where for example translations may be “nested” together in a class if they have the same language of origin. A multi-level model takes account both of the variation between individual translations and the variation between classes. For example, we might expect the scores (such as vocabulary richness or readability scores) of two translations in the same class to be more similar to each other than two translations in different classes.
Review of correlation and simple linear regression. Introduction to lagged (cross-) correlation for identifying recurrent and periodic features in common between pairs of time-series, statistical evidence of possible causal relationships. Introduction to (lagged) autocorrelation for identifying recurrent and periodic features in time-series. Use of correlation and simple linear regression for statistical comparison of time-series to reference datasets, with focus on periodic (sinusoidal) reference datasets. Interpretation of statistical effect-size and significance (p-value).
We study the problem of choosing the best subset of $p$ features in linear regression, given $n$ observations. This problem naturally contains two objective functions including minimizing the amount of bias and minimizing the number of predictors. The existing approaches transform the problem into a single-objective optimization problem. We explain the main weaknesses of existing approaches and, to overcome their drawbacks, we propose a bi-objective mixed integer linear programming approach. A computational study shows the efficacy of the proposed approach.