We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The way networks grow and change over time is called network evolution. Numerous off-the-shelf algorithms have been developed to study network evolution. These can give us insight into the way systems grow and change over time. However, what off-the-shelf algorithms often lack are knowledge of the behavioral details surrounding a specific problem. Here we will develop a simple case that we will revisit over the next few chapters: How do children learn words from exposure to a sea of language? One possibility is that the words children learn first influence the words they learn next. Another possibility is that the structure of language itself facilitates the learning of some words over others. Indeed, we know that adults speak differently to children in ways that facilitate language learning, with semantically informative words tending to appear more often around words that children learn earliest. This invites the question: To what extent does the semantic structure of language predict word learning? This chapter will provide a general framework for building and competing models against one another with a specific application to the network evolution of child vocabularies.
Network science is a broadly interdisciplinary field, pulling from computer science, mathematics, statistics, and more. The data scientist working with networks thus needs a broad base of knowledge, as network data calls for—and is analyzed with—many computational and mathematical tools. One needs good working knowledge in programming, including data structures and algorithms to effectively analyze networks. In addition to graph theory, probability theory is the foundation for any statistical modeling and data analysis. Linear algebra provides another foundation for network analysis and modeling because matrices are often the most natural way to represent graphs. Although this book assumes that readers are familiar with the basics of these topics, here we review the computational and mathematical concepts and notation that will be used throughout the book. You can use this chapter as a starting point for catching up on the basics, or as reference while delving into the book.
Separation commonly occurs in political science, usually when a binary explanatory variable perfectly predicts a binary outcome. In these situations, methodologists often recommend penalized maximum likelihood or Bayesian estimation. But researchers might struggle to identify an appropriate penalty or prior distribution. Fortunately, I show that researchers can easily test hypotheses about the model coefficients with standard frequentist tools. While the popular Wald test produces misleading (even nonsensical) p-values under separation, I show that likelihood ratio tests and score tests behave in the usual manner. Therefore, researchers can produce meaningful p-values with standard frequentist tools under separation without the use of penalties or prior information.
There is a daunting array of statistical “methods” out there – regression, ANOVA, loglinear models, GLMMS, ANCOVA, etc. They often are treated as different data analysis approaches. We take a more holistic view. Most methods biologists use are variations on a central theme of generalized linear models – relating a biological response to a linear combination of predictor variables. We show how several common “named” methods are related, based on classifying biological response and predictor variables as continuous or categorical. We use simple regression, single-factor ANOVA, logistic regression, and two-dimensional contingency tables to show how these methods all represent generalized linear models with a single predictor. We describe how we fit these models and outline their assumptions.
In this chapter, we introduce the design of statistical anomaly detectors. We discuss types of data – continuous, discrete categorical, and discrete ordinal features – encountered in practice. We then discuss how to model such data, in particular to form a null model for statistical anomaly detection, with emphasis on mixture densities. The EM algorithm is developed for estimating the parameters of a mixture density. K-means is a specialization of EM for Gaussian mixtures. The Bayesian information criterion (BIC) is discussed and developed – widely used for estimating the number of components in a mixture density. We also discuss parsimonious mixtures, which economize on the number of model parameters in a mixture density (by sharing parameters across components). These models allow BIC to obtain accurate model order estimates even when the feature dimensionality is huge and the number of data samples is small (a case where BIC applied to traditional mixtures grossly underestimates the model order). Key performance measures are discussed, including true positive rate, false positive rate, and receiver operating characteristic (ROC) and associated area-under-the-curve (ROC AUC). The density models are used in attack detection defenses in Chapters 4 and 13. The detection performance measures are used throughout the book.
Approximate Bayesian analysis is presented as the solution for complex computational models where no explicit maximum likelihood estimation is possible. The activation-suppression racemodel (ASR), which does have a likelihood amenable to Markov chain Monte Carlo methods, is used to demonstrate the accuracy with which parameters can be estimated with the approximate Bayesian methods.
A good model aims to learn the underlying signal without overfitting (i.e. fitting to the noise in the data). This chapter has four main parts: The first part covers objective functions and errors. The second part covers various regularization techniques (weight penalty/decay, early stopping, ensemble, dropout, etc.) to prevent overfitting. The third part covers the Bayesian approach to model selection and model averaging. The fourth part covers the recent development of interpretable machine learning.
As probability distributions form the cornerstone of statistics, a survey is made of the common families of distributions, including the binomial distribution, Poisson distribution, multinomial distribution, Gaussian distribution, gamma distribution, beta distribution, von Mises distribution, extreme value distributions, t-distribution and chi-squared distribution. Other topics include maximum likelihood estimation, Gaussian mixtures and kernel density estimation.
We contribute to the literature on empirical macroeconomic models with time-varying conditional moments, by introducing a heteroskedastic score-driven model with Student’s t-distributed innovations, named the heteroskedastic score-driven $t$-QVAR (quasi-vector autoregressive) model. The $t$-QVAR model is a robust nonlinear extension of the VARMA (VAR moving average) model. As an illustration, we apply the heteroskedastic $t$-QVAR model to a dynamic stochastic general equilibrium model, for which we estimate Gaussian-ABCD and $t$-ABCD representations. We use data on economic output, inflation, interest rate, government spending, aggregate productivity, and consumption of the USA for the period of 1954 Q3 to 2022 Q1. Due to the robustness of the heteroskedastic $t$-QVAR model, even including the period of the coronavirus disease of 2019 (COVID-19) pandemic and the start of the Russian invasion of Ukraine, we find a superior statistical performance, lower policy-relevant dynamic effects, and a higher estimation precision of the impulse response function for US gross domestic product growth and US inflation rate, for the heteroskedastic score-driven $t$-ABCD representation rather than for the homoskedastic Gaussian-ABCD representation.
One major challenge to behavioral decision research is to identify the cognitive processes underlying judgment and decision making. Glöckner (2009) has argued that, compared to previous methods, process models can be more efficiently tested by simultaneously analyzing choices, decision times, and confidence judgments. The Multiple-Measure Maximum Likelihood (MM-ML) strategy classification method was developed for this purpose and implemented as a ready-to-use routine in STATA, a commercial package for statistical data analysis. In the present article, we describe the implementation of MM-ML in R, a free package for data analysis under the GNU general public license, and we provide a practical guide to application. We also provide MM-ML as an easy-to-use R function. Thus, prior knowledge of R programming is not necessary for those interested in using MM-ML.
The Halphen type B (Hal-B) frequency distribution has been employed for frequency analyses of hydrometeorological and hydrological extremes. This chapter derives this distribution using entropy theory and discusses the estimation of its parameters with the use of the constraints used for their derivation. The distribution i+L13s tested using entropy and the methods of moments and maximum likelihood estimation.
Several generalized frequency distributions have been employed in environmental and water engineering over the years. These distributions are quite versatile and can apply to frequency analysis of a wide variety of random variables, such as flood peaks, volume, duration, and inter-arrival time; extreme rainfall amount, duration, spatial coverage, and inter-arrival time; drought duration, severity, spatial extent, and inter-arrival time; wind speed, duration, direction, and spatial coverage; water quality parameters; and sediment concentration, discharge, and yield. However, because of their relatively complex form, these distributions have not become as popular as the simpler distributions. These distributions have at least three but usually more parameters, which have been estimated using the methods of moments, maximum likelihood, probability weighted moments, and L-moments. In some cases, entropy theory has been used to estimate parameters. This chapter provides a snapshot of the generalized distributions that will be discussed in this book. Moreover, a short discussion of the methods of parameter estimation, goodness-of-fit statistics, and confidence intervals is provided.
This paper proposes a procedure to improve the accuracy of the light aircraft 6 DOF simulation model by implementing model tuning and aerodynamic database correction using flight test data. In this study, the full-scale flight testing of a 2-seater aircraft has been performed in specific longitudinal manoeuver for model enhancement and simulation validation purposes. The baseline simulation model database is constructed using multi-fidelity analysis methods such as wind tunnel (W/T) test, computational fluid dynamic (CFD) and empirical calculation. The enhancement process starts with identifying longitudinal equations of motion for sensitivity analysis, where the effect of crucial parameters is analysed and then adjusted using the model tuning technique. Next, the classical Maximum Likelihood (ML) estimation method is applied to calculate aerodynamic derivatives from flight test data, these parameters are utilised to correct the initial aerodynamic table. A simulation validation process is introduced to evaluate the accuracy of the enhanced 6 DOF simulation model. The presented results demonstrate that the applied enhancement procedure has improved the simulation accuracy in longitudinal motion. The discrepancy between the simulation and flight test response showed significant improvement, which satisfies the regulation tolerance.
The problem of tracking the system frequency is ubiquitous in power systems. However, despite numerous empirical comparative studies of various algorithms, the underlying links and commonalities between frequency tracking methods are often overlooked. To this end, we show that the treatment of the two best known frequency estimation methodologies: (i) tracking the rate of change of the voltage phasor angles, and (ii) fixed frequency demodulation, can be unified, whereby the former can be interpreted as a special case of the latter. Furthermore, we show that the frequency estimator derived from the difference in the phase angle is the maximum likelihood frequency estimator of a nonstationary sinusoid. Drawing upon the data analytics interpretation of the Clarke and related transforms in power system analysis as practical Principal Component Analyzers (PCA), we then set out to explore commonalities between classic frequency estimation techniques and widely linear modeling. The so-obtained additional degrees of freedom allow us to arrive at the adaptive Smart Clarke and Smart Park transforms (SCT and SPT), which are shown to operate in an unbiased and statistically consistent way for both standard and dynamically unbalanced smart grids. Overall, this work suggest avenues for next generation solutions for the analysis of modern grids that are not accessible from the Circuit Theory perspective.
Multinomial logit (MNL) differs from many other econometric methods because it estimates the effects of variables upon nominal, not ordered outcomes. One consequence of this is that the estimated coefficients vary depending upon a researcher’s decision about the choice of a reference, or “baseline,” outcome. Most researchers realize this in principle, but many focus upon the statistical significance of MNL coefficients for inference in the same way that they use the coefficients from models with ordered dependent variables. In some instances, this leads researchers to report statistics that do not reflect the correct quantities of interest and reach flawed conclusions. In this note, I argue that researchers need to orient their approach to analyzing both the substantive and statistical significance of predicted probabilities of interest that match their research questions.
Quantitative comparative social scientists have long worried about the performance of multilevel models when the number of upper-level units is small. Adding to these concerns, an influential Monte Carlo study by Stegmueller (2013) suggests that standard maximum-likelihood (ML) methods yield biased point estimates and severely anti-conservative inference with few upper-level units. In this article, the authors seek to rectify this negative assessment. First, they show that ML estimators of coefficients are unbiased in linear multilevel models. The apparent bias in coefficient estimates found by Stegmueller can be attributed to Monte Carlo Error and a flaw in the design of his simulation study. Secondly, they demonstrate how inferential problems can be overcome by using restricted ML estimators for variance parameters and a t-distribution with appropriate degrees of freedom for statistical inference. Thus, accurate multilevel analysis is possible within the framework that most practitioners are familiar with, even if there are only a few upper-level units.
The objective of this study was to identify potential recruitment sources of Prochilodus lineatus from freshwater areas (Paraná and Uruguay rivers) to estuarine population of the Río de la Plata Estuary (La Plata Basin, South America), considering young (age-1) and adult (age-7) fish. LA-ICP-MS chemical analysis of the otolith core (nine element:Ca ratios) of an unknown mixed sample from Río de la Plata Estuary (2011 and 2017) was compared with a young-of-year baseline data set (same cohort) and classified into freshwater nurseries (Paraná or Uruguay river) by using maximum classification-likelihood models (MLE and MCL) and quadratic discriminant analysis (QDA). Considering the three models used, the Uruguay River was the most important contributor for both young and adult populations. The young population (2011) was highly mixed with contributions between 31.7 to 68.3%, while the degree of mixing was found to decrease in 2017 (adult fish) from 97.1 to 100% contributions. The three employed methods showed comparable estimates, however, the QDA showed a high similarity with the MCL model, suggesting sensitivity to evaluate small contributions, unlike the MLE method. Our results show the potential application of maximum likelihood mixture models and QDA for determining the relative importance of recruitment sources of fish in estuarine waters of the La Plata Basin.