To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter focuses on the foundations of study design and statistical analysis in psychological research. It explores strategies for ensuring internal validity, such as randomization, control groups, and large sample sizes. Additionally, it addresses the complexity of human behavior by exploring multivariate experiments and the use of artificial intelligence and machine learning in neuroscience. The chapter also discusses the replication crisis and the emergence of open science practices, encouraging students to think critically about isolated scientific findings and offering tools for identifying credible research. Lastly, it critiques null hypothesis significance testing and p-values while providing an overview of key statistical topics like correlation coefficients, standardized mean differences, and regression.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Emphasizing how and why machine learning algorithms work, this introductory textbook bridges the gap between the theoretical foundations of machine learning and its practical algorithmic and code-level implementation. Over 85 thorough worked examples, in both Matlab and Python, demonstrate how algorithms are implemented and applied whilst illustrating the end result. Over 75 end-of-chapter problems empower students to develop their own code to implement these algorithms, equipping them with hands-on experience. Matlab coding examples demonstrate how a mathematical idea is converted from equations to code, and provide a jumping off point for students, supported by in-depth coverage of essential mathematics including multivariable calculus, linear algebra, probability and statistics, numerical methods, and optimization. Accompanied online by instructor lecture slides, downloadable Python code and additional appendices, this is an excellent introduction to machine learning for senior undergraduate and graduate students in Engineering and Computer Science.
Political science is a field rich in multimodal information sources, from televised debates to parliamentary briefings. This paper bridges a gap between computer and political science in multimodal data analysis using audio. The adoption of multimodal analyses in political science (e.g., video/audio with text-as-data approaches) has been relatively slow due to unequal distribution of computational power and skills needed. We provide solutions to challenges encountered when analyzing audio, advancing the potential for multimodal data analysis in political science. Using a dataset of all televised U.S. presidential debates from 1960 to 2020, we focus on three features encountered when analyzing audio data: low-level descriptors (LLDs), such as pitch or energy; Mel-frequency cepstral coefficients (MFCCs); and audio embeddings/encodings, like Wav2Vec. We showcase four applications: (a) forced alignment of audio text using MFCCs, time-stamping transcripts, and speaker information; (b) speech characterization using LLDs; (c) custom-made classification models with audio embeddings and MFCCs; and (d) emotional recognition models using Wav2Vec for classification of discrete emotions and their valence-arousal dominance. We provide explanations to help understand how these features can be applied for different political research questions and advice on vigilance to naive interpretation, for both experienced researchers and those who want to start working with audio.
In this work we present a framework to explain the prediction of the velocity fluctuation at a certain wall-normal distance from wall measurements with a deep-learning model. For this purpose, we apply the deep-SHAP (deep Shapley additive explanations) method to explain the velocity fluctuation prediction in wall-parallel planes in a turbulent open channel at a friction Reynolds number ${\textit{Re}}_\tau =180$. The explainable-deep-learning methodology comprises two stages. The first stage consists of training the estimator. In this case, the velocity fluctuation at a wall-normal distance of 15 wall units is predicted from the wall-shear stress and wall-pressure. In the second stage, the deep-SHAP algorithm is applied to estimate the impact each single grid point has on the output. This analysis calculates an importance field, and then, correlates the high-importance regions calculated through the deep-SHAP algorithm with the wall-pressure and wall-shear stress distributions. The grid points are then clustered to define structures according to their importance. We find that the high-importance clusters exhibit large pressure and shear-stress fluctuations, although generally not corresponding to the highest intensities in the input datasets. Their typical values averaged among these clusters are equal to one to two times their standard deviation and are associated with streak-like regions. These high-importance clusters present a size between 20 and 120 wall units, corresponding to approximately 100 and 600 $\unicode{x03BC} \textrm {m}$ for the case of a commercial aircraft.
An increasing number of reports highlight the potential of machine learning (ML) methodologies over the conventional generalised linear model (GLM) for non-life insurance pricing. In parallel, national and international regulatory institutions are accentuating their focus on pricing fairness to quantify and mitigate algorithmic differences and discrimination. However, comprehensive studies that assess both pricing accuracy and fairness remain scarce. We propose a benchmark of the GLM against mainstream regularised linear models and tree-based ensemble models under two popular distribution modelling strategies (Poisson-gamma and Tweedie), with respect to key criteria including estimation bias, deviance, risk differentiation, competitiveness, loss ratios, discrimination and fairness. Pricing performance and fairness were assessed simultaneously on the same samples of premium estimates for GLM and ML models. The models were compared on two open-access motor insurance datasets, each with a different type of cover (fully comprehensive and third-party liability). While no single ML model outperformed across both pricing and discrimination metrics, the GLM significantly underperformed for most. The results indicate that ML may be considered a realistic and reasonable alternative to current practices. We advocate that benchmarking exercises for risk prediction models should be carried out to assess both pricing accuracy and fairness for any given portfolio.
As the volume of meteorological observations continues to grow, automating the quality control (QC) process is essential for timely data delivery. This study evaluates the performance of three machine learning algorithms—autoencoder, variational autoencoder, and long short-term memory (LSTM) autoencoder—for detecting anomalies in air temperature data. Using expert-quality-controlled data as ground truth, all models demonstrated anomaly detection capability, with the LSTM outperforming others due to its ability to capture temporal patterns and minimize false positives. When applied to raw data, the LSTM achieved 99.6% accuracy in identifying valid observations and replicated 79% of manual flags, with only five false negatives and six false positives over a full year. Its sensitivity to subtle meteorological changes, such as those caused by rainfall or cloud cover, highlights its robustness. The LSTM’s performance using a three-day timestep, combined with basic QC checks in SaQC (System for Automated Quality Control), suggests a scalable and effective solution for automated QC at Met Éireann, with potential for expansion to include additional variables and multi-station generalization.
Quantifying differences between flow fields is a key challenge in fluid mechanics, particularly when evaluating the effectiveness of flow control or other problem parameters. Traditional vector metrics, such as the Euclidean distance, provide straightforward pointwise comparisons but can fail to distinguish distributional changes in flow fields. To address this limitation, we employ optimal transport (OT) theory, which is a mathematical framework built on probability and measure theory. By aligning Euclidean distances between flow fields in a latent space learned by an autoencoder with the corresponding OT geodesics, we seek to learn low-dimensional representations of flow fields that are interpretable from the perspective of unbalanced OT. As a demonstration, we utilise this OT-based analysis on separated flows past a NACA 0012 airfoil with periodic heat flux actuation near the leading edge. The cases considered are at a chord-based Reynolds number of 23 000 and a free-stream Mach number of 0.3 for two angles of attack (AoA) of $6^\circ$ and $9^\circ$. For each angle of attack, we identify a two-dimensional embedding that succinctly captures the different effective regimes of flow responses and control performance, characterised by the degree of suppression of the separation bubble and secondary effects from laminarisation and trailing-edge separation. The interpretation of the latent representation was found to be consistent across the two AoA, suggesting that the OT-based latent encoding was capable of extracting physical relationships that are common across the different suites of cases. This study demonstrates the potential utility of optimal transport in the analysis and interpretation of complex flow fields.
Accurate and efficient modelling of cardiac blood flow is crucial for advancing data-driven tools in cardiovascular research and clinical applications. Recently, the accuracy and availability of computational fluid dynamics methodologies for simulating intraventricular flow have increased. However, these methods remain complex and computationally costly. This study presents a reduced-order model (ROM) based on higher-order dynamic mode decomposition (HODMD). The proposed approach enables accurate reconstruction and long-term prediction of left ventricle flow fields. The method is tested on two idealized ventricular geometries exhibiting distinct flow regimes to assess its robustness under different hemodynamic conditions. By leveraging a small number of training snapshots and focusing on the dominant periodic components representing the physics of the system, the HODMD-based model accurately reconstructs the flow field over entire cardiac cycles and provides reliable long-term predictions beyond the training window. The reconstruction and prediction errors remain below 5 % for the first geometry and below 10 % for the second, even when using as few as the first three cycles of simulated data, representing the transitory regime. Additionally, the approach reduces computational costs with a speed-up factor of at least $10^{5}$ compared with full-order simulations, enabling fast surrogate modelling of complex cardiac flows. These results highlight the potential of spectrally constrained HODMD as a robust and interpretable ROM for simulating intraventricular hemodynamics. This approach shows promise for integration in real-time analysis and patient specific models.
We studied the reconstruction of turbulent flow fields from trajectory data recorded by actively migrating Lagrangian agents. We propose a deep-learning model, track-to-flow (T2F), which employs a vision transformer as the encoder to capture the spatiotemporal features of a single agent trajectory, and a convolutional neural network as the decoder to reconstruct the flow field. To enhance the physical consistency of the T2F model, we further incorporate a physics-informed loss function inspired by the framework of physics-informed neural network (PINN), yielding a variant model referred to as T2F+PINN. We first evaluate both models in a laminar cylinder wake flow at a Reynolds number of $\textit{Re} = 800$ as a proof of concept. The results show that the T2F model achieves velocity reconstruction accuracy comparable to that of existing flow reconstruction methods, while the T2F+PINN model reduces the normalised error in vorticity reconstruction relative to the T2F model. We then apply the models in turbulent Rayleigh–Bénard convection at a Rayleigh number of $Ra = 10^{8}$ and a Prandtl number of $\textit{Pr} = 0.71$. The results show that the T2F model accurately reconstructs both the velocity and temperature fields, whereas the T2F+PINN model further improves the reconstruction accuracy of gradient-related physical quantities, such as temperature gradients, vorticity and the $Q$ value, with a maximum improvement of approximately 60 % compared to the T2F model. Overall, the T2F model is better suited for reconstructing primitive flow variables, while the T2F+PINN model provides advantages in reconstructing gradient-related quantities. Our models open a promising avenue for accurate flow reconstruction from a single Lagrangian trajectory.
This work proposes a data-driven explicit algebraic stress-based detached-eddy simulation (DES) method. Despite the widespread use of data-driven methods in model development for both Reynolds-averaged Navier–Stokes (RANS) and large-eddy simulations (LES), their applications to DES remain limited. The challenge mainly lies in the absence of modelled stress data, the requirement for proper length scales in RANS and LES branches, and the maintenance of a reasonable switching behaviour. The data-driven DES method is constructed based on the algebraic stress equation. The control of RANS/LES switching is achieved through the eddy viscosity in the linear part of the modelled stress, under the $\ell ^2-\omega$ DES framework. Three model coefficients associated with the pressure–strain terms and the LES length scale are represented by a neural network as functions of scalar invariants of velocity gradient. The neural network is trained using velocity data with the ensemble Kalman method, thereby circumventing the requirement for modelled stress data. Moreover, the baseline coefficient values are incorporated as additional reference data to ensure reasonable switching behaviour. The proposed approach is evaluated on two challenging turbulent flows, i.e. the secondary flow in a square duct and the separated flow over a bump. The trained model achieves significant improvements in predicting mean flow statistics compared with the baseline model. This is attributed to improved predictions of the modelled stress. The trained model also exhibits reasonable switching behaviour, enlarging the LES region to resolve more turbulent structures. Furthermore, the model shows satisfactory generalization capabilities for both cases in similar flow configurations.
Positive food consumption remains one of the most common challenges among older adults in the UK with at least 10% in community settings and up to 45% in care homes affected by malnutrition. It is strongly associated with frailty, functional and health decline. Tracking and understanding the impact of diet is not easy. There are problems with monitoring diet and malnutrition screening such as difficulty remembering, lack of time, or needing a dietician to interpret the results. Computerised tailored education may be a positive solution to these issues. Due to the rise in smartphone ownership the use of technology to monitor diet is becoming more popular. This review paper will aim to look at the issues with current methods of dietary monitoring particularly in older adults, it will present the benefits and barriers of using to monitor food intake. It will discuss how a photo food monitoring app was developed to address the current issues with technology and how it was tested with older adults living in community and care settings. The prototype was co-developed and incorporated automated food classification to monitor dietary intake and food preferences and tested with older adults. The prototype was usable to both older adults and care workers and feedback on how to improve its use was collected. Key design improvements to make it quicker and more accurate were suggested for future testing in this population. With adaptions this prototype could be beneficial to older adults living in both community and care settings.
Researchers classify political parties into families by their shared cleavage origins. However, as parties have drifted from the original ideological commitments, it is unclear to what extent party families today can function as effective heuristics for shared positions. We propose an alternative way of classifying parties based solely on their ideological positions as one solution to this challenge. We use model‐based clustering to recast common subjective decisions involved in the process of creating party groups as problems of model selection, thus, providing non‐subjective criteria to define ideological clusters. By comparing canonical families to our ideological clusters, we show that while party families on the right are often too similar to justify categorizing them into different clusters, left‐wing families are weakly internally cohesive. Moreover, we identify two clusters predominantly composed of parties in Eastern Europe, questioning the degree to which categories originally designed to describe Western Europe can generalize to other regions.
How does the language of male and female politicians differ when they communicate directly with the public on social media? Do citizens address them differently? We apply Lasso logistic regression models to identify the linguistic features that most differentiate the language used by or addressed to male and female Spanish politicians. Male politicians use more words related to politics, sports, ideology and infrastructure, while female politicians talk about gender and social affairs. The choice of emojis varies greatly across genders. In a novel analysis of tweets written by citizens, we find evidence of gender‐specific insults, and note that mentions of physical appearance and infantilising words are disproportionately found in text addressed to female politicians. The results suggest that politicians conform to gender stereotypes online and reveal ways in which citizens treat politicians differently depending on their gender.
Pater's (2019) target article builds a persuasive case for establishing stronger ties between theoretical linguistics and connectionism (deep learning). This commentary extends his arguments to semantics, focusing in particular on issues of learning, compositionality, and lexical meaning.
This paper charts the rapid rise of data science methodologies in manuscripts published in top journals for third sector scholarship, indicating their growing importance to research in the field. We draw on critical quantitative theory (QuantCrit) to challenge the assumed neutrality of data science insights that are especially prone to misrepresentation and unbalanced treatment of sub-groups (i.e., those marginalized and minoritized because of their race, gender, etc.). We summarize a set of challenges that result in biases within machine learning methods that are increasingly deployed in scientific inquiry. As a means of proactively addressing these concerns, we introduce the “Wells-Du Bois Protocol,” a tool that scholars can use to determine if their research achieves a baseline level of bias mitigation. Ultimately, this work aims to facilitate the diffusion of key insights from the field of QuantCrit by showing how new computational methodologies can be improved by coupling quantitative work with humanistic and reflexive approaches to inquiry. The protocol ultimately aims to help safeguard third sector scholarship from systematic biases that can be introduced through the adoption of machine learning methods.
Media plays a major role in molding US public opinions about Muslims. This paper assesses the effect of 9/11 events on the US media's framing of the Muslim nonprofit sector. Overall it finds that the press was more likely to represent the Muslim nonprofit negatively post 9/11. However, post 9/11, the media framing of Muslim nonprofits was mixed. While the media was more likely to associate Muslim nonprofits and terrorism, they were also more likely to represent Muslim nonprofits as organizations that faced persecution because of Islamophobia, government scrutiny, or hate attacks against them. These media frames may have contributed to public perceptions that Muslim organizations support terrorism while also raising the alarm amongst various stakeholders that the government and the general public are persecuting the Muslim nonprofit sector.
Scholars have discovered remarkable inequalities in who gets represented in electoral democracies. Around the world, the preferences of the rich tend to be better represented than those of the less well‐off. In this paper, we use the most comprehensive comparative dataset of unequal representation available to answer why the poor are underrepresented. By leveraging variation over time and across countries, we study which factors explain why representation is more unequal in some places than in others. We compile a number of covariates examined in previous studies and use machine learning to describe which mechanisms best explain the data. Globally, we find that economic conditions and good governance are most important in determining the extent of unequal representation, and we find little support for hypotheses related to political institutions, interest groups or political behaviour, such as turnout. These results provide the first broadly comparative explanations for unequal representation.
Infrared (IR) nanoscopy represents a collection of imaging and spectroscopy techniques capable of resolving IR absorption on the nanometer scale. Chemical specificity is leveraged from vibrational spectroscopy, while light–matter interactions are detected by observing perturbations in the optical near field with an atomic force microscopy probe. Therefore, imaging is wavelength independent and has a spatial resolution on the nanometer scale, well beyond the classical diffraction limit. In this perspective, we outline the recent biological applications of scattering type scanning near-field optical microscopy and nanoscale Fourier-transform IR spectroscopy. These techniques are uniquely suited to resolving subcellular ultrastructure from a variety of cell types, as well as studying biological processes such as metabolic activity on the single-cell level. Furthermore, this review describes recent technical advances in IR nanoscopy, and emerging machine learning supported approaches to sampling, signal enhancement, and data processing. This emphasizes that label-free IR nanoscopy holds significant potential for ongoing and future biological applications.