We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Currently we may have access to large databases, sometimes coined as Big Data, and for those large datasets simple econometric models will not do. When you have a million people in your database, such as insurance firms or telephone providers or charities, and you have collected information on these individuals for many years, you simply cannot summarize these data using a small-sized econometric model with just a few regressors. In this chapter we address diverse options for how to handle Big Data. We kick off with a discussion about what Big Data is and why it is special. Next, we discuss a few options such as selective sampling, aggregation, nonlinear models, and variable reduction. Methods such as ridge regression, lasso, elastic net, and artificial neural networks are also addressed; these latter concepts are nowadays described as so-called machine learning methods. We see that with these methods the number of choices rapidly increases, and that reproducibility can reduce. The analysis of Big Data therefore comes at a cost of more analysis and of more choices to make and to report.
This chapter goes beyond the description of individual events by covering extremes caused by a combination of multiple events. Two main types of interactions are covered: domino effects and compound events. Domino effects, which represent one-way chains of events, are quantified using Markov theory and graph theory. Compound events, which include complex feedback loops in the complex Earth system, are modelled with system dynamics (as in Chapter 4). Two such systems are provided, the ESCIMO climate model and the World2 model of world dynamics. The impact of global warming, pollution, and resource depletion on catastrophes is investigated, as far as ecosystem and societal collapse. The types of catastrophes considered in this chapter are as follows: storm clustering, earthquake clustering (with accelerated fatigue of structures), domino effects at refineries (explosions, fires, toxic spills), cascading failures in physical networks (more precisely blackouts in a power grid), rainforest dieback, lake eutrophication, and hypothetical human population collapse.
Information related to the climate, sowing time, harvest, and crop development is essential for defining appropriate strategies for agricultural activities, which helps both producers and responsible bodies. Paraná, the second largest soybean producer in Brazil, has high climatic variability, which greatly influences planting, harvesting, and crop productivity periods. Therefore, the objective of this study was to regionalize the state of Paraná, considering decennial metrics associated with climate variables and the enhanced vegetation index (EVI) during the soybean cycle. Individual and global analyses of these metrics were conducted performed using multivariate techniques. These analyses were carried out in agricultural scenarios with low, medium, and high precipitation, corresponding to harvest years 2011/2012, 2013/2014, and 2015/2016, respectively. The results obtained from the scores of the retained factors and the cluster analysis were the profile of the groups, with Group 1 presenting more favourable climatic and agronomic conditions for the development of soybean crops for the three harvest years. The opposite occurred for Groups 2 (2011/2012 and 2013/2014) and Group 3 (2015/2016). During the soybean reproductive phases (R2 – R5), precipitation values were inadequate, especially for Group 2 (2011/2012 and 2013/2014) with high water deficit, resulting in a drop in soybean productivity. The climatic and agronomic regionalization of Paraná made it possible to identify the regions most suitable for growing soybeans, the effect of climatic conditions on phenological stages, and the variability of soybean productivity in the three harvest years.
Repetition is a critical issue in interpreting the work of Herodotus. Detlev Fehling, for one, has pointed to recurrence of motif and scene as evidence of the historian’s ‘free invention’. Words that occur twice in Herodotus are an efficient way to consider pressing issues at the centre of how and why Herodotus put together his narrative in the way he has. Pairs where the uses are close together in stories with a lot in common suggest that we may be seeing Herodotus’ ‘habit of presentation’, especially when phrasal repetition is also found. Where pairs are found further apart, the issue of deliberate linkage between discrete episodes may be indicated through the strategic redeployment of a key term. Finally, with Xerxes’ invasion, recurring terms help us to see how Herodotus could operate over large portions of text, deliberately linking one episode to another through the deployment of twice-occurring words, thereby also connecting the whole account of the campaign to the largest project of the History.
It is increasingly common to use chatbots as an interface to services. One of the main components of a chatbot is the Natural Language Understanding (NLU) model, which is responsible for interpreting the text and extracting the intent and entities present in that text. It’s possible to focus only on one of these tasks of NLU, such as intent classification. To train an NLU intent classification model, it’s generally necessary to use a considerable amount of annotated data, where each sentence of the dataset receives a label indicating an intent. Performing manually labeling data is arduous and impracticable, depending on the data volume. Thus, an unsupervised machine learning technique, such as data clustering, could be applied to find and label patterns in the data. For this task, it is essential to have an effective vector embedding representation of texts that depicts the semantic information and helps the machine understand the context, intent, and other nuances of the entire text. This paper extensively evaluates different text embedding models for clustering and labeling. We also apply some operations to improve the dataset’s quality, such as removing sentences and establishing various strategies for distance thresholds (cosine similarity) for the clusters’ centroids. Then, we trained some intent classification Models with two different architectures, one built with the Rasa framework and the other with a neural network (NN) using the attendance text from the Coronavirus Platform Service of Ceará, Brazil. We also manually annotated a dataset to be used as validation data. We conducted a study on semiautomatic labeling, implemented through clustering and visual inspection, which introduced some labeling errors to the intent classification models. However, it would be unfeasible to annotate the entire dataset manually. Nevertheless, results of competitive accuracy were still achieved with the trained models.
Data mining and techniques for analyzing big data play a crucial role in various practical fields, including financial markets. However, only a few quantitative studies have been focused on predicting daily stock market returns. The data mining methods used in previous studies are either incomplete or inefficient. This study used the FPC clustering algorithm and prominent clustering algorithms such as K-means, IPC, FDPC, and GOPC for clustering stock market data. The stock market data utilized in this study comprise data from cement companies listed on the Tehran Stock Exchange. These data concerning capital returns and price fluctuations will be examined and analyzed to guide investment decisions. The analysis process involves extracting the stock market data of these companies over the past two years. Subsequently, these companies are categorized based on two criteria: profitability percentage and short-term and long-term price fluctuations, using the FPC clustering algorithm and the classification above algorithms. Then, the results of these clustering analyses are compared against each other using standard and recognized evaluation criteria to assess the quality of the clustering analysis. The findings of this investigation indicate that the FPC algorithm provides more favorable results than other algorithms. Based on the results, companies demonstrating profitability, stability, and loss within short-term (weekly and monthly) and long-term (three-month, six-month, and one-year) time frames will be placed within their respective clusters and introduced accordingly.
How do children process language as they get older? Is there continuity in the functions assigned to specific structures? And what changes in their processing and their representations as they acquire more language? They appear to use bracketing (finding boundaries), reference (linking to meanings), and clustering (grouping units that belong together) as they analyze the speech stream and extract recurring units, word classes, and larger constructions. Comprehension precedes production. This allows children to monitor and repair production that doesn’t match the adult forms they have represented in memory. Children also track the frequency of types and tokens; they use types in setting up paradigms and identifying regular versus irregular forms. Amount of experience with language, (the diversity of settings) plus feedback and practice, also accounts for individual differences in the paths followed during acquisition. Ultimately, models of the process of acquisition need to incorporate all this to account for how acquisition takes place.
Geolectal variation is often present in settings where one language is spoken across a vast geographic area. This can be found in phonological, morphosyntactic, and lexical features. For practical reasons, it is not always possible to conduct fieldwork in every single location of interest in order to obtain the full pattern of variation, and a sample of them must be chosen. We propose and test a method for sampling these locations, with the goal of obtaining a distribution of typological features representative of the whole area. We apply k-means and hierarchical clustering algorithms for defining this sample, based on their geographic distribution. We test our methods against simulated data with several spatial configurations, and also against real data from Circassian dialects (Northwest Caucasian). Our results show an efficiency significantly higher than random sampling for detecting this variation, which makes our method profitable to fieldworkers when designing their research.
This paper provides a new classification of Central–Southern Italian dialects using dialectometric methods. All varieties considered are analyzed and cast in a data set where homogeneous areas are evaluated according to a selected list of phonetic features. Using numerical evaluation of these features and the Manhattan distance, a linguistic distance rule is defined. On this basis, the classification problem is formulated as a clustering problem, and a k-means algorithm is used. Additionally, an ad-hoc rule is set to identify transitional areas, and silhouette analysis is used to select the most appropriate number of clusters. While meaningful results are obtained for each number of clusters, a nine-group classification emerges as the most appropriate. As the results suggest, this classification is less subjective, more precise, and more comprehensive than traditional ones based on selected isoglosses.
This paper presents a methodology designed to leverage multitemporal sequences of synthetic aperture radar (SAR) and multispectral data and automatically extract urban changes. The approach compares results using different radar and optical sensors, describing the advantages and drawbacks of using SAR data from the COnstellation of small Satellites for the Mediterranean basin Observation (COSMO)/SkyMed, SAtélite Argentino de Observación COn Microondas (SAOCOM), and Sentinel-1 constellations, as well as nighttime light data or Sentinel-2 images. Multiple indexes obtained from multispectral data are compared, too, and results obtained by an unsupervised clustering procedure are analyzed. The results show that using different datasets it is possible to obtain consistent results about different types of changes in urban areas (e.g., demolition, development, and densification) with different levels of spatial details.
Chapter 9 demonstrates how RIO facilitates a field-theoretic approach to regression models. The chapter draws parallels between the data representations made possible by turning regression models inside out and the geometric data analysis (GDA) that is central to field theoretic approaches to social research.
Chapter 6 demonstrates one way that RIO can be used for exploratory data analysis: identifying statistically significant interaction terms. We show how exploring the relationships among cases offers important insights into the relationships between variables.
This paper investigates the growth and clustering of craft breweries in New Jersey. We compiled a historical dataset from 1995 to 2020 that allows us to measure the degree of geographic clustering among craft breweries in New Jersey. The number of craft breweries in New Jersey grew 491% from 2012 to 2020 (from 22 to 130 craft breweries). An impetus for this growth was that New Jersey enacted legislation in 2012 that made opening and operating a craft brewery in the state more economically viable. Our analysis finds that craft breweries in New Jersey are clustering in specific parts of the state and that this is likely due to co-location benefits such as building a culture of craft beer that drives innovation, knowledge sharing, customer sharing, and a thicker labor market. While distinct craft beer clusters have formed in New Jersey, we find there is still significant opportunity for growth. Our analysis confirms this using data on planned craft brewery openings to measure changes in the size and density of clusters and where, in New Jersey, new clusters are likely to form.
Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers’ compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies.
Most community detection methods focus on clustering actors with common features in a network. However, clustering edges offers a more intuitive way to understand the network structure in many real-life applications. Among the existing methods for network edge clustering, the majority are algorithmic, with the exception of the latent space edge clustering (LSEC) model proposed by Sewell (Journal of Computational and Graphical Statistics, 30(2), 390–405, 2021). LSEC was shown to have good performance in simulation and real-life data analysis, but fitting this model requires prior knowledge of the number of clusters and latent dimensions, which are often unknown to researchers. Within a Bayesian framework, we propose an extension to the LSEC model using a sparse finite mixture prior that supports automated selection of the number of clusters. We refer to our proposed approach as the automated LSEC or aLSEC. We develop a variational Bayes generalized expectation-maximization approach and a Hamiltonian Monte Carlo-within Gibbs algorithm for estimation. Our simulation study showed that aLSEC reduced run time by 10 to over 100 times compared to LSEC. Like LSEC, aLSEC maintains a computational cost that grows linearly with the number of actors in a network, making it scalable to large sparse networks. We developed the R package aLSEC which implements the proposed methodology.
Preference for functional and nutritious food capable of meeting consumers' demand and health is on the increase. The present preliminary study seeks to assess physico-chemical and nutraceutical diversity in the cocoa bean powder of 77 genotypes present in four Nigerian cocoa field banks. Twenty ripe pods/genotypes in each of the four active breeding field banks at the Cocoa Research Institute of Nigeria (CRIN), Ibadan, Nigeria were utilized. Composite beans from the 20 pods of each genotype were singly fermented, sun-dried and milled. Duplicate samples of the powder of each genotype were analysed for physico-chemical and nutraceutical components. Twenty-one polymorphic variables distinguished the 77 cocoa genotypes. Grouping by dendogram identified four clusters, three differently and uniquely captured 100% of the genotype membership in the local clone, international clone and the regional varieties field bank but 86% of the genotypes in the hybrid trial field bank were grouped in cluster I. Prominent traits with highest values in each clusters were: protein, pH, Ca, K and Fe (Cluster I), Zn and Mg (Cluster II), crude fat and P (Cluster III) and crude fibre, ash, theobromine, flavonoids and caffeine (Cluster IV). Exploitable diversity for nutritional quality improvement is present in the active breeding and working collections of Nigerian cocoa field banks.
The purpose of this paper is to group the flight data phases based on the sensor readings that are most distinctive and to create a representation of the higher-dimensional input space as a two-dimensional cluster map. The research design includes a self-organising map framework that provides spatially organised representations of flight signal features and abstractions. Flight data are mapped on a topology-preserving organisation that describes the similarity of their content. The findings reveal that there is a significant correlation between monitored flight data signals and given flight data phases. In addition, the clusters of flight regimes can be determined and observed on the maps. This suggests that further flight data processing schemes can use the same data marking and mapping themes regarding flight phases when working on a regime basis. The contribution of the research is the grouping of real data flows produced by in-flight sensors for aircraft monitoring purposes, thus visualising the evolution of the signal monitored on a real aircraft.
Real networks often exhibit clustering, the tendency to form relatively small groups of nodes with high edge densities. This clustering property can cause large numbers of small and dense subgraphs to emerge in otherwise sparse networks. Subgraph counts are an important and commonly used source of information about the network structure and function. We study probability distributions of subgraph counts in a community affiliation graph. This is a random graph generated as an overlay of m partly overlapping independent Bernoulli random graphs (layers) $G_1,\dots,G_m$ with variable sizes and densities. The model is parameterised by a joint distribution of layer sizes and densities. When m grows linearly in the number of nodes n, the model generates sparse random graphs with a rich statistical structure, admitting a nonvanishing clustering coefficient and a power-law limiting degree distribution. In this paper we establish the normal and $\alpha$-stable approximations to the numbers of small cliques, cycles, and more general 2-connected subgraphs of a community affiliation graph.
Multivariate Analysis focuses on the most essential tools for analyzing compositional and/or multivariate data sets that often emerge when performing geochemical analysis. The chapter starts by introducing groundwater contamination in one of the world’s largest agricultural areas: the Central Valley of California. The goal is to use data science to discover the processes that caused contaminations, whether geogenic or anthropogenic. Knowing these causes aids deciding on mitigation actions. The reader will take a path of discovery through several protocols of applying data-scientific tools to unmask the processes, including principal component analysis, multivariate outlier detection and factor analysis. The key to using these tools is to understand the compositional nature of geochemical datasets, and how compositions need to be treated appropriately to draw meaningful conclusions, a field termed compositional data analysis. This chapter emphasizes the need for data scientists to work with domain experts.
Bambara groundnut (Vigna subterranea (L.) Verdc) has been neglected in terms of variety selection and development which has resulted in farmers growing a mixture of landraces that are not genetically characterized and are low yielding. With the need to set up a breeding programme in Malawi, it was necessary to thoroughly understand the genetic diversity (GD) present in the available germplasm. The objectives of the study were to assess Bambara genotypes GD using agro-morphological traits and SNP markers, and to identify and select high yielding Bambara genotypes. Field trials were conducted for two seasons at Bunda College. Later, genotypes were genotyped using DartSeqLD SNP markers. All data were analysed using R Package. Significant genetic variations (P < 0.001) were observed for most traits including grain yield, which suggests that genetic variability exists in Bambara groundnuts which can be exploited in breeding programmes aimed at developing high performing varieties. Based on grain yield, the study identified 18 top performing genotypes across the evaluation seasons which will be tested under farmers’ fields’ conditions. DArTseqLD grouped the genotypes into three clusters. It was noted that majority of the genotypes from the same origin clustered together. High genetic distances were observed between genotypes from Southern African and West African regions and this has important implications in parental selection for the genetic improvement of Bambara. Our results provide valuable insights about the extent of genetic variability and how parental lines can be selected for improved genetic gain in Bambara groundnuts.