We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The goal of this paper is to systematically review the literature on United States Department of Agriculture (USDA) forecast evaluation and critically assess their methods and findings. The fundamental characteristics of optimal forecasts are bias, accuracy and efficiency as well as encompassing and informativeness. This review revealed that the findings of these studies can be very different based on the forecasts examined, commodity, sample period, and methodology. Some forecasts performed very well, while others were not very reliable, resulting in forecast specific optimality record. We discuss methodological and empirical contributions of these studies as well as their shortcomings and potential opportunities for future work.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.
This study suggests that there may be considerable difficulties in providing accurate calendar age estimates in the Roman period in Europe, between ca. AD 60 and ca. AD 230, using the radiocarbon calibration datasets that are currently available. Incorporating the potential for systematic offsets between the measured data and the calibration curve using the ΔR approach suggested by Hogg et al. (2019), only marginally mitigates the biases in calendar date estimates observed. At present, it clearly behoves researchers in this period to “caveat emptor” and validate the accuracy of their calibrated radiocarbon dates and chronological models against other sources of dating information.
Response times (RTs) have become ubiquitous in second language acquisition (SLA) research, providing empirical evidence for the theorization of the language learning process. Recently, there have been discussions of some fundamental psychometric properties of RT data, including, but not limited to, their reliability and validity. In this light, we take a step back to reflect on the use of RT data to tap into linguistic knowledge in SLA. First, we offer a brief overview of how RT data are most commonly used as vocabulary and grammar measures. We then point out three key limitations of such uses, namely that (a) RT data can lack substantive importance without considerations of accuracy, (b) RT differences may or may not be a satisfactory psychometric individual difference measure, and (c) some tasks designed to elicit RT data may not be sufficiently fine-grained to target specific language processes. Our overarching goal is to enhance the awareness among SLA researchers of these issues when interpreting RT results and stimulate research endeavors that delve into the unique properties of RT data when used in our field.
This chapter, authored by a computer scientist and an industry expert in computer vision, briefly explains the fundamentals of artificial intelligence and facial recognition technologies. The discussion encompasses the typical development life cycle of these technologies and unravels the essential building blocks integral to understanding the complexities of facial recognition systems. The author further explores key challenges confronting computer and data scientists in their pursuit of ensuring the accuracy, effectiveness, and trustworthiness of these technologies, which also drives many of the common concerns regarding facial recognition technologies.
Recently released Moderate-Resolution Imaging Spectroradiometer (MODIS) land surface temperature (LST) collection 6.1 (C6.1) products are useful for understanding ice–atmosphere interactions over East Antarctica, but their accuracy should be known prior to application. This study assessed Level 2 and Level 3 MODIS C6.1 LST products (MxD11_L2 and MxD11C1) in comparison with the radiance-derived in situ LSTs from 12 weather stations. Significant cloud-related issues were identified in both LST products. By utilizing a stricter filter based on automatic weather station cloud data, despite losing 29.4% of the data, accuracy of MODIS LST was greatly improved. The cloud-screened MODIS LST exhibited cold biases (−5.18 to −0.07°C, and root mean square errors from 2.37 to 6.28°C) than in situ LSTs at most stations, with smaller cold biases at inland stations, but larger ones at coastal regions and the edge of plateau. The accuracy was notably higher during warm periods (October–March) than during cold periods (April–September). The cloud-screened MODIS C6.1 LST did not show significant improvements over C5 (Collection 5) version across East Antarctica. Ice-crystal precipitation occurring during temperature inversions at the surface (Tair-Tsurface) played a crucial role in MODIS LST accuracy on inland plateau. In coastal regions, larger MODIS LST biases were observed when the original measurements were lower.
In 2000, The Clay Minerals Society established a biennial quantitative mineralogy round robin. The so-called Reynolds Cup competition is named after Bob Reynolds for his pioneering work in quantitative clay mineralogy and exceptional contributions to clay science. The first contest was run in 2002 with 40 sets of three samples, which were prepared from mixtures of purified, natural, and synthetic minerals that are commonly found in clay-bearing rocks and soils and represent realistic mineral assemblages. The rules of the competition allow any method or combination of methods to be used in the quantitative analysis of the mineral assemblages. Throughout the competition, X-ray diffraction has been the method of choice for quantifying the mineralogy of the sample mixtures with a multitude of other techniques used to assist with phase identification and quantification. In the first twelve years of the Reynolds Cup competition (2002 to 2014), around 14,000 analyses from 448 participants have been carried out on a total of 21 samples. The data provided by these analyses constitute an extensive database on the accuracy of quantitative mineral analyses and also has given enough time for the progression of improvements in such analyses. In the Reynolds Cup competition, the accuracy of a particular quantification is judged by calculating a “bias” for each phase in an assemblage. Determining exactly the true amount of a phase in the assemblage would give a bias of zero. Generally, the higher placed participants correctly identified all or most of the mineral phases present. Conversely, the worst performers failed to identify or misidentified phases. Several contestants reported a long list of minor exotic phases, which were likely reported by automated search/match programs and were mineralogically implausible. Not surprisingly, clay minerals were among the greatest sources of error reported. This article reports on the first 12 years of the Reynolds Cup competition results and analyzes the competition data to determine the overall accuracy of the mineral assemblage quantities reported by the participants. The data from the competition were also used to ascertain trends in quantification accuracy over a 12 year period and to highlight sources of error in quantitative analyses.
Chapter 11 provides an overview of the terms for talking about grammar instruction and learning, including implicit learning vs. explicit learning and implicit knowledge vs. explicit knowledge. With these common terms defined, the chapter then describes several instructional approaches that researchers have utilized to better understand how language learners build their understanding of the target language. Particular attention is paid to focus-on-form and form-focused instructional strategies.
GIRI (Glasgow International Radiocarbon Intercomparison) was designed to meet a number of objectives, including to provide an independent assessment of the analytical quality of the laboratory/measurement and an opportunity for a laboratory to participate and improve (if needed). The principles in the design of GIRI were to provide the following: (a) a series of unrelated individual samples, spanning the dating age range, (b) linked samples to earlier intercomparisons to allow traceability, (c) known age samples, to allow independent accuracy checks, (d) a small number of duplicates, to allow independent estimation of laboratory uncertainty, and (e) two categories of samples—bulk and individual—to support laboratory investigation of variability. All of the GIRI samples are natural (wood, peat, and grain), some are known age, and overall their age spans approx. >40,000 years BP to modern. The complete list of sample materials includes humic acid, whalebone, grain, single ring dendro-dated samples, dendro-dated wood samples spanning a number of rings (e.g., 10 rings), background and near background samples of bone and wood. We present an overview of the results received and preliminary consensus values for the samples supporting a more in-depth evaluation of laboratory performance and variability.
Status hierarchies are ubiquitous across cultures and have been over deep time. Position in hierarchies shows important links with fitness outcomes. Consequently, humans should possess psychological adaptations for navigating the adaptive challenges posed by living in hierarchically organised groups. One hypothesised adaptation functions to assess, track, and store the status impacts of different acts, characteristics and events in order to guide hierarchy navigation. Although this status-impact assessment system is expected to be universal, there are several ways in which differences in assessment accuracy could arise. This variation may link to broader individual difference constructs. In a preregistered study with samples from India (N = 815) and the USA (N = 822), we sought to examine how individual differences in the accuracy of status-impact assessments covary with status motivations and personality. In both countries, greater overall status-impact assessment accuracy was associated with higher status motivations, as well as higher standing on two broad personality constructs: Honesty–Humility and Conscientiousness. These findings help map broad personality constructs onto variation in the functioning of specific cognitive mechanisms and contribute to an evolutionary understanding of individual differences.
We use our calibrated ABM and our AI algorithm to make case-by-case predictions of outcomes in new out-of-sample test data. These predictions concern: the full partisan composition of the cabinets which form, participation by particular parties in the cabinets which form, and the observed durations of the cabinet which forms. Absent a baseline model of government formation in such complex settings against which we can evaluate our results, we compare success rates with those of a prediction of minimal winning coalitions which is common to a large number of existing studies. Bearing in mind that the ABM in particular generates probability distributions of predicted outcomes in each case, which we feel is substantively realistic, while only a single outcome can be observed, we are very satisfied with the predictive accuracy of the model. Successful predictions relating to cabinet durations are particularly distinctive to the model, deriving from the model-predicted number of issues tabled in formation negotiations, and the model-predicted likelihood than a random shock will create a situation in which a majority of legislators now prefer some alternative to the incumbent.
Scoring rules measure the deviation between a forecast, which assigns degrees of confidence to various events, and reality. Strictly proper scoring rules have the property that for any forecast, the mathematical expectation of the score of a forecast p by the lights of p is strictly better than the mathematical expectation of any other forecast q by the lights of p. Forecasts need not satisfy the axioms of the probability calculus, but Predd et al. [9] have shown that given a finite sample space and any strictly proper additive and continuous scoring rule, the score for any forecast that does not satisfy the axioms of probability is strictly dominated by the score for some probabilistically consistent forecast. Recently, this result has been extended to non-additive continuous scoring rules. In this paper, a condition weaker than continuity is given that suffices for the result, and the condition is proved to be optimal.
Three experiments (N = 550) examined the effect of an interval construction elicitation method used in several expert elicitation studies on judgment accuracy. Participants made judgments about topics that were either searchable or unsearchable online using one of two order variations of the interval construction procedure. One group of participants provided their best judgment (one step) prior to constructing an interval (i.e., lower bound, upper bound, and a confidence rating that the correct value fell in the range provided), whereas another group of participants provided their best judgment last, after the three-step confidence interval was constructed. The overall effect of this elicitation method was not significant in 8 out of 9 univariate tests. Moreover, the calibration of confidence intervals was not affected by elicitation order. The findings warrant skepticism regarding the benefit of prior confidence interval construction for improving judgment accuracy.
We make hundreds of decisions every day, many of them extremely quickly and without much explicit deliberation. This motivates two important open questions: What is the minimum time required to make choices with above chance accuracy? What is the impact of additional decision-making time on choice accuracy? We investigated these questions in four experiments in which subjects made binary food choices using saccadic or manual responses, under either “speed” or “accuracy” instructions. Subjects were able to make above chance decisions in as little as 313 ms, and choose their preferred food item in over 70% of trials at average speeds of 404 ms. Further, slowing down their responses by either asking them explicitly to be confident about their choices, or to respond with hand movements, generated about a 10% increase in accuracy. Together, these results suggest that consumers can make accurate every-day choices, akin to those made in a grocery store, at significantly faster speeds than previously reported.
Recent research suggests that communicating probabilities numerically rather than verbally benefits forecasters’ credibility. In two experiments, we tested the reproducibility of this communication-format effect. The effect was replicated under comparable conditions (low-probability, inaccurate forecasts), but it was reversed for low-probability accurate forecasts and eliminated for high-probability forecasts. Experiment 2 further showed that verbal probabilities convey implicit recommendations more clearly than probability information, whereas numeric probabilities do the opposite. Descriptively, the findings indicate that the effect of probability words versus numbers on credibility depends on how these formats convey directionality differently, how directionality implies recommendations even when none are explicitly given, and how such recommendations correspond with outcomes. Prescriptively, we propose that experts distinguish forecasts from advice, using numeric probabilities for the former and well-reasoned arguments for the latter.
Stastny and Lehner (2018) reported a study comparing the forecast accuracy of a US intelligence community prediction market (ICPM) to traditionally produced intelligence reports. Five analysts unaffiliated with the intelligence reports imputed forecasts from the reports after stating their personal forecasts on the same forecasting questions. The authors claimed that the accuracy of the ICPM was significantly greater than that of the intelligence reports and suggest this may have been due to methods that harness crowd wisdom. However, additional analyses conducted here show that the imputer’s personal forecasts, which were made individually, were as accurate as ICPM forecasts. In fact, their updated personal forecasts (made after reading the intelligence reports) were marginally more accurate than ICPM forecasts. Imputed forecasts are also strongly correlated with the imputers’ personal forecasts, casting doubt on the degree to which the imputation was in fact a reliably inter-subjective assessment of what intelligence reports implied about the forecasting questions. Alternative methods for comparing intelligence community forecasting methods are discussed.
A routine part of intelligence analysis is judging the probability of alternative hypotheses given available evidence. Intelligence organizations advise analysts to use intelligence-tradecraft methods such as Analysis of Competing Hypotheses (ACH) to improve judgment, but such methods have not been rigorously tested. We compared the evidence evaluation and judgment accuracy of a group of intelligence analysts who were recently trained in ACH and then used it on a probability judgment task to another group of analysts from the same cohort that were neither trained in ACH nor asked to use any specific method. Although the ACH group assessed information usefulness better than the control group, the control group was a little more accurate (and coherent) than the ACH group. Both groups, however, exhibited suboptimal judgment and were susceptible to unpacking effects. Although ACH failed to improve accuracy, we found that recalibration and aggregation methods substantially improved accuracy. Specifically, mean absolute error (MAE) in analysts’ probability judgments decreased by 61% after first coherentizing their judgments (a process that ensures judgments respect the unitarity axiom) and then aggregating their judgments. The findings cast doubt on the efficacy of ACH, and show the promise of statistical methods for boosting judgment quality in intelligence and other organizations that routinely produce expert judgments.
To examine the labelling status of trans-fat of pre-packaged foods sold in Hong Kong.
Design:
Data from 19 027 items in the 2019 FoodSwitch Hong Kong database were used. Ingredient lists were screened to identify specific (e.g. partially hydrogenated vegetable oil, PHVO) and non-specific trans-fat ingredient indicators (e.g. hydrogenated oil). Trans-fat content was obtained from the on-pack nutrition labels, which was converted into proportion of total fat (%total fat). Descriptive statistics were calculated for trans-fat content and the number of specific, non-specific and total trans-fat ingredients indicators found on the ingredients lists. Comparisons were made between regions using one-way ANOVA and χ2 for continuous and categorical variables, respectively.
Setting:
Cross-sectional audit.
Participants:
Not applicable.
Results:
A total of 729 items (3·8 % of all products) reported to contain industrially produced trans-fat, with a median of 0·4 g/100 g or 100 ml (interquartile range (IQR): 0·1–0·6) and 1·2 %totalfat (IQR: 0·6–2·9). ‘Bread and bakery products’ had the highest proportion of items with industrially produced trans-fat (18·9 %). ‘Non-alcoholic beverages’ had the highest proportion of products of ‘false negatives’ labelling (e.g. labelled as 0 trans-fat but contains PHVO; 59·3 %). The majority of products with trans-fat indicator originated from Asia (70 %).
Conclusions:
According to the labelling ∼4 % of pre-packaged food and beverages sold in Hong Kong in 2019 contained industrially produced trans-fat, and a third of these had trans-fat >2 %total fat. The ambiguous trans-fat labelling in Hong Kong may not effectively assist consumers in identifying products free from industrially produced trans-fat.
Epistemologists who study credences have a well-developed account of how you should change them when you learn new evidence; that is, when your body of evidence grows. What's more, they boast a diverse range of epistemic and pragmatic arguments that support that account. But they do not have a satisfactory account of when and how you should change your credences when you become aware of possibilities and propositions you have not entertained before; that is, when your awareness grows. In this paper, I consider the arguments for the credal epistemologist's account of how to respond to evidence, and I ask whether they can help us generate an account of how to respond to awareness growth. The results are surprising: the arguments that all support the same norms for responding to evidence growth support a number of different norms when they are applied to awareness growth. Some of these norms seem too weak, others too strong. I ask what we should conclude from this, and argue that our credal response to awareness growth is considerably less rigorously constrained than our credal response to new evidence.