1 Introduction
There is a substantial literature about the quality of probability and confidence judgments (see Reference Erev, Wallsten and BudescuErev, Wallsten, & Budescu, 1994; Reference Griffin, Brenner, Koehler and HarveyGriffin & Brenner, 2004; Reference HarveyHarvey, 1997; Reference Kahneman, Slovic and TverskyKahneman, Slovic, & Tversky, 1982; Reference KerenKeren, 1991; Reference McClelland, Bolger, Wright and AytonMcClelland & Bolger, 1994; Reference Murphy and WinklerMurphy & Winkler, 1992; Reference Wallsten and BudescuWallsten & Budescu, 1983). A specific property of the probability judgments—their calibration—has been accepted as the “common standard of validity” in the empirical literature (Reference Wallsten and BudescuWallsten & Budescu, 1983). Judgments are said to be calibrated if p(100)% of all events that are assigned a subjective probability of p materialize.Footnote 1 This paper focuses on some conceptual and methodological problems associated with standard calibration analyses. After reviewing some of the problems associated with this approach we propose and illustrate an alternative model-based approach to assess the calibration of probability judgments that overcomes these problems.
1.1 Calibration
We define calibration in terms of individual judgments concerning individual events. Let E ij denote the j-th event for which the i-th judge gives a confidence judgment, C ij. Calibration concerns the relationship between the judgment C ij and the probability of E ij. There are two distinct types of calibration, depending on how the probability of the target event is defined (see Reference Budescu, Erev and WallstenBudescu, Erev, & Wallsten 1997; Reference Budescu, Wallsten and AuBudescu, Wallsten, & Au, 1997; Reference Murphy and WinklerMurphy & Winkler, 1992; Reference WallstenWallsten, 1996). One type concerns the conditional probability . A judge is calibrated if = c for all c. This means that the probability of the event is c when the judge assigns to it a confidence judgment of c. There are two ways of defining miscalibration or over/underconfidence (Reference HarveyHarvey, 1997). The most prevalent definition pertains to the judge’s ability to distinguish between true and false events and it is usually applied in forced choice tasks. According to this view, a judge is overconfident if < c and c > 0.5, or > c and c < 0.5. A judge is considered to be underconfident if > c and c > 0.5, or < c and c < 0.5 (see Reference Wallsten, Budescu and ZwickWallsten, Budescu, & Zwick, 1993). A second definition captures the judge’s confidence in the truth of an event. Thus overconfidence and underconfidence are implied by the inequalities < c and > c, respectively. The (mis)calibration of judgments is often summarized by a “calibration curve” which plots as a function of c.
A second type of calibration concerns the marginal probability P(E ij). A judge is calibrated if P(E ij) = C ij. If we consider the calibration of judges on average, then this could be viewed as a reversal of the conditioning argument in the previous definition, in the sense that one is calibrated if = p so that when the probability of the event is p, the expected judgment is also p. This type of calibration can be visualized by plotting C ij against P(E ij) as a type of “reversed” calibration curve where the axes are interchanged. This definition is used often in studies of Bayesian updating where the events’ probabilities result from a known process of random generation (see Reference Erev, Wallsten and BudescuErev, Wallsten, & Budescu, 1994). A judge is overconfident when C ij > P(E ij) and P(E ij) > 0.5, or C ij < P(E ij) when P(E ij) < 0.5. Underconfidence occurs when C ij < P(E ij) when P(E ij) > 0.5, or C ij < P(E ij) when P(E ij) < 0.5.
Ideally, one would like to quantify the quality of the judgment provided by any specific judge for any given event. However, traditional calibration analysis is usually performed at the group level and across multiple events since the conditional probabilities, , or the marginal probabilities, P(E ij), are typically unknown. To estimate P(E ij) one computes the proportion of observations in a subset of observations for which the event occurs. To estimate , one computes this only for those observations that were assigned a judgment of c. These probability estimates are then used to assess calibration. For example, calibration curves plot p c against c where p c is an estimate of , which is usually the proportion of observations where the event occurs when the judgment is c. Calibration can then be measured, for example, using the calibration index
where n c is the number of observations aggregated to estimate p c when the judgment is c. For calibration based on marginal probabilities researchers use simple global measures such as and where p is a relative frequency and is the average confidence judgment.
The use of relative frequencies is justified by the implicit assumption that the events being aggregated form an equivalence class so that all events in a given aggregated set have a common (conditional) probability. In cases where the observations are not replications of the same process, or not sampled from the same domain, this assumption may be questionable. For example, it may be misleading to combine weather forecasts of different forecasters operating at different locations, as it would be inappropriate to aggregate financial forecasts made in various countries. The practice of combining judgements of many participants regarding general knowledge items from various domains selected arbitrarily, which has been used in many psychological experiments, has been criticized on similar grounds (Reference Gigerenzer, Hoffrage and KleinböltingGigerenzer, Hoffrage & Kleinbölting, 1991; Reference JuslinJuslin, 1994; Reference Juslin, Winman and OlssonJuslin, Winman & Olsson, 2000; Reference WinmanWinman, 1997). Next we review some important conceptual and statistical concerns over group-level analysis of calibration based on aggregated data.
1.2 Conceptual concerns
The quality of subjective/personal judgments regarding unique/non-repeatable events relies on a comparison of these judgments with relative frequencies aggregated across multiple judges, events and occasions. Paradoxically, the standard of calibration for subjective probabilities is based on a frequentist approach to probability. Given the diametrically opposed views held by these two schools of thought, this state of affairs should be equally disturbing to all researchers regardless of their stand on the question of the “proper” interpretation of probability (Reference KerenKeren, 1991; Lad, 1984). Another concern is the insensitivity of calibration analysis to individual differences. Researchers (e.g., Gigerenzer et al., 1991; Reference JuslinJuslin, 1994; Reference WinmanWinman, 1997) have argued that some empirical results are artifacts due to biased selection of events. A similar argument can be made with respect to the selection of judges. The degree of miscalibration in any study is determined, in part, by the expertise of the participants in the domains of interest. Judges who vary in knowledge or expertise may lead researchers to reach different, possibly conflicting, conclusions.
1.3 Statistical concerns
The analysis of calibration based on aggregated observations also has several statistical problems. First, if the subsets of observations used to produce relative frequencies are not sufficiently large, the estimated probabilities will be unstable. This undermines the power and precision of statistical inferences based on these estimates. To avoid this problem researchers aggregate observations. But this leads potentially to a second problem, as this may require one to aggregate data over important characteristics of the judges, events, and/or circumstances under which the judgments were elicited. Confounding variables may distort the apparent relationship between the probabilities and these variables. This problem is well-known in statistics and can lead to such phenomenon as Simpson’s Paradox (Reference SimpsonSimpson, 1951) or the Ecological Fallacy (Reference RobinsonRobinson, 1950). A standard solution to this problem is to avoid aggregation by conditioning on the relevant variables using a statistical model.
2 A model-based approach to estimating probabilities
To simplify notation, let and πij = P(E ij), where i (still) indexes the judges, j the events, and c a given confidence level. We propose estimating and π ij with regression models. Since the outcome (the event) is binary, a natural family of statistical models is generalized linear mixed models for binary variables (see Reference Pendergast, Gange, Newton, Lindstrom, Palta and FisherPendergast, Gange, Newton, Lindstrom, Palta, & Fisher, 1996; Reference Guo and ZhaoGuo & Zhao, 2000). This includes mixed logistic regression models, and extensions thereof. These models are of the form:
where β is a vector of parameters representing the effects of explanatory variables that characterize the events or circumstances of the judgments, bi is a vector of random judge-specific parameters to allow for individual differences, and f and g are inverse link functions that map the parameters into probabilities.Footnote 2 Note that πij(c) is a function of c ij, since the probability is conditional on the confidence judgment, whereas πij is a marginal probability. Appendix A gives further details on model notation and specification, and the next section provides specific examples. We propose a three steps process:
Specify a mixed model for πij(c) and/or πij to produce the estimates and .
Estimate the event probabilities as = and/or .
Use the estimated probabilities, and/or , to assess the calibration of the observed judgments, c ij.
Most models can be estimated using standard statistical packages for generalized linear mixed models. We used PROC GLIMMIX in SAS/STAT, Version 9.2 (SAS, 2008). Appendix B gives syntax examples of how to implement these models with software. The specification of a model is an important issue since this approach relies on having valid estimates of the probabilities. We relied on the Akaike information criterion (AIC; Reference AkaikeAkaike, 1974) to select models, but additional analyses confirmed that our results were reasonably robust to minor changes in the model specification.Footnote 3
Mixed models have already been shown to be useful in analyses of data from research in judgment and decision making (e.g., Merkle, 2010; Merkle, Smithson, Verkuilen, 2010; Stockard, O’Brien, & Peters, 2006). The methodology proposed here can be viewed as an extension of an approach proposed by Merkle (2010) for using mixed models to study calibration. Such models can be used to study systematic trends in data while also accounting for individual differences. This model-based approach overcomes the sparseness of the data by expressing the unknown probabilities as a function of explanatory variables. Instead of relying on relative frequencies from aggregated observations the model provides estimates for the probability of each event. Thus the model avoids problems caused by aggregating data. And by using a parametric model we obtain more precise estimates of the probabilities than those based on relative frequencies. In the following section we demonstrate this approach and contrast it to analyses based on aggregation.
2.1 Illustrative example
McGraw, Mellers, and Ritov (2004) report two studies in which subjects gave confidence judgments prior to throwing a basketball at a basket.Footnote 4 In the first study 45 subjects threw a basketball at a hoop three times from each of 12 locations that varied in terms of the distance to the basket and side of the court. In the second study 20 subjects were randomly assigned to a control group, and 22 to a “debiased” group where they were instructed to avoid overconfidence. They attempted five shots from each of seven different distances from the basket along the center of the court. McGraw et al. were concerned with the relationship between judgments of pleasure of the outcomes to confidence and calibration, but we will focus only on calibration. The limitations of using relative frequencies to estimate the unknown probabilities becomes clear when we consider that probabilities may vary over distance, side of the court, group, and subject. Each subject made only a few shots from each spot, which is a too small a number of observations to accurately estimate probabilities using relative frequencies. But aggregating across variables, such as distance, can change the results.
2.1.1 Analysis based on conditional probabilities
First we consider the conditional probabilities, πij(c). For the first study we used the model
where logit(z) = ln[z/(1−z)], DISTANCEij is the distance (in inches) to the basket, and c ij is the confidence judgment. It is convenient to transform the probabilities and judgments to log-odds since the model is linear on the log-odds scale, and furthermore when β 0 = β 1 = 0, β 2 = 1, and b i0 = 0 then logit(πij(c)) = logit(c ij) and thus π ij(c) = c ij, implying perfect calibration.Footnote 5 Thus the parameters capture miscalibration due to different sources. The discrepancy between the confidence judgment and the conditional probability are represented by β 0 and β 2, the effect of distance by β 1, and b i0 is a subject-specific effect. For the second study we specified a similar model but added an effect for the experimental manipulation so that
where DEBIASi is a binary variable that indicates if the i-th subject was in the debiased group.
Figure 1a plots the estimated calibration curves based on the model in Equation (1). The smooth curves are the mean calibration curves. The open points are mean values of πij(c) grouped by distance and confidence judgment. The model confirmed a significant effect for distance ( = −0.007, z = −5.61, p < 0.001) which can be seen clearly in the figure: as distance increases, the judges tend to be more overconfident. Figure 2a shows the estimated calibration curves for each distance and group for the second study based on Equation (2). The plot is constructed like Figure 1 but the data are conditioned also on group. Again distance had a significant effect ( = −0.01, z = −9.24, p < 0.001). However neither the manipulation ( = 0.12, z = 0.61, p = 0.54) nor the effect of the confidence judgment ( = −0.07, z = −1.24, p = 0.21) were significant. The lack of apparent effect for the confidence judgment might seem surprising, since it implies flat calibration curves. However it is reasonable that after accounting for distance and the subject, that the judgments themselves would not predict the outcome. In both Figures 1 and 2 the solid points are the mean values of the estimates of πij(c) , grouped by confidence judgment and aggregated over distance. This average curve is notably steeper than the estimated calibration curves for each distance. Again the pattern of over- or underconfidence is highly dependent on whether or not one controls for distance.
Figure 1b and 2b are similar to Figures 1a and 2a, respectively, but based on relative frequencies. The open points are relative frequencies from aggregating observations for each distance and confidence judgment. The relative frequencies aggregated within distances are highly unstable. The closed points are relative frequencies based on aggregating observations across distances.Footnote 6 These are more stable, but averaging over distance ignores the effect of distance on confidence. The model-based approach provides more stable estimates of the probabilities without ignoring or obscuring the effects of variables.
One significant benefit of using a mixed model is that it can also account for individual differences in calibration as well as systematic effects due to explanatory variables such as distance and treatment group. In Equations (1) and (2) these individual differences are modeled through the subject-specific effect represented by b i0. The effect of this parameter can be seen graphically by plotting the estimated calibration curve for each subject, for each given distance, as shown in Figure 3. As can be seen in the plot, there is considerable variation across subjects in the calibration curves. The variance of b i0 was estimated at approximately 0.25, with a standard error of 0.09. It is useful to note here that one could also permit variability across subjects in the slope of the calibration curve, on the log-odds scale, by adding the term β i1logit(c ij) to (1) or (2), although this did not improve the fit of either model here.
The lack of a significant effect for the experimental manipulation might appear to contradict the analysis by McGraw et al., but their analysis ignores the effect of distance, whereas our analysis controls for it. But the manipulation may also have influenced the judgments.Footnote 7 To assess the effect of the manipulation on calibration, we examined its effect on the joint distribution of the judgments and the estimated probabilities—specifically the distribution of the discrepancy between them. We computed and used it as the response variable in a mixed effects linear model. The main effect for group was significant (F (1,1416) = 9.07, p = 0.003), showing better calibration indices for the debiased group. The effect for distance was also significant (F (6,1416) = 5.17, p < 0.001), but the interaction between group and distance was not (F (6,1416) = 0.82, p = 0.55). Figure 4 shows the means and distributions of the logs of the discrepancy measure as a function of distance by group. A similar analysis based on the relative frequencies did not show a significant effect for group (F (1,615) = 2.42, p = 0.12), most likely due to the greater variability of the indices computed from relative frequencies based on few observations.
2.1.2 Analysis based on marginal probabilities
Next we estimated the marginal probabilities, πij. For the first study we specified the model
We include the effects of side of the basket with indicator variables, and an additional judge-specific effect for distance, b i1. These effects were not used in the model for the conditional probabilities because they did not improve the fit of the models. To estimate the marginal probabilities in the second study we used the model
These models provide for an interesting new analysis by “reversing” the traditional calibration curve to examine the conditional distribution of the confidence judgments given the (estimated) probabilities. This is possible only because the models provide estimates of the probability for each observation. This is not possible, or at least cannot be done as finely, by aggregating observations for relative frequencies. Figures 5 and 6 depict the mean confidence judgments and their confidence intervals, conditional on the estimated probabilities, for the first and second study.
We grouped the confidence judgments by the corresponding marginal probability estimate, rounded to the nearest tenth. The plots appear to indicate some tendency to overestimate the probabilities, particularly at shorter distances, when shots were attempted at the center of the court in the first study, and more so in the control group than in the debiased group in the second study.
To further examine apparent trends in the miscalibration of the judgments based on the marginal probabilities, we analyzed the discrepancy measure . Figures 7a and 8a show the means and distributions of the miscalibration measures by distance and location for the two studies. These plots also show the trends we observed in Figures 5 and 6. Statistical analyses confirmed the trends. In the first study there was a significant interaction between side and distance (F (6,1564) = 6.17, p < 0.001). The tendency to overestimate the probability decreased with distance when shooting from the side, but the trend is curvilinear when shooting from the center. For the second study we found significant main effects for both distance (F (6,1416) = 4.01, p = 0.005) and group (F (1,1416) = 8.97, p = 0.003), but not their interaction (F (6,1416) = 1.88, p = 0.08). When controlling for distance, the analysis reveals a significant effect for the debiasing manipulation. It significantly improved calibration overall.
To compare these analyses with a model-free approach based on the raw data, we analyzed the discrepancy measure c ij − I(E ij), where I(E ij) indicates whether event E ij (a basket) occurred. We used c ij − I(E ij) rather than aggregating observations to estimate πij since E[c ij − I(E ij)] = E(c ij) − πij. Figures 7b and 8b show the means and distributions for this measure for the first and second study, respectively. Note the greater variability of this measure. This instability hinders statistical analyses. We failed to detect a significant interaction (F (6,1564) = 1.23, p = 0.29), main effect for distance (F (3,1565) = 1.85, p = 0.14), or main effect for side (F (2, 1564) = 1.06, p = 0.35). For the second study we confirmed significant main effects for the manipulation (F (1,1416) = 8.97, p = 0.003), distance (F (6,1416) = 4.01, p = 0.001), but not the interaction (F (6,1416) = 1.88, p = 0.081). While both analyses estimated the same mean difference of the calibration measure between the control and debiased groups (0.11), the standard error was approximately 30% larger than in the analysis using the model-based probability estimates. While it is possible to analyze calibration based on the marginal probabilities without aggregation, a model-based approach can provide more stable estimates and thus more precise inferences.
3 Discussion
We have argued and demonstrated with examples that the standard calibration analysis has significant limitations. Conceptually it is inconsistent because it makes relative frequencies the standard of evaluation of the judges’ subjective probabilities. Statistically it is problematic because it can lead to biased estimates when aggregation is over important variables, and imprecise estimates when aggregation is over too few observations.
We proposed and demonstrated the use of mixed models for binary data to estimate the probabilities of specific events for the purpose of analyzing the calibration of specific judges. With a good model, one can estimate the probabilities of individual events accurately, without resorting to indiscriminate data aggregation. Instead of comparing the judgments to a set of relative frequencies our approach uses probabilities derived from a model that captures empirical regularities, and incorporates relevant individual differences. Thus the standard of comparison for any given event is personal, as it is explicitly tailored to each judge. This fact addresses, at least in part, the conceptual concern about using a frequentist analysis to estimate the quality of subjective judgments. We demonstrated that the model-based approach provides superior results in that it can address issues of confounding explanatory variables while providing more precise probability estimates which translate into higher precision and power in statistical inferences, and allows new informative analyses (e.g., “reverse calibration” curves) and at different levels (individual events and judges), which are not possible in the traditional approach.
Appendix A: Model Parameterization
We introduced a model-based approach to estimating the conditional and marginal probabilities by using the general models πij(c) = f(β, bi, c ij) and πij = g(β, bi), respectively. In this appendix we discuss in more detail how these models might be specified. A common parameterization is the generalized linear mixed model
where h is a link function, such as the log-odds or “logit” as we used, xij and zij are vectors of observed design variables and covariates corresponding to the fixed and random effects, respectively, β is a vector of unknown parameters, and bi is a random vector of unknown subject-specific parameters. In the first model the probability πij(c) is conditional on the confidence judgment, so xij and possibly zij will contain c ij, or some function of thereof. We showed that it is convenient to use the logit link function so that this function is h(c ij) and thus the parameters β and bi capture miscalibration. The random subject-specific parameters are assumed to have a specified distribution, such as a multivariate normal distribution such that all bi are independently and identically distributed as N(0,Σ), which is what we assumed in our analyses.
An alternative parameterization is to write the model as a multilevel or hierarchical generalized linear model as described in, for example, Goldstein (2010) and Reference Raudenbush and BrykRaudenbush and Bryk (2001), respectively. Here we can write the 2-level multilevel model in two stages where the level-1 within-subjects model is
and the level-2 between-subjects model is , where vij and wi are vectors of observed design variables or covariates that vary within subjects or between subjects, respectively.Footnote 8 For estimating πij(c) the vector vij would contain h(c ij). This model can be written in the mixed model parameterization given earlier by substituting the level-2 model for δi into the level-1 model, although note that in some cases some of the elements of δi will be fixed so that ui has a degenerate distribution in which case bi is a sub-vector of ui. To give a concrete example of both parameterizations, Equation (2) can be written as a generalized linear mixed model with = (1, DEBIASi, DISTANCEij, logit(c ij)), = (β 0, β 1, β 2, β 3), zij = 1, and bi = b i0 in the generalized linear mixed model parameterization, and as = (1, DISTANCEij, logit(c ij)), = (1, DEBIASi),
and ui = (u i0, 0, 0)′ using the multilevel model parameterization. Note that the multilevel model is then
which is equivalent to Equation (2) except for the change in notation. The choice of parameterization is largely a matter of preference and the software used.
There is a fairly large literature on inference based on generalized linear mixed or multilevel models. Our approach was to estimate β and Σ using maximum likelihood, approximating the integral in the likelihood using adaptive quadrature. Estimates of bi can be obtained using an empirical Bayes approach. The estimates of and are then obtained by replacing β and bi by their estimates in the model. Another potentially useful approach would be to specify a Bayesian probability model by specifying a prior distribution for β and Σ to make inferences concerning the posterior distribution of πij(c) or πij using simulation-based methods.
The methodological approach described in this paper is quite general. One could potentially specify a useful model beyond the families of generalized linear mixed or multilevel models described here. For example, one might find it useful to consider models that are not linear on the scale of h(πij(c)) or h(πij), or alternative distributions for the the subject-specific parameters such as a mixture distribution. All that is necessary is to have a viable statistical model that provides good estimates of the conditional or marginal probabilities.
Appendix B: Software Implementation
This appendix gives the syntax for PROC GLIMMIX in SAS/STAT, Version 9.2 (SAS, 2008) for estimating the conditional and marginal probabilities (i.e., πij(c) and πij, respectively) for the two studies in McGraw et al. (2004). These models can also be implemented using the glmer() function in the lme4 package (Reference Bates, Maechler and BolkerBates, Maechler, & Bolker, 2011) for R (R Development Core Team, 2011). We have included the corresponding syntax for glmer() for each model as well. The data are assumed to be in “long-form” where each observation/row in the data file corresponds to a one trial for a given subject. The response variable result is a binary indicator variable for a successful basket. The explanatory variables distance, logitc, and debias correspond to DISTANCEij, logit(c ij), and DEBIASi, respectively, in Equations 1-4. The variable side indicates the side of the basket (left, center, or right) and generates the indicator variables LEFTij, CENTERij, and RIGHTij as shown in Equation (3). Subjects are identified by id. The output variables probc and probm are and , respectively.
Model 1: Estimating πij(c) for Study 1
PROC GLIMMIX Syntax
Model 2: Estimating πij(c) for Study 2
PROC GLIMMIX Syntax
Model 3: Estimating πij for Study 1
PROC GLIMMIX Syntax
Model 4: Estimating πij for Study 2
PROC GLIMMIX Syntax
For making inferences concerning the posterior distribution of πij(c) or πij, or functions thereof, we would suggest using OpenBUGS (Reference Thomas, O’Hara, Ligges and SturtzThomas, O’Hara, Ligges, & Sturtz, 2006). For an example we give below one possible way to specify the probability model corresponding to Equation (1).