1 Introduction
Common sense would suggest that it is always better to have more, rather than less, relevant information when making a decision. Most normative and prescriptive theories of multi-attribute decision making are compensatory models that incorporate all relevant variables. This perspective was challenged by Reference Dawes and CorriganGigerenzer and Goldstein (1996) and Gigerenzer, Todd, and the ABC Group (1999), who proposed a theoretical framework of simple decision rules, often referred to as “fast and frugal” heuristics, suggesting that in some cases a decision maker (DM) utilizing less relevant information may actually outperform an idealized DM utilizing all relevant information. In fact, many of these heuristics use a single cue selected among the many available for the prediction task. Key among these single-variable decision rules is the Recognition Heuristic (RH) (Reference Dawes and CorriganGigerenzer & Goldstein, 1996; Reference Einhorn and HogarthGigerenzer et al., 1999; Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002).
A rapidly growing empirical literature suggests that single-variable decision rules are descriptive for at least a subset of DMs with regard to both Take The Best (Reference BröderBröder, 2000; Reference Bröder and SchifferBröder & Schiffer, 2003; Reference Newell and ShanksNewell & Shanks, 2003) and the RH (Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002; Reference Frosch, Beaman and McCloyHertwig & Todd, 2003; Reference Katsikopoulos and MartignonPachur & Biele, 2007; Reference Scheibehenne and BröderScheibehenne & Bröder, 2007; Reference Martignon and HoffrageSerwe & Frings, 2006; Reference Keeney and RaiffaSnook & Cullen, 2006). Questions remain, however, as to exactly when and why a single-variable rule will perform well.
Reference Frosch, Beaman and McCloyHogarth and Karelaia (2005) examined the relative performance of single-variable rules within a binary choice framework where both predictor (independent) and criterion (dependent) variable were assumed to be continuous. Using a combination of analytic tools and simulations, they found that single-variable rules have strong predictive accuracy when: 1) all predictors are highly and positively inter-correlated, 2) the single predictor used is highly (and, typically, positively) correlated with the criterion.Footnote 1 Reference Hogarth and KarelaiaHogarth and Karelaia (2006) conducted a related analysis using binary rather than continuous cues (predictors). Fasolo, McClelland, and Todd (2007) identified similar favorable conditions for single-variable rules using a series of simulations. Reference Shanteau and ThomasShanteau and Thomas (2000) labeled environments with highly positively correlated predictors, “friendly” environments, and demonstrated in a simulation that single-variable rules tended to underperform when the predictors in the model were negatively correlated, a finding that was later replicated by Fasolo et al. (2007) (see also Reference Martignon, Hoffrage, Gigerenzer and ToddMartignon & Hoffrage, 1999; and Reference Payne, Bettman and JohnsonPayne, Bettman, & Johnson, 1993). In a similar vein, Reference Baucells, Carrasco and HogarthBaucells, Carrasco, and Hogarth (2008) presented a framework to analyze simple decision rules within the context of cumulative dominance (see also Reference Gigerenzer and ToddKatsikopoulos & Martignon, 2006).
In this paper, we present results regarding the effectiveness of single variable rules that diverge from previous studies in two major ways. First, we show that when a single predictor, denoted without any loss of generality, x 1, is positively correlated with an array of p-many other predictors, where each of these p-many predictors are either uncorrelated or weakly positively correlated, the optimal weighting scheme places greater weight on x 1 than any of the remaining cues. This result is a major departure from previous studies (Reference Fasolo, McClelland and ToddFasolo et al., 2007; Reference Frosch, Beaman and McCloyHogarth & Karelaia, 2005; Reference Shanteau and ThomasShanteau & Thomas, 2000) that identify favorable conditions for single-variable rules as “single-factor” models where all variables are highly positively correlated. Second, our results do not rely on any specific assumptions about the cue validities, defined as the correlations between the predictors (cues) and the criterion. The only thing that matters is the sign on the correlation of the single cue. Some lexicographic single-variable rules depend upon either the knowledge or estimation of all cue validities. For example, the Take The Best rule (Reference Dawes and CorriganGigerenzer & Goldstein, 1996; Reference Einhorn and HogarthGigerenzer et al., 1999) depends on the identification of the single best cue. In our results, the validity of the single variable need not be the highest among the available predictors. We apply the new results to the RH, for which Reference Fasolo, McClelland and ToddGoldstein and Gigerenzer (2002) argue that the recognition validity, i.e., the correlation between the criterion and recognition, is inaccessible to the DM with the criterion variable influencing recognition through mediator variables in the environment.
The divergence in our results stems from our somewhat different approach to answering the question “when does less information lead to better performance?” First, we characterize the RH within the framework of the linear model — i.e., within the standard regression framework — as an “estimator” that relies on a single predictor. As in the regression framework, we conceptualize the best set of weights to assign to the cues, such that if one had unlimited data and knowledge, they would maximize predictive accuracy and call this vector of weights β. We then consider the weights that would be placed on the cues by various decision heuristics (such as equal weighting) as if they were estimates of β. Within this framework, we can prove results with respect to a measure of statistical inaccuracy called “risk” (defined formally in section 2), which measures how close these heuristic weights are expected to be to the best weights, β. This measure of inaccuracy is particularly useful because it goes beyond optimizing within a single sample; the closer a set of weights is to the best weights, the more robustly it cross-validate in new samples. An advantage to casting the problem within the framework of the linear model is that the results can be generalized to accommodate a broad range of situations, including choosing between more than two options, binary or continuous representations of recognition, and to evaluate the success of any single variable rule, not just RH.
We show that under certain conditions, placing greater weight on a single variable relative to all others represents a form of optimization: It minimizes the maximum value of risk over all choices of decision rules. As the number of cues becomes large, this mini-max strategy converges to a rule that puts a large weight on a single cue and minimally weights all others. We use the term “over-weighting” to describe this effect of a single predictor cue receiving disproportionally more weight than any other predictor cue according to an optimal weighting strategy. Further, we show that weighting the single cue and ignoring all others produces the same risk as ignoring the single cue and weighting all others, regardless of the number of cues. Previous research has shown that decision heuristics applied in this manner outperform standard regression models until samples become very large (Reference BröderDavis-Stober, Dana, & Budescu, 2010). Thus we expect that, under the right conditions, a single variable to be just as accurate a predictor as the full set of predictors.
While this framework could be used to justify any single variable heuristic, we argue that the sufficient conditions plausibly resemble environments in which one would use the RH, and where recognition is the single cue. Indeed, we examine data used in prior recognition tasks (Reference GoldsteinGoldstein, 1997) and show that it fits well our sufficient conditions. Our derivation does not assume a “cue selection” process. In other words, we presuppose the DM always utilizes the single cue of interest. The RH theory is a natural application of these results as this theory also does not presuppose a cue selection process, i.e., if one alternative is recognized and the other is not then recognition is automatically the predictor cue of interest.
Why is recognition rational? Our results demonstrate that when a single cue (recognition) is positively correlated with all other cues (knowledge), then it is a mini-max strategy to over-weight the recognition cue. Interestingly, these results do not depend upon the validities of either the recognition or the knowledge cues.
This paper represents a convergence of two perspectives. On the one hand, we validate the ecological rationality of single variable rules by recasting them as robust statistical estimators that minimize maximal risk within a linear model. From another perspective, our results suggest that the heuristics could work for no other reason than that they approximate an optimal statistical model, albeit with an objective function that has heretofore been unarticulated in this literature. Thus, while Reference Fasolo, McClelland and ToddGoldstein and Gigerenzer (2002) see the RH approach as a contrast to heuristics being used as “imperfect versions of optimal statistical procedures,” it appears that the “Laplacean demon” (Reference Dawes and CorriganGigerenzer & Goldstein, 1996; Reference ToddTodd, 1999) — with unlimited computational power — would use a rule much like the RH!
The remainder of the paper is organized as follows. First, we summarize recent advances on the evaluation of decision heuristics within the linear model and make some simplifying assumptions. We then describe the sufficient conditions for the single-variable “over-weighting” phenomenon, first considering the case of a single variable correlated with an array of weakly positively inter-correlated predictors followed by the more extreme case of a set of mutually uncorrelated predictors within this array. We then present an application of this theory to the RH and prove that under these conditions a DM utilizing only recognition will perform at least as well as a DM utilizing only knowledge. We then examine the inter-correlation matrix of an empirical study, finding preliminary support for the descriptive accuracy of these sufficient conditions. We conclude with a summary and discussion of these results and potential applications and implications.
2 The linear model
We consider decision heuristics as a set of weighting schemes embedded within the linear model, a standard formulation when evaluating performance, e.g., when one compares performance to that of regression (e.g., Reference Martignon and HoffrageMartignon & Hoffrage, 2002; Reference Frosch, Beaman and McCloyHogarth & Karelaia, 2005). Specifically, we re-cast decision heuristics as “improper” linear models (Reference DawesDawes, 1979) within a linear estimation framework, treating each weighting scheme as an estimator of the true relationship between the criterion and the predictors. This formulation allows us to evaluate different weighting schemes by a standard statistical measure of performance, risk, utilizing recent advances in the evaluation of “improper” models (Reference BröderDavis-Stober et al., 2010).
The standard linear model is defined as:
where єi ∽ (0,σ2). We include p+1 predictors to distinguish between the single predictor and the other p. In this paper, we use the linear model interchangeably as either a representation of a criterion (target) variable with environmental predictors or as a representation of a DM’s utility for choice alternative Ci (Reference Keeney and RaiffaKeeney & Raiffa, 1993), which can be expressed as:
In this case, the weights, βj, reflect a DM’s true underlying utility. The results in this paper are general and can be applied to model either the environment (1) or the individual (2). The environmental case (1) is particularly applicable to recognition. In this case, the performance of an individual DM is a function of the true underlying relationship between recognition, environmental cues, and the criterion of interest.
For either case, we are interested in examining different estimators of β, denoted by the vector β. We apply the standard statistical benchmark, risk, to assess the performance of an estimator of the “true” relationship between a criterion and predictors. Risk for an estimator β is defined as
where E is the expectation of a random variable and ║β − β║2 is the sum of squared differences between the coefficients of β and β. Informally, (3) is a measure of how “far” an estimator is expected to be to the “true” value of β. Risk (also known in the literature as the Mean Squared Error or MSE) can be decomposed into the squared bias of the estimates and their variance. Unlike other criteria that focus on unbiasedness, risk minimization provides a more flexible framework where bias and variability are traded off. In many cases estimates with relatively low bias can reduce considerably the variance and would be considered superior with respect to their overall risk. This definition of risk allows us to directly compare fixed weighting schemes to other statistical estimators, and to each other. Our definition of risk bears some qualitative similarity to the “matching” index (Reference Hammond, Hursch and ToddHammond, Hursch, & Todd, 1964) applied to the well-known lens model (Reference BrunswikBrunswik, 1952). It is a measure of fit between a person’s judgment, or choice of weights, and an optimal weighting scheme within the environment.
2.1 An optimality result on weighting rules
To compare various weighting schemes within the context of the linear model, we use what are commonly referred to as “improper” linear models (Reference DawesDawes, 1979; Reference Bröder and SchifferDawes & Corrigan, 1974). These are fixed and pre-determined weighting schemes which are chosen independently of the data collected. Let a denote an improper weighting vector. An example would be “Dawes Rule,” where a is a vector of all ones (Reference DawesDawes, 1979). In this example, the weights βj in (1) and (2) are replaced with ones, resulting in an equal weighting of all the predictors (these predictors are typically assumed to be standardized and/or properly calibrated). When considered as estimators, improper models are clearly both biased and inconsistent, yet many of these a priori weighting schemes have been shown to be surprisingly accurate. For example, Wainer (1976) and Dawes (1979) have demonstrated the excellent predictive accuracy of equally weighting all predictors in a linear regression model.
We are interested in examining various choices of weights, a, using the criterion of risk (see (3)). To proceed, we re-cast the weighting vector a as a statistical estimator of β. Davis-Stober et al. (2010) define a constrained linear estimator, βa, as
where X is the matrix that consists of the values of all p+1 predictors for all n cases, and y is a series of n observations of the criterion (target) variable Y in (1) and (2).
The constrained estimator (4) is a representation of the “improper” weighting scheme, a, in the sense that it is the least-squares solution to the estimation of β subject to the constraints in a. In other words, if the weighting scheme a is a vector of ones (i.e., equal weights) then βa 1 = βa 2 = βa 3 = … = βa (p + 1). The reader should note that while data are being used to scale the weighting vector a in (4), the resulting coefficient of determination, R 2, is the same as if only a had been used (because R 2 is invariant under linear transformation). Thus, this formulation will not affect the outcomes in the binary choice task considered, nor would it ever change a DM’s ranking of several objects on the criterion. The formulation in (4), however, allows us to define and compute an upper bound on the maximal risk for arbitrary choices of the weighting vector a.
Given a choice of weighting vector a and design matrix X, Davis-Stober et al. (2010) proved that the maximal risk of (4) is as follows,
where ║β║2 denotes the sum of squared coefficients of β. Equation (5) allows us to measure the performance of a particular choice of a. Davis-Stober et al. (2010) provide an analysis of (5) for several well-known choices of a, including: equal weights (Reference DawesDawes, 1979), weighting a subset of cues while ignoring others, and unit weighting (Reference Einhorn and HogarthEinhorn & Hogarth, 1975).
It is often useful, given a choice of X, to know the optimal choice of weighting vector a. Davis-Stober et al. (2010) prove that the value of weighting vector a that minimizes maximal risk is the eigenvector corresponding to the largest eigenvalue in the matrix X ′X.
THEOREM (Reference BröderDavis-Stober et al., 2010). Defineλmax as the largest eigenvalue of the matrix X ′X. Assume ║β║2 < ∞. The weighting vector a that is mini-max with respect to all a , is the eigenvector corresponding to the largest eigenvalue of X′X.
In other words, given a model of the matrix X ′X, it is possible to identify the fixed weighting scheme that minimizes maximal risk. Remarkably, this optimal weighting scheme does not depend on the individual predictor cue validities.
To summarize, each choice of weighting scheme, a, can be considered as an estimator of β in the linear model using (4). We can compute an upper bound on the maximal risk that each choice of a will incur using (5). Finally, given a model of X ′X, we can calculate the optimal choice of weighting vector a using the above theorem. In the next section, we apply these results to the case of a single predictor cue positively inter-correlated with an array of p-many other predictor cues, all of which are weakly positively inter-correlated and, in the limit, mutually uncorrelated.
3 Conditions for over-weighting a single predictor
3.1 The case of inter-correlated predictors
Henceforth, assume that all the variables are standardized such that X i = 0 and s2Xi = 1 for all i. This implies that X′X = (n − 1)RXX, where n is the sample size and RXX is the correlation matrix of the predictor variables. Let rij denote the the (i,j)th entry of RXX. For our analysis of single variable rules, we are interested in the case when there are a total of (p+1)-many predictors, where p-many predictors follow a “single-factor” design, i.e., all p-many predictors in this array are assumed to be inter-correlated at r ′ with the remaining predictor equally inter-correlated with all other predictors at r.
CONDITION 1. Assume that RXX is comprised of p+1 predictor cues, p-many of which are inter-correlated at r ′ with the remaining predictor cue correlated with all others at r. Assume r > 0 and r ′ ≥ 0.
For the case of p = 5, Condition 1 specifies the following correlation matrix.
This structure is a special case of the two-factor oblique correlation matrix presented in Davis-Stober et al. (2010). Davis-Stober et al. (2010) provided a closed-form solution for all eigenvalues and corresponding eigenvectors of this matrix. Applying those results we obtain the maximum eigenvalue for matrices satisfying Condition 1 as
with a corresponding (un)-normed eigenvector with the first entry valued at
with the remaining p-many entries valued at 1. Applying the previous theorem, we can now state precisely the mini-max choice of weighting scheme for matrices satisfying Condition 1.
RESULT 1. Assume the matrix structure described in Condition 1. Let. The normed choice of weighting vector that is mini-max with respect to risk, denoted a *, is:
Note that the relative weights of the predictors are unchanged by scalar multiplication.
Result 1 presents the mini-max choice of a fixed weighting vector for the inter-correlation matrix specified in Condition 1. In other words, given a linear model structure as in (1)–(2) and assuming the inter-correlation matrix of the predictor cues follow the structure in Condition 1, then (6) is the best choice of fixed weighting scheme with respect to minimizing maximal risk. This leads to an important question. Assuming Condition 1 holds, when will it be a mini-max strategy to place more weight on a single predictor cue (in our case x 1) than any other cue? The next result answers this question.
RESULT 2. Assume RXX is defined as in Condition 1. Let a * be the mini-max choice of weighting scheme (Equation 6). Let ai * denote the ith element of the vector a *. Then a 1*> ai *, ∀i ∊ {2, 3, …, p+1}, if, and only if, r>r ′.
PROOF. Applying Result 1, a 1* > ai *, ∀i ∊ {2, 3, …, p+1}, if, and only if, , which implies that r>r ′. □.
Result 2 states that it is an optimal mini-max strategy, with respect to risk, to overweight a single predictor when that single predictor positively correlates with the other predictors, which are weakly positivelyFootnote 2 inter-correlated or mutually uncorrelated. Result 2 suggests that the overweighting effect of the first predictor becomes more pronounced as the ratio becomes small and/or as the number of predictors increase. Figure 1 plots the ratio as a function of p and r for a fixed value of r ′ = .30. As expected, the relative weight placed on the first predictor increases smoothly as a function of both p and r.
As an example, let p = 5, r=.6, r ′ = .25. This gives the following matrix for six predictors.
RXX has the following mini-max choice of a,
In this example, the first element in the mini-max choice of weighting scheme, a 1*, is over one and half times as large as the remaining weights. This overweighting effect becomes even more pronounced when the predictors in this p-element array are mutually uncorrelated.
3.2 The case of mutually uncorrelated predictors
In contrast to the conditions identified in Reference Frosch, Beaman and McCloyHogarth and Karelaia (2005) and Fasolo et al. (2007), this over-weighting effect becomes more pronounced as both the number of predictors increase and the inter-correlations of the p-many predictor array approach 0. In this section, we analyze the extreme case of a single cue positively correlated with p-many predictors which are mutually uncorrelated. This is a special case of Result 2 when r ′ = 0. In the asymptotic case, where the number of predictors is unbounded, the weights on the p-many predictors become arbitrarily small.
CONDITION 2. Let RXX be defined as in Condition 1 and assume r ′=0. Condition 2 is a special case of Condition 1 so, applying Result 1, the maximal eigenvalue of RXX under Condition 2 is equal to
and the corresponding normed eigenvector (and the mini-max choice of a) equals
It is easy to see that as the number of predictors increases, i.e., as p becomes large, the weights on the p-many predictor cue entries come arbitrarily close to 0 while a 1* remains constant. Put another way, when r ′=0, it is mini-max with respect to risk for a decision maker to heavily weight the single predictor. The effectiveness of this strategy becomes more pronounced as the number of mutually uncorrelated predictors increases.
As a numerical example, let p = 5, r ′ = 0. This gives the following matrix for six predictors.
RXX has the following mini-max choice of a,
For this example, the weight on x 1 is more than double the weight of each of the other predictors. In general, the ratio, . Thus, if for example, p = 100, the weight on x 1 would be 10 times as large as the weight on any other predictor. We emphasize that the mini-max choice of weighting under Condition 2 does not depend upon the value r.
The preceding results are a mathematical abstraction, in the sense that typically all predictors are neither equally correlated, nor mutually uncorrelated. However, it is clear from these derivations that this over-weighting effect of a single variable, with respect to risk, will occur whenever the single variable of interest is positively correlated with the other predictors and this correlation dominates the inter-correlations of the remaining predictors. In the next section we demonstrate how this over-weighting effect applies to the performance of the Recognition Heuristic, showing that under Condition 2 a DM using only recognition will perform as well as a DM using only knowledge.
4 The Recognition Heuristic
The RH is arguably the simplest of the “fast and frugal” heuristics that make up the “adaptive cognitive toolbox” (e.g., Gigerenzer et al., 1999). Quite simply, if a DM is choosing between two alternatives based on a target criterion, the recognized alternative is selected over the one that is not. Reference Fasolo, McClelland and ToddGoldstein and Gigerenzer (2002) state that a “less-is-more effect” will be encountered whenever the probability of a correct response using only recognition is greater than the probability of a correct response using knowledge (equations (1) and (2) in Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002), in their words: “whenever the accuracy of mere recognition is greater than the accuracy achievable when both objects are recognized.” Goldstein and Gigerenzer’s conditions for the success of the recognition heuristic depend upon the recognition validity, i.e., the correlation between the target criterion and recognition, which is assumed not to be directly accessible to the DM. Other environmental variables, called mediator variables, which are positively correlated with both the target criterion (the ecological correlation) and the probability of recognition (the surrogate correlation), influence the DM as proxies for the target criterion. We build upon the recognition heuristic theory by defining a condition on the predictor cues that leads to a mini-max strategy closely mirroring the recognition heuristic. Conditions 1 and 2 depend only upon observable predictor cue inter-correlations and do not require the estimation or assumption of either a recognition validity or corresponding surrogate and ecological correlations.
4.1 The Recognition Heuristic within a linear models framework
In this section we present a linear interpretation of the recognition heuristic by defining weighting schemes representing a DM using only recognition or only knowledge. We present a proof (Result 3) that a DM using only recognition will incur precisely the same amount of maximal risk as a DM using only knowledge under Condition 2. This result states that a DM using only recognition will do at least as well as using only knowledge with respect to maximal risk. We compare these weighting schemes to each other, the optimal mini-max weighting, and Ordinary Least Squares (OLS).
To apply the optimality results presented above, we must first re-cast the RH within the linear model. Let x i1 be the recognition response variable with respect to the RH theory. Let the remaining p-many predictors correspond to an array of predictors relevant to the criterion under consideration. Within the RH framework, we shall refer to these p-many variables as knowledge variables. This gives the following model,
The recognition variable, x i1, is typically conceptualized as a binary variable (e.g., Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002). Pleskac (2007) presents an analysis of the recognition heuristic in which recognition is modeled as a continuous variable integrated into a signal detection framework (see also Reference Katsikopoulos and MartignonSchooler and Hertwig, 2005). Note that our general result on maximal risk (5) holds for an arbitrary matrix, X, in which predictor cues can be binary, continuous or any combination of the two types.
We use maximal risk as the principal measure of performance within the linear model, in contrast to previous studies that used percentage correct within a two-alternative forced choice framework as a measure of model performance (e.g., Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002; Reference Frosch, Beaman and McCloyHogarth & Karelaia, 2005). As such, we do not compare individual values of Y for different choice alternatives. We instead make the assumption that a DM using “true” β as a weighting scheme will outperform a DM using any other choice of weighting scheme. In our framework, minimizing maximal risk corresponds to a DM using a fixed weighting scheme that is “closer” to the “true” β compared to any other fixed weighting scheme.
The Recognition Heuristic assumes that the DM can find him/herself in one of three mutually exclusive cases, that induce different response strategies: 1) neither alternative is recognized and one must be selected at random, 2) both alternatives are recognized and only knowledge can be applied to choose one of them, and 3) only one alternative is recognized (the other is not), so the DM selects the alternative that is recognized. To model this process, we consider the following two choices of the weighting vector a:
-
• Only Recognition: let the first entry of aR be valued at one, all other entries are zero,
-
• Only Knowledge: let the first entry of aK be valued at zero, all other entries are one,
In the context of the linear model (8), the act of recognizing an item but having no knowledge of that item is modeled by aR, i.e., recognizing one choice alternative without applying knowledge to the decision. The act of only using knowledge to select between items is modeled by aK, corresponding to a DM recognizing both choice alternatives and relying exclusively on knowledge to make the decision. We do not analyze the trivial case of neither alternative being recognized. To facilitate comparison of these weights, let aM be the mini-max weighting scheme from Result 1 and let OLS be the ordinary least squares estimate.
We can now describe and analyze when “less” information will lead to better performance in the language of the linear model. When will aR perform as well as, or better, than aK? Consider Condition 2, i.e., the case when the p-many predictor cues are mutually independent, r ′=0. Under these circumstances the over-weighting effect of the mini-max vector is most pronounced and, surprisingly, the weighting schemes aR and aK incur exactly the same amount of maximal risk.
RESULT 3. Let RXX be defined as in Condition 2, i.e., r ′ = 0. Let aR and aK be defined as above. Then max Risk(aR) = maxRisk(aK).
PROOF. The proof follows by direct calculation using (5),
Result 3 provides a new perspective on the phenomenon of “less” information leading to better performance: A simple single-variable decision heuristic (e.g., the recognition heuristic) is at least as good in terms of risk as equally weighting the remaining cues, e.g., knowledge.
We can directly compare the performance of aK, aR, aM, and OLS by placing some weak assumptions on the intercorrelation matrix of the predictors and the values of σ2 and ║β║. First, we can bound ║β║2 by applying the inequality ║β║2 ≤ , where R 2 is the coefficient of multiple determination of the linear model and λmin is the smallest eigenvalue of the matrix RXX. Now, assuming values of σ2, R 2, and a structure of RXX we can compare the maximal risk for different weighting vectors as a function of sample size. Under our framework, sample size does not play a role in determining the mini-max choice of fixed weighting scheme, however we must consider sample size when comparing values of risk to other estimators, in this case OLS.
How do recognition and knowledge compare with the mini-max choice of weighting scheme? Figure 2 displays the maximal risk for aK, aR, aM, and OLS, under Condition 1 for p=6, r=.6, r ′ = .25, and σ2 = 2. The three panels of Figure 2 corresponds to R 2 = .3, .4, and .5.
The performance of the knowledge, recognition, and mini-max weighting are comparable in all three conditions with the knowledge weighting, aK, coming closest to the optimal weighting scheme. In the case of Condition 1, recognition does not win out over knowledge, however recognition wins out over OLS for sample sizes as large as n = 30 in the case of R 2 = .3 and n = 20 for the case of R 2 = .5.
Figure 3 displays a corresponding set of graphs under Condition 2 for p=3, r=.3, r ′ = 0, and σ2 = 2. As predicted by Result 3, recognition and knowledge have precisely the same maximal risk. The mini-max weighting scheme is quite comparable to knowledge/recognition performance. The performance of OLS improves in Condition 2; this follows from the matrix X ′X having fewer inter-correlated predictors.
To summarize, under Condition 1, both recognition and knowledge compare favorably to OLS and to the mini-max choice of weighting. Here recognition does not perform as well as knowledge but is comparable. Under Condition 2, the maximal risk of recognition and knowledge are identical, as shown in Result 3, and closely resemble the performance of the mini-max choice of weighting vector. In all cases, the performance of the different weighting schemes are affected by the amount of variance in the error term єi, in agreement with Reference Frosch, Beaman and McCloyHogarth and Karelaia (2005).
4.2 Empirical example
As an empirical illustration of these analytic results, we examine previously unpublished pilot data collected during the dissertation research of Goldstein (1997). These data come from a study that uses the well-known “German Cities” experimental stimuli. These stimuli were used by Reference Fasolo, McClelland and ToddGoldstein and Gigerenzer (2002) in a series of experiments to empirically validate the recognition heuristic, and in experiments that examined the Take The Best heuristic (Reference Dawes and CorriganGigerenzer & Goldstein, 1996; Reference Einhorn and HogarthGigerenzer, et al., 1999). These data (Reference GoldsteinGoldstein, 1997) consist of recognition counts from 25 subjects who were asked to indicate whether or not they recognized each of the 83 cities based on their names alone. Each city was assigned a recognition score based on the number of subjects (out of a possible 25) who recognized it. In addition, we have nine binary attributes for each city that serve as predictors — see Table 1 for a description of the predictor cues (Reference Einhorn and HogarthGigerenzer et al., 1999). The inter-correlation matrix of the 10 predictors is presented in Table 2.
The data in Table 2 strongly resemble the optimality conditions described in Conditions 1 and 2. Recognition is significantly and positively correlated with 7 of the 9 predictor cue variables at p<.01. Twenty four of (9 × 8 /2 = ) 36 cue inter-correlations are not significantly greater than 0, which strongly resembles Condition 2. Each of the nine predictors in the array (x 2, x 3, …, x 10) has its highest correlation with recognition with one or two exceptions in each column, i.e., r 1j > rkj, ∀ k ∊ {2, 3, …, 10} holds for 7 of the 9 predictors with at most one exception each. The average correlation of the recognition variable with the other nine cue predictors is .29 and the average inter-correlation of the 9 predictors is .11.
To summarize, these data are consistent with the structure of Conditions 1 and 2. The nine predictor cues for the cities appear to contribute “unique” pieces of information to the linear model (8), yet most of these predictors have large positive correlations with recognition. We conclude that in the case of the “German Cities” stimuli, it is an optimal mini-max strategy to “overweight” recognition.
5 Discussion
We presented a condition on the predictor cue inter-correlation matrix under which it is optimal to over-weight a single-variable within a linear model, to optimize risk. We demonstrated when this over-weighting condition occurs (Result 1) and what the optimal weighting scheme is (Result 2). To summarize, when a single cue correlates with the other predictors more than they inter-correlated with each other, it is a mini-max strategy, with respect to risk, to over-weight the single cue. We applied these results to a prominent single-variable decision heuristic — the Recognition Heuristic (Reference Fasolo, McClelland and ToddGoldstein & Gigerenzer, 2002) — and provided a condition where a DM using solely recognition to choose between two choice alternatives would incur precisely the same maximal risk as a decision maker using only knowledge (Result 3). We illustrated these results by analyzing the inter-cue correlation matrix from an experiment using the well-known “German Cities” stimuli. This dataset provides empirical support for the descriptive accuracy of Conditions 1 and 2, and therefore for the over-weighting of a single predictor cue.
The performance of single-variable decision heuristics like the RH depends upon the complex interplay of a DM’s cognitive capacities and the structure of the environment, i.e., the relationships among predictor cues and the criterion variable. When all predictor cues are highly positively correlated we have conditions for a “flat maximum effect”; here a single-variable decision rule will do as well as any other simple weighting rule. In other words, any choice of weighting scheme utilizes, essentially, the same information in the environment. The DM benefits in such “single-factor” environments as he can use a less cognitively demanding strategy and not suffer any serious penalties for doing so. In such single-factor environments the optimal weighting scheme may not resemble a single-variable rule, but the differences in performance are so small that it doesn’t really matter. The DM is “rational” because he/she is balancing cognitive effort with the demands of the environment, the “twin-blades” of Simon’s scissors (Reference SimonSimon, 1990) or seeking to optimize the accuracy / effort tradeoff (Reference Payne, Bettman and JohnsonPayne et al., 1993).
This article provides a new perspective on the “rationality” of single-variable rules such as the RH. The conditions we have identified are quite different than the conditions previously identified as favorable for single-variable rules, e.g., all predictor cues highly positively correlated. Surprisingly, the over-weighting effect becomes stronger as the inter-correlations in the predictor array become lower, peaking when the predictor cues are mutually uncorrelated. A decision maker employing a single-variable decision heuristic in this environment is “rational” in the sense that he/she is using a decision strategy that is not cognitively demanding and one that resembles an optimal strategy in terms of minimizing risk.
To clarify, we are not proposing a decision heuristic, e.g., pick the most highly correlated predictor cue, but point out that in certain decision environments single-variable heuristics are almost optimal. Our definition of risk, a common benchmark in the statistical literature, may help explain why our “favorable” conditions for a single-variable rule differ from previous ones. Minimizing maximal risk is equivalent to choosing a weighting scheme that is, on average, closest to the true state of the nature, β. This definition accounts for an infinity of possible relationships between the predictor cues and the criterion variable, focusing on the conditions that yield the least favorable relationships. By this measure, it is better to over-weight a single variable that contains some information about all of the predictor cues than to weight, say equally, all of the predictor cues taking the chance that some combination of them may not be at all predictive of the criterion. In this way, we account for an infinity of possible validity structures, even though our formulation does not require any specific assumptions on the validities themselves.
To solve for minimax risk, we only need to know the predictor (cue) inter-correlation matrix and identify the choice of weighting vector a. With minimal assumptions, this is tantamount to assuming a correlational structure on the predictor cues and deciding on a decision heuristic. As a result, we do not assume that the DM has knowledge of the individual validities nor that he/she has an ability to estimate them, in contrast to the lexicographic single-variable heuristic Take The Best (e.g., Gigerenzer et al, 1999). We do assume in our analysis that the decision maker always utilizes the single predictor that conforms to our sufficient conditions. We do not assume any other structure to predictor cue selection or application. Thus, recognition is a natural application of these results, as the Recognition Heuristic does not presuppose a predictor cue selection process.
Our results easily extend to recent generalizations of the RH. Several studies have extended the application of the RH beyond simple 2-alternative forced choice to n-alternative forced choice (Reference Beaman, McCloy, Smith, Sun and MiyakeBeaman, McCloy, & Smith, 2006; Reference Frosch, Beaman and McCloyFrosch, Beaman, & McCloy, 2007; Reference SimonMcCloy & Beaman, 2004; Reference McCloy, Beaman, Goddard, Sun and MiyakeMcCloy, Beaman, & Goddard, 2006); see also Reference Hammond, Hursch and ToddMarewski, Gaissmaier, Schooler, Goldstein, & Gigerenzer (2010) for a related framework employing “consideration sets”. Our linear modeling framework does not depend upon the number of choice alternatives being considered. Our only assumption is that the DM differentially selects alternatives based on the value of the criterion as implied by his/her choice of predictor cue weights. Thus, moving from 2 to n alternatives under consideration does not change our key results, a DM utilizing our optimal weighting scheme to select among a set of alternatives will “on average” perform better than another DM using a sub-optimal weighting scheme. Our framework also allows the predictors to be either binary or continuous or any combination thereof. In this way, our results easily extend to continuous representations of recognition. Several authors have previously explored continuous representations of recognition within the contexts of both signal detection models (Reference Hogarth and KarelaiaPleskac, 2007) and the ACT-R framework (Reference Katsikopoulos and MartignonSchooler & Hertwig, 2005).
Although we applied our model successfully in the context of RH, we must remain silent on the psychological nature of some key concepts underlying the theory in this domain. For example, we treat this p-many predictor cue “knowledge” array as an abstract quantity and subsequently do not place any special psychological restrictions or assumptions upon these predictor cues. We also recognize that there is an ongoing debate in the literature as to the interplay between knowledge, learning, and memory with regard to recognition and that there are many factors that influence whether DMs’ choices are consistent with the RH (e.g., Reference Hertwig, Todd, Hardman and MacchiNewell & Fernandez, 2006; Reference Keeney and RaiffaPachur, Bröder, & Marewski, 2008; Reference Knecht, Wersching, Lohmann, Bruchmann, Duning, Dziewas, Berger and RingelsteinPachur & Hertwig, 2006). However, our model is mute with respect to these debates.
The results we have presented are general, and while we have restricted their application to the study of recognition, they could be applied to other domains. For example, in a well-publicized study on heart disease, Reference Batty, Deary, Benzeval and DerBatty, Deary, Benzeval, and Der (2010) found that IQ was a better predictor of heart disease than many other, more traditional, predictors such as: income, blood pressure, and low physical activity. One possible explanation of their findings is that, historically, IQ tends to correlate to some degree with each of the remaining predictors (e.g., Reference Batty, Deary, Schoon and GaleBatty, Deary, Schoon, & Gale, 2007; Reference Ceci and WilliamsCeci & Williams, 1997; Reference Knecht, Wersching, Lohmann, Bruchmann, Duning, Dziewas, Berger and RingelsteinKnecht et al., 2008). In light of the results presented in our paper, it is perhaps not so surprising that IQ is a “robust” predictor of heart disease.
Our results do not imply that ordinary least squares or other statistical/machine-learning weighting processes are “sub-optimal.” Given an infinite amount of data, weighting schemes derived from such processes would indeed be optimal with respect to almost any measure. Our results apply to fixed weighting schemes, which, in the context of small sample sizes tend to perform very well as they do not “over-fit” the observed data. In other words, fixed weighting schemes cross-validate extremely well compared to other, more computationally intensive, estimation procedures for situations involving small sample sizes and/or large variances in the error terms.
Gigerenezer et al., (1996; 1999) argue that “fast and frugal” heuristics are a rational response to the “bounded” or “finite” computational mind of a decision maker. This article raises many new questions regarding the rationality of simple decision heuristics. Given infinite computational might and a relatively small sample of data, a “Laplacean Demon’s” choice of weighting scheme might resemble that of a “fast and frugal” decision maker. In other words, DMs could be reasoning like a “Laplacean Demon” would if the demon were given limited information, but retained the assumption of infinite computational ability.