1 Introduction
The history of science and technology repeatedly demonstrates that many laws are discovered and many inventions are made serendipitously, as a by-product when researchers are striving for something else. The recognition heuristic is just one example of this: It was formulated as a post-hoc explanation for a puzzling finding that was observed while attempting to test a specific prediction of the theory of Probabilistic Mental Models (PMM; Gigerenzer, Hoffrage, & Kleinbölting, Reference Gigerenzer, Hoffrage and Kleinbölting1991). While the other papers contained in this series of special issues and many of the references given therein illustrate how stimulating the formulation of the recognition heuristic was and how much research it has spurred, the present paper turns back the clock and reports three studies that were conducted in the late 1980’s and early 1990’s (Hoffrage, Reference Hoffrage1995).Footnote 1
This paper is organized as follows: The first part provides the historical context that led to the formulation of the recognition heuristic. At the outset of this part, a brief summary of PMM theory is given. Experiment 1 is then reported, which was conducted to address one of the criticisms of the theory, namely the confounding of sampling procedure and item difficulty. Specifically, we compared over/underconfidence in two item sets that were generated by the same sampling procedure but were nevertheless supposed to differ with respect to percentage correct. This attempt failed, yielding the counter-intuitive finding that German participants performed about the same when making comparisons between German cities as when making comparisons between U.S. cities. Experiment 2 reports a second, and this time successful, attempt to unconfound item difficulty and sampling procedure in order to answer the question that motivated Experiment 1. The second part is also historical: It reports Experiment 3, which provides, to the best of my knowledge, the first empirical test of the recognition heuristic. It was designed to find out whether the results obtained in Experiment 1 could be explained by participants having used the recognition heuristic. In this experiment, the participants’ knowledge and recognition of each city was elicited, and how often this could be used to make an inference was manipulated. We also manipulated the inclusion criterion (and, in turn, the size) of the reference class that the cities were drawn from when constructing the paired comparisons. The last part links the three “historical” experiments to later studies and measures, and discusses the theoretical relevance of the work described here.
2 The historical context of the recognition heuristic
2.1 The theory of Probabilistic Mental Models
Independently, Gigerenzer et al. (Reference Gigerenzer, Hoffrage and Kleinbölting1991) with their PMM theory and Juslin (Reference Juslin1994) developed what was later termed “ecological models” (McClelland & Bolger, Reference McClelland, Bolger, Wright and Ayton1994). When solving a task such as “Which city has more inhabitants, A or B?” people construct a PMM (unless they have direct knowledge or can deduce the answer with certainty, which we called a “local mental model”; Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991). By searching for probabilistic cues that discriminate between the two alternatives, the question is put into a larger context. Imagine that a search hits on the soccer-team cue: City A has a soccer team in the major league and City B does not. Based on the literature on automatic frequency processing, PMM theory posits that people are able to estimate the ecological validity of cues (as long as the objects belong to their natural environment). This validity is defined by the relative frequency of cases in the environment where the cue indicates the correct answer, given that the cue discriminates. For instance, the validity of the soccer-team cue is 90% (in the complete set of paired comparisons of all German cities with more than 100,000 inhabitants). If participants choose the city to which the cue points and report the cue validity as their confidence, they should be well calibrated. This, however, is true only if the cue validities in the item sample reflect the cue validities in the population. If researchers do not sample general-knowledge questions randomly, but over-represent items in which cue-based inferences would lead to wrong choices, overconfidence will occur. Such overconfidence does not reflect fallible reasoning processes but is an artifact of the way the experimenter sampled the stimuli and ultimately misrepresented the cue-criterion relations in the ecology. In two experiments, Gigerenzer et al. (Reference Gigerenzer, Hoffrage and Kleinbölting1991) found exactly this: overconfidence was observed for a set of selected items, but disappeared when the objects that were used in the paired comparisons were randomly sampled from a defined reference class.
The theory can also account for the common finding that average confidence judgments exceed average frequency estimates (“How many of the last 50 items did you answer correctly?”) by positing that different reference classes are used for the two kind of judgments (for details, see Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991). When PMM theory was first published we had a long list of criticisms and open questions that, in turn, gave rise to a series of studies in which attempts were made to falsify the theory in a true Popperian fashion.Footnote 2
2.2 A failed attempt to unconfound sampling procedure and item difficulty (Experiment 1)
One of the established findings in research on overconfidence is the hard-easy effect (Hoffrage, Reference Hoffrage and Pohl2004; Lichtenstein & Fischhoff, Reference Lichtenstein and Fischhoff1977) according to which overconfidence covaries with item difficulty: Hard item sets (i.e., those with a percentage of correct answers of about 75% or lower in a two-alternative forced-choice task) tend to produce overconfidence, whereas easy sets (i.e., those with a percentage correct of about 75% or higher) tend to produce underconfidence.
One of the problems of PMM theory was the fact that selected item sets turned out to be hard (e.g., for Experiment 1 and 2 of Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991, percentage correct was 52.9 and 56.2, respectively), whereas representative item sets turned out to be relatively easy (71.7 and 75.3, respectively). Therefore, even though PMM theory correctly predicted that overconfidence disappeared for the representative sets while it could be observed for the selected set, these findings could, at the same time, be seen as just another example of the hard-easy effect. Hoffrage (Reference Hoffrage1995) tried to shed some light on this issue by comparing two item sets, each consisting of paired comparisons for which the objects were generated by the same, representative, sampling procedure but the difficulty of these sets still differed (see also Kilcher, Reference Kilcher1991). If PMM theory was correct, then overconfidence should disappear in both sets, whereas the hard-easy effect would be observed if there was overconfidence for the hard set, but no overconfidence for the easy set.
2.2.1 Method
Participants were mainly students of the University of Constance, Germany (n=56; 12 female, 44 male). Their task was to (1) repeatedly select, in a series of paired comparisons among cities, the city with more inhabitants, and (2) indicate their confidence in the correctness of their choices on a scale ranging from 50–100% in increments of 10%. Two item sets were used: comparisons between U.S. cities and comparisons between German cities. These item sets were constructed as follows: In the first phase, the largest 75 U.S. and the largest 75 West German cities (before Germany’s unification) were determined. Second, a random set of 39 cities was selected, and ranked according to population size. Third, two ranks were randomly determined and this pair of ranks constituted both the first pair of German cities and the first pair of U.S. cities. This procedure of randomly combining German and U.S. cities simultaneously was repeated until 100 comparisons among German cities and 100 comparisons among U.S. cities (with matched ranks) were determined, with the constraint that no pair appeared twice in the item set. Participants worked on both item sets, with order counterbalanced between-participants.
2.2.2 Results
The two item sets had almost the same difficulty (percentage correct for the German cities: 75.7% and for the U.S. cities: 76.0%). Mean confidence was higher for the German cities (79.5% vs. 72.3%), and thus participants were slightly overconfident for the German cities (3.8%), and slightly underconfident for the U.S. cities (-3.7%). A participant-specific analysis revealed the same tendency: For the German (U.S.) cities, 39 (22) participants were overconfident and 17 (44) were underconfident. A comparison between item sets within participants showed that 22 participants achieved a higher percentage of correct answers for the German cities, 29 participants achieved a higher percentage for the U.S. cities, and for the remaining 5 participants these percentages were the same. In contrast, for 51 participants, their mean confidence was higher for the German cities, for 4 participants it was higher for the U.S. cities, and for the remaining 1 participant there was a tie. Moreover, for 48 participants the overconfidence score (mean confidence minus percentage correct) was higher for the German cities and for 8 participants it was higher for the U.S. cities (no ties).
2.2.3 Discussion
We expected that Germans would perform much better on the German city comparisons than on the U.S. city comparisons. Therefore, the main finding that item difficulty was practically the same for the two sets came as a complete surprise to us, which gave rise to two questions. First, how else could the original intention, namely to unconfound sampling procedure and item difficulty be achieved? And, second, how could the striking result of Experiment 1 be explained? I continue this report with the experiment that addressed the first of these questions.
2.3 A successful attempt to unconfound sampling procedure and item difficulty (Experiment 2)
This study was a second attempt to unconfound sampling procedure and item difficulty (Hoffrage, Reference Hoffrage1995, Exp. 5). In Experiment 1, I tried to achieve this by using two different item sets (German vs. U.S. cities), each consisting of comparisons that had to be made with respect to the same criterion (number of inhabitants). In Experiment 2, in contrast, I used only one reference class—famous people—but two different criteria: Age at time of death (“Who lived to be older?”), and time of birth (“Who was born earlier?”). It was expected that the age questions were relatively hard (think of Plato vs. Albert Einstein) and that the birth questions were much easier (again, think of Plato vs. Einstein).
2.3.1 Method
Participants were 100 students from the University of Salzburg (31 male, 69 female). Comparisons were generated from a list of 286 famous names (for details, see Hoffrage, Reference Hoffrage1995). The criterion (age vs. birth questions) was manipulated within-participants, each of the two item sets consisted of 100 comparisons, and order was counterbalanced.Footnote 3
2.3.2 Results and discussion
As expected, the age questions were much harder than the birth questions (percentage correct = 57.1 and 73.5, respectively, t 99=21.3, p<.001). Mean confidence was much lower for the age questions (62.3% compared to 76.8% for the birth questions, t 99=17.7, p<.001). Participants were slightly overconfident, both for the age and the birth questions (5.2% and 3.3%, respectively). Even though this difference of 1.9 percentage points was statistically significant (t 99=1.99, p=.049), it was obtained with 100 participants in a within-participants design, and cannot be considered as substantial, in fact, the effect size of question type was small to medium (d=.4).
This time the attempt to generate two item sets through the same sampling procedure that differed with respect to percentage correct was successful. Even though there was slightly more overconfidence for the harder set (5.2%) than for the easier set (3.3%), the absolute difference was miniscule compared to the numbers in Lichtenstein and Fischhoff’s (Reference Lichtenstein and Fischhoff1977) paper on the hard-easy effect. Moreover, the hard set had a percentage correct of 57.1% (compared to 73.5% for the easy set), which suggests that for the harder set, scale-end effects (Juslin, Wennerholm, & Olsson, Reference Juslin, Wennerholm and Olsson1999) and unsystematic error (Erev, Wallsten, & Budescu, Reference Erev, Wallsten and Budescu1994; Juslin & Olsson, Reference Juslin and Olsson1997; Juslin, Olsson, & Björkman, Reference Juslin, Olsson and Björkman1997) contributed more to overconfidence than was the case for the easier set.
3 First empirical test of the recognition heuristic (Experiment 3)
Soon after the data of Experiment 1 were analyzed, we moved to the University of Salzburg. When we told our new colleagues about this puzzling result, one of them, Anton Kühberger, just repeated what we said, namely that “the participants had not even heard of many of the American cities” (see also the introduction of Gigerenzer and Goldstein, Reference Gigerenzer and Goldstein2011). He then turned our own words into an explanation that we ourselves had not seen as such and that has, since then, been referred to as the recognition heuristic. He pointed out that this partial lack of knowledge was not an obstacle but something that the German students could exploit. Goldstein and Gigerenzer (Reference Goldstein and Gigerenzer2002) later formulated the recognition heuristic as follows: “If one of two objects is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion” (p. 76).
The data from the Experiment 1 could not be used to test this post-hoc recognition explanation and so the following study was designed to determine whether people used the recognition heuristic when making inferences about city populations (Hoffrage, Reference Hoffrage1995; see also Schmuck, Reference Schmuck1993). We first determined which cities a participant recognized and then manipulated how often the recognition heuristic could be used. In addition, we manipulated the size of the reference class to test whether, as explained below, this affected the participants’ confidence judgments.
3.1 Method
3.1.1 Participants
Participants were 60 students from the University of Salzburg, Austria (30 male, 30 female).
3.1.2 Design and materials
For each of the 100 pairs of U.S. cities that the participants saw, they had to select the city with more inhabitants and then give a confidence judgment. For half of the participants, the cities were taken from the set of all cities with more than 200,000 inhabitants, and for the rest, from all cities with more than 400,000 inhabitants. This factor is henceforth referred to as the size of the reference class, having a value of either 75 or 32 cities, respectively. It is important to note that the size refers to the number of objects with a criterion value that is higher than a specific threshold. Comparing the performance for a reference class of 75 cities, randomly drawn from the 100 largest cities, to the performance for a reference class of 32 cities, randomly drawn from the 100 largest cities, would not be instrumental to test the predictions concerning size of reference class laid out below. The second factor that was manipulated between-subjects was how often participants’ knowledge discriminated between the cities, henceforth referred to as the discrimination rate, with the levels of high, low, and uncontrolled. These two factors were fully crossed and 10 participants were randomly assigned to each of the resulting six conditions.
3.1.3 Procedure
The 40 participants who were assigned to either the high or low discrimination-rate conditions were informed that they were now “presented with a list of some American cities”. These were either the largest 32 or the largest 75 cities (manipulated between-participants, see above). The cities appeared in alphabetic order, and participants were asked to indicate, for each city, whether they (1) “know something about the city, that is, know more than just the name” (henceforth referred to as K, for more Knowledge), (2) “have heard the name of the city, but have no knowledge beyond that” (R, for Recognized name), and (3) “know nothing about the city and have not even heard its name” (U, for Unrecognized). These categorizations made it possible to generate six different types of pairs. As can be seen in Table 1, the number of pairs of a particular type differed between the discrimination-rate conditions. Specifically, for participants in the high discrimination-rate condition, the cities were combined such that the recognition heuristic could be used in 55 of the 100 comparisons (25 K-U and 30 R-U comparisons). In addition, there were 5 K-K comparisons and 30 K-R comparisons for which eventually some other knowledge could allow for an inference. Thus, depending on what this other knowledge was, there was a minimum of 55 and a maximum of 90 comparisons for which either recognition or other knowledge discriminated. In contrast, for the low discrimination-rate condition, the minimum was 25 (5 K-U + 20 R-U) and the maximum was 50 (all except 30 R-R and 20 U-U).
After a participant finished his or her recognition judgments, the computer program generated comparisons by randomly selecting two cities. The frequency distribution for the possible comparison types depended on the condition, as displayed in Table 1. (If, for a given participant, this requirement could not be met, the software stopped and this participant was excluded from the experiment). Another constraint was that no pair was presented twice. Then, the 100 comparisons were randomly ordered and participants chose, for each pair, the city with more inhabitants and indicated their subjective confidence in the correctness of their choice.
The 20 participants who were assigned to the uncontrolled discrimination-rate condition started with 100 paired comparisons that were generated with the only constraint that no pair was presented twice. For these participants, recognition judgments were elicited after the comparison phase.
Finally, participants estimated several relative frequencies. First, for each of the three heterogeneous comparisons, they estimated the accuracy of inferences made for these comparison types. For the R-U comparisons, for instance, the instructions read “For all possible comparisons among cities for which you recognized one (but have no more knowledge about it), and have not heard of the other, what do you think is the percentage of comparisons for which the cities you recognized is the larger one?” Second, they estimated their own percentage of correct inferences for each of the six comparison types.
3.2 Predictions
3.2.1 Discrimination rate
Gigerenzer and Goldstein (Reference Gigerenzer and Goldstein1996) extended Gigerenzer et al.’s (Reference Gigerenzer, Hoffrage and Kleinbölting1991) PMM algorithm by adding the recognition heuristic, while preserving the principle of one-reason decision making. This results in the following possible situations. When the recognition heuristic discriminates between two objects (cities in this case), then the city that is recognized is chosen as the larger one and the recognition validity is given as the confidence level. When both cities are recognized and something is known about at least one of the cities, the most valid cue is used to make a choice and the confidence is determined by the validity of this cue. When both cities are recognized but no further knowledge is available, or when neither city is recognized, a city is chosen randomly and confidence is 50%. If cue validities and recognition validity are estimated without any bias, confidence judgments should be well calibrated and mean percentage correct should equal mean confidence. This is because, within each of the six comparison types, the city pairs were selected randomly so that the validity of the cue and the recognition validity in the sample used in the experiment were expected to be identical to those in the reference class. The factor of discrimination rate should thus affect only mean percentage correct (higher in the high discrimination-rate condition) and mean confidence (again, higher in the high discrimination-rate condition), but it should not affect overconfidence (the difference between confidence and percentage correct).
3.2.2 Size of reference class
The size of the reference class was neither an issue in the original PMM paper, nor in Gigerenzer and Goldstein (Reference Gigerenzer and Goldstein1996), nor in Goldstein and Gigerenzer (Reference Goldstein, Gigerenzer, Gigerenzer and Todd1999). It was simply assumed that people are well adapted to their natural environments, and that they are able to estimate cue validities with a reasonable degree of accuracy. However, as Hoffrage (Reference Hoffrage1995) and Hoffrage and Hertwig (Reference Hoffrage, Hertwig, Fiedler and Juslin2006) showed, cue validities can depend on the size of the reference class (see also Goldstein & Gigerenzer, Reference Goldstein and Gigerenzer2002, Figure 5). Gigerenzer et al. (Reference Gigerenzer, Hoffrage and Kleinbölting1991) used all German cities with more than 100,000 inhabitants (as of 1988). Although 100,000 is a salient number, other thresholds might have been used. Indeed, the cue validities in this environment depend on this threshold, that is, on the minimum number of inhabitants a city must have to be included in the set. Across four different thresholds, cue validities varied widely: For one of the twelve cues, the validity dropped from 77% to 0%; for the others, the average absolute difference between the validities among all cities with more than 100,000 inhabitants and those among all cities with more than 300,000 was 10.3%.
In a similar vein, Juslin, Olsson and Winman (Reference Juslin, Olsson and Winman1998) showed that the percentage of correct inference depends on how items are sampled from a reference class. These authors varied whether pairs were constructed by randomly drawing each of the two objects from the whole reference class or whether sampling was constrained such that one (the other) object was randomly drawn from the set of those objects with a criterion value above (below) the median. The commonality between their constrained sampling procedure and my larger reference class is that for both conditions the differences between ranks (of objects with respect to the criterion value) are, on average, larger compared to the corresponding rank differences in the unconstrained procedure and the smaller reference class, respectively. Juslin et al. (Reference Juslin, Olsson and Winman1998) found, both with simulated and with participants’ data, that cue validities and percentage correct, respectively, were positively related to averaged rank size differences.
Based on Juslin et al.’s findings and on my own calculations just reported, one would predict that the percentage of correct inferences would be higher for the larger reference class. Given PMM Theory’s assumption that cue validities drive not only percentage correct but also confidence, one should predict that mean confidence would be higher for the larger reference class as well. However, based on the assumption that participants are not aware of the relationship between (recognition and other cues’) validities and the size of the reference class—note that not even the authors of PMM theory saw this when they published their paper (Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991)—I predicted that the mean confidence would not differ between the two reference class conditions. Confidence could even be higher for the smaller reference class. This is because, for this reference class, participants will presumably recognize a higher proportion of cities and will know more about a higher proportion of cities, compared to the larger reference class condition. This overall impression of higher familiarity with the cities may translate into higher confidence judgments.Footnote 4 Given the data of Experiment 1, which used U.S. cities above 200,000 inhabitants, I expected that mean confidence would match percentage correct and that there would thus be no overconfidence for the larger reference class (largest 75 cities). In contract, I predicted overconfidence for the smaller reference class (largest 32 cities)—certainly because percentage correct would be lower than for the larger reference class and maybe, in addition, because mean confidence would be higher than for the larger reference class.
Note that the combination of the uncontrolled discrimination-rate condition with the two reference class conditions is most likely to yield evidence conflicting with PMM theory’s prediction that overconfidence disappears if pairs are randomly sampled from a defined reference class. (This prediction holds only for confidence ratings, not for frequency estimates.) For the controlled discrimination-rate condidions, analysing participants’ overconfidence is less crucial for a test of PMM theory, as sampling of pairs is constrained in these conditions.
3.3 Results
This section proceeds as follows. First, I report the effects of the two main factors, discrimination rate and size of the reference class, on percentage correct, mean confidence, and overconfidence. Second, I show that these effects were exclusively driven by the relative frequencies of the six comparison types. Third, I ask how often participants followed the recognition heuristic when selecting a city. Fourth, I compare percentage correct and mean confidence to estimated validities and estimated percentages of correct choices. Finally, I demonstrate that recognition judgments depended on the size of the reference class.
3.3.1 Main effects of discrimination rate and size of the reference class
Figure 1 displays the six calibration curves for the six conditions that result from crossing the two main factors: discrimination rate and size of the reference class. Table 2 displays the mean confidence (MC), percentage correct (PC), and overconfidence (OC = MC-PC), again, across all participants and items. It can be seen that the effect of discrimination rate was small compared to that of the size of reference class. This is consistent with the results of three ANOVAs, each computed with the participant-specific values for MC, PC, and OC (Table 2, lower rows). As predicted, for each of the corresponding discrimination-rate conditions (e.g., high), PC was higher for the larger reference class (e.g., 69.4) than for the smaller one (e.g., 65.5), and MC was higher for the smaller reference class that contained relatively more familiar cities. Further, as predicted, OC differed dramatically between the reference classes. For the 75 cities with more than 200,000 inhabitants, it basically disappeared, replicating Experiment 1 which used the same reference class. For the reference class of the largest 32 cities (each city more than 400,000 inhabitants), however, massive overconfidence was observed. The interaction between size of reference class and discrimination rate was not statistically significant for any of the three dependent variables (not shown in Table 2).
3.3.2 Effect of discrimination rate and size of reference class within comparison type
The values for PC in Table 2 should differ between the discrimination-rate conditions because the validities of the recognition heuristic and that of other cues should be different for the six comparison types, and because the relative frequencies of these types were different for the three levels of discrimination rate. For a given comparison type, however, the PC should not depend on the discrimination-rate condition. To test for this independence, for each of the 60 participants, MC, PC and OC were computed within each of the six comparison types. Based on the resulting 60*6=360 entries, three ANOVAs were conducted, one for each of the three dependent variables (these ANOVAs had only 349 degrees of freedom because for some participants of the uncontrolled discrimination-rate condition there were no entries for some comparison types). Unlike the previous analyses which were computed based on averaging across all 100 items, when comparison type was held constant, that is, statistically controlled for, the discrimination rate no longer had an effect, MC: F(2,349)=1.32, p=0.27, PC: F(2,349)=0.16, p=0.85, and OC: F(2,349)=0.19, p=0.83. In contrast, the differences between the reference class conditions were significant: MC: F(1,349)=12.70, p<0.001, PC: F(1,349)=8.60, p=0.004, and OC: F(1,349)=32.37, p<0.001. Because discrimination rate had, as expected, no significant effect within a given comparison type (and its effect on MC, PC, and OC across all 100 items was due only to different frequencies of the different comparison types), the subsequent analyses focus on comparison types, thereby aggregating across participants of the different discrimination-rate conditions.
3.3.3 Effects of (recognition) knowledge on decisions
The results reported above establish that the frequency of comparison types drove the percentage correct: The more often the recognition heuristic could be used, the better participants’ performance was. Even though this finding already suggests that participants tended to infer that recognized cities are larger than unrecognized cities, there is also a more direct way to see whether this was the case. Table 3 displays—for each of the six comparison types and across all participants and items—mean confidence, percentage correct and overconfidence. For the three heterogeneous comparison types an additional analysis was performed, based on the knowledge about the two cities and how a participant responded. Specifically, decisions that were consistent with the assumption that the recognition heuristic was used (referred to as “consistent”) include those where a city that was recognized (be it with or without more knowledge, that is, a K-city or an R-city) was selected when it was paired with an unrecognized city. Finally, a decision in favor of a K-city when paired with an R-city was also classified as “consistent”. Note that for these cases, recognition did not discriminate, so this classification was based on the assumption that more knowledge about one city is most likely to be knowledge that allows for the inference that it is larger than a city for which such knowledge does not exist.
Across all cases in which a recognized city (either K or R) was paired with an unrecognized city (U), participants decided in favor of the recognized city in 84.3% of the cases. An analysis conducted on an individual basis revealed that 5 participants decided in favor of the recognized city in 100% of the critical cases, 11 in 99.9 - 90% of the cases, 29 in 89.9 - 80%, 10 in 79.9 - 70%, 3 in 69.9 - 60%, 1 in 48% (this participant had a percentage correct of 51%, suggesting that he responded randomly throughout), and for 1 participant, the adherence rate could not be computed (as she recognized all the cities in the reference class). When participants recognized both cities but knew something about one city (K) but not the other (R), they favored the city that they knew something about in 79.3% of the cases. It is interesting to see that such “consistent” decisions were far more likely to be correct than the “inconsistent” decisions. Had participants always decided in favor of the recognized city (or, for K-R pairs, in favor of the K city), the percentage correct for the K-U, K-R, and R-U comparisons would have been 83.1%, 70.0%, and 64.4%, instead of 78.6%, 69.4%, and 60.0%, respectively. It is also interesting to see that mean confidence was lower for the “inconsistent” decisions than for the “consistent” decisions. This reduction, however, was not sufficient to compensate for the lower percentage correct, and so overconfidence was far more pronounced for the “inconsistent” than for the “consistent” decisions.
3.3.4 Effects of (recognition) knowledge on validities
How do participants’ estimates of the validities for the various comparison types relate to the corresponding percentages of correct decisions? Before answering this question, I extend the list of variables by adding what I refer to here as simulated validities, that is, the percentages of correct inferences for all possible comparisons of cities within a given type of comparison (K-U, K-R, and R-U), given that a participant always decided in favor of the first city (K, K, and R, respectively). These variables obviously had to be computed separately for each participant. Table 4 contains the values of the six variables, averaged across all 60 participants. The consistency of the pattern revealed in Table 4 is striking. Within each of the three heterogeneous comparison types, both the simulated validity and percentage correct are higher for the large reference class (75 cities) than for the small one (32 cities). Across all these comparison types and across these two variables (simulated validity and PC), the average for the 75 largest cities exceeds that for the largest 32 cities by 8.2 percentage points. Interestingly, participants were obviously not aware of this relationship. To the contrary, their estimated validities, their mean confidence, and their estimated percentage correct all pointed in the opposite direction: Each of these values was larger for the reduced reference class. Averaged across all comparison types and all these three variables, the difference was −5.4 percentage points.
It is also interesting to see that in each of the six rows in Table 4, the simulated validity exceeded percentage correct. Note that the values for these two measures would have been the same, had all participants always chosen in favor of the K-city and the R-city in cases in which such cities have been paired with an U-city, and in favor of the K-city in cases in which it has been compared with a R-city. However, as was explained above, this was not the case and so the findings reported in Table 4 mirror those reported in Table 3.
3.3.5 Effects of the size of the reference class on recognition judgments
The last analysis of these data reported here concerns the question of whether recognition judgments were independent of the size of the reference class (for more results, see Hoffrage, Reference Hoffrage1995). According to range-frequency theory (Parducci, Reference Parducci1965), which posits that people have a tendency to map the range of an attribute’s levels linearly onto the range of the response scale, one may suspect that this may not be the case. Specifically, having relatively few K-cities in the larger reference class or relatively few unrecognized cities in the smaller reference class may lead one to shift the criterion that is used to make these classifications. Conversely, having relatively more U-cities in the larger reference class and relatively more K-cities in the smaller reference may lead to corresponding criterion shifts in the other direction. Even though Goldstein and Gigerenzer (Reference Goldstein and Gigerenzer2002) conceptualized recognition as a simple dichotomous variable—a city is either recognized or not—others discussed the possibility that the process of making such categorical judgments may draw on some more continuous representations in memory which, in turn, open the theoretical possibility of context effects on threshold settings (Erdfelder, Küpper-Tetzel, & Mattern, Reference Erdfelder, Küpper-Tetzel and Mattern2011; Gigerenzer & Murray, Reference Gigerenzer and Murray1987; Hertwig, Herzog, Schooler, & Reimer, Reference Hertwig, Herzog, Schooler and Reimer2008; Pleskac, Reference Pleskac2007; Schooler & Hertwig, Reference Schooler and Hertwig2005).
In fact, recognition judgments in this experiment did depend on the size of the reference class. The most straightforward way to see this is to compare the recognition judgments of the largest 32 cities to those of the same cities, but now as a subset that is embedded in the set of the largest 75 cities (henceforth referred to as 32-in-75). Without any context effects, the recognition judgments should not differ between the two sets (that contain, after all, exactly the same cities). Table 5 displays the absolute and relative frequencies of the three knowledge states, depending on the size of the reference class. It is interesting to see that in the population of the largest 32 and of the largest 75 cities, virtually the same percentage of cities was categorized as “more knowledge beyond name recognition”, 34.0 and 32.8, respectively. As a necessary consequence, for the 32-in-75 cities, this percentage increased dramatically, from about a third, to more than half. The criterion shift was expected to be in the other direction for the unrecognized cities, and this was the case; the percentage of U-cities decreased by almost a factor of two and fell from about 20% (32 largest cities) to about 10% (32-in-75).
As a necessary consequence, the frequency distribution of the type of comparisons was quite different for the set of the largest 32 cities and the 32-in-75 set (Table 6).
This leads to the interesting question of whether these context effects on the recognition judgments affected the confidence, percentage correct, or overconfidence. The following rationale makes it clear why this may be the case. We have seen that the set of the largest 32 cities when presented alone, compared to analyzing the 32-in-75 set, led to a stricter criterion for a city to be classified as a K-city, and to a more liberal criterion to classify a city as a U-city (see also Table 6). Moreover, we have seen that the validities (be it of the recognition heuristic or, by an extension of the argument, those of other cues) were higher for the larger reference class than for the smaller one. In fact, the percentage of correct inferences for the set of the largest 75 cities, the embedded set (32-in-75) and the set of the largest 32 cities were 68.5%, 67.2%, and 61.7%, respectively.
3.4 Discussion
All of the predictions were basically confirmed. The discrimination rate affected percentage correct, mean confidence, and overconfidence as predicted: The more often the recognition heuristic could be applied and the more often other knowledge discriminated among the cities, the higher the percentage correct and mean confidence were. These effects could be fully accounted for by the relative frequencies of the six comparison types. The percentage of participants’ choices that were consistent with the prediction of the recognition heuristic is in the same range as reported in other studies that were conducted later (e.g., Goldstein & Gigerenzer, Reference Goldstein and Gigerenzer2002). What the present study adds to the literature is the observation that, for a larger reference class (all cities above 200,000 inhabitants) as compared to a smaller reference class (all cities above 400,000 inhabitants), percentage correct was higher, mean confidence was lower, and overconfidence was less pronounced. To the best of my knowledge, such effects have not been reported elsewhere. Equally important is the related finding that participants were not only unaware of the dependency of the validity of the recognition and other knowledge on reference class size, but also that their answers even pointed in the opposite direction (higher confidence judgments and frequency estimates for the smaller reference class). Finally, what the present study adds to the literature is the conjecture that recognition judgments might best be seen as resulting from mapping an underlying hypothetical variable with the help of a response function onto a dichotomous recognition value. Such a view could, at least, easily account for the fact that the observed recognition judgments depended, in a between-participants comparison, on the size of the reference class.
4 General discussion
The present paper reported three studies. The first, paving the way to the recognition heuristic, was a failed attempt to generate hard questions by asking German students which of two U.S. cities (each randomly drawn from a defined reference class) has more inhabitants. To our surprise, German students were about as good at these questions as they were at the corresponding comparisons among German cities. In Experiment 2, a similar attempt succeeded: When comparing two representative item sets, one hard and the other easy, the hard-easy effect was still observed (higher overconfidence for the hard set), but now the effect was much smaller than in previous studies. These two data points fit perfectly into the larger picture that Juslin, Winman, and Olsson (Reference Juslin, Winman and Olsson2000) provided in their meta-analysis in which they analyzed the effect of sampling procedure. Specifically, those authors conducted a review of 95 independent data sets with selected items and 35 sets in which items had been sampled representatively. Across all selected item sets, overconfidence was 10%, and across all representative sets it was 1% (95% confidence intervals for each set were at ±1%). Juslin et al. pointed out that this difference could not be explained by differences in percentage correct. Moreover, when they controlled for the end effects of the confidence scale and the linear dependence between percentage correct and the overconfidence score (recall that OC=MC-PC), the hard-easy effect virtually disappeared for the representative item sets.
4.1 The recognition heuristic: Compensatory or non-compensatory?
The focus of the present paper was on the recognition heuristic, which was proposed as a post-hoc explanation for the puzzling result of Experiment 1. Two of the major results of Experiment 3 were, first, that people’s choices were consistent with the recognition heuristic in about 80% of the pairs when they had no additional knowledge about the recognized city (and in about 90% when there was such knowledge), and, second, that discrimination rates drive percentage correct, mean confidence and overconfidence. As of today, almost 20 years after this study was conducted, readers might say “we knew that all along”, and rightly so, as many similar findings have been reported since then (for overviews see Gigerenzer & Goldstein, Reference Gigerenzer and Goldstein2011; Pachur, Todd, Gigerenzer, Schooler, & Goldstein, in press). However, the literature on the recognition heuristic also reveals some controversies. Some papers (e.g., Bröder, Reference Bröder2000; Chater, Oaksford, & Nakisa, Reference Chater, Oaksford and Nakisa2003; Dougherty, Franco-Watkins, & Thomas, Reference Dougherty, Franco-Watkins and Thomas2008) criticize some aspects and raise some doubts concerning the research program in which the recognition heuristic is embedded, namely the simple heuristics program initiated by the ABC Research Group (Gigerenzer, Todd, and the ABC Research Group, Reference Gigerenzer and Todd1999) in general. Space and the focus of this special issue do not allow such criticism to be addressed here (but see, e.g., Todd, Gigerenzer, and the ABC Research Group, Reference Todd and Gigerenzer2000; Gigerenzer, Hoffrage, & Goldstein, Reference Gigerenzer, Hoffrage and Goldstein2008).
Among those criticisms that refer to the recognition heuristic specifically, one is particularly interesting as it directly relates to a distinction already made in the present Experiment 3. Several authors (e.g., Bröder & Eichler, Reference Bröder and Eichler2006; Hilbig & Pohl, Reference Hilbig and Pohl2008; Reference Hilbig and Pohl2009; Newell & Fernandez, Reference Newell and Fernandez2006; Newell & Shanks, Reference Newell and Shanks2004; Oppenheimer, Reference Oppenheimer2003; Pachur, Bröder, & Marewski, Reference Pachur, Bröder and Marewski2008; Pohl, Reference Pohl2006; Richter & Späth, Reference Richter and Späth2006) have challenged Goldstein and Gigerenzer’s (Reference Goldstein and Gigerenzer2002) claim that people use recognition knowledge in a non-compensatory fashion. Most of the studies reported by those authors distinguished between objects that participants recognized but for which they had no additional knowledge (in the present paper referred to as R-objects) and objects which they recognized and for which they had further knowledge (K-objects). Hilbig and Pohl (Reference Hilbig and Pohl2008), for instance, referred to these objects as mR (for mere recognition) and R+ (for recognition plus knowledge), respectively. Some of these authors then developed and used measures beyond those used in the analyses reported above, like various parameters in a multinomial model approach (Hilbig, Erdfelder, & Pohl, Reference Hilbig, Erdfelder and Pohl2010), response times (Hilbig & Pohl, Reference Hilbig and Pohl2009), or the DI (Discrimination Index; Hilbig & Pohl, Reference Hilbig and Pohl2008); for an overview, see Hilbig (Reference Hilbig2010). The overall conclusion of these authors is that their data conflict with the hypothesis that recognition knowledge is always used in a non-compensatory way.
Some of these authors would probably also interpret some of the results reported in the present paper as inconsistent with the non-compensatory nature of the recognition heuristic. For instance, the finding in Experiment 2 that percentage correct is substantially larger for K-U pairs than for R-U pairs is consistent with the assumption that the knowledge that was available for K-cities has been used in some way. Another example would be the DI (Hilbig & Pohl, Reference Hilbig and Pohl2008), which is defined as the adherence rate to the recognition heuristic among paired comparisons in which the recognized object was the correct answer minus the adherence rate among those comparisons for which the recognized object was the incorrect answer. In their studies, Hilbig and Pohl found this index to be positive and concluded that the recognition heuristic is not used in a non-compensatory way. The rationale for this conclusion is that a positive index “would not be possible through following the recognition cue alone” (Hilbig & Pohl, Reference Hilbig and Pohl2008, p. 395)—simply because following the recognition cue alone yields adherence rates of 100%, both for cases in which the recognition heuristic would lead to a correct and an incorrect inference, which, in turn, would yield a difference of zero.Footnote 5
The DI for Experiment 3 can be recovered from the information displayed in Table 3. Across all participants and items, it was .055 (the average of the participant-specific DIs was .053, with SD = .184, SE = .024, which was significantly greater than 0, t58 = 2.22, p = .02, and it was positive, zero, negative, and not defined for 30, 6, 23, and 1 participants, respectively). Among R-U comparisons, DI = .031 (the average across participant-specific DIs was .041, SD = .242, SE = .032, t58 = 1.3, p = .10, with 25, 7, 27, and 1 participants who had a positive, zero, negative, and undefined score, respectively) and among K-U comparisons, DI = .025 (the average across participant-specific DIs was .032, SD = .277, SE = .041, t45 = .77, p = .22, with 9, 21, 16, and 14 participants, respectively). Even though the DI in Experiment 3 was positive, it was lower than for other studies reported in the literature (e.g., Hilbig & Pohl, 2008), and the difference from zero was only significant when R-U and K-U comparisons were pooled (but for none of these comparison types separately). Moreover, DIK-U did not exceed DIR-U. To the extent that a positive DI reflects the use of knowledge beyond recognition, one should have expected to see that DIK-U > DIR-U, because for K-U comparisons more knowledge can be used than for R-U comparisons.
Some findings of Experiment 3 are in line with those reported by authors who have challenged the non-compensatory nature of the recognition heuristic. Not only the DI (which includes adherence rates conditioned on whether the recognized object is the correct answer), but also Table 3 (which reported percentage correct conditioned on adherence) can be interpreted as evidence inconsistent with the claim that recognition is always used in a non-compensatory way. I want to emphasize that I, just like Gigerenzer and Goldstein (Reference Gigerenzer and Goldstein2011, p. 110), “have no doubts that recognition is sometimes dealt with in a compensatory way”. In fact, if a participant happens to know that a city she recognizes is very small and recognized for reasons other than population size (think of Chernobyl or Fatima), then this would constitute a good reason not to make an inference based on the recognition heuristic, but to decide based on what Gigerenzer et al. (Reference Gigerenzer, Hoffrage and Kleinbölting1991) called a local mental model, that is, to use direct knowledge about the criterion. A simple example can demonstrate that very few cases (5 in 2,000 pairs) like this are already enough to make a difference between the DI that was observed in Experiment 3 and a DI of zero.Footnote 6
That recognition knowledge is trumped by criterion knowledge is one reason why choices may not be consistent with the recognition heuristic. Another reason is that recognition knowledge could be trumped by probabilistic cues (see also Gigerenzer & Goldstein, Reference Marewski, Gaissmaier, Schooler, Goldstein and Gigerenzer2010). Experiment 3 of the present paper did not live up to Gigerenzer and Goldstein’s request to specify models for such compensatory use of cues against which the non-compensatory recognition heuristic is tested. One should not forget, however, that this was the first, exploratory study on the recognition heuristic, conducted almost 20 years ago, whose goal was to test the post-hoc explanation developed after Experiment 1, rather than to test specific claims that were formulated only several years later. While Marewski, Gaissmaier, Schooler, Goldstein, & Gigerenzer, G. (Reference Marewski, Gaissmaier, Schooler, Goldstein and Gigerenzer2010), who conducted such a rigorous test, conclude from their studies that the recognition heuristic outperformed all competing compensatory models with respect to predicting people’s inferences, Experiment 3 of the present paper did not elicit the data that are necessary to perform such tests.
The recognition heuristic is a model of cognitive processes involved in inferences, and, as every model does, it simplifies. Therefore, I do not find it at all surprising to see that people seem to follow the recognition heuristic in less than 100% of the cases in which it allows for an inference (as reflected in adherence rates < 1) and even less so if an inference would be incorrect (as reflected in DI > 0). What I do find surprising, though, is that this “failure” to make correct predictions in 100% of the cases is sometimes seen as critical evidence. This attitude strikes me as even more surprising when considering that there is no scarcity of authors in cognitive psychology who seem to be satisfied if their model predicts outcomes significantly better than chance.
4.2 The theoretical importance of the (size of the) reference class
Experiment 3 revealed effects that have not been reported elsewhere. It is easy to understand why increasing the size of the reference class increases both the recognition validity and the validities of cues: adding smaller cities to a set of larger cities is more likely to result in adding unrecognized cities than recognized cities, and it is more likely to result in adding cities with unknown or negative cue values than with positive cue values. This, in turn, will not only increase the proportion of pairs consisting of recognized and unrecognized cities, but also, within this set, will increase the proportion of pairs in which the recognized city is the larger one (see the simulated validities in Table 4). However, it should be mentioned that increasing the size of the reference class also increases the average difference between the population sizes of the cities that are compared (see also Juslin et al., Reference Juslin, Olsson and Winman1998). To the extent that participants possess criterion knowledge (Hilbig, Pohl, & Bröder, Reference Hilbig, Pohl and Bröder2009), the increase of percentage correct (as size of the reference class increases) could also be explained by a relative increase of comparisons that are made through the construction of local mental models as compared to probabilistic mental models (Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991).
In contrast, participants’ mean confidence revealed an effect in the opposite direction to that which has been observed for percentage correct: confidence judgments were lower for the larger reference class and higher for the smaller one. Taken together with the effect on percentage correct, this resulted in zero over/under-confidence for the larger reference class but in severe overconfidence for the smaller reference class. Note that this result was observed in the condition in which the discrimination rate has not been controlled for and thus it poses a challenge for PMM theory (Gigerenzer et al., Reference Gigerenzer, Hoffrage and Kleinbölting1991). At the same time, the effect on confidence judgments is easily explained: It may have resulted from the fact that the smaller reference class contained relatively more cities that the participants recognized and also more cities that they knew something about, coupled with the (false) belief “the more I know, the better I will perform.” It is also consistent with the results of many studies conducted by Klaus Fiedler and his colleagues demonstrating that participants do not appropriately adjust their judgments to the sampling procedure of the items they are presented with (e.g., Fiedler, Reference Fiedler2000).
The insight that both the validities of recognition and that of other cues depend on the size of the reference class leads to some interesting questions: Which reference classes should experimenters select in their studies? Which reference classes do participants use when they determine their confidence? The problem of choosing the adequate reference class is neither trivial nor new. It is, for instance, fundamental to the frequentistic interpretation of probabilities (for history and interpretations, see Gigerenzer et al., Reference Gigerenzer, Switjink, Porter, Daston, Beatty and Krüger1989). As the great probability theorist Richard von Mises (Reference von Mises1957) put it, “we shall not speak of probability until a collective has been defined” (p. 18). Insurance companies face the same problem when determining the premium for a life insurance of a particular person. Clearly this premium will depend on the probability that this person will die, say, within the next ten years. But which of the person’s innumerable properties should be used to construct a reference population? Each of these properties (as well as combinations thereof) could be used to define the reference class, and in all likelihood, many of the resulting reference classes would yield different statistics and thus different estimations for mortality risks, leaving open the question of which is the correct one.
Frankly, I do not have a good answer. However, I think there are possible pragmatic routes toward a “good enough solution” (see also Hoffrage & Hertwig, Reference Hoffrage, Hertwig, Fiedler and Juslin2006). Under some circumstances, experimenters may circumvent the problems that result from fuzzy reference classes—either by selecting one that is small, finite, and complete (e.g., all African states) or by creating microworlds (e.g., Fiedler et al., Reference Fiedler, Walther, Freytag and Plessner2002). This allows them to control for participants’ exposure to these worlds and make sure that the intended reference class and the participants’ reference class converge. Another possibility would be to explore the boundaries of a reference class empirically (e.g., by analyzing environmental frequencies). Anderson and Schooler (Reference Anderson and Schooler1991), for instance, examined a number of environmental sources (such as the New York Times) and showed that there are reliable relationships between the probability that a memory for a particular piece of information will be needed and frequency, recency, and patterns of prior exposure. Such an analysis of environmental statistics could also be conducted in the context of the research reported in this paper. For the city task, for instance, it may show that people are much more likely to encounter larger cities than smaller cities. Specifically, such environmental frequencies could be used to determine how often a particular city is used in the experimental materials. Finally, another way to determine the “right” size of people’s reference classes is to transfer the task of sampling experimental stimuli from the experimenter to the participants. Hogarth (Reference Hogarth, Fiedler and Juslin2005), for instance, used mobile phones and interrupted his participants in their flow of daily activities at randomly chosen intervals and asked several questions regarding the last decision that they made, thereby letting them, the environment, and chance determine which environmental stimuli are designated to become experimental ones (see also Dhami, Hertwig, & Hoffrage, Reference Dhami, Hertwig and Hoffrage2004).
4.3 Final remarks
The formulation of the recognition heuristic has led to a lot of exciting research. However, we should not only look at what has been achieved in the past, but also continue this fruitful tradition in the future. Interestingly, when adopting the recognition heuristic to generate recommendations for choosing among research topics, it should be inverted. When faced with the choice between working on recognized topics, replicating known findings, versus entering new and unexplored territory: Go with the latter. I hope the present paper helped to identify some of these blank areas on the map of research on the recognition heuristic, thereby initiating some further steps towards new directions.