1 Introduction
In their seminal article, Reference Kahneman and TverskyKahneman and Tversky (1979) presented behavioral results from 16 choice problems, designed to demonstrate ways in which human decision making under risk violates the assumptions of Expected Utility Theory (EUT). These psychological phenomena – elevated to “paradoxes” by virtue of their conflict with EUT – were in turn used to motivate psychological assumptions, principally in the form of nonlinear reactions to value and probability and differential reactions to gains and losses of the same magnitude. These hypothesized mechanisms motivated the functional forms of Prospect Theory (PT), such as the value and the probability weighting functions. Although it was based on a relatively small sample, few studies have had a stronger influence on the field of decision making.
These functional forms have been used to explain behavior in many domains of the social sciences: for example, labor supply (Reference Camerer, Babcock, Loewenstein and ThalerCamerer, Babcock, Loewenstein & Thaler, 1997), international relations (e.g., Jervis, 1992), and conflict theory (e.g., Levy, 1996). It has been proposed that these functional forms are evolutionary adaptive (McDermott, Fowler & Smirnow, 2008; Reference Mallpress, Fawcett, Houston and McNamaraMallpress, Fawcett, Houston & McNamara, 2015). With time the ability to account for these phenomena have been hoisted to benchmarks for any model that is to be allowed into the debate on decision making under risk (Reference BirnbaumBirnbaum, 1999; 2008, Brandstätter, Gigerenzer & Hertwig, 2006; Reference Erev, Ert, Plonsky, Cohen and CohenErev, Ert, Plonsky, Cohen & Cohen, 2017).
There are, at least, three different kinds of empirical research that is performed in connection with PT: i) Studies that attempt to replicate the psychological phenomena that motivated the original formulation of PT; ii) Studies that test if the psychological assumptions postulated by PT are the correct explanations of these phenomena (e.g., if the so called Certainty Effect, see below, is explained by the nonlinear probability weighting); iii) Studies that apply the function forms of PT to account post hoc for real-life phenomena.
Given the recent discussion of a “replication crisis” in the behavioral sciences (Nelson, Simmons & Simonsohn, 2018) – and the observation that relatively few studies on PT have been concerned with replication (but see notable exceptions in the review presented below) – in this article we attempt to replicate the psychological phenomena that supported PT. The replication is conceptual, rather than direct, because it targets a population with a wider range of numeracy (i.e., the ability to apply and reason with numerical concepts) than the original study did (which involved undergraduate university students).
Variation in numeracy was desirable because past research has highlighted the impact of numeracy on decision making, often finding it to be superior to general cognitive abilities (e.g., algebra competence, intelligence, cognitive reflection, literacy) for predicting decision making skills (see Cokely et al., 2016), and also finding that it has a direct effect on the functional forms of PT (see more under “Aims and Hypotheses of the Present Study”). Consequently, a wide range of numeracy enabled a test of the robustness and limiting conditions of the results presented by Reference Kahneman and TverskyKahneman and Tversky (1979). Because we previously have shown that people’s probability weighting functions appear to be dependent on anchors in the form of related judgments, we also manipulated design type: Within-Subject Design (WSD) vs. Single-Stimuli Design (SSD) where participants assess only one problem (Reference Millroth, Nilsson and JuslinMillroth, Nilsson & Juslin, 2018).
1.1 The seminal study by Reference Kahneman and TverskyKahneman and Tversky (1979)
The strategy in Reference Kahneman and TverskyKahneman and Tversky (1979) was to set up pairs of binary choice problems, where the participants choose between A1 and B1 and A2 and B2. The pairs were constructed so that choices of A1 and B2, or B1 and A2, are incompatible with EUT. A total of 16 problems were included, which together posited nine paradoxesFootnote 1: four different variants of the Certainty Effect; two variants of the Reflection Effect; two variants of the Isolation Effect; and the Probabilistic Insurance Effect. The choice problems are described in Table 1 and explained in the following, along with the results in the original study. Note that our labeling of Problem 14 and 16 corresponds to “13´ ” and “14´ ” in Reference Kahneman and TverskyKahneman and Tversky (1979).
+ England, France, Italy.
++ The item is described as a two-stage game where choice options A and B occur in stage two IF one wins in stage one (p win in stage 1 = .75). The choice between A and B must be made before stage one.
+++ For item 11[12] choices are supposed to be made under the following condition: “In addition to whatever you own, you have been given 1000[2000]”. Thus, the ultimate outcomes are the same for A and A’, and for B and B’.
1.2 The certainty effect
To exemplify the certainty effect, consider the first problem in the second pair (Problem 3 in Table 1) that involved a choice between a lotteries with a .80 probability of winning $4000 (A) or $3000 for certain (B). The other problem in the pair (Problem 4 in Table 1) involved a choice between two lotteries, one with a .20 probability of winning $4000 (A’) and one with a .25 probability of winning $3000 (B’). Notably, in the light of EUT, both of these problems involve a choice between prospects with expected utility p•u($3000) and .8p•u($4000) (p is the probability, and u() is a utility function). Because EUT assumes a linear use of probability, it postulates that a person who prefers A [B] should also prefer A’ [B’]. This holds for all stable utility functions. In conflict with EUT, the majority response in Reference Kahneman and TverskyKahneman and Tversky (1979) was to choose B and A’. This result, together with the similar demonstrations comparing Problems 1 and 2, Problems 5 and 6 and Problems 7 and 8, suggest that people’s subjective weighting of probabilities is nonlinear. Most notably, as shown in the first three comparisons, people prefer outcomes that are certain. In addition, later studies have shown that for probabilities other than 0 or 1, people tend to overweight the low and underweight the high probabilities (e.g., Reference Tversky and KahnemanTversky & Kahneman, 1992).
1.3 The isolation effects
These demonstrations show that prospects can be decomposed into common and distinctive components in more than one way, and that different decompositions can lead to inconsistent preferences. For example, consider Problems 4 (described above) and 10. Problem 10 involves two compound, or two-stage, lotteries. Under both lotteries, there is a .75 probability of losing in the first stage. However, if one proceeds to the second stage, then lottery A offers a .8 probability of winning $4000 while lottery B gives $3000 for certain. Thus, the amounts that can be won and the probabilities of winning are identical in Problems 4 and 10. Despite this, as shown in Table 1, the majority response differed greatly between the two problems. The comparisons between Problems 4 and 10 and Problems 11 and 12 show two ways in which different descriptions of one and the same choice problem might give rise to contradictory choices. In the first comparison, one option is made attractive by associating it with a certain positive outcome. As for the first three comparisons in Table 1, this result highlights an apparent attractiveness of perceived certainty (i.e., of perceived control and predictability). In the second comparison, one option is made unattractive by framing it as if it involved a certain loss. PT implies that this framing effect occurs because people have different value functions for gains and losses.
1.4 The reflection effects
These effects posit that outcomes are treated differently in the loss and the gain domain. For example, Problems 13 and 14 are identical apart from one involving only gains and the other only losses.
1.5 Probabilistic insurance
A probabilistic insurance is a hypothetical insurance described as follows. If you have a probabilistic insurance against event E (e.g., your house burns down) and E occurs, then there is a probability of p that all your expenses are paid. However, there is also a probability of 1−p that your premium is returned and that you receive no coverage. In Problem 9, participants are asked if they would be interested in a probabilistic insurance that costs half the full premium but has a .5 probability of not covering any costs. Because of the standard assumption in EUT of a concave utility function, .5 × u(2X) > u(X), and people should prefer a probabilistic to a deterministic insurance. As shown in Table 1, only 20% responded that they would be interested in such an option. Research has shown that it is primarily due to an overweighing of rare events (Reference Wakker, Thaler and TverskyWakker, Thaler & Tversky, 1997).
1.6 Explanatory mechanisms
The paradoxes in Reference Kahneman and TverskyKahneman and Tversky (1979) can roughly be divided into three categories, according to what psychological assumptions they evoke for their explanation. In analogy with many findings in perception and psychophysics, the psychological assumptions typically imply better discrimination between stimuli close to salient references, the current state of wealth in regard to the value functions and certain states in regard to the probability weighting function (see also Reference Tversky and KahnemanTversky & Kahneman, 1992).
Problems 1–9 (the Certainty and the Probabilistic Insurance Effects), and the comparison of Problem 10 to Problem 4 (the Isolation Effect for probabilities) all relate to the non-linearity of probability weighting, where there is especially acute discrimination between probabilities close to 0 and 1, and over-weighting of low probabilities and under-weighting of high probabilities (Reference Tversky and KahnemanTversky & Kahneman, 1992). However, there is also evidence suggesting that subjective probability is categorical (Reference Fleming, Maloney and DawFleming, Maloney & Daw, 2013); that probabilities of 0 (impossible) and 1 (certain) are particularly vivid and treated qualitatively different from other categories. Notably, this is fully in line with the Certainty Effect. The second category of demonstrations includes demonstrations attributed to the shape of the value function with the most acute discrimination close to the current state of wealth, implying a convex value function for losses and a concave value function for gains (Problems 11–12). The third category of demonstrations includes effects attributed to loss-aversion: the differential evaluation of magnitudes in the loss and gain domain (Problems 13–14, 15–16).
1.7 Conceptual replication attempts
For choices, evidence for the paradoxes in Reference Kahneman and TverskyKahneman and Tversky (1979) is mixed. In a recent study with university students, the Certainty Effect and the Reflection Effects were replicated (Erev et al., 2017, see Kühberger, Schulte-Mecklenbeck & Perner, 1999 & Reference Linde and VisLinde & Vis, 2017 for similar results). But other studies show that the prevalence of the Certainty and the Reflection Effects decrease if other presentation formats than explicitly stated probabilities are used (Reference CarlinCarlin, 1990; Reference Erev and WallstenErev & Wallsten, 1993). The effects are also sensitive to the population tested. Politicians in a study did not exhibit the effects (Reference Linde and VisLinde & Vis, 2017) and the effects decreased in people high in both education and domain knowledge (Reference Huck and MüllerHuck & Müller, 2012). Moreover, several studies have failed to capture the Isolation Effect for Outcomes (see Reference Romanus and GärlingRomanus & Gärling, 1999) and shown that the differences in curvature between utility and probability functions in the loss- and gain domains are fairly small (Harbaugh, Krause & Vesterlund, 2009; Reference Mukherjee, Sahay, Pammi and SrinivasanMukherjee, Sahay, Pammi & Srinivasan, 2017; Reference Yechiam and HochmanYechiam & Hochman, 2013); the main difference being that an outcome is perceived as more extreme in the loss than the gain domain. However, also the existence of this hypothesized loss aversion has been questioned (e.g., Harbaugh et al., 2009; Reference Nilsson, Rieskamp and WagenmakersNilsson, Rieskamp & Wagenmakers, 2011). The endowment effect that is typically explained by the notion of loss aversion, has likewise been questioned (Plott & Zeifler, 2005).
In sum, the probability weighting seems highly dependent on context and cognitive constraints of the decision maker (Fox & Poldrack, 2014) and the paradoxes related to loss aversion and curvature differences for gains and losses seem to be most difficult to replicate. Intriguingly, although PT was originally formulated for choices, the evidence for the paradoxes is, if anything, stronger with evaluations of prospects (e.g., certainty equivalents or willingness-to-pay). For example, the fourfold-pattern of risk attitudes (risk seeking over low-probability gains; risk-averse over high probability gains; risk-averse over low-probability losses; risk-seeking over high-probability losses) seems more prevalent under evaluations (Harbaugh et al., 2009).
1.8 Aims and Hypotheses of the Present Study
The aim of the study is a conceptual replication of Reference Kahneman and TverskyKahneman and Tversky (1979), engaging a participant population with a wider range of numeracy than in the original study (to be precise, the higher levels of numeracy in our study should approximate the levels of numeracy in the original study).Footnote 2 At first glance, the literature might suggest that the less numerate should be more vulnerable to the paradoxes. For example, the less numerate are more incoherent in probability judgments (Liberali et al., 2012; Reference Lindskog, Kerimi, Winman and JuslinLindskog et al., 2015; Reference Winman, Juslin, Lindskog, Nilsson and KerimiWinman et al., 2014) and their probability weighting functions are more nonlinear and sensitive to framing (Reference Millroth and JuslinMillroth & Juslin, 2015; Reference Patalano, Saltiel, Machlin and BarthPatalano, Saltiel, Machlin & Barth, 2015; Reference Schley and PetersSchley & Peters, 2014; Reference Traczyk and FulawkaTraczyk & Fulawka, 2016). A more nonlinear probability weighting function will produce more of the paradoxes reported in Kahneman and TverskyFootnote 3.
However, the studies documenting that probability weighting is dependent on numeracy have relied on evaluations of risky prospects. It is well-established that preferences can differ depending on whether preferences are elicited through evaluations of prospects or by choices between prospects (Reference Lichtenstein and SlovicLichtenstein & Slovic, 2006) – the latter being the method used by Reference Kahneman and TverskyKahneman and Tversky (1979). Studies on choices between risky prospects suggest that people often do not rely on the compensatory strategy implied by PT, where trade-offs are made between probabilities and value (Reference Cokely and KelleyCokely & Kelley, 2009, Reference Reyna, Chick, Corbin and HsiaReyna, Chick, Corbin & Hsia, 2014). Instead people often rely on non-compensatory heuristics, for example, choosing the option that minimizes the risk of obtaining the worst possible outcome, and such heuristics are especially likely to be used by people that are low in numeracy (Reference Cokely and KelleyCokely & Kelley, 2009). It may thus be those high in numeracy that are most affected by the by the paradoxes implied by the nonlinear and compensatory processing of value and probability implied by PT.
The issue of whether cognitive illusions arise both in within-subject and between-subject designs (WSDs, BSDs) has been repeatedly addressed (Reference Kahneman and FrederickKahneman & Frederick, 2005) and people often disclose more normative behavior in a WSD (e.g., Regenwatter, Dana & Davis-Stober, 2011; Tversky, 1969; Mellers, Weiss & Birnbaum, 1992). Specifically, recent research suggests that the presence of comparative anchors in a WSD allow people to produce more linear probability weighting than in an extreme case of the BSD, namely the Single-Subject Design (SSD, where participant make only one judgment in isolation, Millroth et al., 2018). This predicts larger Certainty Effects and Isolation Effects for Probability – effects explained by the nonlinear probability weighting function – in a SSD.
Reference Kahneman and TverskyKahneman and Tversky (1979) focused on reporting choice proportions and modal responses (i.e., showing that while a majority chose B for the first problem, a majority chose A’ for the second problem). However, in recent years it has become increasingly clear that inferences about the behavior of individuals from aggregate data can be problematic (e.g., Kirman, 1992; Reference Jouini and NappJouini & Napp, 2012; Reference Regenwetter, Grofman, Popova, Messner, Davis-Stober and CavagnaroRegenwetter et al., 2009; Reference Regenwetter, Dana and Davis-StoberRegenwetter, Dana & Davis-Stober, 2011; Reference Regenwetter and RobinsonRegenwetter & Robinson, 2017). Indeed, for the first paradox, Reference Kahneman and TverskyKahneman and Tversky (1979) reported not only choice proportions and modal choice for each problem (i.e., 82% choose Option B in Problem 1 and 83% choose option A’ in Problem 2), but also the proportion of individuals actually producing the paradoxical choice pattern (i.e., 61% of the individuals made this EUT-violating choice pattern, BA’, see p. 266). For the other paradoxes they did not, leaving the reader to assume that the same pattern held for the other paradoxes. In this article, we therefore report not only the modal choice and the choice proportions (e.g., 60% of the participants choose B in Problem 1), but also the proportion of participants that disclose the paradoxical choice pattern (e.g., 20% revealed the choice pattern B and A’ for Problems 1 and 2 violating EUT). Note that 60% choosing B and 60% choosing A’ is consistent with 80% of the individuals making choices in agreement with EUT.
2 Method
2.1 Participants
In the main study, the WSD sample consisted of 346 participants (165 male and 181 female participants) ranging in age from 18 to 75 years (M = 36.9, SD = 12.4). The SSD sample consisted of 1,287 participants (576 male and 711 female) ranging in age from 18 to 74 years (M = 34.9, SD = 12.3).Footnote 4 Settings on the platform CrowdFlower.com were set so that participants were residents in the U.S. Collection of data continued until the recruitment rate plateaued (i.e., when it was one week’s time since any new participants had started the survey). The participants were compensated with one U.S. dollars for the SSD and a quarter dollar for the SSD.
A potential disadvantage of recruiting participants online is less experimental control over responses, possibly with poorer data quality as a result, although the empirical evidence for this claim is weak at best (Reference Gosling and MasonGosling & Mason, 2015; Reference Hauser and SchwarzHauser & Schwarz, 2016). We address this concern with a number of separate analyses presented at the end of the Results section.
2.2 Design, Material & Procedure
The experiment involved all the 16 forced-choice problems in Table 1. All choices were hypothetical. We created four surveys with different presentation ordersFootnote 5 to control for the possibility that the results were driven by a specific presentation order. Each participant in the WSD was randomly assigned to one of the surveys, resulting in 88 participants allocated to Survey A; 86 participants allocated to Survey B; 90 participants allocated to Survey C; and 82 participants allocated to Survey D.Footnote 6 Before starting the survey, the participants reported their age and gender. They received written information that the study addressed judgment and decision making, was not in any way invasive or unpleasant, did not involve deception, and that part taking was voluntary. The participants were explicitly told that they could abort the study whenever they wished. No personal information was recorded in a way that could make identification of a specific participant possible.
Numeracy was measured last with the four-item Berlin Numeracy Test (BNT; Cokeley et al., 2012). While there are other tests of numeracy (e.g., Reference Schwartz, Woloshin, Black and WelchSchwartz, Woloshin, Black & Welch, 1997; Reference Lipkus, Samsa and RimerLipkus, Samsa & Rimer, 2001), the BNT has come to be the most widely-used test validated for use with diverse samples from industrialized communities, doubling the predictive power of the best available alternative numeracy instruments, uniquely predicting decision quality independent of several measures of general cognitive abilities (Cokely et al., 2018; Reference Ghazal, Cokely and Garcia-RetameroGhazal et al., 2014; Reference Lindskog, Kerimi, Winman and JuslinLindskog, Kerimi, Winman & Juslin, 2015). As LimeSurvey (the survey tool that we used) allows for the collection of response times, these were also collected. An example screen-shot of the decision task is available at https://osf.io/fjvmz/.
2.3 Statistical Analyses
Analyses conducted involved Bayesian Hypothesis Testing (BHT) using the Bayes Factor (BF) in the software JASP (JASP Team, 2018: v. 0.8.6) and Bayes Factor Package in R (Reference Morey, Rouder and JamilMorey, Rouder & Jamil, 2015: v. 0.9.2+: for a discussion of the advantages of BHT over Null-Significance Hypothesis Testing, NSHT, see Dienes, 2014; 2017; Reference Rouder, Speckman, Sun, Morey and IversonRouder, Speckman, Sun, Morey & Iverson, 2009; Reference WagenmakersWagenmakers, 2007). Bayes factors (BFs) focus on the relative evidence, provided by the data, for the hypotheses (NHST focus on the probability of the data, given that a null hypothesis is true). BFs thus indicate how many times more likely the data are under one hypothesis compared to under another hypothesis (e.g., BF10 = 100 represents that the data are 100 times more likely under Hypothesis 1 than under Hypothesis 0). Throughout the results section, however, we demonstrate that the conclusions obtained are not contingent on the use of BHT, but correspond to the conclusions suggested by NHST (p-values).
3 Results
A Bayesian ANOVA based on the WSD with number of EUT-violations (1–9) as dependent variable and presentation order as independent variable did not yield any evidence that the presentation order affected the number of effects (BF10 = .110), and these datasets were thus collapsed to one single WSD set. We first compared the choice proportions and modal responses for each problem in the WSD sample and the SSD sample. Then we compared our results with those reported by Reference Kahneman and TverskyKahneman and Tversky (1979). Then we asked whether numeracy affected the prevalence of decision paradoxes in our data. Finally, we did complementary analyses in order to check our conclusions
3.1 WSD vs. SSD
Figure A1 in Appendix A report the mean proportions “Decision A” with 95% credible intervals, for data from the SSD and the WSD (the proportions are summarized in Tables A1 in Appendix A together with Bayesian hypothesis tests). Figure A1 suggests that for most problems the decision proportions are similar in both designs. The only exception is Problem 11, where there is a higher proportion of Decision A in the SSD than in the WSD. For no problem is the modal decision changed by the design and we conclude that there is little evidence for large or systematic effects of the design.
3.2 Comparison with Reference Kahneman and TverskyKahneman and Tversky (1979)
Because there were no differences in the majority choices in the WSD and SSD, we collapsed the two datasets in order to gain statistical power when the aggregated proportions were compared with the proportions reported in Reference Kahneman and TverskyKahneman and Tversky (1979).Footnote 7 The results are reported in Table 2 and visually compared to Reference Kahneman and TverskyKahneman and Tversky (1979) in Figure 1. In the statistical analysis, we report three different Bayes Factors (BFs): BF10 quantifies the evidence in favor of a population proportion different from .5 relative to the evidence that this proportion is .5 (roughly corresponding to a two-tailed t-test with NSHT). BFDir quantifies the evidence that the population difference is in the observed direction relative to in the alternative direction (i.e., that the population proportion is > .5 if the sample proportion is >.5, and correspondingly for a negative difference).Footnote 8 BFDiff quantifies the evidence in favor of a difference between the two choice proportions that together define the paradox.Footnote 9
3.3 Choice proportions and modal responses
In Table 2 we see that, as in Reference Kahneman and TverskyKahneman and Tversky (1979), for most problems there is clear evidence against the null-hypothesis that the population proportion is .5 (BF10 > 1000: the exceptions are Problems 2, 4, 8) and for most paradoxes there is evidence against the choice proportions in the two problems compared being the same. However, it is only for 11 of the 16 problems that we replicate the modal response in Kahneman and Tversky. Among these 11 problems, the evidence actually favors H0 (a population proportion of .5) over H1 for Problem 2 (choice proportion .530; BF10 = .129) and the evidence for H1 is weak for Problem 8 (choice proportion .576; BF10 = 8.85). For the cases where the modal response in the attempted replication deviates from the original study, the evidence is very strong (Problems 4, 6, 12, 14, & 15).
As a consequence, in terms of the nine paradoxes, in Reference Kahneman and TverskyKahneman and Tversky (1979) reported in terms of conflicting modal choices, it is only for two out of the nine paradoxes (the Certainty Effects with .001 probabilities and the Probabilistic Insurance Effect) that we unambiguously replicate the modal pattern (e.g., “B” for Problem 1 and “A’” for Problem 2). As illustrated in Figure 1, there appear to be systematic differences between our results and those reported by Kahneman and Tversky, with much more “B” choices in our data. “B” choices all involve choice of a certain outcome (and are therefore risk averse) except for Problems 7, 8, 13, and 14, three of which were exceptions to strong preference for B relative to Kahneman and Tversky.
3.4 Paradoxes at the individual level
Examining the proportion of individuals in the WSD that showed 0, 1,… up to all 9 paradoxes in Reference Kahneman and TverskyKahneman and Tversky (1979) (i.e., either producing a EUT-violating AB’ or BA’ choice patterns for the pair of problems), it is clear that not a single participant (N = 346) produced all nine paradoxes, the modal participant produced 2 of the 9 paradoxes, and the median participant produced 3 paradoxes, suggesting that more than 50 % of the participants exhibited no more than 3 of the 9 original paradoxes (the patterns of replicated paradoxes are discussed in greater detail in the next section on the effects of numeracy, and they are summarized in Appendix B).
3.5 Dependence on Numeracy
3.5.1 Choice proportions and modal responses
Tables C1 to C5 in Appendix C report results for each numeracy group (zero to four items correct on the BNT) in the same fashion as in Table 2. We generally replicate the finding that the choice proportions differ. However, as summarized in Figure 2 (see also Table B1 in Appendix B), there are notable differences regarding the modal patterns across numeracy groups. First, the number of modal-pattern replications increase systematically as the numeracy increases (Table 3). Second, the increases in modal-pattern replications are related to the paradoxes driven by the probability weighting function. Third, for the “paradoxes” linked to the value and the loss function the modal pattern was EUT-consistent choices, if only because subjects were generally risk averse. When the modal patterns were not replicated, it was generally not because data yielded insufficient evidence, but rather because they favored other patterns than in the original study (e.g., “B” and “B” instead of “B” and “A”). In sum, the modal results in Reference Kahneman and TverskyKahneman and Tversky (1979) are not replicated for all numeracy levels and best replicated for the probability-weighting paradoxes and the most numerate participants.
3.5.2 Paradoxes at the individual level
A Bayesian ANOVA with number of paradoxes as dependent variable and numeracy group as the independent variable (parametric assumptions were satisfied) yielded a BF10 of 52.6 (strong evidence) in favor of a difference between the numeracy groups (p < .05; see also Appendix D).
The descriptive statistics summarized in Table 4 strongly suggest that the observed difference between the numeracy groups is limited to the paradoxes associated with the probability weighting function. Among the paradoxes linked to the probability weighting, the EUT-inconsistent choice pattern was the modal pattern for Numeracy Groups 3, 4, and 5 (for 51%, 52%, and 74% of the participants in each group, respectively). For “paradoxes” linked to the value function, the paradoxical choice pattern in Reference Kahneman and TverskyKahneman and Tversky (1979) was never the most typical pattern (on average observed in 23% of the participants). Supporting these notions, tests of linear trends using logistic-regression analysis for each paradox with prevalence of paradox (yes/no) as the dependent variable and BNT-score as an independent variable (coded on an interval level) showed significant effects for Certainty effect 1–4 and the Isolation effect for probabilities.Footnote 10 Importantly, the pattern at the level of the individuals (Table 4) is very similar to the pattern observed for the modal response proportions (Figure 2). The prevalence of paradoxes is dependent on numeracy.
Figure 2 (Table B2 in Appendix B) show that (i) BB’ was the most prevalent choice pattern, with increasing frequency as numeracy decreases; (ii) BA’ choices (which were the choice pattern emphasized by Reference Kahneman and TverskyKahneman & Tversky, 1979) were the second-most prevalent pattern with increasing frequency as numeracy increases; (iii) AA’ responses was the most rare pattern, and with no visible difference between the numeracy groups; (iv) AB’ responses were the second-most rare pattern, also with no visible difference between the numeracy groups. This finding is worth stressing: the difference in the responses is due to how the participants respond to BB’ or BA’. Notably, B’ as compared to A’, are options that minimize the risk of obtaining the worst possible outcome, corresponding to the notion that the less numerate rely on heuristics that favor less risky options (Reference Cokely and KelleyCokely & Kelley, 2009), while the more numerate integrate more of the quantitative information (Reference Reyna, Chick, Corbin and HsiaReyna et al., 2014).
3.5.3 Adressing data quality and noise as a confounding variable
A concern may be that the data for the less numerate participants may be of lower quality because they, for example, are less motivated to engage in numerical choice tasks (e.g., Reference Peer, Samat, Brandimarte and AcquistiPeer, Samat, Brandimart & Acquisti, 2016). Several lines of evidence speak against this interpretation. First, BNT is seemingly not correlated with measures of motivation (Reference Cokely, Galesic, Schulz, Ghazal and Garcia-RetameroCokely et al., 2012). Second, the responses for the least numerate deviate systematically from .5 and do not seem random or arbitrary. For example, inspecting the choice proportions for each of the 16 choice problems for the least numerate (Table B1 in Appendix B), we see that the evidence for a choice proportion different from .5, over the .5 null-hypothesis of decision proportion .5 (i.e., random choice), is supported by a BF10 > 1000 for 13 of the 16 problems. The same conclusion is suggested by the consistent choice pattern by the least numerate in Figure 4, which deviates most distinctly from the uniform distribution expected by chance responses. The low replication rate at low numeracy is not a result of more random responses.
Third, there was no evidence that the response times differed between the BNT groups: a Bayesian ANOVA, with the lognormal-transformed average response times per prospect as the dependent variable, and BNT group as the independent variable, yielded a BF10 of .072 (p>.05); and for the response times for the BNT test the ANOVA yielded a BF10 of .122 (p>.05).Footnote 11 It thus seems unlikely that the less numerate simply wanted to get through the experiment as quickly as possible and collect their payment. It should also be noted that our sample also replicated the same positive skew of responses on the BNT that previous research has documented for similar crowd source-recruited participants (Reference Cokely, Galesic, Schulz, Ghazal and Garcia-RetameroCokely et al., 2012); thus, there was seemingly nothing peculiar with our participants compared to other similar data samples.
Fourth, assuming that the probability weighting and value functions can be used to model peoples’ behavior, less reliable responses (i.e., more “noisy” responses”) should lead to more of the patterns observed by Reference Kahneman and TverskyKahneman and Tversky (1979), not less: That random noise contributes to a more nonlinear probability weighting function (and thus should produce more paradoxes) has been demonstrated elsewhere (Blavatsky, 2007; Reference Millroth, Nilsson and JuslinMillroth et al., 2018).
A more serious potential problem motivated an independent replication of our results. Following how we (wrongly) interpreted that Reference Kahneman and TverskyKahneman and Tversky (1979) conducted their study, we did not counter-balance the presentation order of the alternatives (i.e., the prospects for Alternative A, the first choice option, was the same for all participants). Hypothetically, some of the results reported in Table 2 could be driven by a large proportion of participants having chosen Alternative B because it was always presented on the right-hand side of the screen. We therefore conducted an independent replication of the results in Table 2, where half of the participants received the options in the same order as in Table 2 (N = 99) and the other half received the items in the reverse order (N = 100; see Table 4 for results).
In contrast to the hypothesis of a response bias towards Option B (e.g., because it is on the right on the computer screen), the response proportions in Table E1 in Appendix E are mirror images of each other when the choice options are reversed (i.e., if alternative A was majority response in the original order, Alternative B was majority response in the reverse order). The BFDiff in favor of a difference in the proportions between the two order conditions in Table 4 were low and ranged between .177 and .466 over the 16 problems (and between .261 and 1.66, if we only consider the participants with the lowest numeracy, BNT = 0). Thus, regardless of the presentation of the choice options, the results in Table E1 replicate the results in Table 2, but here three out of the nine modal patterns in Reference Kahneman and TverskyKahneman and Tversky (1979) reappear.Footnote 12 The consistent mirror effect of reversing the order of the decision options in Table E1 provides further evidence that the responses collected here – and by implication the relatively modest replication rate – are not explained by the participants providing random responses.
A Bayesian between-subjects ANOVA provides evidence (a BF10 = 34.3, p <.05) for a difference in the number of paradoxes between the numeracy groups, with most replicated paradoxes for participants with highest numeracy (M = 4.75, SD = 2.06) and the lowest number of replicated paradoxes for participants with lowest numeracy (M = 2.43, SD = 1.52). In these data, no participant (N = 199) exhibited all 9 paradoxes in the original study.
4 Discussion
The aim of this study was to provide a conceptual replication of the psychological effects that were reported in the classical study of Reference Kahneman and TverskyKahneman and Tversky (1979), relating the results to the numeracy of the participants and the role of contextual support in terms of other related judgments (raised in terms of the comparison between a WSD and an SSD). These psychological effects – or decision paradoxes – have in turn been used to motivate a number of psychological assumptions of PT, essentially nonlinear probability weighting, nonlinear value functions that are differently shaped for gains and losses, and an stronger reaction to losses than to gains of the same magnitude (i.e., loss aversion). Because we found no strong evidence for systematic differences in the choice proportions depending on the design type, the subsequent discussion is focused on the observed differences between the present study and Reference Kahneman and TverskyKahneman and Tversky (1979) and on how the results are related to numeracy.
4.1 Replication of the paradoxes in Reference Kahneman and TverskyKahneman and Tversky (1979)
While we replicate that the choice proportions often differ between the two choice-problems that define a paradox, the results show that for the entire participant sample the modal responses were clearly replicated only for two of the nine paradoxes (one Certainty Effect and Probabilistic Insurance). The conflicting modal choices in these paradoxes were in focus in Reference Kahneman and TverskyKahneman and Tversky (1979), because they suggested that a majority of the individuals produced choices that are incompatible with EUT. In our results, this seems to be most evident for the paradoxes that are related to the probability weighting function and among the most numerate participants.Footnote 13 Footnote 14 The paradoxes associated with the value function and loss aversion were harder to replicate in our study. Not a single individual exhibited all 9 paradoxes posited by PT, and over 50 percent of all participants exhibited no more than three of the paradoxes. These conclusions tie into at least two lines of previous research.
First, the prevalence of the reflection effects has indeed varied across studies (e.g., Ert & Erev, 2013; Harbaugh et al., 2009; Reference Nilsson, Rieskamp and WagenmakersNilsson, Rieskamp & Wagenmakers, 2011; Yechiam & Hochmann, 2013). The same is true also for the Isolation Effect for Outcomes (Reference Romanus and GärlingRomanus & Gärling, 1999) and even the Certainty Effect has been shown to depend on the presentation format (Reference CarlinCarlin, 1990; Harbaugh et al., 2009) and the participant population (Reference Linde and VisLinde & Vis, 2017; Reference Huck and MüllerHuck & Müller, 2012). Tellingly, the studies that have most consistently replicated the original results (Reference Erev, Ert, Plonsky, Cohen and CohenErev et al., 2017; Reference Kühberger, Schulte-Mecklenbeck and PernerKühberger et al., 1999) have used participants recruited from universities; participant that are likely to exhibit high scores on numeracy.
Second, concerns related to ‘generalized agent models’ (Reference KirmanKirman, 1992) deserve more attention. For a number of reasons, Reference Kahneman and TverskyKahneman and Tversky (1979), along with later studies focusing on choice proportions at the aggregate level, may have over-stated the case for the degree to which the individual participants disclose the choice patterns that correspond to the nine paradoxes. Kirman (1992) argued that i) there may simply be no direct relation between individual and collective behavior, ii) the generalized agent need not react to a manipulation in the same manner as the underlying individuals, and iii), the beliefs by the generalized agent may not be shared by any of the individuals, but emerges due to the effects of dispersion. Recent research has shown by simulations that this may hold for the psychological assumptions of PT (e.g., Reference Jouini and NappJouini & Napp, 2012; Reference Regenwetter and RobinsonRegenwetter & Robinson, 2017).
The point is not to argue that PT is poorer than EUT as a quantitative account of the data. By contrast, because PT is more flexible with free parameters (with EUT as special case) it will trivially be better at accounting for various patterns observed in data, including those observed here and those implied by EUT. The ability of a model to capture also various more idiosyncratic patterns present in a minority of the participants can be regarded as a virtue of a quantitative model. Therefore, PT may in many applied circumstances be a more valid and versatile instrument to describe a variety of different choice patterns than EUT.
However, the results presented here do raise questions about how universal the assumptions postulated by PT are and highlight the importance of determining their limiting conditions. The results in Reference Kahneman and TverskyKahneman and Tversky (1979) served to motivate and illustrate the key assumptions of PT, which were intended – as far as we can tell – to capture important and general aspects of human decision making that deviate from the assumptions of EUT. It is thus hard to regard the limited replicability of these phenomena in a wider population as anything but potentially problematic. As noted above, relaxing EUT in ways that take psychological assumptions into account can be useful to capture choice behaviors. The results presented here, however, raise the question of whether the assumptions made in PT are the most relevant ones for capturing prevalent deviations from EUT. Moreover, the application of PT to explain many large-scale societal phenomena that apparently represent deviations from EUT often seems to presume that the weighting functions of PT are operative in many or most of the individual agents. The validity of these assumptions and their associated effects may also be dependent on the individual characteristics of the agent, such as his or her level of numeracy. Our results also question if the account of these paradoxes should be obligatory benchmarks that any theory of risky decision making in the field should meet.
4.2 Dependence on numeracy
As noted in the Introduction, the previous literature on numeracy suggested two contrasting possibilities regarding the outcome of the present study. The first possibility was that the less numerate would exhibit more paradoxes because they have a probability-weighting function that is more nonlinear than the more numerate (Reference Millroth and JuslinMillroth & Juslin, 2015; Reference Patalano, Saltiel, Machlin and BarthPatalano et al., 2015; Reference Schley and PetersSchley & Peters, 2014; Reference Traczyk and FulawkaTraczyk & Fulawka, 2016), and in the current context a more nonlinear probability weighting function will render more paradoxes. However, a second possibility pointed to the notion that the larger observed nonlinearity of the probability weighting function for lower numeracy has been observed for evaluations of prospects. When people make choices between prospects, however, the less numerate have been found to rely on simple heuristics that often favor less risky options (Reference Cokely and KelleyCokely & Kelley, 2009), while the numerate are capable of more deliberate behavior, taking the quantitative details of the problems into account (Reference Cokely and KelleyCokely & Kelley, 2009; Reference Reyna, Chick, Corbin and HsiaReyna et al., 2014).
The results showed that the least numerate exhibited the fewest paradoxes at the individual level, and the most numerate exhibited most paradoxes at the individual level. The differences was constrained to the paradoxes that relate to the probability weighting function proposed by Kahneman and Tversky (1979; see also Reference Tversky and KahnemanTversky and Kahneman, 1992). Most of the effects of numeracy were consistent with a systematic shift in the less numerate to a more cautious non-compensatory strategy that minimized the risk of obtaining the worst possible outcome, in line with the second possibility. The simple strategy of rejecting the option with the poorest possible outcome or, if all options have this same poorest outcome as a possibility, to reject the option with the higher probability of this poor outcome, predicts Option B for all choice problems in Table 1, which corresponds to the decision behavior at low numeracy (see Table C1 in Appendix C). Researchers therefore need to be aware of the implications: the preferences captured in any given experiment is likely to depend on an interaction between the type of elicitation method (evaluations vs. choices) and level of numeracy. This is in line with a growing body of research that has documented the malleability of preferences derived from behavioral measures (e.g., Pedroni et al., 2017): rather than having stable risk preferences that can be fundamentally different between people, it seems that people are instead probably equipped with a large variety of decision strategies that they apply in response to the specific architecture of the environment.
We found no evidence that these patterns were explained by “random responses” or poor data quality. The responses provided at low numeracy seem highly systematic and when the order of the choice options are reversed, so are the choice between the alternatives (i.e., if the participants choose Option A in the original order they seem to select Option B when the options are presented in the reverse order). Neither were there differences in the response times to the prospects, nor to the BNT. Future research could usefully include other measures of numeracy (e.g., the Lipkus-test), measures of motivation and of metacognition, in order to elucidate the exact mechanisms by which the differences numeracy causes the result.
4.3 Why is numeracy positively related to the susceptibility to (some) paradoxes?
In the following, we entertain two (not necessarily exclusive) explanations. A first explanation emphasizes that the kind of lottery-metaphor tasks typically used in decision research (and in this study) may be cognitively more demanding to people with low numeracy. People low in numeracy may therefore find it especially difficult to confidently evaluate these complex quantitative options and instead they may retreat to the more cautious strategy of minimizing the risk of receiving the worst possible outcome, the pattern observed in our data.
On the one hand, this attitude makes some sense (i.e., even those high in numeracy could presumably be presented with lottery options that are so complex that they are difficult to evaluate, which they therefore are inclined to reject ) and in regard to the specific choice set in Reference Kahneman and TverskyKahneman and Tversky (1979) it leads to better agreement with EUT. On the other hand, and in the large scheme of things, inability to properly integrate probabilities and outcomes to identify superior options will often produce mediocre decisions with poorer long-term accrual of returns. This can be considered as an epistemic risk aversion, where people avoid not only the options where the outcome obtained is unknown, but also the options for which they lack sufficient confidence in their own ability to accurately evaluate their attractiveness. Future research should delineate to what extent this holds under varying conditions (i.e., numerically simplifying things in terms of attributes and alternatives),
A second potential explanation is provided by fuzzy-trace theory (FTT: Brainers & Reyna, 1990; Reference Reyna and BrainerdReyna & Brainerd, 1995; Reference Reyna and BrainerdReyna, 2008; Reference Reyna and BrainerdReyna & Brainerd, 2011; Reference Broniatowski and ReynaBroniatowski & Reyna, 2018). FTT posits that people rely on two types of mental representations: verbatim and gist representations. Verbatim representations capture the exact surface form of problems or situations, how they are perceived literally (e.g., the words or numbers). Gist captures the bottom-line meaning of the problem or situation. In contrast to verbatim representations, which are precise (and quantitative, if they involve numbers), gist representations are vague and qualitative. People are capable of processing both verbatim and gist information, but they prefer to reason with gist traces rather than verbatim. Importantly, FTT can explain the functional forms of probability and value posited by PT (for the mathematics, see Reference Reyna and BrainerdReyna & Brainerd, 2011; Reference Broniatowski and ReynaBroniatowski & Reyna, 2018). A hypothesis in the framework of FFT is that, in choices between risky prospects, the least numerate rely on gist-representations while the more numerate are more able to make use of verbatim quantitative representations.
4.4 Conclusions
We believe that our study demonstrates that i) the replication rate for the paradoxes in Kahneman and Tversky(1979) that originally motivated the key psychological assumptions of PT is very modest in a population with a larger variation in numeracy; ii) The paradoxes that are easiest to replicate are those that relate to the probability weighting function, but they primarily occur among participants that are high numeracy; and iii) The choices in people low in numeracy make are consistent with a shift towards a more cautious and non-compensatory strategy that concentrates on minimizing the risk of obtaining the worst outcome. The results highlight important limiting conditions for the psychological assumptions made in PT.
5 Appendix A: Detailed Report of Statistics Related to Design (WSD or SSD)
To determine whether the modal responses differed between the samples, proportions along with 95 per cent credible intervals were derived using Bayesian binomial tests. To quantify the evidence that a proportion was either mostly “A” or mostly “B” we calculated a Bayes factor (BF) for each problem that tested the contrasts that the proportion of answer “A” was below .50 or over .50. This BF is obtained by first computing a Bayesian binomial test for a positive difference vs. a point null-hypothesis of zero difference to obtain a first BF and then computing a Bayesian binomial test for a negative difference vs. a point null-hypothesis of zero difference to obtain as second BF. The BF directly contrasting a positive vs. a negative difference is obtained by taking the ratio of the BFs for the positive and negative difference. The proportion was categorized as A or B if the BF was over three, as this at least can be considered “positive evidence” (Kass & Raftery, 1996). The results are summarized in Table A1, showing that the modal responses were the same for both samples.
6 Appendix B: Proportion of Paradoxes
Table B1 report the proportion of participants, for each numeracy group and for all participants, which exhibited a specific number of paradoxes. Table B2 report the proportion of responses, for each numeracy group and for all participants, for each observed paired-response type.
7 Appendix C: Results for Each Numeracy Group
Table C1 to C5 summarize the detailed results for each numeracy group.
8 Appendix D: Post-Hoc Comparisons for ANOVA of Paradoxes at the Individual Level
While JASP allow for post-hoc testing of the ANOVA, those tests test only against the null, and arguably directional hypotheses are generally more desirable (see e.g., Reference Morey and RouderMorey & Rouder, 2011; Reference ShafferShaffer, 1972). Thus, we again used directional hypothesis testing (Reference Morey and RouderMorey & Rouder, 2011). The results in Table D1 show that that the least numerate produce fewer paradoxes than all other groups, but also that the most numerate exhibit more paradoxes than the groups that scored one or two items correct on the BNT. The evidence is even stronger when the paradoxes of Probabilistic Insurance, Isolation Effect for Outcomes, and the Reflection Effects are excluded from analysis (a reasonable exclusion, because the groups did not differ in regard to these paradoxes), as also illustrated in Table D1.
9 Appendix E: Independent Replication of WSD Results
Table E1 summarize the results for the independent replication of the WSD results obtained in the main study.
Note. The Bayes factor BFDiff refers to the hypothesis that there is a difference between the proportions after the proportions have been coded to represent the same option, relative to the hypothesis of no difference. Both of the proportions in this cell refer to the response “Yes” to the option of a probabilistic insurance.