Introduction
Depression is a common debilitating disease that is a worldwide leading cause of morbidity and mortality. According to the latest estimates from World Health Organization, in 2015 more than 300 million people are now living with depression (World Health Organization, Reference World Health Organization2017). Low mood and anhedonia are core symptoms of major depressive disorder. Those two symptoms are key criteria to the diagnostic of Major Depressive Disorder (MDD) in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) (American Psychiatric Association, 2013). Anhedonia is broadly defined as a decreased ability to experience pleasure from positive stimuli. Specifically, it is described as a reduced motivation to engage in daily life activities (motivational anhedonia) and reduced enjoyment of usually enjoyable activities (consummator anhedonia).
Depression is a complex and heterogeneous disorder implying instinctual, emotional and cognitive dysfunctions. Although its underlying mechanisms remain unclear, it has been proposed – based on the importance of anhedonia and low mood in depression – that reduced reward processing, both in terms of incentive motivation and reinforcement learning, plays a key role in the clinical manifestation of depression (Admon & Pizzagalli, Reference Admon and Pizzagalli2015; Chen, Takahashi, Nakagawa, Inoue, & Kusumi, Reference Chen, Takahashi, Nakagawa, Inoue and Kusumi2015; Eshel & Roiser, Reference Eshel and Roiser2010; Huys, Pizzagalli, Bogdan, & Dayan, Reference Huys, Pizzagalli, Bogdan and Dayan2013; Safra, Chevallier, & Palminteri, Reference Safra, Chevallier and Palminteri2019; Whitton et al., Reference Whitton, Kakani, Foti, Van't Veer, Haile, Crowley and Pizzagalli2016). This hypothesis implies that subjects with depression should display reduced reward sensitivity both at the behavioral and neural levels in value-based learning. On the long term, a better understanding of these processes could help for the prevention and management of depression.
Following up on this assumption, numerous studies have tried to identify and characterize such reinforcement learning deficits, however the results have been mixed so far. Indeed, while some studies did find evidence of blunted reward learning and reward-related signals in the brain, others indicate limited or no effect (Brolsma et al., Reference Brolsma, Vrijsen, Vassena, Kandroodi, Bergman, van Eijndhoven and Cools2022; Chung et al., Reference Chung, Kadlec, Aimone, McCurry, King-Casas and Chiu2017; Hägele et al., Reference Hägele, Schlagenhauf, Rapp, Sterzer, Beck, Bermpohl and Heinz2015; Rothkirch, Tonn, Köhler, & Sterzer, Reference Rothkirch, Tonn, Köhler and Sterzer2017; Rutledge et al., Reference Rutledge, Moutoussis, Smittenaar, Zeidman, Taylor, Hrynkiewicz and Dolan2017; Shah, O'carroll, Rogers, Moffoot, & Ebmeier, Reference Shah, O'carroll, Rogers, Moffoot and Ebmeier1999). Outside the learning domain, other recent studies showed no disrupted valuation during decision-making under risk (Chung et al., Reference Chung, Kadlec, Aimone, McCurry, King-Casas and Chiu2017; Moutoussis et al., Reference Moutoussis, Rutledge, Prabhu, Hrynkiewicz, Lam, Ousdal and Dolan2018). It is also worth noting that many of previous studies identifying value-related deficits in depression, only included one valence domain (i.e., only rewards or only punishments) and did not directly contrast between rewards and punishments nor separate the two valence domains in different experimental sessions (Admon & Pizzagalli, Reference Admon and Pizzagalli2015; Elliott et al., Reference Elliott, Sahakian, McKay, Herrod, Robbins and Paykel1996; Elliott, Sahakian, Herrod, Robbins, & Paykel, Reference Elliott, Sahakian, Herrod, Robbins and Paykel1997; Forbes & Dahl, Reference Forbes and Dahl2012; Gradin et al., Reference Gradin, Kumar, Waiter, Ahearn, Stickle, Milders and Steele2011; Kumar et al., Reference Kumar, Waiter, Ahearn, Milders, Reid and Steele2008; Pizzagalli, Reference Pizzagalli2014; Vrieze et al., Reference Vrieze, Pizzagalli, Demyttenaere, Hompes, Sienaert, de Boer and Claes2013; Zhang, Chang, Guo, Zhang, & Wang, Reference Zhang, Chang, Guo, Zhang and Wang2013). A recent study (Pike & Robinson, Reference Pike and Robinson2022), where reward and punishment sensitivity has been computationally quantified by assuming different learning rate parameters for positive or negative outcomes show that, compared to controls, contrary to what is generally found in healthy subjects (Chambon et al., Reference Chambon, Théro, Vidal, Vandendriessche, Haggard and Palminteri2020; Palminteri, Lefebvre, Kilford, & Blakemore, Reference Palminteri, Lefebvre, Kilford and Blakemore2017) patient's behaviour is generally better explained assuming reduced sensitivity to negative outcomes.
Here we speculate that the lack of concordant results may be in part explained by the fact that reinforcement learning impairment in depression is dependent on the overall value of the learning context. In fact, computational studies clearly illustrate that the behavioral consequences of blunted reward and punishment sensitivity depend on the underlying distribution of outcome. More specifically, Cazé and Van Der Meer (Cazé & van der Meer, Reference Cazé and van der Meer2013) showed that greater sensitivity to reward compared to punishment (positivity bias; as proxied by different learning rates; Pike and Robinson, Reference Pike and Robinson2022) advantages learning in contexts with poor overall reward expectation (i.e., ‘poor’ contexts) compared those with high overall reward expectation (‘rich’ contexts). Conversely, greater sensitivity to punishment compared to reward (negativity bias) should advantage learning in ‘rich’ context. As a consequence, if depressive patients present blunted reward compared to punishment sensitivity (i.e., a negativity bias) this should induce a difference in performance, specifically in ‘poor’ contexts, where displaying a positivity bias is optimal.
To test this hypothesis, we adapted a standard protocol composed by a learning and a post-learning transfer phase. The learning phase included two different contexts: one defined as ‘rich’ (in which the two options have an overall positive expected value) and the other as ‘poor’ (two options with an overall negative expected value). In contrast with the learning phase, there was no feedback in the transfer phase, in order to probe the subjective values of the options without modifying it (Bavard, Lebreton, Khamassi, Coricelli, & Palminteri, Reference Bavard, Lebreton, Khamassi, Coricelli and Palminteri2018; Frank, Seeberger, & O'Reilly, Reference Frank, Seeberger and O'Reilly2004; Palminteri, Khamassi, Joffily, & Coricelli, Reference Palminteri, Khamassi, Joffily and Coricelli2015). In similar tasks, healthy subjects are generally reported to be able to learn equally from rewards and punishments (Palminteri et al., Reference Palminteri, Khamassi, Joffily and Coricelli2015; Pessiglione, Seymour, Flandin, Dolan, & Frith, Reference Pessiglione, Seymour, Flandin, Dolan and Frith2006). However, based on the idea that depression blunts reward sensitivity and that a positivity bias is advantageous in the ‘poor’ contexts, we expected a learning asymmetry in MDD patients. More precisely, learning rate differences should induce lower performance in the ‘poor’ context in MDD patients.
In addition to choice data, we also analyzed reaction times and outcome observation times as ancillary measures of attention and performance. Previous findings suggest that negative value contexts are associated with overall slower responses (Fontanesi, Gluth, Spektor, & Rieskamp, Reference Fontanesi, Gluth, Spektor and Rieskamp2019a; Fontanesi, Palminteri, & Lebreton, Reference Fontanesi, Palminteri and Lebreton2019b). However, previous studies did not find any specific reaction time signatures in patients (Brolsma et al., Reference Brolsma, Vassena, Vrijsen, Sescousse, Collard, van Eijndhoven and Cools2021; Chase et al., Reference Chase, Frank, Michael, Bullmore, Sahakian and Robbins2010; Douglas, Porter, Frampton, Gallagher, & Young, Reference Douglas, Porter, Frampton, Gallagher and Young2009; Knutson, Bhanji, Cooney, Atlas, & Gotlib, Reference Knutson, Bhanji, Cooney, Atlas and Gotlib2008).
Methods
Participants and inclusion criteria
Fifty-six subjects were recruited in a clinical center (the Ginette Amado psychiatric crisis center) in Paris between May 2016 and July 2017. Inclusion criteria were a diagnosis of major unipolar depression diagnosed by a psychiatrist and an age between 18 and 65 years old (see Table 1). A clear, oral and written explanation was also delivered to all participants. All procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. In total, we tested N = 30 patients undergoing a Major Depressive Episode (MDE) and N = 26 age-, gender- and socioeconomically-matched controls. For patients, exclusion criteria were the presence of psychotic symptoms or a diagnosis of chronic psychosis, severe personality disorder, neurological or any somatic disease that might cause cognitive alterations, neuroleptic treatment, electro-convulsive therapy in the past 12 months and current substance use. Psychiatric co-morbidities were established by a clinician with a semi-structured interview based on the Mini International Neuropsychiatric Interview (MINI) (Sheehan et al., Reference Sheehan, Lecrubier, Sheehan, Amorim, Janavs, Weiller and Dunbar1998). In our final sample, some patients (n = 13) presented anxiety-related disorders. Among them, some (n = 6) presented isolated anxiety-related disorders (social anxiety n = 2; panic disorder n = 2; agoraphobia n = 1; claustrophobia n = 1) and the rest of the group (n = 7) presented several associated anxiety-related disorders (agoraphobia n = 4; panic disorder n = 4; social anxiety n = 3; generalized anxiety n = 3; OCD n = 1; PTSD n = 1). Others (n = 8) presented substance abuse disorder (cannabis n = 3; alcohol n = 4; cocaine n = 2). All patients were undertaking medication (see Table 2 for details). Participants included in the healthy volunteer group had no past or present psychiatric diagnosis and were not taking any psychoactive treatment.
Education: years after graduation For each sample, the mean of each variable is presented with its standard error of the mean.
‘SSRI’: selective serotonin reuptake inhibitor; ‘others’: anti-arrhythmic agent or vitamins.
Behavioral testing
Patients volunteering to take part in the experiment were welcomed in a calm office away from the center's activity where they were given information about the aim and the procedure of the study. The study was verbally described as an evaluation of cognitive functions through a computer «game». The diagnostic of MDE and the presence of psychiatric co-morbidities were assessed with the MINI screener completed in a semi-structured interview with a psychiatrist by the MINI. The subjects were then asked to complete several questionnaires assessing their level of optimism [Life Orientation Test- Revised (LOT-R)], an optimism analog scale (created for this study to contrast usual and current level of optimism) and the severity of depression (Beck Depression Inventory – II) (Beck, Steer, Ball, & Ranieri, Reference Beck, Steer, Ball and Ranieri1996). The participants were told they were going to play a simple computer game, whose goal was to earn as many points as possible. Written instructions were provided and verbally reformulated if necessary. There was no monetary compensation as patients did the task alongside a psychiatric assessment. To match patients' conditions, controls did not receive any compensation either.
As in previous studies of reinforcement learning the behavioral protocol was divided into a learning phase and a transfer phase (Chase et al., Reference Chase, Frank, Michael, Bullmore, Sahakian and Robbins2010; Frank et al., Reference Frank, Seeberger and O'Reilly2004; Palminteri & Pessiglione, 2017)(Fig. 1a). Options were materialized by abstract symbols (agathodaimon font). Symbols appeared in pairs of abstract symbols displayed on a black screen. During the learning phase, options were presented in fixed pairs, while during the transfer phase they were presented in all possible combinations (Fig. 1b). Beforehands, subjects were told that one of the two options was more advantageous than the other and encouraged to identify it to maximize their (fictive) reward. Each symbol was associated to a fixed reward probability. The reward probability attached to each symbol was never explicitly given and the subjects had to learn it through trial and error. Reward probabilities were inspired by previous empirical and theoretical studies (Cazé & van der Meer, Reference Cazé and van der Meer2013; Chambon et al., Reference Chambon, Théro, Vidal, Vandendriessche, Haggard and Palminteri2020; Palminteri & Pessiglione, Reference Palminteri, Pessiglione, Dreher and Tremblay2017) and distributed across symbols as follows: 10%/40% (‘poor’ context), 60%/90% (‘rich context’). The reward probabilities were decided in order to have the same choice difficulty (as indexed by the difference in expected value between the two options) across choice contexts. The learning phase was divided in two sessions of 100 trials each (each involving both the ‘rich’ and the ‘poor’ context repeated for 50 trials).
In the transfer phase the eight different symbols were presented in all binary combinations four times (including pairing that had never been displayed together in the previous phase; 112 trials). The subjects had to choose which symbol was deemed to be the more rewarding, however, in the transfer phase, no feedback was provided in order not to interfere with subjects' final estimates of option values (Chase et al., Reference Chase, Frank, Michael, Bullmore, Sahakian and Robbins2010; Frank et al., Reference Frank, Seeberger and O'Reilly2004; Palminteri & Pessiglione, Reference Palminteri, Pessiglione, Dreher and Tremblay2017). The subjects were told to use instinct when doubting. The aim of the transfer phase was to assess the participants' learning process on a longer time scale than the learning phase, which is supposed to mainly rely on working memory (Collins & Frank, Reference Collins and Frank2012). The transfer phase also assessed the capacity to remember and extrapolate the symbols' subjective values out of their initial context (generalization).
When the symbols appeared on the screen, subjects had to choose between the two symbols by pushing a right or a left key on a keyboard. In respectively rewarded/punished trials a green/red smiley/sad face and ‘ + 1pts’/‘−1pts’ appeared on screen. In order to be sure that the subjects paid attention to the feedback, they had to push the up key after a win and the down key after a loss to move to the next trial (Fig. 1c; top). Trials in the transfer phase were different in that the feedback was not displayed (Fig. 1c; bottom).
Dependent variables
The main behavioral variables of our study are the correct choice rates, as measured in the learning and the transfer phase. A choice is defined ‘correct’ (coded as ‘1’) if the participant picks the reward maximizing option, incorrect (coded as ‘0’) otherwise. In the learning phase, the correct choice is, therefore picking ‘A’ in the ‘rich’ context and ‘B’ in the ‘poor’ contexts (Fig. 1b). For display purposes, the learning curves were smoothed (five trials sliding average) (Fig. 2a). In the transfer phase, the correct choice was defined in a trial-by-trial basis and depended on the particular presented combination (note that in some trials, a correct choice could not be defined, as the comparison involved two symbols with the same value, originally presented in different sessions) (Fig. 1b). For display purposes, concerning the transfer phase, we also considered the choice rate, defined as how many times a given option has been chosen, divided by the number of times a given option has been presented (calculated across all possible combinations except the similar option ones) (Fig. 2b). As ancillary exploratory dependent variables we also looked at two different measures of response times. More precisely, we extracted the reaction times (i.e., the time spent between symbols' onset and choice; Figure 3a) and the outcome observation time (i.e., the time spent between reward onset and key press to next trial; Figure 3b). For display purposes, response time curves were also smoothed (five trials sliding average).
Statistical analyses
The dependent variables were analyzed using Generalized Linear Mixed Models (GLMM) as implemented by the function glmer of the software R [R version 3.6.3 (2020-02-29) R Core Team (2022)] and the package lme4 [lme4 version: 1.1-27.1; (Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015)]. The GLMMs of correct choice rates (both in the learning and the transfer phase) used a binomial linking function, while those of response times (both reaction times and outcome observation time) used a gamma linking function (Yu et al., Reference Yu, Guindani, Grieco, Chen, Holmes and Xu2022). All GLMMs were similarly constructed and included ‘subject’ number as a random effect and ‘group’ (between-subject variable: controls v. patients), ‘context’ (within-subject variable) and interaction between the two as fixed-effects. For dependent variables extracted from the learning phase the ‘context’ within subject variable corresponded to whether the measure was taken from the ‘rich’ or the ‘poor’ context. In the GLMM of the correct choice rate in the transfer phase the variable ‘condition’ took three levels that corresponded to whether or not the choice under consideration involved the best possible option in the ‘rich’ condition (‘A present’); whether or not the choice under consideration involved the worst possible option in the ‘poor’ condition (‘D present’) and all the other trials (‘other’) (see Fig. 1b). Post hoc comparisons were assessed by comparing the marginal means of the contrast of interest to zero. All p values are reported after Tukey's correction for multiple comparisons.
Model fitting and model simulations
To link the behavioral performance in our task to computational processes, we performed some simulations. More specifically, to assess the behavioral consequences of learning rate biases, we simulated a variant of a standard cognitive model of reinforcement learning. The model assumes that subjective option values (Q values) are learnt from reward prediction errors (RPE) that quantify the difference between expected and obtained outcome (Sutton & Barto, Reference Sutton and Barto2018). In this model, Q values are calculated for each combination of states (s; in our task the four contexts; Figure 1b) and actions (a; in our task the symbols). Most of those models assume that subjective options values are updated following a Rescorla-Wagner rule (Rescorla & Wagner, Reference Recorla and Wagner1972). However, to assess the behavioral consequences of a positivity and negativity bias, based on previous studies (Chambon et al., Reference Chambon, Théro, Vidal, Vandendriessche, Haggard and Palminteri2020; Frank, Moustafa, Haughey, Curran, & Hutchison, Reference Frank, Moustafa, Haughey, Curran and Hutchison2007; Niv, Edlund, Dayan, & O'Doherty, Reference Niv, Edlund, Dayan and O'Doherty2012), we modified the standard model by including different learning rates for positive and negative prediction errors (that in our design are correspond to positive and negative outcomes):
The model decision rule was implemented as a softmax function, that calculates the probability of choosing a given option as a function of the difference between the Q values of the two options, as follows:
To assess the effect of the positivity and negativity bias on learning performance of our task we ran extensive model simulation where artificial agents played our learning task (i.e., a ‘rich’ and a ‘poor’ context, for 50 trials each). More specifically, we simulated two different sets of learning rates (1000 virtual agents each). One set represented agents with a positivity bias (i.e., α + > α −), and the other set agents with a negativity bias (α + < α −)(Cazé & van der Meer, Reference Cazé and van der Meer2013). The value of the parameters (learning rates and temperatures) was randomly drawn from uniform distributions; the temperature was drawn from β ∈ U(0, 1) and the learning rates (for example in the positivity bias case) were drawn from α + ∈ U(0, 1) and α − ∈ U(0, α +) (the opposite was true for the negativity bias case).
After running the simulations, we also fitted the empirical data. More specifically, we focused on fitting the transfer phase choices, because it allows to estimate learning rates involved in long term learning, whose estimation is not contaminated by working memory or choice perseveration biases (Collins & Frank, Reference Collins and Frank2012; Frank et al., Reference Frank, Moustafa, Haughey, Curran and Hutchison2007; Katahira, Yuki, & Okanoya, Reference Katahira, Yuki and Okanoya2017). The model free parameters (temperature and learning rates) were fitted at the individual level using the fmincon function (Optimization Toolbox R2021b. MATLAB. (2021). 9.11.0.1809720 (R2021b). 2021B, Natick, Massachusetts: The MathWorks, Inc.) via log model evidence maximization as previously described (Daw, Gershman, Seymour, Dayan, & Dolan, Reference Daw, Gershman, Seymour, Dayan and Dolan2011; Wilson & Collins, Reference Wilson and Collins2019).
Results
Demographics
Patients and controls were matched in age (t(51) = −1.1, p = 0.28), gender (t(53) = 1.15, p = 0.29) and years of education (t(54) = −1.59, p = 0.12). Concerning the optimism measures, patients with depression were found to be less optimistic in all scales (LOT-R: t(47) = −7.42, p = 1.76 × 10−9; usual optimism: t(51) = −2.29, p = 0.03; current optimism: t(50) = −10.34, p = 4.19 × 10−14). Furthermore, the comparison between usual v. current optimism in patients and controls, revealed that only patients were significantly less optimistic than usual at the moment of the test (patients: t(29) = 8.26, p = 4.21 × 10−9; controls t(25) = −1.53, p = 0.14), consistent with the fact that they were undergoing an MDE. All patients were taking at least one psychotropic medication at the moment of test. Their average BDI was: 29.37 and they had, on average, 1.8 previous MDE in the past.
Learning phase results
Global inspection of the learning curves (Fig. 2a) suggests that, overall, participants were able to learn to respond correctly. Indeed, all the learning curves are above chance whatever the group or the context. A more detailed inspection reveals that controls' learning curves were unaffected by the choice context (‘rich’ v. ‘poor’), while patients' learning curves were different depending on the choice context (with a lower correct response rate in the ‘poor’ context).
Correct response rate (as proxied by the intercept of our GLMM) in the learning phase (Fig. 2a) indicated that overall performance is significantly above chance (χ2(1, 56) = 16.17, p < 0.001) which reflects the fact that accuracy was, on average, well above chance level (0.5). There was no significant effect of context (χ2(1, 56) = 0.046, p = 0.83) and no main effect of group (χ2(1, 56) = 2.86, p = 0.091) meaning that there were no overall significant differences between the patients and controls and between the ‘rich’ and ‘poor’ contexts. However, there was a significant interaction between context and group (χ2(1, 56) = 5.88, p = 0.015). Concerning the interaction context and group, post hoc tests indicated that it was driven by an effect of context present in patients (slope = −0.72, s.e. = 0.24, p < 0.0027), but not in controls (slope = −0.063, s.e. = 0.29, p = 0.83).
These results therefore show a specific impact of the context on the two groups. Patients displayed higher accuracy in the ‘rich’ compared to the ‘poor’ contexts, while controls were not affected by this factor as expected from previous articles in the literature (Palminteri et al., Reference Palminteri, Khamassi, Joffily and Coricelli2015; Pessiglione et al., Reference Pessiglione, Seymour, Flandin, Dolan and Frith2006).
Critically, learning phase results cannot establish whether the performance asymmetry observed in patients stems from the learning (i.e., how values are updated) or a decision effect (i.e., how options are selected) processes. To tease apart these interpretations we turned to the analysis of the transfer phase performance.
Transfer phase analysis
The visual inspection of the option-by-option choice rate in the transfer phase, showed that subjects were able to retrieve the values of the options and express meaning preferences among them (Fig. 2b). In fact, in all groups, the options ‘A’ (overall highest value) were chosen much more frequently compared to options ‘D’ (overall lowest value) in both groups. Intermediate value options (‘B’ and ‘C’) scored in between the extreme one (with a pattern reminiscent of relative value encoding; Klein, Ullsperger, & Jocham, Reference Klein, Ullsperger and Jocham2017; Palminteri & Lebreton, Reference Palminteri and Lebreton2021).
Before assessing whether the learning asymmetry observed in patients in the learning phase replicated in the transfer phase, one has to keep in mind that there were no more fixed choices contexts in the transfer phase, but options were presented in all possible combinations. Accordingly, the context factor used for the transfer phase contained three levels, defined by the presence of particular options: (1) trials involving the ‘A’ options (and not ‘D’); (2) trials involving the ‘D’ options (and not ‘A’); (3) other trials. Also in the transfer phase, average correct response rate (as proxied by the intercept of our GLMM) shows that overall performance was significantly above chance (χ2(1, 56) = 15.9, p < 0.001). We also found a significant effect of group (χ2(1, 56) = 6.83, p = 0.009), no effect of context (χ2(1, 56) = 2.23, p = 0.327) and a very strong and significant group by context interaction (χ2(1, 56) = 53.21, p < 0.001). Post-hoc tests reveal that controls were equally able to make the correct decision in contexts involving seeking ‘A’ or those involving avoiding ‘D’ (slope = −0.004, s.e. = 0.1, p = 0.999) whereas patients were strikingly better at seeking ‘A’ than avoiding ‘D’ (slope = 1.06, s.e. = 0.1, p < 0.001).
These results are consistent with the learning phase results. The context-specific asymmetry in patients that we found in the learning phase was also present in the transfer phase where all the different options were extracted from their initial context and paired with other options. It allows us to conclude that the performance asymmetry can be traced back to the learning asymmetry, where negative outcomes (more frequent following the worst possible option ‘D’) seem to exert a smaller effect on patients' learning performances than positive ones (more frequent following the best possible option ‘A’) (Frank et al., Reference Frank, Seeberger and O'Reilly2004).
Modelling results
Model simulations indicate that learning biases affect performance in a context-dependent manner (Fig. 3a). More specifically in our task, a positivity bias (α + > α −) is associated to similar accuracy in the ‘rich’ and ‘poor’ contexts, while a negativity bias (α + < α −) is associated with much higher accuracy in the ‘rich’ compared to the ‘poor’ context. The reason for this result can be traced down to the idea that it is rational to preferentially learn from rare outcomes (Cazé & van der Meer, Reference Cazé and van der Meer2013). The ‘positivity bias’ behavioral pattern closely resembles what we observed in healthy participants, while the ‘negativity bias’ pattern closely reminds the one observed in patients, thus suggesting what we patients are better explained by an exacerbated sensitivity to negative outcomes.
To formally substantiate this intuition, we submitted the learning rates fitted from transfer phase choices to a 2 × 2 ANOVA, with group (patients v. controls) and valence (positive or negative learning rate), as between- and within-subject variables, respectively (Fig. 3b). The results showed a main effect of group [F(1, 107) = 5.26, p = 0.024; η2 (partial) = 0.05, 95% CI (3.37 × 10−3, 1.00)], no main effect of valence [F(1, 107) = 3.27 × 10−3, p = 0.954; η2 (partial) = 3.06 × 10−5, 95% CI (0.00, 1.00)], and, crucially, a significant valence-by-group interaction [F(1, 107) = 7.58, p = 0.007; η2 (partial) = 0.07, 95% CI (0.01, 1.00)]. Finally, we detected no significant different in the choice temperature (t(48) = 1.64, p = 0.11).
Response time analysis
As an exploratory analysis, to assess how learning performance reflected into response times (both at the decision and the learning phase), we looked at reaction and outcome observation times during the learning phase. Reaction times (defined as the difference between stimuli onset and button pressing to make a decision) showed a main effect of the context (χ2(1, 56) = 9.83, p = 0.002), with reaction times being higher in the ‘poor’ compared to the ‘rich’ condition, which is consistent with previous studies showing valence induced slowing in reinforcement learning (Fontanesi et al., Reference Fontanesi, Palminteri and Lebreton2019b; Figure 4a). Reaction times showed is no significant main effect of the group (χ2(1, 56) = 0.03, p = 0.86) nor interaction between context and group (χ2(1, 56) = 0.12, p = 0.73). Post hoc tests showed that the effect of context was significant in both controls (slope = 0.047, s.e. = 0.016, p < 0.003) and patients (slope = −0.043 s.e. = 0.0067, p < 0.001).
Outcome observation time (defined as the difference between the outcome onset and button pressing to move to the next trial) also displayed no significant effect of the context (χ2(1, 56) = 10.39, p < 0.123) but no effect of the group (χ2(1, 56) = 2.17, p = 0.14) nor interaction (χ2(1, 56) = 0.39, p = 0.53) (Fig. 4b).
Taken together, reaction and outcome observation time analyses, suggest that learning performance asymmetry in patients could not be accounted for by reduced engagement and outcome processing during the learning task.
Discussion
In the present study, we assessed reinforcement learning with a behavioral paradigm involving two different reward contexts – one ‘rich’ with a positive overall expected value and one ‘poor’ with a negative overall expected value – in patients undergoing a major depressive episode and age-, gender- and education-matched healthy volunteers.
We used a reinforcement learning task featuring two different learning contexts: one with an overall positive expected value (‘rich’ context) and one with a overall negative expected value (‘poor’ context). Coherent with previous studies, healthy subjects learned equally well in both contexts (Palminteri & Pessiglione, Reference Palminteri, Pessiglione, Dreher and Tremblay2017). On the other hand, patients with depression displayed reduced correct response rate in the ‘poor’ context. This context-dependent learning asymmetry found in the learning phase was confirmed in the analysis of the transfer phase, where subjects were asked to retrieve and generalize the values learned during the learning sessions.
In standard reinforcement learning tasks, a participant has to learn the value of the options and select among them. A deficit in reinforcement learning can therefore arise from two possible causes. On one hand, it can be caused by a learning impairment, i.e., failing to accurately update the value of the stimulus. On the other hand, it can be the result of a decision impairment. In this scenario, a participant could still end up selecting the wrong stimulus even though the learning process in itself is intact. Our design, coupling a learning phase with feedback and a transfer phase, where we shuffled all options without any feedback, allows us to separate these two possible sources of error. Indeed, a decision-related problem would lead to a specific impairment during the learning phase but in the transfer phase, there should be none or only an unspecific impairment. On the other side, a valence-specific update-related deficit would originate in the learning phase (when feedback is provided) and would therefore propagate in the transfer phase and be associated only to the concerned specific options (Frank et al., Reference Frank, Moustafa, Haughey, Curran and Hutchison2007).
Our results are consistent with this second scenario, as we showed that patients were less able to identify the correct response of the ‘poor’ context both in the learning and the transfer phase. Hence, this suggests that the asymmetrical performance observed in patients, stems from the learning process per se and not from the decision process. Therefore, we suppose that this asymmetric learning pattern is the consequence of a more complex mechanism, embedded in the learning process and triggered by affectively negative situations or less frequent affectively positive situations (‘poor’ context).
Our results suggest that learning performances in depression are dependent on the valence of the context. More specifically, patients undergoing a major depressive episode seem to perform worst at learning in negative value context, compared to positive one. This was true despite the fact that the two contexts are matched in difficulty. Accordingly, control participants on the contrary show no difference in performance between the two contexts. Prima facie, this observation challenges some formulations of the negative bias hypothesis described in the literature. Some studies describe negative affective biases in several cognitive processes, such as emotion, memory and perception, as an increased and aberrant saliency of negative affective stimuli (for review see Gotlib and Joormann, Reference Gotlib and Joormann2010; Joormann and Quinn, Reference Joormann and Quinn2014). From this view, one could extrapolate that, contrary to what we observed in our data, MDD patients should display, if anything, higher performance in the ‘poor’ contexts. This prediction contrasts with a computational definition of negativity bias, as a difference between learning rates for positive and negative outcomes (or reward prediction errors). In fact, model simulations studies clearly show that learning positivity or negativity biases affect performance in a context-dependent manner, that in our case is consistent with the idea of a negativity bias in depression (Bavard & Théro, Reference Bavard and Théro2018; Cazé & van der Meer, Reference Cazé and van der Meer2013). The results were confirmed by model simulations and analysis of learning rates that were fitted from transfer phase choices and, even if it is hard to find in the literature a systematic pattern, it is consistent with recent computational meta analyses by Pike and co (Beck, Reference Beck1987; Brolsma et al., Reference Brolsma, Vrijsen, Vassena, Kandroodi, Bergman, van Eijndhoven and Cools2022; Chase et al., Reference Chase, Frank, Michael, Bullmore, Sahakian and Robbins2010; Eshel & Roiser, Reference Eshel and Roiser2010; Gradin et al., Reference Gradin, Kumar, Waiter, Ahearn, Stickle, Milders and Steele2011; Henriques et al., 1994; Huys et al., Reference Huys, Pizzagalli, Bogdan and Dayan2013; Knutson et al., Reference Knutson, Bhanji, Cooney, Atlas and Gotlib2008; Kumar et al., Reference Kumar, Waiter, Ahearn, Milders, Reid and Steele2008; Murphy, Michael, Robbins, & Sahakian, Reference Murphy, Michael, Robbins and Sahakian2003; Pike & Robinson, Reference Pike and Robinson2022; Pizzagalli, Jahn, & O'Shea, Reference Pizzagalli, Jahn and O'Shea2005; Steele, Kumar, & Ebmeier, Reference Steele, Kumar and Ebmeier2007; Ubl et al., Reference Ubl, Kuehner, Kirsch, Ruttorf, Diener and Flor2015; Whitton et al., Reference Whitton, Kakani, Foti, Van't Veer, Haile, Crowley and Pizzagalli2016). Crucially, consistent with our simulations, the overall good performance of patients and more specifically in the ‘rich’ context indicated that patients displayed no generic impairments. Overall good performance of patients in some control conditions is actually not uncommon and can be explained by the fact that patients in general are more focused and more involved than controls in this type of study (the so-called Hawthorne effect), because the result of this experiment is much more ‘meaningful’ for them than it is for controls (Frank et al., Reference Frank, Seeberger and O'Reilly2004).
In addition to choice data, in our studies we collected two different response time measures. The first one, reaction time, was classically defined as the time between the stimuli onset the choice button press. Reaction times were not different between our groups of participants, indicating that in our experiment we were not able to provide support for the idea of a generalized sensorimotor slowing in patients (Byrne, Reference Byrne1976). On the other hand, reaction times were strongly affected by the experimental condition, being significantly slower in the ‘poor’ context in both groups. This finding is at apparent odds with the fact that objective difficulty (as quantified by the difference in value between the two options) was matched across contexts (note that this effect was also present in healthy controls, who displayed equal performance in both conditions). However, slower reaction times in the ‘poor’ context are consistent with recent findings (Fontanesi et al., Reference Fontanesi, Palminteri and Lebreton2019b). Indeed, previous studies coupling behavioral decision diffusion model analyses with reinforcement learning paradigms indicate that reaction times tend to be slower in negative valence contexts, compared to positive valence ones. This effect is well captured by a combination of increased non-decision time (a possible manifestation of Pavlovian-to-instrumental transfer; Guitart-Masip et al., Reference Guitart-Masip, Huys, Fuentemilla, Dayan, Duzel and Dolan2012) and increased cautiousness (a possible manifestation of loss attention; Yechiam & Hochman, Reference Yechiam and Hochman2014). We also recorded the outcome observation times, that quantify the time separating the onset of the outcome from the button press necessary to move to the subsequent trial. Overall, outcome observation times were not significantly modulated by our factors, therefore indicating that the learning asymmetry observed in patients could not be explained by not processing outcome information.
Our study of course suffers from few important limitations. One limitation is the relatively small sample size, which is of course due to the fact that our study was monocentric and went for a relatively short time period. We note, however, that several meaningful insights concerning impairment of reinforcement learning in psychiatric diseases has been obtained until very recently from studies with sample size comparable to ours (Chase et al., Reference Chase, Frank, Michael, Bullmore, Sahakian and Robbins2010; Frank et al., Reference Frank, Seeberger and O'Reilly2004; Henriques & Davidson, Reference Henriques and Davidson2000; Huys et al., Reference Huys, Gölzer, Friedel, Heinz, Cools, Dayan and Dolan2016; Moutoussis et al., Reference Moutoussis, Rutledge, Prabhu, Hrynkiewicz, Lam, Ousdal and Dolan2018; Murphy et al., Reference Murphy, Michael, Robbins and Sahakian2003; Rothkirch et al., Reference Rothkirch, Tonn, Köhler and Sterzer2017; Rupprechter, Stankevicius, Huys, Steele, & Seriès, Reference Rupprechter, Stankevicius, Huys, Steele and Seriès2018). Future, multi-centric, studies will be required to overcome this issue and probe the replicability and generalizability of our findings. Furthermore, by openly sharing our data, our study may contribute to (computational) meta-analysis (Pike & Robinson, Reference Pike and Robinson2022). Another limitation of our study is that patients were medicated at the time of the experiment. Even though studies have found effects on performance on medicated and unmedicated patients (Douglas et al., Reference Douglas, Porter, Frampton, Gallagher and Young2009; Steele et al., Reference Steele, Kumar and Ebmeier2007), it is always difficult to control for this effect, especially when certain patients take medications for other comorbidities. Additionally, the role of serotonin in reward and punishment learning is far from being understood (Palminteri & Pessiglione, 2017). In some tasks, it has been shown to improve performance in a valence-independent manner, making unlikely that the observed effect was a consequence of medication (Palminteri, Clair, Mallet, & Pessiglione, Reference Palminteri, Clair, Mallet and Pessiglione2012). So, under the theory that serotonin drives punishment avoidance learning, we would observe the opposite effect. Finally, as MDD is a polysemic condition, and even though we tried to monitor and control the inclusion of patients to avoid interference with other mental conditions, some patients had other symptoms, especially addictive disorders, that should be considered in future studies.
In the literature, is has been repeatedly shown that controls perform equally when they have to choose a reward or avoid a punishment. It is also frequent that patients with mental or neurological disorders other than MDD show an imbalance behavior when implicated in a task with a reward selection and a punishment avoidance (Frank et al., Reference Frank, Seeberger and O'Reilly2004). Studying several aspects of reward processing that correspond to different neurobiological circuits and exploring dysregulation across different psychiatric disorders could be a very efficient way to unfold abnormalities in reward-related decision making. It could be interesting to apply our task to other psychiatric disorders in order to identify neurobiological signatures and develop more targeted and promising treatments (Brolsma et al., Reference Brolsma, Vrijsen, Vassena, Kandroodi, Bergman, van Eijndhoven and Cools2022; Insel et al., Reference Insel, Cuthbert, Garvey, Heinssen, Pine, Quinn and Wang2010; Whitton, Treadway, & Pizzagalli, Reference Whitton, Treadway and Pizzagalli2015).
Data
Data collected for this paper, a R script presenting the main figures of the paper as well as some Matlab simulation files are available here https://github.com/hrl-team/Data_depression.
Acknowledgements
We thank Magdalena Soukupova for her bright insights on statistical analysis. HV is supported by the Insti tut de Recherche en Santé Publique (IRESP, grant number: 20II171-00). SP is supported by the Institut de Recherche en Santé Publique (IRESP, grant number: 20II138-00), and the Agence National de la Recherche (CogFinAgent: ANR-21-CE23-0002-02; RELATIVE: ANR-21-CE37-0008-01; RANGE: ANR-21-CE28-0024-01). The Departement d’études cognitives is funded by the Agence National de la Recherche (FrontCog ANR-17-EURE-0017). The funding agencies did not influence the content of the manuscript.
Conflict of interest
Dr Lemogne reports personal fees and non-financial support from Boehringer Ingelheim, Janssen-Cilag, Lundbeck, Otsuka Pharmaceutical, outside the submitted work. The other authors declare not competing conflict of interest concerning the related work.