We are invited by this excellent target article (TA) to consider three critical questions: (1) why should there ever be conflict between short-term temptations and long-term goals; (2) what mechanisms in the brain overcome these temptations; and (3) why is the operation of some, but not others, of those mechanisms accompanied by a sense of effort? These are respectively ethological, psychological/neural, and metaphysical – and so demand answers of different characters. Here, we reflect on the TA using the terms and language of neural reinforcement learning (RL): Pavlovian versus instrumental control (Dayan, Niv, Seymour, & Daw, Reference Dayan, Niv, Seymour and Daw2006; Dickinson, Reference Dickinson1980; Mackintosh, Reference Mackintosh1983); liking versus wanting (Berridge, Reference Berridge2009); model-based (MB) versus model-free (MF) control (Daw, Niv, & Dayan, Reference Daw, Niv and Dayan2005); online versus offline learning and planning (Sutton, Reference Sutton1991; Mattar & Daw, Reference Mattar and Daw2018); and internal versus external actions and control (Dayan, Reference Dayan2012; Keramati, Smittenaar, Dolan, & Dayan, Reference Keramati, Smittenaar, Dolan and Dayan2016; Pezzulo, Rigoli, & Chersi, Reference Pezzulo, Rigoli and Chersi2013).
First, why should we and other animals suffer from temptations at all – why should there be even a possibility of misalignment between short- and long-term incentive structures? After all, patience can evolve quite naturally (Stevens & Stephens, Reference Stevens and Stephens2008) – as in stalking hunters who presumably suppress immediate attacking urges in order to better their chances of ultimate success. In the context of thinking (and, in this case, acting) fast and slow (Kahneman, Reference Kahneman2011), the TA hints that the former system repurposed a psychologically-common (hyperbolic) heuristic for valuation across delays that is simply not meet the challenge posed by the latter system of being sufficiently patient. Neural RL offers complementary Pavlovian and instrumental interpretations.
By Pavlovian influences we mean hard-wired or pre-specified actions such as direct approach and engagement with primary and even secondary reinforcers. One view of these influences – which largely align well with the visceral processes (Loewenstein, Reference Loewenstein1996) that are discussed, is that these are evolutionarily specified priors over actions or policies. The benefit of this sort of inductive bias is obviating the sample complexities of learning in stable environments (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006). Could it be that, on balance, the costs of lacking this pre-programming outweigh the benefits? The classic tasks assessing willpower focus on the benefits; a careful accounting of the costs would be interesting. Certainly, external pre-commitment (or suppression in the form of revaluation or attentional diversion) are ways of avoiding Pavlovian misbehaviour (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006).
By contrast, in instrumental conditioning, we and other animals learn to choose actions based on the contingent rewards they produce and punishments they avoid. Of course, here, the key question is what happens when these affective outcomes are in the future. A useful analogy comes from experiments into food reward that separate out the short-term hedonic (e.g., sweetness) and long-term (e.g., nutritive) qualities of the outcomes of actions (de Araujo, Schatzker, & Small, Reference de Araujo, Schatzker and Small2020). Animals are initially attracted by the hedonic appeal of outcomes, but ultimately (via information from the gut), their choices are dictated by what is closer to the true long-term value.
One possibility is that the hedonic system is again a sort of typically-useful prior, but now over likely long-term valuation rather than a policy/action – if, for instance, sweetness is sufficiently frequently aligned with long-term nutritive quality. Thus, the animal might be drawn, at least at first, into favouring what are actually poor choices from a long-term perspective. In RL terms, one speculation is that hedonics – as a form of liking (Berridge, Reference Berridge2009) – act as what is known as a shaping reward system (Ng, Harada, & Russell, Reference Ng, Harada and Russell1999) – these are like hints for the instrumental system that speed learning when they are appropriate (but do not ultimately affect what is the optimal policy; rather only slowing the acquisition of this policy if they are misleading). Complementary to liking is wanting (Berridge, Reference Berridge2009), which would then be considered the true currency for choice. Thus, again, a conventionally useful, hard-wired, prior system can appear to give unwarranted favour to smaller-sooner outcomes that it then takes more or less learning to wash-out.
Second are the mechanisms that overcome temptations – when the temporal accounting of wanting over liking does not suffice. From an RL perspective, it is useful to think about MB and MF systems, and also externally- and internally-directed actions. MB (or goal-directed; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002) control operates by constructing, and performing forward planning in, an internal simulacrum of the environment; it can exactly capture, for instance, the resolve-associated observation that defection sooner implies defection later, thus reducing the chance of actually attaining long-term rewards. This sort of future planning has been associated with phenomena such as preplay in rodents (Pfeiffer & Foster, Reference Pfeiffer and Foster2013; Wikenheiser & Redish, Reference Wikenheiser and Redish2015) and, more speculatively, humans (Eldar, Lièvre, Dayan, & Dolan, Reference Eldar, Lièvre, Dayan and Dolan2020; Liu, Dolan, Kurth-Nelson, & Behrens, Reference Liu, Dolan, Kurth-Nelson and Behrens2019) and, as noted in the TA, various parts of the default mode network.
By contrast, MF systems cache or store information about the actions performed in the past, and thereby come directly to favour those actions that were either associated with rewards or possibly just frequently exercised (Gershman, Reference Gershman2020). In the case that information about rewards is cached, mechanisms such as temporal difference learning (Sutton, Reference Sutton1988; Watkins, Reference Watkins1989) associated with the wanting mentioned above, ensure that these are appropriate in the long-run. These MF policies have been identified with habits (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002).
MB control is time-consuming (Pezzulo et al., Reference Pezzulo, Rigoli and Chersi2013) and potentially taxing (see below); thus, there is a ready process of habit formation (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002), in which habits take over – becoming what we are wont to do – this is partly consonant with notions in the TA. What, though, of suppression – rendered here as internally-directed (e.g., devaluation, or indeed overriding Pavlovian control; Cavanagh, Eisenberg, Guitart-Masip, Huys, and Frank, Reference Cavanagh, Eisenberg, Guitart-Masip, Huys and Frank2013) or externally-directed (e.g., attention) mechanisms for changing the attraction of short-term temptations? In neural RL terms, these can be considered as (expensive; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017) internal actions, that are controlled in the same way as external actions, along with other actions as the deployment of working memory (Dayan, Reference Dayan2012). Although the TA suggests that suppression is less stable than resolve (e.g., via a positive feedback process by which partial failures in distraction tend to spiral out of control and lead to full failure and defection on long-term goals); it should be noted that the internal actions necessary to complete the calculations for the MB realization of resolve are of a piece with those enforcing suppression, and so subject to some of the same problems. It's certainly not obvious that some forms of suppression will not also habitize.
Briefly, what of the effort associated with suppression and, I would argue, resolve, at least when MB calculations remain necessary? Here, we cheat to focus back on ethology. To the flavours of opportunity costs discussed (Boureau, Sokol-Hessner, & Daw, Reference Boureau, Sokol-Hessner and Daw2015; Kurzban, Duckworth, Kable, & Myers, Reference Kurzban, Duckworth, Kable and Myers2013; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017), we should add that of not being able to transfer knowledge from MB to MF systems (Mattar & Daw, Reference Mattar and Daw2018; Sutton, Reference Sutton1991), including the former's rendering its choices less effortful. That's what would make will become wont.
We are invited by this excellent target article (TA) to consider three critical questions: (1) why should there ever be conflict between short-term temptations and long-term goals; (2) what mechanisms in the brain overcome these temptations; and (3) why is the operation of some, but not others, of those mechanisms accompanied by a sense of effort? These are respectively ethological, psychological/neural, and metaphysical – and so demand answers of different characters. Here, we reflect on the TA using the terms and language of neural reinforcement learning (RL): Pavlovian versus instrumental control (Dayan, Niv, Seymour, & Daw, Reference Dayan, Niv, Seymour and Daw2006; Dickinson, Reference Dickinson1980; Mackintosh, Reference Mackintosh1983); liking versus wanting (Berridge, Reference Berridge2009); model-based (MB) versus model-free (MF) control (Daw, Niv, & Dayan, Reference Daw, Niv and Dayan2005); online versus offline learning and planning (Sutton, Reference Sutton1991; Mattar & Daw, Reference Mattar and Daw2018); and internal versus external actions and control (Dayan, Reference Dayan2012; Keramati, Smittenaar, Dolan, & Dayan, Reference Keramati, Smittenaar, Dolan and Dayan2016; Pezzulo, Rigoli, & Chersi, Reference Pezzulo, Rigoli and Chersi2013).
First, why should we and other animals suffer from temptations at all – why should there be even a possibility of misalignment between short- and long-term incentive structures? After all, patience can evolve quite naturally (Stevens & Stephens, Reference Stevens and Stephens2008) – as in stalking hunters who presumably suppress immediate attacking urges in order to better their chances of ultimate success. In the context of thinking (and, in this case, acting) fast and slow (Kahneman, Reference Kahneman2011), the TA hints that the former system repurposed a psychologically-common (hyperbolic) heuristic for valuation across delays that is simply not meet the challenge posed by the latter system of being sufficiently patient. Neural RL offers complementary Pavlovian and instrumental interpretations.
By Pavlovian influences we mean hard-wired or pre-specified actions such as direct approach and engagement with primary and even secondary reinforcers. One view of these influences – which largely align well with the visceral processes (Loewenstein, Reference Loewenstein1996) that are discussed, is that these are evolutionarily specified priors over actions or policies. The benefit of this sort of inductive bias is obviating the sample complexities of learning in stable environments (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006). Could it be that, on balance, the costs of lacking this pre-programming outweigh the benefits? The classic tasks assessing willpower focus on the benefits; a careful accounting of the costs would be interesting. Certainly, external pre-commitment (or suppression in the form of revaluation or attentional diversion) are ways of avoiding Pavlovian misbehaviour (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006).
By contrast, in instrumental conditioning, we and other animals learn to choose actions based on the contingent rewards they produce and punishments they avoid. Of course, here, the key question is what happens when these affective outcomes are in the future. A useful analogy comes from experiments into food reward that separate out the short-term hedonic (e.g., sweetness) and long-term (e.g., nutritive) qualities of the outcomes of actions (de Araujo, Schatzker, & Small, Reference de Araujo, Schatzker and Small2020). Animals are initially attracted by the hedonic appeal of outcomes, but ultimately (via information from the gut), their choices are dictated by what is closer to the true long-term value.
One possibility is that the hedonic system is again a sort of typically-useful prior, but now over likely long-term valuation rather than a policy/action – if, for instance, sweetness is sufficiently frequently aligned with long-term nutritive quality. Thus, the animal might be drawn, at least at first, into favouring what are actually poor choices from a long-term perspective. In RL terms, one speculation is that hedonics – as a form of liking (Berridge, Reference Berridge2009) – act as what is known as a shaping reward system (Ng, Harada, & Russell, Reference Ng, Harada and Russell1999) – these are like hints for the instrumental system that speed learning when they are appropriate (but do not ultimately affect what is the optimal policy; rather only slowing the acquisition of this policy if they are misleading). Complementary to liking is wanting (Berridge, Reference Berridge2009), which would then be considered the true currency for choice. Thus, again, a conventionally useful, hard-wired, prior system can appear to give unwarranted favour to smaller-sooner outcomes that it then takes more or less learning to wash-out.
Second are the mechanisms that overcome temptations – when the temporal accounting of wanting over liking does not suffice. From an RL perspective, it is useful to think about MB and MF systems, and also externally- and internally-directed actions. MB (or goal-directed; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002) control operates by constructing, and performing forward planning in, an internal simulacrum of the environment; it can exactly capture, for instance, the resolve-associated observation that defection sooner implies defection later, thus reducing the chance of actually attaining long-term rewards. This sort of future planning has been associated with phenomena such as preplay in rodents (Pfeiffer & Foster, Reference Pfeiffer and Foster2013; Wikenheiser & Redish, Reference Wikenheiser and Redish2015) and, more speculatively, humans (Eldar, Lièvre, Dayan, & Dolan, Reference Eldar, Lièvre, Dayan and Dolan2020; Liu, Dolan, Kurth-Nelson, & Behrens, Reference Liu, Dolan, Kurth-Nelson and Behrens2019) and, as noted in the TA, various parts of the default mode network.
By contrast, MF systems cache or store information about the actions performed in the past, and thereby come directly to favour those actions that were either associated with rewards or possibly just frequently exercised (Gershman, Reference Gershman2020). In the case that information about rewards is cached, mechanisms such as temporal difference learning (Sutton, Reference Sutton1988; Watkins, Reference Watkins1989) associated with the wanting mentioned above, ensure that these are appropriate in the long-run. These MF policies have been identified with habits (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002).
MB control is time-consuming (Pezzulo et al., Reference Pezzulo, Rigoli and Chersi2013) and potentially taxing (see below); thus, there is a ready process of habit formation (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002), in which habits take over – becoming what we are wont to do – this is partly consonant with notions in the TA. What, though, of suppression – rendered here as internally-directed (e.g., devaluation, or indeed overriding Pavlovian control; Cavanagh, Eisenberg, Guitart-Masip, Huys, and Frank, Reference Cavanagh, Eisenberg, Guitart-Masip, Huys and Frank2013) or externally-directed (e.g., attention) mechanisms for changing the attraction of short-term temptations? In neural RL terms, these can be considered as (expensive; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017) internal actions, that are controlled in the same way as external actions, along with other actions as the deployment of working memory (Dayan, Reference Dayan2012). Although the TA suggests that suppression is less stable than resolve (e.g., via a positive feedback process by which partial failures in distraction tend to spiral out of control and lead to full failure and defection on long-term goals); it should be noted that the internal actions necessary to complete the calculations for the MB realization of resolve are of a piece with those enforcing suppression, and so subject to some of the same problems. It's certainly not obvious that some forms of suppression will not also habitize.
Briefly, what of the effort associated with suppression and, I would argue, resolve, at least when MB calculations remain necessary? Here, we cheat to focus back on ethology. To the flavours of opportunity costs discussed (Boureau, Sokol-Hessner, & Daw, Reference Boureau, Sokol-Hessner and Daw2015; Kurzban, Duckworth, Kable, & Myers, Reference Kurzban, Duckworth, Kable and Myers2013; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017), we should add that of not being able to transfer knowledge from MB to MF systems (Mattar & Daw, Reference Mattar and Daw2018; Sutton, Reference Sutton1991), including the former's rendering its choices less effortful. That's what would make will become wont.
Financial support
This study was supported by Max Planck Society and Alexander von Humboldt Foundation.
Conflict of interest
None.