Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-02-06T18:02:07.909Z Has data issue: false hasContentIssue false

When will's wont wants wanting

Published online by Cambridge University Press:  26 April 2021

Peter Dayan*
Affiliation:
Max Planck Institute for Biological Cybernetics & University of Tuebingen, 72076Tuebingen, Germany. dayan@tue.mpg.de

Abstract

We use neural reinforcement learning concepts including Pavlovian versus instrumental control, liking versus wanting, model-based versus model-free control, online versus offline learning and planning, and internal versus external actions and control to reflect on putative conflicts between short-term temptations and long-term goals.

Type
Open Peer Commentary
Creative Commons
The target article and response article are works of the U.S. Government and are not subject to copyright protection in the United States.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press

We are invited by this excellent target article (TA) to consider three critical questions: (1) why should there ever be conflict between short-term temptations and long-term goals; (2) what mechanisms in the brain overcome these temptations; and (3) why is the operation of some, but not others, of those mechanisms accompanied by a sense of effort? These are respectively ethological, psychological/neural, and metaphysical – and so demand answers of different characters. Here, we reflect on the TA using the terms and language of neural reinforcement learning (RL): Pavlovian versus instrumental control (Dayan, Niv, Seymour, & Daw, Reference Dayan, Niv, Seymour and Daw2006; Dickinson, Reference Dickinson1980; Mackintosh, Reference Mackintosh1983); liking versus wanting (Berridge, Reference Berridge2009); model-based (MB) versus model-free (MF) control (Daw, Niv, & Dayan, Reference Daw, Niv and Dayan2005); online versus offline learning and planning (Sutton, Reference Sutton1991; Mattar & Daw, Reference Mattar and Daw2018); and internal versus external actions and control (Dayan, Reference Dayan2012; Keramati, Smittenaar, Dolan, & Dayan, Reference Keramati, Smittenaar, Dolan and Dayan2016; Pezzulo, Rigoli, & Chersi, Reference Pezzulo, Rigoli and Chersi2013).

First, why should we and other animals suffer from temptations at all – why should there be even a possibility of misalignment between short- and long-term incentive structures? After all, patience can evolve quite naturally (Stevens & Stephens, Reference Stevens and Stephens2008) – as in stalking hunters who presumably suppress immediate attacking urges in order to better their chances of ultimate success. In the context of thinking (and, in this case, acting) fast and slow (Kahneman, Reference Kahneman2011), the TA hints that the former system repurposed a psychologically-common (hyperbolic) heuristic for valuation across delays that is simply not meet the challenge posed by the latter system of being sufficiently patient. Neural RL offers complementary Pavlovian and instrumental interpretations.

By Pavlovian influences we mean hard-wired or pre-specified actions such as direct approach and engagement with primary and even secondary reinforcers. One view of these influences – which largely align well with the visceral processes (Loewenstein, Reference Loewenstein1996) that are discussed, is that these are evolutionarily specified priors over actions or policies. The benefit of this sort of inductive bias is obviating the sample complexities of learning in stable environments (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006). Could it be that, on balance, the costs of lacking this pre-programming outweigh the benefits? The classic tasks assessing willpower focus on the benefits; a careful accounting of the costs would be interesting. Certainly, external pre-commitment (or suppression in the form of revaluation or attentional diversion) are ways of avoiding Pavlovian misbehaviour (Dayan et al., Reference Dayan, Niv, Seymour and Daw2006).

By contrast, in instrumental conditioning, we and other animals learn to choose actions based on the contingent rewards they produce and punishments they avoid. Of course, here, the key question is what happens when these affective outcomes are in the future. A useful analogy comes from experiments into food reward that separate out the short-term hedonic (e.g., sweetness) and long-term (e.g., nutritive) qualities of the outcomes of actions (de Araujo, Schatzker, & Small, Reference de Araujo, Schatzker and Small2020). Animals are initially attracted by the hedonic appeal of outcomes, but ultimately (via information from the gut), their choices are dictated by what is closer to the true long-term value.

One possibility is that the hedonic system is again a sort of typically-useful prior, but now over likely long-term valuation rather than a policy/action – if, for instance, sweetness is sufficiently frequently aligned with long-term nutritive quality. Thus, the animal might be drawn, at least at first, into favouring what are actually poor choices from a long-term perspective. In RL terms, one speculation is that hedonics – as a form of liking (Berridge, Reference Berridge2009) – act as what is known as a shaping reward system (Ng, Harada, & Russell, Reference Ng, Harada and Russell1999) – these are like hints for the instrumental system that speed learning when they are appropriate (but do not ultimately affect what is the optimal policy; rather only slowing the acquisition of this policy if they are misleading). Complementary to liking is wanting (Berridge, Reference Berridge2009), which would then be considered the true currency for choice. Thus, again, a conventionally useful, hard-wired, prior system can appear to give unwarranted favour to smaller-sooner outcomes that it then takes more or less learning to wash-out.

Second are the mechanisms that overcome temptations – when the temporal accounting of wanting over liking does not suffice. From an RL perspective, it is useful to think about MB and MF systems, and also externally- and internally-directed actions. MB (or goal-directed; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002) control operates by constructing, and performing forward planning in, an internal simulacrum of the environment; it can exactly capture, for instance, the resolve-associated observation that defection sooner implies defection later, thus reducing the chance of actually attaining long-term rewards. This sort of future planning has been associated with phenomena such as preplay in rodents (Pfeiffer & Foster, Reference Pfeiffer and Foster2013; Wikenheiser & Redish, Reference Wikenheiser and Redish2015) and, more speculatively, humans (Eldar, Lièvre, Dayan, & Dolan, Reference Eldar, Lièvre, Dayan and Dolan2020; Liu, Dolan, Kurth-Nelson, & Behrens, Reference Liu, Dolan, Kurth-Nelson and Behrens2019) and, as noted in the TA, various parts of the default mode network.

By contrast, MF systems cache or store information about the actions performed in the past, and thereby come directly to favour those actions that were either associated with rewards or possibly just frequently exercised (Gershman, Reference Gershman2020). In the case that information about rewards is cached, mechanisms such as temporal difference learning (Sutton, Reference Sutton1988; Watkins, Reference Watkins1989) associated with the wanting mentioned above, ensure that these are appropriate in the long-run. These MF policies have been identified with habits (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002).

MB control is time-consuming (Pezzulo et al., Reference Pezzulo, Rigoli and Chersi2013) and potentially taxing (see below); thus, there is a ready process of habit formation (Daw et al., Reference Daw, Niv and Dayan2005; Dickinson & Balleine, Reference Dickinson, Balleine and Gallistel2002), in which habits take over – becoming what we are wont to do – this is partly consonant with notions in the TA. What, though, of suppression – rendered here as internally-directed (e.g., devaluation, or indeed overriding Pavlovian control; Cavanagh, Eisenberg, Guitart-Masip, Huys, and Frank, Reference Cavanagh, Eisenberg, Guitart-Masip, Huys and Frank2013) or externally-directed (e.g., attention) mechanisms for changing the attraction of short-term temptations? In neural RL terms, these can be considered as (expensive; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017) internal actions, that are controlled in the same way as external actions, along with other actions as the deployment of working memory (Dayan, Reference Dayan2012). Although the TA suggests that suppression is less stable than resolve (e.g., via a positive feedback process by which partial failures in distraction tend to spiral out of control and lead to full failure and defection on long-term goals); it should be noted that the internal actions necessary to complete the calculations for the MB realization of resolve are of a piece with those enforcing suppression, and so subject to some of the same problems. It's certainly not obvious that some forms of suppression will not also habitize.

Briefly, what of the effort associated with suppression and, I would argue, resolve, at least when MB calculations remain necessary? Here, we cheat to focus back on ethology. To the flavours of opportunity costs discussed (Boureau, Sokol-Hessner, & Daw, Reference Boureau, Sokol-Hessner and Daw2015; Kurzban, Duckworth, Kable, & Myers, Reference Kurzban, Duckworth, Kable and Myers2013; Shenhav et al., Reference Shenhav, Musslick, Lieder, Kool, Griffiths, Cohen and Botvinick2017), we should add that of not being able to transfer knowledge from MB to MF systems (Mattar & Daw, Reference Mattar and Daw2018; Sutton, Reference Sutton1991), including the former's rendering its choices less effortful. That's what would make will become wont.

Financial support

This study was supported by Max Planck Society and Alexander von Humboldt Foundation.

Conflict of interest

None.

References

Berridge, K. C. (2009). Wanting and liking: Observations from the neuroscience and psychology laboratory. Inquiry: A Journal of Medical Care Organization, Provision and Financing, 52(4), 378398.CrossRefGoogle ScholarPubMed
Boureau, Y.-L., Sokol-Hessner, P., & Daw, N. D. (2015). Deciding how to decide: Self-control and meta-decision making. Trends in cognitive sciences, 19(11), 700710.CrossRefGoogle ScholarPubMed
Cavanagh, J. F., Eisenberg, I., Guitart-Masip, M., Huys, Q., & Frank, M. J. (2013). Frontal theta overrides Pavlovian learning biases. Journal of Neuroscience, 33(19), 85418548.CrossRefGoogle ScholarPubMed
Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 17041711.CrossRefGoogle ScholarPubMed
Dayan, P. (2012). How to set the switches on this thing. Current Opinion in Neurobiology, 22, 10681074.CrossRefGoogle ScholarPubMed
Dayan, P., Niv, Y., Seymour, B., & Daw, N. D. (2006). The misbehavior of value and the discipline of the will. Neural Networks, 19(8), 11531160.CrossRefGoogle ScholarPubMed
de Araujo, I. E., Schatzker, M., & Small, D. M. (2020). Rethinking food reward. Annual Review of Psychology, 71, 139164.CrossRefGoogle ScholarPubMed
Dickinson, A. (1980). Contemporary animal learning theory. Cambridge, UK: Cambridge University Press.Google Scholar
Dickinson, A., & Balleine, B. (2002). The role of learning in motivation. In Gallistel, C. (Ed.), Stevens’ handbook of experimental psychology (Vol. 3, pp. 497533). New York, NY: Wiley.Google Scholar
Eldar, E., Lièvre, G., Dayan, P., & Dolan, R. J. (2020). The roles of online and offline replay in planning. eLife, 9.CrossRefGoogle ScholarPubMed
Gershman, S. J. (2020). Origin of perseveration in the trade-off between reward and complexity. bioRxiv.Google ScholarPubMed
Kahneman, D. (2011). Thinking, fast and slow. Macmillan.Google Scholar
Keramati, M., Smittenaar, P., Dolan, R. J., & Dayan, P. (2016). Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum. Proceedings of the National Academy of Sciences of the United States of America, 113, 1286812873.CrossRefGoogle ScholarPubMed
Kurzban, R., Duckworth, A., Kable, J. W., & Myers, J. (2013). An opportunity cost model of subjective effort and task performance. Behavioral and Brain Sciences, 36(6), 661679.CrossRefGoogle ScholarPubMed
Liu, Y., Dolan, R. J., Kurth-Nelson, Z., & Behrens, T. E. (2019). Human replay spontaneously reorganizes experience. Cell, 178(3), 640652.CrossRefGoogle ScholarPubMed
Loewenstein, G. (1996). Out of control: Visceral influences on behavior. Organizational Behavior and Human Decision Processes, 65(3), 272292.CrossRefGoogle Scholar
Mackintosh, N. J. (1983). Conditioning and associative learning. Oxford, UK: Oxford University Press.Google Scholar
Mattar, M. G., & Daw, N. D. (2018). Prioritized memory access explains planning and hippocampal replay. Nature Neuroscience, 21, 16091617.CrossRefGoogle ScholarPubMed
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. ICML (Vol. 99, pp. 278287).Google Scholar
Pezzulo, G., Rigoli, F., & Chersi, F. (2013). The mixed instrumental controller: Using value of information to combine habitual choice and mental simulation. Frontiers in Psychology, 4, 92.CrossRefGoogle ScholarPubMed
Pfeiffer, B. E., & Foster, D. J. (2013). Hippocampal place-cell sequences depict future paths to remembered goals. Nature, 497, 7479.CrossRefGoogle ScholarPubMed
Shenhav, A., Musslick, S., Lieder, F., Kool, W., Griffiths, T. L., Cohen, J. D., & Botvinick, M. M. (2017). Toward a rational and mechanistic account of mental effort. Annual Review of Neuroscience, 40, 99124.CrossRefGoogle Scholar
Stevens, J. R., & Stephens, D. W. (2008). Patience. Current Biology, 18(1), R11R12.CrossRefGoogle ScholarPubMed
Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 944.CrossRefGoogle Scholar
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4), 160163.CrossRefGoogle Scholar
Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.Google Scholar
Wikenheiser, A. M., & Redish, A. D. (2015). Decoding the cognitive map: Ensemble hippocampal sequences and decision making. Current Opinion in Neurobiology, 32, 815.CrossRefGoogle ScholarPubMed