1. Introduction
According to “radical enactivists,” the cognitive sciences should abandon the representational framework, for several reasons. For example, enactivists claim that there is no satisfactory, naturalistic account of content at the level of basic cognition. Hence, representationalism faces the “Hard Problem of Content” (Hutto and Myin Reference Hutto and Myin2013, Reference Hutto and Myin2017). Thus, the cognitive sciences should give up the notion of neurocognitive representations. In addition, enactivists argue, there is no account of how contentful representational states drive action. Again, the conclusion is that cognitive scientists should let go of the assumption of representations (Hutto and Myin Reference Hutto, Myin, Smortchkova, Dołęga and Schlicht2020). Instead, according to enactivists, action control should be explained without appealing to representations. As Myin and Hutto write, “acts of perceptual, motor, or perceptuomotor cognition—chasing and grasping a swirling leaf—are directed towards worldly objects and states of affairs, or aspects thereof, yet without representing them” (Reference Myin and Hutto2015, 62, italics added).
In this article, however, we question these claims by looking at the contemporary algorithmic research on action control. In what follows, we focus especially on reinforcement learning (RL) algorithms. They are widely used to study various aspects of action and motor control in computational neurosciences, in artificial intelligence (AI), and in robotics.
In RL, agents take actions in an environment in order to maximize cumulative reward. Action control is understood as choosing the right action selection policy for a given environment, so as to maximize future reward. This formulation of the computational problem makes RL-based action control models cognitively more sophisticated than many other models, such as simple proportional feedback control models based on control theory from the 1960s.
Moreover, as the case of action planner systems in RL illustrates, action control can be given a representational interpretation in RL. Thus, RL provides a well-understood, algorithmic way to describe how the manipulation of representations makes a difference to the systems that guide and drive behavior. It provides a way to explicate “action-oriented views” of cognitive systems in a way that is overlooked by recent enactivists (and many other antirepresentationalists).
2. Reinforcement Learning Algorithms
In a nutshell, RL can be described as learning by interacting with an environment. An RL agent learns by trial and error, observing the consequences of its actions, rather than from being explicitly taught what to do. The agent selects its actions on the basis of its past experiences and also by exploring new choices.
Historically, the basic idea of RL—learning as trial and error—was developed by early behaviorists. Thorndike’s (Reference Thorndike1911) “Law of Effect” described how reinforcing events (i.e., reward and punishment) affect the tendency to select actions and, hence, how they affect learning. Computer scientists combined this framework with the formalisms of optimal control theory, temporal difference learning, and learning automata and gave a precise formulation in the 1960s and 1970s.Footnote 1
Nowadays, the variants of this algorithmic approach are used in a wide range of applications in AI and robotics. They are used to study various forms of skilled action and motor control in cognitive and computational neurosciences. RL algorithms are also deployed, for example, in learning, decision making, and strategic reasoning tasks, and they have been applied to study attention, procedural memory (for model-free policies or action values), semantic declarative memory (for world maps or models), and episodic memory.Footnote 2
3. The Core Concepts of RL
RL describes how an agent learns to interact with the environment in a rational way. The algorithms should maximize the cumulative reward over time by observing the consequences of the actions.Footnote 3 In RL, an agent learns from experience to choose actions that lead to greater rewards over time.
When using RL as a theory of brain function, the basic idea is that neural activity reflects a set of operations, which together constitute computations that are specified in the RL framework. One of the key theoretical insights in RL is the way of describing how brains, as computational systems, can learn what to do (see fig. 1). In RL, a (technical) environment is a temporal succession of states st from a set of environment states S. At each point in time, the environment is in exactly one state. A state encodes “the world” into a number of variables whose values determine the state.Footnote 4 Each state has a fixed reward that is observable (in a technical sense) for the agent. Reward is not a complex or multidimensional feature but a simple scalar, which can be negative (punishment) or positive (reward). The agent can act in the environment, performing individual actions from a set of actions A. An action at in a state st will take the environment at the next time step to a new state st+σt according to a state transition function.Footnote 5 It is part of the world and generally not known to the agent.

Figure 1. Structure of a reinforcement learning algorithm.
At each time step t the agent is in a state st, where it is possible to choose an action at.Footnote 6 The agent then receives some amount of reward rt at a probability P(r|s). The reward function R(s) is not known to the agent and is considered to be produced by the (technical) environment. This prevents the agent from updating its own reward function—otherwise the agent could trivially maximize the reward by treating whatever happens as maximally rewarding.Footnote 7
However, note that anatomically the reward signal is often generated within the organism. Thus, it typically is organism dependent. Moreover, what is rewarding for a particular agent is not a property of the physical world but a property pertaining to the agent. Different agents will have different reward functions even when the physical (or technical) environment is the same.
In RL, the agent’s task is generally to learn to estimate and maximize the long-term cumulative reward. This means producing an estimation of the value of a state and choosing actions that lead to maximally valuable states. The concept of value stands for these cumulative expected long-term rewards accruing from a state. Technically, the value V(s) of a state s is the expected temporally discounted sum of rewards—(rt) observed at time t and future rewards that are discounted the further they are in the future.
4. Reinforcement Learning and Action Planning
In experimental work on motor control, actions—such as chasing and reaching a leaf—are typically seen as based on internal predictive or forward models of reaching dynamics (Wolpert, Ghahramani, and Jordan Reference Wolpert, Ghahramani and Jordan1995; Miall and Wolpert Reference Miall and Wolpert1996). Typically, the analyses describe the dynamics as progressing adjustments of internal models to fit with current observations. When the dynamics of action are approached in terms of RL (Doya Reference Doya2008; Botvinick et al. Reference Botvinick, Weinstein, Solway and Barto2015; Weinstein and Botvinick Reference Weinstein and Botvinick2017),Footnote 8 the agent is thought to take an action (e.g., reaching a leaf) according to its action policy and then to update the policy after receiving a reward outcome in the form of a signal.Footnote 9
When the goal of the agent changes, the appropriate action becomes different (Doya Reference Doya2008). In this case, the agent must somehow find a way to handle the new goal. In RL, one possible solution is that the agent uses an internal model M to help to update the action policy (Doya Reference Doya2008). This internal model consist of the learning of the so-called state transition rule P (new state|state, action; Doya Reference Doya1999; Kawato Reference Kawato1999). If such a model is available, the agent can perform the following inference: if I take action at from current state t, what new state (
) will I end up in? In addition, if the reward for each state is also known, the agent can evaluate with this internal model also the “goodness” of any hypothetical action.
This approach assumes the existence of a forward model
of the environment. This forward model allows an agent to “plan” its actions. It helps to evaluate how the environment will evolve in response to different actions. By using this forward model, the system can select a sequence of actions that will take the agent from its current state to a desired goal state (e.g., which should maximize rewards accruing along the trajectory).Footnote 10 Technically, this planning procedure can be described as the maximization of the value up to a time
, where t indexes discrete time steps up to some maximum T, and rt is the reward received at each step (Weinstein and Botvinick Reference Weinstein and Botvinick2017). Further, based on a particular policy, the system queries the model
with a series of state-action pairs (st, at) and in turn receives an estimated next state (st+σt) and reward (rt+σt). After the planner completes querying
, it returns an action a, which is executed in M. This results in a new state and a new reward, and the process starts over.
5. Representations in RL-Based Action Control
To characterize the cognitive dynamics of action control in this way is to characterize it in exact abstract and algorithmic terms. This approach makes no mention of the features of the actual environments in which the cognitive processes, or mechanisms, might be deployed. Instead, RL is an exact way to study the cognitive dynamics as forms of algorithmically specified reasoning and learning processes. It explains how cognitive systems control action by planning, selecting, and choosing different options.
In RL-based action control, the algorithms can be taken to operate—at least—with two types of representational states. First, estimations about the values of state-action pairs can be taken to represent the estimated “goodness” of the action in terms of cumulative long-term rewards. What they represent are neither the entities in the real world (say, hands grasping leaves) nor the future trajectories of real world entities (say, the possible future trajectories of hands grasping new leaves). Instead, they represent the goals of action in light of an action policy (e.g., the amount of a reward, if the agent grasps the leaf).
Second, when the algorithm estimates the goodness of future actions, the algorithm uses a forward model. This model can be taken as a representation of future states of the algorithm in a light of its action policy. As a representation, it refers to the estimated, possible future states of the algorithm’s world, not to the states of the real world environment.
Namely, in RL the agent-environment construction is a part of the algorithm specification, and the environment is literally a “synthetic” model of the environment for the algorithm. It is specified in terms of the formalisms, not in terms of real world entities (only).Footnote 11 It, or the concept of agent, should not be confused with the notions of “real” environments (e.g., physical stimuli) or “real agents” (e.g., the organism).
Philosophically, these representational states resemble Egan’s (Reference Egan2014, Reference Egan, Smortchkova, Dołęga and Schlicht2020) cognitive states with computational contents. According to Egan’s (Reference Egan2014) distinction, some (computational) contents are about the formal descriptions of the tasks computed by cognitive systems, and some (“cognitive”) contents are about the environment. As Egan (Reference Egan2014, Reference Egan, Smortchkova, Dołęga and Schlicht2020) remarks, computational contents are domain general and environmentally neutral. They can be applied to a variety of different cognitive uses in different contexts, and they make no reference to external environment whatsoever (Egan Reference Egan2014). Computational states can be assigned a “semantic” content in an appropriate, intentional “gloss.” According to Egan (Reference Egan2014, Reference Egan, Smortchkova, Dołęga and Schlicht2020), it is a pragmatically motivated way to describe the interaction between the organism and its environment as standing in for the objects or properties in the environment. The intentional gloss enables the analysis of cognitive systems to represent the elements of the environment.
In the case of RL, however, the intentional gloss is not addressed in terms of “properties of the external environment.” Namely, the (synthetic) environment is not just a (re)description of the external, real environment. Instead, it is a technical environment for the algorithm, reflecting the computational RL problem and the structure of the algorithm.
In many real world tasks, the sufficient correspondence between the synthetic environment and the real world environment can be crucial. For example, if the goal of a robot hand is, say, to pick a leaf in a real world environment, then, obviously, the leaf’s location, its size, or its configuration with the hand is relevant to the success of performance. To select appropriate policies, the system must take these (and other relevant) external factors into account.
Technically, the degree and the quality of correspondence depends on the details of the specific application, and they can be implemented in many ways. Not all of them are representational, or “contentful” in the radical enactivists’s sense. The real world environment may serve only as a source for feedback information. For example, the parameters of the system can be updated causally by using the feedback information. As Ramsey (Reference Ramsey2007) remarks, however, mere causal relations do not represent. Thus, the feedback information may play only a causal but not a representational role.
In some cases, systems may receive so-called observations (e.g., an image of the environment) as inputs, parametrize them, and transform them into hidden states. The hidden states are then updated iteratively by a recurrent process that receives the previous hidden states and hypothetical next actions. At each step the model predicts the policy, value function, and immediate reward.
However, there is no requirement for the hidden states to “match” the states of the external environment or any other such constraints on the semantics of states (Sutton and Barto Reference Sutton and Barto2018). Instead, the hidden states may represent states in whatever way is relevant for predicting current and future values. That is, RL algorithms do not only use (current or past) “observations” (about the external environment) to estimate future rewards. They do not track the regularities of the external environments. Instead, they estimate what actions they should take to maximize the reward. Thus, they refer to the future development of (synthetic) environment M, not to the (development of) real world environment as such. Hence, if they stand in for something, they stand in for the entities and states in possible worlds.
6. From Fly Detectors to a Variety of Representations
Obviously, these representations do not fit well with the portrait of neurocognitive representations painted by recent radical enactivists. For example, in their recent work Hutto and Myin (Reference Hutto, Myin, Smortchkova, Dołęga and Schlicht2020) describe representational content as “the property that states of mind possess” (82). It “allows them to represent how things are with the world” (82). The states of the mind are connected with the world via “sensory contact,” and the content of representational states is taken to “track” the external environment (Hutto Reference Hutto2015).
This view of representation continues the legacy of so-called fly detectors. In fly detectors, the notion of representation is specified in terms of a relation between the tokening of an internal, neurocognitive state and the external object or property the state represents. Historically, this view is inspired by the receptive field studies on sensory systems in the 1950s and 1960s (Hubel and Wiesel Reference Hubel and Wiesel1959; Lettvin et al. Reference Lettvin, Maturana, McCulloch and Pitts1959). In Lettvin et al. (Reference Lettvin, Maturana, McCulloch and Pitts1959), the focus was on the signal transformation properties of frog ganglion cells, later known as “fly detectors.” These cells were found to respond to small, black, fly-like dots moving in the frog’s visual field. Hubel and Wiesel (Reference Hubel and Wiesel1962) proposed a way in which “pooling mechanisms” might explain the response properties of these cells in the mammalian primary visual cortex.
This framework affected deeply the neuroscientific and psychological research on sensory processes. They were studied as a bottom-up feature detection for decades. Fly detectors also began to dominate the philosophical intuitions on representations. A great deal of effort was expended in the 1980s to answer the questions of (i) whether a representation of a fly is really about flies, (ii) how to make the leap from the physical signal transformation properties of ganglion cells into semantic properties of fly detectors, or (iii) how to specify the content determination of these representations in a satisfactory naturalistic way (Dretske Reference Dretske1981; Millikan Reference Millikan1989; Fodor Reference Fodor1992).
In fly detectors, the activation of representations requires a causal association with preceding stimuli, a (“neural”) signal and subsequent behavior, or an activation of a stimulus causing some indicator to fire. Typically, the stimulus is taken as a proximal cause for the activation of the representational state. This requires that a source for the stimulus (e.g., a signal that causes the stimulus) exists somehow in the physical environment. Or, depending on the account, the source can also be taken as a cause that is responsible, for example, for the firing of an indicator.
Obviously, the representations in RL-based action control systems are not specified in such terms. In RL, value refers to a mathematically specified amount of long-term cumulative expected rewards. That is what value representations stand in for. These representations are not “triggered” by the occurrence of a value stimulus or a “value” signal from the (real) external environment. Or, the rewards are not based on what stimulus features of the world neural signals are responding to. Rewards are not “out there,” and there are no “reward signals” causing reward stimuli to activate the “reward detectors” or any other mechanisms analogous to how detectors (or “indicators” in teleo- or indicator semantics) have been envisaged in the recent enactivist or classical neurosemantic literature.
Of course, these representations raise very difficult problems of “neural encoding” of such future-oriented, abstract, and organism-dependent entities. They will challenge the intuition that all cognitive states represent as fly detectors or that, generally, sensory representations do. However, this puzzle is not a question answerable to intuition. Instead, it is answerable to the roles that representations play in explaining the action control scientifically.
From a neuro- and cognitive scientific point of view, not all representations are sensory. Instead, there appear to be a variety of representational states. For example, while some sensory states (such as auditory signals) are more directly about external environmental target systems, other representations (such as complex action control representations) may not be. Thus, perhaps we should let go of the assumption that only states that track external environments count as representational and abandon a too-narrow fly-detector-based construal of representations.
7. Conclusion
RL algorithms are used to study the same phenomena (e.g., motor and action control) that are celebrated by enactivists (and many other antirepresentationalists) as paradigmatic examples of nonrepresentational phenomena. And still, as the case of RL illustrates, motor and action control can be given a representational interpretation.
The computational models based on RL do not only use possibly the most powerful algorithms that we have in AI, but they are widely and successfully used in many areas of neurocognitive sciences to study biological organisms. One cannot simply ignore this computational and theoretical framework, when assessing the research of action control in current neurocognitive sciences.Footnote 12
Moreover, RL algorithms are theoretically and mathematically well understood. Hence, this framework provides an exact, formal way to analyze, in detail, how action control systems use representations to drive action. It offers a way to explicate action-oriented views of cognitive systems in a way that is overlooked by recent enactivists (and many other antirepresentationalists).
To characterize the cognitive dynamics of action control in this way is to characterize it in abstract and algorithmic terms. This approach makes no mention of the features of the actual environments in which the cognitive processes, or mechanisms, might be deployed. Instead, RL provides an exact way to study the cognitive dynamics as forms of reasoning and learning processes. It helps to explain how cognitive systems control action by planning, selecting, and choosing different options.
Even a simple action—such as grasping a swirling leaf—requires complicated cognitive coordination for an agent in a dynamic, complex, and changing environment. To solve this coordination challenge, cognitive systems learn from observing the consequences of the agent’s actions, they select actions on the basis of past results, and explore new strategies. Moreover, when necessary, intelligent cognitive systems change their goals, compare alternative plans, and search for better solutions. When assessing what is the most plausible story of this kind of action control, perhaps we should let go of the assumption that only states that track external environments count as representational, not the whole representational framework.