The target article raises many important questions about the nature of human high-fidelity imitation. We zoom in on two of them: (1) whether high-fidelity imitation is underpinned by generic learning processes such as associative or reinforcement learning; and (2) whether imitation depends on explicit deliberation or not. Here we sketch an account where a non-deliberative system is trained by generic learning mechanisms.
We adopt an approach based on building connectionist agent models. This approach complements that of the target article which mainly reviews empirical studies. In any such constructive approach it is necessary to select a computationally significant problem to solve. Here there are two: (1) transmission of complex skills and (2) establishing cooperation based on prosocial norms.
The specific models we consider are based on multi-agent learning algorithms developed by the artificial-intelligence research community. In these models, multiple agents co-inhabit a virtual environment resembling a computer game. The researchers can specify the physical dynamics of the simulated world, for example, irrigation is needed to grow crops. The agents process raw visual sensory information formatted as images. Using a deep neural network, they transform the raw sensory input at each moment in time through a series of intermediate representations into a low-level motor action (e.g., move forward, turn left, etc.). Behavior is temporally coherent because the network retains information about the past via an internal state representation. Various learning mechanisms can be used to train the neural network. Reinforcement learning is the most prominent but other approaches are also possible and it is common to use more than one simultaneously in the same agent (e.g., Jaderberg et al., Reference Jaderberg, Mnih, Czarnecki, Schaul, Leibo, Silver and Kavukcuoglu2016).
In model-free approaches to reinforcement learning, the algorithms do not infer internal models of their environment and do not include explicit planning. As such, they are often considered as models of habit-learning and automatization (Dolan & Dayan, Reference Dolan and Dayan2013). They are associated with implicit and procedural forms of memory as opposed to declarative or episodic memory. Model-free agents are contrasted with model-based agents like Schrittwieser et al. (Reference Schrittwieser, Antonoglou, Hubert, Simonyan, Sifre, Schmitt and Silver2020), which do employ explicit planning.
The first critical question is whether or not mostly unstructured tabula-rasa agents can transmit causally transparent skills from one to another. Several recent results suggest this is possible. It is now clear that in the right environment, imitation itself, both the tendency to engage in it and the skill at doing so, can emerge spontaneously from processes resembling trial-and-error learning in the presence of a demonstrator agent. These results were obtained using generic learning methods with no built-in inductive bias for imitation. Borsa, Piot, Munos, and Pietquin (Reference Borsa, Piot, Munos and Pietquin2019); Ha and Jeong (Reference Ha and Jeong2022); Woodward, Finn, and Hausman (Reference Woodward, Finn and Hausman2020), and Bhoopchand et al. (Reference Bhoopchand, Brownfield, Collister, Lago, Edwards, Everett and Zhang2022) all used model-free reinforcement learning. Ndousse, Eck, Levine, and Jaques (Reference Ndousse, Eck, Levine and Jaques2021) combined reinforcement learning with inference of an internal model but did not use it for explicit planning. Thus all five results showcase emergent imitation based on non-deliberative processes. The learning mechanisms employed in these models are consistent with those posited by the associative sequence learning account of imitation (Catmur, Walsh, & Heyes, Reference Catmur, Walsh and Heyes2009). However, unlike associative sequence learning, where imitation may be either goal directed or not (Heyes, Reference Heyes2016), these models depend critically on the presence of a goal, which must be used to define the reward signal. Thus, in the language of the target article, such results show that generic learning mechanisms may underpin instrumental-stance imitation but not ritual-stance imitation where there may be no concrete goal.
The second critical question concerns whether generic learning mechanisms without any imitation-related inductive bias are sufficient to establish cooperation based on prosocial norms. Indeed, mirroring evolutionary models where individual fitness maximization cannot on its own explain the evolution of altruism, self-interested reinforcement learning agents do not cooperate in social dilemmas (Leibo, Zambaldi, Lanctot, Marecki, & Graepel, Reference Leibo, Zambaldi, Lanctot, Marecki and Graepel2017; Perolat et al., Reference Perolat, Leibo, Zambaldi, Beattie, Tuyls and Graepel2017). Groups with prosocial norms however can resolve the social dilemmas they face. For instance, some norms – such as those regulating interpersonal conflict – elicit third-party enforcement patterns that once established have the effect of discouraging antisocial behavior.
In keeping with the target article, we regard ritual stance imitation as norm learning. Normative behavior has two main components: (1) enforcement and (2) compliance (Heyes, Reference Heyes2022). Given a stable background pattern of enforcement, such as that provided by adults in a society, it is then easy for agents to learn compliance by trial-and-error because deviations from the norm are punished (Köster et al., Reference Köster, Hadfield-Menell, Everett, Weidinger, Hadfield and Leibo2022). One study endowed agents in a social dilemma setting with an intrinsic motivation to punish the same behavior as others in the group and found prosocial norms supporting cooperation sometimes emerged (Vinitsky et al., Reference Vinitsky, Köster, Agapiou, Duéñez-Guzmán, Vezhnevets and Leibo2021). However, it is likely that an even simpler model would work where the shaped behavior itself involves punishing transgressions when they occur (a meta-norm). Therefore we conjecture, the same generic and non-deliberative learning mechanisms that can learn to imitate instrumentally could also learn to imitate for affiliation if applied in the right environment. Notice that this also implies copying fidelity should be higher for norm learning because it is based on punishment for deviation.
Of course none of these considerations mean that deliberation is never important for high-fidelity copying. All such computational modeling can ever show is that some mechanism is sufficient to generate some outcome. However, insofar as the argument for deliberateness rests on the apparent implausibility of the non-deliberative mechanism, then these results should be seen as weakening the case for deliberation's importance. Indeed there are numerous theories that treat deliberation largely as post-hoc rationalization of decisions made by other means (e.g., Haidt, Reference Haidt2001; Mercier & Sperber, Reference Mercier and Sperber2017). The sufficiency of non-deliberative mechanisms for transmitting complex skills and norms lends support to these theories.
The target article raises many important questions about the nature of human high-fidelity imitation. We zoom in on two of them: (1) whether high-fidelity imitation is underpinned by generic learning processes such as associative or reinforcement learning; and (2) whether imitation depends on explicit deliberation or not. Here we sketch an account where a non-deliberative system is trained by generic learning mechanisms.
We adopt an approach based on building connectionist agent models. This approach complements that of the target article which mainly reviews empirical studies. In any such constructive approach it is necessary to select a computationally significant problem to solve. Here there are two: (1) transmission of complex skills and (2) establishing cooperation based on prosocial norms.
The specific models we consider are based on multi-agent learning algorithms developed by the artificial-intelligence research community. In these models, multiple agents co-inhabit a virtual environment resembling a computer game. The researchers can specify the physical dynamics of the simulated world, for example, irrigation is needed to grow crops. The agents process raw visual sensory information formatted as images. Using a deep neural network, they transform the raw sensory input at each moment in time through a series of intermediate representations into a low-level motor action (e.g., move forward, turn left, etc.). Behavior is temporally coherent because the network retains information about the past via an internal state representation. Various learning mechanisms can be used to train the neural network. Reinforcement learning is the most prominent but other approaches are also possible and it is common to use more than one simultaneously in the same agent (e.g., Jaderberg et al., Reference Jaderberg, Mnih, Czarnecki, Schaul, Leibo, Silver and Kavukcuoglu2016).
In model-free approaches to reinforcement learning, the algorithms do not infer internal models of their environment and do not include explicit planning. As such, they are often considered as models of habit-learning and automatization (Dolan & Dayan, Reference Dolan and Dayan2013). They are associated with implicit and procedural forms of memory as opposed to declarative or episodic memory. Model-free agents are contrasted with model-based agents like Schrittwieser et al. (Reference Schrittwieser, Antonoglou, Hubert, Simonyan, Sifre, Schmitt and Silver2020), which do employ explicit planning.
The first critical question is whether or not mostly unstructured tabula-rasa agents can transmit causally transparent skills from one to another. Several recent results suggest this is possible. It is now clear that in the right environment, imitation itself, both the tendency to engage in it and the skill at doing so, can emerge spontaneously from processes resembling trial-and-error learning in the presence of a demonstrator agent. These results were obtained using generic learning methods with no built-in inductive bias for imitation. Borsa, Piot, Munos, and Pietquin (Reference Borsa, Piot, Munos and Pietquin2019); Ha and Jeong (Reference Ha and Jeong2022); Woodward, Finn, and Hausman (Reference Woodward, Finn and Hausman2020), and Bhoopchand et al. (Reference Bhoopchand, Brownfield, Collister, Lago, Edwards, Everett and Zhang2022) all used model-free reinforcement learning. Ndousse, Eck, Levine, and Jaques (Reference Ndousse, Eck, Levine and Jaques2021) combined reinforcement learning with inference of an internal model but did not use it for explicit planning. Thus all five results showcase emergent imitation based on non-deliberative processes. The learning mechanisms employed in these models are consistent with those posited by the associative sequence learning account of imitation (Catmur, Walsh, & Heyes, Reference Catmur, Walsh and Heyes2009). However, unlike associative sequence learning, where imitation may be either goal directed or not (Heyes, Reference Heyes2016), these models depend critically on the presence of a goal, which must be used to define the reward signal. Thus, in the language of the target article, such results show that generic learning mechanisms may underpin instrumental-stance imitation but not ritual-stance imitation where there may be no concrete goal.
The second critical question concerns whether generic learning mechanisms without any imitation-related inductive bias are sufficient to establish cooperation based on prosocial norms. Indeed, mirroring evolutionary models where individual fitness maximization cannot on its own explain the evolution of altruism, self-interested reinforcement learning agents do not cooperate in social dilemmas (Leibo, Zambaldi, Lanctot, Marecki, & Graepel, Reference Leibo, Zambaldi, Lanctot, Marecki and Graepel2017; Perolat et al., Reference Perolat, Leibo, Zambaldi, Beattie, Tuyls and Graepel2017). Groups with prosocial norms however can resolve the social dilemmas they face. For instance, some norms – such as those regulating interpersonal conflict – elicit third-party enforcement patterns that once established have the effect of discouraging antisocial behavior.
In keeping with the target article, we regard ritual stance imitation as norm learning. Normative behavior has two main components: (1) enforcement and (2) compliance (Heyes, Reference Heyes2022). Given a stable background pattern of enforcement, such as that provided by adults in a society, it is then easy for agents to learn compliance by trial-and-error because deviations from the norm are punished (Köster et al., Reference Köster, Hadfield-Menell, Everett, Weidinger, Hadfield and Leibo2022). One study endowed agents in a social dilemma setting with an intrinsic motivation to punish the same behavior as others in the group and found prosocial norms supporting cooperation sometimes emerged (Vinitsky et al., Reference Vinitsky, Köster, Agapiou, Duéñez-Guzmán, Vezhnevets and Leibo2021). However, it is likely that an even simpler model would work where the shaped behavior itself involves punishing transgressions when they occur (a meta-norm). Therefore we conjecture, the same generic and non-deliberative learning mechanisms that can learn to imitate instrumentally could also learn to imitate for affiliation if applied in the right environment. Notice that this also implies copying fidelity should be higher for norm learning because it is based on punishment for deviation.
Of course none of these considerations mean that deliberation is never important for high-fidelity copying. All such computational modeling can ever show is that some mechanism is sufficient to generate some outcome. However, insofar as the argument for deliberateness rests on the apparent implausibility of the non-deliberative mechanism, then these results should be seen as weakening the case for deliberation's importance. Indeed there are numerous theories that treat deliberation largely as post-hoc rationalization of decisions made by other means (e.g., Haidt, Reference Haidt2001; Mercier & Sperber, Reference Mercier and Sperber2017). The sufficiency of non-deliberative mechanisms for transmitting complex skills and norms lends support to these theories.
Financial support
All authors are employees of DeepMind.
Conflict of interest
None.