Hostname: page-component-745bb68f8f-lrblm Total loading time: 0 Render date: 2025-02-06T21:54:08.478Z Has data issue: false hasContentIssue false

What is the simplest model that can account for high-fidelity imitation?

Published online by Cambridge University Press:  10 November 2022

Joel Z. Leibo
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com
Raphael Köster
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com
Alexander Sasha Vezhnevets
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com
Edgar A. Duénez-Guzmán
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com
John P. Agapiou
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com
Peter Sunehag
Affiliation:
DeepMind, London EC4A 3TW, UK jzl@deepmind.com rkoster@deepmind.com vezhnick@deepmind.com duenez@deepmind.com jagapiou@deepmind.com sunehag@deepmind.com www.jzleibo.com

Abstract

What inductive biases must be incorporated into multi-agent artificial intelligence models to get them to capture high-fidelity imitation? We think very little is needed. In the right environments, both instrumental- and ritual-stance imitation can emerge from generic learning mechanisms operating on non-deliberative decision architectures. In this view, imitation emerges from trial-and-error learning and does not require explicit deliberation.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

The target article raises many important questions about the nature of human high-fidelity imitation. We zoom in on two of them: (1) whether high-fidelity imitation is underpinned by generic learning processes such as associative or reinforcement learning; and (2) whether imitation depends on explicit deliberation or not. Here we sketch an account where a non-deliberative system is trained by generic learning mechanisms.

We adopt an approach based on building connectionist agent models. This approach complements that of the target article which mainly reviews empirical studies. In any such constructive approach it is necessary to select a computationally significant problem to solve. Here there are two: (1) transmission of complex skills and (2) establishing cooperation based on prosocial norms.

The specific models we consider are based on multi-agent learning algorithms developed by the artificial-intelligence research community. In these models, multiple agents co-inhabit a virtual environment resembling a computer game. The researchers can specify the physical dynamics of the simulated world, for example, irrigation is needed to grow crops. The agents process raw visual sensory information formatted as images. Using a deep neural network, they transform the raw sensory input at each moment in time through a series of intermediate representations into a low-level motor action (e.g., move forward, turn left, etc.). Behavior is temporally coherent because the network retains information about the past via an internal state representation. Various learning mechanisms can be used to train the neural network. Reinforcement learning is the most prominent but other approaches are also possible and it is common to use more than one simultaneously in the same agent (e.g., Jaderberg et al., Reference Jaderberg, Mnih, Czarnecki, Schaul, Leibo, Silver and Kavukcuoglu2016).

In model-free approaches to reinforcement learning, the algorithms do not infer internal models of their environment and do not include explicit planning. As such, they are often considered as models of habit-learning and automatization (Dolan & Dayan, Reference Dolan and Dayan2013). They are associated with implicit and procedural forms of memory as opposed to declarative or episodic memory. Model-free agents are contrasted with model-based agents like Schrittwieser et al. (Reference Schrittwieser, Antonoglou, Hubert, Simonyan, Sifre, Schmitt and Silver2020), which do employ explicit planning.

The first critical question is whether or not mostly unstructured tabula-rasa agents can transmit causally transparent skills from one to another. Several recent results suggest this is possible. It is now clear that in the right environment, imitation itself, both the tendency to engage in it and the skill at doing so, can emerge spontaneously from processes resembling trial-and-error learning in the presence of a demonstrator agent. These results were obtained using generic learning methods with no built-in inductive bias for imitation. Borsa, Piot, Munos, and Pietquin (Reference Borsa, Piot, Munos and Pietquin2019); Ha and Jeong (Reference Ha and Jeong2022); Woodward, Finn, and Hausman (Reference Woodward, Finn and Hausman2020), and Bhoopchand et al. (Reference Bhoopchand, Brownfield, Collister, Lago, Edwards, Everett and Zhang2022) all used model-free reinforcement learning. Ndousse, Eck, Levine, and Jaques (Reference Ndousse, Eck, Levine and Jaques2021) combined reinforcement learning with inference of an internal model but did not use it for explicit planning. Thus all five results showcase emergent imitation based on non-deliberative processes. The learning mechanisms employed in these models are consistent with those posited by the associative sequence learning account of imitation (Catmur, Walsh, & Heyes, Reference Catmur, Walsh and Heyes2009). However, unlike associative sequence learning, where imitation may be either goal directed or not (Heyes, Reference Heyes2016), these models depend critically on the presence of a goal, which must be used to define the reward signal. Thus, in the language of the target article, such results show that generic learning mechanisms may underpin instrumental-stance imitation but not ritual-stance imitation where there may be no concrete goal.

The second critical question concerns whether generic learning mechanisms without any imitation-related inductive bias are sufficient to establish cooperation based on prosocial norms. Indeed, mirroring evolutionary models where individual fitness maximization cannot on its own explain the evolution of altruism, self-interested reinforcement learning agents do not cooperate in social dilemmas (Leibo, Zambaldi, Lanctot, Marecki, & Graepel, Reference Leibo, Zambaldi, Lanctot, Marecki and Graepel2017; Perolat et al., Reference Perolat, Leibo, Zambaldi, Beattie, Tuyls and Graepel2017). Groups with prosocial norms however can resolve the social dilemmas they face. For instance, some norms – such as those regulating interpersonal conflict – elicit third-party enforcement patterns that once established have the effect of discouraging antisocial behavior.

In keeping with the target article, we regard ritual stance imitation as norm learning. Normative behavior has two main components: (1) enforcement and (2) compliance (Heyes, Reference Heyes2022). Given a stable background pattern of enforcement, such as that provided by adults in a society, it is then easy for agents to learn compliance by trial-and-error because deviations from the norm are punished (Köster et al., Reference Köster, Hadfield-Menell, Everett, Weidinger, Hadfield and Leibo2022). One study endowed agents in a social dilemma setting with an intrinsic motivation to punish the same behavior as others in the group and found prosocial norms supporting cooperation sometimes emerged (Vinitsky et al., Reference Vinitsky, Köster, Agapiou, Duéñez-Guzmán, Vezhnevets and Leibo2021). However, it is likely that an even simpler model would work where the shaped behavior itself involves punishing transgressions when they occur (a meta-norm). Therefore we conjecture, the same generic and non-deliberative learning mechanisms that can learn to imitate instrumentally could also learn to imitate for affiliation if applied in the right environment. Notice that this also implies copying fidelity should be higher for norm learning because it is based on punishment for deviation.

Of course none of these considerations mean that deliberation is never important for high-fidelity copying. All such computational modeling can ever show is that some mechanism is sufficient to generate some outcome. However, insofar as the argument for deliberateness rests on the apparent implausibility of the non-deliberative mechanism, then these results should be seen as weakening the case for deliberation's importance. Indeed there are numerous theories that treat deliberation largely as post-hoc rationalization of decisions made by other means (e.g., Haidt, Reference Haidt2001; Mercier & Sperber, Reference Mercier and Sperber2017). The sufficiency of non-deliberative mechanisms for transmitting complex skills and norms lends support to these theories.

Financial support

All authors are employees of DeepMind.

Conflict of interest

None.

References

Bhoopchand, A., Brownfield, B., Collister, A., Lago, A. D., Edwards, A., Everett, R., … Zhang, L. M. (2022). Learning robust real-time cultural transmission without human data. arXiv preprint arXiv:2203.00715.Google Scholar
Borsa, D., Piot, B., Munos, R., & Pietquin, O. (2019). Observational learning by reinforcement learning. Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems (pp. 1117–1124).Google Scholar
Catmur, C., Walsh, V., & Heyes, C. (2009). Associative sequence learning: The role of experience in the development of imitation and the mirror system. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1528), 23692380.CrossRefGoogle ScholarPubMed
Dolan, R. J., & Dayan, P. (2013). Goals and habits in the brain. Neuron, 80(2), 312325.CrossRefGoogle Scholar
Ha, S., & Jeong, H. (2022). Social learning spontaneously emerges by searching optimal heuristics with deep reinforcement learning. arXiv preprint arXiv:2204.12371.Google Scholar
Haidt, J. (2001). The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychological Review, 108(4), 814.CrossRefGoogle ScholarPubMed
Heyes, C. (2016). Homo imitans? Seven reasons why imitation couldn't possibly be associative. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1686), 20150069.CrossRefGoogle ScholarPubMed
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., & Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. International Conference on Learning Representations (ICLR).Google Scholar
Köster, R., Hadfield-Menell, D., Everett, R., Weidinger, L., Hadfield, G. K., & Leibo, J. Z. (2022). Spurious normativity enhances learning of compliance and enforcement behavior in artificial agents. Proceedings of the National Academy of Sciences, 119(3).CrossRefGoogle ScholarPubMed
Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., & Graepel, T. (2017). Multi-agent reinforcement learning in sequential social dilemmas. Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems (pp. 464473).Google Scholar
Mercier, H., & Sperber, D. (2017). The enigma of reason. Harvard University Press.Google Scholar
Ndousse, K. K., Eck, D., Levine, S., & Jaques, N. (2021). Emergent social learning via multi-agent reinforcement learning. International Conference on Machine Learning (pp. 79918004). PMLR.Google Scholar
Perolat, J., Leibo, J. Z., Zambaldi, V., Beattie, C., Tuyls, K., & Graepel, T. (2017). A multi-agent reinforcement learning model of common-pool resource appropriation. Advances in Neural Information Processing Systems, 30.Google Scholar
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., … Silver, D. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839), 604609.CrossRefGoogle ScholarPubMed
Vinitsky, E., Köster, R., Agapiou, J. P., Duéñez-Guzmán, E., Vezhnevets, A. S., & Leibo, J. Z. (2021). A learning agent that acquires social norms from public sanctions in decentralized multi-agent settings. arXiv preprint arXiv:2106.09012.Google Scholar
Woodward, M., Finn, C., & Hausman, K. (2020). Learning to interactively learn and assist. Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 03, pp. 25352543).Google Scholar