Hostname: page-component-745bb68f8f-l4dxg Total loading time: 0 Render date: 2025-02-11T14:53:50.676Z Has data issue: false hasContentIssue false

Crossmodal lifelong learning in hybrid neural embodied architectures

Published online by Cambridge University Press:  10 November 2017

Stefan Wermter
Affiliation:
Knowledge Technology Group, Department of Informatics, Universität Hamburg, Hamburg, Germany. wermter@informatik.uni-hamburg.degriffiths@informatik.uni-hamburg.deheinrich@informatik.uni-hamburg.dehttps://www.informatik.uni-hamburg.de/~wermter/https://www.informatik.uni-hamburg.de/~griffiths/https://www.informatik.uni-hamburg.de/~heinrich/
Sascha Griffiths
Affiliation:
Knowledge Technology Group, Department of Informatics, Universität Hamburg, Hamburg, Germany. wermter@informatik.uni-hamburg.degriffiths@informatik.uni-hamburg.deheinrich@informatik.uni-hamburg.dehttps://www.informatik.uni-hamburg.de/~wermter/https://www.informatik.uni-hamburg.de/~griffiths/https://www.informatik.uni-hamburg.de/~heinrich/
Stefan Heinrich
Affiliation:
Knowledge Technology Group, Department of Informatics, Universität Hamburg, Hamburg, Germany. wermter@informatik.uni-hamburg.degriffiths@informatik.uni-hamburg.deheinrich@informatik.uni-hamburg.dehttps://www.informatik.uni-hamburg.de/~wermter/https://www.informatik.uni-hamburg.de/~griffiths/https://www.informatik.uni-hamburg.de/~heinrich/

Abstract

Lake et al. point out that grounding learning in general principles of embodied perception and social cognition is the next step in advancing artificial intelligent machines. We suggest it is necessary to go further and consider lifelong learning, which includes developmental learning, focused on embodiment as applied in developmental robotics and neurorobotics, and crossmodal learning that facilitates integrating multiple senses.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2017 

Artificial intelligence has recently been seen as successful in a number of domains, such a playing chess or Go, recognising handwritten characters, or describing visual scenes in natural language. Lake et al. discuss these kinds of breakthroughs as a big step for artificial intelligence, but raise the question how we can build machines that learn like people? We can find an indication in a survey of mind perception (Gray et al. Reference Gray, Gray and Wegner2007), which is the “amount of mind” people are willing to attribute to others. Participants judged machines to be high on agency but low on experience. We attribute this to the fact that computers are trained on individual tasks, often involving a single modality such as vision or speech, or a single context such as classifying traffic signs, as opposed to interpreting spoken and gestured utterances. In contrast, for people, the “world” essentially appears as a multimodal stream of stimuli, which unfold over time. Therefore, we suggest that the next paradigm shift in intelligent machines will have to include processing the “world” through lifelong and crossmodal learning. This is important because people develop problem-solving capabilities, including language processing, over their life span and via interaction with the environment and other people (Elman Reference Elman1993, Christiansen and Chater Reference Christiansen and Chater2016). In addition, the learning is embodied, as developing infants have a body-rational view of the world, but also seem to apply general problem-solving strategies to a wide range of quite different tasks (Cangelosi and Schlesinger Reference Cangelosi and Schlesinger2015).

Hence, we argue that the proposed principles or “start-up software” are coupled tightly with general learning mechanisms in the brain. We argue that these conditions inherently enable the development of distributed representations of knowledge. For example, in our research, we found that architectural mechanisms, like different timings in the information processing in the cortex, foster compositionality that in turn enables both the development of more complex body actions and the development of language competence from primitives (Heinrich Reference Heinrich2016). These kinds of distributed representations are coherent with the cognitive science on embodied cognition. Lakoff and Johnson (Reference Lakoff and Johnson2003), for example, argue that people describe personal relationships in terms of the physical sensation of temperature. The transfer from one domain to the other is plausible, as an embrace or handshake between friends or family members, for example, will cause a warm sensation for the participants. These kinds of temperature exchanging actions are supposed to be signs of people's positive feelings towards each other (Hall Reference Hall1966). The connection between temperature sensation and social relatedness is argued to reflect neural “bindings” (Gallese and Lakoff Reference Gallese and Lakoff2005). The domain knowledge that is used later in life can be derived from the primitives that are encountered early in childhood, for example, in interactions between infants and parents, and is referred to as intermodal synchrony (Rohlfing and Nomikou Reference Rohlfing and Nomikou2014). As a further example, our own research shows that learning, which is based on crossmodal integration, like the integration of real sensory perception on low and on intermediate levels (as suggested for the superior colliculus in the brain), can enable both super-additivity and dominance of certain modalities based on the tasks (Bauer et al. Reference Bauer, Dávila-Chacón and Wermter2015).

In developing machines, approaches such as transfer learning and zero-shot learning are receiving increasing attention, but are often restricted to transfers from domain to domain or from modality to modality. In the domain case, this can take the form of a horizontal transfer, in which a concept in one domain is learned and then transferred to another domain within the same modality. For example, it is possible to learn about affect in speech and to transfer that model of affect to music (Coutinho et al. Reference Coutinho, Deng and Schuller2014). In the modality case, one can vertically transfer concepts from one modality to another. This could be a learning process in which language knowledge is transferred to the visual domain (Laptev Reference Laptev, Marszalek, Schmid and Rozenfeld2008; Donahue Reference Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko and Darrell2015). However, based on the previous crossmodal integration in people, we must look into combinations of both, such that transferring between domains is not merely switching between two modalities, but integrating into both. Therefore, machines must exploit the representations that form when integrating multiple modalities that are richer than the sum of the parts. Recent initial examples include (1) understanding continuous counting expressed in spoken numbers from learned spatial differences in gestural motor grounding (Ruciński Reference Ruciński2014) and (2) classifying affective states' audiovisual emotion expressions via music, speech, facial expressions, and motion (Barros and Wermter Reference Barros and Wermter2016).

Freeing learning from modalities and domains in favour of distributed representations, and reusing learned representations in the next individual learning tasks, will enable a larger view of learning to learn. Having underlying hybrid neural embodied architectures (Wermter et al. Reference Wermter, Palm, Weber, Elshaw, Wermter, Palm and Elshaw2005) will support horizontal and vertical transfer and integration. This is the “true experience” machines need to learn and think like people. All in all, Lake et al. stress the important point of grounding learning in general principles of embodied perception and social cognition. Yet, we suggest it is still necessary to go a step further and consider lifelong learning, which includes developmental learning, focused on embodiment as applied in developmental robotics and neurorobotics, and crossmodal learning, which facilitates the integration of multiple senses.

References

Barros, P. & Wermter, S. (2016) Developing crossmodal expression recognition based on a deep neural model. Adaptive Behavior 24(5):373–96.Google Scholar
Bauer, J., Dávila-Chacón, J. & Wermter, S. (2015) Modeling development of natural multi-sensory integration using neural self-organisation and probabilistic population codes. Connection Science 27(4):358–76.CrossRefGoogle Scholar
Cangelosi, A. & Schlesinger, M. (2015) Developmental robotics: From babies to robots. MIT Press.Google Scholar
Christiansen, M. H. & Chater, N. (2016) Creating language: Integrating evolution, acquisition, and processing. MIT Press.CrossRefGoogle Scholar
Coutinho, E., Deng, J. & Schuller, B. (2014) Transfer learning emotion manifestation across music and speech. In: Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China. pp. 3592–98. IEEE.Google Scholar
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K. & Darrell, T. (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, June 7-12, 2015, pp. 2625–34. IEEE.CrossRefGoogle Scholar
Elman, J. L. (1993) Learning and development in neural networks: The importance of starting small. Cognition 48(1):7199.Google Scholar
Gallese, V. & Lakoff, G. (2005) The brain's concepts: The role of the sensory-motor system in conceptual knowledge. Cognitive Neuropsychology 22(3–4):455–79.Google Scholar
Gray, H. M., Gray, K. & Wegner, D. M. (2007) Dimensions of mind perception. Science 315(5812):619.CrossRefGoogle ScholarPubMed
Hall, E. T. (1966) The hidden dimension. Doubleday.Google Scholar
Heinrich, S. (2016) Natural language acquisition in recurrent neural architectures. Ph.D. thesis, Universität Hamburg, DE.Google Scholar
Lakoff, G. & Johnson, M. (2003) Metaphors we live by, 2nd ed. University of Chicago Press.Google Scholar
Laptev, I., Marszalek, M., Schmid, C. & Rozenfeld, B. (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, June 23–28, 2008 (CVPR 2008), pp. 18. IEEE.Google Scholar
Rohlfing, K. J. & Nomikou, I. (2014) Intermodal synchrony as a form of maternal responsiveness: Association with language development. Language, Interaction and Acquisition 5(1):117–36.Google Scholar
Ruciński, M. (2014) Modelling learning to count in humanoid robots. Ph.D. thesis, University of Plymouth, UK.Google Scholar
Wermter, S., Palm, G., Weber, C. & Elshaw, M. (2005) Towards biomimetic neural learning for intelligent robots. In: Biomimetic neural learning for intelligent robots, ed. Wermter, S., Palm, G. & Elshaw, M., pp. 118. Springer.CrossRefGoogle Scholar