The sensorimotor and social sides of the architecture of speech

Giovanni Pezzulo; Laura Barca; Alessando D'Ausilio

doi:10.1017/S0140525X13004172

The sensorimotor and social sides of the architecture of speech

Published online by Cambridge University Press: 17 December 2014

Giovanni Pezzulo ,

Laura Barca and

Alessando D'Ausilio

Show author details

Giovanni Pezzulo: Affiliation:
Institute of Cognitive Sciences and Technologies, National Research Council, 00185 Rome, Italy. giovanni.pezzulo@istc.cnr.itlaura.barca@istc.cnr.ithttps://sites.google.com/site/giovannipezzulo/https://sites.google.com/site/laurabarcahomepage/
Laura Barca: Affiliation:
Institute of Cognitive Sciences and Technologies, National Research Council, 00185 Rome, Italy. giovanni.pezzulo@istc.cnr.itlaura.barca@istc.cnr.ithttps://sites.google.com/site/giovannipezzulo/https://sites.google.com/site/laurabarcahomepage/
Alessando D'Ausilio: Affiliation:
Robotics, Brain and Cognitive Sciences Department, Italian Institute of Technology, 16163 Genova, Italy. alessandro.dausilio@iit.ithttp://www.iit.it/people/robotics-brain-and-cognitive-sciences-mirror-neurons-and-interaction-lab/researcher/alessandro-dausilio.html

Article contents

Abstract
References

Rights & Permissions

Abstract

Speech is a complex skill to master. In addition to sophisticated phono-articulatory abilities, speech acquisition requires neuronal systems configured for vocal learning, with adaptable sensorimotor maps that couple heard speech sounds with motor programs for speech production; imitation and self-imitation mechanisms that can train the sensorimotor maps to reproduce heard speech sounds; and a “pedagogical” learning environment that supports tutor learning.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 37 , Issue 6 , December 2014 , pp. 569 - 570

DOI: https://doi.org/10.1017/S0140525X13004172 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Besides sophisticated phono-articulatory abilities, the architecture of speech has key computational, neuronal, and social prerequisites that can shed light on its phylogenetic and ontogenetic origins.

As a first important requirement, the architecture of speech has to be configured for vocal learning, with adaptable sensorimotor circuits that couple heard speech sounds with motor programs for speech production. From a computational perspective, mastering speech in naturalistic environments plagued by uncertainty and noise is hard; this fact has long motivated control-theoretic views of speech emphasizing error-correction mechanisms and internal modeling (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004; Moore Reference Moore2007).

Computational considerations also suggest that speech processing (and learning, see below) might benefit from a close interaction of perception and production systems. For example, production systems might support perceptual processes by predicting and “synthesizing” auditory candidates (as in analysis by synthesis), while perceptual systems might support the self-monitoring and error-correction of vocal production by affording an advance auditory analysis of the produced speech sounds. Neurobiological experiments support this idea by showing that the neuronal mechanisms for speech production and perception are not segregated in the brain; for example, specific motor circuits are recruited for the analysis of speech sound features (D'Ausilio et al. Reference D'Ausilio, Craighero and Fadiga2012). An organic proposal on the architecture of speech can be formulated within the framework of generative systems, in which perception and action systems share computational (and neuronal) resources and are both guided by a common prediction-error minimization process (Dindo et al. Reference Dindo, Zambuto, Pezzulo and Walsh2011; Friston Reference Friston2010; Kiebel et al. Reference Kiebel, Daunizeau and Friston2008; Pezzulo Reference Pezzulo2012a; Reference Pezzulo2013; Yildiz et al. Reference Yildiz, von Kriegstein and Kiebel2013).

A second important requirement is a learning method powerful enough to train the aforementioned sensorimotor architecture to perceive and (re)produce sounds and speech. This problem has been studied particularly in songbirds that, while not speaking, have sophisticated vocal learning abilities. Most theories assume that songbird learning is a staged process (Brainard & Doupe Reference Brainard and Doupe2002). An initial period of auditory learning is needed to tune sensory maps to represent sensory “prototypes” of heard speech sounds (e.g., memorize learned song patterns heard by conspecifics). These prototypes are then used as “reference signals” for imitation learning; by learning to reproduce the stored template, an animal can acquire equivalent vocal sound production skills. In control-theoretic terms, this process uses (auditory and articulatory) feedback error-correction mechanisms to produce a sound (sing or speech) that closely matches the stored template (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004). During the learning process, internal (inverse and forward) models are trained, too, that successively afford skilled sing or speech processing.

To speed up learning, learners benefit from using self-imitation, too. Covert rather than overt singing (or speaking) might reproduce frequently heard speech sounds in the same way they are encoded in their sensory maps (note that generative architectures afford this form of learning quite naturally; Hinton Reference Hinton2007). Using both overt and covert processes, animals (including humans) might reproduce their stored prototypes with high fidelity, including the local accents of their communities.

The brain architecture supporting the aforementioned learning processes is incompletely known. Indeed, speech is a computationally challenging skill as it requires sensorimotor circuits to be sensitive enough to discriminate subtle changes in speech sounds, and accurate enough to afford extremely precise control (e.g., of the timing of speech). The brain could finesse these problems by recruiting cortico-subcortical loops (especially those involving the basal ganglia and the cerebellum) especially during learning. The role of these loops is seldom recognized in “cortico-centric” theories of motor skills (including speech), but the evidence indicates that they could play an important role in skill learning and mastery (Ackermann Reference Ackermann2008; Caligiore et al. Reference Caligiore, Pezzulo, Miall and Baldassarre2013). For example, vocal learning in the swamp sparrow might involve a loop between forebrain neurons that establish auditory-vocal correspondences and striatal structures important for song learning (Prather et al. Reference Prather, Peters, Nowicki and Mooney2008).

The high-fidelity reproduction of sounds could be key to cultural transmission and the evolutionary value of singing in songbirds (Merker Reference Merker and Bannan2012). However, human communities have richer social structures than other animals, which might have favored an open-ended instrumental use of vocal production besides ritualized display. The importance of this skill might have led to a greater investment of parental time in teaching and, we propose, to advanced forms of “tutor learning” (Canevari et al. Reference Canevari, Badino, D'Ausilio, Fadiga and Metta2013). Of note, a so-called pedagogical learning environment (Csibra & Gergely Reference Csibra and Gergely2011) might have afforded specialized teaching strategies that could be uniquely human and that greatly improve on imitation and self-teaching learning methods. One example is “motherese”: Mothers modify their speech when speaking to young children in order to simplify their auditory processing and learning (see Pezzulo et al. Reference Pezzulo, Donnarumma and Dindo2013). This example suggests that social and interactive aspects of the learning environment are important prerequisites – or at least a useful scaffold – for speech acquisition and cultural transmission.

In sum, speech processing requires a sophisticated neuro-computational architecture in which physiologic, motoric, sensory, and social aspects mutually constrain each other and plausibly co-evolve. In addition to studying genetic determinants, it is important to recognize that speech could have found a suitable “neuronal niche” (Dehaene & Cohen Reference Dehaene and Cohen2007) in existing brain structures (cortical and subcortical) supporting skilled action. For example, speech could have re-used “generative” dynamics of such structures for imitation and self-imitation, and redeployed existing computational resources for combinatorial processing (Chersi et al. Reference Chersi, Ferro, Pezzulo and Pirrelli2014; Fadiga et al. Reference Fadiga, Craighero and D'Ausilio2009).

In parallel, speech could have found a suitable “socio-cultural niche”: It could have been incubated within the sophisticated interactive and social dynamics of our species. The social context in which human speech is acquired is extremely rich, and human speech learning operates on top of the sophisticated interactive, joint action, mutual emulation, and pedagogical abilities, most of which are unique or at least much more developed in our species (Pickering & Garrod Reference Pickering and Garrod2013; Sebanz et al. Reference Sebanz, Bekkering and Knoblich2006). The demands of sophisticated social interactions might have contributed to transform vocalization from an initially quite limited sensorimotor feat to a powerful, open-ended instrumental tool that permits conveying rich communicative intentions and forming extremely varied cultures (Pezzulo Reference Pezzulo2012b). In turn, we should not neglect how the intertwined sensorimotor and social sides of speech had a transformative impact on the destiny of our species.

References

Ackermann, H. (2008) Cerebellar contributions to speech production and speech perception: Psycholinguistic and neurobiological perspectives. Trends in Neurosciences 31(6):265–72. doi: 10.1016/j.tins.2008.02.011.Google Scholar

Brainard, M. S. & Doupe, A. J. (2002) What songbirds teach us about learning. Nature 417:351–58.Google Scholar

Caligiore, D., Pezzulo, G., Miall, R. C. & Baldassarre, G. (2013) The contribution of brain sub-cortical loops in the expression and acquisition of action understanding abilities. Neuroscience and Biobehavioral Reviews 37(10):2504–15.Google Scholar

Canevari, C., Badino, L., D'Ausilio, A., Fadiga, L. & Metta, G. (2013) Modeling speech imitation and ecological learning of auditory-motor maps. Frontiers in Psychology 4:364.Google Scholar

Chersi, F., Ferro, M., Pezzulo, G. & Pirrelli, V. (2014) Topological self-organization and prediction learning can support both action and lexical chains in the brain. Topics in Cognitive Science 6(3):476–91.Google Scholar

Csibra, G. & Gergely, G. (2011) Natural pedagogy as evolutionary adaptation. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences 366:1149–57.Google Scholar

D'Ausilio, A., Craighero, L. & Fadiga, L. (2012) The contribution of the frontal lobe to the perception of speech. Journal of Neurolinguistics 25:328–35.Google Scholar

Dehaene, S. & Cohen, L. (2007) Cultural recycling of cortical maps. Neuron 56(2):384–98. doi: 10.1016/j.neuron.2007.10.004.Google Scholar

Dindo, H., Zambuto, D. & Pezzulo, G. (2011) Motor simulation via coupled internal models using sequential Monte Carlo. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Catalonia, Spain, 16–22 July 2011, ed. Walsh, Toby, pp. 2113–19. AAAI Press/International Joint Conferences on Artificial Intelligence.Google Scholar

Fadiga, L., Craighero, L. & D'Ausilio, A. (2009) Broca's area in language, action, and music. Annals of the New York Academy of Sciences 1169(1):448–58. doi: 10.1111/j.1749-6632.2009.04582.x.Google Scholar

Friston, K. (2010) The free-energy principle: A unified brain theory? Nature Reviews Neuroscience 11:127–38.Google Scholar

Guenther, F. H. & Perkell, J. S. (2004) A neural model of speech production and its application to studies of the role of auditory feedback in speech. In: Speech motor control in normal and disordered speech, ed. Maassen, B., Kent, R., Peters, H., Lieshout, P. Van & Hulstijn, W., pp. 29–49. Oxford University Press.Google Scholar

Hinton, G. E. (2007) Learning multiple layers of representation. Trends in Cognitive Sciences 11:428–34.Google Scholar

Kiebel, S. J., Daunizeau, J. & Friston, K. J. (2008) A hierarchy of time-scales and the brain. PLOS Computational Biology 4:e1000209.Google Scholar

Merker, B. (2012) The vocal learning constellation: Imitation, ritual culture, encephalization. In: Music, language and human evolution, ed. Bannan, N., pp. 215–60. Oxford University Press.Google Scholar

Moore, R. K. (2007) PRESENCE: A human-inspired architecture for speech-based human-machine interaction. IEEE Transactions on Computers 56:1176–88.Google Scholar

Pezzulo, G. (2012a) An Active Inference view of cognitive control. Frontiers in Psychology 3:478. doi: 10.3389/fpsyg.2012.00478.Google Scholar

Pezzulo, G. (2012b) The “Interaction Engine”: A common pragmatic competence across linguistic and non-linguistic interactions. IEEE Transactions on Autonomous Mental Development 4:105–23.Google Scholar

Pezzulo, G. (2013) Studying mirror mechanisms within generative and predictive architectures for joint action. Cortex 49:2968–69.Google Scholar

Pezzulo, G., Donnarumma, F. & Dindo, H. (2013) Human sensorimotor communication: A theory of signaling in online social interactions. PLOS ONE 8:e79876.CrossRef Google Scholar PubMed

Pickering, M. J. & Garrod, S. (2013) An integrated theory of language production and comprehension. Behavioral and Brain Sciences 36(4):329–47.Google Scholar

Prather, J. F., Peters, S., Nowicki, S. & Mooney, R. (2008) Precise auditory–vocal mirroring in neurons for learned vocal communication. Nature 451:305–10.Google Scholar

Sebanz, N., Bekkering, H. & Knoblich, G. (2006) Joint action: Bodies and minds moving together. Trends in Cognitive Sciences 10:70–76.Google Scholar

Yildiz, I. B., von Kriegstein, K. & Kiebel, S. J. (2013) From birdsong to human speech recognition: Bayesian inference on a hierarchy of nonlinear dynamical systems. PLOS Computational Biology 9:e1003219.Google Scholar