Hostname: page-component-745bb68f8f-hvd4g Total loading time: 0 Render date: 2025-02-11T15:29:54.179Z Has data issue: false hasContentIssue false

The sensorimotor and social sides of the architecture of speech

Published online by Cambridge University Press:  17 December 2014

Giovanni Pezzulo
Affiliation:
Institute of Cognitive Sciences and Technologies, National Research Council, 00185 Rome, Italy. giovanni.pezzulo@istc.cnr.itlaura.barca@istc.cnr.ithttps://sites.google.com/site/giovannipezzulo/https://sites.google.com/site/laurabarcahomepage/
Laura Barca
Affiliation:
Institute of Cognitive Sciences and Technologies, National Research Council, 00185 Rome, Italy. giovanni.pezzulo@istc.cnr.itlaura.barca@istc.cnr.ithttps://sites.google.com/site/giovannipezzulo/https://sites.google.com/site/laurabarcahomepage/
Alessando D'Ausilio
Affiliation:
Robotics, Brain and Cognitive Sciences Department, Italian Institute of Technology, 16163 Genova, Italy. alessandro.dausilio@iit.ithttp://www.iit.it/people/robotics-brain-and-cognitive-sciences-mirror-neurons-and-interaction-lab/researcher/alessandro-dausilio.html

Abstract

Speech is a complex skill to master. In addition to sophisticated phono-articulatory abilities, speech acquisition requires neuronal systems configured for vocal learning, with adaptable sensorimotor maps that couple heard speech sounds with motor programs for speech production; imitation and self-imitation mechanisms that can train the sensorimotor maps to reproduce heard speech sounds; and a “pedagogical” learning environment that supports tutor learning.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2014 

Besides sophisticated phono-articulatory abilities, the architecture of speech has key computational, neuronal, and social prerequisites that can shed light on its phylogenetic and ontogenetic origins.

As a first important requirement, the architecture of speech has to be configured for vocal learning, with adaptable sensorimotor circuits that couple heard speech sounds with motor programs for speech production. From a computational perspective, mastering speech in naturalistic environments plagued by uncertainty and noise is hard; this fact has long motivated control-theoretic views of speech emphasizing error-correction mechanisms and internal modeling (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004; Moore Reference Moore2007).

Computational considerations also suggest that speech processing (and learning, see below) might benefit from a close interaction of perception and production systems. For example, production systems might support perceptual processes by predicting and “synthesizing” auditory candidates (as in analysis by synthesis), while perceptual systems might support the self-monitoring and error-correction of vocal production by affording an advance auditory analysis of the produced speech sounds. Neurobiological experiments support this idea by showing that the neuronal mechanisms for speech production and perception are not segregated in the brain; for example, specific motor circuits are recruited for the analysis of speech sound features (D'Ausilio et al. Reference D'Ausilio, Craighero and Fadiga2012). An organic proposal on the architecture of speech can be formulated within the framework of generative systems, in which perception and action systems share computational (and neuronal) resources and are both guided by a common prediction-error minimization process (Dindo et al. Reference Dindo, Zambuto, Pezzulo and Walsh2011; Friston Reference Friston2010; Kiebel et al. Reference Kiebel, Daunizeau and Friston2008; Pezzulo Reference Pezzulo2012a; Reference Pezzulo2013; Yildiz et al. Reference Yildiz, von Kriegstein and Kiebel2013).

A second important requirement is a learning method powerful enough to train the aforementioned sensorimotor architecture to perceive and (re)produce sounds and speech. This problem has been studied particularly in songbirds that, while not speaking, have sophisticated vocal learning abilities. Most theories assume that songbird learning is a staged process (Brainard & Doupe Reference Brainard and Doupe2002). An initial period of auditory learning is needed to tune sensory maps to represent sensory “prototypes” of heard speech sounds (e.g., memorize learned song patterns heard by conspecifics). These prototypes are then used as “reference signals” for imitation learning; by learning to reproduce the stored template, an animal can acquire equivalent vocal sound production skills. In control-theoretic terms, this process uses (auditory and articulatory) feedback error-correction mechanisms to produce a sound (sing or speech) that closely matches the stored template (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004). During the learning process, internal (inverse and forward) models are trained, too, that successively afford skilled sing or speech processing.

To speed up learning, learners benefit from using self-imitation, too. Covert rather than overt singing (or speaking) might reproduce frequently heard speech sounds in the same way they are encoded in their sensory maps (note that generative architectures afford this form of learning quite naturally; Hinton Reference Hinton2007). Using both overt and covert processes, animals (including humans) might reproduce their stored prototypes with high fidelity, including the local accents of their communities.

The brain architecture supporting the aforementioned learning processes is incompletely known. Indeed, speech is a computationally challenging skill as it requires sensorimotor circuits to be sensitive enough to discriminate subtle changes in speech sounds, and accurate enough to afford extremely precise control (e.g., of the timing of speech). The brain could finesse these problems by recruiting cortico-subcortical loops (especially those involving the basal ganglia and the cerebellum) especially during learning. The role of these loops is seldom recognized in “cortico-centric” theories of motor skills (including speech), but the evidence indicates that they could play an important role in skill learning and mastery (Ackermann Reference Ackermann2008; Caligiore et al. Reference Caligiore, Pezzulo, Miall and Baldassarre2013). For example, vocal learning in the swamp sparrow might involve a loop between forebrain neurons that establish auditory-vocal correspondences and striatal structures important for song learning (Prather et al. Reference Prather, Peters, Nowicki and Mooney2008).

The high-fidelity reproduction of sounds could be key to cultural transmission and the evolutionary value of singing in songbirds (Merker Reference Merker and Bannan2012). However, human communities have richer social structures than other animals, which might have favored an open-ended instrumental use of vocal production besides ritualized display. The importance of this skill might have led to a greater investment of parental time in teaching and, we propose, to advanced forms of “tutor learning” (Canevari et al. Reference Canevari, Badino, D'Ausilio, Fadiga and Metta2013). Of note, a so-called pedagogical learning environment (Csibra & Gergely Reference Csibra and Gergely2011) might have afforded specialized teaching strategies that could be uniquely human and that greatly improve on imitation and self-teaching learning methods. One example is “motherese”: Mothers modify their speech when speaking to young children in order to simplify their auditory processing and learning (see Pezzulo et al. Reference Pezzulo, Donnarumma and Dindo2013). This example suggests that social and interactive aspects of the learning environment are important prerequisites – or at least a useful scaffold – for speech acquisition and cultural transmission.

In sum, speech processing requires a sophisticated neuro-computational architecture in which physiologic, motoric, sensory, and social aspects mutually constrain each other and plausibly co-evolve. In addition to studying genetic determinants, it is important to recognize that speech could have found a suitable “neuronal niche” (Dehaene & Cohen Reference Dehaene and Cohen2007) in existing brain structures (cortical and subcortical) supporting skilled action. For example, speech could have re-used “generative” dynamics of such structures for imitation and self-imitation, and redeployed existing computational resources for combinatorial processing (Chersi et al. Reference Chersi, Ferro, Pezzulo and Pirrelli2014; Fadiga et al. Reference Fadiga, Craighero and D'Ausilio2009).

In parallel, speech could have found a suitable “socio-cultural niche”: It could have been incubated within the sophisticated interactive and social dynamics of our species. The social context in which human speech is acquired is extremely rich, and human speech learning operates on top of the sophisticated interactive, joint action, mutual emulation, and pedagogical abilities, most of which are unique or at least much more developed in our species (Pickering & Garrod Reference Pickering and Garrod2013; Sebanz et al. Reference Sebanz, Bekkering and Knoblich2006). The demands of sophisticated social interactions might have contributed to transform vocalization from an initially quite limited sensorimotor feat to a powerful, open-ended instrumental tool that permits conveying rich communicative intentions and forming extremely varied cultures (Pezzulo Reference Pezzulo2012b). In turn, we should not neglect how the intertwined sensorimotor and social sides of speech had a transformative impact on the destiny of our species.

References

Ackermann, H. (2008) Cerebellar contributions to speech production and speech perception: Psycholinguistic and neurobiological perspectives. Trends in Neurosciences 31(6):265–72. doi: 10.1016/j.tins.2008.02.011.Google Scholar
Brainard, M. S. & Doupe, A. J. (2002) What songbirds teach us about learning. Nature 417:351–58.Google Scholar
Caligiore, D., Pezzulo, G., Miall, R. C. & Baldassarre, G. (2013) The contribution of brain sub-cortical loops in the expression and acquisition of action understanding abilities. Neuroscience and Biobehavioral Reviews 37(10):2504–15.Google Scholar
Canevari, C., Badino, L., D'Ausilio, A., Fadiga, L. & Metta, G. (2013) Modeling speech imitation and ecological learning of auditory-motor maps. Frontiers in Psychology 4:364.Google Scholar
Chersi, F., Ferro, M., Pezzulo, G. & Pirrelli, V. (2014) Topological self-organization and prediction learning can support both action and lexical chains in the brain. Topics in Cognitive Science 6(3):476–91.Google Scholar
Csibra, G. & Gergely, G. (2011) Natural pedagogy as evolutionary adaptation. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences 366:1149–57.Google Scholar
D'Ausilio, A., Craighero, L. & Fadiga, L. (2012) The contribution of the frontal lobe to the perception of speech. Journal of Neurolinguistics 25:328–35.Google Scholar
Dehaene, S. & Cohen, L. (2007) Cultural recycling of cortical maps. Neuron 56(2):384–98. doi: 10.1016/j.neuron.2007.10.004.Google Scholar
Dindo, H., Zambuto, D. & Pezzulo, G. (2011) Motor simulation via coupled internal models using sequential Monte Carlo. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Catalonia, Spain, 16–22 July 2011, ed. Walsh, Toby, pp. 2113–19. AAAI Press/International Joint Conferences on Artificial Intelligence.Google Scholar
Fadiga, L., Craighero, L. & D'Ausilio, A. (2009) Broca's area in language, action, and music. Annals of the New York Academy of Sciences 1169(1):448–58. doi: 10.1111/j.1749-6632.2009.04582.x.Google Scholar
Friston, K. (2010) The free-energy principle: A unified brain theory? Nature Reviews Neuroscience 11:127–38.Google Scholar
Guenther, F. H. & Perkell, J. S. (2004) A neural model of speech production and its application to studies of the role of auditory feedback in speech. In: Speech motor control in normal and disordered speech, ed. Maassen, B., Kent, R., Peters, H., Lieshout, P. Van & Hulstijn, W., pp. 2949. Oxford University Press.Google Scholar
Hinton, G. E. (2007) Learning multiple layers of representation. Trends in Cognitive Sciences 11:428–34.Google Scholar
Kiebel, S. J., Daunizeau, J. & Friston, K. J. (2008) A hierarchy of time-scales and the brain. PLOS Computational Biology 4:e1000209.Google Scholar
Merker, B. (2012) The vocal learning constellation: Imitation, ritual culture, encephalization. In: Music, language and human evolution, ed. Bannan, N., pp. 215–60. Oxford University Press.Google Scholar
Moore, R. K. (2007) PRESENCE: A human-inspired architecture for speech-based human-machine interaction. IEEE Transactions on Computers 56:1176–88.Google Scholar
Pezzulo, G. (2012a) An Active Inference view of cognitive control. Frontiers in Psychology 3:478. doi: 10.3389/fpsyg.2012.00478.Google Scholar
Pezzulo, G. (2012b) The “Interaction Engine”: A common pragmatic competence across linguistic and non-linguistic interactions. IEEE Transactions on Autonomous Mental Development 4:105–23.Google Scholar
Pezzulo, G. (2013) Studying mirror mechanisms within generative and predictive architectures for joint action. Cortex 49:2968–69.Google Scholar
Pezzulo, G., Donnarumma, F. & Dindo, H. (2013) Human sensorimotor communication: A theory of signaling in online social interactions. PLOS ONE 8:e79876.CrossRefGoogle ScholarPubMed
Pickering, M. J. & Garrod, S. (2013) An integrated theory of language production and comprehension. Behavioral and Brain Sciences 36(4):329–47.Google Scholar
Prather, J. F., Peters, S., Nowicki, S. & Mooney, R. (2008) Precise auditory–vocal mirroring in neurons for learned vocal communication. Nature 451:305–10.Google Scholar
Sebanz, N., Bekkering, H. & Knoblich, G. (2006) Joint action: Bodies and minds moving together. Trends in Cognitive Sciences 10:7076.Google Scholar
Yildiz, I. B., von Kriegstein, K. & Kiebel, S. J. (2013) From birdsong to human speech recognition: Bayesian inference on a hierarchy of nonlinear dynamical systems. PLOS Computational Biology 9:e1003219.Google Scholar