Besides sophisticated phono-articulatory abilities, the architecture of speech has key computational, neuronal, and social prerequisites that can shed light on its phylogenetic and ontogenetic origins.
As a first important requirement, the architecture of speech has to be configured for vocal learning, with adaptable sensorimotor circuits that couple heard speech sounds with motor programs for speech production. From a computational perspective, mastering speech in naturalistic environments plagued by uncertainty and noise is hard; this fact has long motivated control-theoretic views of speech emphasizing error-correction mechanisms and internal modeling (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004; Moore Reference Moore2007).
Computational considerations also suggest that speech processing (and learning, see below) might benefit from a close interaction of perception and production systems. For example, production systems might support perceptual processes by predicting and “synthesizing” auditory candidates (as in analysis by synthesis), while perceptual systems might support the self-monitoring and error-correction of vocal production by affording an advance auditory analysis of the produced speech sounds. Neurobiological experiments support this idea by showing that the neuronal mechanisms for speech production and perception are not segregated in the brain; for example, specific motor circuits are recruited for the analysis of speech sound features (D'Ausilio et al. Reference D'Ausilio, Craighero and Fadiga2012). An organic proposal on the architecture of speech can be formulated within the framework of generative systems, in which perception and action systems share computational (and neuronal) resources and are both guided by a common prediction-error minimization process (Dindo et al. Reference Dindo, Zambuto, Pezzulo and Walsh2011; Friston Reference Friston2010; Kiebel et al. Reference Kiebel, Daunizeau and Friston2008; Pezzulo Reference Pezzulo2012a; Reference Pezzulo2013; Yildiz et al. Reference Yildiz, von Kriegstein and Kiebel2013).
A second important requirement is a learning method powerful enough to train the aforementioned sensorimotor architecture to perceive and (re)produce sounds and speech. This problem has been studied particularly in songbirds that, while not speaking, have sophisticated vocal learning abilities. Most theories assume that songbird learning is a staged process (Brainard & Doupe Reference Brainard and Doupe2002). An initial period of auditory learning is needed to tune sensory maps to represent sensory “prototypes” of heard speech sounds (e.g., memorize learned song patterns heard by conspecifics). These prototypes are then used as “reference signals” for imitation learning; by learning to reproduce the stored template, an animal can acquire equivalent vocal sound production skills. In control-theoretic terms, this process uses (auditory and articulatory) feedback error-correction mechanisms to produce a sound (sing or speech) that closely matches the stored template (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004). During the learning process, internal (inverse and forward) models are trained, too, that successively afford skilled sing or speech processing.
To speed up learning, learners benefit from using self-imitation, too. Covert rather than overt singing (or speaking) might reproduce frequently heard speech sounds in the same way they are encoded in their sensory maps (note that generative architectures afford this form of learning quite naturally; Hinton Reference Hinton2007). Using both overt and covert processes, animals (including humans) might reproduce their stored prototypes with high fidelity, including the local accents of their communities.
The brain architecture supporting the aforementioned learning processes is incompletely known. Indeed, speech is a computationally challenging skill as it requires sensorimotor circuits to be sensitive enough to discriminate subtle changes in speech sounds, and accurate enough to afford extremely precise control (e.g., of the timing of speech). The brain could finesse these problems by recruiting cortico-subcortical loops (especially those involving the basal ganglia and the cerebellum) especially during learning. The role of these loops is seldom recognized in “cortico-centric” theories of motor skills (including speech), but the evidence indicates that they could play an important role in skill learning and mastery (Ackermann Reference Ackermann2008; Caligiore et al. Reference Caligiore, Pezzulo, Miall and Baldassarre2013). For example, vocal learning in the swamp sparrow might involve a loop between forebrain neurons that establish auditory-vocal correspondences and striatal structures important for song learning (Prather et al. Reference Prather, Peters, Nowicki and Mooney2008).
The high-fidelity reproduction of sounds could be key to cultural transmission and the evolutionary value of singing in songbirds (Merker Reference Merker and Bannan2012). However, human communities have richer social structures than other animals, which might have favored an open-ended instrumental use of vocal production besides ritualized display. The importance of this skill might have led to a greater investment of parental time in teaching and, we propose, to advanced forms of “tutor learning” (Canevari et al. Reference Canevari, Badino, D'Ausilio, Fadiga and Metta2013). Of note, a so-called pedagogical learning environment (Csibra & Gergely Reference Csibra and Gergely2011) might have afforded specialized teaching strategies that could be uniquely human and that greatly improve on imitation and self-teaching learning methods. One example is “motherese”: Mothers modify their speech when speaking to young children in order to simplify their auditory processing and learning (see Pezzulo et al. Reference Pezzulo, Donnarumma and Dindo2013). This example suggests that social and interactive aspects of the learning environment are important prerequisites – or at least a useful scaffold – for speech acquisition and cultural transmission.
In sum, speech processing requires a sophisticated neuro-computational architecture in which physiologic, motoric, sensory, and social aspects mutually constrain each other and plausibly co-evolve. In addition to studying genetic determinants, it is important to recognize that speech could have found a suitable “neuronal niche” (Dehaene & Cohen Reference Dehaene and Cohen2007) in existing brain structures (cortical and subcortical) supporting skilled action. For example, speech could have re-used “generative” dynamics of such structures for imitation and self-imitation, and redeployed existing computational resources for combinatorial processing (Chersi et al. Reference Chersi, Ferro, Pezzulo and Pirrelli2014; Fadiga et al. Reference Fadiga, Craighero and D'Ausilio2009).
In parallel, speech could have found a suitable “socio-cultural niche”: It could have been incubated within the sophisticated interactive and social dynamics of our species. The social context in which human speech is acquired is extremely rich, and human speech learning operates on top of the sophisticated interactive, joint action, mutual emulation, and pedagogical abilities, most of which are unique or at least much more developed in our species (Pickering & Garrod Reference Pickering and Garrod2013; Sebanz et al. Reference Sebanz, Bekkering and Knoblich2006). The demands of sophisticated social interactions might have contributed to transform vocalization from an initially quite limited sensorimotor feat to a powerful, open-ended instrumental tool that permits conveying rich communicative intentions and forming extremely varied cultures (Pezzulo Reference Pezzulo2012b). In turn, we should not neglect how the intertwined sensorimotor and social sides of speech had a transformative impact on the destiny of our species.
Besides sophisticated phono-articulatory abilities, the architecture of speech has key computational, neuronal, and social prerequisites that can shed light on its phylogenetic and ontogenetic origins.
As a first important requirement, the architecture of speech has to be configured for vocal learning, with adaptable sensorimotor circuits that couple heard speech sounds with motor programs for speech production. From a computational perspective, mastering speech in naturalistic environments plagued by uncertainty and noise is hard; this fact has long motivated control-theoretic views of speech emphasizing error-correction mechanisms and internal modeling (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004; Moore Reference Moore2007).
Computational considerations also suggest that speech processing (and learning, see below) might benefit from a close interaction of perception and production systems. For example, production systems might support perceptual processes by predicting and “synthesizing” auditory candidates (as in analysis by synthesis), while perceptual systems might support the self-monitoring and error-correction of vocal production by affording an advance auditory analysis of the produced speech sounds. Neurobiological experiments support this idea by showing that the neuronal mechanisms for speech production and perception are not segregated in the brain; for example, specific motor circuits are recruited for the analysis of speech sound features (D'Ausilio et al. Reference D'Ausilio, Craighero and Fadiga2012). An organic proposal on the architecture of speech can be formulated within the framework of generative systems, in which perception and action systems share computational (and neuronal) resources and are both guided by a common prediction-error minimization process (Dindo et al. Reference Dindo, Zambuto, Pezzulo and Walsh2011; Friston Reference Friston2010; Kiebel et al. Reference Kiebel, Daunizeau and Friston2008; Pezzulo Reference Pezzulo2012a; Reference Pezzulo2013; Yildiz et al. Reference Yildiz, von Kriegstein and Kiebel2013).
A second important requirement is a learning method powerful enough to train the aforementioned sensorimotor architecture to perceive and (re)produce sounds and speech. This problem has been studied particularly in songbirds that, while not speaking, have sophisticated vocal learning abilities. Most theories assume that songbird learning is a staged process (Brainard & Doupe Reference Brainard and Doupe2002). An initial period of auditory learning is needed to tune sensory maps to represent sensory “prototypes” of heard speech sounds (e.g., memorize learned song patterns heard by conspecifics). These prototypes are then used as “reference signals” for imitation learning; by learning to reproduce the stored template, an animal can acquire equivalent vocal sound production skills. In control-theoretic terms, this process uses (auditory and articulatory) feedback error-correction mechanisms to produce a sound (sing or speech) that closely matches the stored template (Guenther & Perkell Reference Guenther, Perkell, Maassen, Kent, Peters, Lieshout and Hulstijn2004). During the learning process, internal (inverse and forward) models are trained, too, that successively afford skilled sing or speech processing.
To speed up learning, learners benefit from using self-imitation, too. Covert rather than overt singing (or speaking) might reproduce frequently heard speech sounds in the same way they are encoded in their sensory maps (note that generative architectures afford this form of learning quite naturally; Hinton Reference Hinton2007). Using both overt and covert processes, animals (including humans) might reproduce their stored prototypes with high fidelity, including the local accents of their communities.
The brain architecture supporting the aforementioned learning processes is incompletely known. Indeed, speech is a computationally challenging skill as it requires sensorimotor circuits to be sensitive enough to discriminate subtle changes in speech sounds, and accurate enough to afford extremely precise control (e.g., of the timing of speech). The brain could finesse these problems by recruiting cortico-subcortical loops (especially those involving the basal ganglia and the cerebellum) especially during learning. The role of these loops is seldom recognized in “cortico-centric” theories of motor skills (including speech), but the evidence indicates that they could play an important role in skill learning and mastery (Ackermann Reference Ackermann2008; Caligiore et al. Reference Caligiore, Pezzulo, Miall and Baldassarre2013). For example, vocal learning in the swamp sparrow might involve a loop between forebrain neurons that establish auditory-vocal correspondences and striatal structures important for song learning (Prather et al. Reference Prather, Peters, Nowicki and Mooney2008).
The high-fidelity reproduction of sounds could be key to cultural transmission and the evolutionary value of singing in songbirds (Merker Reference Merker and Bannan2012). However, human communities have richer social structures than other animals, which might have favored an open-ended instrumental use of vocal production besides ritualized display. The importance of this skill might have led to a greater investment of parental time in teaching and, we propose, to advanced forms of “tutor learning” (Canevari et al. Reference Canevari, Badino, D'Ausilio, Fadiga and Metta2013). Of note, a so-called pedagogical learning environment (Csibra & Gergely Reference Csibra and Gergely2011) might have afforded specialized teaching strategies that could be uniquely human and that greatly improve on imitation and self-teaching learning methods. One example is “motherese”: Mothers modify their speech when speaking to young children in order to simplify their auditory processing and learning (see Pezzulo et al. Reference Pezzulo, Donnarumma and Dindo2013). This example suggests that social and interactive aspects of the learning environment are important prerequisites – or at least a useful scaffold – for speech acquisition and cultural transmission.
In sum, speech processing requires a sophisticated neuro-computational architecture in which physiologic, motoric, sensory, and social aspects mutually constrain each other and plausibly co-evolve. In addition to studying genetic determinants, it is important to recognize that speech could have found a suitable “neuronal niche” (Dehaene & Cohen Reference Dehaene and Cohen2007) in existing brain structures (cortical and subcortical) supporting skilled action. For example, speech could have re-used “generative” dynamics of such structures for imitation and self-imitation, and redeployed existing computational resources for combinatorial processing (Chersi et al. Reference Chersi, Ferro, Pezzulo and Pirrelli2014; Fadiga et al. Reference Fadiga, Craighero and D'Ausilio2009).
In parallel, speech could have found a suitable “socio-cultural niche”: It could have been incubated within the sophisticated interactive and social dynamics of our species. The social context in which human speech is acquired is extremely rich, and human speech learning operates on top of the sophisticated interactive, joint action, mutual emulation, and pedagogical abilities, most of which are unique or at least much more developed in our species (Pickering & Garrod Reference Pickering and Garrod2013; Sebanz et al. Reference Sebanz, Bekkering and Knoblich2006). The demands of sophisticated social interactions might have contributed to transform vocalization from an initially quite limited sensorimotor feat to a powerful, open-ended instrumental tool that permits conveying rich communicative intentions and forming extremely varied cultures (Pezzulo Reference Pezzulo2012b). In turn, we should not neglect how the intertwined sensorimotor and social sides of speech had a transformative impact on the destiny of our species.