Both speech and nonhuman primate vocalizations are produced by the coordinated movements of the lungs, larynx (vocal folds), and the supralaryngeal vocal tract (Ghazanfar & Rendall Reference Ghazanfar and Rendall2008). During vocal production, the shape of the vocal tract can be changed by moving the various effectors of the face (including the lips, jaw, and tongue) into different positions. The different shapes, along with changes in vocal fold tension and respiratory power, are what give rise to different sounding vocalizations. Different vocalizations (including different speech sounds) are produced in part by making different facial expressions. Thus vocalizations are inherently “multisensory” (Ghazanfar Reference Ghazanfar2013).
Given the inextricable link between vocal output and facial expressions, it is perhaps not surprising that nonhuman primates, like humans, readily recognize the correspondence between the visual and auditory components of vocal signals (Ghazanfar & Logothetis Reference Ghazanfar and Logothetis2003; Ghazanfar et al. Reference Ghazanfar, Turesson, Maier, van Dinther, Patterson and Logothetis2007; Habbershon et al. Reference Habbershon, Ahmed and Cohen2013; Jordan et al. Reference Jordan, Brannon, Logothetis and Ghazanfar2005; Sliwa et al. Reference Sliwa, Duhamel, Pascalis and Wirth2011) and use facial motion to more accurately and more quickly detect vocalizations (Chandrasekaran et al. Reference Chandrasekaran, Lemus, Trubanova, Gondan and Ghazanfar2011). However, one striking dissimilarity between monkey vocalizations and human speech is that the latter has a unique bi-sensory rhythmicity, in that both the acoustic output and the movements of the mouth share a 3–8 Hz rhythmicity and are tightly correlated (Chandrasekaran et al. Reference Chandrasekaran, Trubanova, Stillittano, Caplier and Ghazanfar2009; Greenberg et al. Reference Greenberg, Carvey, Hitchcock and Chang2003). According to one hypothesis, this bimodal speech rhythm evolved through the linking of rhythmic facial expressions to vocal output in ancestral primates to produce the first babbling-like speech output (Ghazanfar & Poeppel Reference Ghazanfar, Poeppel, Gazzaniga and Mangun2014; MacNeilage Reference MacNeilage1998). Lip-smacking, a rhythmic facial expression commonly produced by many primate species, may have been one such ancestral expression. It is used during affiliative and often face-to-face interactions (Ferrari et al. Reference Ferrari, Paukner, Ionica and Suomi2009; Van Hooff Reference Van Hooff1962); it exhibits a 3–8 Hz rhythmicity like speech (Ghazanfar et al. Reference Ghazanfar, Chandrasekaran and Morrill2010); and the coordination of effectors during its production (Ghazanfar et al. Reference Ghazanfar, Takahashi, Mathur and Fitch2012) and its developmental trajectory are similar to speech (Morrill et al. Reference Morrill, Paukner, Ferrari and Ghazanfar2012).
Very little is known about the neural mechanisms underlying the production of rhythmic communication signals in human and nonhuman primates. The mandibular movements shared by lip-smacking, vocalizations, and speech all require the coordination of muscles controlling the jaw, face, tongue, and respiration, and their foundational rhythms are likely produced by homologous central pattern generators in the brainstem (Lund & Kolta Reference Lund and Kolta2006). These circuits are modulated by feedback from peripheral sensory receptors. The neocortex may be an additional source influencing orofacial movements and their rhythmicity. Indeed, lip-smacking and speech production are both modulated by the neocortex, in accord with social context and communication goals (Bohland & Guenther Reference Bohland and Guenther2006; Caruana et al. Reference Caruana, Jezzini, Sbriscia-Fioretti, Rizzolatti and Gallese2011). Thus, one hypothesis for the similarities between lip-smacking and visual speech (i.e., the orofacial component of speech production) is that they are a reflection of the development of neocortical circuits influencing brainstem central pattern generators.
One important neocortical node likely to be involved in this circuit is the insula, a structure that has been a target for selection in the primate lineage (Bauernfiend et al. Reference Bauernfiend, de Sousa, Avasthi, Dobson, Raghanti, Lewandowski, Zilles, Semendeferi, Allman, Craig, Hof and Sherwood2013). The human insula is involved in, among other socio-emotional behaviors, speech production (Ackermann & Riecker Reference Ackermann and Riecker2004; Bohland & Guenther Reference Bohland and Guenther2006; Dronkers Reference Dronkers1996). Consistent with an evolutionary link between lip-smacking and speech, the insula also plays a role in generating monkey lip-smacking (Caruana et al. Reference Caruana, Jezzini, Sbriscia-Fioretti, Rizzolatti and Gallese2011). It is conceivable that for both monkey lip-smacking and human speech, the development and coordination of effectors related to their shared orofacial rhythm are due to the socially guided development of the insula. However, a neural substrate is needed to link the production of lip-smack-like facial expressions to concomitant vocal output (the laryngeal source) in order to generate that first babbling-like vocal output. This link to laryngeal control remains a mystery. One scenario is the evolution of insular cortical control over the brainstem's nucleus ambiguus. The fact that gelada baboons produce lip-smacks concurrently with vocal output, generating a babbling-like sound (Bergman Reference Bergman2013), is evidence that a coordination between lip-smacking and vocal output may be easy to evolve.
Human vocal communication is also a coordinated and cooperative exchange of signals between individuals (Hasson et al. Reference Hasson, Ghazanfar, Galantucci, Garrod and Keysers2012). Foundational to all cooperative verbal communicative acts is a more general one: taking turns to speak. Given the universality of turn-taking (Stivers et al. Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009), it is natural to ask how it evolved. Recently, we tested whether marmoset monkeys communicate cooperatively like humans (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013). Among the traits marmosets share with humans are a cooperative breeding strategy and volubility. Cooperative care behaviors scaffold prosocial motivational and cognitive processes not typically seen in other primate species (Burkart et al. Reference Burkart, Hrdy and van Schaik2009a). We capitalized on the fact that marmosets are not only prosocial, but are also highly vocal and readily exchange vocalizations with conspecifics. We observed that they exhibit cooperative vocal communication, taking turns in extended sequences of call exchanges (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013), using conversation rules that are strikingly similar to human rules (Stivers et al. Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009). Such exchanges did not depend upon pair-bonding or kinship with conspecifics and are more sophisticated than simple call-and-responses exhibited by other species. Moreover, our data show that turn-taking in marmosets shares with humans the characteristics of coupled oscillators with self-monitoring as a necessary component (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013) – an example of convergent evolution.
The lack of evidence for such turn-taking (vocal or otherwise) in apes suggests that human cooperative vocal communication could have evolved in a manner very different than what the gestural-origins hypotheses predict (Rizzolatti & Arbib Reference Rizzolatti and Arbib1998; Tomasello Reference Tomasello2008). In this alternative scenario, existing vocal repertoires could begin to be used in a cooperative, turn-taking manner when prosocial behaviors in general emerged. Although the physiological basis of cooperative breeding is unknown (Fernandez-Duque et al. Reference Fernandez-Duque, Valeggia and Mendoza2009), the “prosociality” that comes with it certainly would require modifications to the organization of social and motivational neuroanatomical circuitry. This must have been an essential step in the evolution of both human and marmoset cooperative vocal communication – one that may, like vocal production learning, also include changes to the cortical-basal ganglia loops as well as changes to socially related motivational circuitry in the hypothalamus and amygdala (Syal & Finlay Reference Syal and Finlay2011). These neuroanatomical changes would link vocalizations and response contingency to reward centers during development. Importantly, given the small encephalization quotient of marmosets, such changes may not require an enlarged brain.
Both speech and nonhuman primate vocalizations are produced by the coordinated movements of the lungs, larynx (vocal folds), and the supralaryngeal vocal tract (Ghazanfar & Rendall Reference Ghazanfar and Rendall2008). During vocal production, the shape of the vocal tract can be changed by moving the various effectors of the face (including the lips, jaw, and tongue) into different positions. The different shapes, along with changes in vocal fold tension and respiratory power, are what give rise to different sounding vocalizations. Different vocalizations (including different speech sounds) are produced in part by making different facial expressions. Thus vocalizations are inherently “multisensory” (Ghazanfar Reference Ghazanfar2013).
Given the inextricable link between vocal output and facial expressions, it is perhaps not surprising that nonhuman primates, like humans, readily recognize the correspondence between the visual and auditory components of vocal signals (Ghazanfar & Logothetis Reference Ghazanfar and Logothetis2003; Ghazanfar et al. Reference Ghazanfar, Turesson, Maier, van Dinther, Patterson and Logothetis2007; Habbershon et al. Reference Habbershon, Ahmed and Cohen2013; Jordan et al. Reference Jordan, Brannon, Logothetis and Ghazanfar2005; Sliwa et al. Reference Sliwa, Duhamel, Pascalis and Wirth2011) and use facial motion to more accurately and more quickly detect vocalizations (Chandrasekaran et al. Reference Chandrasekaran, Lemus, Trubanova, Gondan and Ghazanfar2011). However, one striking dissimilarity between monkey vocalizations and human speech is that the latter has a unique bi-sensory rhythmicity, in that both the acoustic output and the movements of the mouth share a 3–8 Hz rhythmicity and are tightly correlated (Chandrasekaran et al. Reference Chandrasekaran, Trubanova, Stillittano, Caplier and Ghazanfar2009; Greenberg et al. Reference Greenberg, Carvey, Hitchcock and Chang2003). According to one hypothesis, this bimodal speech rhythm evolved through the linking of rhythmic facial expressions to vocal output in ancestral primates to produce the first babbling-like speech output (Ghazanfar & Poeppel Reference Ghazanfar, Poeppel, Gazzaniga and Mangun2014; MacNeilage Reference MacNeilage1998). Lip-smacking, a rhythmic facial expression commonly produced by many primate species, may have been one such ancestral expression. It is used during affiliative and often face-to-face interactions (Ferrari et al. Reference Ferrari, Paukner, Ionica and Suomi2009; Van Hooff Reference Van Hooff1962); it exhibits a 3–8 Hz rhythmicity like speech (Ghazanfar et al. Reference Ghazanfar, Chandrasekaran and Morrill2010); and the coordination of effectors during its production (Ghazanfar et al. Reference Ghazanfar, Takahashi, Mathur and Fitch2012) and its developmental trajectory are similar to speech (Morrill et al. Reference Morrill, Paukner, Ferrari and Ghazanfar2012).
Very little is known about the neural mechanisms underlying the production of rhythmic communication signals in human and nonhuman primates. The mandibular movements shared by lip-smacking, vocalizations, and speech all require the coordination of muscles controlling the jaw, face, tongue, and respiration, and their foundational rhythms are likely produced by homologous central pattern generators in the brainstem (Lund & Kolta Reference Lund and Kolta2006). These circuits are modulated by feedback from peripheral sensory receptors. The neocortex may be an additional source influencing orofacial movements and their rhythmicity. Indeed, lip-smacking and speech production are both modulated by the neocortex, in accord with social context and communication goals (Bohland & Guenther Reference Bohland and Guenther2006; Caruana et al. Reference Caruana, Jezzini, Sbriscia-Fioretti, Rizzolatti and Gallese2011). Thus, one hypothesis for the similarities between lip-smacking and visual speech (i.e., the orofacial component of speech production) is that they are a reflection of the development of neocortical circuits influencing brainstem central pattern generators.
One important neocortical node likely to be involved in this circuit is the insula, a structure that has been a target for selection in the primate lineage (Bauernfiend et al. Reference Bauernfiend, de Sousa, Avasthi, Dobson, Raghanti, Lewandowski, Zilles, Semendeferi, Allman, Craig, Hof and Sherwood2013). The human insula is involved in, among other socio-emotional behaviors, speech production (Ackermann & Riecker Reference Ackermann and Riecker2004; Bohland & Guenther Reference Bohland and Guenther2006; Dronkers Reference Dronkers1996). Consistent with an evolutionary link between lip-smacking and speech, the insula also plays a role in generating monkey lip-smacking (Caruana et al. Reference Caruana, Jezzini, Sbriscia-Fioretti, Rizzolatti and Gallese2011). It is conceivable that for both monkey lip-smacking and human speech, the development and coordination of effectors related to their shared orofacial rhythm are due to the socially guided development of the insula. However, a neural substrate is needed to link the production of lip-smack-like facial expressions to concomitant vocal output (the laryngeal source) in order to generate that first babbling-like vocal output. This link to laryngeal control remains a mystery. One scenario is the evolution of insular cortical control over the brainstem's nucleus ambiguus. The fact that gelada baboons produce lip-smacks concurrently with vocal output, generating a babbling-like sound (Bergman Reference Bergman2013), is evidence that a coordination between lip-smacking and vocal output may be easy to evolve.
Human vocal communication is also a coordinated and cooperative exchange of signals between individuals (Hasson et al. Reference Hasson, Ghazanfar, Galantucci, Garrod and Keysers2012). Foundational to all cooperative verbal communicative acts is a more general one: taking turns to speak. Given the universality of turn-taking (Stivers et al. Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009), it is natural to ask how it evolved. Recently, we tested whether marmoset monkeys communicate cooperatively like humans (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013). Among the traits marmosets share with humans are a cooperative breeding strategy and volubility. Cooperative care behaviors scaffold prosocial motivational and cognitive processes not typically seen in other primate species (Burkart et al. Reference Burkart, Hrdy and van Schaik2009a). We capitalized on the fact that marmosets are not only prosocial, but are also highly vocal and readily exchange vocalizations with conspecifics. We observed that they exhibit cooperative vocal communication, taking turns in extended sequences of call exchanges (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013), using conversation rules that are strikingly similar to human rules (Stivers et al. Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009). Such exchanges did not depend upon pair-bonding or kinship with conspecifics and are more sophisticated than simple call-and-responses exhibited by other species. Moreover, our data show that turn-taking in marmosets shares with humans the characteristics of coupled oscillators with self-monitoring as a necessary component (Takahashi et al. Reference Takahashi, Narayanan and Ghazanfar2013) – an example of convergent evolution.
The lack of evidence for such turn-taking (vocal or otherwise) in apes suggests that human cooperative vocal communication could have evolved in a manner very different than what the gestural-origins hypotheses predict (Rizzolatti & Arbib Reference Rizzolatti and Arbib1998; Tomasello Reference Tomasello2008). In this alternative scenario, existing vocal repertoires could begin to be used in a cooperative, turn-taking manner when prosocial behaviors in general emerged. Although the physiological basis of cooperative breeding is unknown (Fernandez-Duque et al. Reference Fernandez-Duque, Valeggia and Mendoza2009), the “prosociality” that comes with it certainly would require modifications to the organization of social and motivational neuroanatomical circuitry. This must have been an essential step in the evolution of both human and marmoset cooperative vocal communication – one that may, like vocal production learning, also include changes to the cortical-basal ganglia loops as well as changes to socially related motivational circuitry in the hypothalamus and amygdala (Syal & Finlay Reference Syal and Finlay2011). These neuroanatomical changes would link vocalizations and response contingency to reward centers during development. Importantly, given the small encephalization quotient of marmosets, such changes may not require an enlarged brain.