Ackermann et al. present an excellent overview of the neurocognitive architecture underlying primate vocal production, including a proposal for the evolution of articulated speech in humans. Multiple sources of evidence support the dual pathway model of acoustic communication. The evolution of volitional control over vocalizations might critically involve adaptations for rhythmic entrainment (i.e., a coupling of independent oscillators that have some means of energy transfer between them). Entrained vocal and non-vocal behaviors afford a variety of modern abilities such as turn-taking in conversation and coordinated music-making, in addition to refinements that lead to the production of speech sounds that interface with the language faculty.
Wilson and Wilson (Reference Wilson and Wilson2005) described an oscillator model of conversational turn-taking where syllable production entrainment allows for efficient interlocutor coordination with minimal gap and overlap in talk. The mechanisms underlying this ability might have been present in the hominin line well before language evolved, and could be closely tied to potential early functions of social signaling including rhythmic musical behavior and dance (Bryant Reference Bryant2013; Hagen & Bryant Reference Hagen and Bryant2003; Hagen & Hammerstein Reference Hagen and Hammerstein2009). Research on error correction mechanisms has revealed several design features of such entrainment mechanisms. Repp (Reference Repp2005) proposed distinct neural systems underlying different kinds of error correction in synchronous tapping. Phase-related adjustments involve dorsal processes controlling action, while ventral perception and planning processes underlie period correction adjustments.
Bispham (Reference Bispham2006) and Phillips-Silver et al. (Reference Phillips-Silver, Aktipis and Bryant2010) have suggested that behavioral entrainment in humans involves the coupling of perception and action incorporating pre-existing elements of motor control and pulse perception. This coupling is plausibly linked to Ackermann et al.'s first phylogenetic stage including laryngeal elaboration and monosynaptic refinement of corticobulbar tracts. In order to implement proper error correction in improvised contexts of vocal synchrony, volitional control over articulators is necessary. While little comparative work has shown such an ability in nonhuman primates, there is some evidence suggesting control over vocal articulators in gelada baboons, with an ability to control, for example, vocal onset times relative to conspecific vocalizations (Richman Reference Richman1976). And recently, Perlman et al. (Reference Perlman, Patterson and Cohn2012) have found that Koko the gorilla exercises breath control in her deliberate play with wind instruments. Other evidence of this sort is certainly forthcoming, and will help us develop an accurate account of the evolutionary precursors to speech production in humans.
Laughter provides a window into the phylogeny of human vocal production as well. Laugh-like vocalizations first appeared prior to the last common ancestor (Davila-Ross et al. Reference Davila-Ross, Owren and Zimmermann2009), and in humans is likely derived from the breathing patterns exhibited during play activity (Provine Reference Provine2000). Bryant and Aktipis (Reference Bryant and Aktipis2014) found that perceptible proportions of inter-voicing intervals (IVIs) differed systematically between spontaneous and volitional human laughter, and altered versions of the laughs were differentially perceived as being human made, and related to the IVI measures. Specifically, slowed spontaneous laughs were indistinguishable from nonhuman animal calls, while slowed volitional laughs were recognizable as being human produced. These data were interpreted as being evidence for perceptual sensitivity to vocalizations originating from different production machinery – a finding consistent with the dual pathway model presented here by Ackermann et al.
Interestingly, laughter seems to play a role in coordinating conversational timing. Manson et al. (Reference Manson, Bryant, Gervais and Kline2013) have reported that convergence in speech rate was positively associated with how much interlocutors engaged in co-laughter. While the degree of convergence over a 10-minute conversation predicted cooperative play in an unannounced Prisoner's Dilemma game, the amount of co-laughter did not. The relationship between laughter and speech is not well understood, though evidence suggests that it is integrated to some extent. The placement of laughter in the speech stream follows some linguistic patterns (i.e., a punctuation effect) (Provine Reference Provine1993), but also manifests itself embedded within words and sentences as well (Bryant Reference Bryant2012). Co-laughter might serve in some capacity to help conversationalists coordinate their talk, and, in early humans, perhaps coordinate other kinds of vocal behavior. Recent work has demonstrated that people can detect in very short co-laughter segments (<2 seconds) whether the co-laughers are acquainted or not (Bryant Reference Bryant2012) suggesting a possible chorusing function.
A surge of recent work is showing how interpersonal synchrony involving entrainment results in cooperative interactions (e.g., Kirschner & Tomasello Reference Kirschner and Tomasello2010; Manson et al. Reference Manson, Bryant, Gervais and Kline2013; Wiltermuth & Heath Reference Wiltermuth and Heath2009), and the effect seems immune to the negative consequences of explicit recognition. That is, when behavior matching is noticed, but does not involve fine temporal coordination, interactants do not respond positively (e.g., Bailenson et al. Reference Bailenson, Yee, Patel and Beall2008). Manson et al. (Reference Manson, Bryant, Gervais and Kline2013) described interpersonal synchrony as a coordination game that does not afford cheating opportunities, unlike mimicry and other behavior matching phenomena where deceptive, manipulative strategies are potentially profitable. Coordinating vocal (and other) behavior provides a means for individuals to assess the fit of others as cooperating partners. Given the extreme cooperative nature of humans relative to other species, mechanisms for such assessment are not surprising, and in fact should be expected.
Taken together, the findings described above point to an important component of human vocal communication that involves the independent and integrated action of emotional vocal production and speech production systems. Selection for articulatory control mechanisms underlying the entrainment of vocal behavior for within- and between-group communicative functions could have set the stage for conversational turn-taking – an ability that incorporated speech. Dual pathway models of acoustic communication should more seriously consider the neurocognitive underpinnings of vocal entrainment abilities and consider these adaptations in the phylogenetic history of human vocal behavior.
Ackermann et al. present an excellent overview of the neurocognitive architecture underlying primate vocal production, including a proposal for the evolution of articulated speech in humans. Multiple sources of evidence support the dual pathway model of acoustic communication. The evolution of volitional control over vocalizations might critically involve adaptations for rhythmic entrainment (i.e., a coupling of independent oscillators that have some means of energy transfer between them). Entrained vocal and non-vocal behaviors afford a variety of modern abilities such as turn-taking in conversation and coordinated music-making, in addition to refinements that lead to the production of speech sounds that interface with the language faculty.
Wilson and Wilson (Reference Wilson and Wilson2005) described an oscillator model of conversational turn-taking where syllable production entrainment allows for efficient interlocutor coordination with minimal gap and overlap in talk. The mechanisms underlying this ability might have been present in the hominin line well before language evolved, and could be closely tied to potential early functions of social signaling including rhythmic musical behavior and dance (Bryant Reference Bryant2013; Hagen & Bryant Reference Hagen and Bryant2003; Hagen & Hammerstein Reference Hagen and Hammerstein2009). Research on error correction mechanisms has revealed several design features of such entrainment mechanisms. Repp (Reference Repp2005) proposed distinct neural systems underlying different kinds of error correction in synchronous tapping. Phase-related adjustments involve dorsal processes controlling action, while ventral perception and planning processes underlie period correction adjustments.
Bispham (Reference Bispham2006) and Phillips-Silver et al. (Reference Phillips-Silver, Aktipis and Bryant2010) have suggested that behavioral entrainment in humans involves the coupling of perception and action incorporating pre-existing elements of motor control and pulse perception. This coupling is plausibly linked to Ackermann et al.'s first phylogenetic stage including laryngeal elaboration and monosynaptic refinement of corticobulbar tracts. In order to implement proper error correction in improvised contexts of vocal synchrony, volitional control over articulators is necessary. While little comparative work has shown such an ability in nonhuman primates, there is some evidence suggesting control over vocal articulators in gelada baboons, with an ability to control, for example, vocal onset times relative to conspecific vocalizations (Richman Reference Richman1976). And recently, Perlman et al. (Reference Perlman, Patterson and Cohn2012) have found that Koko the gorilla exercises breath control in her deliberate play with wind instruments. Other evidence of this sort is certainly forthcoming, and will help us develop an accurate account of the evolutionary precursors to speech production in humans.
Laughter provides a window into the phylogeny of human vocal production as well. Laugh-like vocalizations first appeared prior to the last common ancestor (Davila-Ross et al. Reference Davila-Ross, Owren and Zimmermann2009), and in humans is likely derived from the breathing patterns exhibited during play activity (Provine Reference Provine2000). Bryant and Aktipis (Reference Bryant and Aktipis2014) found that perceptible proportions of inter-voicing intervals (IVIs) differed systematically between spontaneous and volitional human laughter, and altered versions of the laughs were differentially perceived as being human made, and related to the IVI measures. Specifically, slowed spontaneous laughs were indistinguishable from nonhuman animal calls, while slowed volitional laughs were recognizable as being human produced. These data were interpreted as being evidence for perceptual sensitivity to vocalizations originating from different production machinery – a finding consistent with the dual pathway model presented here by Ackermann et al.
Interestingly, laughter seems to play a role in coordinating conversational timing. Manson et al. (Reference Manson, Bryant, Gervais and Kline2013) have reported that convergence in speech rate was positively associated with how much interlocutors engaged in co-laughter. While the degree of convergence over a 10-minute conversation predicted cooperative play in an unannounced Prisoner's Dilemma game, the amount of co-laughter did not. The relationship between laughter and speech is not well understood, though evidence suggests that it is integrated to some extent. The placement of laughter in the speech stream follows some linguistic patterns (i.e., a punctuation effect) (Provine Reference Provine1993), but also manifests itself embedded within words and sentences as well (Bryant Reference Bryant2012). Co-laughter might serve in some capacity to help conversationalists coordinate their talk, and, in early humans, perhaps coordinate other kinds of vocal behavior. Recent work has demonstrated that people can detect in very short co-laughter segments (<2 seconds) whether the co-laughers are acquainted or not (Bryant Reference Bryant2012) suggesting a possible chorusing function.
A surge of recent work is showing how interpersonal synchrony involving entrainment results in cooperative interactions (e.g., Kirschner & Tomasello Reference Kirschner and Tomasello2010; Manson et al. Reference Manson, Bryant, Gervais and Kline2013; Wiltermuth & Heath Reference Wiltermuth and Heath2009), and the effect seems immune to the negative consequences of explicit recognition. That is, when behavior matching is noticed, but does not involve fine temporal coordination, interactants do not respond positively (e.g., Bailenson et al. Reference Bailenson, Yee, Patel and Beall2008). Manson et al. (Reference Manson, Bryant, Gervais and Kline2013) described interpersonal synchrony as a coordination game that does not afford cheating opportunities, unlike mimicry and other behavior matching phenomena where deceptive, manipulative strategies are potentially profitable. Coordinating vocal (and other) behavior provides a means for individuals to assess the fit of others as cooperating partners. Given the extreme cooperative nature of humans relative to other species, mechanisms for such assessment are not surprising, and in fact should be expected.
Taken together, the findings described above point to an important component of human vocal communication that involves the independent and integrated action of emotional vocal production and speech production systems. Selection for articulatory control mechanisms underlying the entrainment of vocal behavior for within- and between-group communicative functions could have set the stage for conversational turn-taking – an ability that incorporated speech. Dual pathway models of acoustic communication should more seriously consider the neurocognitive underpinnings of vocal entrainment abilities and consider these adaptations in the phylogenetic history of human vocal behavior.