The how question
A first point is the transformation of the analog signal “cry” into articulated sounds: From the age of 2 months, infants start producing very low intensity protophones which are mostly, but not all, vowel-like, and have more complex melodic contours than cries (Ruzza et al. Reference Ruzza, Rocca, Lenti Boero, Lenti, Avanzini, Faienza, Minciacchi, Lopez and Majno2003). In the months to follow sounds are produced with the entire vocal apparatus (mouth, lips, nose, and throat) (de Boysson-Bardie Reference de Boysson-Bardie2001; Oller Reference Oller2000). This points to the appearance of a better control of sound emission due not only to the maturation of the vocal apparatus, but also to a better nervous motor control of phonation. Vocal tract length in neonates is about 6 to 8 cm (Vorperian & Kent Reference Vorperian and Kent2007) and reaches 8.5 cm at 18 months, that is, 55% of adult size (Vorperian et al. Reference Vorperian, Wang, Chung, Schimek, Durtschi, Kent, Ziegert and Gentry2009). Animal and human studies suggest that the nervous motor control of infant cry is similar to that of monkeys: It involves the limbic system that initiates the cry, the midbrain structures that configure the response, and the brainstem that is responsible for the mechanics of the cry. The latter integrates the laryngeal and respiratory activity with the activation of the subroutines for fixed vocal patterns pre-programmed as an answer to external stimuli (Jürgens & Ploog Reference Jürgens, Ploog and Newman1988; Lenti Boero Reference Lenti Boero and Manfredi2009, Lester & Boukydis Reference Lester, Boukydis, Papoušek, Jurgens and Papoušek1992); thus a control for articulated sounds should only come from a rearrangement and a maturation of other centers allowing more motor freedom to lips, mandible, and tongue movements (Davis & MacNeilage Reference Davis, MacNeilage, Givòn and Malle2002).
The why question
Cry is an alarm signal with striking characteristics of loudness and long duration. It can communicate individuality, sex of the caller (Cismaresco & Montagner Reference Cismaresco and Montagner1990; Rocca & Lenti Boero Reference Rocca, Lenti Boero and Sàndor2005), and urgency to a recipient (Lenti Boero et al. Reference Lenti Boero, Miraglia, Ortalda, Nuti, Bottoni, Lenti, Daubney, Longhi, Lamont and Hargreaves2008). Now, imagine a hominid social group endowed with such communicative tool: A known individual could communicate alarm and urgency to group mates from a distance. This communication might have had a basic referentiality as in other mammalian species (Lenti Boero Reference Lenti Boero1992; Rasa Reference Rasa1986; Seyfarth & Cheney Reference Seyfarth and Cheney1980; Zuberbühler Reference Zuberbühler2000b). Why go further? Cries are fixed analog sounds and we know they might be aversive even for mothers (Frodi Reference Frodi, Lester and Boukydis1985; Frodi & Senchack Reference Frodi and Senchack1990; Lenti Boero et al. Reference Lenti Boero, Miraglia, Ortalda, Nuti, Bottoni, Lenti, Daubney, Longhi, Lamont and Hargreaves2008; Levitzky & Cooper Reference Levitzky and Cooper2000), while articulated sounds are considered music-like and very pleasant to the care giver (Papoušek & Papoušek Reference Papoušek, Papoušek and Lipsitt1981). Newborns having a capacity for music-like sounds might have been preferentially selected by parents (Locke Reference Locke2006), as a pilot experiment suggests (Lenti Boero & Bottoni Reference Lenti Boero and Bottoni2009). Those same infants might have been selected when adults (Hogan Reference Hogan and Blass1988) because they were able to use frequency modulated sounds in courtship in a kind of hominids' ancestral serenade, enabling them to communicate felt emotions (Banse & Scherer Reference Banse and Scherer1996).
Auditory-motor coevolution
All communication devices, human language included, imply the coevolution of both receiver and emitter, which is evident in the specialized adult language brain areas: Wernicke's and Broca's. During early development we know that infant perception of surrounding sounds, including language, is much more advanced than motor competence: Infants are capable of auditory streaming at 2–5 days old (Winkler et al. Reference Winkler, Kushnerenko, Horvath, Ceponiené, Fellman, Huotilainen, Näätänen and Sussman2003), and they discriminate vowel and phonetic sounds from the first month (Clarkson & Berg Reference Clarkson and Berg1983; Eimas et al. Reference Eimas, Siqueland, Jusczyk and Vigorito1971; Mehler et al. Reference Mehler, Juskzyc, Lamberz, Halsted, Bertoncini and Amiel-Tison1988; Teinonen et al. Reference Teinonen, Fellman, Näätänen, Alku and Huotilainen2009), sharing this capacity with many animal species: rhesus macaques, dogs, chinchilla, quails, and parrots (Adams et al. Reference Adams, Molfese and Betz1987; Bottoni et al. Reference Bottoni, Masin and Lenti Boero2009; Dewson Reference Dewson1964; Kluender et al. Reference Kluender, Diehl and Killeen1987; Kuhl & Miller Reference Kuhl and Miller1975; Miller Reference Miller and Bullock1977; Morse & Snowdon Reference Morse and Snowdon1975; Pepperberg Reference Pepperberg2007). On the melodic and musical side newborn infants recognize musical melodies heard before birth (Kisilevsky et al. Reference Kisilevsky, Hains, Jacquet, Granier-Deferre and Lecanuet2004). In addition, event-related brain potential (ERP) and magnetoencephalography MEG studies show that newborns can form expectation of a musical pitch and that infants detect substitution of musical notes (Tervaniemi & Huotilainen Reference Tervaniemi, Huotilainen, Avanzini, Faienza, Minciacchi, Lopez and Majno2003). Eventually, infants shape their cries' melodic contours upon their native language (Mampe et al. Reference Mampe, Friederici, Christophe and Wermke2009), thus showing a foundation for auditory-motor connection and imitation. Infants' sense of hearing is “encyclopaedic,” because it is open to all linguistic and music sounds. This capacity is lost from 5 to 6 months (de Boysson-Bardie Reference de Boysson-Bardie2001), when infants attach their attention to motherese (Oller Reference Oller2000), a nonexistent feature at the dawn of language. Thus, sound imitation by means of protophones might have been concentrated on surrounding sounds, especially those uttered by predator or prey animals, to convey information about their presence and denote them in the acoustic channel. Though many theories point to enhanced sociality (Dunbar Reference Dunbar1993), the possibility to refer to an object is still a core point for language evolution (Lenti Boero & Bottoni Reference Lenti Boero and Bottoni2006) and might have been a key factor for the selection for articulated sounds emission.
According to Ackermann et al., human language is a multicomponent process whose evolution must have operated at all life stages (Hogan Reference Hogan and Blass1988; Locke & Bogin Reference Locke and Bogin2006). In the first year of life, human sounds undergo a radical transformation: the substitution of the cry, an analog signal paralleling the dimension of infant's homeostatic imbalance (Gustafson et al. Reference Gustafson, Wood, Green, Barr, Hopkins and Green2000; Lenti Boero et al. Reference Lenti Boero, Volpe, Marcello, Bianchi and Lenti1998) and similar to mammalian signals by design (Lieberman et al. Reference Lieberman, Harris and Wolff1968; Reference Lieberman, Harris, Wolff and Russel1971), with articulated speech-like sounds and some meaningful words at the end of the first year (de Boysson-Bardie Reference de Boysson-Bardie2001; Lenti Boero & Bottoni Reference Lenti Boero and Bottoni2006; Oller Reference Oller2000). Thus, in the first few months of life millions of years of language evolution are summarized; and therefore some benchmark might be outlined to make hypotheses about the selective pressures at work.
The how question
A first point is the transformation of the analog signal “cry” into articulated sounds: From the age of 2 months, infants start producing very low intensity protophones which are mostly, but not all, vowel-like, and have more complex melodic contours than cries (Ruzza et al. Reference Ruzza, Rocca, Lenti Boero, Lenti, Avanzini, Faienza, Minciacchi, Lopez and Majno2003). In the months to follow sounds are produced with the entire vocal apparatus (mouth, lips, nose, and throat) (de Boysson-Bardie Reference de Boysson-Bardie2001; Oller Reference Oller2000). This points to the appearance of a better control of sound emission due not only to the maturation of the vocal apparatus, but also to a better nervous motor control of phonation. Vocal tract length in neonates is about 6 to 8 cm (Vorperian & Kent Reference Vorperian and Kent2007) and reaches 8.5 cm at 18 months, that is, 55% of adult size (Vorperian et al. Reference Vorperian, Wang, Chung, Schimek, Durtschi, Kent, Ziegert and Gentry2009). Animal and human studies suggest that the nervous motor control of infant cry is similar to that of monkeys: It involves the limbic system that initiates the cry, the midbrain structures that configure the response, and the brainstem that is responsible for the mechanics of the cry. The latter integrates the laryngeal and respiratory activity with the activation of the subroutines for fixed vocal patterns pre-programmed as an answer to external stimuli (Jürgens & Ploog Reference Jürgens, Ploog and Newman1988; Lenti Boero Reference Lenti Boero and Manfredi2009, Lester & Boukydis Reference Lester, Boukydis, Papoušek, Jurgens and Papoušek1992); thus a control for articulated sounds should only come from a rearrangement and a maturation of other centers allowing more motor freedom to lips, mandible, and tongue movements (Davis & MacNeilage Reference Davis, MacNeilage, Givòn and Malle2002).
The why question
Cry is an alarm signal with striking characteristics of loudness and long duration. It can communicate individuality, sex of the caller (Cismaresco & Montagner Reference Cismaresco and Montagner1990; Rocca & Lenti Boero Reference Rocca, Lenti Boero and Sàndor2005), and urgency to a recipient (Lenti Boero et al. Reference Lenti Boero, Miraglia, Ortalda, Nuti, Bottoni, Lenti, Daubney, Longhi, Lamont and Hargreaves2008). Now, imagine a hominid social group endowed with such communicative tool: A known individual could communicate alarm and urgency to group mates from a distance. This communication might have had a basic referentiality as in other mammalian species (Lenti Boero Reference Lenti Boero1992; Rasa Reference Rasa1986; Seyfarth & Cheney Reference Seyfarth and Cheney1980; Zuberbühler Reference Zuberbühler2000b). Why go further? Cries are fixed analog sounds and we know they might be aversive even for mothers (Frodi Reference Frodi, Lester and Boukydis1985; Frodi & Senchack Reference Frodi and Senchack1990; Lenti Boero et al. Reference Lenti Boero, Miraglia, Ortalda, Nuti, Bottoni, Lenti, Daubney, Longhi, Lamont and Hargreaves2008; Levitzky & Cooper Reference Levitzky and Cooper2000), while articulated sounds are considered music-like and very pleasant to the care giver (Papoušek & Papoušek Reference Papoušek, Papoušek and Lipsitt1981). Newborns having a capacity for music-like sounds might have been preferentially selected by parents (Locke Reference Locke2006), as a pilot experiment suggests (Lenti Boero & Bottoni Reference Lenti Boero and Bottoni2009). Those same infants might have been selected when adults (Hogan Reference Hogan and Blass1988) because they were able to use frequency modulated sounds in courtship in a kind of hominids' ancestral serenade, enabling them to communicate felt emotions (Banse & Scherer Reference Banse and Scherer1996).
Auditory-motor coevolution
All communication devices, human language included, imply the coevolution of both receiver and emitter, which is evident in the specialized adult language brain areas: Wernicke's and Broca's. During early development we know that infant perception of surrounding sounds, including language, is much more advanced than motor competence: Infants are capable of auditory streaming at 2–5 days old (Winkler et al. Reference Winkler, Kushnerenko, Horvath, Ceponiené, Fellman, Huotilainen, Näätänen and Sussman2003), and they discriminate vowel and phonetic sounds from the first month (Clarkson & Berg Reference Clarkson and Berg1983; Eimas et al. Reference Eimas, Siqueland, Jusczyk and Vigorito1971; Mehler et al. Reference Mehler, Juskzyc, Lamberz, Halsted, Bertoncini and Amiel-Tison1988; Teinonen et al. Reference Teinonen, Fellman, Näätänen, Alku and Huotilainen2009), sharing this capacity with many animal species: rhesus macaques, dogs, chinchilla, quails, and parrots (Adams et al. Reference Adams, Molfese and Betz1987; Bottoni et al. Reference Bottoni, Masin and Lenti Boero2009; Dewson Reference Dewson1964; Kluender et al. Reference Kluender, Diehl and Killeen1987; Kuhl & Miller Reference Kuhl and Miller1975; Miller Reference Miller and Bullock1977; Morse & Snowdon Reference Morse and Snowdon1975; Pepperberg Reference Pepperberg2007). On the melodic and musical side newborn infants recognize musical melodies heard before birth (Kisilevsky et al. Reference Kisilevsky, Hains, Jacquet, Granier-Deferre and Lecanuet2004). In addition, event-related brain potential (ERP) and magnetoencephalography MEG studies show that newborns can form expectation of a musical pitch and that infants detect substitution of musical notes (Tervaniemi & Huotilainen Reference Tervaniemi, Huotilainen, Avanzini, Faienza, Minciacchi, Lopez and Majno2003). Eventually, infants shape their cries' melodic contours upon their native language (Mampe et al. Reference Mampe, Friederici, Christophe and Wermke2009), thus showing a foundation for auditory-motor connection and imitation. Infants' sense of hearing is “encyclopaedic,” because it is open to all linguistic and music sounds. This capacity is lost from 5 to 6 months (de Boysson-Bardie Reference de Boysson-Bardie2001), when infants attach their attention to motherese (Oller Reference Oller2000), a nonexistent feature at the dawn of language. Thus, sound imitation by means of protophones might have been concentrated on surrounding sounds, especially those uttered by predator or prey animals, to convey information about their presence and denote them in the acoustic channel. Though many theories point to enhanced sociality (Dunbar Reference Dunbar1993), the possibility to refer to an object is still a core point for language evolution (Lenti Boero & Bottoni Reference Lenti Boero and Bottoni2006) and might have been a key factor for the selection for articulated sounds emission.
ACKNOWLEDGMENTS
The research going into the preparation of this commentary was supported in 1995, from 2001 to 2003, and from 2005 to 2007 by grants from MURST (Ministero dell'Università e della Ricerca Scientifica e Tecnologica), and by funding from the University of Valle d'Aosta from 2009 to 2013.