INTRODUCTION
Many studies suggest that both the amount of caregiver speech (e.g. Hart & Risley, Reference Hart and Risley1995; Hoff & Naigles, Reference Hoff and Naigles2002; Zimmerman et al., Reference Zimmerman, Gilkerson, Richards, Christakis, Xu, Gray and Yapanel2009) and the quality of caregiver speech (e.g. Cristia, Reference Cristia2011; Hurtado, Marchman & Fernald, Reference Hurtado, Marchman and Fernald2008) impact child language acquisition. However, caregiver speech is not a static collection of utterances, passively received by the child. Quite to the contrary, much – if not all – infant-directed speech occurs in the context of conversational exchange (Snow, Reference Snow1977), in which caregiver and child dynamically influence each other's speech. Here, we investigate the possibility that children and caregivers modulate the prosodic characteristics of their speech as a function of their interlocutor's speech patterns, as well as who initiates the conversation.
Examining to what extent individual dyads, as well as children and caregivers in general, are sensitive to one another in running speech is crucial to understanding the nature of these linguistic interactions, and may provide important insights into how language development occurs in both typical and atypical contexts. These insights may emerge on two levels. First, linguistic responsiveness may reflect social sensitivity to other individuals in a global sense, such that more responsive dyads may have higher levels of attachment, which may have a positive impact on language development. In support of this idea, there is evidence that dyads whose speech is optimally correlated in terms of timing may have more secure attachment and the children in such dyads may have better developmental outcomes (Jaffe, Beebe, Feldstein, Crown & Jasnow, Reference Jaffe, Beebe, Feldstein, Crown and Jasnow2001). On the other hand, other researchers have found differences between maternal responsiveness in the sense of social contingency and affect communication, and the effects of speech timing per se (Striano, Henning & Stahl, Reference Striano, Henning and Stahl2006). Second, such responsiveness may have a direct impact on learning mechanisms, such that infants from more responsive dyads may acquire mature linguistic forms more rapidly. This form of learning has been demonstrated experimentally in dyads with infants as young as 9 months of age for characteristics of phonological form (Goldstein & Schwade, Reference Goldstein and Schwade2008), but has yet to be examined with respect to suprasegmental linguistic characteristics, and/or in naturalistic interactions.
The mechanism responsible for this coordination of speech between mother and child has been called variously ‘entrainment’ (e.g. Brennan, Galati & Kuhlen, Reference Brennan, Galati and Kuhlen2010), ‘alignment’ (e.g. Pickering & Garrod, Reference Pickering and Garrod2004), or ‘imitation’ (e.g. Meltzoff & Moore, Reference Meltzoff and Moore1977). These terms have been used in a variety of different ways by researchers, with different implications. Probably the most widely used term in child development is ‘imitation’. This term generally has the connotation of a non-reciprocal influence whereby one party (typically the adult) performs an action and another (typically the child) imperfectly performs the same action. The first party's actions therefore influence how the second party's actions are performed, leading to learning over time. In this paper, we take the position that such influences in conversations are, or may be, reciprocal in nature. We therefore use the term entrainment (more commonly used in adult–adult interactions) to broadly refer to the phenomenon whereby interlocutors engaged in a conversation adapt their speech patterns in accordance with the interlocutor's speech. This may therefore be an important mechanism leading to learning on the part of the child, but we do not rule out the possibility that the child is also influencing the caregiver's behavior. Our study focuses on two specific and separable aspects of entrainment: similarity (overall similarity across a sample) and convergence (becoming more similar over time). Similarity documents a static relationship that may indicate alikeness due to convergence that has already occurred, or convergence that occurs very quickly or instantaneously. Convergence, on the other hand, documents effects of entrainment as it unfolds within a timeline. We focus our investigation of entrainment on the following acoustic variables: timing (utterance duration and inter-speaker silences), pitch measures (mean, minimum, and maximum pitch and pitch range across the utterance), and speaking rate.
There has been considerable research on the development of conversational timing in young infants. Infants are sensitive to appropriate timing and disprefer interactions in which a significant delay is introduced (Striano et al., Reference Striano, Henning and Stahl2006). At least some aspects of this timing develop in a non-linear fashion, with more overlapping speech occurring in interactions with younger and older infants, but fewer during the period of the emergence of meaningful language, during the first half of the second year of life (Elias & Broerse, Reference Elias and Broerse1996). Both mothers and infants appear to influence the duration of overlapping speech and pausing behavior (Feldstein et al., Reference Feldstein, Jaffe, Beebe, Crown, Jasnow, Fox and Gordon1993), and there may be an important relationship between this ability and later cognitive and language development (e.g. Jaffe et al., Reference Jaffe, Beebe, Feldstein, Crown and Jasnow2001). One study (Beebe, Alson, Jaffe, Feldstein & Crown, Reference Beebe, Alson, Jaffe, Feldstein and Crown1988) found a correlation across mother–infant dyads in the duration of response times at turns, but not of within-speaker pauses. Both this study and another (Shimura & Yamanoucho, Reference Shimura and Yamanoucho1992) found no correlation in the durations of infant and mother utterances.
Conversational timing is implicated as a key universal feature of human interaction. There are cross-cultural similarities in the qualitative features of turn-taking timing, such as a unimodal distribution of response timing with a peak between 0–200 ms after the interlocutor's offset, and longer response times for disconforming utterances, which are modulated by quantitative differences across languages in the length of mean response times (Stivers et al., Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009). Nonetheless, conversational timing is clearly a domain that requires learning on the part of the infant. It is therefore perhaps unsurprising that durational measures that capture more dynamic, interactional components of the conversation appear to be more sensitive to entrainment effects.
With respect to pitch, Siegel, Cooper, Morgan, and Brenneise-Sarshad (Reference Siegel, Cooper, Morgan and Brenneise-Sarshad1990) found no effect of interlocutor (mother or father) on measures of the mean pitch (f0) of utterances produced by 9–12-month-old infants. Similarly, McRoberts and Best (Reference McRoberts and Best1997), examining a single infant longitudinally from 3 to 17 months, found adjustment in the form of higher pitch by the mother and father when speaking with their infant (as would be expected either because of general characteristics of infant-directed speech or because of an entrainment effect) but no difference in the infant's mean pitch depending on whether the infant was alone, with mother, or with father across the age range. Shimura and Yamanoucho (Reference Shimura and Yamanoucho1992) found correlations both between and within mother–infant dyads with respect to mean pitch for 2- to 3-month-olds, but did not attempt to disentangle mother–infant or infant–mother effects. One study, Masataka (Reference Masataka1992), found that 3- to 4-month-old infants showed significantly more similarity to their mother in the intonation contours of their vocalizations in response to exaggerated intonation on the part of their mother. However, such response similarity was not seen with less exaggerated contours.
The different results found in the studies reported above are difficult to interpret, as the studies differ on a variety of variables, including age of the infants, language being acquired (Japanese versus English, two languages that differ radically in both linguistic and cultural factors), specific variables being analyzed, and the form of analysis (correlations by participant or by utterance, or more sophisticated statistical analyses). Additionally, many of these studies have necessarily been limited to very small samples of speech, given the labor involved in coding such speech samples. While it has been argued that such small slices produce convergence estimates that are reliable and meaningful for timing (e.g. Jaffe & Feldstein, Reference Jaffe and Feldstein1970), the same has not been shown for pitch.
Another limitation of much of this research is that correlations between dyads are interpreted as evidence for entrainment effects. However, mothers and infants share much in common going into a conversation, and this prior resemblance unrelated to entrainment may drive correlations across dyads. More fine-grained analysis within a given dyad allows us to more directly examine the extent to which mother and infant influence each other dynamically over the course of a conversational exchange, and to tease apart whether the broad-level correlations found to date across dyads are capturing true entrainment, or are simply the result of pre-existing resemblance between mother and child.
The Language ENvironment Analysis, or LENA, provides us with the unique opportunity to examine mother–child interactions in a large-scale, naturalistic manner (Zimmerman et al., Reference Zimmerman, Gilkerson, Richards, Christakis, Xu, Gray and Yapanel2009). The LENA Research Foundation has developed a small recorder that is capable of storing up to 16 hours of running speech, and which can be worn easily by the child when fitted in the pocket of a custom-made vest, thus capturing the child's production as well as her interlocutors' input over the whole day. Additionally, LENA has developed software that pre-processes the recordings to divide up stretches of recordings into utterances (versus silence or noise), and labels the detected utterances as a function of speaker, based on basic acoustic properties (child versus adult female, among others). In the present study, we used LENA software together with our own scripts to estimate child–caregiver entrainment in terms of both timing (specifically, utterance duration, speech rate, and response time) and pitch characteristics (pitch mean, maximum, minimum, and range).
In our study we collected multiple full-day, naturalistic recordings using LENA, of thirteen monolingual English-learning infants and toddlers aged 12 to 30 months going about their typical day. These recordings were then processed in the laboratory to address the following research questions: First, does entrainment in the form of similarity occur in mother–child conversation? To address this question, we conducted Pearson correlation tests for the speech of mother and child pairs at the level of conversational block (a unit referring to short-term conversational bouts separated by pausing, described in more detail below), and conversational turns (i.e. where a child utterance follows a maternal utterance or vice versa). Similarity in conversational blocks would indicate an adaptation of speech in accordance with the interlocutor's speech patterns within the context of a particular conversation, over and above broad level similarity between a mother–infant/toddler dyad. Turn-level similarity would indicate a more immediate adaptation of speech to match the interlocutor's token-level speech patterns. In adults' speech, there is a report that a tighter similarity exists between speakers at turns compared to the session level (Levitan & Hirschberg, Reference Levitan and Hirschberg2011), and at the conversational block than the session level (Edlund, Heldner & Hirschberg, Reference Edlund, Heldner and Hirschberg2009). A comparison of the correlation coefficients at the turn and block level will shed light on the extent to which the influence of the dyad members on each other varies at different levels of analysis.
Second, we asked whether there were any effects of interlocutor order (i.e. who initiates the conversation and who responds) on the nature and extent of responding. To our knowledge, ours is the first study to examine this potentially important aspect of the conversational dynamic with respect to prosodic influences. We address this question by considering the initiator or the respondent at both the conversational turn and block level. At the conversational turn level, we compare the correlation coefficients in mother-to-child turns with those in child-to-mother turns. Differences between these two analyses would help to tease apart the directionality of mother–child entrainment (i.e. whether mothers adjust more to infants or vice versa). At the level of the conversational block, we examine whether initiating a conversational block has any effects on the temporal pattern, i.e. response time and utterance length, of the speaker as the owner of the block.
Finally, does the child's and the mother's speech converge over the course of a conversational block? In studies of adults' convergence, conversations analyzed typically take place between strangers who need time to get familiarized with each other's speech patterns. Thus convergence usually occurs late in the course of the conversation. Due to the existing familiarity between mother and child, such conversational convergence may be reduced or absent entirely. On the other hand, since infants are immature conversationalists, they may require time to warm up with respect to the dynamics of a particular conversational interchange, and hence may show greater similarity at the end than at the beginning of a conversational block.
METHOD
Participants and recording procedure
Recordings were collected as part of a larger project examining the linguistic environment of infants/toddlers across childcare settings (Soderstrom & Wittebolle, Reference Soderstrom and Wittebolle2013). For the present study, all samples were day-long recordings of infants and toddlers being cared for in the home, primarily by their own mothers.
A total of fourteen mother–child pairs participated in the study, and each pair recorded 3 to 5 days (mean = 3·85, SD = 0·99) of their daily life. However, one of these dyads was excluded from analysis because the father was the primary caregiver, and therefore a much smaller proportion of the speech segments were maternal speech, raising the baseline error rate of maternal speech segment identification to a much higher proportion of total utterances. The mean duration of the interval between two subsequent recordings was 16 days (SD = 19). The final recording sample consisted of 517·27 hours of recording from a total of 50 days. Due to a technical issue, two adjacent recording days from one participant were treated as a single ‘day’ in the analysis, so our final recording sample consists of 49 ‘days'. Recordings ranged from 6·56 to 13·96 hours with a mean of 10·00 hours (SD = 1·84). The total hours of recording from each child ranged from 29 hours to 51 (mean = 40·0, SD = 10·1). Children's age ranged from 12 months to 30 months (mean = 20·4 months, SD = 4·5), with nine male and four female infants/toddlers participating. See Table 1 for participant-specific data.
Table 1. Demographic and recording details by participant. In the ‘age range’ row, the age in months of the first and last recording is listed. The bottom row refers to the total number of segments where the speech constitutes more than half of the segment duration (only these segments were included in the analyses).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_tab1.gif?pub-status=live)
Participants were recruited from an existing database of families in Winnipeg, Canada, who had expressed an interest in participating in research studies with their child. If verbal interest in the study was expressed by the caregiver over the phone, a research assistant visited the home. During this initial visit, written informed consent was obtained, and the caregivers were given instructions regarding the use of the LENA recording device. After each day of recording, a research assistant would visit the home to collect the recording device in order to download the recording for processing and provide another device for the next recording. Participants were provided a log sheet to note the number of people in the room and the child's activities throughout the day. Participants were compensated at $20/day for the recordings.
Technical descriptions of the data and pre-processing
Day-long recordings were annotated using the unsupervised algorithms included in the software accompanying LENA, which tags stretches of time as being speech; or else containing speaker overlap, noise, or silence. Thus, a segment is an individual chunk of time associated with a particular category of speech or acoustic input by the LENA system, such as adult female utterance or non-verbal noise. A detailed description of the LENA system processing may be found in the LENA technical reports (<http://www.lenafoundation.org/customer-resources/technical-reports/>). For our purposes, a speech segment may be considered roughly equivalent to an utterance, and we will use these terms interchangeably. The speech segments we analyzed within the study were female adult (FAN in LENA's coding system, assumed to be the mother) and the child being recorded (CHN, identified separately from other children in the environment by LENA directly). Speech segments are divided into conversational blocks by LENA, such that a conversational block continues until the gap between two subsequent speech segments is more than 5 seconds, at which point a new conversational block is declared. Gaps can consist of silence or non-verbal noise. For our purposes, we focused on mother-initiated and child-initiated conversational blocks (AICF and CIC in LENA's coding system), from which the adult female and child segments were selected. The total number of segments spoken by the adult female and the child were similar (collapsing across all days: female adult = 39,745, target child = 38,868). We also analyzed a subset of the data (34,992 segments, 45% of the total) in which it was indicated that only the mother and child were present, based on observational notes taken by the mother at the time of the recording, such that any adult female as labeled by LENA should in fact be the mother. In our initial raw acoustic measures (not reported) we found higher mean pitch in the total dataset compared to the restricted set for both mother and child, but no other basic acoustic differences. Crucially, the overall pattern in the relationship between the two interlocutors was similar. We therefore report only the findings for the larger dataset, and refer to the adult female as the mother for simplicity.
The LENA software provides an estimation of vocalization duration directly from the speaker segment, which may include small within-talker pauses and non-verbal vocalizations such as vegetative noises and crying. LENA makes a first-pass assumption that adult speech will have a minimum length of 1 second, while child speech will have a minimum length of 0·6 s (and 0·8 s for certain other segments). Utterances shorter than these minimums will have their boundaries extended by LENA's segmentation process. Therefore, durations at these minima are artificially inflated. Segments containing less than 50% vocalization (i.e. greater than 50% silence or non-verbal production like crying or laughing) were excluded from our analyses.
Finally, we used custom-written Praat scripts (Boersma & Weenink, Reference Boersma and Weenink2013) to measure pitch and speaking rate in the dataset gathered using LENA. Pitch was expressed as ERBs (equivalent rectangular band width) since its psycho-acoustic scale provides perceptually relevant quanta (Hermes & van Gestel, Reference Hermes and van Gestel1991). Speaking rate was expressed as number of syllables per second, which was calculated based on the combination of intensity peaks and voicing in each syllable (De Jong & Wempe, Reference De Jong and Wempe2009). We used a single set of parameters for all participants. Figure 1A schematically demonstrates the definition of segments, conversational turns, and blocks, and Figure 1B shows an example of how acoustic values extracted from the defined turns and blocks are organized into a data frame for a single conversational block combining LENA tagging and Praat measurements.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241023132001-20843-mediumThumb-S0305000915000203_fig1g.jpg?pub-status=live)
Fig. 1 A (top). Structure of conversational turns and conversational blocks (M = Mother, C = Child). Note that intervening segments with labels other than mother and child were filtered out. Fig. 1B (bottom). Example of the data frame resulting from the application of custom scripts to the data frame generated by the LENA system.
We processed the data frame further using custom scripts written in R (R Core Team, 2013) to extract the information specific to each of our research questions. For example, a segment was tagged as having constituted a conversational turn if there was a talker change between that segment and the previous one. We then extracted acoustic values for the previous and following segments at each turn, as well as its linear location within the conversational block. Based on this information, we also extracted the first and last turn of the block to investigate the pattern of convergence or the lack of convergence. More information about these steps of analysis is provided in the ‘Results' section, where the investigation of each particular research question is described in detail (see Figure 2).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_fig2g.gif?pub-status=live)
Fig. 2. Flow of data selection and processing.
Reliability test
For reliability purposes, we manually coded 100 segments for each dyad (50 from the mother and 50 from the child). Reliability was performed by volunteer research assistants with undergraduate-level linguistics training. Segments were selected randomly by a custom computer script from a single transcript for each dyad. The research assistant first listened to the segment as given by the LENA system, selected which LENA-generated speaker code best represented the segment (e.g. female adult, silence, overlapping speech), and recorded the number of syllables they heard in the segment. They then listened to the segment a second time with a 1 s context buffer on either side of the segment and again selected a LENA code. After this, the research assistant adjusted the start and stop times for the utterance in Praat until they were satisfied that it accurately reflected the real start and stop times of the utterances. If necessary, an additional buffer was added, such that the start or stop time could vary further than 1 s from the original LENA parameter.
The median for the differences in segment duration between the LENA-generated and manual annotations was –90 ms, with the inter-quartile range of Q1: –481 ms, and Q3: 112 ms (see Figure 3). Given that the mean duration of manually annotated segments was 1390 ms, the absolute median difference in the LENA-generated and the manual annotation is around the rate of 6%. This appears to us an acceptable level of accuracy for extracting robust patterns of speech in a large dataset.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_fig3g.gif?pub-status=live)
Fig. 3. Differences in LENA and manual annotation at segment boundaries and the duration.
We then compared the number of syllables calculated by the Praat script with those generated by the manual coding by applying a Pearson's product-moment correlation test. The result shows that there was a high level of correlation between the two sets of data (t(1,397) = 35·9, p < ·001, r = 0·70). This level of agreement is more modest than human inter-coder agreement, but within acceptable ranges given the automated nature of the analysis.
Finally, we performed a Pearson's chi-squared test with Yates' continuity correction to evaluate LENA's accuracy of labeling the speaker by comparing the LENA-generated speaker codes with manually annotated ones. The LENA-generated codes are either female adult (FAN) or target child (CHN), whereas manual annotations included as many as nine different categories (see Table 2). The results show that the LENA-generated codes and the manual annotations are not independent from each other (χ 2 (8, N = 1,400) = 1,045, p < ·001). The type of classification errors made by the LENA system can be inferred from Table 2, which shows how the segments labeled as female adult and target child were re-classified by human annotators after listening to the segment in context. The mean proportion of segments whose speaker had been identified correctly was ·84, with a range of ·51 to ·93. After removing the one dyad where the primary caregiver was the father, the mean proportion of correct identification was ·86.
Table 2. Types of LENA's classification errors in labeling speaker ID. For a subset of the segments labeled as CHN (Target Child) or FAN (Female Adult), human annotators manually re-labeled them.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_tab2.gif?pub-status=live)
Statistics
Our analyses of entrainment rely on comparing simple acoustic measures of mother and child speech at different levels of analysis – comparisons of overall mean measures across conversational blocks, comparisons of means at turns only across blocks, and comparisons at turns at different time periods within a block. This approach has the advantage that it does not require baseline measures, because we rely on the variance across blocks and turns to control for baseline effects. Additionally, it provides matched or paired datapoints, thus enabling us to use well-described statistical approaches such as correlations and regressions.
We also constructed a linear mixed-effects model where appropriate using the lme4 package (Bates, Maechler & Bolker, Reference Bates, Maechler and Bolker2013) implemented in R (R Core Team, 2013). As a means to evaluate the statistical significance of the linear model and attain p-values, we conducted likelihood ratio tests by comparing nested models. We also report the t-values without the degrees of freedom or p-values due to the difficulty of determining the right number of degrees of freedom for assessing the t-values. Roughly, however, absolute t-values of greater than 2 can be interpreted as indicating statistical significance (Baayen, Davidson & Bates, Reference Baayen, Davidson and Bates2008).
RESULTS
Analysis of similarity at the level of conversational turn and conversational block
We first investigated broadly the overall similarity across partners in the measures of segment duration, speaking rate, and pitch by applying Pearson correlation tests. If mother–child dyads have a tendency to coordinate their speech in exchanging conversations, we would expect to see positive correlations in acoustic measures.
The number of conversational blocks contained in each of the 49 days' recording varied between 38 and 623. The number of segments contained within a conversational block ranged from 1 to 254, with a mean of 7·6 segments in a block (SD = 9·5 segments). At the level of conversational block, we took the mean acoustic values for mother and child across the conversational block, including all utterances for mother and child, which reduced the data in each block to two data points, one for each interlocutor. Out of a total of 10,406 blocks, we selected the 8,466 blocks which contained both the mother's and the child's speech and had valid values in all acoustic variables under investigation. For example, blocks where pitch values could not be measured due to various reasons were eliminated. We then calculated correlation coefficients of the acoustic variables between mother and child across all of their conversational blocks within a single recording day.
We conducted two-tailed one-sample t-tests on the correlation coefficients across the session for each dyad, with the null hypothesis that correlations would not be different from zero. This null hypothesis was rejected in all variables except for pitch range (see Table 3, left-hand side). That is, the correlation coefficient was significantly different from zero across the mothers in pitch mean, pitch minimum, and pitch maximum values as well as speaking rate and duration at the block level. However, the mean correlation coefficients were very small, particularly for duration.
Table 3. Results of one-sample t-tests for the mean correlation coefficients at the block level pooled across the forty-nine days of recording from thirteen mother–child dyads
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_tab3.gif?pub-status=live)
For the turn-level analysis, we applied the same statistics to the acoustic values of pairs of segments which constituted conversational turns averaged across the block. This analysis was identical to the preceding one, except that only segments adjacent to turns were included. If several subsequent utterances by the same speaker occurred without the interlocutor taking a turn, these repetitions were excluded. The number of turns in each block ranged from 1 to 76, with a mean of 3·2 turns in a block (SD = 4·0). The rationale for conducting this second analysis was to investigate whether there is a tighter or looser relationship between the two speakers at a more local level in the immediate context of conversational turns, based on the idea that direct feedback between the interlocutors might be masked by including all the utterances in the block.
At the turn level, the hypothesis of a zero correlation coefficient was rejected for the pitch measures (see Table 3, right-hand side), but not for duration or speaking rate.
Finally, we investigated whether there is a tighter correlation as indicated by the magnitude of correlation coefficients at the turn level compared to the block level, and whether there is an effect of age and gender on the correlation coefficients. To answer this question, we constructed a linear mixed-effects model for the correlation coefficients pooled across the 49 sessions with the acoustic variables that came out with significant correlations at both the block and the turn level, i.e. pitch-related measures. The model included the level (block and turn), age (in days), and gender (male, female) as fixed effects and the dyads (13 groups) and the acoustic measures (4 parameters) as random intercepts. Specifically, the formula used was lmer(coefficients ~ level + age + gender + (1|dyad) + (1|measure), data = level.data, REML = FALSE). The lmer function was always run with the same REML specification in this study, so we do not provide this specification in the rest of the paper.
Likelihood ratio tests comparing the full model with the one without the fixed effect of level, i.e. turn or block, showed that the discourse level of conversational exchange affected correlation coefficients between mother and child speech (χ 2 (1) = 15·5, p < ·001). We found that the correlation coefficients were significantly greater at the turn than the block level (Estimate = 0·033, SE = 0·008, t = 4·0), in line with the findings in previous research. We did not find any significant effect of age or gender on the correlation coefficients between mother and child speech.
In sum, there was a significant correlation between the mother's and child's speech for many acoustic measures both at the conversational block and turn levels, particularly in pitch measures, but these correlations were small. We did not find an effect of age or gender on the correlation coefficients of mother–child speech at either of the analysis levels, although the clustering of our infants in the range between 20–25 months may have limited our ability to find an age effect. We found a higher correlation between mother's and child's speech at the turn level than at the block level in pitch-related measures.
Speaker order effects at the turn and block level
In this section, we explore our second question of whether there is any effect of speaker order, i.e. the initiator/respondent of the turn or the block, on the pattern of mother–child acoustic similarity. Although a tendency of greater correlation coefficients is observed at the child-to-mother than the mother-to-child turns in the previous section, this analysis is based on a substantially reduced number of datapoints from a large amount of raw values by averaging them at each of the blocks and then taking one correlation coefficient across each recording day. In this section, we conduct a more targeted investigation of the effects induced by the linear order of the speakers in the conversation by constructing statistical models based on a more fine-grained dataset. In addition to the effects induced by the speaker order at conversational turns, we also investigate if there is any effect on the conversational patterns in mother–child speech depending on who initiated the series of conversational turns in a conversational block, and any interactions between turn type and the initiator of the block.
We derived the more targeted set of data by calculating the correlation coefficients in mother–child speech by turn type for each block. For this calculation, we selected conversational blocks containing at least 3 conversational turns for mother-to-child (M-to-C) and child-to-mother (C-to-M). We then subjected these data directly to a statistical test to compare the magnitude of correlation coefficients depending on turn types.
To investigate the effects of turn type on the strength of the correlation, we constructed a linear mixed-effects model with the correlation coefficient of conversational turns in each block as the dependent variable, the turn type (C-to-M and M-to-C) as the fixed effect, and the mother–child dyads (13 groups) and the acoustic measures (6 parameters) as the random factors. The formula used for this modeling was coefficient ~ turntype+(1|dyad)+(1|measure). A likelihood ratio test comparing the full model with the intercepts-only model showed that the turn type has a significant effect on the correlation coefficients (χ 2 (1) = 11·1, p < ·001), with the turn type C-to-M having a greater correlation coefficient (Estimate = 0·03; SE = 0·009, t = 3·3) than M-to-C. This result thus indicates that the mother has a tendency to be more accommodating to the speech of her child and increase the similarity of their mean pitch level in their interaction than the child does to the mother (Figure 4)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_fig4g.gif?pub-status=live)
Fig. 4. Effects of turn type (C-to-M and M-to-C turns) on the strength of correlation in conversational turns.
.
We next analyzed the effects of turn type on each speaker's response time to the other speaker. Time stamps at each conversational turn were extracted from the LENA output. Out of the total of 78,613 segments, there were 15,204 C-to-M and 14,062 M-to-C conversational turns. As stated in the ‘Introduction’, we constructed a subset of the master data by selecting only the segments labeled as spoken by either the mother or the child from the original LENA-generated dataset to be used for our study. Therefore, many of the M-to-C or C-to-M turns in our data contained intervening segments of other types, for example other child speech (CXN) or overlapping speech (OLN). Nevertheless, a robust pattern of longer response time at the M-to-C turn compared to the C-to-M turn was found (see Figure 5). The response time at M-to-C turns was longer (M = 1·92 s, SD = 2·43 s) compared to the response time at C-to-M turns (M = 1·46 s, SD = 2·10 s). This finding is consistent with previous research suggesting longer response times for children than their caregivers (Tice, Bobb & Clark, Reference Tice (Casillas), Bobb and Clark2011). There is relatively little research on the response time in mother–child interaction, but response times in child-to-child exchanges have been measured at 1·5 to 2·1 seconds (Garvey & Berninger, Reference Garvey and Berninger1981; Lieberman & Garvey, Reference Lieberman and Garvey1977).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_fig5g.gif?pub-status=live)
Fig. 5. Mean response time as a function of turn type and block type. C-to-M: response time between a child utterance and a maternal response. M-to-C: response time between a maternal utterance and a child response.
To statistically test the effect of turn type on response time, we again constructed a mixed-effects model. The turn type (M-to-C and C-to-M) and block type (Child-initiated and Mother-initiated) were entered as fixed effects with an interaction term between them. We included a random intercept for the mother–child dyads. The R formula used for this model was coefficients ~ turntype * blocktype + (1 | dyad). Likelihood ratio tests showed that the full model accounts for the data significantly better than the one without an interaction term (χ 2 (1) = 18·0, p < ·001) or other reduced models, i.e. the one with only the turn type (χ 2 (2) = 19·3, p < ·001) or the block type (χ 2 (2) = 312·8, p < ·001). The effect of turn type reflects the significantly longer response time at the M-to-C than the C-to-M turn type (Estimate = 0·6, SE = 0·04, t = 14·4). The effect of block type reflects a significantly longer response time in the child-initiated than the mother-initiated blocks (Estimate = 0·15, SE = 0·04, t = 3·8). The significant interaction between the turn type and the block type reflects the shorter mean response time at C-to-M turns in mother-initiated blocks (1·35 s) than in child-initiated blocks (1·52 s), and the shorter mean response time at M-to-C turns in child-initiated blocks (1·87 s) than in mother-initiated blocks (1·96 s). This interaction suggests that speakers tend to respond more quickly in blocks that they themselves initiated.
We next investigated the effect of initiator on duration of speech segments produced by each speaker, and found that both types of speakers produced longer segment durations in conversational blocks that they initiated (see Figure 6). That is, the mean duration of maternal utterances is longer in the mother-initiated (M = 1·53 s, SD = 0·88) than in the child-initiated (M = 1·35 s, SD = 0·66) blocks, whereas the mean duration of child utterances was longer in the child-initiated (M = 1·01 s, SD = 0·79) than in the mother-initiated (M = 0·98 s, SD = 0·7) blocks.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_fig6g.gif?pub-status=live)
Fig. 6. Mean segment duration as a function of speaker and block type.
We constructed a mixed-effects model including the speaker and block type (child-initiated and mother-initiated) as fixed factors with an interaction term between the two, and the mother–child dyad as a random factor. The R formula used for this analysis is duration ~ speaker * blocktype + (1|dyad). A likelihood ratio test showed that the full model with the interaction term fits the data significantly better than the model with only the speaker (χ 2 (2) = 382·2, p < ·001), with only the block type (χ 2 (2) = 5,950, p < ·001), or the one without the interaction (χ 2 (1) = 291·2, p < ·001). The significant interaction between speaker and block type (Estimate = –0·2, SE = 0·01, t = –17·1) reflects the tendency for both mothers and children to produce a longer segment in blocks that they initiated than in blocks initiated by their conversational partner reported above.
In sum, we found evidence that there is an effect of turn type on the correlation of mean pitch and response time in mother–child conversation. The greater correlation coefficient in C-to-M turns found in mean and maximum pitch suggests that the mother adapts her speech to the child more actively than vice versa. In addition, our results found an initiator-effect and its interaction with turn type in response time, and with the speaker in the segment duration. In other words, both children and mothers spoke for longer durations and responded more quickly in conversational blocks that they initiated themselves.
Testing convergence over a conversational block
In this section, we investigate whether the speech of mother and child become more similar to each other over the course of a conversation with focus on the change in the trend at the level of conversational block. Much previous research reporting entrainment in adult speech is based on datasets recorded over 30-minute or 1-hour sessions. Often, the mean of the entire session is taken to compare with the conversational partner to check similarity, or the data are divided into half, and the mean acoustic values in the earlier part of the session are compared with the ones in the latter part to examine convergence (e.g. Levitan & Hirschberg, Reference Levitan and Hirschberg2011). When such a method is applied to a day-long recording like ours, however, information that could have been provided by a more temporally fine-grained reduction of the data is likely to be lost. In this regard, the availability of the unit conversational block in our data provides an efficient way of dividing the long stretch of recording into ecologically valid units of smaller time scale without causing radical data reduction by collapsing data across the entire day.
We compared the acoustic values of mother and child speech at the first and last conversational turn of each conversational block within each of the thirteen dyads. For this analysis, we selected the conversational blocks containing at least two conversational turns, which totaled 5,726 blocks out of the original 9,170 blocks that contained at least one turn. We then extracted the first and last conversational turn of each block, took the mean differences between the two interlocutors in the first and last turn of the block across the entire session for each dyad, and compared them using two-tailed paired t-tests. If mother–child speech converges, we would expect the difference between the mother and child speech to be smaller towards the end of a conversational block compared to the beginning.
We found that the mean differences of the acoustic values in the first and last turn of the conversational block between mother and child differed significantly in mean and minimum pitch (see Table 4), with mother and child becoming more similar over the course of a conversational block. This convergence in pitch measures was small, and driven by a decrement in pitch on the part of the child over the course of the conversation. No other measures showed significant convergence.
Table 4. Testing convergence by comparin g the difference in acoustic values in the first and last conversational turns between mother and child in conversational blocks
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921005612597-0706:S0305000915000203:S0305000915000203_tab4.gif?pub-status=live)
DISCUSSION
With respect to our first question, we found evidence of similarity in mother–child utterances across a variety of different analyses of our data, particularly in pitch measures. This effect of similarity cannot be accounted for by overall similarity of the mother–child dyad, since our analyses captured variance across blocks, turns and utterances within a given dyad. Our findings therefore provide some evidence of an entrainment effect whereby mother and child influence each other toward more similar speech patterns across conversational blocks. However, given that the size of the correlations is quite small, the implications of these effects must not be overstated. It is important to remember that mother and child will start out very similar in these global speech measures due to shared environmental and (often) genetic influences. Therefore, detecting this subtle variance across time between mother and child may be very difficult over and above this baseline similarity. Nonetheless, these findings build on prior findings analyzing across dyads to suggest that even with our more rigorous analysis, there is evidence for entrainment effects.
Our second question tackled the extent to which initiator and respondent effects influenced mother–child speech patterns. We found stronger correlations in the mean and the maximum pitch at the turn level when mothers responded to their child than vice versa, suggesting that mothers are adapting their speech more to their child than the reverse. Nonetheless, in some measures, correlations of child responses to maternal utterances reached significance, suggesting at least some adaptation on the part of the child. Additionally, children (and mothers) produce more mature speech forms in conversations that they themselves initiated (i.e. shorter response latencies, and longer segment duration). To our knowledge, this is the first study to identify such an effect, and it has strong implications for the role of locus of control in the development of language. Our findings suggest that providing infants/toddlers with opportunities to initiate dialog may drive learning. This possibility suggests that the notion of language environment quality for infants needs to be expanded to include consideration of agency on the part of the infant, and converges with research finding that infants learn better when mothers are responsive to their requests for linguistic input (e.g. Begus, Gliga & Southgate, Reference Begus, Gliga and Southgage2014).
Our third and final question looked for evidence of convergence of acoustic measures between mother and child, rather than simply similarity, across a conversational block. We found a significant effect of convergence in mean and minimum pitch (i.e. mother and child became more similar to each other in mean and minimum pitch over the course of a conversation), and no other significant changes in the acoustic similarity across a conversational block. It is noteworthy that the convergence in pitch was driven by a decrease in the child's mean pitch and not that of the mother. While this may indicate an entrainment effect, it is also possible that this reflects a more general pitch decrement over the course of a conversation, and is not driven by the mother's pitch.
The finding that child-to-mother turns show a stronger correlation than mother-to-child turns bears additional consideration. This systematic pattern lends credence to the notion that while our correlations are small, we are tapping into a real phenomenon in mother–child interactions, in that it is consistent with the findings in previous research. In McRoberts and Best (Reference McRoberts and Best1997), a similar direction of adjustment is found in that both the mother and father increased their mean pitch when interacting with their child compared to their adult-directed speech, but the child's pitch was not significantly different compared to their baseline pitch, e.g. pitch of vocalization when alone. Our results replicate this by showing that mothers entrain more than infants/toddlers do, but extends it by suggesting that finer-grained analyses can reveal small but consistent alterations in the child's responses to their mother (cf. Table 3 and further discussion below). Thus, we should not conclude that children are entirely passive, a notion that is also inappropriate given the initiator effects found. Indeed, the initiator influences on utterance duration and response time for both mother and child indicate that children are active contributors to the paralinguistic interactions in these conversations.
Our study diverged from previous research in that we found entrainment of pitch from child-to-mother, while some previous studies, such as McRoberts and Best (Reference McRoberts and Best1997) and Siegel et al. (Reference Siegel, Cooper, Morgan and Brenneise-Sarshad1990), did not. One possible reason for their differing results could be the coarse timescale of analysis in their study and/or the smaller sample size. In both the McRoberts and Best and Seigel et al., studies, comparisons were made of mean pitch across different contexts (i.e. comparing interactions with mother and interactions with father), rather than more direct comparisons of adjacent utterances. Although Seigel et al., did perform an analysis examining adjacent turns, and McRoberts and Best also report an attempted finer-grained timescale analysis, in both cases this was done with a relative small sample and might not have had enough power to find the subtle effects found in our analysis (although see below for another possibility related to developmental changes). Previous literature of adult speech indicates that the correlation between the interlocutors increases at a local level. For example, Levitan and Hirschberg (Reference Levitan and Hirschberg2011) found significant correlations in all acoustic variables related to pitch, intensity, and vocal quality at the turn level but only some of them displayed correlation at the session level. Likewise, Gorisch, Wells, and Brown (Reference Gorisch, Wells and Brown2012) found that the pitch contour of insertions (short utterances) were significantly more similar to the immediately preceding turn, suggesting a local management of pitch contour. Thus our finding of higher correlation coefficients at the turn level than the block level for pitch-related measures may reflect a similar effect of tighter correlation at a local level. The tighter correlation observed over a short time span of conversation in various studies suggests that the entrainment in mother–child speech is essentially a process of continuously coordinating the speech in response to the interlocutor at a local level of exchange.
One important difference between our study and that of many of the studies reported in the ‘Introduction’ is the age of our sample. Most of the studies to date have been performed on relatively young infants (e.g. 3-month-olds) with the oldest ages being 9–12 months (Siegel et al., Reference Siegel, Cooper, Morgan and Brenneise-Sarshad1990) and up to 17 months (McRoberts & Best, Reference McRoberts and Best1997). The one exception was the Elias and Broerse (Reference Elias and Broerse1996) study, which examined infants from ages 0;3 up to 2;0 and found developmental changes in overlapping speech across the ages studied. The youngest child in our sample was 13 months old, with the oldest being 2;6. It is therefore possible that the differences in pitch entrainment between our sample and prior research might be driven by developmental changes. Our analyses did not find strong support for age effects, but the clustering of our infants' ages around 20–25 months may have obscured developmental effects. Similarly, McRoberts and Best did not find evidence of developmental differences in their longitudinal case study. Nevertheless, given the small amount of data in that study, it is still possible that developmental changes are taking place between the first and second year of life that they were unable to detect that account for these differences.
Our findings suggest an important role for robust automated analyses in examining questions regarding the acoustic properties of mother–child interactions. This is an exciting time for this kind of research, as there is a convergence of methodological, analytical, and statistical innovations. Recent work by Buder, Warlaumont, Oller, and Chorna (Reference Buder, Warlaumont, Oller and Chorna2010), for example, demonstrates a unique analytic approach to the automated analysis of pitch dynamics between mother and child that may allow for the detection of more complex relationships than simple similarity or convergence, and may be generalizable to multiple other acoustic features.
CONCLUSION
Our study examined the dynamic relationship between the acoustic properties of mother and infant/toddler speech, by examining correlations within mother–child dyads, in order to reduce the influence of prior resemblance on measures of entrainment. We found small, but significant effects of entrainment in pitch, and less robust effects in utterance duration and speaking rate. We also found evidence of convergence in pitch measures across a single conversational block between mother and child, but these effects were small and driven by a general decrease in pitch on the part of the child. In general, maternal entrainment toward the child was stronger than vice versa. However, child entrainment toward mother was also found. In addition, we found effects of the initiator of a conversation, with longer utterances and shorter response latencies associated with the initiator of a conversational block. While mothers show more mature conversational capabilities (more entrainment, shorter response latencies, longer utterances), our findings converge with prior research to highlight the active role of young children in the conversational exchange.