The difficulty in identifying and expressing emotional states is one of the major symptoms of schizophrenia as well as a predictor of social adjustment (Brazo, Beaucousin, Lecardeur, Razafimandimby, & Dollfus, Reference Brazo, Beaucousin, Lecardeur, Razafimandimby and Dollfus2014; Hooker & Park, Reference Hooker and Park2002; Kee, Green, Mintz, & Brekke, Reference Kee, Green, Mintz and Brekke2003); This “flat” affective state is present in at least 66% of schizophrenics (Trémeau et al., Reference Trémeau, Malaspina, Duval, Correa, Hager-Budny, Coin-Bariou and Gorman2005).
The difficulty to identify the emotional state of others in facial expression (Kohler, Walker, Martin, Healey, & Moberg, Reference Kohler, Walker, Martin, Healey and Moberg2010) and in the tone of their voice (Hoekert, Kahn, Pijnenborg, & Aleman, Reference Hoekert, Kahn, Pijnenborg and Aleman2007) has been well established as one of the main predictors of deterioration in all phases of the disorder: from the first episode (Horan et al., Reference Horan, Green, DeGroot, Fiske, Hellemann, Kee and Nuechterlein2012), during its chronicity (Green et al., Reference Green, Bearden, Cannon, Fiske, Hellemann, Horan and Nuechterlein2012), and even in its high-risk states (Allot et al., Reference Allott, Schäfer, Thompson, Nelson, Bendall, Bartholomeusz and Amminger2014).
These difficulties are part of the alterations in social cognition (described as the ability to construct representations of the relationship between oneself and others, and to use them flexibly in behavior regulation, Adolphs, Reference Adolphs2001), along with social perception social, theory of mind, social knowledge and attributional style, which have been consistently linked to schizophrenia (Couture, Penn, & Roberts, Reference Couture, Penn and Roberts2006; Green & Horan, Reference Green and Horan2010; Kring & Ellis, Reference Kring and Ellis2013; Penn, Sanna, & Roberts, Reference Penn, Sanna and Roberts2008).
Similarly, the emotional expressive ability is also impaired (Cohen, Kim, & Najolia, Reference Cohen, Kim and Najolia2013), although there is evidence (Kring & Moran, Reference Kring and Moran2008) indicating the dissociation between expressiveness and emotional experience, since flat affect does not necessarily lead to a reduction in emotional experience.
Prosody has been studied to a lesser extent than the ability to identify and express emotions facially. In speech, not only the changes in the melody produced by variations in the frequency of opening and closing of the vocal cords are perceived, but also the changes of rhythm, speed, intonation, pauses, intensity and other spectral alterations that are perceived by the listener as melodic variations, and interpreted subjectively as paralinguistic signals, essential for the understanding and interpretation of the utterance and the identification of the emotional and motivational state of the speaker (Patel, Reference Patel2008).
Receptive prosody studies have shown that patients exposed to voice samples with different emotions and questioned about the emotion they have heard show a marked difficulty to identify such emotions. An extensive meta-analysis (Hoekert et al., Reference Hoekert, Kahn, Pijnenborg and Aleman2007) comprising twenty articles with a total of 663 patients, has described a significant effect size (d = –1.24) when comparing the performance of schizophrenic participants and controls.
Regarding expressive prosody, the studies that have collected voice samples and have asked participants to encode different emotional states have concluded that there is a marked difficulty in expressing emotions verbally (Putnam & Kring, Reference Putnam and Kring2007). Hoekert et al.’s (Reference Hoekert, Kahn, Pijnenborg and Aleman2007) meta-analysis has also concluded there was a significant effect size (d = –1.11) between the eleven studies reviewed with data from 186 patients.
Most of the studies on expressive prosody have used tasks in which participants were asked to encode various emotional states perceived in voice (Hoekert et al., Reference Hoekert, Kahn, Pijnenborg and Aleman2007); fewer studies have asked their participants to spontaneously narrate sad, happy and anger events they have experienced (Alpert, Rosenberg, Pouget, & Shaw, Reference Alpert, Rosenberg, Pouget and Shaw2000; Shaw et al., Reference Shaw, Dong, Lim, Faustman, Pouget and Alpert1999). These methods have, in our view, reduced the ecological validity of the study of expressive prosody, since high intensity discrete emotional states (anger, sadness, fear, etc.) are rarely expressed in colloquial language in the manner described in these studies. In everyday social interaction, the affective state, intentions and aptitude are constantly and inexorably present in the tone of voice, though less explicitly. An alternative method is used by Cohen et al. (Cohen, Iglesias, & Minor, Reference Cohen, Iglesias and Minor2009; Cohen & Hong, Reference Cohen and Hong2011), consisting of analyzing the prosody of verbal responses triggered by emotional stimuli presentations, using the “International Affective Picture System” (Lang, Bradley, & Cuthbert, Reference Lang, Bradley and Cuthbert2005.
In contrast to these methods, only three studies have used emotionally neutral readings (Cohen, Alpert, Nienow, Dinzeo, & Docherty, Reference Cohen, Alpert, Nienow, Dinzeo and Docherty2008; Dickey et al., Reference Dickey, Vu, Voglmaier, Niznikiewicz, McCarley and Panych2012; Leentjens, Wielaert, Harskamp, & Wilmink, Reference Leentjens, Wielaert, Harskamp and Wilmink1998). To evaluate the non-emotional expressive prosody in schizophrenia, this paper presents a potentially useful technique, using an emotionally neutral text that has been proven useful in quantifying prosody in other disorders (Martínez-Sánchez, Meilán, Pérez, Carro, & Arana, Reference Martínez-Sánchez, Meilán, Pérez, Carro and Arana2012). The procedure consists of the semi-automatic analysis of the variations in the trajectory of pitch and height perception of the fundamental frequency (of the vocal cords’ vibrations) of the vocalic syllable nuclei, as this is the point of greatest loudness, using a purely acoustic base to extract the harmonic peak without the need of phonetic segmentation. The data derived from the behavior of the F0 in the intensity of vocalic segments yield a complete melodic pattern of the speaker that shows significant changes in tone, both upstream (prosodic peaks) and downstream (prosodic valleys), within the syllabic nucleus, as well as between different nuclei (see Annex I for a description of the prosodic parameters used).
This procedure has many advantages over previously used prosodic analyses, since it increases the reliability and validity of results, speeds up the production of prosodic parameters and minimizes the influence of the coding skills of the subject, as it uses an emotionally neutral text. It also increases its ecological validity, as it is closer to the colloquial language used in everyday interactions. Finally, the procedure does not require phonetic segmentation, which virtually eliminates any errors the experimenter could commit in the process of quantification, as well as any differences in estimation between various experimenters.
In the present work, this procedure is used in order to objectively quantify the deficits in expressive prosody in schizophrenia, as well as to assess its discriminatory power between groups. It is hypothesized that the group of schizophrenia patients will show a significantly flatter prosodic profile, characterized by less variability in the dynamics and the path of the vocalic nuclei and in voice intensity as well as an increase in the number of pauses than those obtained by the control group.
Method
Statistical design
A cross-sectional, analytical, observational and retrospective design was used.
Participants
A sample of 80 participants, divided into two groups, was recruited: 45 patients diagnosed with schizophrenia and 35 asymptomatic controls.
The group of patients (M age = 39.49, SD = 10.89; 71.1% male) were recruited from various Mental Health units of the Andalusian Health Service of the province of Jaen (Therapeutic and Rehabilitation Community of the Jaen area and the Mental Health Clinics of Martos and Andujar). All participants were evaluated using the clinical version of the Structured Clinical Interview (First, Spitzer, Gibbon, & Williams, Reference First, Spitzer, Gibbon and Williams1996), following the criteria established in the DSM-IV-TR (2000). The average duration of the disorder, from its initial diagnosis, was 21.17 years (SD = 5.65), the mean number of relapses was 3.47 (SD = 1.89) and the mean time elapsed since the last relapse was 45.88 days (SD = 25.67). The average dose of chlorpromazine equivalent units was 669.88 mg / day (SD = 559.31).
Meanwhile, the control group (M age = 35.34, SD = 10.48; 62.9% male) was matched with the patient sample for age, sex and educational level, and was extracted from the same social environment as the patient sample. They had no history of mental or neurological disorders or drug or alcohol abuse, which were considered exclusion criteria.
Materials and Procedure
Brief Psychiatric Rating Scale (BPRS; Lukoff, Nuechterlein, & Ventura, Reference Lukoff, Nuechterlein and Ventura1986) in the Spanish validation by Peralta and Cuesta (Reference Peralta and Cuesta1994). The 0–5 points response range was used as it increases the inter-rater reliability (Bech, Larsen, & Andersen, Reference Bech, Larsen and Andersen1988). It is composed of 18 items, to be administered by an experienced therapist after a semi-structured interview (15–25 minutes in length). Each item is scored using a Likert-type scale with 5 levels of intensity, where 0 represents the absence of the symptom and 4 represents extreme gravity.
In order to record speech, a professional Fostex FR-2LE recorder was used, with a resolution of 24 bits and a 48 kHz sampling rate, using a cardioid AKG D3700S microphone. Samples were edited using the acoustic voice analysis 5.1.42 Praat program (Boersma & Weenink, Reference Boersma and Weenink2013). Annex I contains the definition of the various parameters used.
The study was conducted between June and December of 2013. All participants were adequately informed in order to sign their consent according to the protocols of the Bioethics Committees of the participating institutions. This study complied with the ethical principles of the Declaration of Helsinki for medical research involving human subjects. The procedure was performed in a single session, initially collecting socio-demographic information (age, marital status, etc.) and clinical data (age at onset of symptoms and diagnosis, duration of illness, number of admissions, time since last admission, etc.) as well as administering the BPRS. The doses of chlorpromazine equivalent units ingested by patients were also registered. Subsequently, speech recordings were made.
The task entailed reading the first paragraph (405 syllables) of “Don Quijote” by Miguel de Cervantes. The recordings were performed in a silent room (but not acoustically isolated), placing the microphone 8 cm away and at a 45° angle from the participant’s mouth in order to prevent any aerodynamic noise.
To quantify prosodic patterns, the automatic prosodic transcription of the recordings was performed using the algorithms implemented by Mertens (Reference Mertens2004) on the Praat program (Boersma & Weenink, Reference Boersma and Weenink2013). The estimation of the prosodic speech profile was performed analyzing the variations in the height trajectory and pitch perception (prosodic peaks and valleys) of the F0 of the vocalic syllable nuclei that contain voice signals, on a peak intensity delimited -3dB and -9dB to left and right, respectively, in order to represent the melodic movements perceived by the human ear. The value of the left limit (–3dB) eliminates most of the microprosodic disturbances and stylizes the beginning of the syllable, while the right (–9dB) limit preserves the variations in tone of accented vowels.
In this paper, a detection range for the F0 of 65–650 Hz was established for 0.005s windows; the following threshold intensity was established for the automatic segmentation in the stylization of the algorithm: Glissando = 0.32/T2, DG = 30, dmin = 0.05. To determine the presence of a vowel, a 0.32/T2 semitone threshold was allocated, where T is the duration of the vowel in seconds. If the tone’s exchange rate is higher than the threshold defined by the perceptual values of voice detection, a value proportional to the glissando threshold (continuous slippage of the melodic line in the same syllable) was assigned, whereas if the value is lower than the threshold, the same value as the median of the voice sample analyzed was assigned. It should be noted that while the standard psychoacoustic threshold for isolated vowels is G = 0.16/T2, voice flow is rarely linear during natural speech, hence, the value assigned in this present study has been shown to more adequately model voice variations, especially in automatic transcription.
Data analyses
The IBM SPSS (version 21) statistical package was used. The Student-t test for independent samples was used to define the differential prosodic profile between groups and Pearson correlations were used to assess the relationship between variables. Finally, a discriminant analysis was performed in order to assess the ability of prosodic variables to classify subjects into both groups.
Results
Different statistical tests were conducted to assess the absence of statistically significant intergroup differences among the sociodemographic variables. No differences for the variable “age” (t 79 = –1.76; p = .091), educational level (measured in months of schooling; t 79 = 1.74; p = .085), or for the distribution of sex per group (χ 2 = .611; p = .434) were observed.
The mean comparisons show the existence of significant intergroup differences in all the prosodic variables, but not for those dependent on the frequency (F0) (Table 1). The main prosodic parameters (valleys, prosodic dynamics, inter- and intra-syllabic and phonation trajectories) of the Schizophrenia patient group yielded significantly lower levels than those obtained by the control group (Figure 1), showing in general, a sparsely prosodic and melodically flatter speech than that of controls (Figure 2). Moreover, they spent more time performing the task, made a greater proportion of pauses and exhibited a significantly lower voice intensity, accentuating their perception of dysprosody.
Note: dB = Decibels; Hz = Hertz; F0 = Fundamental frequency; ST/s= Semitones/second.
As expected, education level significantly affected the performance of the task in the control group. For example, in the schizophrenia group, the higher the educational level, the lower the time spent on reading the text and the lower the proportion of pauses performed. It should be noted that for a test to be used for screening, it must be scarcely sensitive to the effects of education in the pathological group; the poor correlation between prosodic variables and educational level in the patient group for this task can be appreciated in Table 2.
Note: + p < .006 (Bonferroni’s correction).
The years of chronicity of the disorder is the clinical variable most strongly associated with prosodic variables in the schizophrenia patients group; the greater the number of years elapsed since diagnosis of the disorder, the smaller the intrasyllabic (r = –.377; p = .028), and phonation (r = –.422; p = .013) trajectories were. Similarly, the time elapsed since the last relapse correlated with the phonation trajectory (r = .404; p = .018). Drug treatment did not induce significant changes in prosody, although a trend was observed which suggests that the higher the dose of chlorpromazine was, the less time was spent on the reading task (r = –.280; p = .063) and fewer pauses were made (r = –.251; p = .091). Moreover, the BPRS scores were not significantly correlated to the prosodic variables, although the positive symptoms scale score negatively correlated with the intensity of voice (r = –.346; p = .021). The scale’s item that evaluates blunted affect did not correlate significantly with any prosodic parameter, obtaining the highest degree of correlation with the “Intrasyllabic trajectory” variable and with blunted affect (–.219; p = .118). No significant correlations were found between the total scores of the BPRS scale and the prosodic parameters evaluated, or between these parameters and the scores obtained in the positive and negative symptoms subscales.
Finally, a discriminant analysis was performed, in order to assess the ability of the prosodic parameters studied to distinguish subjects in both groups. The canonical discriminant function explained 100% of the variance (canonical correlation = .828; λ14 = .314; χ2 14 = 82.16; p < .001), with Intensity, Intrasyllabic trajectory, Total length, Phonation trajectory and Pause rate being the variables with the greatest discriminating power (with standardized canonical discriminant function coefficients of .630, .541, –.403, .365 y –.344 respectively). Conversely, the dependent variables of the fundamental frequency (M F0 = –.003; F0 Range = –.026; F0 SD = .077) were those that showed less discriminatory power. Overall, the discriminant analysis allowed for the correct classification of 93.8% of the original data (Table 3); in the cross-validation analysis, 87.5% of cases were correctly classified.
Note: 93.8 % of the original cases were classified correctly.
Discussion
The objective of this research was to study the differential expressive prosodic patterns in a group of schizophrenic patients and in a group of asymptomatic subjects using an unemotional text as the reading task. The results show the existence of marked intergroup differences, with the clinical group exhibiting a slow and low-intensity speech, with many pauses. Moreover, this group’s dysprosody is characterized by scarce tone changes, both within the syllable nucleus, and between adjacent syllabic nuclei. All these characteristics yield an “emotionally flattened” speech (Tremeau, Reference Trémeau2006).
These results concur with those obtained in several studies. Firstly, the low dynamics of both the inter- and intra-syllabic nuclei observed in our results, in spite of not have been studied before, is the result of the scarce changes in F0. This is consistent with previous studies (Alpert et al., Reference Alpert, Rosenberg, Pouget and Shaw2000) reporting that the schizophrenic speech is characterized by few inflections of speech. Secondly, the large number of pauses has also been previously identified as a characteristic of the disease (Alpert, Kotsaftis, & Pouget, Reference Alpert, Kotsaftis and Pouget1997; Clemmer, Reference Clemmer1980; Cohen & Elvevåg, Reference Cohen and Elvevåg2014; Cohen, Mitchell, & Elvevåg, Reference Cohen, McGovern, Dinzeo and Covington2014). They can be attributed to the reduction in verbal fluency and the discrimination of phonemes, processes that are altered in schizophrenia (Johnson-Selfridge & Zalewski, Reference Johnson-Selfridge and Zalewski2001; Kugler & Caudrey, Reference Kugler and Caudrey1983). Thirdly, the voice signal is less intense (Pascual, Solé, Castillón, Abadía, & Tejedor, Reference Pascual, Solé, Castillón, Abadía and Tejedor2005), which also accentuates the perception of dysprosodic speech. Finally, these subjects take longer to complete the task, evidencing the limited cognitive resources available to perform a complex task such as reading (Cohen, McGovern, Dinzeo, & Covington, Reference Cohen, McGovern, Dinzeo and Covington2014; Melinder & Barch, Reference Melinder and Barch2003).
The dependent parameters of the fundamental frequency (mean, standard deviation and range) showed no differences between groups. This is not unexpected, as even though these measures contribute to variations in tone, they are microprosodic measures, as opposed to the suprasegmental parameters studied, which have been shown to be significant intergroup discriminators.
The observed differences in the variations in the syllabic dynamics of the vocalic nucleus are especially noteworthy, as they show a differential prosodic profile in both groups regarding their melodic slopes. Thus, although the percentage of upward vocalic nuclei (expressed by prosodic peaks) did not differ between the groups, the downward vocalic nuclei (prosodic valleys) were significantly lower in the group with schizophrenia. In the Spanish language, declarative intonation is typically downward, fully explaining the above-mentioned, while the interrogative intonation is, however, generally upward, presenting an incomplete utterance (Cantero, Reference Cantero and Mendoza2003). As the text used in the task was a declarative text, the clinical group’s intonation was clearly inadequate, a fact that the listener may perceived as discordant with the meaning of the sentence being read.
The clinical variables have proven to be insignificant in expressive prosody. Our results show a correlation between the years of chronicity of the disorder and the impairment of intersyllabic and phonation trajectories. Although there is a lack of data from other investigations to compare, it is known that the ability to identify facial and prosodic expressions reduces with the duration of illness (Kucharska-Pietura, David, Masiak, & Phillips, Reference Kucharska-Pietura, David, Masiak and Phillips2005; Silver, Goodman, Knoll, Isakov, & Modai, Reference Silver, Goodman, Knoll, Isakov and Modai2005; Silver, Shlomo, Turner, & Gur, Reference Silver, Shlomo, Turner and Gur2002). Although drug treatment does not induce changes in prosody, beyond increasing the duration of the task and the number of pauses, Hoekert et al.’s (Reference Hoekert, Kahn, Pijnenborg and Aleman2007) meta-analysis does not verify the existence of any significant relationships between these variables.
The relationship between the increase in positive symptoms and the reduction in the intensity of the voice is paradoxical, because one would expect the negative symptoms to be related to a lower intensity of voice (considering it an indicator of flat prosody), as reported by Cohen and Hong (Reference Cohen and Hong2011), although their results did not reach statistical significance, while positive symptoms would correlate to an increase in verbal fluency. It is known that positive symptoms correlate with the difficulty in discriminating the fundamental frequency changes in sentences with an emotional content (Matsumoto et al., Reference Matsumoto, Samson, O’Daly, Tracy, Patel and Shergill2006). Similarly, patients who experience hallucinations and delusions show alterations when identifying changes in the tone of voice (Johns et al., Reference Johns, Roseell, Frith, Ahmad, Hemsley, Kuipers and McGuire2001).
Finally, the scores of the BPRS subscales have not yielded any significant correlations with prosody. These scales’ limitations are well-known (Cohen & Elvevåg, Reference Cohen and Elvevåg2014; Nicholson, Chapman, & Neufeld, Reference Nicholson, Chapman and Neufeld1995). Even though they are able to identify differences of up to six standard deviations when comparing negative symptoms of patients and controls (Emmerson et al., Reference Emmerson, Ben-Zeev, Granholm, Tiffany, Golshan and Jeste2009), they are relatively insensitive to changes in the patient's condition and they induce response biases that make it difficult for even trained evaluators to notice specific aspects of behavior related to alogia and blunted affect within the patient’s speech (Alpert, Shaw, Pouget, & Lim, Reference Alpert, Shaw, Pouget and Lim2002). On the other hand, our results coincide with those reported by Cohen et al. (Reference Cohen, Kim and Najolia2013), who found no relationship between the symptoms of a schizophrenic group and variables related to the expressive prosody (number of pauses and variability of F0).
While the slow implementation of the task (duration and rate of pauses) is conditioned by educational level in both groups (though to a greater extent in the clinical group), the variables related to the trajectory of the syllabic nuclei exhibit no relationship to these variables in any of the two groups, showing that the prosody results are scarcely sensitive to the cultural level, coinciding with the results reported by Leentjens et al. (Reference Leentjens, Wielaert, Harskamp and Wilmink1998).
The used procedure has proven to be, in our view, a valid and reliable alternative to accurately record non-emotional expressive prosody in colloquial speech, as its results can relate to patients’ everyday interactions with their environment. The small number of existing studies using non-emotional stimuli to assess the expressive prosody in schizophrenia is surprising (Cohen et al., Reference Cohen, Alpert, Nienow, Dinzeo and Docherty2008; Dickey et al., Reference Dickey, Vu, Voglmaier, Niznikiewicz, McCarley and Panych2012; Leentjens et al., Reference Leentjens, Wielaert, Harskamp and Wilmink1998).
Furthermore, the procedure provides numerous advantages to those traditionally employed. It allows to objectively quantify the degree of dysprosody in a fast and non-intrusive way, without causing discomfort to the patient. Additionally, the acoustic analyses are highly sensitive to changes in the voice; therefore, their use is potentially useful in the study of the evolution of the disorder and in evaluating drug treatments. The high discriminating ability (93.8%) is higher than that achieved by Kliper et al. (80.95%; Kliper, Vaizman, Weinshall, & Portuguese, Reference Kliper, Vaizman, Weinshall and Portuguese2010).
This study has several limitations. Firstly, patients were medicated, making it impossible to assess the differential effect of drug treatment on prosody; however, the correlations obtained seem to rule out any connection. Secondly, although several variables that can modulate the results (age, education level, etc.) have been evaluated, it is necessary to expand the number of variables that can influence the results. Thirdly, the possible existence of differences in prosody dependent on the demand of the task should be researched, since the cognitive demand required while reading a neutral text is very different from that required by speech in response to stimuli of an emotional character. The differences in the demands imposed by the tasks can potentially explain the disparity in the results obtained in various investigations, since cognitive resources particularly determine speech production in patients with schizophrenia.
Future studies, with an extended sample, may shed more data on the utility of this procedure and its clinical implications, especially in longitudinal studies. The obtained results support the use of the procedure; however the refinement of acoustic analysis algorithms should be sought, in order to achieve higher levels of discriminatory power between groups.
Annex I. Definition of the used parameters