INTRODUCTION
Bipolar disorder (BD) is a complex mood disorder associated with diminished quality of life and global functioning (Coryell et al., Reference Coryell, Scheftner, Keller, Endicott, Maser and Klerman1993; Cramer, Torgersen, & Kringlen, Reference Cramer, Torgersen and Kringlen2010; Freeman et al., Reference Freeman, Youngstrom, Michalak, Siegel, Meyers and Findling2009; Goodwin & Jamison, Reference Goodwin and Jamison1990; MacQueen, Young, & Joffe, Reference MacQueen, Young and Joffe2001; Morriss et al., Reference Morriss, Scott, Paykel, Bentall, Hayhurst and Johnson2007; Saarni et al., Reference Saarni, Viertiö, Perälä, Koskinen, Lönnqvist and Suvisaari2010). A growing body of research suggests that social cognition is impaired in BD, with recent meta-analytic effect size estimates for overall facial emotion perception falling almost a quarter of a standard deviation below the healthy population mean (Rossell, Van Rheenen, Groot, Gogos, & Joshua, Reference Rossell, Van Rheenen, Groot, Gogos and Joshua2013; Samamé, Martino, & Strejilevich, Reference Samamé, Martino and Strejilevich2012; Van Rheenen & Rossell, Reference Van Rheenen and Rossell2014). Although small, these impairments are nonetheless significant, and appear to contribute to psychosocial dysfunction in the disorder (Hoertnagl et al., Reference Hoertnagl, Muehlbacher, Biedermann, Yalcin, Baumgartner, Schwitzer and Hofer2011; Martino, Strejilevich, Fassi, Marengo, & Igoa, Reference Martino, Strejilevich, Fassi, Marengo and Igoa2011; Ryan et al., Reference Ryan, Vederman, Kamali, Marshall, Weldon, McInnis and Langenecker2013). However, the predominant focus on the visual emotion processing domain (using facial expressions) in BD has hampered understandings of other social cognitive processes at play. Indeed, there is only preliminary research referencing prosodic processing or multimodal emotion integration in the disorder, despite the potential for deficits in these processes to detrimentally influence psychosocial outcome as well (Rossell et al., Reference Rossell, Van Rheenen, Groot, Gogos and Joshua2013; Van Rheenen & Rossell, Reference Van Rheenen and Rossell2013a, Reference Van Rheenen and Rossell2013b; Vederman et al., Reference Vederman, Weisenbach, Rapport, Leon, Haase, Franti and McInnis2012).
In the natural environment, the stimulation of several sensory modalities occurs simultaneously, and the cognitive mechanisms underpinning the processing of information from these sources are thought to be strongly related (Borod et al., Reference Borod, Pick, Hall, Sliwinski, Madigan, Obler and Tabert2000; de Gelder & Jean, Reference de Gelder and Jean2000). Although perception in the context of a single modality is sufficient in some contexts, the integration of equivocal (also referred to as redundant) information from these different sensory channels has been found to augment meaningful and holistic perceptual decoding by improving both accuracy and speed of judgment (de Gelder et al., Reference de Gelder, Meeren, Righart, Stock, van de Riet and Tamietto2006; Paulmann & Pell, Reference Paulmann and Pell2011; Pell, Reference Pell2005). This multimodal integration reflects cross-modal influences between sensory channels that is thought to occur early in the time course of perception, where it serves to enrich perception, compensate for conflicts in cross-modal sensation and facilitate perceptual decoding in times of unimodal ambiguity (Alais & Burr, Reference Alais and Burr2004; De Gelder & Bertelson, Reference De Gelder and Bertelson2003; de Gelder et al., Reference de Gelder, Meeren, Righart, Stock, van de Riet and Tamietto2006; de Gelder, Pourtois, Vroomen, & Bachoud-Lévi, Reference de Gelder, Pourtois, Vroomen and Bachoud-Lévi2000; Vroomen, Driver, & Gelder, Reference Vroomen, Driver and Gelder2001). Indeed, auditory stimuli have been found to modulate visual perception and vice versa, with incompatibility between facial and speech information distorting perceptions (de Jong, Hodiamont, Van den Stock, & de Gelder, Reference de Jong, Hodiamont, Van den Stock and de Gelder2009; Kim, Seitz, & Shams, Reference Kim, Seitz and Shams2008; McGurk & MacDonald, Reference McGurk and MacDonald1976; Paulmann, Titone, & Pell, Reference Paulmann, Titone and Pell2012; Shams, Kamitani, & Shimojo, Reference Shams, Kamitani and Shimojo2004; Vroomen & De Gelder, Reference Vroomen and De Gelder2000). As such, involuntary multimodal integration plays a substantial part in facilitating understandings of the world in a non-segmented, inclusive manner, and is thus, important for social and interpersonal functioning.
Despite BD being characteristically associated with poor psychosocial outcomes including a reduced capacity for meaningful, long-term interpersonal relationships (Australian Bureau of Statistics, 2007; Blairy et al., Reference Blairy, Linotte, Souery, Papadimitriou, Dikeos, Lerer and Mendlewicz2004; Tsai, Lee, & Chen, Reference Tsai, Lee and Chen1999), poor social skills (Goldstein, Miklowitz, & Mullen, Reference Goldstein, Miklowitz and Mullen2006), and difficulties in social activities (Morriss et al., Reference Morriss, Scott, Paykel, Bentall, Hayhurst and Johnson2007), the possibility that these outcomes may be partially underpinned by abnormalities in multi-modal integration has not yet received adequate attention in the BD literature. However, an investigation of multimodal emotion processing in the disorder is justified in light of evidence suggesting that it is subserved by neural structures that are implicated in the pathophysiology of BD, including the temporal lobe, amygdala, anterior cingulate, and prefrontal cortex (de Gelder, Böcker, Tuomainen, Hensen, & Vroomen, Reference de Gelder, Böcker, Tuomainen, Hensen and Vroomen1999; Dolan, Morris, & de Gelder, Reference Dolan, Morris and de Gelder2001; Laurienti et al., Reference Laurienti, Wallace, Maldjian, Susi, Stein and Burdette2003; Phillips, Drevets, Rauch, & Lane, Reference Phillips, Drevets, Rauch and Lane2003; Phillips, Ladouceur, & Drevets, Reference Phillips, Ladouceur and Drevets2008; Pourtois, de Gelder, Bol, & Crommelinck, Reference Pourtois, de Gelder, Bol and Crommelinck2005). This, coupled with the large literature indicating that emotion perception from facial, and to a lesser extent prosody is impaired in the disorder, and evidence suggesting that multimodal integration itself is deficient in the genetically and phenotypically related disorder schizophrenia (Craddock, O’Donovan, & Owen, Reference Craddock, O’Donovan and Owen2006, Reference Craddock, O’Donovan and Owen2005; de Gelder et al., Reference de Gelder, Vroomen, de Jong, Masthoff, Trompenaars and Hodiamont2005; Kohler, Walker, Martin, Healey, & Moberg, Reference Kohler, Walker, Martin, Healey and Moberg2010; Rossell et al., Reference Rossell, Van Rheenen, Groot, Gogos and Joshua2013; Van Rheenen & Rossell, Reference Van Rheenen and Rossell2013a, Reference Van Rheenen and Rossell2013b, Reference Van Rheenen and Rossell2014), certainly provides a foundation from which it is reasonable to assume that abnormalities in cross-modal influences between different sensory modalities may be present in BD.
One method of investigating multimodal integration is to use a focused attention paradigm comparing responses between conditions in which facial and prosodic modalities are fed congruent and incongruent emotional information. In this paradigm, when participants are explicitly instructed to make judgments based on the inputs of a particular modality, better performance for congruent relative to incongruent stimuli (commonly referred to as priming) would indicate automatic cross-modal interference and thus, mandatory multimodal integration. On the other hand, matched performance between the conditions would indicate the absence of cross-modal priming.
Here, we present a study referencing this paradigm in a large cohort of BD patients compared to controls. That is, in the interests of ascertaining the nature of inter-sensory processes occurring during the time course of emotional perception in the disorder, we specifically sought to determine the extent to which the evaluation of complex emotional inputs from an attended visual channel (facial expressions) were biased by the concurrent presentation of stimuli in an unattended auditory channel (emotional prosody). By comparing facial emotion recognition on the parameters of both accuracy, and its more sensitive counterpart response latency, we aimed to examine the level at which potential group differences become apparent. We hypothesized that healthy individuals would demonstrate multimodal integration by a disproportionate response pattern to congruent relative to incongruent audio-visual emotional information. Given the lack of prior research on this topic in the BD literature, the extent to which emotional prosody would influence facial emotion recognition in individuals with BD remained an open question; a degree of impairment compared to controls was predicted however, due to prior evidence of unimodal deficits in BD cohorts.
MATERIALS AND METHODS
This study was approved by the Alfred Hospital and Swinburne University Human Ethics Review Boards and abided by the Declaration of Helsinki. Written informed consent was obtained from each participant before the study began.
Participants
The clinical sample comprised 50 patients (17 male, 33 female) diagnosed as having DSM-IV-TR BD (BD I n=38; BD II n=12) using the Mini International Neuropsychiatric Interview (MINI: Sheehan et al., Reference Sheehan, Lecrubier, Harnett Sheehan, Amorim, Janavs, Weiller and Dunbar1998). Patients were recruited via community support groups and general advertisements and were all out-patients. Current symptomology was assessed using the Young Mania Rating Scale (YMRS: Young, Biggs, Ziegler, & Meyer, Reference Young, Biggs, Ziegler and Meyer1978) and the Montgomery Asberg Depression Rating Scale (MADRS: Montgomery & Asberg, Reference Montgomery and Asberg1979); there were 16 depressed (defined as those that met strict criteria for MADRS scores>8), 4 (hypo) manic (defined as those that met strict criteria for YMRS scores>8), 12 mixed (defined as those that met strict criteria for YMRS and MADRS scores>8) and 18 euthymic patients (defined as those that met strict criteria for YMRS and MADRS scores≤8). Those with current psychosis, co-morbid psychotic disorders, significant visual and auditory impairments, neurological disorder and/or a history of substance/alcohol abuse or dependence during the past six months were excluded. Thirty two patients were taking antipsychotics, 15 were taking antidepressants, 16 were taking mood stabilizers, and 10 were taking benzodiazepines.
An age- and gender-matched control sample of 52 healthy participants (20 male, 32 female) were recruited for comparison purposes by general advertisement and contacts of the authors. Using the MINI screen, no control participant had a current diagnosis or previous history of psychiatric illness (Axis I). An immediate family history of mood and psychiatric disorder, in addition to a personal history of neurological disorder, current or previous alcohol/substance dependence or abuse, visual impairments and current psychiatric medication use was exclusion criteria for all controls.
All participants were fluent in English, were between the ages of 18 and 65 years and had an estimated pre-morbid IQ as scored by the Wechsler Test Of Adult Reading (WTAR) of >75.
Materials
A task designed by the authors was administered to assess emotional multimodal integration across visual and auditory modes (described below). The Brief Assessment of Cognition in Schizophrenia - Symbol Coding subtest (taken from the MATRICS consensus cognitive battery and described in detail by Nuechterlein & Green, Reference Nuechterlein and Green2006) was used to co-vary out processing speed as a potential confound in the response time analysis.
Visual and Auditory Stimuli
The visual stimuli was taken from the widely used and well validated Ekman and Friesen series known as the Pictures of Facial Affect (POFA: Ekman & Friesen, Reference Ekman and Friesen1976). The stimuli comprised black and white photographs of faces free of jewelry, spectacles, make up and facial hair (female and male) and expressing the emotions happy, sad, fear, and neutral. The faces were cropped to an oval shape spanning the top of the forehead to the bottom of the chin and excluding any hair and the ears on either side of the face. The auditory prosodic emotion stimuli comprised a series of sentences with neutral content (“the windows are made of glass”), matched for length and spoken by actors (male and female) who were directed to express each of four emotional prosodic tones (happy, sad, fear, and neutral). The task was presented binaurally via noise reduction headphones on a 14-inch Lenovo laptop computer and was run through Presentation (Neurobehavioral Systems Inc, 2012). Written instructions, an example and a set of practice trials were provided to participants before its commencement.
Design and Procedure
The task required participants to recognize emotional faces while being presented with a series of paired facial (visual) and prosodic (auditory) stimuli portraying either congruent or incongruent happy, sad, fearful or neutral emotional expressionsFootnote 1. Participants were instructed to keep their eyes on the screen at all times and to label a target emotion expressed by the facial stimuli whilst mentally blocking out the irrelevant prosodic stimuli. Responses were made via a labeled keyboard button press, where accuracy and response time data was recorded by the computer from 200 milliseconds (ms) onward. The task comprised 48 randomized trials, with 24 presentations of congruent stimuli (six pairs each for happy, sad, fear, or neutral expressions with some pairs being presented twice) and 24 presentations of incongruent stimuli (pairs representing different combinations of emotion with some pairs being presented twice or three times). Each trial lasted 3500 ms (including an inter-stimulus interval of 1000 ms); the prosodic stimuli presentation length was approximately 2500 ms and the facial stimuli presentation length was 2000 ms. The onset of the facial emotion stimuli were delayed by 500 ms to ensure that their departure from the computer screen coincided with the departure of the prosodic stimuli (which were longer in duration). This design was necessitated by the need to combine a short visual presentation with a longer auditory utterance. The delay in the onset of facial stimuli was suitable given that emotional information is not usually present in the initial fragment of a sentence, but is rather aggregated over its time course (de Gelder et al., Reference de Gelder, Vroomen, de Jong, Masthoff, Trompenaars and Hodiamont2005). The entire task took approximately 4 min to complete.
Statistical Analysis
Demographic and clinical group differences were assessed via χ2 or independent samples t tests. We used a repeated measures (2[condition: congruent and incongruent]*2[group: controls and BD]) design to ascertain group differences between conditions for both accuracy and response time data. Post hoc paired sample t tests split by group were used to follow up significant results. To better understand the effects of diagnostic status and medication on task performance, all analyses were rerun in the patient group comparing those diagnosed with BD I (n=38) versus BD II (n=12), and those currently on or off different classes of medication. The effect of current mood status was also considered, however, given that the sample size of some of the mood phase subgroups were too small for meaningful analysis, we collapsed the mixed and manic groups into one (resulting n=16) and compared this group to patients meeting criteria for euthymia (n=18) or depression (n=16). Bivariate correlations were also conducted to examine the relationship between mean response times to congruent and incongruent stimuli and symptom severity on the YMRS and MADRS.
RESULTS
Descriptive Analyses
There was no significant difference in age, gender, or pre-morbid IQ between the two groups (see Table 1).
Note: BD=bipolar disorder; M/F=Male / Female, WTAR=Wechsler Test of Adult Intelligence, YMRS=Young Mania Rating Scale, MADRS=Montgomery Asberg Depression Rating Scale.
Multi-modal Integration
Figures 1 and 2 present mean accuracy and response time scores, respectively, for the recognition of facial emotions as a function of congruent or incongruent prosody in both patients and controls
Accuracy (% correct)
There was a main effect of condition (F[1,100]=23.43; p<.001), with all participants performing more accurately in the congruent condition, relative to the incongruent condition (congruent: M=98.41; SD=3.47; incongruent: M=95.01; SD=6.76; d=−.63). However, there was no effect of group (Control M=97.15; SD=3.50; BD M=96.25; SD=4.55; d=−.22; F[1,100]=1.27; p=.26) and no interaction effect (F[1,100]=.15; p=.70).
Response time (ms)
There was a main effect of condition (F[1,100]=12.58; p<.01) and a main effect of group (F[1,100]=9.39; p<.01), with response recognition taking longer for incongruent relative to congruent trials (congruent M=1584.42; SD=429.52, incongruent M=1672.94; SD=401.21; d=.21), and patients performing worse than controls overall (Control M=1515.54; SD=375.79; BD M=1746.34; SD=384.71; d=−.31). A significant interaction effect was also present (F[1,100]=5.62; p≤.02); such that the effect of condition on response time differed across the groups. Follow up analyses revealed a significant cross-modal effect of prosody on latency for recognizing facial emotions in controls only, such that response times were shorter when emotion congruent, relative to emotion incongruent prosody, accompanied facial stimuli (t[51]=−4.84; p<.01). This effect (reflected by the average ms difference between responses in congruent and incongruent conditions) was four times greater for controls than it was for patients (Control M=145.75; SD=217.10; BD M=29.00; SD=277.86; Cohen’s d=−.47). When the BACS-SC scores were entered into the repeated measures model to control for processing speed, the main effect of condition (F[1,99]=745.75; p=.88) and group became non-significant (F[1,99]=2.43; p=.12), but the significant interaction effect was preserved (F[1,99]=3.89; p≤.05).
Subgroup analyses
Table 2 presents descriptive statistics and correlations for the subgroup analyses. There were no between group main effects or interactions for patients diagnosed as having BD I versus BD II (all p’s>.05), nor were there significant between group differences in the patient group for individuals on or off antipsychotics/anticonvulsants, antidepressants, lithium, and benzodiazepines (all p’s>.05). There were also no between group main effects or interactions for patients classified as euthymic, depressed, or mixed/manic (all p’s>.05). Further bivariate correlations supported this, indicating no associations between current depression (as measured by the MADRS) or mania (as measured by the YMRS) severity and response times to either congruent or incongruent stimuli.
Note: BD=bipolar disorder; MADRS=Montgomery Asberg Depression Rating Scale, YMRS=Young Mania Rating Scale
DISCUSSION
This study is the first of its kind to investigate how the perceptual system of BD combines visual and auditory emotional information. By comparing the extent to which emotional prosody cross-modally influences facial expression recognition performance between BD and control cohorts, we were able to assess multimodal integration in the disorder. Our findings indicated that emotional prosody interfered with accuracy for recognizing facial expressions regardless of group status, with performance being better for the congruent relative to the incongruent condition. This occurred even though participants were instructed to attend only to information presented to the visual channel, which is consistent with previous literature suggesting that multimodal integration occurs pre-attentively (de Gelder & Jean, Reference de Gelder and Jean2000; de Gelder, et al., Reference de Gelder, Vroomen, de Jong, Masthoff, Trompenaars and Hodiamont2005; Kim et al., Reference Kim, Seitz and Shams2008). Indeed, the less efficient processing of incongruent relative to congruent emotional information indicates that cues from different modalities were being integrated at a perceptual level.
In contrast, there were group differences in response times, with BD patients exhibiting difficulty in processing emotional facial expressions irrespective of the influence of congruent or incongruent prosodic information. This is largely supportive of prior work indicating that visual facial emotion impairments may represent a unique processing difficulty in BD (Vederman et al., Reference Vederman, Weisenbach, Rapport, Leon, Haase, Franti and McInnis2012). An interaction effect was also evident, suggesting that it took longer for participants in the control group to recognize facial expressions when conflicted with incongruent emotional prosody. This effect was not apparent in the BD patients, with the absence of a priming/facilitation of emotion processing in this group relative to controls, signifying that sensory emotion integration was substantially impaired in this cohort. Indeed, the pattern of disproportionate latencies, in the absence of group differences in accuracy between conditions, suggests a subtle delay in automatic multimodal emotion integration in BD, such that typical redundancy based performance benefits are diminished. As this BD related impairment in implicitly extracting vocal emotional information was evident only on the sensitive response time parameter, a deficit in the normal automatic process of rapidly integrating information from different sensory sources appears to be quite subtle in nature.
This group difference is unlikely to be attributable to premorbid differences in processing speed, given that the group by condition interaction remained even when processing speed was statistically partialed out. As subgroup analyses did not reveal differences between patients diagnosed as having BD I versus BD II or meeting criteria for different symptomatic statuses, mood or diagnostic subtype are also unlikely to have had a significant impact. However, as these post hoc subgroup comparisons were underpowered, and as we were unable to directly compare these subgroups to controls due to the restricted sample size after stratification, the contribution of these factors cannot be completely ruled out.
There are other limitations to the study that should be considered when interpreting the results. First, the emotion integration task was newly developed in our lab and has not been validated in other clinical samples. Second, the presentation of the facial emotions over long intervals likely resulted in the ceiling level accuracy for the task. Although these intervals were necessary to accommodate the length of the auditory utterances, the timing of emotional facial stimuli in real-time interactions is much shorter. Thus, these results may not be ecologically valid. Third, there were too few stimuli in the current task design to reliably evaluate the effects of specific emotions. Subsequent research in the field would certainly do well to include more stimuli of each emotion type to investigate valence effects. Finally, the same face-emotion stimuli were repeated due to the limited number of stimuli in the POFA series. Thus, it is possible that our results are partly attributable to cross-contamination effects whereby responses on earlier trials affected responses on later trials using the same face and facial expression.
Despite these limitations, the current findings appear to indicate a level of hypo-integration in BD, at least in reference to the cross-modal influence of prosody on response latencies to facial expression recognition. Whether bidirectional cross-modal influences of facial expressions on emotional prosody is diminished in BD, however, remains to be seen. It is nonetheless likely that the typical magnitude of gains to be made on the basis of redundant multimodal information is diminished in these patients, at least at a subtle latency level. Further research directly comparing congruent multimodal emotion recognition to unimodal emotion recognition is needed to establish whether this is the case.
It also possible that the BD-related findings observed here are not specific to emotion, but rather relate to more general difficulties in receiving different information simultaneously. Indeed, it is widely recognized that patients with BD exhibit pervasive deficits in executive functioning that translate to difficulties in shifting between information sources and simultaneously thinking about multiple concepts (McKirdy et al., Reference McKirdy, Sussmann, Hall, Lawrie, Johnstone and McIntosh2009; Melcher et al., Reference Melcher, Wolter, Falck, Wild, Wild, Gruber and Gruber2013). Such neurocognitive deficits have been shown to underpin unimodal emotion processing in schizophrenia populations (Brekke, Kay, Lee, & Green, Reference Brekke, Kay, Lee and Green2005), and it is certainly possible that our results reflect the downstream outcome of this cognitive inflexibility. Given that both executive and emotion perception impairments have recently been shown to predict occupational outcome in BD (Ryan et al., Reference Ryan, Vederman, Kamali, Marshall, Weldon, McInnis and Langenecker2013), future work would certainly do well to establish a more coherent understanding of the interplay between cognitive and emotional processing abilities in this context.
As social communication is a complex process, our findings have substantial implications for the understanding of social functioning in BD. For example, given that multimodal integration is particularly necessary in times of unimodal ambiguity (Alais & Burr, Reference Alais and Burr2004), and that facial emotion processing is impaired in BD, the fact that emotional prosody does not appear to implicitly guide the speed of perception of facial emotions in patients with the disorder may add to the significant psychosocial burden they carry. Indeed, this delay in generating a coherent emotional representation could certainly decrease cognitive efficiency in social situations where communication occurs in real time, in turn adversely influencing interpersonal behaviors. These findings certainly add to the existing research suggesting that emotion processing is important in facilitating healthy psychosocial outcomes (Hoertnagl et al., Reference Hoertnagl, Muehlbacher, Biedermann, Yalcin, Baumgartner, Schwitzer and Hofer2011; Martino et al., Reference Martino, Strejilevich, Fassi, Marengo and Igoa2011; Ryan et al., Reference Ryan, Vederman, Kamali, Marshall, Weldon, McInnis and Langenecker2013).
In conclusion, the results of this study indicate that the automatic process of rapidly integrating multimodal information from facial and prosodic sensory channels is impaired in BD. As we cannot make comment on whether this apparent multimodal integration impairment reflects an emotion specific phenomenon or more of a generalized audio-visual integration problem, prospective researchers would certainly be well placed to consider this when designing future multimodal investigations for the disorder.
ACKNOWLEDGMENTS
The authors have no conflicts of interest but would like to acknowledge the Australian Rotary Health/Bipolar Expedition, the Helen McPherson Smith Trust and an Australian Postgraduate Award for providing financial assistance for the completion of this work. We also thank Chris Groot (University of Melbourne) for providing the auditory stimuli for the multi-modal task.