Introduction
A variety of tests across cognitive domains have been evaluated on mobile self-administered platforms (Moore, Swendsen, & Depp, Reference Moore, Swendsen and Depp2017). Potential advantages of mobile cognitive testing include the ability to assess cognitive performance in naturalistic settings and enhance practical access to cognitive testing for research or clinical purposes. When repeated intensively within persons over time, coupled with the reports of experiences in ecological momentary assessment (EMA), it is also possible to evaluate day-to-day and contextual influences on cognition heretofore challenging, if not impossible, to measure. In a small group of studies, ecological momentary cognitive tests have made measurement more precise by reducing error and cognitive performance was linked to variability in activity participation (Allard et al., Reference Allard, Husky, Catheline, Pelletier, Dilharreguy, Amieva and Swendsen2014; Moore et al., Reference Moore, Swendsen and Depp2017). Tasks developed to date have been designed to measure a variety of cognitive domains (e.g. memory, attention, processing speed; Jongstra et al., Reference Jongstra, Wijsman, Cachucho, Hoevenaar-Blom, Mooijaart and Richard2017; Moore et al., Reference Moore, Swendsen and Depp2017, Reference Moore, Campbell, Delgadillo, Paolillo, Sundermann, Holden and Swendsen2020; Schweitzer et al., Reference Schweitzer, Husky, Allard, Amieva, Pérès, Foubert-Samier and Swendsen2017), but none to our knowledge have focused on social cognition. The purpose of this paper is to detail the initial validation and relationships (e.g. symptoms as measured by EMA) with the performance of a new ecological momentary test of facial emotion recognition.
Social cognition is a growing focus of observational and interventional research in schizophrenia and psychotic disorders (Green, Horan, & Lee, Reference Green, Horan and Lee2015). There is evidence that social cognition abilities in general are separable from non-social cognition, and that social cognition independently predicts community function in schizophrenia and bipolar disorder (Fett, Viechtbauer, Penn, van Os, & Krabbendam, Reference Fett, Viechtbauer, Penn, van Os and Krabbendam2011; Hoe, Nakagami, Green, & Brekke, Reference Hoe, Nakagami, Green and Brekke2012; Lahera et al., Reference Lahera, Ruiz-Murugarren, Iglesias, Ruiz-Bennasar, Herreria, Montes and Fernandez-Liria2012; Mehta et al., Reference Mehta, Thirthalli, Subbakrishna, Gangadhar, Eack and Keshavan2013). A number of longitudinal studies indicate that social cognitive abilities, including facial emotion recognition, appear to be stable over the course of the illness, similar to neurocognition (Comparelli et al., Reference Comparelli, Corigliano, De Carolis, Mancinelli, Trovini, Ottavi and Girardi2013; Green et al., Reference Green, Bearden, Cannon, Fiske, Hellemann, Horan and Nuechterlein2012; McCleery et al., Reference McCleery, Lee, Fiske, Ghermezi, Hayata, Hellemann and Green2016). Moreover, social cognitive tasks that have been vetted for psychometric properties, including facial emotion recognition, have test–retest reliabilities comparable to non-social cognitive tests (Pinkham, Harvey, & Penn, Reference Pinkham, Harvey and Penn2018). As such, social cognitive abilities are presumed to be relatively stable trait-like abilities over time and might not be assumed to fluctuate markedly. Nonetheless, it is unclear if this stability is evident in intensively repeated performance outside of the lab setting. In addition, psychotic symptoms may influence social cognitive abilities, and this influence may be greater than that between symptoms and non-social cognition (Fett, Maat, & Investigators, Reference Fett, Maat and Investigators2013; Pinkham, Harvey, & Penn, Reference Pinkham, Harvey and Penn2016; Ventura, Wood, & Hellemann, Reference Ventura, Wood and Hellemann2013) including increasing intra-task variation (Hajduk, Harvey, Penn, & Pinkham, Reference Hajduk, Harvey, Penn and Pinkham2018). Thus, although social cognitive performance may be generally stable, the influence of symptoms on changes within persons over time is unclear.
In addition to elucidating the influence of symptoms on performance, ecological momentary tests of facial emotion recognition could help specify the influence of social cognition on social behavior and performance within participants over time. However, in addition to no ecological momentary tasks, according to a recent review (Mote & Fulford, Reference Mote and Fulford2020), there has been only one study to evaluate the relationship between in-lab social cognitive performance and EMA-derived social behavior. That study indicated a somewhat surprising lack of association between social cognitive performance and EMA measures of social activity (e.g. time spent alone, with others; Janssens et al., Reference Janssens, Lataster, Simons, Oorschot, Lardinois, Van Os and Myin-Germeys2012). Therefore, assessing social cognition in a manner that is more proximal to social behavior may provide a more sensitive test of this relationship.
An additional dimension that may vary over time in conjunction with affect recognition accuracy is introspective accuracy judgement, or the ability to accurately gauge one's performance (Harvey & Pinkham, Reference Harvey and Pinkham2015). Introspective accuracy is strongly linked to functional outcomes (Gould et al., Reference Gould, McGuire, Durand, Sabbag, Larrauri, Patterson and Harvey2015), and introspective accuracy for facial affect predicts social function above and beyond ability (Gould et al., Reference Gould, McGuire, Durand, Sabbag, Larrauri, Patterson and Harvey2015; Silberstein, Pinkham, Penn, & Harvey, Reference Silberstein, Pinkham, Penn and Harvey2018). In particular, overconfidence is particularly pronounced in psychotic disorders (Balzan, Woodward, Delfabbro, & Moritz, Reference Balzan, Woodward, Delfabbro and Moritz2016; Jones et al., Reference Jones, Deckler, Laurrari, Jarskog, Penn, Pinkham and Harvey2020; Moritz et al., Reference Moritz, Balzan, Bohn, Veckenstedt, Kolbeck, Bierbrodt and Dietrichkeit2016). Psychotic and mood symptoms, which vary over time, are associated with overconfidence and underestimation of performance, respectively (Harvey et al., Reference Harvey, Deckler, Jones, Jarskog, Penn and Pinkham2019; Harvey, Paschall, & Depp, Reference Harvey, Paschall and Depp2015; Jones et al., Reference Jones, Deckler, Laurrari, Jarskog, Penn, Pinkham and Harvey2020; Köther et al., Reference Köther, Veckenstedt, Vitzthum, Roesch-Ely, Pfueller, Scheu and Moritz2012; Moritz et al., Reference Moritz, Goritz, Gallinat, Schafschetzy, Van Quaquebeke, Peters and Andreou2015). Therefore, simultaneous and intensively repeated evaluation of symptoms, cognitive performance, and introspective accuracy may help to identify if shifts in mood or psychotic symptoms within subjects are associated with changes in introspective accuracy for emotion recognition.
We developed a facial affect recognition measure called Mobile Ecological Test of Emotion Recognition (METER) that is delivered through a web-based smartphone capable program coupled with contemporaneous EMA and real-time accuracy judgements for the task performance. This study aimed to evaluate acceptability, adherence, and convergent validity of the METER task in regard to the following planned analyses: (1) METER adherence and predictors of adherence, (2) patterns of performance and self-assessed performance ratings over time and evidence of practice effects, (3) convergent validity with ‘gold standard’ facial emotion recognition measures, (4) convergent validity with non-social cognition test performance. We explored associations with psychotic and mood symptoms measured by both in-lab-based testing and with EMA reports as well as patterns of overestimation as identified in prior cross-sectional lab-based research (Jones et al., Reference Jones, Deckler, Laurrari, Jarskog, Penn, Pinkham and Harvey2020),
Methods
Participants
Data for this study were derived from an ongoing longitudinal study investigating relationships between negative social cognitive biases, psychosis, and suicidal ideation and behavior. Participants were recruited from three sites – the University of California San Diego (UCSD), The University of Texas at Dallas (UTD), and the University of Miami (UM). Recruitment was performed in a stratified fashion based on the presence v. absence of active suicidal ideation by the use of the Columbia Suicide Severity Rating Scale (CSSRS; Posner et al., Reference Posner, Brown, Stanley, Brent, Yershova, Oquendo and Shen2011). Participants were included in the study if they (1) were between the ages of 18 and 65; (2) had a current diagnosis of schizophrenia, schizoaffective disorder, bipolar disorder with psychotic features, or major depressive disorder with psychotic features, confirmed by the Structured Clinical Interview for the DSM-V (SCID 5; First, Williams, Karg, & Spitzer, Reference First, Williams, Karg and Spitzer2015) and Mini International Neuropsychiatric Interview (MINI; Sheehan et al., Reference Sheehan, Lecrubier, Sheehan, Amorim, Janavs, Weiller and Dunbar1998); (3) had an informant they were regularly in contact with, for safety procedures; (4) were in outpatient, partial hospitalization, or residential care; (5) were proficient in English; and (6) were able to provide informed consent.
Participants were excluded if they (1) had a history of a head trauma with loss of consciousness >15 min; (2) were ever diagnosed with neurological or neurodegenerative disorder; (3) had vision or hearing problems that would interfere with data collection; (4) had an estimated IQ<70, as determined by the Wide Range Achievement Test-4 (WRAT-4; Wilkinson & Robertson, Reference Wilkinson and Robertson2006); (5) had a DSM-V diagnosis of a substance use disorder in the past 3 months, excluding cannabis and tobacco, and confirmed by the SCID-V (First et al., Reference First, Williams, Karg and Spitzer2015). This study was reviewed by each site's Institutional Review Board, and all participants provided written informed consent.
Procedures
Once deemed eligible, participants completed lab-based assessments examining their social and neurocognitive performance. At the end of this visit, participants were given an option of using their own smartphone (either iPhone or Android) or using a study-provided Samsung Galaxy S8 Android smartphone to complete the EMA surveys and METER tasks. All participants were provided with a 15 min training session at the end of this in-lab visit on operating the study-provided smartphone (if borrowed), and in completing the EMA and METER tasks. During the 10-day remote survey period, research staff conducted weekly or as needed check-ins to maintain adherence and to resolve participant concerns. Once the 10 days were completed, participants returned the smartphone, if borrowed, and were compensated for their completed surveys [participants received $1.67 for each completed survey (30 total)] for a maximum of $50 (in addition to $50 for in-lab testing).
In-lab measures of psychopathology
Clinical diagnoses were established through the MINI (Sheehan et al., Reference Sheehan, Lecrubier, Sheehan, Amorim, Janavs, Weiller and Dunbar1998), the SCID 5 (First et al., Reference First, Williams, Karg and Spitzer2015), clinical chart reviews, and consensus meetings with the site investigators. Primary current diagnoses were based on both past and present history of diagnoses and symptoms using the methods described above. Psychotic symptom severity was assessed with the Positive and Negative Syndrome Scale subscales for positive and negative symptoms (PANSS; Kay, Fiszbein, & Opler, Reference Kay, Fiszbein and Opler1987). Depressive symptom severity was also measured using the interview-rated Montgomery–Åsberg Depression Rating Scale (MADRS; Montgomery & Åsberg, Reference Montgomery and Åsberg1979). Symptoms of mania were assessed using the Young Mania Rating Scale (YMRS; Young, Biggs, Ziegler, & Meyer, Reference Young, Biggs, Ziegler and Meyer1978). These three symptom assessments measured current (past week) symptom severity.
Facial emotion recognition measures
Participants completed the Bell Lysaker Emotion Recognition Task (BLERT; Bryson, Bell, & Lysaker, Reference Bryson, Bell and Lysaker1997) and the computerized Penn Emotion Recognition Task (ER-40; Kohler et al., Reference Kohler, Turner, Bilker, Brensinger, Siegel, Kanes and Gur2003). The BLERT displays 21 video segments of one male actor who, through intonation, upper body movement cues, and facial expression, displays one of seven emotions: happiness, sadness, fear, disgust, surprise, anger, or no emotion. Participants were instructed to choose the correctly displayed emotion in this task. A total score was calculated to determine the number of correct emotion choices identified.
The ER-40 measures emotion recognition ability by displaying 40 color photographs expressing one of four emotions: happiness, sadness, anger, fear, or no emotion. Participants were presented one image at a time and asked to select the emotion displayed as quickly and as accurately as possible. Total scores were calculated as a sum of correct responses from 0 to 40.
Neurocognitive performance measures
Premorbid verbal ability was assessed with the WRAT-4 (Wilkinson & Robertson, Reference Wilkinson and Robertson2006). Participants were administered a subset of the MATRICS Consensus Cognitive Battery (MCCB; Nuechterlein et al., Reference Nuechterlein, Green, Kern, Baade, Barch, Cohen and Marder2008) including the Trail Making Test, Part A (TMT-A; Tombaugh, Reference Tombaugh2004); Brief Assessment of Cognition in Schizophrenia (BACS) Symbol Coding Subtest (Keefe et al., Reference Keefe, Goldberg, Harvey, Gold, Poe and Coughenour2004); Category Fluency: Animal Naming (Spreen, Reference Spreen, Spreen and Strauss1991), Letter-Number Span (LNS; Gold, Carpenter, Randolph, Goldberg, & Weinberger, Reference Gold, Carpenter, Randolph, Goldberg and Weinberger1997), and Hopkins Verbal Learning Test (HVLT; Brandt & Benedict, Reference Brandt and Benedict2001). In addition to individual subscale scores, age- and education-normed T-scores were calculated and averaged into a Global Composite Score.
EMA procedures
Participants were sent text notifications to their smartphones (or study provided Android device) to complete the smartphone-based surveys three times daily for 10 days and the METER task once per day. This text notification contained the participant link for the study surveys. Participants selected preferred time slots for the survey notifications, with at least a 2 h increment in between each survey. Participants received the surveys once in the morning, once in the afternoon, and once at night. Upon receiving the link, participants first completed several EMA questions assessing context, mood, and behaviors and then subsequently completed the METER task, if administered, followed by post-task game performance questions. Once the survey was delivered, the link stayed active for 1 h, after which the survey was no longer accessible. Study surveys were linked to the smartphone number, and so were opened only by the device. Participant's data were deidentified and were not stored locally on the devices. Survey data were sent to encrypted, HIPAA compliant, cloud storage in Amazon Web Services (AWS), and responses were recorded even if participants did not complete the entire survey. The AWS system allowed research staff to access participant data in real-time and monitor progress daily. If participants missed more than three surveys in a row, experimenters contacted them to address any technical difficulties or adherence issues.
Mobile facial emotion task
The mobile facial emotion task (see Fig. 1) was modeled directly after the widely used and validated Penn Emotion Recognition 40 test (Kohler et al., Reference Kohler, Turner, Bilker, Brensinger, Siegel, Kanes and Gur2003) and was administered concurrently with the EMA surveys once per day. The timing of the task was stratified by time of day (either morning, afternoon, or evening periods). This task was administered immediately following the EMA questions. In METER, participants were presented with a total of 10 faces each session from a pool of 100 unique faces taken from the publicly available University of Pennsylvania Brain Behavior Lab 2D Facial Emotion Stimuli. Those faces were validated by collecting recognition ratings from healthy volunteers, and only those faces identified with accuracy levels exceeding 70% were used (Gur et al., Reference Gur, Sara, Hagendoorn, Marom, Hughett, Macy and Gur2002). Each face displayed one of five emotions: happiness, sadness, anger, fear, or no emotion, and two of each category were presented each session. Neither actor identities nor specific stimuli overlapped with those used in the ER-40. Completion time was collected for each emotion choice for every face, aggregated and averaged across the 10-day protocol.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221109151231975-0368:S0033291720004419:S0033291720004419_fig1.png?pub-status=live)
Fig. 1. Screenshots of ecological momentary emotion recognition test (METER).
After each session of the METER, participants were asked to rate their self-assessed performance, that is, how many faces they believe they correctly identified from 0 to 10. We then calculated the difference between actual and self-assessed performance and, since this discrepancy score is bimodal with optimal performance in the middle, we categorized each test session into (1) overestimation (estimated > actual), (2) accurate estimation (estimated = actual), and (3) underestimation (estimated < actual).
EMA mood and symptom items
The EMA survey included items on location, activity, mood, voices and paranoia, and social activities. For this study, we focused on items that corresponded to in-lab symptom measures of psychosis: voices (e.g. ‘Since the past alarm, how much have you been bothered by voices?’), and participants' trust in others (e.g. ‘Since the last alarm, how much have you had thoughts that you really can't trust other people?’), along with mood state: happiness (e.g. ‘Since the past alarm, how much have you felt happy?’), sadness (e.g. ‘Since the past alarm, how much have you felt sad or depressed?’). These self-reported items were presented on a seven-point Likert Scale (1 not at all and 7 extremely). Analyses using these variables focused on the epoch in which the METER was administered.
Statistical analysis
We first evaluated METER adherence, which was calculated as the number of tests that were completed out of the total number possible (i.e. 10). We also evaluated the relative impact of removing low adherent participants on convergent validity. We evaluated the METER's total completion time relationship with the in-lab social cognition measures and METER performance. Parametric or non-parametric correlations (depending on whether variables violated normality assumptions) were used to examine the relationship between adherence and demographics, mental health symptoms, and cognitive variables. Then, mean squared successive difference (MSSD), or the sum of the differences between consecutive observations squared, and then divided by (N−1), was calculated to understand within-person variability, as was an intraclass correlation coefficient (ICC). We then evaluated performance and self-assessed performance across testing sessions to evaluate practice effects using linear mixed models. We evaluated convergent validity by examining correlations between the person-averaged METER performance and self-assessed performance with in-lab measures of facial emotion recognition (BLERT and ER-40 Total scores), non-social neurocognitive measures (MCCB tasks), and symptoms. These analyses included univariate (correlations) between individual measures and a multivariate regression to determine the contribution of social v. non-social cognitive measures to METER performance. With EMA data, we evaluated combined between and within-person associations between EMA measures of happiness, sadness, hearing voices, and trust in others with METER performance and self-assessed performance using linear mixed models (Twisk, Reference Twisk2019). All linear mixed models had a random effect for the intercept (subject). The α value was set at 0.05 and the Bonferroni correction was used for post-hoc pairwise analyses.
Results
Sample characteristics and adherence
Sample characteristics can be seen in Table 1. As might be expected, diagnostic groups differed on the PANSS Positive Syndrome Scale [F (2,83) = 0.5, p = 0.001] and Negative Syndrome Scale [F (2,83) = 6.23, p = 0.003]. Participants with schizoaffective disorder (M = 20.0, s.d. = 5.0) had a higher severity of positive symptoms than those with schizophrenia (M = 18.5, s.d. = 5.24) and the mood disorder group (M = 14.0, s.d. = 5.9). Additionally, the schizophrenia group (M = 14.7, s.d. = 4.2) showed higher severity of negative symptoms than those with schizoaffective disorder (M = 12.5, s.d. = 3.7) and the mood disorder group (M = 11.0, s.d. = 2.8). Groups also differed by depression severity on the MADRS [F (2,82) = 4.7, p = 0.012], with the group with schizoaffective disorder (M = 18.9, s.d. = 10.9) having more severe depressive symptoms than the group with schizophrenia (M = 11.0, s.d. = 11.6) and the mood disorder group not differing significantly from either psychosis group (M = 19.2, s.d. = 13.8). Overall, the sample had more severe positive symptoms and comparable negative symptoms to prior reports involving social cognition validation (Pinkham et al., Reference Pinkham, Harvey and Penn2018) but was otherwise similar in terms of demographic distribution.
Table 1. Sample characteristics (N = 86)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221109151231975-0368:S0033291720004419:S0033291720004419_tab1.png?pub-status=live)
WRAT-4, Wide Range Achievement Test 4; ER-40, Penn Emotion Recognition Task; BLERT, Bell Lysaker Emotion Recognition Task; HVLT, Hopkins Verbal Learning Test; Global Impairment T-Score was calculated by averaging the MCCB age-corrected T-scores; PANSS, Positive and Negative Symptoms Scale; MADRS, Montgomery–Åsberg Depression Scale; YMRS, Young Mania Rating Scale. The ranges are observed from our sample.
METER adherence, mean performance, variability, and reaction time
The mean rate of adherence (number of tests completed/number provided) for the METER task was 79.8% (s.d. = 20.9), ranging from 10% to 100%. Adherence was not correlated with any demographic, cognitive, or symptom variables (p's > 0.05; see online Supplementary Table S1). Adherence was not significantly different across schizophrenia, schizoaffective disorder, and the mood disorder group [F (2,83) = 0.23, p = 0.799] or by the presence of current suicidal ideation [F (1,84) = 0.9, p = 0.585].
Mean percent of faces correct on the METER was 75.6% (s.d. = 11.0%), which was very similar to the self-assessed correct number of faces, 76.5% (s.d. = 15.1%). Interestingly, mean actual and self-assessed performance were not correlated (ρ = 0.151, p = 0.159). In terms of potential practice effects, performance was negatively associated with protocol day, with slight but significant declines in performance over time (estimate = −0.07, s.e. = 0.23, t = −2.98, p = 0.003), but no significant changes over time were observed in self-assessed performance (estimate = 0.006, s.e. = 0.03, t = −0.23, p = 0.822). Performance on the METER was negatively correlated with age and positively correlated with education and WRAT-4 Standard Score (Table 2). After removing seven individuals who have <50% adherence on the METER, performance was no longer correlated with WRAT-4 score. There were no correlations between METER participant self-assessed performance and other demographics characteristics (p's > 0.05; Table 2).
Table 2. METER parametric and non-parametric correlations with in-lab variables (N = 86)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221109151231975-0368:S0033291720004419:S0033291720004419_tab2.png?pub-status=live)
WRAT-4, Wide Range Achievement Test 4; ER-40, Penn Emotion Recognition Task; BLERT, Bell Lysaker Emotion Recognition Task; HVLT, Hopkins Verbal Learning Test; Global Impairment T-Score was calculated by averaging the MCCB age-corrected T-scores; PANSS, Positive and Negative Symptoms Scale; MADRS, Montgomery–Åsberg Depression Scale; YMRS, Young Mania Rating Scale.
a Non-parametric correlation.
bN = 85; cN = 82.
*Significant at p < 0.05; **significant at p < 0.01.
In terms of within-person variability, we found that performance on the METER had a higher MSSD (5.21, s.d. = 3.12) than self-assessed performance on the METER (MSSD = 4.08, s.d. = 5.74). Greater variability of performance on the METER was correlated with older age (ρ = 0.282, p = 0.009). The ICC for performance on the METER was 0.29, whereas the ICC for self-assessed performance on the METER was 0.51. Mean total completion time for the task, aggregated across all 10 lists, was 49.61 s (s.d. = 85.8). The mean total completion time was negatively associated with performance (ρ = −0.218, p = 0.044). A linear mixed model revealed that there was no effect of day on completion time (estimate = −2.56, s.e. = 2.9, t = −0.88, p = 0.382).
Convergent validity with the METER performance, variability, and reaction time
Mean performance on the METER was strongly positively associated with the ER-40 total score (ρ = 0.454, p < 0.001) as well as the BLERT total score (ρ = 0.592, p < 0.001) (Table 2). METER performance was associated with all non-social MCCB neurocognitive measures with the exception of the HVLT total score. METER correlations with non-social cognition were slightly lower than that of the ER-40 or BLERT. The strength and significance of associations in the subsample of participants with 50% or higher adherence (N = 79) was highly similar to that in the entire sample, yet with TMT-A score was no longer significantly associated with METER performance. To evaluate the relative association of METER performance to social and non-social tests, a linear regression predicting METER performance including social cognition and non-social cognition tests was significant overall [F (7,78) = 11.4, p < 0.001, R 2 = 0.51]. BLERT emerged as the only significant predictor (B = 0.04, t = 6.1, p < 0.001) followed by ER40 (B = 0.01, t = 2.0, p = 0.053). Participant self-assessed performance on the METER had no relationship with any of these social or non-social cognitive scores (Table 2).
Greater within-person variability in performance as calculated by MSSDs on METER was associated with worse ER40 performance (ρ = −0.312, p = 0.004). There were no other correlations between variability of performance and other variables of interest (p's > 0.153), and there were no correlations between variability of self-assessed performance and variables of interest (p's > 0.125). Finally, reaction time on the METER Task was associated negatively with BLERT, ER40, and non-social tasks, with longer reaction time associated particularly strongly with worse performance on timed tasks (e.g. Trail Making Test, Symbol Digit).
Associations with in-lab symptom measures
PANSS positive syndrome score was strongly negatively correlated with METER performance (ρ = −0.537, p < 0.001) (Table 2). To evaluate whether this effect was confounded with diagnosis, we repeated this analysis with only participants with a diagnosis of schizophrenia or schizoaffective disorder, and found a similar correlation (ρ = −0.540, p < 0.001). By comparison, the PANSS positive syndrome score was also negatively correlated with the BLERT (ρ = −0.315, p = 0.003) but not the ER-40 (ρ = −0.115, p = 0.319). Additionally, the specific association between PANSS positive syndrome scale was significant when adjusting in a partial correlation for PANSS general psychopathology (r = −0.467, p < 0.001).
There was no significant association between METER performance and the PANSS negative syndrome scale nor the MADRS total score. Self-assessed performance on the METER was negatively correlated with depressive symptoms (MADRS total score; ρ = −0.30, p = 0.005). The mean total completion time was positively significantly associated with only PANSS positive symptoms (ρ = 0.213, p = 0.049).
Associations between METER and time-varying EMA-assessed symptoms
Linear mixed models were used to assess the effects of concurrent EMA-reported psychosis symptoms of hearing voices, and mistrustfulness in others, along with affective ratings of happiness and sadness, on actual and self-assessed performance on the METER. In these models, we simultaneously evaluated the person-averaged effect and momentary differences from average effects. Person-averaged voices (estimate = −0.29, s.e. = 0.07, t = −4.3, p < 0.001), mistrustfulness (estimate = −0.13, s.e. = 0.06, t = −2.2, p = 0.032), sadness, (estimate = −0.14, s.e. = 0.06, t = −2.1, p = 0.037) were associated with reduced accuracy in METER, with no significant effects of momentary changes (there was a trend for momentary increases in voices and reduced performance, estimate = 0.11, s.e. = 0.06, t = 1.8, p = 0.080). Both person-averaged and momentary psychotic symptoms were not associated with self-assessed performance (p's > 0.05). In contrast, person-averaged self-assessed performance was associated negatively with person-averaged sadness (estimate = 0.33, s.e. = 0.10, t = −3.5, p = 0.001) and happiness (estimate = 0.33, s.e. = 0.10, t = 3.42, p = 0.001), with no effect of momentary changes. As seen in Fig. 2, effects that combined both actual and self-assessed performance, overestimation was associated with more severe voices, whereas underestimation was associated with greater sadness and lesser happiness.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221109151231975-0368:S0033291720004419:S0033291720004419_fig2.png?pub-status=live)
Fig. 2. Overestimation and underestimation of performance on the METER by EMA voices severity and mood.
Note. This figure depicts the concurrent associations between underestimation, overestimation and accurate estimation of the METER tests with self-reported EMA voices severity and mood. As shown, there is a significant association between reported overestimation and reported happiness. The same trend applies to voices severity. However, those that tend to report sadness tend to underestimate their METER performance. Linear mixed models; Overestimation > accurate, estimate = 0.23, S.E. = 0.20, p = 0.004; Overestimation > underestimation, estimate = 0.87, S.E. = 0.16, p < 0.001; Happy: F(2,694) = 15.4, p < 0.001; Sad: F(2,694) = 5.1, p = 0.006; Underestimation > overestimation, estimate = 0.49, S.E. = 0.16, p = 0.008; Voices: F(2,674) = 4.1, p = 0.017; Overestimation > underestimation, estimate = .40, S.E. = 0.15, p = 0.021 (pairwise contrasts Bonferroni adjusted)
Discussion
This paper provides preliminary evidence for the validity of a new mobile self-administered ecological momentary emotion recognition test (METER) which enables the evaluation of social cognitive ability and self-assessments of ability in naturalistic settings. The measure was well tolerated, with an average adherence of 79.8%, and all participants contributed data for analyses. Adherence was not correlated with any demographic or symptom variables, indicating broad tolerability. Despite frequent repetition, the measure was not associated with detectable practice effects. In terms of convergent validity, performance on the task was highly associated with gold standard in-lab affect recognition measures as well as other non-social neurocognitive measures. Highlighting the potential utility of intensively repeated tests, concurrently assessed psychosis symptoms (severity of voices and mistrustfulness) and sadness were associated with diminished performance, whereas sadness and positive affect but not psychotic symptoms impacted self-assessed performance. Taken together, these findings extend prior cross-sectional work on the influence of psychotic symptoms on social cognitive ability, and mood symptoms on biased judgements of performance. Thus, the METER could be a useful complement to a variety of applications in social cognition research.
Our findings address many of the dimensions used to evaluate the validity of lab-based social cognitive tasks [see Social Cognition Psychometric Evaluation study (SCOPE); Pinkham et al., Reference Pinkham, Harvey and Penn2018; Pinkham, Penn, Green, & Harvey, Reference Pinkham, Penn, Green and Harvey2016]. In terms of practicability and tolerability, the METER was associated with a relatively high rate of adherence, which was likely boosted by the practice of micro-payments per survey and check-ins from staff. Baseline symptoms, cognitive, or demographic data did not impact adherence, and so there did not appear to be subgroups who experienced greater challenges with completing the task; in particular, adherence did not vary by level of cognitive impairment, which might be assumed to determine whether individuals can complete self-administered tasks. Adherence also did not appear to markedly impact convergent validity.
Although the task did not display substantial practice effects over the course of 10 days, future research would be needed to evaluate test–retest reliability of the task over separate measurement epochs. Furthermore, the task was associated with gold-standard, in-lab measures of the same construct (BLERT, ER-40) as well as to a lesser extent non-social neurocognition tests. Regression analyses indicated some specificity toward validation to the target of facial emotion recognition. Nonetheless, some psychometric properties remain to be evaluated. In particular, a central aspect of SCOPE was validation against measures of functional outcome, which will be evaluated in future studies examining the METER. In addition, other psychometric properties typical of in-lab measures, such as internal consistency, are challenging to measure with mobile repeated tasks. Each testing epoch contains a relatively small number of stimuli and it is somewhat unclear how internal consistency metrics could be inclusive of repeated administrations over time when the underlying construct under study may also change; we found slight performance declines over time and so task design must take into account performance changes as they correspond to the ordering of stimuli. As such, ecological momentary tasks may also require different kinds of psychometric indices. Further, the lack of control group in this study inhibits the establishment of normative performance.
In addition to establishing the initial validity of the METER, this study showcases some of the potential for EMA to examine how clinical factors might influence social cognition and confidence judgments. Our study is consistent with prior work indicating psychotic symptoms, in particular voices, negatively impact social cognitive ability whereas mood, but not psychotic symptoms, is associated with self-assessment of performance. Disentangling between and within-person effects, these effects were best accounted for between variation rather than day-to-day increases in symptoms. There were surprisingly few associations with negative symptoms, although the sample was likely enriched for positive symptoms given the focus on suicidal ideation. Further, the PANSS may be less optimal for quantifying negative symptoms compared to other instruments such as the Clinical Assessment Interview for Negative Symptoms (CAINS; Kring, Gur, Blanchard, Horan, & Reise, Reference Kring, Gur, Blanchard, Horan and Reise2013).
Self-assessed performance is an emerging area of research in psychotic disorders because of its link to functional outcome (Gould et al., Reference Gould, McGuire, Durand, Sabbag, Larrauri, Patterson and Harvey2015; Silberstein & Harvey, Reference Silberstein and Harvey2019), and biases in judgements of performance may alter effort, motivation, and sustainment of goal-directed activities (Cornacchio, Pinkham, Penn, & Harvey, Reference Cornacchio, Pinkham, Penn and Harvey2017; Gould, Sabbag, Durand, Patterson, & Harvey, Reference Gould, Sabbag, Durand, Patterson and Harvey2013; Harvey, Strassnig, & Silberstein, Reference Harvey, Strassnig and Silberstein2019). Extending prior work of in-lab studies (Harvey et al., Reference Harvey, Paschall and Depp2015, Reference Harvey, Deckler, Jones, Jarskog, Penn and Pinkham2019; Moritz et al., Reference Moritz, Goritz, Gallinat, Schafschetzy, Van Quaquebeke, Peters and Andreou2015), our study indicated that overestimation of performance was linked to concurrent severity of voices as measured by EMA, whereas sadness was associated with underestimation of performance (and reduced performance). As with actual performance, these effects were most aligned with between-person variation rather than within-person fluctuation. This study demonstrates that over- and underestimation biases can be studied in real-time. This opens the door for evaluating person and time-varying mechanisms and social-environmental influences on these biases, such as with lagged models that exploit time series, the impact of these biases on everyday social decision making, social avoidance, and behavior. It may also be possible for rehabilitative interventions to attempt to alter biases as they occur, such as with feedback delivered through ecological momentary interventions.
There were several limitations to the study. The sample size was small and so validity should be considered preliminary, and the findings on the strength and direction of associations with in-lab and EMA measures would need to be replicated in a larger sample. The sample was stratified to over-recruit for participants with current suicidal ideation, and the mean level of current depression and psychosis severity were likely higher than that of prior studies of social cognitive tests that recruited more psychiatrically stable outpatient samples. In addition, at this time, we lack data from multiple EMA epochs, and so test–retest reliability across EMA-bursts is unknown. Lastly, the METER is a measure of only one domain of social cognition and future work may evaluate whether other domains (e.g. theory of mind) could be translated to mobile self-administered ecological momentary formats.
In summary, this study provided initial validation of a novel mobile self-administered facial emotion task, with a positive indication of adherence, tolerability, practicability, and lack of observed practice effects, along with convergent validity with gold-standard lab-based measures of the same construct and non-social-related neurocognitive domains. EMA analyses reveal that psychotic symptoms influence facial emotion recognition accuracy but not self-assessed performance, whereas mood had a stronger impact on self-assessed performance. Future work will evaluate test–retest reliability and capitalize on whether and how these observed accuracy deficits and biases influence behavior, including social function and suicidal behavior. Finally, this study provides optimism that other social cognitive tasks could be translated into EMA paradigms.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291720004419.
Acknowledgements
We would like to thank Katelyn Barone, Bianca Tercero, Cassi Springfield, Linlin Fan, Ian Kilpatrick, and Maxine Hernandez for their involvement in data collection and recruitment. We would also like to thank Mayra Cano for her efforts in data collection and in managing the data across the three sites.
Financial support
This work was supported by the National Institute of Mental Health (grant number: NIMH R01 MH116902-01A1).
Conflict of interest
Dr Raeanne C. Moore is a co-founder of KeyWise AI, Inc. and a consultant for NeuroUX. Dr Philip D. Harvey has received consulting fees or travel reimbursements from Acadia Pharma, Alkermes, Bio Excel, Boehringer Ingelheim, Minerva Pharma, Otsuka Pharma, Regeneron Pharma, Roche Pharma, and Sunovion Pharma during the past year. He receives royalties from the Brief Assessment of Cognition in Schizophrenia. He is the chief scientific officer of i-Function, Inc. He had a research grant from Takeda and from the Stanley Medical Research Foundation. None of these companies provided any information to the authors that is not in the public domain. No other authors have conflicts of interest to report.
Ethical standards
The authors assert that all procedures contributing to this project comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.