Introduction
Major depressive disorder (MDD) is typically a relapsing remitting illness that rarely occurs as a solitary episode. Even depressions treated in primary care tend to be recurrent, chronic and co-morbid (Judd, Reference Judd1997; Lin et al. Reference Lin, Katon, Von Korff, Russo, Simon, Bush, Rutter, Walker and Ludman1998; Brodaty et al. Reference Brodaty, Luscombe, Peisah, Anstey and Andrews2001; Gilmer et al. Reference Gilmer, Trivedi, Rush, Wisniewski, Luther, Howland, Yohanna, Khan and Alpert2005; Vuorilehto et al. Reference Vuorilehto, Melartin and Isometsa2005). Numerous publications have drawn attention to the low detection of MDD in primary care with a typical case recognition rate (sensitivity of unassisted clinical detection alone) of between 36% and 56% (Thompson et al. Reference Thompson, Kinmonth, Stevens, Peveler, Stevens, Ostler, Pickering, Baker, Henson, Preece, Cooper and Campbell2000; Christensen et al. Reference Christensen, Toft, Frostholm, Ørnbøl, Fink and Olesen2003; Croudace et al. Reference Croudace, Evans, Harrison, Sharp, Wilkinson, McCann, Spence, Crilly and Brindle2003; MaGPIe Research Group, 2004). An equivalent body of work has highlighted low detection rates in medical settings (Wilhelm et al. Reference Wilhelm, Kotze, Waterhouse, Hadzi-Pavlovic and Parker2004). Little, however, has been published regarding the detection rates in psychiatric settings or indeed about the assessment and screening practices of psychiatrists. Most information concerning the diagnostic value of specific symptoms of depression comes from validation studies of various mood questionnaires, including those replying upon only one or two questions (Williams et al. Reference Williams, Noel, Cordes, Ramirez and Pignone2002a, Reference Williams, Pignone, Ramirez and Stellatob; Takeuchi et al. Reference Takeuchi, Nakao and Yano2006; Li et al. Reference Li, Friedman, Conwell and Fiscella2007; Mitchell & Coyne, Reference Mitchell and Coyne2007). Yet this approach may be problematic because almost every scale and diagnostic schedule was created by consensus without primary data regarding the value of specific symptoms suggested. Furthermore, although there is considerable evidence concerning the accuracy of scores generated from combining multiple questions for depression, there is very little evidence concerning the diagnostic value of individual symptoms, even those that are included in DSM-IV and ICD-10. In ICD-10 two typical symptoms are required from the following three items: depressed mood, loss of interest, and decreased energy. A minimum of four symptoms are required to qualify with mild depression, and five symptoms (later revised to six) needed for moderate depression. To qualify as a severe depressive episode all three typical symptoms must be present plus at least four other symptoms. In DSM-IV either depressed mood or loss of interest is required for a diagnosis of MDD, with a total of five of a list of nine symptoms altogether. In theory, assigning special significance to core features reduces false-positive diagnoses in those patients who manifest five of the nine criteria but without low mood or lost interest. Nevertheless, Zimmerman et al. (Reference Zimmerman, McGlinchey, Young and Chelminski2006a) found that only 27 (1.5%) of 1800 psychiatric out-patients reported five or more criteria in the absence of low mood or loss of interest or pleasure. Of these 27 patients, 25 reported depressed mood at a subthreshold level. It is therefore unclear whether low mood or loss of interest have special significance in relation to a diagnosis of MDD. Indeed, it is also unclear to what degree all proposed specific symptoms of MDD have diagnostic weight when considered on their own.
The accuracy of specific symptoms may have additional importance in relation to screening for depression. There has been interest in developing very brief screening tools with less than five questions in the hope of improving acceptability in primary care (Takeuchi et al. Reference Takeuchi, Nakao and Yano2006; Mitchell & Coyne, Reference Mitchell and Coyne2007; Muhwezi et al. Reference Muhwezi, Agren and Musisi2007). Although the nine DSM-IV criteria of MDD have been incorporated into a short instrument, the Patient Health Questionnaire (PHQ-9), in one survey 62.5% of general practitioners considered this questionnaire too long and 37.5% considered it too time-consuming (Bermejo et al. Reference Bermejo, Niebling, Mathias and Harter2005). This has led to the development of ultra-short questionnaires consisting of two or three questions, or even just a single detection question. Perhaps the most well-known example is the two-item Patient Health Questionnaire (PHQ-2) (Spitzer et al. Reference Spitzer, Kroenke and Williams1999). This asks: Over the past 2 weeks, have you been bothered by either (a) feeling down, depressed or hopeless or (b) having little interest or pleasure in doing things? From early validation studies on the PHQ it is likely that the items low mood (strictly a three-part conjoint question) and loss of interest (strictly a two-part conjoint question) were selected because they were the essential features in DSM-IV. In 2004 the National Institute for Health and Clinical Excellence (NICE) released guidelines for the management of unipolar depression in primary and secondary care (NICE, 2004). In 2007 NICE released guideline 45 for antenatal and postnatal mental health (NICE, 2007). Both included the recommendation that simple screening using two or three questions would suffice, and offering an adapted version the PHQ-2 that extended over a duration of 4 weeks rather than 2. Although the PHQ-2 has been used in a range of studies, only two studies have reported the accuracy of the questions applied individually (head-to-head) (Whooley et al. Reference Whooley, Avins, Miranda and Browner1997; Lowe et al. Reference Lowe, Grafe, Kroenke, Zipfel, Quenter, Wild, Fiehn and Herzog2003). In both of these studies, the second PHQ question (loss of interest) alone had superior sensitivity (Se) and negative predictive value (NPV) to the first question (low mood). It remains untested whether these items are optimal for diagnosing MDD or whether other specific symptoms would be preferable. To our knowledge, no group has reported on the diagnostic validity of the PHQ-2 methods in specialist settings although one publication reported on the PHQ-9 in 171 ‘psychosomatic out-patients’ (Grafe et al. Reference Grafe, Zipfel, Herzog and Lowe2004). Similarly, no group has attempted to examine the diagnostic value of specific symptoms in psychiatric settings.
The Rhode Island Methods to Improve Diagnostic Assessment and Services (MIDAS) project is a large clinical epidemiological study in which semi-structured interviews were administered to a large sample of patients presenting for psychiatric out-patient treatment. We have previously examined the diagnostic properties of the DSM-IV criteria, in addition to the psychometric performance of symptoms that are not part of the diagnostic criteria (McGlinchey et al. Reference McGlinchey, Zimmerman, Young and Chelminski2006). Using simple logistic regression we found that the ranked order of symptoms by diagnostic weight for DSM-IV membership was depressed mood>loss of interest (anhedonia)>sleep disturbance>concentration/indecision>worthlessness/excessive guilt>loss of energy (Zimmerman et al. Reference Zimmerman, McGlinchey, Young and Chelminski2006b). The aim of the current study was to re-examine the diagnostic validity of a full item bank of DSM-IV and non-DSM-IV symptoms in order to determine which single items would be the most useful as a single-item diagnostic rule-in or rule-out test for MDD, as applied to a high prevalence setting.
Method
To date in the MIDAS project 1800 psychiatric out-patients have been evaluated with a semi-structured diagnostic interview in the Rhode Island Hospital Department of Psychiatry out-patient practice. The methods of the study have been described in detail elsewhere (Zimmerman et al. Reference Zimmerman, McGlinchey, Young and Chelminski2006b). The Rhode Island Hospital institutional review committee approved the research protocol, and all patients provided informed, written consent. Patients were interviewed by a diagnostic rater who administered the Structured Clinical Interview for DSM-IV (SCID; First et al. Reference First, Spitzer, Gibbon and Williams1995). To study the psychometric performance of the DSM-IV symptom criteria for major depression, it was necessary to modify the SCID and eliminate the skip-out that curtails the depression module for patients who did not report either depressed mood or loss of interest or pleasure. Thus, we enquired about all of the symptoms of depression for all patients. For compound criteria that encompass more than one symptom (e.g. indecisiveness or impaired concentration; increased sleep or insomnia), we made separate ratings of each component of the diagnostic criterion. Thus, the nine DSM-IV symptom criteria were broken down into 17 separate items. Our total item bank consisted of 25 items (22 single and three dual or combination items). The combination items we allowed were ‘diminished interest or pleasure’, ‘anxiety’ and ‘sleep disturbance’, as many consider these to be integral features of depression without separation into more specific components.
Patients were interviewed by a diagnostic rater who administered the SCID, supplemented by items from the Schedule for Affective Disorders and Schizophrenia (SADS), to rate the severity of depressive and non-depressive symptoms (Endicott & Spitzer, Reference Endicott and Spitzer1978). The diagnostic raters in the MIDAS project were highly trained and monitored throughout the study to minimize rater drift. In addition to rating the DSM-IV MDD criteria, the interviewers determined the presence of the following symptoms, which are not part of the diagnostic criteria: hopelessness, helplessness, unreactive mood, diminished drive, psychic anxiety, and somatic anxiety. The reasons for considering these symptoms as possible diagnostic indicators for depression have been detailed previously (McGlinchey et al. Reference McGlinchey, Zimmerman, Young and Chelminski2006).
Out of the pool of 1800 patients, we excluded from the present analysis 106 patients with current bipolar disorder because there are some symptom differences between patients with bipolar and non-bipolar forms of depression. We also excluded 171 patients who had MDD that was in partial remission because inclusion of these patients with the depression group would have lowered the Se of the symptom criteria, whereas inclusion of the patients in the non-depressed group would have reduced specificity (Sp). Thus, the present report is based on the 1523 remaining psychiatric out-patients who were administered a semi-structured diagnostic interview. The demographic characteristics of the sample are included inTable 1. In brief, the majority of the subjects were white, female, married, and single. The most frequent DSM-IV diagnoses were MDD (46.0%, n=829), social phobia (28.8%, n=519), panic disorder (18.4%, n=331) and generalized anxiety disorder (17.8%, n=320).
s.d., Standard deviation.
Diagnostic validity testing
In examining diagnostic accuracy of specific symptoms, a number of methodological issues were considered. The performance of a test will vary with the baseline prevalence of the condition (Whiting et al. Reference Whiting, Rutjes, Dinnes, Reitsma, Bossuyt and Kleijnen2004). Rule-in accuracy is computed as the product of the positive predictive value (PPV), where the denominator is all who test positive, and Sp, where the denominator is all without the disease (Sackett & Haynes, Reference Sackett and Haynes2002). For example, if the Sp is 100%, then there can be no false positives and hence all positive scores will imply a true case. Rule-out accuracy is computed as the product of the NPV, where the denominator is all who test negative, and Se, where the denominator is all with the disease (Sackett & Haynes, Reference Sackett and Haynes2002). When using a clinical feature as a diagnostic test, a symptom may have a high PPV or NPV, but in the real world its clinical relevance will be poor if it occurs rarely. Consider the example of a hypothetical biological challenge test that, if positive, has a 90% PPV but is only positive in half of depressed individually (Se 50%). Clinically relevant rule-in accuracy would be the product of the PPV and Se and clinically relevant rule-out accuracy (rule-out accuracy corrected for occurrence) would be the product of the NPV and Sp. These have been defined as the positive utility index (UI+) and the negative utility index (UI–) respectively (Mitchell, Reference Mitchell2008). Several methods are available to calculate overall diagnostic accuracy. The pre-test post-test gain, or ‘added value’, may be calculated by the difference between the prevalence and either PPV or NPV. Summary methods of accuracy include Youden's J and the Predictive Summary Index (PSI; Youden, Reference Youden1950). Youden's J is a composite of overall accuracy using Se+Sp – 1. The PSI is a composite of overall accuracy using all positive and negative screens calculated as: PPV+NPV – 1. Where multiple tests generate different Se and Sp values, the results can be combined in a summary receiver operating characteristic (sROC) curve (Macaskill, Reference Macaskill2004).
In the analysis of specific symptoms there is no cut-off, as all items score categorically. A further issue is that there is no consensus about what level of accuracy is acceptable clinically. The acceptable ratio of Se/Sp may depend on how important it is not to overlook cases (false negatives) or incorrectly diagnose cases (false positives). In some situations it may be better to minimize false negatives at the expense of false positives. As a guide we used the following grades of diagnostic accuracy (applied to PPV or NPV) adapted from Landis & Koch (Reference Landis and Koch1977) : excellent=0.90, good=0.80, satisfactory=0.80; otherwise poor. The equivalent grades for clinical utility were 0.81, 0.64 and 0.49 (applied to the utility index).
Results
The sample characteristics are shown in Table 1.
Se, Sensitivity; Sp, specificity; PPV, positive predictive value; NPV, negative predictive value; PSI, Predictive Summary Index.
Symptom frequencies
Seven items occurred in at least half of all cases (depressed and non-depressed). The all-case proportion was, in diminishing order: loss of energy (62%), diminished drive (62%), sleep disturbance (60%) depressed mood (59%), anxiety (57%), diminished concentration (56%), and insomnia (51%). All of these items also occurred in more than half of patients with MDD. The only items that were common (>50%) in MDD but uncommon in non-MDD were diminished interest/pleasure, helplessness, appetite/weight disturbance, indecisiveness and psychomotor retardation.
The 10 most common symptoms in patients with depression were: depressed mood (93%), diminished drive (88%), loss of energy (87%), sleep disturbance (83%), diminished concentration (82%), diminished interest/pleasure (81%), insomnia (70%), anxiety (69%), worthlessness (61%) and helplessness (60%).
Rule-in accuracy
Overall rule-in accuracy
The Se, Sp, PPV and NPV of all items considered are shown in Table 2. Using the PPV, the five most accurate symptoms and the five least accurate symptoms when confirming a diagnosis of MDD are shown in Table 3. From these results, in a psychiatric out-patient setting, if a patient had psychomotor retardation, there is a 90% chance based on this item alone that MDD would be present. However, the presence of anxiety alone increases the chance of a correct diagnosis by only 11.7% above the baseline (chance) rate (calculated by subtracting prevalence from PPV). In fact, psychomotor retardation only occurred in 16.9% of the sample and 28% of depressed patients, reducing its clinical applicability.
a Refers to positive predictive value (PPV) in this case.
b Refers to positive utility index.
Rule-in accuracy corrected for occurrence (UI)
Considering the UI+, that is the occurrence in the depressed patients (Se) multiplied by the proportion of all positive tests that are accurate (PPV), the top five most accurate and frequent items were: depressed mood (UI+0.80), diminished interest/pleasure (UI+0.71), diminished drive (UI+ 0.69), loss of energy (UI+ 0.67) and diminished concentration (UI+ 0.65).
Clinically, out of 200 individuals seen as psychiatric out-patients where the prevalence of depression is approximately 50%, 111 would be expected to have depressed mood and, of these, 93 would be true cases and 18 false positives. Ninety-one individuals would be expected to deny depressed mood and, of these, 82 would be true negatives and seven with syndromal depressed overlooked.
Rule-out accuracy
Overall rule-out accuracy
Using the NPV, the five most accurate and the five least accurate symptoms when attempting to confirmed the absence of MDD are show in Table 4. Thus, in a clinical setting, if an individual did not report depressed mood, there was a 91% chance that the individual was not suffering depression, an added value of 36.2% over chance detection alone. However, if an individual did not report weight loss, there was only a 50% chance based on this item alone that MDD would be absent. This is in fact 4% less that the unassisted detection rate of 54.4%, suggesting that this item is less than helpful in ruling-out depression. A summary of the added value of all specific symptoms is illustrated in Fig. 1.
a Refers to negative predictive value (NPV) in this case.
b Refers to negative utility index.
Rule-out accuracy correct for occurrence (UI)
Considering the UI–, or the occurrence in the non-depressed patients (Sp) in combination with the proportion of all negatives tests that are accurate (NPV), the top five most accurate and frequent items were: depressed mood (UI− 0.75), diminished interest/pleasure (UI− 0.69), diminished concentration (UI− 0.59), diminished drive (UI− 0.58) and worthlessness (UI− 0.57).
Using the item diminished interest/pleasure in a hypothetical group of 200 out-patients in which the prevalence of depression is about 50%, 92 people would be expected to show loss of interest/pleasure and 108 would not. Of these 92, 80 (88%) would be true positives and of the 108 without this symptom, 88 (79%) would be true negatives.
Combined accuracy
Measures of combined accuracy attempt to assess overall diagnostic accuracy considering rule-in and a rule-out test at the same time. Examining the Youden index and the PSI, the most discriminating items were: depressed mood (Youden=0.75), diminished interest/pleasure (Youden=0.68) and diminished drive (Youden=0.58). This is illustrated in a sROC plot of Se versus 1 – Sp for each individual item (Fig. 2).
Discussion
In this report from the MIDAS project, we extended our previous work on discriminatory items by looking in more detail at rule-in and rule-out accuracy of specific symptoms of depression. As anticipated, the most common symptom of depression in this sample was depressed mood but, of note, the next four most common (diminished drive, loss of energy, sleep disturbance, diminished concentration) are traditionally viewed as somatic or cognitive rather than affective in nature. These findings correspond to previous studies that have examined the frequency of symptoms of depression in psychiatric, primary care and community samples (Breslau & Davis, Reference Breslau and Davis1985; Buchwald & Rudick-Davis, Reference Buchwald and Rudick-Davis1993). However, frequency of occurrence is not the same as discriminatory ability and this in turn is best considered in two directions. In this sample the three most discriminatory items for ruling in MDD were psychomotor retardation, diminished interest/pleasure and indecisiveness. The three most discriminatory specific symptoms for ruling out MDD were the absence of depressed mood, diminished drive and loss of energy. However, some rule-in items were uncommon in depressed patients and some rule-out items were uncommon in non-depressed patients so that their usefulness in the clinical setting would be limited. Correcting for this, most clinically useful rule-in items became depressed mood, diminished interest/pleasure and diminished drive. The most clinically useful rule-out items became the absence of depressed mood, the absence of diminished interest/pleasure and the absence of poor concentration. The combined measures of accuracy ranked discriminatory power as depressed mood>diminished interest/pleasure>diminished drive. These items increased the chances of an accurate diagnosis from about 50% to more than 80% even when used alone.
The results suggest that the low mood and loss of interest/pleasure items of the PHQ-2 and NICE guidance are close to the optimal single-item questions out of all possible symptoms of depression. The fact that the best single item was depressed mood is notable, given that Zimmerman et al. (Reference Zimmerman, McGlinchey, Young and Chelminski2006c) recently found that only 7% of those qualifying as MDD do not report low mood. Patients without reported depressed mood may be qualitatively different to those with low mood and hence may be more problematic for general practitioners to detect. Diminished drive had significant value in diagnosing syndromal depression. Diminished drive refers to reduced motivation or a decreased capacity to initiate behaviour towards a certain goal. Further work should be performed to elucidate the role of motivation as a diagnostic feature of MDD. Loss of energy (fatigue), a typical (core) feature of ICD-10, had only modest discriminatory power, in part because it occurred in 32% of non-depressed cases. However, somatic symptoms were common and important. Most previous studies agree that somatic symptoms of depression are common and discriminating in both primary and secondary care settings (Akechi et al. Reference Akechi, Nakano, Akizuki, Okamura, Sakuma, Nakanishi, Yoshikawa and Uchitomi2003; Nakao & Yano, Reference Nakao and Yano2003; Barkow et al. Reference Barkow, Heun, Ustun, Berger, Bermejo, Gaebel, Harter, Schneider, Stieglitz and Maier2004; Reuter et al. Reference Reuter, Raugust, Bengel and Harter2004; de Coster et al. Reference de Coster, Leentjens, Lodder and Verhey2005). However, in a study of Nigerian army personnel using principal component analysis, Okulate et al. (Reference Okulate, Olayinka and Jones2004) found that somatic items accounted for only a little of the total variance for depression in this setting. Patients who present initially with somatic complaints may pose diagnostic difficulties if health-care professionals only think about depression when emotional complaints are mentioned (Bridges & Goldberg, Reference Bridges and Goldberg1985).
The strengths of this study are the large sample size and the use of highly trained, reliable interviewers who used semi-structured diagnostic interviews. There are also several limitations. First, the alternative (non-DSM-IV) symptoms may have been disadvantaged, in that a diagnosis of MDD was dependent on the contribution of each of the DSM-IV MDD symptom criteria but not influenced at all by the alternative symptoms. Second, data on diagnostic accuracy were calculated post hoc. Dissecting individual items from a questionnaire may increase the discriminatory ability compared with a new independent analysis because an interviewer is not required to use all items to make a diagnosis. Third, and perhaps most importantly, we relied on a psychiatric out-patient sample where the prevalence of depression was 54.4%. This is considerably higher than that seen in studies of MDD in primary care. This may mean that the diagnostic weighting of individual symptoms is somewhat different in a primary care sample. This clearly requires further study.
Miller (Reference Miller2002) found that, when unassisted, clinicians evaluated an average of only 32% of DSM-IV criteria for MDD. Even psychiatrists, who usually remember to ask about low mood, enquire about loss of interest/pleasure in only 8% of evaluations for depression (Miller, Reference Miller2002). General practitioners only consider low mood or loss of interest to be useful in detecting depression in 54% and 36% of cases respectively (Krupinski & Tiller, Reference Krupinski and Tiller2001). Given these problems, considerable effort has been expended on perfecting short case-finding instruments that might perform almost as well as longer validated severity scales or even semi-structured interviews. In a large occupational health sample of 1621 workers, Takeuchi et al. (Reference Takeuchi, Nakao and Yano2006) reported that the single item ‘feeling blue’ or a combination with ‘miserable’ had similar diagnostic accuracy to the full 15-item Profile of Mood States (POMS) questionnaire. Mitchell & Coyne (Reference Mitchell and Coyne2007) recently reported pooled analysis from eight studies of single-question tests for the diagnosis of depression in primary care. Although the overall Se was low at 31.9%, Sp was high at 96% (PPV was 55.6% and NPV was 92.3%).
In conclusion, we found that the most clinically valuable specific symptoms when diagnosing depression were depressed mood and diminished interest/pleasure together with diminished drive (rule-in) and poor concentration (rule-out). This study supports the use of the questions endorsed by the PHQ-2 in a psychiatric out-patient clinic and suggests that specific items can give reasonable diagnostic performance in high prevalence settings, providing that 14% false-positive and 10% false-negative error rates are considered acceptable.
Declaration of Interest
None.