Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-02-11T14:50:06.421Z Has data issue: false hasContentIssue false

The effects of psychotherapy for adult depression are overestimated: a meta-analysis of study quality and effect size

Published online by Cambridge University Press:  03 June 2009

P. Cuijpers*
Affiliation:
Department of Clinical Psychology, VU University Amsterdam, The Netherlands EMGO Institute, VU University Medical Center, Amsterdam, The Netherlands
A. van Straten
Affiliation:
Department of Clinical Psychology, VU University Amsterdam, The Netherlands EMGO Institute, VU University Medical Center, Amsterdam, The Netherlands
E. Bohlmeijer
Affiliation:
Technical University Twente, Deventer, The Netherlands
S. D. Hollon
Affiliation:
Department of Psychology, Vanderbilt University, Nashville, TN, USA
G. Andersson
Affiliation:
Department of Behavioural Sciences and Learning, Swedish Institute for Disability Research, Linköping University, Sweden Department of Clinical Neuroscience, Psychiatry Section, Karolinska Institutet, Stockholm, Sweden
*
*Address for correspondence: P. Cuijpers, Ph.D., Professor of Clinical Psychology, Department of Clinical Psychology, VU University Amsterdam, Van der Boechorststraat 1, 1081 BT Amsterdam, The Netherlands. (Email: p.cuijpers@psy.vu.nl)
Rights & Permissions [Opens in a new window]

Abstract

Background

No meta-analytical study has examined whether the quality of the studies examining psychotherapy for adult depression is associated with the effect sizes found. This study assesses this association.

Method

We used a database of 115 randomized controlled trials in which 178 psychotherapies for adult depression were compared to a control condition. Eight quality criteria were assessed by two independent coders: participants met diagnostic criteria for a depressive disorder, a treatment manual was used, the therapists were trained, treatment integrity was checked, intention-to-treat analyses were used, N⩾50, randomization was conducted by an independent party, and assessors of outcome were blinded.

Results

Only 11 studies (16 comparisons) met the eight quality criteria. The standardized mean effect size found for the high-quality studies (d=0.22) was significantly smaller than in the other studies (d=0.74, p<0.001), even after restricting the sample to the subset of other studies that used the kind of care-as-usual or non-specific controls that tended to be used in the high-quality studies. Heterogeneity was zero in the group of high-quality studies. The numbers needed to be treated in the high-quality studies was 8, while it was 2 in the lower-quality studies.

Conclusions

We found strong evidence that the effects of psychotherapy for adult depression have been overestimated in meta-analytical studies. Although the effects of psychotherapy are significant, they are much smaller than was assumed until now, even after controlling for the type of control condition used.

Type
Original Articles
Copyright
Copyright © Cambridge University Press 2009

Introduction

In the past three decades more than 100 randomized controlled studies have shown that psychotherapy is an effective treatment of depressive disorders in adults. Meta-analyses in this field have consistently found that psychotherapies have moderate to large effect sizes (Gloaguen et al. Reference Gloaguen, Cottraux, Cucherat and Blackburn1998; Churchill et al. Reference Churchill, Hunot, Corney, Knapp, McGuire, Tylee and Wessely2001; Wampold et al. Reference Wampold, Minami, Baskin and Tierney2002; Cuijpers et al. Reference Cuijpers, van Straten and Warmerdam2007a; Ekers et al. Reference Ekers, Richards and Gilbody2008). However, in several adjacent fields recent meta-analytical studies have found strong indications that earlier meta-analytical studies have overestimated treatment effects considerably. In two recent meta-analytical reviews of studies examining the effects of antidepressive medication, it was found that these effects have been overestimated because of publication bias (Kirsch et al. Reference Kirsch, Deacon, Huedo-Medina, Scoboria, Moore and Johnson2008; Turner et al. Reference Turner, Matthews, Linardatos, Tell and Rosenthal2008). Recent meta-analyses in the field of psychotherapy for depression in children and adolescents also found indications that the effects have been overestimated considerably in earlier meta-analytical research (Weisz et al. Reference Weisz, McCarty and Valeri2006; Klein et al. Reference Klein, Jacobs and Reinecke2007).

One of the major reasons for overestimation of the overall efficacy of an intervention is the inclusion in meta-analyses of studies with a low quality (Altman, Reference Altman2002). Whether the quality of studies has a negative impact on the mean effect sizes found in meta-analyses of psychotherapy for adult depression has rarely been addressed. Most meta-analyses in this field have focused on specific types of psychotherapy (De Mello et al. Reference De Mello, De Jesus Mari, Bacaltchuk, Verdeli and Neugebauer2005; Cuijpers et al. Reference Cuijpers, van Straten and Warmerdam2007b; Leichsenring & Rabung, Reference Leichsenring and Rabung2008), specific target groups (Bledsoe & Grote, Reference Bledsoe and Grote2006; Pinquart et al. Reference Pinquart, Duberstein and Lyness2006), or specific delivery formats (McDermut et al. Reference McDermut, Miller and Brown2001; Gellatly et al. Reference Gellatly, Bower, Hennessy, Richards, Gilbody and Lovell2007), and only one of these studies even addressed the association between effect size and study quality (Gellatly et al. Reference Gellatly, Bower, Hennessy, Richards, Gilbody and Lovell2007). In that study, significant associations between the effect size and some quality criteria, including concealment of allocation and intention-to-treat analyses, were found with respect to self-help treatments for depression. In another meta-analysis on psychodynamic therapies a correlation was calculated between effect sizes and study quality as a continuous measure, but no significant association was found (Leichsenring & Rabung, Reference Leichsenring and Rabung2008). Only a few meta-analyses have focused on the complete field of psychotherapy for adult depression, and these have not examined study quality at all (Robinson et al. Reference Robinson, Berman and Neimeyer1990), or have not examined an association between effect size and study quality (Cuijpers et al. 2008 b). The one comprehensive meta-analysis that did examine study quality found that high-quality studies had lower effect sizes than lower-quality studies, but did not test this difference for significance or examine how much the overall effect size was overestimated because of the inclusion of lower-quality studies or examine whether specific quality criteria were responsible for such an overestimation of the effect size.

We decided, therefore, to examine in more detail whether there is an association between study quality and outcome in a meta-analysis in which we included all studies that compared psychotherapy to a control condition.

Method

Identification and selection of studies

We used a database of studies on the psychological treatment of depression in general. This database, how it was developed and the methods used, have been described in detail elsewhere (Cuijpers et al. 2008 a). Key materials, overviews of the goals and mission, and an overview of all other meta-analyses which have used this database can be downloaded from the website for this project (www.evidencebasedpsychotherapies.org). In brief, the database was developed through a comprehensive literature search (from 1966 to December 2007) in which we examined 8861 abstracts in PubMed (1403 abstracts), PsycINFO (2097), EMBASE (2207), and the Cochrane Central Register of Controlled Trials (2204). In order to identify unpublished studies, we also searched the Dissertation Abstracts International (950 abstracts). We identified abstracts by combining terms indicative of psychological treatment and depression. For this database, we also collected the primary studies from earlier meta-analyses of psychological treatments for depression (Cuijpers & Dekker, Reference Cuijpers and Dekker2005) and checked the references of included studies. We retrieved a total of 857 papers and 33 dissertations for further study. These papers and dissertations were studied, and we selected the ones that met our inclusion criteria.

We included studies in which (1) the effects of a psychological treatment (2) on adults (3) with a depressive disorder or an elevated level of depressive symptomatology, (4) were compared to a control condition (5) in a randomized controlled trial.

Psychological treatments were defined as interventions in which verbal communication between a therapist and a patient was the core element, or in which a psychological treatment was written down in book format (bibliotherapy), while the patient worked through it more or less independently, but with some kind of personal support from a therapist (by telephone, email, or otherwise). Control conditions could be waiting lists, care as usual (CAU), pill-placebo or psychological placebo groups. Studies in which the psychological intervention could not be discerned from other elements of the intervention were excluded (managed care interventions and disease management programmes), as were studies in which a standardized effect size could not be calculated (usually because no statistical test was conducted that examined the difference between the psychotherapy and the control condition), studies on children and adolescents, studies in in-patients, studies aimed at maintenance treatments and relapse prevention, and studies that included participants with anxiety but not depression (along with those who also had depression). Co-morbid general medical or psychiatric disorders were not used as an exclusion criterion. No language restrictions were applied.

Coding of study quality

We assessed the quality of each study using eight criteria. These criteria were based on an authoritative review of empirically supported psychotherapies (Chambless & Hollon, Reference Chambless and Hollon1998), and on the criteria proposed by the Cochrane Collaboration to assess the methodological validity of a study (Higgins & Green, Reference Higgins and Green2006). The criteria based on the review of empirically supported psychotherapies assessed the quality of the treatment delivery, while the criteria proposed by the Cochrane Collaboration assessed more methodological sources of bias.

A study was considered to be of high quality when (1) participants met diagnostic criteria for a depressive disorder (as assessed with a personal diagnostic interview, such as CIDI, SCID, or SADS, and using a diagnostic system such as DSM or Research Diagnostic Criteria); (2) the study referred to the use of a treatment manual (either a published manual, or a manual specifically designed for the study); (3) the therapists who conducted the therapy were trained for the specific therapy, either specifically for that study or as a general training; (4) treatment integrity was checked during the study (by supervision of the therapists during treatment or by recording of treatment sessions or by systematic screening of protocol adherence by a standardized measurement instrument); (5) data were analysed with intention-to-treat analyses, in which all persons who were randomized to the treatment and control conditions initially were included in the analyses; (6) the study had a minimal level of statistical power to find significant effects of the treatment, and included ⩾50 persons in the comparison between treatment and control groups [this allows the study to find standardized effect sizes of d=0.80 and larger, assuming a statistical power of 0.80 and α=0.05; calculations in Stata (Stata Corp., USA)]; (7) the study reported that randomization was conducted by an independent (third) party (this variable was positive if an independent person did the randomization, when a computer program was used to assign patients to conditions, or when sealed envelopes were used); (8) assessors of outcome were blinded and did not know to which condition the respondents were assigned to (this was only coded when the effect sizes were based on interviewer-based depression ratings; when only self-reports were used, it was assumed that this criterion was met).

If a study did not report whether it met the quality criterion is was coded as negative. All quality criteria for each study were coded as positive or negative by two independent researchers. Disagreements were solved by discussing the ratings, which usually resulted in more specified definitions of the criteria.

We also coded other characteristics of the studies: target group (adults in general versus a more specific population, such as older adults, or women with postpartum depression), recruitment method (through community recruitment, clinical samples, or other), type of therapy (cognitive behaviour therapy versus other therapy types), therapy format (individual, group, or guided self-help), number of sessions (<8, 8–11, ⩾12), and type of control group (waiting-list control group, CAU, or other, such as pill-placebo or psychological placebo groups).

Analyses

We first calculated effect sizes (standardized mean difference, d) for each study by subtracting (at post-test) the average score of the control group (Me) from the average score of the experimental group (Mc) and dividing the result by the pooled standard deviations of the experimental and control group (SDec). An effect size of 0.5 thus indicates that the mean of the experimental group is half a standard deviation larger than the mean of the control group. Effect sizes of ⩾0.56 can be assumed to be large, while effect sizes of 0.33–0.55 are moderate, and lower effect sizes are small (Lipsey, Reference Lipsey1990).

In the calculations of effect sizes, only those instruments were used that explicitly measure depression (Table 1). If more than one depression measure was used, the mean of the effect sizes was calculated, so that each study (or contrast group) had only one effect size. When means and standard deviations were not reported, we used other statistics (t value, p value) to calculate effect sizes. We focused on the effect sizes based on the difference between treatment and control groups at post-test. Differences at follow-up were not included in the analyses.

Table 1. Selected characteristics of high-quality studies of psychotherapy for adult depression

BDI, Beck Depression Inventory; CAU, care as usual; CBT, cognitive behaviour therapy; DRP, Depression Recurrence Prevention programme; DYST, dysthymia; EPDS, Edinburgh Postnatal Depression Scale; EU, European Union; GDS, Geriatric Depression Scale; GP, general practitioner; GRP, group therapy; HAMD, Hamilton Rating Scale for Depression; HSCL-D-20, Hopkins Symptom Checklist Depression Scale; IND, individual therapy; IPT, interpersonal psychotherapy; minD, minor depression; MADRS, Montgomery–Asberg Depression Rating Scale; MDD, major depressive disorder; NL, The Netherlands; PST, problem-solving therapy.

The standardized mean difference is not easy to interpret from a clinical point of view. Therefore, we transformed the standardized mean differences into the numbers needed to be treated (NNT), using the formulae provided by Kraemer & Kupfer (Reference Kraemer and Kupfer2006). The NNT indicates the number of patients that have to be treated in order to generate an additional positive outcome in one of them (Sackett et al. Reference Sackett, Strauss, Richardson, Rosenberg and Haynes2000).

To calculate pooled mean effect sizes, we used the computer program comprehensive meta-analysis (version 2.2.021), developed for support in meta-analysis. Because we expected considerable heterogeneity, we conducted all analyses using the random-effects model (Higgins & Green, Reference Higgins and Green2006).

We assessed heterogeneity by calculating the I 2 statistic that is an indicator of heterogeneity in percentages (Higgins et al. Reference Higgins, Thompson, Deeks and Altman2003). A value of 0% indicates no observed heterogeneity, and larger values show increasing heterogeneity, with 25% as low, 50% as moderate, and 75% as high heterogeneity. We also calculated the Q statistic, but only report whether this was significant or not.

Subgroup analyses and univariate meta-regression analyses were conducted according to the procedures implemented in comprehensive meta-analysis version 2.2.021. In the subgroup analyses we used mixed-effects analyses that pooled studies within subgroups with the random-effects model, but tested for significant differences between subgroups with the fixed-effects model.

Multivariate meta-regression analyses in which more than one predictor was entered simultaneously were conducted in Stata/SE 8.2 for Windows, because these analyses cannot be conducted in comprehensive meta-analysis. In order to avoid collinearity among the predictors that were entered in the regression models, we first examined whether high correlations were found among the variables that could be entered into the model. The correlations between all entered characteristics were calculated and we then checked whether the correlations were lower than r=0.60.

Results

Description of included studies

All inclusion criteria were met by 115 studies, in which 178 psychological treatment conditions were compared to a control group. These studies included a total of 8140 participants (4773 in the experimental groups and 3367 in the control groups). The overall mean effect size of all comparisons was 0.68 (95% CI 0.60–0.76). A full list of references and the coded characteristics of the included studies can be downloaded from the website of the project (www.evidencebasedpsychotherapies.org).

Table 2 shows how many comparisons were present in each of the major subgroups of studies. In 60 studies (52.2%) participants met diagnostic criteria for a mood disorder, while in the other 55 studies another definition of depression was used (usually a high score on a self-report questionnaire). Of the 59 studies which measured depression severity with the Beck Depression Inventory (BDI; Beck et al. Reference Beck, Ward, Mendelson, Mock and Erbaugh1961) at pre-test, most (43 studies, 72.9%) indicated moderate to severe mean depression scores (BDI score between 19 and 29), while seven studies (11.9%) were aimed at participants with severe depression (mean BDI score ⩾30), and nine (7.8%) at mild to moderate depression (mean BDI score ⩽18). Fifteen unpublished dissertations were included.

Table 2. Differences between high-quality and other studies: overall analyses and subgroup analyses

BDI, Beck Depression Inventory; CBT, cognitive behaviour therapy; CI, confidence interval; HAMD, Hamilton Rating Scale for Depression; HQ, High-quality studies; N comp, number of comparisons; NNT, numbers needed to be treated.

a The pvalue in this column indicates whether the Q statistic is significant (the I 2 statistic contains no test of significance, but indicates heterogeneity in percentages).

b This p value indicates whether high-quality studies differed significantly from the other studies. Significant p values are underlined.

c When the other studies (not high-quality) were limited to pill placebo, the outcomes were comparable (four studies with lower quality used a pill placebo as control; mean effect size: 0.53, 95% CI −0.16 to 1.22, Z=1.50, n.s., I 2=82.57, p<0.01), but the difference was not significant, possibly because of low statistical power.

** p<0.01, *** p<0.001.

All quality criteria were met by 11 studies (16 comparisons). Selected characteristics of these studies are presented in Table 1. Five of the high-quality studies used a CAU group, while six used a placebo control group. None of these studies used a waiting-list control group. Eight studies were restricted to patients with a major depressive disorder, while the other three also included persons with other depressive diagnoses. Three studies recruited participants from the community, five from clinical samples, and three used other recruitment methods. Six were aimed at adults in general, while the others focused on more specific target groups. Of the 16 interventions in these studies, eight examined cognitive behavioural therapies, and 13 used an individual treatment format (no intervention used a guided self-help format).

Effect size in high-quality and lower-quality studies

The overall mean effect size for all comparisons was 0.68 [95% confidence interval (CI) 0.60–0.76], with a NNT of 2.70, and with high heterogeneity (I 2=70.27, Table 2).

We tested the difference between the high-quality studies and studies that did not meet all quality criteria in subgroup analyses (Table 2), and found that the mean effect size found for high-quality studies was significantly smaller (d=0.22, 95% CI 0.13–0.31, NNT=8.06) than the mean effect size of other studies (d=0.74, 95% CI 0.65–0.84, p<0.001, NNT=2.48). Furthermore, heterogeneity was zero in the group of high-quality studies, while it was high in the other studies. The effect sizes of the high-quality studies and the pooled mean effect sizes are presented in Fig. 1.

Fig. 1. Standardized effect sizes of high-quality studies on psychological treatment of adult depression compared to control conditions at post-test.

Because outliers may distort the overall mean effect sizes, we excluded the comparisons with effect sizes ⩾2.0, and again compared the remaining effect sizes from the high-quality and the other studies. This resulted in comparable outcomes, with a highly significant difference between high-quality and other studies (p<0.001, Table 2), zero heterogeneity for the high-quality studies, and high heterogeneity for the other studies.

Our analyses included 49 studies in which more than two psychological treatments were compared to a control group, which means that multiple comparisons from one study were included in the same analysis. However, these multiple comparisons are not independent of each other, possibly resulting in an artificial reduction of heterogeneity and a bias in the overall mean effect size. Therefore we conducted additional analyses in which we included only one comparison per study. Only the comparison with the largest effect size was included first, followed by another analysis including only the smallest effect size. As shown in Table 2, the difference between high-quality and other studies remained highly significant (p<0.001).

The BDI and the Hamilton Rating Scale for Depression (HAMD; Hamilton, Reference Hamilton1960) were the most frequently used instruments in the included studies. When we calculated the effect sizes based on BDI only, we again found a highly significant difference between high-quality and other studies (Table 2), although the effect sizes in both groups of studies were somewhat higher than in the overall analyses. The same was true for the analyses based on HAMD only.

Subgroup and meta-regression analyses

A series of subgroup analyses was conducted in which we first selected a subgroup of studies based on a specific characteristic of the study (Table 2), and then examined whether there was a significant difference between high-quality and other studies within this subgroup of studies. This is important, because it may very well be possible that the difference between high-quality and other studies is explained, for example, by the fact that none of the high-quality studies used a waiting-list control group, which are known to have higher effect sizes than other control groups, or by differences in recruitment methods, or other characteristics of the studies.

As shown in Table 2, the difference between high-quality and other studies remained significant in all subgroup analyses. In these analyses, we found no indication that target group, recruitment method, type of treatment, format, number of sessions, or type of control condition could explain the difference between high-quality and other studies. No high-quality studies had examined psychotherapy compared to waiting-list control groups, and no high-quality study had examined guided self-help therapy. Heterogeneity was low to zero for the high-quality studies in most subgroup analyses, while in the other studies heterogeneity was high.

Because all high-quality studies used CAU or pill-placebo control conditions, we repeated all previous analyses and included only those studies (high-quality and other studies) which used these two types of control conditions. The results of these analyses are presented in Table 3. As can be seen, the results are comparable to those of the previous analyses. Although not all differences between high-quality and other studies were significant this could be expected because of the limited statistical power of these comparisons.

Table 3. Differences between high-quality and other studies using care-as-usual or placebo control groups: overall analyses and subgroup analyses

BDI, Beck Depression Inventory; CBT, cognitive behaviour therapy; CI, confidence interval; HAMD, Hamilton Rating Scale for Depression; HQ, high-quality studies; N comp, number of comparisons; NNT, numbers needed to be treated.

a The p value in this column indicates whether the Q statistic is significant (the I 2statistic contains no test of significance, but indicates heterogeneity in percentages).

b This p value indicates whether high-quality studies differed significantly from the other studies. Significant p values are underlined.

c None of the studies of this sample was an outlier (d⩾2).

** p<0.01, *** p<0.001.

Then we conducted a multivariate meta-regression analysis with effect size as the dependent variable. As predictors we used the same variables we used for the subgroup analyses described in Table 2 (number of sessions was used as a continuous variable instead of dichotomous variables, in order not to loose statistical power). We also added a dummy variable as a predictor that indicated whether the study was a high-quality study or not. The results of these analyses are presented in Table 4. As can be seen, the dummy variable indicating whether the study was a high-quality study or not was significantly associated with the effect size, after controlling for the other variables (B=−0.30, s.e.=0.14, p<0.05).

Table 4. Regression coefficients of study characteristics in relation to the effect size of psychological interventions for depression: multivariate meta-regression analyses

CBT, Cognitive behaviour therapy; CI, confidence interval; s.e., standard error.

a In the parsimonious model, the least significant variable was dropped in each step of a (manual) backwards regression analysis, until only significant predictors were retained.

* p<0.05, ** p<0.01, *** p<0.001.

We subsequently conducted a (manual) back-step meta-regression analysis, with the same dependent variable (effect size) and the same predictors. The least significant variable in each step was dropped in this analysis until only significant predictors were retained in the model. Table 4 shows that dummy variable indicating whether the study was a high-quality study or not was still a significant predictor of effect size (B=−0.27, s.e.=0.13, p<0.05).

Because we included studies with more than one comparison, which are not independent of each other, we conducted a series of additional analyses. First, we conducted a multivariate meta-regression analysis with one effect size per study as dependent variable. We used the same predictors as in the earlier multivariate meta-regression analysis. Again we used a full model and a parsimonious model. In both models, there was a trend (p<0.1) indicating that the variable indicating whether or not the study was of high quality was a significant predictor of the effect size. The two dummy variables for the control conditions were significant at the p<0.01 level.

Then we conducted another multivariate meta-regression analysis in which we used one effect size per study, but this time we used the smallest effect size as the dependent variable. In the full model, the variable indicating whether or not a study belonged to the high-quality group was a significant predictor of the effect size (p<0.05) and in the parsimonious model it was the only variable that remained significant after removing all non-significant variables.

Subgroup analyses limited to high-quality studies

Because it is possible that there are significant differences within the group of high-quality studies, we conducted another series of subgroup analyses that was limited to the 11 high-quality studies (16 comparisons between psychotherapy and a control group). In these analyses, we examined whether there were significant differences between subgroups of studies examining recruitment method (community, clinical samples, other), target group (adults in general versus a more specific target group), type of treatment (cognitive behaviour therapy versus other therapies), format (individual versus group therapies), number of sessions (studies with <8 sessions, 8–11, ⩾12 sessions), control group (CAU versus other control groups). In all subgroups the resulting effect sizes were comparable and none of the analyses found a significant difference between subgroups (p>0.1). The detailed results of these analyses can be obtained from the first author (P.C.).

Several of the included studies were aimed at specific target groups (women with postpartum depression, veterans with co-morbid depression and post-traumatic stress disorder, low-income, young minority women, and older adults). Some of these studies would not be considered representative efficacy trials by many researchers in the field, although several other studies were aimed at adult outpatients in general. It is possible that one or more of the non-typical studies had a negative effect on the overall mean effect size. Therefore, we conducted a series of meta-analyses (limited to the group of high-quality studies) in which one of the studies was removed each time. In this way we could examine whether removal of one study resulted in important changes of the outcomes.

Removal of the study by Williams and colleagues (Reference Williams, Oxman, Frank, Katon, Sullivan, Cornell and Sengupta2000) resulted in the largest increase of the effect size (the resulting effect size was 0.25; with I 2=0). After the removal of this study, we repeated this procedure and examined which study should be removed in order to realize the next largest increase of the effect size. This was the study by Miranda and colleagues (Reference Miranda, Chung, Green, Krupnick, Siddique, Revicki and Belin2003), and the meta-analysis resulted in an effect size of 0.26 (I 2=0). Repeating this procedure a third time resulted (after removal of the study by Smit and colleagues, Reference Smit, Kluiter, Conradi, Van der Meer, Tiemens, Jenner, Van Os and Ormel2006) in a mean effect size of 0.27 (I 2=0). These analyses suggest that removal of studies did not result in major changes in the effect sizes.

We repeated these analyses, but this time we did not examine which studies contributed to an increase in the effect size, but to a decrease. This resulted in removal of the study of Cooper and colleagues (Reference Cooper, Murray, Wilson and Romaniuk2003; d=0.21, I 2=0), Jarrett and colleagues (Reference Jarrett, Schaffer, McIntire, Witt-Browder, Kraft and Risser1999; d=0.19, I 2=0), and Dimidjian and colleagues (Reference Dimidjian, Hollon, Dobson, Schmaling, Kohlenberg, Addis, Gallop, McGlinchey, Markley, Gollan, Atkins, Dunner and Jacobson2006; d=0.19, I 2=0). These analyses did not result in clear indications that removal of some studies resulted in important changes in the mean effect size.

Associations between effect size and each of the quality criteria

We examined the associations between the effect sizes and each of the quality criteria in a series of subgroup analyses (detailed results can be obtained from P.C.). The effect size was significantly lower in studies with intention-to-treat analyses (p<0.001), in studies with a sample size ⩾50 (p<0.001), studies in which randomization was done by an independent party (p<0.001), and studies in which assessors of outcome did not know to which condition the respondents were assigned to (p<0.01). There was also a trend indicating that studies in which a manual was used had larger effect sizes than other studies (p<0.1).

We selected the studies that met the four quality criteria that were significantly associated with effect size (regardless of whether they met the other quality criteria). These 20 studies (28 comparisons between psychotherapy and control conditions) also had a significantly smaller effect size (d=0.34, 95% CI 0.20–0.48, p<0.001, NNT=5.26) than the other studies (N=150, d=0.78, 95% CI 0.68–0.88, p<0.001, NNT=2.39), although heterogeneity was considerably higher (I 2=61.93% in the higher-quality studies, with a significant Q statistic, p<0.001).

Quality as a continuous measure

We made a continuous variable in which we added up the scores on each of the quality criteria (0 or 1), which resulted in a scale ranging from 0 to 8. In a meta-regression analysis we found that this variable was strongly associated with effect size, with a slope of −0.07 (95% CI −0.09 to −0.05, p<0.001). This indicates that not meeting each of the criteria reduces the effect size by 0.07.

Discussion

We found strong indications that high-quality studies of psychotherapy for adult depression resulted in considerably smaller effect sizes than other studies and that the effects of psychotherapy for adult depression have been overestimated in earlier meta-analytical studies, even after controlling for the nature of the control group. Although the effects of psychotherapy are significant, they are much smaller than was assumed until now, with NNT in the high-quality studies of eight, compared to two in the lower-quality studies (four in the lower-quality studies using CAU control groups, and three in the studies using placebo and other control groups). In a series of subgroup analyses, we found that these differences could not be explained by characteristics of the population, the nature of the intervention or other general characteristics of the studies. Clearly, the inclusion of studies with less rigorous methods has led to an overestimation of the effects of psychotherapy. Since their development, meta-analyses have been criticized by many authors for paying too much attention to studies of low quality (Hedges & Olkin, Reference Hedges and Olkin1985; Rosenthal & DiMatteo, Reference Rosenthal and DiMatteo2001). This criticism, also known as the ‘garbage in and garbage out’ issue, now appears to be very important for studies in psychotherapy for adult depression.

We found only 11 studies that met all our quality criteria. That is probably because we used a conservative method of assessing the quality. When a study did not report that the quality criterion was met, we assumed that it was not. That may not always have been the case, and not reporting about the criteria may be caused by lack of space in the journals or a tradition of reporting certain characteristics but not others. Furthermore, studies were only defined as high quality when the treatment delivery was intended to optimize treatment efficacy, which excludes effectiveness studies aimed at implementation in routine practice and maximal external validity.

On the other hand, we did not examine all possible quality criteria. We prioritized quality criteria that could be regarded as important from the point of view of psychotherapy delivery and from the point of view of the validity of the studies. We did not examine other criteria, such as the adequacy of analyses in general, or the use of long-term outcomes. We also did not include high attrition as one of the quality criteria, because there is no clear threshold which can be used to indicate which studies had high attrition rates. It is possible that if we had included more quality criteria we would have ended up with even fewer studies. It seems clear that the methodological quality of the literature has improved over time and that many of the earlier studies are less than optimal in quality. It may be wise to take those limitations into account analytically when such studies are included in quantitative reviews.

This study has several limitations. One important limitation was, as indicated above, that we did not examine all quality criteria. Second, the definition of several of our quality criteria could be questioned. For example, meeting diagnostic criteria for a mood disorder is a good quality criterion because it is a relatively homogeneous population with at least a minimum level of severity. However, good research is also possible, in principle, with a target group scoring high on a self-rating scale, although this group is much more heterogeneous it is uncertain whether they all suffer from clinically relevant depressive symptoms and not from other problems. Moreover, the criterion of including at least 50 respondents per study is admittedly arbitrary, as is the concrete operationalization of several of our quality criteria. A third limitation is that we could find only a small number of trials that met all our quality criteria. Because of this small number we have to be cautious in interpreting the results of this meta-analysis. Furthermore, several of the high-quality studies were effectiveness trials that sought to examine the utility of forms of psychotherapy that were intentionally restricted in dose or duration in order to meet pragmatic ‘real world’ constraints rather than efficacy trials designed to maximize the impact of psychotherapy. However, we did not find any indication that removal of the studies with the largest impact did indeed increase or decrease the overall mean effect size very much. In that regard, it was interesting that the study quality indices that most contributed to the overestimation of the effects of psychotherapy were methodological factors that served to protect against bias and not aspects of implementation indices that serve to maximize the potency of treatment.

The factors that lead to the overestimation of treatment effects may not always be the same for psychotherapy and medication treatments. In the psychotherapy literature the overestimation of the effect sizes appears to be largely a function of a reliance on inadequately rigorous methods, whereas economic considerations may lead to selective reporting of null findings in industry-funded trials of medication treatment (Kirsch et al. Reference Kirsch, Deacon, Huedo-Medina, Scoboria, Moore and Johnson2008; Turner et al. Reference Turner, Matthews, Linardatos, Tell and Rosenthal2008). There are clear indications that the methodological rigour of the psychotherapy trials has increased over time. All but one of the high-quality psychotherapy studies was published within the previous decade and that one exception was planned and conducted by the National Institute of Mental Health (NIMH) as an example of how to perform high-quality treatment research (Elkin et al. Reference Elkin, Shea, Watkins, Imber, Sotsky, Collins, Glass, Pilkonis, Leber, Docherty, Fiester and Parloff1989). Features like intention-to-treat analyses, adequate statistical power, independent randomization, and blinding of assessors that have long been required in the testing of medications are only now starting to find their way into the psychotherapy research literature. In part, that reflects the regulatory environment in which pharmacological research is conducted. Private companies driven by a profit motive conduct much of the pharmacological research and they must secure permission from governmental regulatory agencies to market their medications. As a consequence, their studies are often larger and better funded than traditionally has been the case for the psychotherapy, but also are subjected to greater outside scrutiny. Psychotherapy research has often been conducted by academic investigators with more limited resources, albeit not totally without commercial interests as books describing the psychotherapy may generate additional income for academics.

Despite the limitations of this study, we can conclude that earlier meta-analyses have overestimated the effects of psychotherapy for adult depression, and that the quality of the studies in this field can be improved considerably. This does not necessarily mean that psychotherapy for depression is without value. The comparison conditions used in the high-quality studies typically controlled for most of the non-specific factors that contribute in part to the beneficial effects of treatment. Whereas waiting lists control only for the passage of time and repeated assessments, patients receiving CAU or pill-placebo controls believe that they are taking part in active treatment and benefit from non-specific factors associated with the expectations of change and the provision of a helping relationship. The rather modest effect sizes produced by the high-quality studies should be interpreted as the magnitude of the incremental benefit produced by a treatment with a specific effect relative to a non-specific treatment, not as an estimate of the overall benefit of treatment relative to its absence. Given the fact that medication did no better than psychotherapy in several of the high-quality studies we examined (Elkin et al. Reference Elkin, Shea, Watkins, Imber, Sotsky, Collins, Glass, Pilkonis, Leber, Docherty, Fiester and Parloff1989; Jarrett et al. Reference Jarrett, Schaffer, McIntire, Witt-Browder, Kraft and Risser1999; DeRubeis et al. Reference DeRubeis, Hollon, Amsterdam, Shelton, Young, Salomon, O'Reardon, Lovett, Gladis, Brown and Gallop2005; Dimidjian et al. Reference Dimidjian, Hollon, Dobson, Schmaling, Kohlenberg, Addis, Gallop, McGlinchey, Markley, Gollan, Atkins, Dunner and Jacobson2006), and that earlier meta-analyses found that the effects of medication also are smaller than was previously assumed (Turner et al. Reference Turner, Matthews, Linardatos, Tell and Rosenthal2008), we need to either develop more powerful interventions or to enhance the power of the interventions already available.

Declaration of Interest

None.

References

Altman, DG (2002). Poor-quality medical research: what can journals do? Journal of the American Medical Association 287, 27652767.Google Scholar
Beck, AT, Ward, CH, Mendelson, M, Mock, J, Erbaugh, J (1961). An inventory for measuring depression. Archives of General Psychiatry 4, 561571.Google Scholar
Bledsoe, SE, Grote, NK (2006). Treating depression during pregnancy and the postpartum: a preliminary meta-analysis. Research on Social Work Practice 16, 109120.Google Scholar
Chambless, DL, Hollon, SD (1998). Defining empirically supported therapies. Journal of Consulting and Clinical Psychology 66, 7–18.CrossRefGoogle ScholarPubMed
Churchill, R, Hunot, V, Corney, R, Knapp, M, McGuire, H, Tylee, A, Wessely, S (2001). A systematic review of controlled trials of the effectiveness and cost-effectiveness of brief psychological treatments for depression. Health Technology Assessment 5, 35.Google Scholar
Cooper, P, Murray, L, Wilson, A, Romaniuk, H (2003). Controlled trial of the short- and long-term effect of psychological treatment of past-partum depression. I. Impact on maternal mood. British Journal Psychiatry 182, 412419.Google Scholar
Cuijpers, P, Dekker, J (2005). Psychological treatment of depression: a systematic review of meta-analyses. Dutch Journal of Medicine 149, 18921897.Google Scholar
Cuijpers, P, van Straten, A, Warmerdam, L, Andersson, G (2008 a). Psychological treatment of depression: a meta-analytic database of randomized studies. BMC Psychiatry 8, 36.CrossRefGoogle ScholarPubMed
Cuijpers, P, van Straten, A, Warmerdam, L, Smits, N (2008 b). Characteristics of effective psychological treatments of depression: a meta-regression analysis. Psychotherapy Research 18, 225236.Google Scholar
Cuijpers, P, van Straten, A, Warmerdam, L (2007 a). Behavioral treatment of depression: a meta-analysis of activity scheduling. Clinical Psychology Review 27, 318326.Google Scholar
Cuijpers, P, van Straten, A, Warmerdam, L (2007 b). Problem solving therapies for depression: a meta-analysis. European Psychiatry 22, 9–15.Google Scholar
De Mello, MF, De Jesus Mari, J, Bacaltchuk, J, Verdeli, H, Neugebauer, R (2005). A systematic review of research findings on the efficacy of interpersonal therapy for depressive disorders. European Archives of Psychiatry and Clinical Neuroscience 255, 7582.Google Scholar
DeRubeis, RJ, Hollon, SD, Amsterdam, JD, Shelton, RC, Young, PR, Salomon, RM, O'Reardon, JP, Lovett, ML, Gladis, MM, Brown, LL, Gallop, R (2005). Cognitive therapy vs. medications in the treatment of moderate to severe depression. Archives of General Psychiatry 62, 409416.CrossRefGoogle ScholarPubMed
Dimidjian, S, Hollon, SD, Dobson, KS, Schmaling, KB, Kohlenberg, RJ, Addis, ME, Gallop, R, McGlinchey, JB, Markley, DK, Gollan, JK, Atkins, DC, Dunner, DL, Jacobson, NS (2006). Randomized trial of behavioral activation, cognitive therapy, and antidepressant medication in the acute treatment of adults with major depression. Journal of Consulting and Clinical Psychology 74, 658670.Google Scholar
Dowrick, C, Dunn, G, Ayuso-Mateos, JL, Dalgard, OS, Page, H, Lehtinen, V, Casey, P, Wilkinson, C, Vazquez Barquero, JL, Wilkinson, G, the Outcomes of Depression International Network (ODIN) Group (2000). Problem solving treatment and group psychoeducation for depression: multicentre randomised controlled trial. Outcomes of Depression International Network (ODIN) Group. British Medical Journal 321, 14501454.Google Scholar
Dunn, NJ, Rehm, LP, Schillaci, J, Soucheck, J, Mehta, P, Ashton, CM, Yanasak, E, Hamilton, JD (2007). A randomized trial of self-management and psychoeducational group therapies for comorbid chronic posttraumatic stress disorder and depressive disorder. Journal of Traumatic Stress 20, 221237.Google Scholar
Ekers, D, Richards, D, Gilbody, S (2008). A meta-analysis of randomized trials of behavioural treatment of depression. Psychological Medicine 38, 611623.Google Scholar
Elkin, I, Shea, MT, Watkins, JT, Imber, SD, Sotsky, SM, Collins, JF, Glass, DR, Pilkonis, PA, Leber, WA, Docherty, JP, Fiester, SJ, Parloff, MB (1989). Treatment of depression collaborative research program. Archives of General Psychiatry 46, 971982.Google Scholar
Gellatly, J, Bower, P, Hennessy, S, Richards, D, Gilbody, S, Lovell, K (2007). What makes self-help interventions effective in the management of depressive symptoms? Meta-analysis and meta-regression. Psychological Medicine 37, 12171228.Google Scholar
Gloaguen, V, Cottraux, J, Cucherat, M, Blackburn, IM (1998). A meta-analysis of the effects of cognitive therapy in depressed patients. Journal of Affective Disorders 49, 5972.Google Scholar
Hamilton, M (1960). A rating scale for depression. Journal of Neurology, Neurosurgery and Psychiatry 23, 5662.CrossRefGoogle ScholarPubMed
Hedges, L, Olkin, I (1985). Statistical Methods for Meta-analysis. Academic Press: Orlando.Google Scholar
Higgins, JPT, Green, S (2006). Cochrane Handbook for Systematic Reviews of Interventions 4.2.6 (updated September 2006). In The Cochrane Library, Issue 4, 2006. John Wiley & Sons, Ltd: Chichester, UK.Google Scholar
Higgins, JPT, Thompson, SG, Deeks, JJ, Altman, DG (2003). Measuring inconsistency in meta-analyses. British Medical Journal 327, 557560.Google Scholar
Jarrett, RB, Schaffer, M, McIntire, D, Witt-Browder, A, Kraft, D, Risser, RC (1999). Treatment of atypical depression with cognitive therapy or phenelzine: a double-blind, placebo-controlled trial. Archives of General Psychiatry 56, 431437.Google Scholar
Kirsch, I, Deacon, BJ, Huedo-Medina, TB, Scoboria, A, Moore, TJ, Johnson, BT (2008). Initial severity and antidepressant benefits: a meta-analysis of data submitted to the food and drug administration. PLOS Medicine 5, e45.CrossRefGoogle ScholarPubMed
Klein, JS, Jacobs, RH, Reinecke, MA (2007). Cognitive-behavioral therapy for adolescent depression: a meta-analytic investigation of changes in effect-size estimates. Journal of the American Academy of Child and Adolescent Psychiatry 46, 14031413.Google Scholar
Kraemer, HC, Kupfer, DJ (2006). Size of treatment effects and their importance to clinical research and practice. Biological Psychiatry 59, 990996.Google Scholar
Leichsenring, F, Rabung, S (2008). Effectiveness of long-term psychodynamic psychotherapy: a meta-analysis. Journal of the American Medical Association 300, 15511565.Google Scholar
Lipsey, MW (1990). Design Sensitivity; Statistical Power for Experimental Research. Newbury Park: Sage.Google Scholar
McDermut, W, Miller, IW, Brown, RA (2001). The efficacy of group psychotherapy for depression: a meta-analysis and review of the empirical research. Clinical Psychology: Science and Practice 8, 98–116.Google Scholar
Miranda, J, Chung, JY, Green, BL, Krupnick, J, Siddique, J, Revicki, DA, Belin, T (2003). Treating depression in predominantly low-income young minority women: a randomized controlled trial. Journal of the American Medical Association 290, 5765.Google Scholar
Pinquart, M, Duberstein, PR, Lyness, JM (2006). Treatments for later-life depressive conditions: a meta-analytic comparison of pharmacotherapy and psychotherapy. American Journal of Psychiatry 163, 14931501.Google Scholar
Robinson, LA, Berman, JS, Neimeyer, RA (1990). Psychotherapy for the treatment of depression: a comprehensive review of controlled outcome research. Psychological Bulletin 108, 3049.Google Scholar
Rosenthal, R, DiMatteo, MR (2001). Meta-analysis: recent developments in quantitative methods for literature reviews. Annual Review of Psychology 52, 5982.Google Scholar
Sackett, DL, Strauss, SE, Richardson, WS, Rosenberg, W, Haynes, RB (2000). Evidence-based Medicine. How to Practice and Teach EBM, 2nd edn. Edinburgh: Churchill Livingstone.Google Scholar
Smit, A, Kluiter, H, Conradi, HJ, Van der Meer, K, Tiemens, BG, Jenner, JA, Van Os, TWDP, Ormel, J (2006). Short-term effects of enhanced treatment for depression in primary care: results from a randomized controlled trial. Psychological Medicine 36, 1526.CrossRefGoogle ScholarPubMed
Turner, EH, Matthews, AM, Linardatos, E, Tell, RA, Rosenthal, R (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine 358, 252260.Google Scholar
Van Schaik, A, van Marwijk, H, Ader, H, van Dyck, R, de Haan, M, Penninx, B, van der Kooij, K, van Hout, H, Beekman, A (2006). Interpersonal psychotherapy for elderly patients in primary care. American Journal of Geriatric Psychiatry 14, 777786.Google Scholar
Wampold, BE, Minami, T, Baskin, TW, Tierney, SC (2002). A meta-(re)analysis of the effects of cognitive therapy versus ‘other therapies’ for depression. Journal of Affective Disorders 68, 159165.CrossRefGoogle ScholarPubMed
Weisz, JR, McCarty, CA, Valeri, SM (2006). Effects of psychotherapy for depression in children and adolescents: a meta-analysis. Psychological Bulletin 132, 132149.Google Scholar
Williams, JW Jr., Oxman, T, Frank, E, Katon, W, Sullivan, M, Cornell, J, Sengupta, A (2000). Treatment of dysthymia and minor depression in primary care: a randomized controlled trial in older adolescents. Journal of the American Medical Association 284, 15191526.CrossRefGoogle Scholar
Figure 0

Table 1. Selected characteristics of high-quality studies of psychotherapy for adult depression

Figure 1

Table 2. Differences between high-quality and other studies: overall analyses and subgroup analyses

Figure 2

Fig. 1. Standardized effect sizes of high-quality studies on psychological treatment of adult depression compared to control conditions at post-test.

Figure 3

Table 3. Differences between high-quality and other studies using care-as-usual or placebo control groups: overall analyses and subgroup analyses

Figure 4

Table 4. Regression coefficients of study characteristics in relation to the effect size of psychological interventions for depression: multivariate meta-regression analyses