Introduction
About half the people who initiate major depression treatment either drop out or otherwise fail to receive a minimally adequate first course (Thornicroft et al., Reference Thornicroft, Chatterji, Evans-Lacko, Gruber, Sampson, Aguilar-Gaxiola and Kessler2017). About 30% of treated cases remit after a full course of evidence-based treatment and 70% after several such courses (Rush et al., Reference Rush, Trivedi, Wisniewski, Nierenberg, Stewart, Warden and Fava2006a). However, given the high initial treatment dropout rate, many patients never get the later courses needed for remission (Wells et al., Reference Wells, Browne, Aguilar-Gaxiola, Al-Hamzawi, Alonso, Angermeyer and Kessler2013). These observations have led to an interest in clinical decision support tools to select optimal first-line depression treatments. No single first-line antidepression medication (ADM) is substantially better than others in the aggregate, resulting in ADM treatment recommendations being based largely on considerations of tolerability and safety (Parikh et al., Reference Parikh, Quilty, Ravitz, Rosenbluth, Pavlova, Grigoriadis and Uher2016). However, comparison studies suggest that some patients respond better to some ADMs than others (Jha et al., Reference Jha, Minhajuddin, Gadad, Greer, Mayes and Trivedi2017), making it important to ask whether this differential treatment response can be determined in advance to guide personalized treatment selection.
Early research on this question failed to find stable depression subtypes predicting significant between-patient differences in relative treatment responses (van Loo, de Jonge, Romeijn, Kessler, & Schoevers, Reference van Loo, de Jonge, Romeijn, Kessler and Schoevers2012). Subsequent research found some stable biomarkers, but none strong enough to have broad clinical value (Frodl, Reference Frodl2017). Recent attention has consequently shifted to combining information across weak predictors to construct stronger composite individualized treatment rules (ITRs). Promising results exist (Cohen & DeRubeis, Reference Cohen and DeRubeis2018; Ermers, Hagoort, & Scheepers, Reference Ermers, Hagoort and Scheepers2020), but no practical ADM ITR has yet been developed. Three important barriers exist to doing this. First, much larger samples are needed to detect complex interactions (Luedtke, Sadikova, & Kessler, Reference Luedtke, Sadikova and Kessler2019). Second, richer baseline predictor batteries are needed (Kessler, Reference Kessler2018). Third, more sophisticated statistical methods are needed (VanderWeele, Luedtke, van der Laan, & Kessler, Reference VanderWeele, Luedtke, van der Laan and Kessler2019).
We address the latter problem here by carrying out a secondary analysis of the SUN(^_^)D trial (Furukawa et al., Reference Furukawa, Akechi, Shimodera, Yamada, Miki, Watanabe and Yonemoto2011; Kato et al., Reference Kato, Furukawa, Mantani, Kurata, Kubouchi, Hirota and Guyatt2018) of outpatient depression treatment to develop an ITR for depression remission. Prior SUN(^_^)D analyses investigated predictors of remission at any time in the first 9 weeks (Furukawa et al., Reference Furukawa, Kato, Shinagawa, Miki, Fujita, Tsujino and Yamada2019b), acute phase deterioration (Akechi et al., Reference Akechi, Sugishita, Chino, Itoh, Ikeda, Shimodera and Furukawa2020), switching to hypomania/mania (Akechi, Kato, Watanabe, Tanaka, & Furukawa, Reference Akechi, Kato, Watanabe, Tanaka and Furukawa2019a), and relapse after successful acute-phase treatment (Akechi et al., Reference Akechi, Mantani, Kurata, Hirota, Shimodera, Yamada and Furukawa2019b). However, no prior SUN(^_^)D analysis attempted to develop an ITR for remission, although one previous SUN(^_^)D report developed an ITR for mean symptom change (Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020). We focus on remission here based on the injunction to treat depression to remission to minimize risks of relapse and recurrence (Keller, Reference Keller2003). The second-line treatment alternatives we consider are continuing a first-line ADM, switching to a different ADM, or combining the two ADMs.
Method
Participants and setting
SUN(^_^)D was a 25-week, parallel-group, open-label, assessor-blinded, pragmatic trial of previously-untreated outpatients ages 25–75 with a primary diagnosis of DSM-IV major depressive episode (MDE) in 48 outpatient psychiatric treatment centers across Japan. Exclusion criteria included history of bipolar disorder, psychosis, and several current disorders (dementia, borderline personality disorder, eating disorder, and substance disorder) as well as imminent suicide risk (suicide ideation with any level of reported intent). Participants could not already be taking antidepressants, antipsychotics, or mood stabilizers or receiving depression-specific psychotherapy. Recruitment occurred between December 2010 and March 2015. More details about eligibility are presented elsewhere (Furukawa et al., Reference Furukawa, Akechi, Shimodera, Yamada, Miki, Watanabe and Yonemoto2011). Socio-demographic and baseline clinical characteristics are presented in online Supplementary Table S1. The trial was ‘pragmatic’ in the sense that (i) the sample consists of real patients coming to treatment at study clinics rather than recruited by advertisements; (ii) the clinicians providing the treatment were those at these clinics rather than employees of the research team; (iii) there were minimal exclusions; and (iv) the trial was open-label randomized.
The trial involved two randomization steps. First-step randomization, done by site, assigned patients to sertraline (i) up to 50 mg/day through week 3 or (ii) 25–50 mg/day in week 1 up to 100 mg/day in weeks 2–3. The licensed dose for sertraline in Japan is 50–100 mg/day, which corresponds to the dose range found to provide best balance between efficacy, tolerability, and acceptability in a recent dose–response meta-analysis of selective serotonin reuptake inhibitors (Furukawa et al., Reference Furukawa, Cipriani, Cowen, Leucht, Egger and Salanti2019a). Second-step individual-level randomization occurred among patients that did not remit within 3 weeks, defined as Patient Health Questionnaire (PHQ-9; Kroenke, Spitzer, & Williams, Reference Kroenke, Spitzer and Williams2001) scores >4, with equal allocation to continuing at the same dose on sertraline (n = 551), switching to mirtazapine (n = 558; 7.5–45 mg/day at psychiatrist discretion, with sertraline tapered and discontinued by week 7), or combining mirtazapine and sertraline (n = 537). Second-step randomization was done regardless of symptom response or side effects. More design details are presented elsewhere (Furukawa et al., Reference Furukawa, Akechi, Shimodera, Yamada, Miki, Watanabe and Yonemoto2011). Second-step randomization occurred within 4 days of the end of week 3 for 87.3% of patients but, as per the pragmatic nature of the trial, was extended for patients who took longer to reach. The week 9 assessment was anchored to baseline rather than to timing of second-step randomization.
Study procedures were explained and written informed consent was obtained before collecting data. Patients were remunerated approximately $20 for each assessment. The authors assert that all procedures contributing to this study comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. All procedures involving human subjects/patients were approved by the institutional review board at each participating site. A target first-step sample of n = 1934 patients was set based on a power analysis to detect a 1 point mean difference on the PHQ-9 at first-step randomization and a 0.20 standardized mean difference on the PHQ-9 between a given pair of treatment groups in second-step randomization pooled across 50 and 100 mg/day first-step samples assuming a week 3 dropout rate of 10% and a week 9 remission rate of 10%. The final sample consisted of n = 2011 patients in the first-step randomization, 1927 of whom were re-assessed at week 9 (95.8%). More detailed recruitment information is presented elsewhere (Kato et al., Reference Kato, Furukawa, Mantani, Kurata, Kubouchi, Hirota and Guyatt2018). We focus here on the n = 1549 patients who had not remitted in week 3 and provided baseline and week 1 data, all but 30 of whom were followed at week 9 (98.2%). Online Supplementary Fig. S1 shows how we arrived at the samples used for this analysis. Separate analyses were carried out in the 50 and 100 mg/day subsamples.
Measures
Outcome
The primary outcome was depression remission (PHQ-9 < 5) measured by masked assessors in semi-structured week 9 telephone interviews.
Predictors
We considered 58 predictors. Distributions are reported in online Supplementary Table S1:
(i) Socio-demographics: Age, sex, education (as both continuous and three nested dichotomies), four categories of employment status, four categories of marital status.
(ii) Baseline reports of lifetime clinical characteristics: Depression age-at-onset, number of previous episodes, length of current episode.
(iii) Clinical scales: The Beck Depression Inventory 2nd Edition (BDI-II; Beck, Steer, & Brown, Reference Beck, Steer and Brown1996) and the PHQ-9 assessed at baseline, week 1 and week 3.
(iv) Derived clinical subscales: Exploratory factor analysis of combined BDI and PHQ items was carried out to extract meaningful structure that might not emerge in within-scale analyses. Principal axis was used for factor extraction (Costello & Osborne, Reference Costello and Osborne2005). Parallel analysis was used to select number of factors (Lim & Jahng, Reference Lim and Jahng2019). Promax rotation was used to aid in interpretation (Schmitt & Sass, Reference Schmitt and Sass2011). Six symptom subscales were found that we labeled: dysphoria, negative view of self, anhedonia, suicidality, irritability, and sleep/appetite disturbance.
Scale and subscale scores were standardized to baseline means and variances (thereby allowing us to observe changes in both means and variances over time benchmarked to baseline values). We also included nested quintiles of subscale scores as predictors to capture nonlinear main effects and interactions. In addition, we constructed change scores between baseline and week 3 for each scale and subscale. Finally, we created three summary nested counts of how often symptom scores at week 3 were meaningfully (i) higher, (ii) higher or approximately the same, or (iii) no less than slightly lower in the six symptom subscales than the baseline scores.
(v) Side effects: The week 3 telephone interview administered the Frequency, Intensity and Burden of Side Effects Rating (FIBSER; Rush et al., Reference Rush, Trivedi, Wisniewski, Stewart, Nierenberg, Thase and Fava2006b), which we used to define seven summary dichotomous measures of side effects.
Statistical analysis
Adjusting for missing data
Small amounts of item-missing data (n = 3–28 patients per measure) existed for some socio-demographic variables and reports about age-of-onset, number of prior depressive episodes, and duration of current episode (online Supplementary Table S1). We used predictive mean matching to impute these missing values (van Buuren, Reference van Buuren2016). Thirty respondents with week 3 data were lost to follow-up at week 9. We imputed their week 9 outcome scores using all baseline information with the R program ranger (Wright & Ziegler, Reference Wright and Ziegler2017). Given the small numbers involved, all missing values were treated as observed rather than implementing a multiple imputation adjustment procedure.
Adjusting for variation in pre-randomization covariates across treatment arms
We observed minor differences in pre-randomization covariates across experimental treatment arms (online Supplementary Table S1). Although well within the bounds of chance, we adjusted for these differences to improve ITR accuracy (Luedtke & van der Laan, Reference Luedtke and van der Laan2016a). A propensity score weight based on a simple logistic regression model was used for this purpose (Gruber & van der Laan, Reference Gruber and van der Laan2009).
Estimating the ITR
In the conventional approach to estimating an ITR, a model is estimated to search for significant interactions between predictors and dummy variables for treatment in predicting the outcome (Van Bronswijk et al., Reference Van Bronswijk, Bruijniks, Lorenzo-Luaces, Derubeis, Lemmens, Peeters and Huibers2020). Predicted probabilities of the outcome are estimated from this model conditional on values of the predictors for each patient in each treatment condition (e.g. the estimated outcome of patient p separately based on the alternative assumptions that the patient was assigned to treatment arm a, arm b, or arm c). An estimate of the optimal treatment strategy for patient p is then obtained by creating predicted difference scores in probabilities of remission across all treatment arms. The treatment with the highest predicted difference score is defined as the best treatment for the patient. The accuracy of this approach depends on correct specification of both the (possibly nonlinear) main effects and the (possibly complex nonlinear and higher-order) interaction terms and will yield upwardly biased estimates in the absence of cross-validation (VanderWeele et al., Reference VanderWeele, Luedtke, van der Laan and Kessler2019).
All more advanced approaches for developing an ITR use the same counterfactual logic as the conventional approach by either explicitly or implicitly creating patient-level difference scores and defining the optimal treatment as the one with the highest predicted difference score (Kent et al., Reference Kent, Paulus, van Klaveren, D'Agostino, Goodman, Hayward and Steyerberg2020). Approaches differ, although, in how they handle misspecification and over-fitting. The approach we used deviates in three important ways from the conventional approach. First, rather than estimate interaction terms in a regression model and generate individual-level difference scores from that model, we used machine learning (ML) methods to develop a separate model for remission among patients assigned to each treatment arm and then generated conditional patient-level predicted probabilities of remissions in the entire sample regardless of treatment received from each subgroup model. Patient-level difference scores were then calculated from these treatment-specific predictions. This approach allowed us to capture complex nonlinearities and interactions within each treatment arm using predictors that might not be important in other arms. Cross-validation was used to estimate within-subgroup models to minimize over-fitting.
Second, rather than use first-stage difference scores to define the ITR, we used these scores as second-stage ML analysis outcomes. Predictions from that second stage were then used to define the ITR. The advantage of this is that the second-stage prediction model does not require correct specification of main effects, as it directly estimates differences (Luedtke & van der Laan, Reference Luedtke and van der Laan2016a). The first-stage difference scores used as outcomes in this second-stage analysis take into consideration information about individual-level differences in probability of treatment assignment with respect to important predictors of the ITR (which can occur even in experiments due to random processes) and in observed outcomes, both of which reduce bias (Funk et al., Reference Funk, Westreich, Wiesen, Stürmer, Brookhart and Davidian2011). Cross-validation was again used in this stage to minimize over-fitting.
Third, although some earlier approaches were similar to our approach in predicting difference scores directly (Murphy, Reference Murphy2003; Robins, Reference Robins, Lin and Heagerty2004), our approach uses a principled method for combining results across an ensemble of algorithms rather than using a single algorithm. This maximizes chances of capturing complex nonlinear and higher-order interactions correctly. We used the sg R package to do this (Luedtke & van der Laan, Reference Luedtke and van der Laan2016a, Reference Luedtke and van der Laan2017). sg uses the super learner (SL) method (van der Laan, Polley, & Hubbard, Reference van der Laan, Polley and Hubbard2007) for prediction. SL selects a weighted combination of predicted outcome scores (in our case, a difference in predicted probabilities of remission between alternative treatments) across a collection (ensemble) of candidate algorithms via cross-validation to yield a nearly optimal weighted combination guaranteed to perform as well in expectation as the best component algorithm according to a pre-specified criterion (in our case, minimizing mean-squared error) (Polley, LeDell, Kennedy, Lendle, & van der Laan, Reference Polley, LeDell, Kennedy, Lendle and van der Laan2018). The algorithms in the ensemble can be a mix of parametric and flexible ML algorithms, making SL less prone to model misspecification than traditional parametric approaches.
We used external 5-fold cross-validation (5F-CV) of the SL solution to develop a separate prediction model for remission among patients who received each of the three types of treatment. SL used 10F-CV to generate individual-level predicted probabilities in each of the 5F-CV subsamples. All CV folds were selected at random without stratification. Consistent with recommendations (LeDell, van der Laan, & Petersen, Reference LeDell, van der Laan and Petersen2016), we used a diverse set of algorithms in the SL ensemble to capture nonlinearities and interactions. Key hyperparameters were varied for some algorithms and treated as distinct models, resulting in a total of 40 models (online Supplementary Table S2). We screened predictors before estimating the SL separately within each CV sample based on evidence that this reduces over-fitting (Kuhn & Johnson, Reference Kuhn and Johnson2019). Two screening approaches were used in each CV sample: lasso (Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010); and selecting all predictors having p < 0.10 univariate associations with the outcome. All algorithms within a given CV sample began with the feature sets selected using these two methods. Each algorithm–screener combination was treated as a distinct model in the ensemble; that is, each algorithm was estimated twice in each CV sample: once using the feature set selected by lasso and the second using the feature set selected by the p < 0.10 rule. Only algorithms that had nonzero weights in the SL ensemble in the first step were used in the second step. The ITR for each patient was defined as the treatment with the highest positive cross-validated second-stage difference score.
Evaluating ITR performance
Once ITRs are determined, sg uses targeted minimum loss-based estimation (TMLE; Gruber and van der Laan, Reference Gruber and van der Laan2011) to evaluate the ITRs. The expected remission rates and their TMLE standard errors were estimated both for each treatment alternative and for differences between optimized treatment and the alternatives. We considered four alternatives to optimization: randomization across treatments and all patients assigned to one of the three treatment arms (i.e. continuation, switching, or combining). Significance of differences in estimated remission rates for optimal treatment v. alternatives was evaluated using 0.05-level one-sided χ2 tests, as optimization would be expected, at worst, to have no effect on the remission rate. As sg uses TMLE to estimate marginal differences directly, these estimates differ slightly from the estimates calculated by subtracting one TMLE-estimated marginal treatment-specific remission rate from another.
We also separately estimated average treatment remission rates among those patients optimized to receive each treatment under both optimization and randomization. The standard errors of these data-adaptive target parameters (i.e. parameters in subsamples defined by the ITR) are valid for large samples (Hubbard, Kherad-Pajouh, & van der Laan, Reference Hubbard, Kherad-Pajouh and van der Laan2016). Finally, we evaluated aggregate remission rates under optimization constraints because combination treatment, which we suspected to be optimal for many patients, is more expensive and might be constrained in some treatment settings. sg allowed us to estimate the aggregate remission rate if combined treatment was available to only the k% of patients whose predicted probability of remission was highest compared to the second-best treatment option, with second-best treatment assigned to other patients optimized by combining (Luedtke & van der Laan, Reference Luedtke and van der Laan2016b). k was varied in deciles between 0% and 100%.
Examining predictor importance
We examined predictor importance by comparing mean differences for each predictor across subgroups of patients defined by optimal treatment type. This simple method was done rather than using methods available to generate rank orderings of predictor importance (Lundberg & Lee, Reference Lundberg and Lee2017) based on concerns that have been raised in the literature about the plausibility of the assumptions needed for such rankings to be interpretable (Kumar, Venkatasubramanian, Scheidegger, & Friedler, Reference Kumar, Venkatasubramanian, Scheidegger and Friedler2020).
Analyses were carried out using R Version 3.6.3 (R Core Team, 2019) and SAS Version 9.4 (SAS Institute Inc., 2014). The STROBE criteria for reporting cohort studies (Institute of Social and Preventive Medicine Clinical Epidemiology & Biostatistics, 2009) and the TRIPOD criteria for reporting analyses designed to develop a prediction rule (Collins, Reitsma, Altman, & Moons, Reference Collins, Reitsma, Altman and Moons2015) were used in reporting results.
Results
Estimating the ITR
The distribution of optimal treatment
Combining mirtazapine with sertraline was estimated to be optimal for most patients either if sertraline was limited to 50 mg/day (69.9%) or 100 mg/day (66.7%) (Table 1). The proportions of patients optimized by switching were smaller (21.1–29.3% in the 50 and 100 mg/day samples) and the proportions optimized by continuing were very small (9.0–4.1% in the 50 and 100 mg/day samples). Inspection of individual-level estimated difference scores for optimal and second-best probabilities-of-remission showed that the most common pattern was combining best with switching second-best (48.8–50.5% in the 50 and 100 mg/day samples) (Figs 1a, 1b). Other common patterns were combining best with continuing second-best (21.1–16.2%) and switching best with combining second-best (18.3–20.5%). The difference score distributions show that the median advantage when either combining or switching is best is to increase the remission rate by 5–12% compared to the second-best alternative. The median advantage is much smaller (1–3%), in comparison, when continuing is best. Continuing, then, is seldom best and has only a small advantage when it is best.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221222060616328-0009:S0033291721000027:S0033291721000027_fig1.png?pub-status=live)
Fig. 1. (a) 50 mg/day subgroup second-line 9-week remission difference in difference scores (first minus second choice for optimal treatment). (b) 100 mg/day subgroup second-line 9-week remission difference in difference scores (first minus second choice for optimal treatment).
Table 1. Estimated second-line 9-week optimal treatment assignment proportions in the different samples
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221222060616328-0009:S0033291721000027:S0033291721000027_tab1.png?pub-status=live)
s.e., standard error.
The predicted impact of the ITR on the overall remission rate
The TMLE estimated that random treatment assignment would result in a 30.1% remission rate in the 50 mg/day sample and 30.8% in the 100 mg/day sample (Table 2). Optimal treatment assignment, in comparison, was estimated to result in a significant 5.3% increase in remission compared to random assignment in the 50 mg/day sample (p = 0.016) and a significant 5.1% increase in the 100 mg/day sample (p = 0.031). Aggregate decomposition showed that these significant differences were due to optimal assignment resulting in significantly higher remission rates than continuing in both the 50 mg/day sample (11.2%, p = 0.002) and the 100 mg/day sample (11.7%, p = 0.001), moderate nonsignificant gains over switching (5.0%, p = 0.09 in the 50 mg/day sample and 3.3%, p = 0.18 in the 100 mg/day), and modest gains over combining (0.4% in the 50 mg/day sample and 1.0% in the 100 mg/day sample). In other words, using the simple treatment rule of assigning all second-step patients to combined treatment would not yield substantially lower remission rates than optimal assignment.
Table 2. Estimated second-line 9-week remission rates under a range of treatment rules given three treatment options
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221222060616328-0009:S0033291721000027:S0033291721000027_tab2.png?pub-status=live)
s.e., standard error; TMLE, targeted minimum loss-based estimation.
a The difference from SL-optimized was estimated directly with TMLE and is not the same as the difference between the aggregate within-arm estimated remission rates.
b Standard errors for difference scores are not reliable because the effective sample size for the comparison, the individuals for whom the two treatment rules disagree, is too small.
*Significantly lower remission rate than for SL-optimized allocation at the 0.05 level, one-sided test.
Subgroup decomposition showed that the few patients optimized by continuing (n = 64–34) had a much lower probability of remission under randomization (15.4–11.7%) than patients optimized by switching (n = 151–244; 29.6–41.6%) or combining (n = 500–556; 30.9–26.7%). The estimated effects of optimization were significant only among patients optimized by combining (5.8%, p < 0.001 in the 50 mg/day sample and 5.9%, p < 0.001 in the 100 mg/day sample). The estimated effects of optimization were non-significant among patients optimized by continuing due to low remission rates under either optimization or randomization and among patients optimized by switching due to intermediate (50 mg/day sample) or high (100 mg/day sample) remission rates under either optimization or randomization
Constrained optimization
Subgroup decomposition results (Table 2) show that the significant difference in aggregate remission between optimal treatment and randomization was due to optimization having by far the largest effect among patients optimized by combining. Figures 1a and 1b show, although, that the advantage of combining over the second best treatment option is relatively small for some patients. Consistent with this result, constrained optimization analysis, which used smoothing to stabilize estimates, suggests that aggregate remission rates would remain at the optimum if no more than about 50% of patients in the 50 mg/day sample received combined treatment and about 80% of the advantage would be retained if only 30% received it [i.e. (35.5–31.1%)/(36.5–31.1%) compared to 69.9% optimized by combined treatment] (Table 3). In the 100 mg/day sample, about 90% of the advantage of combined treatment would be retained if 60% received it and 63% of the advantage if only 50% received it (compared to 66.7% optimized by combined treatment). Caution is needed in interpreting these projections with the sample size we have here, as the standard errors of the estimates are large relative to the differences in prevalence, but the broad conclusion is that cost constraints would not lead to dramatic decreases in overall remission rates if the ITR was used to assign combined treatment only to the patients for whom this option has substantial benefits over the second best treatment.
Table 3. Estimated second-line 9-week remission rates under constrained optimization limiting the proportion (k) of patients that could receive combined treatment
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20221222060616328-0009:S0033291721000027:S0033291721000027_tab3.png?pub-status=live)
s.e., standard error.
Evaluating predictor importance
The small proportion of patients whose probability of remission was optimized by continuing with sertraline was consistently (across the 50 and 100 mg/day samples) more likely than other patients to be male, well-educated, have baseline irritability, have four high symptom scores at week 1 (total PHQ-9, dysphoria, irritability, sleep-appetite disturbance), have five high symptom scores at week 3 (total BDI in addition to the same 4 as at week 1), and have significantly less reduction in symptom subscale scores than other patients between baseline and week 3 on symptom change scores (online Supplementary Table S3). The patients whose probability of remission was optimized by switching to mirtazapine, in comparison, were consistently more likely than other patients to be separated/divorced, have a comorbid physical disorder, have low baseline dysphoria, and have low suicidality at baseline, week 1, and week 3. The patients whose probability of remission was optimized by combining mirtazapine with sertraline, finally, consistently had the largest decreases in all nine symptom subscale scores between baseline and week 3. None of the week 3 side effect measures was consistently associated with optimized treatment groups. Several inconsistently significant associations were also found (i.e. significant in one but not both samples) that might have been due either to over-fitting or to unique characteristics of the two samples.
Discussion
Results are broadly consistent with prior studies, most of which were small proof-of-concept studies, in suggesting that treatment optimization could improve MDE symptom response (Cohen & DeRubeis, Reference Cohen and DeRubeis2018; Ermers et al., Reference Ermers, Hagoort and Scheepers2020). It is instructive to compare our results to two earlier SUN(^_^)D reports. The first focused on aggregate week 9 depression for second-line treatment pooled across 50 and 100 mg samples and found switching and combining equally efficacious but that continuing was significantly less efficacious (Kato et al., Reference Kato, Furukawa, Mantani, Kurata, Kubouchi, Hirota and Guyatt2018). The second estimated an ITR in the pooled sample for week 9 mean symptoms and found equivalent proportions optimized by switching (47%) and combining (45%) (Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020). Our ITR results, in comparison, focus on remission and show that considerably higher proportions of patients are optimized for remission by combining (69.9–66.7% in the 50–100 mg/day subsamples) than switching (21.1–29.3%). Comparison across studies suggests that optimal second-line treatment differs for some patients depending on whether the goal is to minimize mean symptom scores or optimize probability of remission.
But, what of using the simpler rule of treating all second-line patients with combined treatment? The results in Figs 1a and 1b suggest that this would reduce the probability of remission among patients optimized by switching (18.3–20.5% of all patients) by about 5%. A more reasonable strategy, then, would be to limit combination treatment to the patients for whom it is optimal. We also found that combination treatment could be reduced even more without meaningful reduction in aggregate remission rates when resources are constrained by assigning second best alternatives to patients for whom the advantage of combined treatment is low.
In interpreting these results, it is important to note that the numbers of patients in the separate treatment arms in the 50 and 100 mg/day samples were below the n = 300 minimum recommended for estimating an MDE ITR (Luedtke et al., Reference Luedtke, Sadikova and Kessler2019). In addition, the estimated benefit of treatment optimization might have increased if other known baseline predictors of ADM treatment response had been assessed (e.g. comorbid anxiety disorders; Kessler, Reference Kessler2018). Replication in a larger sample using more extensive measures would resolve these limitations, but would be expensive and time-consuming. One practical way around this difficulty would be to emulate the SUN(^_^)D trial in a large observational comparative effectiveness study (Hernán & Robins, Reference Hernán and Robins2016). The validity of this design could be examined by determining whether it yields the same aggregate comparative effectiveness results as the SUN(^_^)D trial and whether the ITR reported here could be replicated in the observational sample using appropriate baseline covariate balancing. If so, the ITR could be refined in this large observational sample by using additional baseline predictors. If expensive biomarkers were included among the predictors, their use could be restricted to the patients for whom the estimated difference scores in probability of remission in the absence of biomarker assessment (as in Figs 1a, 1b) are small enough that biomarker effects in a plausible range could influence optimal treatment assignment.
Even before such a replication, although, the results reported here are useful in suggesting that substantial variation exists in the relative effectiveness of continuing, switching, and combining sertraline with mirtazapine. Individual differences would likely be even more pronounced if psychotherapy and combined ADM-psychotherapy were added as treatment options (Cuijpers et al., Reference Cuijpers, Noma, Karyotaki, Vinkers, Cipriani and Furukawa2020; Kessler, Reference Kessler2018; McGrath et al., Reference McGrath, Kelley, Holtzheimer, Dunlop, Craighead, Franco and Mayberg2013). The development of comprehensive ITRs with the approach used here could be of value in improving the matching of patients with their individually optimal treatments. The extent to which the strength of the ITR is sufficient to justify the added complexity of developing and using it is a separate matter that requires weighing the benefits of improved outcomes against the costs of implementation. Principled methods exist for determining appropriate decision thresholds of this sort (Van Calster et al., Reference Van Calster, Wynants, Verbeek, Verbakel, Christodoulou, Vickers and Steyerberg2018), but that type of cost–benefit analysis is an extensive undertaking that exceeds the scope of the current report.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291721000027.
Data
Sharing of identifiable, individual-level data is prohibited by Japanese law. Fully anonymized, unidentifiable data with age, sex, and site deleted are available from the University Hospital Medical Information Network (UMIN-ICDR; http://www.umin.ac.jp/icdr/index.html), an academic network information center for biomedical sciences funded by the Japanese Ministry of Education, Culture, Science, Sports and Technology (MEXT) at https://upload.umin.ac.jp/cgi-open-bin/icdr_e/ctr_view.cgi?recptno=R000034028.
Financial support
SUN(^_^)D was funded by the Ministry of Health, Labor and Welfare, Japan (H-22-Seishin-Ippan-008) from April 2010 through March 2012 to TAF (http://www.mhlw.go.jp/english/) and thereafter by the Japan Foundation for Neuroscience and Mental Health (JFNMH) to TAF (http://www.jfnm.or.jp/). The JFNMH received donations from Asahi Kasei, Eli Lilly, GlaxoSmithKline (GSK), Janssen, Merck Sharp & Dohme (MSD), Meiji, Mochida, Otsuka, Pfizer, Shionogi, Taisho, and Mitsubishi-Tanabe. The secondary analysis reported here and preparation of this report were funded in part by the Precision Treatment of Mental Disorders Foundation. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. Clinical trial registration: ClinicalTrials.gov identifier: NCT01109693. Registered on 23 April 2010.
Conflict of interest
In the past 3 years, Dr Kessler reports being a consultant for Datastat, Inc., Rallypoint Networks, Inc., Sage Pharmaceuticals, and Takeda. Dr Furukawa reports personal fees from Mitsubishi-Tanabe, MSD, and Shionogi, and a grant from Mitsubishi-Tanabe, outside the submitted work. Dr Furukawa has a patent 2018-177688 pending. Dr Kato reports receiving lectures fees from Eli Lilly and Mitsubishi-Tanabe. Dr Kato has received contracted research funds from GSK, MSD, and Mitsubishi-Tanabe. The remaining authors report no financial relationships with commercial interests.