There is no dispute: there is great heterogeneity in the treatment effects associated with major depressive disorder (MDD). Since depression is a syndrome, we would expect a large range of causes, potential treatments and outcomes with any one treatment as well as between treatments. As Kessler's report notes, the field has yet to effectively address and reduce this heterogeneity despite multi-decade attempts including descriptively defined depressive subtypes (Rush et al. Reference Rush, Wisniewski, Warden, Luther, Davis, Fava, Nierenberg and Trivedi2008) and – more recently – neuropsychological (Etkin et al. Reference Etkin, Patenaude, Song, Usherwood, Rekshan, Schatzberg, Rush and Williams2015), imaging (Korgaonkar et al. Reference Korgaonkar, Rekshan, Gordon, Rush, Williams, Blasey and Grieve2014) and genetic (Schatzberg et al. Reference Schatzberg, DeBattista, Lazzeroni, Etkin, Murphy and Williams2015) approaches.
Kessler and colleagues review a variety of recent efforts and methodological innovations to better address the challenge of the heterogeneity of treatment effects (HTE) in MDD. They have compiled a list of critical baseline patient-reported parameters that have at least some evidence of relating to outcome in the acute treatment of depression. These 30+ measures are easily obtained from patients at very low cost, although the optimal combination for predicting treatment outcomes or selecting between treatments is yet undefined since the full compendium of measures has not been tested in large informative populations of depressed patients.
The authors properly emphasise the extraordinary importance of such an effort. Indeed, the relevant combination of these inexpensively acquired measures can and should provide a platform upon which additional laboratory-based measures should be placed to further reduce HTE and better target treatments. In fact, a lack of knowledge regarding what could be offered by these patient-reported indicators could limit the clinical value of specific laboratory tests. This idea is not surprising because clinical parameters typically influence whether treatments will work.
Finally, the authors provide an important demonstration of an analytic approach using epidemiological data. Their proposal clearly deserves testing in large informative samples on which all the suspected baseline predictors are collected. Large samples, combined with machine learning and replication of the initial results in independent samples, seems very likely to yield clinically actionable information and clinically useful tools that should inform clinical decision-making.
There are, however, some challenges and questions to be considered in taking this proposal further. One challenge is to define the preferred sample. While one may hope that machine learning will sort out many issues if samples are big enough, unnecessarily large samples may add more cost and complexity than benefit. At first glance, would not we want to include all patients who are deemed to be clinically depressed and are receiving medication? Extremely inclusive samples may not be more informative, especially if: (a) our treatments are not uniquely effective in distinct subsamples, and (b) the accuracy of the chart diagnosis is questionable. Both seemingly apply.
We know, for example, that some depressed patients suffer from large numbers of comorbid general medical conditions that in turn affect the efficacy of treatments (Rush et al. Reference Rush, Trivedi, Wisniewski, Nierenberg, Stewart, Warden, Niederehe, Thase, Lavori, Lebowitz, McGrath, Rosenbaum, Sackeim, Kupfer, Luther and Fava2006, Reference Rush, Warden, Wisniewski, Fava, Trivedi, Gaynes and Nierenberg2009; Rush, Reference Rush2007). If a medication is ineffective in some substantial subset of depressed patients, the case mix under study may well affect the algorithms being developed, which could reduce the chances of replication in another independent population that has a different case mix. While machine learning may overcome some of these issues to some degree, a more cost-effective approach may entail fewer subjects, which increases chances for replication in independent samples. Relying on randomised controlled trials that are designed for maximal internal validity limits subject availability and leads to unrepresentative samples that do not represent patients in practice, making results potentially challenging to replicate (Wisniewski et al. Reference Wisniewski, Rush, Nierenberg, Gaynes, Warden, Luther, McGrath, Lavori, Thase, Fava and Trivedi2009).
Another sample selection issue is whether to include treatment-resistant with multiple prior failed treatment attempts. Such a sample is likely to be less heterogeneous compared with all patients who entered the first treatment step, and placebo response rates should be lower.
If the placebo responders are common in the sample (logically more likely with the first than the second treatment step), they may mask the detection of indicators that reduce heterogeneity among those who do respond. While very large samples may help with this challenge, there are costs.
Furthermore, if a combination of inexpensively acquired patient-reported variables actually reduces HTE, then that particular combination should become a platform upon which more expensive laboratory tests can be added. Developing the combination of patient-reported variables that best informs the first step would certainly be useful. However, the addition of laboratory tests will likely be more cost-effective at the second or subsequent steps. Thus, a second platform using perhaps a different or even the same combinations of patient-reported variables would be of value in informing this second step in anticipation of the addition of laboratory tests. It would seem important to define the clinical decision points (which steps and concerning which choices) to be addressed by the results of machine learning.
This could be accomplished in a registry in which patients begin with a single generic medication, after which the second step could offer some common choices. Such a registry would be readily accepted by patients, even if one limited the number of second step options (to focus on specific clinically important questions or common decisions). An equipoise-stratified randomised design could be a consideration (Lavori et al. Reference Lavori, Rush, Wisniewski, Alpert, Fava, Kupfer, Nierenberg, Quitkin, Sackeim, Thase and Trivedi2001).
Another truly major challenge entails how to minimise noise generated by wide variation in treatment delivery. Antidepressant medications are underdosed, patient adherence is highly variable, clinician adherence to practice clinical guidelines is quite variable, and up to one-third of patients do not complete acute medication trials. Measurement-based care does produce more robust but tailored dosing and better outcomes than routine care (Rush, Reference Rush2015), but measurement-based care is not yet widely used. Large samples of improperly treated patients could produce machine learning results that do not apply to higher-quality treatment practices. Inappropriate heterogeneity in care delivery simply adds to HTE, and lowers the chances of replication in independent samples with higher-quality care.
Another challenge is the absence of any measured outcome (e.g., depressive symptoms or function) in most electronic health records (EHRs). In this context, we would not know how well each patient has fared. If measurement-based care were implemented, it would have the dual advantage of improving the quality of care and providing a clinically relevant outcome that could be entered into the EHR. Even if symptom ratings were used by different providers, such as the Montgomery Asberg Depression Rating Scale, the Patient Health Questionnaire or the Quick Inventory of Depressive Symptoms (www.ids-qids.org), item response theory analyses can provide reasonably good crosswalks between these various measures (www.ids-qids.org).
Two final thoughts: First, let us not overlook the potential value of combining baseline patient self-reported variables with early post-baseline changes in either symptoms or function to target treatments. A post hoc analysis of STAR-D (Rush et al. Reference Rush, Fava, Wisniewski, Lavori, Trivedi, Sackeim, Thase, Nierenberg, Quitkin, Kashner, Kupfer, Rosenbaum, Alpert, Stewart, McGrath, Biggs, Shores-Wilson, Lebowitz, Ritz and Niederehe2004) data provided a proof of concept of this approach (Kuk et al. Reference Kuk, Li and Rush2010; Li et al. Reference Li, Kuk and Rush2012). Machine learning in large samples with both baseline and early post-baseline data would clearly strengthen the clinical utility and precision of such an approach. These kinds of results could provide an evidence-based approach to address another clinical challenge; namely, to avoid prolonged treatment trials that are certain to produce a poor outcome. This challenge should be addressed as soon as possible using baseline and early post-baseline patient-reported parameters.
Second, these patient-reported baseline parameters may also sub-serve other clinical functions beyond reducing HTE to better target treatments. Clinically available algorithms developed by machine learning could help to predict longer-term prognoses for depressed patients following response or remission. For example, could we identify the 10–18% of remitted depressed patients who are likely to relapse within a year (Judd et al. Reference Judd, Schettler, Rush, Coryell, Fiedorowicz and Solomon2015) using a combination of either baseline or end-of-acute-treatment self-reports? If so, follow-up visit frequency could be tailored to individuals based on their likelihood of relapse.
In summary, a combination of self-reported baseline parameters would appear to be feasible, inexpensive and very likely to reduce HTE and thus enhance treatment targeting. This tool is also likely to provide an essential platform upon which more time-intensive and expensive tests to further reduce HTE could be evaluated and developed. A multi-site registry would seem to be essential to ensure a reasonably representative patient sample, perhaps focused on a select number of treatment options and specific treatment steps, all delivered by measurement-based care. This effort would be one way to increase feasibility and contain cost. Potentially different registries that are focused on clinically important but distinct subgroups (e.g., bipolar and unipolar patients; depressed youth; adults; elderly) could be developed and similarly analysed. Given the current remarkable dearth of clinically informative tools to reduce HTE, the table is set for substantial advances.
There is no dispute: there is great heterogeneity in the treatment effects associated with major depressive disorder (MDD). Since depression is a syndrome, we would expect a large range of causes, potential treatments and outcomes with any one treatment as well as between treatments. As Kessler's report notes, the field has yet to effectively address and reduce this heterogeneity despite multi-decade attempts including descriptively defined depressive subtypes (Rush et al. Reference Rush, Wisniewski, Warden, Luther, Davis, Fava, Nierenberg and Trivedi2008) and – more recently – neuropsychological (Etkin et al. Reference Etkin, Patenaude, Song, Usherwood, Rekshan, Schatzberg, Rush and Williams2015), imaging (Korgaonkar et al. Reference Korgaonkar, Rekshan, Gordon, Rush, Williams, Blasey and Grieve2014) and genetic (Schatzberg et al. Reference Schatzberg, DeBattista, Lazzeroni, Etkin, Murphy and Williams2015) approaches.
Kessler and colleagues review a variety of recent efforts and methodological innovations to better address the challenge of the heterogeneity of treatment effects (HTE) in MDD. They have compiled a list of critical baseline patient-reported parameters that have at least some evidence of relating to outcome in the acute treatment of depression. These 30+ measures are easily obtained from patients at very low cost, although the optimal combination for predicting treatment outcomes or selecting between treatments is yet undefined since the full compendium of measures has not been tested in large informative populations of depressed patients.
The authors properly emphasise the extraordinary importance of such an effort. Indeed, the relevant combination of these inexpensively acquired measures can and should provide a platform upon which additional laboratory-based measures should be placed to further reduce HTE and better target treatments. In fact, a lack of knowledge regarding what could be offered by these patient-reported indicators could limit the clinical value of specific laboratory tests. This idea is not surprising because clinical parameters typically influence whether treatments will work.
Finally, the authors provide an important demonstration of an analytic approach using epidemiological data. Their proposal clearly deserves testing in large informative samples on which all the suspected baseline predictors are collected. Large samples, combined with machine learning and replication of the initial results in independent samples, seems very likely to yield clinically actionable information and clinically useful tools that should inform clinical decision-making.
There are, however, some challenges and questions to be considered in taking this proposal further. One challenge is to define the preferred sample. While one may hope that machine learning will sort out many issues if samples are big enough, unnecessarily large samples may add more cost and complexity than benefit. At first glance, would not we want to include all patients who are deemed to be clinically depressed and are receiving medication? Extremely inclusive samples may not be more informative, especially if: (a) our treatments are not uniquely effective in distinct subsamples, and (b) the accuracy of the chart diagnosis is questionable. Both seemingly apply.
We know, for example, that some depressed patients suffer from large numbers of comorbid general medical conditions that in turn affect the efficacy of treatments (Rush et al. Reference Rush, Trivedi, Wisniewski, Nierenberg, Stewart, Warden, Niederehe, Thase, Lavori, Lebowitz, McGrath, Rosenbaum, Sackeim, Kupfer, Luther and Fava2006, Reference Rush, Warden, Wisniewski, Fava, Trivedi, Gaynes and Nierenberg2009; Rush, Reference Rush2007). If a medication is ineffective in some substantial subset of depressed patients, the case mix under study may well affect the algorithms being developed, which could reduce the chances of replication in another independent population that has a different case mix. While machine learning may overcome some of these issues to some degree, a more cost-effective approach may entail fewer subjects, which increases chances for replication in independent samples. Relying on randomised controlled trials that are designed for maximal internal validity limits subject availability and leads to unrepresentative samples that do not represent patients in practice, making results potentially challenging to replicate (Wisniewski et al. Reference Wisniewski, Rush, Nierenberg, Gaynes, Warden, Luther, McGrath, Lavori, Thase, Fava and Trivedi2009).
Another sample selection issue is whether to include treatment-resistant with multiple prior failed treatment attempts. Such a sample is likely to be less heterogeneous compared with all patients who entered the first treatment step, and placebo response rates should be lower.
If the placebo responders are common in the sample (logically more likely with the first than the second treatment step), they may mask the detection of indicators that reduce heterogeneity among those who do respond. While very large samples may help with this challenge, there are costs.
Furthermore, if a combination of inexpensively acquired patient-reported variables actually reduces HTE, then that particular combination should become a platform upon which more expensive laboratory tests can be added. Developing the combination of patient-reported variables that best informs the first step would certainly be useful. However, the addition of laboratory tests will likely be more cost-effective at the second or subsequent steps. Thus, a second platform using perhaps a different or even the same combinations of patient-reported variables would be of value in informing this second step in anticipation of the addition of laboratory tests. It would seem important to define the clinical decision points (which steps and concerning which choices) to be addressed by the results of machine learning.
This could be accomplished in a registry in which patients begin with a single generic medication, after which the second step could offer some common choices. Such a registry would be readily accepted by patients, even if one limited the number of second step options (to focus on specific clinically important questions or common decisions). An equipoise-stratified randomised design could be a consideration (Lavori et al. Reference Lavori, Rush, Wisniewski, Alpert, Fava, Kupfer, Nierenberg, Quitkin, Sackeim, Thase and Trivedi2001).
Another truly major challenge entails how to minimise noise generated by wide variation in treatment delivery. Antidepressant medications are underdosed, patient adherence is highly variable, clinician adherence to practice clinical guidelines is quite variable, and up to one-third of patients do not complete acute medication trials. Measurement-based care does produce more robust but tailored dosing and better outcomes than routine care (Rush, Reference Rush2015), but measurement-based care is not yet widely used. Large samples of improperly treated patients could produce machine learning results that do not apply to higher-quality treatment practices. Inappropriate heterogeneity in care delivery simply adds to HTE, and lowers the chances of replication in independent samples with higher-quality care.
Another challenge is the absence of any measured outcome (e.g., depressive symptoms or function) in most electronic health records (EHRs). In this context, we would not know how well each patient has fared. If measurement-based care were implemented, it would have the dual advantage of improving the quality of care and providing a clinically relevant outcome that could be entered into the EHR. Even if symptom ratings were used by different providers, such as the Montgomery Asberg Depression Rating Scale, the Patient Health Questionnaire or the Quick Inventory of Depressive Symptoms (www.ids-qids.org), item response theory analyses can provide reasonably good crosswalks between these various measures (www.ids-qids.org).
Two final thoughts: First, let us not overlook the potential value of combining baseline patient self-reported variables with early post-baseline changes in either symptoms or function to target treatments. A post hoc analysis of STAR-D (Rush et al. Reference Rush, Fava, Wisniewski, Lavori, Trivedi, Sackeim, Thase, Nierenberg, Quitkin, Kashner, Kupfer, Rosenbaum, Alpert, Stewart, McGrath, Biggs, Shores-Wilson, Lebowitz, Ritz and Niederehe2004) data provided a proof of concept of this approach (Kuk et al. Reference Kuk, Li and Rush2010; Li et al. Reference Li, Kuk and Rush2012). Machine learning in large samples with both baseline and early post-baseline data would clearly strengthen the clinical utility and precision of such an approach. These kinds of results could provide an evidence-based approach to address another clinical challenge; namely, to avoid prolonged treatment trials that are certain to produce a poor outcome. This challenge should be addressed as soon as possible using baseline and early post-baseline patient-reported parameters.
Second, these patient-reported baseline parameters may also sub-serve other clinical functions beyond reducing HTE to better target treatments. Clinically available algorithms developed by machine learning could help to predict longer-term prognoses for depressed patients following response or remission. For example, could we identify the 10–18% of remitted depressed patients who are likely to relapse within a year (Judd et al. Reference Judd, Schettler, Rush, Coryell, Fiedorowicz and Solomon2015) using a combination of either baseline or end-of-acute-treatment self-reports? If so, follow-up visit frequency could be tailored to individuals based on their likelihood of relapse.
In summary, a combination of self-reported baseline parameters would appear to be feasible, inexpensive and very likely to reduce HTE and thus enhance treatment targeting. This tool is also likely to provide an essential platform upon which more time-intensive and expensive tests to further reduce HTE could be evaluated and developed. A multi-site registry would seem to be essential to ensure a reasonably representative patient sample, perhaps focused on a select number of treatment options and specific treatment steps, all delivered by measurement-based care. This effort would be one way to increase feasibility and contain cost. Potentially different registries that are focused on clinically important but distinct subgroups (e.g., bipolar and unipolar patients; depressed youth; adults; elderly) could be developed and similarly analysed. Given the current remarkable dearth of clinically informative tools to reduce HTE, the table is set for substantial advances.