Death by suicide is a major public health concern (World Health Organization, 2018). Given the high rates of death by suicide among military veterans, (Kessler et al., Reference Kessler, Hwang, Hoffmire, McCarthy, Petukhova, Rosellini and Thompson2017) the U.S. Veterans Health Administration (VHA) has prioritized improving methods to identify individuals at the highest risk for suicide (McCarthy et al., Reference McCarthy, Bossarte, Katz, Thompson, Kemp, Hannemann and Schoenbaum2015; Torous et al., Reference Torous, Larsen, Depp, Cosco, Barnett, Nock and Firth2018). As part of this initiative, machine-learning models are increasingly utilized to evaluate complex networks of potential predictor variables, decipher true v. false positives, and identify the most predictive variables or constellation of variables (Walsh, Ribeiro, & Franklin, Reference Walsh, Ribeiro and Franklin2017). Although these innovations have led to improvements, even state-of-the-art suicide prediction models have been critiqued for lacking optimal accuracy (Belsher et al., Reference Belsher, Smolenski, Pruitt, Bush, Beech, Workman and Skopp2019; Kessler et al., Reference Kessler, Bernecker, Bossarte, Luedtke, McCarthy, Nock, Zuromski, Passos, Mwangi and Kapczinski2019). As a means to further increase accuracy, we describe the development and preliminary testing of a novel natural language processing (NLP) approach that includes risk predictor variables extracted from mental health providers' written notes alongside structured variables included in the current VHA state-of-the-art suicide prediction model.
Suicide risk prediction models have been critiqued for a range of reasons, from concerns about specific predictor variables to foundational debates about in-person clinical evaluations v. analytically derived risk assessment tools (Kessler, Reference Kessler2019). The accuracy of suicide risk screening tools is further complicated by patients' reluctance to disclose suicidal intent (Ganzini et al., Reference Ganzini, Denneson, Press, Bair, Helmer, Poat and Dobscha2013; Husky, Zablith, Fernandez, & Kovess-Masfety, Reference Husky, Zablith, Fernandez and Kovess-Masfety2016) and concerns about associated stigma (Ganzini et al., Reference Ganzini, Denneson, Press, Bair, Helmer, Poat and Dobscha2013; Hom, Stanley, Podlogar, & Joiner, Reference Hom, Stanley, Podlogar and Joiner2017). The direct and indirect costs of provider-administered suicide risk assessments have also been noted (Kessler, Reference Kessler2019). Responding to these concerns, the VHA recently developed Recovery Engagement and Coordination for Health – Veterans Enhanced Treatment (REACH VET; Veterans Affairs Office of Public and Intergovernmental Affairs, 2017), a machine-learning-based suicide prediction model. Drawing from VHA users' Electronic Medical Records (EMR), REACH VET was designed to identify individuals with the highest suicide risk (the top 0.1% tier). REACH VET includes 61 variables abstracted from structured EMR data, ranging from health service usage, to psychotropic medication usage, socio-demographics, and the interaction of demographics and healthcare usage. REACH VET, integrated within an alert system that informs mental health providers about patient risk, has been quickly adopted throughout the VHA network (Lowman, Reference Lowman, Ritchie and Llorente2019).
REACH VET's predictive ability is defined by its utilization of structured EMR variables, which are easily quantified and analyzable. Although structured EMR variables account for many relevant suicide predictor variables, not all potential predictor variables have been developed into structured formats or have structured formats that are widely used (Barzilay & Apter, Reference Barzilay and Apter2014; Rudd et al., Reference Rudd, Berman, Joiner, Nock, Silverman, Mandrusiak and Witte2006). To this end, studies have explored the linguistic analysis of unstructured, text-based EMR data for predictive purposes (Leonard Westgate, Shiner, Thompson, & Watts, Reference Leonard Westgate, Shiner, Thompson and Watts2015; Poulin et al., Reference Poulin, Shiner, Thompson, Vepstas, Young-Xu, Goertzel and McAllister2014; Rumshisky et al., Reference Rumshisky, Ghassemi, Naumann, Szolovits, Castro, McCoy and Perlis2016). Evidence suggests that the linguistic analysis of unstructured EMR data – materials such as clinician's free-text notes and written records – may offer relevant information for suicide risk prediction, including information about patients' interpersonal patterns and the relationship between the patient and the medical provider. In addition to capturing new and potentially useful clinical information, this indirect approach offers to lessen concerns about patients' self-report bias as well as clinicians' interpretative bias. Finally, these text-based approaches may better account for the dynamic nature of suicide risk in contrast to approaches that largely rely on unchanging demographic variables.
NLP bridges linguistics and machine learning, quantifying written language as vectors that can be statistically evaluated. NLP offers to broaden the reach of computational analysis to better include human experience, emotion, and relationships (Crossley, Kyle, & McNamara, Reference Crossley, Kyle and McNamara2017), areas that have been previously linked with suicide risk (Van Orden et al., Reference Van Orden, Witte, Cukrowicz, Braithwaite, Selby and Joiner2010). Initial research suggests that NLP offers an effective means to mine free-text data for variables that impact suicide (Ben-Ari & Hammond, Reference Ben-Ari and Hammond2015; Fernandes et al., Reference Fernandes, Dutta, Velupillai, Sanyal, Stewart and Chandran2018; Koleck, Dreisbach, Bourne, & Bakken, Reference Koleck, Dreisbach, Bourne and Bakken2019). This study evaluates whether REACH VET's ability to predict death by suicide can be improved by including NLP-derived variables from unstructured EMR data. To accomplish this objective, we built on established REACH-VET predictor variables to determine whether linguistic analysis of free-text clinical notes could improve prediction of death by suicide within a cohort of veterans that have been diagnosed with post-traumatic stress disorder (PTSD). This study utilized a PTSD cohort because of associations linking PTSD and suicide (McKinney, Hirsch, & Britton, Reference McKinney, Hirsch and Britton2017), because excess suicide mortality in the VHA PTSD treatment population has been previously established (Forehand et al., Reference Forehand, Peltzman, Westgate, Riblet, Watts and Shiner2019), and because we had a readily available and well-developed cohort (Shiner, Leonard Westgate, Bernardy, Schnurr, & Watts, Reference Shiner, Leonard Westgate, Bernardy, Schnurr and Watts2017).
Methods
Data source
Since 2000, the Department of Veterans Affairs (VA) has employed an electronic medical record for all aspects of patient care. EMR data from all VHA hospitals are stored in the VA Corporate Data Warehouse (CDW). Using the VA CDW, we selected VA users that had been newly diagnosed with PTSD between 2004 and 2013. Individuals in this sample received at least two PTSD diagnoses within 3 months, of which at least one occurred in a mental health clinic, and had not met PTSD diagnostic criterion during the previous 2 years. This original sample (n = 731 520) has been previously described (Shiner et al., Reference Shiner, Leonard Westgate, Bernardy, Schnurr and Watts2017). The first of the two qualifying PTSD diagnoses was considered the ‘index PTSD diagnosis’, and the date of that diagnosis was used as the start of a risk time for analyses (see below). Patients were followed for 1 year following the index diagnosis and if they met criteria for multiple yearly sub-cohorts, they were assigned to the earliest sub-cohort. We obtained information on patient characteristics and service use, as well as clinical note text associated with psychotherapy encounters, from the CDW.
Selection of cases and controls
We used a three-step process to match cases and controls. First, we identified all psychotherapists that had at least one individual psychotherapy session with patients (n = 436) who died by suicide within their first treatment year after the index PTSD diagnosis. Psychotherapists were selected regardless of therapeutic orientation and level of graduate training. As less than 3% of patients in this cohort received evidence-based psychotherapy (EBP) at the recommended treatment level (Shiner et al., Reference Shiner, Westgate, Gui, Cornelius, Maguen, Watts and Schnurr2019), we did not differentiate between EPB and non-EBP providers. We then identified all other patients (n = 96 570) within PTSD cohort that remained alive at the end of their first treatment year who had at least one session with one of these psychotherapists (the same psychotherapist as the case group). Second, to account for individual clinician note-writing differences, familiarity with patients, psychotherapy type, and note-writing style, we performed a conditional match on plurality psychotherapist (the psychotherapist who completed the highest number of administratively-coded psychotherapy sessions with each patient) and a 2-year window. This step resulted in 343 cases and 26 542 potential controls. Third, we selected among conditional matches using propensity scores created with the 61 REACH VET variables. REACH VET is the current standard method for suicide risk identification in the VHA. Minor modifications were made to the REACH VET variables to address variable collinearity, interaction variables, and associated sample cohort differences. Modifications to the REACH VET variables are addressed in Table 1. Because REACH VET's weighted coefficients were not publicly available, we calculated propensity scores from patients' treatment history prior to an end date dependent on the case (date of death occurring during the first year of PTSD treatment) or control (end date of the first year of PTSD treatment). We used greedy matching (Austin, Reference Austin2011), with the psychotherapist that provided the plurality of psychotherapy visits (‘primary psychotherapist’) specified as an exact matching constraint to balance the sample such that there were no consequential differences between cases and controls on REACH VET variables (cases and controls were at an equal calculated risk for suicide based on the 61 REACH-VET variables). We utilized a 5:1 nearest-neighbor propensity score match (Austin, Reference Austin2011) and achieved a caliper of 27.5 and C statistic = 0.806. The initial analytic sample consisted of 252 cases and 1090 controls. Table 1 presents conditional matching and propensity score matching details.
Table presents conditional match and propensity match for cases and control based on 61-variable electronic medical record (EMR) suicide prediction metric.c
a The original 61-variable suicide risk model includes three interaction variables. As the weighted regression model associated with these variables was not publically available, we could not effectively evaluate these interaction variables. To best approximate, we included both parts of the interaction variable as unique variables. As the original model included Interaction between anxiety disorder and personality disorders diagnoses in the last 24 months, we instead included both anxiety disorder diagnosis and personality disorders diagnoses in the last 24 months. Similarly, as the original model included Interaction between Divorced and Male and the Interaction between Widowed and Male, we instead included Divorced, Male, and Widowed.
b AK, AZ, CA, CO, HI, ID, MT, NV, NM, OR, UT, WA, WY.
cThe original 61-variable suicide risk model also included ‘Head/neck cancer diagnosis in last 12 months’ and ‘First VHA visit in last 5 years was in the prior year’. ‘Head/neck cancer diagnosis in last 12 months’ was excluded from our model due to collinearity with the ‘Head/neck cancer diagnosis in last 24 months’. ‘First VHA visit in last 5 years was in the prior year’ was excluded from our model as we did not have access to data going back 5 years.
Selection of the text corpus
For cases in the analytic sample, we obtained all clinical notes associated with administratively-coded psychotherapy encounters beginning at the index PTSD diagnosis and ending 5 days before death. Templates and forms were deleted from notes in order to facilitate machine learning on free-text written by clinicians. For each of the five controls, notes were selected from an equivalent range of time as the matched case, so that if a case lived for 6 months, notes from the relevant controls would be evaluated from diagnosis forward for 6 months. We excluded notes from within 5 days before death as the VA EMR often documents calls to or from families following a death by suicide, and dates of death can sometimes be incorrect by several days. We also excluded patients who did not have any notes within the selected date range. Patients who had more than threefold the mean number of notes were removed so as to not to overweight patients that had been seen more frequently. Although we retrieved and processed notes from all psychotherapy encounters, analysis was limited to notes from patients' plurality psychotherapist to account for more developed patient relationships. Due to these exclusions, the number of associated controls for each case varied throughout the sample. The final sample consisted of 246 cases and 986 controls. A total of 10 244 notes were selected for analysis.
Because we were concerned that patients who were in treatment longer may have had more developed relationships with their providers than those that were treated for less time, we grouped patients that lived equivalent lengths of time from diagnosis together. Cases and their associated controls were grouped together as follows: patients that lived up to 3 months after index diagnosis (3-month cohort, cases: n = 33, controls: n = 99); patients that lived between 3 and 6 months after index diagnosis (6-month cohort, cases: n = 63, controls: n = 222); patients that lived between 6 and 9 months (9-month cohort, cases: n = 72, controls: n = 297); and patients that lived between 9 and 12 months (12-month cohort, cases: n = 78, controls: 368). Notes from diagnosis until the relevant end date were evaluated.
Calculation of linguistic indices
Notes were processed by Sentiment Analysis and Cognition Engine [SÉANCE; (Crossley et al., Reference Crossley, Kyle and McNamara2017)], a Python-based NLP package. SÉANCE utilizes a suite of established linguistic databases including SemanticNet (Cambria, Havasi, & Hussain, Reference Cambria, Havasi and Hussain2012; Cambria, Speer, Havasi, & Hussain, Reference Cambria, Speer, Havasi and Hussain2010), General Inquirer Database (GID; Stone, Dunphy, Smith, & Ogilvie, Reference Stone, Dunphy, Smith and Ogilvie1966), EmoLex (Mohammad & Turney, Reference Mohammad and Turney2010, Reference Mohammad and Turney2013), Lasswell (Lasswell & Namenwirth, Reference Lasswell and Namenwirth1969), Valence Aware Dictionary and sEntiment Reasoner (VADER; Hutto & Gilbert, Reference Hutto and Gilbert2014), Hu–Liu (Hu & Liu, Reference Hu, Liu, Kim and Kohavi2004), Harvard IV-4 (Stone et al., Reference Stone, Dunphy, Smith and Ogilvie1966), and the Geneva Affect Label Coder (GALC; Scherer, Reference Scherer2005). These sources range from expert-derived dictionary lists to rule based systems (Urbanowicz & Moore, Reference Urbanowicz and Moore2009), comprise more than 250 unique variables, and can be evaluated in positive and negative iterations. SÉANCE compares positively (Crossley et al., Reference Crossley, Kyle and McNamara2017) with Linguistic Inquiry and Word Count (LIWC; Pennebaker, Booth, and Francis, Reference Pennebaker, Booth and Francis2007), a widely used semantic analysis tool. Additionally, in contrast to LIWC, SÉANCE is a downloadable open-source software package that is more easily utilized under VA Office of Information and Technology (VA OI&T) constraints, which restrict the use of cloud-based software packages such as LIWC.
Analysis
Cases and controls were compared over the full range of SÉANCE variables. Data were analyzed using a least absolute shrinkage and selection operator (LASSO) penalized generalized linear mixed model that controlled for varying numbers of psychotherapy sessions. LASSO is a machine-learning algorithm that reduces prediction errors frequently associated with stepwise selection (Tibshirani, Reference Tibshirani1996). LASSO sets the sum of the absolute values of the regression coefficients to be less than a fixed value, such that less important feature coefficients are reduced to zero and excluded from the model. Bayesian information criterion (BIC; Schwarz, Reference Schwarz1978) was utilized to select the tuning parameter.
Each sub-cohort was randomly divided into training (2/3 of sample) and testing (1/3 of sample) sets. LASSO was implemented on the training set to select features, which were in turn utilized in the testing set to estimate prediction scores. Area under the curve (AUC) and confidence interval (95%) statistics were calculated to determine models' predictive accuracy using the c-statistic. Analysis was completed using Python and R's glmnet package (Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010).
Results
Each sub-cohort's reduced model contained unique linguistic features. AUC statistics evaluating the predictive accuracy of each of these models accounted for little to no improvement above chance except for the 12-month cohort (Table 2). Twelve-month cohort AUC statistics indicated (C = 58.0; 95% CI 51.2–64.9) that the associated features offered small (8%) predictive improvements. Within other sub-cohorts, confidence intervals (95%), showed a wide range of values, indicating that the sample was too small for adequate analysis. Figure 1 presents model AUC curves.
Table presents SÉANCE (Crossley et al., Reference Crossley, Kyle and McNamara2017) selected features. Features were selected by a Lasso-penalized generalized linear mixed model that used BIC to select tuning parameters. Data were randomly divided into training (2/3) and test (1/3) sets. The following table presents evaluation of features using the test set.
a Features were selected from the GID (Stone et al., Reference Stone, Dunphy, Smith and Ogilvie1966) semantic dictionary.
b Repeated items correspond to positive and negative articulations of selected features.
c Features were selected from EmoLex (Mohammad & Turney, Reference Mohammad and Turney2010, Reference Mohammad and Turney2013) semantic dictionary.
d Features were selected from Geneva Affect Label Coder (GALC; Scherer, Reference Scherer2005) semantic dictionary.
The most prominent NLP features in the 3-month cohort dealt with food and life routine issues. The most prominent NLP features in 6-month cohort dealt with vice, anger, and male professional and social roles. The most prominent NLP features in the 9-month cohort dealt with heightened arousal, including issues of affiliation and hostility. The most prominent NLP features in the 12-month cohort dealt with goal directedness and indifference and adjectives evaluating social relationships. Selected NLP features were drawn from GID's (Stone et al., Reference Stone, Dunphy, Smith and Ogilvie1966) semantically categorized dictionary list and EmoLex (Mohammad & Turney, Reference Mohammad and Turney2013) and GALC's (Scherer, Reference Scherer2005) list of emotional specific words. Table 2 includes each sub-cohort's featured model.
Discussion
Whereas REACH VET variables are often static, our results present a novel dynamic method to identify and monitor predictor variables and how they change over time. Following Kleiman et al.'s (Reference Kleiman, Turner, Fedor, Beale, Huffman and Nock2017) findings about the variation and fluctuation of suicide risk factors, this method presents a valuable format to monitor real-time risk factor changes. Results suggest that unique themes were present in the notes of patients that had different life durations after diagnosis. Looking more closely at these selected variables, they parallel Maslow's hierarchy of needs (Maslow, Reference Maslow1943), transitioning over the course of the year after diagnosis from variables associated with basic needs such as food, to those associated with safety and conduct, emotional arousal and social affiliation, and finally to personal fulfillment and achievement. For Maslow, wellbeing rests upon the foundation of satisfied needs, positing that as one level of need is met, other more abstract potentialities become apparent (Heylighen (Reference Heylighen1992). Supporting this approach, our findings suggest that those that fail to reach certain thresholds of achievement experience increased suicide risk.
Three-month cohort features have strong associations with previous research. Food insecurity has been identified as an important predictor of mental health symptoms and access to health care among veterans (Narain et al., Reference Narain, Bean-Mayberry, Washington, Canelo, Darling and Yano2018). Indeed, almost 25% of veterans have experienced food insecurity, 40% more than the general population (Widome, Jensen, Bangerter, & Fu, Reference Widome, Jensen, Bangerter and Fu2015). Food and routine may also be considered proxies for homelessness (Wang et al., Reference Wang, McGinnis, Goulet, Bryant, Gibert, Leaf and Fiellin2015), another substantial predictor for suicide (Schinka, Schinka, Casey, Kasprow, & Bossarte, Reference Schinka, Schinka, Casey, Kasprow and Bossarte2012). Food insecure veterans frequently experience heightened levels of mental and physical health disorders, including pronounced rates of substance abuse (Wang et al., Reference Wang, McGinnis, Goulet, Bryant, Gibert, Leaf and Fiellin2015; Widome et al., Reference Widome, Jensen, Bangerter and Fu2015), and limited access to mental health services (McGuire & Rosenheck, Reference McGuire and Rosenheck2005). Increased and decreased food intake has been widely linked with major depression and suicide risk (Brundin, Petersén, Björkqvist, & Träskman-Bendz, Reference Brundin, Petersén, Björkqvist and Träskman-Bendz2007; Bulik, Carpenter, Kupfer, & Frank, Reference Bulik, Carpenter, Kupfer and Frank1990). We may potentially understand the rapid rates of suicide among this sub-cohort as stemming from the population's acute psychiatric needs, as those with higher rates of psychiatric symptoms tend to complete suicide earlier in treatment (Qin & Nordentoft, Reference Qin and Nordentoft2005).
Six-month cohort features are associated with anger and aggression, substance and physical abuse, and masculine gender norms, all of which are known predictors of suicidality (Bohnert, Ilgen, Louzon, McCarthy, & Katz, Reference Bohnert, Ilgen, Louzon, McCarthy and Katz2017; Genuchi, Reference Genuchi2019; Wilks et al., Reference Wilks, Morland, Dillon, Mackintosh, Blakey, Wagner and Elbogen2019). Threats to safety are especially highlighted in the GID Vice Dictionary, which includes words like ‘abject’, ‘abuse’, and ‘adultery’. Masculine gender roles, especially within the military community, are thought to encourage stoicism as opposed to help-seeking behaviors (Addis, Reference Addis2008; Genuchi, Reference Genuchi2019), and thus add additional suicide risk. Anger, substance abuse, and depression are also cited as correlates of PTSD (Hellmuth, Stappenbeck, Hoerster, & Jakupcak, Reference Hellmuth, Stappenbeck, Hoerster and Jakupcak2012). As everyone in this sample has received PTSD diagnoses, this may indicate particularly elevated symptomology. Notably the terms described are closely related to core symptoms or PSTD and depression.
Nine-month cohort findings similarly are associated with established suicide predictor variables. Words included within the GID Arousal dictionary (Stone et al., Reference Stone, Dunphy, Smith and Ogilvie1966) focus on emotional excitation, including affiliation and hostility related terms. Included words, such as ‘abhor’, ‘affection’, and ‘antagonism’, center on the presence and absence of belongingness and love. Recalling the Interpersonal Theory of Suicide's approach (Van Orden et al., Reference Van Orden, Witte, Cukrowicz, Braithwaite, Selby and Joiner2010), this relational spectrum could have importance in evaluating suicide risk. Whereas supportive relationships protect against suicidality (Wilks et al., Reference Wilks, Morland, Dillon, Mackintosh, Blakey, Wagner and Elbogen2019), hostility, anger, and aggression are established suicide risks, especially for a PTSD cohort (McKinney et al., Reference McKinney, Hirsch and Britton2017). Similarly, these terms suggest that those patients who died by suicide may have struggled to achieve and maintain relationships.
Twelve-month cohort features include self-fulfillment related constructs. Whereas the GALC Boredom dictionary (Scherer, Reference Scherer2005) highlights words showing lack of interest and indifference, the Harvard Interpersonal Adjective dictionary (Stone et al., Reference Stone, Dunphy, Smith and Ogilvie1966), addresses judgments about other people, and constructs connected with competitiveness and externalizing behavior. Based on these word usages, this sub-cohort tends towards depression, hopelessness, and antisocial tendencies, variables recognized as long-term suicide risk factors (Beck, Steer, Kovacs, & Garrison, Reference Beck, Steer, Kovacks and Garrison1985; Verona, Patrick, & Joiner, Reference Verona, Patrick and Joiner2001). In contrast to individuals that completed suicide earlier in the treatment year whose EMR notes demonstrated acute physical (3-month cohort), psychological (6-month cohort), and relational needs (9-month cohort), those that lived longest had notes that conveyed depression, lack of purpose, and social detachment. While the 3-month, 6-month, and 9-month cohorts had fewer suicide deaths than the 12-month cohort, the fact that these patients died sooner may suggest that failure to address those more basic needs and associated symptoms may result in more rapid progression to lethality.
It is difficult to identify whether these clustered sub-cohorts are connected with personal changes over time, responses to interventions, or different levels of baseline functionality. Alternatively, differences may be indicative of deepening therapeutic relationships between provider and patient, these shifts being associated with having increased time to explore existential issues. Regardless of the causal mechanism, the described method offers access to theoretically important areas that have remained outside of the purview of computational analysis (Rogers, Reference Rogers2001). An alternate hypothesis is that initial focus in psychotherapy sessions may begin with basic needs, evolve to focus on symptoms, and final begin to address more core interpersonal issues. Given the nature of the data, we are hesitant to derive conclusions regarding whether these trends reflect patients not achieving resolution of these needs or reflect increased focus by the psychotherapist.
Findings reinforce the importance of formally assessing patients' immediate needs as well addressing more abstract patient characteristics, such as goal directedness, hope, and social relationship quality. So as to better identify and mitigate suicide risk, it is recommended that evaluations of these issues be included both as part of initial assessment and at regular intervals during the course of treatment.
Leveraging NLP variables extracted from EMR data only substantially improved REACH VET's predictive model for patients that received the most care. It is worth noting, however, that while AUC statistics for all sub-cohorts remained close to 0.50, ranging from 0.46 to 0.58, there was considerable confidence interval variance. The confidence interval range became smaller as the sample size increased, inferring that the sample was not uniform and that with a larger sample size, estimates would likely be different. Only the 12-month cohort, the sub-cohort containing the largest sample size and quantity of free-text data, showed significant improvements above REACH VET. Findings suggest that with an adequate sample size and quantity of free-text data, NLP-derived variables added to REACH-VET's ability to predict death by suicide.
Recent research has led to significant advances in understanding the centrality of therapeutic alliance in predicting psychotherapy outcome in general (Lambert & Barley, Reference Lambert and Barley2001; Norcross & Lambert, Reference Norcross and Lambert2018), and PTSD (Keller, Zoellner, & Feeny, Reference Keller, Zoellner and Feeny2010) and suicide treatment outcomes (Dunster-Page, Haddock, Wainwright, & Berry, Reference Dunster-Page, Haddock, Wainwright and Berry2017) in particular. While progress has been made in the development of psychometrically robust therapeutic alliance measures (Mallinckrodt & Tekie, Reference Mallinckrodt and Tekie2016), alliance can be difficult to adequately monitor (Elvins & Green, Reference Elvins and Green2008). In line with current trends (Colli & Lingiardi, Reference Colli and Lingiardi2009; Martinez et al., Reference Martinez, Flemotomos, Ardulov, Somandepalli, Goldberg, Imel and Narayanan2019), our research suggests the value of assessing therapeutic alliance through indirect mechanisms. Leveraging NLP derived variables may offer additional ability to indirectly monitor therapeutic alliance over the course of treatment. That being noted, at least within our sample, the presence or absence of therapeutic alliance related linguistic themes could be associated with psychotherapist or theory-related factors as opposed to patient factors.
Several considerations should be acknowledged regarding the unique nature of this study and its design. Firstly, as we utilized a PTSD sample, as opposed to the broader VA population that REACH VET was developed on, results are not necessarily comparable. Secondly, as REACH VET variables' coefficients were not publicly available, we ran a regression model using selected cases and potential control pool to implement propensity score matching. While this approach sufficiently accounted for all REACH VET variables, it is likely that our method yielded different results than the weighted REACH VET model and may have been overly conservative or liberal in restricting the sample. Thirdly, and perhaps most importantly, our sample size was insufficient to optimally develop testing and training sets, which in turn may have impacted the accuracy of the feature selection algorithm. As the sample size was dependent on suicide, we could not correct for this limitation. Even with these constraints, as the REACH VET variables were precisely developed and evaluated (Kessler et al., Reference Kessler, Hwang, Hoffmire, McCarthy, Petukhova, Rosellini and Thompson2017; McCarthy et al., Reference McCarthy, Bossarte, Katz, Thompson, Kemp, Hannemann and Schoenbaum2015), any improvements beyond its predictive model should be taken seriously. Given the importance of effective and timely suicide intervention, even small improvements in prediction may make lifesaving differences.
Several additional concerns are worth noting and should be addressed in future studies. Our sample may not be comparable with the population used to develop SÉANCE and associated toolkits' variables. To avoid this confound, deep-learning NLP approaches (Geraci et al., Reference Geraci, Wilansky, de Luca, Roy, Kennedy and Strauss2017) could be used to develop population specific linguistic references. Although our note extraction method took steps to eliminate template and information that was copied and pasted from other sources, it is difficult to fully exclude this content. Not knowing to what extent information has been duplicated, limits ability to extrapolate personalized information. As duplication is relatively common (Cohen, Elhadad, & Elhadad, Reference Cohen, Elhadad and Elhadad2013), preventative steps could be taken to pre-evaluate if EMR notes include copy and pasted materials.
We acknowledge that the discussed method necessitates patients' having EMR psychotherapy notes. As such, the method does not offer increased predictive accuracy for patients without said notes. It is unclear if an adapted method could be usefully applied to non-psychotherapy mental health notes or to notes from general medical visits. Lastly, we plan to re-run the study and evaluate comparative impact if and when REACH VET coefficients become publically available.
Conclusion
The study identified a novel method for measuring suicide risk over time and potentially categorizing patient subgroups with distinct risk sensitivities. Results suggest that modest improvements above and beyond REACH VET's predicative capability were achieved during select times over the treatment year. In particular, for the 12-month cohort, the group with the largest sample size and the greatest number of psychotherapy sessions, NLP-derived variables provided an 8% predictive gain above state-of-the-art standards. Future research is necessary to parse whether the strengths of the 12-month cohort are associated with sample size, length of treatment, or number of notes. Despite its shortcoming, this study broadens domains of inquiry, contributing new methods to assess patients' feelings and experiences. These domains present opportunities to inform theory and practice, and potentially save lives.
Financial support
This work was funded by the VA National Center for Patient Safety Center of Inquiry Program (PSCI-WRJ-Shiner). Dr Levis' time was supported by the VA Office of Academic Affiliations Advanced Fellowship in Health Systems Engineering. Dr Shiner's time was supported by the VA Health Services Research and Development Career Development Award Program (CDA11-263).
Conflict of interest
Authors have no conflict of interest.