Clostridium difficile infection (CDI) is a serious illness whose presentation can range from loose stools to profuse watery diarrhea, leading to dehydration, life-threatening complications, and sometimes death. This illness is associated with substantial morbidity, mortality, excess health services utilization, and increased cost.Reference Lessa, Winston and McDonald 1 – Reference Olsen, Young-Xu and Stwalley 3 The Centers for Disease Control and prevention estimated that there were 453,000 cases of incident CDI (iCDI) in 2011, with 29,000 associated deaths and 83,000 first recurrences (rCDI).Reference Lessa, Winston and McDonald 1 Recurrences are common due to persistent or newly acquired bacterial spores.Reference Freedberg, Salmasian, Cohen, Abrams and Larson 4 After initial treatment and resolution of diarrhea, up to 35% of CDI patients experience rCDI.Reference Lessa, Winston and McDonald 1 , Reference McFarland 5 , Reference Bouza 6 Of those with a primary recurrence, 40% will have another CDI episode, and after 2 recurrences, the likelihood of an additional episode increases to as high as 65%.Reference McFarland, Elmer and Surawicz 7 However, due to recent advances, this estimate may be overstated.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 8 , Reference Sheitoyan-Pesant, Abou Chakra, Pepin, Marcil-Heguy, Nault and Valiquette 9
Prevention of rCDI remains a critical unmet medical need, and it is desirable to predict which patients are at highest risk of recurrence. A number of research teams have developed predictive models for rCDI.Reference Hu, Katchar and Kyne 10 – Reference D’Agostino, Collins, Pencina, Kean and Gorbach 13 These models have had limited sample size, have been restricted to data from a single center, have employed imprecise proxies for measures of disease severity, and have made limited use of electronic medical record (EMR) data.
A need exists for risk prediction models to address these gaps. As more healthcare systems in the United States transition to fully automated EMRs, it is important to take advantage of the increased granular clinical data that are becoming available. Although health systems are beginning to experiment with predictive models embedded in EMRs,Reference Kollef, Chen and Heard 14 – Reference Escobar, Turk and Ragins 16 access to such capability remains limited. The overall incidence of CDI is affected by local factors such as antimicrobial stewardship efforts, patient case mix, varying antibiotic utilization patterns, C. difficile strain epidemiology, and prevention. Thus, models may not be completely generalizable and may need periodic updating. Although considerable interest in predicting rCDI exists, descriptions of the performance characteristics of existing models have been limited, and few have been sufficiently validated outside the populations in which they were developed. Now that treatments are available to prevent the recurrence of CDI (eg, fidaxomicin,Reference Watt, Dinh, Le Monnier and Tilleul 17 , Reference Nelson, Suda and Evans 18 bezlotoxumabReference Wilcox, Gerding and Poxton 19 ), it is advantageous to patients and healthcare providers to identify those at greatest risk for recurrence who may benefit from the most appropriate treatments.
To address these gaps, we developed and validated rCDI predictive models in a large and representative sample of adults. For our defined population, cared for by a single medical group within an integrated delivery system, Kaiser Permanente Northern California (KPNC), comprehensive EMR data were available. Our modeling process included comparing different models and externally validating a previously published model.
MATERIALS AND METHODS
This project was approved by the KPNC Institutional Review Board for the Protection of Human Subjects, which has jurisdiction over all the hospitals and clinics described in this report.
Our setting consisted of 21 KPNC hospitals described previously.Reference Escobar, Greene, Gardner, Marelich, Quick and Kipnis 20 – Reference Escobar, Gardner, Greene, Draper and Kipnis 22 Under a mutual exclusivity arrangement, salaried physicians of The Permanente Medical Group care for 4.2 million Kaiser Foundation Health Plan members at facilities owned by Kaiser Foundation Hospitals. All KPNC facilities (21 hospitals and an additional 60 clinics) employ the same information systems with a common medical record number.Reference Selby 23 Comprehensive KPNC information systems permit tracking of patient information across the continuum of care, including some aspects of care outside KPNC.22,23 Deployment of the Epic EMR system (www.epicsystems.com), known internally as KP HealthConnect (KPHC), began in 2006 and was completed in 2010.
The eligible population (denominator) included adults ≥18 years of age with at least 1 positive test (the index test) for C. difficile toxins or DNA associated with a hospitalization between 2007 and 2014. The date-time stamp of the physician order for the index test was time zero (T0) for all study measurements. Details on KPNC assays and testing procedures are provided in the Appendix.
Measures
Primary study outcome
The dependent variable was rCDI, which could occur either in the inpatient or outpatient setting. To ensure that we distinguished between incident and recurrent episodes, T0 had to be preceded by an 84-day period with no evidence of CDI (Figure 1). A patient’s treatment period extended from the first known instance of antibiotic treatment to 48 hours after conclusion of such treatment. A positive test defining a patient as having rCDI had to occur within 84 days after the end of the treatment period. Tests that occurred within the treatment period were not included. Figure 1 also shows that predictors were included if available up to 4 days after T0, a clinically reasonable period for acquisition of information following a CDI testing order.

FIGURE 1 Time periods employed to define patient inclusion in cohort and patient data in predictive models. The T0 is defined by the date/time stamp of the physician order for the index test. In order for the patient to be included in the cohort, the T0 had to be preceded by 84 days with no positive test for Clostridium difficile (“clean”period). To be considered an outcome, an infection had to occur during the Recurrence period. This meant that a positive test result occurred within 84 days following the end of a variable treatment period (time between the T0 and completion of antibiotic treatment, ABX End ). Patient data included in the predictive models had to be available within 4 days from the T0 (Predictor period). See text for additional details.
Mortality
We ascertained mortality using KPNC patient demographic databases and publicly available files of deceased patients provided by the Social Security Administration, as described previously.Reference Escobar, Gardner, Greene, Draper and Kipnis 22
Model development
We assessed more than 150 potential predictors, including age, sex, and different configurations of historical variables (eg, antibiotic exposure, recent hospitalizations, and surgery). The final set of 23 predictors incorporated in the 3 models was based on clinical grounds, statistical performance, data abstraction burden in settings without EMRs, and (for the fully automated models) current KPNC data availability.Reference Escobar and Dellinger 15 , Reference Escobar, Turk and Ragins 16
Predictors fell into the following categories: demographic (age, sex), location of iCDI onset (either the inpatient setting or a skilled nursing facility), medication exposure (antibiotics, proton pump inhibitors), comorbidities (both as individual predictors as well as composite indices such as the Charlson comorbidity indexReference Deyo, Cherkin and Ciol 24 and the 12-month longitudinal COmorbidity Point Score, version 2, or COPS2Reference Escobar, Gardner, Greene, Draper and Kipnis 22 ), medical history (eg, recent surgery involving the gastrointestinal tract), and physiologic markers (ie, laboratory tests, vital signs, and a severity of illness score, the Laboratory-based Acute Physiology Score, version 2 (LAPS2).Reference Escobar, Gardner, Greene, Draper and Kipnis 22 The LAPS2 employs 16 laboratory tests, vital signs, pulse oximetry, and neurological status checks. We categorized 24 antibiotics as high risk (eg, ciprofloxacin, clindamycin, and amoxicillin).Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25 – Reference Dubberke, Yan and Reske 27 A full list of the predictors examined is provided in the Appendix.
Based on statistical performance, the 3 best-performing models are described here: basic, enhanced, and automated. The basic model is a parsimonious model with components that could be easily populated in most medical settings. The enhanced model is a variant of the basic model to which a limited set of variables, which could be extracted from an EMR, were added. These variables, which are part of the LAPS2 severity of illness score,Reference Escobar, Gardner, Greene, Draper and Kipnis 22 were based on their statistical contribution using methods described below. Finally, the automated model is based on variables that could be generated in real time given existing systems in place in KPNC.Reference Escobar, Turk and Ragins 16
We elected to compare these final 3 models against the Zilberberg model, a previously published model by Zilberberg et al,Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25 because it was based on a large cohort and the authors provided substantive detail on its statistical performance. For the Zilberberg model, we structured predictors to match the specifications of Zilberberg et al exactly. However, we did not employ their original coefficients, instead allowing these to emerge given our population. The 4 models, arranged according to increasing complexity, are summarized in Table 1.
TABLE 1 Predictors Used Within Each Model

NOTE. LAPS2, laboratory-based acute physiology score, version 2; COPS, comorbidity point score, version 2; T0, Time zero (T0) is the date-time stamp of the physician order for the index Clostridium difficile infection test; iCDI, incident Clostridium difficile infection (see text for how iCDI is defined); ICU, intensive care unit.
a See text for more detail on model selection.
b We replicated the model developed by Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25
c A patient’s immunosuppression status was defined using algorithmic rules using International Classification of Disease, Ninth Revision (ICD-9) diagnosis codes and immunocompromising medications and treatments used in the 6 mo prior to iCDI.
d Locus of iCDI onset is categorized as (1) community-onset, healthcare-facility–associated (iCDI diagnosed by a positive toxin test within 72 h of admission or iCDI diagnosed in any outpatient setting and a hospitalization in the prior 90 d); (2) community-onset, community-associated (reference group in model: iCDI diagnosed by a positive toxin test within 72 h of admission or in any outpatient setting and no hospitalization in the previous 90 d); or (3) hospital-onset, healthcare-facility–associated (CDI diagnosed >72 h after hospital admission). These definitions were also used by Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25
eWe employed the same definitions as Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25
f LAPS2 is a composite severity of illness score and employs 16 laboratory tests, vital signs, pulse oximetry, and neurological status checks.Reference Escobar, Gardner, Greene, Draper and Kipnis 22
g COPS2 is a 12-month longitudinal comorbidity burden score that includes history elements (eg, recent surgery involving the gastrointestinal track).Reference Escobar, Gardner, Greene, Draper and Kipnis 22
Statistical Methods
We divided cohort data into derivation (patients with iCDI between 2007 and 2013) and validation (iCDI in 2014) datasets. All analyses during model development were performed using the derivation dataset, with final coefficients applied once to the validation dataset. As a further precaution against overfitting, we divided derivation data into Derivation 1 (iCDI dates 2007–2012) and Derivation 2 (2013) datasets.Reference Hastie, Tibshirani and Friedman 28 Within the Derivation 1 dataset, we identified a set of candidate predictors by first performing univariate and bivariate analyses and then applying a random forest algorithm.Reference Hastie, Tibshirani and Friedman 28 , Reference Allison 29 We evaluated the performance and robustness of all models on the Derivation 2 data set using 5-fold cross-validation.Reference Hastie, Tibshirani, Friedman and Franklin 30 We excluded multiple models because, although they performed well in the derivation dataset, performance deteriorated dramatically following cross-validation. This was particularly true with respect to models that incorporated multiple interaction terms.
We fit a simple logistic regression, excluding deaths prior to rCDI for the basic, the enhanced, and the automated models. However, because patients with CDI have a substantial mortality risk and might die prior to developing rCDI, we evaluated several models (based on the enhanced model predictors) to address the possible impact of mortality on rCDI prediction. These included competing risk discrete survival modelsReference Allison 29 and Cox competing risk survival regression.Reference Hosmer and Lemeshow 31 We conducted sensitivity analyses in which we first assigned a probability of rCDI to all patients in a randomly selected portion of the derivation dataset. We then tested various models using the remaining records in which the dependent variable was not dichotomous but continuous (ie, patients who died were assigned a probability of rCDI, and then we modeled for rCDI as a continuous outcome), and we incorporated the conditional probability of mortality into the analyses. Additional details are provided in the Appendix.
We compared the discrimination of each model using the c statistic (area under the receiver operator characteristic curve),Reference Cook, Duke, Hart, Pilcher and Mullany 32 calibration through calibration plots,Reference Crowson, Atkinson and Therneau 33 the incremental contribution of additional predictors using integrated discrimination improvement (IDI), and net reclassification improvement as recommended by CookReference Cook 34 and Pencina et al.Reference Pencina, D’Agostino, D’Agostino and Vasan 35 As recommended by Cook,Reference Cook 34 we also included the Nagelkerke pseudo-R2 in our assessments of model performance. In standard linear regression models, the ratio of the mean-squared error to the variance of the dependent variable can be subtracted from 1 to define an R2 that is always between 0 and 1. In a validation sample, however, the mean-squared error may exceed the variance of the dependent variable, and the resulting R2 may be negative. A negative R2 indicates a very poor fit with the validation sample.Reference Estrella and Mishkin 36
We also conducted sensitivity analyses in which we employed a 30-day (as opposed to an 84-day) period for outcome ascertainment.
RESULTS
We scanned KPNC databases from 2007 to 2014 and identified a total of 41,499 positive tests for Clostridium difficile. A total of 11,251 patients who experienced iCDI. In the derivation dataset, a total of 9,386 patients with iCDI experienced 1,311 first recurrences (14.0%); 2,197 (23.4%) patients died prior to the end of the follow-up period; and 260 (2.8%) died following a recurrence. The corresponding numbers in the validation dataset were 1,865 iCDIs, 144 (7.7%) rCDIs, 376 (20.2%) deaths prior to the end of the follow-up period, and 27 (1.4%) deaths following rCDI. The Appendix provides a flow chart describing the cohort assembly. Excluding patients who died prior to the end of the follow-up period, Table 2 summarizes our cohort characteristics, which are fairly similar to the cohort described by Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25 However, in general, the KPNC cohort was older but healthier (eg, the proportion with Charlson scores <3 was 80%, while that in the Zilberberg et al cohort was ~55%). Furthermore, the KPNC cohort generally had lower risk (eg, only 24% were receiving high-risk antibiotics, compared to 40% in the Zilberberg cohort). Expanded versions of this table are provided in the Appendix.
TABLE 2 Incident Clostridium difficile (iCDI) Cohort Description

NOTE. iCDI, incident Clostridium difficile infection; LAPS2, laboratory-based acute physiology score, version 2; COPS2, comorbidity point score, version 2.
a Cohort consists of patients with iCDI. Patients who died during the follow-up period were removed from analysis.
b See Deyo et alReference Deyo, Cherkin and Ciol 24 for details on how this score was assigned.
c Locus of iCDI onset is categorized as (1) community onset, healthcare-facility associated (iCDI diagnosed by a positive toxin test within 72 h of admission or iCDI diagnosed in any outpatient setting and a hospitalization in the prior 90 d); (2) community onset, community associated (reference group in model: iCDI diagnosed by a positive toxin test within 72 h of admission or in any outpatient setting and no hospitalization in the previous 90 d); or (3) hospital onset, healthcare-facility associated (CDI diagnosed >72 h after hospital admission). These definitions were also used by Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25
d We employed the same antibiotic classifications as Zilberberg et al.Reference Zilberberg, Reske, Olsen, Yan and Dubberke 25
e For an extended definition of LAPS2 and (COPS2), refer to the text and Escobar et al.Reference Escobar, Gardner, Greene, Draper and Kipnis 22 For both of these scores, increasing values are associated with increasing mortality risk. The univariate relationship of an admission LAPS2 with 30-d mortality is as follows: 0–59, 1.0%; 60–109, 5.0%, 110+, 13.7%; the univariate relationship of COPS2 with 30-d mortality is as follows: 0–39, 1.7%; 40–64, 5.2%, 65+, 9.0%.
We compared performance of the discrete time survival and competing risk Cox regression models against the simple logistic regression algorithm where we excluded patients who died prior to an rCDI. The simple logistic regression basic, enhanced, and automated models showed performance comparable to that of the competing risk survival models.
Table 3 summarizes performance characteristics of our models in the validation dataset. All models demonstrated modest discrimination, as shown by their areas under the receiver-operator characteristic curve, or c statistics (range, 0.591–0.605) and poor explanatory power, with negative Nagelkerke pseudo-R2s (−0.1033 to −0.0875). At a predicted risk of ≥15% the positive predictive value ranged from 11.0% to 12.1%; sensitivity ranged from 69.4% to 79.2%; and specificity ranged from 32.0% to 43.6% across the models. With this threshold, the number of patients needed to evaluate (NNE) to detect 1 case of rCDI ranged from 8.3 to 9.0 across the models. Figure 2 shows calibration of the Zilberberg model and the enhanced model; neither model was well calibrated.

FIGURE 2 Model Calibration Using the Validation Dataset. For both plots, the X axis shows predicted rates of recurrent CDI in 5% increments, while the Y axis shows the actual observed rates (with associated 95% confidence intervals) in the validation dataset for all observations with that predicted level of risk. The dotted line shows what would be found were calibration to be perfect. For both the Zilberberg and Enhanced models, calibration is poor: calibration fails at levels above 10% predicted risk. Observed rates do not approach predicted rates, meaning that both models over-predict recurrent CDI. Additional calibration figures, including Hosmer-Lemeshow plots, are provided in the Appendix.
TABLE 3 Model Performance in the Validation DatasetFootnote a at a Predicted Risk of ≥15%

NOTE. c statistic, area under the receiver operator characteristic curve; R2, Nagelkerke’s pseudo-R2; PPV, positive predictive value; NPV, negative predictive value; NNE, number of incident cases one would need to evaluate to detect one recurrence; NRI, net reclassification improvement; IDI, integrated discrimination improvement; iCDI, incident Clostridium difficile infection.
a The validation dataset consisted of 1,865 iCDI patients, of whom 144 developed rCDI. A total of 376 iCDI patients died (and thus could not be assessed for recurrence).
b See text for description of the 4 models. “Age ≥65 years” refers to a simple decision rule based on age alone. Sensitivity, PPV, NPV, NNE, NRI, and IDI are based on the model giving a predicted recurrence risk of ≥15% within 84 days.
c We conducted sensitivity analyses using predicted risk of ≥20%, ≥25%, and ≥30%. These results are provided in the Appendix.
Sensitivity analyses of the possible impact of mortality indicate that consideration of this issue (eg, by assigning a weighted probability of rCDI to patients who died and then modeling for rCDI as a continuous outcome) did not improve prediction. Sensitivity analyses using a 30-day (instead of 84-day) follow-up period resulted in worse model performance. Additional results are provided in the Appendix.
DISCUSSION
Using a large recent cohort, we developed and validated 3 rCDI predictive models using contemporary modeling techniques and EMR data. We also validated a previously published modelReference Zilberberg, Reske, Olsen, Yan and Dubberke 25 in a different population. However, despite including highly granular EMR data (eg, vital signs, laboratory tests, composite severity of illness scores, and longitudinal comorbidity), the models and underlying data had poor ability to predict rCDI. We formally tested a common assumption made by many investigators (ie, that deaths can simply be excluded from the numerator). We found that this approach is justified, and that including patients who die prior to the conclusion of the follow-up period did not improve prediction. Lastly, we found that shortening the length of follow-up to 30 days resulted in worse model performance.
Some authors have reported better model performance. Examination of these other studies paints a less optimistic picture. Hu et alReference Hu, Katchar and Kyne 10 report the use of machine-learning approaches and a c statistic of 0.80 in their validation dataset. However, this study had a very small sample size (N=110, with N=64 in the validation dataset) and did not employ cross-validation (ie, no formal assessment of the possibility that model performance in a different population might be poor). We were able to achieve c statistics that were this high in our derivation dataset, but these apparently successful models demonstrated considerable instability during cross-validation. We did not pursue them further and chose more parsimonious models.
Contrary to previous literature reports, some predictors (eg, specific antibiotic exposures) were of limited value, particularly in models that included severity of illness. This probably reflects the fact that severity of illness is highly correlated (and may, in fact, be the underlying risk factor) with other predictors (eg, intensive care and antibiotics known to predispose for CDI). We deliberately focused on predicting rCDI in iCDI cases, though previous CDI is a well-known risk factor for recurrence. It is possible that, had we included prior CDI as a predictor, we might have achieved better model performance. However, models that included the COPS2 score (a longitudinal comorbidity measure that captures information from the preceding 12 months) did not perform much better.
Multiple investigators, using a variety of statistical approaches, including machine-learning methods, have been unable to produce static models with better performance using the currently available set of predictors. While it is true that many predictors reach statistical significance in bivariate analyses (particularly when the sample size is large), the clinical significance may be muted because the relative proportions of patients with and without recurrence are not that different. Further, it is clear that the risk factors (age, antibiotic exposure, severity of illness) that place an individual at risk for iCDI are also risk factors for rCDI. Thus, future efforts ought to be placed on identifying better predictors rather than on using different statistical approaches with the currently available predictors. New predictors may include newer biomarkers (eg, indicators of underlying predisposition to recurrence), environmental factors (eg, proximity to other CDI patients, presence of C. difficile sporesReference Freedberg, Salmasian, Cohen, Abrams and Larson 4 ), behavioral aspects (eg, handwashing), and/or molecular markers (eg, information on specific C. difficile strains). It is also important to consider rCDI in an ecological context, and future predictive models may need to be explicit about including environmental and ecological predictors (eg, isolation rooms, who is roomed where, other family members exposure), if such data become available.
One alternative that we did not explore because it is currently not feasible with existing EMRs, was to develop dynamic models. In contrast to the static approach we and others have employed (ie, providing a single probability estimate based on a discrete set of predictors available at some T0), such models adjust posterior probabilities based on new information. In the case of rCDI, having additional information on both antibiotic treatment as well as other exposures (eg, proton pump inhibitors) could have dramatic effects on our ability to predict recurrence.Reference McDonald, Milligan, Frenette and Lee 37 , Reference Deshpande, Pasupuleti and Thota 38 The development of such models would require EMRs with greater capabilities than those currently available.
Our study had several additional limitations. Due to resource limitations and sparse data, we limited our cases to inpatient iCDI. During this study, KPNC implemented aggressive efforts to reduce CDI. As a result, our data show that the incidences of iCDI and rCDI were decreasing in our study cohort. Despite these limitations, models to predict recurrence have value. They do permit identification of patient subsets with elevated or very low risk. In some scenarios, and in the context of discrete interventions, the use of these models might improve outcomes and decrease costs. In addition, existing models point to predictors that can be assessed in the future, such as the aforementioned ecological ones.
Compared to our ability to predict other outcomes (eg, death, unplanned transfer to intensive care),Reference Escobar, Turk and Ragins 16 , Reference Escobar, Greene, Gardner, Marelich, Quick and Kipnis 20 , Reference Escobar, Gardner, Greene, Draper and Kipnis 22 our ability to predict rCDI is limited and contrasts with much better ability to predict iCDI.Reference Kuntz, Johnson and Raebel 39 , Reference Kuntz, Smith and Petrik 40 Given the major consequences of rCDI on patient outcomes, our results support the need to expand research on the prevention and treatment of recurrence. Such research may also result in the identification of novel predictors that are currently unavailable even in the most comprehensive EMRs.
ACKNOWLEDGMENTS
This project was funded by a grant from Merck Sharp & Dohme Corporation, Whitehouse Station, New Jersey. The authors wish to thank Juan Carlos LaGuardia for help assembling the dataset, Dr Tracy Lieu for reviewing the manuscript, Vanessa Rodriguez for formatting the text for publication, Anna Cardellino for her assistance in drafting the protocol, and Mary Beth Dorr for her review and guidance in the analysis.
Financial support: Dr Vincent Liu was funded by a National Institutes of Health award (grant no. K23GM112018).
Potential conflicts of interest: The Kaiser Permanente authors Escobar, Kipnis, Liu, Greene, and Baker have no conflicts of interest to report. Dr Erik Dubberke has received grant support from Rebiotix, Merck, and Sanofi Pasteur; he also has consulting and advisory board relationships with Rebiotix, Summit, GSK, Valenva, Sanofi Pasteur. The remaining coauthors Cossrow, Gupta, Mast, and Mehta are or were employees of Merck Sharp & Dohme Corporation, a subsidiary of Merck & Co., Kenilworth, New Jersey, and potentially own stock and/or hold stock options in the company.
SUPPLEMENTARY MATERIAL
To view supplementary material for this article, please visit https://doi.org/10.1017/ice.2017.176