Incidents such as Superstorm Sandy, the 2009-2010 H1N1 influenza pandemic, Hurricane Katrina, the sudden acute respiratory syndrome (SARS) outbreak, and numerous large-scale food-borne disease outbreaks highlight the need for sustained attention to ensuring the nation's ability to prepare for, respond to, and recover from public health emergencies. Anecdotal evidence suggests that federal investments in preparedness such as pandemic flu planning, medical countermeasure development and delivery, and laboratory testing for bioterrorism agents, have improved the nation's readiness. However, empirical evidence for such claims remains limited.1-Reference Nelson, Beckjord, Dausey, Chan, Lotstein and Lurie4
Measuring the construct of preparedness has proven to be difficult,Reference Nelson, Willis, Chan, Shelton and Parker5-Reference McCabe, Barnett, Taylor and Links8 as the quality and results of the nation's preparedness only becomes fully visible during operational responses to real large-scale emergencies, which fortunately are rare.Reference Nelson, Chan and Chandra9, Reference Falkenrath10 In most instances, performance measures must focus either on (1) the capacities (eg, plans, personnel, materiel, training) that are necessary, but often not sufficient, conditions for successful operational capability (ie, the ability to use capacities to perform activities and functions), or (2) the capabilities observed during more routine responses (e.g., food-borne outbreaks) or exercises.Reference Falkenrath10 However, the evidence base for the linkage between capacity measurement and actual operational capability is weak, as is the evidence on the relationship between performances in small- and large-scale responses.Reference Nelson, Beckjord, Dausey, Chan, Lotstein and Lurie4, Reference Nelson, Lurie and Wasserman6, Reference Asch, Stoto and Mendes11, Reference Handler, Issel and Turnock12 An additional challenge lies in the fact that measuring preparedness requires data from local, regional, state, and federal levels, and that measures must account for variations in community context.Reference Mays, Halverson, Baker, Stevens and Vann13, Reference Mays, Scutchfield, Bhandari and Smith14
Since 2008, the Centers for Disease Control and Prevention (CDC) Public Health Emergency Preparedness (PHEP) Cooperative Agreement—one of the principal federal vehicles for funding state and local public health preparedness—has included a standardized set of capability measures.15, 16 Data from measures on laboratory, emergency operations coordination, and emergency public information and warning capabilities were recently published in the 2012 State-by-State Preparedness Report,3 and similar data were published in versions of the report released in 2010 and 2011.
This report describes the process that the CDC and its partners used to develop these capability measures, including the approaches used to (1) determine intended uses and users of the measures, (2) decide what to measure, (3) ensure data quality, and (4) revise and improve measures based on pilot testing and stakeholder input. The full list of performance measures that were developed through this process are found in the online supplement. The report concludes by identifying a number of additional topics that need to be addressed to build a robust measurement system.
Identifying Intended Uses and Users
Before developing a performance measurement system, its designers must determine the primary uses and users of the measures. Three common uses of performance measures include accountability (eg, applying standards and benchmarks, demonstrating the effects of investments, making funding decisions), performance management and quality improvement (eg, identifying weaknesses and opportunities for improvement, lessons learned, and best practices), and research (eg, understanding determinants of variation in performance measure data).Reference Nelson, Lurie and Wasserman6, Reference Stecher, Camm and Damberg17
Accountability-oriented measures are usually aimed at external stakeholders (eg, funders, elected officials) and seek to determine what was achieved with funds provided, and whether programs warrant continued funding.Reference Nelson, Lurie and Wasserman6, Reference Stecher, Camm and Damberg17 Improvement-oriented measures are aimed at internal stakeholders (eg, program managers, staff) and seek to support planning and the identification and remediation of performance gaps.Reference Nelson, Lurie and Wasserman6, Reference Stecher, Camm and Damberg17 Often these two uses stand in tension, as internal stakeholders may be less willing to accurately report and share data on performance gaps when they know it might be used to drive funding cuts or other punitive actions. Similarly, accountability-oriented measures are often reported at a level of aggregation that makes them less useful for identifying and executing improvements to specific system components. In some circumstances, however, the same measurement system can be dual-purposed, supporting both improvement efforts and ensuring accountability.Reference Nelson, Lurie and Wasserman6
The CDC's PHEP measurement system was designed to support both improvement and accountability. Legislation and federal reporting requirements necessitated that the primary use is for accountability; specifically, to allow the CDC, Department of Health and Human Services, and other decision-makers to determine the extent to which states and other jurisdictions have used public funds to build preparedness capabilities.15, 16 Data are also used for public reporting and incorporated into larger discussions of national health security policy.
The CDC recognized the importance of measures that can support program improvement because they help improve the overall system and because state and local officials value them. Awareness of stakeholders’ needs and interests helped increase the level of acceptance and utility of the measurement system, and result in more meaningful data for reporting and use at the local, state, and national levels. Participation of federal, state, and local stakeholders throughout the process ensured that key stakeholders’ views were represented and weighed in the design of the measures, and that the dual uses of accountability and improvement could be achieved.
Determining What Should be Measured
Once the primary uses and users of the data were identified, the next step was to understand what information is most useful and relevant to measure. Ideally, decisions about what to measure would follow a strong program of empirical research that links structures and processes to health outcomes, including (where possible) randomized experiments or quasi-experiments.18 However, a strong evidence base is not always available. In the case of public health preparedness, the aggressive timelines for measure development required by the Preparedness and All Hazards Preparedness Act necessitated the use of other forms of evidence to support measure identification and development. While the nine CDC-sponsored Preparedness and Emergency Response Research Centers were tasked with helping to build the evidence base for public health preparedness,19 most findings from those efforts were released too late to inform the measures described here.
We used three approaches to identify relevant, feasible, and useful measures of public health preparedness—review of existing measures and literature; eliciting practitioner perspectives, and process analysis. While these approaches are discussed independently for greater detail and explanation, all three work together to determine what is important to measure.
Review of Existing Measures and Available Literature
One of the initial steps in the process of performance measure development was to conduct a thorough review of existing literature and measures related to public health preparedness. In spite of the overall shortage of scientific literature on public health preparedness, we were able to identify measurable predictors of successful responses in either traditional peer-reviewed scientific journals or in the so-called gray literature. For instance, previous research on epidemiologic investigations identified success factors that served as candidates for measures of this capability. The review of existing metrics provided clues about what is important to measure—the existence of multiple metrics focusing on the same construct or activity may (but does not necessarily) indicate that this area is important to measure. Building on existing measures is especially relevant in an era of increased budget cuts, as it reduces the likelihood of duplicative efforts and capitalizes on existing opportunities for data collection.
Eliciting Practitioners’ Perspectives
The CDC convened an overarching workgroup of key stakeholders, including representatives from state and local health departments, national partner organizations (ie, the Association of Public Health Laboratories, the Association of State and Territorial Health Officials, the Council of State and Territorial Epidemiologists, and the National Association of County and City Health Officials), federal preparedness partners (ie, the Office of the Assistant Secretary for Preparedness and Response, the Federal Emergency Management Agency, and CDC's Career Epidemiology Field Officer Program), and experts from academic and nonprofit institutions.
The workgroup provided guidance and recommendations on multiple aspects of measure development and implementation, including identifying priority areas for measurement, suggesting strategies and approaches to developing and implementing measures, and reviewing and prioritizing draft measures before implementation. Using a modified Delphi process,Reference Dalkey20 the workgroup identified the following public health preparedness capabilities as most critical for initial measure development: biosurveillance, incident management (later renamed emergency operations coordination), crisis and emergency risk communication (CERC; later renamed emergency public information and warning), delivery of medical countermeasures, and community-based disease mitigation strategies (eg, social distancing, school closures). Once the measures were developed, the workgroup helped ensure that the measures and resulting data would meet various needs, thereby ensuring that the measurement system could meet its dual mission of accountability and program improvement.
While the type of input provided by the workgroup was critical to meeting the needs of the national measurement system, the CDC recognized that it was not the appropriate vehicle for addressing the details of measure identification and development. For each capability selected for measurement, the CDC convened a group of state and local practitioners with expertise in the specific capability to provide guidance on the most important aspects to measure and the appropriate metrics.
In summary, stakeholder involvement was crucial in complementing and supplementing more traditional forms of evidence. It promoted buy-in among key stakeholders and ensured that the measures were both meaningful and feasibly collected, as well as applicable across a range of communities.
Process Analysis
A key challenge to developing preparedness measures without a strong empirical evidence base was to provide a systematic and reasoned justification for selecting a small set of high-priority measures from the array of specific preparedness-related activities that could potentially be measured. In some sectors, knowledge gained through experience with actual performance might be analyzed to identify system components that are (1) high-leverage, ie, empirically linked to favorable outcomes; and (2) failure-prone, ie, points in the system or process that are at particular risk of failure. Given the absence of such an experience base in public health emergency preparedness, we used techniques from the fields of industrial engineering and logistics, namely process mapping, to provide a reasonable and efficient way to identify high-leverage and failure-prone components.
Process maps illustrate the sequence of steps required to generate a product or outcome.Reference McClain, Thomas and Mazzola21 In this context, they were used to show the series of activities required to successfully implement a given capability. We examined the maps to identify activities that were (1) critical to the success of the system or process, (2) most vulnerable to failure, (3) relevant across a wide range of threat scenarios and community contexts, (4) readily observable, and (5) under the control or influence of the entity being measured. Highest priority for measurement was given to activities that metall of these criteria. We engaged the stakeholder workgroups in this work to ensure that the activities identified in the process met each of these criteria.
Variations among state and local public health systems provided a challenge, as the specific activities and sequences used in one jurisdiction may not be the same as those used in other jurisdictions. We addressed this challenge by dividing the stakeholder workgroup into three smaller groups, each tasked with developing a process map that represented response activities in their jurisdictions. Because each small group included stakeholders from several jurisdictions, these first-draft maps helped to identify common-mode elements that applied across jurisdictions.
Next, each group presented its map to the full group. After considering each small group's draft map, the full group worked collectively to build a single process map that captured common-mode process elements across all represented jurisdictions. The fact that the groups included members from a wide variety of jurisdictions helped ensure that the resulting process maps would include components applicable to most—if not all—jurisdictions across the United States.
Figure 1 provides a sample process map that captures the common-mode elements of emergency public information and warning. Based on the process-mapping exercise and the criteria listed, the CDC and its workgroup of public information experts identified the “time to issue a risk communication message to the public” to be the most feasible and relevant component of the capability to measure. The process map helped inform the decision about start and stop times for the measure.
Figure 1 Process Map of Crisis and Emergency Risk Communication.
If potential measures did not meetall four of the criteria listed above, use of that measure was reconsidered. For example, CDC and the workgroup of public information experts considered developing a measure related to activating a joint information center. However, while activating a joint information center is a key step in providing information to the public, health departments do not always have the authority to do so, and many smaller health departments are not responsible for operating one. Therefore, this component failed to meet two of the selection criteria (ie, largely in public health's control and relevant across a broad range of jurisdictions and scenarios), and the measure was deemed inappropriate for the standardized national measurement system.
Ensuring Data Quality
Ensuring the availability of high-quality data to support the selected measures required two key considerations. The first was to ensure that feasible and regularly available data sources exist.22 Selecting observable measures (as recommended in the previous section) increased the likelihood that feasible data sources exist for implementing the measure. The second consideration was to employ mechanisms for reducing irrelevant variation in the data, that is, differences in the data that are not relevant to performance.22
Health departments can differ in risk profiles and governance structures, as well as the contextual details of exercises and incidents, limiting the accuracy of cross-jurisdictional and temporal comparisons.Reference Mays, Halverson, Baker, Stevens and Vann13, Reference Mays, Scutchfield, Bhandari and Smith14 For instance, variations in the amount of time required for different jurisdictions to issue risk communication messages to the public might reflect variations in preparedness, differences in the complexity of the approval process required in different jurisdictions, the perceived urgency of the exercise scenario or incident, and many other factors.
Many irrelevant contributors of variance can be anticipated and reduced before analysis through measurement specifications and reporting criteria. For example, the measurement specifications and reporting criteria for the crisis and emergency risk communications measure only allow data to be reported for the first risk communication message developed during an incident. This specification reduces the variability associated with slower-developing incidents.
A variety of complementary approaches can be used to reduce irrelevant variance in the data, including clarifying key definitions, restricting the range of measured tasks and data sources, and making post hoc statistical adjustments using control variables. Each of these are discussed in more detail below.
Clarifying Key Definitions
Differences in the interpretation of key terms can lead to undesirable variation, that is, a larger range of responses than would be expected. Given that many of the Cooperative Agreement measures are time-based, the CDC took particular care in defining the start and stop times. For instance, the stop time for the crisis and emergency risk communication measure is defined as the “date and time that a designated official approved the first risk communication message for dissemination,” with public official defined as “any individual in the public health agency who has the authority to take the necessary action (e.g., approve a message). A designated official may be a public information officer, an incident commander, or other individual with such authority. This definition was crucial to this measure because it provides the flexibility needed to recognize the variation in organizational infrastructure across health departments such as who in the organization is authorized to develop and release a communication, while standardizing the actual process for using this measure and allowing for a comparable national sample.
Restricting Measured Tasks and Data Sources
Another approach to minimizing performance-irrelevant variance is to specify or restrict the tasks or data sources that can be used to generate data for the measure.18, 22 In general, the broader the range of potential data sources, the greater the chance that measures will reflect variations in factors that are irrelevant to the concept being measured. Therefore, the CDC PHEP measures, restricted data sources for measurement data to specific types of exercise or incidents.
However, restrictions on data sources should take into account the realities and constraints involved in using the measures. For its preparedness measures, the CDC decided, in conjunction with stakeholders, to give jurisdictions considerable flexibility in selecting data sources as a way to minimize burden and increase the feasibility of collecting and reporting data on the measures. For instance, the crisis and emergency risk communication measure allows jurisdictions to report data based on any drill, functional exercise, full-scale exercise, or real incident, with the only restriction being that they report on the first risk communication message issued. Furthermore, jurisdictions are required to submit data only on their “best demonstration,” that is, the exercise or incident that shows their strongest performance.15
Relying on Post Hoc Statistical Adjustment
In exchange for allowing greater flexibility in data sources, the CDC PHEP measures generally require jurisdictions to report on a wide range of contextual variables, with the expectation that such data can be used to statistically control for the error variance post hoc. For instance, the “time required to issue a risk communication message” measure requires jurisdictions to report the exercise type (eg, full-scale exercise, functional exercise, drill), incident severity (from the National Incident Management System), whether the jurisdiction led the simulated or real response, as well as a number of other factors.
Preliminary analysis of the measurement data suggests that statistical adjustment does not fully account for variations in how data are collected, and that better adjustment variables may need to be identified. Yet, given reduced funding for PHEP and the significant burdens that performance measures can place on jurisdictions, additional restrictions on data sources did not seem feasible at the time.
ASESSING Utility and Acceptability
Prior to implementing measures, pilot testing helped ensure that measures were useful and acceptable.16, 22 It also provided an initial assessment of the face validity, utility, consistency of interpretation, and feasibility of data collection.
For each capability measured, the CDC identified up to eight health departments to perform either a desk review of the measures or a pilot test to collect and report the measures. For the most part, these sites could be regarded as atypical, both in their willingness to engage in measurement and their capacity to do so. While this limited the generalizability of findings from the pilot tests, the approach was consistent with the fiscal realities and the nascent culture of measurement at the time.
After initial pilot testing, the measures were rolled out to all 62 PHEP Cooperative Agreement grantees, and data from the measures have now been featured in four successive public reports on preparedness from the CDC.1, 3 Multiyear longitudinal data indicate discernable and sustained progress. Below are a few examples of the data reported on emergency operations coordination and emergency public information and warning measures in the 2012 State-By-State Report.3
Activating the Emergency Operations Center
Performance measure: Time for pre-identified staff covering activated public health agency incident management roles to report for immediate duty.
In 2011, the median staff assembly time was 30 min, with 47 of 50 states (94%) meeting the target of assembling staff in 60 min or less. The median time for state public health departments to assemble their staff has decreased during the past 3 years (57 min in 2009, 31 min in 2010, 30 min in 2011).3
Ensuring Overall Response Strategy for Incident Management
Performance measure: Approved incident action plan (IAP) produced before the start of the second operational period
The percentage of states, localities, and insular areas that had an approved IAP written before the start of the second operational period was 54 of 62 (87%) in 2009, 55 of 62 (89%) in 2010, and 57 of 62 (92%) in 2011. More than one-half of the IAPs were developed in response to an executed or planned exercise. Natural disasters, such as tornadoes, blizzards, tsunamis, and earthquakes accounted for the remaining IAPs.3
Assessing Response Capabilities
Performance measure: Drafted an after-action report (AAR) and improvement plan (IP) after an exercise or real incident
In 2010 and 2011, all 62 (100%) states, localities, and insular areas developed an AAR and IP. In 2009, 61 of 62 (98%) developed an AAR and IP.3
Communicating With the Public During an Emergency
Performance measure: Developed a first-risk communication message for the public during an exercise or a real incident
In 2010 and 2011, 61 of 62 awardees (98%) developed a first-risk communication message for the public during an exercise or a real incident. In 2009, 60 of 62 (97%) developed a first-risk communication message. Awardees cited biological outbreaks and natural disasters as the most common incidents for which they created the risk communication message.3
While a formal evaluation of the measures has not yet been conducted, interactions with state and local officials indicate that grantees are familiar with the intent and expectations of the measures. Compliance issues do not exist because grantees are required to report on the measures, and fields are mandatory in the online reporting system.
Discussion
The fortunate rarity of large-scale incidents, combined with variations in the observable manifestations of preparedness across communities and situations, has slowed efforts to develop a scientific basis for identifying performance measures for public health preparedness and feasible strategies for collecting measurement data. This report presents an approach that allows the field to move forward in spite of these challenges.
The approach described here incorporates more flexible alternatives to marshaling evidence in an effort to identify meaningful measures than is typically done in clinical medicine or public health. It bears some similarities to the approach recently taken in developing the CDC's standards for mass dispensing of medications.Reference Nelson, Chan and Chandra9 While it is clear that traditional, peer-reviewed scientific studies are important references for measure development, techniques such as those described herein are important complements to the published literature. For example, process mapping, especially if informed by experts, can provide a strong starting point for identifying the most important components of capabilities to measure.
In addition, the approach involves techniques for minimizing performance-irrelevant variation in measurement data, which ensures consistent and fair comparisons over time and across jurisdictions. In some instances, specifying or restricting the range of allowable data sources can enhance the comparability of data. When such restrictions prove too burdensome for jurisdictions, however, comparability can be enhanced by collecting contextual data on the jurisdiction, exercise, and/or incident and filtering out such variation through statistical analysis. In all instances, it is crucial to clearly define key terms.
Researchers and practitioners will need to address a number of additional issues to develop an effective and meaningful performance-based accountability system for public health preparedness. First, the long-term goal should be a balanced portfolio of capacity and capability measures. Second, to be meaningful, performance measures that assess how well jurisdictions are doing must be paired with standards that explain how good is good enough. Very few of the CDC PHEP performance measures specify standards or targets based on programmatic expectations; data collection of contextual information for each measure is key in developing evidence-based standards. Third, although knowing about specific preparedness-related activities, such as how long it takes jurisdictions to assemble their staff, is important, these measures by themselves are unlikely to be meaningful to citizens and policymakers seeking to know what these activities tell us about how prepared the nation is.
For accountability, it is critical to consider how the broad set of performance measures complements one another and provides a larger picture of preparedness, and national health security overall. Given the decentralized nature of the US public health system, public health preparedness must be measured at the local, state, and national levels to provide a comprehensive picture. Thus, it will be necessary to develop approaches that allow individual measures of critical capabilities to be aggregated into readily-interpretable summaries of overall preparedness and for measures of local-level preparedness to be incorporated into state- and national-level assessments. Finally, it will be important to assess how well performance on measures relates to performance in real responses.
Conclusions
Progress in measuring public health preparedness has been, and will likely continue to be, incremental. In the absence of a strong evidence base, we propose an approach to developing measures that leverages nontraditional sources of evidence. Once developed, these initial measures can, in turn, provide the basis for studies of increasing quality and sophistication, which can then provide a scientific base for subsequent generations of measures. Efforts to develop performance measures for public health preparedness must complement efforts to support research so that changes in these measurement systems and programs are based on the best available science, rigorously evaluated, and periodically tested using established quality improvement methods.
Funding and Support
This research was funded by the Office of Public Health Preparedness and Response, Centers for Disease Control and Prevention. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Acknowledgments
The authors thank members of the PHEP Evaluation Workgroup and capability-specific workgroups for their expertise and support in developing the measures. The authors also thank Dale Rose and Michael Fanning from CDC OPHPR for their contributions to the measure development process and data analysis. Finally, the authors thank Sarah Hauer and Kristin Leuschner for their administrative and editing expertise.