The clinical effectiveness of a new medical test is determined by the extent to which incorporating the test into clinical practice ultimately improves patient health outcomes. This depends on a series of factors. For example, the clinical effectiveness of positron emission tomography (PET) in the assessment of patients with head and neck cancer for radiotherapy depends on its accuracy to delineate the tumor, changes in the radiotherapy regimen following PET, and consequences of these changes on patient survival and quality of life (19).
Randomized controlled trials (RCTs) of tests that capture the entire clinical pathway between testing and health outcomes provide direct evidence of the clinical effectiveness of a test. Although ideal, these studies are not often done and are sometimes not feasible (Reference Bossuyt, Lijmer and Mol4). For fast evolving technologies such as medical tests, reviewers will rarely find direct trial evidence and therefore must often rely on evidence about test accuracy and other factors to draw conclusions about clinical effectiveness.
Within the test evaluation framework of Fryback and Thornbury (Reference Fryback and Thornbury7), these factors can be regarded as critical steps along the clinical pathway linking the use of the test to patient health outcomes (Figure 1). Diagnostic accuracy is a measure of how well a test identifies patients with and without a disorder, commonly reported as test sensitivity and specificity (Reference Bossuyt, Reitsma and Bruns5). For the purpose of this report, we have defined the direct consequences of test results, such as changes in therapeutic decisions, that can have downstream consequences for health outcomes, as “intermediate” test outcomes. Health outcomes refer to measurement of the health state of patients, which are ideally measured in treatment RCTs (Reference Moher, Schulz and Altman17).
All these outcomes are relevant in the assessment of medical tests. Information from studies investigating test accuracy can sometimes be directly linked with health outcomes from RCTs showing that treatment for the target condition is effective to draw conclusions about the health benefits of detecting disease (Reference Lord, Irwig and Simes15). This requires that the spectrum of disease defined by the new test is representative of cases included in the treatment RCTs.
If test accuracy and health outcomes cannot be directly linked, studies reporting intermediate outcomes—those occurring between accuracy and health outcomes—may provide additional information to strengthen conclusions about the effectiveness of a new test (Figure 1). Studies of intermediate outcomes may demonstrate that the test information has an impact on clinical decision making, for example, by changing decisions about treatment or about the ordering of further tests. An observational study of seventy-one patients with head and neck cancer showed that PET changed the management plan for 32 percent of patients (70 percent when additional lesions were detected by PET, 11 percent when there were no additional lesions) (Reference Scott, Gunawardana and Bartholomeusz24). Clearly, this change in management plan does not by itself provide evidence of improved health outcomes. Hence, studies on intermediate outcomes need careful interpretation.
Current guidelines on conducting and reporting HTAs of medical tests do not provide explicit criteria about when to include intermediate outcomes, what assumptions are necessary when linking evidence of accuracy with intermediate outcomes and health outcomes, and how to assess the quality of primary studies that examine intermediate outcomes (1;6;18;20). Given this lack of guidance, we sought to understand how, and to what extent, different test outcomes are being incorporated into HTAs in current practice. We document what outcomes beyond test accuracy are being used in current HTAs of medical tests when direct evidence of health outcomes is lacking. This review focuses on intermediate outcomes and how this evidence is interpreted to draw conclusions about the clinical effectiveness of new tests.
METHODS
Identification of HTA Reports
We first searched the Web sites of HTA organizations that are INAHTA members to identify English-language test assessments published between January 2005 and February 2010 (search date, February 12, 2010). This pilot search confirmed the wide range of approaches in current test evaluation and helped refine the extraction of the data.
For the main review, we then searched the HTA database (http://www.crd.york.ac.uk/crdweb) for test evaluations with a sensitive search strategy using the terms diagnos* OR test* AND english:la (search date, March 25, 2010). We included test HTAs with a primary focus on test accuracy, intermediate outcomes, and/or patient health outcomes. Reviews of outcomes peripheral to our study, such as patient or clinician confidence and testing or screening compliance, were not examined further. To be eligible for our review, HTAs had to be reports of human studies with a full report in English. We excluded methodological reviews, horizon scanning studies, newsletters, pure economic studies, reviews comparing different generations of the same technology, and guidelines for tests already used in clinical practice.
Assessment of HTA reports
We extracted general information about the name of the test, the proposed role of the test, the disease and patient group to be tested, and outcomes mentioned for each eligible HTA. Reports were classified according to the type of investigated test: screening (asymptomatic populations) (Reference Grootendorst, Jager, Zoccali and Dekker9); diagnosis (detecting or excluding disorders in symptomatic populations) (Reference Knottnerus, van Weel and Muris13); disease classification in patients with established diagnosis (including staging, prognosis, monitoring) (Reference Glasziou, Aronson, Glasziou, Irwig and Aronson8;Reference Pepe and Pepe22); or combinations of these purposes. Where more than one research question, indication, or test was included in an HTA, the first indication identifying studies on intermediate outcomes was used. All included HTAs were independently reviewed by two investigators (S.D., L.S.).
We compiled descriptive statistics of the frequencies of the types of tests, disease areas, and the types of reported outcomes in the HTAs. Where applicable, we classified the reported intermediate outcomes and summarized the kinds of primary studies on intermediate outcomes and how the quality of these studies was assessed. We also examined how this evidence was interpreted in the HTAs to support conclusions about the clinical effectiveness of the test. HTAs were classified as providing clear conclusions if they made a clear positive or negative statement about the clinical effectiveness based on the evidence presented or if they judged there was not enough evidence to support definitive conclusions. HTAs were classified as not providing a clear conclusion about clinical effectiveness if they did not provide any statement about the likely impact of the test on health outcomes or did not state that the evidence available was insufficient for these conclusions.
RESULTS
Characteristics of Identified HTA Reports
We identified 318 non-duplicate records. Ninety-seven of these were excluded because the main focus was not test evaluation; thirty-eight did not present data on accuracy, intermediate outcomes or patient health outcomes; twenty-two were horizon scanning reports or economic evaluations; and twelve were guidelines for tests already in use.
The included 149 HTAs were prepared by eighteen agencies in eight countries. The types of tests evaluated were for screening (24 percent), diagnosis (25 percent), disease classification of established diagnosis (32 percent), or multiple purposes (19 percent). The most common disease areas were oncology (38 percent) and the circulatory system (17 percent), followed by endocrine and metabolic diseases (6 percent), infectious diseases (5 percent), and multiple disease areas (6 percent) (Table 1). Additional information and Web links to all included HTAs are available in Supplementary Table 1, which can be viewed online at www.journals.cambridge.org/thc2012008.
* includes staging, prognosis, monitoring
Accuracy
Seventy-one of the 149 included HTAs (48 percent) reported solely on diagnostic accuracy. In forty-two (59 percent) of these assessments we found a clear conclusion about the clinical effectiveness of the test. These conclusions were negative (that is, the test was not effective) in nineteen assessments and positive (the test was effective) in sixteen, while in the remaining seven assessments the authors argued that there was not enough evidence to support definitive conclusions about the effectiveness of the test to improve health outcomes.
Patient Health Outcomes
In addition to accuracy, evidence of patient health outcomes was reported in seventeen HTAs (11 percent). Common outcomes were treatment success, disease progression, and treatment complication rates. Thirteen of the seventeen HTAs (76 percent) had clear conclusions about the clinical effectiveness of the test. These conclusions were positive in six HTAs and negative in one. In six HTAs, it was concluded that evidence for final conclusions about the clinical effectiveness of the test was lacking.
Intermediate Outcomes
A total of sixty-one HTAs (41 percent) identified intermediate outcomes that were deemed relevant to answer the reviewers’ research question. Of these, fourteen did not identify any primary studies but included a theoretical discussion of intermediate outcomes. In the remaining forty-seven, primary studies reporting on intermediate outcomes were included. Change in patient management was reported in thirty-three HTAs (70 percent) and was by far the most common intermediate outcome (Table 2). Measures of patient management included changes in medication (dose, time to discontinuation), surgical procedures (surgery avoided, postponed, or added), radiotherapy (target field, dose), ordering of further tests, hospitalization rates, duration of treatment, and referral to specialists.
Other intermediate outcomes reported were downstream patient adherence to other interventions (e.g., motivation to cease smoking or lose weight, mammography uptake), impact of testing on subsequent visits to health services or hospital admissions, change in definitive diagnosis or reducing the number of differential diagnoses, and impact on time delays (time to diagnosis, time to transfer to operative care, length of hospital stay).
In thirty-three HTAs (70 percent), at least some of the included studies reported intermediate outcomes in sufficient detail to allow an interpretation of test consequences in the clinical pathway. For example, these studies did not simply mention that patient management was changed, but specified what changes occurred by reporting rates of patients in whom surgery was avoided or chemotherapy increased. However, only seventeen HTAs included studies that compared intermediate outcomes according to test results, for instance, differences in measured time to diagnosis between test positives and negatives.
Design and Quality Assessment of Primary Studies on Intermediate Outcomes
Studies that reported intermediate outcomes included randomized trials of tests and observational studies. In twenty-one HTAs, RCTs were included that measured intermediate outcomes as the primary endpoint. In fourteen of these HTAs, trials also reported health outcomes. In twelve HTAs, observational diagnostic before–after designs (Reference Guyatt, Tugwell and Feeny10) were included to provide evidence about intermediate outcomes. These studies compared planned patient management before and after test results had been made available to clinicians. In fourteen HTAs other observational studies were included, of which five compared the consequences of testing, such as hospital admission rates, with the rates of historic controls before the test was in use.
The quality of studies on intermediate outcomes was considered in thirty-four of the forty-seven HTAs. In fourteen HTAs, the authors used published quality-rating tools to assess intermediate outcomes. Some of these tools had originally been developed for diagnostic accuracy studies (e.g., QUADAS (Reference Whiting, Rutjes and Reitsma26): four HTAs), some for randomized trials of clinical interventions (e.g., Jadad scale (Reference Jadad, Moore and Carroll12): ten HTAs). In thirteen HTAs, the authors adapted existing tools for diagnostic accuracy studies for the appraisal of intermediate outcomes. In seven HTAs the authors developed their own quality-assessment tools, for example checklists based on recommendations by Guyatt et al. (Reference Guyatt, Tugwell and Feeny10). The results of the quality assessment were clearly reported in thirty HTAs.
Interpretation of the Evidence of Intermediate Outcomes
Of the forty-seven HTAs that identified studies of intermediate outcomes, seventeen mentioned in the methods section a specific test evaluation framework or guidelines describing how evidence from different outcomes was integrated. The Fryback and Thornbury framework (Reference Fryback and Thornbury7) was mentioned in five HTAs, while twelve Australian HTAs cited the MSAC Guidelines (18) for the assessment of diagnostic technologies. Furthermore, nine HTAs applied an overall quality rating of the body of evidence to their review.
The relationship between intermediate and patient health outcomes was considered in thirty-one HTAs; however, the uncertainty around assumptions linking intermediate outcomes with health benefits was inconsistently discussed. The validity and limitations of linking patient management with health outcomes was discussed in most cases (twenty-eight HTAs). In twenty-two HTAs these discussions were at least partly supported with data from included studies on health outcomes, but were based on untested assumptions in the other cases.
Using the evidence of intermediate outcomes, twenty-seven of forty-seven (57 percent) HTAs drew clear conclusions about the clinical effectiveness of the investigated technology. These conclusions were positive in fifteen and negative in seven. A lack of evidence to make conclusions was concluded in five.
DISCUSSION
We have reviewed how the international HTA community deals with the challenges of evaluating medical tests, with particular focus on the common situation where no direct evidence exists that a test improves health outcomes. Half of 149 HTAs reported evidence about the consequences of testing beyond accuracy, with 41 percent considering intermediate outcomes. Overall, only approximately 60 percent of 149 HTAs drew clear conclusions about the clinical effectiveness of the test based on the evidence available. Here, we will discuss the use of evidence of the impact of test results on patient management, the most frequently used intermediate outcome, and make recommendations about the interpretation of this evidence in HTAs of tests.
The use of intermediate outcomes is well established in test evaluation frameworks. Fryback and Thornbury's six-tiered model (Reference Fryback and Thornbury7) is arguably the most prominent of these frameworks, and similar schemes have been proposed (Reference Lijmer, Leeflang and Bossuyt14). They share the basic principle of a hierarchy of types of outcome, starting with technical efficacy at the lowest level and then progressing sequentially to diagnostic accuracy, diagnostic thinking, therapeutic impact, patient health outcomes, and societal aspects. In this hierarchy, therapeutic impact provides higher level evidence of test effectiveness than accuracy. When a test has been shown to be accurate and its purpose is to improve treatment selection, change in patient management is a necessary condition for the test to improve health outcomes. It is, however, not a sufficient condition, because the test result is often only one of several factors influencing patient management, and a change of management does not necessarily lead to improved outcomes. Hence, intermediate outcomes may help answer some questions about the consequences of testing but leave reviewers with open issues about how to judge whether this evidence is an adequate surrogate for patient health outcomes.
To make valid judgments when evaluating change in patient management, we propose a structured approach that starts with making a claim about what change in patient management will occur as a consequence of the test results and how this is expected to lead to improved health outcomes. The type of management change specified and assumptions required to infer impact on health outcomes will then inform the formulation of research questions for the test HTA (Box 1). This approach is similar to the methodology of realist synthesis developed for complex policy interventions (Reference Pawson, Greenhalgh, Harvey and Walshe21). Indeed, change in patient management may provide important evidence for realist reviews of tests.
The first consideration is whether evidence of test impact on change in patient management is needed for drawing conclusions about the clinical effectiveness of a test. When direct evidence of test impact on health outcomes is not available, the value of measuring patient management depends on the role the test has in the clinical pathway (Reference Bossuyt, Irwig, Craig and Glasziou3). If a new test is proposed to replace a more expensive or invasive existing test without changing practice, accuracy may suffice to recommend the new test. For example, evidence of improved or at least similar sensitivity of new fecal DNA analyses compared with the common fecal occult blood tests in colorectal cancer screening may be enough to recommend the new method, provided it is reasonable to assume that a positive test result from the new test will have the same consequences on patient management as a positive test from the old test (Reference Piper, Aronson and Ziegler23).
When the consequences of test results are not well established, evidence about patient management will be relevant for assessment. In these situations, the second step for reviewers is to specify what management changes are anticipated and the assumptions required to link the management changes to change in health outcomes (Box 1). These assumptions are critical to interpretation of the evidence and ideally should be tested. We found that the key assumptions were identified in most HTAs we reviewed but not all. Evidence from published studies was often used to support these assumptions. Expert opinion is required to infer whether evidence of effective treatment from these studies can be applied to the new setting which includes the test in review. In the assessment of PET for head and neck cancer, a panel of oncologists and radio-oncologists judged that increased radiotherapy due to PET-detected additional lymph node metastases is likely to improve health outcomes based on existing evidence of the effectiveness of radiotherapy on cervical lymph node metastases (19). Such a judgment needs to weigh up the likelihood and extent of the benefits of changed management against its potential harms. However, in many of the reviewed assessments the statements of assumptions could not easily be located; they were often somewhat hidden in the discussion. We suggest giving this important issue a more prominent place in a dedicated paragraph of test HTAs.
If assumptions that changes in patient management are likely to improve outcomes appear to be reasonable, the third step is a review of the evidence for changed management (Box 1). Included studies need to report their results with a minimum standard of detail to be interpretable. Simply reporting a rate of “overall change” is not informative. Information about the direction and extent of changed treatment after a positive and negative test result is needed to estimate the impact on health outcomes. The assumptions used for these conclusions should be explicitly stated as discussed above. Disappointingly, in only approximately a third of reviewed HTAs were the included primary studies sufficiently reported to allow an interpretation of changed patient management stratified by test result. Interpretation also requires information about test accuracy to determine what proportion of patients receives a change in management based on a correct diagnosis and what proportion has management changed due to a false positive or false negative test result.
In the fourth step, the quality appraisal of this evidence, reviewers have to judge whether the included studies are able to demonstrate a true change in patient management (Box 1). The different study designs are prone to varying types of bias (Reference Staub, Lord and Simes25). If these studies do not measure actual management in patients randomly allocated to different test strategies, the outcome is often a hypothetical assessment of planned management in a patient cohort, so it remains unclear to what extent the measured changes in planned management reflect actual clinical practice. These limitations always need consideration. We also found inconsistent use of different appraisal tools. For a systematic review evaluating the added value of structural neuro-imaging with computed tomography or magnetic resonance imaging compared with current practice in the assessment of psychotic patients (Reference Albon, Tsourapas and Frew2), the authors adapted an appraisal tool commonly used for accuracy studies (QUADAS) to assess the included diagnostic before-after studies. Their subsequent publication of this method (Reference Meads and Davenport16) is an important step toward a more consistent appraisal of these studies. However, the sources of bias relevant to accuracy studies, particularly in the verification of test results with the reference standard, do not apply to assessing the impact of test information on downstream health outcomes. More important are the types of bias encountered in intervention studies, such as differences in patient characteristics between tested groups, differences in the measurement of outcomes, or differences in the reporting of outcomes (Reference Higgins, Altman, Higgins and Green11). In addition, appraisal should include assessing the validity of the study authors’ assumptions for inferring that management is a good proxy for outcomes.
Finally, the conclusions of test HTAs should have a clear statement as to whether the use of the test is recommended (Box 1). They should also explain whether the test is accurate, changes patient management and improves health outcomes; and reviewers should specify on what basis the recommendation about the use of the test was drawn.
This review has some limitations. Because of financial and time restraints we included only English-language assessments. We believe that our sample is representative of HTAs in the current published English literature, but the extent to which the results can be applied to other HTA settings is debatable. However, the primary aim of this review was to document the range of approaches to test evaluation used by different agencies. We believe that the HTAs used here are appropriate to document this issue. Some of the information extracted for this review was subjective, such as whether conclusions about the effectiveness of tests on improving health outcomes were clearly stated. Although two investigators (S.D., L.S.) independently rated the included assessments and agreed on a consensus rating in cases of initial disagreement, these judgments cannot be fully objective. Finally, in undertaking this review, we have presented a framework for test evaluation that has been used by the Australian MSAC. We are aware that different agencies may hold slightly different views; we anticipate this review will stimulate discussion about the use of intermediate outcomes in medical test assessments. In particular, we have identified the need for further research in the HTAi community to establish criteria for assessing the quality of primary studies and judging the validity of assumptions when using patient management as a surrogate for health outcomes. We hope that the recommendations in our Box can be a departure point for these discussions.
In conclusion, we have demonstrated that intermediate outcomes are frequently used in medical test HTAs, but interpretation of this evidence is inconsistently reported. We recommend that reviewers routinely explain the rationale for using intermediate outcomes to investigate a claim about impact on health outcomes, identify the assumptions required to link intermediate outcomes and patient benefits and harms, and assess the quality of included studies.
SUPPLEMENTARY MATERIAL
Supplementary Table 1 www.journals.cambridge.org/thc2012008
CONFLICT OF INTEREST
All authors report having no potential conflicts of interest.
CONTACT INFORMATION
Lukas P. Staub, MD, PhD candidate, Suzanne Dyer, PhD, Research Fellow, Sarah J. Lord, MBBS, Research Fellow, R. John Simes, MBBS, MS, MD, Professor of Medicine and Clinical Epidemiology, National Health and Medical Research Council Clinical Trials Centre, The University of Sydney, Sydney, New South Wales, Australia