The overarching purpose of health technology assessments (HTAs) is to assist readers to make informed healthcare decisions, on both a health policy as well as on a clinical level. HTAs, therefore, need to assess whether included studies are free of bias and how findings are applicable to a specific population of interest.
Because randomized controlled trials (RCTs), if conducted properly, minimize the risk of bias (Reference Mulrow25), they are the primary source of information for most HTAs. RCTs, however, are frequently conducted in highly selected populations that do not reflect the broad populations to which conclusions of HTAs will be applied. For this reason, uncertainty about the applicability of findings of RCTs to the “average” healthcare population has become a major concern in systematic reviews and HTAs (Reference Malmivaara, Koes and Bouter21). Two initiatives, the Drug Effectiveness Review Project (DERP) (26) and the AHRQ's (Agency for Healthcare Research and Quality) program to conduct comparative effectiveness reviews under Section 1013 of the Medicare Modernization Act (MMA) (Reference Slutsky, Atkins and Chang29), both conducting systematic reviews on the comparative efficacy and safety of drugs, have recognized this issue and specifically emphasize findings of trials with high applicability in their reports (1).
Clinicians and policy makers often distinguish between the efficacy and the effectiveness of an intervention. Efficacy trials, also called explanatory trials, determine whether an intervention will produce an expected result under ideal circumstances—usually within a narrowly defined clinical setting (Reference MacRae20). Such trials usually enroll highly selected patients to minimize any factors that could jeopardize treatment success (Reference Fortin, Dionne and Pinho7;Reference Sokka and Pincus30). By contrast, effectiveness trials, also called pragmatic trials, measure the degree of beneficial effects under “real-world,” more diverse settings (Reference Godwin, Ruhland and Casson11;Reference MacRae20). Patients in these trials reflect the heterogeneous populations that are likely to be treated in primary care settings. For the purpose of this discussion, we will use the terms explanatory and pragmatic studies as coined by Schwartz and Lellouch, who characterized pragmatism as a feature of trial design (Reference Schwartz and Lellouch28).
Explanatory and pragmatic studies essentially answer two different questions: explanatory trials determine whether an intervention can work, whereas pragmatic studies examine whether an intervention does work (Reference Haynes15). Trials conducted for regulatory approval are usually designed to answer the first question. They rarely provide all the necessary information to answer how well a treatment will work in practice or how the benefits compare with adverse effects for specific populations of interest (Reference Atkins3).
Policy decision makers, as well as clinicians, always have to determine how applicable results of studies are to their population of interest, and thus distinguish whether studies are explanatory or pragmatic. This process basically requires them to answer three questions: (i) Is the study population similar to the population of interest? (ii) Does the study design reflect clinical practice? (iii) Are outcomes relevant to make policy or clinical decisions?
Therefore, to some extent applicability becomes a relative measure that largely depends on the reader's population of interest. Any given study can have high applicability for one population and low applicability for another. For example, a well-conducted pragmatic trial in hypertensive elderly men might have little applicability to young hypertensive women. Figure 1 depicts a logical framework of the association of health decision-making, explanatory, and pragmatic studies.

Figure 1. Logical framework for the distinction between efficacy and effectiveness.
Recently, the RTI/UNC (Research Triangle Institute/University of North Carolina) Evidence-based Practice Center (EPC) developed and validated a simple tool to distinguish explanatory from pragmatic studies in a standardized manner. Involving the directors of the US and Canadian EPCs, they identified seven criteria for classifying randomized trials as explanatory or pragmatic studies (Reference Gartlehner, Hansen and Nissman9). These criteria strive to provide a tool that can help researchers and clinicians to assess certain aspects of study design that can be viewed as prerequisites for pragmatic studies. Based on the rationale to identify pragmatic studies reliably with minimal false positives, testing revealed that a cutoff of six criteria produced the most desirable balance between sensitivity and specificity.
There are limitations to such an approach though. Just as methodological quality (internal validity) of studies, explanatory and pragmatic characteristics exist on a continuum. Dichotomizing these characteristics into groups prone to high or low risk of bias, or into explanatory or pragmatic studies always involves some degree of arbitrariness. Nevertheless, various organizations commissioning HTAs and systematic reviews have concluded that the merits of an explicit distinction outweigh the disadvantages. The AHRQ, for example, will base their recommendations on these criteria in the upcoming manual for comparative effectiveness reviews. Table 1 lists the RTI-UNC criteria. Detailed descriptions of each criterion and the process of developing them have been published previously (Reference Gartlehner, Hansen and Nissman9).
Table 1. Criteria to Distinguish Pragmatic from Explanatory Trialsa

aAdapted from: Gartlehner et al. (Reference Gartlehner, Hansen and Nissman9).
To be able to apply these criteria and to distinguish explanatory from pragmatic trials, publications of studies need to provide sufficient information regarding important factors that characterize pragmatic trials. The primary objective of the current study was to determine the adequacy of reporting of information relevant to distinguish explanatory from pragmatic studies using four different areas of drug therapy. Numerous studies have assessed the adequacy of reporting in areas such as adverse events (Reference Ethgen, Boutron and Baron6;Reference Hazell and Shakir16;Reference Ioannidis, Evans and Gotzsche19) or study methods (Reference Chan, Hrobjartsson and Haahr4;Reference Dumville, Torgerson and Hewitt5;Reference Gross, Mallory and Heiat12;Reference Hewitt, Hahn and Torgerson17;Reference Huwiler-Muntener, Juni and Junker18;Reference Mullner, Matthews and Altman24) and have subsequently led to guidelines of reporting such as the CONSORT (Consolidated Standards of Reporting Trials) (Reference Altman, Schulz and Moher2;Reference Ioannidis, Evans and Gotzsche19;Reference Moher, Schulz and Altman23) and the QUORUM (Quality of Reporting of Meta-analysis) (Reference Moher, Cook and Eastwood22) statements, which have greatly improved the reporting of RCTs and systematic reviews (Reference Plint, Moher and Morrison27). To our knowledge, our study is the first to determine the adequacy of reporting of information critical to distinguish explanatory from pragmatic studies.
METHODS
To be able to examine comprehensive bodies of evidence that reflect an entire spectrum of publicly and industry-funded studies, we chose four medical domains of drug therapy for which we have recently completed systematic reviews on the comparative efficacy and safety of drugs (Reference Gartlehner, Hansen and Kahwati8;Reference Gartlehner, Hansen and Thieda10;Reference Hansen, Gartlehner and Kaufer13;Reference Hansen, Gartlehner and Lohr14). These treatments included the following: (i) second-generation antidepressants for the treatment of major depressive disorder; (ii) inhaled corticosteroids for the treatment of asthma and chronic obstructive pulmonary disease; (iii) cholinesterase inhibitors and memantine for the treatment of Alzheimer's disease; and (iv) targeted immune modulators for the treatment of rheumatoid arthritis, ankylosing spondylitis, psoriatic arthritis, and Crohn's disease.
Because we were interested in explanatory and pragmatic trials regardless of internal validity, we included all the trials that met the eligibility criteria of these systematic reviews, regardless of the quality ratings. We excluded all observational studies such as case-series and cohort studies, extension studies which usually were open-label safety studies of previously conducted RCTs, and studies having pharmacokinetic outcomes as their primary purpose. We did not conduct any additional systematic literature searches.
Evaluation of 8,013 citations from the systematic reviews and respective bibliographic databases rendered 137 eligible trials: 43 on second-generation antidepressants, 32 on inhaled corticosteroids, 28 on Alzheimer drugs, and 38 on targeted immune modulators.
The majority of the reviewed studies were head-to-head trials with flexible dose designs. Trials were published between 1992 and 2006.
We developed a questionnaire to assess the adequacy of reporting based on the seven criteria presented in Table 1. The objective of the questionnaire was to assess whether reviewers felt confident to answer each of the seven criteria based on the reported information. Therefore, the term “inadequate reporting” indicates that two reviewers determined the existing information as insufficient to reliably answer a given criterion.
We pilot-tested the instrument to ensure usability and inter-rater reliability. The final version consisted of two types of questions: first, did the article contain enough information to reliably answer each criterion; second, did the individual study fulfill enough criteria to be considered a pragmatic study? The final questionnaire is presented in Table 2.
Table 2. Summary of the Questionnaire Used to Determine Adequacy of Reporting

WHO, World Health Organization; UKU, Udvalg for Kliniske Undersøgelser; ITT, intention-to-treat;
Two persons, experienced in systematic reviews and trained in using the RTI-UNC tool, independently reviewed each eligible study. The primary endpoint of our analysis was the proportion of criteria that could be answered adequately based on the reported information. We used unweighted kappa to measure the inter-rater reliability of the form. The overall observed agreement of reviewers on the adequacy of reporting was 88 percent, resulting in an unweighted overall kappa of 0.68, which indicates substantial agreement. All discrepancies could subsequently be resolved through consultations with a third person.
RESULTS
Overall Adequacy of Reporting
Only 12 percent (n = 16) of the included studies reported sufficient information to enable readers to reliably distinguish explanatory from pragmatic studies. The large majority (n = 121) had substantial shortcomings in reporting, limiting readers in their ability to assess whether a study could be considered a pragmatic trial. The degree of suboptimal reporting varied substantially among these publications. In 36 percent (n = 49) of the examined studies, reviewers felt that only for one criterion the reported information was insufficient to make a judgment. On the lower end of the spectrum, 2 percent (n = 3) of the reviewed publications failed to provide adequate information for investigators to determine five out of the seven criteria. Table 3 summarizes the percentages of studies that adequately reported information on all or fewer criteria.
Table 3. Percentages of Articles That Adequately Reported Information Necessary to Determine All or Fewer Criteria

Over time, the quality of reporting has not improved substantially. Figure 2 depicts the average number of adequately reported criteria from 1992 to 2005.

Figure 2. Average number of adequately reported criteria over time.
Adequacy of Reporting by Criterion
Table 4 summarizes the percentages of studies that provided enough information to determine whether individual criteria could be answered reliably for each study. In the following sections, we briefly discuss the adequacy of reporting for each criterion.
Table 4. Percentages of Articles That Adequately Reported Information Necessary to Determine Individual Criteria

Criterion 1: Populations in Primary Care or Sites of Usual Care
Explanatory studies are frequently conducted in tertiary care, referral settings with specialized equipment and highly trained staff. Pragmatic studies, however, strive to reflect the initial care facilities available to a diverse population. Reporting of information relevant to this criterion was substantially worse than for all the other criteria. We found that only 26.3 percent (n = 36) of studies provided enough information to determine the exact setting of a study. For the majority of studies, reviewers could not determine reliably whether study populations were primary-, secondary-, or tertiary-care based.
Criterion 2: Eligibility criteria
Explanatory studies often apply stringent eligibility criteria with run-in phases to enroll compliant patients without substantial comorbidities, risks for adverse events, or placebo responses. This procedure leads to highly selected populations that are, in turn, unrepresentative of the “average” patients seen in daily clinical care. By contrast, pragmatic trials must allow the study population to be representative of the general population affected by a condition.
The large majority (92.7 percent; n = 127) of studies provided adequate information on this criterion. Nevertheless, more than 7 percent (n = 10) of all surveyed studies did not report sufficiently on inclusion and exclusion criteria of study participants.
Criteria 3 and 4: Health Outcomes, Long Study Duration, and Clinically Relevant Treatment Modalities
Surrogate outcomes, fixed dose designs, or study durations that do not reflect clinical practice frequently limit the applicability of explanatory trials. Therefore, it is crucial for pragmatic studies to examine health outcomes, relevant for the condition of interest, over time periods that mimic a minimum length of treatment in clinical settings. Furthermore, treatment modalities should reflect clinical relevance.
Information on these criteria was almost always adequately reported. Overall, 99 percent (n = 136) of studies clearly reported which outcomes were assessed, enabling readers to judge whether any of these were relevant health outcomes. Likewise, the length of a study and treatment modalities were adequately reported in almost all publications (99 percent).
Criterion 5: Assessment of Adverse Events
A crucial component in determining whether benefits outweigh harms for any given intervention is adequate assessment of adverse events. Ideally, for each intervention, adverse events should be predefined and prespecified, and assessed with an objective scale of adverse events (e.g., the World Health Organization scale of adverse reactions).
Information on the methods of adverse events assessment was reported well in approximately two-thirds (67 percent; n = 92) of all studies. A substantial proportion, however, did not specify how adverse events were assessed, even though the publication may have covered specific adverse events.
Criterion 6: Adequate Sample Size to Assess a Minimally Important Difference from a Patient Perspective
Explanatory trials often lack power to detect minimally important differences (MID) from a patient perspective. The sample size of a pragmatic trial should be sufficient to detect at least a MID on a health-related quality of life scale.
Information on sample size calculations with respect to a MID was well reported in 58 percent (n = 79) of reviewed publications. Conversely, slightly less than half of all the studies did not provide enough information to justify the chosen sample size and to enable readers to determine whether a study was adequately powered to assess a MID.
Criterion 7: Intention-to-Treat Analysis
To some extent, adequate intention-to-treat (ITT) analysis takes the effects of lack of adherence and varying other reasons for treatment discontinuations into consideration when estimating a treatment effect. Information to determine whether ITT analysis was conducted adequately, was reported by 82 percent (n = 112) of included articles. The rest of the studies did not provide sufficient details to determine whether an analysis was conducted following ITT principles. These studies generally lacked information to distinguish between a “completers” analysis and an ITT analysis.
DISCUSSION
Our evaluation of 137 trials of four different drug classes indicates that substantial shortcomings in reporting exist regarding aspects of study designs important to determine the applicability of a study. Only 12 percent of reviewed publications reported sufficient information to reliably distinguish explanatory from pragmatic trials. Elements most commonly lacking adequate reporting were setting descriptions, methods of adverse events assessment, and sample size justifications. Although all three elements are important to distinguish explanatory from pragmatic trials, for policy decision makers as well as for clinicians, setting descriptions, in particular, are important to determine the applicability of results.
For most conditions, primary care facilities will be the initial provider of care. Primary care settings, however, can vary depending on health condition and available infrastructure. For people with most diseases, office-based locations, primary care clinics, or community health centers are the initial settings for health care. For specific populations, such as children or frail elderly populations, schools or nursing homes may be the site of primary care. For persons with rare or severe diseases or those requiring high-risk interventions, such as organ transplantations, specialized secondary or tertiary care settings may provide essentially all of the care. Therefore, it is troubling that only 26 percent of reviewed studies provided enough information to determine the setting of a given study.
The reason for suboptimal reporting, particularly with respect to study settings, remains unclear. Since the implementation of the CONSORT statement in 1996, a substantial improvement in many areas of reporting has been noted. A recent systematic review illustrated the improvements that have been achieved since its introduction (Reference Plint, Moher and Morrison27). An extension of the CONSORT statement has recently been published to improve the reporting of pragmatic trials (Reference Zwarenstein, Treweek and Gagnier31). Hopefully, this statement will increase the awareness of the importance of applicability issues among researchers and editors who still focus primarily on issues of internal validity.
Our study has two major limitations. First, it was limited to only four medical domains of drug therapy. Other interventions may have different issues of reporting with respect to applicability. We think, however, that these shortcomings in reporting are similar throughout the medical literature. The four areas that we chose represent a wide range of chronic conditions that affect a broad portion of the population. Second, inter-rater agreements indicate that subjective judgments may play a modest role in evaluating whether or not reported information is sufficient to adequately answer a given pragmatic criterion. It is unlikely though, that the poor reporting of information on the type of setting is entirely attributable to a lack of inter-rater agreement, because the inter-rater agreement for this criterion was 82.6 percent.
CONCLUSION
Suboptimal reporting of information relevant to distinguish explanatory from pragmatic trials limits the assessment of applicability of published studies. An increased emphasis on the reporting of information essential for determining applicability should therefore be exerted by authors and editors. A diligent use of the extension of the CONSORT statement (Reference Zwarenstein, Treweek and Gagnier31) would greatly facilitate these efforts.
CONTACT INFORMATION
Gerald Gartlehner, MD, MPH (gerald.gartlehner@donau-uni.ac.at), Associate Professor, Department for Evidence-based Medicine and Clinical Epidemiology, Danube University, Krems, Dr. Karl Dorrekstr. 30, Krems, 3500, Austria
Patricia Thieda, MA (pthieda@schsr.unc.edu), Research Associate, Cecil G. Sheps Center for Health Services Research, Richard A. Hansen, PhD, RPh (rahansen@unc.edu), Associate Professor, Eshelman School of Pharmacy, Division of Pharmaceutical Outcomes and Policy, Laura C. Morgan, MA (lmorgan@schsr.unc.edu), Research Associate and Project Coordinator, Cecil G. Sheps Center for Health Services Research, The University of North Carolina at Chapel Hill, 725 Martin Luther King Jr. Bouleveard, CB# 7590, Chapel Hill, North Carolina 27599
Janelle A. Shumate, MD, MPH (jshumate@unch.unc.edu), Resident; PGY3, Department of Pediatrics, University of North Carolina Hospitals, 101 Manning Drive, Chapel Hill, North Carolina 27599
Daniel B. Nissman, MD, MPH (nissman@musc.edu), Resident, Department of Radiology, Medical University of South Carolina, 69 Jonathan Lucas Street, Charleston, South Carolina 29425