Association between methodological characteristics and outcome in health technology assessments which included case series
Published online by Cambridge University Press: 04 August 2005
Abstract
Objectives: Case series constitute a weak form of evidence for effectiveness of health technologies. However, for a variety of reasons, such studies may be included in health technology assessments. There are no clear criteria for assessing the quality of case series. We carried out an empirical investigation of the association between outcome frequency and methodological characteristics in a sample of health technology assessments.
Methods: Systematic reviews of functional endoscopic sinus surgery for nasal polyps, spinal cord stimulation for chronic back pain, and percutaneous transluminal coronary angioplasty and coronary artery bypass grafting for chronic stable angina were identified as containing more than forty case series. Data were extracted by one reviewer and checked by a second on population characteristics, outcomes, and the following methodological features: sample size, prospective/retrospective approach, consecutive recruitment, multi- or single-center organization, length of follow-up, independence of outcome measurement, and date of publication. Association between methodological features and outcome were explored in univariate and multivariate analyses using parametric and nonparametric tests and robust regression or analysis of variance/analysis of covariance, as appropriate.
Results: Included reviews contained between forty-two and seventy-six case series studies, involving 5 to 172,283 participants. Reporting of methodological features was poor and limited the analyses. In general, we found little evidence of any association between methodological characteristics and outcome. Sample size is used as an inclusion criterion in many reviews of case series but was consistently shown to have no relationship to outcome in all analyses. A prospective approach was not associated with outcome. Insufficient data were available to explore consecutive recruitment. Mixed results were shown for length of follow-up, independence of outcome measurement, and publication date.
Conclusion: We found little evidence to support the use of many of the factors included in tools used for quality assessment of case series. Importantly, we found no relationship between study size and outcome across the four examples studied. Isolated examples of a potentially important relationship between other methodological factors and outcome were shown, for example, blinding of outcome measurement, but these examples were not shown consistently across the small number of examples studied. Further research into the determinants of quality in case series studies is required to support health technology assessment.
- Type
- GENERAL ESSAYS
- Information
- International Journal of Technology Assessment in Health Care , Volume 21 , Issue 3 , July 2005 , pp. 277 - 287
- Copyright
- © 2005 Cambridge University Press
Case series are uncontrolled observational studies involving an intervention and outcome for more than one person (2). They constitute a weak form of evidence, and, in an ideal world, it would be unnecessary to consider them when carrying out systematic reviews. However, case series may be the only form of evidence available, or inclusion may be considered necessary by policy-makers for comprehensiveness.
It has been argued that case series cannot be used to assess effectiveness as it is impossible to attribute observed outcomes to treatment. Evidence from case series may be misleading, and such misinformation may result in ineffective treatments, or harmful treatments being adopted, or effective treatments being disregarded. Examples exist in the literature, such as observational studies of hormone replacement therapy, that suggested a potential benefit (6), whereas later randomized controlled trials (RCTs; 7) revealed no net benefit.
As background to the present study, we reviewed systematic reviews carried out for the UK's National Institute for Clinical Excellence (NICE) and published on the NICE Web site up to September 2002 (3). Forty-seven reviews were obtained, of which fourteen (30 percent) included case series. The reviews included between 2 and 159 case series studies (mean =28). In two reviews, case series constituted the only evidence available to inform national policy (9;15). Other nonrandomized study designs (such as case-control and cohort studies) were also included in half of these reports. The most common reason for considering case series was the absence of sufficient data from randomized controlled trials for effectiveness or safety outcomes.
Sample size and duration of follow-up were used to determine inclusion of case series in eight reviews (1;5;8;10;11; 13–15). However, the specific criteria applied varied widely, with sample sizes of between 10 and 2,000 used to determine inclusion.
After inclusion, nineteen different methods of quality assessment were used to appraise case series. The most commonly used features were a clear description of the included patients/cases (seven reports), description of loss to follow-up (seven reports), length of follow-up sufficient/described (eight reports) and valid, objective, masked outcome measurement (seven reports).
It seems likely that, despite their critical methodological weaknesses, case series will continue to play a significant role in health technology assessments, particularly in systems such as the NICE Appraisal process, which predominantly consider new technologies. The plethora of approaches to quality assessment of case series reflects uncertainty about the importance of different methodological features of case series. Despite extensive searches of electronic databases, examination of bibliographies, and hand searching of key journals (strategy available from the authors), we were unable to find methodological studies that examined the association between characteristics of case series studies and outcome. We, therefore, carried out such an investigation in a sample of systematic reviews of case series drawn from those carried out for NICE and the NHS HTA Programme.
METHODS
We developed the following hypotheses:
- Smaller sample sizes may be associated with increased selection of cases.
- The desirable outcome frequency in retrospective studies may be systematically better than in those that are prospective, due to selection.
- Multicenter case series may show greater desirable outcome frequency than single-center case series due to bias in selection of cases within individual centers of a multicenter study.
- Consecutive enrollment may protect against selection bias compared with nonconsecutive enrollment.
- Blind or objective outcome assessment may protect against measurement bias.
- Length of follow-up is often independently related to estimate of outcome frequency, due to natural history of the condition. Case series with short follow-up periods may systematically report better outcomes than those with longer follow-up periods.
- Early reports of a new intervention may be biased toward more favorable outcomes due to selection bias.
We identified and selected reviews of case series from projects undertaken for the NHS HTA Programme up to October 2002. Selection criteria were at least 40 case series were included and information on age of participants was included, as a minimum description of the population. We also intended to compare the results of case series with available RCTs, and a further inclusion criterion was that at least one good-quality RCT should be available. This element of the study is not reported here.
Reviews on four interventions, in three reports, were included: functional endoscopic sinus surgery (FESS) for nasal polyps (4); spinal cord stimulation (SCS) for chronic back pain (12); coronary artery bypass grafting (CABG) for chronic stable angina (11); and percutaneous transluminal coronary angioplasty (PTCA) for chronic stable angina (11).
Data were extracted from tables in published or unpublished reports (see Table 1). Where studies were excluded by original reviewers due to small sample size, these data were obtained. Data were extracted by one reviewer (R.G. or K.D.) and checked by a second reviewer (K.S. or E.C.).

Analyses of study characteristics were undertaken at a between study level within each review. For each of the study hypotheses described above, where the potential explanatory variable was continuous, a scatter plot was drawn and inspected, and, if appropriate, a linear regression analysis performed. Given the considerable heterogeneity in the data, robust regression was carried out using STATA version 8. This approach identifies single data points that have particularly strong impact on the regression and sets these aside. The remaining data points are weighted according to the size of their residuals before an ordinary least squares regression being carried out. This method, while having lower statistical power in ideal circumstances, does not require the errors in the data to be normally, independently, and identically distributed (normal i.i.d). It is, therefore, a more general and flexible approach. Weighted regression analyses were also performed, weighted by sample size. Where appropriate, heteroskedasticities were examined in regression analyses using the Cook–Weisberg test.
Where the potential explanatory variable was dichotomous, association between methodological characteristics were explored using analysis of variance/analysis of covariance (ANOVA/ANCOVA) as appropriate. In addition to these univariate analyses, multivariate analyses were undertaken as variations in the population included in different studies may explain some of the differences in outcomes. Available data on disease severity, sex, and mean age of the population, therefore, were included in multivariable analysis using robust regression or ANOVA/ANCOVA as appropriate.
The outcome measures extracted from the studies were reported at different time periods, depending on the length of follow-up of the entire study. For relatively non–time-dependent outcomes (such as those with following FESS and SCS), the relationship between length of follow-up and outcome was one of our hypotheses. For the angina outcomes, the natural history of the condition suggested that the outcome measure of mortality would worsen with time. Therefore, a yearly adjusted outcome measure was calculated by dividing the reported outcome by the average length of follow-up. Although this strategy is likely to be an oversimplification of the true relationship between length of follow-up and mortality, this approach seemed to be a reasonable assumption. This method, as opposed to including length of follow-up in a multivariable analysis, was used because of the relatively small number of observations in the data set. For angina recurrence, the relationship was not as clear; hence, both nonadjusted and adjusted outcome measures were calculated.
RESULTS
Details of the number and type of studies that form the four data sets analyzed in this section are shown below in Table 2. The included papers were often not explicit about items such as whether the data were collected prospectively or consecutively. Where it was not possible to tell one way or another, these data were excluded from further analysis. There was no evidence of heteroskedasticity in the data sets.

Results: FESS for Nasal Polyps
Forty-two case series were available for analysis. Outcomes were nasal patency and proportion of patients showing symptomatic improvement. Methodological details were poorly reported, and data were unavailable from many studies. Scatter plots showed no obvious relationship between length of follow-up and outcomes (Figures 1 and 2).

Percentage symptomatic improvement after functionalendoscopic sinus surgery versus follow-up (months)

Nasal patency after functional endoscopic sinus surgeryversus follow-up (months)
Univariate analyses showed significant associations between both outcomes and whether studies were multicentered. Symptomatic improvement was shown, on average, in 92 percent of participants in multicentered compared with 81 percent in single center studies (Mann–Whitney p =.02) and nasal patency in 95 percent versus 70 percent (Mann–Whitney, p =.03). Although significant association was shown between age and proportion of males in studies for symptom improvement (b =.68, p =.003), the inclusion of this and other population characteristics in multivariate analyses, did not alter the conclusion. In univariate and multivariate analyses, no association was shown between either outcome and sample size, prospective approach, consecutive recruitment, length of follow-up, or date of publication.
Results: Spinal Cord Stimulation for Chronic Back Pain
Seventy-five case series were available for analysis on the outcome measure of pain relief. Insufficient data were available to address hypotheses concerning multiversus single-centered studies, prospective approach, consecutive recruitment, or independent/blinded outcome measurement.
There was no significant relationship between pain relief and length of follow-up (Figure 3). Univariate and multivariate analyses showed no association between methodological characteristics and outcome. However, an interesting finding was that a quality score developed by the authors of the original review for appraisal of case series, and based on the Jadad score for assessing RCTs, showed a small negative correlation with percentage of participants reporting pain relief (robust regression coefficient -0.053, p =.04). No association between outcome and other potential explanatory variables (age, proportion males, duration of pain, and number of previous operations) was found, and the addition of these factors in multivariate analyses had no effect on conclusions.

Percentage of participants reporting pain relief afterspinal cord stimulation and length of follow-up.
Results: PTCA for Chronic Stable Angina
Sixty-three case series were available for this analysis. Unlike the results for FESS and SCS, which reported outcomes related to treatment success, the outcomes in the PTCA and CABG analyses reported undesirable outcomes (mortality and angina recurrence).
A scatter plot showing reported mortality (proportion) against length of follow-up in years was plotted for PCTA (Figure 4). There was a positive regression coefficient (b =.007, p =.03). Adjusted yearly mortality, therefore, was used in the analysis.

Mortality after percutaneous transluminal coronaryangioplasty (PTCA) and length of follow-up.
However, there was no significant linear relationship between the proportion reporting recurrent angina and length of follow-up (Figure 5). The natural history of the condition suggests that angina recurrence should increase over time, rather than decrease, so this finding is surprising. Loss to follow-up, including deaths, may explain the apparent lack of relationship, although the inclusion of mortality in a multivariate analysis did not demonstrate that this factor had a significant confounding effect. Despite this finding, we adjusted the analysis for length of follow-up for consistency with the analysis of mortality.

Angina recurrence after percutaneous transluminal coronary angioplasty and length of follow-up.
Due to the nature of the outcome, it was not possible to assess the effect of independent or blinded measurement of mortality. Insufficient data were available to examine outcomes according to multicentered or single-center studies (only three studies involved multicenter data collection).
A significantly higher rate of recurrent angina was seen in the nonadjusted analysis for studies that measured this outcome independently compared with those that did not (ANOVA p =.005). However, when data were adjusted for length of follow-up, this effect disappeared.
The effects of publication date were conflicting. No association was found with reported mortality or with angina recurrence in a robust regression which did not take account of study size. However, in a regression where study size was taken into account, a very small effect was shown (robust regression coefficient =-0.02, 95 percent confidence interval [CI], -0.037 to -0.007, p =.005). Note that earlier publication date was associated with less favorable results, contrary to the hypothesis. This result was maintained in the multivariate analysis. Neither sample size nor prospective/retrospective approach were associated with mortality or recurrence in univariate or multivariate analyses.
Results: CABG for Chronic Stable Angina
Seventy-two case series were available for analysis. There was a positive relationship between length of follow-up and mortality and angina recurrence (Figures 6 and 7). Annual mortality, therefore, was used in the analyses. For angina recurrence, the association with length of follow-up was not clear, with significant results reported in robust regression but not in regression weighted for sample size.

Mortality after coronary artery bypass grafting (CABG)and duration of follow-up.

Angina recurrence after coronary artery bypass grafting(CABG) and duration of follow-up.
As for PTCA, independence of outcome measurement was not explored for mortality. Insufficient data were available to investigate consecutive recruitment, as only two studies reporting mortality and one reporting angina recurrence included this feature. For angina recurrence, the analyses of date of publication produced conflicting results, depending on the approach taken in the univariate analyses. Robust regression suggested a small but significant negative association, i.e., earlier publication was associated with worseoutcome (b =-0.004, 95 percent CI, -0.007 to -0.001, p =.01). However, analysis weighted for sample size was not significant (b =-0.004, 95 percent CI, -0.012 to 0.013, p =.95).
Comparison of prospective versus retrospective study designs showed conflicting results, depending on the type of analysis. Twenty-six studies were included (seventeen retrospective, nine prospective), and mean angina recurrence was higher in the prospective group (0.34 versus 0.26). This difference was not significant when examined using the t-test (p =.31) or Mann–Whitney test (0.33) but was using ANOVA weighted for study size (p =.002).
Analysis of independent measurement of angina recurrence also showed discrepant results according to the statistical approach taken. A small difference in mean frequency between studies was shown (0.02), which was significant only in weighted ANOVA (p =.002).
Age and sex were significant predictors of outcome, although New York Heart Association classification stage was not. However, the inclusion of these factors in multivariate analyses did not alter the conclusions of the univariate analyses and, notably, did not alter the discrepant findings on angina recurrence. Publication date remained a significant predictor of outcome, with a small coefficient in the unweighted robust regression (b =-0.0015, 95 percent CI, -0.003 to -0.0002, p =.02). Table 3 shows a summary of the results of all analyses.

DISCUSSION
Overall, we found limited evidence of association between methodological features and outcome in the analyses carried out. However, a consistent finding across all the case studies was of no relationship between sample size and outcome frequency. Hitherto, sample size has been used as a criterion for the inclusion or exclusion of case series from reviews. The lack of relationship between study size and outcome suggests that this approach may not be justified. Case series are likely to be more numerous than RCTs or other designs, and there may be an incentive for researchers to limit the number of series included in a review, supported perhaps by the view that this design is necessarily less likely to result in robust conclusions. Our findings tentatively suggest that setting a cut-off in terms of sample size may be less justified than including all studies or taking a random sample.
We found no evidence that prospective series or those in which consecutive cases were enrolled were associated with different outcome frequency to studies not having these features. Again, these criteria are frequently used to judge the quality of case series. These analyses were particularly constrained by inadequate reporting in the original studies. However, all the examples explored were surgical interventions, and it may be that, where retrospective designs were used, case ascertainment was good, reflecting the ease of identifying patients after surgical procedures from hospital records. Where ascertainment is more difficult retrospectively, for example for drug technologies, a greater difference may be shown between retrospective and prospective studies, or those in which recruitment was or was not consecutive. A further consideration in this and all the analyses showing no association between methodological features and outcome is the limited power to detect a significant difference afforded by the small number and heterogeneity of available studies.
In the case of spinal cord stimulation, a significant association between the quality score used by reviewers and outcome was demonstrated. This finding was the only example in which such a score had been used and may suggest that the use of quality scoring systems can differentiate between studies. However, we found no relationship between the individual study factors that made up the score and outcome in the case series. It is difficult, therefore, to conclude whether the score is acting as a valid measure of study quality. Because the relationship between methodological features and validity is not clear and how item scores should be summed into a single measure of study quality remains uncertain, it may be unwise to use such single scoring systems to judge the quality of case series.
Our other findings were inconsistent across the case studies, and the small number of associations demonstrated cannot be taken as good evidence on which to base any change in approach to the appraisal of case series. The finding, in one analysis, that independent measurement of outcome may be important in determining study quality is consistent with the findings of Juni and colleagues regarding blinding in relation to the quality of RCTs (10). This finding is potentially important, but further evidence is required on the importance of this factor in other case series.
The failure to demonstrate any relationship between date of publication and outcome, as with the other negative findings in this study, may be related to limited statistical power. Three other explanations are possible. First, the impact of early adopters and any effects of selection of cases in early studies may be short-lived and, therefore, be not apparent when a longer historical perspective is taken. Second, the effect of the learning curve in the early stages of use of a technology may counteract the effects of case selection. Third, technological improvements may have a very marked effect on successful outcomes.
We suggest that there are complementary positions for different methodological approaches in the ongoing evaluation of health technologies. In some cases, where the natural history of the condition is well understood and a dramatic effect is shown by a technology, comparative studies may not be considered necessary or ethical. We expect that such cases will be very few. It is more likely that case series will continue to be carried out in the early stages of technology diffusion, particularly in surgery where there is a less-stringent regulatory framework governing adoption. Such case series will be important in identifying whether technologies are likely to be efficacious. Early assessment of case series, therefore, may identify technologies that should be subject to more rigorous evaluation. Efficacy may then be established through well-conducted RCTs. This determination, however, may be insufficient to inform practice and policy, and for some technologies, it may be necessary to continue to collect data through case series or, more systematically, through the use of comprehensive registries. These options hold several potential advantages over case series led, more conventionally, by the clinicians delivering the intervention. Standardization of data collection and reporting is more feasible and investigation of the effects of center and operator would be facilitated. Furthermore, establishment of an ongoing system for reporting of process and outcomes would demonstrate changes in the nature of the technology, which is a particular issue in the development of surgical techniques. A key advantage of ongoing collection of data through large case series or registry studies is the identification of uncommon side effects in practice and a high degree of external validity. Using registry or case series data to make a comparison between technologies will continue to be necessarily and severely constrained by the nondirect nature of such comparisons and the effect of a large range of known and unknown confounders. However, the collection of data on the performance of technologies in undifferentiated populations over long time periods will complement and may extend the knowledge yielded in the generally short time scales and selected populations of RCTs.
In the investigation of possible impact of methodological aspects of case series, our examples were all surgical interventions. This means that our findings may not be generalizable to evaluations of other types of technology using case series. As noted above, the effects of learning curve and the possibility of bias arising from enthusiastic early adopters may act in opposite directions, making it difficult to discern any effect relating to timing of publication. A more important problem arising from the nature of the technologies examined is the introduction of further variance in the data as a result of operator effects that would not be apparent in, for example, drug technologies.
The small number of cases examined and the relatively limited number of studies in each set of case series are important limitations to precision and generalizability, which may be addressed by further research. However, it is likely that empirical opportunities for investigation will be few, as has been shown in comparisons of randomized and nonrandomized controlled trials. Under these circumstances, modeling studies may be valuable. Our analysis was necessarily limited to the aggregate reports of individual studies, and there is, therefore, the potential for ecological bias.
Some statistical considerations should be borne in mind when interpreting our results. Weighting for variance is generally favored in meta-analysis and gives greater weight to larger studies to improve precision. Although sample size is not the only determinant of variance in studies, in the current context, it is likely to dominate other factors. As we did not have data on the variance of individual studies, we were constrained to using sample size alone. We found no relationship with study size that does not support concern over increased bias in smaller observational studies. Hence, we have included weighted regression for completeness. Our main statistical approach was robust regression. This technique resists the influence of extreme outliers but results in slightly larger standard errors. Hence, the power of robust regression to detect true differences is slightly reduced compared with ordinary least squares regression. Given the nature of our data, we consider this method to be a reasonable analytic approach. The use of several methods of analysis has led, in some cases, to apparently discrepant results. Given the large number of analyses performed, the usual level of significance of p =.05 should be viewed with caution.
A general problem in the data examined is the very low “signal to noise” ratio. In other words, it is difficult to identify the effects of methodological factors from the effects of heterogeneity between studies in aspects of the populations and interventions. This is a particular problem where reporting of population and intervention characteristics was limited. The impact of unknown confounders, the fundamental reason for favoring RCTs over other study designs, is also an important consideration.
The potential role of publication bias should also be considered. Case series may be particularly prone to publication bias. Being much less robust than comparative or experimental designs, they may be less likely to achieve publication in any journal. Small case series may be more prone to this bias, as in other study designs, and those with less impressive findings and small size are likely to be at greatest risk. However, two findings suggest that publication bias may not be a particular problem in the examples studied. First, the finding of no association between sample size and outcome suggests that smaller studies are not more likely to be positive. Second, the very large range in sample sizes among studies suggests that even small studies are achieving publication. The extent to which these findings are likely to be replicated in other reviews of case series is unknown and further research into the extent and impact of publication bias in different study designs is required.
The general finding of poor reporting of methodological features in case series is a cause for concern and will continue to hamper research into case series and the ability of decision-makers to consider the appropriate influence of case series evidence on policy. We chose to constrain our analyses to reported data, that is, where a methodological feature was not reported in a study, this information was excluded from the analysis. The reporting of other study designs, notably RCTs and systematic reviews, has been improved considerably in recent years. While case series rightly occupy a position low in the hierarchy of evidence, their continued use in health technology assessments seems inevitable and strongly suggests the need to improve reporting quality.
CONTACT INFORMATION
Ken Stein, MSc, FFPHM, Senior Lecturer in Public Health (Ken.Stein@exeter.ac.uk), Kim Dalziel, BSc, Research Fellow (kim.dalziel@med.monash.edu.au), Ruth Garside, MA, Emanuela Castelnuovo, MSc, Research Fellow (Emanuela.Castelnuovo@PenTAG.nhs.uk), Ali Round, MRCP, FFPHM, Senior Lecturer in Public Health (Alison.Round@nhs.net), Peninsula Technology Assessment Group, Peninsula Medical School, Dean Clarke House, Southernhay East, Exeter, Devon EX1 1PQ, UK
References

Data Extracted on Population Characteristics, Outcome Measures, and Methodological Features

Summary of Studies Included in Analysis

Percentage symptomatic improvement after functionalendoscopic sinus surgery versus follow-up (months)

Nasal patency after functional endoscopic sinus surgeryversus follow-up (months)

Percentage of participants reporting pain relief afterspinal cord stimulation and length of follow-up.

Mortality after percutaneous transluminal coronaryangioplasty (PTCA) and length of follow-up.

Angina recurrence after percutaneous transluminal coronary angioplasty and length of follow-up.

Mortality after coronary artery bypass grafting (CABG)and duration of follow-up.

Angina recurrence after coronary artery bypass grafting(CABG) and duration of follow-up.

Summary of Results
- 12
- Cited by