With the scarcity of resources in health care, efficacy and safety become insufficient for a well-informed decision making on resources allocations. In the current environment, the priority becomes to reduce the costs without deteriorating the quality of care, or to improve quality of care at a reasonable cost (Reference Drummond, Sculpher, Torrance, O'Brien and Stoddart10). Consequently, interest in full economic evaluations, that is, studies comparing both costs and outcomes of at least two healthcare programs (Reference Drummond, Sculpher, Torrance, O'Brien and Stoddart10) has increased and numerous countries have now developed specific guidelines for economic evaluations. As a consequence, the use of systematic reviews of economic evaluations to summarize knowledge has been intensified and quality assessment instruments have been developed to evaluate the quality of published economic evaluations. The most frequently used instruments (Reference Jefferson, Demicheli and Vale13) are the Drummond et al. ten-item check-list (Reference Drummond, Sculpher, Torrance, O'Brien and Stoddart10) or the BMJ check-list (Reference Drummond and Jefferson9), based on the Drummond check-list.
Jefferson et al. showed that too disparate quality assessment instruments were used (Reference Jefferson, Demicheli and Vale13), illustrating the need for a validated and internationally accepted list. To respond to this need, the Consensus Health Economic Criteria (CHEC) list has recently been developed (Reference Evers, Goossens, de Vet, van Tulder and Ament11).
On the other hand, neither the CHEC list nor the BMJ check-list were created with the aim of performing a simple comparison between studies in a quantifiable way. Quantitative measures of quality would allow ranking studies according to a quality score. One solution is to apply an equal weight for each item, but this strategy does not allow analysts to take into account the relative importance of each criterion. For this reason, a new instrument has been developed: the Quality of Health Economic Studies (QHES) (Reference Chiou, Hay and Wallace5), a grading system in which weightings differ according to the relative importance of each criterion.
The first objective of this study was to compare the BMJ check-list, the CHEC list, and the QHES instrument as quantitative tools to measure the quality of economic evaluations, and to examine the importance of weighting the criteria. The second objective was to assess the test–retest reliability and the inter-rater agreement for each instrument. Finally, problems associated with analyzing the quality of economic evaluations with these instruments were also determined and recommendations were made. The analysis was performed through a systematic review of economic evaluations of the surgical treatment of obesity.
METHODS
Quality Assessment Instruments Description
The BMJ Check-List. The BMJ set up a working party to develop a quality assessment check-list for use by both referees and authors. Drafts of the check-list were transmitted to health economists and journal editors and were debated at the “biannual meeting of the UK Health Economists’ Study Group” in January 1996. The final check-list was based on a broad consensus and contains thirty-five items under three headings: study design, data collection, and analysis and interpretation of results (see Table 1). This check-list concentrates on full economic evaluations but could also be used for partial economic evaluations, or report and commentaries on economic evaluations. If items were not applicable to a specific study, a “not appropriate” (NA) response can be stated. The working party admitted that it is not possible to address all the points in the article and that authors can, for example, refer the reader to other published sources. More details about this check-list can be found in the literature (Reference Drummond and Jefferson9).
The CHEC List. An initial item pool divided in nineteen categories was first developed by performing a literature search from Medline, Psychlit, Econlit, the Cochrane Library, and the National Health Service Economic Evaluation Databases (NHS EED). The criteria list was then created using the Delphi method. This method made use of a panel of expert on a specific topic to reach a consensus (Reference Whitman21). In a first round, international experts were asked to give their opinion on the categories and the items selected from the literature research. Comments and the resulting list were redistributed among experts until a consensus was reached. Three rounds were sufficient to obtain the final criteria list. More details on the method used can be found in the literature (Reference Ament, Evers, Goossens, De Vet, Van Tulder, Donaldson, Mugford and Vale2;Reference Evers, Goossens, de Vet, van Tulder and Ament11).
The list contains nineteen yes-or-no questions (see Table 2). Authors recommended that, if not enough information was available in the article or in other published material to answer to a question, a “No” response should be stated. A description of the items can be found on the Web (www.beoz.unimaas.nl/chec/). It should be noted that this list was not created to analyze the quality of economic evaluations based on modeling studies. However, in this study, all economic evaluations found in the literature were analyzed with the three lists, including modeling studies. Consequently, the quality score generated by the CHEC list for modeling studies has to be interpreted with caution.
The QHES List. A steering committee comprised of five experts in the field of health economics and three investigators developed a check-list for economic evaluations from a literature search using Medline, Healthstar, and the Cochrane databases. From existing guidelines and check-lists, the committee selected 16 criteria with a “Yes” or “No” format (see Table 3). The selection was made by consensus. Then, weights for each criterion were estimated using a general linear regression (random effects) based on data collected from a conjoint analysis survey on 120 international health economists. More details about the QHES list can be found in the literature (Reference Chiou, Hay and Wallace5).
Studies Selection
The different quality assessment instruments were applied to nine economic evaluations of surgical treatment techniques of obesity. More details about the studies and the selection criteria for the studies are described elsewhere (Reference Lambert, Kohn and Vinck14). While initially only full economic evaluations were selected for review, we also tested the quality assessment instruments on cost-outcome descriptions, that is, studies describing both costs and effects but not presenting an incremental cost-effectiveness ratio (ICER). Five full economic evaluations, of which two included a cost-utility and cost-effectiveness analysis and three included only a cost-utility analysis were included in the quality assessment exercise (Reference Chevallier, Daoud, Szwarcensztein, Volcot and Rupprecht4;Reference Clegg, Colquitt and Sidhu7;Reference Craig and Tseng8;Reference van Gemert, Adang and Kop19;Reference van Mastrigt, van Dielen, Severens, Voss and Greve20). Moreover, four cost-outcome description studies were evaluated with the three quality assessment instruments (Reference Agren, Narbro and Jonsson1;Reference Christou, Sampalis and Liberman6;Reference Martin, Tan and Horn16;Reference Nguyen, Goldman and Rosenquist17).
Quality Assessment of Economic Evaluations
The quality of selected studies was assessed independently by two heath economists (rater 1 and rater 2) using the BMJ check-list, the CHEC list, and the QHES list. Each economist blindly evaluated the quality of studies with the three instruments consecutively. Moreover, rater 1 repeated the analysis 8 weeks later. During the investigation, the guidelines of the instruments were followed and an inventory of problems associated with analyzing the quality of economic evaluations according to these guidelines was made.
Quality scores were then evaluated for each study. In a first stage, a score with an equal weight for each item was calculated as a quantitative proxy for the evaluation's quality. In a second stage, the importance of weighting the criteria was examined. For the QHES instrument, weights determined by Chiou et al. were used (Reference Chiou, Hay and Wallace5). For the BMJ and the CHEC list, no weighting exist. Implicit weightings determined by one of the assessor according to the relative importance he confers to each item (subjective assessment) were thus used. In summary, three types of quality scores were obtained: a score without weighting of the criteria, a score with an implicit weighting for the BMJ and the CHEC list, and a score with an explicit weighting determined by Chiou et al. for the QHES instrument.
Statistical Analysis
To compare the instruments, the range and the mean of the quality scores generated by each instrument were calculated. Ranking differences between instruments were then assessed using the Spearman rank correlation coefficient. In this study, we considered a correlation coefficient of r ≥ 0.7 as high, 0.7 > r ≥ 0.5 as moderate, and r < 0.5 as low.
Moreover, test–retest reliability between time 1 and 2 was assessed for each instrument by the rater 1 using model 3 of the six ICCs discussed by Shrout and Fleiss (Reference Shrout and Fleiss18), that is, the ICC(3,1) where raters are assumed to be representative of the entire population of raters (Supplementary Figure 1, available at www.journals.cambridge.org/thc).
Finally, for each instrument, the inter-rater agreement at time 1 was estimated at two levels: comparison of the total score of each article by the ICC(2,1) (Reference Shrout and Fleiss18), where raters are assumed to be a random subset of all possible raters, and comparison of results per item by kappa values. Kappa values less than 0.40, between 0.40 and 0.74, and between 0.75 and 1 were defined as poor, fair to good, and perfect agreement respectively (Reference Fleiss12;Reference Landis and Koch15). Tests were performed using SAS software version 9.
RESULTS
Instruments Comparison
The comparison of instruments showed that they mainly analyze similar items. Nevertheless, some differences can be highlighted (see Table 4).
Table 4. Instruments’ Comparisons
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170128004414-89956-mediumThumb-S0266462308080422_tab4.jpg?pub-status=live)
BMJ, British Medical Journal; CHEC, Consensus Health Economic Criteria; QHES, Quality of Health Economic Studies; Y, handled; N, not handled.
First, only the BMJ check-list investigates if the economic importance of the study question is stated and if the choice of alternatives is justified. On the other hand, the BMJ check-list does not assess if the choice of cost and outcome items is appropriate, as it is done in the other two instruments. Finally, this is the only instrument that does not include a question about the presence of conflicts of interest by the authors.
Second, the CHEC list is designed for clinical trial and observational studies. Consequently, there is no item on model characteristics. Moreover, this instrument does not determine whether limitations of the studies are specified. On the other hand, it is the only instrument asking whether ethical aspect and generalizability of the results are investigated.
Third, the QHES instrument determines if subgroups analyzed are appropriately defined but does not examine whether details on the population and on the study design are specified, in contrast to both the BMJ check-list and the CHEC list. Finally, this instrument does not investigate if details about price adjustments for inflation or about currency conversion are given.
Inventory of Problems Encountered During the Quality Assessment
It was often difficult to judge from the studies if an item was respected or not because too little information was given in the publication. For example, details on cost calculations were regularly limited and sometimes, only sources were given. In such situation, it is thus important to consult these sources to be able to evaluate the quality of the studies with more accuracy.
Moreover, the BMJ check-list and the QHES instruments were mainly adapted to modeling studies while the CHEC list was conceived for clinical trials and observational studies. Consequently, the item assessing if details of the model were given was, for example, not adapted to clinical trials and observational studies. Hence, it could be interesting to have an instrument adapted to several study designs with specific subquestions for each design.
It was also often difficult to choose between a “Yes” or “No” response. Some items regrouped various criteria. Consequently, if only one of the criterion was not respected, a “no” response should be stated, even if other criteria were respected. The possibility to use an intermediate value as “partially respected” could thus be interesting. This problem was mostly present with the QHES instrument. For example, one item tested if the time horizon was relevant, if costs and outcomes were discounted and if the discount rate was justified. It would be interesting to test the impact of subdividing this kind of item.
Differences in Quality Scores Between Instruments
With equal weights between items, the quality score of the nine selected studies and the three ratings (rater 1, time 1; rater 1, time 2; rater 2, time 1) varied between 30.8 and 90.0 of 100 points on the BJM check-list, between 15.8 and 84.2 of 100 points on the CHEC list, and between 6.3 and 87.5 of 100 points on the QHES instrument. Means and standard deviations of the studies quality scores for each instrument and the three ratings are detailed on the Web site (Supplementary Table 5, available at www.journals.cambridge.org/thc).
With a weighting between items, the quality score of the nine selected studies and the three ratings varied between 6 and 92 of 100 points on the BJM check-list, between 14 and 89 of 100 points on the CHEC list, and between 22 and 77 of 100 points on the QHES instrument.
Hierarchical ranking of studies
The hierarchical ranking of studies based on their quality score and the Spearman ranking correlation coefficient can be found at the Web site (Supplementary Tables 6 and 7, available at www.journals.cambridge.org/thc). A high Spearman ranking correlation between instruments was found, except for rater 1 in time 1 where the correlation was moderate between the BMJ and the CHEC list.
Moreover, for each instrument, the Spearman ranking correlation coefficients between weighted and not weighted scores were high and ranged from 0.83 to 0.99. Thus, weighting of criteria has little impact on the ranking. On the other hand, the ranking varied according to the assessor, as shown by the inter-rater agreement.
Test–retest Reliability and Inter-rater Agreement
Test–retest reliability in terms of ICC(3,1) was good for all instruments, that is, 0.99 (95 percent CI, 0.86–0.99) for the BMJ check-list, 0.97 (95 percent CI, 0.73–0.98) for the CHEC list, and 0.95 (95 percent CI, 0.75–0.99) for the QHES instrument. However, there was poor inter-rater agreement. For the BMJ check-list, agreement was poor for twenty-seven of thirty-five items (77 percent), fair to good for six of thirty-five items (17 percent), and perfect for only two out of thirty-five items (6 percent). For the CHEC list, agreement was poor for fifteen of nineteen items (79 percent) and perfect for four of nineteen items (21 percent). For the QHES instrument, agreement was poor for ten of sixteen items (63 percent), fair to good for three of sixteen items (19 percent), and perfect for three of sixteen items (19 percent). Overall inter-rater agreement in terms of ICC(2,1) was 0.52 (95 percent CI, 0.21–0.83) for the BMJ check-list, 0.33 (95 percent CI, 0.07–0.71) for the CHEC list, and 0.33 (95 percent CI, 0.02–0.73) for the QHES instrument.
DISCUSSION
Instruments comparisons highlighted the subjective character of the quality assessment. Indeed, results were not influenced by the instruments used but rather by experts who analyzed the studies. As shown in the study, instruments mainly assessed similar items, which could explain the high Spearman rank correlation coefficient.
The poor agreement between experts could be explained by various factors. First, time spent to analyze studies might have an impact on results. One author spent around 1 day per study to assess deeply the quality of the study and returned systematically to referred sources if insufficient details were provided in the basic article. The other expert spent around 2 days to assess the quality of all the studies and based his analysis on the main article only.
Second, the subjectivity of the examinants could also influence the response. A complete respect of criteria was rare and intermediate responses were not authorized. Severe raters could have tendency to state a 0 value if one criterion of the item was not completed, while another rater could have tendency to state a 1 value, considering that on the whole, the criteria were respected.
Third, experience of the rater in economic evaluation could also play a role. One rater has worked in the health economics domain for nearly 20 years, while the experience of the second rater was only 2 years. Thus, it is possible that they consider the quality of the studies differently.
Fourth, the perception and interpretation of the items and the ambiguity of the responses also influenced the results. Items were sometimes large and could be interpreted in various ways. Some items also referred to specific study design and when the design of the study was not appropriate, reaction of raters could differ.
It should also be noted that the BMJ check-list and the CHEC list were created as qualitative instruments and not as scoring instruments. On the other hand, calculating a quality score from these instruments allowed us to easily have an idea of the ranking of these studies according to their quality.
Finally, caution is needed when interpreting our results given that the limited number of studies led to high confidence intervals for the inter-rater agreement. In a previous study, two people analyzed the quality of 30 economic evaluations of health promotion with the QHES instrument and found a better inter-rater reliability (IC95 percent: 0.64–0.91) (Reference Au, Prahardhi and Shiell3). However, our results plead in favor of doing further research to estimate the overall role of the evaluator in assessing the quality of economic evaluations. To do so, an international study including a higher number of evaluators and in particular a larger sample of studies from various areas should be conducted.
In conclusion, our findings highlight that in practice, results are not so much influenced by the instrument used but more by the assessor. It is thus essential to perform quality analyses of economic evaluations by at least two blinded experts and to base the final scoring on a consensus. Moreover, a clear definition of each item should be given and respected by raters. Experts should also spend a substantial period of time to analyze studies thoroughly and should refer to sources of information when specified in the article if not enough details are provided in the basic study. Finally, in the future, it would be interesting to create a single instrument adapted to each study design and to introduce the possibility to use an intermediate score value.
CONTACT INFORMATION
Sophie Gerkens, MSc, PhD Candidate (sophie.gerkens@uclouvain.be), Health Economist, School of Public Health–Unité de Socioéconomie de la santé (SESA), Université Catholique de Louvain, 30 Clos Chapelle-aux-Champs box 3041, Brussels 1200, Belgium
Ralph Crott, PhD (ralph.crott@uclouvain.be), Health Eco-nomist, Department of Medicine, Cliniques Universitaires Saint-Luc, 10, Av. Hippocrate, Brussels B-1200, Belgium
Irina Cleemput, PhD (Irina.Cleemput@kce.fgov.be), Expert Economic Analysis, Department of Research, Belgian Health Care Knowledge Centre (KCE), 62 Rue de la Loi, Brussels B-1040, Belgium
Jean-Paul Thissen, MD, PhD (Thissen@diab.ucl.ac.be), Professor, Department of Endocrinology, Université Catholique de Louvain, 10, av. Hippocrate, Brussels B-1200, Belgium; Chief, Department of Endocrinology, Cliniques Universitaires Saint-Luc, 10, av. Hippocrate, Brussels B-1200, Belgium
Marie-Christine Closon, PhD (closon@sesa.ucl.ac.be), Professor, School of Public Health – Unité de socioéconomie de la santé (SESA), Université Catholique de Louvain, 30 Clos Chapelle-aux-Champs, Box 3041, Brussels B-1200, Belgium
Yves Horsmans, MD, PhD (yves.horsmans@uclouvain.be), Professor, Department of Gastroenterology, Université Catholique de Louvain; Chief, Department of Gastroenterology, Cliniques Universitaires Saint-Luc, 10, Av. Hippocrate, Brussels B-1200, Belgium
Claire Beguin, MD, PhD (claire.beguin@uclouvain.be]), Chief, Department of Medical Information and Statistics, Cliniques Universitaires Saint-Luc, 10, Av. Hippocrate, Brussels B-1200, Belgium