Published online by Cambridge University Press: 04 August 2005
Objectives: There is relatively little evidence on the test–retest reliability of utility scores derived from multiattribute measures. The objective was to estimate test–retest reliability for Health Utilities Index Mark 2 (HUI2) and Mark 3 (HUI3) utility scores in patients recovering from hip fracture.
Methods: We enrolled an inception cohort of hip fracture patients within 3 to 5 days of surgery. Baseline assessments included the Functional Independence Measure (FIM™), Folstein Mini-Mental State Examinations, and the HUI2 and HUI3 questionnaire. Follow-up assessments at 1, 3, and 6 months also included a global change question. Test–retest reliability was assessed as agreement between 3- and 6-month scores using the intraclass correlation coefficient (ICC). Two approaches were used to classify patients as stable; a third approach based on the generalizability theory was also used. Patients were classified as stable if their FIM™ overall scores changed by 10 points or fewer and if they classified themselves as having experienced no or only a little change according to their global change question.
Results: Complete data at both the 3- and 6-month assessments based on self-report were available for 196 patients; 141 patients with complete data were classified as stable. The ICCs for HUI2 and HUI3 for stable patients were 0.71 and 0.72; the ICCs derived from the generalizability theory were 0.76 and 0.77.
Conclusions: Test–retest reliability for HUI in this cohort was similar to reliability estimates for other preference-based multiattribute and generic health-profile measures—in the acceptable range for making valid group-level comparisons.
Although there is evidence on the test–retest reliability of directly measured utilities, there is relatively little published evidence on the test–retest reliability of utility scores derived from multiattribute measures such as the EQ-5D (30), Short-Form 6D (SF-6D; 4), Quality of Well Being scale (QWB; 18), and Health Utilities Index (HUI; 9;10;12;16). A standard design for studies investigating test–retest reliability is to administer the instrument to a group of respondents who are expected not to experience changes in health status and then to readminister the instrument shortly thereafter (24;37). Often the interval between administrations is selected to be long enough that respondents are likely to have forgotten their previous responses but short enough that health status is unlikely to have changed. Alternatively, using the generalizability theory, one can calculate the intraclass correlation coefficient (ICC) between scores at the two administrations by dividing the between-subject variation by the total variation (37). Finally, one can use longitudinal studies to obtain estimates of test–retest reliability by assessing agreement among scores for stable patients. A challenge in this approach is to classify subjects as stable versus nonstable (25).
This investigation of test–retest reliability is part of a larger prospective cohort study examining recovery after hip fracture. Inclusion criteria included age 65 or older, ability to speak English, availability of a friend or family member who could act as a proxy respondent, and residence in the Capital Health region (Edmonton and surrounding area) of Alberta, Canada. Exclusion criteria included a pathological fracture other than one caused by osteoporosis, Paget disease, readmission for a previous fracture, and previous fracture within the past 5 years. Patients were recruited from October 2000 until December 2001. Questionnaires were administered to patients who had a Folstein Mini-Mental State Examination (MMSE) scores of 18 or higher (11;39). Data for patients with MMSE scores >18 were collected from proxy respondents. For this analysis, we report only the results from patients with MMSE scores greater than or equal to 18 at baseline.
The baseline assessment was performed in person by a trained professional interviewer within 3 to 5 days after surgery. Follow-up interviews were conducted by telephone at 1, 3, and 6 months after fracture. Follow-up interviews included the same battery of instruments used at baseline plus a nine-point global change (over the past 1 month) question (extremely worse; a lot worse; somewhat worse; a little worse; no change; a little better; somewhat better; a lot better; extremely better).
Data on clinical and demographic characteristics of patients were collected. Information on comorbidities was obtained in interviews with patients using a list derived from chronic conditions listed in the Charlson Comorbidity Index (5) and the Statistics Canada National Population Health Survey instrument.
Although many patients improved substantially during the first month and continued to improve over the next 2 months, descriptive results for the entire cohort indicated little or no change in the overall health of the cohort between the month 3 and month 6 assessments. We, therefore, based our examination of test–retest reliability on a comparison of the 3- and 6-month scores.
The HUI2 is a multiattribute utility measure that includes a health-status descriptive system and a multiplicative multiattribute utility function that provides overall utility scores for HUI2 health states on the conventional dead = 0.00 to perfect health = 1.00 scale (9;12). The HUI2 covers seven attributes (dimensions) of health status: sensation (vision, hearing, and speech), mobility, emotion, cognition, self-care, pain, and fertility. (The initial application of HUI2 involved survivors of cancer in childhood for whom low fertility and infertility are issues; the fertility dimension was omitted from the study reported here.) Each attribute has four or five levels, ranging from highly impaired, levels 4 or 5 (for instance, level 5, unable to control or use arms and legs, for mobility), to normal, level 1. The HUI2 focuses on capacity rather than performance. The multiplicative HUI2 scoring function is based on preference scores obtained from a random sample of parents in the general population in Hamilton, Ontario, Canada (40).
The attributes included in the HUI3 are vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain (10;12;16). There are five or six levels per attribute in the HUI3. The multiplicative scoring function for the HUI3 is based on preference scores obtained from a random sample of respondents 16 years of age and older in Hamilton, Ontario, Canada (10). Scores range from -0.36 (the all-worst HUI3 state) to 0.00 for dead to 1.00 for perfect health.
The MMSE is a screening instrument for cognitive status (11). Eleven questions cover orientation to time, orientation to place, registration of three words, attention and calculation, recall, language, and visual construction. We defined the severity of cognitive impairment by using three cutoff levels (32;39): no cognitive impairment (24 to 30 points), mild impairment (18 to 23 points), and severe impairment (0 to 17 points; patients in this group at baseline were excluded from the analyses presented here).
The Functional Independence Measure™ (FIM™) is a performance-based measure of disability based on the amount of assistance required to perform basic activities of daily living (13). The FIM™ includes eighteen items covering self-care, sphincter control, transfers, locomotion, communication, social adjustment/cooperation, and cognition/problem solving. Scores range from 18 to 126, with a higher score representing greater independence. Several studies provide evidence on the use of the telephone version of the FIM™ and/or its use in patients with hip fracture or elderly patients (13;14;26– 29;34;36).
Wallace et al. (41) have suggested that the minimal clinically important difference for FIM™ scores is 11. Conservatively, we chose a slightly more stringent criterion to define stable patients—those experiencing a change of 10 points or fewer.
In one approach to assessing test–retest reliability, the usefulness of the assessment of test–retest reliability relies on identifying a “known group” of stable patients. Following the example of Deyo et al. (7), we classified patients as stable if they fulfilled two criteria: (i) a less than clinically important change in the overall FIM™ score (10 or fewer); and (ii) a response on the global change questions of no change, a little worse, somewhat worse, a little better, or somewhat better (± two categories). Given that recall of previous health status is less than ideal (25), we did rely solely on results from the global change question. In a secondary analysis, a more stringent criterion was used: no change, a little worse, or a little better (± one category). In addition, using the generalizability theory (37), we estimated test–retest reliability for all available patients by assessing agreement between HUI scores at 3 and 6 months.
Agreement among measures was assessed using the kappa statistic (19). Kappa values of <0.00 are interpreted as indicating poor agreement, values of 0.00–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–0.1.00 as almost perfect (19).
We assessed test–retest reliability using an ICC (7;33;35) derived from a mixed model two-way analysis of variance in which assessment was the fixed factor and patients were modeled as random. (Estimates were also done using a two-way random effects model in which both time and patient were random.) For the analyses based on the generalizability theory, the ICC was calculated as between-subject variability divided by total variability. Analyses were performed with SPSS Version 12 (SPSS, Inc., Chicago, IL). Using criteria proposed by Juniper et al. (17), ICCs>0.80 were classified as excellent agreement, 0.61 through 0.80 as good agreement, 0.41 through 0.60 as moderate agreement, and ≤0.40 as poor to fair. Ethics approval was obtained before data collection and medical chart review from the Health Research Ethics Review Board of the University of Alberta and Capital Health Region.
In total, 383 patients were enrolled in the study. Of these, 265 patients had baseline MMSE scores greater than or equal to 18 and, thus, were potentially available for analysis. Complete data based on patient self-assessment at months 3 and 6 are available for 195 patients. Using the ± two categories criterion from the global change question, 167 (85.6 percent) of these patients were classified as stable and 28 (14.4 percent) as not stable. Using the criterion of a change in overall FIM™ score ≤ 10, 160 (82.1 percent) patients were classified as stable and 35 (17.9 percent) as not stable. The percentage agreement between the two criteria was 77; the kappa statistic (unweighted) was 0.54, indicating moderate agreement (19).
Combining the two criteria, 141 (72.3 percent) patients were classified as stable by both criteria, 9 (4.6 percent) were classified as not stable by both, 19 (9.7 percent) were classified as stable by the FIM™ but not by the global change question, and 26 (13.3 percent) were classified as stable by the global change question but not by the FIM™. The 141 patients classified as stable by both criteria were used to assess test–retest reliability (Tables 1 and 2).
In the secondary analysis using the ± one category criterion for the global change question and change in FIM™ score ≤10, 124 (63.6 percent) patients were classified as stable by both criteria; the percentage agreement between the two criteria was 69 and the kappa was 0.37, indicating fair agreement (19).
For the analyses based on the generalizability theory, complete HUI2 scores were available for the 3- and 6-month assessments for 136 patients. For the HUI3, complete data were available for 137 patients.
Several patients skipped one or more items on the HUI questionnaire; frequently skipped questions included those on vision and hearing. As a result, complete data are available at both the 3- and 6-month assessments for only 104 patients for HUI2 (37 missing) and 105 for HUI3 (36 missing). At the 3-month assessment, patients for whom HUI data were incomplete may have been less healthy than those for whom data were complete. This tendency is not apparent at the 6-month assessment. Nonetheless, the mean HUI scores for the patients used in the analysis of test–retest reliability should not be regarded as representative of results for the cohort.
Mean change scores between the 3- and 6-month assessments indicate little change over the period (Table 3). The ICCs for the overall HUI2 and HUI3 scores were 0.71 and 0.72 (Table 4). Results are virtually identical for the ± two or ± one category criteria from the global change question. Results are also virtually identical when using a two-way random effects model (data not shown). Results based on the generalizability theory are similar but somewhat higher, 0.76 and 0.77. When patients are divided into two groups, not cognitively impaired versus cognitively impaired at 3 and 6 months, the ICCs based on the generalizability theory are 0.79 for HUI2 and 0.77 for HUI3 for the not cognitively impaired compared with 0.65 and 0.67 for the cognitively impaired.
The results provide evidence that the test–retest reliability of HUI2 and HUI3 falls into the acceptable level of 0.70 or higher generally recommended as required for group-level comparisons (15;22;31). The ICCs are below the 0.90 level generally recommended for individual-level use of scores. The ICCs for agreement for the cohort are higher, 0.76 and 0.77, and again acceptable for group-level comparisons.
Our reliability estimates are consistent with those from previous studies of the HUI. In an assessment of test–retest reliability conducted as part of a pretest of the Statistics Canada National Population Health Survey using a provisional scoring system for HUI3, Boyle et al. (1) report an ICC of 0.77. Suarez-Almazor et al. (38) reported 3-month and 6-month test–retest reliability ICCs of 0.78 and 0.80 for HUI2 in a cohort of patients with low back pain. In the same study, ICCs for EQ-5D index scores were 0.76 and 0.50. (ICCs for EQ-5D visual analogue scale scores are reported in the studies by Dorman et al. and Macran (8;21).) Our results are also similar to those of Luo et al. (20) for patients with rheumatic disease, who reported test–retest ICCs for EQ-5D and HUI3 of 0.64 and 0.75.
Brazier et al. (2) reported a Spearman correlation of 0.67 between test and retest EQ-5D index scores in a study of elderly female patients in the United Kingdom. They also reported additional evidence for EQ-5D index scores of 0.83 in patients with chronic obstructive pulmonary disease and 0.55 in patients with rheumatoid arthritis. Coons et al. (6) found an ICC of 0.78 in a 2-week test–retest study using EQ-5D index scores. One-day test–restest reliability ICCs for the QWB scale range from 0.78 to 0.99, with most values exceeding 0.90 (3).
It is also important to compare the ICCs for the preference-based multiattribute measures with ICCs for generic health-profile measures. In their review, Coons et al. (6) reported ICCs of 0.60 to 0.81 for various domain scores derived from SF-36, 0.87 to 0.97 for the Sickness Impact Profile, and 0.67 to 0.97 for the Dartmouth COOP Charts. McHorney and Tarlov (23) report ICCs of 0.77 to 0.85 for the Nottingham Health Profile, 0.42 to 0.88 for the Dartmouth COOP Charts, 0.30 to 0.78 for the Duke Health Profile, and 0.60 to 0.81 for the SF-36.
Several study limitations should be noted. First, our results are based on self-assessments of health status and systematically exclude cognitively impaired and very ill patients. Second, because many respondents skipped questions on vision or hearing or other dimensions of health status, there were missing data for the HUI. Clearly, the mean HUI scores are not representative of the entire cohort. The reliability estimates may reflect the experience of somewhat healthier respondents and may not be fully generalizable.
The results reported here for the test–retest reliability of the HUI2 and HUI3 are consistent with other results for the HUI, results for other multiattribute preference measures, and results for generic health profile measures. Test–retest reliability appears to be acceptable for group-level comparisons. Additional empirical evidence on test–retest reliability for multiattribute utility scores in other patient groups would be welcome.
C. Allyson Jones, PT, PhD, Assistant Professor (Allsyson.Jones@ualberta.ca) Department of Physical Therapy, Faculty of Rehabiliations Medicine, 2-50 Corbett Hall, University of Alberta, Edmonton, Alberta T6G 2G4, Canada
David Feeny, PhD (david.feeny@ualberta.ca), Professor of Economics, Public Health Sciences, and Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, Alberta T6G 2H4, Canada; Institute of Health Economics, 10405 Jasper Avenue, Suite 1200, Edmonton, Alberta T5J 3N4, Canada
Ken Eng, MA, Research Associate (keng@ihe.ca), Institute of Health Economics, 10405 Jasper Avenue, Suite 1200, Edmonton, Alberta T5J 3N4, Canada
The study, "Measurement of Health Status and Health-Related Quality of Life in Patients with Hip Fractures" was supported by a grants to Drs. C. Allyson Jones, David Feeny, Finlay McAlister, Cheryl Wiens, and John Cinats from the Institute of Health Economics, University Hospital Foundation, Edmonton Orthopaedic Research Trust, and Royal Alexandra Foundation. The analyses reported in this paper were supported by grants from the Alberta Heritage Foundation for Medical Research (AHFMR; #199909) and the Institute of Health Economics (IHE) to C. Allyson Jones and David Feeny. IHE, AHFMR, the University Hospital Foundation, Edmonton Orthopaedic Research Trust, and Royal Alexandra Foundation played no role in the design, interpretation, or analysis of the project and have not reviewed or approved of this manuscript. Support for the postdoctoral fellowship held by Dr. C. Allyson Jones was provided by the Alberta Heritage Foundation for Medical Research and Canadian Institute of Health Research. An earlier version of the paper was presented as a poster at the 11th Annual Meeting of the International Society for Quality of Life Research, October 16–19, 2004, in Hong Kong. We thank the patients and family caregivers for their participation in the study. We also thank the staff of the University of Alberta Hospital Orthopaedics Research Office for their assistance in patient recruitment and data collection. It should be noted that David Feeny has a proprietary interest in Health Utilities Incorporated, Dundas, Ontario, Canada. HUInc. owns the copyright to and distributes HUI materials.