Introduction
As in many other areas of evidence-based practice, there is an increasing demand in otorhinolaryngology for reliable outcome measures. Voice clinicians need valid, repeatable and change-sensitive outcome measures in order to evaluate speech and language therapy and/or phonosurgical interventions. To date, three main areas of laryngological outcome assessment have been developed: self-reporting, acoustic analysis and perceptual rating. Detailed investigation has shown minimal correlation between general acoustic measures and patient complaints.Reference Carding, Steen, Webb, Mackenzie, Deary and Wilson1 Perceptual measurement has become the accepted ‘gold standard’ for voice assessment; however, the process is time intensive and requires the expertise of a trained observer, usually a speech and language therapist. Also, unlike self-reported measures, expert voice rating does not reflect quality of life impact. For these reasons, there has been increasing activity in the design of research tools which evaluate the quality of life impact of dysphonia from the patient's perspective.Reference Deary, Webb, Mackenzie, Wilson and Carding2–Reference Ma and Yiu6
From the various available self-rating scales for the evaluation of voice-related quality of life, we selected three which we considered the most appropriate for potential use in the UK. These were the vocal performance questionnaire,Reference Deary, Webb, Mackenzie, Wilson and Carding2 the voice handicap indexReference Jacobson, Johnson, Grywalski, Silbergleit, Jacobson and Benninger3 and the newly developed voice symptom scale.Reference Wilson, Webb, Carding, Steen, Mackenzie and Deary7 The vocal performance questionnaire was the first scale developed for use within a British population. The sample size used in the evaluation of its reliability was small,Reference Carding, Steen, Webb, Mackenzie, Deary and Wilson1 and there is little information in support of its validity.Reference Carding and Docherty8 The voice handicap index already has a body of evidence in support of its reliability and validity, but these studies were undertaken on small samples of North American patients.Reference Jacobson, Johnson, Grywalski, Silbergleit, Jacobson and Benninger3, Reference Benniger, Ahuja, Gardner and Grywalski9, Reference Rosen and Murray10 The voice symptom scale, which was designed to be applicable across the range of heterogeneous voice symptoms, was developed from British patient samples.Reference Scott, Robinson, Wilson and MacKenzie11 It underwent rigorous psychometric evaluation of its content validity, internal consistency and factorial structure, using a number of large samples of voice patients.Reference Wilson, Webb, Carding, Steen, Mackenzie and Deary7–Reference Benniger, Ahuja, Gardner and Grywalski9
The aim of this study was to evaluate the reliability (i.e. internal consistency and repeatability) and validity of these three self-rating scales.
Methods
Patient self-reported scales
Vocal performance questionnaire
This scale was designed for use in an evaluation study of voice therapy in cases of non-organic dysphonia.Reference Carding and Horsley12 It consists of 12 items which address the physical aspects of the voice problem and also its social and emotional impact. It is scored to give a total score only, with no subscales. The reliability of the questionnaire was originally assessed on a group of only 10 respondents. Validity was ascertained by discussing the questionnaire with the patients in the pilot study and by correlating the scores with a mean severity rating of voice quality, determined by external raters.Reference Carding and Docherty8
Voice handicap index
This disability and handicap inventory was developed for use in a variety of voice disorders.Reference Jacobson, Johnson, Grywalski, Silbergleit, Jacobson and Benninger3 Its 30 items are grouped into three content domains representing functional, emotional and physical aspects of voice disorders. The items were selected from patients' case records. The reliability of the questionnaire was assessed on a sample of 65 consecutive patients. Construct validity was evaluated by correlating the voice handicap index with the domains of the SF36, quality of life measure, in 260 patients.Reference Benniger, Ahuja, Gardner and Grywalski9 The sensitivity to change in voice was evaluated on a sample of 37 subjects with various vocal fold abnormalities.Reference Rosen and Murray10 This study concluded that the voice handicap index was a useful patient-based instrument for the measurement of change following intervention. The voice handicap index has been used in previous studies to assess patients' perception of the severity of their voice disorder due to a variety of aetiologiesReference Stewart, Chen and Stach13, Reference Rosen, Murray, Zinn, Zullo and Sonbolian14 and in efficacy studies of intervention for voice disorders.Reference Ma and Yiu6, Reference Courey, Garrett, Billante, Stove, Portell and Smith15–Reference Fung and Yoo17
Voice symptom scale
This 30-item scale has three content domains – impairment, physical symptoms and emotional response – and a total score. The impairment domain has 15 items and reflects the impact of the voice problem and the patient's ability to use their voice. The physical symptoms domain has seven items and addresses the symptoms which regularly occur as concomitants of voice disorder (e.g. sore throat and throatclearing). These may result from and/or exacerbate dysphonia but are not synonymous with poor voice quality. The emotional domain reflects the impact of the voice disorder on the patient's psychological well-being.
In summary, these three different questionnaires attempt to reflect the breadth of patients' voice problems, but have different derivations and thus potentially different applicability to the general population of voice-disordered patients.
Perceptual analysis
The grade–roughness–breathiness–aesthenia–strain rating scaleReference Hirano18 scores each of these five parameters. Each parameter is scored using a four-point rating scale, from zero (normal) to three (extreme). There is a body of evidence in support of the reliability of this scale.Reference Dejonckere, Obbens, Leeper, Hawkins, Heeneman and Doyle19–Reference Webb, Carding, Deary, Mackenzie, Steen and Wilson22
Patients
One hundred and eighty-one patients complaining of hoarseness and attending otorhinolaryngology – head and neck surgery out-patient clinics in Newcastle and Glasgow gave consent to take part in the study at their initial out-patient consultation. Patient exclusion criteria were: laryngeal cancer; age less than 18 years; pregnancy; learning difficulties; stroke; aphasia; and English not being their first language. The 127 female and 54 male patients included had a mean age of 52 years (range 18 to 88 years). Forty-four (34 per cent) were smokers. Patients' voice disorder categories are shown in Table I. At the initial out-patient appointment, each participant completed the three voice questionnaires. A sub-group of 50 participants was asked to complete a second set of the same questionnaires, one week later. The gold standard with which each questionnaire was compared was perceptual analysis of the voice using the grade–roughness–breathiness–aesthenia–strain scale.Reference Hirano18 This analysis was determined for each participant following a standard protocol for recording and assessment. Each patient gave a speech sample, consisting of rote counting and the days of the week, a prolonged /a/ and /i/ vowel, and three sentences from the Rainbow Passage.Reference Fairbanks23 An independent, expert rater evaluated each of the voice recordings, blinded to all but the age and sex of the participant. Each of the ratings was recorded in a standardised, pre-designed proforma.
Statistical analysis
Reliability
The assessment of reliability was based on whether each scale gave consistent and reproducible results.Reference Webb, Carding, Deary, Mackenzie, Steen and Wilson22
Firstly, the vocal performance questionnaire, the voice handicap index and the voice symptom scale were evaluated for internal consistency. From the several assessment methods available, we selected the most widely used – the Cronbach's alpha reliability coefficient.Reference Cronbach24
Secondly, the repeatability or stability of the measurementsReference Hays, Anderson and Revicki25 was assessed, based on analysis of correlations between repeated measures. The measures were repeated over time (i.e. test–retest reliability) in 50 of the participants with dysphonia. Test–retest reliability was assessed by calculating the intra-class correlation coefficient based on a two-way analysis of variance (subjects by occasions), with both subjects and occasions being treated as random effects.Reference Shrout and Fleiss26
Validity
Two aspects were assessed: concurrent validity and criterion validity.
Concurrent validity is the extent to which results obtained with one measure of a construct relate to results obtained with another measure of the same construct.Reference Cronbach24 Concurrent validity was evaluated by Pearson correlations of the three different self-reported scales.
Criterion validity is a special case of construct validity in which a stronger hypothesis is made possible by reference to some outside validating criterion or gold standard.Reference Hays, Anderson and Revicki25, Reference Nunally27, Reference Schuavetti and Metz28 There are no gold standards available for voice-specific, self-reported patient scales, therefore criterion validity was evaluated by comparing the ratings given to the participants' voice quality using the grade–roughness–breathiness–aesthenia–strain scaleReference Hirano18 with the scores on the three self-reported voice scales, using the Spearman rho correlation coefficient.
Results
One hundred and seventy participants with complete data sets for all three questionnaires, and 46 with a second complete set, were included in the analysis.
Internal consistency
Generally, Cronbach's alpha coefficients of at least 0.7–0.8 are regarded as necessary for adequate internal consistency.Reference Cronbach24 Cronbach's alpha coefficient for the vocal performance questionnaire total score was 0.81. The alpha coefficients for the domains of the voice handicap index were: physical aspects 0.85, functional aspects 0.90 and emotional aspects 0.90, with a total score of 0.95. The alpha coefficients for the domains of the voice symptom scale were impairment 0.85, physical symptoms 0.73 and emotion 0.90, with a total score of 0.89.
Repeatability
Table II shows the test–retest coefficients and the 95 per cent confidence intervals for each of the domains within each scale and for their total scores. The voice handicap index demonstrated very good stability, with a total scale test–retest reliability coefficient of 0.83. The vocal performance questionnaire and the voice symptom scale both demonstrated adequate stability, with test–retest reliability coefficients of 0.75 and 0.63, respectively. It should also be noted that the test–retest reliability of the voice handicap index and the voice symptom scale domain scores were very good.
* The vocal performance questionnaire (VPQ), the vocal handicap index (VHI) and the voice symptom scale (VoiSS). †Intra-class correlation coefficient. CI = confidence intervals
Concurrent validity
Table III presents a correlation matrix for the domains and total score of the voice symptom scale, the domains and total score of the voice handicap index, and the total score of the vocal performance questionnaire. Most components showed strong positive correlations, except the voice symptom scale physical symptoms domain, which included relevant but non-voice throat symptoms.
* The voice symptom scale (VoiSS), the vocal performance questionnaire (VPQ) and the vocal handicap index (VHI). ††Correlation significant at the 0.01 level (two-tailed); †correlation significant at the 0.05 level (two-tailed).
Criterion validity
Table IV presents a correlation matrix for the self-reported scales and the parameters of the grade–roughness–breathiness–aesthenia–strain auditory rating scale. Observer- and self-rated voice quality are two different things, and it is not surprising that the overall strength of correlations between the self-reported voice scales and the grade–roughness–breathiness–aesthenia–strain scale is less than that of correlations between the three self-reported scales. Results for the vocal performance questionnaire and the voice handicap index significantly correlated with all parameters of the grade–roughness–breathiness–aesthenia–strain scale except roughness. The highest vocal performance questionnaire correlation was with overall grade (0.32), while the voice handicap index and the voice symptom scale correlated most strongly with breathiness (0.44 and 0.43, respectively). The physical symptoms domain of the voice symptom scale was not related to the grade–roughness–breathiness–aesthenia–strain rating scale.
* The vocal performance questionnaire (VPQ), the vocal handicap index (VHI) and the voice symptom scale (VoiSS). ††Correlation significant at the 0.01 level (two-tailed); †correlation significant at the 0.05 level (two-tailed). GRBAS = grade–roughness–breathiness–aesthenia–strain
Discussion
This study demonstrated that all three self-reported patient questionnaires were reliable and valid instruments for measuring the patient-perceived impact of a voice disorder. We consider that the relatively minor differences between the scales, with regard to coefficient sizes, are of limited significance.
The vocal performance questionnaire, voice handicap index and voice symptom scale had good internal consistency and test–retest reliability (Table II).Reference Wilson, Webb, Carding, Steen, Mackenzie and Deary7, Reference Deary, Webb, Mackenzie, Wilson and Carding2 Criterion validity entails comparing the scale under review with an outside validating criterion or gold standard. The adopted criterion in this study was the grade–roughness–breathiness–aesthenia–strain perceptual rating scale. Previously, the vocal performance questionnaire had been correlated with an overall rating of severity of voice quality, comparable to ‘grade’ on the grade–roughness–breathiness–aesthenia–strain scale, in 45 patients with non-organic dysphonia, giving a Spearman rho coefficient of 0.65.Reference Deary, Webb, Mackenzie, Wilson and Carding2 In the present study, the total score of the vocal performance questionnaire again correlated significantly (0.32) with the grade parameter of the grade–roughness–breathiness–aesthenia–strain scale. All the self-reported scales, with the exception of the physical symptoms domain of the voice symptom scale, correlated significantly with all the parameters of the grade–roughness–breathiness–aesthenia–strain scale, except roughness. This supports the theory that the self-reported and perceptual assessments are in part measuring the same underlying concept.
• There are several self-reported voice quality research tools available
• Most studies report on only one such tool
• The comparative reliability and validity of different tools is not known
• The voice performance questionnaire, the voice handicap inventory and the voice symptom scale have good internal consistency and test–retest reliability
• In comparison with observer rating of voice performance, all three scales emerged as valid; the vocal performance questionnaire appeared adequate for a synopsis of voice outcomes, whereas the vocal handicap index may be superior for emotional domains
• The voice symptom scale physical symptom domain score seemed independent of the other self- and observer-reported ratings
The highest correlations were demonstrated between the function/impairment domains of the voice handicap index and voice symptom scale, respectively, and the ‘breathiness’ parameter of the grade–roughness–breathiness–aesthenia–strain scale. This may indicate that air wastage through the glottis has the largest subjective impact on the patient's ability to carry out their normal activities. However, although statistically significant, none of the correlations was high. In other words, a clinician's perception of voice quality, as recorded at one point in time, does not directly correspond to the patient's perception of voice quality and its impact on their daily activities.Reference Ma and Yiu6
Conclusion
There were strong correlations between the vocal performance questionnaire, the voice handicap index and the voice symptom scale, and aspects which address impairment or alteration of function. The voice symptom scale has an additional domain not reflected in the other scales, that of associated physical symptoms. The vocal performance questionnaire gives little indication of emotional effects, but, like the shortened version of the voice handicap index (the voice handicap index 10), is a convenient, internally consistent, uni-dimensional voice outcome tool.Reference Deary, Webb, Mackenzie, Wilson and Carding2
A voice assessment tool that addresses voice problems in terms of physical, functional and emotional impacts may provide a more accurate indication of the outcomes of a particular treatment package. For example, a functional approach to therapy may more adequately be informed by assessment with the voice handicap index, whilst a medical approach to the treatment of symptoms may benefit from the results of the more symptom-based voice symptom scale. However, if the aim of assessment is to obtain a brief, simple indication of severity of impact, in order to determine intervention outcomes and to audit service provision, then a shorter, more general measure (such as the vocal performance questionnaire) would be more appropriate.
Acknowledgements
This research was supported by a grant from the Wellcome Trust.