Article contents
Effects of using same- versus alternate-form memory tests during short-interval repeated assessments in multiple sclerosis
Published online by Cambridge University Press: 21 October 2005
Abstract
Repeated neuropsychological testing gives rise to practice effects in that patients become familiar with test material as well as test-taking procedures. Using alternate forms prevents the learning of specific test stimuli, potentially mitigating practice effects. However, changing forms could diminish test-retest reliability coefficients. Our objective was to examine test-retest effects in multiple sclerosis (MS) patients randomly assigned to same- (SF) or alternate-form (AF) conditions. Thirty-four MS patients underwent neuropsychological evaluation. The battery included the California Verbal Learning Test II (CVLT-II) and the Brief Visuospatial Memory Test–Revised (BVMT-R), memory tests recommended by a recently convened consensus panel. Patients were randomly assigned to SF or AF groups and then tested at baseline and follow-up examination 1 week later. Analysis of variance tests (ANOVAs) revealed significant group × time interactions, with SF patients showing greater gain than AF patients. SF practice effects were often large, compromising test validity. Reliability coefficients were either equivalent or higher in the AF group, a finding attributed to ceiling effects and reduced variance in the SF group at retest. The generalizability of the findings may be limited to short test-retest intervals and the MS population. Nevertheless, I conclude that the use of CVLT-II and BVMT-R alternate forms likely helps preserve test validity without compromising test-retest reliability. (JINS, 2005, 11, 727–736.)
- Type
- Research Article
- Information
- Journal of the International Neuropsychological Society , Volume 11 , Issue 6 , October 2005 , pp. 727 - 736
- Copyright
- © 2005 The International Neuropsychological Society
INTRODUCTION
Repeated neuropsychological (NP) testing has become commonplace in recent years. Increasingly, tests are used to establish a baseline against which the effects of neurologic disease (Heaton et al., 1995; Amato et al., 2001; DeCarli et al., 2004; Green et al., 2004), trauma (Iverson et al., 2003; Lovell et al., 2004), or treatment (Chelune et al., 1993; Collie et al., 2002; Krupp et al., 2004) may be assessed. There are, in turn, efforts to develop NP tests with good test-retest reliability and alternate forms (Benedict et al., 1996; Woodard et al., 1996; Benedict et al., 1998) so that testing enhances the power to detect reliable change (Temkin et al., 1999; Heaton et al., 2001; Iverson et al., 2003).
Practice effects are particularly salient when patients are asked to learn a problem-solving strategy or remember new information. There may be two distinct sources of learning across testing sessions. First, as with most any cognitive test, patients are apt to develop test-taking strategies with repeated exposures to the same procedure. We (Benedict & Zgaljardic, 1998; Zgaljardic & Benedict, 2001) previously referred to this as “test-specific” practice. In contrast, “item-specific” practice refers to the learning of actual content (e.g., a word list) from one administration of the test to the next. While test-specific practice is unavoidable in serial NP assessment, item-specific practice can be mitigated via the use of alternate test forms.
In our previous review of studies involving healthy subjects (Benedict & Zgaljardic, 1998), we compared effect sizes from studies of repeated memory testing whenever same or alternate test forms were employed at retest. We calculated Cohen's (1988) d and found marked practice effects in studies repeating the same test forms (McCaffrey et al., 1992, 1993, 1995; Rapport et al., 1997). In contrast, studies using alternate forms generated much lower d values (Parker et al., 1995; Benedict et al., 1996). Surprisingly few studies had directly compared the effects of using same or alternate forms in the same sample. An exception was a study where two forms of the Rey Auditory Verbal Learning Test (Rey, 1964) were compared (Crawford et al., 1989). Subjects were randomly assigned to testing with one of two test forms. A between-group baseline comparison indicated that the forms were of equivalent difficulty. Participants were then randomly reassigned to same- or alternate-form conditions. After 27 days, significant improvement was found for the same-form group on all measures, with mostly large effects (d range .4 to 1.3). By comparison, significant practice effects were not apparent in the alternate-form group.
In our study (Benedict & Zgaljardic, 1998), we assessed 30 healthy volunteers divided into same- and alternate-form groups, matched on age, education, and baseline memory test performance. The tests employed were the Hopkins Verbal Learning Test–Revised (Brandt & Benedict, 2001) and the Brief Visuospatial Memory Test–Revised (Benedict, 1997). Participants tested with the same form every two weeks improved significantly over four sessions, whereas those completing alternate forms produced either small or insignificant practice effects. Taken together, these studies of normal volunteers show that in the domain of memory, practice effects are significantly attenuated when alternate forms are employed.
Practice effects have rarely been examined systematically in patient samples. Hawkins and Wexler (1999) administered the California Verbal Learning Test (CVLT; Delis et al., 1987) to 20 schizophrenia patients on three occasions: baseline, 10, and 14 weeks. As anticipated, statistically significant practice effects were observed on multiple measures. Baseline to week 14 effect sizes (d) ranged from 0.6 on Trial 5 to 1.0 on Delayed Recall. A larger and older (mean age = 59.4 years) schizophrenia sample was studied by Harvey et al. (2005), who used a large battery of tests spanning multiple cognitive domains. There were two assessments, with an 8-week test-retest interval. Of the 17 tests administered, only two had alternate forms. Significant improvement was recorded on three tests, and in each case the effects were small. The authors concluded that NP performance of older schizophrenia patients is stable over 8 weeks. Practice effects have also been documented in HIV samples. McCaffrey and colleagues (1995) found significant practice effects on the CVLT and the Paced Auditory Serial Addition Test (Gronwall, 1977). More recently, in 26 HIV+ symptomatic and 33 HIV+ asymptomatic patients, significant gain was observed on multiple CVLT indices over an interval of roughly 16 days. Altogether, these studies show that neurological and psychiatric patients, like healthy volunteers, often demonstrate significant practice effects, but that older patients with chronic neuropsychiatric illness may be less susceptible to these effects. Unfortunately, to the best of our knowledge, no studies, so far, have compared the effects of same- versus alternate-forms within the same sample of neurological or psychiatric patients.
Reliability is another issue to be weighed in considering the costs and benefits of alternate forms. Even if the equivalence of alternate forms is well established, test-retest reliability may be attenuated from one test session to another, thereby increasing error in statistical analysis and hindering estimates of reliable change in performance (Chelune et al., 1991; Jacobson & Truax, 1991; Sawrie et al., 1996). In the studies with clinical patients cited earlier, test-retest reliability coefficients were generally good. For example, in Harvey et al's schizophrenia study, correlations ranged from r = .52 to .93 (Harvey et al., 2005). In the more recent HIV work (Duff et al., 2001), reliability coefficients for CVLT recall measures for asymptomatic patients ranged from .52 to .76. It is conceivable that using alternate forms in longitudinal studies could increase error, thus lowering reliability coefficients and increasing reliable change indices.
The population of interest in this study is multiple sclerosis (MS). Roughly half of all MS patients are cognitively impaired (Rao et al., 1991; Heaton, 1985) and a higher frequency of impairment may exist among clinic attendees (Medaer et al., 1984; Rao et al., 1984). Deficits in processing speed and working memory are common (Franklin et al., 1988; Litvan et al., 1988; Beatty et al., 1989; Rao et al., 1989; DeLuca et al., 1993; Kujala, 1994; Camp et al., 1999; Demaree et al., 1999; Archibald & Fisk, 2000), as are impairments in new learning and memory (Grant et al., 1984; Rao et al., 1984; Fischer, 1988; DeLuca et al., 1994; Beatty, 1996; DeLuca et al., 1998). In recent years, medications that alter disease activity have significantly diminished the frequency of relapses in MS and delayed the onset of physical disability (Paty et al., 1993; Johnson et al., 1995; Jacobs et al., 1996). In some cases, interferon beta-1a has significantly delayed the onset of neuropsychological impairment (Fischer et al., 2000). In addition, donepezil was recently found to enhance memory function in MS patients (Krupp et al., 2004). As a result of such medicinal successes, demand for serial NP testing in MS has increased, leading a consensus panel (Benedict et al., 2002) to conclude that minimal standards for NP testing in MS should emphasize alternate forms, as well as reliability and discriminative validity. This group acknowledged, however, that little is known about stability of NP testing when alternate forms are used in clinical trials.
In summary, practice effects are problematic in NP assessment, and it appears that serial examinations over short time intervals are increasingly called for in studies concerning the natural history of cognitive impairment in neurologic disease and clinical trials. Memory tests are almost always included in NP assessments, yet the effects of using same/alternate forms of memory tests have not been compared in a neurologic or psychiatric sample. This study, therefore, was designed to assess test-retest effects in an MS sample, the general hypothesis being that alternate forms would protect against marked practice effects, and that the stability of NP testing would not be adversely affected by introducing alternate forms.
METHODS
Research Participants
Included were 34 MS patients who underwent repeat NP testing while participating in a study of the psychometric properties of a recently developed screening questionnaire (Benedict et al., 2004a). All were attending one of four MS clinics (Baird MS Center, Buffalo, NY; University of California at San Francisco; University of Colorado Health Sciences Center, Denver, CO; Gimbel Center at Teaneck, NJ). Inclusion criteria were (a) diagnosis of clinically definite MS (McDonald et al., 2001), (b) informant having contact with the patient at least three times per week, (c) age 18 or older, (d) fluent in English, and (e) able to provide informed consent to all procedures. Exclusion criteria were (a) neurological disorder other than MS, (b) psychiatric disorder (American Psychiatric Association, 1994) other than mood, personality, or behavior change following the onset of MS, (c) other medical condition that might influence cognition, (d) history of developmental disorder (e.g., ADHD, learning disability), (e) history of substance or alcohol dependence, (f) current substance abuse, (g) motor or sensory defect that might interfere with cognitive test performance, (h) relapse and/or corticosteroid pulse within four weeks of assessment. All participants signed informed consent forms approved by institutional review panels prior to participating in the study.
Mean age (±SD) was 42.2 ± 8.9 years. On average, patients completed 15.4 ± 2.5 years of education. The majority were Caucasian (n = 31 or 91%) and female (n = 28 or 82%), consistent with the MS population, which is primarily female (Jacobs et al., 1999). Scores derived from the Expanded Disability Status Scale (Kurtzke, 1983), which assess neurologic/physical disability, were available for 28 patients and the mean was 2.5 ± 2.0, representing mild/moderate impairment. Most patients (n = 27 or 79%) had relapsing-remitting rather than progressive (4 secondary progressive, 3 primary progressive) course.
Materials
Each patient underwent NP testing in accordance with recent consensus panel recommendations (Benedict et al., 2002). This battery, known as the Minimal Assessment of Cognitive Function in MS (MACFIMS), recommends the use of alternate forms where possible, and includes the following tests with alternate forms: California Verbal Learning Test, second edition (CVLT-II; Delis et al., 2000), Brief Visuospatial Memory Test–Revised (BVMT-R; Benedict, 1997), Paced Auditory Serial Addition Test (PASAT; Gronwall, 1977), Controlled Oral Word Association Test (COWAT; Benton et al., 1994), and Sorting Test from the Delis-Kaplan Executive Function System (Delis et al., 2001). The Sorting Test was not included in the analysis because it was not administered uniformly across subjects (cued sorting not administered to all subjects). Same/alternate form comparisons were thus restricted to four tests, two emphasizing memory (CVLT-II, BVMT-R) and two emphasizing processing speed or executive control (PASAT, COWAT).
The CVLT-II and BVMT-R are both learning and memory tests, with similar formats. Both require single exposures to new material that is recalled immediately after presentation. There is a 25-minute interval following the final learning trial, after which participants are asked to recall the information again without further exposure to the to-be-learned material. Delayed recall is followed by a yes/no, forced-choice recognition task. There are 5 learning trials for the CVLT-II. Examiners read 16 words and ask participants to repeat as many words as possible. The entire word list is repeated each time. For the BVMT-R, the stimulus material is a matrix of 6 visual designs, held before the participant for 10 seconds. Participants are asked to render the designs using paper and pencil, taking as much time as needed for reproduction. Each design receives a score of 0, 1, or 2, based on accuracy and location scoring criteria. There are three free-recall trials preceded by stimulus exposure. In this study, we considered the following measures for each test: trial 1 recall (Trial 1), final trial recall (CVLT-II Trial 5; BVMT-R Trial 3), total recall over all learning trials (Total Learning), recall after the delay interval (Delayed Recall), and recognition discrimination index recommended in the test manuals, which follows the delayed interval (Delayed Recognition). Z scores were calculated using a previously published (Benedict et al., 2004a) normative sample (n = 40; mean age = 40.0 ± 9.1; mean education = 14.9 ± 2.0; 70% female), and the means ranged from −1.2 ± 1.5 for BVMT-R Delayed Recall to −0.5 ± 1.4 for CVLT-II Total Learning.
The PASAT and COWAT served as executive control tasks. We employed Rao's adaptation (Rao, 1991) of the PASAT, which includes 60 trials presented at interstimulus intervals of 3 and 2 seconds. This version was selected to be a component of the MS Functional Composite (MSFC), a clinical outcome measure composed of quantitative measures of leg, arm/hand, and cognitive function (Cutter et al., 1999; Fischer et al., 1999). The dependent measure was the mean number of correct responses from the two trials. The COWAT was administered in the standard manner, following the method of Arthur Benton (Benton et al., 1994). In successive one-minute trials, participants generated as many words as possible, beginning with each of three designated letters. The dependent measure was the total number of correct words over the three trials.
The test battery also included the Symbol Digit Modalities Test (SDMT; Smith, 1982), which measures working memory and processing speed. It presents a series of nine symbols, each of which is paired with a single digit in a key at the top of an 8.5 × 11 inch sheet. The remainder of the page presents a pseudo-randomized sequence of symbols. Participants respond by voicing the digit associated with each symbol as quickly as possible. As recommended by the MACFIMS panel (Benedict et al., 2002), we employed only the oral-response administration. The SDMT is strongly correlated with whole-brain atrophy in MS (Christodoulou et al., 2003; Benedict et al., 2004b). The alternate forms available are probably not equivalent (Boringa et al., 2001). Thus, only the standard form was used in this study. The data are nevertheless presented for descriptive purposes and to determine if the degree of practice was the same across groups.
Finally, patients also completed the Beck Depression Inventory–Fast Screen for Medical Patients (BDI-FS; Beck et al., 2000), which was recently validated in this population (Benedict et al., 2003).
Procedure
The participants were contacted by mail or approached during the course of their usual clinical care at an MS center. The tests were administered individually by a trained assistant or graduate student working under the supervision of a licensed psychologist. The entire test battery, requiring approximately 90 minutes, was repeated in 6–8 days following the baseline examination. The participants were assigned, randomly, in sequence, to same- (SF) or alternate-form (AF) conditions. The former underwent NP testing using the same forms on each occasion, whereas the AF group was tested with an alternate version at retest. The CVLT-II, PASAT, and COWAT have two, equivalent alternate forms (Rao, 1991; Benton et al., 1994; Ruff et al., 1996; Delis et al., 2000). For the BVMT-R, we employed the equivalent forms 1 and 4 (Benedict et al., 1996). At baseline, all patients were examined with the CVLT-II Standard Form, BVMT-R Form 1, PASAT Form A, and COWAT Form CFL. At retest, only the AF group was examined with alternate forms (CVLT-II Alternate Form, BVMT-R Form 4, PASAT Form B, COWAT Form PRW).
Analysis
Pearson correlation, analysis of variance (ANOVA), and chi-square tests were utilized to examine correlations and between-group effects. Reliability coefficients were compared using the Fisher r-to-z transformation. The primary analysis approach was mixed factor ANOVA, with group (SF, AF) serving as the between-groups factor, and time (test, retest) serving as the repeated factor. The hypothesis that using alternate forms differentially affects retest performance was tested by the group × time interaction. The dependent variables included total correct indices from the SDMT, PASAT, and COWAT, and the following were derived from the CVLT-II and BVMT-R: Trial 1, Trial 5 or 3, Total Learning, Delayed Recall, Delayed Recognition. For the memory tests, we also examined the pattern of learning and forgetting via 3 × 2 repeated measures ANOVAs. Here, the within-subjects effects were trial (Trial 1, Trial 5/3, Delayed Recall) and time (test, retest). Post-hoc comparisons were accomplished via t test. Throughout, alpha was set at p < .05. For descriptive purposes, we examined effect sizes using Cohen's d statistic, which is the difference between means divided by the pooled standard deviation (SD).
RESULTS
The SF and AF groups were well matched on demographics, disease features, and BDI-FS scores as demonstrated by nonsignificant univariate comparisons. The groups were also matched on baseline CVLT-II, BVMT-R, PASAT, and COWAT performance (all p values > .20).
Symbol Digit Modalities Test
The test-retest reliability coefficients were robust and similar for both groups (SF r = .98, p < .001; AF r = .97, p < .001). The performance of the SF and AF groups on the SDMT is presented in Figure 1. The ANOVA showed a significant effect for time [F(1,32) = 23.1, p < .001], but the group and group × time interaction effects were not significant. Collectively, the sample improved from a raw score of 51.9 ± 16.0 to 55.4 ± 17.4, a small effect of d = 0.2.

Mean raw scores produced by the SF and AF groups on the Symbol Digit Modalities Test (SDMT). Statistical analysis using ANOVA reveals significant gain or practice effect in both groups, but no difference in overall performance or degree of gain.
Auditory/Verbal Memory
CVLT-II data are presented in Table 1. The mixed-factor ANOVAs revealed significant interaction effects for all measures. In every analysis, the SF group showed significantly higher scores on retest compared to baseline [T1 t(16) = −3.9, p < .01; T5 t(16) = −2.4, p < .05; Total Learning t(16) = −4.1, p < .01; Delayed Recall t(16) = −4.2, p < .01; Delayed Recognition t(16) = −3.6, p < .01], whereas there were no significant test-retest effects in the AF group. Analysis of effect sizes revealed that SF d's ranged from 0.5 to 1.0 and AF effects ranged from −0.1 to 0.1.
Within-group data and interaction effects for the CVLT-II

Figure 2 describes the data as analyzed by the trial × time repeated measures ANOVA. The SF group (Fig. 2a) showed marked change over the test-retest interval, as demonstrated by a significant interaction [F(2,15) = 9.5, p < .01]. As noted earlier, t tests showed significant effects of time at Trial 1, Trial 5, and Delayed Recall, with higher scores being observed at retest. In an effort to further explain the interaction, Trial 5 and Delayed Recall scores were also compared. At baseline, there was a significant difference between the scores with the higher value being associated with Trial 5 [t(16) = 4.1, p < .01]. In contrast, the Trial 5/Delayed Recall comparison was not significant at retest [t(16) = 0.0].

Raw scores produced by the SF and AF groups on the California Verbal Learning Test–II. Shown are the mean number of words recalled on Trial 1, Trial 5, and after the 25 min delay interval. For the same form group (Fig. 2a) repeated measures, trial × time ANOVA reveals a significant interaction. Post hoc tests reveal decline in performance from Trial 5 to Delayed Recall at baseline, but not at retest. For the alternate form condition (Fig. 2b), the ANOVA reveals only a main effect for trial, meaning that performance is not affected by retesting.
The AF repeated measures ANOVA (Fig. 2b) revealed only a significant trial main effect [F(2,15) = 89.15, p < .001].
Visual/Spatial Memory
Similar findings emerged for the BVMT-R (Table 2). In all of the ANOVAs but one, there was again a significant interaction effect best explained by significant gain in the SF group [T1 t(16) = −6.5, p < .001; T3 t(16) = −2.7, p < .05; Total Learning t(16) = −6.1, p < .001; Delayed Recall t(16) = −2.9, p < .05]. This time, SF d's ranged from 0.7 to 1.6, whereas all AF effect sizes were again small (−0.2 to 0.2). There were no significant within-subjects effects in the AF group. For BVMT-R recognition, the ANOVA revealed no significant main or interaction effect.
Within-group data and interaction effects for the BVMT-R

The BVMT-R across trial data are shown in Figure 3. Again, the SF group (Fig. 3a) showed marked change over the test-retest interval, as demonstrated by trial × time interaction [F(2,15) = 4.7, p < .05]. As noted earlier, there were significant test-retest effects at Trial 1, Trial 3, and Delayed Recall. When comparing Trial 3 and Delayed Recall scores, we observed a trend toward a significant difference at test [t(16) = 1.9, p = .07] and no effect at retest [t(16) = 1.7].

Raw scores produced by the SF and AF groups on the Brief Visuospatial Memory Test–Revised. Shown are the mean recall scores (range 0–12) on Trial 1, Trial 3, and after the 25 min delay interval. For the same form group (Fig. 3a) repeated measures, trial × time ANOVA reveals a significant interaction. Post hoc tests reveal a trend toward decline in performance from Trial 5 to Delayed Recall at baseline, but not at retest. For the alternate form condition (Fig. 3b), the ANOVA reveals only a main effect for trial, meaning that performance is not affected by retesting.
As was the case with the CVLT-II, the AF ANOVA showed only a significant trial main effect [F(2,15) = 49.7, p < .001].
Executive Control Tests
The interaction effects (Table 3) were not significant for either the PASAT [F(1,32) = 0.1] or COWAT [F(1,32) = 0.2]. There were no significant group effects, but both ANOVAs revealed significant effects of time [PASAT F(1,32) = 19.9, p < .001; COWAT F(1,32) = 8.8, p < .01].
Within-group data and interaction effects for the PASAT and COWAT

Reliability
Same/alternate form pairings of test-retest reliability coefficients were analyzed for statistical significance using the Fisher r-to-z transformation. Significant differences were found on four measures: CVLT-II Delayed Recall [SF r = 0.54, AF r = 0.89, z = −2.2, p < .05], BVMT-R Trial 1 [SF r = 0.47, AF r = 0.88, z = −2.3, p < .05], BVMT-R Total Learning [SF r = 0.64, AF r = 0.91, z = −2.0, p < .05], and BVMT-R Delayed Recall [SF r = 0.32, AF r = 0.85, z = −2.5, p < .05]. In each case, higher reliability coefficients were found in the AF group, and, except for one case, there was substantially greater variance at retest in the AF group (Tables 1 and 2).
DISCUSSION
The general aim of this study was to investigate the effects of using alternate-forms during repeated NP testing in MS. The common assumption that alternate forms protect against inter-examination learning effects was directly tested by randomly assigning patients to same- (SF) and alternate-form (AF) conditions. Because practice effects can be attributed to the learning of both the procedural aspects of testing and test content, this study focused on tests of auditory/verbal and visual/spatial memory. When retested at one week, SF patients showed marked practice benefit on the order of ½ to 1 SD in magnitude. Patients taking a different form of these memory tests showed no such practice. Thus, the data support the use of alternate forms, at least for these particular tests, in MS patients.
There is an additional benefit with the use of alternate forms. Figures 2 and 3 clearly show that the shape of the learning and delayed recall curve in the SF group is affected by the same/alternate form contingency (see also Tables 1 and 2). For example, at test 1 (or baseline), the asymptote of the CVLT-II learning curve is 12.3 ± 3.0. The mean recall score after the delay interval falls to 10.5 ± 3.5. Such a significant, if modest, drop in performance would be expected in a cerebral disease such as MS (Benedict et al., 2002; Delis et al., 2000). On retest, however, the Trial 5 and Delayed Recall values are precisely the same. Similar findings emerge for the BVMT-R, although the statistical test of Trial 3 versus Delayed Recall for the SF group at baseline shows only a trend toward significance. In other words, there is no longer evidence of modest forgetting after patients were exposed to the baseline examination. As such, the validity of the CVLT-II and BVMT-R is compromised when readministered in this way.
Is reliability compromised when alternate forms are used? In this study, test-retest correlations were calculated for each CVLT-II and BVMT-R measure independently for the SF and AF groups. The reliability coefficients were statistically different in four measures. In general, these significant findings were associated with higher reliability in the AF group. This finding was contrary to hypothesis. Error in test-retest analysis is often conceptualized (Crocker & Algina, 1986) as being attributed to the effects of time (e.g., fluctuations in mental state, motivation, etc.) or both time and content as may occur when alternate forms are used. Because the AF group's error variance may have stemmed from two different sources (time, content), lower r values were expected from these data. The most parsimonious explanation of this unexpected finding is that practice in the SF group restricted the range of retest scores, thereby compromising Pearson r statistics. The BVMT-R Delayed Recall measure provides a useful illustration of this point. At baseline, the mean scores and variances are nearly identical (SF = 8.8 ± 2.8, AF = 8.7 ± 2.6). One week later, the SF group mean is 10.7 but the range of possible scores extends only to 12. Therefore, it is not surprising to find that the SD drops to 1.3. In contrast, there is virtually no change in the AF group mean or variance.
These data have a bearing on calculations of reliable change intervals that may be used in clinical trials to gauge the meaning of improved NP test performance. The familiar Reliable Change Index (RCI) is most simply calculated as the estimated practice effect +/− a margin of error based on the standard error of the difference (SEdiff). The SEdiff is the square root of the sum of the squared standard error of the means (SEMs) for test and retest (Hageman & Arrindell, 1993; Iverson et al., 2003). SEMs are, in turn, directly related to the reliability coefficient. Thus, as can be seen in Tables 1 and 2, low test-retest reliability is associated with higher SEdiff values, which would invariably lead to larger RCIs. For example, returning to the SF group's BVMT-R Delayed Recall performance, the 80% confidence interval (CI), as derived from the SEdiff of 2.5, is 3.3. When this figure is added to the measured practice effect (+1.9) the 80% RCI stands at 5. Considering that the baseline mean is 8.8, the RCI then extends beyond the test's ceiling. This finding raises serious concerns about the test's utility when the same form is administered repeatedly.
It is noteworthy that while significant practice effects were found, group × time interactions did not emerge for either the PASAT or COWAT. The PASAT finding is perhaps most easily explained. Patients are asked to quickly add numbers while listening for new stimuli, and it would be nearly impossible for patients to memorize number combinations while the test is proceeding. Practice effects found on the PASAT are presumably a result of the development of a test-taking strategy that should not differ across same versus alternate-form conditions. The COWAT data are more difficult to explain. Using a two-week test-retest interval, we (Zgaljardic & Benedict, 2001) found similar results when comparing the CFL and FAS forms in healthy volunteers. Although a significant effect of practice was observed, there was no interaction. These and the present findings imply that procedural aspects of the test (e.g., retrieval strategy, learning to avoid repeating the same stem word, etc.) are more important than the examinee's familiarity with specific word content. The finding, if replicated, would challenge the common belief that alternate forms are an important consideration for tests of generative verbal fluency.
This study supports the reliability and validity of several tests proposed by a consensus panel for the minimal core assessment of MS patients (Benedict et al., 2002). The MACFIMS battery is a rationally derived, clinically oriented battery based on expert consensus regarding the cognitive domains most important to assess in MS. Test selection was based largely on psychometric properties of candidate measures and ease of administration. All of the recommended tests have acceptable test-retest reliability in this sample of 34 MS patients. If one assumes the use of CVLT-II and BVMT-R alternate forms, only one measure (CVLT-II Trial 1) showed a reliability coefficient less than 0.70. Furthermore, the construct validity of these memory tests is supported by the demonstration of a learning curve followed by modest forgetting, approximating data derived from healthy volunteers (Benedict, 1997; Delis et al., 2000).
There are several methodological limitations in this study that merit further comment. First, the lack of a normal-control group hindered determination of whether the magnitude of practice shown on these tests is “normal.” Second, the small sample size raises questions about external validity, and because so few men were enrolled in the study, analysis of gender interaction effects could not be pursued. The study is further limited by the one-week test-retest interval. These findings are useful for planning serial assessments for clinical trials where short test-retest intervals are needed. However, many clinical trials or natural history studies will require much longer test-retest intervals and very different findings might be obtained in such a research design. In addition, the patients studied here are mostly those with relapsing-remitting course and mild physical/neurological disability. Research with other clinical populations suggests that elderly patients with chronic psychiatric disease may not show practice with repeated exposure to the same test forms (Harvey et al., 2005). Unfortunately, our study sheds no light on the question of whether or not alternate forms are valuable in older MS patients with greater degrees of disability.
These concerns notwithstanding, the present data support the use of alternate forms when evaluating memory in MS repeatedly over time. When the test-retest interval is short, alternate forms may enhance, not hinder, test-retest reliability.
ACKNOWLEDGMENTS
I acknowledge the assistance of Drs. Darcy Cox, Laetitia Thompson, and Frederick Foley for their assistance in data collection. This research was supported, in part, by an unrestricted educational grant from Biogen Idec.
References
REFERENCES

Mean raw scores produced by the SF and AF groups on the Symbol Digit Modalities Test (SDMT). Statistical analysis using ANOVA reveals significant gain or practice effect in both groups, but no difference in overall performance or degree of gain.

Within-group data and interaction effects for the CVLT-II

Raw scores produced by the SF and AF groups on the California Verbal Learning Test–II. Shown are the mean number of words recalled on Trial 1, Trial 5, and after the 25 min delay interval. For the same form group (Fig. 2a) repeated measures, trial × time ANOVA reveals a significant interaction. Post hoc tests reveal decline in performance from Trial 5 to Delayed Recall at baseline, but not at retest. For the alternate form condition (Fig. 2b), the ANOVA reveals only a main effect for trial, meaning that performance is not affected by retesting.

Within-group data and interaction effects for the BVMT-R

Raw scores produced by the SF and AF groups on the Brief Visuospatial Memory Test–Revised. Shown are the mean recall scores (range 0–12) on Trial 1, Trial 3, and after the 25 min delay interval. For the same form group (Fig. 3a) repeated measures, trial × time ANOVA reveals a significant interaction. Post hoc tests reveal a trend toward decline in performance from Trial 5 to Delayed Recall at baseline, but not at retest. For the alternate form condition (Fig. 3b), the ANOVA reveals only a main effect for trial, meaning that performance is not affected by retesting.

Within-group data and interaction effects for the PASAT and COWAT
- 89
- Cited by