Evaluation of measurement equivalence of the Family Satisfaction with the End-of-Life Care (FAMCARE): Tests of differential item functioning between Hispanic and non-Hispanic White caregivers

Jeanne A. Teresi; Katja Ocepek-Welikson; Mildred Ramirez; Marjorie Kleinman; Katherine Ornstein; Albert Siu; Jose Luchsinger

doi:10.1017/S1478951520000152

Evaluation of measurement equivalence of the Family Satisfaction with the End-of-Life Care (FAMCARE): Tests of differential item functioning between Hispanic and non-Hispanic White caregivers

Published online by Cambridge University Press: 19 March 2020

Jeanne A. Teresi ,

Katja Ocepek-Welikson ,

Albert Siu and

Jeanne A. Teresi: Affiliation:
Research Division, Hebrew Home at Riverdale, Riverdale, New York, NY Measurement and Data Management Core, Mount Sinai Pepper Older Americans Independence Center, Mount Sinai Medical Center, and Analytic Core, Columbia University Alzheimer's Disease Resource Center for Minority Aging Research, New York, NY Columbia University Stroud Center, New York State Psychiatric Institute, New York, NY Division of Geriatrics and Palliative Medicine, Weill Cornell Medical Center, New York, NY
Katja Ocepek-Welikson: Affiliation:
Research Division, Hebrew Home at Riverdale, Riverdale, New York, NY
Mildred Ramirez*: Affiliation:
Research Division, Hebrew Home at Riverdale, Riverdale, New York, NY Measurement and Data Management Core, Mount Sinai Pepper Older Americans Independence Center, Mount Sinai Medical Center, and Analytic Core, Columbia University Alzheimer's Disease Resource Center for Minority Aging Research, New York, NY Division of Geriatrics and Palliative Medicine, Weill Cornell Medical Center, New York, NY
Marjorie Kleinman: Affiliation:
Columbia University Stroud Center, New York State Psychiatric Institute, New York, NY
Katherine Ornstein: Affiliation:
Department of Geriatrics and Palliative Medicine, Institute for Translational Epidemiology Mount Sinai School of Medicine, New York, NY
Albert Siu: Affiliation:
Department of Geriatrics and Palliative Medicine, General Internal Medicine, Health Evidence and Policy, Mount Sinai Medical Center, New York, NY
Jose Luchsinger: Affiliation:
Department of Medicine, Columbia University Medical Center, PH9 Center, New York, NY10032
*: Author for correspondence: Mildred Ramirez, Research Division, Hebrew Home at Riverdale in RiverSpring Health, 5901 Palisade Avenue, Riverdale, New York, NY10471, USA. E-mail: milramirez@aol.com

Article contents

Abstract
Objective
Method
Results
Significance of results
Introduction
Methods
Results
Discussion
Footnotes
References

Rights & Permissions

Abstract

Objective

Although the psychometric properties of the Family Satisfaction with End-of-Life Care measure have been examined in diverse settings internationally; little evidence exists regarding measurement equivalence in Hispanic caregivers. The aim was to examine the psychometric properties of a short-form of the FAMCARE in Hispanics using latent variable models and place information on differential item functioning (DIF) in an existing family satisfaction item bank.

Method

The graded form of the item response theory model was used for the analyses of DIF; sensitivity analyses were performed using a latent variable logistic regression approach. Exploratory and confirmatory factor analyses to examine dimensionality were performed within each subgroup studied. The sample included 1,834 respondents: 317 Hispanic and 1,517 non-Hispanic White caregivers of patients with Alzheimer's disease and cancer, respectively.

Results

There was strong support for essential unidimensionality for both Hispanic and non-Hispanic White subgroups. Modest DIF of low magnitude and impact was observed; flagged items related to information sharing. Only 1 item was flagged with significant DIF by both a primary and sensitivity method after correction for multiple comparisons: “The way the family is included in treatment and care decisions.” This item was more discriminating for the non-Hispanic, White responders than for the Hispanic subsample, and was also a more severe indicator at some levels of the trait; the Hispanic respondents located at higher satisfaction levels were more likely than White non-Hispanic respondents to report satisfaction.

Significance of results

The magnitude of DIF was below the salience threshold for all items. Evidence supported the measurement equivalence and use for cross-cultural comparisons of the short-form FAMCARE among Hispanic caregivers, including those interviewed in Spanish.

Keywords

Differential item functioning Ethnic diversity Family satisfaction with end-of-life care Item response theory Palliative care

Type: Original Article
Information: Palliative & Supportive Care , Volume 18 , Issue 5 , October 2020 , pp. 544 - 556

DOI: https://doi.org/10.1017/S1478951520000152 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2020

Introduction

The Family Satisfaction with End-of-Life Care (FAMCARE) scale (Kristjanson, Reference Kristjanson1986, Reference Kristjanson1989), although used most widely with cancer patients in palliative care, has also been applied to a range of serious illness (Hwang et al., Reference Hwang, Chang and Alejandro2003), including caregivers to patients with Alzheimer's disease (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2019) and residents in long-term care (Rodriguez et al., Reference Rodriguez, Bayliss and Jaffe2010). The FAMCARE is used widely internationally as a quality measure of end-of-life care in clinical and research settings, and translations are available in many languages, including Italian (D'Angelo et al., Reference D'Angelo, Punziano and Mastroianni2017), Spanish (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2019), and Swedish (Ljungberg et al., Reference Ljungberg, Fossum and First2015). Although the psychometric properties of the scale have been examined in cancer patients in diverse settings internationally, little evidence exists regarding measurement equivalence across ethnically diverse groups. There is also little experience with the scale among individuals with different diseases such as Alzheimer's disease and related disorders (ADRD) or among ethnic subgroups, including Spanish speakers and caregivers. While several studies have examined the relationship of demographic characteristics to satisfaction with end-of-life care (Kristjanson, Reference Kristjanson1993; Lo et al., Reference Lo, Burman and Rodin2009; Aoun et al., Reference Aoun, Bird and Kristjanson2010), no studies have examined these characteristics in terms of measurement equivalence in Hispanic samples.

A study of measurement equivalence comparing Black with White non-Hispanic caregivers of patients with cancer found that 13 items evidenced differential item functioning (DIF), a type of item bias; however, none of high magnitude (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2015). Moreover, the scale-level impact was negligible. One item related to pain relief evidenced DIF for race and education and was also hypothesized to show DIF. To our knowledge, no other studies have examined the FAMCARE for equivalence of item endorsement across different socio-demographic groups using item response theory (IRT) methods to detect DIF. Thus, the aim of this set of analyses was to examine the psychometric properties of the scale in a sample of Hispanics using latent variable models and to obtain information on DIF to place in an existing item bank on family satisfaction and care transitions.

Methods

Qualitative

Qualitative methods, including content analyses and cognitive interviews, were used to develop Spanish translations for use among Spanish speakers (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2019). The first step in the evaluation of DIF is the generation of a priori hypotheses regarding potential group differences in item responses, conditional on the trait. Hypotheses regarding potential racial/ethnic group differences in item response were established qualitatively by a panel of content experts. The following instructions related to hypotheses generation were given.

Differential item functioning means that individuals in groups with the same underlying trait (state) level will have different probabilities of endorsing an item. Put another way, item endorsement should depend only on the level of the trait (state), e.g., satisfaction, and not on membership in a group, e.g., race/ethnicity. Very specifically, randomly selected persons from each of two groups (e.g., minority and non-minority) who are at the same (e.g., mild) level of satisfaction should have the same likelihood of reporting being very satisfied with the aspects of care provided. If it is hypothesized that this is not the case, it would be hypothesized that the item has DIF with respect to race/ethnicity.

The rationale for DIF hypotheses is that items may be posited to have a different meaning for some individuals and may measure a trait that is not expected. Thus, the item could perform differently for some groups, conditional on the trait.

Quantitative analyses and tests of DIF hypotheses

The graded (Samejima, Reference Samejima1969) form of the IRT model (Lord and Novick, Reference Lord and Novick1968; Lord, Reference Lord1980; Hambleton et al., Reference Hambleton, Swaminathan and Roger1991) was used for the analyses of DIF. An item shows DIF if people from different subgroups but at the same level of satisfaction have unequal probabilities of endorsement. The item characteristic curve (ICC) that relates the probability of an item response to the underlying state, e.g., satisfaction, measured by the item set can be characterized by two parameters: location (denoted b and also called threshold, difficulty, or severity) and a discrimination parameter (denoted a) that is proportional to the slope of the curve. DIF analyses approaches to assessment of patient and caregiver-reported outcomes using IRT are described in Orlando-Edelen et al. (Reference Orlando-Edelen, Thissen and Teresi2006). The Wald test was used for examination of group differences in IRT item parameters (Lord, Reference Lord1980; Teresi et al., Reference Teresi, Kleinman and Ocepek-Welikson2000; Cai et al., Reference Cai, Thissen and du Toit2011) accompanied by magnitude measures (Thissen et al., Reference Thissen, Steinberg, Wainer, Holland and Wainer1993; Raju et al., Reference Raju, van der Linden and Fleer1995; Kleinman and Teresi, Reference Kleinman and Teresi2016).

Uniform DIF is detected when the b parameters differ because the direction of the DIF (more or less severe) for one group as contrasted with a comparison group is the same across the latent continuum. If the a parameters differ, this result is called non-uniform DIF because the ICC curves cross and the direction of DIF can differ across the latent continuum. Non-uniform DIF occurs when the probability of response is in a different direction for the reference and focal groups, at different levels of the latent trait (θ). For example, Hispanic persons may have a lower probability than White, non-Hispanic persons of endorsing a satisfaction item at low levels of the satisfaction trait and higher probabilities of an endorsement than White, non-Hispanic persons at higher levels. If non-uniform DIF is detected in the context of the IRT method, this finding assumes primacy over findings of uniform DIF because tests for group differences in the a parameters are followed by conditional tests of the b parameters (tests of b parameters are performed, constraining the a parameters to be equal).

An iterative process was used in the selection of the anchor items for theta estimation. There are several methods for selecting anchor items, assumed to be DIF-free (Orlando-Edelen et al., Reference Orlando-Edelen, Thissen and Teresi2006; Woods, Reference Woods2009; Wang et al., Reference Wang, Shih and Sun2012). The approach that was used in these analyses was a modified “all-other” method in which initial DIF estimates were obtained by treating each item as a “studied” item while using the remainder as “anchor” items. The purification process was also iterative, and items identified as DIF-free were those included in the final anchor set. IRTPRO, version 3.1 option 3, which permits the all-other approach for the multiple group case was used. This (Wald-type) procedure is more robust than just relying on the all-other anchor procedure and may take several iterations.

The final P values testing for DIF were adjusted using the Bonferroni method (Bonferroni, Reference Bonferroni1936). Other methods such as Benjamini–Hochberg (B-H; Benjamini and Hochberg, Reference Benjamini and Hochberg1995; Thissen et al., Reference Thissen, Steinberg and Kuang2002) have been used in sensitivity analyses for many of our studies. Generally, the results are almost identical. Thus, the Bonferroni method was selected as the primary approach for adjustment for multiple comparisons.

Model assumptions and fit: Exploratory and confirmatory factor analyses (Asparouhov and Muthén, Reference Asparouhov and Muthén2009) to examine dimensionality were performed within each subgroup studied, and fit indices (Bentler, Reference Bentler1990) examined. Additionally, the explained common variance (ECV) was used as an indicator of unidimensionality. The ECV (Sijtsma, Reference Sijtsma2009), estimated as the percent of observed variance explained (Reise, Reference Reise2012), can be calculated as the ratio of the first eigenvalue to the sum of all eigenvalues extracted (see Reise et al., Reference Reise, Moore and Haviland2010).

Local independence requires that all pairs of item responses be independent, conditional on the latent trait. Local dependency (LD) was examined using the methods of Chen and Thissen (Reference Chen and Thissen1997). A suggested cutoff indicative of potential LD is 10 (Chen and Thissen, Reference Chen and Thissen1997; Cai et al., Reference Cai, Thissen and du Toit2011). This approach is based on a comparison of observed and expected frequencies derived from item-by-item two-way cross-tabulations; the likelihood ratio statistic resulting from this comparison is chi-square distributed. LD statistics are affected by sample size and increase in value with the increased sample size. Thus, to ensure comparability in sample sizes between the Hispanic and non-Hispanic White sample, a random sample of the White non-Hispanic group comparable in size to that of the Hispanic sample was selected. The root mean square error of approximation (RMSEA) was examined for both confirmatory factor analyses and IRT model fit.

The best methods and criteria for cutoff values for goodness of fit statistics have been debated (e.g., Cook et al., Reference Cook, Kallen and Amtmann2009), with recommendations to not be overly reliant on specific values, given the many factors that may affect these statistics. The following model fit statistics and criteria for goodness of fit (Bentler, Reference Bentler1990) provided general guidelines, and included the comparative fit index (CFI; Bentler, Reference Bentler1990; CFI > 0.95), Tucker–Lewis index (TLI; Tucker and Lewis, Reference Tucker and Lewis1973; TLI > 0.95), and the root mean square error of approximation (RMSEA < 0.06).

Evaluation of DIF magnitude and impact: Expected item scores were measures of magnitude. A method for quantification of the difference in the average expected item scores is the non-compensatory DIF (NCDIF) index used by Raju et al. (Reference Raju, van der Linden and Fleer1995). NCDIF is expressed as the average squared difference in expected scores for individuals as members of the focal group and as members of the reference group. The cutoff recommended as indicative of high DIF magnitude is 0.024 for polytomous items with three response options. An additional effect size measure (T1) proposed by Wainer (Reference Wainer, Holland and Wainer1993) and extended for polytomous data by Kim et al. (Reference Kim, Cohen and Alagoz2007) was also examined; however, primary reliance was on the NCDIF magnitude measure because little research has been conducted on the performance of T1. The use of these statistics is explicated in Kleinman and Teresi (Reference Kleinman and Teresi2016) and Teresi et al. (Reference Teresi, Ocepek-Welikson and Kleinman2007).

Expected scale scores that provide information about the effect of DIF on the total score were calculated by summing the expected item scores. Group differences in these scale response functions provide overall aggregated measures of impact.

DIF sensitivity analyses: Sensitivity analyses using a different method was conducted using an ordinal logistic regression approach with a latent conditioning variable; lordif, version 0.3-3 (Choi et al., Reference Choi, Gibbons and Crane2011) was used. This method was used to flag consistent DIF identified by both methods that might be salient based on magnitude and impact measures.

Additionally, sensitivity analyses were conducted comparing only Spanish speakers to White, non-Hispanic English speakers.

Reliability and information: Reliability was evaluated with McDonald's omega total (ω _t; McDonald, Reference McDonald1999); this estimate is based on the proportion of total common variance explained. Reliability estimates were also calculated for various points along the latent continuum of family satisfaction using IRT. IRT also provides estimates of the information provided by items and scales. This item information can be used to select items for short-form measures. Additionally, information function parameters stored in item banks are used to generate computerized adaptive tests that tailor item selection to target the respondent's level of the trait based on responses to a starting item and to other items administered.

MPlus, version 6.11 (Muthén and Muthén, Reference Muthén and Muthén2011) was used for factor analyses and IRTPRO, version 3.12 (Cai et al., Reference Cai, Thissen and du Toit2011) for IRT item parameter estimation and DIF analyses. Item level magnitude using NCDIF (Fleer, Reference Fleer1993; Raju et al., Reference Raju, van der Linden and Fleer1995; Flowers et al., Reference Flowers, Oshima and Raju1999; Morales et al., Reference Morales, Flowers and Gutierrez2006) was estimated using MAGNITS (Kleinman and Teresi, Reference Kleinman and Teresi2016). Scale-level impact was evaluated using lordif, version 0.3-3 (Choi et al., Reference Choi, Gibbons and Crane2011) in the psych package in R. Reliability estimated with McDonald's omega was also calculated with R version 3.4.4 (R core team, 2018).

Measure

The short-form FAMCARE used in these analyses was based on earlier work (Teresi et al., Reference Teresi, Ornstein and Ramirez2014) with advanced psychometric methods. This work showed that lower categories were overlapping such that the probability of response was similar for the three categories: “very dissatisfied,” “dissatisfied,” and “undecided,” indicating little if any unique information provided by these categories. Thus, items were coded as ordinal and collapsed as follows: “very satisfied” responses were coded as 2, “satisfied” as 1 and “not satisfied” (indecision or “dissatisfaction”) as 0, with a resulting sum score from 0 to 20. The item analyses were thus performed with three ordinal response categories.

Sample

There were 1,834 respondents, 317 Hispanics, and 1,517 non-Hispanic Whites; among the Hispanic sample, 209 were interviewed in Spanish. For these analyses, the Hispanic Spanish and English speakers were combined because not enough respondents were interviewed in English to perform a separate DIF analysis by the language of administration. The Hispanic sample was comprised of caregivers to patients with Alzheimer's disease (study period June 1, 2013 through March 31, 2019), while the White non-Hispanic sample was comprised of caregivers to cancer patients (study period September 30, 2006 through July 31, 2013). A larger proportion of the Hispanic (83%) as contrasted with the non-Hispanic caregiver sample (55%) was female and younger (74% were below age 65 as contrasted with 62% of the non-Hispanic Whites; see Table 1). Among the Hispanic caregiver sample, 45% had some post-high school education and 24% had 0–11 years, as contrasted with the White non-Hispanic sample for which only 11% had less than high school education. More of the Hispanic sample of caregivers (77%) than the White non-Hispanic caregiver sample (54%) lived with the patient. The average age of the non-Hispanic White care recipients was 60.7 (11.6) as contrasted with the Hispanic care recipients with an average age of 79.9 (8.9).

Table 1. Demographic characteristics of the caregivers and care recipients for the White and Hispanic samples

Sample size: Hispanic responders (n = 317); non-Hispanic White responders (n = 1,517); total (n = 1,834). Data were not available for care recipient education for the Hispanic sample.

This study was approved by the Institutional Review Board (IRB) at Mount Sinai Medical Center (study reported at https://projectreporter.nih.gov/project_info_description.cfm?aid=7892314) and at Columbia University Medical Center (protocols IRB-AAAL7251, IRB-AAAM5150), reported at https://projectreporter.nih.gov/project_info_description.cfm?aid=9251192&icde=43514731&ddparam=&ddvalue=&ddsub=&cr=10&csb=default&cs=ASC&MMOpt=.

Results

Qualitative

The DIF hypotheses were posited with respect to race/ethnicity and language. Although the majority (two-thirds) were interviewed in Spanish, the sample size was too small to examine language within the Hispanic subgroup. Thus, the hypotheses regarding ethnicity were relevant to these analyses. With respect to race/ethnicity, 5 items out of 10 were hypothesized to evidence DIF, however only 2 with a direction given: “The way the family is included in treatment and care decisions” and “Information given about the patient's tests.” These were hypothesized to be more likely endorsed in the dissatisfied direction, conditional on the trait by minority than by White respondents.

Quantitative

Model assumptions: As shown by the eigenvalue ratios in Table 2, there was strong support for essential unidimensionality for the total sample and both subgroups, Hispanic and non-Hispanic White responders. All three ratios of component 1–2 were large (total sample — 19.5; non-Hispanic White responders — 16.1; Hispanic responders — 33.9). The first component accounted for between 74% and 89% of the variance for all groups, supporting the essential unidimensionality of the item set across comparison subgroups. The RMSEA index from the MPlus analysis was 0.10 for the total sample and for both demographic groups. The RMSEA indices from the IRTPRO estimation were slightly lower ranging from 0.08 to 0.09. The CFIs ranged from 0.988 to 0.997. The ECVs ranged from 92.66 to 96.77 (see Table 3).

Table 2. Eigenvalues from the exploratory factor analysis using principal components estimation and fit indices from confirmatory factor analyses^a (MPlus)

Model fit statistics: comparative fit index (CFI); Tucker–Lewis index (TLI), and root mean square error of approximation (RMSEA) from MPlus and RMSEA from IRTPRO.

^a Geomin (oblique) rotation and fit statistics for one factor solutions.

^b Based on M₂ statistics which are based on full marginal tables.

Table 3. Reliability and dimensionality estimates

All analyses based on polychoric correlations.

In general, the LD statistics (Chen and Thissen, Reference Chen and Thissen1997) were in the acceptable range for Hispanics, and over the threshold for the non-Hispanic White sample. There were five instances of LDs above 10 for the White non-Hispanic sample (see Appendix Table A1): items 2 (availability of doctors) and 8 (doctor assesses symptoms; 15.9); items 3 (coordination of care) and 4 (time required to make diagnosis; 13.2); items 5 (families included in treatment) and 8 (doctor assesses symptoms; 14.6); items 6 (information given about management of pain) and 10 (availability of the doctor; 14.5); and items 9 (tests and treatments followed up by doctor) and 10 (availability of the doctor; 12.2). These values did not appear to inflate the magnitude of the discrimination parameters, and the values were relatively low; thus, it was concluded that they did not require action.

The reliability estimates were high. The omega total values ranged from 0.962 to 0.986, and the ordinal alphas ranged from 0.961 to 0.985 (see Table 3). The classical test theory estimated Cronbach's alpha for the total sample was 0.95 for both non-standardized and standardized calculations. The corrected item-total correlations ranged from 0.72 to 0.83 (see Appendix Table A2). The internal consistency for those interviewed in English and Spanish were 0.96 and 0.97, respectively.

IRT-based reliability: The reliability estimates calculated along the satisfaction continuum were >0.90 in the range of theta from −2.0 to 0.8. Estimates were slightly lower at the dissatisfied tail (0.80, 0.83, 0.84 across the total, non-Hispanic White, and Hispanic subsamples) as well as the very satisfied range of the distribution. The overall reliability estimates were 0.90 for the total sample, 0.91 for the non-Hispanic White, and 0.93 for the Hispanic subgroup (see Table 4).

Table 4. IRT reliability estimates at varying levels of the attribute (theta) estimate based on results of the IRT analysis (IRTPRO)

Note: Reliability estimates were calculated for theta levels for which there are respondents.

The information function for the items and overall scale for the total sample were bimodal with the highest peaks at theta levels −1.2 and 0.4. The most informative item was “The way tests and treatments are followed up by the doctor” (item 9), and the least informative item was “Coordination of care” (item 3; see Figure 1).

Fig. 1. FAMCARE: scale and item information functions.

The analyses of DIF showed that three items evidenced DIF consistently by two methods: IRTPRO and lordif (see Table 5 and Appendix Table A3). However, only one item was flagged as significant by both methods. After the Bonferroni adjustment, non-uniform DIF was flagged with IRTPRO for the item, “The way the family is included in treatment and care decisions” (item 5). The item was more discriminating (more highly related to the satisfaction state) for the non-Hispanic, White responders than for the Hispanic subsample, and was also a more severe indicator (higher difficulty parameter) for this group at specific levels of the trait; the non-Hispanic White responders were less satisfied at higher levels of the satisfaction trait.

Table 5. Summary of DIF hypotheses and analyses

All NCDIF values were smaller than the threshold (0.0240); the range was from 0.0001 to 0.0057 and none of the T1 statistics were significant.

NU, non-uniform DIF involving the discrimination parameters; U, uniform DIF involving the location parameters.

^a The numbers in bold are the number positing DIF. Not all provided a direction to the hypothesis; only those with a direction are presented.

* Significant after Bonferroni correction.

The items “Information given about how to manage the patient's pain” (item 6) and “Information given about the patient's tests” (item 7) were identified with uniform DIF by IRTPRO; however, the result was not significant after application of the Bonferroni adjustment for multiple comparisons. The item “Information given about patient's tests” was also flagged for uniform DIF by lordif. Lordif identified non-uniform DIF for both items, after the adjustment; the items were more discriminating for the Hispanic responders (see Appendix Figure A1). The magnitude of DIF was small; all NCDIF and T1 statistics were below threshold (see Table 5). The impact of DIF was negligible, as shown by the overlapping curves (see Appendix Figure A2).

Language sensitivity analysis: Sensitivity DIF analysis was performed comparing the White non-Hispanic group to Spanish-speaking Hispanics alone (see Appendix Table A4). The results were similar to those of the main analyses. Three items showed DIF, two the same as in prior analysis. No DIF comparisons were significant after the Bonferroni correction.

Discussion

The FAMCARE scale, although extensively used to assess satisfaction with care for cancer patients, has also been applied to palliative care, including caregivers to patients with Alzheimer's disease. The psychometric properties of the FAMCARE have been examined in cancer patients in diverse settings internationally, including the relationship of demographic characteristics to satisfaction with end-of-life care. However, little evidence exists concerning measurement equivalence across ethnically diverse groups, particularly in Hispanic samples.

These analyses identified only one item with consistent DIF after Bonferroni correction: item 5, “The way the family is included in treatment decisions.” No items evidenced salient DIF.

Although the two groups examined in this study differ in disease type, we argue that the two groups have in common that they are caregivers to individuals with serious illness and poor prognosis. The diseases are different; however, it was not posited that the different diseases would result in DIF. It was posited that cultural and language differences can have an impact on item meaning and response. An advantage of IRT is that it produces arguably more invariant parameters that can be compared because they are sample independent. Philosophically, DIF can be examined with IRT across many groups differing in socio-demographic characteristics; however, it is important to present a rationale for such analyses.

Examination of the hypotheses for the qualitative analyses in conjunction with the quantitative analyses showed that two items were posited to evidence DIF for ethnic/race groups. In general, minority groups were hypothesized to express less satisfaction than White groups, conditional on overall satisfaction. Content experts posited directional race/ethnicity hypotheses for the item that evidenced consistent DIF: “The way the family is included in treatment and care decisions” (item 5). It was posited that minority group members would be less satisfied, conditional on the trait. Contrary to the hypotheses, item 5 showed non-uniform DIF, and the uniform DIF observed was in the opposite direction of that hypothesized. As noted, this item did not reach the criteria for salient DIF. Because the experts used their clinical experience when establishing hypotheses, it is possible that they took into account the potential language barrier when suggesting lower satisfaction for Hispanics, in contrast to their White counterparts. It may be that they felt, even at the same levels of satisfaction, Hispanics might respond in a more dissatisfied direction because of general health disparities and health care disparities, both real and perceived. Although there is no literature on the FAMCARE in a sample of Hispanic caregivers to persons with ADRD, earlier work on ethnically diverse caregivers may have informed the hypotheses. In contrast to the findings reported here, in an earlier paper on DIF in the FAMCARE (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2015), Black responders reported less satisfaction with their care, conditional on the trait.

The non-uniform DIF observed showed that conditional on overall satisfaction, the reported satisfaction for Hispanics was not constant (see the crossing item response curves); thus, supporting the dissatisfied direction posited by the experts for some satisfaction levels. This hypothesis is consistent with research evidence suggesting that Hispanics tend to endorse the extreme response categories in surveys (Clarke, Reference Clarke2000) potentially due to cultural values that relate such response style with demonstrating trustworthiness (McHorney and Fleishman, Reference McHorney and Fleishman2006).

A confirmatory directional hypothesis was not given for the item related to information about management of pain. However, in an earlier study a similar item, “Satisfaction with the patient's pain relief” was found to show DIF for the comparison of Blacks and White non-Hispanics. In that study, it was found that conditional on the satisfaction level, caregivers of Black patients were less satisfied with pain relief (Teresi et al., Reference Teresi, Ocepek-Welikson and Ramirez2015), a finding corresponding to findings of racial and ethnic disparities in pain treatment identified by Green et al. (Reference Green, Anderson and Baker2003). It is possible that the content experts posited the presence of an unmeasured secondary extraneous factor such as personal experiences that may have influenced responses to satisfaction items.

Strengths and limitations

Limitations of the study include the small number of Hispanics interviewed in English which did not permit systematic analyses of this group. The inability to perform other subgroup analyses due to sample size restrictions is also a limitation. As pointed out by a reviewer, the overlapping information curves and high corrected item-total correlations may be indicative of redundancy in the item set for this sample. IRT-based reliability estimates provided at varying points along the satisfaction trait continuum yielded somewhat lower reliability estimates, particularly at the tails of the distribution. Thus, while omnibus summary reliability estimates appear to show uniform item performance, the scale was not uniformly reliable across the trait; however, it is emphasized that estimates were above 0.80 for nearly all theta points for which reliability was estimated.

Strengths of the study include the provision of information for placement in an item bank on family satisfaction and care transitions. Such a bank was used to develop the short-form of the FAMCARE (Ornstein et al., Reference Ornstein, Teresi and Ocepek Welikson2015) used in these analyses. Additionally, the short-form version developed with IRT was used to develop a Japanese translation (Ito and Tadaka, Reference Ito and Tadaka2018). This study is the first to examine the measurement equivalence of the FAMCARE scale in a sample of Hispanic caregivers to patients with ADRD using latent variable models. This paper provides information on DIF for inclusion in an existing item bank on family satisfaction with care and care transitions. Additionally, reliability estimates indicated that the scale was highly reliable (estimates ≥ 0.90). Most items provided adequate information, although the item related to care coordination was less informative.

In summary, the analyses showed modest DIF of low magnitude and impact for the Hispanic sample in comparison to a White non-Hispanic sample. The item flagged related to information sharing: the way the family is included in treatment and care decisions. No items rose to the level of salient DIF of high magnitude or impact. Evidence from this study supports the measurement equivalence of the FAMCARE among Hispanics interviewed in Spanish and English. Thus, the short-form FAMCARE can be recommended for use in cross-cultural assessments and research involving such groups.

Authorship

J.A.T. substantially contributed to the design of the work, oversaw analyses, and drafted the article. K.O.-W. performed analyses and participated in drafting the article. M.R. contributed to the design, qualitative analyses, and review of the article. M.K. performed analyses. K.O. contributed to the design of the work and reviewed the manuscript. A.S. and J.L. acquired the data and participated substantially in the work. All authors have approved the publication of the article.

Acknowledgments

The authors thank Stephanie Silver, MPH for her expert editing of the manuscript.

Funding

Support for these analyses was provided by a collaboration between the Claude Pepper Older Americans Independence Center: National Institute on Aging (grant number 1P30AG028741) and the National Institute on Aging Alzheimer's Disease Resource Center on Minority Aging Research (grant number 1P30AG059303). The studies from which data were supplied were funded by the Patient-Centered Outcomes Research Institute (PCORI) (contract number CE-1304-7160) and the National Institute of Nursing Research (NINR) (grant number 1R01NR0114430-01) and the National Cancer Institute (NCI) (grant number 5R01CA116227-059999).

Conflict of interest

The authors declare that there is no conflict of interest with respect to the research, authorship, and/or publication of this article.

Appendix

Fig. A1. Item response functions and magnitude of DIF.

Note: Results are from lordif software. For each item, the upper left panel shows the expected item score plots (denoted item true score functions) for Hispanics and non-Hispanic Whites. The lower left panel shows the item characteristic curves (category response functions). The upper right panel displays the absolute group differences in expected item scores. The lower right panel shows the differences weighted by density and is indicative of the magnitude (impact) of DIF at the item level. This measure is related to the non-compensatory DIF statistic (NCDIF) described in the text.

Fig. A2. Impact of DIF at the scale level: expected scale scores.

Table A1. Local dependency statistics (bolded entries are slightly above the threshold for elevation).

Table A2. Classical test reliability estimates (SPSS): total sample (n = 1,834)

Table A3. IRT item parameters and DIF statistics for Hispanic compared to non-Hispanic White responders (reference group)

Table A4. Sensitivity analyses: summary of DIF analyses comparing White non-Hispanic subsample with Spanish-speaking Hispanics only

Footnotes

“NS, Anchor item” refers to a non-significant DIF finding for the item during the initial iterative anchor item selection process. The “non-significant” designation refers to the second stage DIF detection procedure using the anchor items and testing the remaining items. The “non-significant” designation indicates that the item was not found to have DIF in the second stage of DIF detection.

^a Statistical test for differences in parameters is Wald test using 1 df for the test of differences in the a parameters for the comparison groups and 2 df for the test of differences in the b parameters.

^b Bolded entries indicate items that evidence DIF after correction for multiple comparisons.

NU, non-uniform DIF involving the discrimination parameters; U, uniform DIF involving the location parameters.

Note: No items were significant after correction for multiple comparisons.

References

Aoun, S, Bird, S, Kristjanson, LJ, et al. (2010) Reliability testing of the FAMCARE-2 scale: Measuring family care satisfaction with palliative care. Palliative Medicine 24(7), 674–681.CrossRef Google Scholar PubMed

Asparouhov, T and Muthén, B (2009) Exploratory structural equation modeling. Structural Equation Modeling 16, 397–438.CrossRef Google Scholar

Benjamini, Y and Hochberg, Y (1995) Controlling for the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289–300. doi:10.2307/2346101Google Scholar

Bentler, PM (1990) Comparative fit indexes in structural models. Psychological Bulletin 107(2), 238–246. doi:10.1037/0033-2909.107.2.238CrossRef Google Scholar PubMed

Bonferroni, CE (1936) Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 3–62.Google Scholar

Cai, L, Thissen, D and du Toit, SHC (2011) IRTPRO: Flexible, Multidimensional, Multiple Categorical IRT Modeling (Computer Software). Chicago, IL: Scientific Software International, Inc.Google Scholar

Chen, WH and Thissen, D (1997) Local dependence indices for item pairs using item response theory. Journal of Educational and Behavioral Statistics 22, 265–289.CrossRef Google Scholar

Choi, SW, Gibbons, LE and Crane, PK (2011) lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software 39, 1–30.CrossRef Google Scholar

Clarke, I (2000) Extreme response style in cross-cultural research: An empirical investigation. Journal of Social Behavior and Personality 15, 137–152.Google Scholar

Cook, KF, Kallen, MA and Amtmann, D (2009) Having a fit: Impact of number of items and distribution of data on traditional criteria for assessing IRT's unidimensionality assumption. Quality of Life Research 18, 447–460. doi:10.1007/s11136-009-9464-4CrossRef Google Scholar PubMed

D'Angelo, D, Punziano, AC, Mastroianni, C, et al. (2017) Translation and testing of the Italian version of FAMCARE-2: Measuring family caregivers’ satisfaction with palliative care. Journal of Family Nursing 23(2), 252–272.CrossRef Google Scholar PubMed

Fleer, PF (1993) A Monte Carlo Assessment of a New Measure of Item and Test Bias (Dissertation, Dissertation Abstracts International, 54-04B, 2266). Illinois Institute of Technology, Chicago, IL.Google Scholar

Flowers, CP, Oshima, TC and Raju, NS (1999) A description and demonstration of the polytomous DFIT framework. Applied Psychological Measurement 23, 309–332.CrossRef Google Scholar

Green, CR, Anderson, KO, Baker, TA, et al. (2003) The unequal burden of pain: Confronting racial and ethnic disparities in pain. Pain Medicine 4(3), 277–294.CrossRef Google Scholar PubMed

Hambleton, RK, Swaminathan, H and Roger, HJ (1991) Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications, Inc.Google Scholar

Hwang, SS, Chang, VT, Alejandro, Y, et al. (2003) Caregiver unmet needs, burden, and satisfaction in symptomatic advanced care patients at a Veterans Affairs (VA) medical center. Palliative & Supportive Care 1, 319–329.CrossRef Google Scholar

Ito, E and Tadaka, E (2018) Development of a Japanese version of the short-form FAMCARE scale for family caregivers of terminal cancer patients at home in Japan. Nippon Ronen Igakkai Zasshi. Japanese Journal of Geriatrics 55(1), 81–89.CrossRef Google Scholar PubMed

Kim, S, Cohen, AS, Alagoz, C, et al. (2007) DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement 44, 93–116. doi:10.1111/j.1745-3984.2007.00029.xCrossRef Google Scholar

Kleinman, M and Teresi, JA (2016) Differential item functioning magnitude and impact measures from item response theory models. Psychological Test and Assessment Modeling 58(1), 79–98.Google Scholar PubMed

Kristjanson, LJ (1986) Indicators of quality of palliative care from a family perspective. Journal of Palliative Care 1(2), 8–17.CrossRef Google Scholar PubMed

Kristjanson, LJ (1989) Quality of terminal care: Salient indicators identified by families. Journal of Palliative Care 5(1), 21–30.CrossRef Google Scholar PubMed

Kristjanson, LJ (1993) Validity and reliability testing of the FAMCARE Scale: Measuring family satisfaction with advanced cancer care. Social Science & Medicine 36(5), 693–701.CrossRef Google Scholar PubMed

Ljungberg, AK, Fossum, B, First, CJ, et al. (2015) Translation and cultural adaptation of research instruments – Guidelines and challenges: An example in FAMCARE-2 for use in Sweden. Informatics for Health and Social Care 40, 67–78. doi:10.3109/17538157.2013.87211CrossRef Google Scholar PubMed

Lo, C, Burman, D, Rodin, G, et al. (2009) Measuring patient satisfaction in oncology palliative care: Psychometric properties of the FAMCARE-patient scale. Quality of Life Research 18, 747–752.CrossRef Google Scholar PubMed

Lord, FM (1980) Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar

Lord, FM and Novick, MR (1968) Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley Publishing Co.Google Scholar

McDonald, RP (1999) Test Theory: A Unified Treatment. Mahwah, NJ: L. Erlbaum Associates.Google Scholar

McHorney, C and Fleishman, J (2006) Assessing and understanding measurement equivalence in health outcome measures: Issues for further quantitative and qualitative inquiry. Medical Care 44(Suppl 3), S205–S210. doi:10.1097/01.mlr.0000245451.67862.57CrossRef Google Scholar PubMed

Morales, LS, Flowers, C, Gutierrez, P, et al. (2006) Item and scale differential functioning of the Mini-Mental State Exam assessed using the Differential Item and Test Functioning (DFIT) framework. Medical Care 44(11), S143–S151.CrossRef Google Scholar PubMed

Muthén, LK and Muthén, BO (2011) M-PLUS Users Guide, 6th ed. Los Angeles, CA: Muthén and Muthén, pp. 1998–2011.Google Scholar

Orlando-Edelen, M, Thissen, D, Teresi, JA, et al. (2006) Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: Applications to the Mini-Mental State Examination. Medical Care 44, S134–S142.CrossRef Google Scholar

Ornstein, KA, Teresi, JA, Ocepek Welikson, K, et al. (2015) Use of an item bank to develop two short-form FAMCARE scales to measure family satisfaction with care in the setting of serious illness. Journal of Pain and Symptom Management 49(5), 894–903.CrossRef Google Scholar PubMed

Raju, NS, van der Linden, WJ and Fleer, PF (1995) IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement 19, 353–368.CrossRef Google Scholar

R Core Team (2018) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/Google Scholar

Reise, SP (2012) The rediscovery of bifactor measurement models. Multivariate Behavioral Research 47, 667–696. doi:10.1080/00273171.2012.715555CrossRef Google Scholar PubMed

Reise, SP, Moore, TM and Haviland, MG (2010) Bi-factor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment 92, 544–559. doi:10.1080/00223891.2010.496477CrossRef Google Scholar

Rodriguez, KL, Bayliss, NK, Jaffe, E, et al. (2010) Factor analysis and internal consistency evaluation of the FAMCARE Scale for use in the long-term care setting. Palliative & Supportive Care 8(2), 169–176.CrossRef Google Scholar PubMed

Samejima, F (1969) Estimation of Latent Ability Using a Response Pattern of Graded Scores (Psychometrika Monograph; Supplement 17). Dordrecht: Springer.Google Scholar

Sijtsma, K (2009) On the use, the misuse, and the very limited usefulness of Cronbach's alpha. Psychometrika 74, 107–120. doi:10.1007/s11336-008-9101-0CrossRef Google Scholar PubMed

Teresi, JA, Kleinman, M and Ocepek-Welikson, K (2000) Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine 19, 1651–1683.3.0.CO;2-H>CrossRef Google Scholar PubMed

Teresi, J, Ocepek-Welikson, K, Kleinman, M, et al. (2007) Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): Applications (with illustrations) to measures of physical functioning ability and general distress. Quality of Life Research 16, 43–68. doi:10.1007/s11136-007-9186-4CrossRef Google Scholar PubMed

Teresi, JA, Ornstein, K, Ramirez, M, et al. (2014) Performance of the Family Satisfaction with the End-of-Life Care (FAMCARE) measure in an ethnically diverse cohort: Psychometric analyses using item response theory. Supportive Care in Cancer 22, 399–408.CrossRef Google Scholar

Teresi, JA, Ocepek-Welikson, K, Ramirez, M, et al. (2015) Evaluation of measurement equivalence of the Family Satisfaction with the End-of-Life Care in an ethnically diverse cohort: Tests of differential item functioning. Palliative Medicine 29, 83–96.CrossRef Google Scholar

Teresi, JA, Ocepek-Welikson, K, Ramirez, M, et al. (2019) Psychometric properties of a Spanish-language version of a short-form FAMCARE: Applications to caregivers of patients with Alzheimer's disease and related dementias. Journal of Family Nursing 25(4), 557–589.CrossRef Google Scholar PubMed

Thissen, D, Steinberg, L and Wainer, H (1993) Detection of differential item functioning using the parameters of item response models. In Holland, PW and Wainer, H (eds), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum, Inc.Google Scholar

Thissen, D, Steinberg, L and Kuang, D (2002) Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false discovery rate in multiple comparisons. Journal of Educational and Behavioral Statistics 27, 77–83. doi:10.3102/10769986027001077CrossRef Google Scholar

Tucker, LR and Lewis, C (1973) A reliability coefficient for maximum likelihood factor analysis. Psychometrika 38, 1–10. doi:10.1007/BF02291170CrossRef Google Scholar

Wainer, H (1993) Model-based standardization measurement of an item's differential impact. In Holland, PW and Wainer, H (eds), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum, Inc., pp. 123–135.Google Scholar

Wang, W-C, Shih, C-L and Sun, G-W (2012) The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement 72, 687–708. doi:10.1177/0013164411426157CrossRef Google Scholar

Woods, CM (2009) Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement 33, 42–57. doi:10.1177/0146621607314044CrossRef Google Scholar

Table 1. Demographic characteristics of the caregivers and care recipients for the White and Hispanic samples

Table 2. Eigenvalues from the exploratory factor analysis using principal components estimation and fit indices from confirmatory factor analysesa (MPlus)

Table 3. Reliability and dimensionality estimates

Table 4. IRT reliability estimates at varying levels of the attribute (theta) estimate based on results of the IRT analysis (IRTPRO)

Fig. 1. FAMCARE: scale and item information functions.

Table 5. Summary of DIF hypotheses and analyses

Fig. A1. Item response functions and magnitude of DIF.Note: Results are from lordif software. For each item, the upper left panel shows the expected item score plots (denoted item true score functions) for Hispanics and non-Hispanic Whites. The lower left panel shows the item characteristic curves (category response functions). The upper right panel displays the absolute group differences in expected item scores. The lower right panel shows the differences weighted by density and is indicative of the magnitude (impact) of DIF at the item level. This measure is related to the non-compensatory DIF statistic (NCDIF) described in the text.

Fig. A2. Impact of DIF at the scale level: expected scale scores.

Table A1. Local dependency statistics (bolded entries are slightly above the threshold for elevation).

Table A2. Classical test reliability estimates (SPSS): total sample (n = 1,834)

Table A3. IRT item parameters and DIF statistics for Hispanic compared to non-Hispanic White responders (reference group)

Table A4. Sensitivity analyses: summary of DIF analyses comparing White non-Hispanic subsample with Spanish-speaking Hispanics only

Article contents

Evaluation of measurement equivalence of the Family Satisfaction with the End-of-Life Care (FAMCARE): Tests of differential item functioning between Hispanic and non-Hispanic White caregivers

Abstract

Keywords

Introduction

Methods

Qualitative

Quantitative analyses and tests of DIF hypotheses

Measure

Sample

Results

Qualitative

Quantitative

Discussion

Strengths and limitations

Authorship

Acknowledgments

Funding

Conflict of interest

Appendix

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests