The Rosenberg self-esteem scale (RSES) with ten items and a balanced number of positive and negative items is widely used in psychology, the social sciences, education etc. Its factor structure has been investigated by many researchers with a large variety of populations and in many different cultural contexts. Given its opposite direction of wordings, bifactor models taking account of the wording effects have enjoyed increasing popularity among researchers. Bifactor model refers to a CFA model which comprises a general factor explaining the substantive traits, and several method-effect factors accounting for the variance caused by wording directions, phrases, contexts, methods, etc; the covariance among the general factor and the method factors are specified to be zero (Reise, Morizot, & Hays, Reference Reise, Morizot and Hays2007). Wang, Siegal, Falck, and Calson (Reference Wang, Siegal, Falck and Carlson2001) assessed the factor structure of RSES in a sample of crack-cocaine drug users by comparing nine alternative confirmatory factor analysis (CFA) models, and found that the best-fitting model was the one with a general self-esteem factor and two method-effect factors associated with item wording, one with positive and the other with negative items. Tomas and Oliver (Reference Tomas and Oliver1999) conducted nine RSES models in a sample of Spanish high school pupils, and found that the model with a general self-esteem factor and method effects among positively and negatively worded items fitted the data best. Recently, McKay, Boduszek, and Harvey (Reference McKay, Boduszek and Harvey2014) conducted analysis among four models: one-factor, two-factor correlated, two-factor hierarchical and bifactor models. Results showed that the bifactor was the best-fitting model with reasonable model fit and factor loadings. The bifactor model was also found to embrace the best fit in the sample of general U.S. adults (Hyland, Boduszek, Dhingra, Shevlin, & Egan, Reference Hyland, Boduszek, Dhingra, Shevlin and Egan2014) and children of prisoners from four European countries (Sharratt, Boduszek, Jones, & Gallagher, Reference Sharratt, Boduszek, Jones and Gallagher2014). The bifactor model with both positive and negative method-effects therefore seems a good candidate for RSES.
Bifactor model with negative-only effects
However, Wang et al. (Reference Wang, Siegal, Falck and Carlson2001) also noted that the formation of a positive method-effect might be due to the uniqueness of the population, with the positive wording arousing extra mental experience among the drug users. Further, others have endorsed the one-factor model with only one method-effect introduced by negative item wording (e.g., Marsh, Reference Marsh1996; Motl & DiStefano, Reference Motl and DiStefano2002). Marsh, Scalas, and Nagengast (Reference Marsh, Scalas and Nagengast2010) used a seven-item self-esteem scale and found that the uni-dimensional model with a method-effect having only negative item wording fitted the data best. Horan, DiStefano, and Motl (Reference Horan, DiStefano and Motl2003) explored the factor structure of the seven-item RSES, based on the National Educational Longitudinal Study (NELS: 88) over three time points, and found that the model with a negative wording factor fitted better. Additionally, the negative method-effects of RSES correlated moderately with each other across time. Motl and DiStefano (Reference Motl and DiStefano2002) evaluated the longitudinal invariance of the same measurement model and found that the method-effect introduced by negatively worded items exhibited longitudinal invariance at a strict level, equal item uniqueness, and suggested that the method-effect reflected a stable response style. Based on a sample of European children with parental imprisonment, Sharratt et al. (Reference Sharratt, Boduszek, Jones and Gallagher2014) tested a series of CFA models and found that the bifactor model with only negative factor exhibited decent model fit, though not the best. Hence, we also consider a bifactor model with only negative effects. RSES is in Likert scale format and it is debatable whether data is treated as continuous or categorical.
Continuous vs categorical
There are long debates on whether the Likert scale can be treated as continuous, and a summary can be found in Leung (Reference Leung2011). Muthen (Reference Muthen1984) maintained that treating Likert-type data as continuous was inadequate because the property of equal distance between the categories was not guaranteed, and that correlations would be distorted if some variables were heavily skewed. Joreskog (Reference Joreskog1994) argued that it was not meaningful to produce covariance based on ordinal data because the distribution of Likert-type responses was non-normal and the data discrete. According to Reference Finney, Distefano, Hancock and MuellerFinney and DiStefano (2006), Likert-type scales are in widespread use in the social sciences, but often treated as continuous. They also reported that the normal theory estimator maximum likelihood (ML) is used, which will lead to attenuated Pearson product-moment (PPM), and thus biased model fit, parameter estimates and standard errors. The smaller the number of scale points, the more severe the underestimate of point estimation and standard errors (Reference Finney, Distefano, Hancock and MuellerFinney & DiStefano, 2006; Leung, Reference Leung2011). If the number of scale points is as small as four, Leung (Reference Leung2011) has shown that the approximation to continuous scale is the least, as more Likert scale points will better approximate normal variables and hence a continuous scale. In this study the numbers of scale points is four. Strictly speaking, the Likert scale is an ordinal scale of measurement. However, there are far more studies that regard it as continuous than as categorical. There is empirical evidence that many parametric results (e.g., t-tests) are indistinguishable between two scales, and most people regard the sum scores as continuous even though individual items are categorical. In this paper we investigate the differences in scoring. A summary of pros and cons can be found in Leung (Reference Leung2011).
By treating Likert scales as categorical, we use the mean- and variance-adjusted weighted least square (WLSMV) estimator, which displays less type-I error (Reference Finney, Distefano, Hancock and MuellerFinney & DiStefano, 2006). With WLSMV, the polychoric correlation matrix with asymptotic covariance matrix is used as a sufficient statistics. It is posited that there are continuous latent response variables underlying the item responses, and that the observed ordinal categorical data arises from cutting the latent response distributions through thresholds (Muthen, Reference Muthen1984). Flora and Curran (Reference Flora and Curran2004) conducted a simulation study to compare the performance of weighted least square (WLS) and WLSMV under varying conditions of sample size, model specification, degree of non-normality, and number of scale points. They found that WLSMV performed consistently better in terms of parameter estimates and standard errors across all simulation conditions with a large sample size. The WLSMV estimator is therefore used in this study to address the categorical nature of the data. Besides the controversy on whether the Likert scale is continuous, there is a problem with the Chinese version of the RSES, which will be dealt with below.
Chinese scale, 10- or 9-item
The RSES has been translated into many languages to extend its use in different cultures. In the case of the Chinese RSES, Cheng and Hamid (Reference Cheng and Hamid1995) argued that item 8, I wish I could have more respect for myself, should be dropped, because the word ‘wish’ here has no comparable term in Chinese. Item 8 displayed near-zero average item-item correlations and negative item-total correlations. Similar problem for item 8 also appeared in Polish language (Boduszek, Shevlin, Mallett, Hyland, & O’Kane, Reference Boduszek, Shevlin, Mallett, Hyland and O’Kane2012). Shek (Reference Shek1995) countered by observing that item 8 should not be dropped without including both English and Chinese versions of RSES and comparing the results. In responding to Shek’s argument, Hamid and Cheng (Reference Hamid and Cheng1995) administered the RSES to a sample of college students in New Zealand, when the results showed that the item-total correlations were all above 0.4, in contrast to near-zero in the Chinese sample. Leung and Wong (Reference Leung and Wong2008) used a sample of secondary school pupils in Macau and offered three alternative translations for item 8. Their results showed that none of the translations displayed satisfactory item-total correlation or internal consistency reliability. An exploration of the factor structure of the Chinese-translated RSES is certainly needed, for the following reasons. First, the translation problem with item 8 had demonstrated negative effects on the psychometric properties, but none of the previous studies analysed the factor structure. Wu (Reference Wu2008) studied a large sample of college students in Taiwan and compared a series of CFA models of 10-item RSES (in Chinese), and found that a bifactor model with two group factors based on item wordings fitted best. However, no information on factor loadings was provided, nor was there any reference to item 8. The mystery of item 8 was not cleared up. Wang, Kong, Huang, and Liu (Reference Wang, Kong, Huang and Liu2015) studied a sample of college students in China, and also found that the bifactor model with two method-effect factors was the best-fitting model. However, there were inconsistency in signs of loadings on the positive method factor, and the item 8 exhibited lowest magnitude of loadings on the corresponding factors, indicating that it is imperative to explore the factor structure of the Chinese RSES taking into account the issue of item 8. Because of this, both 10- and 9-item scales are included in this paper, the latter without item 8. However, regardless of whether item 8 is removed or not, the final scoring may not be affected substantially, as explained below.
Scoring
Regardless of any models, a major purpose of RSES is to rate individual self-esteem. A model with bad fit may still provide reliable scoring, which is referred to as ‘robustness’. Generally, Bartholomew and Knott (Reference Bartholomew and Knott1999) showed that the weighted sum of factor loading correlates highly with the latent constructs under study (p. 32–39). Specifically, Leung and Wu (Reference Leung and Wu2013) found empirically that scaling from the one- and two-factor IRT models of RSES correlates highly with the usual simple sum scores, supporting the use of scoring in standard practice. In this paper, scoring with different models and conditions is investigated to examine the degree of robustness.
The objectives of this paper are as follows. First, we consider bifactor models with both positive and negative effects (termed ‘bifactor models’), and bifactor models with only negative effects (termed ‘bifactor negative models’). Second, RSES is in Likert scale format, which it is controversial to treat as continuous. We consider methods which treat data as continuous vs categorical. Third, there is a problem in translating item 8 into Chinese. We investigate the differences between 10- and 9-item Chinese RSES. Fourth, scorings are analyzed to see if different models provide similar or different scoring.
Method
The sample for this study consisted of 1,734 senior elementary school pupils in Hong Kong. In terms of gender, 738 (42.6%) were male students and 992 (57.2%) were female students, while four of them did not expose their gender information. In terms of class, 832 (48%) were from Grade 5, 864 (49.8%) were from Grade 6, and 38 of them did not report their class information. The Chinese version of RSES (Leung & Wong, Reference Leung and Wong2008) is used, with ten items rated on a four-point Likert scale. The scale comprises a balance of five items worded positively and five negatively. Representative samples in each category of each item are displayed by means of simple frequency distributions.
The model comparison procedures were conducted within a CFA framework. Based on the studies by McKay et al. (Reference McKay, Boduszek and Harvey2014) and others, as stated above, a series of CFA models was tested (see Figure 1) with the aim of seeking the most appropriate factor structure for the 10 vs 9 item scale. The first model is the one-factor type without error covariance, based on the assumption that item responses can be explained by a common latent trait, self-esteem. The second model is the two-factor orthogonal type with two independent latent traits, where all negatively-worded items are loaded on one factor (labelled neg), and all positively-worded items are loaded on another factor (labelled pos). In the third model, the two-factor correlated type, the two factors are allowed to be correlated. The fourth model is the bifactor type, where all the items are loaded on a general factor, which is intended to represent self-esteem. In addition, all negatively-worded items are loaded on the neg group factor, and all positively-worded items on the pos group factor. Finally, the fifth model is the bifactor negative type, with method-effect introduced only in the case of negatively-worded items - ie the same as the bifactor model but with the pos factor removed. All the CFA models were analysed by Mplus 7.2. The five models were compared in respect of both 10- and 9-item scales.
As the previous literature supports the bifactor model for the structure of RSES, the bifactor and bifactor negative models form the major comparison in this paper. We also include the other three models partly for purposes of comparison and partly because we want to see whether their scoring is similar to that of the bifactor models. All models are shown graphically in Figure 1.
Regarding the evaluation of model performance, a series of model fit indices were conducted. As reported by McKay et al. (Reference McKay, Boduszek and Harvey2014), Bentler’s comparative fit index (CFI), the Tucker-Lewis Index (TLI), and the root mean square error of approximation (RMSEA) were used. Values larger than 0.95 for CFI and TLI indicate good model fit (Hu & Bentler, Reference Hu and Bentler1999). RMSEA values smaller than .08 and .05 indicate respectively acceptable and good fit (Browne & Cudeck, Reference Browne and Cudeck1992).
The BIC was computed for the models treating the Likert-type data as continuous. Lower BIC values indicate better models. Kass and Raftery (Reference Kass and Raftery1995) suggested that a difference in BIC of 0–2, 2–6, 6–10 and more than 10 can be interpreted as little, positive, strong and very strong evidence, respectively, against models with larger BIC values. BIC is particularly useful when we compare the bifactor and bifactor negative models, because both may provide similar fits, with the latter being preferred in view of its greater parsimony. BIC can strike a balance between model fit and simplicity. More importantly, the interpretation of BIC is beyond an index. There is a simple relation between BIC differences and the Bayes factor, which is the ratio of posterior probabilities between two models. Hence, the interpretations produced by Kass and Raftery (Reference Kass and Raftery1995) can be regarded as the chance of the bifactor model against the bifactor negative, or vice versa.
Since comparing BICs with a different number of variables, i.e., 10 vs 9 here, may be problematic, we use the minimum BIC within 10- and 9-item scales respectively to calculate BIC differences. And, with the WLSMV estimator not based on likelihood function, BIC cannot be produced when treating data as categorical.
Results
Table 1 displays the frequencies by category, and item-total correlations of the ten items in the scale. It shows that the sample size was sufficient in each category to make it representative for further analysis. With respect to item-total correlation, it is notable that item 8 displayed an opposite correlation to other items, reconfirming previous findings on the Chinese RSES, as mentioned above.
Table 2 shows the correlation matrix of the ten items in the RSES, with the upper triangle being the Pearson correlation, and the lower triangle the polychoric correlation; the mean (M) and standard deviation (SD) for each item is also displayed. In the correlation matrix, item 8 shows opposite correlation with other items; as expected, polychoric correlation coefficients have higher magnitude than those of Pearson correlation matrix generally.
Note: M: mean; SD: standard deviation.
Model fit and selection
Table 3 reports on fit indices for the five models listed earlier for both 10- and 9-item scales; and also for both categorical and continuous data. When treating data as categorical, for the 10-item scale, the one-factor model had poor fit across all the model fit indices (e.g., RMSEA > .08, CFI < .9, TLI < .9).
Note: χ2 = Chi-Square statistics with WLSMV estimator; df = degree of freedom; RMSEA = root mean square error of approximation; CFI = Comparative Fit Index; BIC = Bayesian Information Criterion; TLI = Tucker–Lewis Index; ∆BIC = differences in BIC, which is calculated by BIC of the current model minus the minimum BIC (based on Bifactor model). 10 = 10-item scale; 9 = 9-item scale. Categorical = treating Likert scale as categorical; Continuous = treating Likert scale as continuous.
The two-factor orthogonal model has less satisfactory fit than the one-factor model, with larger RMSEA and smaller CFI and TLI. The two-factor correlated model shows model fit with better indices of CFI and TLI, but a value of .114 for RMSEA indicates room for improvement. The bifactor model is the best. The bifactor negative model is the second best, suggesting that a more parsimonious version without positive method-effect could also be considered. Generally, the fit indices in the 9-item scale have similar patterns to those in the 10-item scale. The one-factor model shows unsatisfactory fit (e.g., RMSEA > .08, CFI < .9, TLI < .9). The two-factor orthogonal model does not fit the data as well as the one-factor model, as reflected by all the fit indices. The two-factor correlated model has better model performance than the first two, but a value of .106 for RMSEA suggests that improvement still needs to be made. The bifactor model displays good fit, suggesting that it successfully represents the factor structure of the 9-item RSES scale. Finally, the bifactor negative model displays similar fit to the bifactor model with two method-effects, with a difference less than .01 in RMSEA, CFI and TLI. A comparison of the model fit of the 10- and 9-item scales shows that they share the same pattern across the five tested models: bifactor > bifactor negative > two-factor correlated > one-factor > two factor orthogonal model (with the > symbol indicating better performance). The bifactor models are the best. Both scales have similar model fit index values, and hence generally the overall goodness-of-fit between 10- and 9-item scales is largely similar. When treating item responses as continuous (see the lower half of Table 3), the pattern of model fit across the five tested models was consistent with the categorical condition. This indicates that, in terms of model fitness, treating RSES as continuous or categorical produces the same patterns. In terms of model selection index BIC, the pattern is also the same. That is, bifactor > bifactor negative > two-factor correlated > one-factor > two-factor orthogonal models. The bifactor model with both effects displays slightly lower BIC values (35,680 and 32,376 respectively for 10- and 9-item model) than the bifactor negative model (35,697 and 32,404 respectively for 10- and 9-item models). The latter has fewer parameters than the former and hence we would expect it to have smaller BIC. But this is not the case, indicating that the bifactor model is likely to be better, which more than compensates for the extra parameters added, as compared with the bifactor negative model. Since the bifactor model has a minimum BIC for both 10- and 9-item scales, it is taken as the standard to calculate BIC differences.
From the BIC differences, all values are larger than 10, and some much greater, and there is very strong evidence for choosing the bifactor model over others. Strictly speaking, BIC may not be comparable with between 10- and 9-item scales, as the number of items differs. But the BICs for 10- and 9-item scales are around 35,000 and 32,000 respectively, a difference of 3,000. We conjecture that a clear difference in model fit is not being picked up by other indices, as the extra parameters on only one item could not lead to such a difference.
Hence, the overall conclusion is that the 9-item bifactor categorical model is the best in terms of model fit and selection with BIC. Its composite reliability is 0.876 which was calculated using Raykov (Reference Raykov2004)’s method, indicating that the substantive factor of RSES embraces satisfactory reliability. The composite reliability for the two group factors are lower: 0.474 and 0.335 for the positive and negative factor respectively. These results are better than those corresponding figures in McKay et al. (Reference McKay, Boduszek and Harvey2014): 0.838 for the general factor, 0.468 and 0.167 for the positive and negative factors respectively.
Factor loadings
The results reviewed in previous sections above indicate that the bifactor and bifactor negative models fit well. We proceed to investigate the factor loadings in the case of these two models. We use the labels ‘9BiPN’ for the 9-item scale bifactor model with both positive and negative effects and ‘9BiN’ for the same model with only negative effects; in the same way, we use ‘10BiPN’ and ‘10BiN’ for the 10-item scale. In factor analysis, loadings can be seen as the degree to which a factor is associated with a certain item. In IRT, loadings represent the discriminating power of items in their latent traits. Further, loadings affect the final factor score, which is the sum of item scores weighted by the corresponding loadings.
Standardized factor loadings for bifactor models in the 10- vs 9-item scales, categorical vs continuous treatment, are displayed in Table 4. All the loadings across the models are statistically significant at a level of p < .001, except in some minor cases. Returning to the debate about whether Likert scales can be treated as continuous: results from factor loadings show that the general pattern is similar. The immediate consequence is that as far as scoring is concerned, which is the weighted sum of loadings over all items, results may probably be similar, and this can be further supported by the scoring results. This may also be true even though data are not normal, for continuous data need not be normal.
Note: 9Bi = 9-item bifactor model; 10Bi = 10-item bifactor model; P = positive method effect; N = negative method effects. neg = the negative method effect; pos = the positive method effect. Categorical = treating Likert scale as categorical; Continuous = treating Likert scale as continuous; Negative = factor loadings relevant to the negatively worded items only; Positive = factor loadings relevant to the positively worded items only.
However, when we look at the loadings in more detail, their magnitude in the categorical case is generally slightly greater than in its continuous counterpart. This may be consistent with the previous report that treating Likert-type data as continuous and adopting ML estimators would lead to underestimation of the parameter estimates. Treating data as categorical is therefore preferred, with slightly larger magnitude of factor loadings.
For 10-item scales, the patterns of factor loadings are also similar to those for 9-item scales. The loadings on the general factor for 10- and 9-item scales share a similar magnitude. But item 8 is negative. The extra item 8, only in the 10-item scale, always has negative values for the general factor. Similar results can be found in Leung and Wong (Reference Leung and Wong2008). Though 10- and 9-item scales have similar model fit, item 8 contaminates the loading of the scale, and it is therefore recommended that it be removed, to leave a 9-item scale. This further supports the view that a scale without item 8 has more reasonable loading patterns.
Comparing the loadings of general and group effects, it was found that in most cases the loadings on group factors were lower than their counterparts on general factors. This suggests that the trait of self-esteem may already have been captured by the general factor, while the method effect introduced by item wordings cannot be ignored. Only two items have larger loadings in the method effect than in the common factor. They are items 2 and 6, which respectively contain the phrases, in English, ‘I am no good’ and ‘I feel useless’. Other items do not contain any wording similar to ‘no good’ or ‘useless’. We suggest that such extreme wording causes loadings to be greater in their method-effects.
Regardless of whether a 10- or 9-item scale is concerned, or whether there are two method-effects or only a negative effect, or whether data are treated as continuous or categorical, the rankings of negative items with respect to the general factor are the same: item 9 > 5 > 6 > 2, with item 8 negative. In the case of positive items, ranking with regard to the general factor is not so clear, but the range of loadings is around 0.1, indicating that the differences between the maximum and minimum loadings are not that large across different models, item scales and continuous vs categorical data. This shows that the relative strength of each item within a positive or negative group is quite consistent in different situations. Model 9BiPN is the only one where all negative items in the general factor have loadings greater than positive items, and negative items in the group factor, indicating that (a) positive items in the group factor share some variances and hence make those in general factor smaller; (b) negative items are more dominant in the general factor; and (c) the existence of item 8 with negative values causes positive items to be larger, and to be more positive. The ranking of loadings has immediate effects on factor scores, which are item scores weighted by loadings, as reported below.
Scoring
Table 5 reports the correlations of factor scores from various fitted models with the 9-item bifactor with both effects and treating data as categorical. We choose this model as a standard because it provides the best fit among various indices. Since there is no general factor for two-factor correlated and orthogonal models, we need to have two factor scores instead of one: one for positive and one for negative. Table 5 shows that correlations are similar for corresponding models between 10- and 9-item scales, and also between categorical and continuous models. Hence, treating data as categorical or continuous, or including item 8, does not have much effect on scoring. Instead, scoring is affected more by the models used. Among all models, the 9-item continuous bifactor model has correlation of 0.99, indicating no difference between categorical and continuous types, although in that particular case keeping item 8 will lower the correlations to around 0.95 and 0.96. We therefore suggest removing item 8. The one-factor models have correlations around 0.97, probably due to squeezing items into a single factor. The simple sums (9 and 10 items) have correlations around 0.96, indicating the usefulness of this practice, the simplest one. The two-factor correlated model general components have correlations of 0.95; and bifactor negative models of around 0.92. This shows the importance of the negative component compared with the positive, correlations of two-factor correlated model positive components being around 0.9. Two-factor orthogonal models have the fewest correlations, because they produce scoring in two dimensions.
Note: Categorical = treating Likert scale as categorical; Continuous = treating Likert scale as continuous. 10 = 10-item scale; 9 = 9-item scale. Simple Sum = sum score for 10-item and 9-item scale each.
Discussion
RSES is widely used to measure self-esteem. Its structure has been investigated by many authors, McKay et al. (Reference McKay, Boduszek and Harvey2014) recently suggested a bifactor model for RSES, and Donnellan, Ackerman, and Brecheen (Reference Donnellan, Ackerman and Brecheen2016) and Wang et al. (Reference Wang, Kong, Huang and Liu2015) suggested bifactor negative models also. This paper moves in this direction too, and extends the results in more depth.
First, the bifactor negative model is considered and assessed by BIC. Marsh (Reference Marsh1996) found that method-effects were primarily associated with negative rather than positive wording, while Tomas and Oliver (Reference Tomas and Oliver1999) found that models with only positively worded items had the least satisfactory fit among the models specifying method-effects. Wang et al. (Reference Wang, Kong, Huang and Liu2015) examined the associations between the factors in the RSES and the Amygdala and Hippocampus in the brain, and found that the negative method factor was uniquely associated with the Grey Matter Volume of the right amygdala, revealing a unique neural mechanisms underlying the negative factor, echoing our findings that the negative method factor plays a significant role. However, the current study added an assessment of models by BIC, and suggested that bifactor is more in favor of bifactor negative model. For both the 10- and 9-item scales, the BIC values for bifactor are smaller than for bifactor negative with BIC value differences larger than 10, exceeding the cut-off values reported by Kass and Raftery (Reference Kass and Raftery1995) and providing strong evidence that, even when punished for model complexity, the bifactor is still preferred to the bifactor negative. Further, ‘strong evidence’ can be interpreted probabilistically, i.e., that the bifactor is much more likely than the bifactor negative.
Second, regarding the controversy over treating Likert scales as either continuous or categorical, this paper also found little difference in fit indices or final scoring. But in respect of factor loading patterns treating data as categorical resulted in greater magnitude and is therefore preferred. This paper has a sample size of 1,734 whereas, in the case of other studies with smaller samples, perhaps only a few hundred, the factor loadings may not be so clear. In that case treating data as categorical has relatively greater advantages.
Third, regarding the debate about whether we should keep item 8 in the Chinese RSES, this paper reports that keeping it makes little difference to fit indices or final scoring of individual self-esteem. However, item 8 consistently loaded in the opposite direction to other items, making the loadings uninterpretable; removing it improves the model selection indices and produces better loading patterns. We suggest removing it because the item content in Chinese is confusing, and content validity is considered to be the real validity (Borsboom, Mellenbergh, & van Heerden, Reference Borsboom, Mellenbergh and van Heerden2004). The similarity of final results whether it is kept or removed may be a matter of statistical robustness and nothing to do with the content domain. And removing it gives a better loading pattern, supporting the content domain of what is being measured.
Fourth, with regard to factor loading, the ordering of loading magnitude within positive and negative items is the same across all models. The relative weights are consistent, and hence the scoring. This paper also found that in most cases the general effects were greater than the method effect as revealed by the magnitude of loadings, except two items with more extreme wording. This shows that the general factor captures most effects of self-esteem, with the rest going to method-effect.
Fifth, most practitioners may focus on whether or not the final scoring of individual self-esteem will be affected by the latent structure. Since the final score is the sum of item scores weighted by loadings, we expect it to be largely similar if the loading patterns do not differ significantly, and the results confirmed this. However, two-factor models are not preferable to one-factor models, because the latter can squeeze all loadings into one dimension whereas the former puts loadings in two different directions without giving a general factor. Interestingly, the usual simple sum scores may be of this spirit, thus endorsing the usual practice of summing up all items without any weight. More importantly, even though the two-factor correlated model yields better fit and selection statistics than the one-factor model, the latter can produce the same or even better scorings. Donnellan et al. (Reference Donnellan, Ackerman and Brecheen2016) investigated the relationships between the factors in the bifactor model and several external variables, and found that the correlation patterns involving the simple sum score and the global factor were similar, echoing our findings. Simple sum scores can also give a very good measure of individual self-esteem even though the models behind them may not be correct, showing that a less satisfactory model can yield better scores because of the way they are calculated. This can be supported theoretically by Bartholomew and Knott (Reference Bartholomew and Knott1999, p32 to 39), and empirically by Leung and Wu (Reference Leung and Wu2013), as stated in the scoring section in the introduction.