As organizations become decentralized and work becomes team based, organizations are adopting performance management practices that integrate employees’ performance information from multiple perspectives (e.g., 360-degree performance ratings). Both arguments for and against the use of performance ratings presented in the focal article focused on rater agreement (or lack thereof) as evidence supporting the position that multisource ratings are a useful (or not a useful) approach to performance appraisal. In the argument for the use of multisource ratings, Adler, Campion, and Grubb (Adler et al., Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) point out that multisource ratings are advantageous because they lead to increased interrater reliability in the ratings. Although Adler and colleagues were not explicit about why this would be true, proponents of multisource ratings often cite the measurement theory assumption that increasing the number of raters will yield more valid and reliable scores to the extent that there is any correlation in the ratings (Shrout & Fleiss, Reference Shrout and Fleiss1979). In the argument against the use of multisource performance ratings, Colquitt, Murphy, and Ollander-Krane argued that because multisource ratings pool together ratings from raters who are systematically different in terms of their roles and perspectives about the target employee's performance, the increased number of raters is not expected to resolve the low level of interrater agreement that is typically observed in performance ratings (Viswesvaran, Ones, & Schmidt, Reference Viswesvaran, Ones and Schmidt1996).
The focus on agreement (or disagreement) among raters as the key issue in the argument for or against the use of multisource performance ratings is not surprising given that reliability is a focal index of the quality of measurement scores. Reliability can be estimated in various ways depending on the relevant source of measurement error (Cortina, Reference Cortina1993), but interrater reliability in particular has emerged as the reliability index of choice in defining the psychometric quality of performance ratings (Murphy, Reference Murphy2008). Interrater reliability considers rater idiosyncrasies as a source of random measurement error (Schmidt & Hunter, Reference Schmidt and Hunter1996). However, an underlying assumption in the use of multisource ratings is that different rating sources provide unique performance-relevant information (Borman, Reference Borman1974, Reference Borman1997), meaning that different rating sources are actually expected to disagree with respect to their perception of target performance (Hoffman, Lance, Bynum, & Gentry, Reference Hoffman, Lance, Bynum and Gentry2010). To the extent that each rating source provides uniquely meaningful information about a target employee's job performance, collapsing all variance not shared across raters into error would result in collapsing meaningful variance into error, which in turn is expected to produce inappropriate inferences regarding the construct validity of multisource performance ratings (Murphy & DeShon, Reference Murphy and DeShon2000). Then, whether a systematic source effect in multisource ratings represents undesirable source-specific bias or independently valid performance-relevant information is an important issue that has implications for what multisource ratings represent. This commentary seeks to supplement the focal article's discussion of multiple source ratings by providing more detailed discussion of the construct validity of multiple source ratings. Specifically, I briefly review previous studies that have examined the validity and meaning of multisource ratings to assess the value of interrater reliability evidence in arguments for or against the use of multisource performance ratings.
Source Effect in Multisource Performance Ratings: Meaningful Variance Versus Bias
Consistent with the underlying assumption in the use of multisource ratings that different rating sources provide unique performance-relevant information, previous studies that examined the internal structure of multisource performance ratings have consistently found that rating source usually accounts for a significant proportion of variance in multisource performance ratings (Hoffman et al., Reference Hoffman, Lance, Bynum and Gentry2010; Lance, Hoffman, Gentry, & Baranik, Reference Lance, Hoffman, Gentry and Baranik2008; Woehr, Sheehan, & Bennett, Reference Woehr, Sheehan and Bennett2005). From a traditional psychometric perspective, variance attributed to the rating source represents undesirable construct-irrelevant bias that should be reduced (Podsakoff, MacKenzie, Podsakoff, & Lee, Reference Podsakoff, MacKenzie, Podsakoff and Lee2003), but in multisource performance ratings, the underlying assumption is that different rating sources are expected to provide source-specific information about the target employee's job performance. Based on this perspective, there should be rating disagreements between rating sources, but the source of rating disagreement should represent reliable performance-relevant information in each of the rating sources.
An internal structure approach to examining the construct validity of multisource performance ratings may be supplemented with a nomological network approach that examines the patterns of covariance among source effects and external measures of performance to derive the extent to which rating source effects in multisource ratings represent substantively meaningful source-specific variance (Cronbach & Meehl, Reference Cronbach and Meehl1955). Specifically, source-specific variance should correlate with relevant externally measured constructs to the extent that they represent substantively meaningful performance-relevant variance. In addition, consistent with the theoretical explanation that different rating sources capture different aspects of performance, rating source effects should be differentially related to external measures of job performance to the extent that each rating source relies on different performance information to provide the ratings (Hoffman & Woehr, Reference Hoffman and Woehr2009).
Previous research that implemented a nomological network approach provided support to the assumption that rating source effects represent substantively meaningful performance-relevant variance. Namely, Hoffman and Woehr (Reference Hoffman and Woehr2009) collected multisource ratings of managers enrolled in an executive master of business administration program (ratings collected from supervisors, peers, and subordinate employees of the participants) and asked them to also participate in an assessment center that measured different managerial skills (decision making, judgment, influencing others, persuasiveness, and coaching). Consistent with previous studies that examined the internal factor structure of multisource ratings, Hoffman and Woehr (Reference Hoffman and Woehr2009) found clear support for the factor structure that modeled each rating source as a separate rating source factor (supervisor, peer, and subordinate). Correlations of the rating source factors with external variables provided further support to the assumption that source effects represent substantively meaningful performance-relevant variance. Specifically, all three factors showed weak to moderately significant correlations with the relevant measured managerial skills (e.g., r = .29 between subordinate latent factor and leadership skills). Furthermore, each rating source effect showed differential relationships with the measured managerial skills (i.e., confidence intervals around the correlation difference did not include zero; Meng, Rosenthal, & Rubin, Reference Meng, Rosenthal and Rubin1992). For example, subordinate source factor showed a stronger correlation with the leadership skill factor (r = .29) than the peer (difference in r = .13) or manager (difference in r = .16) source factors.
Taken together, Hoffman and Woehr's (Reference Hoffman and Woehr2009) results indicate that not only do rating source factors represent substantively meaningful variance but also each factor can provide source-specific information that is uniquely related to performance. These findings provide a more in-depth perspective into the meaning of the rating source effect in multisource ratings and what they represent that cannot be derived from internal structure or interrater reliability based approaches to examining the construct validity of multisource ratings.
Implications for Validity of Multisource Ratings
As multisource ratings have become an increasingly common performance measurement in practice organizations, there has been a corresponding increase in the amount of research attention paid to investigating the psychometric properties of multisource ratings (e.g., Conway, Reference Conway1996; Conway & Huffcutt, Reference Conway and Huffcutt1997; Mount, Judge, Scullen, Systma, & Hezlett, Reference Mount, Judge, Scullen, Systma and Hezlett1998). Much of this research has relied on an internal approach that examines the covariance of ratings made by different sources, including interrater reliability evidence that was briefly discussed in the focal article. Interestingly, the contrast between assumptions underlying the use of multisource ratings and assumptions regarding what represents true variance in interrater reliability elicits questions regarding the information that interrater reliability estimates can provide about the construct validity of multisource performance ratings. That is, interrater reliability considers rater idiosyncrasies as a source of random measurement error, but the use of multisource ratings is based on the assumption that each rating source provides a unique perspective about a target employee's performance. As a result, different rating sources are expected to have a low level of agreement, but each source is expected to provide source-specific valid performance information.
In addition to the consistent stream of research evidence that has shown that rating source factors represent a reliable source of variance in multisource performance ratings, Hoffman and Woehr's (Reference Hoffman and Woehr2009) findings with respect to the relationship between different rating source factors and measures of job performance provide evidence supporting the underlying assumption in the multisource performance ratings that rating source represents a meaningful source of specific variance as opposed to bias. Although the authors in the focal article focused on the interrater reliability evidence to support their argument for or against the use of multisource ratings, the literature reviewed in this commentary suggests that interrater reliability alone is not sufficient as evidence for (or against) the construct validity of multisource performance ratings.
As organizations become decentralized and work becomes team based, organizations are adopting performance management practices that integrate employees’ performance information from multiple perspectives (e.g., 360-degree performance ratings). Both arguments for and against the use of performance ratings presented in the focal article focused on rater agreement (or lack thereof) as evidence supporting the position that multisource ratings are a useful (or not a useful) approach to performance appraisal. In the argument for the use of multisource ratings, Adler, Campion, and Grubb (Adler et al., Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) point out that multisource ratings are advantageous because they lead to increased interrater reliability in the ratings. Although Adler and colleagues were not explicit about why this would be true, proponents of multisource ratings often cite the measurement theory assumption that increasing the number of raters will yield more valid and reliable scores to the extent that there is any correlation in the ratings (Shrout & Fleiss, Reference Shrout and Fleiss1979). In the argument against the use of multisource performance ratings, Colquitt, Murphy, and Ollander-Krane argued that because multisource ratings pool together ratings from raters who are systematically different in terms of their roles and perspectives about the target employee's performance, the increased number of raters is not expected to resolve the low level of interrater agreement that is typically observed in performance ratings (Viswesvaran, Ones, & Schmidt, Reference Viswesvaran, Ones and Schmidt1996).
The focus on agreement (or disagreement) among raters as the key issue in the argument for or against the use of multisource performance ratings is not surprising given that reliability is a focal index of the quality of measurement scores. Reliability can be estimated in various ways depending on the relevant source of measurement error (Cortina, Reference Cortina1993), but interrater reliability in particular has emerged as the reliability index of choice in defining the psychometric quality of performance ratings (Murphy, Reference Murphy2008). Interrater reliability considers rater idiosyncrasies as a source of random measurement error (Schmidt & Hunter, Reference Schmidt and Hunter1996). However, an underlying assumption in the use of multisource ratings is that different rating sources provide unique performance-relevant information (Borman, Reference Borman1974, Reference Borman1997), meaning that different rating sources are actually expected to disagree with respect to their perception of target performance (Hoffman, Lance, Bynum, & Gentry, Reference Hoffman, Lance, Bynum and Gentry2010). To the extent that each rating source provides uniquely meaningful information about a target employee's job performance, collapsing all variance not shared across raters into error would result in collapsing meaningful variance into error, which in turn is expected to produce inappropriate inferences regarding the construct validity of multisource performance ratings (Murphy & DeShon, Reference Murphy and DeShon2000). Then, whether a systematic source effect in multisource ratings represents undesirable source-specific bias or independently valid performance-relevant information is an important issue that has implications for what multisource ratings represent. This commentary seeks to supplement the focal article's discussion of multiple source ratings by providing more detailed discussion of the construct validity of multiple source ratings. Specifically, I briefly review previous studies that have examined the validity and meaning of multisource ratings to assess the value of interrater reliability evidence in arguments for or against the use of multisource performance ratings.
Source Effect in Multisource Performance Ratings: Meaningful Variance Versus Bias
Consistent with the underlying assumption in the use of multisource ratings that different rating sources provide unique performance-relevant information, previous studies that examined the internal structure of multisource performance ratings have consistently found that rating source usually accounts for a significant proportion of variance in multisource performance ratings (Hoffman et al., Reference Hoffman, Lance, Bynum and Gentry2010; Lance, Hoffman, Gentry, & Baranik, Reference Lance, Hoffman, Gentry and Baranik2008; Woehr, Sheehan, & Bennett, Reference Woehr, Sheehan and Bennett2005). From a traditional psychometric perspective, variance attributed to the rating source represents undesirable construct-irrelevant bias that should be reduced (Podsakoff, MacKenzie, Podsakoff, & Lee, Reference Podsakoff, MacKenzie, Podsakoff and Lee2003), but in multisource performance ratings, the underlying assumption is that different rating sources are expected to provide source-specific information about the target employee's job performance. Based on this perspective, there should be rating disagreements between rating sources, but the source of rating disagreement should represent reliable performance-relevant information in each of the rating sources.
An internal structure approach to examining the construct validity of multisource performance ratings may be supplemented with a nomological network approach that examines the patterns of covariance among source effects and external measures of performance to derive the extent to which rating source effects in multisource ratings represent substantively meaningful source-specific variance (Cronbach & Meehl, Reference Cronbach and Meehl1955). Specifically, source-specific variance should correlate with relevant externally measured constructs to the extent that they represent substantively meaningful performance-relevant variance. In addition, consistent with the theoretical explanation that different rating sources capture different aspects of performance, rating source effects should be differentially related to external measures of job performance to the extent that each rating source relies on different performance information to provide the ratings (Hoffman & Woehr, Reference Hoffman and Woehr2009).
Previous research that implemented a nomological network approach provided support to the assumption that rating source effects represent substantively meaningful performance-relevant variance. Namely, Hoffman and Woehr (Reference Hoffman and Woehr2009) collected multisource ratings of managers enrolled in an executive master of business administration program (ratings collected from supervisors, peers, and subordinate employees of the participants) and asked them to also participate in an assessment center that measured different managerial skills (decision making, judgment, influencing others, persuasiveness, and coaching). Consistent with previous studies that examined the internal factor structure of multisource ratings, Hoffman and Woehr (Reference Hoffman and Woehr2009) found clear support for the factor structure that modeled each rating source as a separate rating source factor (supervisor, peer, and subordinate). Correlations of the rating source factors with external variables provided further support to the assumption that source effects represent substantively meaningful performance-relevant variance. Specifically, all three factors showed weak to moderately significant correlations with the relevant measured managerial skills (e.g., r = .29 between subordinate latent factor and leadership skills). Furthermore, each rating source effect showed differential relationships with the measured managerial skills (i.e., confidence intervals around the correlation difference did not include zero; Meng, Rosenthal, & Rubin, Reference Meng, Rosenthal and Rubin1992). For example, subordinate source factor showed a stronger correlation with the leadership skill factor (r = .29) than the peer (difference in r = .13) or manager (difference in r = .16) source factors.
Taken together, Hoffman and Woehr's (Reference Hoffman and Woehr2009) results indicate that not only do rating source factors represent substantively meaningful variance but also each factor can provide source-specific information that is uniquely related to performance. These findings provide a more in-depth perspective into the meaning of the rating source effect in multisource ratings and what they represent that cannot be derived from internal structure or interrater reliability based approaches to examining the construct validity of multisource ratings.
Implications for Validity of Multisource Ratings
As multisource ratings have become an increasingly common performance measurement in practice organizations, there has been a corresponding increase in the amount of research attention paid to investigating the psychometric properties of multisource ratings (e.g., Conway, Reference Conway1996; Conway & Huffcutt, Reference Conway and Huffcutt1997; Mount, Judge, Scullen, Systma, & Hezlett, Reference Mount, Judge, Scullen, Systma and Hezlett1998). Much of this research has relied on an internal approach that examines the covariance of ratings made by different sources, including interrater reliability evidence that was briefly discussed in the focal article. Interestingly, the contrast between assumptions underlying the use of multisource ratings and assumptions regarding what represents true variance in interrater reliability elicits questions regarding the information that interrater reliability estimates can provide about the construct validity of multisource performance ratings. That is, interrater reliability considers rater idiosyncrasies as a source of random measurement error, but the use of multisource ratings is based on the assumption that each rating source provides a unique perspective about a target employee's performance. As a result, different rating sources are expected to have a low level of agreement, but each source is expected to provide source-specific valid performance information.
In addition to the consistent stream of research evidence that has shown that rating source factors represent a reliable source of variance in multisource performance ratings, Hoffman and Woehr's (Reference Hoffman and Woehr2009) findings with respect to the relationship between different rating source factors and measures of job performance provide evidence supporting the underlying assumption in the multisource performance ratings that rating source represents a meaningful source of specific variance as opposed to bias. Although the authors in the focal article focused on the interrater reliability evidence to support their argument for or against the use of multisource ratings, the literature reviewed in this commentary suggests that interrater reliability alone is not sufficient as evidence for (or against) the construct validity of multisource performance ratings.