Recent commentary has suggested that performance management (PM) is fundamentally “broken,” with negative feelings from managers and employees toward the process at an all-time high (Pulakos, Hanson, Arad, & Moye, Reference Pulakos, Hanson, Arad and Moye2015; Pulakos & O'Leary, Reference Pulakos and O'Leary2011). In response, some high-profile organizations have decided to eliminate performance ratings altogether as a solution to the growing disenchantment. Adler et al. (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) offer arguments both in support of and against eliminating performance ratings in organizations. Although both sides of the debate in the focal article make some strong arguments both for and against utilizing performance ratings in organizations, we believe there continue to be misunderstandings, mischaracterizations, and misinformation with respect to some of the measurement issues in PM. We offer the following commentary not to persuade readers to adopt one particular side over another but as a call to critically reconsider and reevaluate some of the assumptions underlying measurement issues in PM and to dispel some of the pervasive beliefs throughout the performance rating literature.
Measurement Issues in Performance Ratings
As noted by Adler et al., measurement issues have been pervasive in the PM literature since its inception. Understandably, some scholars have argued that the overwhelming focus on measurement issues in the academic literature has rendered PM research essentially useless to PM practitioners (DeNisi & Pritchard, Reference DeNisi and Pritchard2006; Fletcher, Reference Fletcher2001). Unfortunately, however, PM critics continue to rely on overgeneralized conclusions regarding measurement issues that are based on outdated, unsupported, and misinterpreted research, and these unsubstantiated generalizations have become accepted as truth in our science. Below, we separate fact from fiction with respect to three key areas in PM: (a) rating formats, (b) rater training, and (c) rater (dis)agreement and the reliability of PM ratings.
Rating Formats
The most frequently cited article put forth as “evidence” of the failure of rating format interventions is Landy and Farr's (Reference Landy and Farr1980) watershed article in which they famously called for a moratorium on rating format design research and concluded that interventions designed to improve performance rating formats were, at best, minimally successful. However, what is less often communicated is that Landy and Farr's (Reference Landy and Farr1980) conclusions regarding the lack of usefulness of rating format research were based almost entirely on the presence of psychometric “errors” in performance ratings (DeNisi, Reference DeNisi1996). As Colquitt, Murphy, and Ollander-Kane (as cited in Adler et al.) aptly note in their own criticism of PM ratings, psychometric “errors” only represent one type of rating property, and they have also been repeatedly criticized as poor indicators of rating quality (Balzer & Sulsky, Reference Balzer and Sulsky1992; Fisicaro, Reference Fisicaro1988; Murphy, Reference Murphy2008; Murphy & Balzer, Reference Murphy and Balzer1989; Murphy, Jako, & Anhalt, Reference Murphy, Jako and Anhalt1993; Nathan & Tippins, Reference Nathan and Tippins1990). Consequently, the tenuous evidence base regarding rating errors calls into question the conclusions of an entire body of rating format research dismissed by Landy and Farr (Reference Landy and Farr1980).
More recently, other psychometric indices have been used to evaluate rating quality and have provided a much clearer picture regarding the value of rating formats. Specifically, research that has used more appropriate indices of rating quality, such as predictor validity and rater reactions, has actually yielded favorable results (Bartram, Reference Bartram2007; Benson, Buckley, & Hall, Reference Benson, Buckley and Hall1988; Borman et al., Reference Borman, Buck, Hanson, Motowidlo, Stark and Drasgow2001; Goffin, Gellatly, Paunonen, Jackson, & Meyer, Reference Goffin, Gellatly, Paunonen, Jackson and Meyer1996; Roch, Sternburgh, & Caputo, Reference Roch, Sternburgh and Caputo2007; Tziner, Reference Tziner1984; Wagner & Goffin, Reference Wagner and Goffin1997). Hoffman, Gorman, Blair, Meriac, Overstreet, and Atchley (Reference Hoffman, Gorman, Blair, Meriac, Overstreet and Atchley2012), for example, found that a new rating format they termed “frame-of-reference (FOR) scales” resulted in an improved factor structure compared with a standard multisource rating instrument and rating accuracy levels comparable with those from a FOR training program. Moreover, recognizing recent advances in technology, the expanding criterion domain, and the creation of new forms of work, Landy (Reference Landy and Outz2010) himself officially lifted the 30-year moratorium on rating format design research. Thus, we suggest that rumors of the death of performance rating formats have been greatly exaggerated.
Rater Training
Colquitt et al. (in Adler et al.) state that rater training has been unsuccessful in substantially improving ratings in organizations. We should point out that we agree wholeheartedly that rater error training has been a tremendous disappointment as an intervention for improving ratings. Research has shown that although rater error training results in fewer leniency and halo errors, it inadvertently lowers levels of rating accuracy (Bernardin & Pence, Reference Bernardin and Pence1980; Borman, Reference Borman1979; Landy & Farr, Reference Landy and Farr1980), and rater error training essentially creates a meaningless redistribution of ratings and is practically useless in terms of improving rating quality (Borman, Reference Borman1979; Smith, Reference Smith1986). As noted above, though, this is not surprising given the inherent limitations of psychometric “errors” as indicators of rating quality.
However, we disagree with the assertion made in the focal article that behavior-based rater training (e.g., FOR training) is a disappointing rating intervention. Although there is a relative lack of evidence that rater training improves actual ratings in field settings (cf. Noonan & Sulsky, Reference Noonan and Sulsky2001), for several reasons, we suggest that rater training is an understudied rating intervention that has a great deal of potential for improving ratings in organizations. For example, in a popular industrial–organizational (I-O) psychology textbook, Levy (Reference Levy2010) noted that rater training has become more common in organizations such as the Tennessee Valley Authority, JP Morgan Chase, Lucent Technologies, and AT&T.
Moreover, meta-analytic reviews have found impressive effect sizes for the impact of FOR training on improving rating quality (d = 0.83 in Woehr & Huffcutt, Reference Woehr and Huffcutt1994, and d = 0.50 in Roch, Woehr, Mishra, & Kieszczynska, Reference Roch, Woehr, Mishra and Kieszczynska2011). Finally, in a recent exploratory survey of for-profit companies, Gorman, Meriac, Ray, and Roddy (Reference Gorman, Meriac, Ray, Roddy, O'Leary, Weathington, Cunningham and Biderman2015) found that 61% of the 101 organizations surveyed reported that they use a behavior-based approach (such as FOR training) to train raters, and companies that utilized behavior-focused rater training programs generated higher revenue than those that provide rater error training or no training at all. More research is clearly needed on this topic, but claiming that rater training is ineffective is premature. Thus, we agree that rater error training should not be considered a viable rater training option but hasten to note that a lack of research in organizational settings does not automatically equate to a failed intervention in the case of FOR and other behavior-based training interventions.
Rater (Dis)agreement and the Reliability of Performance Ratings
Colquitt et al. (in Adler et al.) also suggest that disagreement among raters in PM is a major problem that supports the abandonment of ratings altogether, and they support this assertion using the ubiquitous .52 interrater reliability estimate often cited as a general estimate of the reliability of performance ratings (Viswesvaran, Ones, & Schmidt, Reference Viswesvaran, Ones and Schmidt1996). There are at least two problems with this argument: (a) disagreement among raters may reflect true variance, instead of error, and (b) .52 is potentially an underestimate of the reliability of performance ratings. We address each of these issues below.
We agree that “raters do not show the level of agreement one might expect from . . . two different forms of the same paper-and-pencil test” (Adler et al., p. 225). The question of why one would or should expect similar levels of agreement between multiple raters, however, must be asked. Would we actually want to see perfect agreement among multiple raters or sources? From a classical test theory perspective, for example, differences between a true score and an actual score on a paper-and-pencil test are considered error (Nunnally & Bernstein, Reference Nunnally and Bernstein1994). But how do we know what the true score is when it comes to job performance?
Research evidence suggests that rating source disagreement may be due more to differences in the performance constructs being rated than differences between sources (Woehr, Sheehan, & Bennett, Reference Woehr, Sheehan and Bennett2005). Moreover, proponents of the ecological validity perspective have long recognized that performance ratings are based on functionally and socially adaptive judgments that likely represent true sources of variance rather than error (Hoffman, Lance, Bynum, & Gentry, Reference Hoffman, Lance, Bynum and Gentry2010; Lance, Hoffman, Gentry, & Baranik, Reference Lance, Hoffman, Gentry and Baranik2008; Lance & Woehr, Reference Lance, Woehr and Ray1989). In fact, as Hoffman et al. (Reference Hoffman, Lance, Bynum and Gentry2010) aptly noted, why would we gather performance ratings from different sources if we expected them to all completely agree? Thus, we suggest that the assumption that rater disagreement is indicative of a problem with PM is based on a faulty premise.
We completely agree that a reliability estimate around .50 is “hardly the level one would expect if ratings were in fact good measures of the performance of the ratees” (Adler et al., p. 225). However, the oft-cited estimate of .52 is likely a biased underestimate of the reliability of performance (LeBreton, Scherer, & James, Reference LeBreton, Scherer and James2014). The argument over whether inter- or intrarater correlations are more appropriate measures of reliability (Murphy & DeShon, Reference Murphy and DeShon2000; Ones, Viswesvaran, & Schmidt, Reference Ones, Viswesvaran and Schmidt2008; Schmidt, Viswesvaran, & Ones, Reference Schmidt, Viswesvaran and Ones2000) notwithstanding, there are several other reasons to believe this to be a downwardly biased estimate. First, job performance is a dynamic and multidimensional criterion (Austin & Villanova, Reference Austin and Villanova1992), and low reliability is often indicative of a dynamic and multidimensional construct (Nunnally & Bernstein, Reference Nunnally and Bernstein1994). Second, it is well-known that performance ratings are skewed to the positive end of the distribution. For example, when using a seven-point scale, 80% of ratings are often a 6 or 7 (Murphy & Cleveland, Reference Murphy and Cleveland1995). This problem becomes even more pronounced if the ratings are used for administrative decisions (Jawahar & Williams, Reference Jawahar and Williams1997). Thus, restriction of range severely attenuates the observed reliability in job performance ratings (LeBreton, Burgess, Kaiser, Atchley, & James, Reference LeBreton, Burgess, Kaiser, Atchley and James2003).
Finally, job performance ratings are rarely modeled in the extant research with training or format as a factor, but research has demonstrated that interventions such as rater training can improve reliability estimates of performance ratings. Lievens (Reference Lievens2001), for example, found that a schema-driven training condition produced interrater reliability estimates of at least .80 or greater in a sample of both students and managers across three performance dimensions. In addition, using variance components analysis, Gorman and Jackson (Reference Gorman, Jackson and Gibbons2012) reported that rater idiosyncrasies accounted for a large amount of variance in a control training condition but a negligible amount of variance in a FOR training condition. Thus, ratings from trained raters are much less influenced by idiosyncratic error than ratings made by untrained raters. Hence, in situations where raters are left to their own devices without proper training and well-developed rating instruments, the reliability of job performance ratings may actually be much lower.
Some Additional Considerations
The above issues aside, we also agree with many of the other points made by the authors, including the notion that the overall process must be considered, including the consequences of ratings. Dissatisfaction with the ultimate outcomes of management decisions (e.g., raises, promotions, or terminations) would simply shift the criticism from performance ratings to other elements of the process. Performance judgments and comparisons between employees will inevitably be made, whether we call them “ratings” or something else (Meriac, Gorman, & Macan, Reference Meriac, Gorman and Macan2015). In addition, the social context of PM is, and should remain, an important consideration in the PM process. Without proper management support, accountability (London, Smither, & Adsit, Reference London, Smither and Adsit1997), and an environment supporting the effective use of performance ratings and feedback (e.g., Steelman, Levy, & Snell, Reference Steelman, Levy and Snell2004), even highly reliable ratings are unlikely to work as expected. However, the abandonment of ratings is unlikely to facilitate effective PM.
Conclusion
In this commentary, we suggested that, as evidenced in the focal article, there are several myths and urban legends surrounding the measurement of performance ratings that have been perpetuated and passed down in the PM literature. Specifically, we argued that (a) premature conclusions have been reached regarding performance rating formats based on outdated research using improper criteria, (b) behavior-focused rater training programs hold great promise as interventions to improve the quality of ratings in organizations but deserve much more research attention in field settings, and (c) rater agreement is an unrealistic goal, but nevertheless, estimates around .50 are likely downward estimates of the reliability of job performance ratings. We further urge readers to consider the research evidence critically for themselves before accepting foregone conclusions regarding the measurement and ultimate value of performance ratings. As I-O scientists and practitioners, a shared understanding of the measurement issues involved in PM must be a priority before we can begin a dialogue on the merits of abandoning a process fundamental to many of our human resource activities.
Recent commentary has suggested that performance management (PM) is fundamentally “broken,” with negative feelings from managers and employees toward the process at an all-time high (Pulakos, Hanson, Arad, & Moye, Reference Pulakos, Hanson, Arad and Moye2015; Pulakos & O'Leary, Reference Pulakos and O'Leary2011). In response, some high-profile organizations have decided to eliminate performance ratings altogether as a solution to the growing disenchantment. Adler et al. (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) offer arguments both in support of and against eliminating performance ratings in organizations. Although both sides of the debate in the focal article make some strong arguments both for and against utilizing performance ratings in organizations, we believe there continue to be misunderstandings, mischaracterizations, and misinformation with respect to some of the measurement issues in PM. We offer the following commentary not to persuade readers to adopt one particular side over another but as a call to critically reconsider and reevaluate some of the assumptions underlying measurement issues in PM and to dispel some of the pervasive beliefs throughout the performance rating literature.
Measurement Issues in Performance Ratings
As noted by Adler et al., measurement issues have been pervasive in the PM literature since its inception. Understandably, some scholars have argued that the overwhelming focus on measurement issues in the academic literature has rendered PM research essentially useless to PM practitioners (DeNisi & Pritchard, Reference DeNisi and Pritchard2006; Fletcher, Reference Fletcher2001). Unfortunately, however, PM critics continue to rely on overgeneralized conclusions regarding measurement issues that are based on outdated, unsupported, and misinterpreted research, and these unsubstantiated generalizations have become accepted as truth in our science. Below, we separate fact from fiction with respect to three key areas in PM: (a) rating formats, (b) rater training, and (c) rater (dis)agreement and the reliability of PM ratings.
Rating Formats
The most frequently cited article put forth as “evidence” of the failure of rating format interventions is Landy and Farr's (Reference Landy and Farr1980) watershed article in which they famously called for a moratorium on rating format design research and concluded that interventions designed to improve performance rating formats were, at best, minimally successful. However, what is less often communicated is that Landy and Farr's (Reference Landy and Farr1980) conclusions regarding the lack of usefulness of rating format research were based almost entirely on the presence of psychometric “errors” in performance ratings (DeNisi, Reference DeNisi1996). As Colquitt, Murphy, and Ollander-Kane (as cited in Adler et al.) aptly note in their own criticism of PM ratings, psychometric “errors” only represent one type of rating property, and they have also been repeatedly criticized as poor indicators of rating quality (Balzer & Sulsky, Reference Balzer and Sulsky1992; Fisicaro, Reference Fisicaro1988; Murphy, Reference Murphy2008; Murphy & Balzer, Reference Murphy and Balzer1989; Murphy, Jako, & Anhalt, Reference Murphy, Jako and Anhalt1993; Nathan & Tippins, Reference Nathan and Tippins1990). Consequently, the tenuous evidence base regarding rating errors calls into question the conclusions of an entire body of rating format research dismissed by Landy and Farr (Reference Landy and Farr1980).
More recently, other psychometric indices have been used to evaluate rating quality and have provided a much clearer picture regarding the value of rating formats. Specifically, research that has used more appropriate indices of rating quality, such as predictor validity and rater reactions, has actually yielded favorable results (Bartram, Reference Bartram2007; Benson, Buckley, & Hall, Reference Benson, Buckley and Hall1988; Borman et al., Reference Borman, Buck, Hanson, Motowidlo, Stark and Drasgow2001; Goffin, Gellatly, Paunonen, Jackson, & Meyer, Reference Goffin, Gellatly, Paunonen, Jackson and Meyer1996; Roch, Sternburgh, & Caputo, Reference Roch, Sternburgh and Caputo2007; Tziner, Reference Tziner1984; Wagner & Goffin, Reference Wagner and Goffin1997). Hoffman, Gorman, Blair, Meriac, Overstreet, and Atchley (Reference Hoffman, Gorman, Blair, Meriac, Overstreet and Atchley2012), for example, found that a new rating format they termed “frame-of-reference (FOR) scales” resulted in an improved factor structure compared with a standard multisource rating instrument and rating accuracy levels comparable with those from a FOR training program. Moreover, recognizing recent advances in technology, the expanding criterion domain, and the creation of new forms of work, Landy (Reference Landy and Outz2010) himself officially lifted the 30-year moratorium on rating format design research. Thus, we suggest that rumors of the death of performance rating formats have been greatly exaggerated.
Rater Training
Colquitt et al. (in Adler et al.) state that rater training has been unsuccessful in substantially improving ratings in organizations. We should point out that we agree wholeheartedly that rater error training has been a tremendous disappointment as an intervention for improving ratings. Research has shown that although rater error training results in fewer leniency and halo errors, it inadvertently lowers levels of rating accuracy (Bernardin & Pence, Reference Bernardin and Pence1980; Borman, Reference Borman1979; Landy & Farr, Reference Landy and Farr1980), and rater error training essentially creates a meaningless redistribution of ratings and is practically useless in terms of improving rating quality (Borman, Reference Borman1979; Smith, Reference Smith1986). As noted above, though, this is not surprising given the inherent limitations of psychometric “errors” as indicators of rating quality.
However, we disagree with the assertion made in the focal article that behavior-based rater training (e.g., FOR training) is a disappointing rating intervention. Although there is a relative lack of evidence that rater training improves actual ratings in field settings (cf. Noonan & Sulsky, Reference Noonan and Sulsky2001), for several reasons, we suggest that rater training is an understudied rating intervention that has a great deal of potential for improving ratings in organizations. For example, in a popular industrial–organizational (I-O) psychology textbook, Levy (Reference Levy2010) noted that rater training has become more common in organizations such as the Tennessee Valley Authority, JP Morgan Chase, Lucent Technologies, and AT&T.
Moreover, meta-analytic reviews have found impressive effect sizes for the impact of FOR training on improving rating quality (d = 0.83 in Woehr & Huffcutt, Reference Woehr and Huffcutt1994, and d = 0.50 in Roch, Woehr, Mishra, & Kieszczynska, Reference Roch, Woehr, Mishra and Kieszczynska2011). Finally, in a recent exploratory survey of for-profit companies, Gorman, Meriac, Ray, and Roddy (Reference Gorman, Meriac, Ray, Roddy, O'Leary, Weathington, Cunningham and Biderman2015) found that 61% of the 101 organizations surveyed reported that they use a behavior-based approach (such as FOR training) to train raters, and companies that utilized behavior-focused rater training programs generated higher revenue than those that provide rater error training or no training at all. More research is clearly needed on this topic, but claiming that rater training is ineffective is premature. Thus, we agree that rater error training should not be considered a viable rater training option but hasten to note that a lack of research in organizational settings does not automatically equate to a failed intervention in the case of FOR and other behavior-based training interventions.
Rater (Dis)agreement and the Reliability of Performance Ratings
Colquitt et al. (in Adler et al.) also suggest that disagreement among raters in PM is a major problem that supports the abandonment of ratings altogether, and they support this assertion using the ubiquitous .52 interrater reliability estimate often cited as a general estimate of the reliability of performance ratings (Viswesvaran, Ones, & Schmidt, Reference Viswesvaran, Ones and Schmidt1996). There are at least two problems with this argument: (a) disagreement among raters may reflect true variance, instead of error, and (b) .52 is potentially an underestimate of the reliability of performance ratings. We address each of these issues below.
We agree that “raters do not show the level of agreement one might expect from . . . two different forms of the same paper-and-pencil test” (Adler et al., p. 225). The question of why one would or should expect similar levels of agreement between multiple raters, however, must be asked. Would we actually want to see perfect agreement among multiple raters or sources? From a classical test theory perspective, for example, differences between a true score and an actual score on a paper-and-pencil test are considered error (Nunnally & Bernstein, Reference Nunnally and Bernstein1994). But how do we know what the true score is when it comes to job performance?
Research evidence suggests that rating source disagreement may be due more to differences in the performance constructs being rated than differences between sources (Woehr, Sheehan, & Bennett, Reference Woehr, Sheehan and Bennett2005). Moreover, proponents of the ecological validity perspective have long recognized that performance ratings are based on functionally and socially adaptive judgments that likely represent true sources of variance rather than error (Hoffman, Lance, Bynum, & Gentry, Reference Hoffman, Lance, Bynum and Gentry2010; Lance, Hoffman, Gentry, & Baranik, Reference Lance, Hoffman, Gentry and Baranik2008; Lance & Woehr, Reference Lance, Woehr and Ray1989). In fact, as Hoffman et al. (Reference Hoffman, Lance, Bynum and Gentry2010) aptly noted, why would we gather performance ratings from different sources if we expected them to all completely agree? Thus, we suggest that the assumption that rater disagreement is indicative of a problem with PM is based on a faulty premise.
We completely agree that a reliability estimate around .50 is “hardly the level one would expect if ratings were in fact good measures of the performance of the ratees” (Adler et al., p. 225). However, the oft-cited estimate of .52 is likely a biased underestimate of the reliability of performance (LeBreton, Scherer, & James, Reference LeBreton, Scherer and James2014). The argument over whether inter- or intrarater correlations are more appropriate measures of reliability (Murphy & DeShon, Reference Murphy and DeShon2000; Ones, Viswesvaran, & Schmidt, Reference Ones, Viswesvaran and Schmidt2008; Schmidt, Viswesvaran, & Ones, Reference Schmidt, Viswesvaran and Ones2000) notwithstanding, there are several other reasons to believe this to be a downwardly biased estimate. First, job performance is a dynamic and multidimensional criterion (Austin & Villanova, Reference Austin and Villanova1992), and low reliability is often indicative of a dynamic and multidimensional construct (Nunnally & Bernstein, Reference Nunnally and Bernstein1994). Second, it is well-known that performance ratings are skewed to the positive end of the distribution. For example, when using a seven-point scale, 80% of ratings are often a 6 or 7 (Murphy & Cleveland, Reference Murphy and Cleveland1995). This problem becomes even more pronounced if the ratings are used for administrative decisions (Jawahar & Williams, Reference Jawahar and Williams1997). Thus, restriction of range severely attenuates the observed reliability in job performance ratings (LeBreton, Burgess, Kaiser, Atchley, & James, Reference LeBreton, Burgess, Kaiser, Atchley and James2003).
Finally, job performance ratings are rarely modeled in the extant research with training or format as a factor, but research has demonstrated that interventions such as rater training can improve reliability estimates of performance ratings. Lievens (Reference Lievens2001), for example, found that a schema-driven training condition produced interrater reliability estimates of at least .80 or greater in a sample of both students and managers across three performance dimensions. In addition, using variance components analysis, Gorman and Jackson (Reference Gorman, Jackson and Gibbons2012) reported that rater idiosyncrasies accounted for a large amount of variance in a control training condition but a negligible amount of variance in a FOR training condition. Thus, ratings from trained raters are much less influenced by idiosyncratic error than ratings made by untrained raters. Hence, in situations where raters are left to their own devices without proper training and well-developed rating instruments, the reliability of job performance ratings may actually be much lower.
Some Additional Considerations
The above issues aside, we also agree with many of the other points made by the authors, including the notion that the overall process must be considered, including the consequences of ratings. Dissatisfaction with the ultimate outcomes of management decisions (e.g., raises, promotions, or terminations) would simply shift the criticism from performance ratings to other elements of the process. Performance judgments and comparisons between employees will inevitably be made, whether we call them “ratings” or something else (Meriac, Gorman, & Macan, Reference Meriac, Gorman and Macan2015). In addition, the social context of PM is, and should remain, an important consideration in the PM process. Without proper management support, accountability (London, Smither, & Adsit, Reference London, Smither and Adsit1997), and an environment supporting the effective use of performance ratings and feedback (e.g., Steelman, Levy, & Snell, Reference Steelman, Levy and Snell2004), even highly reliable ratings are unlikely to work as expected. However, the abandonment of ratings is unlikely to facilitate effective PM.
Conclusion
In this commentary, we suggested that, as evidenced in the focal article, there are several myths and urban legends surrounding the measurement of performance ratings that have been perpetuated and passed down in the PM literature. Specifically, we argued that (a) premature conclusions have been reached regarding performance rating formats based on outdated research using improper criteria, (b) behavior-focused rater training programs hold great promise as interventions to improve the quality of ratings in organizations but deserve much more research attention in field settings, and (c) rater agreement is an unrealistic goal, but nevertheless, estimates around .50 are likely downward estimates of the reliability of job performance ratings. We further urge readers to consider the research evidence critically for themselves before accepting foregone conclusions regarding the measurement and ultimate value of performance ratings. As I-O scientists and practitioners, a shared understanding of the measurement issues involved in PM must be a priority before we can begin a dialogue on the merits of abandoning a process fundamental to many of our human resource activities.