Adler et al. (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) provide a discussion of the pros and cons surrounding the issue of “Getting Rid of Performance Ratings.” Yet neither the pro nor the con side of the debate appears to fully consider the central role of performance ratings outside the realm of performance management. In 1949, Robert L. Thorndike wrote,
The key to effective research in personnel selection and classification is an adequate measure of proficiency on the job. Only when proficiency measures can be obtained for the individuals who have been tested is it possible to check the effectiveness of test and selection procedures. (Thorndike, Reference Thorndike1949, p. 6, italics added)
This statement remains as true today as it was in 1949. For better or worse, performance ratings have been the most frequently used measure of “proficiency on the job” for nearly 100 years (Austin & Villanova, Reference Austin and Villanova1992). And if performance rating in organizations is truly a “failed experiment,” does this call into question all of the research for which performance ratings have served as the criteria? Performance ratings are the criterion of choice not only for validating selection measures but also for evaluating training interventions (Goldstein & Ford, Reference Goldstein and Ford2002).
So before admitting defeat with respect to performance ratings, we believe it important to consider the evidence suggesting performance ratings do indeed capture performance. One such piece of evidence often not discussed is the extent to which performance ratings are correlated with conceptually relevant predictors. In general, predictors that should be related to individual job performance do indeed predict performance ratings. As noted above, in studies investigating the relationship between predictors and job performance, job performance is most often assessed using supervisory ratings (Schmidt & Hunter, Reference Schmidt and Hunter1998). The literature clearly demonstrates that a variety of predictors frequently utilized for selection and assessment purposes demonstrate substantial relationships with job performance as typically assessed. For example, cognitive ability has a corrected validity of approximately ρ = .50 (e.g., Bertua, Anderson, & Salgado, Reference Bertua, Anderson and Salgado2005), even though estimates range as high as ρ = .62 (Salgado, Anderson, Moscoso, Bertua, & De Fruyt, Reference Salgado, Anderson, Moscoso, Bertua and De Fruyt2003) and as low as ρ = .45 (Hunter, Reference Hunter1983). Job knowledge has a validity coefficient of ρ = .48 (Hunter & Hunter, Reference Hunter and Hunter1984), ρ = .36 for assessment centers (Arthur, Day, McNelly, & Edens, Reference Arthur, Day, McNelly and Edens2003; Gaugler, Rosenthal, Thornton, & Bentson, Reference Gaugler, Rosenthal, Thornton and Bentson1987), ρ = .32 for biodata (Rothstein, Schmidt, Erwin, Owens, & Sparks, Reference Rothstein, Schmidt, Erwin, Owens and Sparks1990), ρ = .37 for interviews (Huffcutt & Arthur, Reference Huffcutt and Arthur1994; McDaniel, Whetzel, Schmidt, & Maurer, Reference McDaniel, Whetzel, Schmidt and Maurer1994), and ρ = .34 for situational judgment tests (McDaniel, Morgeson, Finnegan, Campion, & Braverman, Reference McDaniel, Morgeson, Finnegan, Campion and Braverman2001). Given that job performance is most often assessed via supervisory performance ratings, these studies provide an indication of the general level of predictability of these ratings. Even though these correlations do not necessarily provide incontrovertible evidence with respect to the construct validity of performance ratings, they are certainly consistent with theoretical expectations and inconsistent with the notion that performance ratings do not work.
Moreover, several studies have specifically examined differences across various criterion methods with respect to criterion-related validity. These studies provide a direct comparison of the level of predictability between rating-based measures and more objective measures such as production records, sales records, and output, often referred to as hard criteria. Schmitt, Gooding, Noe, and Kirsch (Reference Schmitt, Gooding, Noe and Kirsch1984), for example, investigated published validation studies between 1964 and 1982. They examined validity coefficients as a function of the type of criterion used. They report remarkably similar validity coefficients among some criteria. In particular, the average r was .26 for performance ratings, r = .25 for turnover, and r = .21 for productivity. It should be noted that other criteria had higher validities (r = .36 for status change, and r = .40 for wages). Similarly, Schmidt and Rader (Reference Schmidt and Rader1999) investigated phone interviews and found almost the same validity coefficients for performance ratings (ρ = .40) as for production records (ρ = .40) and job tenure (ρ = .39). Interestingly, they found higher validity coefficients for performance ratings than for sales performance (ρ = .24). In general, research indicates that supervisory performance ratings typically demonstrate criterion-related validity levels as good as, if not better than, those of other criterion measures. Predictability has long been viewed as a desirable criterion characteristic (e.g., Blum & Naylor, Reference Blum and Naylor1968).
It may be argued that subjective measures such as supervisory performance ratings are likely to demonstrate bias to a greater degree than more objective criteria. However, in a meta-analysis investigating to what extent race influences performance evaluations, McKay and McDaniel (Reference McKay and McDaniel2006) found that race influenced subjective and objective task-related ratings to essentially the same degree (d = 0.18 versus d = 0.20). Interestingly, race influenced subjective estimates of absenteeism (d = −0.01) to a lesser degree than more objective estimates of absenteeism (d = 0.11). It is not possible to determine to what extent these effect sizes represent bias versus true performance differences. Nevertheless, these findings do indicate that performance ratings demonstrate no more race-based differences than do objective measures.
Although it is important not to overly confound construct and method when making comparisons among predictor and criterion relationships (Arthur & Villado, Reference Arthur and Villado2008), the literature to date suggests that supervisor ratings of job performance are consistently predicted by those constructs expected to be related to job performance. Rating-based measures of performance appear to be as, if not more, predictable than nonrating measures. It is also notable that supervisor ratings of performance tend to show no more race-based differences than do other more objective criteria.
Performance ratings are the criterion of choice not only for selection measures but also when evaluating whether training has influenced employees’ on-the-job behavior or, in other words, training transfer (Goldstein & Ford, Reference Goldstein and Ford2002). Taylor, Russ-Eft, and Taylor (Reference Taylor, Russ-Eft and Taylor2009) in their meta-analysis of the literature investigating the transfer of management training found that supervisor ratings were the most frequently used criterion when evaluating training transfer versus self, peer, and subordinate rating. Even more important, Taylor et al. found larger effect sizes for training when the performance ratings targeted the training content. This is in line with Kraiger's (Reference Kraiger and Kraiger2002) suggestion that when determining the organizational payoff of training, it is important to focus on changes in behavior on the job. Performance ratings allow the organization to assess to what extent training has influenced on-the-job behavior, information that is not available in bottom line performance measures, which may be influenced by factors outside of the employees’ control.
Of course, one could argue that the performance ratings used for personnel research are not the same as those used for administrative purposes in organizations. There is certainly a good bit of research focusing on this “purpose of appraisal” effect (Jawahar & Williams, Reference Jawahar and Williams1997). Yet it has been widely noted that the same performance ratings are regularly used for multiple purposes, and much research utilizes operational performance ratings. Even if one accepts that research-oriented ratings are different from administration-oriented ratings, this suggests that the problem is not with the performance ratings themselves but with the way in which they are used. Of course, much has been written about problems in the performance management process. Both sides of the debate seem to agree that performance management practices are almost universally not well implemented. But if we suddenly had a perfect measure of job performance, would these problems be alleviated? We think not. So while performance management in organizations may be a messy, poorly managed, and poorly implemented process, we should be cautious not to lay the blame on the quality of performance ratings. Let's not throw the performance rating baby out with the performance management bathwater.
Adler et al. (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) provide a discussion of the pros and cons surrounding the issue of “Getting Rid of Performance Ratings.” Yet neither the pro nor the con side of the debate appears to fully consider the central role of performance ratings outside the realm of performance management. In 1949, Robert L. Thorndike wrote,
The key to effective research in personnel selection and classification is an adequate measure of proficiency on the job. Only when proficiency measures can be obtained for the individuals who have been tested is it possible to check the effectiveness of test and selection procedures. (Thorndike, Reference Thorndike1949, p. 6, italics added)
This statement remains as true today as it was in 1949. For better or worse, performance ratings have been the most frequently used measure of “proficiency on the job” for nearly 100 years (Austin & Villanova, Reference Austin and Villanova1992). And if performance rating in organizations is truly a “failed experiment,” does this call into question all of the research for which performance ratings have served as the criteria? Performance ratings are the criterion of choice not only for validating selection measures but also for evaluating training interventions (Goldstein & Ford, Reference Goldstein and Ford2002).
So before admitting defeat with respect to performance ratings, we believe it important to consider the evidence suggesting performance ratings do indeed capture performance. One such piece of evidence often not discussed is the extent to which performance ratings are correlated with conceptually relevant predictors. In general, predictors that should be related to individual job performance do indeed predict performance ratings. As noted above, in studies investigating the relationship between predictors and job performance, job performance is most often assessed using supervisory ratings (Schmidt & Hunter, Reference Schmidt and Hunter1998). The literature clearly demonstrates that a variety of predictors frequently utilized for selection and assessment purposes demonstrate substantial relationships with job performance as typically assessed. For example, cognitive ability has a corrected validity of approximately ρ = .50 (e.g., Bertua, Anderson, & Salgado, Reference Bertua, Anderson and Salgado2005), even though estimates range as high as ρ = .62 (Salgado, Anderson, Moscoso, Bertua, & De Fruyt, Reference Salgado, Anderson, Moscoso, Bertua and De Fruyt2003) and as low as ρ = .45 (Hunter, Reference Hunter1983). Job knowledge has a validity coefficient of ρ = .48 (Hunter & Hunter, Reference Hunter and Hunter1984), ρ = .36 for assessment centers (Arthur, Day, McNelly, & Edens, Reference Arthur, Day, McNelly and Edens2003; Gaugler, Rosenthal, Thornton, & Bentson, Reference Gaugler, Rosenthal, Thornton and Bentson1987), ρ = .32 for biodata (Rothstein, Schmidt, Erwin, Owens, & Sparks, Reference Rothstein, Schmidt, Erwin, Owens and Sparks1990), ρ = .37 for interviews (Huffcutt & Arthur, Reference Huffcutt and Arthur1994; McDaniel, Whetzel, Schmidt, & Maurer, Reference McDaniel, Whetzel, Schmidt and Maurer1994), and ρ = .34 for situational judgment tests (McDaniel, Morgeson, Finnegan, Campion, & Braverman, Reference McDaniel, Morgeson, Finnegan, Campion and Braverman2001). Given that job performance is most often assessed via supervisory performance ratings, these studies provide an indication of the general level of predictability of these ratings. Even though these correlations do not necessarily provide incontrovertible evidence with respect to the construct validity of performance ratings, they are certainly consistent with theoretical expectations and inconsistent with the notion that performance ratings do not work.
Moreover, several studies have specifically examined differences across various criterion methods with respect to criterion-related validity. These studies provide a direct comparison of the level of predictability between rating-based measures and more objective measures such as production records, sales records, and output, often referred to as hard criteria. Schmitt, Gooding, Noe, and Kirsch (Reference Schmitt, Gooding, Noe and Kirsch1984), for example, investigated published validation studies between 1964 and 1982. They examined validity coefficients as a function of the type of criterion used. They report remarkably similar validity coefficients among some criteria. In particular, the average r was .26 for performance ratings, r = .25 for turnover, and r = .21 for productivity. It should be noted that other criteria had higher validities (r = .36 for status change, and r = .40 for wages). Similarly, Schmidt and Rader (Reference Schmidt and Rader1999) investigated phone interviews and found almost the same validity coefficients for performance ratings (ρ = .40) as for production records (ρ = .40) and job tenure (ρ = .39). Interestingly, they found higher validity coefficients for performance ratings than for sales performance (ρ = .24). In general, research indicates that supervisory performance ratings typically demonstrate criterion-related validity levels as good as, if not better than, those of other criterion measures. Predictability has long been viewed as a desirable criterion characteristic (e.g., Blum & Naylor, Reference Blum and Naylor1968).
It may be argued that subjective measures such as supervisory performance ratings are likely to demonstrate bias to a greater degree than more objective criteria. However, in a meta-analysis investigating to what extent race influences performance evaluations, McKay and McDaniel (Reference McKay and McDaniel2006) found that race influenced subjective and objective task-related ratings to essentially the same degree (d = 0.18 versus d = 0.20). Interestingly, race influenced subjective estimates of absenteeism (d = −0.01) to a lesser degree than more objective estimates of absenteeism (d = 0.11). It is not possible to determine to what extent these effect sizes represent bias versus true performance differences. Nevertheless, these findings do indicate that performance ratings demonstrate no more race-based differences than do objective measures.
Although it is important not to overly confound construct and method when making comparisons among predictor and criterion relationships (Arthur & Villado, Reference Arthur and Villado2008), the literature to date suggests that supervisor ratings of job performance are consistently predicted by those constructs expected to be related to job performance. Rating-based measures of performance appear to be as, if not more, predictable than nonrating measures. It is also notable that supervisor ratings of performance tend to show no more race-based differences than do other more objective criteria.
Performance ratings are the criterion of choice not only for selection measures but also when evaluating whether training has influenced employees’ on-the-job behavior or, in other words, training transfer (Goldstein & Ford, Reference Goldstein and Ford2002). Taylor, Russ-Eft, and Taylor (Reference Taylor, Russ-Eft and Taylor2009) in their meta-analysis of the literature investigating the transfer of management training found that supervisor ratings were the most frequently used criterion when evaluating training transfer versus self, peer, and subordinate rating. Even more important, Taylor et al. found larger effect sizes for training when the performance ratings targeted the training content. This is in line with Kraiger's (Reference Kraiger and Kraiger2002) suggestion that when determining the organizational payoff of training, it is important to focus on changes in behavior on the job. Performance ratings allow the organization to assess to what extent training has influenced on-the-job behavior, information that is not available in bottom line performance measures, which may be influenced by factors outside of the employees’ control.
Of course, one could argue that the performance ratings used for personnel research are not the same as those used for administrative purposes in organizations. There is certainly a good bit of research focusing on this “purpose of appraisal” effect (Jawahar & Williams, Reference Jawahar and Williams1997). Yet it has been widely noted that the same performance ratings are regularly used for multiple purposes, and much research utilizes operational performance ratings. Even if one accepts that research-oriented ratings are different from administration-oriented ratings, this suggests that the problem is not with the performance ratings themselves but with the way in which they are used. Of course, much has been written about problems in the performance management process. Both sides of the debate seem to agree that performance management practices are almost universally not well implemented. But if we suddenly had a perfect measure of job performance, would these problems be alleviated? We think not. So while performance management in organizations may be a messy, poorly managed, and poorly implemented process, we should be cautious not to lay the blame on the quality of performance ratings. Let's not throw the performance rating baby out with the performance management bathwater.