In their focal article, Adler and his colleagues (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) elaborate on the pros and cons of abolishing the performance appraisal process in organizations. Sherman-Garr (Reference Sherman-Garr2014) contends that this trend is on the rise because both managers—the raters—and their subordinates—the ratees—disdain performance scores. Employees feel that performance ratings do not reflect their actual performance, and therefore they do not gain the rewards they merit. Conversely, their supervisors/managers experience a great deal of frustration because the improvement of employee performance does not match up to the excessive time and effort invested in the appraisal process, making the whole process ineffective and inefficient. We agree that performance appraisals, specifically the practice of assigning performance ratings, appear to be a disliked and ineffective human resource function. However, we do not agree that goal attainment should be used in place of performance ratings; rating format and rater training represent “disappointing interventions” and, most of all, only “weak” criteria exist for performance ratings.
Goal Attainment as an Alternative to Performance Ratings
One favored solution appears to be assessing an employee's performance according to his/her completion of assigned work goals in comparison with other employees, while accounting for the complexity and impact of these goals to the overall performance of the organization. In this case, the employee's compensation is linked to the degree of goal attainment adjusted, as noted, according to the complexity and impact on the organization's overall performance. Thus, performance achievement is determined by a global measure of goal attainment, namely by end results. Is this indicative of true performance? Almost 30 years ago, Landy and Farr (Reference Landy and Farr1983) pointed to a striking finding: Many “objective” measures exhibit low levels of reliability and consistency across equivalent indices. Interestingly, the correlation between different indices of absenteeism nears zero. Moreover, objective measures of output, sales, and the like are not available for each job, especially for managerial jobs. Furthermore, when they are available, they usually exhibit criterion deficiency because some components of work performance do not lend themselves to measurement, whereas others are contaminated by contextual factors.
For instance, we suspect that sales of detection devices of explosive materials will grow exponentially in world regions susceptible to be struck by terror acts. Thus, the volume of sales will hike, leading to an assessment that the performance of the salespeople has been exceptional, whereas in fact the performance level should be attributed to circumstantial factors. Would it be accurate to ascribe this sort of performance excellence solely to the competence and efforts of the salespeople?
Conversely, if an employee has done his/her best, yet nonetheless has failed to attain his/her set goals due to uncontrollable factors, would it be just and fair to penalize him/her? Ignoring deficiencies in the process leading to goal attainment and the embedded context would hamper performance improvement and accurate feedback. Only if we center on the process rather than on end results can we detect what has gone wrong, provide punctual and credible feedback, and thereby hope to enhance employee performance. Thus, it is no surprise that performance appraisal experts for the past 20+ years have made a strong case for concentrating on behaviors versus results (e.g., Murphy & Cleveland, Reference Murphy and Cleveland1995).
Two “Disappointing Interventions”
We contend that the two interventions discussed by Colquitt, Murphy, and Ollander-Krane as “disappointing,” rating format and rating training, are anything but disappointing. It is true, unfortunately, that empirical findings have not corroborated the psychometric superiority of one rating format over others. So, is this a justifiable reason to label rating format research as disappointing? Tziner and his colleagues (Tziner, Joanis, & Murphy, Reference Tziner, Joanis and Murphy2000; Tziner, Kopelman, & Livneh, Reference Tziner, Kopelman and Livneh1993; Tziner & Latham, Reference Tziner and Latham1989; Tziner & Murphy, Reference Tziner and Murphy1999) have shown that too much weight has been placed on rating accuracy. Over 30 years ago, Bernardin and Beatty (Reference Bernardin and Beatty1984) pointed out that ratees’ reactions to appraisal systems are more likely than their psychometric qualities to make a significant contribution to sustaining the viability of appraisal systems. Regardless of the accuracy and psychometric characteristics of the performance ratings, an appraisal system will be rendered useless, and probably sink into decay, if it does not elicit positive reactions from both raters and ratees (Hedge & Borman, Reference Hedge, Borman and Howard1995).
Thus, it is not surprising that as Levy and Williams (Reference Levy and Williams2004) state in their review “Perhaps no area within the PA (performance appraisal) literature has seen such a dramatic increase in research attention since 1990 as ratee reactions to PA processes” (p. 889). Specifically in regard to rating formats, Tziner and colleagues found in a series of articles that the type of rating format used can influence goal clarity, goal acceptance, and goal commitment satisfaction (e.g., Tziner, Kopelman, & Joanis, Reference Tziner, Kopelman and Joanis1997; Tziner et al., Reference Tziner, Kopelman and Livneh1993; Tziner & Latham, Reference Tziner and Latham1989). Roch, Sternburgh, and Caputo (Reference Roch, Sternburgh and Caputo2007) found that, in general, absolute formats (which compare individuals with standards) are seen as more fair than relative formats (which compare individuals with their peers) and that format differences can influence perceived interpersonal justice, especially if employees do not trust their supervisors (Roch, Reference Roch and Gorman2015). Thus, it appears that the type of rating format used can influence employees’ attitudes and justice perceptions, both of which have implications for employee performance, both task performance and organizational citizenship behaviors. Improved job satisfaction, organizational commitment, and so forth are worth real dollars (Cascio, Reference Cascio2000).
Even though, as mentioned by Colquitt and colleagues, we cannot determine in an organizational setting whether one rating format can more accurately assess employee performance than another format, we can evaluate whether a rating format is useful in promoting positive organizational attitudes and behavior. Type of rating format does matter.
Rater training also does matter, especially frame-of-reference (FOR) training. We were surprised that Colquitt and colleagues discussed rater error training (RET) in depth. Woehr and Huffcutt (Reference Woehr and Huffcutt1994) pointed out the problems with RET over 20 years ago, and as mentioned in the rater training meta-analysis by Roch, Woehr, Mishra, and Kieszczynska (Reference Roch, Woehr, Mishra and Kieszczynska2012), RET training has practically disappeared from the literature in the last 20 years.
It has been well established that FOR training improves rating accuracy (Roch et al., Reference Roch, Woehr, Mishra and Kieszczynska2012; Woehr & Huffcutt, Reference Woehr and Huffcutt1994). However, we believe that the main advantage of FOR training is that it helps to bring everyone “on the same page”—gives everyone, both raters and ratees, the same definition of performance. It is not important whether the definition is the “accurate” one, just that the definition is the one espoused by the organization. Raters may still see different aspects of employee performance and thus disagree, but at least they will have a common definition of performance. It is no surprise that in the last 10 years, FOR has been implemented in a wide variety of contexts, including assessing language proficiency (Dierdorff, Surface, & Brown, Reference Dierdorff, Surface and Brown2010), modeling competency (Lievens & Sanchez, Reference Lievens and Sanchez2007), and evaluating biodata items (Lundstrom, Reference Lundstrom2007).
Rater training may also help with the underlying problem of performance appraisal systems: rater motivation. We have long known that performance ratings are viewed as a management tool, one used to achieve certain ends, and not a measurement tool (Longenecker, Sims, & Gioia, Reference Longenecker, Sims and Gioia1987). Research has shown that performance rating inaccuracy is linked not only to rating format deficiencies and raters’ cognitive impairment but also to raters’ deliberate, volitional distortion of performance ratings. These distortions emanate from a gamut of motives and considerations, mostly related to a wish to promote valuable individual goals (e.g., supervisors avoid giving performance ratings that may antagonize employees; supervisors give low performance ratings because they fear that their employees will be transferred to another boss; supervisors avoid giving low performance ratings because they fear violent behavior on the part of their employees).
Rater training cannot deter raters from providing distorted ratings but can give raters the tools needed to provide ratings consistent with the organization's viewpoints and, more important, convey a message that the organization values performance ratings enough to invest in rater training. In a recent survey of 101 U.S. firms, Gorman, Meriac, Ray, and Roddy (Reference Gorman, Meriac, Ray, Roddy, O'Leary, Weathington, Cunningham and Biderman2015) found that over 76% of these firms used rater training, with FOR training as the most common type of training. Even more important, human resource executives whose firms provided rater training rated their performance appraisal systems as more effective than those not offering training. Controlling for firm size, the firms offering rater training had higher revenue than firms not offering rater training. Colquitt, Murphy, and Ollander-Krane remark that “However, neither variation on rater training has been successful in markedly improving ratings in organizations” (Adler et al., p. 225). However, data collected 20 years later present a different picture. Even though we do not know whether rater training has improved performance ratings in organizations, it appears that rater training can improve perceived performance appraisal effectiveness.
“Weak Criteria”
Colquitt, Murphy, and Ollander-Krane suggest that two criteria can be used to evaluate performance ratings: rating accuracy and rating agreement. However, these are the proverbial “strawmen,” easily knocked down. Murphy, Balzer, Sulsky, and colleagues wrote an excellent series of articles over 20 years ago suggesting that outside of laboratory contexts, accuracy and rater errors should not be used to evaluate performance ratings (e.g., Balzer & Sulsky, Reference Balzer and Sulsky1992; Murphy & Balzer, Reference Murphy and Balzer1989; Murphy, Jako, & Anhalt, Reference Murphy, Jako and Anhalt1993; Sulsky, & Balzer, Reference Sulsky and Balzer1988). There is almost no disagreement regarding this assessment; why are Colquitt and colleagues making this point again today?
Similarly, there is a growing acceptance of the ecological validity argument in regard to performance ratings (e.g., Lance, Baranik, Lau, & Scharlau, Reference Lance, Baranik, Lau, Scharlau, Lance and Vandenberg2009). Proponents of this argument contend that given individual raters’ differences in respect to expectations, goals, and so on, raters may be more or less attuned to specific behaviors. Thus, lack of rater agreement may be a result of differences in the raters’ perspectives and represent unique, but valid, observations rather than error. To a certain extent, rater differences in respect to expectations, goals, and personal characteristics can be improved by means of training, such as training to enhance one's self-efficacy, but raters will still see different behaviors (not all raters will be watching ratee performance at the same time, and individual performance varies). Thus, rating agreement is not a useful criterion for evaluating the quality of performance ratings. Given the problems with both rating accuracy and rater agreement, it is no surprise that, as mentioned earlier, research investigating ratee reactions to performance appraisal processes has seen such growth since 1990 (Levy & Williams, Reference Levy and Williams2004).
Organizational interventions should be designed and conducted to change the organization and performance appraisal systems’ characteristics, thereby leading to an increase in the degree of positive perceptions toward these entities. Such changes will subsequently affect rating behavior.
Our contention is this: Accuracy is not the most important thing. We, in organizational behavior and human resource management, put enormous efforts into the attempt to find the practices through which we can improve work behaviors and attitudes. If so, why not judge the value of interventions such as rating format and rater training in terms of their ability to improve work attitudes and behaviors, improvement that has financial value? So, let us not throw out performance ratings because of their perceived inaccuracy (which may or may not reflect reality), but let us judge it by its value in improving work behaviors and attitudes, which consequently improve the organization's performance. This, as mentioned, has financial value.
In their focal article, Adler and his colleagues (Reference Adler, Campion, Colquitt, Grubb, Murphy, Ollander-Krane and Pulakos2016) elaborate on the pros and cons of abolishing the performance appraisal process in organizations. Sherman-Garr (Reference Sherman-Garr2014) contends that this trend is on the rise because both managers—the raters—and their subordinates—the ratees—disdain performance scores. Employees feel that performance ratings do not reflect their actual performance, and therefore they do not gain the rewards they merit. Conversely, their supervisors/managers experience a great deal of frustration because the improvement of employee performance does not match up to the excessive time and effort invested in the appraisal process, making the whole process ineffective and inefficient. We agree that performance appraisals, specifically the practice of assigning performance ratings, appear to be a disliked and ineffective human resource function. However, we do not agree that goal attainment should be used in place of performance ratings; rating format and rater training represent “disappointing interventions” and, most of all, only “weak” criteria exist for performance ratings.
Goal Attainment as an Alternative to Performance Ratings
One favored solution appears to be assessing an employee's performance according to his/her completion of assigned work goals in comparison with other employees, while accounting for the complexity and impact of these goals to the overall performance of the organization. In this case, the employee's compensation is linked to the degree of goal attainment adjusted, as noted, according to the complexity and impact on the organization's overall performance. Thus, performance achievement is determined by a global measure of goal attainment, namely by end results. Is this indicative of true performance? Almost 30 years ago, Landy and Farr (Reference Landy and Farr1983) pointed to a striking finding: Many “objective” measures exhibit low levels of reliability and consistency across equivalent indices. Interestingly, the correlation between different indices of absenteeism nears zero. Moreover, objective measures of output, sales, and the like are not available for each job, especially for managerial jobs. Furthermore, when they are available, they usually exhibit criterion deficiency because some components of work performance do not lend themselves to measurement, whereas others are contaminated by contextual factors.
For instance, we suspect that sales of detection devices of explosive materials will grow exponentially in world regions susceptible to be struck by terror acts. Thus, the volume of sales will hike, leading to an assessment that the performance of the salespeople has been exceptional, whereas in fact the performance level should be attributed to circumstantial factors. Would it be accurate to ascribe this sort of performance excellence solely to the competence and efforts of the salespeople?
Conversely, if an employee has done his/her best, yet nonetheless has failed to attain his/her set goals due to uncontrollable factors, would it be just and fair to penalize him/her? Ignoring deficiencies in the process leading to goal attainment and the embedded context would hamper performance improvement and accurate feedback. Only if we center on the process rather than on end results can we detect what has gone wrong, provide punctual and credible feedback, and thereby hope to enhance employee performance. Thus, it is no surprise that performance appraisal experts for the past 20+ years have made a strong case for concentrating on behaviors versus results (e.g., Murphy & Cleveland, Reference Murphy and Cleveland1995).
Two “Disappointing Interventions”
We contend that the two interventions discussed by Colquitt, Murphy, and Ollander-Krane as “disappointing,” rating format and rating training, are anything but disappointing. It is true, unfortunately, that empirical findings have not corroborated the psychometric superiority of one rating format over others. So, is this a justifiable reason to label rating format research as disappointing? Tziner and his colleagues (Tziner, Joanis, & Murphy, Reference Tziner, Joanis and Murphy2000; Tziner, Kopelman, & Livneh, Reference Tziner, Kopelman and Livneh1993; Tziner & Latham, Reference Tziner and Latham1989; Tziner & Murphy, Reference Tziner and Murphy1999) have shown that too much weight has been placed on rating accuracy. Over 30 years ago, Bernardin and Beatty (Reference Bernardin and Beatty1984) pointed out that ratees’ reactions to appraisal systems are more likely than their psychometric qualities to make a significant contribution to sustaining the viability of appraisal systems. Regardless of the accuracy and psychometric characteristics of the performance ratings, an appraisal system will be rendered useless, and probably sink into decay, if it does not elicit positive reactions from both raters and ratees (Hedge & Borman, Reference Hedge, Borman and Howard1995).
Thus, it is not surprising that as Levy and Williams (Reference Levy and Williams2004) state in their review “Perhaps no area within the PA (performance appraisal) literature has seen such a dramatic increase in research attention since 1990 as ratee reactions to PA processes” (p. 889). Specifically in regard to rating formats, Tziner and colleagues found in a series of articles that the type of rating format used can influence goal clarity, goal acceptance, and goal commitment satisfaction (e.g., Tziner, Kopelman, & Joanis, Reference Tziner, Kopelman and Joanis1997; Tziner et al., Reference Tziner, Kopelman and Livneh1993; Tziner & Latham, Reference Tziner and Latham1989). Roch, Sternburgh, and Caputo (Reference Roch, Sternburgh and Caputo2007) found that, in general, absolute formats (which compare individuals with standards) are seen as more fair than relative formats (which compare individuals with their peers) and that format differences can influence perceived interpersonal justice, especially if employees do not trust their supervisors (Roch, Reference Roch and Gorman2015). Thus, it appears that the type of rating format used can influence employees’ attitudes and justice perceptions, both of which have implications for employee performance, both task performance and organizational citizenship behaviors. Improved job satisfaction, organizational commitment, and so forth are worth real dollars (Cascio, Reference Cascio2000).
Even though, as mentioned by Colquitt and colleagues, we cannot determine in an organizational setting whether one rating format can more accurately assess employee performance than another format, we can evaluate whether a rating format is useful in promoting positive organizational attitudes and behavior. Type of rating format does matter.
Rater training also does matter, especially frame-of-reference (FOR) training. We were surprised that Colquitt and colleagues discussed rater error training (RET) in depth. Woehr and Huffcutt (Reference Woehr and Huffcutt1994) pointed out the problems with RET over 20 years ago, and as mentioned in the rater training meta-analysis by Roch, Woehr, Mishra, and Kieszczynska (Reference Roch, Woehr, Mishra and Kieszczynska2012), RET training has practically disappeared from the literature in the last 20 years.
It has been well established that FOR training improves rating accuracy (Roch et al., Reference Roch, Woehr, Mishra and Kieszczynska2012; Woehr & Huffcutt, Reference Woehr and Huffcutt1994). However, we believe that the main advantage of FOR training is that it helps to bring everyone “on the same page”—gives everyone, both raters and ratees, the same definition of performance. It is not important whether the definition is the “accurate” one, just that the definition is the one espoused by the organization. Raters may still see different aspects of employee performance and thus disagree, but at least they will have a common definition of performance. It is no surprise that in the last 10 years, FOR has been implemented in a wide variety of contexts, including assessing language proficiency (Dierdorff, Surface, & Brown, Reference Dierdorff, Surface and Brown2010), modeling competency (Lievens & Sanchez, Reference Lievens and Sanchez2007), and evaluating biodata items (Lundstrom, Reference Lundstrom2007).
Rater training may also help with the underlying problem of performance appraisal systems: rater motivation. We have long known that performance ratings are viewed as a management tool, one used to achieve certain ends, and not a measurement tool (Longenecker, Sims, & Gioia, Reference Longenecker, Sims and Gioia1987). Research has shown that performance rating inaccuracy is linked not only to rating format deficiencies and raters’ cognitive impairment but also to raters’ deliberate, volitional distortion of performance ratings. These distortions emanate from a gamut of motives and considerations, mostly related to a wish to promote valuable individual goals (e.g., supervisors avoid giving performance ratings that may antagonize employees; supervisors give low performance ratings because they fear that their employees will be transferred to another boss; supervisors avoid giving low performance ratings because they fear violent behavior on the part of their employees).
Rater training cannot deter raters from providing distorted ratings but can give raters the tools needed to provide ratings consistent with the organization's viewpoints and, more important, convey a message that the organization values performance ratings enough to invest in rater training. In a recent survey of 101 U.S. firms, Gorman, Meriac, Ray, and Roddy (Reference Gorman, Meriac, Ray, Roddy, O'Leary, Weathington, Cunningham and Biderman2015) found that over 76% of these firms used rater training, with FOR training as the most common type of training. Even more important, human resource executives whose firms provided rater training rated their performance appraisal systems as more effective than those not offering training. Controlling for firm size, the firms offering rater training had higher revenue than firms not offering rater training. Colquitt, Murphy, and Ollander-Krane remark that “However, neither variation on rater training has been successful in markedly improving ratings in organizations” (Adler et al., p. 225). However, data collected 20 years later present a different picture. Even though we do not know whether rater training has improved performance ratings in organizations, it appears that rater training can improve perceived performance appraisal effectiveness.
“Weak Criteria”
Colquitt, Murphy, and Ollander-Krane suggest that two criteria can be used to evaluate performance ratings: rating accuracy and rating agreement. However, these are the proverbial “strawmen,” easily knocked down. Murphy, Balzer, Sulsky, and colleagues wrote an excellent series of articles over 20 years ago suggesting that outside of laboratory contexts, accuracy and rater errors should not be used to evaluate performance ratings (e.g., Balzer & Sulsky, Reference Balzer and Sulsky1992; Murphy & Balzer, Reference Murphy and Balzer1989; Murphy, Jako, & Anhalt, Reference Murphy, Jako and Anhalt1993; Sulsky, & Balzer, Reference Sulsky and Balzer1988). There is almost no disagreement regarding this assessment; why are Colquitt and colleagues making this point again today?
Similarly, there is a growing acceptance of the ecological validity argument in regard to performance ratings (e.g., Lance, Baranik, Lau, & Scharlau, Reference Lance, Baranik, Lau, Scharlau, Lance and Vandenberg2009). Proponents of this argument contend that given individual raters’ differences in respect to expectations, goals, and so on, raters may be more or less attuned to specific behaviors. Thus, lack of rater agreement may be a result of differences in the raters’ perspectives and represent unique, but valid, observations rather than error. To a certain extent, rater differences in respect to expectations, goals, and personal characteristics can be improved by means of training, such as training to enhance one's self-efficacy, but raters will still see different behaviors (not all raters will be watching ratee performance at the same time, and individual performance varies). Thus, rating agreement is not a useful criterion for evaluating the quality of performance ratings. Given the problems with both rating accuracy and rater agreement, it is no surprise that, as mentioned earlier, research investigating ratee reactions to performance appraisal processes has seen such growth since 1990 (Levy & Williams, Reference Levy and Williams2004).
Organizational interventions should be designed and conducted to change the organization and performance appraisal systems’ characteristics, thereby leading to an increase in the degree of positive perceptions toward these entities. Such changes will subsequently affect rating behavior.
Our contention is this: Accuracy is not the most important thing. We, in organizational behavior and human resource management, put enormous efforts into the attempt to find the practices through which we can improve work behaviors and attitudes. If so, why not judge the value of interventions such as rating format and rater training in terms of their ability to improve work attitudes and behaviors, improvement that has financial value? So, let us not throw out performance ratings because of their perceived inaccuracy (which may or may not reflect reality), but let us judge it by its value in improving work behaviors and attitudes, which consequently improve the organization's performance. This, as mentioned, has financial value.