In their focal article, Tett, Hundley, and Christiansen (Reference Tett, Hundley and Christiansen2017) stated in multiple places that if there are good reasons to expect moderating effect(s), the application of an overall validity generalization (VG) analysis (meta-analysis) is “moot,” “irrelevant,” “minimally useful,” and “a misrepresentation of the data.” They used multiple examples and, in particular, a hypothetical example about the relationship between agreeableness and job performance. Four noteworthy problems with the above statements, other similar statements elsewhere in Tett et al.’s article, and their underlying assumptions are discussed below along with alternative perspectives.
VG as a Method Should Be Distinguished From VG as a Practice
Throughout the article, Tett et al. (Reference Tett, Hundley and Christiansen2017) did not make it clear whether they intended to challenge either a statistical method of VG or some improper or outdated practices of VG, but it appears that they challenged the accuracy and usefulness of VG as a method based on improper or outdated applications of VG as noted below. This confusion is problematic because VG (originally developed by Schmidt and Hunter [Reference Schmidt and Hunter1977]) is a well-established statistical method/tool with statistical accuracy and efficiency verified in many articles (e.g., Field, Reference Field2001) and books by external evaluators (e.g., Schulze, Reference Schulze2004). This does not necessarily mean that VG cannot be misused and abused in its applications like other well-established statistical analysis methods (e.g., regression, hierarchical linear modeling, structural equation modeling). Thus, there is a need to clearly distinguish VG as a method from VG as a practice. Like some recent challenges to VG and meta-analysis (e.g., James & McIntyre, Reference James, McIntyre, Farr and Tippins2010; Muchinsky & Raines, Reference Muchinsky and Raines2013), Tett et al. also failed to make this distinction and, thus, confused the readers of their article as discussed in some detail in this commentary.
VG (or Lack of It) Should Be Treated as a Matter of Degree, Not Dichotomy
Tett et al. (Reference Tett, Hundley and Christiansen2017) stated that VG is determined using some rules such as the 75% rule and/or the 90% CV rule.Footnote
1
For example, they stated that “the 75% rule (Schmidt & Hunter, Reference Schmidt and Hunter1977) holds that if %VE > 75%, situational generalizability of mean rho may be inferred” (p. 425) and that “VG dichotomizes the continuum of correlation strength (James & McIntyre, Reference James, McIntyre, Farr and Tippins2010); in terms of the “90% [CV] > 0 rule, either the 90% CV falls above 0, conferring VG, or it does not, failing to confer VG” (p. 426; bracket added for clarity). However, more recent books and articles about VG have made it clear that when examining VG analysis results, we need to regard VG (and situational specificity) as a matter of degree, not dichotomy. In fact, for this reason, Hunter and Schmidt dropped the 75% “rule of thumb,” which was originally included in the first edition of their meta-analysis book (Hunter & Schmidt, Reference Hunter and Schmidt1990), from the second and third editions of the book (Hunter & Schmidt, Reference Hunter and Schmidt2004; Schmidt & Hunter, Reference Schmidt and Hunter2015). Note that like many other well-established statistical analysis methods, meta-analysis (VG as a method) is a constantly evolving research synthesis tool, and meta-analysts should be aware of major advancements and refinements of the method (Cortina, Aguinis, & DeShon, Reference Cortina, Aguinis and DeShon2017; Schmidt, Reference Schmidt2008). Thus, VG results should also be interpreted as a matter of degree, not a matter of dichotomy (VG or not) in order not to be subject to the same fallacy created by null hypothesis significance testing (NHST; significant or not). Meta-analysis should be practiced by scientists, not judges (with a “yes” or “no” switch in their heads). In the same vein:
It is important to note, however, that validity generalization can be justified in many cases even if the remaining variance [i.e., SD(rho)] is not zero. That is, validity generalization can be justified in many cases in which the hypothesis of situational specificity cannot be definitively rejected. (Pearlman, Schmidt, & Hunter, Reference Pearlman, Schmidt and Hunter1980, p. 376; bracket added)
Finally, to unstrap the straitjacket of NHST or such dichotomous heuristics, the degree of VG should be carefully gauged by triangulating all available meta-analytic results and, if possible, relevant prior VG results as discussed next (e.g., Pearlman et al., Reference Pearlman, Schmidt and Hunter1980; Salgado, Anderson, Moscoso, Bertua, & Fruyt, Reference Salgado, Anderson, Moscoso, Bertua and Fruyt2003).
The Degree of VG Should Be Gauged by Triangulating All Available Meta-Analytic Results
Tett et al. (Reference Tett, Hundley and Christiansen2017) stated in multiple places of their article that if there are some good reasons to expect some moderating effect(s) results from an overall VG analysis (meta-analysis) are “moot,” “irrelevant,” “minimally useful,” and “a misrepresentation of the data” using multiple examples. They developed a hypothetical, and atypical, example of an overall VG analysis about the bidirectional relationship between agreeableness and job performance across different samples (e.g., positive relationships for jobs “where caring for others is especially valued” and negative relationships for jobs where being “tough skinned is favored”). Before commenting more on the specific hypothetical example, we would like to note two things. First, this example by Tett et al. does not represent a problematic meta-analysis practice that mixes apples, oranges, and pears in meta-analysis (e.g., mixing many different personality traits in the same meta-analysis and concluding that personality does not matter in predicting performance because of the low mean rho and a huge amount of variability [SD(rho)] across input validities; Cortina, Reference Cortina2003). Second, the vast majority of moderator analyses in meta-analysis/VG do not appear to result in negative mean validities for predictors such as cognitive ability tests (Hunter & Hunter, Reference Hunter and Hunter1984; Salgado et al., Reference Salgado, Anderson, Moscoso, Bertua and Fruyt2003), conscientiousness (Barrick & Mount, Reference Barrick and Mount1991), work sample tests (Roth, Bobko, & McFarland, Reference Roth, Bobko and McFarland2005), assessment centers (Gaugler, Rosenthal, Thornton, & Bentson, Reference Gaugler, Rosenthal, Thornton and Bentson1987), grade point average (GPA; Roth, BeVier, Switzer, & Schippmann, Reference Roth, Bevier, Switzer and Schippmann1996), or employment interviews (Huffcutt & Arthur, Reference Huffcutt and Arthur1994). Instead, such analyses typically result in varying degrees of positive relationships, suggesting there could be useful validities at multiple values or levels of moderators (e.g., positive validities for GPA across multiple conditions). In a very real sense, there is an argument that the data support validity across multiple levels of a moderator (e.g., see Hunter & Hunter's [Reference Hunter and Hunter1984] work on useful levels of validity across levels of job complexity) and that the judgment of validity does generalize across those moderator levels.
Now to Tett et al.’s (Reference Tett, Hundley and Christiansen2017) example: Although the percentage of jobs in which a low level of agreeableness is valued is much smaller than jobs in which a high level of agreeableness is valued, let us conservatively assume that we have six validation studies on the relationship between agreeableness and job performance with a sample size of 150 each (a reasonable sample size found in many validation studies), and that observed validities are .09 (a reasonable value for this trait; Barrick & Mount, Reference Barrick and Mount1991) in three studies and observed validities are –.09 in the remaining three studies. Assuming no measurement error and no range restriction in all six input studies (just to make the example easy to understand), overall VG analysis results will show that the mean rho is .00 and SD(rho) is .04. The 80% credibility interval ranges from –.05 to .05, and the 95% confidence interval ranges from –.07 to .07; the percentage of variance due to artifacts (%VE; sampling error alone in this case) is 83%. If meta-analysts followed the VG practices discussed in Tett et al.’s article, they would be confused given that the 90% CV “rule of thumb” would suggest that VG is not present, whereas the 75% “rule of thumb” would suggest that VG is present. Again, VG should not be determined in a dichotomous manner using only part of the meta-analytic results, although improper and outdated practices of VG may have determined VG in this fashion. Instead, the aforementioned VG results suggest that we cannot rule out the possibility that the sign of the input validities is mostly artifactual due to sampling error, not due to occupational differences given the reliability for the validity distribution (vector) is very low at .17 (= 1 – .83) or the correlation between the observed validities and sampling error is .91 (= SQRT of .83). That is, we simply cannot completely trust any of the observed validities in the distribution (at face value) because the magnitude and sign of those validities are mostly due to sampling error.Footnote
2
Obviously, this is a more parsimonious explanation for the apparent variation in the entire validity distribution. Furthermore, it is well known that %VE is a percent-based relative index (in this case, .83 = [.0067 / .0081] and, thus, should be triangulated by carefully considering its components,Footnote
3
as well as all other meta-analytic results such as k, total N, SD(rho), SE(mean rhoFootnote
4
), and their extensions (the 80% credibility and 95% confidence intervals).
The Mean Rho From an Overall VG Analysis Is Useful as Long as the Analysis Is Properly Conducted
Tett et al. (Reference Tett, Hundley and Christiansen2017) claimed that the mean rho estimated from an overall VG analysis is “moot” or “irrelevant” if moderating effects are expected with good reasons. Simply put, the overall mean rho estimated via meta-analysis, regardless of its magnitude, is the best estimate for mean rho of the grand population and thus represents the grand population (that may subsume subpopulations) more accurately than any other single value. The estimated subgroup mean rho via meta-analysis is the best estimate for mean rho of the subpopulation and thus accurately represents the subpopulation. For clarity, we note that each subpopulation may have some known or unknown moderators (i.e., next lower-level subpopulations), and thus mean rho, not rho, is used here according to the primary principle of the random-effects model (Schmidt, Oh, & Hayes, Reference Schmidt, Oh and Hayes2009). Of course, not surprisingly, the overall mean rho estimate does not more accurately represent subpopulations (i.e., some subsets/subgroups of the entire validity distribution) than corresponding subgroup mean rho estimates, and likewise, any of the subgroup mean rho estimates do not more accurately represent the grand population (the entire validity distribution) than the overall mean rho estimate. Accordingly, Tett et al.’s claim that mean rho estimated from an overall VG analysis misrepresents the entire validity distribution if some moderating effects are expected is inappropriate as long as the VG analysis is properly conducted. Of course, if an overall VG analysis was conducted improperly by mixing different constructs (e.g., general mental ability, proactive personality) or different methods (e.g., work sample tests, employment interviews, situational judgment tests) as predictors of job performance, the mean rho estimated from the overall analysis would be almost moot and potentially meaningless (Cortina, Reference Cortina2003). As noted above, mean rho should not be the sole focus of VG.
Conclusion
In conclusion, we want to draw the readers’ attention to the fact that VG is a well-established statistical method/tool that can be properly used, misused, and abused like other well-established statistical analysis methods. Like the results of many other statistical analysis methods, VG analysis results should be interpreted as a matter of degree, not as a matter of dichotomy (VG or not), in order to be scientifically useful and not to be subject to the same fallacy created by NHST (which ironically VG was designed to address). In addition, VG analysis results should also be interpreted while considering and triangulating all available meta-analytic results, not just mean rho and/or SD(rho). Given that VG is a constantly evolving statistical method like many other statistical methods, meta-analysts should not stick to outdated practices and methods but keep abreast of important advancements and refinements in both VG practices and methods. Finally, there is a need to distinguish VG as a method from VG as a practice, and hence, improper or outdated VG practices should not be a basis for challenging VG as a state-of-the-art method.
In their focal article, Tett, Hundley, and Christiansen (Reference Tett, Hundley and Christiansen2017) stated in multiple places that if there are good reasons to expect moderating effect(s), the application of an overall validity generalization (VG) analysis (meta-analysis) is “moot,” “irrelevant,” “minimally useful,” and “a misrepresentation of the data.” They used multiple examples and, in particular, a hypothetical example about the relationship between agreeableness and job performance. Four noteworthy problems with the above statements, other similar statements elsewhere in Tett et al.’s article, and their underlying assumptions are discussed below along with alternative perspectives.
VG as a Method Should Be Distinguished From VG as a Practice
Throughout the article, Tett et al. (Reference Tett, Hundley and Christiansen2017) did not make it clear whether they intended to challenge either a statistical method of VG or some improper or outdated practices of VG, but it appears that they challenged the accuracy and usefulness of VG as a method based on improper or outdated applications of VG as noted below. This confusion is problematic because VG (originally developed by Schmidt and Hunter [Reference Schmidt and Hunter1977]) is a well-established statistical method/tool with statistical accuracy and efficiency verified in many articles (e.g., Field, Reference Field2001) and books by external evaluators (e.g., Schulze, Reference Schulze2004). This does not necessarily mean that VG cannot be misused and abused in its applications like other well-established statistical analysis methods (e.g., regression, hierarchical linear modeling, structural equation modeling). Thus, there is a need to clearly distinguish VG as a method from VG as a practice. Like some recent challenges to VG and meta-analysis (e.g., James & McIntyre, Reference James, McIntyre, Farr and Tippins2010; Muchinsky & Raines, Reference Muchinsky and Raines2013), Tett et al. also failed to make this distinction and, thus, confused the readers of their article as discussed in some detail in this commentary.
VG (or Lack of It) Should Be Treated as a Matter of Degree, Not Dichotomy
Tett et al. (Reference Tett, Hundley and Christiansen2017) stated that VG is determined using some rules such as the 75% rule and/or the 90% CV rule.Footnote 1 For example, they stated that “the 75% rule (Schmidt & Hunter, Reference Schmidt and Hunter1977) holds that if %VE > 75%, situational generalizability of mean rho may be inferred” (p. 425) and that “VG dichotomizes the continuum of correlation strength (James & McIntyre, Reference James, McIntyre, Farr and Tippins2010); in terms of the “90% [CV] > 0 rule, either the 90% CV falls above 0, conferring VG, or it does not, failing to confer VG” (p. 426; bracket added for clarity). However, more recent books and articles about VG have made it clear that when examining VG analysis results, we need to regard VG (and situational specificity) as a matter of degree, not dichotomy. In fact, for this reason, Hunter and Schmidt dropped the 75% “rule of thumb,” which was originally included in the first edition of their meta-analysis book (Hunter & Schmidt, Reference Hunter and Schmidt1990), from the second and third editions of the book (Hunter & Schmidt, Reference Hunter and Schmidt2004; Schmidt & Hunter, Reference Schmidt and Hunter2015). Note that like many other well-established statistical analysis methods, meta-analysis (VG as a method) is a constantly evolving research synthesis tool, and meta-analysts should be aware of major advancements and refinements of the method (Cortina, Aguinis, & DeShon, Reference Cortina, Aguinis and DeShon2017; Schmidt, Reference Schmidt2008). Thus, VG results should also be interpreted as a matter of degree, not a matter of dichotomy (VG or not) in order not to be subject to the same fallacy created by null hypothesis significance testing (NHST; significant or not). Meta-analysis should be practiced by scientists, not judges (with a “yes” or “no” switch in their heads). In the same vein:
It is important to note, however, that validity generalization can be justified in many cases even if the remaining variance [i.e., SD(rho)] is not zero. That is, validity generalization can be justified in many cases in which the hypothesis of situational specificity cannot be definitively rejected. (Pearlman, Schmidt, & Hunter, Reference Pearlman, Schmidt and Hunter1980, p. 376; bracket added)
Finally, to unstrap the straitjacket of NHST or such dichotomous heuristics, the degree of VG should be carefully gauged by triangulating all available meta-analytic results and, if possible, relevant prior VG results as discussed next (e.g., Pearlman et al., Reference Pearlman, Schmidt and Hunter1980; Salgado, Anderson, Moscoso, Bertua, & Fruyt, Reference Salgado, Anderson, Moscoso, Bertua and Fruyt2003).
The Degree of VG Should Be Gauged by Triangulating All Available Meta-Analytic Results
Tett et al. (Reference Tett, Hundley and Christiansen2017) stated in multiple places of their article that if there are some good reasons to expect some moderating effect(s) results from an overall VG analysis (meta-analysis) are “moot,” “irrelevant,” “minimally useful,” and “a misrepresentation of the data” using multiple examples. They developed a hypothetical, and atypical, example of an overall VG analysis about the bidirectional relationship between agreeableness and job performance across different samples (e.g., positive relationships for jobs “where caring for others is especially valued” and negative relationships for jobs where being “tough skinned is favored”). Before commenting more on the specific hypothetical example, we would like to note two things. First, this example by Tett et al. does not represent a problematic meta-analysis practice that mixes apples, oranges, and pears in meta-analysis (e.g., mixing many different personality traits in the same meta-analysis and concluding that personality does not matter in predicting performance because of the low mean rho and a huge amount of variability [SD(rho)] across input validities; Cortina, Reference Cortina2003). Second, the vast majority of moderator analyses in meta-analysis/VG do not appear to result in negative mean validities for predictors such as cognitive ability tests (Hunter & Hunter, Reference Hunter and Hunter1984; Salgado et al., Reference Salgado, Anderson, Moscoso, Bertua and Fruyt2003), conscientiousness (Barrick & Mount, Reference Barrick and Mount1991), work sample tests (Roth, Bobko, & McFarland, Reference Roth, Bobko and McFarland2005), assessment centers (Gaugler, Rosenthal, Thornton, & Bentson, Reference Gaugler, Rosenthal, Thornton and Bentson1987), grade point average (GPA; Roth, BeVier, Switzer, & Schippmann, Reference Roth, Bevier, Switzer and Schippmann1996), or employment interviews (Huffcutt & Arthur, Reference Huffcutt and Arthur1994). Instead, such analyses typically result in varying degrees of positive relationships, suggesting there could be useful validities at multiple values or levels of moderators (e.g., positive validities for GPA across multiple conditions). In a very real sense, there is an argument that the data support validity across multiple levels of a moderator (e.g., see Hunter & Hunter's [Reference Hunter and Hunter1984] work on useful levels of validity across levels of job complexity) and that the judgment of validity does generalize across those moderator levels.
Now to Tett et al.’s (Reference Tett, Hundley and Christiansen2017) example: Although the percentage of jobs in which a low level of agreeableness is valued is much smaller than jobs in which a high level of agreeableness is valued, let us conservatively assume that we have six validation studies on the relationship between agreeableness and job performance with a sample size of 150 each (a reasonable sample size found in many validation studies), and that observed validities are .09 (a reasonable value for this trait; Barrick & Mount, Reference Barrick and Mount1991) in three studies and observed validities are –.09 in the remaining three studies. Assuming no measurement error and no range restriction in all six input studies (just to make the example easy to understand), overall VG analysis results will show that the mean rho is .00 and SD(rho) is .04. The 80% credibility interval ranges from –.05 to .05, and the 95% confidence interval ranges from –.07 to .07; the percentage of variance due to artifacts (%VE; sampling error alone in this case) is 83%. If meta-analysts followed the VG practices discussed in Tett et al.’s article, they would be confused given that the 90% CV “rule of thumb” would suggest that VG is not present, whereas the 75% “rule of thumb” would suggest that VG is present. Again, VG should not be determined in a dichotomous manner using only part of the meta-analytic results, although improper and outdated practices of VG may have determined VG in this fashion. Instead, the aforementioned VG results suggest that we cannot rule out the possibility that the sign of the input validities is mostly artifactual due to sampling error, not due to occupational differences given the reliability for the validity distribution (vector) is very low at .17 (= 1 – .83) or the correlation between the observed validities and sampling error is .91 (= SQRT of .83). That is, we simply cannot completely trust any of the observed validities in the distribution (at face value) because the magnitude and sign of those validities are mostly due to sampling error.Footnote 2 Obviously, this is a more parsimonious explanation for the apparent variation in the entire validity distribution. Furthermore, it is well known that %VE is a percent-based relative index (in this case, .83 = [.0067 / .0081] and, thus, should be triangulated by carefully considering its components,Footnote 3 as well as all other meta-analytic results such as k, total N, SD(rho), SE(mean rhoFootnote 4 ), and their extensions (the 80% credibility and 95% confidence intervals).
The Mean Rho From an Overall VG Analysis Is Useful as Long as the Analysis Is Properly Conducted
Tett et al. (Reference Tett, Hundley and Christiansen2017) claimed that the mean rho estimated from an overall VG analysis is “moot” or “irrelevant” if moderating effects are expected with good reasons. Simply put, the overall mean rho estimated via meta-analysis, regardless of its magnitude, is the best estimate for mean rho of the grand population and thus represents the grand population (that may subsume subpopulations) more accurately than any other single value. The estimated subgroup mean rho via meta-analysis is the best estimate for mean rho of the subpopulation and thus accurately represents the subpopulation. For clarity, we note that each subpopulation may have some known or unknown moderators (i.e., next lower-level subpopulations), and thus mean rho, not rho, is used here according to the primary principle of the random-effects model (Schmidt, Oh, & Hayes, Reference Schmidt, Oh and Hayes2009). Of course, not surprisingly, the overall mean rho estimate does not more accurately represent subpopulations (i.e., some subsets/subgroups of the entire validity distribution) than corresponding subgroup mean rho estimates, and likewise, any of the subgroup mean rho estimates do not more accurately represent the grand population (the entire validity distribution) than the overall mean rho estimate. Accordingly, Tett et al.’s claim that mean rho estimated from an overall VG analysis misrepresents the entire validity distribution if some moderating effects are expected is inappropriate as long as the VG analysis is properly conducted. Of course, if an overall VG analysis was conducted improperly by mixing different constructs (e.g., general mental ability, proactive personality) or different methods (e.g., work sample tests, employment interviews, situational judgment tests) as predictors of job performance, the mean rho estimated from the overall analysis would be almost moot and potentially meaningless (Cortina, Reference Cortina2003). As noted above, mean rho should not be the sole focus of VG.
Conclusion
In conclusion, we want to draw the readers’ attention to the fact that VG is a well-established statistical method/tool that can be properly used, misused, and abused like other well-established statistical analysis methods. Like the results of many other statistical analysis methods, VG analysis results should be interpreted as a matter of degree, not as a matter of dichotomy (VG or not), in order to be scientifically useful and not to be subject to the same fallacy created by NHST (which ironically VG was designed to address). In addition, VG analysis results should also be interpreted while considering and triangulating all available meta-analytic results, not just mean rho and/or SD(rho). Given that VG is a constantly evolving statistical method like many other statistical methods, meta-analysts should not stick to outdated practices and methods but keep abreast of important advancements and refinements in both VG practices and methods. Finally, there is a need to distinguish VG as a method from VG as a practice, and hence, improper or outdated VG practices should not be a basis for challenging VG as a state-of-the-art method.