“Effectiveness” within personnel selection is not a unitary construct. It refers to the extent to which an organization can successfully map job tasks onto necessary worker knowledge, skills, and abilities; measure these attributes; and recruit a deep applicant pool possessing these qualities (Gatewood et al., Reference Gatewood, Feild and Barrick2015). It also refers to an organization’s capability to mitigate legal risk, via compliance with equal employment opportunity laws (Equal Employment Opportunity Commission, 1978; Gutman et al., Reference Gutman, Koppes and Vodanovich2012). Finally, it refers to ensuring that those interacting with the selection process (both internal and external stakeholders) experience it as job-related and fair, and that the resulting hiring decisions promote the organization’s diversity and inclusion goals.
Unfortunately, organizations can face trade-offs when it comes to simultaneously attaining these inter-related yet at times competing goals (e.g., Pyburn et al., Reference Pyburn, Ployhart and Kravitz2008). For example, whereas a firm may have access to a hiring tool that effectively predicts future job performance, the tool may not be compliant with legal guidance (e.g., the job-relatedness standard specified by the Uniform Guidelines; Equal Employment Opportunity Commission, 1978); it may be seen as disadvantaging minority groups (e.g., a physical ability test seen as unfair by women or disabled applicants); or it may cause legitimate adverse impact, where groups protected by legislation (e.g., Title VII of the Civil Rights Act of 1964) are systematically underselected by an otherwise valid tool or process (e.g., as is the case with general cognitive ability tests that often systematically underselect racial minorities). This is further complicated by the fact that the nature of much Does EEO stand for equal employment opportunity?EEO legislation in the United States prohibits “radical” forms of discrimination reduction, such as giving preferential treatment to minorities in efforts to increase workforce diversity (except when court ordered or part of a temporary voluntary settlement agreement).
As such, the field of industrial and organizational (I-O) psychology has expended considerable effort examining how these tradeoffs might be effectively balanced in order to maximize the aforementioned effectiveness criteria. To date, these efforts have received mixed success. Strategically integrating recruiting and selection efforts has shown some potential. For example, while general minority recruiting is unlikely to result in greater diversity of selected applicants if adverse impact-prone selection tools are used (Tam et al., Reference Tam, Murphy and Lyall2004), there is preliminary support for qualification-based minority recruiting (Newman et al., Reference Newman, Jones, Fraley, Lyon, Mullaney, Yu and Cable2013; Newman & Lyon, Reference Newman and Lyon2009). While more sophisticated psychometric innovations, such as banding (i.e., using standard errors to create score bands where applicants within a band are considered equal and then making diversity-related hires within a band), have shown practical potential, they have been viewed by the courts as legally noncompliant (e.g., Henle, Reference Henle2004; Campion et al., Reference Campion, Outtz, Zedeck, Schmidt, Kehoe, Murphy and Guion2001).
In this practice forum, we briefly trace these efforts, exploring both the technical aspects of the proposed methods, as well as the technicalities on which they have been called out within the court system. We then review a more recent method that has been proposed in the psychometric literature, Pareto-optimization, which seeks to simultaneously maximize the prediction of job performance and minimize adverse impact against legally protected groups. As we review below, whereas this method has promise and overcomes some of the pitfalls of previous methods, it has yet to be widely applied within organizations and, consequently, vetted in the court system through legal challenge. We seek to contribute to both the research on and the practice of equal opportunity employment by providing a user-friendly, yet thorough, description of the Pareto-optimization method. We also provide materials to aid in the application of the method (ranging from a practical description of how to use a publicly available Pareto-optimization tool, to an in-depth, technical description of how the method performs the estimations). Finally, we discuss how the method could potentially be challenged legally as violating Title VII and what an organization could present (both proactively and reactively) as a defense against such challenges. By demystifying Pareto-optimization, our hope is that this presentation, critique, and analysis will serve to encourage organizations to consider employing the method within their selection systems. Further, we hope to encourage discourse among I-O psychologists and legal professionals regarding its defensibility if challenged on legal grounds.
Strategies for simultaneously maximizing diversity and validity
As mentioned above, some selection procedures that strongly predict job performanceFootnote 1 show systematic differences in scores across demographic groups such as race or gender (Bobko & Roth, Reference Bobko and Roth2013; Ployhart & Holtz, Reference Ployhart and Holtz2008), which hinder the goals of diversity and legal defensibility by selecting minority applicants at lower rates than majority applicants.Footnote 2 This problem has been labeled the diversity-validity dilemma (Pyburn et al., Reference Pyburn, Ployhart and Kravitz2008). The most salient example of this dilemma stems from the use of cognitive ability tests for personnel selection, which have been shown to be among the strongest predictors of job performance (corrected r = .51; Schmidt & Hunter, Reference Schmidt and Hunter1998; Roth et al., Reference Roth, Switzer, Van Iddekinge and Oh2011), but also demonstrate higher race-related subgroup differences compared to other common selection procedures (Roth et al., Reference Roth, BeVier, Bobko, Switzer and Tyler2001).Footnote 3
A number of strategies have been proposed for mitigating this dilemma. For example, a surge of work discussing the use of score banding began in the early 1990s (cf. Campion et al., Reference Campion, Outtz, Zedeck, Schmidt, Kehoe, Murphy and Guion2001). This method involves the use of standard errors to create score bands. Within a score band, candidates are considered to have statistically equivalent scores and thus cannot be rank-ordered by score. As such, preferential selection based on demographic status is one strategy that can be used to select candidates within a band to increase the representation of women and minorities in the workforce, while minimizing the compromise to validity. Unfortunately, this strategy has faced two major hurdles in its implementation. The first is an issue of recruitment. If an organization is not successful in recruiting a diverse pool of highly qualified applicants to begin with, then top score bands could potentially lack diversity, thus constraining what banding can do to improve diversity (Newman et al., Reference Newman, Jones, Fraley, Lyon, Mullaney, Yu and Cable2013). The second is an issue of legality. The technique has in fact been legally challenged on the grounds of discriminating against White applicants, and it has been argued that using demographic status as a selection criterion within a band could be viewed as violating EEO law (i.e., interpreted as a race/sex/etc.-based decision; Henle, Reference Henle2004; Sackett & Roth, 1991)“Sackett and Roth (1991)” has not been included in the Reference List; please provide complete reference details..Footnote 4 Whereas other, more legally appropriate selection criteria could be used to choose candidates within a band, these alternative methods may do little to resolve validity-diversity tradeoffs (Cascio et al., Reference Cascio, Jacobs, Silva and Outtz2009)Cascio et al. (2010) has been changed to Cascio et al. (2009) as per the reference list. Please check if okay..
A second approach involves strategically choosing, ordering, and combining various predictors in a way that minimizes validity-diversity tradeoffs (Ployhart & Holtz, Reference Ployhart and Holtz2008). This might involve the use of a variety of (valid, non-overlapping) predictors instead of, or in combination with, general cognitive ability tests. These predictors include tests of more fine-grained cognitive facets (e.g., verbal ability, mathematical ability), noncognitive predictors (e.g., personality), and predictor measures that are not tests (e.g., interviews, assessment centers; Bobko et al., Reference Bobko, Roth and Potosky1999; Sackett & Ellingson, Reference Sackett and Ellingson1997). Illustrating this, Finch et al. (Reference Finch, Edwards and Wallace2009) showed through a simulation study that using different combinations of predictors in different stages of a selection procedure can, to varying extents, impact both the amount of performance predicted by the selection system as well as the adverse impact of the selection system for minority applicants.Footnote 5 Further, when job performance is widely conceptualized to include task performance and contextual performance (rather than task performance only), it can be effectively predicted by a wider range of noncognitive predictors (that have smaller subgroup differences; Hattrup et al., Reference Hattrup, Rock and Scalia1997).
Despite the promise of these various approaches, a systematic method for carrying out such “optimization” is lacking. Instead, the majority of previous efforts rely on labor-intensive, trial-and-error-based analyses, and have been more academic exercises (to explore what is possible) than clear, actionable solutions for practitioners to implement. A defined process that allows for a more systematic identification of how predictors should be weighted across specific contexts could reduce the guesswork required within selection decisions and allow greater precision in reaching organizational goals for employee performance and diversity. The Pareto-optimal weighting technique (or “Pareto-optimization”) has been offered as just such a process.Footnote 6
Pareto-optimization as a potential way forward
The application of Pareto-optimization in personnel selection was first introduced by De Corte et al. (Reference De Corte, Lievens and Sackett2007). This technique, which borrows from the multiple-objective optimization literature in engineering and economics, was proposed specifically to allow for personnel selection decisions that simultaneously optimize both the expected job performance and the diversity of new hires. Previous work (e.g., De Corte et al., Reference De Corte, Sackett and Lievens2008, in press; Song, Wee et al., Reference Song, Wee and Newman2017; Wee et al., Reference Wee, Newman and Joseph2014) has shown that the technique has the potential to improve diversity (i.e., increase the number of job offers extended to minority applicants) with no loss in expected job performance.
Pareto-optimization involves intervening on the weights applied to the items or predictors comprising the selection system. It can be compared to more common methods such as unit-weighting, where all predictors are given equal weights, and regression-weighting, where regression analysis is used to determine predictor weights that maximize the prediction of job performance. Pareto-optimization differs from these techniques in that the weights for each predictor are statistically determined to simultaneously optimize two (or more) criteria (e.g., job performance and diversity), as compared to optimizing only one criterion (e.g., job performance, such as when regression weights are estimated).
As an illustration, consider an organization that has identified three predictors for use in its selection system: a cognitive ability test, a conscientiousness (personality) measure, and a physical ability assessment. Screening candidates using these three tools results in a score on each predictor. Unit weighting involves taking the sum or average of the three predictor scores and rank-ordering candidates on the basis of the total or average scores. Unit weighting makes intuitive sense, and research has supported its effective use in making valid selection decisions (e.g., Einhorn & Hogarth, Reference Einhorn and Hogarth1975), which has contributed to its popularity in practice (Bobko et al., Reference Bobko, Roth and Buster2007).
Regression weighting involves obtaining predictor scores from current employees, along with an accurate measure of their job performance (the criterion).Footnote 7 The data are then used to fit a regression model to determine the weighting scheme that most likely maximizes (optimizes) criterion-related selection decisions. These regression weights are then used to calculate a weighted predictor composite score for each applicant, which allows for the rank-ordering of the applicants in a top-down selection process. Although regression weighting is more labor intensive and requires reliable performance data, its promise for maximizing criterion-related validity has also made it a popular technique within organizations (Bobko, Reference Bobko2001).
Pareto-optimal weighting is similar to regression weighting in that it also seeks “optimized” composite scores. However, it differs from regression weighting in that it aims to optimize two (or more) outcomes simultaneously (De Corte et al., Reference De Corte, Lievens and Sackett2007). For example, if an organization wants to simultaneously minimize certain subgroup differences/adverse impact and maximize job performance in new hires, Pareto-optimization could be utilized to derive a set of weights that provides the best possible solution (i.e., maximizing performance prediction at a predetermined threshold for adverse impact). That is, given a desired level of diversity among new hires, a maximum level of expected job performance can be obtained. Likewise, given a desired level of expected job performance, a maximum level of diversity among new hires can be obtained.
Continuing our example from above, where an organization has in place three selection predictors for a particular job (cognitive ability, conscientiousness, and physical ability), let us further assume that it has interest in reducing Black/WhiteFootnote 8 subgroup differences to the greatest extent possible without sacrificing its ability to effectively predict applicants’ future job performance, and wishes to do so via Pareto-optimization. Figure 1 illustrates a potential Pareto-optimal tradeoff curve that could be used to attain this goal. The horizontal axis represents the adverse impact ratio (i.e., the selection rate of Black applicants relative to the selection rate of White applicants).Footnote 9 Throughout our example, we reference the four-fifths or 80% rule (i.e., that the selection rate of one protected subgroup should not be less than 80% of the majority subgroup’s selection rate; EEOC, 1978), while at the same time acknowledging it is not the only (or often best) way to demonstrate adverse impact.Footnote 10 The vertical axis represents the possible range of criterion-related validity estimates (i.e., the varying levels at which selection composite scores correlate with job performance). Each point on the curve represents a set of potential predictor weights. Continuing our example with cognitive ability, conscientiousness, and physical ability as predictors, each point on the Pareto-optimal curve represents a set of three predictor weights, one for each of the three predictors. The negative slope of the Pareto-optimal curve illustrates the diversity-performance tradeoff. If the curve is steep, it means the organization must sacrifice a large amount of expected job performance to gain a small decrease in adverse impact. On the other hand, if the curve is flat, it means that the organization would not have to give up much expected job performance to get a large payoff in terms of adverse impact reduction.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig1.png?pub-status=live)
Figure 1. An example Pareto-optimal trade-off curve. Job performance (validity) is the correlation between the weighted predictor composite score and job performance score. Diversity of new hires is represented as the Black/White adverse impact (AI) ratio. Point A (red dot) represents the solution where job performance validity is maximal; Point C (blue dot) represents the solution where adverse impact ratio is maximal. Point B (green dot) represents the Pareto-optimal solution where the prediction of future job performance and the minimization of adverse impact against Black applicants are considered equally important.
Organizations can effectively use such curves to determine the specific weighting scheme where, to the extent it is simultaneously possible, certain subgroup differences are minimized and job performance validity is maximized. For example, the weighting scheme at Point A in Figure 1 would provide maximal job performance and high adverse impact (where the selection rate of Black applicants is far less than 80% that of White applicants). This would be akin to regression weighting, where optimization occurs around a single criterion only (job performance). Similarly, Point C represents the weighting scheme that minimizes the difference in Black/White selection rates with no consideration of job performance validity. In contrast, Point B represents a weighting scheme where the prediction of future job performance and the minimization of adverse impact against Black applicants are considered equally important.
Applying Pareto-optimal weighting in the real world
Although the psychometric/empirical research on Pareto-optimization may seem “statistics-heavy” as an actionable tool for use in organization, its application is quite straightforward. Below, we provide three general levels of detail to aid in the implementation of Pareto-optimization: (a) the steps human resource (HR) practitioners would need to take to collect/obtain data from current employees to estimate Pareto-optimal solutions; (b) the steps that would be taken once this information is collected to generate the Pareto-optimal curve and associated predictor weights; and (c) the technical details pertaining to how exactly the weights are estimated. We expect that different readers (HR practitioners, psychometric consultants, attorneys, expert witnesses) may find different aspects of these descriptions useful.
Table 1 provides a step-by-step guide for collecting the information needed to compute the input statistics that feed the Pareto-optimization algorithm. The first step involves the collection of predictor (e.g., scores from cognitive ability tests, personality measures, physical ability tests) and criterion (e.g., job performance and adverse impact) data from existing employees.Footnote 11 Using these data, in Step 2, one can compute (a) predictor intercorrelations, (b) job performance criterion validity estimates for each predictor (i.e., the correlation between each predictor score and job performance ratings), and (c) subgroup differences on each predictor (d; the standardized group-mean difference between two groups; e.g., Black/White differences, or differences between men and women). This information is then used in Step 3 as input into the Pareto-optimization algorithm to obtain the predictor weights and Pareto curve.
Table 1. A step-by-step guide for implementing Pareto-optimization within personnel selection
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_tab1.png?pub-status=live)
There are at least three tools that can be used to carry out Pareto-optimization in personnel selection: (a) a FORTRAN program, TROFSS (De Corte et al., Reference De Corte, Lievens and Sackett2007), (b) an R package, “ParetoR” (Song et al., Reference Song, Newman and Wee2017), and (c) a click-and-point web application, ParetoR Shiny app (Song et al., Reference Song, Newman and Wee2017). For each of these tools, users input the data and estimates described above in order to generate (1) the Pareto-optimal predictor weights, (2) criterion solutions (i.e., job performance validity and AI ratios), and (3) the Pareto-optimal trade-off curve. The Appendix provides the technical details pertaining to how the Pareto-optimal weights are generated.
Figure 2 shows a screenshot of the ParetoR Shiny app. The web application consists of three parts: (1) “Input” (red box, left, and Figure 3); (2) “Output: Plots” (green box, top right, and Figure 4; and (3) “Output: Table” (blue box, bottom right, and Figure 5). To start, users specify as input (a) the selection ratio (the expected percentage of applicants who will be extended job offers), (b) the expected proportion of “minority”Footnote 12 applicants (following our example, this would be the expected proportion of Black applicants),Footnote 13 (c) the number of predictors, (d) the predictor-criterion correlation matrix (obtained from the incumbent sample), and (e) predictor subgroup differences (obtained from the incumbent sample; following our example, this would be Black/White subgroup differences). Once this information is entered, the “Get Pareto-Optimal Solution!” button in the “Input” panel is chosen.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig2.png?pub-status=live)
Figure 2. An example application of Pareto-optimization using the ParetoR Shiny app web application (https://qchelseasong.shinyapps.io/ParetoR/). The “Input” box (red box, left) shows the input panel in which users will specify the following input data: (a) selection ratio, (b) expected proportion of minority applicants, (c) number of predictors, (d) predictor-criterion correlation matrix, and (e) subgroup differences (see Table 1, Step 2 for details). The “Output: Plots” box (green box, top right) shows two output plots: (1) Plot A: Pareto-optimal trade-off curve [i.e., job performance validity (vertical axis) and AI ratio (horizontal axis) trade-off curve], similar to Figure 1, each point (Pareto point) on the trade-off curve represents a set of predictor weights; (2) Plot B: predictor weights across different Pareto points (the three predictors at each point correspond to the Pareto point in Plot A). The “Output: Table” box (blue box, bottom right) shows the AI ratio, job performance criterion validity, and predictor weights corresponding to each Pareto point (each row). Expanded views of each of the parts of this figure are shown in Figures 3–5.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig3.png?pub-status=live)
Figure 3. An expanded view of Figure 2: “Input.”.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig4.png?pub-status=live)
Figure 4. An expanded view of Figure 2: “Output: Plots.”.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig5.png?pub-status=live)
Figure 5. An expanded view of Figure 2: “Output: Table.”.
The Pareto-optimal solutions will be displayed in the “Output Plots” and “Output: Table” sections on the right. Plot A in the “Output: Plots” section (Figure 2 (green box) and Figure 4) is a Pareto curve, similar to Figure 1. The vertical axis in Plot A displays criterion-related validities (performance outcomes), whereas the horizontal axis displays adverse impact ratios (diversity outcomes). The Shiny app provides 21 Pareto points, or 21 evenly spaced solutions. Each point (Pareto points) on the trade-off curve represents a set of predictor weights (three predictors in our ongoing example) that simultaneously optimize both job performance (criterion validity) and the Black/White adverse impact ratio. In the example, as the user examines the curve from left to right, the sets of predictors increasingly provide less job performance criterion validity and more favorable Black/White adverse impact ratios.
Plot B presents the predictor weights across different Pareto points. In the example, more weight is given to Predictor 2 when job performance criterion validity is maximal (and the Black/White adverse impact ratio is not maximal), and Predictor 3 is weighted more when the Black/White adverse impact ratio is maximal (and job performance is not maximal). This is because Predictor 2 is the strongest predictor of job performance (r = .52; see “Correlation Matrix” in the “Input” panel) but is also affiliated with the highest Black/White subgroup d (d = .72; see “Subgroup Difference” in the “Input” panel). In contrast, Predictor 3 is the weakest predictor of job performance (r = .22, see “Correlation Matrix” in the “Input” panel) but is also affiliated with the lowest Black/White subgroup d (d = –.09, see “Subgroup Difference” in the “Input” panel).
The “Output: Table” box (Figure 2, blue box, bottom right; Figure 5 shows an expanded view) presents the specific adverse impact ratio, job performance criterion validity, and predictor weights corresponding to each Pareto point (each row), plotted in the “Output: Plots” section. Based on the selection outcomes (i.e., adverse impact ratio, job performance criterion validity) listed in the table, users can select a combination of (in this case three) predictor weights that lead to their preferred outcome (out of the 21 sets of predictor weights). For example, an organization might choose the solution that results in a Black/White AI ratio of .82 and job performance criterion validity of .36 (Figure 5, the row with an arrow). This is the solution out of the 21 Pareto-optimal solutions that provides the highest job performance criterion validity for an adverse impact ratio that is greater than .80 (the four-fifths rule often referred to in court and within the Uniform Guidelines Footnote 14).
In this example, if these were the primary subgroups of interest, and if compliance to the four-fifths rule was a goal (along with the best possible criterion-related validity given this goal), in subsequent selection processes, users would give the weights of .01, .23, and .76 for Predictors 1, 2, and 3, respectively.Footnote 15 However, it may be the case that they want to also consider the Pareto curve for other subgroup differences (e.g., women/men). Our example provides a situation where this might very well be the case, and highlights one complexity of Pareto-optimization: simultaneously considering multiple subgroups. Whereas the use of cognitive ability tests has historically shown adverse impact against racial minority groups, the use of physical ability tests has historically shown adverse impact against women as compared to men. Thus, generating the Pareto curve considering sex-based subgroups will likely produce a different set of “optimal” weights. This highlights the need for users’ specific selection goals to guide the analyses they choose to carry out. It also suggests considering ahead of time how the opposing “optimal” weighting schemes for different subgroup comparisons will be reconciled when making final decisions about the scoring and weighting of predictors. We discuss this issue further in the Practical Recommendations section.
Considerations of the legal defensibility of pareto-optimization
Whereas, statistically speaking, Pareto-optimization offers a promising solution for dealing with diversity-validity tradeoffs, this does not necessarily mean that it has the power to hold up to legal scrutiny. This was the very issue our field faced following promising research on the banding technique for personnel selection (Henle, Reference Henle2004). That is, although the method offered a solution for increasing minority selection rates using predictors with strong job performance criterion validity, the courts determined that giving preference to minority group members within a band is illegal, even though their selection scores can be considered equivalent to majority group members in the same band. As such, it is important that we identify ways Pareto-optimization might be legally challenged, alongside possible defenses organizations might offer in response to such challenges (as well as proactive actions organizations might take to avoid such scrutiny).
As was the case with banding, given that the use of Pareto-optimization is aimed at increasing the representation of underrepresented groups in the workplace (or at least decreasing the reduction of minority hiring rates that might be caused by using validity-based regression weights alone), challenges claiming traditional adverse impact (against minority members) may be unlikely. What could occur, however, are challenges in the form of “majority” group members (e.g., Whites, menFootnote 16) claiming so-called “reverse discrimination” in the form of (intentional) disparate treatment (since considering diversity is explicitly and purposely a goal of the method). Such claims are not unprecedented in general, as evidenced by the cases discussed below. That being said, an extensive search of case law and legal discourse revealed no reports of Pareto-optimization having been legally challenged in any way. Our search also did not reveal any discussion of the method as explicitly legally defensible, therefore providing what we consider to be a clean slate for considering the legality of Pareto-optimization.
Challenge: Demographic information used to determine weights
One specific legal challenge that might be made to the application of the Pareto-optimization is that, since demographic information is used to compute the weights, this method might violate Title VII of the Civil Rights Act (which prohibits selection decisions on the basis of race, color, religion, sex, or national origin). This would be a disparate treatment claim, where an argument is made that the organization intentionally took protected group status into consideration when making selection decisions. One defense against such a claim is that the method does not consider any demographic information from the applicant pool itself; nor does it consider demographic information of any individual job incumbent. Rather, the only demographic information used to determine the Pareto-optimal weights are group-level subgroup differences among current employees on predictor and criterion variables (De Corte et al., Reference De Corte, Lievens and Sackett2007). Thus, as with selection systems using unit or regression weights, any individual applicant (regardless of his or her demographic group status) has the best chance of being selected for a job if he or she scores highly on predictors that have the greatest weights.
Challenge: Method advantages underrepresented groups
A second challenge that might be brought forth concerning the use of Pareto-optimization is whether the method violates Title VII by virtue of creating an advantage for underrepresented groups (e.g., women, racial minorities). The argument would be that, by incorporating information on incumbent subgroup differences alongside criterion-related validity estimates to create the Pareto curve, organizations are consciously choosing to sacrifice some level of validity (“performance”) in order to gain diversity, which could be interpreted as (“illegal”) preferential treatment (disparate treatment) in favor of “minority” group members.
In Hayden v. County of Nassau (1999), Nassau was under a consent decree with the Department of Justice. Motivated by a goal to create as little adverse impact in selection as possible, Nassau consciously chose to use only nine sections of a 25-section exam that was taken by over 25,000 applicants (in order to retain as much criterion-related validity as possible while preventing as much adverse impact as possible). Importantly, when choosing the test sections on which the company was going to actually consider applicants’ scores for the selection process, Nassau rejected a different subset of exam components that led to a better minority selection ratio but worse criterion validity for performance. The Second Circuit Court ruled that, although the test redesign took race into account, it was scored in a race-neutral fashion and therefore was acceptable.
This ruling, however, is interpreted by some as having been overturned by the Supreme Court ruling in Ricci v. DeStefano (2009)—except for instances when a consent decree is in place (Gutman et al., Reference Gutman, Koppes and Vodanovich2012). In this case, the New Haven Fire Department was using an exam as part of a selection process for promotions to the ranks of lieutenant and captain. After administering the test, New Haven feared it would face an adverse impact liability given that the results would have led to no Black candidates being promoted. Consequently, they invalidated the test results and used an alternative approach to make promotion decisions. Majority group candidates then filed a disparate treatment claim, arguing that race was used as a factor in determining what predictor to use (and not use) in making promotion decisions. In a split vote, the Supreme Court supported this reasoning in its ruling, setting the standard that “before an employer can engage in intentional discrimination for the asserted purpose of avoiding or remedying an unintentional, disparate impact, the employer must have a strong basis in evidence to believe it will be subject to disparate-impact liability if it fails to take the race-conscious, discriminatory action” (Ricci v. DeStefano, 2009). In a dissent supported by three other justices, Justice Ginsburg noted the heavy burden on an organization to show a strong basis in evidence when taking actions aimed at mitigating adverse impact, stating, “This court has repeatedly emphasized that the statute ‘should not be read to thwart’ efforts at voluntary compliance…. The strong-basis-in-evidence standard, however as barely described in general, and cavalierly applied in this case, makes voluntary compliance a hazardous venture” (Ricci v. DeStefano, 2009, Ginsburg dissenting opinion).
This ruling might, at first blush, be interpreted as precedent against using Pareto-optimized weights, as New Haven did explicitly take adverse impact—and therefore race—into account when deciding whether and how to use test scores. Indeed, case law recommends great care be taken in how predictors are chosen and combined, and legal analyses suggest that Ricci “highlights the tension between the requirement to seek alternatives with less (or no) adverse impact and disparate treatment rules that permit non-minorities to claim disparate treatment if alternatives are used. Therefore, employers … may be sued for either using or not using alternatives” (Gutman et al., 2011, p. 151).
That having been said, there are a number of differences in what occurred at Nassau County and New Haven compared to the scenario we describe in the current article, where an organization proactively decides to employ Pareto-optimal weights to address validity-diversity tradeoffs. First, in our scenario, tests have not already been administered to job applicants in the way they were at Nassau and New Haven. That is, incumbent data are used in determining how predictors will be weighted a priori. In this way, consistent with best practice, test scores are piloted and validated in advance, and in determining if and how they will be used/weighted, considered for their ability to both predict variance in job performance and minimize adverse impact. This is consistent with the holistic view of validity that professional standards have come to embrace (e.g., the Society for Industrial and Organizational Psychology, 2018), which emphasizes both prediction and fairness as evidence for the effectiveness of selection tests.
Second, much of the field of personnel selection is predicated on making smart decisions in determining what predictors to use and how predictor scores will be combined (e.g., Sackett & Lievens, Reference Sackett and Lievens2008). It is considered “industry standard” to identify predictors that can be supported by job analysis, and, in deciding how to use these predictors to consider other factors, such as the potential for adverse impact, cost, and so forth (Ployhart & Holtz, Reference Ployhart and Holtz2008). The use of personality assessment within selection contexts is often the result of such an approach. The only difference with the use of Pareto-optimal weights is that these decisions are more systematically quantified in order to more effectively make such decisions. Again, considering the potential for subgroup differences within the context of a validation study is completely consistent with professional standards (e.g., the Principles for the Validation and Use of Personnel Selection Procedures [SIOP, 2018]; the Uniform Guidelines on Employee Selection Procedures [EEOC, 1978]).
Third, the use of Pareto-optimal weights does not necessarily disadvantage majority applicants. As stated above, no within-group adjustments are made; rather, an additional criterion (adverse impact reduction) is applied as implementation/combination strategies are considered. Further, as explained above, in using the Pareto curve to choose weights, the organization can decide what thresholds to use in terms of both job performance validity and adverse impact. Many psychometric discussions of this method (e.g., Sackett et al., Reference Sackett, De Corte, Lievens and Outtz2010) use examples that set the adverse impact ratio to .80 in order to reflect the four-fifths standard often used by the courts. However, the adverse impact ratio does not have to be set at exactly .80 (and as mentioned in the footnotes above, alternative metrics to the adverse impact ratio could be used here as well). Rather, an organization can look to the Pareto curve with the goal of improving diversity (i.e., decreasing the adverse impact ratio) in ways that do not sacrifice job performance validity. The extent to which an organization will be able to attain this goal is, of course, dependent on the context itself (e.g., the predictors used, the presence of subgroup differences, the selection ratio, the proportion of minority applicants in the applicant pool).
Challenge: Does the method hold up to the Daubert standards?
A third way in which the legality of the use of Pareto-optimization might be questioned is related to the Daubert standard for the admissibility of expert testimony (originally established in Daubert v. Merrell Dow Pharmaceuticals Inc., 1993). This standard has often been used to evaluate a scientific or technical method that is argued to be rigorous and acceptable by a defendant or plaintiff’s expert witness. There are five illustrative factors (criteria) used within legal contexts to support a method: (1) it can be or has been tested previously, (2) it has been published in peer-reviewed outlets, (3) its known or potential level of imprecision is acceptable, (4) there is evidence that a scientific community accepts the method (i.e., the method has widespread acceptance within the community and is not viewed with a large amount of skepticism), and (5) the method will be judged by the courts based on its inherent characteristics as opposed to the conclusions of the analysis.
The Pareto-optimal weighting technique fares well when held up against each of these criteria. First, this method has been tested in several peer-reviewed articles (summarized in Table 2), meeting criteria (1) and (2). The top section of Table 2 reviews the peer-reviewed articles and presentations that have both introduced and tested the validity of the technique. As an example, De Corte et al. (Reference De Corte, Lievens and Sackett2008) demonstrated how the Pareto-optimal method could simultaneously provide improved job performance and diversity outcomes as compared to both unit weighting and regression weighting (methods that have been widely accepted within the court system; see Black Law Enforcement Officers Assoc. v. Akron, 1986; Kaye, Reference Kaye2001; Reynolds v. Ala. DOT, 2003).
Table 2. Summary of papers and presentations on the use of Pareto-optimization in personnel selection
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_tab2.png?pub-status=live)
With regard to criterion (3) involving accuracy, when carrying out Pareto-optimization, confidence intervals can be created around an expected outcome (e.g., expected job performance criterion validity given a particular AI ratio) in order to take into consideration imprecision with regard to the data and measures.Footnote 17 Further, Song et al. (Reference Song, Newman and Wee2017) examined the cross-sample validity of Pareto-optimal weights. Cross-sample validity refers to the extent to which an optimal weighting solution for one calibration sample (e.g., the incumbent sample on which a selection system’s weighting scheme was established) similarly predicts the outcomes of interest for a second validation sample (e.g., the applicant sample on which a selection weighting scheme could be used). As optimization methods such as Pareto-optimization aim to maximize their model fit in the calibration sample, the resulting weights tend to overfit, leading to smaller predictive validity in the validation sample as compared to the calibration sample (i.e., validity shrinkage). Song and colleagues demonstrated that when the calibration sample is sufficiently large (e.g., in their study, 100 participants for a set of cognitive selection predictors examined in their article), Pareto-optimization outperforms unit weighting methods in terms of maximizing both performance validity and diversity (represented by the adverse impact ratio), even after accounting for the possible loss in predictive validity in the validation sample.
With regard to criterion (4), Pareto-optimal weighting seems to have gleaned an acceptable level of support by the scientific community. Table 2 demonstrates that this method has been extensively discussed in academic writing within the applied psychological testing community. Not only has the method been explicitly examined in several empirical peer-reviewed articles (e.g., De Corte et al., Reference De Corte, Lievens and Sackett2007, Reference De Corte, Lievens and Sackett2008; Druart & De Corte, Reference Druart and De Corte2012a, Reference Druart and De Corte2012b; Wee et al., Reference Wee, Newman and Joseph2014) and commentaries of these articles (e.g., Kehoe, Reference Kehoe2008; Potosky et al., Reference Potosky, Bobko and Roth2008; Sackett & Lievens, Reference Sackett and Lievens2008), but many reviews and book chapters (e.g., Aiken & Hanges, Reference Aiken, Hanges, Farr and Tippins2017; Oswald et al., Reference Oswald, Putka, Ock, Lance, Vandenberg, Lance and Vandenburg2014; Russell et al., Reference Russell, Ford and Ramsberger2014; Sackett et al., Reference Sackett, De Corte, Lievens and Outtz2010) have discussed the method as a promising development within personnel selection contexts.
Finally, criterion (5) requires that the method can be judged by the courts based on its inherent characteristics (as opposed to the consequences of implementing the method in a particular context). Although we are unaware of any examples of the method being challenged and therefore judged by the courts, we believe that the summaries and arguments provided in this article along with the expansive work listed in Table 2 demonstrate the inherent viability of this method within selection systems (and thus the opportunity to judge its characteristics without bias).
Overall, despite the absence of case law on Pareto-optimal weighting methods, our examination of the extent to which the method meets the Daubert standards demonstrates that the method has a reasonable likelihood of being considered both legally and scientifically valid.
Practical recommendations
We close with the presentation of Table 3, which provides some additional practical tips to keep in mind when applying Pareto-optimization. This includes carefully considering how job performance is operationalized, the samples used for calibration and pilot testing, the timing of data collection, and how selection decisions will be made once optimized composite scores are computed. It also contains a reminder to users of the benefit of collecting content validity evidence for all predictors, and to not get so caught up in the “metrics” that careful, qualitative consideration of item-level content is neglected. Finally, we recommend systematic and continuous legal auditing, in ways that protect the organization from legal scrutiny as it seeks to proactively improve the performance and diversity of its workforce. Here we highlight some of the more complex issues inherent to implementing Pareto-optimal weights within personnel selection.
Table 3. A checklist of the key decisions for adopting Pareto-optimization for personnel selection
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_tab3.png?pub-status=live)
* If a pilot trial is not practically feasible, see Song et al. (Reference Song, Newman and Wee2017) for computational methods to evaluate the Pareto-optimal predictor weights when applied to the applicant sample.
Individual predictors are still vulnerable to legal challenge
Throughout this article, we have provided evidence that the Pareto-optimization method shows promise for reducing adverse impact, maintaining job performance prediction, and withstanding legal scrutiny. That being said, organizations may still be vulnerable to adverse impact claims made about individual predictors within their selection systems. Specifically, in Connecticut v. Teal (1982), the Supreme Court ruled that a lack of adverse impact in the total selection system (i.e., the bottom-line defense) does not preclude plaintiffs from successfully showing disparate impact against individual components of the selection system. Therefore, each selection predictor utilized within a selection system must demonstrate sufficient, individual evidence of validity (i.e., job relatedness as specified in the Uniform Guidelines (EEOC, 1978]), and alternative predictors should not exist that measure the same knowledge, skills, abilities and other characteristics (KSAOs) with similar effectiveness but less adverse impact (cf. Griggs v. Duke Power Co., 1971).
Choice of performance criterion matters
As we have highlighted throughout this article, there are a number of issues to consider pertinent to the measurement of job performance. Although our case example employed a unidimensional measure of task performance as the performance criterion, a wider conceptualization of job performance allows for a wider array of predictors, including those that are more resistant to adverse impact (e.g., personality measures). This is relevant for Pareto-optimization (as well as other predictor-weighting methods), in that a wider “criterion space” provides more latitude for estimating predictor weighting schemes that maximize the prediction of performance and minimize adverse impact. Research on the use of multiple performance criteria within Pareto-optimization is still developing. This work has combined multiple performance criteria (e.g., task performance, contextual performance) into a weighted composite using prespecified weights. For example, De Corte et al. (Reference De Corte, Lievens and Sackett2007) created a weighted sum of standardized task performance and contextual performance scores, with task performance weighted three times that of contextual performance. The authors chose the 1:3 criterion weights based on past research and subsequently examined the performance outcome of the Pareto-optimal solutions as the mean weighted-performance score obtained by the selected candidates. They found that there was still a relatively steep trade-off between performance and diversity criteria, even when contextual performance was included as a part of overall performance. Decisions on the types of performance to include and the computation of composite performance measures should be based on the specific organization and job context. Importantly, all job performance measures should be preceded by and based on job analysis (EEOC, 1978; Gatewood et al., Reference Gatewood, Feild and Barrick2015).
A second set of performance-related issues to consider pertains to the psychometric quality of the performance measures used to calculate the criterion-related validities that are input into the Pareto-optimization algorithm. As performance data often take the form of supervisor ratings, the reliability of the resulting performance scores often suffers due to lack of or insufficient rater training and clear rubrics with which to make ratings. Thus, in order to obtain accurate validity estimates as input into the Pareto-optimization algorithm, improvements might need to be made to the performance measurement system (again, based on job analysis). Further, as these data are collected on incumbents rather than applicants, performance (as well as predictor) scores are generally range-restricted. Thus, the predictor intercorrelations and the criterion-related validity of each predictor need to be corrected for range restriction (De Corte, Reference De Corte2014; De Corte et al., Reference De Corte, Sackett and Lievens2011; Roth et al., Reference Roth, Switzer, Van Iddekinge and Oh2011).
Finally, we should note that although our case example used job performance criterion validity as input into the Pareto algorithm, there are other computations that could be input here instead. One alternative is expected average job performance (also referred to as the “expected average criterion score” and the “average job performance of the selected employees”), which is the average of the expected job performance score of the selected applicants (see De Corte, Reference De Corte2014; De Corte et al., Reference De Corte, Sackett and Lievens2007, in press; Druart & De Corte, Reference Druart and De Corte2012a, Reference Druart and De Corte2012b). Another alternative is selection utility, which refers to the job performance gain of Pareto-optimization over random selection, after taking into account costs (for details, see the Appendix and De Corte et al., Reference De Corte, Lievens and Sackett2007).
Multiple hurdle (and other) selection strategies
Pareto-optimization is most often discussed as a weighting technique to be applied within a compensatory, top-down selection process. However, it is also applicable to multiple hurdle or multistage selection designs (see De Corte et al., Reference De Corte, Sackett and Lievens2011). The analytical procedures for multistage Pareto-optimization are generally similar to the single-stage scenario, with the exception of the analytical expression of selection outcomes (e.g., job performance and diversity), which considers the noncompensatory characteristics of the predictors (see De Corte et al., 2006, for details). De Corte (Reference De Corte2014) provides a computational tool, COPOSS, to implement Pareto-optimization within a multistage scenario, while De Corte et al. (Reference De Corte, Sackett and Lievens2011) provide a computational software, SSOV, as a decision aid for designing Pareto-optimal selection systems (including the multistage selection scenario). Both tools as well as their tutorials are available for download at https://users.ugent.be/~wdecorte/software.html. Compared to the single-stage setting, multistage optimization is more cost-effective, but will likely yield less optimal validity/diversity trade-off solutions. This is because multistage optimization usually can only use part of the predictor information at a time, compared to single-stage optimization, which makes use of all available predictors simultaneously. However, given the complexity of selection systems (including predictor characteristics, applicant pool distribution, contextual constraints), the superiority of a single- vs. multistage strategy depends on the context.
Judgment calls still required, especially when multiple subgroup comparisons come into play
As we have noted, although Pareto-optimization provides a systematic and quantitative approach to selecting predictor weights that seek to maximize job performance prediction and minimize adverse impact, the method still requires judgment calls at multiple stages of its implementation. The organization must carefully consider not only how to measure performance and which predictors to use, but also how validity and diversity will be prioritized within the selection system.
Further, it must decide which subgroup contrasts are relevant for consideration, and it must face the reality that different (perhaps opposing) weighting solutions may be “optimal” for different contrasts (e.g., the solution that maximizes validity and minimizes Black/White subgroup differences may differ from that which minimizes subgroup differences between men and women). To address this issue, Song and Tang (Reference Song and Tang2020) developed an updated Pareto-optimal technique to simultaneously consider multiple subgroup comparisons, which includes multiple subgroups within the same demographic category (e.g., race), as well as multiple demographic categories (e.g., race and gender). The magnitude of the validity-diversity tradeoff in multi-subgroup optimization is influenced by (1) the subgroup mean differences between the majority group and minority groups and (2) the subgroup mean differences among the minority groups. Using Monte-Carlo simulation, the Pareto-optimal weighting for multiple subgroups is currently being evaluated in various selection scenarios (e.g., proportion of minority applicants, selection ratio, predictor sets). This research will provide guidance on how likely “opposing” Pareto-optimal solutions among different subgroup contrasts actually are, and ways in which Pareto-optimal weighting schemes, which consider multiple comparisons simultaneously, could be obtained.
Conclusion
In this practice forum, we sought to highlight the tensions organizations face when seeking to create and implement effective, fair, and legally compliant personnel selection systems. We traced the history of innovations presented by I-O psychologists in reconciling the so-called diversity-validity tradeoff and presented Pareto-optimization as a potential way forward to systematically optimize performance and diversity criteria. Following, we provided a primer to the method at varying levels of sophistication and presented user-friendly tools for implementing the technique in practice. Then, we attempted to scrutinize the method from an EEO law perspective, and in doing so offered potential defenses that might be waged in justifying the method when challenged.
It is important to note that discussion of Pareto-optimization is generally limited to academic outlets. Our search revealed no case analyses describing the method as used in practice, nor legal analysis such as that provided here. We hope this article will encourage practitioners to submit examples to this forum to better highlight the benefits and challenges associated with the application of Pareto-optimization, and for those in the legal arena to weigh in with their views on the legal appropriateness of the method.
APPENDIX. Pareto optimization: Technical details
Pareto-optimal weighting for multi-objective optimization (to create Pareto curves; e.g., Figure 1) can be implemented in a variety of ways, one of which is labeled normal boundary intersection (NBI; see Das & Dennis [Reference Das and Dennis1998] for a foundational introduction to this method). The aim of the algorithm is to find evenly spaced sets of solutions on the Pareto curve that optimize multiple criteria (e.g., diversity and job performance) under certain constraints (e.g., the Pareto-optimal predictor weights add up to 1).
An example of NBI is shown in Figure A1. The horizontal axis shows the demographic spread of new hires (represented by adverse impact ratio), whereas the vertical axis shows their expected job performance (represented by the job performance validity of the predictor composite). The blue oval (which includes both the solid and dotted blue border) represents the solution space of all possible solutions under a certain constraint. For example, to find a unique solution, one constraint could be that all predictor weights add up to 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200727094614541-0738:S175494262000019X:S175494262000019X_fig6.png?pub-status=live)
Figure A1. An illustration of the normal boundary intersection (NBI) algorithm.
There are three main steps in the NBI algorithm.
Step 1: Find the endpoints (e.g., Points A and B in Figure 1) and corresponding predictor weights. Specifically, the SLSQP algorithmFootnote 18 is used to find one set of predictor weights (e.g., Point A) where only job performance is maximized; and another set of predictor weights (e.g., Point B) where only diversity (represented using adverse impact ratio) is maximized. The adverse impact ratio and job performance validity of the two endpoints are also estimated.
Step 2: Linear interpolation of evenly-spaced solutions between the endpoints. The Pareto points between the two endpoints (found in Step 1) can be estimated by first creating a line that connects the two endpoints (i.e., the orange line in Figure A1) and specifying evenly spaced points along this line. The number of points along the line (i.e., including the two end points; yellow dots in Figure A1) equals the user-specified number of Pareto solutions (e.g., 21 evenly spaced solutions).
Step 3: Projection of evenly spaced solutions between the endpoints. At each Pareto point, the SLSQP algorithm is again used to find the optimal set of weights. Specifically, the algorithm will project in a perpendicular direction from the initial criterion line (i.e., yellow arrows in Figure A1) until it reaches the border of the solution space (i.e., blue oval in Figure A1), finding a Pareto-optimal solution (i.e., a blue dot in Figure A1). This process will iterate through all Pareto points (e.g., 21 points) until the optimal predictor weights for each Pareto point are obtained.