1. Introduction
For more than half a century, experimental social psychologists have (1) demonstrated the many ways people are treated differently because of their race, age, sex, and other social categories and (2) used these findings to explain why group disparities exist in the real world. From racial disparities in fatal police shootings and school discipline, to sex disparities in science, technology, engineering, and mathematics (STEM) engagement and corporate leadership, social psychologists have overwhelmingly concluded that the stereotypes in the heads of decision-makers play a substantial role in causing group disparities, whether or not people agree with or even consciously acknowledge such stereotypes (Devine, Reference Devine1989; Greenwald & Banaji, Reference Greenwald and Banaji1995). The logic among social psychologists has been the following: If we can show in an experiment that people are treated differently based on their outward appearances when we present them as equal in all other respects, then in the real world such differential treatment exists and is a major cause for why outcomes differ across groups (see, e.g., Greenwald & Krieger, Reference Greenwald and Krieger2006; Kang & Banaji, Reference Kang and Banaji2006). As just one example illustrating the way in which experimental demonstrations of decision-maker bias have been tied to disparate outcomes, Moss-Racusin, Dovidio, Brescoll, Graham, and Handelsman (Reference Moss-Racusin, Dovidio, Brescoll, Graham and Handelsman2012) state clearly that their research “informs the debate on possible causes of the gender disparity in academic science by providing unique experimental evidence that science faculty of both genders exhibit bias against female undergraduates” (p. 16477).
The purpose of this article is to show that standard practice in experimental social psychology is fundamentally flawed, so much so that findings from these studies cannot be used to draw any substantive conclusions about the nature of real-world disparities – despite the ubiquitous practice of drawing exactly these conclusions. There are three problems inherent in the current approach that render it impotent for this purpose. First, critical pieces of information used by actual decision-makers are absent in experimental studies (missing information flaw). Second, effects of biased decision-making are rarely understood in the context of other important influences on group outcomes, such as the behaviors of targets themselves (missing forces flaw). Third, there is no systematic study on whether the contingencies required to produce experimental bias are present in actual decisions (missing contingencies flaw). These three flaws can lead researchers to vastly overestimate the role of stereotyping as a causal process, even going so far as to reveal experimental stereotyping effects when they play no role in real decisions or in causing group disparities. Although current experimental studies can provide important information about stereotyping processes per se, they cannot and do not provide information about the nature of group disparities. That is, the contribution of stereotyping and bias research is misunderstood and misused.
I first describe the standard “research cycle” of stereotyping and bias studies in experimental social psychology. I describe the flaws inherent in this approach at the abstract level and then apply the analysis to three research topics in social psychology: police officers' decision to use deadly force, implicit bias, and school disciplinary policy. I then describe what experimental studies of bias can tell us and how researchers generally misinterpret the nature of such studies. I speculate on the ways this research tradition has skewed the understanding of the human mind that has been exported from our discipline to the culture at large. I then connect this critique to related critiques within psychology and similar problems that have arisen in other fields. In the final section, I chart out an alternative path that might be more effective for studying group disparities.
Throughout this paper, I focus on the familiar social psychological demonstrations of categorical bias: experiments in which participants respond differently to targets from different social categories. Although I focus on studies that posit stereotype activation as the culprit for such differential responding (as this is a long-standing way of understanding such effects, e.g., Duncan, Reference Duncan1976), nearly all of the current analysis is applicable to bias caused by other sources. I focus on experimental social psychology because this area has had a considerable impact on the discussion of group disparities, but this is not to say that similar critiques cannot be leveled against other areas and disciplines. The current critique is also distinct from related critiques of mundane realism or external validity, which I discuss in more detail later. Instead, this critique is about how social psychologists are fundamentally misguided in how they approach the study of group disparities, which distorts the nature of the decision under study and leads to incorrect conclusions about the conditions under which decisions will be more or less biased. Although psychology has no shortage of problems to be addressed (e.g., Srivastava, Reference Srivastava2016), I limit my discussion in this paper to the misuse of experimental social psychology in explaining group disparities.
Before getting into the details of the argument, it is important to provide two cautionary notes. First, I am addressing the question of whether decision-maker bias produces group disparities in the immediate outcomes of that decision (and whether experimental social psychology can inform this process). This is seen in the example of a police officer's decision to shoot and racial disparities in being shot by police, or of a search committee's hiring decision and sex disparities in STEM employment. The current analysis does not address or dispute the possibility that decision-maker bias may enter earlier in the chain of events that leads to the decision in question. For example, police officers may show bias in the decision to engage in discretionary stopping of Black citizens or high school teachers may show bias in discouraging female students from pursuing STEM careers.
Second, the current analysis relies substantially on the fact that the distributions of behaviors, personality, character, preferences, abilities, and so on are not equal across different demographic groups (and that this fact is not appropriately considered by experimental social psychologists). I make no claims about the origin of these group differences in terms of the degree to which they are caused by individual decision-makers, “structural” forces beyond individual actions, genetic factors, incentive structures because of government policies, and so on. The point here is not to claim that group differences are inherent to people (although they might very well be) or that there are no broader social influences on human behavior. There may be systematic bias that produces group differences in the distributions of important characteristics. For the purposes of the present argument, the distal causes of group differences are irrelevant because these causes are separable from the question of whether group disparities are because of biased decision-making for specific outcomes. For example, the reasons why men and women differ in their interest in things versus people is a separate question from whether faculty search committees are biased against women in hiring for STEM positions.
On both these points, there is the possibility that bias “earlier” in the causal chain eventually leads to disparities on a later outcome, even while decision-makers show no bias on that later outcome. Of course, claims of “earlier” bias also require evidence, and if the available evidence is merely more of the same demonstrations from experimental social psychology, then these studies suffer from the same flaws described here and are, therefore, not convincing evidence.
2. The standard approach
The standard research cycle begins with an observation that groups differ in their real-world outcomes and the desire to understand the causes of such disparities. Simply, we see that members of some groups get better or worse outcomes than members of other groups and we want to know why. It is, perhaps, natural that social psychologists would start with the assumption that stereotypes – categorical information stored in a decision-maker's mind – play a meaningful role in producing these group differences. To gather evidence in support of this possibility, researchers design experiments in which participants make judgments of targets who vary only with respect to the social categories to which they belong. For example, to study the role of race in police officers' decision to use deadly force, researchers show participants pictures of Black and White men who do not vary in how they are presented in any way other than their race (as in the First-Person Shooter Task [FPST]; Correll, Park, Judd, & Wittenbrink, Reference Correll, Park, Judd and Wittenbrink2002). If participants shoot unarmed Blacks more than unarmed Whites, one can be sure that the race of the target played a causal role in participants' decisions because the experimenter has presented the groups in an identical way on all other dimensions (such as their posture, the frequency of holding a gun, facial expressions, etc.). Making all groups exactly equal in how they are presented in an experiment allows the researcher to conclude that the decision-maker (and not differences in the behavior of targets themselves) is responsible for biased responses directed at targets from different groups. It would be difficult to overstate the ubiquity of this approach in experimental social psychology; it is the paragon of systematic design and is understood as the method for studying the biasing effects of categories.
Having established that social categories impact participants' decisions in an experiment, researchers return to the original real-world disparity and conclude that the same processes observed in the lab explain these disparities as well (see, e.g., Moss-Racusin et al., Reference Moss-Racusin, Dovidio, Brescoll, Graham and Handelsman2012 for a prototypical example). That is, if stereotypes cause people to treat targets differently when there are no real behavioral differences in experimental stimuli, then this same biased treatment on the part of decision-makers is at play in the real world and can account for a meaningful amount of the disparities we see across groups. Researchers then complete the circle by using their experimental findings as evidence for designing interventions intended to reduce the disparity of interest.
3. Critical flaws of the standard paradigm
The standard experimental approach in social psychology contains three fundamental flaws which prevent the findings of experimental studies from being directly applied to the study of group disparities: the flaw of missing information, the flaw of missing forces, and the flaw of missing contingencies. The first flaw is that the decision components used by real-world decision-makers are absent in our experiments; in other words, information that is available to and used by actual decision-makers is removed from our experimental studies. The second flaw is that other influences on group outcomes – such as actual behavioral differences across groups – are not integrated into our designs, analyses, or conclusions. The third flaw is the lack of systematic study of whether the contingencies required to produce experimental bias are present in real decisions; along with this is the understudied question of whether the experimental landscape changes the motivation and ability of decision-makers. By “fatal flaws,” I mean that any one of these flaws can reveal experimental stereotyping effects even when no such effects exist in real decisions. I first describe these flaws at the general level and then show how such flaws are evident in three different research areas in experimental social psychology. Descriptions and examples of these flaws are summarized in Table 1.
Table 1. Three flaws inherent to experimental social psychology studies of bias
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220512135301077-0069:S0140525X21000017:S0140525X21000017_tab1.png?pub-status=live)
3.1 The missing information flaw
For reasons good and bad, experimental studies of categorical bias in social psychology are massive simplifications of real-world decision landscapes. The problem is that this simplification removes information that may play a strong or even critical role in real decisions. When this happens, three distortions may occur. First, the missing information may have more powerful effects than social category information and may overwhelm any categorical influence in real decisions; when such forces are removed all that remains is the categorical influence, which is then realized in the experiment. Second, removing these variables may leave experimental participants with no useful information to render a judgment other than the target's social category; although categorical information may be used minimally or not at all in real decisions, experimental participants now use it because of the absence of any other kind of diagnostic information. Third, the presence or absence of such information may change the underlying decision process itself, leaving researchers with a distorted understanding of the cognitive dynamics at play in real decisions. In all cases, researchers are at risk of incorrectly concluding that the reliable and replicable effects of categories observed in their experiments are present in the real world and have the same effects on outcomes. This suggests that social psychologists may fundamentally misunderstand the nature of a decision if their experimental methods strip away critical features present in real decision-makers' environments.
This first flaw reflects a fallacy in the justification for using experimental studies, which is to presume that any information that can affect outcomes in an experimental setting does have the same effect in the real world. Said differently, the fallacy is the unstated belief that adding additional information to the decision landscape will not change the nature or magnitude of an experimental effect and the missing information can therefore safely be ignored.
3.2 The missing forces flaw
The second flaw of the current experimental approach is that researchers do not interpret experimental effects in light of the other causal forces which impact group outcomes in the real world. Primary among these forces is the behavior of the targets themselves and the cognitive, motivational, and behavioral differences that exist across groups. This flaw is important because if there are strong influences on group outcomes besides biased treatment, then it follows that experimental participants may show reliable decision-maker bias – even very strong bias – while such bias exerts no discernable effect on real outcomes. If true, social psychologists may be perpetually disappointed in the state of the world because their recommended interventions of removing decision-maker bias will not yield equal outcomes or even reduce group disparities. Indeed, depending on the strength of group differences, social psychologists may be diverting resources away from effective interventions and toward those that will have little effect on reducing disparities.
This flaw reflects the fallacy that researchers believe they can safely ignore the degree to which the stimuli used in experimental studies match the distributional properties of the real-world groups they represent. One reason for this disregard may be the belief that all groups have roughly identical distributions on important underlying causal characteristics.Footnote 1 Yet this assumption is incorrect, as groups differ (and often markedly so) on important personality, motivational, and cognitive dimensions – in other words, on the interest and ability factors that relate to nearly all outcomes (see, e.g., ACT, 2017; Andreoni et al., Reference Andreoni, Kuhn, List, Samek, Sokal and Sprenger2019; Beaver et al., Reference Beaver, DeLisi, Wright, Boutwell, Barnes and Vaughn2013; Benbow & Stanley, Reference Benbow and Stanley1980; Benbow, Lubinski, Shea, & Eftekhari-Sanjani, Reference Benbow, Lubinski, Shea and Eftekhari-Sanjani2000; Byrnes, Miller, & Schafer, Reference Byrnes, Miller and Schafer1999; Ceci & Williams, Reference Ceci and Williams2010; Cesario, Johnson, & Terrill, Reference Cesario, Johnson and Terrill2019; Diekman, Steinberg, Brown, Belanger, & Clark, Reference Diekman, Steinberg, Brown, Belanger and Clark2017; Gottfredson, Reference Gottfredson1998; Halpern et al., Reference Halpern, Benbow, Geary, Gur, Hyde and Gernsbacher2007; Hsia, Reference Hsia1988; Hsin & Xie, Reference Hsin and Xie2014; Jussim, Cain, Crawford, Harber, & Cohen, Reference Jussim, Cain, Crawford, Harber and Cohen2009; Jussim, Crawford, Anglin, Chambers, et al., Reference Jussim, Crawford, Anglin, Chambers, Stevens and Cohen2015a; Jussim, Crawford, & Rubinstein, Reference Jussim, Crawford and Rubinstein2015c; Lee & Ashton, Reference Lee and Ashton2020; Lippa, Reference Lippa1998; Lu, Nisbett, & Morris, Reference Lu, Nisbett and Morris2020; Lubinski & Benbow, Reference Lubinski and Benbow1992; Lynn, Reference Lynn2004; Lynn & Irwing, Reference Lynn and Irwing2004; McLanahan & Percheski, Reference McLanahan and Percheski2008; Roth, Bevier, Bobko, Switzer, & Tyler, Reference Roth, Bevier, Bobko, Switzer and Tyler2001; Sowell, Reference Sowell2005, Reference Sowell2008; Su, Rounds, & Armstrong, Reference Su, Rounds and Armstrong2009; Tregle, Nix, & Alpert, Reference Tregle, Nix and Alpert2019; Wright, Morgan, Coyne, Beaver, & Barnes, Reference Wright, Morgan, Coyne, Beaver and Barnes2014).Footnote 2 In understanding the role of decision-maker bias in producing disparate outcomes, it is necessary to compare and interpret the size of categorical bias effects with the size of these behavioral differences across groups.
Methodologically, this flaw is guaranteed because target stimuli are presented as equal on all dimensions except for social category membership. Statistically, this flaw is guaranteed because analytic models either do not incorporate information about real-world behavioral differences or if they do, they are treated as control variables whose (often very strong) relationship to the outcome of interest is ignored. These decisions shift researchers' attention away from the role that causal forces beyond categorical bias may have on group disparities in the real world – or at least, allow researchers to relegate these forces to a brief mention in the Introduction of their papers. To the extent that groups differ in important ways, and such differences have strong effects on obtained outcomes, the role of perceiver bias and stereotyping will be overstated.
Although the first flaw concerns removing everything but categorical information from the experiment, this second flaw concerns failing to interpret those experimental categorical effects in light of other known forces on group outcomes. This failure can lead to overemphasizing the role of perceiver bias, as revealed by experimental methods and “statistically significant” model coefficients (while ignoring variance explained or effect sizes).
3.3 The missing contingencies flaw
The third critical flaw is the failure to study whether the precise contingencies needed to produce categorical bias in our experiments are realized in real-world decision situations. Whether the conditions required for experimental demonstrations of bias are present in the real world obviously informs the degree to which such demonstrations can explain group disparities. However, it is also important because such contingencies relate to the motivation and ability of decision-makers in experimental tasks, and it is known that motivation and ability are critical variables for categories to have biasing effects on judgments.
This flaw reflects another fallacy present in experimental studies of bias, which is to ignore the contrived nature of experiments and the total control experimenters exercise over all aspects of the participant's experience. Indeed, contingencies are “missing” in not one but two ways. First, social psychologists do not explore whether the conditions needed for experimental bias are present in the real world. But second, discussion of these conditions is missing when social psychologists advocate for their research outside of academic psychology, where “contingent” and “conditional” bias now becomes “widespread” and “pervasive” bias (e.g., Greenwald & Krieger, Reference Greenwald and Krieger2006; Kang & Banaji, Reference Kang and Banaji2006).
Specific contingencies are required for category information to bias a person's decisions; stereotype effects do not occur uniformly for all people or under all conditions. For categories to bias decisions, clear diagnostic or individuating information must be absent and perceivers must lack the ability or motivation to control the biasing influence of categories. When decision-makers have adequate ability and motivation to control the effects of categorical information, or when information is unambiguous (as with strong individuating information or applicability of a single concept; Higgins, Reference Higgins, Higgins and Kruglanski1996), categories have little to no biasing effect on judgments (e.g., Darley & Gross, Reference Darley and Gross1983; Dovidio & Gaertner, Reference Dovidio and Gaertner2000; Koch, D'Mello, & Sackett, Reference Koch, D'Mello and Sackett2015; Krueger & Rothbart, Reference Krueger and Rothbart1988; Locksley, Borgida, Brekke, & Hepburn, Reference Locksley, Borgida, Brekke and Hepburn1980; see Jussim, Reference Jussim2012b; Jussim et al., Reference Jussim, Crawford and Rubinstein2015c; Kunda & Spencer, Reference Kunda and Spencer2003). As stated unequivocally in a summary by Kunda and Thagard (Reference Kunda and Thagard1996) over two decades ago, “It is clear … that the target's behavior has been shown to undermine the effects of stereotypes based on all the major social categories” (p. 292).
Given the importance of some contingency set, researchers must outline the precise contingencies required to give rise to bias in the lab and detail the degree to which experimental contingencies are present in real decisions. Assuming researchers can, in fact, show that these necessary contingencies are reproduced with regularity in the real world, researchers are also responsible for keeping these contingencies front and center when discussing their study in applied contexts so as to not overextend claims of bias.
Experimental contingencies are also important because they relate to the roles of ability and motivation in biased decision-making. Ability and motivation have been the twin variables in nearly every major model of impression formation, persuasion, and decision-making in social cognition for decades (Bargh, Reference Bargh, Chaiken and Trope1999; Devine, Reference Devine1989; Fazio, Reference Fazio1990; Fiske & Neuberg, Reference Fiske and Neuberg1990; Petty & Wegener, Reference Petty, Wegener, Chaiken and Trope1999; Smith & DeCoster, Reference Smith and DeCoster2000). Given this, it is surprising that social cognitive researchers have not systematically studied whether novice or experimental participants match expert or real-world decision-makers on these two dimensions.
First, regarding ability, it has long been known that experts use different information, and use the same information differently, relative to novices (see, e.g., Klein, Reference Klein1998; Koch et al., Reference Koch, D'Mello and Sackett2015; Levine, Resnick, & Higgins, Reference Levine, Resnick and Higgins1993; Logan, Reference Logan2018; although not always, see Miller, Reference Miller2019). In experimental studies of stereotyping, it is undeniable that there are no serious attempts to train participants before having them render a judgment. If trained decision-makers use information in the decision landscape differently than do untrained participants, this represents an important difference in ability between the two groups. If experts attend to different decision components or use these components differently than novices, and this difference changes the effect of social categories on the ultimate decision, then the conclusion of widespread bias in real decisions based on findings from undergraduate participants will be unwarranted.
The experimental situation itself can also be understood as impacting participants' ability in important ways. As described above, the simplified experimental methods used in studies of bias remove important sources of information used by real decision-makers. Said differently, researchers change the nature of accuracy and bias in decision-makers when they fail to give participants information that is available in real decisions, information which can allow participants to make decisions in unbiased (or, at least, less biased) ways.
Besides the ability differences between expert decision-makers and naive experimental participants, there are surely important motivational differences as well. Some research has tried to increase participants' motivation to provide unbiased decisions by rewarding accurate decisions or increasing personal relevance, but whether such manipulations produce similar motivation to those found outside experimental contexts is unknown. And importantly, experimental participants simply do not bear the costs of their decisions in ways that are required of many real-world decision-makers, a fact which can change the link between intentions and behavior (e.g., Sowell, Reference Sowell2008; Tetlock, Reference Tetlock1985). For naive participants making imaginary decisions about hypothetical targets, there is no effect on their lives once the experiment ends.
3.4 Summary
The three critical flaws of the experimental approach to the study of bias and group disparities can be summarized as follows. If the information used by actual decision-makers in real-world decision landscapes is absent in experimental studies of these decisions, one's understanding of the decision under study can be dramatically skewed. Merely demonstrating bias conveys nothing about the strength of that bias relative to other causal forces on group outcomes. Moreover, there is a failure to specify the required contingencies for experimental demonstrations of bias and explore whether such contingencies are present in real decisions. Finally, if actual decision-makers use information differently or have different motivations and abilities than experimental decision-makers, there is no guarantee that bias will be observed outside experimental contexts. For these reasons, claims of ubiquitous bias among real-world decision-makers may be overstated.
4. Experimental studies of bias: Three topics
Having identified the problems inherent to experimental studies of bias at the general level, I now turn to demonstrating how these problems appear in practice. I chose the three topics discussed next because they cover a range of characteristics. Shooter bias is a narrow topic with nearly two decades of research and is a prototypical social psychological study. Implicit bias is a much broader topic but one that has had a major effect on the public's understanding of group disparities. School disciplinary policies are a relatively new topic, but an important one with broad interest beyond the discipline of psychology.
4.1 Shooter bias
For nearly two decades, researchers have studied the question of racial bias in police officers' decisions to use deadly force. Without question, the most common experimental task used is the FPST, in which participants are shown pictures of armed and unarmed Black or White men and asked to press buttons labeled “shoot” and “don't shoot” (Correll et al., Reference Correll, Park, Judd and Wittenbrinkin press; see Cesario & Carrillo, Reference Cesario, Carrillo, Carlston, Johnson and Hugenbergin press for a summary). How does this research fare with respect to the three fundamental flaws of experimental social psychology? (Fig. 1).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220512135301077-0069:S0140525X21000017:S0140525X21000017_fig1.png?pub-status=live)
Figure 1. Example trial of the First-Person Shooter Task, the most common experimental method for understanding police officers' decisions to shoot.
4.1.1 Shooter bias: The missing information flaw
With respect to the first flaw, every relevant piece of information used by police officers in the decision to shoot has been removed from the standard experimental task, absent the one variable of whether or not targets are holding guns (an effect which overwhelms all other effects in both real and experimental decisions). Although a small number of exceptions exist and are discussed below, this has been true of virtually all studies using the FPST (see Cesario & Carrillo, Reference Cesario, Carrillo, Carlston, Johnson and Hugenbergin press). These missing variables include: dispatch information about the citizen and why the officer has been called to the scene, neighborhood information, past encounters with the citizen, how the interaction has unfolded leading up to the decision point (e.g., has the citizen been compliant thus far?), the physical movements by the citizen at the moment of the decision point, the goal of the officer at the scene, whether other officers are present, whether non-lethal tactics have already been used, and so on.
Officers report that all these factors matter, and indeed officers are trained to attend to these factors and integrate them into their dynamic, continuously updating decision to use deadly force as the interaction with the citizen unfolds. Of course, whether and the extent to which any of these pieces of information actually affect officers' decisions are empirical questions. Yet by not including these features, researchers simply have no idea whether their experimental methods are adequately capturing officers' decision processes. Researchers may be fundamentally misunderstanding the underlying cognitive decision dynamics if factors that impact those dynamics in real decision-makers have no possibility of impacting experimental participants, simply because researchers have failed to include such factors in their studies. Thus, we can ask, what happens to racial bias – at both the behavioral and the cognitive process levels – when such information is introduced into the experiment?
As one example of how the conclusions from experimental studies can drastically change if we introduce information used by officers in real decisions, Johnson, Cesario, and Pleskac (Reference Johnson, Cesario and Pleskac2018) conducted a series of studies examining the role of dispatch information in the decision to shoot. Participants completed a standard FPST, but with an important modification: On some trials, participants were given dispatch information at the start of each trial that contained the race of the target, whether the target had a weapon (correct 75% of the time), or both pieces of information. As shown in Figure 2, although untrained undergraduates showed the standard race bias effect when no dispatch information was given, dispatch information of any type eliminated race bias in the decision to shoot. Thus, a single change to the standard, simplified experimental task to include the most important and relevant information that officers have in real shootings eliminated the biasing effects of race. This calls into question our ability to draw conclusions about real-world cases of police shootings from simplified experimental paradigms. More generally, it illustrates the importance of ensuring that the decision landscape for participants in experimental laboratory tasks contains those factors used by real-world decision-makers.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220512135301077-0069:S0140525X21000017:S0140525X21000017_fig2.png?pub-status=live)
Figure 2. In the standard First-Person Shooter Task (“No Info”), undergraduate participants showed racial bias in the decision to shoot. When provided with prior dispatch information about target race or presence of a weapon (“Prior Info”), participants showed no evidence of racial bias in the decision to shoot. Black and white bars refer to target race. Modified from Johnson et al. (Reference Johnson, Cesario and Pleskac2018).
This point is consistent with research by Correll and colleagues (Correll, Wittenbrink, Park, Judd, & Goyle, Reference Correll, Wittenbrink, Park, Judd and Goyle2011; but see Pleskac, Cesario, & Johnson, Reference Pleskac, Cesario and Johnson2018), who manipulated the neighborhood background in which targets in the FPST appeared. In nearly all uses of the FPST, targets are presented in neutral, uninformative backgrounds (office buildings, parks, etc.). These researchers manipulated whether targets appeared in neutral backgrounds or dangerous, urban backgrounds. Placing targets in the dangerous backgrounds completely eliminated racial bias in the decision to shoot. To the extent that real-world police shootings occur in dangerous neighborhoods or situations, this seriously calls into question the degree to which our experimental findings inform our understanding of police officer racial bias.
As another attempt to reintroduce those factors present in actual decisions but missing in experimental studies, at least three independent research groups have used some version of an immersive shooting simulator similar to those used for training by law enforcement (Cox, Devine, Plant, & Schwartz, Reference Cox, Devine, Plant and Schwartz2014; James, James, & Vila, Reference James, James and Vila2016; James, Vila, & Daratha, Reference James, Vila and Daratha2013; James, Klinger, & Vila, Reference James, Klinger and Vila2014; Pleskac, Johnson, Cesario, Terrill, & Gagnon, Reference Pleskac, Johnson, Cesario, Terrill and Gagnonunder review). As depicted in the right panel of Figure 3, participants in such studies stand in front of a projection screen and watch life-sized videos recorded from a first-person point of view. These videos are of policing scenarios similar to those encountered by law enforcement (e.g., traffic pullovers and domestic disturbances). Participants verbally interact with individuals in the videos as they unfold over time, during which participants must decide whether to use deadly force. This response is made using a modified handgun; when the trigger is pulled, cycling of the firearm occurs through a compressed air connection, which provides recoil and initiates the sound of a handgun firing through a set of speakers. Officers routinely report being highly involved with these scenarios and display strong emotional states, attesting to the realism of the method.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220512135301077-0069:S0140525X21000017:S0140525X21000017_fig3.png?pub-status=live)
Figure 3. Left panel: Participant completing the standard laboratory First-Person Shooter Task. Right panel: A participant-officer completing an immersive shooting simulator, with video from officer's perspective superimposed in lower right corner.
Importantly, such a methodological change is not merely about recreating surface-level similarity to the decision to shoot in terms of the participant's experience (e.g., pressing a button vs. holding a gun). This method allows researchers to introduce back into the decision landscape those factors which simplified tasks remove but which officers report as being important.
Cesario and Carrillo (Reference Cesario, Carrillo, Carlston, Johnson and Hugenbergin press) came to two main conclusions in their summary of the research on shooting simulator studies. First, among the studies that manipulated the scenarios to which officers responded, there was strong evidence of the importance of the scenario and the specific actors on officers' decisions – stronger than the effects of suspect race. In Pleskac et al. (Reference Pleskac, Johnson, Cesario, Terrill and Gagnon2019) for example, variance in officers' decisions was primarily explained by the different scenarios in the videos (e.g., serving a warrant for armed robbery vs. failure to pay for child support) and the behavior of the different actors in the videos – features that the standard FPST removes entirely from the decision landscape. The second main conclusion was that studies using shooting simulators do not provide strong evidence of anti-Black bias in officers' decisions. Indeed, in all possible tests of racial bias across such studies, only about 5% showed anti-Black bias in officers' decisions. In contrast, almost 40% of tests showed anti-White bias in officers' decisions.
Although these results are inconsistent with claims from experimental social psychologists regarding the overwhelming importance of racial stereotypes in decisions to shoot, these results are consistent with the many analyses of actual police shootings that have revealed the importance of context and suspect behavior (see, e.g., Cesario et al., Reference Cesario, Johnson and Terrill2019; Fryer, Reference Fryer2016; Fyfe, Reference Fyfe1980; Geller & Karales, Reference Geller and Karales1981; Inn, Wheeler, & Sparling, Reference Inn, Wheeler and Sparling1977; Klinger, Rosenfeld, Isom, & Deckard, Reference Klinger, Rosenfeld, Isom and Deckard2016; Loughlin & Flora, Reference Loughlin and Flora2017; Ma, Graves, & Alvarado, Reference Ma, Graves and Alvarado2019; Mentch, Reference Mentch2020; Ross, Winterhalder, & McElreath, Reference Ross, Winterhalder and McElreath2021; Shjarback & Nix, Reference Shjarback and Nix2020; Tregle et al., Reference Tregle, Nix and Alpert2019; Wheeler, Phillips, Worrall, & Bishopp, Reference Wheeler, Phillips, Worrall and Bishopp2017; Worrall, Bishopp, Zinser, Wheeler, & Phillips, Reference Worrall, Bishopp, Zinser, Wheeler and Phillips2018).
4.1.2 Shooter bias: The missing forces flaw
With respect to the second flaw, it is clear that experimental social psychologists have ignored the contexts of actual deadly force decisions and the multiple influences on group disparities in fatal shootings, including the behavior of citizens themselves and whether such behavior varies across groups. There have been almost no serious attempts to connect experimental research to systematic analyses of fatal police shootings from the Criminal Justice literature, with nothing more than superficial citations of such research and no substantive input on how studies are designed or how research is conducted. Indeed, nearly a decade passed from the first publication using the FPST before researchers thought to ask about the very basic variable of neighborhood dangerousness (Correll et al., Reference Correll, Wittenbrink, Park, Judd and Goyle2011), and 15 years passed before experimental social psychologists asked about whether different violent crime rates play a role in explaining racial disparities (Cesario et al., Reference Cesario, Johnson and Terrill2019; Scott, Ma, Sadler, & Correll, Reference Scott, Ma, Sadler and Correll2017).
In shooter bias studies, Black and White targets are shown holding guns with the same frequency; in other words, they are presented in equal proportions in those situations for which deadly force is relevant. The logic is that, if experimental participants are more likely to shoot Black targets in the FPST, then this same racial bias in the heads of police officers explains the per capita racial disparity in being shot. Yet for the results to apply, it must be the case that Black and White citizens are present in deadly force situations with equal likelihoods in the real world, otherwise factors such as differential exposure to the police may be sufficient to explain racial disparities.
In contrast to the underlying assumption in experimental studies, there is clear evidence that (1) the context of violent crime is an overwhelming influence on officers' decisions to shoot and (2) violent crime rates differ across racial groups (e.g., Barnes, Jorgensen, Beaver, Boutwell, & Wright, Reference Barnes, Jorgensen, Beaver, Boutwell and Wright2015; Cesario et al., Reference Cesario, Johnson and Terrill2019; Klinger et al., Reference Klinger, Rosenfeld, Isom and Deckard2016; Ma et al., Reference Ma, Graves and Alvarado2019; Miller et al., Reference Miller, Lawrence, Carlson, Hendrie, Randall, Rockett and Spicer2017; Nix, Campbell, Byers, & Alpert, Reference Nix, Campbell, Byers and Alpert2017; Tregle et al., Reference Tregle, Nix and Alpert2019; Wheeler et al., Reference Wheeler, Phillips, Worrall and Bishopp2017; Worrall et al., Reference Worrall, Bishopp, Zinser, Wheeler and Phillips2018). Police officers do not use deadly force equally across all policing situations. The modal police shooting is one in which officers have been called by dispatch to the scene of a possible crime and are confronted with an armed citizen posing a deadly threat to the officer or to other citizens. It is also the case that violent crime rates differ very starkly across racial groups. Indeed, recent study suggests that the different rates of exposure to police through violent crime situations greatly – if not entirely – accounts for the overall per capita disparities in being fatally shot by police (Cesario et al., Reference Cesario, Johnson and Terrill2019; Fryer, Reference Fryer2016; Mentch, Reference Mentch2020; Ross et al., Reference Ross, Winterhalder and McElreath2021; Tregle et al., Reference Tregle, Nix and Alpert2019).
Once fatal police shootings are understood from this angle, it becomes clear that social psychologists have misunderstood this topic in their experimental approaches. Rather than first studying the nature of police shootings and then building experimental investigations around that understanding, researchers instead first created experimental worlds in which all group members are equal, under the assumption that this matched the actual behavior of groups and that their experimental findings would shed light on the disparate outcomes of those group members.
When it comes to explaining group disparities, researchers clearly prioritize their experimental findings over other possible causal forces on group outcomes. For example, of 18 recently published papers on shooter bias from experimental social psychology, only two raise the possibility that different behaviors of Black and White citizens might play a role in Black citizens' overrepresentation in being shot by the police (a possibility dismissed in one paper with indirect evidence and dismissed in the other paper with reference to a single article). This was true even when authors recognized that behavioral differences might account for other disparities, such as how the greater aggressiveness and criminality of men account for why they are more likely to be shot than women (Plant, Goplen, & Kunstman, Reference Plant, Goplen and Kunstman2011).
An important point concerning “blaming the victim” needs to be raised here, and this applies not only to fatal shootings but to all disparities. It is necessary to keep causal analysis distinct from “blaming the victim,” or in Felson's (Reference Felson1991) terms, to not use a blame analysis framework where a causal analysis framework is needed. Whatever the causal factors that lead an individual to one or another outcome, such factors can be described without the language of blame and responsibility. To say that a proximate cause of police shootings is involvement in crime is not to cast blame on a person for their own shooting, and certainly such an explanation should not be misapplied to those cases where criminal involvement is not present. But neither should a person's behavior be off-limits as part of a causal analysis merely because that person belongs to a minority group.
4.1.3 Shooter bias: The missing contingencies flaw
Research on shooter bias clearly illustrates the third flaw, the lack of attention to experimental contingencies and whether there are differences in motivation and ability between experimental and real-world decision-makers. Evidence of racial bias is reliably obtained with untrained citizens completing the FPST (Mekawi & Bresin, Reference Mekawi and Bresin2015), but the task has specific parameters that are required for such bias to be realized. For example, in the FPST, the target appears on the screen holding an object and the participant must make a decision within a response window relative to target onset. Thus, target race and object are presented simultaneously and responses after, say, 650 ms are considered errors.
The important question is whether these contingencies match the nature of actual police shootings. They do not. Officers almost always have information about citizen race much, much sooner than when the decision to shoot is made (and certainly well outside the window for ruling out controlled processing), and officers almost always have some interaction with the citizen before deciding to shoot. As noted above, experimental FPST participants are also given zero information about the situation surrounding the decision, a fact that matches no police shooting.
More important, the FPST is a task about misidentifying harmless objects for weapons. However, evidence of racial bias in the FPST has been used to make claims about widespread police officer bias in the decision to shoot. What has not been questioned is the degree to which fatal police shootings are actually about misidentification of harmless objects. If police shootings rarely involve the misidentification of objects under neutral conditions (which is the focus of the FPST), then it might be misleading to apply findings from the FPST to explain racial bias in fatal shootings more broadly. In fact, we estimated that the number of fatal shootings in which officers misidentify harmless objects for weapons is around 30 incidents per year (Cesario et al., Reference Cesario, Johnson and Terrill2019). To the extent that error rates on the FPST are informative for understanding racial bias, the task may be applicable only to an extremely infrequent event within a much larger set of related events. Indeed, considering that there are over 75,000,000 police–citizen contacts per year (Davis, Whyde, & Langton, Reference Davis, Whyde and Langton2018), this suggests the error rate for officers misidentifying a harmless object as a weapon – the central question of the FPST – is on the order of less than one in a million.
One could salvage the FPST by replying that the task still tells us something important about officers' decisions during these very infrequent events. Moreover, infrequent events can be tremendously important, and the tragic cases where an officer makes a clear error and shoots a citizen reaching for his wallet are the events that we as citizens should care the most about. However, two problems remain. First, the most reliable effect in the FPST is on response times and not on error rates; meta-analysis indicates that there is not a reliable effect of target race on shooting unarmed targets (Mekawi & Bresin, Reference Mekawi and Bresin2015). Second, such an argument requires ignoring the problems described above, which can change the applicability of such results to real-world cases. For example, the FPST assumes equal encounter rates with the police (as 50% of trials are White targets and 50% of trials are Black targets). However, if officers have differential contact with Black citizens (because of bias in discretionary stopping of citizens or simply because of different violation rates between Black and White citizens), then racial disparity in being shot while reaching for a wallet may exist while officers show no bias in the actual decision to shoot. A constant, race-blind error rate on the part of the police would still result in a greater proportion of Black Americans being shot while reaching for their wallets (see Cesario, Reference Cesario2021; Ross et al., Reference Ross, Winterhalder and McElreath2021).
What about the failure to consider possible motivation and ability differences between real-world and experimental decision-makers? Social psychologists have overwhelmingly used convenience samples of naive undergraduates to study the decision to shoot (see Cesario & Carrillo, Reference Cesario, Carrillo, Carlston, Johnson and Hugenbergin press), participants for whom the decision is inconsequential and who have no training in how to make such a decision. Yet police officers typically receive over 1,000 hours of use of force training (Morrison, Reference Morrison2006; Stickle, Reference Stickle2016). It would be surprising if the ability to detect and classify objects, and the cognitive processes underlying such performance, was similar for experienced officers and undergraduates who have never made a single such decision in their lives. Interestingly, Correll et al. (Reference Correll, Park, Judd and Wittenbrink2002) issued exactly this caution in the very first study on the FPST (“it is not yet clear that Shooter Bias actually exists among police officers … there is no reason to assume that this effect will generalize beyond [lay samples],” p. 1328). Yet despite this and later warnings (Cox & Devine, Reference Cox and Devine2016), researchers continued to apply studies from undergraduates to police officers, even as data came to light that police officers did not show the same bias (e.g., Correll et al., Reference Correll, Park, Judd, Wittenbrink, Sadler and Keesee2007, Reference Correll, Hudson, Guillermo and Ma2014).
The fact that trained officers may use information in the decision landscape differently than untrained undergraduates represents an important ability difference between the two groups. If experts attend to different decision components or use these components differently than novices, and this difference changes the effect of target race on the ultimate decision, then the conclusion of widespread race bias in officers' deadly force decisions based on findings from undergraduate participants will be unwarranted. Indeed, sworn officers typically show little to no bias in the behavioral decision to shoot with the standard FPST (e.g., Akinola, Reference Akinola2009; Correll et al., Reference Correll, Park, Judd, Wittenbrink, Sadler and Keesee2007; Johnson et al., Reference Johnson, Cesario and Pleskac2018; Ma & Correll, Reference Ma and Correll2011; Sim, Correll, & Sadler, Reference Sim, Correll and Sadler2013; Taylor, Reference Taylor2011), and this is especially true for studies using immersive shooting simulators such as the one described above (e.g., James et al., Reference James, Vila and Daratha2013, Reference James, Klinger and Vila2014, Reference James, James and Vila2016). Cesario and Carrillo (Reference Cesario, Carrillo, Carlston, Johnson and Hugenbergin press) summarized studies in which sworn officers completed the standard FPST and found that out of 64 possible tests for racial bias, only ~25% showed anti-Black bias whereas ~70% showed no bias on the part of officers in one direction or the other.
As a direct means of demonstrating the importance of collecting data with trained experts rather than naive undergraduates, Johnson et al. (Reference Johnson, Cesario and Pleskac2018) tested for differences between officers and students in the underlying cognitive dynamics of the decision to shoot. Was there evidence that trained versus untrained individuals were making the decision in a different way or using race differently during the decision process? Trained officers and untrained undergraduates completed the standard laboratory FPST. These researchers then modeled the data from each group using a drift diffusion model. This model describes the decision to shoot as a sequential sampling process in which people start with a prior bias to shoot or not and accumulate evidence over time until a threshold required for a decision is reached. The details regarding this modeling can be found elsewhere (see Johnson, Hopwood, Cesario, & Pleskac, Reference Johnson, Hopwood, Cesario and Pleskac2017; Pleskac et al., Reference Pleskac, Cesario and Johnson2018); for now, the important point is that the model allows for an understanding of how the cognitive processes underlying to the decision to shoot might vary between untrained and trained participants.
In these data, trained officers showed no racial bias in their behavioral decisions, despite untrained undergraduates showing such bias. More important, cognitive modeling of the decision data revealed why officers did not show bias in their behavioral responses. Officers showed two major differences compared to untrained undergraduates in the underlying decision components. First, race did not affect the manner in which officers accumulated evidence about whether to shoot. For untrained undergraduates, their processing of the object held by the target was “contaminated” by the target's race: When a harmless object was held by a Black target, the processing of his race interfered with processing of the object being held, pushing participants toward a “shoot” decision (resulting in more false alarms). Officers showed no such effect of race. They were able to extract information about the object in the person's hand independent of the target's race. Second, officers set higher thresholds for making a decision, accumulating more evidence before making a decision. In combination, these two components eliminated the effect of race on officers' behavioral decisions, an effect robustly observed in untrained participants.
Among trained officers, then, the decision process operated differently and race did not have the same effects on the underlying decision components as it did on untrained participants. Failure to understand or appreciate these differences leads researchers to inappropriately apply the results from undergraduates – who have no training and have never had to make such a decision before entering the lab – to expert decision-makers.
4.2 Implicit bias and group disparities
It would be difficult to find a concept from experimental social psychology that has spread more quickly and widely outside academia than implicit bias. There is no question that implicit bias research (1) has been used to explain why groups in contemporary American society obtain unequal outcomes and (2) has relied almost exclusively on studies using indirect measures such as the Implicit Association Task (IAT).Footnote 3 Other writings have critiqued the theoretical and measurement aspects of implicit bias research (Arkes & Tetlock, Reference Arkes and Tetlock2004; Blanton & Jaccard, Reference Blanton and Jaccard2008; Blanton, Jaccard, Gonzales, & Christie, Reference Blanton, Jaccard, Gonzales and Christie2006; Blanton et al., Reference Blanton, Jaccard, Klick, Mellers, Mitchell and Tetlock2009; Blanton, Jaccard, Strauts, Mitchell, & Tetlock, Reference Blanton, Jaccard, Strauts, Mitchell and Tetlock2015; Corneille & Hütter, Reference Corneille and Hütter2020; Fiedler, Messner, & Bluemke, Reference Fiedler, Messner and Bluemke2006; Mitchell, Reference Mitchell, Crawford and Jussim2018; Oswald, Mitchell, Blanton, Jaccard, & Tetlock, Reference Oswald, Mitchell, Blanton, Jaccard and Tetlock2013; Schimmack, Reference Schimmack2020), so I restrict my discussion of this topic to those aspects most relevant to the question of explaining group disparities.
4.2.1 Implicit bias: The missing information flaw
In prototypical implicit bias research, as in studies using the IAT or other indirect measurement techniques (Fazio, Jackson, Dunton, & Williams, Reference Fazio, Jackson, Dunton and Williams1995; Greenwald, McGhee, & Schwartz, Reference Greenwald, McGhee and Schwartz1998), every possible source of information which could impact a person's judgment and behavior is stripped from the measurement of these unconscious or uncontrollable processes. In the best-case scenario, participants are shown cropped photos of faces belonging to different group members and make rapid categorizations of these faces; in the worst-case scenario, there are no group members whatsoever and group labels (e.g., “Black”) serve as target stimuli instead. No information other than category membership is available to participants and button-press differences on the order of a fraction of a second are the outcome of interest. Additionally, research has shown that implicit or indirect measures can be sensitive to context information (see, e.g., Barden, Maddux, Petty, & Brewer, Reference Barden, Maddux, Petty and Brewer2004; Blair, Reference Blair2002; Gawronski, Reference Gawronski2019; Gawronski & Sritharan, Reference Gawronski, Sritharan, Gawronski and Payne2010; Wittenbrink, Judd, & Park, Reference Wittenbrink, Judd and Park2001). The fact that humans exist and are perceived only in contexts, and not isolated against empty backgrounds, should prompt meaningful discussion about the degree to which such context-less implicit bias measures will predict bias in real decisions.
4.2.2 Implicit bias: The missing forces flaw
Implicit bias research also reflects the second flaw outlined in this paper, which is that the effects of implicit bias are not appropriately compared to other influences on group outcomes. Consider the example of sex differences in STEM participation. Women and men do not have identical profiles of ability and interest relevant to STEM performance, and much research has explored the implications of these factors (Benbow & Stanley, Reference Benbow and Stanley1980; Benbow et al., Reference Benbow, Lubinski, Shea and Eftekhari-Sanjani2000; Ceci & Williams, Reference Ceci and Williams2010; Reference Ceci and Williams2011; Ceci, Williams, & Barnett, Reference Ceci, Williams and Barnett2009; Cheng, Reference Cheng2020; Cortés & Pan, Reference Cortés and Pan2020; Hakim, Reference Hakim2006; Halpern et al., Reference Halpern, Benbow, Geary, Gur, Hyde and Gernsbacher2007; Kleven, Landais, & Søgaard, Reference Kleven, Landais and Søgaard2019; Kleven, Landais, Posch, Steinhauer, & Zweimüller, Reference Kleven, Landais, Posch, Steinhauer and Zweimüller2020; Lubinski & Benbow, Reference Lubinski and Benbow1992; Su & Rounds, Reference Su and Rounds2015; Su et al., Reference Su, Rounds and Armstrong2009; Valla & Ceci, Reference Valla and Ceci2014). How it is that millisecond differences in measured associations correspond to those factors which impact group disparities is questionable, given the lack of integration of such differences into the larger dynamics of STEM engagement and performance. Although there is variation in the published literature, studies claiming to demonstrate the importance of implicit bias in explaining group outcomes often do not measure these other forces at all (e.g., Cvencek, Greenwald, & Meltzoff, Reference Cvencek, Greenwald and Meltzoff2011a; Cvencek, Meltzoff, & Greenwald, Reference Cvencek, Meltzoff and Greenwald2011b), do not compare the size of implicit bias effects to the size of these other forces (e.g., Nosek & Smyth, Reference Nosek and Smyth2011), treat these other forces as control variables without directly comparing the size of implicit bias effects to these variables (e.g., Kiefer & Sekaquaptewa, Reference Kiefer and Sekaquaptewa2007), or treat such forces as a predicted variable resulting from implicit bias rather than the reverse (e.g., Nosek, Banaji, & Greenwald, Reference Nosek, Banaji and Greenwald2002; Nosek et al., Reference Nosek, Smyth, Sriram, Lindner, Devos, Ayala and Greenwald2009).
4.2.3 Implicit bias: The missing contingencies flaw
As for the third flaw, there has been a striking failure to explore whether the precise experimental contingencies required to demonstrate implicit bias in the lab correspond in some reasonable way to the contingencies present during real-life decisions. These contingencies include the twin features of the lack of ability and motivation, as well as the specific experimental details needed to reveal bias on indirect measures.
Consider some of the necessary experimental contingencies required both for the measurement of implicit cognition and for observing the effects of implicit bias on decision-making and behavior. Perhaps the central defining feature of implicit cognition is awareness (Greenwald & Banaji, Reference Greenwald and Banaji1995), and as such implicit measures are supposed to “neither inform the subject of what is being assessed nor request self-report concerning it” (p. 5).Footnote 4 A first-order, foundational question then is whether people are aware of their biases, aware of what is being assessed during the measurement of these biases, or aware of the effects of their biases. After all, if one defines implicit bias as discrimination based on “unconscious” processes and argues that implicit bias is so important as to have implications for legal doctrine in the United States (Greenwald & Krieger, Reference Greenwald and Krieger2006; Kang & Banaji, Reference Kang and Banaji2006), then certainly the basic question of awareness must have been thoroughly settled by now. As Gawronski (Reference Gawronski2019) describes, however, there is currently no convincing evidence that people are uniquely unaware of their biases or the effects of their biases.Footnote 5 It is striking that the concept of implicit bias has been pushed into federal policy at the highest levels of the U.S. government without any convincing evidence concerning even basic questions about the measurement or the effects of implicit bias.
Indirect measurement techniques (as a means of assessing stereotype associations) require specific contingencies to reveal bias on the part of participants. Take for example the IAT (Greenwald et al., Reference Greenwald, McGhee and Schwartz1998). As with other control tasks, such as the Stroop task, no one shows bias in their decisions if given sufficient time to respond.Footnote 6 Thus, a speeded response is a required condition for measurement of implicit bias so that controlled cognitive processes will be prevented or attenuated from impacting responses. In this way, implicit measures can assess “implicit attitudes by measuring their underlying automatic evaluation” (Greenwald et al., Reference Greenwald, McGhee and Schwartz1998, p. 1464), as opposed to measuring a more controlled evaluation elicited by the stimulus.
In addition to measurement, there are also necessary conditions to demonstrate the effects of implicit bias on behavior and decision-making. Consider the central claim by implicit bias researchers that automatically activated associations influence us even when we don't want them to (e.g., “implicit biases are especially intriguing, and also especially problematic, because they can produce behavior that diverges from a person's avowed or endorsed beliefs or principles,” Greenwald & Krieger, Reference Greenwald and Krieger2006, p. 951). Given this, people must be in a decision situation where controlled processes – that is, what we want – cannot play a role, conditions where we want to respond in unbiased ways but are unable to do so. This requires that a person lacks the ability to exercise controlled processes, as in a decision with a short response time window. Without this feature, the decision situation is no longer one in which we are unable to produce the desired, unbiased response. Good experimental practice and inference would then require that, in implicit bias research, both contingencies are in place: People do not want to be influenced by categorical information but are in decision situations where such controlled processes cannot influence responses. Given that none of the studies recently presented as strong evidence for the behavioral prediction of implicit bias ensured that these contingencies were met suggests that this practice is not widespread (Jost, Reference Jost2019).
As a final, critical contingency, as noted earlier there is overwhelming evidence that categorical bias is overridden when decision-makers are provided with individuating information (e.g., Kunda & Thagard, Reference Kunda and Thagard1996). In the measurement of implicit bias, no individuating information is ever presented; yet it is common to apply laboratory findings of implicit bias to real decisions which contain strong individuating information, such as in hiring decisions or decisions about one's own career choice (where a person clearly has interest and ability information; e.g., Nosek & Smyth, Reference Nosek and Smyth2011).
In terms of explaining group disparities, it follows that the bulk of the underrepresentation for any group must be because of an underrepresentation of people who are ambiguous with respect to their performance at the task at hand, because it is only these people for whom decisions will be affected by implicit bias on the part of decision-makers. In the case of “gatekeepers” making biased decisions against potential STEM students (Nosek & Smyth, Reference Nosek and Smyth2011), the “A” student and the “F” student are both unaffected by implicit bias on the part of the guidance counselor (because there is unambiguous positive and negative individuating information, respectively). This means that the sex disparity must be comprised of “C” students who would have become successful in STEM careers had implicit bias not caused the guidance counselor to unintentionally steer those students out of a STEM track. It is the responsibility of implicit bias proponents to show this is the case.
Implicit bias research, then, provides another example of the fundamental weaknesses of experimental social psychology when explaining group disparities. Without providing any relevant information to participants, researchers obtain evidence of the biasing effects of category information. Such associations as measured by millisecond response time differences – obtained under completely discordant conditions to the real world and which do not correspond to the presumed psychological constructs of interest in a straightforward way (see, e.g., Blanton, Jaccard, Christie, & Gonzales, Reference Blanton, Jaccard, Christie and Gonzales2007; Uhlmann, Brescoll, & Paluck, Reference Uhlmann, Brescoll and Paluck2006) – are proposed to explain complex and sizable group disparities. Little effort is made to integrate these differences into a detailed model which includes other, strong influences on outcomes or specification of the real-world performance parameters. These weaknesses are consistent with the poor performance of implicit bias measures to predict discriminatory behavior (see, e.g., Blanton et al., Reference Blanton, Jaccard, Klick, Mellers, Mitchell and Tetlock2009; Oswald et al., Reference Oswald, Mitchell, Blanton, Jaccard and Tetlock2013, Reference Oswald, Mitchell, Blanton, Jaccard and Tetlock2015).Footnote 7
4.3 Racial disparities in school disciplinary outcomes
A final example is the recent study in experimental social psychology on racial disparities in school disciplinary outcomes. There are well-known racial disparities in suspensions and expulsions, with Black schoolchildren more likely to receive such outcomes than White, Hispanic, or Asian schoolchildren (Lhamon & Samuels, Reference Lhamon and Samuels2014). At issue is why this per capita disparity exists and whether distorted interpretation of behavior because of racial stereotypes explains such disparities. That is, are schoolteachers interpreting the same behavior on the part of Black and White schoolchildren differently and, therefore, referring them for disciplinary action at different rates, even while the behavior of Black and White kids is the same?
Experimental social psychologists have followed the familiar pattern of instructing participants to make punitive judgments of hypothetical schoolchildren from simple written scenarios, with targets who are presented as equal on every dimension other than their race. After observing an effect of target race on disciplinary decisions, researchers then loop back and claim that such findings can help explain why racial disparities in real classrooms exist (Jarvis & Okonofua, Reference Jarvis and Okonofua2020; Okonofua & Eberhardt, Reference Okonofua and Eberhardt2015).
An analysis of this research reveals the three flaws identified above. The information provided to participants in these experimental studies are impoverished descriptions of real teacher–child experiences, removing important information that real decision-makers could use, such as a child's history of behavior in the classroom, the other children involved, the teacher's current intentions and behavior, or even the general context surrounding the event. All the knowledge that the teacher has concerning the student's history and past behavior simply cannot play a role in their experimental judgments. This is important because the distribution of student disciplinary action is highly skewed and is principally tied to specific students; the question is not about the average, generic student but about specific students at the tail end of a distribution. For example, in one large survey of teacher referrals for disciplinary action, 93% of the 22,000 students recorded did not receive a single referral, 4% received only one referral, and six students received more than 20 referrals each (Rocque & Paternoster, Reference Rocque and Paternoster2011). Experiments are about group average effects (e.g., “Does a sample of participants show an average difference in disciplining unknown, nonspecific Black or White students?”), but the distribution of disciplinary actions suggest this misses the nature of the actual topic under study.
Researchers prevent teachers from making unbiased judgments if such information plays a strong role in real decisions and forces participants to use the only diagnostic information given to them. For example, studies on race and classroom discipline give teachers a student's name (manipulated to be either a common Black or White name) and a one-paragraph description of an event (“You tell DeShawn to pick his head up and get to work. He only picks his head up”). Whether these vignettes contain information used by teachers when making real disciplinary decisions is unknown.
These experimental designs also fail to consider the possible influence of other factors that may play a role in a child's behavior, such as socioeconomic status, family structure, cultural norms for the teacher–child relationship, parental expectations, interest in school, delay of gratification, and so on, all of which differ across racial groups and would reasonably be expected to relate to behavioral differences in the classroom (Andreoni et al., Reference Andreoni, Kuhn, List, Samek, Sokal and Sprenger2019; DeNavas-Walt, Proctor, & Smith, Reference DeNavas-Walt, Proctor and Smith2013; Heriot & Somin, Reference Heriot and Somin2018; Hsin & Xie, Reference Hsin and Xie2014; McLanahan & Percheski, Reference McLanahan and Percheski2008; Musu-Gillette et al., Reference Musu-Gillette, Zhang, Wang, Zhang, Kemp, Diliberti and Oudekerk2018; Price-Williams & Ramirez, Reference Price-Williams and Ramirez1974; Rocque & Paternoster, Reference Rocque and Paternoster2011; Wright et al., Reference Wright, Morgan, Coyne, Beaver and Barnes2014; Zytkoskee et al., Reference Zytkoskee, Strickland and Watson1971). Whatever the size of participants' racial bias in disciplining hypothetical Black versus White schoolchildren in an experimental situation, one cannot draw any conclusions about whether such categorical biases impact disciplinary outcomes in the real world because the experimental bias effect is not understood in relation to these other factors. An assumption justifying the design of such studies is the expectation that children who differ in myriad important ways should behave identically in the classroom.
As support for this claim, consider a recent paper on race and school suspensions by experimental social psychologists, which begins by stating that racial differences in school suspension are “not fully explained by racial differences in socioeconomic status or in student misbehavior” (Okonofua & Eberhardt, Reference Okonofua and Eberhardt2015, p. 617). No report is given of how much the racial disparities are explained by these factors, just that some non-zero amount remains. As evidence for this claim, six citations are provided, but none of these citations measure student behavior and show that Black and White students are behaving similarly. Indeed, one of these citations states “The ideal test … would be to compare observed student behavior with school disciplinary data. Those data were not available for this study, nor are we aware of any other investigation that has directly observed student behaviors” (Skiba, Michael, Nardo, & Peterson, Reference Skiba, Michael, Nardo and Peterson2002, p. 325).Footnote 8 In contrast, Wright et al. (Reference Wright, Morgan, Coyne, Beaver and Barnes2014) did find that racial differences in school suspension rates were fully accounted for by prior behavioral problems of the student. The point is not to single out these researchers (as such claims are broadly made by nearly everyone doing similar research), but instead to illustrate an additional example of the problems identified above.
Moreover, using experimental social psychology to explain school suspensions and expulsions reflects the third flaw as well: A lack of attention to the actual contingencies needed to produce stereotyping effects in the lab and whether such contingencies resemble real-world situations. As noted earlier, stereotyping effects occur under conditions of ambiguity and are absent or small when perceivers have individuating information or are judging unambiguous behaviors. To the extent that teachers are misconstruing or misinterpreting students' behaviors because of stereotypes held about different racial groups, those effects are therefore predicted to occur in the absence of individuating information or for ambiguous behaviors. How categorical bias might reveal itself in long-term interactions such as teacher–student relationships, where plenty of individuating information is available, is not established.
Some study by the leading scholars within social psychology on school disciplinary disparities has tried to take a more dynamic perspective. For example, Okonofua, Walton, and Eberhardt (Reference Okonofua, Walton and Eberhardt2016) propose that the teacher–student relationship can devolve over time and that initial stereotype effects can increase in strength as teachers' expectations and worry about minority students' behavior affects students' behavior in the classroom (see also Madon et al., Reference Madon, Jussim, Guyll, Nofziger, Salib, Willard and Scherr2018; Martell, Lane, & Emrich, Reference Martell, Lane and Emrich1996). Of course, whether initial teacher concerns about classroom management eventually lead Black students to enact those behaviors that would get them expelled, when they would not have otherwise done so absent such expectations, is unclear. Nor are the effects of such expectations set within the context and force of the other strong effects listed earlier on students' outcomes.
5. What do experimental studies of bias tell us?
To say that studies in experimental social psychology cannot tell us about real-world group disparities is not to say that such studies are worthless. These studies provide a wealth of information about the function and process of storing and using categorical information. However, if researchers want to know about real-world group disparities, such findings cannot provide them with the information they seek.
The standard way of interpreting experimental stereotyping findings has already been described: Experimental evidence that participants are biased against identical targets from different groups reflects the power of stereotypes to affect individual decision-makers. The assumption that the same processes operate in the real world means that removing decision-maker bias will result in groups obtaining roughly similar (or at least substantially more similar) outcomes.
Yet is this interpretation the correct one? An alternative interpretation of the results of experimental studies of bias starts with the understanding that people learn the conditional probabilities of the behavior of different groups as they navigate their social worlds. In other words, groups differ in their characteristics and people pick up on this, storing diagnostic information about relative group differences even if imperfectly so (Eagly, Wood, & Diekman, Reference Eagly, Wood, Diekman, Eckes and Trautner2000; Eagly, Nater, Miller, Kaufmann, & Sczesny, Reference Eagly, Nater, Miller, Kaufmann and Sczesny2020; Jussim et al., Reference Jussim, Cain, Crawford, Harber and Cohen2009, Reference Jussim, Crawford, Anglin, Chambers, Stevens and Cohen2015a, Reference Jussim, Crawford and Rubinstein2015c; Koenig & Eagly, Reference Koenig and Eagly2014; McCauley, Stitt, & Segal, Reference McCauley, Stitt and Segal1980).
Then, they enter a social psychology experiment on bias. They are asked to render a judgment about a target without being given diagnostic or distinguishing individuating information. Under such conditions, they end up using the information that they have come to learn as being probabilistically accurate in their daily lives, and categorical influence dominates.
Thus, through a kind of methodological trickery, the experimenter has created a world in which information that is probabilistically predictive in everyday life becomes completely inaccurate given the systematic design of our experiments. This interpretation is consistent with a view of stereotyping that describe perceivers as forming conditional probabilities and emphasizes how categorical effects are most likely under conditions of ambiguity and uncertainty, when no strong individuating information is present (Krueger & Rothbart, Reference Krueger and Rothbart1988; Kunda & Thagard, Reference Kunda and Thagard1996; Lick, Alter, & Freeman, Reference Lick, Alter and Freeman2018; McCauley et al., Reference McCauley, Stitt and Segal1980). Given the design of most experiments, it is not surprising that there are decades of laboratory studies showing stereotyping effects. To be clear, this provides no information about whether this type of categorical influence leads to disparate outcomes across groups. It does reveal that experimenters are skilled at creating worlds whose landscapes do not match the real world in any way, and participants fail to behave perfectly according to the standards of the experimenter when placed in such worlds.
In light of this reframing, what does the standard interpretation of experimental studies reveal about researchers' assumptions of how minds should and do operate? Throughout this paper, I have noted that the standard experimental design presents targets “who vary only with respect to the social categories to which they belong.” What do researchers intend when they design stimuli in this way? In doing this, researchers intend to make targets equal on all dimensions relevant to the decision at hand. For example, in the FPST, the single relevant piece of information in the decision to “shoot” is whether the target is holding a gun or not. If participants are influenced by anything other than the object in the target's hand, then researchers conclude that participants are making erroneous decisions – that is, they are showing bias. This includes cases when participants are influenced by factors related to a person's race that are probabilistically related to threat or handgun use, for example, having been previously arrested for a violent crime. Similarly, in studies of STEM hiring, the single relevant piece of information is the qualification of the applicant as revealed by the resume; being influenced by anything other than this information is treated as biased, erroneous decision-making.
What this illustrates is the researcher's belief that participants are wrong to use any information other than the information deemed relevant by the researcher. This includes information that the participant has learned prior to entering the experiment, information that may be probabilistically accurate in everyday life. In the mind of the researcher, participants should not use information within the experiment that may actually lead to more accurate decisions outside the experiment – not because such information is reliably incorrect, but because the experimenter has artificially made it incorrect. The researcher demands that participants are accurate as defined by the decision landscape of the experiment, no matter how disconnected this landscape is from the real world. Researchers, thus, require a kind of blank slate worldism of their participants in judging accuracy and bias, where information from one world must be erased when moving to the next. Such a demand on the part of social psychologists in fact violates a core tenet of good prediction, which is the use of priors in updating posterior prediction. Bayes' rule would require participants in social psychology experiments to include the target's categorical information in their judgments (though of course the effect of categorical information should depend on the strength of the data, as it does).
6. Broader consequences
Beyond the specific conclusions about group disparities, experimental social psychology has had a significant – and potentially misleading – impact on broader questions about the human mind and human nature. This research has led directly to the widespread attention currently given to the topic of implicit bias. Originally, dual process models in social psychology supported a satisficer view of the human mind, one in which people did “good enough” (and were thus subject to bias) unless motivation and ability were high (Fazio, Reference Fazio1990; Fiske & Neuberg, Reference Fiske and Neuberg1990; Petty & Wegener, Reference Petty, Wegener, Chaiken and Trope1999; Smith & DeCoster, Reference Smith and DeCoster2000). Importantly, such models were explicit that biasing effects were conditional (Bargh, Reference Bargh, Uleman and Bargh1989); they were not present at all times and for all people.
As experimental studies of categorical bias proliferated and as demonstrations of bias became more attractive than demonstrations of accuracy (e.g., Higgins & Bargh, Reference Higgins and Bargh1987; Jussim, Reference Jussim2012b; Jussim et al., Reference Jussim, Cain, Crawford, Harber and Cohen2009), the published literature left one with the impression of widespread, inescapable error in decision-making and the important point that bias occurs under specific experimental conditions was given a backseat to the more attractive story of widespread bias in real-world decisions (Greenwald & Krieger, Reference Greenwald and Krieger2006). Moreover, as social psychology moved further away from actual behavior and increasingly focused on millisecond reaction times, whether such differences mattered for actual decisions became increasingly unclear.
At the same time, demographic groups in the United States continued to obtain unequal outcomes despite little overt, official discrimination for several decades (and in places such as academia, preferential policies in favor of underrepresented groups), coupled with increasingly egalitarian attitudes. These disparities presented a puzzle. If groups were not being overtly barred from entry and decision-makers widely expressed egalitarian beliefs, what was causing persistent disparities?
Enter the concept of implicit bias, supported by experimental social psychology studies on categorical bias (Greenwald & Banaji, Reference Greenwald and Banaji1995). As this research was taken up by people outside the research community, the understanding of the human mind morphed from “under certain conditions, bias may emerge” to “unconscious bias is ever-present and impossible to control,” with a lack of attention to those studies showing individual variation in automatically activated concepts (e.g., Fazio et al., Reference Fazio, Jackson, Dunton and Williams1995). By now, this view is ubiquitous and claims of uncontrollable, unavoidable, pervasive, unconscious bias can be found anywhere one cares to look.
Such a view of the human mind, however, is in no way justified by the experimental studies on which it is built. There is so little overlap between our experimental parameters and the parameters of real-world decisions that the popular view of the human mind as swamped with uncontrollable bias is premature. It is troubling that researchers have not devoted serious research attention to exploring this gap.
At the same time that social psychologists have been using their findings to explain group disparities, people outside academia have enthusiastically adopted these claims. This has been true throughout popular culture, government organizations, the legal system, and the corporate world. In the case of police shootings, the claim that implicit bias is responsible for racial disparities is widely broadcast in newspaper accounts of fatal police shootings, with studies from experimental social psychological cited as evidence (e.g., Carey & Goode, Reference Carey and Goode2016; Dreifus, Reference Dreifus2015; Kristof, Reference Kristof2014; Lopez, Reference Lopez2017). In the case of school disciplinary disparities, President Obama's 2014 “Dear Colleague” letter on the “Nondiscriminatory Administration of School Discipline” was explicit in rejecting the idea that actual behavioral differences across racial groups contribute meaningfully to the corresponding disparities in school suspensions. It also named implicit bias training as a possible solution for ensuring that school police administer discipline in a non-discriminatory manner. It is difficult to overstate how widespread this belief has become in the last decade, driven primarily if not wholly by research from experimental social psychologists. Indeed, some researchers have actively pushed this agenda, appearing on televised news programs, holding press conferences, writing advocacy pieces, and testifying in court (as described in, e.g., Mitchell, Reference Mitchell, Crawford and Jussim2018).
7. Related critiques
Although I focus on social psychology experiments in this paper, related critiques have been made in other literatures. A brief review of these critiques, some of which are general methodological critiques and some of which are specific to group disparities, provides additional support to the current argument.
On the question of group disparities specifically, Heckman's (Reference Heckman1998) analysis of racial and gender disparities in employment supports the current analysis. In typical “audit studies” (e.g., Bertrand & Mullainathan, Reference Bertrand and Mullainathan2004), a set of prospective employers are sent resumes that are identical except for the race of the applicant; research typically finds that Black applicants receive fewer callbacks for interviews than White applicants. Such findings are then used as evidence that actual racial disparities in employment are because of discrimination on the part of employers. Thus, the general format of experimental labor market studies is the same as the social psychology research described in the current paper: If we can show average levels of race-based differential treatment between hypothetical people who are otherwise presented as equal, then this same differential treatment is responsible for actual group disparities.
Heckman argued that average levels of market-wide discrimination cannot necessarily be applied to real people engaged in real transactions, because such transactions do not occur at the market-wide level. Employment transactions are between specific people and specific firms, and if the people and firms in experimental studies do not match the characteristics of real people and firms in the market, then experimental results are irrelevant for explaining real group disparities. Suppose an experimental audit study finds that employers at Goldman Sachs engage in discrimination against Black applicants. If it is the case that Black applicants do not apply to Goldman Sachs, or that actual Black applicants do not have the resumes that would make them competitive at Goldman Sachs, then whether employers at Goldman Sachs discriminate against artificial Black applicants tells us nothing about why Blacks may be under-employed there or anywhere else in the financial market.
There is the same problem in labor market studies as in studies in experimental social psychology: A lack of attention to the degree of overlap between the characteristics of real group members and the characteristics of our hypothetical experimental targets. And this failure, as in social psychology, distorts our understanding of the nature of group disparities. As Heckman summarized, “A careful reading of the entire body of available evidence confirms that most of the disparity in earnings between blacks and whites in the labor market of the 1990s is due to the differences in skills they bring to the market, and not to discrimination within the labor market” (p. 101; see also Neal & Johnson, Reference Neal and Johnson1996).
In terms of broad methodological critiques, similar concerns have been raised in the field of judgment and decision-making (JDM). Hogarth (Reference Hogarth1981), for example, highlighted the discrepancy between the discrete judgments used in experimental JDM research and the continuous, interactive judgments frequently found in the real world. He used this discrepancy to highlight how researchers' failure to incorporate the role of feedback in experimental decision tasks could lead to distorted conclusions. Specifically, he demonstrated that decisions characterized as “biased” in discrete judgments could be understood as functional when decisions were continuous. Similarly, a major thrust of Gigerenzer and colleagues' research program has been to show that the structure of the decision environment is a crucial consideration for a full understanding of accurate and inaccurate decisions. Failure to appreciate the relation between the organism and its environment can lead to misleading conclusions about the nature of human rationality and decision-making (Dhami, Hertwig, & Hoffrage, Reference Dhami, Hertwig and Hoffrage2004; Gigerenzer, Hoffrage, & Kleinbölting, Reference Gigerenzer, Hoffrage and Kleinbölting1991; Pleskac & Hertwig, Reference Pleskac and Hertwig2014). Tetlock (Reference Tetlock1985) also analyzed the nature of JDM research and noted how laboratory studies lacked accountability for decision-makers, a key component inherent to most real-world decisions and one which can change the nature of decisions. Thus, there is precedent for being concerned about social psychologists' lack of interest in the degree to which their experimental tasks reflect the decision landscape in which actual decisions are made or whether the characteristics of real decision-makers match those in our experimental settings.
Relatedly, Eagly and colleagues' study on gender differences and leadership style provide supportive evidence for the arguments advanced here (Eagly & Johannesen-Schmidt, Reference Eagly and Johannesen-Schmidt2001; Eagly & Johnson, Reference Eagly and Johnson1990). These researchers found that some gender differences in leadership style were larger in laboratory studies compared to studies conducted in actual organizational settings. The explanation for this difference in methodology could be understood in the terms described here, which is the failure to include real-world information in our laboratory studies. Specifically, actual roles in organizational settings contain role requirements, which can exert powerful effects on behavior regardless of the person occupying the role. In laboratory studies, in contrast, this influence is absent, hence the greater potential for gender to exert an influence on leadership behavior in this context.
To be fair, within social psychology there are some lines of research on stereotyping and disparate outcomes that do consider group behavioral differences as an important part of the causal chain producing group disparities. For example, Diekman et al. (Reference Diekman, Steinberg, Brown, Belanger and Clark2017) have proposed a goal congruity model to help understand sex differences in STEM participation. In this model, the communal goals that people have, in combination with their beliefs about how different STEM and non-STEM careers can fulfill those needs, impact STEM engagement and ultimately career choice. Importantly, this model accounts for at least some of the sex disparities in STEM participation by taking seriously the sizable male–female difference in communal goals.
Finally, the current study is most closely related to broad concerns in the experimental literature on external validity. Part of the current analysis raises multiple concerns regarding the external validity of experimental social psychology, and this is certainly not new. However, this study goes beyond past treatments in several ways. First, this paper outlines which features of the typical experimental investigations are threats to external validity and analyzes how the fallacies and assumptions underlying researchers' approaches to the question of group disparities directly lead to choices that undermine external validity. Second, the current study is not a broad indictment of the external validity of typical experimental social psychology. The standard experimental social psychology study can tell us much about how categorical information is formed and used, and I raise no issue with the external validity of those studies. Instead, the concern here is specifically with the use of these findings to explain real-world disparate outcomes. Finally, the current study goes beyond typical external validity concerns because, even if the external validity of current studies was improved, the problems inherent to this approach are so fundamental that they still could not be applied to explain group disparities. For example, if distributional differences between men and women on STEM-related attributes are not taken into account when explaining group disparities in STEM participation, then irrespective of any changes to the experimental process researchers will still misunderstand the nature of this disparity. A way of thinking about the relationship between the current analysis and past critiques of external validity is that the current study uses those past critiques as a vehicle for a broader, more systematic dismantling of current experimental studies on bias.
On external validity, relevant data supporting the current argument come from Mitchell (Reference Mitchell2012), who compared effect sizes of laboratory studies to field studies. Although the relationship between the two was strong and positive, this varied by subfield in important ways. Social psychology not only had a lower correspondence between lab and field studies than some other subareas, but social psychology was also the subfield in which the sign of the effect reversed most often. Although the purpose of Mitchell's analysis was not to identify all the features that impact lab-field correlations, the relatively poor performance of social psychology could be understood with the current framework – to the extent that the lab studies fail along the three flaws outlined here, the correspondence of these experimental effects once behavior returns to the field will be low. Of course, not all the social psychology studies in Mitchell were of decision-maker bias, but other analyses have found similar, supportive effects (e.g., Eagly & Johnson, Reference Eagly and Johnson1990; Koch et al., Reference Koch, D'Mello and Sackett2015).
8. A new (or at least rehashed) approach
If the current approach to understanding group disparities is not just misguided but fundamentally flawed, what might be an alternative, more productive research cycle? Although it would be nice to claim a completely new approach to studying these important topics, what follows is largely a rehashing and reemphasizing of other, better recommendations that have already been made, for example, by Dasgupta and Stout (Reference Dasgupta and Stout2012) and Mortensen and Cialdini (Reference Mortensen and Cialdini2010), with some further elaboration and connection to other critiques from the past several decades. The major difference is that I begin by explicitly noting that in many (perhaps most) cases of studying group disparities, we may end up concluding that experimental social psychology cannot contribute or at least will play a distant backseat to other approaches.
Studies of group disparities on any outcome should begin first and foremost with a task analysis of the decision itself as it exists outside the laboratory. This would involve detailed discussions with those individuals responsible for making such decisions, ideally including novice and expert decision-makers. Researchers might also meaningfully enhance the quality of their models by completing training protocols themselves, to learn how the decision is supposed to unfold (at least as formally instructed). In the case of police shootings, beginning at this step would likely have led to a drastically different methodology used by experimental social psychologists, one which incorporated actual features of deadly force decisions.
The second step in the process involves the study of members of groups who are obtaining disparate outcome on the topic of interest (both more and less desirable outcomes), including behavioral, personality, or other individual differences relevant to the topic at hand. This can often be useful in confirming that the factors identified by decision-makers in step 1 are, in fact, relevant. This step is also important for placing any categorical bias effects in the context of the size of these performance-related differences. Beyond giving us a more accurate understanding of the nature of group disparities, this can also provide information about the strength of different interventions to reduce such disparities. The expectation about what the world will look like after eliminating all decision-maker bias is very different depending on whether there are no differences or large differences across groups.
In the case of shooter bias, an initial task analysis would have revealed that the context and behavior of the target citizen is critical and that the context of violent crime is a central part of the officer's decision to shoot. The second step would have led to the recognition that there are very sizable differences across groups in violent crime rates and led to an appreciation that any biasing effects of race on an officer's decision must be placed in the context of these behavioral differences. The same is true, for example, of intellectual performance differences across groups, where sometimes average differences do not exist but differences are large at the extreme tails and other times average group differences do exist and are sizable (Ceci & Williams, Reference Ceci and Williams2010; Fryer & Levitt, Reference Fryer and Levitt2010; Halpern et al., Reference Halpern, Benbow, Geary, Gur, Hyde and Gernsbacher2007; Hsin & Xie, Reference Hsin and Xie2014; Lubinski & Benbow, Reference Lubinski and Benbow1992). Outcome differences in demographic disparities among, for example, college grade point averages (GPAs), majors, and graduation rates must be understood in the context of these sizable incoming differences across racial and ethnic groups (e.g., ACT, 2017), and interventions that do not address these differences at the core are unlikely to stem the cascading and continuing differences over time.
Only after the first two non-experimental steps comes the third step of designing experiments informed by the data already obtained. This will almost always necessitate more involved and difficult studies with non-student samples; what follows would likely be a steep decline in both the number of studies conducted and the proportion of studies involving undergraduate convenience samples.
The final step in relating back to the real-world disparities of interest involves integrating the size of categorical effects from experimental tasks with the sizes of other effects on a group's outcomes, for example, behavioral and personality differences across groups. This is something that will be specific to the domain under study as it is unlikely that many of the same factors impact outcomes to the same extent across domains (but see Gottfredson, Reference Gottfredson1997, Reference Gottfredson1998, Reference Gottfredson2004).
This call for a new approach to research complements other, previous concerns about the approach of standard psychological science. Already noted are the proposals by Dasgupta and Stout (Reference Dasgupta and Stout2012) and Mortensen and Cialdini (Reference Mortensen and Cialdini2010). Other recent examples include Rozin's (Reference Rozin2009) assessment of how changes to the reward structure in psychology would improve the science. As he stated (emphasis added):
In such cases, as with the nth study (where n > 10) on a particular phenomenon or claim, it is appropriate to determine whether proper controls have been conducted, whether alternative accounts have been dealt with, and whether there are any errors in thinking or experimentation. But first, we have to find out what it is that we will be studying, what its properties are, and its generality outside of the laboratory and across cultures.
Aligned with Rozin's critique, the current study pushes back against a movement that gained momentum with the emergence of social cognition in the late 1970s and perspectives such as Mook's “Defense of External Invalidity” (Mook, Reference Mook1983). These forces pushed the importance of systematic design and justified the measurement of small differences in highly impoverished experimental settings, without consideration of whether the decisions made in these studies related in clear ways to the actual decisions that, ultimately, we care so much about (see also Ring, Reference Ring1967). Another way of framing the problem is to suggest that social psychology has been more focused on publishing demonstrations of bias than on fully understanding the nature of group disparities through the pursuit of a “strong inference” model (Platt, Reference Platt1964).
9. Conclusion
What can experimental social psychology tell us about why different segments of society are not evenly represented across all outcomes? Experimental studies of categorical bias can and do tell us about the functions and processes of storing group-based information. However, the disconnect between the experimental parameters of these studies and the conditions surrounding real-world decisions makes our experiments irrelevant when it comes to understanding the complex dynamics of group disparities. Of course, there is individual-level bias and discrimination; tribalism and intergroup bias are features of all human minds. But if the goal is to study systematic categorical bias and its effects on group outcomes, a different approach is needed. I describe one possible new approach for experimental social psychology, one which begins not with the assumptions of academic researchers holding the goal of demonstrating bias but instead with an analysis of the actual decision itself. Such an approach would not only change the relevance of social psychology for understanding group disparities, but may also correct some of the misleading claims about the human mind that have extended out from academia in the last two decades.
Acknowledgments
I thank Michael Bailey, E. Tory Higgins, Lee Jussim, Calvin Lai, Richard Lucas, and three anonymous colleagues for productive discussions and feedback on earlier drafts of this study. Alice Eagly and two anonymous reviewers provided outstanding and critical comments that greatly increased the quality of this manuscript. This study also benefitted from discussions with friends and colleagues at the Duck Conference on Social Cognition (2017).
Financial support
This study is based on work supported by the National Science Foundation under Grants No. 1230281 and 1756092.
Conflict of interest
None.
Target article
What can experimental studies of bias tell us about real-world group disparities?
Related commentaries (29)
A skeptical reflection: Contextualizing police shooting decisions with skin-tone
Accuracy in social judgment does not exclude the potential for bias
Beyond stereotypes: Prejudice as an important missing force explaining group disparities
Centering the relationship between structural racism and individual bias
Cesario's framework for understanding group disparities is radically incomplete
Controlled lab experiments are one of many useful scientific methods to investigate bias
Culturally fluent real-world disparities can blind us to bias: Experiments using a cultural lens can help
Developmental research assessing bias would benefit from naturalistic observation data
Experimental studies of bias: Imperfect but neither useless nor unique
Experiments make a good breakfast, but a poor supper
External validity of social psychological experiments is a concern, but these models are useful
Fighting over who dictates the nature of prejudice
How should we understand “bias” as a thick concept in recruitment discrimination studies?
Missing context from experimental studies amplifies, rather than negates, racial bias in the real world
Missing perspective: Marginalized groups in the social psychological study of social disparities
Practical consequences of flawed social psychological research on bias
Social bias insights concern judgments rather than real-world decisions
Surely not all experimental studies of bias need abandoning?
Taking social psychology out of context
The call for ecological validity is right but missing perceptual idiosyncrasies is wrong
The importance of ecological validity, ultimate causation, and natural categories
The internal validity obsession
The logic of challenging research into bias and social disparity
The missing consequences: A fourth flaw of experiments
The only thing that can stop bad causal inference is good causal inference
The unbearable limitations of solo science: Team science as a path for more rigorous and relevant research
Two thousand years after Archimedes, psychologist finds three topics that will simply not yield to the experimental method
Understanding causal mechanisms in the study of group bias
What can the implicit social cognition literature teach us about implicit social cognition?
Author response
Reply to the commentaries: A radical revision of experimental social psychology is still needed