Audit experiments are used to measure discrimination in a large number of domains (Employment: Bertrand et al. (Reference Bertrand and Mullainathan2004); Legislator responsiveness: Butler et al. (Reference Butler and Broockman2011); Housing: Fang et al. (Reference Fang, Guess and Humphreys2018)). Audit studies all have in common that they estimate the average difference in response rates depending on randomly varied characteristics (such as the race or gender) of a requester. Scholars conducting audit experiments often seek to extend their analyses beyond the effect on response to the effects on the quality of the response. Response is a consequence of treatment; answering these important questions well is complicated by post-treatment bias (Montgomery et al., Reference Montgomery, Nyhan and Torres2018). In this note, I consider a common form of post-treatment bias that occurs in audit experiments.
As an instructive example, consider White et al. (Reference White, Nathan and Faller2015), an audit experiment in which election officials were sent emails from putatively Non-Latino White or Latino names asking “I’ve been hearing a lot about voter ID laws on the news. What do I need to do to vote?” Whereas Non-Latino White names received a response 70.5% of the time, Latino names were responded to 64.8% of the time, for a statistically and substantively significant difference of negative 5.7 percentage points. In a secondary analysis, the authors further estimate the effect on the friendliness of the e-mails conditional on response.
Response is a post-treatment outcome; conditioning on post-treatment outcomes “de-randomizes” an experiment in the sense that the resulting treatment and control groups no longer have potential outcomes that are in expectation equivalent. Seen another way, conditioning on a post-treatment outcome induces confounding. This problem is relatively widespread. Seven of the 20 legislative audit experiments analyzed in Costa (Reference Costa2017) and nine of the 29 employment audit studies analyzed in Quillian et al. (Reference Quillian, Pager, Hexel and Midtbøen2017) inappropriately condition on response.
In this setting, a subject might be one of the four types in Table 1. R i(Z) is the response potential outcome depending on whether subject i is assigned to a putatively non-Latino White name (Z = 0) or a putatively Latino name (Z = 1). Together, R i(1) and R i(0) indicate whether a subject is an Always-Responder, an If-Treated-Responder, an If-Untreated-Responder, or a Never-Responder. The friendliness potential outcome Y i(Z) is undefined if a subject does not respond, implying that the average treatment effect of the Latino name on friendliness does not exist for subjects who do not respond in one condition or the other. The average effect of treatment on Always-Responders (E [Y i(1) − Y i(0)|R i(0) = R i(1) = 1]) does exist, but estimating it is not straightforward because we do not have complete information on who is an Always-Responder.
Table 1 Types of Subjects
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190130022745767-0314:S205226301800009X:S205226301800009X_tab1.gif?pub-status=live)
Analysts have three main choices:
Bounds. Zhang and Rubin (Reference Zhang and Rubin2003) develop bounds around the average effect for subjects whose outcomes are never “truncated by death,” regardless of treatment assignment; their result can be immediately applied to the audit study case (see Aronow et al. (Reference Aronow, Baron and Pinson2018) for an application and extension of these bounds in Political Science). The estimates correspond to the most pessimistic and most optimistic scenarios for the average treatment effect among Always-Responders. These bounds often have very large (or even infinite) width, so their scientific utility will vary depending on the application.
Find Always-Responders. If we were to assume that a particular group of subjects consisted entirely of Always-Responders, we could directly estimate the effect of treatment on the quality of response in that group. One check on the plausibility of the “Always-Responders” assumption is that the response rate in both the treatment and control groups must equal 100%. The assumption can of course still be incorrect as some treated units might not have responded if untreated (or vice-versa). Bendick et al. (Reference Bendick, Jackson and Reinoso1994) implicitly invokes an “Always-Responders” assumption in an analysis that conditions the dataset to include only firms who offer jobs to both White and Black confederates before estimating the average effect of race on the salary offered among this group.
Redefine the outcome. Analysts can change the outcome variable to be Y*i(Z), which is equal to Y i(Z) if R i(Z) = 1 and 0 otherwise. Crucially, this means that e-mails never sent are “not friendly.” The average effect of treatment on this new dependent variable E [Y*i(1) − Yi*(0)] is well-defined. Kalla et al. (Reference Kalla, Rosenbluth and Teele2018) use this approach; White et al. (Reference White, Nathan and Faller2015) report in their footnote 29 that they ran this analysis as well.
In my reanalysis of White et al. (Reference White, Nathan and Faller2015), I provide examples of all three approaches. Following the procedure in Zhang and Rubin (Reference Zhang and Rubin2003), I estimate the lower bound to be −66 points and the upper bound to be 65 points. These bounds themselves are subject to sampling variability, which I estimate via the non-parametric bootstrap. To find Always-Responders, I took advantage of the original experiment’s matched-pair design. I subset the dataset to the 719 matched pairs in which both the treated and untreated pair member responded. Under the unverifiable assumption that both pair members would also have responded if the treatment assignment had been switched, I estimate the ATE among this subgroup to be −5.5 points (SE = 2.5 points). Finally, using the redefined outcome, I estimate the ATE to be a 6 point decrease in friendliness (SE = 1.7 points). As it happens, this estimate is statistically significant while the naive estimate (−3.5 percentage, SE = 2.0 points) is not. Table 2 displays all four estimates.
Table 2 Reanalysis of White, Nathan, and Faller (Reference White, Nathan and Faller2015)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190130022745767-0314:S205226301800009X:S205226301800009X_tab2.gif?pub-status=live)
In this note, I have outlined how in audit experiments, some causal quantities we seek to estimate do not exist. We can either attempt to recover estimates of the effect for a subgroup of units (the Always-Responders), or we can redefine the outcome so that the average treatment effect is defined. The approaches outlined here have applications beyond audit experiments. Rondeau et al. (Reference Rondeau and List2008) seek to estimate the effect of a treatment on the size of donations; they inappropriately condition on units making any donation. Björkman et al. (Reference Björkman and Svensson2009) seek to estimate the effects of a monitoring intervention on child health; they inappropriately condition on infant survival. The applicability of each of the three “solutions” to the problem will depend on the substantive area, but conditioning on post-treatment variables should be avoided in all cases.