1 Introduction
Social scientists are often interested in estimating the marginal, or population average, effects of treatment in the presence of post-treatment confounding. Post-treatment confounding is common in studies of time-varying treatments, where confounders of future treatments may be affected by prior treatments. For example, political scientists study how the timing and frequency of negative advertising during political campaigns affect election outcomes (e.g., Lau, Sigelman, and Rovner Reference Lau, Sigelman and Rovner2007; Blackwell Reference Blackwell2013). In this context, the decision to run negative advertisements at any given point during a campaign is affected by a candidate’s position in recent polling data, which itself is affected by negative advertising conducted previously. Post-treatment confounding is also common in analyses of causal mediation, where confounders for the effect of the mediator on the outcome may be affected by treatment. For example, when assessing the role of morality in mediating the effects of shared democracy on public support for war, post-treatment variables, such as beliefs about the threat posed by the adversary, may affect both the perceived morality of war and support for military action (Tomz and Weeks Reference Tomz and Weeks2013).
Adjusting for post-treatment confounders using conventional methods, for example, by naively conditioning, stratifying, or otherwise balancing on them, may engender two different types of bias (Robins Reference Robins1986, Reference Robins, Halloran and Berry2000). First, adjusting naively for post-treatment confounders leads to bias from overcontrol of intermediate pathways because it blocks, or “controls away”, the effect of treatment on the outcome that operates through these variables. Second, adjusting naively for post-treatment confounders can lead to collider-stratification bias if these variables are also affected by unobserved determinants of the outcome, as conditioning on a variable generates a spurious association between its common causes even when these common causes are unconditionally independent (Pearl Reference Pearl2009).
Marginal structural models (MSMs) and the associated method of inverse probability weighting (IPW) avoid these biases and are capable of consistently estimating treatment effects under fairly general conditions (Robins Reference Robins, Halloran and Berry2000; Robins, Hernan, and Brumback Reference Robins, Hernan and Brumback2000; VanderWeele Reference VanderWeele2015). Compared with more traditional models for time-series cross-sectional data (e.g., fixed effects regression models), MSMs with IPW can better accommodate dynamic causal relationships (Imai and Kim Reference Imai and Kim2019). Specifically, unlike conventional methods, this approach allows past treatments to affect current outcomes (i.e., “carryover effects”) and past outcomes to affect current treatment (i.e., “feedback effects”). Because of this flexibility, political scientists have increasingly used MSMs with IPW to draw causal inferences from longitudinal data (e.g., Zhukov Reference Zhukov2017; Ladam, Harden, and Windett Reference Ladam, Harden and Windett2018; Simmons and Creamer Reference Simmons and Creamer2019).
Nevertheless, IPW has several important limitations. First, IPW requires models for the conditional distributions of exposure to treatment and/or the mediator, and prior research indicates that it is highly sensitive to their misspecification (e.g., Kang and Schafer Reference Kang and Schafer2007). Second, even if these models are correctly specified, IPW is relatively inefficient, and it is susceptible to large finite-sample biases when confounders strongly predict the exposures of interest (Wang et al. Reference Wang, Petersen, Bangsberg and van der Laan2006; Cole and Hernán Reference Cole and Hernán2008).Footnote 1 Finally, when the exposures of interest are continuous, IPW tends to perform poorly because estimates of conditional densities are often unreliable (e.g., Vansteelandt Reference Vansteelandt2009; Naimi et al. Reference Naimi, Moodie, Auger and Kaufman2014).
Several remedies have been proposed to improve the efficiency and robustness of IPW. For example, Cole and Hernán (Reference Cole and Hernán2008) suggest truncating or censoring extreme weights to obtain more precise estimates. With this approach, however, the improved precision comes at the cost of greater bias. Recently, Imai and Ratkovic (Reference Imai and Ratkovic2014, Reference Imai and Ratkovic2015) propose constructing weights for an MSM with covariate balancing propensity scores (CBPS). By integrating a large set of balancing conditions when estimating propensity scores, this method is less sensitive to model misspecification. But estimating CBPS can be computationally demanding, and because of the practical difficulties associated with modeling conditional densities, this method remains challenging to use with continuous exposures, even in the cross-sectional setting (Fong et al. Reference Fong, Hazlett and Imai2018).
In this paper, we propose an alternative method of constructing weights for MSMs, which we call “residual balancing”. Briefly, the method is implemented in two stages. First, a model for the conditional mean of each post-treatment confounder, given past treatments and confounders, is estimated and then used to construct residual terms. Second, a set of weights is constructed using Hainmueller’s (Reference Hainmueller2012) entropy balancing method such that, in the weighted sample, (a) the residualized confounders are orthogonal to future exposures, past treatments, and past confounders, and (b) their discrepancy with a set of base weights (e.g., survey sampling weights) is minimized. Thus, our proposed method is an extension of Hainmueller’s (Reference Hainmueller2012) entropy balancing procedure to the longitudinal setting. It exactly balances sample moments for each of the post-treatment confounders across future exposures, conditional on the observed past, without explicit models for the conditional distributions of exposure to treatment and/or a mediator.Footnote 2
Residual balancing has a number of advantages over both conventional methods of covariate adjustment and over IPW and its variants. First, by appropriately residualizing the post-treatment confounders, the proposed method avoids bias due to overcontrol and collider stratification, unlike conventional methods that condition, stratify, or otherwise balance on these variables naively. Second, residual balancing is relatively robust to the model misspecification bias that commonly afflicts IPW and its variants. Third, residual balancing is also more efficient than IPW because it tends to avoid highly variable and extreme weights by minimizing their relative entropy with respect to a set of base weights. Fourth, in contrast to CBPS, residual balancing is computationally attractive in that the weighting solution is quickly obtained even with a large number of confounders, time periods, and observations. Finally, because it does not require models for the conditional distributions of the exposures, residual balancing is easy to use when treatments and/or mediators are continuous. This advantage may be especially important in political science applications, where continuous exposures commonly arise in analyses of time-series cross-sectional data (e.g., Blackwell Reference Blackwell2013). An open-source R package, rbw, is available for implementing the proposed method, as is a Stata package with similar functionality.
In the sections that follow, we first briefly review MSMs and the method of IPW. Next, we introduce the method of residual balancing and conduct a set of simulation studies to evaluate its performance relative to IPW and its variants. We then illustrate the method empirically by estimating the cumulative effect of negative advertising on election outcomes as well as the controlled direct effect (CDE) of shared democracy on public support for war. We conclude by discussing the method’s limitations along with possible remedies.
2 MSMs and IPW: A Review
In this section, we briefly review MSMs and the method of IPW (Robins Reference Robins, Halloran and Berry2000; Robins, Hernan, and Brumback Reference Robins, Hernan and Brumback2000). Consider first a study with $T\geqslant 2$ time points where interest is in the effect of a time-varying treatment, $A_{t}$ ( $1\leqslant t\leqslant T$ ), on an end-of-study outcome, $Y$ . At each time point, there is also a vector of observed time-varying confounders, $L_{t}$ , that may be affected by prior treatments. Following convention, we use overbars to denote the treatment history, $\overline{A}_{t}=(A_{1},\ldots ,A_{t})$ , and confounder history, $\overline{L}_{t}=(L_{1},\ldots ,L_{t})$ , up to time $t$ . Similarly, we denote an individual’s complete treatment and confounder histories through the end of follow-up by $\overline{A}=\overline{A}_{T}$ and $\overline{L}=\overline{L}_{T}$ , respectively. Finally, we use $Y(\overline{a})$ to denote the potential outcome under the particular treatment history $\overline{a}$ .
An MSM is a model for the marginal mean of the potential outcomes, which can be expressed in general form as follows:
where $\unicode[STIX]{x1D707}(\cdot )$ is some function of treatment history, $\overline{a}$ , and a parameter vector, $\unicode[STIX]{x1D6FD}$ , that captures the marginal effects of interest. For example, with a large number of time points and a binary treatment, a common parameterization is
where $\text{cum}(\overline{a})=\sum _{t=1}^{T}a_{t}$ denotes the total number of time periods on treatment and $\unicode[STIX]{x1D6FD}_{1}$ captures the marginal effect of one additional wave on treatment. Of course, many other parameterizations are possible.
An MSM can be identified from observed data under three key assumptions:
-
(1) consistency, which requires that, for any unit, if $\overline{A}=\overline{a}$ , then $Y=Y(\overline{a})$ ;
-
(2) sequential ignorability, which requires that treatment at each time point must not be confounded by unobserved factors conditional on past treatments and observed confounders or, formally, that $Y(\overline{a})\bot \!\!\!\bot A_{t}|\overline{A}_{t-1},\overline{L}_{t}$ for any treatment sequence $\overline{a}$ ; and
-
(3) positivity, which requires that treatment assignment must not be deterministic or, formally, that $f(A_{t}=a_{t}|\overline{A}_{t-1}=\overline{a}_{t-1},\overline{L}_{t}=\overline{l}_{t})>0$ for any treatment condition $a_{t}$ if $f(\overline{A}_{t-1}=\overline{a}_{t-1},\overline{L}_{t}=\overline{l}_{t})>0$ , where $f(\cdot )$ denotes a probability mass or density function.
When these assumptions are satisfied, an MSM can be consistently estimated using the method of IPW.
IPW estimation involves fitting a model for the conditional mean of the observed outcome given an individual’s treatment history using weights that balance, in expectation, past confounders across treatment at each time point. The IPW for individual $i$ is defined as
where the $\overline{A}_{t-1}=\overline{a}_{i,t-1}$ term can be ignored when $t=1$ . Since the denominator of equation (3) can be very small, some units may end up with extremely large weights, leading to highly variable estimates. To mitigate this problem, Robins, Hernan, and Brumback (Reference Robins, Hernan and Brumback2000) suggest using a so-called “stabilized” weight, which is defined as
Sometimes, the probabilities in both the numerator and denominator are also made conditional on a set of baseline or time-invariant confounders $X$ :
In such cases, these variables need to be included in the MSM to properly adjust for confounding, which is unproblematic because they cannot be affected by treatment and thus conditioning on them will not engender bias due to overcontrol or collider stratification.
In practice, both the numerator and the denominator of the stabilized weight need to be estimated. When treatment is binary, the denominator is typically estimated using a generalized linear model (GLM), with the logit or probit link function, for treatment at each time point, while the numerator is estimated using a constrained version of this model that omits the time-varying confounders. When treatment is continuous, models are needed to estimate the conditional densities in both the numerator and the denominator of the weight. After weights have been computed, the marginal effects of interest are estimated by fitting a model for the conditional mean of $Y$ , given $\overline{A}_{t}$ (and also possibly $X$ ) with weights equal to $sw_{i}$ . When both this model and the models for treatment assignment are correctly specified, this procedure yields consistent estimates for all marginal means of the potential outcomes, $\mathbb{E}[Y(\overline{a})]$ , and thus for any marginal effect of interest, provided that the identification assumptions outlined previously are satisfied.
As shown in prior studies (e.g., Kang and Schafer Reference Kang and Schafer2007), IPW estimates of marginal effects can be highly sensitive to misspecification of the models used to construct the weights. To address this limitation, Imai and Ratkovic (Reference Imai and Ratkovic2014, Reference Imai and Ratkovic2015) developed the method of CBPS to estimate the denominator in equation (4) for binary treatments. With a logit model for treatment at each time point, this method augments the score conditions of the likelihood function with a set of covariate balance conditions. Because the total number of score and balance conditions exceeds the number of model parameters to be estimated, the generalized method of moments is used to minimize imbalance in the weighted sample. This method of incorporating balance conditions into model-based estimation of the weights tends to reduce the bias that results when the treatment models are misspecified.
MSMs and IPW estimation can also be used to examine causal mediation (VanderWeele Reference VanderWeele2015). Consider now a study with a point-in-time treatment, $A$ , a putative mediator measured at some point following treatment, $M$ , and an end-of-study outcome, $Y$ . Suppose that both treatment and the mediator are confounded by a vector of observed baseline covariates, denoted by $X$ , and that the mediator is additionally confounded by a vector of observed post-treatment covariates, denoted by $Z$ , which may be affected by the treatment received earlier. In this setting, the potential outcomes of interest are denoted by $Y(a,m)$ .
As before, an MSM models the marginal mean of the potential outcomes. If, for example, treatment and the mediator are both binary, a saturated MSM can be expressed as follows:
From this model, the CDE of treatment is given by $\text{CDE}(m)=\mathbb{E}[Y(1,m)-Y(0,m)]=\unicode[STIX]{x1D6FC}_{1}+\unicode[STIX]{x1D6FC}_{3}m$ , which measures the strength of the causal relationship between treatment and the outcome when the mediator is fixed at a given value, $m$ , for all individuals (Pearl Reference Pearl2001; Robins Reference Robins2003). This estimand is useful for assessing causal mediation because it helps to adjudicate between alternative explanations for a treatment effect. For example, the difference between a total effect and the $\text{CDE}(m)$ may be interpreted as the degree to which the mediator contributes to a causal mechanism that transmits the effect of treatment on the outcome (Acharya, Blackwell, and Sen Reference Acharya, Blackwell and Sen2016; Zhou and Wodtke Reference Zhou and Wodtke2019).
MSMs for the joint effects of a treatment and mediator, like equation (6), can be identified under essentially the same assumptions as outlined previously. In this context, the consistency assumption requires that $Y=Y(a,m)$ if $A=a$ and $M=m$ ; sequential ignorability requires that both treatment and the mediator must be unconfounded conditional on the observed past or, formally, that $Y(a,m)\bot \!\!\!\bot A|X$ and $Y(a,m)\bot \!\!\!\bot M|X,A,Z$ ; and positivity requires that both treatment and the mediator are not deterministic functions of past variables. Similarly, the stabilized IPW are here defined as
and they must be estimated using appropriate models for the conditional probabilities and/or densities that compose this expression. After weights have been computed, the marginal effects of interest—here, the $\text{CDE}(m)$ —are estimated by fitting a model for the conditional mean of $Y$ given $A$ and $M$ with weights equal to $sw_{i}^{\ast }$ . Alternatively, it is also possible to define the weights as $sw_{i}^{\dagger }=\frac{f(M=m_{i}|X=x_{i},A=a_{i})}{f(M=m_{i}|X=x_{i},A=a_{i},Z=z_{i})}$ , in which case $X$ must be included in the MSM to properly adjust for confounding. Adjusting for $X$ in the MSM is unproblematic because these variables are not post-treatment confounders, unlike $Z$ .
3 Residual Balancing
In this section, we motivate and explain the method of residual balancing. We first focus on analyses of time-varying treatment effects, and then we outline how the method is easily adapted for studies of causal mediation. Finally, we discuss the advantages and limitations of residual balancing compared with IPW as well as the similarities and differences between residual balancing and the CBPS method.
3.1 Rationale
To explain the method of residual balancing, it is useful to begin with Robins’ (Reference Robins1986) g-computation formula. The g-computation formula factorizes the marginal mean of the potential outcome, $Y(\overline{a})$ , as follows:
In contrast, the conditional mean of the observed outcome $Y$ , given $\overline{A}=\overline{a}$ , can be factorized into
A comparison of equation (8) with equation (9) indicates that weighting the observed population by
would yield a pseudo-population in which $f^{\ast }(l_{t}|\overline{l}_{t-1},\overline{a})=f^{\ast }(l_{t}|\overline{l}_{t-1},\overline{a}_{t-1})=f(l_{t}|\overline{l}_{t-1},\overline{a}_{t-1})$ and thus $\mathbb{E}^{\ast }[Y|\overline{A}=\overline{a}]=\mathbb{E}^{\ast }[Y(\overline{a})]=\mathbb{E}[Y(\overline{a})]$ , where the asterisk denotes quantities in the weighted pseudo-population.Footnote 3 Because $L_{t}$ is often high-dimensional, estimation of the conditional densities in equation (10) is practically difficult.
Nevertheless, the condition that $f^{\ast }(l_{t}|\overline{l}_{t-1},\overline{a})=f^{\ast }(l_{t}|\overline{l}_{t-1},\overline{a}_{t-1})=f(l_{t}|\overline{l}_{t-1},\overline{a}_{t-1})$ implies that, in the pseudo-population, the following moment condition would hold for any scalar function $g(\cdot )$ of $L_{t}$ :
This moment condition can be equivalently expressed as
where $\unicode[STIX]{x1D6FF}(g(L_{t}))=g(L_{t})-\mathbb{E}[g(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ is a residual transformation of $g(L_{t})$ with respect to its conditional mean given the observed past. The moment condition in equation (12), in turn, implies that for any scalar function $h(\cdot )$ of $\overline{L}_{t-1}$ and $\overline{A}$ , $\unicode[STIX]{x1D6FF}(g(L_{t}))$ and $h(\overline{L}_{t-1},\overline{A})$ are uncorrelated, that is,
where the second equality follows from the fact that $\mathbb{E}^{\ast }[\unicode[STIX]{x1D6FF}(g(L_{t}))]=\mathbb{E}^{\ast }\mathbb{E}^{\ast }[\unicode[STIX]{x1D6FF}(g(L_{t}))|\overline{L}_{t-1},\overline{A}]=0$ .
The method of residual balancing emulates the moment conditions (13) that would hold in the pseudo-population were it possible to weight by $W_{l}$ . In other words, it emulates the moment conditions (13) that would be expected in a sequentially randomized experiment. Specifically, this is accomplished by (a) specifying a set of $g(\cdot )$ functions, $G(L_{t})=\{g_{1}(L_{t}),\ldots ,\mathit{g}_{\mathit{J}_{t}}(L_{t})\}$ , and a set of $h(\cdot )$ functions, $H(\overline{L}_{t-1},\overline{A})=\{h_{1}(\overline{L}_{t-1},\overline{A}),\ldots ,h_{\mathit{K}_{t}}(\overline{L}_{t-1},\overline{A})\}$ ; (b) computing a set of residual terms, $\unicode[STIX]{x1D6FF}(g(L_{t}))=g(L_{t})-\mathbb{E}[g(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ , from the observed data; and then (c) finding a set of weights such that, for any $j$ , $k$ , and $t$ , the cross-moment of $\unicode[STIX]{x1D6FF}(g_{j}(l_{it}))$ and $h_{k}(\overline{l}_{i,t-1},\overline{a}_{i})$ is zero in the weighted data. Hence, it involves finding a set of nonnegative weights, denoted by $rbw_{i}$ , subject to the following balancing conditions:
or, expressed more succinctly,
where $c_{ir}$ is the $r$ th element of $\mathbf{c}_{i}=\{\unicode[STIX]{x1D6FF}(g_{j}(l_{it}))h_{k}(\overline{l}_{i,t-1},\overline{a}_{i});1\leqslant j\leqslant J_{t},1\leqslant k\leqslant \,K_{t},1\leqslant t\leqslant T\}$ and $n_{c}=\sum _{t=1}^{T}J_{t}K_{t}$ is the total number of balancing conditions. The conditions in equation (14) stipulate that the residualized confounders at each time point are balanced across future treatments, past treatments, and past confounders, or some function thereof. In this way, the proposed method adjusts for post-treatment confounding without engendering bias due to overcontrol or collider stratification, as the residualized confounders are balanced across future treatments while (appropriately) remaining orthogonal to the observed past.
As long as the convex hull of $\{\mathbf{c}_{i};1\leqslant i\leqslant n\}$ contains $\mathbf{0}$ , finding the weighting solution is an underidentified (or just-identified) problem. Following Hainmueller (Reference Hainmueller2012), we minimize the relative entropy between $rbw_{i}$ and a set of base weights $q_{i}$ (e.g., a vector of ones or survey sampling weights),Footnote 4
subject to the $n_{c}$ balancing conditions. This is a constrained optimization problem that can be solved using Lagrange multipliers. Technical details can be found in Supplementary Material A (see also Hainmueller Reference Hainmueller2012).
In Figure 1, we illustrate the logic of residual balancing with a directed acyclic graph, which describes the causal relationships between a time-varying treatment $A_{t}$ , a vector of time-varying confounders $L_{t}$ , and an end-of-study outcome $Y$ with two time periods $t=1,2$ . Weighting is intended to create a pseudo-population in which the confounding arrows $L_{1}\rightarrow A_{1}$ , $L_{1}\rightarrow A_{2}$ , and $L_{2}\rightarrow A_{2}$ are “broken”, that is, a pseudo-population in which (a) $L_{1}$ no longer predicts $A_{1}$ or $A_{2}$ and (b) $L_{2}$ no longer predicts $A_{2}$ , given $L_{1}$ and $A_{1}$ . The first condition requires $L_{1}$ to be marginally independent of both $A_{1}$ and $A_{2}$ . Thus, any function of $L_{1}$ should be uncorrelated with any function of $A_{1}$ and $A_{2}$ in the weighted population. The second condition, by contrast, requires $L_{2}$ to be conditionally independent of $A_{2}$ , given $L_{1}$ and $A_{1}$ . To this end, we could divide the original population into a number of strata defined by $L_{1}$ and $A_{1}$ and then balance $L_{2}$ across levels of $A_{2}$ within each stratum. This approach, however, becomes impractical when $L_{1}$ and $A_{1}$ are continuous and/or multidimensional. To circumvent this problem, our method invokes a model for the conditional mean of $L_{2}$ (or some function of $L_{2}$ ), given $L_{1}$ and $A_{1}$ , and it then balances the residuals from this model across levels of $A_{2}$ and levels of $(L_{1},A_{1})$ . This procedure breaks the confounding arrow $L_{2}\rightarrow A_{2}$ but preserves the causal arrow $A_{1}\rightarrow L_{2}$ , thereby adjusting properly for the observed post-treatment confounders while avoiding bias due to overcontrol and collider stratification. Taken together, the balancing conditions for both $L_{1}$ and $L_{2}$ yield a weighted population in which all the confounding arrows ( $L_{1}\rightarrow A_{1}$ , $L_{1}\rightarrow A_{2}$ , and $L_{2}\rightarrow A_{2}$ ) are “broken” and all the other arrows are left intact. An MSM can then be fit to this population in order to estimate the average causal effects of $A_{1}$ and $A_{2}$ on $Y$ .
3.2 Implementation
In practice, residual balancing requires specifying a set of $g(\cdot )$ functions that constitute $G(L_{t})$ . A natural choice is to set $g_{j}(L_{t})=L_{jt}$ , where $L_{jt}$ is the $j$ th element of the covariate vector $L_{t}$ . If there is concern about confounding by higher-order or interaction terms, they can also be included in $G(L_{t})$ . Then, the residual terms, $\unicode[STIX]{x1D6FF}(g(L_{t}))$ , need to be estimated from the data. Because $\unicode[STIX]{x1D6FF}(g(L_{t}))=g(L_{t})-\mathbb{E}[g(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ , they can be estimated by fitting GLMs for $g(L_{t})$ and then extracting the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(g(L_{t}))=g(L_{t})-m(\hat{\unicode[STIX]{x1D6FD}}_{t}^{T}r(\overline{L}_{t-1},\overline{A}_{t-1}))$ , where $r(\overline{L}_{t-1},\overline{A}_{t-1})=[r_{1}(\overline{L}_{t-1},\overline{A}_{t-1}),\ldots ,r_{\mathit{L}_{t}}(\overline{L}_{t-1},\overline{A}_{t-1})]$ is a vector of regressors and $m(\cdot )$ denotes the inverse link function of the GLM.
In addition, residual balancing requires specifying a set of $h(\cdot )$ functions that constitute $H(\overline{L}_{t-1},\overline{A})$ . Because weighting is intended to neutralize the relationship between $L_{t}$ and future treatments, we suggest including all future treatments, $A_{t},A_{t+1},\ldots ,A_{T}$ , in $H(\overline{L}_{t-1},\overline{A})$ . However, if it is reasonable to assume that the effects of $L_{t}$ on future treatments stop at $A_{t^{\prime }}$ , where $t\leqslant t^{\prime }<T$ , treatments beyond time $t^{\prime }$ may be excluded from $H(\overline{L}_{t-1},\overline{A})$ . Equation (13) additionally indicates that $\unicode[STIX]{x1D6FF}(g(L_{t}))$ should be uncorrelated with past treatments, $\overline{A}_{t-1}$ , and past confounders, $\overline{L}_{t-1}$ , in the weighted pseudo-population. Because $\mathbb{E}[\unicode[STIX]{x1D6FF}(g(L_{t}))|\overline{L}_{t-1},\overline{A}_{t-1}]=0$ by construction, zero correlation is guaranteed in the original unweighted population, and when the GLMs for $g(L_{t})$ are Gaussian, binomial, or Poisson regressions with canonical links, the score equations ensure that the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(g(L_{t}))$ , are orthogonal to the regressors $r(\overline{L}_{t-1},\overline{A}_{t-1})$ in the original sample. But to ensure that the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(g(L_{t}))$ , are also orthogonal to the regressors in the weighted sample, we suggest including all elements of $r(\overline{L}_{t-1},\overline{A}_{t-1})$ in $H(\overline{L}_{t-1},\overline{A})$ .
In general, then, $H(\overline{L}_{t-1},\overline{A})$ should include all future treatments as well as all regressors in the GLMs for $g(L_{t})$ , including an intercept. A reassuring property of this specification for $H(\overline{L}_{t-1},\overline{A})$ is that if the GLMs for $g(L_{t})$ are Gaussian, binomial, or Poisson regressions with canonical links and they are fit to the weighted sample with all future treatments, $A_{t},A_{t+1},\ldots ,A_{T}$ , as additional regressors, the coefficients on future treatments will all be exactly zero and the coefficients on $r(\overline{L}_{t-1},\overline{A}_{t-1})$ will be the same as those in the original sample. Therefore, when the GLMs for $g(L_{t})$ are correctly specified, the first moments of $g(L_{t})$ are guaranteed to be balanced across future treatments, conditional on past treatments and confounders, as would be expected in a scenario where treatment is unconfounded by $\overline{L}_{t}$ .
In sum, a typical implementation of residual balancing for estimating the marginal effects of a time-varying treatment proceeds in two steps:
-
(1) At each time point $t$ and for each confounder $j$ , fit a linear, logistic, or Poisson regression of $l_{ijt}$ , as appropriate given its level of measurement, on $\overline{l}_{i,t-1}$ and $\overline{a}_{i,t-1}$ , and then compute the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(l_{ijt})$ .
-
(2) Find a set of weights, $rbw_{i}$ , such that:
-
(a) in the weighted sample, the residuals, $\hat{\unicode[STIX]{x1D6FF}}(l_{ijt})$ , are orthogonal to all future treatments and the regressors of $l_{ijt}$ ;
-
(b) the relative entropy between $rbw_{i}$ and the base weights, $q_{i}$ , is minimized.
-
The weighting solution can then be used to fit any MSM of interest.
3.3 Application to Causal Mediation
Residual balancing can also be used to estimate an MSM for the joint effects of a point-in-time treatment, $A$ , and mediator, $M$ , in the presence of both baseline confounders, $X$ , and a set of post-treatment confounders, $Z$ , for the mediator–outcome relationship. In this setting, residual balancing is implemented using essentially the same procedure as outlined previously but with several minor adaptions. First, for each baseline confounder $X_{j}$ , compute the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(x_{ij})$ , by centering it around its sample mean. Then, for each post-treatment confounder $Z_{j}$ , fit a linear, logistic, or Poisson regression of $z_{ij}$ , depending on its level of measurement, on $x_{i}$ and $a_{i}$ , and then compute the response residuals, $\hat{\unicode[STIX]{x1D6FF}}(z_{ij})$ . Finally, find a set of weights, $rbw_{i}$ , such that, in the weighted sample, the baseline residuals $\hat{\unicode[STIX]{x1D6FF}}(x_{ij})$ are orthogonal to both treatment $a_{i}$ and the mediator $m_{i}$ ; the post-treatment residuals $\hat{\unicode[STIX]{x1D6FF}}(z_{ij})$ are orthogonal to treatment, the mediator, and the pretreatment confounders $x_{ij}$ ; and the relative entropy between $rbw_{i}$ and the base weights $q_{i}$ is minimized. The weighting solution can then be used to fit any MSM for the joint effects of the treatment and mediator on the outcome, from which the CDE of interest are constructed. Alternatively, it is also possible to skip the first step and construct weights that only balance the residualized post-treatment confounders, in which case the baseline confounders $X$ must be included as regressors in the MSM.
3.4 Comparison with Existing Methods
Compared with IPW, residual balancing has both advantages and limitations. On the one hand, because it does not require explicit models for the conditional distribution of exposure to treatment and/or a mediator, residual balancing is robust to the bias that results when these models are misspecified, and it is easy to use with both binary and continuous exposures. Also, by minimizing the relative entropy between the balancing weights and the base weights, the method tends to avoid highly variable and extreme weights, thus yielding more stable estimates of causal effects.
On the other hand, residual balancing requires models for the conditional means of the post-treatment confounders (or transformations thereof). When these models are misspecified, the moment condition in equation (11) is only partially achieved. In this case, equation (12) implies
where future treatments (i.e., $A_{t},A_{t+1},\ldots ,A_{T}$ ) may still be unconfounded in the weighted pseudo-population, but the pseudo-population no longer mimics the original unweighted population. As a result, estimates of marginal effects based on residual balancing weights may be biased. In addition, even when models for $\mathbb{E}[g(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ are correctly specified, residual balancing estimates of marginal effects may still be biased if the balancing conditions are insufficient. For example, if both the treatment and outcome are affected by the product of two confounders, say $L_{1t}L_{2t}$ , but $L_{1t}$ and $L_{2t}$ are only included separately in the $G(L_{t})$ functions, uncontrolled confounding may still be present in the weighted sample, leading to bias.
Residual balancing is similar to the CBPS method (Imai and Ratkovic Reference Imai and Ratkovic2015) in that it seeks a set of weights that balance time-varying confounders across future treatments by explicitly specifying a set of balancing conditions. Residual balancing differs from CBPS, however, in two important respects. First, unlike CBPS, residual balancing can easily accommodate continuous treatments and/or mediators. As mentioned previously, this is because residual balancing does not require parametric models for exposure to treatment and/or a mediator, and, thus, it can balance confounders across both binary and continuous treatments using a common set of balancing conditions (equation (14)). CBPS, by contrast, is based on a parametric logistic model for the propensity score, and it is, therefore, limited to settings with binary treatments and/or mediators.
Second, residual balancing allows for the specification of more flexible and parsimonious balancing conditions than those specified with the CBPS method. In fact, the balancing conditions specified by CBPS can also be generated within the residual balancing framework. To see the connection, note that CBPS attempts to balance the time-varying confounders across all possible sequences of future treatments within each possible history of past treatments. Thus, for each confounder $j$ , there are $2^{t-1}\times (2^{T-t+1}-1)=2^{T}-2^{t-1}$ balancing conditions at time $t$ . Summing over $t$ and $j$ , the total number of balancing conditions associated with CBPS is $n_{c}^{\text{ CBPS}}=J[(T-1)2^{T}+1]$ . Because $n_{c}^{\text{ CBPS}}\sim O(J\cdot T\cdot 2^{T})$ , the number of balancing conditions can easily exceed the sample size, in which case they are, at best, approximated (even without the method’s parametric constraints). With residual balancing, the number of balancing conditions $n_{c}=\sum _{t=1}^{T}J_{t}K_{t}$ depends on the specification of $G(L_{t})$ and $H(\overline{L}_{t-1},\overline{A})$ . As mentioned previously, a natural specification of $G(L_{t})$ is $\{L_{1t},L_{2t},\ldots ,L_{jt}\}$ . If $\mathbb{E}[g_{j}(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ is then modeled with a saturated GLM of $L_{jt}$ on $\overline{A}_{t-1}$ only and $H(\overline{L}_{t-1},\overline{A})$ is defined as a set of dummy variables for each possible sequence of future treatments interacted with each possible history of past treatments, the balancing conditions in equation (14) would be equivalent to those for the CBPS method.
With residual balancing, however, $G(L_{t})$ , $\mathbb{E}[g_{j}(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ , and $H(\overline{L}_{t-1},\overline{A})$ can be specified more flexibly. For example, when a parsimonious GLM is used to fit $\mathbb{E}[g_{j}(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ , and only the $L_{t}$ regressors of $g_{j}(L_{t})$ and $T-t+1$ future treatments are included in $H(\overline{L}_{t-1},\overline{A})$ , the number of balancing conditions will be $n_{c}=J\sum _{t=1}^{T}(T-t+1+L_{t})$ , which is substantially smaller than $n_{c}^{\text{CBPS}}$ . In large and even moderately sized samples, these balancing conditions can often be satisfied exactly.
4 Simulation Experiments
In this section, we conduct a set of simulation studies to assess the performance of residual balancing for estimating marginal effects with (a) a binary time-varying treatment under correct model specification, (b) a binary time-varying treatment under incorrect model specification, (c) a continuous time-varying treatment under correct model specification, and (d) a continuous time-varying treatment under incorrect model specification. In each of these four settings, we compare residual balancing with four variants of IPW: conventional IPW with weights estimated from GLMs (IPW-GLM), IPW with weights estimated from GLMs and then censored (IPW-GLM-Censored), IPW with weights estimated from CBPS (IPW-CBPS), and as a benchmark, IPW with weights based on the true exposure probabilities (IPW-Truth). Because the CBPS method has not been extended for continuous treatments in the time-varying setting, we assess the performance of IPW-CBPS only for binary treatments.Footnote 5
The data generating process (DGP) in our simulations is similar to that of Imai and Ratkovic (Reference Imai and Ratkovic2015). It involves four time-varying covariates measured at $T=3$ time periods with a sample of $n=1000$ . At each time $t$ , the covariates $L_{t}$ are determined by treatment at time $t-1$ and a multiplicative error: $L_{t}=(U_{t}\unicode[STIX]{x1D716}_{1t},U_{t}\unicode[STIX]{x1D716}_{2t},|U_{t}\unicode[STIX]{x1D716}_{3t}|,|U_{t}\unicode[STIX]{x1D716}_{4t}|)$ , where $U_{1}=1$ , $U_{t}=(5/3)+(2/3)A_{t-1}$ for $t>1$ and $\unicode[STIX]{x1D716}_{jt}\sim N(0,1)$ for $1\leqslant j\leqslant 4$ . Treatment at each time $t$ depends on prior treatment at time $t-1$ and the covariates $L_{t}$ . Specifically, when treatment is binary, it is generated as a Bernoulli draw with probability $p=\text{logit}^{-1}[-A_{t-1}+\unicode[STIX]{x1D6FE}^{T}L_{t}+(-0.5)^{t}]$ , and when treatment is continuous, it is generated as $A_{t}\sim N(\unicode[STIX]{x1D707}_{t}=-A_{t-1}+\unicode[STIX]{x1D6FE}^{T}L_{t}+(-0.5)^{t},\unicode[STIX]{x1D70E}_{t}^{2}=2^{2})$ , where $A_{0}=0$ and $\unicode[STIX]{x1D6FE}=\unicode[STIX]{x1D6FC}(1,-0.5,0.25,0.1)^{T}$ . Here, we use the $\unicode[STIX]{x1D6FC}$ parameter to control the level of treatment–outcome confounding. We consider two values of $\unicode[STIX]{x1D6FC}$ , 0.4 and 0.8, corresponding to scenarios where treatment–outcome confounding is weak and strong, respectively. Finally, the outcome is generated as $Y\sim N(\unicode[STIX]{x1D707}=250-10\sum _{t=1}^{3}A_{t}+\sum _{t=1}^{3}\unicode[STIX]{x1D6FF}^{T}L_{t},\unicode[STIX]{x1D70E}^{2}=5^{2})$ , where $\unicode[STIX]{x1D6FF}=(27.4,13.7,13.7,13.7)^{T}$ . To assess the impact of model misspecification, we use the same DGP, but we recode the “observed” covariates as nonlinear transformations of the “true” covariates: specifically, $L_{t}^{\ast }=(L_{1t}^{3},6\cdot L_{2t},\log (L_{3t}+1),1/(L_{4t}+1))^{T}$ . We then use only the transformed covariates, $L_{t}^{\ast }$ , to implement IPW, its variants, and residual balancing. For IPW and its variants, using the transformed covariates leads to misspecification of the treatment assignment model. For residual balancing, the conditional mean model for $L_{jt}^{\ast }$ is still correct when treatment is binary but incorrect when treatment is continuous. However, in both cases, using the transformed covariates (instead of the original covariates) leads to misspecification of the balancing conditions.
For each scenario described previously, we generate 2500 random samples. Then, for each sample, we construct weights using IPW-GLM, IPW-GLM-Censored, IPW-CBPS, and residual balancing. With IPW-GLM, we estimate the weights using logistic regression for binary treatments and normal linear models for continuous treatments, assuming homoskedastic errors. With IPW-GLM-Censored, we follow Cole and Hernán’s (Reference Cole and Hernán2008) example and censor weights at the 1st and 99th percentiles. With IPW-CBPS, we estimate weights using the methods proposed by Imai and Ratkovic (Reference Imai and Ratkovic2015) with the function CBMSM() in the R package CBPS. With residual balancing, $G(L_{t})=L_{t}$ , and the residual terms are estimated from linear models for $L_{t}$ with prior treatment $A_{t-1}$ as a regressor, and $H(\overline{L}_{t-1},\overline{A})$ includes $A_{t}$ as well as the regressors in the model for $L_{t}$ (i.e., 1 and $A_{t-1}$ ). Finally, with each set of weights, we fit an MSM by regressing the outcome $Y$ on the three treatment variables $\{A_{1},A_{2},A_{3}\}$ and denote their coefficient estimates as $\hat{\unicode[STIX]{x1D6FD}}_{1}$ , $\hat{\unicode[STIX]{x1D6FD}}_{2}$ , and $\hat{\unicode[STIX]{x1D6FD}}_{3}$ . We obtain the true values of these coefficients by simulating potential outcomes with the g-computation formula, regressing them on the treatment variables, and averaging their coefficients over a large number of simulations. The performance of each method is evaluated using the simulated sampling distributions of $\hat{\unicode[STIX]{x1D6FD}}_{1}$ , $\hat{\unicode[STIX]{x1D6FD}}_{2}$ , and $\hat{\unicode[STIX]{x1D6FD}}_{3}$ .
Figure 2 presents results from simulations with a binary treatment. Specifically, this figure displays a set of violin plots, which show the sampling distributions of $\hat{\unicode[STIX]{x1D6FD}}_{1}$ , $\hat{\unicode[STIX]{x1D6FD}}_{2}$ , and $\hat{\unicode[STIX]{x1D6FD}}_{3}$ centered at the true values of these coefficients. In these plots, black dots represent means of the sampling distributions, and the shaded distributions highlight the estimator with the smallest root mean squared error (RMSE) in each scenario. In this figure, the first two panels show the sampling distributions of the parameter estimates under correct model specification. Comparing the first and second panels, we see that IPW and its variants suffer from finite-sample bias and may have skewed sampling distributions, especially when the covariates are strongly predictive of treatment. By contrast, residual balancing is roughly unbiased, and its estimates appear approximately normally distributed, regardless of the level of confounding. Second, the results indicate that residual balancing is much more efficient than IPW-GLM, especially when the level of confounding is high. In addition, with a high level of confounding, both IPW-GLM-Censored and IPW-CBPS yield much less variable estimates than IPW-GLM, but this gain in precision comes at the expense of greater bias. Residual balancing, by contrast, improves efficiency without inducing bias.
The last two panels of Figure 2 show the sampling distributions of parameter estimates under misspecified models where $L_{t}$ is measured incorrectly. In these simulations, the treatment assignment models for IPW and the balancing conditions for residual balancing are misspecified. As indicated by its extreme level of sampling variation, IPW-GLM is highly unstable when models for the conditional probability of treatment are misspecified. Consistent with Imai and Ratkovic (Reference Imai and Ratkovic2015), IPW-CBPS appears more robust to model misspecification, as reflected in its substantially smaller sampling variation compared with IPW-GLM. However, this improvement in precision comes at the cost of greater bias. In addition, censoring the IPW also appears to substantially improve the method’s performance in the presence of misspecification. In fact, IPW-GLM-Censored outperforms IPW-CBPS in these simulations. Nevertheless, despite the improvements achieved by censoring the weights or using CBPS, residual balancing consistently produces the most accurate and efficient estimates across nearly all scenarios, even though its balancing conditions are incorrectly specified.
Figure 3 presents another set of violin plots based on simulations with a continuous treatment. As shown in the first two panels, when both the treatment assignment models and the confounder models are correctly specified, the bias for IPW and its variants increases substantially with the level of confounding. Residual balancing, by contrast, is approximately unbiased across both levels of confounding. Moreover, residual balancing consistently outperforms IPW and its variants in terms of efficiency. For example, residual balancing is the most accurate and precise estimator for $\unicode[STIX]{x1D6FD}_{2}$ and $\unicode[STIX]{x1D6FD}_{3}$ under both high and low levels of confounding, and for $\unicode[STIX]{x1D6FD}_{1}$ , the performance of residual balancing is comparable to that of IPW-GLM-Censored.
The last two panels of Figure 3 present sampling distributions under misspecified models where $L_{t}$ is measured incorrectly. In these simulations, the treatment assignment models for IPW are misspecified, as are both the confounder models and the balancing conditions used with residual balancing. Consistent with the results discussed previously, this figure also indicates that IPW-GLM is extremely biased and inefficient under incorrect models for treatment, that censoring the weights reduces bias and improves efficiency, and that residual balancing yields by far the most accurate and efficient estimator among all methods. Residual balancing even outperforms IPW based on the true treatment densities, even though its confounder models and balancing conditions are both misspecified.
5 The Cumulative Effect of Negative Advertising on Vote Shares
In this section, we illustrate residual balancing by estimating the cumulative effect of negative campaign advertising on election outcomes (Lau, Sigelman, and Rovner Reference Lau, Sigelman and Rovner2007; Blackwell Reference Blackwell2013; Imai and Ratkovic Reference Imai and Ratkovic2015). Drawing on U.S. senate and gubernatorial elections from 2000 to 2006, Blackwell (Reference Blackwell2013) used MSMs with IPW to evaluate the cumulative effects of negative campaign advertising on election outcomes for 114 Democratic candidates. MSMs are appropriate for this problem because campaign advertising is a dynamic process plagued by post-treatment confounding. For example, candidates adjust their campaign strategies on the basis of current polling results, where trailing candidates are more likely to “go negative” than leading candidates. At the same time, polling results change over time and are likely affected by a candidate’s previous use of negative advertising.
Treatment, $A_{t}$ , in this analysis is the proportion of campaign advertisements that are “negative” (i.e., that mention the opposing candidate) in each campaign week. Because IPW tends to perform poorly with continuous treatments, we also consider a binary version of treatment, $B_{t}$ , for which the proportion of negative advertisements is dichotomized using a cutoff of 10%, as in Blackwell (Reference Blackwell2013). The time-varying confounders, $L_{t}$ , included in this analysis are the Democratic share in the polls and the share of undecided voters in the previous campaign week. This analysis also uses a set of baseline confounders, $X$ , including total campaign length, election year, incumbency status, and whether the election is for the senate or governor’s office. The outcome, $Y$ , is the Democratic share of the two-party vote.
Following Imai and Ratkovic (Reference Imai and Ratkovic2015), we focus on the final five weeks preceding the election and estimate an MSM for the binary version of treatment with the form
and an MSM for the continuous treatment with the form
In these models, $\text{cum}(\overline{b})$ denotes the total number of campaign weeks for which more than 10% of the candidate’s advertising was negative, $\text{avg}(\overline{a})$ denotes the average proportion of advertisements that were negative over the final five weeks of the campaign, and $V$ is an indicator of incumbency status used to construct interaction terms that allow the effect of negative advertising to differ between incumbents and nonincumbents.Footnote 6 Thus, the effect of an additional week with more than 10% negative advertising for nonincumbents is $\unicode[STIX]{x1D703}_{1}$ , and for incumbents, it is $\unicode[STIX]{x1D703}_{1}+\unicode[STIX]{x1D703}_{2}$ . Similarly, $\unicode[STIX]{x1D6FD}_{1}$ and $\unicode[STIX]{x1D6FD}_{1}+\unicode[STIX]{x1D6FD}_{2}$ correspond to the effects of a 1 percentage point increase in negative advertising for nonincumbents and incumbents, respectively. To facilitate comparison of results across the different versions of treatment, we report estimates for the effects of a 10 percentage point increase in negative advertising—that is, $10\unicode[STIX]{x1D6FD}_{1}$ and $10(\unicode[STIX]{x1D6FD}_{1}+\unicode[STIX]{x1D6FD}_{2})$ .
We estimate these models with both IPW methods and residual balancing. Specifically, we first implement IPW-GLM by fitting, at each time point, a logistic regression of the dichotomized treatment on both time-varying confounders and baseline confounders, and then constructing the IPW using equation (5). Second, we implement IPW-CBPS with the same treatment assignment model using the function CBMSM() in the R package CBPS. Finally, we implement residual balancing by, first, fitting linear models for each covariate in $L_{t}$ ( $t\geqslant 2$ ) with lagged values of treatment and the time-varying confounders as regressors, and then extracting residual terms $\hat{\unicode[STIX]{x1D6FF}}(L_{t})$ . For each covariate in $L_{1}$ , the residual term is computed as the deviation from its sample mean. Next, we find a set of minimum entropy weights such that, in the weighted sample, $\hat{\unicode[STIX]{x1D6FF}}(L_{t})$ is orthogonal to treatment at time $t$ and the regressors of $L_{jt}$ . We compute estimates of standard errors using both the robust (i.e., “sandwich”) variance estimatorFootnote 7 and the jackknife method.Footnote 8 R code for implementing residual balancing in this analysis is available in Part C of the Supplementary Material.
Note: For the dichotomized treatment, results represent the estimated marginal effects of an additional week with more than 10% negative advertising. For the continuous treatment, results represent the estimated marginal effects of a 10 percentage point increase in the average proportion of negative advertisements across all campaign weeks. The two numbers in each parenthesis are the robust (i.e., “sandwich”) and jackknife standard errors, respectively.
Results from these analyses are presented in Table 1, where the first two columns contain IPW-GLM, IPW-CBPS, and residual balancing estimates based on the dichotomized version of treatment. For nonincumbent candidates, these results suggest that the effect of negative advertising is positive. However, both IPW-CBPS and residual balancing yield point estimates that are considerably smaller than IPW-GLM. While IPW-GLM suggests that an additional week with more than 10% negative advertising increases a candidate’s vote share by 1.42 percentage points, on average, the estimated effect is reduced to 0.78 percentage points for IPW-CBPS and 0.98 percentage points for residual balancing. For incumbent candidates, all three methods indicate that negative advertising has a substantively large negative effect on vote shares. Residual balancing, for example, suggests that an additional week with more than 10% negative advertising decreases a candidate’s vote share by 1.67 percentage points, on average.
The last two columns of Table 1 present results based on the continuous version of treatment. Because IPW-CBPS has not been extended for continuous treatments in the time-varying setting, we focus on estimates from IPW-GLM and residual balancing. Overall, these results are quite consistent with those based on the dichotomized treatment. For nonincumbents, the effect of negative advertising appears to be positive, although the estimate from residual balancing is relatively small. For incumbents, both methods suggest a sizable negative effect. According to the residual balancing estimate, a 10 percentage point increase in the proportion of negative advertising throughout the final five weeks of the campaign reduces a candidate’s vote share by about one percentage point, on average.
6 The Controlled Direct Effect of Shared Democracy on Public Support for War
In this section, we reanalyze data from Tomz and Weeks (Reference Tomz and Weeks2013) to estimate the CDE of shared democracy on public support for war, controlling for a respondent’s perceived morality of war. With a nationally representative sample of 1273 US adults, Tomz and Weeks (Reference Tomz and Weeks2013) conducted a survey experiment to analyze the role of public opinion in the democratic peace, that is, the empirical regularity that democracies almost never fight each other. In this experiment, they presented respondents with a situation in which a country was developing nuclear weapons and, when describing the situation, they randomly and independently varied three characteristics of the country: its political regime (whether it was a democracy), alliance status (whether it had signed a military alliance with the United States), and economic ties (whether it had high levels of trade with the United States). They then asked respondents about their levels of support for a preventive military strike against the country’s nuclear facilities. The authors found that individuals are substantially less supportive of military action against democracies than against otherwise identical autocracies.
To investigate the causal mechanisms through which shared democracy reduces public support for war, Tomz and Weeks (Reference Tomz and Weeks2013) also measured each respondent’s beliefs about the threat posed by the potential adversary (threat), the cost of military intervention (cost), and the likelihood of victory (success). In addition, the authors assessed each respondent’s moral concerns about using military force (morality). With these data, they conducted a causal mediation analysis and found that shared democracy reduces public support for war, primarily by changing perceptions of the threat and morality of using military force. In this analysis, the authors examined the role of each mediator separately by assuming that they operate independently and do not influence one another. However, it is likely that one’s perception of morality is partly influenced by beliefs about the threat, cost, and likelihood of success, which also affect support for war directly. Thus, in the following analysis, we treat these variables as post-treatment confounders and reassess the mediating role of morality accordingly.
In these data, the outcome, $Y$ , is a measure of support for war on a five-point scale; treatment, $A$ , denotes whether the country developing nuclear weapons was presented as a democracy; the mediator, $M$ , is a dummy variable indicating whether the respondent thought it would be morally wrong to strike; the baseline covariates $X$ include dummy variables for each of the two other randomized treatments (alliance status and economic ties) as well as a number of demographic and attitudinal controls; and the post-treatment confounders $Z$ include measures of the respondent’s beliefs about threat, cost, and likelihood of success.Footnote 9 We estimate the CDE of shared democracy, controlling for perceptions of morality, using an MSM with form
In this model, we control for baseline covariates because, although treatment is randomly assigned, they may still confound the mediator–outcome relationship.Footnote 10 The CDE is given by $\text{CDE}(m)=\unicode[STIX]{x1D6FC}_{1}+\unicode[STIX]{x1D6FC}_{3}m$ , where $\unicode[STIX]{x1D6FC}_{1}$ measures the effect of shared democracy on support for war if none of the respondents had moral reservations about military intervention and $\unicode[STIX]{x1D6FC}_{1}+\unicode[STIX]{x1D6FC}_{3}$ measures the effect of shared democracy on support for war if all respondents thought it would be morally wrong to strike.
We estimate this model with both IPW-GLM and residual balancing weights. Specifically, we first implement IPW-GLM by fitting a logit model for $M$ with $X$ , $A$ , and $Z$ as regressors, by fitting a second logit model for $M$ with only $X$ and $A$ as regressors, and then by using the fitted values from these models to estimate a set of weights with the following form: $sw_{i}^{\dagger }=\frac{\mathbb{P}(M=m_{i}|X=x_{i},A=a_{i})}{\mathbb{P}(M=m_{i}|X=x_{i},A=a_{i},Z=z_{i})}$ . Second, we implement residual balancing by fitting a linear model for each post-treatment confounder in $Z$ with $X$ and $A$ as regressors, computing residual terms $\hat{\unicode[STIX]{x1D6FF}}(Z)$ and then finding a set of minimum entropy weights such that, in the weighted sample, $\hat{\unicode[STIX]{x1D6FF}}(Z)$ is orthogonal to $M$ and the regressors of $Z$ . Standard errors are computed using the robust (i.e., “sandwich”) variance estimator and the jackknife method. R code for implementing residual balancing in this analysis is available in Part C of the Supplementary Material.
Note: Coefficients of pretreatment covariates are omitted. For ease of interpretation, all pretreatment covariates are centered at their means. The two numbers in each parenthesis are robust (i.e., “sandwich”) standard errors and jackknife standard errors, respectively.
As a benchmark, the first column of Table 2 presents an estimate of the total treatment effect from a regression of $Y$ on $X$ and $A$ . Consistent with the original study, we find that shared democracy significantly reduces public support for war—specifically, by 0.35 points on the five-point scale, or about 0.25 standard deviations. The next two columns present IPW and residual balancing estimates, respectively, for model (19). In this model, the “main effect” of shared democracy represents the estimated CDE if respondents had no moral reservations about military intervention, and the sum of this coefficient and the interaction term represents the estimated CDE if respondents did have moral reservations.
IPW and residual balancing yield somewhat different estimates of these effects. According to IPW, the estimated CDE of shared democracy is $-0.20$ if respondents had no moral concerns about war, and it is $-0.25$ if respondents thought it was morally wrong to strike. According to residual balancing, by contrast, the estimated CDE of shared democracy is $-0.36$ if respondents had no moral concerns about war, and it is $-0.22$ if respondents thought military intervention was morally wrong. Notwithstanding these differences, however, both IPW and residual balancing suggest that most of the total effect is “direct”, that is, transmitted through pathways other than morality.
7 Discussion and Conclusion
Post-treatment confounding arises in analyses of both time-varying treatments and causal mediation, where it complicates the use of conventional regression, matching, and balancing methods for causal inference. To adjust for this type of confounding, researchers most often use MSMs along with the associated method of IPW estimation (Robins Reference Robins, Halloran and Berry2000; Robins, Hernan, and Brumback Reference Robins, Hernan and Brumback2000; VanderWeele Reference VanderWeele2015). IPW, however, is highly sensitive to model misspecification, relatively inefficient, susceptible to finite-sample bias, and difficult to use with continuous treatments. Several remedies for these problems have been proposed, such as censoring the weights (Cole and Hernán Reference Cole and Hernán2008) or constructing them with CBPS (Imai and Ratkovic Reference Imai and Ratkovic2014, Reference Imai and Ratkovic2015), but these corrections are not without their own limitations.
In this article, we proposed the method of residual balancing for constructing weights that can be used to estimate MSMs. Like IPW, residual balancing avoids the bias that afflicts conventional methods of covariate adjustment when some or all of the covariates are post-treatment confounders. In contrast to IPW, residual balancing does not require models for the conditional distribution of exposure to treatment and/or a mediator. Rather, it entails modeling only the conditional means of the post-treatment confounders, and because it simultaneously imposes covariate balancing and minimum entropy conditions on the weights, the method is both more efficient and more robust to model misspecification than IPW. It is also much easier to use with continuous treatments, which obviates the need for arbitrary quantile binning as is often employed in practice (e.g., Wodtke, Harding, and Elwert Reference Wodtke, Harding and Elwert2011; Blackwell Reference Blackwell2013).
Residual balancing also appears to outperform IPW even when the weights are constructed with CBPS, which also incorporate explicit balancing conditions when estimating the conditional probabilities of exposure. The reason, we believe, is that IPW with CBPS is torn between two conflicting goals. On the one hand, it imposes a parametric logistic model on the propensity score, which limits the number of balancing conditions that can be satisfied with IPW. On the other hand, it attempts to balance the time-varying confounders across all possible sequences of future treatments within all possible histories of prior treatments, generating an extremely large number of balancing conditions. Therefore, the search for covariate balancing weights is almost always an overidentified problem with CBPS, leading to weights that can, at best, satisfy the balancing conditions approximately. In this situation, IPW with CBPS may remain biased if certain important balancing conditions are not well satisfied in the weighted sample. By contrast, residual balancing does not impose a parametric structure on the conditional probability/density of the exposure. Moreover, it models the conditional means of the time-varying confounders and balances only their residuals across a parsimonious representation of future treatments and the observed past. Therefore, the search for residual balancing weights is often an underidentified problem, leading to exact, rather than approximate, balance in the weighted sample.
Despite its many advantages, residual balancing is still limited in several ways. First, it requires modeling the conditional means of the post-treatment confounders (or transformations thereof). As noted earlier, when these models are misspecified, the pseudo-population created by the residual balancing weights will no longer mimic the original unweighted population, making estimates of marginal effects biased for the target quantities of interest. This problem might be mitigated in practice by combining residual balancing with a sensitivity analysis to assess the robustness of estimates to different parametric models for the post-treatment confounders. Another remedy might involve fitting nonparametric or semiparametric models for $\mathbb{E}[g(L_{t})|\overline{L}_{t-1},\overline{A}_{t-1}]$ , although this may potentially engender inferential problems (e.g., a lack of $\sqrt{n}$ -consistency; see Newey Reference Newey1994) and thus additional research is needed to better understand the method’s performance with these types of models for the post-treatment confounders.
Second, even when models for the conditional means of the post-treatment confounders are correctly specified, residual balancing estimates of marginal effects may still be biased if the balancing conditions are insufficient. In practice, this bias can be mitigated by including more functions (e.g., cross-product and higher-order terms) in $G(L_{t})$ . Nevertheless, if there are a large number of time-varying confounders, inclusion of their cross-product and higher-order terms would multiply the number of balancing conditions, making exact balance more difficult to achieve. In those cases, the balancing conditions in equation (15) may need to be relaxed to allow for approximate, rather than exact, balance (e.g., Wang and Zubizarreta Reference Wang and ZubizarretaForthcoming). We leave this extension for future work.
Another important direction for future research will be to further investigate the theoretical properties of residual balancing. For example, consistency may be established if the method can be recast as a form of IPW with treatment probabilities/densities estimated from a proper scoring rule (an objective function that is not necessarily the log-likelihood). As Zhao and Percival (Reference Zhao and Percival2017) show, when treatment is binary and the estimand is the average treatment effect on the treated, entropy balancing weights can be recast as IPW estimated from a tailored objective function that differs from the Bernoulli likelihood. However, this relationship does not hold when the estimand is the average treatment effect (ATE). Specifically, Zhao (Reference Zhao2019) shows that IPW for the ATE can be viewed as a set of covariate balancing weights only when a different loss function ( $\sum _{i}(w_{i}-1)\log (w_{i}-1)-w_{i}$ ), rather than the entropy loss ( $\sum _{i}w_{i}\log w_{i}$ ), is used in the optimization problem. This result suggests that alternative loss functions may be required to establish a formal link between residual balancing and IPW. Future work should therefore explore the properties and performance of residual balancing with a variety of loss functions, including but not limited to the entropy loss on which we focus in the present study.
These limitations notwithstanding, residual balancing appears to provide an efficient and robust method of constructing weights for MSMs. It should therefore find wide application in analyses of time-varying treatments and causal mediation, wherever post-treatment confounding presents itself. To facilitate its implementation in practice, we have developed an open-source R package, rbw, for constructing residual balancing weights, which is available from GitHub: https://github.com/xiangzhou09/rbw. A Stata package with similar functionality is also available from GitHub: https://github.com/gtwodtke/rbw. In addition, Part C of the Supplementary Material provides R code illustrating the use of rbw in our two empirical examples.
Acknowledgements
The authors benefited from communications with Justin Esarey, Kosuke Imai, Gary King, José Zubizarreta, and participants of the Applied Statistics Workshop at Harvard University, the Political Methodology Speaker Series at MIT, and the 35th Annual Meeting of the Society for Political Methodology at Brigham Young University.
Data Availability Statement
Replication data are available in Zhou and Wodtke (Reference Zhou and Wodtke2020).
Supplementary material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2020.2.