Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS

Pete Mohanty; Robert Shaffer

doi:10.1017/pan.2018.33

Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS

Published online by Cambridge University Press: 26 September 2018

Pete Mohanty

and

Robert Shaffer

Show author details

Pete Mohanty*: Affiliation:
Stanford University, Statistics, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305, USA. Email: pmohanty@stanford.edu
Robert Shaffer: Affiliation:
Department of Government, The University of Texas at Austin, Batts Hall 2.116, Austin, TX 78712-1704, USA. Email: rbshaffer@utexas.edu
*: *Email: pmohanty@stanford.edu

Article contents

Abstract
Introduction
Data Science as Interpretability versus Complexity
Complexity and Interpretability with KRLS
Algorithmic and Statistical Optimization Using bigKRLS
Application: The Trump Effect in “Communities in Crisis”
Conclusion
Footnotes
References

Rights & Permissions

Abstract

Complex models are of increasing interest to social scientists. Researchers interested in prediction generally favor flexible, robust approaches, while those interested in causation are often interested in modeling nuanced treatment structures and confounding relationships. Unfortunately, estimators of complex models often scale poorly, especially if they seek to maintain interpretability. In this paper, we present an example of such a conundrum and show how optimization can alleviate the worst of these concerns. Specifically, we introduce bigKRLS, which offers a variety of statistical and computational improvements to the Hainmueller and Hazlett (2013) Kernel-Regularized Least Squares (KRLS) approach. As part of our improvements, we decrease the estimator’s single-core runtime by 50% and reduce the estimator’s peak memory usage by an order of magnitude. We also improve uncertainty estimates for the model’s average marginal effect estimates—which we test both in simulation and in practice—and introduce new visual and statistical tools designed to assist with inference under the model. We further demonstrate the value of our improvements through an analysis of the 2016 presidential election, an analysis that would have been impractical or even infeasible for many users with existing software.

Keywords

nonparametric estimation observational studies properties of estimators

Type: Articles
Information: Political Analysis , Volume 27 , Issue 2 , April 2019 , pp. 127 - 144

DOI: https://doi.org/10.1017/pan.2018.33 [Opens in a new window]
Copyright: Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology.

1 Introduction

Statistical performance and interpretability are desirable attributes for any modeling approach, particularly in social science research. As ongoing political commentary reminds us, both academic and broader communities care about whether predictions are accurate, robust, and interpretable. In some applications, prediction may be a goal in and of itself, whether or not the model in question illuminates underlying causal mechanisms. However, even in these settings, interpretability is a helpful trait, allowing researchers to check assumptions and guard against overfitting.

Unfortunately—but unsurprisingly—modeling strategies that excel at these criteria often exhibit severe scalability constraints. For a concrete example, consider the Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) Kernel-Regularized Least Squares (KRLS) approach. KRLS offers a desirable balance of interpretability, flexibility, and theoretical guarantees, primarily through the pointwise marginal effects estimates and their corresponding averages. However, pointwise marginal effects are costly to estimate in both time and memory, whether via KRLS or through related techniques such as Bayesian additive regression trees (Chipman et al. Reference Chipman, George and McCulloch2010) or LASSOplus (Ratkovic and Tingley Reference Ratkovic and Tingley2017). As a result, optimization (both in memory and speed) is important to make KRLS a practical choice in many applied settings.

In this paper, we present a series of improvements designed to improve KRLS’s statistical performance, speed, and memory usage, which we implement in the bigKRLS package. Compared with the original KRLS library,Footnote ¹ our algorithm decreases runtime by approximately 75%Footnote ² and reduces peak memory usage by approximately an order of magnitude. These improvements allow users to straightforwardly fit models via KRLS to larger datasets (N ${>}$ 3,500) on personal machines, which was not possible using existing implementations. We also develop an updated significance test for the average marginal effect (AME) estimates—which we justify both theoretically and in simulation—and a new inferential statistic designed to help identify the presence of heterogeneous effects.

After describing our methodological contributions, we illustrate the practical utility of bigKRLS through an extended examination of the so-called “communities in crisis” explanation for the 2016 presidential election. Our estimates—which are robust both to our significance correction and to a series of cross-validations—straightforwardly address hypotheses advanced by postelection commentary. Due to sample size constraints, the model we estimate would have been impractical to fit on a personal computer using the original KRLS algorithm, but runs smoothly with bigKRLS, highlighting the importance of optimization work for applied political science tasks.

2 Data Science as Interpretability versus Complexity

2.1 Desirable estimator properties

When selecting an estimator, there are an array of properties that we might value. For example, we might want our estimator to be unbiased or efficient, or we might want it to minimize some particular loss function (e.g., mean squared error). In theoretical settings, we generally assume that our model of interest captures the “true” data-generating process; however, in applied settings, we are usually—and, often, rightly—skeptical of such assumptions. As a result, we also want estimators to be robust against violations of potentially problematic modeling assumptions (e.g., incorrect functional form or omitted variables). For this reason, we sometimes place a premium on predictive accuracy against held-out test data.

Besides these traits, however, we also favor models whose results are easily interpreted. Compared with the traits described above, “interpretability” does not possess a particularly precise definition. However, we can colloquially view a model as more “interpretable” to the extent that its estimates allow researchers to answer useful questions with minimal additional effort. A model like linear regression, for example, directly estimates coefficients that offer information about the marginal effect of some covariates $\mathbf{X}$ on a dependent variable $\mathbf{y}$ . By contrast, prediction-oriented models like random forests (Breiman Reference Breiman2001) offer more limited options. If a researcher wishes to learn about the data-generating process using a random forest, her choices are either to inspect relatively uninformative summary statistics such as variable importance or to generate first difference estimates for particular values of interest through perturbations of the input data.

Many attributes of a model can influence its interpretability. For example, models that offer simple, familiar estimates such as average treatment effect estimates alongside more nuanced counterparts can make their contents more accessible, serving readers with different backgrounds and levels of experience. Similarly, modeling strategies that reduce the number of nonzero effect estimates or the complexity of their functional form tend to ease interpretation. Regularization constraints—that are explicitly designed allow researchers to ignore some parameters by shrinking their values to or near zero (Hastie, Tibshirani, and Wainwright Reference Hastie, Tibshirani and Wainwright2015)—offer a direct example of this kind of strategy.Footnote ³

Importantly, we do not mean to suggest that these are the only traits that contribute to model interpretability, or that interpretability (however defined) is the only standard by which a model ought to be judged. Depending on the application, researchers might be willing to employ a less interpretable model in exchange for improved predictive performance or model fit. In general, however, we argue that all of these traits represent important modeling goals, which need to be balanced in context.

2.2 The complexity frontier

Unfortunately, estimators that excel at both interpretability and prediction in the face of challenging data-generating processes are often highly complex. Here, we use “complexity” in the algorithmic sense, referring to the CPU and memory resources needed to estimate a model given the size of the inputs (Papadimitriou Reference Papadimitriou2003). Algorithmic complexity is usually represented using order notation: so, an $O(N)$ algorithm is one whose complexity grows linearly with $N$ , and an $O(\log (N))$ algorithm is one whose complexity grows logarithmically with $N$ .Footnote ⁴ For example, linear regression with $N$ observations and $P$ covariates has complexity approximately $O(P^{2}N)$ (since calculating $\mathbf{X}^{\prime }\mathbf{X}$ dominates other calculations involved in generating $\hat{\unicode[STIX]{x1D6FD}}_{OLS}$ ).Footnote ⁵ Since $N$ is usually much larger than $P$ , estimating a linear model via Ordinary Least Squares (OLS) has complexity which is approximately linear with respect to $N$ .

Compared with other approaches, under appropriate assumptions linear regression directly calculates causally interpretable effects, but is sensitive to model specification choices and possesses poor predictive performance. On the other end of the spectrum, decision trees do not calculate causally interpretable effect estimates, but are highly flexible, make few assumptions about the data-generating process, and often possess excellent out-of-sample performance. In exchange for these desirable properties, however, decision trees are substantially more complex than ordinary linear regression. In rough terms, a single decision tree has complexity $O(N\log (N)^{2})+O(PN\log (N))$ .Footnote ⁶ Generally, decision trees perform better when used in an ensemble approach such as a random forest (Breiman Reference Breiman2001), leading users to generate hundreds or thousands of such trees for any given application.

Models that attempt to optimize all of these traits simultaneously quickly encounter what we might call the computational complexity frontier. Flexibility with respect to functional form, sparsity constraints, and related modeling strategies all impose a substantial computational burden, rendering them impractical for particularly large datasets. Importantly, in many applications, interpretability also factors into this tradeoff. Again, regularization strategies (e.g., LASSO) and cross-validated parameter selection approaches offer canonical examples of this relationship.Footnote ⁷

3 Complexity and Interpretability with KRLS

Kernel regularization techniques have a long history in the computer science and statistics literatures (Rifkin, Yeo, and Poggio Reference Rifkin, Yeo and Poggio2003; Yu, Xu, and Gong Reference Yu, Xu, Gong, Koller, Schuurmans, Bengio and Bottou2009).Footnote ⁸ Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) demonstrate that this approach is useful for inference as well as prediction, primarily through the derivative estimators they derive. However, the approach they present also offers a stark example of the tradeoffs between statistical performance, interpretability, and complexity we describe in the previous sentence (see Table 1). To build intuition, we briefly discuss key features of the KRLS approach in this section.Footnote ⁹

3.1 An overview of KRLS

Begin by considering the following model:

$$\begin{eqnarray}\mathbf{y}_{i}=c_{1}k(\mathbf{X}_{i},\mathbf{X}_{1})+c_{2}k(\mathbf{X}_{i},\mathbf{X}_{2})+\cdots +c_{N}k(\mathbf{X}_{i},\mathbf{X}_{N})+\unicode[STIX]{x1D750}_{i},\end{eqnarray}$$

where $\mathbf{c}$ represents a vector of coefficients and $k$ represents a kernel function that quantifies the pairwise similarity between two vectors of data. Defining $\mathbf{K}$ as a matrix of such similarity values, we can express the model for the full sample as:

$$\begin{eqnarray}\mathbf{y}=\mathbf{Kc}+\unicode[STIX]{x1D750}.\end{eqnarray}$$

Viewed from this perspective, this model treats the dependent variable as a linear and additive combination of the pairwise similarity between a given observation and each other observation in the dataset, as calculated using the predictor matrix $\mathbf{X}$ . These similarity values are then weighted by a set of so-called “choice coefficients” $\mathbf{c}$ , which serve to weight observations based on their influence on the conditional expectation function.

In principle, any similarity kernel function $k$ can be used to estimate the model, but we focus here on the Gaussian kernel:Footnote ¹⁰

$$\begin{eqnarray}k(\mathbf{X}_{i},\mathbf{X}_{j})=e^{-||\mathbf{X}_{i}-\mathbf{X}_{j}||^{2}/\unicode[STIX]{x1D70E}^{2}},\end{eqnarray}$$

where $||\mathbf{X}_{i}-\mathbf{X}_{j}||$ denotes Euclidean distance and $\unicode[STIX]{x1D70E}^{2}$ denotes a researcher-specified bandwidth parameter. In practice, Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) recommend setting $\unicode[STIX]{x1D70E}^{2}=P$ (the number of predictor variables), which we adopt throughout this paper.

Table 1. Overview of the KRLS estimation procedure.

Letting $i,j$ index observations such that $i,j=1,2\ldots N$ ultimately captures all pairs and letting $p=1,2,\ldots P$ index the explanatory $x$ variables. Note steps 4–6 are followed by uncertainty estimates, for which closed-form estimates also exist along with proofs of a number of desirable properties such as consistency (Hainmueller and Hazlett Reference Hainmueller and Hazlett2013).

$^{\text{i}}$ Using worst-case results for a divide-and-conquer algorithm, which we employ here (Demmel Reference Demmel1997, p. 220–221). For runtime improvements via eigentruncation, see Section 4.2.1.

$^{\text{ii}}$ Using Golden Section Search given $\mathbf{y}$ , $\mathbf{E}$ , and $\mathbf{v}$ . Note that this value also depends on a tolerance parameter, which is set by the user.

Since this model involves estimating one parameter for each observation, to rule out degenerate solutions we replace $\mathbf{c}$ with $\mathbf{c}^{\ast }$ , where $\mathbf{c}^{\ast }$ is defined using a Tikhonov regularization strategy:

$$\begin{eqnarray}\mathbf{c}^{\ast }=\underset{c\in \mathbb{R}^{P}}{\text{argmin}}[(\mathbf{y}-\mathbf{Kc})^{\prime }(\mathbf{y}-\mathbf{Kc})+\unicode[STIX]{x1D706}\mathbf{c}^{\prime }\mathbf{Kc}].\end{eqnarray}$$

Here $\unicode[STIX]{x1D706}$ is a regularization parameter, selected to minimize leave-one-out loss. As we document in Section 4.3, $\unicode[STIX]{x1D706}$ is computationally demanding to select; however, once $\unicode[STIX]{x1D706}$ is selected this approach yields a closed-form expression $\hat{\mathbf{c}}_{KRLS}^{\ast }=(\mathbf{K}+\unicode[STIX]{x1D706}I)^{-1}\mathbf{y}$ , which can be calculated straightforwardly. Under appropriate functional form and error structure assumptions, both $\hat{\mathbf{c}}_{KRLS}^{\ast }$ and $\hat{\mathbf{y}}_{KRLS}^{\ast }$ are unbiased and consistent estimators of their population equivalents $\mathbf{c}^{\ast }$ and $\mathbf{y}^{\ast }$ , with closed-form expressions for both the estimators and their variances (Hainmueller and Hazlett Reference Hainmueller and Hazlett2013).Footnote ¹¹ In simulations, both we and Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) show that KRLS is competitive with respect to out-of-sample predictive performance compared with related approaches (see Supplementary Appendix D.2 for details).

3.2 Opening the black box

In contrast with other flexible modeling approaches, the kernel and regularized coefficients offer a natural way to express the effects of variables contained in the model (see Figure 1). In particular, since $\hat{\mathbf{y}}^{\ast }$ has a closed-form expression, for continuous predictors we can estimate the marginal effect of a given predictor at any observed point $\mathbf{X}_{j,p}$ by taking the derivativeFootnote ¹² of the predicted values with respect to the point of interest:

$$\begin{eqnarray}\widehat{\frac{\unicode[STIX]{x1D6FF}\mathbf{y}^{\ast }}{\unicode[STIX]{x1D6FF}\mathbf{X}_{j,p}}}=\frac{-2}{\unicode[STIX]{x1D70E}^{2}}\mathop{\sum }_{i}^{N}{\hat{c}}_{i}^{\ast }\mathbf{K}_{i,j}(\mathbf{X}_{i,p}-\mathbf{X}_{j,p}).\end{eqnarray}$$

Since the regularization constraints imposed on the estimated coefficient vector $\hat{\mathbf{c}}^{\ast }$ serve to shrink its values, many of the pairwise comparisons embedded in this expression have little or no effect on the final estimated derivative value. These shrinkage constraints therefore simplify and smooth the estimated derivatives, consistent with the logic of regularization. Because the distribution of the regularized coefficients is in general unknown, constructing reliable measures of uncertainty around each pointwise marginal effect is challenging. As a result, these estimates are best used for exploration rather than inference. However, for researchers who are mindful of these limitations, these estimates offer a useful summary of the observed effect structure.

Because these pointwise marginal effects can be expressed in closed form, we can straightforwardly produce summaries of their content. Defining $\hat{\unicode[STIX]{x1D6E5}}$ as an $N\times P$ matrix of partial derivatives, we can define the AME of each variable as $\hat{\unicode[STIX]{x1D6E5}}_{AME}=[\frac{1}{N}\sum _{j}\widehat{\frac{\unicode[STIX]{x1D6FF}\mathbf{y}^{\ast }}{\unicode[STIX]{x1D6FF}\mathbf{X}_{p,j}}}\forall p\in \{1,2,\ldots P\}]$ , or the column means of $\hat{\unicode[STIX]{x1D6E5}}$ . Hypotheses about $\hat{\unicode[STIX]{x1D6E5}}_{AME}$ can be evaluated with standard hypothesis testing tools, and offer a highly interpretable summary of the overall effect of each variable.

Figure 1. “Actual” marginal effects.

Unfortunately, the flexibility, statistical performance, and interpretability of this approach come at a cost. This pairwise model is difficult to estimate with KRLS, even without estimating the marginals. In bigKRLS—our implementation of KRLS for this model—the algorithm’s peak memory requirements are $O(N^{2})$ . Assuming derivatives are estimated, the original KRLS implementation has peak memory requirements of $O(PN^{2})$ ; as a result, this figure offers a substantial improvement over the original estimation routine, but remains difficult to scale.Footnote ¹³ From a runtime standpoint, KRLS’s requirements are similarly onerous, with total runtime complexity $O(N^{3})$ . For comparison, decision trees have runtime complexity $O(N\log (N)^{2})+O(PN\log (N))$ .

4 Algorithmic and Statistical Optimization Using bigKRLS

4.1 Overview

The improvements we present in this paper can be divided into two rough categories. First, from an algorithmic standpoint we reimplement all major functions using the bigmemory, Rcpp, and parallel packages in R, allowing researchers to easily parallelize model estimation. We also develop new algorithms for kernel regularization and first differences. Second, to improve inference, we propose a degrees-of-freedom correction for the model’s AME estimates and a new measure of effect heterogeneity contained in the model. We justify our degrees-of-freedom correction both theoretically and in simulation, and find that it performs well in the settings we examine. The correction improves coverage most notably for complex data-generating processes and smaller samples.

Put together, our algorithmic changes reduce peak memory consumption from $O((P+21)N^{2})$ to $O(5N^{2})$ in our implementation. Crucially, unlike KRLS, the memory footprint of bigKRLS does not depend on $P$ , the number of explanatory variables.Footnote ¹⁴ Runtime using bigKRLS and the original KRLS implementation is roughly comparable when $N$ and $P$ are small and all predictors are continuous. However, in most applied settings, bigKRLS is substantially faster. In simulation results for a dataset consisting of 10 binary and 10 continuous predictors, for example, we report approximately 50% decreased wall-clock time when running on a single core. When bigKRLS is set to use multiple processors (not an option with KRLS), a task that takes KRLS just over two hours can be done by bigKRLS in twenty minutes (Figure 3).

Figure 2. Published sample sizes. Source: American Journal of Political Science and American Political Science Review, January 2015 through January 2017 (own calculation).

Figure 3. Runtime comparisons (worst case).

How important are these improvements? For illustration purposes, we surveyed all empirical articles published in the American Journal of Political Science (AJPS) and the American Political Science Review (APSR) from January 2015 to January 2017, and recorded sample sizes for each dataset used in those articles ( $N=279$ ).Footnote ¹⁵ In the time frame we surveyed, approximately 43% of datasets were too large for the original KRLS implementation. By contrast, with similar dimensions bigKRLS can handle datasets up to approximately $N=14,000$ on a personal machine before reaching the 8 GB cutoff, opening an additional 20% of published datasets for estimation.Footnote ¹⁶

These improvements, we argue, are substantial. Statistical models are only useful to the extent that they can be employed in practice. While high-complexity methods like KRLS are unlikely to be usable in truly “big” data settings, our improvements allow a noticeably greater proportion of applied researchers to use the modeling approach we present.

4.2 Algorithmic improvements

4.2.1 A leaner first differences estimator

For binary explanatory variables, KRLS estimates first differences.Footnote ¹⁷ The original algorithm for this procedure functions as follows. Suppose $\mathbf{X}_{b}$ is a column that contains a binary variable. Construct two copies of $\mathbf{X}$ , denoted as $\mathbf{X}_{\{\mathbf{0}\}}$ and $\mathbf{X}_{\{\mathbf{1}\}}$ , which are modified such that all observations in the $b$ th column of the copies are equal to 0 and 1, respectively. Combine the original matrix and the two modified copies into a new matrix $\mathbf{X}_{\text{new}}^{\prime }=[\mathbf{X}\mid \mathbf{X}_{\{\mathbf{0}\}}\mid \mathbf{X}_{\{\mathbf{1}\}}]^{\prime }$ , and construct a new similarity kernel. This step is temporary, but has a memory footprint of $9N^{2}$ (!). Finally, save the two submatrices of the kernel corresponding to the respective counterfactual comparisons between $\mathbf{X}_{\{\mathbf{0}\}}$ , $\mathbf{X}_{\{\mathbf{1}\}}$ , and the observed data $\mathbf{X}$ .

Our leaner implementation can also be expressed in terms of potential outcomes (Keele Reference Keele2015). The goal is to minimize the computational burden of obtaining the vector of differences for the scenario in which everyone was counterfactually assigned to one group versus the other. Let $\mathbf{K}_{\{1\}}$ and $\mathbf{K}_{\{0\}}$ be the counterfactual kernels.Footnote ¹⁸ The first differences are:

$$\begin{eqnarray}\unicode[STIX]{x1D6FF}_{b}=\mathbf{y}_{\{1\}}-\mathbf{y}_{\{0\}}=\mathbf{K}_{\{1\}}\mathbf{c}^{\ast }-\mathbf{K}_{\{0\}}\mathbf{c}^{\ast }=(\mathbf{K}_{\{1\}}-\mathbf{K}_{\{0\}})\mathbf{c}^{\ast }.\end{eqnarray}$$

As with the AMEs of continuous variables, the mean $\bar{\hat{\unicode[STIX]{x1D6FF}}}_{\mathbf{b}}$ is used as the point estimate that appears in the regression table. The variance of that mean first difference is:

$$\begin{eqnarray}\hat{\unicode[STIX]{x1D70E}}_{\unicode[STIX]{x1D6FF}_{\mathbf{b}}}^{\boldsymbol{ 2}}=\mathbf{h}^{\prime }(\mathbf{K}_{\text{new}}\hat{\unicode[STIX]{x1D72E}}_{\boldsymbol{ c}})\mathbf{K}_{\text{new}}^{\prime }\mathbf{h},\end{eqnarray}$$

where $\mathbf{h}$ is a vector of constants,Footnote ¹⁹ $\mathbf{K}_{\text{new}}$ is a partitioned matrix with the counterfactual kernels, and $\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}$ is the variance–covariance matrix of the coefficients (Hainmueller and Hazlett Reference Hainmueller and Hazlett2013). Though highly interpretable, first difference calculations are computationally daunting because the peak memory footprint is $O(6N^{2})$ : $O(2N^{2})$ for $\mathbf{K}_{\text{new}}$ and another $O(4N^{2})$ for $\hat{\unicode[STIX]{x1D748}}_{\unicode[STIX]{x1D739}_{\mathbf{b}}}^{\mathbf{2}}$ . The following insight allowed us to derive a more computationally friendly algorithm.

Consider the similarity score $\mathbf{K}_{i,j}$ . We can manipulate this quantity as follows:

$$\begin{eqnarray}\displaystyle \mathbf{K}_{i,j} & = & \displaystyle e^{-||\mathbf{x}_{i}-\mathbf{x}_{j}||^{2}/\unicode[STIX]{x1D70E}^{2}}\nonumber\\ \displaystyle & = & \displaystyle e^{-[(\mathbf{x}_{i,1}-\mathbf{x}_{j,1})^{2}+(\mathbf{x}_{i,2}-\mathbf{x}_{j,2})^{2}+\cdots +(\mathbf{x}_{i,b}-\mathbf{x}_{j,b})^{2}+\ldots ]}\nonumber\\ \displaystyle & = & \displaystyle e^{-(\mathbf{x}_{i,b}-\mathbf{x}_{j,b})^{2}/\unicode[STIX]{x1D70E}^{2}}e^{-[(\mathbf{x}_{i,1}-\mathbf{x}_{j,1})^{2}+(\mathbf{x}_{i,2}-\mathbf{x}_{j,2})^{2}+\ldots ]}\nonumber\\ \displaystyle & = & \displaystyle e^{-(\mathbf{x}_{i,b}-\mathbf{x}_{j,b})^{2}/\unicode[STIX]{x1D70E}^{2}}\mathbf{K}_{i,j}^{\ast }.\nonumber\end{eqnarray}$$

These manipulations allow us to reexpress the quantity of interest in terms of $\mathbf{K}_{i,j}^{\ast }$ , the observed similarity on dimensions other than $b$ , and $\unicode[STIX]{x1D719}=\exp (-\frac{1}{\unicode[STIX]{x1D70E}_{\mathbf{X}_{b}}^{2}\unicode[STIX]{x1D70E}^{2}})$ , the (only nonzero) pairwise distance on the binary dimension, where $\unicode[STIX]{x1D70E}_{\mathbf{X}_{b}}^{2}$ is the variance of the binary variable. This process facilitates reexpression wholly in terms of the observed kernel and the constant $\unicode[STIX]{x1D719}$ , as shown in Figure 4. As a result, our algorithm avoids constructing the costly temporary matrix required in the original implementation.

Figure 4. Reexpressed kernel for first differences estimation.

Building on this observation, we took the following steps to make the variance–covariance calculation more tractable.

(1) Though $(\mathbf{K}_{\text{new}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}})\mathbf{K}_{\text{new}}^{\prime }$ is $2N\times 2N$ it is possible to focus the calculations on four $N\times N$ submatrices:
$$\begin{eqnarray}\displaystyle (\mathbf{K}_{\text{new}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}})\mathbf{K}_{\text{new}}^{\prime }=\left[\begin{array}{@{}c@{}}\mathbf{K}_{\{1\}}\mathbf{K}_{\{0\}}\end{array}\right]\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}\left[\begin{array}{@{}c@{}}\mathbf{K}_{\{1\}}^{\prime }\\ \boldsymbol{ K}_{\{0\}}^{\prime }\end{array}\right]=\left[\begin{array}{@{}cc@{}}\mathbf{K}_{\{1\}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}\mathbf{K}_{\{1\}}^{\prime } & \mathbf{K}_{\{1\}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}\mathbf{K}_{\{1\}}^{\prime }\\ \boldsymbol{ K}_{\{1\}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}\mathbf{K}_{\{0\}}^{\prime } & \mathbf{K}_{\{0\}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}\mathbf{K}_{\{0\}}^{\prime }\end{array}\right]. & & \displaystyle \nonumber\end{eqnarray}$$
Each (sub)matrix in the final term functions as a weight on the observed variances and covariances in the various counterfactual scenarios. Partitioning the matrix in this manner allows us to avoid constructing the full $2N\times 2N$ matrix directly.
(2) Though $\mathbf{h}$ is simply an auxiliary vector that facilitates averaging, $\mathbf{h}^{\prime }(\mathbf{K}_{\text{new}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}})\mathbf{K}_{\text{new}}^{\prime }\mathbf{h}$ presents different opportunities for factoring than $(\mathbf{K}_{\text{new}}\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}})\mathbf{K}_{\text{new}}^{\prime }$ . Our algorithm factors out individual elements of $\hat{\unicode[STIX]{x1D72E}}_{\mathbf{c}}$ as far as possible. Along with an expanded version of Figure 4 that expresses all possible products of two counterfactual similarity scores, we reduce the computational complexity by an order of magnitude by avoiding an intractable inner loop.

Other factorizations may exist that further optimize either speed or memory—but not both. Our first differences algorithm, for example, can be reexpressed as a triple loop with no additional memory overhead; however, that formulation sacrifices vectorization speedups which our current setup exploits. In the implementation we present, we create two $N\times N$ temporary matrices, which is both an improvement over six and no worse than any other part of the algorithm. Consistent with our experience with $bigKRLS$ , speed tests show that our algorithm is no slower than a purely linear algebra approach.

To illustrate why this advance is important, consider dyadic data. Because of the pairwise structure of the kernel, KRLS is tailor-made for international relations, where often data in country dyads are encountered. However, such analyses often require at least 150 binary variables for nation states. In Section 5, we use 50 binary variables for US states, which is similarly prohibitive on many machines with KRLS. With bigKRLS, this is no longer an issue.

4.2.2 Lowering the cost of kernel regularization

For KRLS, the regularization parameter $\unicode[STIX]{x1D706}$ represents the degree of skepticism toward idiosyncratic features of the data-generating process. In bigKRLS we introduce a leaner version of the Golden Search algorithm as described by Rifkin and Lippert (Reference Rifkin and Lippert2007) and adopted by Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013). $bigKRLS$ ’s approach to selecting 𝜆 offers a lower memory footprint and speed gains at the sample sizes where $bigKRLS$ has the most to offer to political science researchers.

The Golden Search strategy is as follows. Though $\unicode[STIX]{x1D706}$ cannot be obtained analytically, it can be selected (to arbitrary precision) through an iterative procedure that depends primarily on the eigendecomposition of the kernel.Footnote ²⁰ $\unicode[STIX]{x1D706}$ is selected to minimize the sum of squared leave-one-out errors, $LOOE=\sum (\frac{\hat{\mathbf{c}}^{\ast }}{\mathbf{G}_{i,i}^{-1}})^{2}$ . $\mathbf{G}=\mathbf{K}+\unicode[STIX]{x1D706}\mathbf{I}$ is the “ridge” version of the kernel. The key computational challenge is $\mathbf{G}^{-1}$ , which is used to calculate candidate coefficient values and $LOOE$ at each step.

For computational ease, $\mathbf{G}^{-1}$ (or even $\mathbf{G}$ ) is not obtained directly. Instead, the equivalent expression $\mathbf{Q}\unicode[STIX]{x1D726}\mathbf{Q}^{\prime }$ is substituted, where $\mathbf{Q}$ is a matrix containing the eigenvectors of $\mathbf{K}$ and $\unicode[STIX]{x1D726}$ is a diagonal matrix of eigenvalues. Expanding this expression yields

$$\begin{eqnarray}\mathbf{G}_{i,j}^{-1}=\mathop{\sum }_{k=1}^{N_{Q}}\frac{\mathbf{Q}_{i,k}\ast \mathbf{Q}_{j,k}}{\unicode[STIX]{x1D706}+\unicode[STIX]{x1D726}_{\text{k,k}}},\end{eqnarray}$$

where $N_{Q}$ is the number of eigenvalues actually used in this computation.

Existing implementations of this approach usually begin by calculating the cross product $\mathbf{Q}(\unicode[STIX]{x1D726}+\unicode[STIX]{x1D706}\mathbf{I})\mathbf{Q}^{\prime }$ , and using its values to calculate candidate coefficient values. In our implementation, we instead build this matrix in a column-by-column manner. This strategy has two advantages. First, since the only purpose of constructing $\mathbf{G}^{-1}$ is to calculate $\hat{\mathbf{c}}^{\ast }$ , we can avoid placing the full $\mathbf{G}^{-1}$ matrix in memory by simply accumulating each column of $\mathbf{G}^{-1}$ into the coefficient vector. Second, since $\mathbf{G}^{-1}$ is symmetric we can simply skip calculating the elements of each column that correspond to the upper triangle of the matrix. As shown in Table 2, this implementation offers a noticeable performance boost over existing implementations.Footnote ²¹

Table 2. Alternatives for the spectral decomposition’s cumulative effect on runtime.

As an empirical benchmark, we isolate the portions of the algorithm that depend on the eigenvectors and eigenvalues (i.e., the eigendecomposition, $\unicode[STIX]{x1D706}$ search, and estimating the coefficients and their variance–covariance matrix but not the kernel or marginal effects). We report the runtime of this portion on the 2016 election data presented in Section 5 ( $N=3106,P=68$ ) in seconds on a 2017 MacBook Pro (2.3 GHz Intel Core i5 with 16 GB of RAM). For the eigentruncation case, $\unicode[STIX]{x1D716}=0.01$ and, in the final column, $N_{Q}=50$ .

As implied by the preceding discussion, not all eigenvalues and vectors are necessary for selecting $\unicode[STIX]{x1D706}$ . In most applied settings, the vast majority of the eigenvalues are very small, and can be safely ignored during the calculations described above. bigKRLS facilitates two types of eigentruncation: specifying that (1) only $N_{Q}$ eigenvectors and eigenvalues be calculatedFootnote ²² and/or (2) defining an $\unicode[STIX]{x1D716}$ such that only those eigenvectors and eigenvalues will be used where the eigenvalue is at least $100\unicode[STIX]{x1D716}\%$ as large as the largest eigenvalue. Optimizing either of these strategies ex ante is difficult; however, since runtime for this section of the algorithm depends on $N_{Q}$ rather than the proportion of variance explained by the retained eigenvalues/vectors, even a conservative decision rule will often produce substantial speed gains. As a default in bigKRLS we set $\unicode[STIX]{x1D716}=0.001$ , which produces virtually identical results to those generated using the full decomposition in both simulated examples and our application in Section 5.

4.3 Inferential tools

4.3.1 Degrees of freedom for AMEs

One of KRLS’s key strengths is its ability to offer both nuanced effect estimates as well as high-level summaries of each effect. The key high-level quantities produced by the model are the AME estimates, for which Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) derive closed-form expressions for both the values of the AMEs and their variances. They then use these quantities to derive a Student’s $t$ -test with $N-P$ degrees of freedom to assess statistical significance of each individual AME estimate. This approach works well for simple data-generating processes but not for more complex problems, as we show in simulation (introduced below and detailed in Supplementary Appendix C). For many realistic cases this test yields overly narrow confidence intervals coupled with misleadingly low $p$ values.

To address this issue, we propose an uncertainty correction using the effective degrees of freedom from the Tikhonov penalty used by KRLS. Since KRLS estimates $N$ choice coefficients, ${\hat{c}}^{\ast }$ , each of which is a parameter with an $L_{2}$ penalty, the effective degree of freedom for the model is

$$\begin{eqnarray}N_{\mathit{effective}}=N-\mathop{\sum }_{k=1}^{N_{Q}}\frac{\unicode[STIX]{x1D726}_{k,k}}{\unicode[STIX]{x1D726}_{k,k}+\unicode[STIX]{x1D706}},\end{eqnarray}$$

where $N$ is the original sample size, $\unicode[STIX]{x1D6EC}_{k,k}$ is the $k$ th eigenvalue of $K$ , and $\unicode[STIX]{x1D706}$ is the model’s regularization parameter (Hastie, Tibshirani, and Wainwright Reference Hastie, Tibshirani and Wainwright2015, 61–68).Footnote ²³ Provided $N_{Q}$ (the number of eigenvalues and eigenvectors actually used) is large enough to select $\unicode[STIX]{x1D706}$ reliably, $N_{Effective}$ should yield essentially the same estimate as if $N_{Q}=N$ due to the eigenvalues’ skew and constraints.Footnote ²⁴

To test the performance of this approach, we conducted a series of simulation studies (see Supplementary Appendix C for details). Our experiments suggest that this correction is most impactful when the true data-generating process is highly nonlinear and the sample size is smaller, offering approximately a 10-point increase in empirical coverage in this scenario. When the true data-generating process is simpler, the difference in coverage rates between corrected and uncorrected approaches vanishes. However, since KRLS is most appealing for applied work when researchers suspect that the true effect structure is more complex, we argue that this result offers substantial justification for our correction.

4.3.2 Detecting effect heterogeneity

While useful, inspecting AME estimates alone can conceal substantial effect heterogeneity. To help bridge the gap between high-level AMEs and the more nuanced pointwise derivative estimates, we introduce two statistics, which we term $R_{K}^{2}$ and $R_{AME}^{2}$ :

$$\begin{eqnarray}\displaystyle R_{AME}^{2} & = & \displaystyle cor(y,{\hat{y}}_{AME})^{2},\nonumber\\ \displaystyle R_{K}^{2} & = & \displaystyle cor(y,{\hat{y}})^{2},\nonumber\end{eqnarray}$$

with ${\hat{y}}=K{\hat{c}}$ and ${\hat{y}}_{AME}=\mathbf{X}\hat{\unicode[STIX]{x1D6E5}}_{AME}^{\prime }$ , where $\hat{\unicode[STIX]{x1D6E5}}_{AME}$ denotes the vector of AMEs. Phrased differently, $R_{K}^{2}$ denotes the pseudo- $R^{2}$ calculated with the kernel and all $N$ coefficient estimates, while $R_{AME}^{2}$ denotes the pseudo- $R^{2}$ calculated using predictions from $\mathbf{X}$ and its estimated partial derivatives alone. Intuitively, these quantities will be similar when $\mathbf{y}$ can be well approximated by a linear combination of the columns of $\mathbf{X}$ . When $\mathbf{y}$ is a more idiosyncratic function better modeled by the pairwise similarity of the observations, $R_{K}^{2}$ will outperform the AMEs, often dramatically. Note that, since $R_{AME}^{2}$ is not based on values selected to optimize any particular loss function, its performance can be unstable. Unlike $R_{K}^{2}$ , which tends to be relatively consistent in and out of sample, $R_{AME}^{2}$ is often “pessimistic” in the sense that it can be noticeably smaller on training than test data.

5 Application: The Trump Effect in “Communities in Crisis”

As an example application of bigKRLS, we analyze county-level results from the 2016 presidential election, with a focus on the so-called “communities in crisis” hypothesis (described in detail in the following section). This application highlights two key strengths of bigKRLS: scalability, and ability to gracefully handle binary predictors. Because we include state as a predictor, our resulting model contains more than 50 binary variables. Peak memory requirements in the original KRLS implementation scale with the number of predictors while with bigKRLS they do not, resulting in more than an order of magnitude decrease in memory consumption with the move to our implementation.

5.1 Overview

In both popular and academic discussions (e.g. Guo Reference Guo2016; Monnat Reference Monnat2016; Siegel Reference Siegel2016), a number of commentators argued that Donald Trump’s success in the 2016 election was partly attributable to his appeal in “communities in crisis.” As shown by Case and Deaton (Reference Case and Deaton2015), suicides, drug overdoses, and other so-called “deaths of despair” rose sharply among non-Hispanic Whites over the decades preceding the election, leading to a decrease in overall life expectancy within this population. Combined with declining economic opportunities, commentators argued, declining public health outcomes fostered a sense of dissatisfaction with traditional elites in afflicted areas. As a result, members of these communities may have been unusually inclined to vote for Trump relative to “establishment” Republican candidates.

To investigate this hypothesis, we use bigKRLS to model county-level voting patterns in the 2016 presidential election. Our dependent variable in this model is $\unicode[STIX]{x1D6E5}GOP$ , defined as the difference between two-party vote shares for Donald Trump in 2016 and Mitt Romney in 2012 (%Trump $-$ %Romney). We focus on county-level data for data availability reasons. Because of privacy considerations, county-level data is the most granular unit publicly available in relevant official US data sources like the Census Bureau and the Centers for Disease Control and Prevention.

Our key independent variables are county-level age-adjusted all-purpose mortality rate (per 1,000 individuals) and difference in three-year mortality rates for the periods preceding the 2016 and 2012 elections. These variables are intended to capture the “communities in crisis” hypothesis, with a particular focus on highlighting communities in which public health crises emerged between election cycles. We also include standard racial, macroeconomic, and education variables, along with geolocation information for each county and state-level dummy variables (described in Supplementary Appendix B). As we note at the outset of this section, including state-level dummies would have been impractical without the improvements we introduce in this paper, highlighting the utility of our approach.Footnote ²⁵

Table 3. Average marginal effect estimates.

Estimates for latitude, longitude, and state omitted for brevity. The dependent variable is the change in GOP vote share in the presidential election, 2012–2016, measured in percentage points. $p_{uc}$ denotes uncorrected $p$ values generated using a $t$ -test with $N-P$ degrees of freedom; $p_{c}$ denotes corrected $p$ values with $N_{\mathit{effective}}=2,892$ . $N=3,106$ , $R_{K}^{2}=0.83$ , and $R_{\mathit{AME}}^{2}=0.31$ . For definitions, see Section 4.3.

5.2 Average effect estimates

Average marginal effect estimates for this model are given in Table 3. Unsurprisingly, the model fits the data. For our application, $R_{K}^{2}=0.83$ , suggesting a reasonable level of in-sample fit.Footnote ²⁶ Nearly all predictors easily reach conventional levels of statistical significance, with intuitive signs. On average, Trump performed better in Whiter, older, poorer, and lower-education localities. As hypothesized, Trump also received a larger two-party vote share than Romney in higher-mortality counties, though the effect size is not particularly large. Averaged across the country, a one-standard-deviation ( ${\approx}1.47$ ) increase in age-adjusted mortality is estimated to produce approximately a 0.25% increase in $\unicode[STIX]{x1D6E5}GOP$ . These findings match the basic contours of the “communities in crisis” hypothesis: relative to previous Republican candidates, Trump performed particularly well in localities facing substantial hardships. $\unicode[STIX]{x1D6E5}$ Mortality is the main exception to this finding pattern, and does not reach conventional levels of statistical significance. Likely, this result is due to a lack of variability; since our study only covers a six-year period, large changes in mortality rates are rare.

As described in Section 4.3, uncorrected $p$ values for KRLS AMEs are suspect for more complex data-generating processes. In Table 3, we give both the corrected and the uncorrected $p$ values for this model, calculated using the effective degrees-of-freedom correction given previously ( $N_{\mathit{effective}}=2,825$ ). Since the sample size and effects detected by this analysis are both reasonably large, implementing the degrees-of-freedom correction we propose does not change any conclusions regarding statistical significance. At least in this case, our sample size appears to be sufficiently large relative to the complexity of the data-generating process to limit the impact of our correction.

5.3 Spatial first differences

While useful, inspecting the AME estimates can conceal substantial effect heterogeneity. In this case, the $R_{AME}^{2}$ is only 0.31 but $R_{K}^{2}=0.83$ on the full sample. As we discuss in Supplementary Appendix D.1, the out-of-sample difference between these values is likely smaller than their in-sample difference. However, as a binary indicator of effect heterogeneity, the gap between these quantities is suggestive.

Since KRLS offers closed-form estimates for the variance of the predicted values produced by the model, a simple way to explore effect heterogeneity is to estimate perturbation-based first differences, in which we “perturb” the variable of interest and examine the perturbation’s predicted impact (and the variance of that impact) on the dependent variable. In this section, we take precisely this approach. For each county, we calculate $\hat{\mathbf{y}}_{test}-\hat{\mathbf{y}}$ , where $\hat{\mathbf{y}}_{test}$ represents the predicted value for the counterfactual scenario in which we perturb the mortality variable—our key variable of interest—by some fixed constant $\unicode[STIX]{x1D70F}$ . We operationalize $\unicode[STIX]{x1D70F}$ as the difference between the $95$ th and $5$ th percentiles of the mortality variable ( ${\approx}3.2$ standard deviations).

Besides its interpretive benefits, this approach also offers an important theoretical payoff. Both $\hat{\mathbf{y}}_{test}$ and $\hat{\mathbf{y}}$ are distributed multivariate normal with known variance–covariance matrices (Hainmueller and Hazlett Reference Hainmueller and Hazlett2013). As a result, the marginal distribution of each predicted difference is distributed univariate Gaussian, with variance equal to the sum of the corresponding diagonal elements of the variance–covariance matrices for each set of predicted values. In Supplementary Appendix E.1, we use these facts to derive a pointwise hypothesis test, which distinguishes whether these first differences differ significantly from zero.

Figure 5 plots the estimates generated using this procedure. Nationwide, after applying a Benjamini–Hochberg correction approximately 12% of estimates are distinguishable from zero, with most significant county-level estimates clustered in the West, Mountain West, and Midwest.Footnote ²⁷ Notably, though the counterfactual scenario we simulated involves a large change in mortality rate, most individual estimates are still insignificant, echoing the small AME size we note in the previous section. However, even with this small average effect size, the magnitude of our county-level estimates remains highly variable. Our estimates are largest in the West and Mountain West, but we also estimate noticeable (and significant) effects in key swing states like Pennsylvania, Ohio, and Michigan. In these latter states, the mortality increase we model produces a predicted change of ${\sim}1{-}2$ percentage points in $\unicode[STIX]{x1D6E5}GOP$ , which is substantial given the small margins by which the presidential election was decided in these states.Footnote ²⁸ These results are consistent with Monnat (Reference Monnat2016)’s findings, which suggest Trump’s overperformance in high-mortality counties was regionally contingent.

Figure 5. Mortality first differences.

Strikingly, in some regions of the country, the predicted relationship between mortality and $\unicode[STIX]{x1D6E5}GOP$ presidential vote share was actually negative. This effect is particularly pronounced (and statistically significant) in parts of Kentucky and West Virginia, but is also noticeable in some neighboring Ohio and Illinois counties. One policy-driven explanation for this finding relates to state-level Medicaid expansion decisions following the passage of the Affordable Care Act. Under the “communities in crisis” hypothesis, the primary causal mechanism is a local dissatisfaction with political elites, and particularly with elite responses to poverty and poverty-related public health crises. In states like Kentucky and West Virginia that chose to expand Medicaid following the passage of the Affordable Care Act, high-mortality counties likely received a substantial portion of new Medicaid spending, which may have buttressed their faith in “establishment” politicians.

Though not conclusive evidence, we argue that these results are at least suggestive. While the model we estimate clearly cannot distinguish the Medicaid expansion’s effect on voter preference from unmeasured local predispositions, our results suggest that policy context matters. Far-reaching policy programs like the Affordable Care Act’s Medicaid expansion provisions might plausibly condition voter receptiveness to the anti-elite message offered by the Trump campaign. Thus, KRLS’s flexibility points to mechanisms worthy of future inquiry.

5.4 Interpreting pointwise marginal effects

In addition to geographic heterogeneity, the “communities in crisis” hypothesis implies that mortality’s effect should be conditioned by two other factors. First, in line with most postelection commentary, Trump’s appeal should be strongest in White “communities in crisis.” In other words, we should expect the effect of increasing mortality rates to be strongest in communities with larger White populations. Figure 6 weakly supports this hypothesis; however, plotting pointwise marginal effects in this manner suggests that the size of this relationship is small at best, with both majority–minority and majority-White counties displaying roughly similar effect sizes.

Figure 6. Marginal effect of age-adjusted mortality on $\unicode[STIX]{x1D6E5}GOP$ presidential vote share, 2012–16, by proportion of White population in each county.

Second, we should also expect the marginal effect of mortality to be increasing. Marginal increases in mortality, in other words, should have a relatively small effect in low-mortality counties, and a much larger one in high-mortality locations (as mortality rates approach “crisis” status). However, as shown in Figure 7, the estimated marginal effect of mortality actually peaks in mid-mortality counties and declines as mortality increases. Based on these results, true “crisis” communities (those with the highest mortality rates) appear to have been less responsive to mortality differences than their moderate-mortality counterparts, suggesting that the mortality effect is largely concentrated within the latter group of localities.

We hasten to emphasize that the discussion of pointwise marginal effects in this section necessarily possesses an exploratory quality. Unlike the first difference estimates we present in the previous section, pointwise hypothesis tests for these quantities are difficult to construct. Developing appropriate pointwise uncertainty estimators represents a fruitful direction for future research.

Figure 7. Marginal effect of age-adjusted mortality on $\unicode[STIX]{x1D6E5}GOP$ presidential vote share, 2012–16, by mortality.

6 Conclusion

In recent years, researchers have become increasingly interested in methods and models that combine canonical statistical properties with flexibility, robustness, and predictiveness. Some modeling approaches in this area also emphasize interpretability, which we argue should be viewed as a coequal goal with the other traits mentioned above. The Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) KRLS paradigm balances many of these concerns and has the capacity to contribute to social science research at a number of stages. Inevitably, KRLS offers no free lunch. By attempting to couple a flexible statistical model with interpretable effect estimates, KRLS encounters a steep scalability curve. We introduce bigKRLS not with the hopes of eliminating the computational burden of $N\times N$ calculations but rather in an effort to push the frontier (in terms of both $N$ and $P$ ) for a variety of important political problems. For most applications, our improvements reduce runtime by about 75% and reduce memory usage by an order of magnitude. Our proposed $p$ -value correction for the model’s AME further improves on the model’s statistical properties, particularly for complex data-generating processes estimated using smaller samples.

There are a number of exciting areas for future work. Optimizing kernel regularization to scale to truly “big data” applications without compromising inference or interpretability is an open area of research. Though kernels are theoretically well suited to high dimensions (Diaconis, Goel, and Holmes Reference Diaconis, Goel and Holmes2008; El Karoui Reference El Karoui2010), large numbers of $x$ variables still create practical problems, both computational and interpretative. Statistically, recent theoretical advances in selective inference (Taylor and Tibshirani Reference Taylor and Tibshirani2015), risk estimation for tuning parameters (Tibshirani and Rosset Reference Tibshirani and Rosset2016), and subsampling (Boutsidis, Mahoney, and Drineas Reference Boutsidis, Mahoney and Drineas2009; Gu, Jeon, and Lin Reference Gu, Jeon and Lin2013; Homrighausen and McDonald Reference Homrighausen and McDonald2016) offer paths forward in high-dimensional space. In the meantime, our analysis suggests that KRLS can produce interpretable and theoretically useful estimates on a perennial topic of interest: the behavior of American voters.

Supplementary material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2018.33.

Footnotes

Authors’ note: Authors, who have contributed equally to this project, are listed alphabetically. This project has benefited immensely from feedback at the Stanford University Statistics Seminar, April 18, 2017; useR! 2016, hosted by Stanford University; American Political Science Association 2016; the International Methods Colloquium, hosted by Justin Esarey on November 11, 2016; the Stevens Institute of Technology on February 27, 2017; and the Bay Area R Users Group Official Meetups, hosted by Treasure Data (May 2016), Santa Clara University (October 2016), and GRAIL (June 2017). Thanks in particular to Susan Holmes, Joseph Rickert, Stefan Wager, Stephen Jessee, Christopher Wlezien, Trevor Hastie, Christian Fong, Luke Sonnet, Chad Hazlett, Kristyn Karl, Jacob Berman, Jonathan Katz, Gaurav Sood, Maraam Dwidar, and anonymous reviewers for additional comments (mistakes, of course, are ours). Pete Mohanty thanks Stanford University’s Vice Provost for Undergraduate Education for research leave. For replication materials, see Mohanty and Shaffer (2018).

Contributing Editor: Jonathan N. Katz

1 In this manuscript, “KRLS” refers to the method of estimation whereas “KRLS” refers to the R package described in a companion piece by Ferwerda, Hainmueller, and Hazlett (Reference Ferwerda, Hainmueller and Hazlett2017).

2 Assuming parallelization is used. Single-core runtime is also approximately 50% faster with bigKRLS.

3 Arguably, we might view Bayesian posterior probabilities as a good example of an “interpretable” procedure. In a direct sense, many regularization strategies can be justified as a particular prior structure (Wahba Reference Wahba1983; Tibshirani Reference Tibshirani1996). More broadly, as Gill (Reference Gill1999), Jackman (Reference Jackman2009), and others argue, the frequentist null hypothesis testing paradigm is notoriously difficult to properly interpret. By contrast, researchers can straightforwardly calculate probabilities of interest such as $P(\unicode[STIX]{x1D6FD}>0|X)$ under the Bayesian paradigm without reference to counterfactuals.

That said, many researchers find Bayesian priors confusing or arbitrary. To a certain extent, this disagreement is a question of whether one locates the primary interpretive dilemma at the beginning or the end of the analysis. Bayesian versions of kernel-regularized regression are relevant to this discussion but beyond the scope of this paper. See, for example, Zhang, Dai, and Jordan (Reference Zhang, Dai and Jordan2011) for further discussion.

4 Since order notation is designed to describe the limiting complexity of a given algorithm as the size of the inputs grows arbitrarily large, constants and lower-order terms are usually omitted from order-notation statements. However, if a high level of precision is necessary to compare a pair of algorithms (as in Section 4.1), constant terms can be included in the complexity statement.

5 Assuming $N$ is substantially larger than $P$ and a Cholesky decomposition of $\mathbf{X}^{\prime }\mathbf{X}$ is used to calculate $\hat{\unicode[STIX]{x1D6FD}}=(\mathbf{X}^{\prime }\mathbf{X})^{-1}\mathbf{X}^{\prime }\mathbf{y}$ rather than inverting $\mathbf{X}^{\prime }\mathbf{X}$ directly.

6 With fairly pessimistic assumptions regarding tree growth rates (Witten, Frank, and Hall Reference Witten, Frank and Hall2011, p. 199–200).

7 The complexity frontier phenomenon has become increasingly relevant for applied political science work. For example, as Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016) document, workhorse political science ideal point models take days to run on standard datasets (e.g. Congressional roll-call votes), limiting researchers’ ability to estimate these models in data-intensive settings. Imai, Lo, and Olmsted address this issue by proposing an expectation–maximization estimator, which produces similar results to standard approaches two to three fewer orders of magnitude more quickly.

8 For an interesting example involving facial expression recognition during televised debates, see Eleftheriadis, Rudovic, and Pantic (Reference Eleftheriadis, Rudovic and Pantic2015).

9 See Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) for additional details.

10 See Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) for a comparison of various kernel functions, and additional theoretical justification for this choice. Broadly, we view the choice of kernel as a preprocessing decision, which researchers can adjust based on their particular problem domain. Other kernels besides the Gaussian kernel are certainly justified in some settings; however, simulation results in Supplementary Appendix D.2 and in Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) suggest that this option represents a reasonable default choice.

11 In principle, if $\mathbf{K}=\mathbf{I}$ the KRLS coefficient estimates converge with those obtained via a ridge regression approach in which we regress $\mathbf{y}$ on $\mathbf{K}$ (up to a scaling factor). This relationship can be confirmed in $R$ using $\text{glmnet}(\ldots ,\text{alpha}=0)$ . However, the (multidimensional) Chebyshev inequality implies $p(\mathbf{K}=\mathbf{I})=0$ at this or similar bandwidths. For observable kernels ( $\mathbf{K}\neq \mathbf{I}$ ), the relationship between the KRLS coefficients and the ridge coefficients is not constant across iterations.

12 In the discrete variable case, we estimate first differences rather than derivatives, yielding a different estimator; however, the interpretation of this quantity remains similar. For details, see Section 3.3.

13 Even when P is small, bigKRLS’s peak memory usage is lower since it is $O(5N^{2})$ compared with $O((P+10)N^{2})$ plus an additional $O(11N^{2})$ term if any of the predictors are binary for KRLS. In addition to changes discussed in Section 3.2, our algorithm also differs from the original implementation in that it constructs the simple distance matrices “just in time” for estimation and removes large matrices the moment they are no longer needed.

14 If derivatives are not estimated, the two algorithms have a fairly similar memory footprint outside of the $\unicode[STIX]{x1D706}$ selection routine (see Section 4.2.2 for details). However, as we argue in Section 3.2, in most social science applications the model’s derivative estimates are its most attractive quality.

15 For replication materials, see Mohanty and Shaffer (Reference Mohanty and Shaffer2018).

16 Real numbers require 8 bytes of storage each in most scientific computing languages including C, C++, and R. Assuming 8 GB available memory, at least one binary predictor, and $P=67$ (as in the 2016 election model in Section 5), the original KRLS implementation will have insufficient memory if $N>3,500$ . If only 4 GB are available (R’s default for many Windows laptops), the cutoffs for KRLS and bigKRLS would be $N=2,500$ and $N=10,000$ , respectively, similarly suggesting $bigKRLS$ can estimate 20% more publishable datasets.

17 A nearly identical procedure is used for out-of-sample prediction.

18 How closely the first differences resemble an experiment depends on the entropy of $\mathbf{K}_{\{1\}}$ and $\mathbf{K}_{\{0\}}$ (Hazlett Reference Hazlett2016).

19 The first $N$ entries of $\mathbf{h}$ are $\frac{1}{N}$ and the next $N$ are $-\frac{1}{N}$ .

20 Mercer’s theorem enables regularization as the kernel’s eigendecomposition takes a known form even in high-dimensional space, ultimately enabling $\unicode[STIX]{x1D706}$ to be found in a finite, unidimensional space (Beck and Ben-Tal Reference Beck and Ben-Tal2006; Hastie, Tibshirani, and Fiedman Reference Hastie, Tibshirani and Fiedman2008; El Karoui Reference El Karoui2010).

21 See Supplementary Appendix A for details. In practice, convergence for the applied problems we study in this paper takes 5–20 iterations. Without eigentruncation, at $N=5,000$ each iteration takes 3.8 seconds with the new algorithm versus 16.1 seconds. At $N=10,000$ , each iteration takes ${\approx}8$ minutes versus ${\approx}13~\text{min}$ .

22 This option is also available in the original KRLS implementation.

23 This quantity is often expressed in terms of squared singular values. However, since $\mathbf{K}$ is positive semidefinite the squared singular values and eigenvalues are equivalent in this case. We express $N_{\mathit{Effective}}$ as a function of the eigenvalues to remain consistent with the bigKRLS software architecture (which uses a decomposition of the symmetric kernel rather than singular value decomposition) and related work (see, e.g., Rifkin and Lippert Reference Rifkin and Lippert2007). In their paper, Hainmueller and Hazlett (Reference Hainmueller and Hazlett2013) note this relationship, but do not use it as part of their degrees-of-freedom calculations.

24 Unlike $\mathbf{X}\mathbf{X}^{\prime }$ and $\mathbf{X}^{\prime }\mathbf{X}$ , $\mathbf{K}$ does not have exactly $P$ nonzero eigenvalues (but $\mathbf{K}$ ’s eigenvalues are real, positive values that sum to $N$ ). For detail on the kernel’s spectrum, see El Karoui (Reference El Karoui2010). For the application to this algorithm, see Section 4.2.2.

25 Since the complexity of the original $R$ implementations depended on both the number of predictor variables and the presence of binary variables, at $N>3,000$ the original KRLS implementation is impractical to estimate with the predictor variables we include.

26 See Supplementary Appendix D.2 for model fit comparison between KRLS and other approaches.

27 See Supplementary Appendix E.1 for details.

28 See Supplementary Appendix E.2 for a more detailed effect size plot.

References

Beck, A., and Ben-Tal, A.. 2006. On the solution of the Tikhonov regularization of the total least squares problem. Journal of Optimization 17:98–118.Google Scholar

Boutsidis, C., Mahoney, M. W., and Drineas, P.. 2009. An improved approximation algorithm for the column subset selection problem. Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms . SIAM, pp. 968–977.Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45(1):5–32.Google Scholar

Case, A., and Deaton, A.. 2015. Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century. Proceedings of the National Academy of Sciences 112(49):15078–15083.Google Scholar

Chipman, H. A., George, E. I., and McCulloch, R. E. et al. . 2010. Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4(1):266–298.Google Scholar

Demmel, J. W. 1997. Applied numerical linear algebra . Philadelphia, PA: SIAM.Google Scholar

Diaconis, P., Goel, S., and Holmes, S.. 2008. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics 2:777–807.Google Scholar

El Karoui, N. 2010. The spectrum of kernel random matrices. The Annals of Statistics 38(1):1–50.Google Scholar

Eleftheriadis, S., Rudovic, O., and Pantic, M.. 2015. Discriminative shared Gaussian processes for multiview and view-invariant facial expression recognition. IEEE Transactions on Image Processing 24(1):189–204.Google Scholar

Ferwerda, J., Hainmueller, J., and Hazlett, C.. 2017. Kernel-based regularized least squares in R (KRLS) and Stata (krls). Journal of Statistical Software, Articles 79(3):1–26.Google Scholar

Gill, J. 1999. The insignificance of null hypothesis significance testing. Political Research Quarterly 52(3):647–674.Google Scholar

Gu, C., Jeon, Y., and Lin, Y.. 2013. Nonparametric density estimation in high-dimensions. Statistica Sinica 23(3):1131–1153.Google Scholar

Guo, J.2016. Death predicts whether people vote for Donald Trump. Washington Post. Available at: https://www.washingtonpost.com/news/wonk/wp/2016/03/04/death-predicts-whether-people-vote-for-donald-trump/?utm_term=.7d2dd542d4cd.Google Scholar

Hainmueller, J., and Hazlett, C.. 2013. Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach. Political Analysis 22(2):143–168.Google Scholar

Hastie, T., Tibshirani, R., and Fiedman, J.. 2008. The elements of statistical learning . 2nd edn. New York: Springer.Google Scholar

Hastie, T., Tibshirani, R., and Wainwright, M.. 2015. Statistical learning with sparsity: The LASSO and generalizations . New York: CRC Press.Google Scholar

Hazlett, C.2016. Kernel balancing: A flexible non-parametric weighting procedure for estimating causal effects. Available at: https://arxiv.org/abs/1605.00155.Google Scholar

Homrighausen, D., and McDonald, D. J.. 2016. On the Nyström and column-sampling methods for the approximate principal components analysis of large datasets. Journal of Computational and Graphical Statistics 25(2):344–362.Google Scholar

Imai, K., Lo, J., and Olmsted, J.. 2016. Fast estimation of ideal points with massive data. American Political Science Review 110(4):631–656.Google Scholar

Jackman, S. 2009. Bayesian analysis for the social sciences . West Sussex: John Wiley & Sons.Google Scholar

Keele, L. 2015. The statistics of causal inference: A view from political methodology. Political Analysis 23:313–335.Google Scholar

Mohanty, P., and Shaffer, R.. 2018. Replication data for: Messy data, Robust inference? Navigating obstacles to inference with bigKRLS. https://doi.org/10.7910/DVN/CYYLOK, Harvard Dataverse, V1.Google Scholar

Monnat, S. M.2016. Deaths of despair and support for Trump in the 2016 presidential election, Pennsylvania State University Department of Agricultural Economic Research Brief. Available at: https://aese.psu.edu/directory/smm67/Election.16.pdf.Google Scholar

Papadimitriou, C. H. 2003. Computational complexity. In Encyclopedia of computer science . Chichester, UK: John Wiley and Sons Ltd, pp. 260–265.Google Scholar

Ratkovic, M., and Tingley, D.. 2017. Sparse estimation and uncertainty with application to subgroup analysis. Political Analysis 25(1):1–40.Google Scholar

Rifkin, R., Yeo, G., and Poggio, T. et al. . 2003. Regularized least-squares classification. NATO Science Series Sub Series III Computer and Systems Sciences 190:131–154.Google Scholar

Rifkin, R. M., and Lippert, R. A.. 2007. Notes on regularized least squares. Computer Science and Artificial Intelligence Laboratory Technical Report.Google Scholar

Siegel, Z.2016. Is the opioid crisis partly to blame for President Trump? Slate Magazine. Available at: http://www.slate.com/articles/health_and_science/medical_examiner/2016/12/the_trump_heroin_connection_is_still_unclear.html.Google Scholar

Taylor, J., and Tibshirani, R. J.. 2015. Statistical learning and selective inference. Proceedings of the National Academy of Sciences 112:7629–7634.Google Scholar

Tibshirani, R. 1996. Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society 58:267–288.Google Scholar

Tibshirani, R. J., and Rosset, S.. 2016. Excess optimism: How biased is the apparent error of an estimator tuned by sure? Preprint, arXiv:1612.09415.Google Scholar

Wahba, G. 1983. Bayesian “confidence intervals”. Journal of the Royal Statistical Society 45(1):133–150.Google Scholar

Witten, I. H., Frank, E., and Hall, M. A.. 2011. Data mining: Practical machine learning tools and techniques . 3rd edn. Burlington, MA: Elsevier.Google Scholar

Yu, K., Xu, W., and Gong, Y.. 2009. Deep learning with kernel regularization for visual recognition. In Advances in neural information processing systems , ed. Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L.. Red Hook, NY: Curran Associates, pp. 1889–1896.Google Scholar

Zhang, Z., Dai, G., and Jordan, M. I.. 2011. Bayesian generalized kernel mixed models. Journal of Machine Learning Research 12:111–139.Google Scholar

Table 1. Overview of the KRLS estimation procedure.

Figure 1. “Actual” marginal effects.

Figure 2. Published sample sizes. Source: American Journal of Political Science and American Political Science Review, January 2015 through January 2017 (own calculation).

Figure 3. Runtime comparisons (worst case).

Figure 4. Reexpressed kernel for first differences estimation.

Table 2. Alternatives for the spectral decomposition’s cumulative effect on runtime.

Table 3. Average marginal effect estimates.

Figure 5. Mortality first differences.

Figure 6. Marginal effect of age-adjusted mortality on $\unicode[STIX]{x1D6E5}GOP$ presidential vote share, 2012–16, by proportion of White population in each county.

Figure 7. Marginal effect of age-adjusted mortality on $\unicode[STIX]{x1D6E5}GOP$ presidential vote share, 2012–16, by mortality.

Mohanty and Shaffer supplementary material

Mohanty and Shaffer supplementary material 1

File 2.9 MB

Article contents

Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS

Abstract

Keywords

1 Introduction

2 Data Science as Interpretability versus Complexity

2.1 Desirable estimator properties

2.2 The complexity frontier

3 Complexity and Interpretability with KRLS

3.1 An overview of KRLS

3.2 Opening the black box

4 Algorithmic and Statistical Optimization Using bigKRLS

4.1 Overview

4.2 Algorithmic improvements

4.2.1 A leaner first differences estimator

4.2.2 Lowering the cost of kernel regularization

4.3 Inferential tools

4.3.1 Degrees of freedom for AMEs

4.3.2 Detecting effect heterogeneity

5 Application: The Trump Effect in “Communities in Crisis”

5.1 Overview

5.2 Average effect estimates

5.3 Spatial first differences

5.4 Interpreting pointwise marginal effects

6 Conclusion

Supplementary material

Footnotes

References

Mohanty and Shaffer supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests