Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-02-06T09:20:09.908Z Has data issue: false hasContentIssue false

Modeling Context-Dependent Latent Effect Heterogeneity

Published online by Cambridge University Press:  20 May 2019

Diogo Ferrari*
Affiliation:
Department of Political Science, University of Michigan, 505 South State Street, 5700 Haven Hall, Ann Arbor, MI 48104, USA. Email: diogoferrari@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Classical generalized linear models assume that marginal effects are homogeneous in the population given the observed covariates. Researchers can never be sure a priori if that assumption is adequate. Recent literature in statistics and political science have proposed models that use Dirichlet process priors to deal with the possibility of latent heterogeneity in the covariate effects. In this paper, we extend and generalize those approaches and propose a hierarchical Dirichlet process of generalized linear models in which the latent heterogeneity can depend on context-level features. Such a model is important in comparative analyses when the data comes from different countries and the latent heterogeneity can be a function of country-level features. We provide a Gibbs sampler for the general model, a special Gibbs sampler for gaussian outcome variables, and a Hamiltonian Monte Carlo within Gibbs to handle discrete outcome variables. We demonstrate the importance of accounting for latent heterogeneity with a Monte Carlo exercise and with two applications that replicate recent scholarly work. We show how Simpson’s paradox can emerge in the empirical analysis if latent heterogeneity is ignored and how the proposed model can be used to estimate heterogeneity in the effect of covariates.

Type
Articles
Copyright
Copyright © The Author(s) 2019. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

1 Introduction

This paper proposes a model to deal with context-dependent latent heterogeneity in the effect of covariates in generalized linear models (GLMs). Generalized linear models, including those with mixed effects, are still one of the most used tools for multivariate analyses in political science. Among many assumptions required by such models, e.g., the conditional independence, researchers need to assume that important covariates were not left out. In that regard, much has been said in political science, statistics, and econometrics about the problems caused by omitting additive covariates in the model, but much less about the issues surrounding unobserved confounders that condition the effects of observed covariates. Conditioning features can lead to a well-known phenomenon in statistics called Simpson’s paradox (a.k.a. aggregation paradox): an effect found when data are aggregated can be completely different or even reversed when data are separated into groups (Pearson, Lee, and Bramley-Moore Reference Pearson, Lee and Bramley-Moore1899; Yule Reference Yule1903; Simpson Reference Simpson1951; Blyth Reference Blyth1972). The crucial point connecting the paradox and omitting variables is that, in the typical situation, researchers can never be sure a priori that there are not latent or unobserved groups—a.k.a. clusters—with heterogeneous effect nor how many of them exist.

Consider for example the study of voter’s preferences for redistribution. It is well known that features such as income and race can affect support for redistributive policies (Alesina and Angeletos Reference Alesina and Angeletos2005; Rehm Reference Rehm2009; Shayo Reference Shayo2009; Alesina and Giuliano Reference Alesina, Giuliano, Benhabib, Bisin and Jackson2010), but the effects of such observed factors like income and race can be heterogeneous among subpopulations due to unobserved factors such as motivation, personal history, and ability (Stegmueller Reference Stegmueller2013). Consequently, the estimated effect of income, e.g., found when data are aggregated can be very different from the effect that would be estimated if we had observed motivation and considered low- and high-motivation groups separately or considered the income effect as conditional upon motivation.

Although that problem occurs in all scientific disciplines, perhaps it is more salient in the social sciences because, to mention a few reasons, problems have high dimensionality and often many dimensions remain unmeasured; data are often difficult to collect or are unavailable for privacy or other reasons; culture-specific aspects are not well measured; some subjects may conceal information from researchers purposefully; or researchers may simply be unaware of possible latent interactive factors (Stegmueller Reference Stegmueller2013; Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015).

Especially in comparative politics, an additional layer of complication seems likely: the latent heterogeneity can depend on context-level features. For instance, some researchers have shown that the effect of income is conditional on country-level variables such as the progressivity of the tax system (Beramendi and Rehm Reference Beramendi and Rehm2016), the levels of inequality and crime rates (Rueda and Stegmueller Reference Rueda and Stegmueller2016), national identity (Johnston et al. Reference Johnston, Banting, Kymlicka and Soroka2010), and the existing levels of redistribution (Svallfors Reference Svallfors1997; Arts and Gelissen Reference Arts and Gelissen2001). If there is latent effect heterogeneity due to unobserved factors like motivation or personal experiential history among the population from a given context, say a country, it is very likely that a different heterogeneity manifests in other countries. In other words, suppose the effect of income is heterogeneous between two groups of voters (clusters) in the United States (the context) and we do not know the group membership of individual voters. We would expect to see heterogeneity among the population in another context, say Italy, but we should not expect to have the same two latent groups in Italy (or in other contexts). Maybe there are more or fewer latent groups in Italy, or maybe some latent subpopulations are similar in Italy and the United States. For instance, high- and low-motivation Italians and Americans may have welfare opinions similarly affected by their income. But Italians’ personal experiences of crime modify income effects on welfare support very differently than Americans’ personal experiences of crime modify their income effects on welfare support. In sum, the characteristics of the within-context heterogeneity (clustering) can vary from one context (e.g., country) to another, and that within-context heterogeneity may depend on the characteristics of the context itself.

Practitioners in political science have long recognized these challenges. The possibility of omitting relevant conditioning factors, in conjunction with cross-context differences, have been stressed as an important source of an attitude of radical skepticism regarding the results of observational and experimental empirical investigation in the social sciences in general, and in comparative analysis in particular (Przeworski Reference Przeworski, Boix Boix and Stokes2007; Stokes Reference Stokes and Teele2014).

The literature has proposed different approaches to address effect heterogeneity. The approaches depend on whether the grouping features are known and measured. When the groups are observed, classical approaches include mixed models with gaussian distributed random effects (e.g., hierarchical linear models (HLMs)). Suppose, for example, that we are analyzing data from many countries (contexts) and in each country there are different subpopulations with heterogeneous effects. If we knew the subpopulation to which each individual belongs, we could use a classical mixed-effects model at country and subpopulation levels. However, the distributional assumption on the random effect in such an approach is often criticized because of the single modality, light tails, and symmetry of the normal distribution, which imposes unnecessary and often unjustifiable constraints to the analysis in the empirical modeling stage (Verbeke and Lesaffre Reference Verbeke and Lesaffre1997; Heinzl and Tutz Reference Heinzl and Tutz2013). In addition, such approach only works if the heterogeneous groups within each context are known and observed, but researchers have to assume that there is no other latent or unobserved feature that can cause effect heterogeneity. There are some modeling approaches that work for single-context cases in which subpopulation membership is unknown or unobserved. When one wants to investigate subpopulations with latent heterogeneity within a given context and the number of subpopulations are known or are assumed to be finite and fixed, finite mixture models (FMM) are often used (Ng et al. Reference Ng, McLachlan, Wang, Ben-Tovim Jones and Ng2006; De la Cruz-Mesía, Quintana, and Marshall Reference De la Cruz-Mesía, Quintana and Marshall2008; Villarroel, Marshall, and Barón Reference Villarroel, Marshall and Barón2009). More commonly, however, researchers do not know if or how many latent heterogeneous subpopulations exist within a given context. Recent contributions in statistics literature have proposed models that use Dirichlet process prior (DPP) to deal with these single-context cases with unknown subpopulation heterogeneity/clustering (Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997; Kleinman and Ibrahim Reference Kleinman and Ibrahim1998b; Hannah, Blei, and Powell Reference Hannah, Blei and Powell2011; Heinzl and Tutz Reference Heinzl and Tutz2013). Models using DPP have been used in marketing literature to model the error term with a flexible distribution, the heterogeneity of consumer’s demand in discrete choice models (Rossi, Allenby, and McCulloch Reference Rossi, Allenby and McCulloch2006; Rossi Reference Rossi2014), and in latent instrumental variable (LIV) models to deal with endogeneity of covariates (Ebbes et al. Reference Ebbes, Wedel, Böckenholt and Steerneman2005; Ebbes, Wedel, and Böckenholt Reference Ebbes, Wedel and Böckenholt2009). Related work has also been developed in econometrics and program evaluation literature to study effect heterogeneity of training programs (Aakvik, Heckman, and Vytlacil Reference Aakvik, Heckman and Vytlacil2005; Chen Reference Chen2007; Heckman and Vytlacil Reference Heckman and Vytlacil2007; Ichimura and Todd Reference Ichimura and Todd2007; Matzkin Reference Matzkin2007). DPP models have been applied in political science to study lengths of time political appointees stay in their appointed position (Gill and Casella Reference Gill and Casella2009), political priorities of senators (Grimmer Reference Grimmer2009), intraparty voting (Spirling and Quinn Reference Spirling and Quinn2010), immigrant turnout in elections (Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015), and dynamic aspects of preferences for redistribution (Stegmueller Reference Stegmueller2013).

Those DPP approaches, however, have three limitations. First, they are usually designed to be used with specific types of dependent variables, e.g., with outcome variables measured on an ordered scale. Second, particularly in political science literature, previous works have used DPP mostly as a prior only for the intercept (or error) term. Third and more importantly, previous works were not designed to study cases in which the latent heterogeneity is context-dependent.

To redress these limitations, this paper proposes a Dirichlet mixture of generalized linear models in which the within-context effect heterogeneity (clustering) can be context-dependent. The proposed model is a generalization, from the point of view of the expectation of the dependent variable, of usual generalized linear model (GLM), classical generalized linear mixed models (GLMM), finite mixture models (FMMs), and current single-context Dirichlet mixtures of generalized linear models. The proposed model has several advantages over those special cases.

First, when there are multiple contexts, for instance in cross-country comparative analysis, the model can be used to investigate if country features are associated with latent heterogeneity in the covariate effects; that is, if country-level features affect the number and the characteristics of the subpopulation clusters.

Second, the proposed Dirichlet mixture of generalized linear models is developed in its full generality to handle Dirichlet mixtures of any distribution in the exponential family, investigate heterogeneity not only in the error term but in the effect of any observed covariates, and, as mentioned, study how such heterogeneity varies with context-level features. This paper implements two special cases: binary and continuous outcomes, modeled using Bernoulli and gaussian distributions, respectively. The algorithms for estimation of these special cases are presented, but an MCMC algorithm with a Gibbs sampler is derived for the more general model, so it can easily be extended to other outcome variable distributions.

Third, as a generalization of the other models, it can be used in situations in which any of the more specialized models are well justified. If, in fact, one believes that a single GLM can be used across contexts and there is no latent heterogeneity in the population, the proposed model can be estimated and it will produce similar results for the conditional expectation of the dependent variable as those estimated using a GLM. If there is just one context, but unknown clusters, it can be used instead of the single-context Dirichlet mixture of GLMs. The analogous situation is true for the other special cases, i.e., whenever the researcher is estimating a GLMM or a finite mixture model (FMM) the proposed model can be used, and it has two additional advantages: the number of latent clusters, whose number is allowed to grow with the size of the data, is being simultaneously estimated. As already mentioned, if the data comes from different contexts, the effect of the context on the characteristics of the clusters are also being investigated.

Fourth, the model estimates cluster memberships, so we can classify the data points into (latent) groups. The clusters differ in terms of the vector of linear coefficients that connect covariates to the outcome variable. So it can be used to study and characterize the heterogeneity in the effect of the covariates within and across contexts.

Fifth, the statistics and epidemiology literature has proposed approaches for dealing with Simpson’s paradox based on domain knowledge (Hernán, Clayton, and Keiding Reference Hernán, Clayton and Keiding2011) and estimation diagnostics (Kievit et al. Reference Kievit, Frankenhuis, Waldorp and Borsboom2013). Its formal aspects and its connection to other problems have also been studied (Samuels Reference Samuels1993; Hernán, Clayton, and Keiding Reference Hernán, Clayton and Keiding2011; Pearl Reference Pearl2011, Reference Pearl2014). However, to the best of our knowledge, the literature has not proposed any modeling solution. We connect the model proposed here with Simpson’s paradox in the context of generalized linear models and show how it can be used to detect the occurrence of the paradox and to deal with such problems by estimating the cluster-specific effects.

The rest of the paper is organized as follows. The next section presents the model. Then, the following section demonstrates how the proposed model is connected to classical GLM, mixed models, FMM, and the econometric models mentioned above that use DPP to deal with heterogeneity in single-context analyses. Section 4 develops MCMC algorithms to estimate the model in its full generality and for two special cases of outcome variables. In the Section 5 we conduct a Monte Carlo exercise to study the frequentist properties of the estimation. The estimation is tested against a large variety of scenarios with and without latent heterogeneity. The section also illustrates how the model can be used to deal with Simpson’s paradox in the context of generalized linear models. It also compares the estimated results of GLM using a maximum likelihood estimator (MLE) with those produced by the proposed model using the MCMC developed in the Section 4. Section 6 uses the model to analyze real data sets. It replicates some studies and shows how it uncovers latent heterogeneity and Simpson’s paradox. Finally, the conclusions are presented.

2 The Model

To restate the problem, we want to use a generalized linear model to estimate the effect of the covariates $X_{i}$ on $y_{i}$ . Second, we want to take into account the possibility that the effect of the covariates is heterogeneous across different subgroups whose defining features are latent or were not observed. In other words, there might be latent subpopulations of individuals for which the covariates have a different relationship with the outcome. Finally, we want to allow this latent subpopulation heterogeneity or clustering to be investigated both for data that comes from a single context or from multiple contexts. Finally, when the observed population comes from different contexts (e.g., different countries and different years), we want to investigate if context-level features change not only the effect of observed covariates on the outcome but also the existence and the characteristics of latent subpopulations in which the observed covariates have different effects.

The model that deals with such problems can be developed as follows. For each observation  $i$ , suppose we have a set of observed covariates $X_{i}^{\prime }\in \mathbb{R}^{D_{x}}$ and an outcome variable  $y_{i}$ . Denote $X_{i}=(1,X_{i}^{\prime })$ . Let $K$ denote the number of heterogeneous groups in the population such that it can be bigger if the population is bigger, and let $Z_{i}$ indicate the group of  $i$ . $Z_{i}$ and $K$ may or may not be known or observed. When $Z_{i}$ is not observed, we use the term “clusters” instead of “groups.” Denote  $C_{i}$ the context of  $i$ , so $C_{i}=j$ indicates that the observation $i$ comes from context indexed by $j\in \{1,\ldots ,J\}$ , were $J$ is the number of contexts.

For the purpose of illustration and as a toy example, suppose we want to investigate the effect of income and race on voters’ support for welfare policies in different countries. Then $X_{i}$ are measures of income and race of individual  $i$ , and $y_{i}$ is his degree of support for welfare policies. The variable $C_{i}=j$ indicates the country where $i$ lives, and data are collected in $J$ countries. Suppose further that in each country the population is divided into types of individuals with different personal experiences with class and racial conflict. The types are not observed but we suspect the effect of income and race is conditional on the type. The latent variable $Z_{i}=k$ indicates that $i$ is type $k$ and $K$ denotes the number of different types.

If $p(\,)$ denotes a distribution in the exponential family, $g$ a link function, and $\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$ , then the group- and context-specific GLM is given by:

(1) $$\begin{eqnarray}y_{i}\mid Z_{i},X_{i},C_{i},\unicode[STIX]{x1D703}_{C_{i}Z_{i}}\sim p(y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{C_{i}Z_{i}})\ni \mathbb{E}[y_{i}\mid \cdot ]=\unicode[STIX]{x1D707}_{i}=g^{-1}(X_{i}^{T}\unicode[STIX]{x1D6FD}_{C_{i}Z_{i}}),\quad Z_{i}=1,\ldots ,K.\end{eqnarray}$$

If $Z_{i}$ was observed and $K$ was therefore known, one could use classical mixed-effects models to estimate groups and context-specific heterogeneous effects. If $K$ was known, but $Z_{i}$ was latent or unobserved, one option would be to use finite mixture models for the estimation (Gaffney Reference Gaffney2003; Ng et al. Reference Ng, McLachlan, Wang, Ben-Tovim Jones and Ng2006; De la Cruz-Mesía, Quintana, and Marshall Reference De la Cruz-Mesía, Quintana and Marshall2008; Villarroel, Marshall, and Barón Reference Villarroel, Marshall and Barón2009). When $Z_{i}$ is latent and $K$ is unknown, some authors have proposed models that use DPP on $\unicode[STIX]{x1D703}$ in order to estimate cluster-specific effectsFootnote 1 (Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997; Kleinman and Ibrahim Reference Kleinman and Ibrahim1998a,Reference Kleinman and Ibrahimb; Dorazio et al. Reference Dorazio, Mukherjee, Zhang, Ghosh, Jelks and Jordan2008; Gill and Casella Reference Gill and Casella2009; Heinzl and Tutz Reference Heinzl and Tutz2013; Stegmueller Reference Stegmueller2013; Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015). We refer here to such models as Dirichlet process generalized linear model (dpGLM), as adopted by Hannah, Blei, and Powell (Reference Hannah, Blei and Powell2011). Contrary to their formulation, however, we assume that $X_{i}$ is given. If we denote by ${\mathcal{D}}{\mathcal{P}}(\unicode[STIX]{x1D6FC},G)$ the Dirichlet process with location parameter $\unicode[STIX]{x1D6FC}$ and base measure  $G$ , the GLM is modified in the following way to produce the dpGLM:

(2) $$\begin{eqnarray}\begin{array}{@{}l@{}}G\mid \unicode[STIX]{x1D6FC}_{o},G_{o}\sim {\mathcal{D}}{\mathcal{P}}(\unicode[STIX]{x1D6FC}_{o},G_{o})\\ \unicode[STIX]{x1D703}_{i}\mid G\sim G\\ y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{i},~\sim p(y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{i}),\quad \mathbb{E}[y_{i}\mid \cdot ]=\unicode[STIX]{x1D707}_{i}=g^{-1}(X_{i}^{T}\unicode[STIX]{x1D6FD}_{i}).\end{array}\end{eqnarray}$$

Authors have warned that using DPP can lead to biased estimators, and it is known that neither weak consistency nor asymptotic unbiasedness are guaranteed in general in DPP models (Diaconis and Freedman Reference Diaconis and Freedman1986; Ghosal, Ghosh, and Ramamoorthi Reference Ghosal, Ghosh and Ramamoorthi1999; Tokdar Reference Tokdar2006; Kyung et al. Reference Kyung, Gill and Casella2010). Although bias will always be present due to the bayesian priors, Hannah, Blei, and Powell (Reference Hannah, Blei and Powell2011) demonstrated that the dpGLM satisfies the conditions that guarantee weak consistency of the joint posterior distribution and consistency of the regression estimates (see also Tokdar Reference Tokdar2006).

The dpGLMs lacks the hierarchical clustering approach that we would like to have in the model, that is, that the clusters can be a function of higher level context features in a multi-context analysis. We want to preserve the structure of the dpGLM and the DPP—because there might be unknown clusters with heterogeneous effect and unknown cluster membership—but include such context dependency—because the heterogeneity may depend on the context characteristics.

Some authors have proposed different approaches to model hierarchical clustering and to create dependencies among multiple Dirichlet processes (Mallick and Walker Reference Mallick and Walker1997; Carota and Parmigiani Reference Carota and Parmigiani2002; De Iorio et al. Reference De Iorio, Müller, Rosner and MacEachern2004; Müller, Quintana, and Rosner Reference Müller, Quintana and Rosner2004; Teh et al. Reference Teh, Jordan, Beal and Blei2006). We can generalize and combine these approaches with the dpGLM in the following way. Let $W_{j}^{\prime }\in \mathbb{R}^{D_{w}}$ denote the context-level features of context $j$ and $J$ the total number of contexts as before. Let $W_{j}=(1,W_{j}^{\prime })$ .

In our toy example, $W_{j}$ can be thought of as the level of economic development and inequality of the country (context)  $j$ . It means we want to investigate if the effect of income and race (the observed covariates $X_{i}$ ) on support for redistribution ( $y_{i}$ ) varies with the degree of economic development and inequality of the country ( $W_{j}$ ). Moreover, we also want to investigate if the effect of those observed covariates (income and race) is different among within-country subpopulations whose membership ( $Z_{i}$ ) is unobserved. Finally, we want to verify if those subpopulations vary from one country (context) to another due to a country’s level of inequality and economic development (context-level features  $W_{j}$ ).

The model can be modified in the following way to introduce context-level dependency among DPP:

(3) $$\begin{eqnarray}\begin{array}{@{}l@{}}G_{j}\mid \unicode[STIX]{x1D6FC}_{o},G_{o},W_{j}\sim {\mathcal{D}}{\mathcal{P}}(\unicode[STIX]{x1D6FC}_{o},G_{o}(W_{j}))\\ \unicode[STIX]{x1D703}_{ji}\mid G_{j}\sim G_{j}\\ y_{i}\mid X_{i},C_{i},\unicode[STIX]{x1D703}_{ji},~\sim p(y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{ji}),\quad \mathbb{E}[y_{i}\mid \cdot ]=\unicode[STIX]{x1D707}_{ji}=g^{-1}(X_{i}^{T}\unicode[STIX]{x1D6FD}_{ji}).\end{array}\end{eqnarray}$$

We refer to the model (3) as hierarchical Dirichlet process generalized linear model (hdpGLM). It generalizes the GLM and the dpGLM and provides a hierarchical clustering structure that is context-dependent. Because it generalizes GLMs, it can be used even if there is neither heterogeneity nor multiple contexts. The advantage of using hdpGLM is that clusters can be uncovered if the researcher is uncertain about the existence of heterogeneous effects. A model selection procedure can be adopted to decide if either the results of GLM or hdpGLM is adequate for the data at hand (Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997).

To complete the formulation of the model and connect (3) and (1), denote $Z_{ik}$ the case in which individual $i$ belongs to the subpopulation indexed by  $k$ , that is, $Z_{i}=k$ . Let $C_{ij}$ indicates that individual $i$ belongs to the context (country) $j$ , that is, $C_{i}=j$ . We can parameterize the effect of the context-level covariates $W$ with $\unicode[STIX]{x1D70F}\in \mathbb{R}^{(D_{w}+1)\times (D_{x}+1)}$ and rewrite the model (3) using the stick-breaking construction (Sethuraman Reference Sethuraman1994; Teh et al. Reference Teh, Jordan, Beal and Blei2006). The resulting model in its full generality is the following:

(4) $$\begin{eqnarray}\begin{array}{@{}l@{}}V_{l}\mid \unicode[STIX]{x1D6FC}_{o}\sim \text{Beta}(1,\unicode[STIX]{x1D6FC}_{o})\\ \unicode[STIX]{x1D70B}_{k}=\left\{\begin{array}{@{}ll@{}}V_{1},\quad & k=1,\\ \displaystyle V_{k}\mathop{\prod }_{l=1}^{k-1}(1-V_{l}),\quad & k>1,\end{array}\right.\\ \begin{array}{@{}ll@{}}Z_{i}\mid \unicode[STIX]{x1D70B}\sim \text{Cat}(\unicode[STIX]{x1D70B}),\quad & \unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\infty }\\ \unicode[STIX]{x1D70F}_{d}\sim p(\unicode[STIX]{x1D70F}_{d}),\quad & d=1,\ldots ,D_{x}+1\\ \unicode[STIX]{x1D703}_{kj}\mid Z_{ik},\unicode[STIX]{x1D70F},C_{ij},W\sim p(\unicode[STIX]{x1D703}_{jk}\mid W,\unicode[STIX]{x1D70F}),\quad & j=1,\ldots ,J\end{array}\\ \begin{array}{@{}rcl@{}}y_{i}\mid Z_{ik},\unicode[STIX]{x1D703}_{kj},X_{i},C_{ij}\sim p(y_{i}\mid Z_{ik},C_{ij},X_{i},\unicode[STIX]{x1D703}_{kj})\ & \ni \ & \mathbb{E}[y_{i}\mid Z_{ik},\unicode[STIX]{x1D703}_{kj},X_{i},C_{ij}]=g^{-1}(X_{i}^{T}\unicode[STIX]{x1D6FD}_{kj}),\\ \ & \ & p(y_{i}\mid Z_{ik},C_{ij},X_{i},\unicode[STIX]{x1D703}_{kj})~\text{from exponential family}.\!\!\!\!\!\!\!\end{array}\end{array}\end{eqnarray}$$

We further assume that, for $\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$

$$\begin{eqnarray}\begin{array}{@{}ll@{}}\unicode[STIX]{x1D70F}_{d}\mid \unicode[STIX]{x1D707}_{\unicode[STIX]{x1D70F}},\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D70F}}\sim N_{D_{w}+1}(\unicode[STIX]{x1D707}_{\unicode[STIX]{x1D70F}},\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D70F}}),\quad & d=1,\ldots ,D_{x}+1\\ \unicode[STIX]{x1D6FD}_{kj}\mid Z_{ik},\unicode[STIX]{x1D70F},C_{ij},W\sim N_{D_{x}+1}([W_{j}^{T}\unicode[STIX]{x1D70F}]^{T},\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}),\quad & j=1,\ldots ,J.\end{array}\end{eqnarray}$$

So, the variable $\unicode[STIX]{x1D70F}_{d}$ is a vector of linear coefficients of the country-level features. It determines the average effect of the individual-level features $X_{id}$ on the outcome $y_{i}$ in the cluster  $k$ .

In our toy example, $\unicode[STIX]{x1D70F}_{1}=(\unicode[STIX]{x1D70F}_{11},\unicode[STIX]{x1D70F}_{21})$ would be the linear effect of the inequality and economic development (the context-level features) on  $\unicode[STIX]{x1D6FD}_{1k}$ , the linear effect of income (observed covariate) on support for redistribution among people with similar history of social and racial conflict, which are unobserved, indexed by $k$ . The parameter $\unicode[STIX]{x1D70F}_{2}=(\unicode[STIX]{x1D70F}_{12},\unicode[STIX]{x1D70F}_{22})$ would be the linear effect of inequality and economic development on  $\unicode[STIX]{x1D6FD}_{2k}$ , which is the effect of race on support for redistribution among people of type  $k$ . Therefore, we have a DPP clustering model that is context-dependent because the linear coefficients $\unicode[STIX]{x1D6FD}$ of the outcome variable depend on the cluster probability  $\unicode[STIX]{x1D70B}$ , (thought $Z_{i}$ ) and on the context-level feature $W$ (through its linear effect  $\unicode[STIX]{x1D70F}$ ).

3 Generalized Linear Models, Finite Mixture Models, and hdpGLM

This section shows the relationship between the hdpGLM and the classical GLMs, GLMM, FMMs, and dpGLM in terms of the structure of the average parameters of the outcome variable  $y_{i}$ . In that sense, the hdpGLM can be viewed as a generalization of the other models. That generalization allows us to estimate latent or unobserved heterogeneity in the population in terms of how the covariates and the outcome are linearly related when the number of heterogeneous groups is not known in advance. The section also explores some connections between hdpGLM, latent instrumental variable (LIV) and latent-index models, which are approaches that use DPP with regression models.

As before, we denote $X_{i}=(1,X_{i}^{\prime })\in \mathbb{R}^{(D_{x}+1)\times 1}$ the observed characteristics of unit  $i$ , $Z_{i}\in \{0,1\}^{\unicode[STIX]{x1D705}}$ the design variable indicating the group (or cluster) of  $i$ . The parameter $\unicode[STIX]{x1D705}$ represents the number of clusters. Let $\unicode[STIX]{x1D6FE}_{i}\in \mathbb{R}^{(D_{x}+1)\times \unicode[STIX]{x1D705}}$ be the cluster-specific matrix of linear coefficients such that $\unicode[STIX]{x1D6FE}_{ik}\in \mathbb{R}^{(D_{x}+1)\times 1}$ is the $k^{\text{th}}$ column of $\unicode[STIX]{x1D6FE}_{i}$ with linear coefficients of cluster  $k$ . The most general formulation of the GLM in which every individual and groups have their own set of linear coefficients is:

(5) $$\begin{eqnarray}y_{i}\mid X_{i},\unicode[STIX]{x1D6FD}_{i},\unicode[STIX]{x1D6FE}_{i}\sim p(y_{i}\mid X_{i},\unicode[STIX]{x1D6FD}_{i},\unicode[STIX]{x1D6FE}_{i})\ni \mathbb{E}[y_{i}\mid \cdot ]=\unicode[STIX]{x1D707}_{i}=g^{-1}(X_{i}^{T}\unicode[STIX]{x1D6FD}_{i}+X_{i}^{T}\unicode[STIX]{x1D6FE}_{i}Z_{i}).\end{eqnarray}$$

Define $\unicode[STIX]{x1D702}_{i}=X_{i}^{T}\unicode[STIX]{x1D6FD}_{i}+X_{i}^{T}\unicode[STIX]{x1D6FE}_{i}Z_{i}$ and for simplicity let $D_{x}=1$ . We can write

$$\begin{eqnarray}\displaystyle \unicode[STIX]{x1D702}_{i} & = & \displaystyle (\unicode[STIX]{x1D6FD}_{oi}+\unicode[STIX]{x1D6FE}_{oi}Z_{i})+(\unicode[STIX]{x1D6FD}_{1i}+\unicode[STIX]{x1D6FE}_{1i}Z_{i})X_{i}^{\prime }\nonumber\\ \displaystyle \unicode[STIX]{x1D702}_{ik} & = & \displaystyle (\unicode[STIX]{x1D6FD}_{oi}+\unicode[STIX]{x1D6FE}_{oik})+(\unicode[STIX]{x1D6FD}_{1i}+\unicode[STIX]{x1D6FE}_{1ik})X_{i}^{\prime }.\nonumber\end{eqnarray}$$

Classical GLMs, GLMMs, FMMs, or hdpGLMs emerge from model (5) depending on what we know or believe about $\unicode[STIX]{x1D705},Z_{i},\unicode[STIX]{x1D6FE}_{i}$ and  $\unicode[STIX]{x1D6FD}_{i}$ . More precisely, it depends on the structural assumptions we impose on those parameters.

Classical GLM can be interpreted in two ways. Either one assumes $\unicode[STIX]{x1D6FE}_{i}=0$ for all $i$ and $\unicode[STIX]{x1D6FD}_{i}=\unicode[STIX]{x1D6FD}$ , which gives

(6) $$\begin{eqnarray}\unicode[STIX]{x1D702}_{i}=\unicode[STIX]{x1D6FD}_{o}+\unicode[STIX]{x1D6FD}_{1}X_{i}^{\prime }\end{eqnarray}$$

or one assumes $\unicode[STIX]{x1D705}=1$ , which gives

(7) $$\begin{eqnarray}\unicode[STIX]{x1D702}_{i}=(\unicode[STIX]{x1D6FD}_{o}+\unicode[STIX]{x1D6FE}_{o})+(\unicode[STIX]{x1D6FD}_{1}+\unicode[STIX]{x1D6FE}_{1})X_{i}=\unicode[STIX]{x1D703}_{o}+\unicode[STIX]{x1D703}_{1}X_{i}^{\prime }.\end{eqnarray}$$

Clearly, (6) and (7) are structurally equivalent, and treating either $\unicode[STIX]{x1D703}$ in (7) of $\unicode[STIX]{x1D6FD}$ in (6) as the parameter to be estimated should produce the same results.

When $\unicode[STIX]{x1D705}>1$ , $Z_{i}$ is observed, and one believes $\unicode[STIX]{x1D6FE}\neq 0$ , the common approach is to use fixed, random, or mixed-effects models. For fixed effects, one either assumes that each group $k$ has its own fixed intercept term  $\unicode[STIX]{x1D703}_{ok}$ , or both its own fixed intercept and slope ( $\unicode[STIX]{x1D703}_{ok},\unicode[STIX]{x1D703}_{1k}$ ). Classical models with random effects similarly assume that each observed group has its own intercept (and/or slope) but also that, instead of being fixed, they are drawn from a common distribution. A gaussian distribution with zero mean is the standard choice for the random effects (Hayashi Reference Hayashi2000; Woodridge Reference Woodridge2002), but one can also have group-specific averages (Gelman and Hill Reference Gelman and Hill2007) (see Table 1). Mixed models use a combination of random and fixed effects.

Table 1. Relationship between GLM, GLMM, FMM and hdpGLM based on structural assumptions on $\unicode[STIX]{x1D705},Z_{i}$ and  $\unicode[STIX]{x1D711}_{i}$ .

* GLM: Generalized Linear Models; FE: Fixed Effect in the intercept (FE (I)) and both in the intercept and slope (FE (I + S)); RE: Random Effect in the intercept (FE (I)) and both in the intercept and slope (RE (I + S)); FMM: finite mixture models.

When $Z_{i}$ is not observed, obviously it is not possible to use classical mixed models for the group heterogeneity. When one does not observe  $Z_{i}$ , but $\unicode[STIX]{x1D705}$ is known or it is assumed to be finite and fixed, then a finite mixture model is usually used (Lenk and DeSarbo Reference Lenk and DeSarbo2000). If we let $Z_{i}\sim \text{Cat}(\unicode[STIX]{x1D70B})$ , $\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\unicode[STIX]{x1D705}}$ then

(8) $$\begin{eqnarray}\unicode[STIX]{x1D702}_{i}^{\prime }=\mathbb{E}[\unicode[STIX]{x1D702}_{i}\mid X_{i}]=(\unicode[STIX]{x1D6FD}_{o}+\unicode[STIX]{x1D6FE}_{o}\unicode[STIX]{x1D70B})+(\unicode[STIX]{x1D6FD}_{1}+\unicode[STIX]{x1D6FE}_{1}\unicode[STIX]{x1D70B})X_{i}^{\prime }=\unicode[STIX]{x1D703}_{o}+\unicode[STIX]{x1D703}_{1}X_{i}^{\prime }\end{eqnarray}$$

and

$$\begin{eqnarray}\unicode[STIX]{x1D702}_{ik}^{\prime }=\mathbb{E}[\unicode[STIX]{x1D702}_{i}\mid X_{i}]=(\unicode[STIX]{x1D6FD}_{o}+\unicode[STIX]{x1D6FE}_{o}\unicode[STIX]{x1D70B}_{k})+(\unicode[STIX]{x1D6FD}_{1}+\unicode[STIX]{x1D6FE}_{1}\unicode[STIX]{x1D70B}_{k})X_{i}=\unicode[STIX]{x1D703}_{ok}+\unicode[STIX]{x1D703}_{1k}X_{i}^{\prime }\end{eqnarray}$$

which implies a finite mixture distribution for $y_{i}$ , that is,

(9) $$\begin{eqnarray}\begin{array}{@{}rcl@{}}Z_{i}\mid \unicode[STIX]{x1D70B}\ & {\sim}\ & \text{Cat}(\unicode[STIX]{x1D70B}),\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\unicode[STIX]{x1D705}}\\ y_{i}\mid X_{i},Z_{ik},\unicode[STIX]{x1D703}_{k}\ & {\sim}\ & p(y_{i}\mid \unicode[STIX]{x1D707}_{ik}).\end{array}\end{eqnarray}$$

By averaging over $\unicode[STIX]{x1D705}$ we get

$$\begin{eqnarray}\overline{\unicode[STIX]{x1D702}}_{i}^{\prime }=\overline{\unicode[STIX]{x1D703}}_{o}+\overline{\unicode[STIX]{x1D703}}_{1}X_{i}^{\prime }.\end{eqnarray}$$

Again, it has the same basic structure of the classical GLM.

The dpGLM generalizes that structure by allowing $\unicode[STIX]{x1D705}$ to be undetermined. It emerges naturally from finite mixtures when there might be clusters in the populations that are latent or that were not measured and, in addition, we do not know exactly the number of clusters. By letting $\unicode[STIX]{x1D705}\rightarrow \infty$ in the finite mixture model in (9), and by putting a prior on $\unicode[STIX]{x1D703}$ and a stick-breaking prior on $\unicode[STIX]{x1D70B}$ we have the dpGLM, as described in (10) (Teh et al. Reference Teh, Jordan, Beal and Blei2006; Hannah, Blei, and Powell Reference Hannah, Blei and Powell2011).

(10) $$\begin{eqnarray}\begin{array}{@{}l@{}}V_{l}\mid \unicode[STIX]{x1D6FC}_{o}\sim \text{Beta}(1,\unicode[STIX]{x1D6FC}_{o})\\ \unicode[STIX]{x1D70B}_{k}=\left\{\begin{array}{@{}ll@{}}V_{1},\quad & k=1,\\ \displaystyle V_{k}\mathop{\prod }_{l=1}^{k-1}(1-V_{l}),\quad & k>1,\end{array}\right.\\ Z_{i}\mid \unicode[STIX]{x1D70B}\sim \text{Cat}(\unicode[STIX]{x1D70B}),\quad \unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\infty }\\ \unicode[STIX]{x1D703}_{Zi}\mid Z_{i}\sim p_{\unicode[STIX]{x1D703}}\\ y_{i}\mid X_{i},Z_{ik},\unicode[STIX]{x1D703}_{k}\sim p(y_{i}\mid \unicode[STIX]{x1D707}_{ik}).\end{array}\end{eqnarray}$$

Starting with the dpGLM, by restricting the possible number of clusters to be finite ( $\unicode[STIX]{x1D705}<\infty$ ), and treating $\unicode[STIX]{x1D70B}$ and $\unicode[STIX]{x1D703}$ as fixed we are again back to the FMM. If, in addition, we either average out the clusters and treat those averaged elements as the fixed parameters to estimate or if $\unicode[STIX]{x1D705}=1$ , we have the classical GLM. In sum, the GLM can be viewed a special case of the dpGLM.

Finally, the hdpGLM proposed here generalizes that structure to account for the possibility of context-dependent clustering. It does that by letting the linear coefficients of the clusters be a function of context-level covariates. We modify the model (10) by adding the parameter $\unicode[STIX]{x1D70F}$ and context-level information  $W$ . Given $J$ different contexts, the context-level covariates $W\in \mathbb{R}^{J\times (D_{w}+1)}$ , and the variable $C_{i}$ that indicates the context to which $i$ belongs, we have the hdpGLM model by modifying the dpGLM and adding the following structure to it:

(11) $$\begin{eqnarray}\begin{array}{@{}ll@{}}\unicode[STIX]{x1D70F}_{d}\sim p(\unicode[STIX]{x1D70F}_{d}),\quad & d=1,\ldots ,D_{x}+1\\ \unicode[STIX]{x1D703}_{Z_{i}C_{i}}\mid Z_{ik},\unicode[STIX]{x1D70F},C_{ij},W\sim p(\unicode[STIX]{x1D703}_{jk}\mid W,\unicode[STIX]{x1D70F}),\quad & j=1,\ldots ,J.\end{array}\end{eqnarray}$$

Hence, if there is just one context ( $J=1$ ) we have the dpGLM again, and it demonstrates the connection between hdpGLM and the other models. Table 1 summarizes the connection between them.

The hdpGLM model is also structurally connected to latent instrumental variable (LIV) models (Ebbes, Böckenholt, and Wedel Reference Ebbes, Böckenholt and Wedel2004; Ebbes et al. Reference Ebbes, Wedel, Böckenholt and Steerneman2005; Ebbes, Wedel, and Böckenholt Reference Ebbes, Wedel and Böckenholt2009). Such models can be used to deal with endogenous covariates. The main feature of the LIV is the introduction of a latent categorical instrumental variable, which turns the instrumental variable (IV) regression model into a FMM. To see this, consider this simple example of a classical IV model with endogenous covariate  $x_{1}$ , a gaussian outcome, and the instrumental variable $z$ (index  $i$ omitted for simplicity):

(12) $$\begin{eqnarray}\left\{\begin{array}{@{}l@{}}y=\unicode[STIX]{x1D6FD}_{0}+\unicode[STIX]{x1D6FD}_{1}x_{1}+\unicode[STIX]{x1D6FD}_{2}x_{2}+\unicode[STIX]{x1D700}\quad \\ x_{1}=\unicode[STIX]{x1D6FE}_{0}+\unicode[STIX]{x1D6FE}_{2}x_{2}+\unicode[STIX]{x1D6FE}_{3}z+\unicode[STIX]{x1D708}.\quad \end{array}\right.\end{eqnarray}$$

For $\unicode[STIX]{x1D719}_{0}=\unicode[STIX]{x1D6FD}_{0}+\unicode[STIX]{x1D6FD}_{i}\unicode[STIX]{x1D6FE}_{0}$ , $\unicode[STIX]{x1D719}_{2}=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D6FE}_{2}+\unicode[STIX]{x1D6FD}_{2}$ , $\unicode[STIX]{x1D719}_{3}=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D719}_{3}$ , and $\unicode[STIX]{x1D700}^{\prime }=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D708}+\unicode[STIX]{x1D700}$ we have the reduced form:

(13) $$\begin{eqnarray}y=\unicode[STIX]{x1D711}_{0}+\unicode[STIX]{x1D711}_{2}x_{2}+\unicode[STIX]{x1D711}_{3}z+\unicode[STIX]{x1D700}^{\prime }.\end{eqnarray}$$

The LIV approach defines a latent $K$ -level categorical random variable $Z_{i}$ to be used instead of the instrument  $z_{i}$ . Each group $k$ is assumed to have its is own mean value  $\unicode[STIX]{x1D711}_{0k}$ . It leads to a FMM with $K$ latent groups such that the outcome is given by:

(14) $$\begin{eqnarray}y=(\unicode[STIX]{x1D711}_{0}+\unicode[STIX]{x1D711}_{0k})+\unicode[STIX]{x1D711}_{2}x_{2}+\unicode[STIX]{x1D700}^{\prime }=\unicode[STIX]{x1D703}_{0k}+\unicode[STIX]{x1D703}_{2}x_{2}+\unicode[STIX]{x1D700}^{\prime }\end{eqnarray}$$

or equivalently, and assuming group-specific errors:

(15) $$\begin{eqnarray}y=\unicode[STIX]{x1D6FD}_{0}+\unicode[STIX]{x1D6FD}_{1}x_{1k}+\unicode[STIX]{x1D6FD}_{2}x_{2}+\unicode[STIX]{x1D700}\end{eqnarray}$$
$$\begin{eqnarray}x_{1k}=\unicode[STIX]{x1D6FE}_{0}+\unicode[STIX]{x1D6FE}_{2}x_{2}+\unicode[STIX]{x1D711}_{0k}+\unicode[STIX]{x1D708}_{i}.\end{eqnarray}$$

Ebbes et al. (Reference Ebbes, Wedel, Böckenholt and Steerneman2005) and Ebbes, Wedel, and Böckenholt (Reference Ebbes, Wedel and Böckenholt2009) have proposed (15), called LIV, to deal with endogeneity in the regressors. Obviously, by (14) and (15) we can see how it is structurally connected to the hdpGLM. Some differences between LIV and the model presented here is that in the former the number of latent groups needed to be selected in advance before the parameters are estimated, which is a feature of any FMM. Moreover, LIV is designed for a single-context estimation, that is, the endogeneity and the instrument are not context-dependent. The main difference, however, is that the LIV approach uses the joint distribution of the endogenous covariates and the outcome due to its goal of dealing with endogeneity of the covariate, while here the covariates are assumed to be exogeneous. The hdpGLM leads to a LIV model if we truncate the DPP, restrict the hdpGLM to its non-hierarchical version (dpGLM) with gaussian outcome, and model the distribution of the endogenous covariates as in the equations above.

Finally, the hdpGLM also has a close connection with the latent-index model that has been designed to deal with single-context heterogeneity (Aakvik, Heckman, and Vytlacil Reference Aakvik, Heckman and Vytlacil2005; Rossi Reference Rossi2014). Aakvik, Heckman, and Vytlacil (Reference Aakvik, Heckman and Vytlacil2005) propose a model in which the marginal effects are heterogeneous in the population and indexed by a continuous latent random variable. They also provide a special case with two latent groups by using a binary transformation of that (gaussian distributed) latent index, which produces a mixture model with two latent components. The model here generalizes that continuous latent-index approach in two ways. First, it imposes a much more flexible distribution on the latent index and allows us to estimate effect heterogeneity when there are unknown number of finite or countably infinite groups. Their gaussian index model can be approximated by a countably infinity partition of the real line and a symmetric unimodal discrete distribution on that partition. It is embedded in the structure of the model and the DPP can naturally be used to estimate such a distribution of the indexes. Second, the model here generalizes their approach by adding a context-dependent structure to the latent heterogeneity.

4 Estimation

There are many options in the literature to estimate models that use DPP (Ishwaran and Zarepour Reference Ishwaran and Zarepour2000; Neal Reference Neal2000; Blei et al. Reference Blei and Jordan2006; Walker Reference Walker2007). Here we extend the approach proposed by Ishwaran and James (Reference Ishwaran and James2001). In order to implement a (blocked) Gibbs sampler for a DPP model, one of the algorithms they propose uses a truncated version of the stick-breaking construction in conjunction with the generalized Dirichlet distribution. We extend their basic algorithm in two ways. First, we incorporate the hierarchical structure of the model proposed here and develop a Gibbs sampler in its full generality. Second, we derive the sampler for two special cases: continuous outcome variable  $y_{i}$ , modeled using a gaussian distribution, and a binary outcome variable, modeled using a Bernoulli distribution with a logistic transformation of the average parameter. For the gaussian outcome, the Gibbs update can be used for all parameters. Therefore, in practice for that special case the estimation shows good convergence diagnostics within thousand iterations and it can be performed in a relatively short time depending on the size of the data set. For the binary outcome, the Gibbs update is available for all parameters but the linear coefficients of the generalized linear model. So we extend the algorithm and implement a Metropolis–Hasting update within Gibbs to sample the linear coefficients ( $\unicode[STIX]{x1D6FD}$ ) using Riemann manifold Hamiltonian Monte Carlo (Neal Reference Neal2000; Shahbaba and Neal Reference Shahbaba and Radford2009; Neal et al. Reference Neal2011). The R package hdpGLM contains the implementation of the model with the algorithms presented here.

The truncation of the DPP used in the MCMC algorithm restricts the mixing probability parameter $\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\infty }$ described in (4) to $\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{K}$ . To estimate the model properly, we set a large value for $K$ and monitor the estimation to check the maximum number of clusters the sampler used to allocate the data points during the iterations. If it reached $K$ at any point we increase its value and repeat the process. By selecting a $K$ much larger than the number of clusters the sampler activates during the estimation, we make sure the truncation is not changing the estimated results.

As before, let $\mathbf{X}\in \mathbb{R}^{n\times (D_{x}+1)}$ denote the individual-level covariates including a column with ones for the intercept term, and $n$ the number of data points including all contexts. Denote $C=(C_{1},\ldots ,C_{n})$ and $C_{i}\in \{1,\ldots ,J\}$ the variable that indicates the context to which $i$ belongs, and let $\mathbf{W}\in \mathbb{R}^{J\times (D_{w}+1)}$ be the $(D_{w}+1)$ -dimensional context-level features of the contexts  $J$ . Finally, let $Z=(Z_{1},\ldots ,Z_{n})$ .

The following additional notation is used to derive the algorithm: $Z^{\ast }$ denotes the unique values of  $Z$ , and $Z^{\ast C}$ the values between 1 and $K$ that are not in  $Z^{\ast }$ . We denote by $Z_{j}^{\ast }$ the unique values of $Z$ in the context  $j$ , and $Z_{j}^{\ast C}$ its complement in  $j$ , $I_{k}$ is the set of indexes $i$ of the data points assigned to the cluster  $k$ , $N_{k}$ the total number of data points in  $k$ , and $X_{jk}$ (or $y_{jk}$ ) the covariates (outcome variable) of the observations $i$ in context $j$ and assigned to the cluster  $k$ .

Given the most general formulation of the hdpGLM in (4) and the truncation used for the sampler we have the following proposition (see proof in the appendix A):

Proposition 1 (Blocked Gibbs sampler for hdpGLM).

A Blocked Gibbs sampler for the model described in (4) with $\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{K}$ is given by the Algorithm 1.

A special case of the model described in (4) occurs when $y_{i}$ is gaussian distributed. Let $N_{d}(\unicode[STIX]{x1D707},\unicode[STIX]{x1D6F4})$ denote a d-dimensional multivariate gaussian distribution. Then, for $\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$ , we can have a Gibbs sampler for all parameters if we use the following distribution for $\unicode[STIX]{x1D70F},\unicode[STIX]{x1D6FD}$ and $\unicode[STIX]{x1D70E}$ (see proof in the appendix A):

(16) $$\begin{eqnarray}\begin{array}{@{}ll@{}}\unicode[STIX]{x1D70F}_{d}\mid \unicode[STIX]{x1D707}_{\unicode[STIX]{x1D70F}},\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D70F}}\sim N_{D_{w}+1}(0,\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D70F}}),\quad & d=1,\ldots ,D_{x}+1\\ \unicode[STIX]{x1D6FD}_{kj}\mid Z_{ik},\unicode[STIX]{x1D70F},C_{ij},\mathbf{W}\sim N_{D_{x}+1}([W_{j}^{T}\unicode[STIX]{x1D70F}]^{T},\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D6FD}}I),\quad & j=1,\ldots ,J,k=1,\ldots ,K\\ \unicode[STIX]{x1D70E}_{k}^{2}\mid Z_{ik}\sim \text{Scale\text{-}inv}\text{-}\unicode[STIX]{x1D712}^{2}(\unicode[STIX]{x1D708},s^{2})\quad & \\ \unicode[STIX]{x1D700}_{i}\mid \unicode[STIX]{x1D70E}_{k},Z_{ik}\sim N(0,\unicode[STIX]{x1D70E}_{k})\quad & \\ ~~~~y_{i}=X_{i}^{T}\unicode[STIX]{x1D6FD}_{Z_{i}C_{i}}+\unicode[STIX]{x1D700}_{i}\quad \end{array}\end{eqnarray}$$

Proposition 2 (Gibbs for hdpGLM with gaussian mixtures).

The Gibbs sampler for the model described in (16) is given by the Algorithm 2.

When the outcome variable $y_{i}$ in the model (4) is binomial distributed, or in general has a distribution that does not have a conjugate prior for the linear coefficients, the full conditional of the parameters $\unicode[STIX]{x1D703}$ (or  $\unicode[STIX]{x1D6FD}$ ) is not standard and we cannot sample from it directly. To deal with such cases we use a Riemman manifold Hamiltonian Monte Carlo (RMHMC) update (Girolami and Calderhead Reference Girolami and Calderhead2011) within Gibbs to sample the $\unicode[STIX]{x1D6FD}$ coefficients. We can still sample all the other parameters as before. For the sake of completeness, the RMHMC algorithm is presented in the supplementary material.

The random variable of interest is $\unicode[STIX]{x1D6FD}_{kj}\in \mathbb{R}^{D_{x}+1}$ , called the position variable of the Hamiltonian Monte Carlo (HMC) algorithm (Neal et al. Reference Neal2011), and we denote by $v\in \mathbb{R}^{D_{x}+1}$ the ancillary variable (momentum) such that $v\sim N_{D_{x}+1}(0,G(\unicode[STIX]{x1D6FD}_{kj}))$ . The Hamiltonian for our model is defined by

(17) $$\begin{eqnarray}\displaystyle H(\unicode[STIX]{x1D6FD}_{kj},v) & = & \displaystyle U(\unicode[STIX]{x1D6FD}_{kj},v)+K(\unicode[STIX]{x1D6FD}_{kj},v)\nonumber\\ \displaystyle & = & \displaystyle -\ln p(\unicode[STIX]{x1D6FD}_{kj}\mid \cdot )+\frac{D_{x}+1}{2}\ln (2\unicode[STIX]{x1D70B})+\frac{1}{2}[\ln (\det [G(\unicode[STIX]{x1D6FD}_{kj})])+v^{T}G(\unicode[STIX]{x1D6FD}_{kj})^{-1}v]\end{eqnarray}$$

whose solution is

(18) $$\begin{eqnarray}\begin{array}{@{}rcl@{}}\unicode[STIX]{x1D6FB}_{v}H(\unicode[STIX]{x1D6FD}_{kj},v)\ & =\ & G(\unicode[STIX]{x1D6FD}_{kj})^{-1}v\\ \unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}H(\unicode[STIX]{x1D6FD}_{kj},v)\ & =\ & -\left[\unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}U(\unicode[STIX]{x1D6FD}_{kj},v)-{\textstyle \frac{1}{2}}\text{tr}\{G(\unicode[STIX]{x1D6FD}_{kj})^{-1}\unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}G(\unicode[STIX]{x1D6FD}_{kj})\}\right.\\ \ & \ & +\left.{\textstyle \frac{1}{2}}(v^{T}G(\unicode[STIX]{x1D6FD}_{kj})^{-1}G(\unicode[STIX]{x1D6FD}_{kj})^{-1}v)\unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}G(\unicode[STIX]{x1D6FD}_{kj})\right].\end{array}\end{eqnarray}$$

The Hamiltonian equations are solved using the generalized Stormer–Verlet leapfrog integrator (Calin and Chang Reference Calin and Chang2006; Girolami and Calderhead Reference Girolami and Calderhead2011). For $L$ leapfrog steps with size $\unicode[STIX]{x1D700}$ , and $l=1,\ldots ,L$ , it is given by:

(19) $$\begin{eqnarray}\begin{array}{@{}rcl@{}}v^{l+\unicode[STIX]{x1D700}/2}\ & =\ & \displaystyle v^{l}-\frac{\unicode[STIX]{x1D700}}{2}\unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}H(\unicode[STIX]{x1D6FD}_{kj}^{l},v^{l+\unicode[STIX]{x1D700}/2})\\ \displaystyle \unicode[STIX]{x1D6FD}_{kj}^{l+\unicode[STIX]{x1D700}}\ & =\ & \displaystyle \unicode[STIX]{x1D6FD}_{kj}^{l}+\frac{\unicode[STIX]{x1D700}}{2}[\unicode[STIX]{x1D6FB}_{v}H(\unicode[STIX]{x1D6FD}_{kj}^{l},v^{l+\unicode[STIX]{x1D700}/2})+\unicode[STIX]{x1D6FB}_{v}H(\unicode[STIX]{x1D6FD}_{kj}^{l+\unicode[STIX]{x1D700}},v^{l+\unicode[STIX]{x1D700}/2})]\\ \displaystyle v^{l+\unicode[STIX]{x1D700}}\ & =\ & \displaystyle v^{l+\unicode[STIX]{x1D700}/2}-\frac{\unicode[STIX]{x1D700}}{2}\unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}H(\unicode[STIX]{x1D6FD}_{kj}^{l+\unicode[STIX]{x1D700}},v^{l+\unicode[STIX]{x1D700}/2}).\end{array}\end{eqnarray}$$

When $y_{i}$ is binomial, that is, the distribution of $y_{i}$ in the model (4) is defined by

$$\begin{eqnarray}y_{i}\sim \text{Bin}(p_{kj}),\quad p_{kj}=\frac{1}{1+e^{-X_{i}^{T}\unicode[STIX]{x1D6FD}_{kj}}}\end{eqnarray}$$

then, given the Equation (A 4), the elements of the RMHMC for the model hdpGLM when $k\in Z_{j}^{\ast }$ are defined by the following equations:

$$\begin{eqnarray}\displaystyle U(\unicode[STIX]{x1D6FD}_{kj}) & = & \displaystyle -\ln p(\unicode[STIX]{x1D6FD}_{kj}\mid \cdot )\nonumber\\ \displaystyle & \propto & \displaystyle -\left[-\frac{D_{x}+1}{2}\ln 2\unicode[STIX]{x1D70B}-\frac{1}{2}\ln (\det (\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}))-\frac{1}{2}(\unicode[STIX]{x1D6FD}_{kj}-(W_{j}^{T}\unicode[STIX]{x1D70F})^{T})^{T}\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}^{-1}(\unicode[STIX]{x1D6FD}_{kj}-(W_{j}^{T}\unicode[STIX]{x1D70F})^{T})\right.\nonumber\\ \displaystyle & & \displaystyle -\left.\mathop{\sum }_{i\in I_{k}}y_{i}\ln (1+e^{-X_{i}^{T}\unicode[STIX]{x1D6FD}_{kj}})-\mathop{\sum }_{i\in I_{k}}(1-y_{i})\ln (1+e^{X_{i}^{T}\unicode[STIX]{x1D6FD}_{kj}})\right]\nonumber\\ \displaystyle \unicode[STIX]{x1D6FB}_{\unicode[STIX]{x1D6FD}_{kj}}U(\unicode[STIX]{x1D6FD}_{kj}) & = & \displaystyle -\left[-(\unicode[STIX]{x1D6FD}_{kj}-(W_{j}^{T}\unicode[STIX]{x1D70F})^{T})^{T}\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}^{-1}+\mathop{\sum }_{i\in I_{k}}X_{i}y_{i}p(y_{i}=0\mid \cdot )-\mathop{\sum }_{i\in i_{k}}X_{i}(1-y_{i})p(y_{i}=1\mid \cdot )\right].\nonumber\end{eqnarray}$$

In practice we use $G(\unicode[STIX]{x1D6FD}_{kj})=I_{(D_{x}+1)\times (D_{x}+1)}$ , which is the most widely used approach in applications (Liu Reference Liu2008; Neal et al. Reference Neal2011). It also simplifies the Equations (17), (18), and (19) substantially. Using $v\sim N_{D_{x}+1}(0,I)$ , the integrator reduces to the standard Stormer–Verlet leapfrog integrator (Duane et al. Reference Duane, Kennedy, Pendleton and Roweth1987; Neal et al. Reference Neal2011). We follow that approach in this paper.

5 Monte Carlo Simulation

In this section, we conduct a Monte Carlo exerciseFootnote 2 to demonstrate the properties of the estimates of the model produced by the algorithms developed in Section 4. The exercise is divided into three parts. First, we reproduce a particular situation that often occurs in practice if one omits factors that condition the association between the variable of interest and the outcome. In order to do that, we compare results produced by hdpGLM with those produced by GLM when there is no latent heterogeneity in the population and when there are latent clusters. We show how Simpson’s paradox can happen in the latter case and how it is uncovered by the proposed model. In the second part of the MC exercise we simulate a large variety of possible scenarios, each with different types of heterogeneity and numbers of observed covariates to show that the model has good performance in a large variety of situations. We evaluate the frequentist properties of the estimators in each case, particularly their coverage probability (Carlin and Louis Reference Carlin and Louis2000; Little et al. Reference Little2011). Lastly, we compare the predictive performance of GLM and hdpGLM for different possible number of clusters in terms of root-mean-squared error (RMSE).

We start by comparing the estimates produced by GLM and by hdpGLM with and without latent heterogeneity in the population. We generated data sets from two parameter configurations with 3 continuous covariates sampled from a gaussian distribution. In the first data set, there is no effect heterogeneity. Hence, a single GLM would be appropriate because the effect of those three covariates are homogeneous in the population. In the second data set, we let the effect of one covariate to be conditional on a latent factor such that it has opposite signs and similar magnitude for half of the population. The effect of the other two covariates is homogeneous. Data sets used in this exercise contain 2000 observations.

As a toy example, we can think that the first covariate represents income, the second age, the third the degree of racial fragmentation in the neighborhood, and the outcome the degree of support for redistributive policies. The effect of income on support for redistribution depends on a latent feature, let’s say, if the individuals have experienced economic reward due to their effort and hard work, as opposed to luck or family monetary heritage. The latent heterogeneous effect of income can occur, for instance, if more income means less support for redistribution only for those that believe upward mobility can be achieved through effort and hard work.

Table 2 compares the point estimates and their confidence intervals produced by estimating a GLM using MLE, with the posterior average and the 95% HPD interval produced by estimating the hdpGLM with the MCMC proposed here. We can compare the estimates with the true value, which is displayed in the fourth column of the table. After estimating the hdpGLM, we classified the data into clusters using the estimated cluster probability of each data point. We assigned each observation to the cluster they have the highest probability to belong to. The indexes of the clusters occupied by data points are displayed in the first column of the table. We can see in the upper half of the Table 2 the estimates when the data comes from a population in which there is no heterogeneity. All data points were classified into the same single cluster by the hdpGLM. The estimates of the two models are very similar. In fact, they are identical up to two significant figures, as shown in the table. The lower half of the table presents the results of the estimation using GLM and hdpGLM for the second data set with heterogeneous effects in the first covariate  $X_{1}$ . We used the same procedure just described to classify the data into clusters. The hdpGLM estimated two clusters in virtually all repetitions of the procedure. The results produced by the GLM and the hdpGLM are very similar for the covariates 2 and 3 ( $\unicode[STIX]{x1D6FD}_{3}$ and $\unicode[STIX]{x1D6FD}_{4}$ ), whose effects are homogeneous in the population. For the hdpGLM, the values of the linear effect of those covariates with homogeneous effects are indistinguishable in the two clusters estimated, as expected. However, for the heterogeneous effect $\unicode[STIX]{x1D6FD}_{1}$ (e.g., income) the GLM estimated a positive effect when in fact there are two subpopulations, one with a positive and another with a negative effect. The hdpGLM, on the other hand, estimated the marginal effect of $X_{1}$ correctly for both clusters.

Table 2. Comparing estimates of GLM (estimated using MLE) and hdpGLM (estimated using MCMC) with and without latent heterogeneity in the population.

Table 2 contains an example of Simpson’s paradox: the aggregate effect found for $X_{1}$ when one uses GLM and ignores the clusters is quite different from the effect found when the clusters are considered. We can see it clearer in Figure 1. The lines represent the fitted values using the MLE estimate for the GLM model and the fitted values using the posterior average for the hdpGLM. In the left panel, we can compare the estimated marginal effects produced by each model. In the right panels, we see the data points after they were clustered by the hdpGLM. The right panels also display the fitted values. We would have reached incomplete conclusions using GLM in such situation: the effect is positive and significant for the MLE estimates of the GLM but, in fact, it is negative for half of the population, and it has a larger positive effect than estimated by the GLM for the other half.

Figure 1. Comparing marginal effect estimated using GLM (MLE estimator) and hdpGLM (MCMC posterior average) when there are 3 clusters in the population.

The general take away from these results is that when GLM is well specified and there is no effect heterogeneity due to latent features, using hdpGLM will not harm the estimation. When there are clusters with heterogeneous effects, GLM will produce accurate aggregate results but can nevertheless be incorrect for each one of the subpopulations. Table 2 and Figure 1 demonstrate that the hdpGLM reduces to GLM when there is no heterogeneity (see Section 3). When the assumptions that justify the adoption of the GLM holds, the hdpGLM can still be used and it estimates the mean value of the linear parameters quite close to the ones produced by MLE estimates of GLM. When there was heterogeneity, the hdpGLM classified the data correctly into two clusters, the marginal effects were correctly estimated, and Simpson’s paradox was uncovered.

Next, in order to evaluate the performance of the hdpGLM in a wide range of possible scenarios, we randomly generated 10 different sets of parameters. To make the Monte Carlo exercise faster and easy to visualize, we simulate data for a single context ( $J=1$ ) with a continuous outcome variable. An example with context-dependent heterogeneity $(J>1)$ are presented in the sequel. Examples with binary outcome variables are provided in the supplementary material.

For each parameter set, we randomly generated 100 data sets. The number of clusters $K$ and the number of covariates in each case was also randomly generated. We allowed the heterogeneity to occur in the effect of all covariates. Values of the linear coefficients range from $-20$ to 20. We estimated the hdpGLM for each one of the 1,000 data sets (10 parameter sets times 100 data sets for each parameter set) and all the usual convergence diagnostics were conducted (Geweke Reference Geweke1992; Cowles and Carlin Reference Cowles and Carlin1996; Brooks and Gelman Reference Brooks and Gelman1998; Flegal Reference Flegal2008; Flegal, Haran, and Jones Reference Flegal, Haran and Jones2008). The high posterior density (HPD) intervals were computed across data sets generated by each parameter set and so was the posterior average.

Table 3 summarizes the coverage probability of the linear coefficients $\unicode[STIX]{x1D6FD}$ for each one of the 10 parameter sets along with the cluster estimation. The first column indicates the number of covariates in each parameter set, and the second indicates the true number of clusters in the population. The third through fifth columns display the summaries of the estimation across the 100 data sets generated by each parameter set. It shows the mean, minimum, and maximum number of clusters the data points were assigned to after the estimation. As before, we assigned the data points to the clusters based on the maximum estimated probability of cluster membership. The sixth column shows the proportion of the time the data was classified into correct number of clusters across the replications. The table also displays the minimum and the average coverage across linear coefficients for each parameter set. For instance, the second line displays a case in which there are two latent clusters in the population and five covariates. The sixth column indicates that the data points were classified into two clusters in 96% of the estimations performed using the 100 data sets generated by that parameter set. There are 10 linear coefficients across clusters for that case (5 linear coefficients per cluster). Among those 10 linear coefficients, the minimum coverage probability was 90%. It means that the linear parameter whose estimation had the worst coverage still was correctly estimated 90% of the time. By correct estimation we mean the true value was within the 95% HPDI. So in at least 90 out of 100 cases, the true values were contained in the 95% HPD interval for all the linear parameters. One may argue that such good coverage probability occurs because the posterior intervals are too wide. So, we display in the last column of the table the maximum average of the HPD intervals among the linear coefficients in each case. As the intervals are generally small we can be confident that the model and the estimation procedure proposed here have good coverage probability and such results are not due to the large variance of the posterior distribution. Another possible objection is that the number of replications is too small. In the supplementary material we provide a much larger MC exercise for two additional parameter sets with 1,000 replications each. The supplementary material also contains tables with the MC standard error for all simulations and for all linear parameter  $\unicode[STIX]{x1D6FD}$ . The results are similar to those presented here.

Table 3. Summary of the performance of the hdpGLM when estimating number of clusters (K) and linear coefficients ( $\unicode[STIX]{x1D6FD}$ ) across 100 replications generated by 10 different parameter sets.

Now we turn to a full example of an estimation with context-dependent latent heterogeneity. For this example, we used ten contexts $(J=10)$ and two covariates $(D_{x}=2)$ . We let the expectation of the effect $\unicode[STIX]{x1D6FD}_{1}$ of first covariate $X_{1}$ be a function of the context-level feature  $W_{1}$ , but the expectation of the linear effect $\unicode[STIX]{x1D6FD}_{2}$ of the second covariate $X_{2}$ is not a function of context-level features. In other words, we randomly sampled $\unicode[STIX]{x1D70F}_{11}$ (the effect of $W_{1}$ on the expectation of $\unicode[STIX]{x1D6FD}_{1}$ ) from its prior distribution and set $\unicode[STIX]{x1D70F}_{12}$ (the effect of $W_{1}$ on the expectation of $\unicode[STIX]{x1D6FD}_{2}$ ) to zero. We set the number of clusters to two $(K=2)$ . Figure 2 shows the result of the estimation. On the left panel of the figure we see the posterior distribution of the linear coefficients in each context. The vertical lines indicate the true values. We clearly see the posterior concentrated around the true values of the clusters. On the top right of the figure, we see the estimated posterior averages for $\unicode[STIX]{x1D6FD}_{1}$ and $\unicode[STIX]{x1D6FD}_{2}$ for each cluster, in each context, as a function of the context feature  $W_{1}$ . We clearly see that the expectation of $\unicode[STIX]{x1D6FD}_{1}$ and the clusters are positive functions of  $W_{1}$ , but that is not the case for  $\unicode[STIX]{x1D6FD}_{2}$ . Finally, in the bottom right we see the posterior expectation of  $\unicode[STIX]{x1D70F}$ . The estimated values are quite close to the true values and within a small 95% HPD interval.

Figure 2. Output of estimation of hdpGLM model for a data set with 10 contexts and positive effect of context-level feature $W_{1}$ on the marginal effect $\unicode[STIX]{x1D6FD}_{1}$ of $X_{1}$ .

To complete this section, we compared the predictive performance of the GLM and the hdpGLM in terms of RMSE for different numbers of latent clusters. We randomly generated 30 parameter sets, each one with the number of clusters ranging from 1 to 30. For each case, we generated 10 data sets and estimated both the GLM and the hdpGLM. The RMSE was computed in each case. The Figure 3 compares the predictive performance of the GLM and the hdpGLM. The value of the RMSE stays always low for the hdpGLM, as expected.

Figure 3. Comparing performance of hdpGLM and GLM using root-mean-squared error (RMSE) as a function of the number of clusters in the data.

All estimations in this section and in the following use $(\unicode[STIX]{x1D707}_{\unicode[STIX]{x1D70F}_{d}},\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D70F}_{d}}I,\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D6FD}_{kj}}I,s^{2},\unicode[STIX]{x1D708},\unicode[STIX]{x1D6FC}_{0})=(0,10I,10I,10,10,1)$ as prior parametrization, where $I$ represents the identity matrix. Those values give a reasonably large variation for the underlying random variables, and the simulation results have shown that they produce good coverage and small 95% HPD intervals in a large variety of situations. The supplementary material contains details of a prior perturbation study. Briefly, it shows that on average the model is not very sensitive to different prior settings, but in the worst case for certain combinations of prior parameters the model can demand very large data sets to escape the influence of the prior specification. This is true specially for extreme values of the concentration parameter $\unicode[STIX]{x1D6FC}$ and values that produce highly dispersed inverse-scaled- $\unicode[STIX]{x1D712}^{2}$ distribution, which can be generated by low values (below five) of the scale parameter  $s^{2}$ . For details, see supplementary material.

6 Empirical Application

In this section, we illustrate some applications of the model by replicating empirical studies and comparing the original results with the ones produced by the hdpGLM estimates.

We start with Bechtel, Hainmueller, and Margalit (Reference Bechtel, Hainmueller and Margalit2014), who present a study in Germany using online and telephone survey data. The paper investigates why some voters agree with bailout payments for other countries. The dependent variable is a dichotomous measure coded as 1 if the person is against bailout payments for over-indebted EU countries and 0 otherwise. They find that social dispositions, in particular feelings of cosmopolitanism and altruism, are the strongest predictors of attitudes toward providing financial help to other countries. The left panel of Figure 4 reproduces their results and displays the marginal effects of their three main variables. The right panel shows the estimation of the hdpGLM using an indicator variable for German states. The panel shows the effect of (very high) cosmopolitanism on support for bailout payments in each region. We see that for most of the states there is no latent heterogeneity. Moreover, the aggregate average effect estimated using a GLM is similar, for most cases, to the posterior average effect found by hdpGLM in each state. One exception is Lower Saxony, in which we see Simpson’s paradox: there are two clusters with opposite effects of very high cosmopolitanism on support for bailout. Although we would need further investigation to provide a substantive account of these results, we can see how the hdpGLM can be used to estimate context-dependent heterogeneity.

Figure 4. Left panel shows the effect of altruism, cosmopolitanism, and income on support for bailout estimated by GLM, reproducing Table 3 of Bechtel, Hainmueller, and Margalit (Reference Bechtel, Hainmueller and Margalit2014). Right panel shows latent heterogeneity in the marginal effect of very high cosmopolitanism as function of German states.

For the second empirical application, we replicate Newman, Johnston, and Lown (Reference Newman, Johnston and Lown2015). Using national surveys conducted in the USA, they investigate if residential proximity to inequality affect US citizens’ beliefs in meritocracy, defined as the idea that the economic system rewards individuals based on their hard work and ability. The data set contains individual- and county-level covariates. They show that the association between the individual’s income and the probability of rejecting meritocracy is conditional on the levels of inequality in the county: low-income individuals become more likely to reject meritocracy when inequality increases. We reproduce their results for white residents, as they present in Table 1 of their paper, using a linear probability model. In their results, income and the percentage of blacks in the county do not matter alone, but the interaction between income and inequality is significant. We focus here on that result. We estimate the hdpGLM using the same individual- and county-level covariates they included in their model. County covariates are inequality, county income, percentage of black, percentage of votes for Bush in 2004, and county population. The estimation of the hdpGLM found no latent heterogeneity in 1,633 out of 1,688 counties. Two latent clusters were estimated in 54 counties, and three latent clusters in one of them. Table 4 compares the MLE estimates of the GLM with the posterior expectation of the hdpGLM, averaged across counties with no latent heterogeneity. We can see in that table that those values are similar.

Table 4. GLM vs hdpGLM estimates for counties with no latent heterogeneity.

The left panel of Figure 5 shows the posterior distribution of the income effect in 20 randomly sampled counties. We see that the GLM and the hdpGLM estimates agree in many cases, but for some counties, there are latent heterogeneous groups and the estimates of the two models disagree. In the county with index 543, for instance, there are three latent groups, one in which the income plays no role, and two with opposite income effects. That case represents an example of Simpson’s paradox in the effect of income in that county.

As discussed, one of the advantages of using hdpGLM is that we can evaluate if there is any effect of context(county)-levels variables after we take into account the latent heterogeneity in the effect of observed individual-level covariates. Newman, Johnston, and Lown (Reference Newman, Johnston and Lown2015) found that inequality conditions the effect of income on the probability of rejecting meritocracy. However, when we take into account the latent heterogeneity of the income effect in each county, that conditional effect disappears. It can be seen in Figure 5. In the top-right panel of the figure, we see the posterior expectation of each cluster within each one of the 1,688 counties. In the bottom right, we see the posterior distribution of  $\unicode[STIX]{x1D70F}_{11}$ , the effect of inequality on the expectation of the effect of income for each county and cluster. The results indicate that inequality does not change the effect of income when we consider latent heterogeneity in the effect of covariates.

Figure 5. Posterior distribution of income effect for 20 selected counties (left panel), posterior expectation of income effect in each cluster as function of inequality (top-right panel), and posterior distribution of the effect of inequality on the income effect (bottom right).

As we can see, the hdpGLM model can be used to investigate latent heterogeneity in the effect of observed covariates in generalized linear models. When there is no heterogeneity, the results of GLM and hdpGLM are similar. When there is latent heterogeneity, the GLM can produce estimates that are incorrect for all or some subpopulations. By using GLM, one is simply assuming that Simpson’s paradox does not occur in the analysis. The hdpGLM can be used instead to estimate the heterogeneous effect, cluster the data into groups, uncover Simpson’s paradox, and evaluate if the effect of context-level features remains relevant after latent heterogeneity is considered.

7 Final Discussion

Researchers in any academic discipline can never be sure a priori that there is no effect heterogeneity caused by latent or omitted variables in their investigation. In other words, we are never sure if there are latent subpopulations in which the average effect found using the aggregated data is different or even reversed. When there are such subpopulations, using GLM or GLMM can produce an incomplete picture and in the worst case scenario a completely misleading conclusion. This is true in analyses using either observation or experimental data. It is desirable to use a method that is robust to latent heterogeneity. Moreover, when data comes from different contexts, for instance, different states or different countries, it is common to assume that the effect of observed covariates varies from context to context due to context-level features. Likewise, it is also desirable to consider that the latent heterogeneity of the covariate effects within each context (e.g., country) can vary from context to context (from country to country) due to context-level features.

We have provided a model to deal with those issues. The model is designed to estimate marginal effects in linear models and consider if there are latent subpopulations in which the marginal effects differ. If data comes from different contexts, the model also estimates if the existence of such subpopulations and their specific marginal effects are functions of context-level features. We have shown that the proposed model causes no harm when the GLM is correct—that is, when there are no latent heterogeneous effects—but it correctly estimates the heterogeneous effects when they exist.

Similar to GLM, however, the proposed model requires specifying which and how observed covariates are included in the model. This is not a trivial task. It can affect the estimation as much as it does for GLMs. Further research is needed to develop methods to compare and select which observed covariates should be used and how. However, if for any reason one believes a specific set of covariates is adequate for a GLM, the proposed model can be used instead with the advantage that it will be robust to subpopulation heterogeneous effects.

Appendix A. Markov Chain Monte Carlo Algorithms

The proof of the Proposition 1 is the following.

Proof (Blocked Gibbs sampler for hdpGLM).

The full conditional of $\unicode[STIX]{x1D70F}$ is given by

$$\begin{eqnarray}p(\unicode[STIX]{x1D70F}\mid \unicode[STIX]{x1D703},\unicode[STIX]{x1D70B},Z,y,X,W,C)\propto p(\unicode[STIX]{x1D703}\mid W,\unicode[STIX]{x1D70F})p(\unicode[STIX]{x1D70F})\propto \mathop{\prod }_{d=1}^{D_{x}+1}p(\unicode[STIX]{x1D703}_{d}\mid W,\unicode[STIX]{x1D70F}_{d})p(\unicode[STIX]{x1D70F}_{d}).\end{eqnarray}$$

For each $d=1,\ldots ,D_{x}+1$ we have

$$\begin{eqnarray}p(\unicode[STIX]{x1D70F}_{d}\mid \cdot )\propto p(\unicode[STIX]{x1D703}_{d}\mid W,\unicode[STIX]{x1D70F}_{d})P(\unicode[STIX]{x1D70F}_{d}).\end{eqnarray}$$

The full conditional for $\unicode[STIX]{x1D703}$ is

$$\begin{eqnarray}\displaystyle & & \displaystyle p(\unicode[STIX]{x1D703}\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D70B},y,X,W,Z,C)\propto p(y\mid \unicode[STIX]{x1D703},X,W,Z,C)p(\unicode[STIX]{x1D703}\mid W,\unicode[STIX]{x1D70F})\nonumber\\ \displaystyle & & \displaystyle \quad =\mathop{\prod }_{j=1}^{J}\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast }}p(y_{i}\mid X_{i},Z_{i},C_{i},\unicode[STIX]{x1D703}_{C_{i},Z_{i}})p(\unicode[STIX]{x1D703}_{C_{i}Z_{i}}\mid \unicode[STIX]{x1D70F},W_{j})\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast C}}p(\unicode[STIX]{x1D703}_{C_{i}Z_{i}}\mid W_{j},\unicode[STIX]{x1D70F})\nonumber\\ \displaystyle & & \displaystyle \quad =\mathop{\prod }_{j=1}^{J}\left[\left(\mathop{\prod }_{k\in Z_{j}^{\ast }}p(\unicode[STIX]{x1D703}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast }}p(y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{jk})\right)\left(\mathop{\prod }_{k\in Z_{j}^{\ast C}}p(\unicode[STIX]{x1D703}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\right)\right].\nonumber\end{eqnarray}$$

Therefore, for all $j=1,\ldots ,J$ and $k=1,\ldots ,K$ , we have

(A 1) $$\begin{eqnarray}p(\unicode[STIX]{x1D703}_{jk}\mid \cdot )\propto \left\{\begin{array}{@{}ll@{}}\displaystyle p(\unicode[STIX]{x1D703}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\mathop{\prod }_{i:Z_{i}=k}p(y_{i}\mid X_{i},\unicode[STIX]{x1D703}_{jk}),\quad & \text{if}~k\in Z_{j}^{\ast },\\ p(\unicode[STIX]{x1D703}_{jk}\mid W_{j},\unicode[STIX]{x1D70F}),\quad & \text{if}~k\in Z_{j}^{\ast C}.\end{array}\right.\end{eqnarray}$$

For the variable $Z$ , the full conditional is given by

$$\begin{eqnarray}\displaystyle p(Z\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D703},\unicode[STIX]{x1D70B},y,X,W,C) & \propto & \displaystyle p(y\mid \unicode[STIX]{x1D703},Z,X,W,C)p(Z\mid \unicode[STIX]{x1D70B})\nonumber\\ \displaystyle & = & \displaystyle \mathop{\prod }_{i=1}^{n}p(y_{i}\mid \unicode[STIX]{x1D703}_{C_{i}Z_{i}},X_{i},C_{i},Z_{i})p(Z_{i}\mid \unicode[STIX]{x1D70B}).\nonumber\end{eqnarray}$$

Therefore for all $i=1,\ldots ,n$ we have

$$\begin{eqnarray}p(Z_{i}=k\mid \cdot )\propto \unicode[STIX]{x1D70B}_{k}p(y_{i}\mid \unicode[STIX]{x1D703}_{C_{i}k},X_{i},C_{i})\end{eqnarray}$$

or similarly

(A 2) $$\begin{eqnarray}p(Z_{i}\mid \cdot )\propto \mathop{\sum }_{k=1}^{K}p_{ik}I(Z_{i}=k)\ni p_{ik}=\unicode[STIX]{x1D70B}_{k}p(y_{i}\mid \unicode[STIX]{x1D703}_{C_{i}k},X_{i},C_{i}).\end{eqnarray}$$

Finally, for $\unicode[STIX]{x1D70B}$ the full conditional is

$$\begin{eqnarray}p(\unicode[STIX]{x1D70B}\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D703},Z,y,X,W,C)\propto p(Z\mid \unicode[STIX]{x1D70B})p(\unicode[STIX]{x1D70B})=\mathop{\prod }_{i=1}^{n}p(Z_{i}\mid \unicode[STIX]{x1D70B})p(\unicode[STIX]{x1D70B}).\end{eqnarray}$$

Now, for simplicity, let $\unicode[STIX]{x1D70B}\sim \text{Dir}(\unicode[STIX]{x1D6FC}/K)$ . The connection between this distribution and the stick-breaking process described in (4) can be found in Ishwaran and James (Reference Ishwaran and James2001). Then we have

$$\begin{eqnarray}p(\unicode[STIX]{x1D70B}\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D703},Z,y,X,W,C)\propto \mathop{\prod }_{i=1}^{n}\left(\mathop{\prod }_{k=1}^{K}\unicode[STIX]{x1D70B}_{k}^{I(Z_{i}=k)}\right)\mathop{\prod }_{k+1}^{K}\unicode[STIX]{x1D70B}_{k}^{\unicode[STIX]{x1D6FC}/K-1}=\mathop{\prod }_{k=1}^{K}\unicode[STIX]{x1D70B}_{k}^{N_{k}+(\unicode[STIX]{x1D6FC}/K)-1}.\end{eqnarray}$$

Therefore,

(A 3) $$\begin{eqnarray}\hspace{96.0pt}p(\unicode[STIX]{x1D70B}\mid \cdot )\propto \text{Dir}\left(N_{1}+\frac{\unicode[STIX]{x1D6FC}}{K},\ldots ,N_{K}+\frac{\unicode[STIX]{x1D6FC}}{K}\right).\hspace{96.0pt}\square\end{eqnarray}$$

The proof of the Proposition 2 is the following.

Proof (Gibbs for hdpGLM with gaussian mixtures).

Considering the results in Proposition 1 and the model described in (16), we have the following. For  $\unicode[STIX]{x1D70F}$ , for each $d=1,\ldots ,D_{x}+1$

$$\begin{eqnarray}p(\unicode[STIX]{x1D70F}_{d}\mid \cdot )\propto p(\unicode[STIX]{x1D6FD}_{d}\mid W,\unicode[STIX]{x1D70F}_{d})P(\unicode[STIX]{x1D70F}_{d})=\mathop{\prod }_{k=1}^{K}\left[\mathop{\prod }_{j=1}^{J}p(\unicode[STIX]{x1D6FD}_{dkj}\mid W,\unicode[STIX]{x1D70F}_{d})p(\unicode[STIX]{x1D70F}_{d})\right].\end{eqnarray}$$

But by conjugacy of the gaussian distributions, we have $\prod _{j=1}^{J}p(\unicode[STIX]{x1D6FD}_{dkj}\,\,|\,\,W,\unicode[STIX]{x1D70F}_{d})p(\unicode[STIX]{x1D70F}_{d})\propto N_{D_{w}+1}(\unicode[STIX]{x1D707}_{A}^{(k)},\unicode[STIX]{x1D6F4}_{A})$ where

$$\begin{eqnarray}\displaystyle S_{A} & = & \displaystyle (\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D70F}}^{-1}\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D6FD}}^{2}+W^{T}W)^{-1}\nonumber\\ \displaystyle \unicode[STIX]{x1D6F4}_{A} & = & \displaystyle S_{A}\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D6FD}}^{2}\nonumber\\ \displaystyle \unicode[STIX]{x1D707}_{k}^{(k)} & = & \displaystyle S_{A}W^{T}\unicode[STIX]{x1D6FD}_{dk}.\nonumber\end{eqnarray}$$

Therefore

$$\begin{eqnarray}p(\unicode[STIX]{x1D70F}_{d}\mid \cdot )\propto \mathop{\prod }_{k=1}^{K}N_{D_{w}+1}(\unicode[STIX]{x1D707}_{A}^{(k)},\unicode[STIX]{x1D6F4}_{A})\propto \exp \left\{-\frac{1}{2}\left[\unicode[STIX]{x1D70F}_{d}^{T}(k\unicode[STIX]{x1D6F4}_{A}^{-1})\unicode[STIX]{x1D70F}_{d}-2\unicode[STIX]{x1D70F}_{d}^{T}\unicode[STIX]{x1D6F4}_{A}^{-1}\left(\mathop{\sum }_{k=1}^{K}\unicode[STIX]{x1D707}_{A}^{(k)}\right)\right]\right\}.\end{eqnarray}$$

If we denote $\overline{\unicode[STIX]{x1D6F4}}_{\unicode[STIX]{x1D70F}_{d}}=\frac{1}{K}\unicode[STIX]{x1D6F4}_{A}$ and $\overline{\unicode[STIX]{x1D707}}_{\unicode[STIX]{x1D70F}_{d}}=\frac{1}{K}\sum _{k=1}^{K}\unicode[STIX]{x1D707}_{A}^{(k)}$ then

$$\begin{eqnarray}\unicode[STIX]{x1D70F}_{d}\mid \cdot \propto N_{D_{w}+1}(\overline{\unicode[STIX]{x1D707}}_{\unicode[STIX]{x1D70F}_{d}},\overline{\unicode[STIX]{x1D6F4}}_{\unicode[STIX]{x1D70F}_{d}}).\end{eqnarray}$$

The full conditional for $\unicode[STIX]{x1D6FD}$ is

$$\begin{eqnarray}\displaystyle & & \displaystyle p(\unicode[STIX]{x1D6FD}\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D70E}^{2},\unicode[STIX]{x1D70B},y,X,W,Z,C)\propto p(y\mid \unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E}^{2},X,W,Z,C)p(\unicode[STIX]{x1D6FD}\mid W,\unicode[STIX]{x1D70F})\nonumber\\ \displaystyle & & \displaystyle \quad =\mathop{\prod }_{j=1}^{J}\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast }}p(y_{i}\mid X_{i},Z_{i},C_{i},\unicode[STIX]{x1D6FD}_{C_{i},Z_{i}},\unicode[STIX]{x1D70E}_{Z_{i}}^{2})p(\unicode[STIX]{x1D6FD}_{C_{i}Z_{i}}\mid \unicode[STIX]{x1D70F},W_{j})\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast C}}p(\unicode[STIX]{x1D6FD}_{C_{i}Z_{i}}\mid W_{j},\unicode[STIX]{x1D70F})\nonumber\\ \displaystyle & & \displaystyle \quad =\mathop{\prod }_{j=1}^{J}\left[\left(\mathop{\prod }_{k\in Z_{j}^{\ast }}p(\unicode[STIX]{x1D6FD}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\mathop{\prod }_{i:Z_{i}\in Z_{j}^{\ast }}p(y_{i}\mid X_{i},\unicode[STIX]{x1D6FD}_{jk},\unicode[STIX]{x1D70E}_{k}^{2})\right)\left(\mathop{\prod }_{k\in Z_{j}^{\ast C}}p(\unicode[STIX]{x1D6FD}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\right)\right].\nonumber\end{eqnarray}$$

Therefore, for all $j=1,\ldots ,J$ and $k=1,\ldots ,K$ , we have

(A 4) $$\begin{eqnarray}p(\unicode[STIX]{x1D6FD}_{jk}\mid \cdot )\propto \left\{\begin{array}{@{}ll@{}}\displaystyle p(\unicode[STIX]{x1D6FD}_{jk}\mid W_{j},\unicode[STIX]{x1D70F})\mathop{\prod }_{i:Z_{i}=k}p(y_{i}\mid X_{i},\unicode[STIX]{x1D6FD}_{jk},\unicode[STIX]{x1D70E}_{k}^{2}),\quad & \text{if}~k\in Z_{j}^{\ast },\\ p(\unicode[STIX]{x1D6FD}_{jk}\mid W_{j},\unicode[STIX]{x1D70F}),\quad & \text{if}~k\in Z_{j}^{\ast C}.\end{array}\right.\end{eqnarray}$$

Denote $X_{kj}=\{X_{i}\mid C_{i}=j,Z_{i}=k\}$ , $y_{kj}=\{y_{i}\mid C_{i}=j,Z_{i}=k\}$ , it is clear from (A 4), (16), and the conjugacy of the normal distribution that for $k\in Z_{j}^{\ast }$

$$\begin{eqnarray}\begin{array}{@{}llrcl@{}}\unicode[STIX]{x1D6FD}_{jk}\mid \cdot \propto N_{D_{x}+1}(\overline{\unicode[STIX]{x1D707}}_{\unicode[STIX]{x1D6FD}},\overline{\unicode[STIX]{x1D6F4}}_{\unicode[STIX]{x1D6FD}})\quad & \text{where}\ & S_{\unicode[STIX]{x1D6FD}}\ & =\ & (\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}^{-1}\unicode[STIX]{x1D70E}_{k}^{2}+X_{kj}^{T}X_{kj})^{-1},\quad \overline{\unicode[STIX]{x1D6F4}}_{\unicode[STIX]{x1D6FD}}=S_{\unicode[STIX]{x1D6FD}}\unicode[STIX]{x1D70E}_{k}^{2}\\ \quad & \ & \overline{\unicode[STIX]{x1D707}}_{\unicode[STIX]{x1D6FD}}\ & =\ & \displaystyle S_{\unicode[STIX]{x1D6FD}}\left[\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}^{-1}(W_{j}^{T}\unicode[STIX]{x1D70F})^{T}+\frac{X_{kj}^{T}y_{kj}}{\unicode[STIX]{x1D70E}_{k}^{2}}\right]\unicode[STIX]{x1D70E}_{k}^{2}.\end{array}\end{eqnarray}$$

The full conditional for $\unicode[STIX]{x1D70E}^{2}$ is

$$\begin{eqnarray}\displaystyle & & \displaystyle p(\unicode[STIX]{x1D70E}^{2}\mid \unicode[STIX]{x1D70F},\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70B},Z,y,X,W,C)\propto p(y\mid \unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E}^{2},Z,X,W,C)p(\unicode[STIX]{x1D70E}^{2})\nonumber\\ \displaystyle & & \displaystyle \quad =\mathop{\prod }_{i=1}^{n}p(y_{i}\mid \unicode[STIX]{x1D6FD}_{C_{i}Z_{i}},\unicode[STIX]{x1D70E}_{Z_{i}}^{2},X_{i},Z_{i},C_{i})p(\unicode[STIX]{x1D70E}_{Z_{i}}^{2})\nonumber\\ \displaystyle & & \displaystyle \quad =\left(\mathop{\prod }_{k\in Z^{\ast }}p(\unicode[STIX]{x1D70E}_{k}^{2})\mathop{\prod }_{i:Z_{i}=k}p(y_{i}\mid \unicode[STIX]{x1D6FD}_{C_{i}k},X_{i},C_{i})\right)\left(\mathop{\prod }_{k\in Z^{\ast C}}p(\unicode[STIX]{x1D70E}_{k}^{2})\right).\nonumber\end{eqnarray}$$

Therefore, for all $k=1,\ldots ,K$ we have

(A 5) $$\begin{eqnarray}p(\unicode[STIX]{x1D70E}_{k}^{2}\mid \cdot )\propto \left\{\begin{array}{@{}ll@{}}\displaystyle p(\unicode[STIX]{x1D70E}_{k}^{2})\mathop{\prod }_{i:Z_{i}=k}p(y_{i}\mid \unicode[STIX]{x1D6FD}_{C_{i}k},X_{i},C_{i}),\quad & \text{if}~k\in Z^{\ast }\\ p(\unicode[STIX]{x1D70E}_{k}^{2}),\quad & \text{if}~k\in Z^{\ast C}.\end{array}\right.\end{eqnarray}$$

Given the full conditional of $\unicode[STIX]{x1D70E}^{2}$ in (A 5), the distributions in (16), and the fact that the scaled inverse $\unicode[STIX]{x1D712}^{2}$ distribution is a conjugate prior for a gaussian likelihood with known mean, which is the case for the full conditional, it is straightforward to see that for $k\in Z^{\ast }$ , $X_{k}=\{X_{i}\mid Z_{i}=k\}$ , and $y_{k}=\{y_{k}\mid Z_{i}=k\}$

$$\begin{eqnarray}\unicode[STIX]{x1D70E}_{k}^{2}\mid \cdot \propto \text{Scale\text{-}inv}\text{-}\unicode[STIX]{x1D712}^{2}(\overline{\unicode[STIX]{x1D708}},\overline{s}^{2})\end{eqnarray}$$

where

$$\begin{eqnarray}\overline{\unicode[STIX]{x1D708}}=\unicode[STIX]{x1D708}+N_{k},\quad \overline{s}^{2}=\frac{\unicode[STIX]{x1D708}s^{2}+N_{k}{\hat{s}}^{2}}{\unicode[STIX]{x1D708}+N_{k}},\quad {\hat{s}}^{2}=\frac{1}{N_{k}}(y_{k}-X_{k}\unicode[STIX]{x1D6FD}_{k})^{T}(y_{k}-X_{k}\unicode[STIX]{x1D6FD}_{k}).\end{eqnarray}$$

The full conditionals for $Z$ and $\unicode[STIX]{x1D70B}$ are as in (A 2) and (A 3), respectively.◻

Supplementary material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2019.13.

Footnotes

Author’s note: The author is thankful to Robert Franzese, Walter Mebane, Kevin Quinn, Long Nguyen, as well as participants of 2018 Polmeth and 2018 APSA Annual meeting for helpful comments on previous versions of this manuscript. The author also thanks the editor Jeff Gill and two anonymous reviewers for their invaluable suggestions. Replication materials are publicly available on the Political Analysis Harvard Dataverse (Ferrari 2018) as well as author’s website.

Contributing Editor: Jeff Gill

1 The clustering property of the DPP will not be revised here because there are already good sources explaining such feature of the Dirichlet process prior. The reader interested in a review can check Teh et al. (Reference Teh, Jordan, Beal and Blei2006) and Müller and Mitra (Reference Müller and Mitra2013) and the references therein.

2 See Ferrari (Reference Ferrari2018) for replication.

References

Aakvik, A., Heckman, J. J., and Vytlacil, E. J.. 2005. “Estimating Treatment Effects for Discrete Outcomes When Responses to Treatment Vary: an Application to Norwegian Vocational Rehabilitation Programs.” Journal of Econometrics 125(1–2):1551.Google Scholar
Alesina, A., and Angeletos, G.-M.. 2005. “Fairness and Redistribution.” The American Economic Review 95(4):960980.Google Scholar
Alesina, A., and Giuliano, P.. 2010. “Preferences for Redistribution.” In Handbook of Social Economics , edited by Benhabib, J., Bisin, A., and Jackson, M. O., 93131. Amsterdam: Elsevier.Google Scholar
Arts, W., and Gelissen, J.. 2001. “Welfare States, Solidarity and Justice Principles: Does the Type Really Matter? Acta Sociologica 44(4):283299.Google Scholar
Bechtel, M. M., Hainmueller, J., and Margalit, Y.. 2014. “Preferences for International Redistribution: The Divide Over the Eurozone Bailouts.” American Journal of Political Science 58(4):835856.Google Scholar
Beramendi, P., and Rehm, P.. 2016. “Who gives, who gains? Progressivity and Preferences.” Comparative Political Studies 49(4):529563.Google Scholar
Blei, D. M, and Jordan, M. I. et al. . 2006. “Variational inference for Dirichlet process mixtures.” Bayesian Analysis 1(1):121143.Google Scholar
Blyth, C. R. 1972. “On Simpson’s Paradox and the Sure-Thing Principle.” Journal of the American Statistical Association 67(338):364366.Google Scholar
Brooks, S. P., and Gelman, A.. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7(4):434455.Google Scholar
Calin, O., and Chang, D.-C.. 2006. Geometric Mechanics on Riemannian Manifolds: Applications to Partial Differential Equations . Birkhauser: Springer Science & Business Media.Google Scholar
Carlin, B. P., and Louis, T. A.. 2000. Bayes and Empirical Bayes Methods for Data Analysis , 2nd edn. Boca Raton, FL: Chapman & Hall/CRC.Google Scholar
Carota, C., and Parmigiani, G.. 2002. “Semiparametric Regression for Count Data.” Biometrika 89(2):265281.Google Scholar
Chen, X. 2007. “Large Sample Sieve Estimation of Semi-Nonparametric Models.” Handbook of Econometrics 6:55495632.Google Scholar
Cowles, M. K., and Carlin, B. P.. 1996. “Markov chain Monte Carlo convergence diagnostics: a comparative review.” Journal of the American Statistical Association 91(434):883904.Google Scholar
De Iorio, M., Müller, P., Rosner, G. L., and MacEachern, S. N.. 2004. “An ANOVA Model for Dependent Random Measures.” Journal of the American Statistical Association 99(465):205215.Google Scholar
De la Cruz-Mesía, R., Quintana, F. A., and Marshall, G.. 2008. “Model-Based Clustering for Longitudinal Data.” Computational Statistics & Data Analysis 52(3):14411457.Google Scholar
Diaconis, P., and Freedman, D.. 1986. “On the Consistency of Bayes Estimates.” Annals of Statistics 14(1):126.Google Scholar
Dorazio, R. M., Mukherjee, B., Zhang, L., Ghosh, M., Jelks, H. L., and Jordan, F.. 2008. “Modeling Unobserved Sources of Heterogeneity in Animal Abundance Using a Dirichlet Process Prior.” Biometrics 64(2):635644.Google Scholar
Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D.. 1987. “Hybrid monte carlo.” Physics Letters B 195(2):216222.Google Scholar
Ebbes, P., Wedel, M., and Böckenholt, U.. 2009. “Frugal IV Alternatives to Identify the Parameter for an Endogenous Regressor.” Journal of Applied Econometrics 24(3):446468.Google Scholar
Ebbes, P., Wedel, M., Böckenholt, U., and Steerneman, T.. 2005. “Solving and Testing for Regressor-Error (in) Dependence When No Instrumental Variables are Available: With New Evidence for the Effect of Education on Income.” Quantitative Marketing and Economics 3(4):365392.Google Scholar
Ebbes, P., Böckenholt, U., and Wedel, M.. 2004. “Regressor and random-effects dependencies in multilevel models.” Statistica Neerlandica 58(2):161178.Google Scholar
Ferrari, D.2018. “Replication Data for: Modeling Context-Dependent Latent Effect Heterogeneity.” https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WB9XLZ.Google Scholar
Flegal, J. M., Haran, M., and Jones, G. L.. 2008. “Markov Chain Monte Carlo: Can We Trust the Third Significant Figure? Statistical Science 23(2):250260.Google Scholar
Flegal, J. M.2008. “Monte Carlo Standard Errors for Markov Chain Monte Carlo.” PhD thesis, University of Minnesota.Google Scholar
Gaffney, S.2003. “Curve Clustering with Random Effects Regression Mixtures.” In Ninth International Workshop on Artificial Intelligence and Statistics, AISTATS. Key West, Florida.Google Scholar
Gelman, A., and Hill, J.. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models . Cambridge: Cambridge University Press.Google Scholar
Geweke, J. 1992. “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments.” In Bayesian Statistics , 4th edn. 169193. Oxford: Oxford University Press.Google Scholar
Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V.. 1999. “Consistent Semiparametric Bayesian Inference about a Location Parameter.” Journal of Statistical Planning and Inference 77(2):181193.Google Scholar
Gill, J., and Casella, G.. 2009. “Nonparametric Priors for Ordinal Bayesian Social Science Models: Specification and Estimation.” Journal of the American Statistical Association 104(486):453454.Google Scholar
Girolami, M., and Calderhead, B.. 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2):123214.Google Scholar
Grimmer, J. 2009. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18(1):135.Google Scholar
Hannah, L. A., Blei, D. M., and Powell, W. B.. 2011. “Dirichlet Process Mixtures of Generalized Linear Models.” Journal of Machine Learning Research 12(Jun):19231953.Google Scholar
Hayashi, F. 2000. Econometrics, vol. 1. Princeton, NJ: Princeton University Press.Google Scholar
Heckman, J. J., and Vytlacil, E. J.. 2007. “Econometric Evaluation of Social Programs, Part II: Using the Marginal Treatment Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and to Forecast their Effects in New Environments.” Handbook of Econometrics 6:48755143.Google Scholar
Heinzl, F., and Tutz, G.. 2013. “Clustering in Linear Mixed Models with Approximate Dirichlet Process Mixtures Using EM Algorithm.” Statistical Modelling 13(1):4167.Google Scholar
Hernán, M. A., Clayton, D., and Keiding, N.. 2011. “The Simpson’s Paradox Unraveled.” International Journal of Epidemiology 40(3):780785.Google Scholar
Ichimura, H., and Todd, P. E.. 2007. “Implementing Nonparametric and Semiparametric Estimators.” Handbook of Econometrics 6:53695468.Google Scholar
Ishwaran, H., and James, L. F.. 2001. “Gibbs Sampling Methods for Stick-Breaking Priors.” Journal of the American Statistical Association 96(453):161173.Google Scholar
Ishwaran, H., and Zarepour, M.. 2000. “Markov Chain Monte Carlo in Approximate Dirichlet and Beta Two-Parameter Process Hierarchical Models.” Biometrika 87(2):371390.Google Scholar
Johnston, R., Banting, K., Kymlicka, W., and Soroka, S.. 2010. “National Identity and Support for the Welfare State.” Canadian Journal of Political Science 43(02):349377.Google Scholar
Kievit, R., Frankenhuis, W. E., Waldorp, L., and Borsboom, D.. 2013. “Simpson’s Paradox in Psychological Science: A Practical Guide.” Frontiers in Psychology 4(513):114.Google Scholar
Kleinman, K. P., and Ibrahim, J. G.. 1998a. “A Semi-Parametric Bayesian Approach to Generalized Linear Mixed Models.” Statistics in Medicine 17(22):25792596.Google Scholar
Kleinman, K. P., and Ibrahim, J. G.. 1998b. “A Semiparametric Bayesian Approach to the Random Effects Model.” Biometrics 54(3):921938.Google Scholar
Kyung, M., Gill, J., and Casella, G. et al. . 2010. “Estimation in Dirichlet Random Effects Models.” The Annals of Statistics 38(2):9791009.Google Scholar
Lenk, P. J., and DeSarbo, W. S.. 2000. “Bayesian Inference for Finite Mixtures of Generalized Linear Models with Random Effects.” Psychometrika 65(1):93119.Google Scholar
Little, R. et al. . 2011. “Calibrated Bayes, for Statistics in General, and Missing Data in Particular.” Statistical Science 26(2):162174.Google Scholar
Liu, J. S. 2008. Monte Carlo Strategies in Scientific Computing . New York: Springer Science & Business Media.Google Scholar
Mallick, B. K., and Walker, S. G.. 1997. “Combining Information from Several Experiments with Nonparametric Priors.” Biometrika 84(3):697706.Google Scholar
Matzkin, R. L. 2007. “Nonparametric Identification.” Handbook of Econometrics 6:53075368.Google Scholar
Mukhopadhyay, S., and Gelfand, A. E.. 1997. “Dirichlet Process Mixed Generalized Linear Models.” Journal of the American Statistical Association 92(438):633639.Google Scholar
Müller, P., and Mitra, R.. 2013. “Bayesian Nonparametric Inference-Why and How.” Bayesian Analysis 8(2):269302.Google Scholar
Müller, P., Quintana, F., and Rosner, G.. 2004. “A Method for Combining Inference Across Related Nonparametric Bayesian Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(3):735749.Google Scholar
Neal, R. M. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics 9(2):249265.Google Scholar
Neal, R. M. et al. . 2011. MCMC Using Hamiltonian Dynamics , vol. 2. New York, NY: CRC Press.Google Scholar
Newman, B. J., Johnston, C. D., and Lown, P. L.. 2015. “False Consciousness or Class Awareness? Local Income Inequality, Personal Economic Position, and Belief in American Meritocracy.” American Journal of Political Science 59(2):326340.Google Scholar
Ng, S.-K., McLachlan, G. J., Wang, K., Ben-Tovim Jones, L., and Ng, S.-W.. 2006. “A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles.” Bioinformatics 22(14):17451752.Google Scholar
Pearl, J.2011. “Simpson’s Paradox: An Anatomy.” Technical Report UCLA: Department of Statistics Los Angeles, California. https://escholarship.org/uc/item/3s62r0d6.Google Scholar
Pearl, J. 2014. “Comment: Understanding Simpson’s Paradox.” The American Statistician 68(1):813.Google Scholar
Pearson, K., Lee, A., and Bramley-Moore, L.. 1899. “Mathematical Contributions to the Theory of Evolution. VI. Genetic (Reproductive) Selection: Inheritance of Fertility in Man, and of Fecundity in Thoroughbred Racehorses.” Philosophical Transactions of the Royal Society of London Series A 192:257330.Google Scholar
Przeworski, A. 2007. “Is the Science of Comparative Politics Possible? In The Oxford Handbook of Comparative Politics , edited by Boix Boix, C. and Stokes, S. C., Oxford Handbooks Online.Google Scholar
Rehm, P. 2009. “Risks and Redistribution an Individual-Level Analysis.” Comparative Political Studies 42(7):855881.Google Scholar
Rossi, P. 2014. Bayesian Non- and Semi-Parametric Methods and Applications . Princeton, NJ: Princeton University Press.Google Scholar
Rossi, P. E., Allenby, G. M., and McCulloch, R.. 2006. Bayesian Statistics and Marketing . Chichester: John Wiley & Sons.Google Scholar
Rueda, D., and Stegmueller, D.. 2016. “The Externalities of Inequality: Fear of Crime and Preferences for Redistribution in Western Europe.” American Journal of Political Science 60(2):472489.Google Scholar
Samuels, M. L. 1993. “Simpson’s Paradox and Related Phenomena.” Journal of the American Statistical Association 88(421):8188.Google Scholar
Sethuraman, J. 1994. “A Constructive Definition of Dirichlet Priors.” Statistica Sinica 4:639650.Google Scholar
Shahbaba, B., and Radford, N.. 2009. “Nonlinear Models Using Dirichlet Process Mixtures.” Journal of Machine Learning Research 10(Aug):18291850.Google Scholar
Shayo, M. 2009. “A Model of Social Identity with an Application to Political Economy: Nation, Class, and Redistribution.” American Political Science Review 103(02):147174.Google Scholar
Simpson, E. H. 1951. “The Interpretation of Interaction in Contingency Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 13(2):238241.Google Scholar
Spirling, A., and Quinn, K.. 2010. “Identifying Intraparty Voting Blocs in the UK House of Commons.” Journal of the American Statistical Association 105(490):447457.Google Scholar
Stegmueller, D. 2013. “Modeling Dynamic Preferences: A Bayesian Robust Dynamic Latent Ordered Probit Model.” Political Analysis 21(3):314333.Google Scholar
Stokes, S. C. 2014. “A Defense of Observational Research.” In Field Experiments and their Critics: Essays on the Uses and Abuses of Experimentation in the Social Sciences , edited by Teele, D. L., 3357. New Haven, CT: Yale University Press.Google Scholar
Svallfors, S. 1997. “Worlds of Welfare and Attitudes to Redistribution: A Comparison of Eight Western Nations.” European Sociological Review 13(3):283304.Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M.. 2006. “Hierarchical Dirichlet Processes.” Journal of the American Statistical Association 101:15661581.Google Scholar
Tokdar, S. T. 2006. “Posterior Consistency of Dirichlet Location-Scale Mixture of Normals in Density Estimation and Regression.” Sankhyā: The Indian Journal of Statistics 68(1):90110.Google Scholar
Traunmuller, R., Murr, A., and Gill, J.. 2015. “Modeling Latent Information in Voting Data with Dirichlet Process Priors.” Political Analysis 23(1):1, http://dx.doi.org/10.1093/pan/mpu018.Google Scholar
Verbeke, G., and Lesaffre, E.. 1997. “The Effect of Misspecifying the Random-Effects Distribution in Linear Mixed Models for Longitudinal Data.” Computational Statistics & Data Analysis 23(4):541556.Google Scholar
Villarroel, L., Marshall, G., and Barón, A. E.. 2009. “Cluster Analysis Using Multivariate Mixed Effects Models.” Statistics in Medicine 28(20):25522565.Google Scholar
Walker, S. G. 2007. “Sampling the Dirichlet Mixture Model with Slices.” Communications in Statistics - Simulation and Computation 36(1):4554.Google Scholar
Woodridge, J. M. 2002. Econometric Analysis of Cross-Sectional and Panel Data . Cambridge and London: MIT Press.Google Scholar
Yule, G. U. 1903. “Notes on the Theory of Association of Attributes in Statistics.” Biometrika 2(2):121134.Google Scholar
Figure 0

Table 1. Relationship between GLM, GLMM, FMM and hdpGLM based on structural assumptions on $\unicode[STIX]{x1D705},Z_{i}$ and $\unicode[STIX]{x1D711}_{i}$.

Figure 1

Table 2. Comparing estimates of GLM (estimated using MLE) and hdpGLM (estimated using MCMC) with and without latent heterogeneity in the population.

Figure 2

Figure 1. Comparing marginal effect estimated using GLM (MLE estimator) and hdpGLM (MCMC posterior average) when there are 3 clusters in the population.

Figure 3

Table 3. Summary of the performance of the hdpGLM when estimating number of clusters (K) and linear coefficients ($\unicode[STIX]{x1D6FD}$) across 100 replications generated by 10 different parameter sets.

Figure 4

Figure 2. Output of estimation of hdpGLM model for a data set with 10 contexts and positive effect of context-level feature $W_{1}$ on the marginal effect $\unicode[STIX]{x1D6FD}_{1}$ of $X_{1}$.

Figure 5

Figure 3. Comparing performance of hdpGLM and GLM using root-mean-squared error (RMSE) as a function of the number of clusters in the data.

Figure 6

Figure 4. Left panel shows the effect of altruism, cosmopolitanism, and income on support for bailout estimated by GLM, reproducing Table 3 of Bechtel, Hainmueller, and Margalit (2014). Right panel shows latent heterogeneity in the marginal effect of very high cosmopolitanism as function of German states.

Figure 7

Table 4. GLM vs hdpGLM estimates for counties with no latent heterogeneity.

Figure 8

Figure 5. Posterior distribution of income effect for 20 selected counties (left panel), posterior expectation of income effect in each cluster as function of inequality (top-right panel), and posterior distribution of the effect of inequality on the income effect (bottom right).

Supplementary material: File

Ferrari supplementary material

Ferrari supplementary material 1

Download Ferrari supplementary material(File)
File 10.7 MB