1 Introduction
This paper proposes a model to deal with context-dependent latent heterogeneity in the effect of covariates in generalized linear models (GLMs). Generalized linear models, including those with mixed effects, are still one of the most used tools for multivariate analyses in political science. Among many assumptions required by such models, e.g., the conditional independence, researchers need to assume that important covariates were not left out. In that regard, much has been said in political science, statistics, and econometrics about the problems caused by omitting additive covariates in the model, but much less about the issues surrounding unobserved confounders that condition the effects of observed covariates. Conditioning features can lead to a well-known phenomenon in statistics called Simpson’s paradox (a.k.a. aggregation paradox): an effect found when data are aggregated can be completely different or even reversed when data are separated into groups (Pearson, Lee, and Bramley-Moore Reference Pearson, Lee and Bramley-Moore1899; Yule Reference Yule1903; Simpson Reference Simpson1951; Blyth Reference Blyth1972). The crucial point connecting the paradox and omitting variables is that, in the typical situation, researchers can never be sure a priori that there are not latent or unobserved groups—a.k.a. clusters—with heterogeneous effect nor how many of them exist.
Consider for example the study of voter’s preferences for redistribution. It is well known that features such as income and race can affect support for redistributive policies (Alesina and Angeletos Reference Alesina and Angeletos2005; Rehm Reference Rehm2009; Shayo Reference Shayo2009; Alesina and Giuliano Reference Alesina, Giuliano, Benhabib, Bisin and Jackson2010), but the effects of such observed factors like income and race can be heterogeneous among subpopulations due to unobserved factors such as motivation, personal history, and ability (Stegmueller Reference Stegmueller2013). Consequently, the estimated effect of income, e.g., found when data are aggregated can be very different from the effect that would be estimated if we had observed motivation and considered low- and high-motivation groups separately or considered the income effect as conditional upon motivation.
Although that problem occurs in all scientific disciplines, perhaps it is more salient in the social sciences because, to mention a few reasons, problems have high dimensionality and often many dimensions remain unmeasured; data are often difficult to collect or are unavailable for privacy or other reasons; culture-specific aspects are not well measured; some subjects may conceal information from researchers purposefully; or researchers may simply be unaware of possible latent interactive factors (Stegmueller Reference Stegmueller2013; Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015).
Especially in comparative politics, an additional layer of complication seems likely: the latent heterogeneity can depend on context-level features. For instance, some researchers have shown that the effect of income is conditional on country-level variables such as the progressivity of the tax system (Beramendi and Rehm Reference Beramendi and Rehm2016), the levels of inequality and crime rates (Rueda and Stegmueller Reference Rueda and Stegmueller2016), national identity (Johnston et al. Reference Johnston, Banting, Kymlicka and Soroka2010), and the existing levels of redistribution (Svallfors Reference Svallfors1997; Arts and Gelissen Reference Arts and Gelissen2001). If there is latent effect heterogeneity due to unobserved factors like motivation or personal experiential history among the population from a given context, say a country, it is very likely that a different heterogeneity manifests in other countries. In other words, suppose the effect of income is heterogeneous between two groups of voters (clusters) in the United States (the context) and we do not know the group membership of individual voters. We would expect to see heterogeneity among the population in another context, say Italy, but we should not expect to have the same two latent groups in Italy (or in other contexts). Maybe there are more or fewer latent groups in Italy, or maybe some latent subpopulations are similar in Italy and the United States. For instance, high- and low-motivation Italians and Americans may have welfare opinions similarly affected by their income. But Italians’ personal experiences of crime modify income effects on welfare support very differently than Americans’ personal experiences of crime modify their income effects on welfare support. In sum, the characteristics of the within-context heterogeneity (clustering) can vary from one context (e.g., country) to another, and that within-context heterogeneity may depend on the characteristics of the context itself.
Practitioners in political science have long recognized these challenges. The possibility of omitting relevant conditioning factors, in conjunction with cross-context differences, have been stressed as an important source of an attitude of radical skepticism regarding the results of observational and experimental empirical investigation in the social sciences in general, and in comparative analysis in particular (Przeworski Reference Przeworski, Boix Boix and Stokes2007; Stokes Reference Stokes and Teele2014).
The literature has proposed different approaches to address effect heterogeneity. The approaches depend on whether the grouping features are known and measured. When the groups are observed, classical approaches include mixed models with gaussian distributed random effects (e.g., hierarchical linear models (HLMs)). Suppose, for example, that we are analyzing data from many countries (contexts) and in each country there are different subpopulations with heterogeneous effects. If we knew the subpopulation to which each individual belongs, we could use a classical mixed-effects model at country and subpopulation levels. However, the distributional assumption on the random effect in such an approach is often criticized because of the single modality, light tails, and symmetry of the normal distribution, which imposes unnecessary and often unjustifiable constraints to the analysis in the empirical modeling stage (Verbeke and Lesaffre Reference Verbeke and Lesaffre1997; Heinzl and Tutz Reference Heinzl and Tutz2013). In addition, such approach only works if the heterogeneous groups within each context are known and observed, but researchers have to assume that there is no other latent or unobserved feature that can cause effect heterogeneity. There are some modeling approaches that work for single-context cases in which subpopulation membership is unknown or unobserved. When one wants to investigate subpopulations with latent heterogeneity within a given context and the number of subpopulations are known or are assumed to be finite and fixed, finite mixture models (FMM) are often used (Ng et al. Reference Ng, McLachlan, Wang, Ben-Tovim Jones and Ng2006; De la Cruz-Mesía, Quintana, and Marshall Reference De la Cruz-Mesía, Quintana and Marshall2008; Villarroel, Marshall, and Barón Reference Villarroel, Marshall and Barón2009). More commonly, however, researchers do not know if or how many latent heterogeneous subpopulations exist within a given context. Recent contributions in statistics literature have proposed models that use Dirichlet process prior (DPP) to deal with these single-context cases with unknown subpopulation heterogeneity/clustering (Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997; Kleinman and Ibrahim Reference Kleinman and Ibrahim1998b; Hannah, Blei, and Powell Reference Hannah, Blei and Powell2011; Heinzl and Tutz Reference Heinzl and Tutz2013). Models using DPP have been used in marketing literature to model the error term with a flexible distribution, the heterogeneity of consumer’s demand in discrete choice models (Rossi, Allenby, and McCulloch Reference Rossi, Allenby and McCulloch2006; Rossi Reference Rossi2014), and in latent instrumental variable (LIV) models to deal with endogeneity of covariates (Ebbes et al. Reference Ebbes, Wedel, Böckenholt and Steerneman2005; Ebbes, Wedel, and Böckenholt Reference Ebbes, Wedel and Böckenholt2009). Related work has also been developed in econometrics and program evaluation literature to study effect heterogeneity of training programs (Aakvik, Heckman, and Vytlacil Reference Aakvik, Heckman and Vytlacil2005; Chen Reference Chen2007; Heckman and Vytlacil Reference Heckman and Vytlacil2007; Ichimura and Todd Reference Ichimura and Todd2007; Matzkin Reference Matzkin2007). DPP models have been applied in political science to study lengths of time political appointees stay in their appointed position (Gill and Casella Reference Gill and Casella2009), political priorities of senators (Grimmer Reference Grimmer2009), intraparty voting (Spirling and Quinn Reference Spirling and Quinn2010), immigrant turnout in elections (Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015), and dynamic aspects of preferences for redistribution (Stegmueller Reference Stegmueller2013).
Those DPP approaches, however, have three limitations. First, they are usually designed to be used with specific types of dependent variables, e.g., with outcome variables measured on an ordered scale. Second, particularly in political science literature, previous works have used DPP mostly as a prior only for the intercept (or error) term. Third and more importantly, previous works were not designed to study cases in which the latent heterogeneity is context-dependent.
To redress these limitations, this paper proposes a Dirichlet mixture of generalized linear models in which the within-context effect heterogeneity (clustering) can be context-dependent. The proposed model is a generalization, from the point of view of the expectation of the dependent variable, of usual generalized linear model (GLM), classical generalized linear mixed models (GLMM), finite mixture models (FMMs), and current single-context Dirichlet mixtures of generalized linear models. The proposed model has several advantages over those special cases.
First, when there are multiple contexts, for instance in cross-country comparative analysis, the model can be used to investigate if country features are associated with latent heterogeneity in the covariate effects; that is, if country-level features affect the number and the characteristics of the subpopulation clusters.
Second, the proposed Dirichlet mixture of generalized linear models is developed in its full generality to handle Dirichlet mixtures of any distribution in the exponential family, investigate heterogeneity not only in the error term but in the effect of any observed covariates, and, as mentioned, study how such heterogeneity varies with context-level features. This paper implements two special cases: binary and continuous outcomes, modeled using Bernoulli and gaussian distributions, respectively. The algorithms for estimation of these special cases are presented, but an MCMC algorithm with a Gibbs sampler is derived for the more general model, so it can easily be extended to other outcome variable distributions.
Third, as a generalization of the other models, it can be used in situations in which any of the more specialized models are well justified. If, in fact, one believes that a single GLM can be used across contexts and there is no latent heterogeneity in the population, the proposed model can be estimated and it will produce similar results for the conditional expectation of the dependent variable as those estimated using a GLM. If there is just one context, but unknown clusters, it can be used instead of the single-context Dirichlet mixture of GLMs. The analogous situation is true for the other special cases, i.e., whenever the researcher is estimating a GLMM or a finite mixture model (FMM) the proposed model can be used, and it has two additional advantages: the number of latent clusters, whose number is allowed to grow with the size of the data, is being simultaneously estimated. As already mentioned, if the data comes from different contexts, the effect of the context on the characteristics of the clusters are also being investigated.
Fourth, the model estimates cluster memberships, so we can classify the data points into (latent) groups. The clusters differ in terms of the vector of linear coefficients that connect covariates to the outcome variable. So it can be used to study and characterize the heterogeneity in the effect of the covariates within and across contexts.
Fifth, the statistics and epidemiology literature has proposed approaches for dealing with Simpson’s paradox based on domain knowledge (Hernán, Clayton, and Keiding Reference Hernán, Clayton and Keiding2011) and estimation diagnostics (Kievit et al. Reference Kievit, Frankenhuis, Waldorp and Borsboom2013). Its formal aspects and its connection to other problems have also been studied (Samuels Reference Samuels1993; Hernán, Clayton, and Keiding Reference Hernán, Clayton and Keiding2011; Pearl Reference Pearl2011, Reference Pearl2014). However, to the best of our knowledge, the literature has not proposed any modeling solution. We connect the model proposed here with Simpson’s paradox in the context of generalized linear models and show how it can be used to detect the occurrence of the paradox and to deal with such problems by estimating the cluster-specific effects.
The rest of the paper is organized as follows. The next section presents the model. Then, the following section demonstrates how the proposed model is connected to classical GLM, mixed models, FMM, and the econometric models mentioned above that use DPP to deal with heterogeneity in single-context analyses. Section 4 develops MCMC algorithms to estimate the model in its full generality and for two special cases of outcome variables. In the Section 5 we conduct a Monte Carlo exercise to study the frequentist properties of the estimation. The estimation is tested against a large variety of scenarios with and without latent heterogeneity. The section also illustrates how the model can be used to deal with Simpson’s paradox in the context of generalized linear models. It also compares the estimated results of GLM using a maximum likelihood estimator (MLE) with those produced by the proposed model using the MCMC developed in the Section 4. Section 6 uses the model to analyze real data sets. It replicates some studies and shows how it uncovers latent heterogeneity and Simpson’s paradox. Finally, the conclusions are presented.
2 The Model
To restate the problem, we want to use a generalized linear model to estimate the effect of the covariates
$X_{i}$
on
$y_{i}$
. Second, we want to take into account the possibility that the effect of the covariates is heterogeneous across different subgroups whose defining features are latent or were not observed. In other words, there might be latent subpopulations of individuals for which the covariates have a different relationship with the outcome. Finally, we want to allow this latent subpopulation heterogeneity or clustering to be investigated both for data that comes from a single context or from multiple contexts. Finally, when the observed population comes from different contexts (e.g., different countries and different years), we want to investigate if context-level features change not only the effect of observed covariates on the outcome but also the existence and the characteristics of latent subpopulations in which the observed covariates have different effects.
The model that deals with such problems can be developed as follows. For each observation
$i$
, suppose we have a set of observed covariates
$X_{i}^{\prime }\in \mathbb{R}^{D_{x}}$
and an outcome variable
$y_{i}$
. Denote
$X_{i}=(1,X_{i}^{\prime })$
. Let
$K$
denote the number of heterogeneous groups in the population such that it can be bigger if the population is bigger, and let
$Z_{i}$
indicate the group of
$i$
.
$Z_{i}$
and
$K$
may or may not be known or observed. When
$Z_{i}$
is not observed, we use the term “clusters” instead of “groups.” Denote
$C_{i}$
the context of
$i$
, so
$C_{i}=j$
indicates that the observation
$i$
comes from context indexed by
$j\in \{1,\ldots ,J\}$
, were
$J$
is the number of contexts.
For the purpose of illustration and as a toy example, suppose we want to investigate the effect of income and race on voters’ support for welfare policies in different countries. Then
$X_{i}$
are measures of income and race of individual
$i$
, and
$y_{i}$
is his degree of support for welfare policies. The variable
$C_{i}=j$
indicates the country where
$i$
lives, and data are collected in
$J$
countries. Suppose further that in each country the population is divided into types of individuals with different personal experiences with class and racial conflict. The types are not observed but we suspect the effect of income and race is conditional on the type. The latent variable
$Z_{i}=k$
indicates that
$i$
is type
$k$
and
$K$
denotes the number of different types.
If
$p(\,)$
denotes a distribution in the exponential family,
$g$
a link function, and
$\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$
, then the group- and context-specific GLM is given by:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn1.gif?pub-status=live)
If
$Z_{i}$
was observed and
$K$
was therefore known, one could use classical mixed-effects models to estimate groups and context-specific heterogeneous effects. If
$K$
was known, but
$Z_{i}$
was latent or unobserved, one option would be to use finite mixture models for the estimation (Gaffney Reference Gaffney2003; Ng et al.
Reference Ng, McLachlan, Wang, Ben-Tovim Jones and Ng2006; De la Cruz-Mesía, Quintana, and Marshall Reference De la Cruz-Mesía, Quintana and Marshall2008; Villarroel, Marshall, and Barón Reference Villarroel, Marshall and Barón2009). When
$Z_{i}$
is latent and
$K$
is unknown, some authors have proposed models that use DPP on
$\unicode[STIX]{x1D703}$
in order to estimate cluster-specific effectsFootnote
1
(Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997; Kleinman and Ibrahim Reference Kleinman and Ibrahim1998a,Reference Kleinman and Ibrahimb; Dorazio et al.
Reference Dorazio, Mukherjee, Zhang, Ghosh, Jelks and Jordan2008; Gill and Casella Reference Gill and Casella2009; Heinzl and Tutz Reference Heinzl and Tutz2013; Stegmueller Reference Stegmueller2013; Traunmuller, Murr, and Gill Reference Traunmuller, Murr and Gill2015). We refer here to such models as Dirichlet process generalized linear model (dpGLM), as adopted by Hannah, Blei, and Powell (Reference Hannah, Blei and Powell2011). Contrary to their formulation, however, we assume that
$X_{i}$
is given. If we denote by
${\mathcal{D}}{\mathcal{P}}(\unicode[STIX]{x1D6FC},G)$
the Dirichlet process with location parameter
$\unicode[STIX]{x1D6FC}$
and base measure
$G$
, the GLM is modified in the following way to produce the dpGLM:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn2.gif?pub-status=live)
Authors have warned that using DPP can lead to biased estimators, and it is known that neither weak consistency nor asymptotic unbiasedness are guaranteed in general in DPP models (Diaconis and Freedman Reference Diaconis and Freedman1986; Ghosal, Ghosh, and Ramamoorthi Reference Ghosal, Ghosh and Ramamoorthi1999; Tokdar Reference Tokdar2006; Kyung et al. Reference Kyung, Gill and Casella2010). Although bias will always be present due to the bayesian priors, Hannah, Blei, and Powell (Reference Hannah, Blei and Powell2011) demonstrated that the dpGLM satisfies the conditions that guarantee weak consistency of the joint posterior distribution and consistency of the regression estimates (see also Tokdar Reference Tokdar2006).
The dpGLMs lacks the hierarchical clustering approach that we would like to have in the model, that is, that the clusters can be a function of higher level context features in a multi-context analysis. We want to preserve the structure of the dpGLM and the DPP—because there might be unknown clusters with heterogeneous effect and unknown cluster membership—but include such context dependency—because the heterogeneity may depend on the context characteristics.
Some authors have proposed different approaches to model hierarchical clustering and to create dependencies among multiple Dirichlet processes (Mallick and Walker Reference Mallick and Walker1997; Carota and Parmigiani Reference Carota and Parmigiani2002; De Iorio et al.
Reference De Iorio, Müller, Rosner and MacEachern2004; Müller, Quintana, and Rosner Reference Müller, Quintana and Rosner2004; Teh et al.
Reference Teh, Jordan, Beal and Blei2006). We can generalize and combine these approaches with the dpGLM in the following way. Let
$W_{j}^{\prime }\in \mathbb{R}^{D_{w}}$
denote the context-level features of context
$j$
and
$J$
the total number of contexts as before. Let
$W_{j}=(1,W_{j}^{\prime })$
.
In our toy example,
$W_{j}$
can be thought of as the level of economic development and inequality of the country (context)
$j$
. It means we want to investigate if the effect of income and race (the observed covariates
$X_{i}$
) on support for redistribution (
$y_{i}$
) varies with the degree of economic development and inequality of the country (
$W_{j}$
). Moreover, we also want to investigate if the effect of those observed covariates (income and race) is different among within-country subpopulations whose membership (
$Z_{i}$
) is unobserved. Finally, we want to verify if those subpopulations vary from one country (context) to another due to a country’s level of inequality and economic development (context-level features
$W_{j}$
).
The model can be modified in the following way to introduce context-level dependency among DPP:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn3.gif?pub-status=live)
We refer to the model (3) as hierarchical Dirichlet process generalized linear model (hdpGLM). It generalizes the GLM and the dpGLM and provides a hierarchical clustering structure that is context-dependent. Because it generalizes GLMs, it can be used even if there is neither heterogeneity nor multiple contexts. The advantage of using hdpGLM is that clusters can be uncovered if the researcher is uncertain about the existence of heterogeneous effects. A model selection procedure can be adopted to decide if either the results of GLM or hdpGLM is adequate for the data at hand (Mukhopadhyay and Gelfand Reference Mukhopadhyay and Gelfand1997).
To complete the formulation of the model and connect (3) and (1), denote
$Z_{ik}$
the case in which individual
$i$
belongs to the subpopulation indexed by
$k$
, that is,
$Z_{i}=k$
. Let
$C_{ij}$
indicates that individual
$i$
belongs to the context (country)
$j$
, that is,
$C_{i}=j$
. We can parameterize the effect of the context-level covariates
$W$
with
$\unicode[STIX]{x1D70F}\in \mathbb{R}^{(D_{w}+1)\times (D_{x}+1)}$
and rewrite the model (3) using the stick-breaking construction (Sethuraman Reference Sethuraman1994; Teh et al.
Reference Teh, Jordan, Beal and Blei2006). The resulting model in its full generality is the following:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn4.gif?pub-status=live)
We further assume that, for
$\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU1.gif?pub-status=live)
So, the variable
$\unicode[STIX]{x1D70F}_{d}$
is a vector of linear coefficients of the country-level features. It determines the average effect of the individual-level features
$X_{id}$
on the outcome
$y_{i}$
in the cluster
$k$
.
In our toy example,
$\unicode[STIX]{x1D70F}_{1}=(\unicode[STIX]{x1D70F}_{11},\unicode[STIX]{x1D70F}_{21})$
would be the linear effect of the inequality and economic development (the context-level features) on
$\unicode[STIX]{x1D6FD}_{1k}$
, the linear effect of income (observed covariate) on support for redistribution among people with similar history of social and racial conflict, which are unobserved, indexed by
$k$
. The parameter
$\unicode[STIX]{x1D70F}_{2}=(\unicode[STIX]{x1D70F}_{12},\unicode[STIX]{x1D70F}_{22})$
would be the linear effect of inequality and economic development on
$\unicode[STIX]{x1D6FD}_{2k}$
, which is the effect of race on support for redistribution among people of type
$k$
. Therefore, we have a DPP clustering model that is context-dependent because the linear coefficients
$\unicode[STIX]{x1D6FD}$
of the outcome variable depend on the cluster probability
$\unicode[STIX]{x1D70B}$
, (thought
$Z_{i}$
) and on the context-level feature
$W$
(through its linear effect
$\unicode[STIX]{x1D70F}$
).
3 Generalized Linear Models, Finite Mixture Models, and hdpGLM
This section shows the relationship between the hdpGLM and the classical GLMs, GLMM, FMMs, and dpGLM in terms of the structure of the average parameters of the outcome variable
$y_{i}$
. In that sense, the hdpGLM can be viewed as a generalization of the other models. That generalization allows us to estimate latent or unobserved heterogeneity in the population in terms of how the covariates and the outcome are linearly related when the number of heterogeneous groups is not known in advance. The section also explores some connections between hdpGLM, latent instrumental variable (LIV) and latent-index models, which are approaches that use DPP with regression models.
As before, we denote
$X_{i}=(1,X_{i}^{\prime })\in \mathbb{R}^{(D_{x}+1)\times 1}$
the observed characteristics of unit
$i$
,
$Z_{i}\in \{0,1\}^{\unicode[STIX]{x1D705}}$
the design variable indicating the group (or cluster) of
$i$
. The parameter
$\unicode[STIX]{x1D705}$
represents the number of clusters. Let
$\unicode[STIX]{x1D6FE}_{i}\in \mathbb{R}^{(D_{x}+1)\times \unicode[STIX]{x1D705}}$
be the cluster-specific matrix of linear coefficients such that
$\unicode[STIX]{x1D6FE}_{ik}\in \mathbb{R}^{(D_{x}+1)\times 1}$
is the
$k^{\text{th}}$
column of
$\unicode[STIX]{x1D6FE}_{i}$
with linear coefficients of cluster
$k$
. The most general formulation of the GLM in which every individual and groups have their own set of linear coefficients is:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn5.gif?pub-status=live)
Define
$\unicode[STIX]{x1D702}_{i}=X_{i}^{T}\unicode[STIX]{x1D6FD}_{i}+X_{i}^{T}\unicode[STIX]{x1D6FE}_{i}Z_{i}$
and for simplicity let
$D_{x}=1$
. We can write
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU2.gif?pub-status=live)
Classical GLMs, GLMMs, FMMs, or hdpGLMs emerge from model (5) depending on what we know or believe about
$\unicode[STIX]{x1D705},Z_{i},\unicode[STIX]{x1D6FE}_{i}$
and
$\unicode[STIX]{x1D6FD}_{i}$
. More precisely, it depends on the structural assumptions we impose on those parameters.
Classical GLM can be interpreted in two ways. Either one assumes
$\unicode[STIX]{x1D6FE}_{i}=0$
for all
$i$
and
$\unicode[STIX]{x1D6FD}_{i}=\unicode[STIX]{x1D6FD}$
, which gives
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn6.gif?pub-status=live)
or one assumes
$\unicode[STIX]{x1D705}=1$
, which gives
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn7.gif?pub-status=live)
Clearly, (6) and (7) are structurally equivalent, and treating either
$\unicode[STIX]{x1D703}$
in (7) of
$\unicode[STIX]{x1D6FD}$
in (6) as the parameter to be estimated should produce the same results.
When
$\unicode[STIX]{x1D705}>1$
,
$Z_{i}$
is observed, and one believes
$\unicode[STIX]{x1D6FE}\neq 0$
, the common approach is to use fixed, random, or mixed-effects models. For fixed effects, one either assumes that each group
$k$
has its own fixed intercept term
$\unicode[STIX]{x1D703}_{ok}$
, or both its own fixed intercept and slope (
$\unicode[STIX]{x1D703}_{ok},\unicode[STIX]{x1D703}_{1k}$
). Classical models with random effects similarly assume that each observed group has its own intercept (and/or slope) but also that, instead of being fixed, they are drawn from a common distribution. A gaussian distribution with zero mean is the standard choice for the random effects (Hayashi Reference Hayashi2000; Woodridge Reference Woodridge2002), but one can also have group-specific averages (Gelman and Hill Reference Gelman and Hill2007) (see Table 1). Mixed models use a combination of random and fixed effects.
Table 1. Relationship between GLM, GLMM, FMM and hdpGLM based on structural assumptions on
$\unicode[STIX]{x1D705},Z_{i}$
and
$\unicode[STIX]{x1D711}_{i}$
.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_tab1.gif?pub-status=live)
* GLM: Generalized Linear Models; FE: Fixed Effect in the intercept (FE (I)) and both in the intercept and slope (FE (I + S)); RE: Random Effect in the intercept (FE (I)) and both in the intercept and slope (RE (I + S)); FMM: finite mixture models.
When
$Z_{i}$
is not observed, obviously it is not possible to use classical mixed models for the group heterogeneity. When one does not observe
$Z_{i}$
, but
$\unicode[STIX]{x1D705}$
is known or it is assumed to be finite and fixed, then a finite mixture model is usually used (Lenk and DeSarbo Reference Lenk and DeSarbo2000). If we let
$Z_{i}\sim \text{Cat}(\unicode[STIX]{x1D70B})$
,
$\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\unicode[STIX]{x1D705}}$
then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn8.gif?pub-status=live)
and
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU3.gif?pub-status=live)
which implies a finite mixture distribution for
$y_{i}$
, that is,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn9.gif?pub-status=live)
By averaging over
$\unicode[STIX]{x1D705}$
we get
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU4.gif?pub-status=live)
Again, it has the same basic structure of the classical GLM.
The dpGLM generalizes that structure by allowing
$\unicode[STIX]{x1D705}$
to be undetermined. It emerges naturally from finite mixtures when there might be clusters in the populations that are latent or that were not measured and, in addition, we do not know exactly the number of clusters. By letting
$\unicode[STIX]{x1D705}\rightarrow \infty$
in the finite mixture model in (9), and by putting a prior on
$\unicode[STIX]{x1D703}$
and a stick-breaking prior on
$\unicode[STIX]{x1D70B}$
we have the dpGLM, as described in (10) (Teh et al.
Reference Teh, Jordan, Beal and Blei2006; Hannah, Blei, and Powell Reference Hannah, Blei and Powell2011).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn10.gif?pub-status=live)
Starting with the dpGLM, by restricting the possible number of clusters to be finite (
$\unicode[STIX]{x1D705}<\infty$
), and treating
$\unicode[STIX]{x1D70B}$
and
$\unicode[STIX]{x1D703}$
as fixed we are again back to the FMM. If, in addition, we either average out the clusters and treat those averaged elements as the fixed parameters to estimate or if
$\unicode[STIX]{x1D705}=1$
, we have the classical GLM. In sum, the GLM can be viewed a special case of the dpGLM.
Finally, the hdpGLM proposed here generalizes that structure to account for the possibility of context-dependent clustering. It does that by letting the linear coefficients of the clusters be a function of context-level covariates. We modify the model (10) by adding the parameter
$\unicode[STIX]{x1D70F}$
and context-level information
$W$
. Given
$J$
different contexts, the context-level covariates
$W\in \mathbb{R}^{J\times (D_{w}+1)}$
, and the variable
$C_{i}$
that indicates the context to which
$i$
belongs, we have the hdpGLM model by modifying the dpGLM and adding the following structure to it:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn11.gif?pub-status=live)
Hence, if there is just one context (
$J=1$
) we have the dpGLM again, and it demonstrates the connection between hdpGLM and the other models. Table 1 summarizes the connection between them.
The hdpGLM model is also structurally connected to latent instrumental variable (LIV) models (Ebbes, Böckenholt, and Wedel Reference Ebbes, Böckenholt and Wedel2004; Ebbes et al.
Reference Ebbes, Wedel, Böckenholt and Steerneman2005; Ebbes, Wedel, and Böckenholt Reference Ebbes, Wedel and Böckenholt2009). Such models can be used to deal with endogenous covariates. The main feature of the LIV is the introduction of a latent categorical instrumental variable, which turns the instrumental variable (IV) regression model into a FMM. To see this, consider this simple example of a classical IV model with endogenous covariate
$x_{1}$
, a gaussian outcome, and the instrumental variable
$z$
(index
$i$
omitted for simplicity):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn12.gif?pub-status=live)
For
$\unicode[STIX]{x1D719}_{0}=\unicode[STIX]{x1D6FD}_{0}+\unicode[STIX]{x1D6FD}_{i}\unicode[STIX]{x1D6FE}_{0}$
,
$\unicode[STIX]{x1D719}_{2}=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D6FE}_{2}+\unicode[STIX]{x1D6FD}_{2}$
,
$\unicode[STIX]{x1D719}_{3}=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D719}_{3}$
, and
$\unicode[STIX]{x1D700}^{\prime }=\unicode[STIX]{x1D6FD}_{1}\unicode[STIX]{x1D708}+\unicode[STIX]{x1D700}$
we have the reduced form:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn13.gif?pub-status=live)
The LIV approach defines a latent
$K$
-level categorical random variable
$Z_{i}$
to be used instead of the instrument
$z_{i}$
. Each group
$k$
is assumed to have its is own mean value
$\unicode[STIX]{x1D711}_{0k}$
. It leads to a FMM with
$K$
latent groups such that the outcome is given by:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn14.gif?pub-status=live)
or equivalently, and assuming group-specific errors:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn15.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU5.gif?pub-status=live)
Ebbes et al. (Reference Ebbes, Wedel, Böckenholt and Steerneman2005) and Ebbes, Wedel, and Böckenholt (Reference Ebbes, Wedel and Böckenholt2009) have proposed (15), called LIV, to deal with endogeneity in the regressors. Obviously, by (14) and (15) we can see how it is structurally connected to the hdpGLM. Some differences between LIV and the model presented here is that in the former the number of latent groups needed to be selected in advance before the parameters are estimated, which is a feature of any FMM. Moreover, LIV is designed for a single-context estimation, that is, the endogeneity and the instrument are not context-dependent. The main difference, however, is that the LIV approach uses the joint distribution of the endogenous covariates and the outcome due to its goal of dealing with endogeneity of the covariate, while here the covariates are assumed to be exogeneous. The hdpGLM leads to a LIV model if we truncate the DPP, restrict the hdpGLM to its non-hierarchical version (dpGLM) with gaussian outcome, and model the distribution of the endogenous covariates as in the equations above.
Finally, the hdpGLM also has a close connection with the latent-index model that has been designed to deal with single-context heterogeneity (Aakvik, Heckman, and Vytlacil Reference Aakvik, Heckman and Vytlacil2005; Rossi Reference Rossi2014). Aakvik, Heckman, and Vytlacil (Reference Aakvik, Heckman and Vytlacil2005) propose a model in which the marginal effects are heterogeneous in the population and indexed by a continuous latent random variable. They also provide a special case with two latent groups by using a binary transformation of that (gaussian distributed) latent index, which produces a mixture model with two latent components. The model here generalizes that continuous latent-index approach in two ways. First, it imposes a much more flexible distribution on the latent index and allows us to estimate effect heterogeneity when there are unknown number of finite or countably infinite groups. Their gaussian index model can be approximated by a countably infinity partition of the real line and a symmetric unimodal discrete distribution on that partition. It is embedded in the structure of the model and the DPP can naturally be used to estimate such a distribution of the indexes. Second, the model here generalizes their approach by adding a context-dependent structure to the latent heterogeneity.
4 Estimation
There are many options in the literature to estimate models that use DPP (Ishwaran and Zarepour Reference Ishwaran and Zarepour2000; Neal Reference Neal2000; Blei et al.
Reference Blei and Jordan2006; Walker Reference Walker2007). Here we extend the approach proposed by Ishwaran and James (Reference Ishwaran and James2001). In order to implement a (blocked) Gibbs sampler for a DPP model, one of the algorithms they propose uses a truncated version of the stick-breaking construction in conjunction with the generalized Dirichlet distribution. We extend their basic algorithm in two ways. First, we incorporate the hierarchical structure of the model proposed here and develop a Gibbs sampler in its full generality. Second, we derive the sampler for two special cases: continuous outcome variable
$y_{i}$
, modeled using a gaussian distribution, and a binary outcome variable, modeled using a Bernoulli distribution with a logistic transformation of the average parameter. For the gaussian outcome, the Gibbs update can be used for all parameters. Therefore, in practice for that special case the estimation shows good convergence diagnostics within thousand iterations and it can be performed in a relatively short time depending on the size of the data set. For the binary outcome, the Gibbs update is available for all parameters but the linear coefficients of the generalized linear model. So we extend the algorithm and implement a Metropolis–Hasting update within Gibbs to sample the linear coefficients (
$\unicode[STIX]{x1D6FD}$
) using Riemann manifold Hamiltonian Monte Carlo (Neal Reference Neal2000; Shahbaba and Neal Reference Shahbaba and Radford2009; Neal et al.
Reference Neal2011). The R package hdpGLM contains the implementation of the model with the algorithms presented here.
The truncation of the DPP used in the MCMC algorithm restricts the mixing probability parameter
$\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{\infty }$
described in (4) to
$\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{K}$
. To estimate the model properly, we set a large value for
$K$
and monitor the estimation to check the maximum number of clusters the sampler used to allocate the data points during the iterations. If it reached
$K$
at any point we increase its value and repeat the process. By selecting a
$K$
much larger than the number of clusters the sampler activates during the estimation, we make sure the truncation is not changing the estimated results.
As before, let
$\mathbf{X}\in \mathbb{R}^{n\times (D_{x}+1)}$
denote the individual-level covariates including a column with ones for the intercept term, and
$n$
the number of data points including all contexts. Denote
$C=(C_{1},\ldots ,C_{n})$
and
$C_{i}\in \{1,\ldots ,J\}$
the variable that indicates the context to which
$i$
belongs, and let
$\mathbf{W}\in \mathbb{R}^{J\times (D_{w}+1)}$
be the
$(D_{w}+1)$
-dimensional context-level features of the contexts
$J$
. Finally, let
$Z=(Z_{1},\ldots ,Z_{n})$
.
The following additional notation is used to derive the algorithm:
$Z^{\ast }$
denotes the unique values of
$Z$
, and
$Z^{\ast C}$
the values between 1 and
$K$
that are not in
$Z^{\ast }$
. We denote by
$Z_{j}^{\ast }$
the unique values of
$Z$
in the context
$j$
, and
$Z_{j}^{\ast C}$
its complement in
$j$
,
$I_{k}$
is the set of indexes
$i$
of the data points assigned to the cluster
$k$
,
$N_{k}$
the total number of data points in
$k$
, and
$X_{jk}$
(or
$y_{jk}$
) the covariates (outcome variable) of the observations
$i$
in context
$j$
and assigned to the cluster
$k$
.
Given the most general formulation of the hdpGLM in (4) and the truncation used for the sampler we have the following proposition (see proof in the appendix A):
Proposition 1 (Blocked Gibbs sampler for hdpGLM).
A Blocked Gibbs sampler for the model described in (4) with
$\unicode[STIX]{x1D70B}\in \unicode[STIX]{x1D6E5}^{K}$
is given by the Algorithm 1.
A special case of the model described in (4) occurs when
$y_{i}$
is gaussian distributed. Let
$N_{d}(\unicode[STIX]{x1D707},\unicode[STIX]{x1D6F4})$
denote a d-dimensional multivariate gaussian distribution. Then, for
$\unicode[STIX]{x1D703}=(\unicode[STIX]{x1D6FD},\unicode[STIX]{x1D70E})$
, we can have a Gibbs sampler for all parameters if we use the following distribution for
$\unicode[STIX]{x1D70F},\unicode[STIX]{x1D6FD}$
and
$\unicode[STIX]{x1D70E}$
(see proof in the appendix A):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn16.gif?pub-status=live)
Proposition 2 (Gibbs for hdpGLM with gaussian mixtures).
The Gibbs sampler for the model described in (16) is given by the Algorithm 2.
When the outcome variable
$y_{i}$
in the model (4) is binomial distributed, or in general has a distribution that does not have a conjugate prior for the linear coefficients, the full conditional of the parameters
$\unicode[STIX]{x1D703}$
(or
$\unicode[STIX]{x1D6FD}$
) is not standard and we cannot sample from it directly. To deal with such cases we use a Riemman manifold Hamiltonian Monte Carlo (RMHMC) update (Girolami and Calderhead Reference Girolami and Calderhead2011) within Gibbs to sample the
$\unicode[STIX]{x1D6FD}$
coefficients. We can still sample all the other parameters as before. For the sake of completeness, the RMHMC algorithm is presented in the supplementary material.
The random variable of interest is
$\unicode[STIX]{x1D6FD}_{kj}\in \mathbb{R}^{D_{x}+1}$
, called the position variable of the Hamiltonian Monte Carlo (HMC) algorithm (Neal et al.
Reference Neal2011), and we denote by
$v\in \mathbb{R}^{D_{x}+1}$
the ancillary variable (momentum) such that
$v\sim N_{D_{x}+1}(0,G(\unicode[STIX]{x1D6FD}_{kj}))$
. The Hamiltonian for our model is defined by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn17.gif?pub-status=live)
whose solution is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn18.gif?pub-status=live)
The Hamiltonian equations are solved using the generalized Stormer–Verlet leapfrog integrator (Calin and Chang Reference Calin and Chang2006; Girolami and Calderhead Reference Girolami and Calderhead2011). For
$L$
leapfrog steps with size
$\unicode[STIX]{x1D700}$
, and
$l=1,\ldots ,L$
, it is given by:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn19.gif?pub-status=live)
When
$y_{i}$
is binomial, that is, the distribution of
$y_{i}$
in the model (4) is defined by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU8.gif?pub-status=live)
then, given the Equation (A 4), the elements of the RMHMC for the model hdpGLM when
$k\in Z_{j}^{\ast }$
are defined by the following equations:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU9.gif?pub-status=live)
In practice we use
$G(\unicode[STIX]{x1D6FD}_{kj})=I_{(D_{x}+1)\times (D_{x}+1)}$
, which is the most widely used approach in applications (Liu Reference Liu2008; Neal et al.
Reference Neal2011). It also simplifies the Equations (17), (18), and (19) substantially. Using
$v\sim N_{D_{x}+1}(0,I)$
, the integrator reduces to the standard Stormer–Verlet leapfrog integrator (Duane et al.
Reference Duane, Kennedy, Pendleton and Roweth1987; Neal et al.
Reference Neal2011). We follow that approach in this paper.
5 Monte Carlo Simulation
In this section, we conduct a Monte Carlo exerciseFootnote 2 to demonstrate the properties of the estimates of the model produced by the algorithms developed in Section 4. The exercise is divided into three parts. First, we reproduce a particular situation that often occurs in practice if one omits factors that condition the association between the variable of interest and the outcome. In order to do that, we compare results produced by hdpGLM with those produced by GLM when there is no latent heterogeneity in the population and when there are latent clusters. We show how Simpson’s paradox can happen in the latter case and how it is uncovered by the proposed model. In the second part of the MC exercise we simulate a large variety of possible scenarios, each with different types of heterogeneity and numbers of observed covariates to show that the model has good performance in a large variety of situations. We evaluate the frequentist properties of the estimators in each case, particularly their coverage probability (Carlin and Louis Reference Carlin and Louis2000; Little et al. Reference Little2011). Lastly, we compare the predictive performance of GLM and hdpGLM for different possible number of clusters in terms of root-mean-squared error (RMSE).
We start by comparing the estimates produced by GLM and by hdpGLM with and without latent heterogeneity in the population. We generated data sets from two parameter configurations with 3 continuous covariates sampled from a gaussian distribution. In the first data set, there is no effect heterogeneity. Hence, a single GLM would be appropriate because the effect of those three covariates are homogeneous in the population. In the second data set, we let the effect of one covariate to be conditional on a latent factor such that it has opposite signs and similar magnitude for half of the population. The effect of the other two covariates is homogeneous. Data sets used in this exercise contain 2000 observations.
As a toy example, we can think that the first covariate represents income, the second age, the third the degree of racial fragmentation in the neighborhood, and the outcome the degree of support for redistributive policies. The effect of income on support for redistribution depends on a latent feature, let’s say, if the individuals have experienced economic reward due to their effort and hard work, as opposed to luck or family monetary heritage. The latent heterogeneous effect of income can occur, for instance, if more income means less support for redistribution only for those that believe upward mobility can be achieved through effort and hard work.
Table 2 compares the point estimates and their confidence intervals produced by estimating a GLM using MLE, with the posterior average and the 95% HPD interval produced by estimating the hdpGLM with the MCMC proposed here. We can compare the estimates with the true value, which is displayed in the fourth column of the table. After estimating the hdpGLM, we classified the data into clusters using the estimated cluster probability of each data point. We assigned each observation to the cluster they have the highest probability to belong to. The indexes of the clusters occupied by data points are displayed in the first column of the table. We can see in the upper half of the Table 2 the estimates when the data comes from a population in which there is no heterogeneity. All data points were classified into the same single cluster by the hdpGLM. The estimates of the two models are very similar. In fact, they are identical up to two significant figures, as shown in the table. The lower half of the table presents the results of the estimation using GLM and hdpGLM for the second data set with heterogeneous effects in the first covariate
$X_{1}$
. We used the same procedure just described to classify the data into clusters. The hdpGLM estimated two clusters in virtually all repetitions of the procedure. The results produced by the GLM and the hdpGLM are very similar for the covariates 2 and 3 (
$\unicode[STIX]{x1D6FD}_{3}$
and
$\unicode[STIX]{x1D6FD}_{4}$
), whose effects are homogeneous in the population. For the hdpGLM, the values of the linear effect of those covariates with homogeneous effects are indistinguishable in the two clusters estimated, as expected. However, for the heterogeneous effect
$\unicode[STIX]{x1D6FD}_{1}$
(e.g., income) the GLM estimated a positive effect when in fact there are two subpopulations, one with a positive and another with a negative effect. The hdpGLM, on the other hand, estimated the marginal effect of
$X_{1}$
correctly for both clusters.
Table 2. Comparing estimates of GLM (estimated using MLE) and hdpGLM (estimated using MCMC) with and without latent heterogeneity in the population.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_tab2.gif?pub-status=live)
Table 2 contains an example of Simpson’s paradox: the aggregate effect found for
$X_{1}$
when one uses GLM and ignores the clusters is quite different from the effect found when the clusters are considered. We can see it clearer in Figure 1. The lines represent the fitted values using the MLE estimate for the GLM model and the fitted values using the posterior average for the hdpGLM. In the left panel, we can compare the estimated marginal effects produced by each model. In the right panels, we see the data points after they were clustered by the hdpGLM. The right panels also display the fitted values. We would have reached incomplete conclusions using GLM in such situation: the effect is positive and significant for the MLE estimates of the GLM but, in fact, it is negative for half of the population, and it has a larger positive effect than estimated by the GLM for the other half.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_fig1g.gif?pub-status=live)
Figure 1. Comparing marginal effect estimated using GLM (MLE estimator) and hdpGLM (MCMC posterior average) when there are 3 clusters in the population.
The general take away from these results is that when GLM is well specified and there is no effect heterogeneity due to latent features, using hdpGLM will not harm the estimation. When there are clusters with heterogeneous effects, GLM will produce accurate aggregate results but can nevertheless be incorrect for each one of the subpopulations. Table 2 and Figure 1 demonstrate that the hdpGLM reduces to GLM when there is no heterogeneity (see Section 3). When the assumptions that justify the adoption of the GLM holds, the hdpGLM can still be used and it estimates the mean value of the linear parameters quite close to the ones produced by MLE estimates of GLM. When there was heterogeneity, the hdpGLM classified the data correctly into two clusters, the marginal effects were correctly estimated, and Simpson’s paradox was uncovered.
Next, in order to evaluate the performance of the hdpGLM in a wide range of possible scenarios, we randomly generated 10 different sets of parameters. To make the Monte Carlo exercise faster and easy to visualize, we simulate data for a single context (
$J=1$
) with a continuous outcome variable. An example with context-dependent heterogeneity
$(J>1)$
are presented in the sequel. Examples with binary outcome variables are provided in the supplementary material.
For each parameter set, we randomly generated 100 data sets. The number of clusters
$K$
and the number of covariates in each case was also randomly generated. We allowed the heterogeneity to occur in the effect of all covariates. Values of the linear coefficients range from
$-20$
to 20. We estimated the hdpGLM for each one of the 1,000 data sets (10 parameter sets times 100 data sets for each parameter set) and all the usual convergence diagnostics were conducted (Geweke Reference Geweke1992; Cowles and Carlin Reference Cowles and Carlin1996; Brooks and Gelman Reference Brooks and Gelman1998; Flegal Reference Flegal2008; Flegal, Haran, and Jones Reference Flegal, Haran and Jones2008). The high posterior density (HPD) intervals were computed across data sets generated by each parameter set and so was the posterior average.
Table 3 summarizes the coverage probability of the linear coefficients
$\unicode[STIX]{x1D6FD}$
for each one of the 10 parameter sets along with the cluster estimation. The first column indicates the number of covariates in each parameter set, and the second indicates the true number of clusters in the population. The third through fifth columns display the summaries of the estimation across the 100 data sets generated by each parameter set. It shows the mean, minimum, and maximum number of clusters the data points were assigned to after the estimation. As before, we assigned the data points to the clusters based on the maximum estimated probability of cluster membership. The sixth column shows the proportion of the time the data was classified into correct number of clusters across the replications. The table also displays the minimum and the average coverage across linear coefficients for each parameter set. For instance, the second line displays a case in which there are two latent clusters in the population and five covariates. The sixth column indicates that the data points were classified into two clusters in 96% of the estimations performed using the 100 data sets generated by that parameter set. There are 10 linear coefficients across clusters for that case (5 linear coefficients per cluster). Among those 10 linear coefficients, the minimum coverage probability was 90%. It means that the linear parameter whose estimation had the worst coverage still was correctly estimated 90% of the time. By correct estimation we mean the true value was within the 95% HPDI. So in at least 90 out of 100 cases, the true values were contained in the 95% HPD interval for all the linear parameters. One may argue that such good coverage probability occurs because the posterior intervals are too wide. So, we display in the last column of the table the maximum average of the HPD intervals among the linear coefficients in each case. As the intervals are generally small we can be confident that the model and the estimation procedure proposed here have good coverage probability and such results are not due to the large variance of the posterior distribution. Another possible objection is that the number of replications is too small. In the supplementary material we provide a much larger MC exercise for two additional parameter sets with 1,000 replications each. The supplementary material also contains tables with the MC standard error for all simulations and for all linear parameter
$\unicode[STIX]{x1D6FD}$
. The results are similar to those presented here.
Table 3. Summary of the performance of the hdpGLM when estimating number of clusters (K) and linear coefficients (
$\unicode[STIX]{x1D6FD}$
) across 100 replications generated by 10 different parameter sets.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_tab3.gif?pub-status=live)
Now we turn to a full example of an estimation with context-dependent latent heterogeneity. For this example, we used ten contexts
$(J=10)$
and two covariates
$(D_{x}=2)$
. We let the expectation of the effect
$\unicode[STIX]{x1D6FD}_{1}$
of first covariate
$X_{1}$
be a function of the context-level feature
$W_{1}$
, but the expectation of the linear effect
$\unicode[STIX]{x1D6FD}_{2}$
of the second covariate
$X_{2}$
is not a function of context-level features. In other words, we randomly sampled
$\unicode[STIX]{x1D70F}_{11}$
(the effect of
$W_{1}$
on the expectation of
$\unicode[STIX]{x1D6FD}_{1}$
) from its prior distribution and set
$\unicode[STIX]{x1D70F}_{12}$
(the effect of
$W_{1}$
on the expectation of
$\unicode[STIX]{x1D6FD}_{2}$
) to zero. We set the number of clusters to two
$(K=2)$
. Figure 2 shows the result of the estimation. On the left panel of the figure we see the posterior distribution of the linear coefficients in each context. The vertical lines indicate the true values. We clearly see the posterior concentrated around the true values of the clusters. On the top right of the figure, we see the estimated posterior averages for
$\unicode[STIX]{x1D6FD}_{1}$
and
$\unicode[STIX]{x1D6FD}_{2}$
for each cluster, in each context, as a function of the context feature
$W_{1}$
. We clearly see that the expectation of
$\unicode[STIX]{x1D6FD}_{1}$
and the clusters are positive functions of
$W_{1}$
, but that is not the case for
$\unicode[STIX]{x1D6FD}_{2}$
. Finally, in the bottom right we see the posterior expectation of
$\unicode[STIX]{x1D70F}$
. The estimated values are quite close to the true values and within a small 95% HPD interval.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_fig2g.gif?pub-status=live)
Figure 2. Output of estimation of hdpGLM model for a data set with 10 contexts and positive effect of context-level feature
$W_{1}$
on the marginal effect
$\unicode[STIX]{x1D6FD}_{1}$
of
$X_{1}$
.
To complete this section, we compared the predictive performance of the GLM and the hdpGLM in terms of RMSE for different numbers of latent clusters. We randomly generated 30 parameter sets, each one with the number of clusters ranging from 1 to 30. For each case, we generated 10 data sets and estimated both the GLM and the hdpGLM. The RMSE was computed in each case. The Figure 3 compares the predictive performance of the GLM and the hdpGLM. The value of the RMSE stays always low for the hdpGLM, as expected.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_fig3g.gif?pub-status=live)
Figure 3. Comparing performance of hdpGLM and GLM using root-mean-squared error (RMSE) as a function of the number of clusters in the data.
All estimations in this section and in the following use
$(\unicode[STIX]{x1D707}_{\unicode[STIX]{x1D70F}_{d}},\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D70F}_{d}}I,\unicode[STIX]{x1D70E}_{\unicode[STIX]{x1D6FD}_{kj}}I,s^{2},\unicode[STIX]{x1D708},\unicode[STIX]{x1D6FC}_{0})=(0,10I,10I,10,10,1)$
as prior parametrization, where
$I$
represents the identity matrix. Those values give a reasonably large variation for the underlying random variables, and the simulation results have shown that they produce good coverage and small 95% HPD intervals in a large variety of situations. The supplementary material contains details of a prior perturbation study. Briefly, it shows that on average the model is not very sensitive to different prior settings, but in the worst case for certain combinations of prior parameters the model can demand very large data sets to escape the influence of the prior specification. This is true specially for extreme values of the concentration parameter
$\unicode[STIX]{x1D6FC}$
and values that produce highly dispersed inverse-scaled-
$\unicode[STIX]{x1D712}^{2}$
distribution, which can be generated by low values (below five) of the scale parameter
$s^{2}$
. For details, see supplementary material.
6 Empirical Application
In this section, we illustrate some applications of the model by replicating empirical studies and comparing the original results with the ones produced by the hdpGLM estimates.
We start with Bechtel, Hainmueller, and Margalit (Reference Bechtel, Hainmueller and Margalit2014), who present a study in Germany using online and telephone survey data. The paper investigates why some voters agree with bailout payments for other countries. The dependent variable is a dichotomous measure coded as 1 if the person is against bailout payments for over-indebted EU countries and 0 otherwise. They find that social dispositions, in particular feelings of cosmopolitanism and altruism, are the strongest predictors of attitudes toward providing financial help to other countries. The left panel of Figure 4 reproduces their results and displays the marginal effects of their three main variables. The right panel shows the estimation of the hdpGLM using an indicator variable for German states. The panel shows the effect of (very high) cosmopolitanism on support for bailout payments in each region. We see that for most of the states there is no latent heterogeneity. Moreover, the aggregate average effect estimated using a GLM is similar, for most cases, to the posterior average effect found by hdpGLM in each state. One exception is Lower Saxony, in which we see Simpson’s paradox: there are two clusters with opposite effects of very high cosmopolitanism on support for bailout. Although we would need further investigation to provide a substantive account of these results, we can see how the hdpGLM can be used to estimate context-dependent heterogeneity.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_fig4g.gif?pub-status=live)
Figure 4. Left panel shows the effect of altruism, cosmopolitanism, and income on support for bailout estimated by GLM, reproducing Table 3 of Bechtel, Hainmueller, and Margalit (Reference Bechtel, Hainmueller and Margalit2014). Right panel shows latent heterogeneity in the marginal effect of very high cosmopolitanism as function of German states.
For the second empirical application, we replicate Newman, Johnston, and Lown (Reference Newman, Johnston and Lown2015). Using national surveys conducted in the USA, they investigate if residential proximity to inequality affect US citizens’ beliefs in meritocracy, defined as the idea that the economic system rewards individuals based on their hard work and ability. The data set contains individual- and county-level covariates. They show that the association between the individual’s income and the probability of rejecting meritocracy is conditional on the levels of inequality in the county: low-income individuals become more likely to reject meritocracy when inequality increases. We reproduce their results for white residents, as they present in Table 1 of their paper, using a linear probability model. In their results, income and the percentage of blacks in the county do not matter alone, but the interaction between income and inequality is significant. We focus here on that result. We estimate the hdpGLM using the same individual- and county-level covariates they included in their model. County covariates are inequality, county income, percentage of black, percentage of votes for Bush in 2004, and county population. The estimation of the hdpGLM found no latent heterogeneity in 1,633 out of 1,688 counties. Two latent clusters were estimated in 54 counties, and three latent clusters in one of them. Table 4 compares the MLE estimates of the GLM with the posterior expectation of the hdpGLM, averaged across counties with no latent heterogeneity. We can see in that table that those values are similar.
Table 4. GLM vs hdpGLM estimates for counties with no latent heterogeneity.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_tab4.gif?pub-status=live)
The left panel of Figure 5 shows the posterior distribution of the income effect in 20 randomly sampled counties. We see that the GLM and the hdpGLM estimates agree in many cases, but for some counties, there are latent heterogeneous groups and the estimates of the two models disagree. In the county with index 543, for instance, there are three latent groups, one in which the income plays no role, and two with opposite income effects. That case represents an example of Simpson’s paradox in the effect of income in that county.
As discussed, one of the advantages of using hdpGLM is that we can evaluate if there is any effect of context(county)-levels variables after we take into account the latent heterogeneity in the effect of observed individual-level covariates. Newman, Johnston, and Lown (Reference Newman, Johnston and Lown2015) found that inequality conditions the effect of income on the probability of rejecting meritocracy. However, when we take into account the latent heterogeneity of the income effect in each county, that conditional effect disappears. It can be seen in Figure 5. In the top-right panel of the figure, we see the posterior expectation of each cluster within each one of the 1,688 counties. In the bottom right, we see the posterior distribution of
$\unicode[STIX]{x1D70F}_{11}$
, the effect of inequality on the expectation of the effect of income for each county and cluster. The results indicate that inequality does not change the effect of income when we consider latent heterogeneity in the effect of covariates.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_fig5g.gif?pub-status=live)
Figure 5. Posterior distribution of income effect for 20 selected counties (left panel), posterior expectation of income effect in each cluster as function of inequality (top-right panel), and posterior distribution of the effect of inequality on the income effect (bottom right).
As we can see, the hdpGLM model can be used to investigate latent heterogeneity in the effect of observed covariates in generalized linear models. When there is no heterogeneity, the results of GLM and hdpGLM are similar. When there is latent heterogeneity, the GLM can produce estimates that are incorrect for all or some subpopulations. By using GLM, one is simply assuming that Simpson’s paradox does not occur in the analysis. The hdpGLM can be used instead to estimate the heterogeneous effect, cluster the data into groups, uncover Simpson’s paradox, and evaluate if the effect of context-level features remains relevant after latent heterogeneity is considered.
7 Final Discussion
Researchers in any academic discipline can never be sure a priori that there is no effect heterogeneity caused by latent or omitted variables in their investigation. In other words, we are never sure if there are latent subpopulations in which the average effect found using the aggregated data is different or even reversed. When there are such subpopulations, using GLM or GLMM can produce an incomplete picture and in the worst case scenario a completely misleading conclusion. This is true in analyses using either observation or experimental data. It is desirable to use a method that is robust to latent heterogeneity. Moreover, when data comes from different contexts, for instance, different states or different countries, it is common to assume that the effect of observed covariates varies from context to context due to context-level features. Likewise, it is also desirable to consider that the latent heterogeneity of the covariate effects within each context (e.g., country) can vary from context to context (from country to country) due to context-level features.
We have provided a model to deal with those issues. The model is designed to estimate marginal effects in linear models and consider if there are latent subpopulations in which the marginal effects differ. If data comes from different contexts, the model also estimates if the existence of such subpopulations and their specific marginal effects are functions of context-level features. We have shown that the proposed model causes no harm when the GLM is correct—that is, when there are no latent heterogeneous effects—but it correctly estimates the heterogeneous effects when they exist.
Similar to GLM, however, the proposed model requires specifying which and how observed covariates are included in the model. This is not a trivial task. It can affect the estimation as much as it does for GLMs. Further research is needed to develop methods to compare and select which observed covariates should be used and how. However, if for any reason one believes a specific set of covariates is adequate for a GLM, the proposed model can be used instead with the advantage that it will be robust to subpopulation heterogeneous effects.
Appendix A. Markov Chain Monte Carlo Algorithms
The proof of the Proposition 1 is the following.
Proof (Blocked Gibbs sampler for hdpGLM).
The full conditional of
$\unicode[STIX]{x1D70F}$
is given by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU10.gif?pub-status=live)
For each
$d=1,\ldots ,D_{x}+1$
we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU11.gif?pub-status=live)
The full conditional for
$\unicode[STIX]{x1D703}$
is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU12.gif?pub-status=live)
Therefore, for all
$j=1,\ldots ,J$
and
$k=1,\ldots ,K$
, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn20.gif?pub-status=live)
For the variable
$Z$
, the full conditional is given by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU13.gif?pub-status=live)
Therefore for all
$i=1,\ldots ,n$
we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU14.gif?pub-status=live)
or similarly
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn21.gif?pub-status=live)
Finally, for
$\unicode[STIX]{x1D70B}$
the full conditional is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU15.gif?pub-status=live)
Now, for simplicity, let
$\unicode[STIX]{x1D70B}\sim \text{Dir}(\unicode[STIX]{x1D6FC}/K)$
. The connection between this distribution and the stick-breaking process described in (4) can be found in Ishwaran and James (Reference Ishwaran and James2001). Then we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU16.gif?pub-status=live)
Therefore,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn22.gif?pub-status=live)
The proof of the Proposition 2 is the following.
Proof (Gibbs for hdpGLM with gaussian mixtures).
Considering the results in Proposition 1 and the model described in (16), we have the following. For
$\unicode[STIX]{x1D70F}$
, for each
$d=1,\ldots ,D_{x}+1$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU17.gif?pub-status=live)
But by conjugacy of the gaussian distributions, we have
$\prod _{j=1}^{J}p(\unicode[STIX]{x1D6FD}_{dkj}\,\,|\,\,W,\unicode[STIX]{x1D70F}_{d})p(\unicode[STIX]{x1D70F}_{d})\propto N_{D_{w}+1}(\unicode[STIX]{x1D707}_{A}^{(k)},\unicode[STIX]{x1D6F4}_{A})$
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU18.gif?pub-status=live)
Therefore
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU19.gif?pub-status=live)
If we denote
$\overline{\unicode[STIX]{x1D6F4}}_{\unicode[STIX]{x1D70F}_{d}}=\frac{1}{K}\unicode[STIX]{x1D6F4}_{A}$
and
$\overline{\unicode[STIX]{x1D707}}_{\unicode[STIX]{x1D70F}_{d}}=\frac{1}{K}\sum _{k=1}^{K}\unicode[STIX]{x1D707}_{A}^{(k)}$
then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU20.gif?pub-status=live)
The full conditional for
$\unicode[STIX]{x1D6FD}$
is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU21.gif?pub-status=live)
Therefore, for all
$j=1,\ldots ,J$
and
$k=1,\ldots ,K$
, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn23.gif?pub-status=live)
Denote
$X_{kj}=\{X_{i}\mid C_{i}=j,Z_{i}=k\}$
,
$y_{kj}=\{y_{i}\mid C_{i}=j,Z_{i}=k\}$
, it is clear from (A 4), (16), and the conjugacy of the normal distribution that for
$k\in Z_{j}^{\ast }$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU22.gif?pub-status=live)
The full conditional for
$\unicode[STIX]{x1D70E}^{2}$
is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU23.gif?pub-status=live)
Therefore, for all
$k=1,\ldots ,K$
we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqn24.gif?pub-status=live)
Given the full conditional of
$\unicode[STIX]{x1D70E}^{2}$
in (A 5), the distributions in (16), and the fact that the scaled inverse
$\unicode[STIX]{x1D712}^{2}$
distribution is a conjugate prior for a gaussian likelihood with known mean, which is the case for the full conditional, it is straightforward to see that for
$k\in Z^{\ast }$
,
$X_{k}=\{X_{i}\mid Z_{i}=k\}$
, and
$y_{k}=\{y_{k}\mid Z_{i}=k\}$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU24.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191203100811243-0017:S1047198719000135:S1047198719000135_eqnU25.gif?pub-status=live)
The full conditionals for
$Z$
and
$\unicode[STIX]{x1D70B}$
are as in (A 2) and (A 3), respectively.◻
Supplementary material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2019.13.