1 Introduction
Ideal point estimation is critical to understanding many important political questions. From topics as diverse as voting in legislatures or the US Supreme Court, campaign donations, survey responses, and many others, these models have revolutionized political science and are crucial to our understanding of complex phenomena where actors have latent preferences. Whilst many ways to analyze this data exist, a common approach—item response theory (IRT)—specifies a generative model for the observed outcomes and estimates the underlying parameters of interest.Footnote 1 Most existing IRT frameworks focus on generative models for binomial outcomes, although recent work has provided extensions to ordinal data in a Bayesian framework (Martin, Quinn, and Park Reference Martin, Quinn and Park2011; Imai, Lo, and Olmsted Reference Imai, Lo and Olmsted2016). Whilst important, these extensions miss a key type of data in political science—multinomial or unordered categorical data. Typically, these questions are not included in Bayesian IRT models or are treated as ordinal. It is possible to extend existing frameworks to include multinomial data modeled via the classic form of the multinomial logistic regression, however, this would likely require estimation techniques that scale poorly to large datasets or further approximations to the underlying likelihood function.
This paper addresses these problems and pushes this literature forward by creating a multinomial framework for ideal point estimation (mIRT). The framework has two elements; first, it relies on a different representation of multinomial data (“stick-breaking representation”; Linderman, Johnson, and Adams Reference Linderman, Johnson, Adams, Cortes, Lee, Garnett, Lawrence and Sugiyama2015) that remains tractable whilst also containing binary and ordinal data as “special cases.” Thus, this framework not only permits the analysis of purely multinomial data but allows scaling of data that includes any combination of binary, ordinal, and multinomial data. Second, I show that this model can be estimated exactly using a Gibbs Sampler or an Expectation–Maximization (EM) algorithm without approximation using a special form of data augmentation (Polson, Scott, and Windle Reference Polson, Scott and Windle2013); using the EM algorithm will allow the researcher to exactly recover the posterior mode of the parameters of interest—up to error that comes from stopping the EM algorithm before “perfect” convergence is achieved. Thus, this framework can be seen as an important extension of the path-breaking work of Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016) for fast ideal point estimation to a more complex set of generative models whilst also allowing exact inference. One contribution of this paper is therefore to bring the fast and tractable estimation techniques to the existing work on multinomial ideal point models in political science (e.g., Groseclose and Milyo Reference Groseclose and Milyo2005; Lo Reference Lo2013; Hill and Tausanovitch Reference Hill and Tausanovitch2015) as well as a longer tradition in the psychometric literature (e.g., Bock Reference Bock1972).Footnote 2
More broadly, this framework also is extremely flexible and can serve as the basis for specifying more complicated models whilst maintaining the same simple inference procedure and not requiring a move to approximate methods. For example, adding covariates to the generative model (e.g., Bailey and Maltzman Reference Bailey and Maltzman2011), dynamic smoothing of ideal points (Martin and Quinn Reference Martin and Quinn2002), or modeling networks (e.g., Barberá Reference Barberá2015) can all be added whilst maintaining a model that can be estimated via a Gibbs Sampler or an exact EM algorithm and thus only require fairly simple modifications of the corresponding Gibbs updates or the
$M$
step in the EM algorithm.Footnote
3
An additional improvement of this framework over existing EM implementations is that it allows the easy (and exact) fitting of multidimensional models for ordinal and multinomial data that are not present in existing frameworks, e.g., Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016). Identification concerns for these multidimensional models can be addressed by the well-known techniques in Rivers (Reference Rivers2003) and are discussed in detail in Appendix B.
The paper proceeds as follows; first, it outlines the data generating process that underlies the mIRT. It then discusses particular features of the stick-breaking representation and argues that it provides a different but flexible way of modeling multinomial data. Second, it shows how Pólya-Gamma data augmentation leads to a simple and exact estimation procedure for this model. For an application of this model, I suggest that nonresponse in survey data can be meaningfully analyzed as a separate multinomial category. Focusing on the American National Election Study (ANES), I focus on a scale of “moral values.” I show that rather than treating nonresponse as missing at random, it can be modeled using a multinomial framework. This allows us to explore how social desirability (e.g., deliberately not responding to a question) interacts with the underlying latent scale. The analysis is more exploratory but suggests that the bias toward nonresponse is strongest for when questions focus on particular social groups (e.g., women, Christians, and homosexuals) rather than on asking about the moral fabric of society as a whole. The evidence suggests that while conservatives are more likely to exhibit this “shyness” when responding (e.g., not responding versus giving morally conservative attitudes on women and homosexuals), moral liberals exhibit a similar shyness when asked to evaluate certain aspects of Christianity.
2 Stick-Breaking Ideal Point Models
Ideal point models in political science address the following question: Given some observed set of outcomes, e.g., votes, how can researchers recover both the underlying ideal points as well as parameters that determine how these ideal points are translated into outcomes? I assume, for simplicity, that there are no missing data (or these are coded into some distinct “category” of response),Footnote
4
and there are
$I$
individuals indexed by
$i$
who answer (vote) on
$J$
questions indexed by
$j$
. Define
$y_{ij}$
as the answer by person
$i$
on question
$j$
. The core binary case assumes the following (where “yes” is 1 and “no” is 0), assuming a logistic link:Footnote
5
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn1.gif?pub-status=live)
Models may differ in how they specify
$\unicode[STIX]{x1D713}_{ij}$
, but the most common approach posits a linear formulation for
$\unicode[STIX]{x1D713}_{ij}$
with the following parameters:
$x_{i}$
is individual
$i$
’s ideal point as an
$s\times 1$
vector, and
$\unicode[STIX]{x1D6FD}_{j}$
are question-specific vectors of discrimination parameters.
$\unicode[STIX]{x1D705}_{j}$
is a scalar intercept. To generalize this multinomial or ordinal outcomes, I rely on a “stick-breaking” representation (Linderman, Johnson, and Adams Reference Linderman, Johnson, Adams, Cortes, Lee, Garnett, Lawrence and Sugiyama2015) or a “continuation logit” representation (Mare Reference Mare1980, see Agresti (Reference Agresti2002) for a more general discussion).Footnote
6
This decomposes a choice between multiple outcomes into a series of pairwise choices; the intuition is that a choice with multiple options can be considered sequentially. The individual
$i$
first decides whether to pick option “A” (“A” or “not A”). If they choose “not A,” they then consider whether to pick “B” or “not B” conditional on not picking A. The name “stick-breaking” comes from the fact that one can think of the probability that an individual
$i$
assigns to the outcomes as constituting a “stick” with length one. The first choice “breaks off” part of the stick and assigns that to the probability of choice A. The second choice takes the remainder of the stick and breaks off another chunk that is assigned to choice B. This procedure is repeated for all but one categories (as the final choice is determined given all previous choices) to generate the probability distribution for
$i$
’s choice over the outcomes for bill or question
$j$
. It can be shown that the stick-breaking representation of a multinomial random variable is equivalent to the “standard” formulation.Footnote
7
To formally outline the generative model, assume there are
$K_{j}$
choices for question
$j$
and the researcher imposed some ordering on them from
$k=1,\ldots ,K_{j}$
. Define
$O_{k}$
as the set of outcomes that occur before
$k$
in the ordering. Thus, calling
$\unicode[STIX]{x1D70E}_{ij}^{k}$
the probability of person
$i$
on question
$j$
choosing
$k$
given that they have not “stopped” before
$k$
, it can be written as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn2.gif?pub-status=live)
As noted above, this formalizes a “sequential” process: on question
$j$
,
$i$
first decides whether to choose
$k=1$
. If they decide against choosing
$k=1$
, then they decide whether to pick
$k=2$
given that they have not picked
$k=1$
. Crucially for the estimation later, these binary choices are independent. Whilst this may seem counterintuitive, consider the following stylized example. Respondent
$i$
on question
$j$
flips
$K_{j}-1$
independent coins with probability of heads equal to the corresponding
$\unicode[STIX]{x1D70E}_{ij}^{k}$
. They examine the coins and “reveal” their outcome as described above;
$y_{ij}=1$
if the first coin is “heads,”
$y_{ij}=2$
if the first coin is “tails” and the second coin is “heads,” etc. An important implication of this stopping rule is that for some outcome
$k$
, all coin flips for outcomes of
$k+1$
or greater are irrelevant to whether
$k$
is revealed.
This independence between stick-breaking decisions is crucial to the tractability of the stick-breaking representation and encodes the analogue to the independent of irrelevant alternatives (IIA) assumption in this framework.Footnote
8
Consider a three-level question about party identification: “Do you think of yourself as a Democrat (
$D$
), Republican (
$R$
), or Independent (
$I$
)?” The traditional multinomial representation would assign probabilities to each of the three outcomes, say
$\langle 0.6,0.1,0.3\rangle$
. The IIA assumption in this context can be written as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU1.gif?pub-status=live)
It states that the ratio of the probability of my choosing “Democrat” to “Republican” is constant even if the choice of “Independent” was removed. However, if the probability assigned to the “Independent” might split “unevenly” to the other categories, this assumption could be thought of as relatively restrictive. Numerically, it can be shown that
$\text{Pr}(D|Answer\in \{D,R\})=0.6/0.7\approx 0.86$
.
Now consider the question in a stick-breaking framework where the order of the outcomes was
$\{I,D,R\}$
. Respondents are thought to reason in the following way: First, “Do I think of myself as an Independent?” (Yes or No). This would have a probability of
$0.3$
of the respondent saying “yes.” Then, “given that I can only pick from
$D$
or
$R$
, which would I choose?” The key assumption in the stick-breaking representation is that the answer to the second question does not depend on whether one said
$I$
or not
$I$
to the first question. The implied probability is
$\text{Pr}(D|Answer\in \{D,R\})=0.6/0.7=0.86$
—exactly as in the traditional formulation of the multinomial choice question!Footnote
9
Thus, whilst the way one formulates the IIA assumption may appear different in the stick-breaking representation, it encodes a very similar assumption to the one placed in the classic formulation of multinomial choices. The equivalence between IIA in both frameworks comes from the fact that any multinomial distribution can be factorized by rearranging the density into a series of binary stick-breaking choices for any arbitrary ordering of the choice categories.
Thus, any difference between the two frameworks comes in how one models the probability of each outcome (in the classic multinomial case) or the stick-breaks. Both frameworks traditionally rely on a linear link that encodes different functional form assumptions, although neither is inherently better or worse; they are merely different models. A clear analogue here comes when modeling ordinal data in a regression context. There are at least three different ways of parameterizing ordinal choices. Whilst the most classic formulation is a cumulative logistic regression, other options exist. For example, researchers could choose a stick-breaking representation similar to the one above (continuation logit) or an adjacent-category regression where they attempt to model whether some observation
$y_{ij}$
is equal to category
$k$
or category
$k+1$
(Agresti Reference Agresti2002). The use of a linear systematic component leads to coefficients that have different interpretations, although the hope is that this functional form is sufficiently flexible to lead to similar predicted probabilities for different covariate profiles. Appendix A justifies the use of a stick-breaking specification instead of a classic multinomial (or “softmax”) specification in extensive detail to make the case that the two-parameter IRT specification is sufficiently flexible to make the order of the categories unimportant for the key quantities of interest (the ideal points and the predicted probabilities).Footnote
10
These results are not definitive, and thus researchers should try multiple orderings to ensure that the correlations are high, but in every scenario attempted in this paper, the results are highly invariant to permutations of the ordering—even using permutations that are deliberately “bad” (correlations above 0.99).
I adopt the stick-breaking parameterization following an intuition by Linderman, Johnson, and Adams (Reference Linderman, Johnson, Adams, Cortes, Lee, Garnett, Lawrence and Sugiyama2015); they note that some complex Bayesian models, e.g., correlated topic models, can be made easily tractable by using this representation of multinomial data as it reduces to a series of binary choices, rather than having to work with the complicated softmax formulation associated with the traditional multinomial logistic parameterization. I use their intuition and derive results for a different class of model: the two-parameter IRT models. This is the workhorse ideal point model in political science and states that the ideal point
$x_{i}$
is linearly combined with a question-and-level specific “discrimination” parameter
$\unicode[STIX]{x1D6FD}_{j}^{k}$
as well as an intercept
$\unicode[STIX]{x1D705}_{j}^{k}$
to generated predicted probabilities. The stick-breaking formulation is shown in Equation (3).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn3.gif?pub-status=live)
From this parameterization, the outcome probabilities of some choice
$k$
for respondent
$i$
on question
$j$
(
$p_{ij}^{k}$
) can be backed out from the stick-breaks leading to the following identities:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn4.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn5.gif?pub-status=live)
Note that for
$k=1$
, the stick-breaking probability is the “raw” probability of choosing category one, i.e.,
$\unicode[STIX]{x1D70E}_{ij}^{1}=p_{ij}^{1}$
. When choosing an ordering, a key point to keep in mind is that the predicted probabilities from first category given this formulation of
$\unicode[STIX]{x1D713}_{ij}^{n}$
will be monotonically increasing or decreasing as the ideal point
$x_{i}$
changes. Phrased differently, the baseline category should be chosen such that our subject-specific knowledge suggests that the probability of choosing the baseline outcome increases smoothly from zero to one (or one to zero if it is decreasing) as the ideal point
$x_{i}$
moves across the real line.Footnote
11
This requires the use of a researcher’s substantive knowledge about what roughly they think the underlying latent dimension will map onto. In most survey settings, a plausible choice is to put an extreme outcome as the baseline category as that is most plausibly one that has a monotonic relationship with the ideal point. If there is no substantive guide, however, Appendix A shows that the estimated ideal points are likely robust to incorrectly specifying the first category and discusses how to use various model selection techniques to choose between orderings.
If the data are truly ordinal, this order should be used for each question.Footnote 12 This framework thus allows for different numbers of categories across questions, e.g., in a survey with 5-point and 7-point scales. This is an improvement above existing implementations, e.g., Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016), that require variational approximations and collapsing scales down to three categories to analyze ordinal data.
From the above notation and recalling that there are
$I$
individuals answering
$J$
questions, the full likelihood function can be written compactly as shown below using the definition of
$\text{Pr}(y_{ij}=k)$
in terms of the stick-breaks shown in the previous equations. To introduce some additional notation to make the subsequent results tidier, define
$y_{ij}^{\prime }$
as the minimum of the observed
$y_{ij}$
or the highest modeled category
$K_{j}-1$
; this is used to denote that if
$y_{ij}=K_{j}$
, it is not modeled as it is defined implicitly by the constraint that the probabilities of all choices sum to one.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn6.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU2.gif?pub-status=live)
3 Estimation
Estimation in this framework can be done using Markov Chain Monte Carlo (MCMC) methods via a Gibbs Sampler or an Expectation–Maximization (EM) algorithm (Dempster, Laird, and Rubin Reference Dempster, Laird and Rubin1977). I focus on the later as it is much faster (Imai, Lo, and Olmsted Reference Imai, Lo and Olmsted2016), although the requisite MCMC updates are stated implicitly in the
$M$
-steps.Footnote
13
The crux of either estimation method relies on transforming the logistic link to become tractable by relying on a recent innovation in Bayesian statistics.Footnote
14
The key identity comes from Polson, Scott, and Windle (Reference Polson, Scott and Windle2013) who in turn drew on a detailed analysis by Biane, Pitman, and Yor (Reference Biane, Pitman and Yor2001). They define a Pólya-Gamma random variable
$\unicode[STIX]{x1D714}\sim PG(b,c)$
(
$b>0;c>0$
) as an infinite sum of independent gamma random variables, scaled in a particular fashion (Polson, Scott, and Windle Reference Polson, Scott and Windle2013, p. 1341):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn7.gif?pub-status=live)
The
$b$
parameter governs the type of Gamma variable being summed together and the
$c$
parameter is seen as an “exponential tilt.”Footnote
15
This carefully constructed variable leads to a powerful identity; returning to the notation above, Polson, Scott, and Windle (Reference Polson, Scott and Windle2013) demonstrate that for any
$\unicode[STIX]{x1D713}_{ij}\in \mathbb{R}$
, and where
$\unicode[STIX]{x1D714}_{ij}\sim PG(1,0)$
:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn8.gif?pub-status=live)
The power of this augmentation means that if one augments each stick-breaking choice with
$PG(1,0)$
random variables, then the complete data log-likelihood becomes quadratic in the
$\unicode[STIX]{x1D713}_{ij}^{n}$
.Footnote
16
More broadly, this data augmentation makes models with logistic links as tractable as traditional probit models. Consider some observed choice
$y_{ij}$
, the complete data likelihood for this observation after augmenting the Pólya-Gamma random variables is as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn9.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU5.gif?pub-status=live)
This data augmentation allows us to use an exact EM algorithm to find either the maximum-likelihood estimate of the parameters of interest
$\boldsymbol{\unicode[STIX]{x1D703}}=(\boldsymbol{\unicode[STIX]{x1D705}}_{j}^{n},\boldsymbol{\unicode[STIX]{x1D6FD}}_{j}^{n},\boldsymbol{x}_{i})$
from this data generating process or, more commonly in ideal point estimation, estimates of the posterior mode (maximum a posteriori estimates) of
$\boldsymbol{\unicode[STIX]{x1D703}}$
when priors are included. I follow with the later tradition and add independent normal priors on
$\unicode[STIX]{x1D705}_{j}^{n}$
,
$\unicode[STIX]{x1D6FD}_{j}^{n}$
,
$x_{i}$
, with mean zero and variances
$\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D6FD}}$
,
$\unicode[STIX]{x1D6F4}_{x}$
,
$\unicode[STIX]{x1D6F4}_{\unicode[STIX]{x1D705}}$
. I will sometimes denote the prior distribution by
$p(\boldsymbol{x}_{i},\boldsymbol{\unicode[STIX]{x1D6FD}}_{j}^{n},\boldsymbol{\unicode[STIX]{x1D705}}_{j}^{n})$
for simplicity.Footnote
17
The EM algorithm provides a way of finding
$\boldsymbol{\unicode[STIX]{x1D703}}$
given that the stick-breaking generative process can be augmented with the relevant Pólya-Gamma variables as noted above. One can write the maximization question as follows, where
$\boldsymbol{\unicode[STIX]{x1D714}}$
denotes the collection of (independent) augmented Pólya-Gamma variables that are being integrating over:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn10.gif?pub-status=live)
Defining
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t)}$
as the vector of parameters obtained at some iteration
$t$
of the EM algorithm, the
$Q$
function is defined as the expectation of the log of the integrand with respect to
$p(\boldsymbol{\unicode[STIX]{x1D714}}|\boldsymbol{y},\boldsymbol{\unicode[STIX]{x1D703}}^{(t-1)})$
:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn11.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU6.gif?pub-status=live)
As Dempster, Laird, and Rubin (Reference Dempster, Laird and Rubin1977) show, if one iteratively updates the
$Q$
function using an
$E$
(Expectation) and
$M$
(Maximization) step, this procedure obtains an estimate of
$\boldsymbol{\unicode[STIX]{x1D703}}$
. The
$E$
-step takes the conditional expectation of each
$\unicode[STIX]{x1D714}_{ij}^{n}$
given the previous values of the parameters
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t-1)}$
, denote this by
$(\unicode[STIX]{x1D714}_{ij}^{n})^{\ast }$
. Using further results in Polson, Scott, and Windle (Reference Polson, Scott and Windle2013), it can be shown that each
$\unicode[STIX]{x1D714}_{ij}^{n}$
conditional on the current updates of the parameters is
$PG(1,\unicode[STIX]{x1D713}_{ij}^{n})$
. Thus, its expectation is defined below:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU7.gif?pub-status=live)
The
$M$
-step finds the next update for
$\boldsymbol{\unicode[STIX]{x1D703}}$
, i.e.,
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t)}$
, by maximizing the
$Q$
function with respect to
$\boldsymbol{\unicode[STIX]{x1D703}}$
given
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t-1)}$
via the results from the associated
$E$
-step:
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t)}=\max _{\boldsymbol{\unicode[STIX]{x1D703}}}Q(\boldsymbol{\unicode[STIX]{x1D703}},\boldsymbol{\unicode[STIX]{x1D703}}^{(t-1)})$
. This is most easily done using a conditional EM algorithm (Meng and Rubin Reference Meng and Rubin1993), e.g., maximizing
$Q$
with respect to one block of parameters whilst holding the others constant. Once the first block of parameters is updated, those new values are plugged into the
$Q$
function and then the next block of parameters is updated. Thus, when applying the
$M$
-steps below, the relevant components of
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t-1)}$
would be replaced with the
$\boldsymbol{\unicode[STIX]{x1D703}}^{(t)}$
updates found in the previous conditional
$M$
-step. The
$M$
-steps can be derived simply given the quadratic nature of the
$Q$
function. For completeness, I write out the multinomial
$M$
-steps for a multidimensional model assuming the order of iteration is the
$x_{i}$
, the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
and then the
$\unicode[STIX]{x1D705}_{j}^{n}$
.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn12.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn13.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqn14.gif?pub-status=live)
Thus, by cycling through the
$E$
-step, and the conditional
$M$
steps, one can rapidly and exactly estimate multinomial models with a logistic link using the Pólya-Gamma data augmentation. This demonstrates that for a wide class of model that are specified with some
$\unicode[STIX]{x1D713}_{ij}^{n}$
, the mIRT will admit a simple
$E$
-step and thus as long as the
$M$
-step is tractable, the models can be estimated using a fast EM algorithm instead of time-consuming MCMC methods.
When choosing priors for this model, I follow convention and place independent standard normal priors on each
$x_{i}$
and independent normal
$N(0,25)$
priors on
$\unicode[STIX]{x1D6FD}_{j}^{n}$
. For the priors on the cut points, my default choice of prior is based on the observed stick-breaking frequencies; for each
$\unicode[STIX]{x1D705}_{j}^{n}$
, I calculate what the implied empirical stick-breaking probability is for the category and use that as the mean for a diffuse normal prior, i.e., with variance 25.Footnote
18
4 Extensions of the General Model
Beyond the model shown above, the mIRT is extremely flexible in that one can include many different formulations for the
$\unicode[STIX]{x1D713}_{ij}^{n}$
(the latent utility of a choice) whilst maintaining the ability to estimate the model exactly using the fast EM algorithm shown above. Indeed, as long as
$\unicode[STIX]{x1D713}_{ij}^{n}$
remains some function of parameters and observed data, the mIRT’s
$E$
-step will remain unchanged although its
$M$
-step must differ to address the particular functional form imposed. A different extension of the mIRT involves changing the priors on the parameters to capture other generative processes. The most common extension—dynamic smoothing (Martin and Quinn Reference Martin and Quinn2002)—can be done with ease. This section briefly sketches how three extensions (covariates, network effects, dynamic smoothing) can be easily implemented in this model.
4.1 Covariates
Some authors, e.g., Bailey and Maltzman (Reference Bailey and Maltzman2011), suggest that adding observed covariates to ideal points may improve inferences as well as allow us to understand other salient features of the data generating process. Define some vector of covariates
$z_{ij}$
observed for an individual
$i$
on question
$j$
. A simple way to add this to the generative process would be to define
$\unicode[STIX]{x1D713}_{ij}^{n}$
as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU8.gif?pub-status=live)
Since
$z_{ij}$
is observed, this adds a set of
$\unicode[STIX]{x1D70F}_{j}^{n}$
coefficients to be estimated in the
$M$
-step. These updates will have a similar form to that of
$\unicode[STIX]{x1D6FD}_{j}^{n}$
and do not change the underlying procedure for the
$E$
and
$M$
steps in any materially difficult fashion.
4.2 Network models
A new frontier in ideal point models recognizes that network effects may govern behavior (e.g., Barberá Reference Barberá2015); the most classic example involves seeing a binary
$y_{ij}$
as either a “link” or “no link.” A multinomial interpretation might be for “strong friend,” “friend,” or “not friend.” This relationship can be modeled simply in the mIRT by again changing the definition of
$\unicode[STIX]{x1D713}_{ij}^{n}$
. Following Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016)’s presentation in one dimension for simplicity, one could define a
$\unicode[STIX]{x1D713}_{ij}^{n}$
to capture this effect as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU9.gif?pub-status=live)
This model can again be estimated with a nearly identical
$E$
-step. The
$M$
-step is more complicated but can be done exactly by solving the implied cubic equation in the first-order condition for the
$x_{i}$
ideal points.Footnote
19
4.3 Dynamic smoothing
A common extension of ideal point models links periods with the same respondents by using a “dynamic ideal point” framework (Martin and Quinn Reference Martin and Quinn2002). This model induces persistence in ideal points in a person over time by specifying a prior that depends on the ideal point of the respondent in the previous period; specifically, that the prior of the ideal point of
$x_{i}^{(g)}$
where
$i$
now indexes time and
$g$
indexes individuals (e.g., John Kerry
$g$
in the 107th Congress
$i$
):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_eqnU10.gif?pub-status=live)
The intuitive interpretation of this specification is that our prior for a MP’s ideal point at time
$i$
is their prior ideal point at time
$i-1$
plus noise.
$\unicode[STIX]{x1D6E5}$
, fixed by the researcher in Martin and Quinn (Reference Martin and Quinn2002)’s approach, defines the variance of the “noise” and as it tends to infinity, this becomes equivalent to estimating different ideal points in each period whilst
$\unicode[STIX]{x1D6E5}$
tending to zero implies a single ideal point for each MP across all periods. This allows for a “smoothing” of ideal points across time whilst sometimes allowing discontinuous change. This extension is computationally simple to include in the unified IRT framework for any of the above data types as it simply involves changing the prior and thus will not affect the
$E$
-step. The
$M$
-step is derived in Appendix C.
5 Validation of the Model
To show that the mIRT model generates plausible results given its stick-breaking representation, this section performs two series of tests. First, I run simulations to show the mIRT successfully recovers the underlying ideal points, although I note some important caveats about categories with few observations. Next, I show that on four canonical datasets, the mIRT recovers ideal points that are highly correlated with those from the major alternative EM estimation framework—emIRT (Imai, Lo, and Olmsted Reference Imai, Lo and Olmsted2016)—as well as results from MCMC estimation procedures.Footnote 20
5.1 Simulated data
As no existing framework for ideal point estimation has implemented multinomial outcomes using a stick-breaking representation in conjunction with an EM algorithm, I examine how my method fares using simulated data. I generate simulated data with 2000 individuals and 100 questions using the data generating process described above. Each question
$j$
has some number of outcomes
$K_{j}$
drawn randomly from the set
$\{2,\ldots ,M\}$
where
$M\in \{3,5,10,15,20\}$
.Footnote
21
The mIRT method converges quickly and Figure 1 compares the estimates with the truth. It shows that the ideal points are strongly correlated with the truth across all
$M$
.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig1g.gif?pub-status=live)
Figure 1. Simulated multinomial data: ideal points. Note: Each panel indicates the
$M$
, i.e., that each question
$j$
is sampled from
$K_{j}\in \{2,\ldots ,M\}$
. The correlation between the estimates and the truth is shown on each plot.
However, when analyzing multinomial data, there is an important further caveat that researchers should examine; if certain categories contain few observations, the question parameters may be imprecisely estimated and/or the data will not dominate the prior. To examine this, I plot the correlation of
$\unicode[STIX]{x1D6FD}_{j}^{n}$
with the true values. Figure 2 again shows the correlations are high, though it decreases as
$M$
increases.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig2g.gif?pub-status=live)
Figure 2. Simulated multinomial data: discrimination parameters. Note: Each panel indicates the
$M$
, i.e., that each question
$j$
is sampled from
$K_{j}\in \{2,\ldots ,M\}$
. The correlation between the estimates and the truth is shown on each plot.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig3g.gif?pub-status=live)
Figure 3. Empirical tests of the mIRT. Note: emIRT represents the implementation of the model in Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016). The baseline results are NOMINATE in (a) and MCMC procedures for the Supreme Court (Martin and Quinn Reference Martin and Quinn2002), the United Nations (Bailey, Strezhnev, and Voeten Reference Bailey, Strezhnev and Voeten2017), and the Ashai Todai survey (Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016)’s hand coded MCMC estimation).
I examine this further in Table 1; I split the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
into three groups based on how many votes are recorded in the corresponding category, i.e., for how many
$i$
does
$y_{ij}=n$
. I look at the correlation of the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
with the true values in the lower quartile, the middle two quartiles, and the upper quartile of observations. This will tell us whether in categories with fairly few observations, there is a cause for concern about whether the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
are accurately recovered.
Table 1. Correlation of multinomial question parameters.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_tab1.gif?pub-status=live)
There is a decline in the correlation in the lowest quartile which makes sense given that there are fewer than 40 observations (out of 2000 for each question) in the lower quartile of responses categories when
$M=20$
. Yet, despite the weaker correlations for the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
in that category, it is reassuring that this does not contaminate the estimates of the ideal points when pooling across all questions. The key point of this analysis is that researchers should be cautious about including categories with very few responses and, if possible, attempt to collapse those categories to estimate the
$\unicode[STIX]{x1D6FD}_{j}^{n}$
more precisely. Further, if one wishes to make claims based on the question parameters (e.g., generate predicted probabilities of choosing categories across options), using the parametric bootstrap or draws from the posterior is advisable insofar as this will capture the uncertainty of those less-used categories.
5.2 Empirical data
Stepping back into the simple case of binary data, I briefly show that the mIRT recovers very similar results to existing models on canonical datasets. This is to be expected for binary data insofar as the only difference is the use of a logistic versus probit link function, although this confirms that the EM algorithm using the Pólya-Gamma augmentation “works” to return similar results. Figure 3 shows the results from four canonical datasets. I show results from the mIRT, the emIRT that uses variational approximations and an EM algorithm (Imai, Lo, and Olmsted Reference Imai, Lo and Olmsted2016), as well as the non-EM based canonical method (NOMINATE in the case of Congress and MCMC methods for the other examples). The correlation coefficient is printed in the upper left corner of each plot. First, using a binary model on the 82nd Congress, I compare the mIRT against the emIRT implementation and NOMINATE (Poole and Rosenthal Reference Poole and Rosenthal1997). Next, I run a dynamic binary model US Supreme Court from 1946 and again report the mIRT’s results alongside the emIRT and MCMC results from Martin and Quinn (Reference Martin and Quinn2002).Footnote 22 Third, I examine a dynamic ordinal (multinomial in the mIRT) for analyzing votes in the United Nations compared against the MCMC estimation in Bailey, Strezhnev, and Voeten (Reference Bailey, Strezhnev and Voeten2017). Finally, I run a multinomial model on the (large) Ashai Todai voter survey used in Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016) and show results against both emIRT results and an MCMC estimation reported in the same paper.Footnote 23
The mIRT returns highly correlated results with other estimation methods in all cases. The slight differences that appear are perhaps due to a conjunction of various factors: (i) the difference in tail behavior between logistic and probit links; (ii) the use of a multinomial framework for the UN and Ashai Todai data; (iii) the variational approximations used in Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016); (iv) the stick-breaking functional form of the mIRT; (v) the fact that MCMC methods typically report the posterior mean whereas the EM approaches target the posterior mode.
6 Multinomial Data in Survey Responses: Dealing with Nonresponse
Turning from the legislative domain to that of survey responses, most social science surveys ask questions with binary, ordinal, or multinomial choices. Existing scaling methods can easily accommodate binary data; for ordinal data, the most common practice is to treat it as continuous (either implicitly or explicitly), perhaps after applying some transformation. However, existing methods almost never include multinomial outcomes when constructing the latent scale as there is simply not a way to credibly pretend they are continuous. Besides leaving out questions that could help us more precisely estimate the underlying latent scale, it also means that researchers are unable to see how these questions load onto the underlying latent dimension.
More worryingly, one could also think of binary and ordinal survey questions as, in fact, always being inherently multinomial because of nonresponse. The fact that respondents can deliberately chose to not respond to a question (or are sometimes even prompted to “skip” if they “don’t know”) means that they are introducing a category of “nonresponse” that cannot be easily compared to the other outcomes. Traditional methods assume that these nonresponses are missing at random and thus are either dropped from the estimation or, more commonly, imputed using some procedure. Yet, as existing research that directly analyzes nonresponse shows, these individuals are systematically different on observable characteristics (Berinsky Reference Berinsky1999, Reference Berinsky2002) and thus the missing at random imputation assumptions may not be credible.
Thus, a more principled solution to nonresponse is to treat them as a valid category that is scaled alongside the “intended” responses as part of the generative model. The mIRT provides exactly the framework to do so; I begin by applying it to a scale of “moral values” formed by pooling together approximately 25 questions from the 2008 ANES.Footnote 24 The questions used in the scale are highly typical of survey response items; most are typically viewed as “ordinal” questions where respondents are asked to pick from a moderate number of choices (four to seven) in response to some question.Footnote 25 Whilst some are classically “ordinal,” others are more complicated; they provide a series of options that the survey designers believed were ordinal but are more qualitative. For example, consider the question of abortion. It asks respondents to pick from one of the four choices:Footnote 26
1. By law, abortion should never be permitted.
2. The law should permit abortion ONLY in case of rape, incest or when the woman’s life is in danger.
3. The law should permit abortion for reasons OTHER THAN rape, incest or danger to the woman’s life, but only after the need for the abortion has been clearly established.
4. By law, a woman should always be able to obtain an abortion as a matter of personal choice.
Other available responses are an “other” (volunteered by the respondent) as well as a classic “don’t know” response. Even though the four provided options seem to be ordered in roughly increasing restrictiveness, it is perhaps not something that researchers would be perfectly happy to assume was true a priori. More importantly, even if the data are ordinal, the typical approach for modeling ordinal data places strong assumptions on the nature of responses—that they are “parallel regressions” leading to the famous “proportional odds” implication with a logistic link discussed above. Treating the abortion question as multinomial allows us to have a more flexible structure to let the data reveal itself and then researchers can examine the quantities of interest ex post to see whether the recovered parameters do suggest an ordinal structure.
Second, note that the abortion question provides a “don’t know” option; many other questions in this scale provide a similar option or even have a response category of “haven’t thought much about it.” Indeed, some questions use a “full filter” and allow the respondent to “skip” the question if they claim to lack knowledge about the issue. The feeling thermometers present this option most clearly by providing a “I have not heard of this group” response. In general, nonrespondents are likely to be different than those who do respond (Berinsky Reference Berinsky1999, Reference Berinsky2002), and thus one might also conjecture that they hold different ideological positions on the underlying moral values scale. Thus, by not modeling their “don’t know” or nonresponse more broadly defined, researchers both lose information to help efficiently estimate the positions of these individuals and risk creating scales that are biased for certain respondents, i.e., more extreme individuals might appear more moderate because they skipped questions for which their true beliefs were more extreme.
Given these concerns and the substantively interesting question of whether nonresponse has an ideological slant on certain questions, I estimated a multinomial model where all questions are treated as multinomial and questions with nontrivial levels of nonresponse (i.e., more than 1%) are modeled as a separate discrete category.
6.1 Different scalings of moral values
To begin, it is worth comparing the raw estimated ideal points from two models; first, a factor analysis model similar to that in Ansolabehere, Rodden, and Snyder (Reference Ansolabehere, Rodden and Snyder2008);Footnote 27 second, a multinomial model that treats nonresponse as a separate category for analysis. Figure 4 plots the results, with points that provided a nonresponse to at least one question colored using filled circles. The results are quite similar which makes sense given that most questions see fairly low levels of nonresponse, and there are sufficiently many questions to construct reliable scales using any of the standard approaches.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig4g.gif?pub-status=live)
Figure 4. Comparison of scaling methods for ANES moral values. Note: Individuals who did not respond to at least one question (409 out of 2102 respondents) are indicated using filled (i.e., not hollow) circles. The correlation between the methods is 0.898.
6.2 Question—specific analysis of nonresponse
Besides recovering ideal points based on a variety of more complex questions, I transformed the estimated parameters
$(\unicode[STIX]{x1D6FD}_{j}^{n},\unicode[STIX]{x1D705}_{j}^{n})$
get predicted probabilities of answering
$\text{Pr}(y_{ij}=k)$
for each question under consideration and therefore see how the underlying latent scale predicts nonresponse. As multinomial models have parameters that are challenging to interpret, showing predicted probabilities of particular questions as ideal points vary is a concise and visually interpretable way of showing how the ideal points map onto outcome probabilities.
To get an estimate of uncertainty, Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016) suggest the parametric bootstrap (Lewis and Poole Reference Lewis and Poole2004; Carroll et al. Reference Carroll, Lewis, Lo, Poole and Rosenthal2009), i.e., take the EM estimates as the truth and generate some number of simulated datasets that are scaled using the original procedure. They note, however, this sits somewhat uneasily with the Bayesian nature of the model as it represents a measure of “sampling variability” rather than a true exploration of the posterior in a fully Bayesian sense. Given the size of the data in question here, I adopt a different approach: I use the EM estimates as the starting values for the Gibbs Sampler implementation of the mIRT. As this means that the sampler starts at the posterior mode, one should expect rapid convergence. Thus, the Gibbs Sampler can be run for a short period of time (and much shorter than from random starting values) to approximate the uncertainty in the posterior, in the region of highest density.Footnote 28 I can then calculate the predicted probabilities for each set of parameters across the ideal points commonly observed.Footnote 29 By taking the 95% credible interval, I can show the predicted probabilities with an estimate of the associated uncertainty. To begin, I plot the predicted probabilities for two questions of interest:
Same-Sex Marriage (083214): ‘Should same-sex couples be ALLOWED to marry, or should they NOT BE ALLOWED to marry?
- 1
Marriage: Should be allowed
- 3
No Marriage: Should not be allowed
- 5
Civil Unions: Should not be allowed to marry but should be allowed to legally form a civil union
- 7
Other: This includes respondents who volunteered some other answer (32; 1.5%).
- NA
Refusal: This includes the respondents who did not provide a valid answer (55; 2.5%).
Prayer (083183): “People practice their religion in different ways. Outside of attending religious services, do you pray SEVERAL TIMES A DAY, ONCE A DAY, A FEW TIMES A WEEK, ONCE A WEEK OR LESS, or NEVER?”
- 1
Several Times a Day
- 2
Once A Day
- 3
A Few Times A Week
- 4
Once A Week Or Less
- 5
Never
- 7
Other
- NA
Refusal: This lumps together 11 respondents who did not submit a valid response (all volunteered some “other” response). As this category is nearly empty, I do not include it as a category and treat the missing data as idiosyncratic and impute it using the data augmentation approach described in Appendix D.
The results are in Figure 5. The left panel shows the results for same-sex marriage. The posterior median of the predicted probabilities is shown in a solid black line. Negative values on this scale represent those who are moral liberal and the positive values indicate moral conservatism. Note that even though the “civil unions” option was listed last (coded as “5”), it in fact occurs in the middle representing the preferred choice of moderates. Thus, even though the model specified the wrong ordering (i.e., putting it third), the predicted probabilities are sensible. There is a small bump for “refusals” on the morally conservative side: I return to this in the next section.
For the question on prayer, consider right panel. Even though there are many options provided, they are scaled in the “correct” order using the mIRT. Moving from right to left, the probabilities of being in a category of frequent prayer decrease. The modes of the predicted probabilities also are ordered in the expected fashion.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig5g.gif?pub-status=live)
Figure 5. Predicted probabilities for moral questions. Note: The category labels and question wordings are outlined in the main text. The dashed lines indicate the 25 and 75th percentiles of the estimated ideal points. Negative values indicate morally liberal responses. Uncertainty around the predicted probabilities is shown using the 95% credible interval from posterior simulations.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20190924112624508-0294:S1047198718000311:S1047198718000311_fig6g.gif?pub-status=live)
Figure 6. Probability of nonresponse. Note: Each panel shows the probability of nonresponse for a particular question. The dashed lines indicate the 25 and 75th percentiles of the estimated ideal points. Negative values indicate morally liberal responses. Uncertainty around the predicted probabilities is shown using the 95% credible interval from posterior simulations and 95% quantiles from the parametric bootstrap.
To show the nonresponse probabilities for all questions, Figure 6 shows the probability of the nonresponse category across the ten questions where there was nonnegligible levels of nonresponse. Recall that the left of the scale (negative ideal points) indicates moral liberalism. The dashed vertical lines mark out the 25th and 75th percentiles. Credible intervals are shown from the posterior draws with the posterior median indicated by a solid line.
This figure is striking in that it shows clear evidence for ideological “shyness” for certain types of respondents. Especially when given feeling thermometers, moral conservatives have a modest probability of refusing to answer the question (a predicted probability of around 0.10—which is about the level of some of the lesser used “intermediate” categories on the feeling thermometer). The questions where this occurs are especially interesting: When asked to evaluate homosexuals (or LGBT individuals more generally when asked about same-sex marriage), there can see a quite distinct pattern where moral conservatives are less willing to provide the consistent response of a conservative attitude. A similar pattern appears when asked to evaluate feminists. This perhaps suggests social desirability bias; moral conservatives are perhaps less willing to admit to a view that they think interviewers would judge them for and thus take the option of nonresponse. Most interestingly, there is a similar pattern for “Christian fundamentalists” (feeling thermometer). This question is striking in that it has a very high level of “do not recognize this group” as well as standard levels of other nonresponse. This suggests that moral conservatives may take umbrage at being asked to evaluate a group referred to by a fairly pejorative term and thus refuse to answer the question either by refusing to acknowledge the legitimacy of the group label (“I do not recognize this group”) or by skipping the question entirely. It is striking to compare this against the feeling thermometer for “Christians” (sans fundamentalist) where moral conservatives report highly positive feelings. There are very low levels of nonresponse and that they are not sharply ideologically biased; thus, when asked to rate their religion as a whole, the vast majority of morally conservative respondents do provide a (highly positive) answer, but refuse to do so when asked to admit to being sympathetic toward a pejoratively defined group.
Yet, ideological nonresponse is not exclusively the domain of moral conservatives. On questions regarding religion, especially views on the bible, liberals show nonresponse. Specifically, the results suggest that moral liberals feel some cross-pressure to not adopt the most extreme response on the question about the bible (“the Bible is the work of man and not God”) and thus will say “don’t know” to avoid the question. Again, social desirability is a plausible explanation; even moral liberals may feel unconformable taking a fairly strong anti-Christian stance in front of an interviewer.
Overall, this section has shown that there are gains to taking the “don’t know” and other nonresponses seriously when estimating ideal point models. Whilst these results are preliminary, they suggest an interesting direction of future research that tries to peer more deeply into the nonresponse category in our standard social science surveys to see whether it is masking ideological extremism. Further, as the results appear reasonably subtle as to which questions show ideological nonresponse (and that the direction is not always driven by moral conservatives), scaling nonresponse in a flexible way allows for these patterns to reveal themselves rather than being imposed a priori by researchers.
The model outlined in this paper (mIRT) provides a novel way for doing so; existing Bayesian implementations for ideal point models do not permit the tractable analysis of multinomial outcomes and would have required either assuming ordinality and thus having to place nonresponse at some point in the scale using prior information. Given the nuanced results above, it is not implausible to think that researchers might have disagreement about the correct placement of the nonresponse category in an ordinal framework and thus a method that avoids the researcher having to take a strong stand before the analysis is desirable. Further, the mIRT allowed the quick and flexible scaling of questions with different numbers of outcomes (from 5 to 10); it was not necessary to collapse questions down to three categories (as required by Imai, Lo, and Olmsted (Reference Imai, Lo and Olmsted2016)) and, indeed, many of the moral questions analyzed here cannot be plausibly so recoded.
Future extensions of this preliminary investigation into nonresponse could involve trying to integrate the models of predicting nonresponse using covariates (Berinsky Reference Berinsky1999, Reference Berinsky2002) that would allow us to both flexibly model nonresponse but also scale our questions of interesting using an “all-in-one” framework. In terms of survey design, this also should cause researchers to consider the use of feeling thermometers and whether the ‘I do not know who this group is’ filter should be applied. Especially when considering groups that are described in perhaps contested or controversial ways (e.g., “Christian fundamentalists”), the possibility of nonresponse as a way of dissenting against the description of the group might bias the results that are obtained.
7 Conclusion
This paper brought together two developments in Bayesian statistics (stick-breaking representation of multinomial choice; Pólya-Gamma data augmentation) and applied them to ideal points for the first time. This allowed me to derive a conceptually simple and elegant representation for flexibly modeling multinomial data. Estimation is similarly clean and can be done using an exact EM algorithm to find the posterior mode or a Gibbs Sampler to recover the full posterior. This model, the mIRT, includes most of the canonical models in political science as special cases as well as allowing the analysis of complex forms of survey data (e.g., many-valued ordinal and multinomial responses) for the first time using an estimation procedure (the EM algorithm) that also allows feasible scaling to large datasets.
The main contribution of the mIRT is its flexibility to allow researchers to modify the terms in the “utility” of choices (the
$\unicode[STIX]{x1D713}_{ij}^{n}$
) to easily create more theoretically rich models to analyze questions across a wide variety of domains. As an example, I applied the mIRT model to scaling nonresponse in the ANES. I demonstrate that the flexibility of this model allowed us to uncover patterns of ideological nonresponse; for a sizeable number of questions on moral issues, nonresponse is not missing at random: Rather, ideologically extreme individuals (particular conservatives) will skip or not respond to questions that would require them to give an outcome that might be seen as socially undesirable. For example, it seems that moral conservatives are somewhat more unwilling to admit opposition to policies for legal remedies to discrimination against homosexuals, whilst moral liberals tend to be shyer about admitting views on the bible that suggest it is “the work of man.”
Beyond unifying core models and improving speed, the key benefit of the mIRT is that it easily admits theoretically interesting extensions whilst staying in the same framework of a stick-breaking multinomial—with binary outcomes being an important special case. The fact that estimation can be done not only via a clear MCMC framework but also via a simple EM algorithm without the need for variational approximations means that sophisticated models generated using the mIRT can be easily scaled up to estimate models based on large datasets without undue computational demands. A caveat of the mIRT is the fact that it requires the researcher to impose some ordering on the response categories; however, Appendix A shows in extensive detail that in all scenarios considered in this paper, the estimated ideal points are very highly correlated despite the choice of ordering—even if one chooses a deliberately bad choice of ordering. Whilst preliminary, Appendix A also sketches a theoretical justification for why this is the case; it shows that the stick-breaking method represents an approximation of the classic multinomial framework and thus, at least for the types of models considered in this paper, may explain why the results are so robust to choice of ordering. More theoretical work on this question and understanding exactly when the choice of ordering becomes significant is an open area for future research. Preliminary work suggests that then where are very many (e.g., one-hundred or more) categories and/or categories that are sparsely populated, the ordering may become more important.
However, for many applications, the stick-breaking parameterization has important benefits for inference (exact EM or simple Gibbs Samplers) and provides a flexible base on which to construct more complicated ideal point models that better reflect the interesting underlying structure of the particular questions. Thus, with the caveats of the mIRT held in mind, the framework developed in this paper will hopefully permit researchers to write and estimate more sophisticated models to scale many types of data as well as reducing the reliance on “bespoke” models that are difficult to translate into other domains.
Supplementary material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2018.31.