Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-02-05T12:41:41.007Z Has data issue: false hasContentIssue false

Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Published online by Cambridge University Press:  03 July 2019

Ludovic Rheault*
Affiliation:
Assistant Professor, Department of Political Science and Munk School of Global Affairs and Public Policy, University of Toronto, Canada. Email: ludovic.rheault@utoronto.ca
Christopher Cochrane
Affiliation:
Associate Professor, Department of Political Science, University of Toronto, Canada. Email: christopher.cochrane@utoronto.ca
Rights & Permissions [Opens in a new window]

Abstract

Word embeddings, the coefficients from neural network models predicting the use of words in context, have now become inescapable in applications involving natural language processing. Despite a few studies in political science, the potential of this methodology for the analysis of political texts has yet to be fully uncovered. This paper introduces models of word embeddings augmented with political metadata and trained on large-scale parliamentary corpora from Britain, Canada, and the United States. We fit these models with indicator variables of the party affiliation of members of parliament, which we refer to as party embeddings. We illustrate how these embeddings can be used to produce scaling estimates of ideological placement and other quantities of interest for political research. To validate the methodology, we assess our results against indicators from the Comparative Manifestos Project, surveys of experts, and measures based on roll-call votes. Our findings suggest that party embeddings are successful at capturing latent concepts such as ideology, and the approach provides researchers with an integrated framework for studying political language.

Type
Articles
Copyright
Copyright © The Author(s) 2019. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

1 Introduction

The representation of meaning is a fundamental objective in natural language processing. As a basic illustration, consider queries performed with a search engine. We ideally want computers to return documents that are relevant to the substantive meaning of a query, just like a human being would interpret it, rather than simply the records with an exact word match. To achieve such results, early methods such as latent semantic analysis relied on low-rank approximations of word frequencies to score the semantic similarity between texts and rank them by relevance (Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990; Manning, Raghavan, and Schütze Reference Manning, Raghavan and Schütze2009, chap. 18). The new state of the art in meaning representation is word embeddings, or word vectors, the parameter estimates of artificial neural networks designed to predict the occurrence of a word by the surrounding words in a text sequence. Consistent with the use theory of meaning (Wittgenstein Reference Wittgenstein2009), these embeddings have been shown to capture semantic properties of language, revealed by an ability to solve analogies and identify synonyms (Mikolov, Sutskever, et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014). Despite a broad appeal across disciplines, the use of word embeddings to analyze political texts remains a new field of research.Footnote 1 The aim of this paper is to examine the reliability of the methodology for the detection of a latent concept such as political ideology. In particular, we show that neural networks for word embeddings can be augmented with metadata available in parliamentary corpora. We illustrate the properties of this method using publicly available corpora from Britain, Canada, and the United States and assess its validity using external indicators.

The proposed methodology addresses at least three shortcomings associated with textual indicators of ideological placement currently available. First, as opposed to measures based on word frequencies, the estimates from our neural networks are trained to predict the use of language in context. Put another way, the method accounts for a party’s usage of words, given the surrounding text. Second, our approach can easily accommodate control variables, factors that could otherwise confound the placement of parties or politicians. For example, we account for the government–opposition dynamics that have foiled ideology indicators applied to Westminster systems in the past and filter out their influence to achieve more accurate estimates of party placement. Third, the methodology allows us to map political actors and language in a common vector space. This means that we can situate actors of interest based on their proximity to political concepts. Using a single model of embeddings, researchers can rank political actors relative to these concepts using a variety of metrics for vector arithmetic. We demonstrate such implementations in our empirical section.

Our results suggest that word embeddings are a promising tool for expanding the possibilities of political research based on textual data. In particular, we find that scaling estimates of party placement derived from the embeddings for the metadata—which we call party embeddings—are strongly correlated with human-annotated and roll-call vote measures of left–right ideology. We compare the methodology with WordFish, which represents the most popular alternative for text-based scaling of political ideology. The two methods share similarities in that both can be estimated without the need for annotated documents. This comparison illustrates how embedding models are well suited to account for the evolution of language over time. In contrast to other text-based scaling methods, our methodology also allows to map political actors in a multi-dimensional space. Finally, we demonstrate how the methodology can help to advance substantive research in comparative politics by replicating the model in two Westminster-style democracies. For such countries in particular, the ability to include control variables proves useful to account for the effect of institutional characteristics.

We start by situating our methodology in the broader literature on textual analysis in the next section. We then introduce the methodology concretely in Section 3. Section 4 discusses preprocessing steps and software implementation. Next, an empirical section illustrates concrete applications of the methodology: we demonstrate strategies for the retrieval of ideological placement, validate the results against external benchmarks, and compare the properties of the method against WordFish. We also illustrate an application at the legislator level and briefly address the question of uncertainty estimates for prospective users. Finally, we discuss intended uses and raise some warnings regarding interpretation.

2 Relations to Previous Work

Two of the most popular approaches in political science for the extraction of ideology from texts are WordScores (Laver, Benoit, and Garry Reference Laver, Benoit and Garry2003) and WordFish (Slapin and Proksch Reference Slapin and Proksch2008).Footnote 2 The first relies on a sample of labeled documents, for instance, party manifestos annotated by experts. The relative probabilities of word occurrences in the labeled documents serve to produce scores for each word, which can be viewed as indicators of their ideological load. Next, the scores can be applied to the words found in new documents to estimate their ideological placement. In fact, this approach can be compared to methods of supervised machine learning (Bishop Reference Bishop2006), where a computer is trained to predict the class of a labeled set of documents based on their observed features (e.g. their words). WordFish, on the other hand, relies on party annotations only. The methodology consists of fitting a regression model where word counts are projected onto party–year parameters, using an expectation maximization algorithm (Slapin and Proksch Reference Slapin and Proksch2008). This approach avoids the reliance on expert annotations and amounts to estimating the specificity of word usage by party at different points in time.

Neither of these approaches, however, takes into account the role of words in context. Put another way, they ignore semantics. Although theoretically both WordScores and WordFish could be expanded to include $n$ -grams (sequences of more than one word), this comes at an increased computational cost. There are so many different combinations of words in the English language that it rapidly becomes inefficient to count them. This problem has been addressed recently in Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016) and presented as a curse of dimensionality. Using a large number of words may be inefficient when tracking down ideological slants from textual data since a high feature–document ratio overstates the variance across the target corpora (Taddy Reference Taddy2013; Gentzkow, Shapiro, and Taddy Reference Gentzkow, Shapiro and Taddy2016). But for a few exceptions (Sim et al. Reference Sim, Acree, Gross and Smith2013; Iyyer et al. Reference Iyyer, Enns, Boyd-Graber and Resnik2014), models of political language face a trade-off between ignoring the role of words in context and dealing with high-dimensional variables.Footnote 3 Instead of relying on word frequencies, word embedding models aim to capture and represent relations between words using co-occurrences, which sidesteps the high feature–document ratio problems while allowing researchers to move beyond counts of words taken in isolation.

In the context of studies based on parliamentary debates, an additional concern is the ability to account for other institutional elements such as the difference in tone between the government and the opposition. A dangerous shortcut would consist of attributing any observed differences between party speeches to ideology. In Westminster-style parliaments, the cabinet will use a different vocabulary than opposition parties due to the nature of these legislative functions. For instance, opposition parties will invoke ministerial positions frequently when addressing their counterparts. Hirst et al. (Reference Hirst, Riabinin, Graham, Boizot-Roche, Morris, Kaal, Maks and van Elfrinkhof2014) show that machine-learning models used to classify texts by ideology have a tendency to be confounded with government–opposition language. As a result, temporal trends can also be obscured by procedural changes in the way government and opposition parties interact in parliament. A similar issue has been found to affect methods based on roll-call votes to infer ideology, where government–opposition dynamics can dominate the first dimension of the estimates (Spirling and McLean Reference Spirling and McLean2007; Hix and Noury Reference Hix and Noury2016). Our proposed methodology can accommodate the inclusion of additional “control variables” to filter out the effect of these institutional factors.

We argue that neural network models of language represent the natural way for scholars to move forward when attempting to measure a concept such as political ideology. The patterns of ideas that constitute ideologies are not as straightforward as is often assumed (Freeden Reference Freeden1998; Cochrane Reference Cochrane2015). Ideologies are emergent phenomena that are tangible but irreducible. Ideologies are tangible in that they genuinely structure political thinking, disagreement, and behavior; they are irreducible, however, in that no one actor or idea, or group of actors or ideas, constitutes the core from which an ideology emanates. Second-degree connections between words and phrases give rise to meaningful clusters that fade from view when we analyze ideas, actors, or subsets of these things in isolation from their broader context. People know ideology when they see it but struggle to say precisely what it is because they cannot resolve the irresolvable (Cochrane Reference Cochrane2015). This property of ideology has bedeviled past analysis. Since neural network models are designed to capture complex interactions between the inputs—in our case, context words and indicator variables for political actors—they are well adapted for the study of concepts that should theoretically emerge from such interactions.

The neural network models used in this paper are derived from word embeddings (Mikolov, Chen, et al. Reference Mikolov, Chen, Corrado and Dean2013; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014; Levy, Goldberg, and Dagan Reference Levy, Goldberg and Dagan2015). Such models have gained a wide popularity in many disciplines, including, more recently, political science. For example, Preoţiuc-Pietro et al. (Reference Preoţiuc-Pietro, Liu, Hopkins and Ungar2017) use word embeddings to generate lists of topic-specific words automatically by exploiting the ability of embeddings to find semantic relations between words. Glavaš, Nanni, and Ponzetto (Reference Glavaš, Nanni and Ponzetto2017) utilize a word embedding model expanded to multiple languages as their main input to classify sentences from party manifestos by topic, which they evaluate against data from the Comparative Manifestos Project (CMP). Rheault et al. (Reference Rheault, Beelen, Cochrane and Hirst2016) rely on word embeddings to automatically adapt sentiment lexicons to the domain of politics, which avoids problems associated with off-the-shelf sentiment dictionaries that often attribute an emotional valence to politically relevant words such as “health,” “war,” or “education.” In this study, we expand on traditional word embedding models to include political variables (for an illustration using political texts, see Nay Reference Nay2016). We examine how such models can be used to study the ideological leanings of political actors across legislatures.

3 Methodology

Models for word embeddings have been explored thoroughly in the literature, but we need to introduce them summarily to facilitate the exposition of our approach. This section also adopts a notation familiar to social science scholars. Our implementation uses shallow neural networks, that is, statistical models containing one layer of latent variables—or hidden nodes—between the input and output data.Footnote 4 The outcome variable $w_{t}$ is the word occurring at position $t$ in the corpus. The variable $w_{t}$ is multinomial with $V$ categories corresponding to the size of the vocabulary. The input variables in the model are the surrounding words appearing in a window $\unicode[STIX]{x1D6E5}$ before and after the outcome word, which we denote as $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}=(w_{t-\unicode[STIX]{x1D6E5}},\ldots ,w_{t-1},w_{t+1},\ldots ,w_{t+\unicode[STIX]{x1D6E5}})$ . The window is symmetrical to the left and to the right, which is the specification we use for this study, although nonsymmetrical windows are possible, for instance, if one wishes to give more consideration to the previous words in a sequence than to the following ones. Simply put, word embedding models consist of predicting $w_{t}$ from  $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ .

The neural network can be subdivided into two components. Let $z_{m}$ represent a hidden node, with $m=\{1,\ldots ,M\}$ and where $M$ is the dimension of the hidden layer. Each node can be expressed as a function of the inputs:

(1) $$\begin{eqnarray}z_{m}=f(\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}^{\prime }\unicode[STIX]{x1D737}_{m}).\end{eqnarray}$$

In machine learning, $f$ is called an activation function. In the case of word embedding models such as the one we rely upon, that function is simply the average value of $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}^{\prime }\unicode[STIX]{x1D737}_{m}$ across all input words (see Mikolov, Chen, et al. Reference Mikolov, Chen, Corrado and Dean2013). Since each word in the vocabulary can be treated as an indicator variable, Equation (1) can be expressed equivalently as

(2) $$\begin{eqnarray}z_{m}=\frac{1}{2\unicode[STIX]{x1D6E5}}\mathop{\sum }_{w_{v}\in \boldsymbol{w}_{\unicode[STIX]{x1D6E5}}}\unicode[STIX]{x1D6FD}_{v,m}\end{eqnarray}$$

that is, a hidden node is the average of coefficients $\unicode[STIX]{x1D6FD}_{v,m}$ specific to a word $w_{v}$ if that word is present in the context of the target word  $w_{t}$ . In turn, the vector of hidden nodes $\boldsymbol{z}=(z_{1},\ldots ,z_{M})$ is the average of the $M$ -dimensional vectors of coefficients  $\unicode[STIX]{x1D737}_{v}$ , for all words $v$ occurring in  $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ :

(3) $$\begin{eqnarray}\boldsymbol{z}=\frac{1}{2\unicode[STIX]{x1D6E5}}\mathop{\sum }_{w_{v}\in \boldsymbol{w}_{\unicode[STIX]{x1D6E5}}}\unicode[STIX]{x1D737}_{v}.\end{eqnarray}$$

Upon estimation, these vectors $\unicode[STIX]{x1D737}_{v}$ are the word embeddings of interest.

The remaining component of the model expresses the probability of the target word $w_{t}$ as a function of the hidden nodes. Similar to the multinomial logit regression framework, commonly used to model vote choice, a latent variable representing a specific word $i$ can be expressed as a linear function of the hidden nodes: $u_{it}^{\ast }=\unicode[STIX]{x1D6FC}_{i}+\boldsymbol{z}^{\prime }\unicode[STIX]{x1D741}_{i}$ . The probability $P(w_{t}=i)$ given the surrounding words corresponds to

(4) $$\begin{eqnarray}P(w_{t}=i|\boldsymbol{w}_{\unicode[STIX]{x1D6E5}})=\frac{\boldsymbol{e}^{\unicode[STIX]{x1D6FC}_{i}+\boldsymbol{z}^{\prime }\unicode[STIX]{x1D741}_{i}}}{\mathop{\sum }_{v=1}^{V}\boldsymbol{e}^{\unicode[STIX]{x1D6FC}_{v}+\boldsymbol{z}^{\prime }\unicode[STIX]{x1D741}_{v}}}.\end{eqnarray}$$

The full model can be written compactly using nested functions and dropping some indices for simplicity:

(5) $$\begin{eqnarray}P(w_{t}|\boldsymbol{w}_{\unicode[STIX]{x1D6E5}})=g(\unicode[STIX]{x1D736},\unicode[STIX]{x1D741},f(\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}^{\prime }\unicode[STIX]{x1D737})).\end{eqnarray}$$

As can be seen with the visual depiction in Figure 1, the embeddings $\unicode[STIX]{x1D737}$ link each input word to the hidden nodes.Footnote 5 The parameters of the model can be fitted by minimizing the cross-entropy using variants of stochastic gradient descent. We rely on negative sampling to fit the predicted probabilities in Equation (4) (see Mikolov, Sutskever, et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). In an influential study, Pennington, Socher, and Manning (Reference Pennington, Socher and Manning2014) have shown that a corresponding model can be represented as a log-bilinear Poisson regression using the word–word co-occurrence matrix of a corpus as data. However, the implementation we use here facilitates the inclusion of metadata by preserving individual words as units of analysis.

Figure 1. Example of model with word and party embeddings. The figure shows a schematic example of input and output data in a model with $M=5$ and a window $\unicode[STIX]{x1D6E5}=3$ . The model includes a variable indicating the party affiliation and parliament of the politician making the speech.

The basic model introduced above can be expanded to include additional input variables, which is our main interest in this paper. A common implementation uses indicator variables for documents or segments of text of interest, in addition to the context words (Le and Mikolov Reference Le, Mikolov, Xing and Jebara2014).Footnote 6 The approach was originally called paragraph vectors or document vectors. More generally, other types of metadata can be entered in Equation (1) to account for properties of interest at the document level, which is the approach we adopt here. In our implementation, we focus primarily on indicator variables measuring the party affiliation of a member of parliament (MP) or a congressperson uttering a speech. The inner component of the expanded model can be represented as

(6) $$\begin{eqnarray}z_{m}=f(\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}^{\prime }\unicode[STIX]{x1D737}_{m}+\boldsymbol{x}^{\prime }\unicode[STIX]{x1D73B}_{m}),\end{eqnarray}$$

where $\boldsymbol{x}$ is a vector of metadata, and the rest of the specification is similar as before. In addition to party affiliation, it is straightforward to account for attributes with the potential to affect the use of language and confound party-specific estimates. We mentioned the government status earlier (cabinet vs. noncabinet positions or party in power vs. opposition). For Canada, a country where federal politics is characterized by persistent regional divides, a relevant variable would be the province of the MP. Just like words have their embeddings, each variable entered in $\boldsymbol{x}$ has a corresponding vector $\unicode[STIX]{x1D73B}$ of dimension  $M$ .

Observe that the resulting vectors $\unicode[STIX]{x1D73B}$ have commonalities with the WordFish estimator of party placement. In their WordFish model, Slapin and Proksch (Reference Slapin and Proksch2008) predict word counts with party–year indicator variables. The resulting parameters are interpreted as the ideological placement of parties. The model introduced in (6) achieves a similar goal. The key difference is that our model is estimated at the word level, while taking into account the context ( $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ ) in which a word occurs. The hidden layer serves an important purpose by capturing interactions between the metadata and these context words. Moreover, the dimension of the hidden layer will determine the size of what we refer to as party embeddings in what follows, that is, the estimated parameters for each party. Rather than a single point estimate, we fit a vector of dimension  $M$ . A benefit is that these party embeddings can be compared against the rest of the corpus vocabulary in a common vector space, as we illustrate below.

Specifically, our implementation uses party–parliament pairs as indicator variables for a number of reasons. First, fitting combinations of parties and time periods allows us to reproduce the nature of the WordFish model as closely as possible: each party within a given parliament or Congress has a specific embedding. This approach has relevant benefits by accounting for the possibility that the language and issues debated by each party may evolve from one parliament to the next. Parties are allowed to “move” over time in the vector space. We rely on parliaments/Congresses, rather than years, to facilitate external validity tests against roll-call vote measures and annotations based on party manifestos, which are published at the election preceding the beginning of each parliament. Of course, the possible specifications are virtually endless and may differ in future applications. But we believe that the models we present are consistent with existing practice and provide a useful ground for a detailed assessment.

4 Data and Software Implementation

Models of word embeddings have been shown to perform best when fitted on large corpora that are adapted to the domain of interest (Lai, Liu, and Xu Reference Lai, Liu, Xu and an Zhao2016). For the purpose of this study, we rely on three publicly available collections of digitized parliamentary debates overlapping a century of history in the United States, Britain, and Canada. Replicating the results in three polities helps to demonstrate that the proposed methodology is general in application. The United States corpus is the version released by Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016). Our version of the British Hansard corpus is hosted on the Political Mashup website.Footnote 7 Finally, the Canadian Hansard corpus is described in Beelen et al. (Reference Beelen, Thijm, Cochrane, Halvemaan, Hirst, Kimmins, Lijbrink, Marx, Naderi, Polyanovsky, Rheault and Whyte2017) and released as linked and open data on www.lipad.ca. Each resource is enriched with a similar set of metadata about speakers, such as party affiliations and functions. The first section of the online appendix describes each resource in more detail.

We considered speeches made by the major parties in each corpus. The United States corpus ranges from 1873 to 2016 (43rd to 114th Congress). We present results for the House of Representatives and the Senate separately and restrict our attention to voting members affiliated with the Democratic and Republican parties. For the United Kingdom, the corpus covers the period from 1935 to 2014. We restrict our focus to the three major party labels: Labour, Liberal-Democrats, and Conservatives. For Canada, we use the entirety of the available corpus, which covers a period ranging between 1901 and 2018, from the 9th to the 42nd Parliament. The corpus represents over 3 million speeches after restricting our attention to five major parties (Conservatives, Liberals, New Democratic Party, Bloc Québécois, and Reform Party/Canadian Alliance). We removed speeches from the Speaker of the House of Commons in Britain and Canada, whose role is nonpartisan.

4.1 Steps for Implementation

We fit the embedding models on each corpora using custom scripts based on the library for Python of Řehůřek and Sojka (Reference Řehůřek and Sojka2010), which builds upon the implementation of document vectors proposed by Le and Mikolov (Reference Le, Mikolov, Xing and Jebara2014). This model itself extends the original C++ implementation of the word embeddings model proposed by Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). The library uses asynchronous stochastic gradient descent and relies on the method of negative sampling to fit the softmax function at the output of the neural networks. Our scripts are released openly, and the source code for the aforementioned library is also available publicly.Footnote 8 Fitting the models involves some decisions pertaining to text preprocessing and the choice of hyperparameters. We discuss each of these decision processes in turn.

4.1.1 Text Preprocessing

Preprocessing decisions when working with textual data may have nontrivial consequences (see Denny and Spirling Reference Denny and Spirling2018). Word embedding models are often implemented with little text preprocessing. One of the most popular implementations, however, that of Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), relies on subsampling—the random removal of frequent words from the context window during estimation. Subsampling essentially achieves a goal similar to stop word removal by limiting the influence of overly frequent tokens. This strategy was shown to improve the accuracy of word embeddings for tasks involving semantic relations. In our case, we want the models to learn information about terms that are substantively meaningful for politics. Simply put, the co-occurrence of the terms “reducing” and “taxes” is more meaningful to learn political language than that of the terms “the” and “taxes.”

For this study, we preprocessed the text by removing digits and words with two letters or fewer, as well as a list of English stop words enriched to remove overly common procedural words such as “speaker” (or “chairwoman/chairman” in the United States), used in most speeches due to decorum. Even though our corpora comprise hundreds of millions of words, they are smaller than the corpora used in the original implementations of word embeddings. The removal of these common words ensures that we observe many instances of substantively relevant words used in the same context during the training stage. We tested models with and without stop words removed. Although removing procedural words only has a marginal effect on our methodology, we find that the removal of English stop words does improve the accuracy of our models for tasks such as ideology detection. We also limit the vocabulary to tokens with a minimum count of 50 occurrences. This avoids fitting embeddings for words with few training examples.

Finally, our models include not only words but also common phrases. We proceed with two passes of a collocation detection algorithm, which is applied to each corpus prior to estimation. Collocations (words used frequently together) are merged as single entities, which means that with two passes of the algorithm, we capture phrases of up to four words.Footnote 9 Since phrases longer than four words are very sparse, we stop after two runs of collocation detection on each corpus. Although not entirely necessary for the methodology, we find that the inclusion of phrases facilitates interpretation for political research, where multi-word entities are frequent and common expressions may have specific meanings (e.g. “civil rights”). The online appendix includes a table with the most frequent phrases for each corpus.

4.1.2 Fitting the Model

For the most part, we implemented our models using hyperparameters set at default values in the original algorithms proposed by Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) and Le and Mikolov (Reference Le, Mikolov, Xing and Jebara2014). We fit each model with a learning rate of 0.025 and five epochs—that is, five passes over all speeches in each corpus. Previous work using word embeddings often relies on hidden layers of size 100, 200, or 300. The main text reports models with hidden layers of 200 nodes, which we find to be reliable for applied research. The online appendix provides additional information on parameterization and its influence on the output. In essence, we find that modifications to these default hyperparameters do not provide substantial improvements to the results presented in this paper. Hence, choosing the values mentioned above appears to be a reasonable starting point and avoids overfitting the parameters to the characteristics of a specific corpus in ways that may not generalize over time.

The only departure from default hyperparameters is the choice of a window size. We rely on a window $\unicode[STIX]{x1D6E5}$ of $\pm 20$ words.Footnote 10 In contrast, implementations in the field of computer science often have window sizes of 5 or 10 words (see e.g. Levy, Goldberg, and Dagan Reference Levy, Goldberg and Dagan2015). This choice is based on our own examination of parliamentary corpora. We find that the discursive style of members of parliament is more verbose than the language typically found on the web. The chosen size roughly corresponds to the average length of a sentence in the British corpus (19.56 tokens per sentence on average) and the Canadian one (20.62 tokens per sentence). Moreover, the topics discussed in individual speeches tend to expand over several sentences. As a result, a window of 20 context words takes into account information from the previous and following sentences. We find this choice to be appropriate for our methodology. Our online appendix also reports sensitivity tests regarding the window size.

5 Empirical Illustrations

Upon estimation of the models, the embeddings can be used to compute several quantities of interest. In this section, we emphasize an approach to ideological scaling based on low-dimensional projections from the party embeddings.Footnote 11 We introduce tools facilitating interpretation and discuss validity assessments against three external sources of data on party ideology. Next, we compare the method against WordFish to emphasize a desirable property of word embedding models: the ability to account for changes in word usage over time. Finally, we illustrate how the methodology can be applied to individual legislators and discuss the question of uncertainty measures.

5.1 Party Embeddings

We start by assessing the models fitted with party-specific indicator variables. The objective is to project the $M$ -dimensional party embeddings into a substantively meaningful vector space. These party embeddings can be visualized in two dimensions using standard dimensionality reduction techniques such as principal component analysis (PCA), which we rely upon in this section.

In plain terms, PCA finds the one-dimensional component that maximizes the variance across the vectors of party embeddings (see e.g. Hastie, Tibshirani, and Friedman Reference Hastie, Tibshirani and Friedman2009, chap. 14.5). The next component is calculated the same way by imposing a constraint of zero covariance with the first component. Additional components could be computed, but our analysis focuses on two-dimensional projections to simplify visualizations. If the speeches made by members of different parties are primarily characterized by ideological conflicts, as is normally assumed in unsupervised techniques for ideology scaling, we can reasonably expect the first component to describe the ideological placement of parties. The second component will capture the next most important source of semantic variation across parties in a legislature. To facilitate the interpretation of these components, we can use the model’s word embeddings and examine the concepts most strongly associated with each dimension.

Starting with the US corpus, Figure 2(a) plots the party embeddings in a two-dimensional space for the House of Representatives. We label some data points using an abbreviation of the party name and the beginning year of a Congress; for instance, the embedding $\unicode[STIX]{x1D73B}_{\text{Dem}~2011}$ means the Democratic party in the Congress starting in 2011 (the 112th Congress). The only adjustment that may be relevant to perform is orienting the scale in a manner intuitive for interpretation, for instance, by multiplying the values of a component by $-1$ such that conservative parties appear on the right. Each model includes party–Congress indicator variables as well as separate dummy variables for Congresses, which account for temporal changes in the discourse. To facilitate visualization of the results, panels b and c in Figure 2 plot the party embeddings as time series, respectively, for the first and second component.

Figure 2. Party placement in the US House (1873–2016). The figure shows a two-dimensional projection of the two principal components of party embeddings for the US House of Representatives (a) and time-series plots for each of the two components separately in (b) and (c).

Our methodology captures ideological shifts that occurred during the 20th century. Although both major parties were originally close to the center of the first dimension, which we interpret as the left–right or liberal–conservative divide, they begin to drift apart around the New Deal era in the 1930s, a period usually associated with the fifth party system. The trend culminates with a period of marked polarization from the late 1990s to the most recent Congresses. The most spectacular shift is probably the one occurring on the second dimension, which we interpret as a South–North divide (we oriented the South to the bottom and the North to the top). The change reflects a well-documented realignment between Northern and Southern states that occurred since the New Deal and the civil rights eras (Shafer and Johnston Reference Shafer and Johnston2009; Sundquist Reference Sundquist2011). A similar trajectory is manifested using both the House and the Senate corpora (we report equivalent figures for the Senate in the online appendix). Although Republicans initially became associated with issues of Northern states, the opposite is true today—the two parties eventually switched sides completely. The recent era appears particularly polarized on both axes, which is consistent with a body of literature documenting party polarization. On the other hand, we do not find evidence of polarization on the principal component in the late 19th century, contrary to indicators based on vote data (Poole and Rosenthal Reference Poole and Rosenthal2007) but consistently with Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016).

5.1.1 Interpreting Axes

The models have desirable properties for interpreting the principal components in substantive terms. Since both words and party indicators are used as inputs in the same neural networks, we can project the embeddings associated with each word from the corpus vocabulary onto the principal components just estimated. Next, we can rank the words based on their distance from specific coordinates. For instance, the point $(10,0)$ in Figure 2 is located on the right end of the first component. The words and phrases closest to that location can help researchers to interpret the meaning of that axis.Footnote 12

To illustrate, Table 1 reports the 20 expressions with the shortest Euclidean distances to the four cardinal points of the two-dimensional space for the US House of Representatives.Footnote 13 We use the minimum and maximum values of the party projections on each axis to determine the cardinal points. Expressions located closest to the left end of the first component comprise “civil rights,” “racism,” “decent housing,” and “poorest,” indicating that these terms are semantically associated with this end of the ideological spectrum. These words refer to topics one would expect in the language of liberal (or left-wing) parties in the United States. Conversely, terms like “bureaucracy,” “free enterprise,” and “red tape” are associated with the right. As for the second dimension, the keywords refer to Southern and Northern locations or industries associated with each region, which supports our interpretation of that axis as a South–North divide.

Table 1. Interpreting PCA axes.

We should emphasize that these lists may contain unexpected locutions, for instance, the “Missouri River” among the terms closest to the right edge of the first component. As argued earlier, political ideology cannot be reduced easily to any single idea or core. Attempting to summarize concepts such as liberal or conservative ideologies with individual words entails losing the context-dependent nature of semantics, which our model is designed to capture. For instance, some words contained in the lists of Table 1 may reflect idiosyncratic speaking styles of some Members of Congress on each side of the spectrum. Therefore, the interpretation ultimately involves the domain knowledge of the researcher to detect an overarching pattern. In this case, we believe the word associations provide relatively straightforward clues that facilitate a substantive interpretation of each axis. Since there are several different ways to explore relations between words and political actors in this model, an objective for future research would be to develop robust techniques for interpretation.

Figure 3. Party placement in Britain (1935–2014) and Canada (1901–2017). The figure shows the two principal components of party embeddings for the British and Canadian parliaments.

5.1.2 Replication in Parliamentary Democracies

Next, we illustrate that the methodology is generalizable across polities by replicating the same steps using the British and Canadian parliamentary corpora. To begin, Figure 3(a) reports a visualization of party embeddings for Britain. In addition to party and parliament indicators, the model includes a variable measuring whether an MP is a member of the cabinet or not. As can be seen, political parties are once again appropriately clustered together in the vector space: speeches made by members of the same group tend to resemble each other across parliaments. Moreover, the parties are correctly ordered relative to intuitive expectations about political ideology. Focusing on the first principal component ( $x$ -axis), the Labour party appears on one end of the spectrum, the Liberal-Democrats occupy the center, whereas the embeddings for Conservatives are clustered on the other side. In fact, without any intervention needed on our end, the model correctly captures well-accepted claims about ideological shifts within the British party system over time (see e.g. Clarke et al. Reference Clarke, Sanders, Stewart and Whiteley2004). For instance, the party embeddings for Conservatives during the Thatcher era (Cons 1979, 1983, and 1987) are ranked farther apart on the right end of the axis, whereas the Labour’s shift toward the center at the time of the “New Labour” era (Labour 1997, 2001, and 2005), under the leadership of Tony Blair, is also apparent. The second component captures dynamics opposing the party in power and opposition, with parties forming a government appearing at the top of the $y$ -axis.

Finally, Figure 3(b) reports the results for Canada.Footnote 14 Once more, the first principal component can be readily interpreted in terms of the left–right ideological placement. The Conservatives appear closer to the right, whereas the left-wing New Democratic Party (which is merged with its predecessor, the Co-operative Commonwealth Federation) is correctly located on the other end of the spectrum. The Reform/Canadian Alliance split from the Conservatives, generally viewed as the most conservative political party in the Canadian system (see Cochrane Reference Cochrane2010), appears at the extreme right of the first dimension, consistent with substantive expectations. In the case of Canada, the second principal component can be easily interpreted as a division between parties reflecting their views of the federation. The secessionist party, the Bloc Québécois, appears clustered on one end of the $y$ -axis, whereas federalist parties are grouped on the other side. Such a division also resurfaces in models based on vote data (see Godbout and Høyland Reference Godbout and Høyland2013; Johnston Reference Johnston2017).

5.1.3 Validity Assessments

To assess the external validity of estimates derived from our models, we evaluate the predicted ideological placement against gold standards: ideology scores based on roll-call votes (for the US), surveys of experts, and measures based on the CMP data. For each gold standard, we report the Pearson correlation coefficient with the first principal component of our party embeddings. We also report the pairwise accuracy, that is, the percentage of pairs of party placements that are consistently ordered relative to the gold standard. Pairwise accuracy accounts for all possible comparisons, within parties and across parties. Table 2 presents the results.

Table 2. Accuracy of party placement against gold standards.

The first gold standard used for the US is the average DW-NOMINATE score (first dimension) from the Voteview project for House and Senate (Poole and Rosenthal Reference Poole and Rosenthal2007). Expert surveys are standardized measures of left–right party placement from three sources (Castles and Mair Reference Castles and Mair1984; Huber and Inglehart Reference Huber and Inglehart1995; Benoit and Laver Reference Benoit and Laver2006). The other three references are the Rile measure of party placement based on the 2017 version of the Comparative Manifestos Project (CMP) dataset (Budge and Laver Reference Budge and Laver1992; Budge et al. Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001), the Vanilla measure of left–right placement (Gabel and Huber Reference Gabel and Huber2000), and the Legacy measure from Cochrane (Reference Cochrane2015). The pairwise accuracy metric counts the percentage of correct ideological orderings for all possible pairs of parties and parliaments.

Starting with the US corpus, we consider DW-NOMINATE estimates based on roll-call votes from the 67th to the 114th Congress and retrieved from the latest version of the Voteview project (Poole and Rosenthal Reference Poole and Rosenthal2007).Footnote 15 We use the first dimension of the aggregate Voteview indicator, which measures the average placement of Congress members by party over time. Our ideological placement is strongly correlated with the Voteview scores ( $r\approx 0.92$ ), and the pairwise accuracy, a more conservative metric, is around 85% for both the House and the Senate. These tests provide preliminary support to the conclusion that our model produces reliable estimates of ideological placement.

To further validate our methodology, we rely upon a second set of benchmarks based on human evaluations, namely expert surveys. Such surveys have been conducted sporadically in the discipline, asking country experts to locate national parties on a left–right scale. The average expert position is commonly used to examine party ideology at fixed points in time (see Benoit and Laver Reference Benoit and Laver2006). We retrieved expert surveys from three different sources covering the three countries under study in this paper (Castles and Mair Reference Castles and Mair1984; Huber and Inglehart Reference Huber and Inglehart1995; Benoit and Laver Reference Benoit and Laver2006). Two points on measurement should be emphasized. First, since expert surveys come from different sources, they may vary in construction and measurement scales. As a result, we standardize the scores before combining data points from different sources. We compute $z$ -scores for each of the three surveys by subtracting the mean for all parties across the three countries and then dividing by the standard deviation. Second, we should point out that expert surveys provide us with only a few data points, in contrast to the other benchmarks reported in Table 2. In the United States, we retrieved expert party placements from three sources, which means six data points (i.e. Democrats and Republicans at three different points in time).

The third and fourth rows of Table 2 report the Pearson correlation coefficient and pairwise accuracy of our ideological placement—again, the first principal component of the party embeddings—evaluated against expert surveys. For both the US House and Senate, the two goodness-of-fit scores suggest that our methodology produces ideology scores consistent with the views of experts. The correlation coefficients are very high ( $r\approx 0.98$ ), and the pairwise accuracy reaches 100% for the Senate. Despite the low number of comparison points, validating with a different source helps to give further credence to the interpretation of our unsupervised method of party placement. Experts surveys represent a more challenging test for the other two countries, which contain multiple parties, hence additional data points. The fit with expert surveys in Canada and Britain remains very strong, however, with correlation coefficients near or above 0.9.

Finally, we validate our ideological placement variables using data from the CMP (Budge and Laver Reference Budge and Laver1992; Budge et al. Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001). The CMP is based on party manifestos and relies on human annotations to score the orientation of political parties on a number of policy items, following a normalized scheme. We test whether the three ideology indicators derived from the project’s data are consistent with our estimated placement of the same party in the parliament that immediately follows. The Rile measure is the original left/right measure in the CMP. It is an additive index composed of 26 policy-related items, as described in Budge et al. (Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001). The Rile metric, however, excludes important dimensions of ideology of the CMP, such as equality and the environment. The Vanilla measure proposed by Gabel and Huber (Reference Gabel and Huber2000) uses all 56 items in the CMP and weights them according to their loadings on the first unrotated dimension of a factor analysis. Finally, the Legacy measure is a weighted index based on a network analysis of party positions and a model that assigns exponentially decaying weights to party positions in prior elections (Cochrane Reference Cochrane2015).

Overall, the CMP-based indicators are consistent with our ideology scores, in particular when considering the Vanilla and Legacy measures. For the US, the correlation is positive but very modest when considering the more basic “right minus left” measure (Rile). Accuracy reaches $r\approx 0.9$ , however, when considering the Legacy score.Footnote 16 Looking at the British case, party placements appear positively correlated with the three external indicators, ranging from $r\approx 0.68$ , when considering the Rile indicator, up to $r\approx 0.76$ and $r\approx 0.88$ using more robust ideology metrics based on the same data. As for pairwise accuracy, between 75% and 83% of comparisons against CMP-based measures are ordered consistently. Using the Canadian corpus, the fit with the CMP data is also strong across the three gold standards. Overall, these results provide strong evidence that our party embeddings are related to left–right ideology as coded by humans using party manifestos.

Figure 4. Comparison with WordFish estimates. The figure reports WordFish estimates for Democratic and Republican parties in the US House of Representatives, fitted on the full corpus (a) or using five Congresses from 2007 to 2015 (b).

5.1.4 Comparison with WordFish

We now emphasize the properties of these embeddings using a comparison with estimates from the WordFish model.Footnote 17 The proposed methodology differs from WordFish in at least two important ways. Our model fits embeddings by predicting words from political variables and a window of context words. Each word in the corpus represents a distinct unit of analysis. In contrast, WordFish relies on word frequencies tabulated within groupings of interest. The data required to fit the WordFish estimator can be contained in a term–document matrix (for full expositions of this model, see Slapin and Proksch Reference Slapin and Proksch2008; Proksch and Slapin Reference Proksch and Slapin2010; Lowe and Benoit Reference Lowe and Benoit2013). This matrix representation loses the original sequence in which the terms were used in the document, and the information loss is where the two models fundamentally differ. A second difference is that WordFish produces a single estimate of party placement, whereas the embeddings contain a flexible number of dimensions, which can be projected onto lower dimensional spaces, as illustrated above.

To illustrate these differences, we compute WordFish estimates on the entire US House corpus, restricting the vocabulary to the 50,000 most frequent terms.Footnote 18 We combined speeches by members of the same party into single “documents” for each Congress, following the approach used in Proksch and Slapin (Reference Proksch and Slapin2010). Figure 4(a) plots the estimated party positions for the full time period, which can be contrasted against our placement in Figure 2(b). When fitting the WordFish model on the full corpus, the estimator fails to capture the ideological distinctiveness of the parties. In fact, the dominant dimension appears to be the distinction between the language used in the late 19th century and that of more recent Congresses. Put another way, the estimates for both parties appear confounded by the change in language. Both parties are located next to each other over time, and the estimates bear no resemblance to benchmarks from either the Voteview project or the results based on our party embeddings.

The sharp contrast reflects a central property of the proposed methodology—and of word embeddings more generally—namely the ability to place equivalent expressions at proximity in a vector space.Footnote 19 Because the names and issues debated in early Congresses tend to be markedly different from those used in recent Congresses, the Democrats (or Republicans) of the past may seem disconnected from the Democrats (Republicans) of the present when considering word usage alone. In contrast, word usage across parties within a specific time period will appear very similar since both parties are debating the same topics. By accounting for the fact that different words can express the same meaning, a model based on word embeddings can account for continuity in a party’s position even when word usage changes over time. For example, an expression related to an insurrection in the late 19th and early 20th century, the term “Moros” (referring to the Moro Rebellion in the Philippines) has a word embedding similar to the term “ISIL” (the Islamic State of Iraq and the Levant).Footnote 20 Even though ISIL was never mentioned in debates from the 19th century, the embedding model is trained to recognize that both terms share a related meaning.

When fitting the WordFish estimator on a shorter time span (the five most recent Congresses in the corpus), the model then places the two parties farther apart, and estimates become more consistent with the ideological divide expected between Democrats and Republicans (see Figure 4(b)). We calculate the accuracy of both implementations of the WordFish estimator against the Voteview benchmarks in Table 3. Although WordFish does not perform well for studying long periods of time, a shorter model produces estimates that are closer to those achieved with the first principal component of the proposed methodology. Both methods have benefits and limitations, and we should point out that an advantage of WordFish is that the model can be estimated on smaller corpora, as opposed to embedding models, which require more training data.

Table 3. Accuracy of party placement in the US House: WordFish and party embeddings.

Accuracy scores for WordFish estimates and party embeddings against the DW-NOMINATE scores (first dimension) from the Voteview project (Poole and Rosenthal Reference Poole and Rosenthal2007). The pairwise accuracy metric counts the percentage of correct ideological orderings for all possible pairs of parties and Congresses.

5.2 Legislator Embeddings

The methodology introduced in this paper can be adapted to the study of individual legislators, rather than parties. The specification is similar to that of the model used for political parties, with the difference that legislator–Congress indicator variables replace the party–Congress ones. We illustrate such an implementation by examining a model comprising embeddings fitted at the level of individual Senators in the US corpus. Since the number of speeches per legislator is lower than for a party as a whole, we compensate by increasing the number of epochs during training. The model discussed below is trained with 20 epochs.

Figure 5 reports the two principal components of the embeddings associated with individual Senators in the 114th Congress. The figure illustrates the expected division of Senators along party lines, although the clustering is nowhere as tight as that typically obtained using scaling methods based on roll-call vote data. For instance, some Republican Senators appear mixed together with Democratic counterparts in the center of the spectrum. This can be explained in part by the fact that language reflects a much wider variance in positions than binary vote data, an issue that could be explored in future research. The centrist Republicans in Figure 5 include Robert Jones Portman, Lamar Alexander, and Susan Collins, who were generally ranked as moderates in external sources of ideology ratings discussed below.

Figure 5. Ideological placement of Senators (114th Congress).

To assess whether the low-dimensional projection captures political ideology, we proceed once more with a validation against external measures. By focusing on a recent Congress, we are able to use a variety of measures of ideological placement for US Senators: the Voteview scores for individual Senators, ratings from the American Conservative Union (ACU) and GovTrack.us ideology scores.Footnote 21 Table 4 reports Pearson correlation coefficients and pairwise accuracy scores using the first principal component of our model and each gold standard. Since our corpus is restricted to members affiliated with major party labels, this analysis excludes two independent Senators. For each gold standard considered, we obtain correlation coefficients of at least 0.85. The fit with the GovTrack ideology scores is particularly strong. Overall, these accuracy results support the conclusion that the first principal component extracted from our “Senator embeddings” is related to political ideology.

Table 4. Accuracy of Senator ideological placement.

The DW-NOMINATE score is the main indicator from the Voteview data, and the Nokken–Poole is an alternative implementation based on Nokken and Poole (Reference Nokken and Poole2004). The ACU ratings are for the year indicated or, alternatively, the life-long rating attributed to each Senator. The methodology used for the GovTrack ideology scores is described in the source website.

5.3 Uncertainty Estimates

A natural question regarding models based on word embeddings is whether they can be fitted with measures of uncertainty, such as standard errors and confidence intervals commonly reported in inference-based research. For instance, it would be convenient to express uncertainty around a prediction of a political party’s placement. Unfortunately, deriving measures of uncertainty with neural network models, even with a relatively modest degree of complexity, remains an area of research under development (see Gentzkow, Kelly, and Taddy Reference Gentzkow, Kelly and Taddy2017). Bootstrap methods would be impractical due to the size of the corpora and the computing time needed to fit each model. Machine learning aims at prediction rather than inference, and the preferred validation methods are designed to assess model specification using predictive accuracy metrics, as we did above, rather than measures of uncertainty (Mullainathan and Spiess Reference Mullainathan and Spiess2017). However, there are solutions available for producing measures of uncertainty for quantities of interest after accepting the raw embeddings as point estimates. For example, when computing cosine similarities using lists of words, researchers working with word embeddings have relied on techniques such as permutation $t$ -tests (Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017) or bootstrap estimators (Garg et al. Reference Garg, Schiebinger, Jurafsky and Zou2018).

6 Intended Uses and Caveats

The proposed methodology can help to support research agendas in a number of subfields of political science. In comparative politics, the use of scaling estimates based on roll-call votes is often problematic when exported outside the United States, in particular for countries with high levels of party discipline or where the executive sits inside the parliament. The party embedding model introduced here affords comparative scholars with new opportunities to study party ideology, whenever a collection of digitized parliamentary corpora is available. Since the model can be adapted to the study of individual legislators, the technique also provides an additional resource for research on intra-party conflict, a central focus of recent research involving textual analysis of parliamentary speeches (see e.g. Proksch and Slapin Reference Proksch and Slapin2015; Bäck and Debus Reference Bäck and Debus2016; Lauderdale and Herzog Reference Lauderdale and Herzog2016; Schwarz, Traber, and Benoit Reference Schwarz, Traber and Benoit2017; Proksch et al. Reference Proksch, Lowe, Wäckerle and Soroka2018). Next, the model is well adapted for research on political representation (Powell Reference Powell2004; Bird Reference Bird, Bird, Saalfeld and Wüst2010). In particular, the party embeddings can allow researchers to go beyond descriptive representation and examine how the language of political actors relates to issues and groups of interest. The results presented earlier suggest a connection between parties and minorities in the US Congress, and the method could be adapted for studies on the representation of other groups of interest across legislatures. Finally, the model can serve as a basis to develop additional methodological tools for textual analysis going beyond word counts, with the goal of studying political semantics.

Some words of caution are warranted, however, when interpreting text-based indicators of ideological placement. There are well-known limitations to scaling estimates based on roll-call votes, and those are often relevant for applications based on textual data. For instance, strategic agenda-setting decisions may limit opportunities to vote on some issues, and nonrandom abstentions during votes could obscure the true ideal point of a legislator (Clinton Reference Clinton2012). In a similar fashion, the distribution of issues in parliamentary corpora may be constrained by who controls the agenda during a given legislature, and some legislators could strategically refrain from expressing their true opinions. These constraints on legislative debates may also vary across polities. For instance, the daily order of business in the Canadian House of Commons is dominated by the government, whereas other legislatures may offer extended opportunities to opposition parties. Researchers working with the proposed methodology should make sure to gain familiarity with the traits characterizing a legislature and the behavioral constraints that may affect their conclusions.

7 Conclusion

Methods based on word embeddings have the potential to offer a useful, integrated framework for studying political texts. By relying on artificial neural networks that take into account interactions between words and political variables, the methodology proposed in this paper is naturally suited for research focusing on latent properties of political discourse such as ideology. This approach has the benefit of going beyond word counts, which improves the ability to account for changes in word usage over time, in contrast to existing methods. The present paper relied on a custom implementation of word embeddings trained on parliamentary corpora and augmented to include input variables measuring party affiliations. As illustrated in our empirical section, the resulting “party embeddings” can be utilized to characterize the content of parliamentary speeches and place political actors in a vector space representing ideological dimensions. Our validity assessments suggest that low-dimensional projections from the model are generally consistent with human-annotated indicators of ideology, expert surveys, and, in the case of Congress, roll-call votes measures. Judging by the wide adoption of word embeddings in real-world applications based on natural language processing, we believe that the technique is bound to gain in popularity in the social sciences. In fact, our paper has probably scratched only the surface of the potential of word embeddings for political analysis.

Despite the promise of (augmented) word embeddings for political research, scholars should be wary of some limitations with the methodology. First, fitting the models requires a large corpus of text documents to fully benefit from their properties for semantic analysis. We find that corpora of digitized parliamentary debates appear reasonably large for this type of research, but smaller collections of texts may not achieve the same level of accuracy. Second, obtaining uncertainty estimates for the neural network models necessary to fit the embeddings would require further research. Although we have illustrated a variety of techniques for model validation, the methodology could eventually be implemented using frameworks like Bayesian neural networks and variational inference (see e.g. MacKay Reference MacKay1992; Tran et al. Reference Tran, Hoffman, Saurous, Brevdo, Murphy and Blei2017). This would facilitate the estimation of confidence intervals, although at this stage the cost in terms of computational time remains prohibitive. Moreover, the quantities of interest to political scientists will likely be derivatives of the embeddings—for instance, a measure of similarity between words and parties—which implies compounding uncertainty estimates across the different steps involved. On the other hand, we have shown that the methodology can be used with a limited number of arbitrary decisions and produce reliable indicators of party placement, eliminating a source of variance associated with individual judgments.

Supplementary material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2019.26.

Footnotes

Authors’ note: We thank participants in the annual meeting of the Society for Political Methodology, the Canadian Political Science Association annual conference, the Advanced Computational Linguistics seminar at the University of Toronto, as well as anonymous reviewers for their helpful comments. Replication data is available through the Political Analysis Dataverse (Rheault and Cochrane 2019).

Contributing Editor: Jeff Gill

1 Examples of recent applications in political research include Rheault et al. (Reference Rheault, Beelen, Cochrane and Hirst2016), Glavaš, Nanni, and Ponzetto (Reference Glavaš, Nanni and Ponzetto2017), and Preoţiuc-Pietro et al. (Reference Preoţiuc-Pietro, Liu, Hopkins and Ungar2017).

2 See also Lauderdale and Herzog (Reference Lauderdale and Herzog2016) for an extension of the method to legislative speeches and Kim, Londregan, and Ratkovic (Reference Kim, Londregan and Ratkovic2018) for an expanded model combining both text and votes.

3 Other examples of text-based methods for the detection of ideology based on word and phrase occurrences include Gentzkow and Shapiro (Reference Gentzkow and Shapiro2010), Diermeier et al. (Reference Diermeier, Godbout, Yu and Kaufmann2012), and Jensen et al. (Reference Jensen, Kaplan, Naidu and Wilse-Samson2012).

4 For the purpose of our presentation, we follow the steps of the model that Mikolov, Chen, et al. (Reference Mikolov, Chen, Corrado and Dean2013) call continuous bag-of-words (CBOW).

5 In fact, Mikolov, Chen, et al. (Reference Mikolov, Chen, Corrado and Dean2013) proposed two approaches: one in which the word embeddings are the link coefficients between input words and the hidden nodes (CBOW) and another where the outcome and the inputs are switched (called skip-gram)—in effect, predicting surrounding context from the word, rather than the reverse.

6 The type of model described here is called “distributed memory” in the original article (Le and Mikolov Reference Le, Mikolov, Xing and Jebara2014).

8 Our materials are available at Rheault and Cochrane (Reference Rheault and Cochrane2019) and on a dedicated GitHub repository (https://github.com/lrheault/partyembed).

9 The algorithm used to detect phrases is based on the original implementation of word embeddings proposed in Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). Each pass combines pairs of words frequently used together as a single expression, for instance “united_kingdom.” By applying a second pass, expressions of one word or two words can be merged, resulting in phrases of up to four words.

10 Note that during estimation, we weigh words according to their position in the context window, with the $t$ th word weighted by a factor of $1/t$ . At the beginning or end of a speech, the context window is truncated.

11 In the online appendix, we demonstrate how users can create projections on a customized space for the study of political ideology and compute other quantities of interest from the raw embeddings such as indicators of party polarization.

12 We also present an alternative tool for interpretation in the online appendix.

13 To help us focus on common expressions, we restrict our attention to the 50,000 most frequent words or phrases up to three words in the vocabulary.

14 For that country, our model includes variables measuring whether the MP making the speech belongs to the cabinet or not, whether they belong to the party in power or the opposition, and their province of origin.

15 Note that for Westminster systems, vote-based estimates are less reliable indicators of ideology, and we did not consider them as a benchmark.

16 Note that for the United States, the scores are only available for Presidential election years, whereas the corpus is divided into two-year Congresses. However, the CMP data are available from the election year 1920 in the US.

17 Grimmer and Stewart (Reference Grimmer and Stewart2013) have previously illustrated a limitation to WordFish when applied to individual legislators in Congress, a case for which the model does not perform well at capturing the expected ideological distance expected between members of opposite political parties.

18 We use the implementation available in the quanteda package in R, which is based on Lowe and Benoit (Reference Lowe and Benoit2013).

19 We are indebted to an anonymous reviewer for their helpful suggestions regarding this discussion.

20 The cosine similarity between the two terms’ embeddings is 0.48.

21 The ACU ratings for 2016 were retrieved from http://acuratings.conservative.org/acu-federal-legislative-ratings/ on August 29, 2018. GovTrack report cards are also for 2016 and were retrieved the same day at https://www.govtrack.us/congress/members/report-cards/2016/senate/ideology.

References

Bäck, H., and Debus, M.. 2016. Political Parties, Parliaments and Legislative Speechmaking . New York: Palgrave Macmillan.Google Scholar
Beelen, K., Thijm, T. A., Cochrane, C., Halvemaan, K., Hirst, G., Kimmins, M., Lijbrink, S., Marx, M., Naderi, N., Polyanovsky, R., Rheault, L., and Whyte, T.. 2017. “Digitization of the Canadian Parliamentary Debates.” Canadian Journal of Political Science 50(3):849864.Google Scholar
Benoit, K., and Laver, M.. 2006. Party Policy in Modern Democracies . New York: Routledge.Google Scholar
Bird, K. 2010. “Patterns of Substantive Representation Among Visible Minority MPs: Evidence from Canada’s House of Commons.” In The Political Representation of Immigrants and Minorities , edited by Bird, K., Saalfeld, T., and Wüst, A. M.. New York: Routledge.Google Scholar
Bishop, C. M. 2006. Pattern Recognition and Machine Learning . New York: Springer.Google Scholar
Budge, I., Klingemann, H.-D., Volkens, A., Bara, J., and Tanenbaum, E.. 2001. Mapping Policy Preferences: Estimates for Parties, Electors, and Governments (1945–1998) . Oxford: Oxford University Press.Google Scholar
Budge, I., and Laver, M. J., eds. 1992. Party Policy and Government Coalitions . London: Palgrave Macmillan UK.Google Scholar
Caliskan, A., Bryson, J. J., and Narayanan, A.. 2017. “Semantics Derived Automatically from Language Corpora Contain Human-Like Biases.” Science 356(6334):183186.Google Scholar
Castles, F. G., and Mair, P.. 1984. “Left–Right Political Scales: Some ‘Expert’ Judgments.” European Journal of Political Research 12(1):7388.Google Scholar
Clarke, H. D., Sanders, D., Stewart, M. C., and Whiteley, P.. 2004. Political Choice in Britain . Oxford: Oxford University Press.Google Scholar
Clinton, J. D. 2012. “Using Roll Call Estimates to Test Models of Politics.” Annual Review of Political Science 15:7999.Google Scholar
Cochrane, C. 2010. “Left/Right Ideology and Canadian Politics.” Canadian Journal of Political Science 45(3):583605.Google Scholar
Cochrane, C. 2015. Left and Right: The Small World of Political Ideas . Montreal, Kingston: McGill-Queen’s University Press.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R.. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41(6):391407.Google Scholar
Denny, M. J., and Spirling, A.. 2018. “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It.” Political Analysis 26(2):168189.Google Scholar
Diermeier, D., Godbout, J.-F., Yu, B., and Kaufmann, S.. 2012. “Language and Ideology in Congress.” British Journal of Political Science 42(1):3155.Google Scholar
Freeden, M. 1998. Ideology and Political Theory: A Conceptual Approach . Oxford: Oxford University Press.Google Scholar
Gabel, M. J., and Huber, J. D.. 2000. “Putting Parties in Their Place: Inferring Party Left–Right Ideological Positions from Party Manifestos Data.” American Journal of Political Science 44(1):94103.Google Scholar
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J.. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115(16):E3635E3644.Google Scholar
Gentzkow, M., Kelly, B. T., and Taddy, M.. 2017. “Text as Data.” NBER Working Paper w23276.Google Scholar
Gentzkow, M., and Shapiro, J. M.. 2010. “What Drives Media Slant? Evidence from U.S. Daily Newspapers.” Econometrica 78(1):3571.Google Scholar
Gentzkow, M., Shapiro, J. M., and Taddy, M.. 2016. “Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech.” NBER Working Paper: 22423.Google Scholar
Glavaš, G., Nanni, F., and Ponzetto, S. P.. 2017. “Cross-Lingual Classification of Topics in Political Texts.” In Proceedings of the 2017 ACL Workshop on Natural Language Processing and Computational Social Science , 4246. Association for Computational Linguistics.Google Scholar
Godbout, J.-F., and Høyland, B.. 2013. “The Emergence of Parties in the Canadian House of Commons (1867–1908).” Canadian Journal of Political Science 46(4):773797.Google Scholar
Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267297.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J.. 2009. The Elements of Statistical Learning . Berlin: Springer.Google Scholar
Hirst, G., Riabinin, Y., Graham, J., Boizot-Roche, M., and Morris, C.. 2014. “Text to Ideology or Text to Party Status? In From Text to Political Positions: Text Analysis across Disciplines , edited by Kaal, B., Maks, I., and van Elfrinkhof, A., 93116. Amsterdam: John Benjamins Publishing Company.Google Scholar
Hix, S., and Noury, A.. 2016. “Government–Opposition or Left–Right? The Institutional Determinants of Voting in Legislatures.” Political Science Research and Methods 4(2):249273.Google Scholar
Huber, J., and Inglehart, R.. 1995. “Expert Interpretations of Party Space and Party Locations in 42 Societies.” Party Politics 1(1):73111.Google Scholar
Iyyer, M., Enns, P., Boyd-Graber, J., and Resnik, P.. 2014. “Political Ideology Detection Using Recursive Neural Networks.” In Proceedings of the 2014 Annual Meeting of the Association for Computational Linguistics , 11131122. Association for Computational Linguistics.Google Scholar
Jensen, J., Kaplan, E., Naidu, S., and Wilse-Samson, L.. 2012. “Political Polarization and the Dynamics of Political Language: Evidence from 130 Years of Partisan Speech.” Brookings Papers on Economic Activity Fall:181.Google Scholar
Johnston, R. 2017. The Canadian Party System: An Analytic History . Vancouver: UBC Press.Google Scholar
Kim, I. S., Londregan, J., and Ratkovic, M.. 2018. “Estimating Spatial Preferences from Votes and Text.” Political Analysis 26(2):210229.Google Scholar
Lai, S., Liu, K., Xu, J., and an Zhao, L.. 2016. “How to Generate Good Word Embedding? IEEE Intelligent Systems 31(6):514.Google Scholar
Lauderdale, B. E., and Herzog, A.. 2016. “Measuring Political Positions from Legislative Speech.” Political Analysis 24(3):374394.Google Scholar
Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts Using Words as Data.” American Political Science Review 97(2):311331.Google Scholar
Le, Q., and Mikolov, T.. 2014. “Distributed Representations of Sentences and Documents.” In Proceedings of the 31st International Conference on Machine Learning , edited by Xing, E. P. and Jebara, T., II-1188II-1196. PMLR.Google Scholar
Levy, O., Goldberg, Y., and Dagan, I.. 2015. “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” Transactions of the Association for Computational Linguistics 3:211225.Google Scholar
Lowe, W., and Benoit, K.. 2013. “Validating Estimates of Latent Traits from Textual Data Using Human Judgment as a Benchmark.” Political Analysis 21(3):298313.Google Scholar
MacKay, D. J. C. 1992. “A Practical Bayesian Framework for Backpropagation Networks.” Neural Computation 4(3):448472.Google Scholar
Manning, C. D., Raghavan, P., and Schütze, H.. 2009. An Introduction to Information Retrieval . Cambridge: Cambridge University Press.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J.. 2013. “Distributed Representations of Words and Phrases and their Compositionality.” In Proceedings of the 26th International Conference on Neural Information Processing Systems , 31113119. Neural Information Processing Systems Foundation.Google Scholar
Mikolov, T., Chen, K., Corrado, G., and Dean, J.. 2013. “Efficient Estimation of Word Representations in Vector Space.” In Proceedings of Workshop at ICLR , 112. International Conference on Representation Learning.Google Scholar
Mullainathan, S., and Spiess, J.. 2017. “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives 31(2):87106.Google Scholar
Nay, J. J. 2016. “Gov2Vec: Learning Distributed Representations of Institutions and Their Legal Text.” In Proceedings of the 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science , 4954. Association for Computational Linguistics.Google Scholar
Nokken, T. P., and Poole, K. T.. 2004. “Congressional Party Defection in American History.” Legislative Studies Quarterly 29(4):545568.Google Scholar
Pennington, J., Socher, R., and Manning, C. D.. 2014. “Glove: Global Vectors for Word Representation.” In Conference on Empirical Methods in Natural Language Processing (EMNLP) , 15321543. Association for Computational Linguistics.Google Scholar
Poole, K. T., and Rosenthal, H. L.. 2007. Ideology and Congress . New York: Transaction Publishers.Google Scholar
Powell, G. B. 2004. “Political Representation in Comparative Politics.” Annual Review of Political Science 7(1):273296.Google Scholar
Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., and Ungar, L.. 2017. “Beyond Binary Labels: Political Ideology Prediction of Twitter Users.” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , 729740. Association for Computational Linguistics.Google Scholar
Proksch, S.-O., and Slapin, J. B.. 2010. “Position Taking in European Parliament Speeches.” British Journal of Political Science 40(3):587611.Google Scholar
Proksch, S.-O., and Slapin, J. B.. 2015. The Politics of Parliamentary Debate . Cambridge: Cambridge University Press.Google Scholar
Proksch, S.-O., Lowe, W., Wäckerle, J., and Soroka, S.. 2018. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 0(0):135.Google Scholar
Řehůřek, R., and Sojka, P.. 2010. “Software Framework for Topic Modelling with Large Corpora.” In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks , 4550. European Language Resources Association.Google Scholar
Rheault, L., and Cochrane, C.. 2019. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” https://doi.org/10.7910/DVN/K0OYQF, Harvard Dataverse.Google Scholar
Rheault, L., Beelen, K., Cochrane, C., and Hirst, G.. 2016. “Measuring Emotion in Parliamentary Debates with Automated Textual Analysis.” PLoS ONE 11(12): e0168843.Google Scholar
Schwarz, D., Traber, D., and Benoit, K.. 2017. “Estimating Intra-Party Preferences: Comparing Speeches to Votes.” Political Science Research and Methods 5(2):379396.Google Scholar
Shafer, B. E., and Johnston, R.. 2009. The End of Southern Exceptionalism: Class, Race, and Partisan Change in the Postwar South . Cambridge: Harvard University Press.Google Scholar
Sim, Y., Acree, B. D. L., Gross, J. H., and Smith, N. A.. 2013. “Measuring Ideological Proportions in Political Speeches.” In Proceedings of the 2013 Conference on Empirical Methods of Natural Language Processing (EMNLP) , 91101. Association for Computational Linguistics.Google Scholar
Slapin, J. B., and Proksch, S.-O.. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52(3):705722.Google Scholar
Spirling, A., and McLean, I.. 2007. “UK OC OK? Interpreting Optimal Classification Scores for the UK House of Commons.” Political Analysis 15(1):8596.Google Scholar
Sundquist, J. L. 2011. Dynamics of the Party System . Washington, DC: Brookings Institution Press.Google Scholar
Taddy, M. 2013. “Multinomial Inverse Regression for Text Analysis.” Journal of the American Statistical Association 108(203):755770.Google Scholar
Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., and Blei, D. M.. 2017. “Deep Probabilistic Programming.” In Proceedings of the 5th International Conference on Learning Representations , 118.Google Scholar
Wittgenstein, L. 2009. Philosophical Investigations . West Sussex, UK: Blackwell.Google Scholar
Figure 0

Figure 1. Example of model with word and party embeddings. The figure shows a schematic example of input and output data in a model with $M=5$ and a window $\unicode[STIX]{x1D6E5}=3$. The model includes a variable indicating the party affiliation and parliament of the politician making the speech.

Figure 1

Figure 2. Party placement in the US House (1873–2016). The figure shows a two-dimensional projection of the two principal components of party embeddings for the US House of Representatives (a) and time-series plots for each of the two components separately in (b) and (c).

Figure 2

Table 1. Interpreting PCA axes.

Figure 3

Figure 3. Party placement in Britain (1935–2014) and Canada (1901–2017). The figure shows the two principal components of party embeddings for the British and Canadian parliaments.

Figure 4

Table 2. Accuracy of party placement against gold standards.

Figure 5

Figure 4. Comparison with WordFish estimates. The figure reports WordFish estimates for Democratic and Republican parties in the US House of Representatives, fitted on the full corpus (a) or using five Congresses from 2007 to 2015 (b).

Figure 6

Table 3. Accuracy of party placement in the US House: WordFish and party embeddings.

Figure 7

Figure 5. Ideological placement of Senators (114th Congress).

Figure 8

Table 4. Accuracy of Senator ideological placement.

Supplementary material: File

Rheault and Cochrane supplementary material

Online appendix

Download Rheault and Cochrane supplementary material(File)
File 425 KB