1 Introduction
The representation of meaning is a fundamental objective in natural language processing. As a basic illustration, consider queries performed with a search engine. We ideally want computers to return documents that are relevant to the substantive meaning of a query, just like a human being would interpret it, rather than simply the records with an exact word match. To achieve such results, early methods such as latent semantic analysis relied on low-rank approximations of word frequencies to score the semantic similarity between texts and rank them by relevance (Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990; Manning, Raghavan, and Schütze Reference Manning, Raghavan and Schütze2009, chap. 18). The new state of the art in meaning representation is word embeddings, or word vectors, the parameter estimates of artificial neural networks designed to predict the occurrence of a word by the surrounding words in a text sequence. Consistent with the use theory of meaning (Wittgenstein Reference Wittgenstein2009), these embeddings have been shown to capture semantic properties of language, revealed by an ability to solve analogies and identify synonyms (Mikolov, Sutskever, et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014). Despite a broad appeal across disciplines, the use of word embeddings to analyze political texts remains a new field of research.Footnote 1 The aim of this paper is to examine the reliability of the methodology for the detection of a latent concept such as political ideology. In particular, we show that neural networks for word embeddings can be augmented with metadata available in parliamentary corpora. We illustrate the properties of this method using publicly available corpora from Britain, Canada, and the United States and assess its validity using external indicators.
The proposed methodology addresses at least three shortcomings associated with textual indicators of ideological placement currently available. First, as opposed to measures based on word frequencies, the estimates from our neural networks are trained to predict the use of language in context. Put another way, the method accounts for a party’s usage of words, given the surrounding text. Second, our approach can easily accommodate control variables, factors that could otherwise confound the placement of parties or politicians. For example, we account for the government–opposition dynamics that have foiled ideology indicators applied to Westminster systems in the past and filter out their influence to achieve more accurate estimates of party placement. Third, the methodology allows us to map political actors and language in a common vector space. This means that we can situate actors of interest based on their proximity to political concepts. Using a single model of embeddings, researchers can rank political actors relative to these concepts using a variety of metrics for vector arithmetic. We demonstrate such implementations in our empirical section.
Our results suggest that word embeddings are a promising tool for expanding the possibilities of political research based on textual data. In particular, we find that scaling estimates of party placement derived from the embeddings for the metadata—which we call party embeddings—are strongly correlated with human-annotated and roll-call vote measures of left–right ideology. We compare the methodology with WordFish, which represents the most popular alternative for text-based scaling of political ideology. The two methods share similarities in that both can be estimated without the need for annotated documents. This comparison illustrates how embedding models are well suited to account for the evolution of language over time. In contrast to other text-based scaling methods, our methodology also allows to map political actors in a multi-dimensional space. Finally, we demonstrate how the methodology can help to advance substantive research in comparative politics by replicating the model in two Westminster-style democracies. For such countries in particular, the ability to include control variables proves useful to account for the effect of institutional characteristics.
We start by situating our methodology in the broader literature on textual analysis in the next section. We then introduce the methodology concretely in Section 3. Section 4 discusses preprocessing steps and software implementation. Next, an empirical section illustrates concrete applications of the methodology: we demonstrate strategies for the retrieval of ideological placement, validate the results against external benchmarks, and compare the properties of the method against WordFish. We also illustrate an application at the legislator level and briefly address the question of uncertainty estimates for prospective users. Finally, we discuss intended uses and raise some warnings regarding interpretation.
2 Relations to Previous Work
Two of the most popular approaches in political science for the extraction of ideology from texts are WordScores (Laver, Benoit, and Garry Reference Laver, Benoit and Garry2003) and WordFish (Slapin and Proksch Reference Slapin and Proksch2008).Footnote 2 The first relies on a sample of labeled documents, for instance, party manifestos annotated by experts. The relative probabilities of word occurrences in the labeled documents serve to produce scores for each word, which can be viewed as indicators of their ideological load. Next, the scores can be applied to the words found in new documents to estimate their ideological placement. In fact, this approach can be compared to methods of supervised machine learning (Bishop Reference Bishop2006), where a computer is trained to predict the class of a labeled set of documents based on their observed features (e.g. their words). WordFish, on the other hand, relies on party annotations only. The methodology consists of fitting a regression model where word counts are projected onto party–year parameters, using an expectation maximization algorithm (Slapin and Proksch Reference Slapin and Proksch2008). This approach avoids the reliance on expert annotations and amounts to estimating the specificity of word usage by party at different points in time.
Neither of these approaches, however, takes into account the role of words in context. Put another way, they ignore semantics. Although theoretically both WordScores and WordFish could be expanded to include $n$ -grams (sequences of more than one word), this comes at an increased computational cost. There are so many different combinations of words in the English language that it rapidly becomes inefficient to count them. This problem has been addressed recently in Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016) and presented as a curse of dimensionality. Using a large number of words may be inefficient when tracking down ideological slants from textual data since a high feature–document ratio overstates the variance across the target corpora (Taddy Reference Taddy2013; Gentzkow, Shapiro, and Taddy Reference Gentzkow, Shapiro and Taddy2016). But for a few exceptions (Sim et al. Reference Sim, Acree, Gross and Smith2013; Iyyer et al. Reference Iyyer, Enns, Boyd-Graber and Resnik2014), models of political language face a trade-off between ignoring the role of words in context and dealing with high-dimensional variables.Footnote 3 Instead of relying on word frequencies, word embedding models aim to capture and represent relations between words using co-occurrences, which sidesteps the high feature–document ratio problems while allowing researchers to move beyond counts of words taken in isolation.
In the context of studies based on parliamentary debates, an additional concern is the ability to account for other institutional elements such as the difference in tone between the government and the opposition. A dangerous shortcut would consist of attributing any observed differences between party speeches to ideology. In Westminster-style parliaments, the cabinet will use a different vocabulary than opposition parties due to the nature of these legislative functions. For instance, opposition parties will invoke ministerial positions frequently when addressing their counterparts. Hirst et al. (Reference Hirst, Riabinin, Graham, Boizot-Roche, Morris, Kaal, Maks and van Elfrinkhof2014) show that machine-learning models used to classify texts by ideology have a tendency to be confounded with government–opposition language. As a result, temporal trends can also be obscured by procedural changes in the way government and opposition parties interact in parliament. A similar issue has been found to affect methods based on roll-call votes to infer ideology, where government–opposition dynamics can dominate the first dimension of the estimates (Spirling and McLean Reference Spirling and McLean2007; Hix and Noury Reference Hix and Noury2016). Our proposed methodology can accommodate the inclusion of additional “control variables” to filter out the effect of these institutional factors.
We argue that neural network models of language represent the natural way for scholars to move forward when attempting to measure a concept such as political ideology. The patterns of ideas that constitute ideologies are not as straightforward as is often assumed (Freeden Reference Freeden1998; Cochrane Reference Cochrane2015). Ideologies are emergent phenomena that are tangible but irreducible. Ideologies are tangible in that they genuinely structure political thinking, disagreement, and behavior; they are irreducible, however, in that no one actor or idea, or group of actors or ideas, constitutes the core from which an ideology emanates. Second-degree connections between words and phrases give rise to meaningful clusters that fade from view when we analyze ideas, actors, or subsets of these things in isolation from their broader context. People know ideology when they see it but struggle to say precisely what it is because they cannot resolve the irresolvable (Cochrane Reference Cochrane2015). This property of ideology has bedeviled past analysis. Since neural network models are designed to capture complex interactions between the inputs—in our case, context words and indicator variables for political actors—they are well adapted for the study of concepts that should theoretically emerge from such interactions.
The neural network models used in this paper are derived from word embeddings (Mikolov, Chen, et al. Reference Mikolov, Chen, Corrado and Dean2013; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014; Levy, Goldberg, and Dagan Reference Levy, Goldberg and Dagan2015). Such models have gained a wide popularity in many disciplines, including, more recently, political science. For example, Preoţiuc-Pietro et al. (Reference Preoţiuc-Pietro, Liu, Hopkins and Ungar2017) use word embeddings to generate lists of topic-specific words automatically by exploiting the ability of embeddings to find semantic relations between words. Glavaš, Nanni, and Ponzetto (Reference Glavaš, Nanni and Ponzetto2017) utilize a word embedding model expanded to multiple languages as their main input to classify sentences from party manifestos by topic, which they evaluate against data from the Comparative Manifestos Project (CMP). Rheault et al. (Reference Rheault, Beelen, Cochrane and Hirst2016) rely on word embeddings to automatically adapt sentiment lexicons to the domain of politics, which avoids problems associated with off-the-shelf sentiment dictionaries that often attribute an emotional valence to politically relevant words such as “health,” “war,” or “education.” In this study, we expand on traditional word embedding models to include political variables (for an illustration using political texts, see Nay Reference Nay2016). We examine how such models can be used to study the ideological leanings of political actors across legislatures.
3 Methodology
Models for word embeddings have been explored thoroughly in the literature, but we need to introduce them summarily to facilitate the exposition of our approach. This section also adopts a notation familiar to social science scholars. Our implementation uses shallow neural networks, that is, statistical models containing one layer of latent variables—or hidden nodes—between the input and output data.Footnote 4 The outcome variable $w_{t}$ is the word occurring at position $t$ in the corpus. The variable $w_{t}$ is multinomial with $V$ categories corresponding to the size of the vocabulary. The input variables in the model are the surrounding words appearing in a window $\unicode[STIX]{x1D6E5}$ before and after the outcome word, which we denote as $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}=(w_{t-\unicode[STIX]{x1D6E5}},\ldots ,w_{t-1},w_{t+1},\ldots ,w_{t+\unicode[STIX]{x1D6E5}})$ . The window is symmetrical to the left and to the right, which is the specification we use for this study, although nonsymmetrical windows are possible, for instance, if one wishes to give more consideration to the previous words in a sequence than to the following ones. Simply put, word embedding models consist of predicting $w_{t}$ from $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ .
The neural network can be subdivided into two components. Let $z_{m}$ represent a hidden node, with $m=\{1,\ldots ,M\}$ and where $M$ is the dimension of the hidden layer. Each node can be expressed as a function of the inputs:
In machine learning, $f$ is called an activation function. In the case of word embedding models such as the one we rely upon, that function is simply the average value of $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}^{\prime }\unicode[STIX]{x1D737}_{m}$ across all input words (see Mikolov, Chen, et al. Reference Mikolov, Chen, Corrado and Dean2013). Since each word in the vocabulary can be treated as an indicator variable, Equation (1) can be expressed equivalently as
that is, a hidden node is the average of coefficients $\unicode[STIX]{x1D6FD}_{v,m}$ specific to a word $w_{v}$ if that word is present in the context of the target word $w_{t}$ . In turn, the vector of hidden nodes $\boldsymbol{z}=(z_{1},\ldots ,z_{M})$ is the average of the $M$ -dimensional vectors of coefficients $\unicode[STIX]{x1D737}_{v}$ , for all words $v$ occurring in $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ :
Upon estimation, these vectors $\unicode[STIX]{x1D737}_{v}$ are the word embeddings of interest.
The remaining component of the model expresses the probability of the target word $w_{t}$ as a function of the hidden nodes. Similar to the multinomial logit regression framework, commonly used to model vote choice, a latent variable representing a specific word $i$ can be expressed as a linear function of the hidden nodes: $u_{it}^{\ast }=\unicode[STIX]{x1D6FC}_{i}+\boldsymbol{z}^{\prime }\unicode[STIX]{x1D741}_{i}$ . The probability $P(w_{t}=i)$ given the surrounding words corresponds to
The full model can be written compactly using nested functions and dropping some indices for simplicity:
As can be seen with the visual depiction in Figure 1, the embeddings $\unicode[STIX]{x1D737}$ link each input word to the hidden nodes.Footnote 5 The parameters of the model can be fitted by minimizing the cross-entropy using variants of stochastic gradient descent. We rely on negative sampling to fit the predicted probabilities in Equation (4) (see Mikolov, Sutskever, et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). In an influential study, Pennington, Socher, and Manning (Reference Pennington, Socher and Manning2014) have shown that a corresponding model can be represented as a log-bilinear Poisson regression using the word–word co-occurrence matrix of a corpus as data. However, the implementation we use here facilitates the inclusion of metadata by preserving individual words as units of analysis.
The basic model introduced above can be expanded to include additional input variables, which is our main interest in this paper. A common implementation uses indicator variables for documents or segments of text of interest, in addition to the context words (Le and Mikolov Reference Le, Mikolov, Xing and Jebara2014).Footnote 6 The approach was originally called paragraph vectors or document vectors. More generally, other types of metadata can be entered in Equation (1) to account for properties of interest at the document level, which is the approach we adopt here. In our implementation, we focus primarily on indicator variables measuring the party affiliation of a member of parliament (MP) or a congressperson uttering a speech. The inner component of the expanded model can be represented as
where $\boldsymbol{x}$ is a vector of metadata, and the rest of the specification is similar as before. In addition to party affiliation, it is straightforward to account for attributes with the potential to affect the use of language and confound party-specific estimates. We mentioned the government status earlier (cabinet vs. noncabinet positions or party in power vs. opposition). For Canada, a country where federal politics is characterized by persistent regional divides, a relevant variable would be the province of the MP. Just like words have their embeddings, each variable entered in $\boldsymbol{x}$ has a corresponding vector $\unicode[STIX]{x1D73B}$ of dimension $M$ .
Observe that the resulting vectors $\unicode[STIX]{x1D73B}$ have commonalities with the WordFish estimator of party placement. In their WordFish model, Slapin and Proksch (Reference Slapin and Proksch2008) predict word counts with party–year indicator variables. The resulting parameters are interpreted as the ideological placement of parties. The model introduced in (6) achieves a similar goal. The key difference is that our model is estimated at the word level, while taking into account the context ( $\boldsymbol{w}_{\unicode[STIX]{x1D6E5}}$ ) in which a word occurs. The hidden layer serves an important purpose by capturing interactions between the metadata and these context words. Moreover, the dimension of the hidden layer will determine the size of what we refer to as party embeddings in what follows, that is, the estimated parameters for each party. Rather than a single point estimate, we fit a vector of dimension $M$ . A benefit is that these party embeddings can be compared against the rest of the corpus vocabulary in a common vector space, as we illustrate below.
Specifically, our implementation uses party–parliament pairs as indicator variables for a number of reasons. First, fitting combinations of parties and time periods allows us to reproduce the nature of the WordFish model as closely as possible: each party within a given parliament or Congress has a specific embedding. This approach has relevant benefits by accounting for the possibility that the language and issues debated by each party may evolve from one parliament to the next. Parties are allowed to “move” over time in the vector space. We rely on parliaments/Congresses, rather than years, to facilitate external validity tests against roll-call vote measures and annotations based on party manifestos, which are published at the election preceding the beginning of each parliament. Of course, the possible specifications are virtually endless and may differ in future applications. But we believe that the models we present are consistent with existing practice and provide a useful ground for a detailed assessment.
4 Data and Software Implementation
Models of word embeddings have been shown to perform best when fitted on large corpora that are adapted to the domain of interest (Lai, Liu, and Xu Reference Lai, Liu, Xu and an Zhao2016). For the purpose of this study, we rely on three publicly available collections of digitized parliamentary debates overlapping a century of history in the United States, Britain, and Canada. Replicating the results in three polities helps to demonstrate that the proposed methodology is general in application. The United States corpus is the version released by Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016). Our version of the British Hansard corpus is hosted on the Political Mashup website.Footnote 7 Finally, the Canadian Hansard corpus is described in Beelen et al. (Reference Beelen, Thijm, Cochrane, Halvemaan, Hirst, Kimmins, Lijbrink, Marx, Naderi, Polyanovsky, Rheault and Whyte2017) and released as linked and open data on www.lipad.ca. Each resource is enriched with a similar set of metadata about speakers, such as party affiliations and functions. The first section of the online appendix describes each resource in more detail.
We considered speeches made by the major parties in each corpus. The United States corpus ranges from 1873 to 2016 (43rd to 114th Congress). We present results for the House of Representatives and the Senate separately and restrict our attention to voting members affiliated with the Democratic and Republican parties. For the United Kingdom, the corpus covers the period from 1935 to 2014. We restrict our focus to the three major party labels: Labour, Liberal-Democrats, and Conservatives. For Canada, we use the entirety of the available corpus, which covers a period ranging between 1901 and 2018, from the 9th to the 42nd Parliament. The corpus represents over 3 million speeches after restricting our attention to five major parties (Conservatives, Liberals, New Democratic Party, Bloc Québécois, and Reform Party/Canadian Alliance). We removed speeches from the Speaker of the House of Commons in Britain and Canada, whose role is nonpartisan.
4.1 Steps for Implementation
We fit the embedding models on each corpora using custom scripts based on the library for Python of Řehůřek and Sojka (Reference Řehůřek and Sojka2010), which builds upon the implementation of document vectors proposed by Le and Mikolov (Reference Le, Mikolov, Xing and Jebara2014). This model itself extends the original C++ implementation of the word embeddings model proposed by Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). The library uses asynchronous stochastic gradient descent and relies on the method of negative sampling to fit the softmax function at the output of the neural networks. Our scripts are released openly, and the source code for the aforementioned library is also available publicly.Footnote 8 Fitting the models involves some decisions pertaining to text preprocessing and the choice of hyperparameters. We discuss each of these decision processes in turn.
4.1.1 Text Preprocessing
Preprocessing decisions when working with textual data may have nontrivial consequences (see Denny and Spirling Reference Denny and Spirling2018). Word embedding models are often implemented with little text preprocessing. One of the most popular implementations, however, that of Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), relies on subsampling—the random removal of frequent words from the context window during estimation. Subsampling essentially achieves a goal similar to stop word removal by limiting the influence of overly frequent tokens. This strategy was shown to improve the accuracy of word embeddings for tasks involving semantic relations. In our case, we want the models to learn information about terms that are substantively meaningful for politics. Simply put, the co-occurrence of the terms “reducing” and “taxes” is more meaningful to learn political language than that of the terms “the” and “taxes.”
For this study, we preprocessed the text by removing digits and words with two letters or fewer, as well as a list of English stop words enriched to remove overly common procedural words such as “speaker” (or “chairwoman/chairman” in the United States), used in most speeches due to decorum. Even though our corpora comprise hundreds of millions of words, they are smaller than the corpora used in the original implementations of word embeddings. The removal of these common words ensures that we observe many instances of substantively relevant words used in the same context during the training stage. We tested models with and without stop words removed. Although removing procedural words only has a marginal effect on our methodology, we find that the removal of English stop words does improve the accuracy of our models for tasks such as ideology detection. We also limit the vocabulary to tokens with a minimum count of 50 occurrences. This avoids fitting embeddings for words with few training examples.
Finally, our models include not only words but also common phrases. We proceed with two passes of a collocation detection algorithm, which is applied to each corpus prior to estimation. Collocations (words used frequently together) are merged as single entities, which means that with two passes of the algorithm, we capture phrases of up to four words.Footnote 9 Since phrases longer than four words are very sparse, we stop after two runs of collocation detection on each corpus. Although not entirely necessary for the methodology, we find that the inclusion of phrases facilitates interpretation for political research, where multi-word entities are frequent and common expressions may have specific meanings (e.g. “civil rights”). The online appendix includes a table with the most frequent phrases for each corpus.
4.1.2 Fitting the Model
For the most part, we implemented our models using hyperparameters set at default values in the original algorithms proposed by Mikolov, Sutskever, et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) and Le and Mikolov (Reference Le, Mikolov, Xing and Jebara2014). We fit each model with a learning rate of 0.025 and five epochs—that is, five passes over all speeches in each corpus. Previous work using word embeddings often relies on hidden layers of size 100, 200, or 300. The main text reports models with hidden layers of 200 nodes, which we find to be reliable for applied research. The online appendix provides additional information on parameterization and its influence on the output. In essence, we find that modifications to these default hyperparameters do not provide substantial improvements to the results presented in this paper. Hence, choosing the values mentioned above appears to be a reasonable starting point and avoids overfitting the parameters to the characteristics of a specific corpus in ways that may not generalize over time.
The only departure from default hyperparameters is the choice of a window size. We rely on a window $\unicode[STIX]{x1D6E5}$ of $\pm 20$ words.Footnote 10 In contrast, implementations in the field of computer science often have window sizes of 5 or 10 words (see e.g. Levy, Goldberg, and Dagan Reference Levy, Goldberg and Dagan2015). This choice is based on our own examination of parliamentary corpora. We find that the discursive style of members of parliament is more verbose than the language typically found on the web. The chosen size roughly corresponds to the average length of a sentence in the British corpus (19.56 tokens per sentence on average) and the Canadian one (20.62 tokens per sentence). Moreover, the topics discussed in individual speeches tend to expand over several sentences. As a result, a window of 20 context words takes into account information from the previous and following sentences. We find this choice to be appropriate for our methodology. Our online appendix also reports sensitivity tests regarding the window size.
5 Empirical Illustrations
Upon estimation of the models, the embeddings can be used to compute several quantities of interest. In this section, we emphasize an approach to ideological scaling based on low-dimensional projections from the party embeddings.Footnote 11 We introduce tools facilitating interpretation and discuss validity assessments against three external sources of data on party ideology. Next, we compare the method against WordFish to emphasize a desirable property of word embedding models: the ability to account for changes in word usage over time. Finally, we illustrate how the methodology can be applied to individual legislators and discuss the question of uncertainty measures.
5.1 Party Embeddings
We start by assessing the models fitted with party-specific indicator variables. The objective is to project the $M$ -dimensional party embeddings into a substantively meaningful vector space. These party embeddings can be visualized in two dimensions using standard dimensionality reduction techniques such as principal component analysis (PCA), which we rely upon in this section.
In plain terms, PCA finds the one-dimensional component that maximizes the variance across the vectors of party embeddings (see e.g. Hastie, Tibshirani, and Friedman Reference Hastie, Tibshirani and Friedman2009, chap. 14.5). The next component is calculated the same way by imposing a constraint of zero covariance with the first component. Additional components could be computed, but our analysis focuses on two-dimensional projections to simplify visualizations. If the speeches made by members of different parties are primarily characterized by ideological conflicts, as is normally assumed in unsupervised techniques for ideology scaling, we can reasonably expect the first component to describe the ideological placement of parties. The second component will capture the next most important source of semantic variation across parties in a legislature. To facilitate the interpretation of these components, we can use the model’s word embeddings and examine the concepts most strongly associated with each dimension.
Starting with the US corpus, Figure 2(a) plots the party embeddings in a two-dimensional space for the House of Representatives. We label some data points using an abbreviation of the party name and the beginning year of a Congress; for instance, the embedding $\unicode[STIX]{x1D73B}_{\text{Dem}~2011}$ means the Democratic party in the Congress starting in 2011 (the 112th Congress). The only adjustment that may be relevant to perform is orienting the scale in a manner intuitive for interpretation, for instance, by multiplying the values of a component by $-1$ such that conservative parties appear on the right. Each model includes party–Congress indicator variables as well as separate dummy variables for Congresses, which account for temporal changes in the discourse. To facilitate visualization of the results, panels b and c in Figure 2 plot the party embeddings as time series, respectively, for the first and second component.
Our methodology captures ideological shifts that occurred during the 20th century. Although both major parties were originally close to the center of the first dimension, which we interpret as the left–right or liberal–conservative divide, they begin to drift apart around the New Deal era in the 1930s, a period usually associated with the fifth party system. The trend culminates with a period of marked polarization from the late 1990s to the most recent Congresses. The most spectacular shift is probably the one occurring on the second dimension, which we interpret as a South–North divide (we oriented the South to the bottom and the North to the top). The change reflects a well-documented realignment between Northern and Southern states that occurred since the New Deal and the civil rights eras (Shafer and Johnston Reference Shafer and Johnston2009; Sundquist Reference Sundquist2011). A similar trajectory is manifested using both the House and the Senate corpora (we report equivalent figures for the Senate in the online appendix). Although Republicans initially became associated with issues of Northern states, the opposite is true today—the two parties eventually switched sides completely. The recent era appears particularly polarized on both axes, which is consistent with a body of literature documenting party polarization. On the other hand, we do not find evidence of polarization on the principal component in the late 19th century, contrary to indicators based on vote data (Poole and Rosenthal Reference Poole and Rosenthal2007) but consistently with Gentzkow, Shapiro, and Taddy (Reference Gentzkow, Shapiro and Taddy2016).
5.1.1 Interpreting Axes
The models have desirable properties for interpreting the principal components in substantive terms. Since both words and party indicators are used as inputs in the same neural networks, we can project the embeddings associated with each word from the corpus vocabulary onto the principal components just estimated. Next, we can rank the words based on their distance from specific coordinates. For instance, the point $(10,0)$ in Figure 2 is located on the right end of the first component. The words and phrases closest to that location can help researchers to interpret the meaning of that axis.Footnote 12
To illustrate, Table 1 reports the 20 expressions with the shortest Euclidean distances to the four cardinal points of the two-dimensional space for the US House of Representatives.Footnote 13 We use the minimum and maximum values of the party projections on each axis to determine the cardinal points. Expressions located closest to the left end of the first component comprise “civil rights,” “racism,” “decent housing,” and “poorest,” indicating that these terms are semantically associated with this end of the ideological spectrum. These words refer to topics one would expect in the language of liberal (or left-wing) parties in the United States. Conversely, terms like “bureaucracy,” “free enterprise,” and “red tape” are associated with the right. As for the second dimension, the keywords refer to Southern and Northern locations or industries associated with each region, which supports our interpretation of that axis as a South–North divide.
We should emphasize that these lists may contain unexpected locutions, for instance, the “Missouri River” among the terms closest to the right edge of the first component. As argued earlier, political ideology cannot be reduced easily to any single idea or core. Attempting to summarize concepts such as liberal or conservative ideologies with individual words entails losing the context-dependent nature of semantics, which our model is designed to capture. For instance, some words contained in the lists of Table 1 may reflect idiosyncratic speaking styles of some Members of Congress on each side of the spectrum. Therefore, the interpretation ultimately involves the domain knowledge of the researcher to detect an overarching pattern. In this case, we believe the word associations provide relatively straightforward clues that facilitate a substantive interpretation of each axis. Since there are several different ways to explore relations between words and political actors in this model, an objective for future research would be to develop robust techniques for interpretation.
5.1.2 Replication in Parliamentary Democracies
Next, we illustrate that the methodology is generalizable across polities by replicating the same steps using the British and Canadian parliamentary corpora. To begin, Figure 3(a) reports a visualization of party embeddings for Britain. In addition to party and parliament indicators, the model includes a variable measuring whether an MP is a member of the cabinet or not. As can be seen, political parties are once again appropriately clustered together in the vector space: speeches made by members of the same group tend to resemble each other across parliaments. Moreover, the parties are correctly ordered relative to intuitive expectations about political ideology. Focusing on the first principal component ( $x$ -axis), the Labour party appears on one end of the spectrum, the Liberal-Democrats occupy the center, whereas the embeddings for Conservatives are clustered on the other side. In fact, without any intervention needed on our end, the model correctly captures well-accepted claims about ideological shifts within the British party system over time (see e.g. Clarke et al. Reference Clarke, Sanders, Stewart and Whiteley2004). For instance, the party embeddings for Conservatives during the Thatcher era (Cons 1979, 1983, and 1987) are ranked farther apart on the right end of the axis, whereas the Labour’s shift toward the center at the time of the “New Labour” era (Labour 1997, 2001, and 2005), under the leadership of Tony Blair, is also apparent. The second component captures dynamics opposing the party in power and opposition, with parties forming a government appearing at the top of the $y$ -axis.
Finally, Figure 3(b) reports the results for Canada.Footnote 14 Once more, the first principal component can be readily interpreted in terms of the left–right ideological placement. The Conservatives appear closer to the right, whereas the left-wing New Democratic Party (which is merged with its predecessor, the Co-operative Commonwealth Federation) is correctly located on the other end of the spectrum. The Reform/Canadian Alliance split from the Conservatives, generally viewed as the most conservative political party in the Canadian system (see Cochrane Reference Cochrane2010), appears at the extreme right of the first dimension, consistent with substantive expectations. In the case of Canada, the second principal component can be easily interpreted as a division between parties reflecting their views of the federation. The secessionist party, the Bloc Québécois, appears clustered on one end of the $y$ -axis, whereas federalist parties are grouped on the other side. Such a division also resurfaces in models based on vote data (see Godbout and Høyland Reference Godbout and Høyland2013; Johnston Reference Johnston2017).
5.1.3 Validity Assessments
To assess the external validity of estimates derived from our models, we evaluate the predicted ideological placement against gold standards: ideology scores based on roll-call votes (for the US), surveys of experts, and measures based on the CMP data. For each gold standard, we report the Pearson correlation coefficient with the first principal component of our party embeddings. We also report the pairwise accuracy, that is, the percentage of pairs of party placements that are consistently ordered relative to the gold standard. Pairwise accuracy accounts for all possible comparisons, within parties and across parties. Table 2 presents the results.
The first gold standard used for the US is the average DW-NOMINATE score (first dimension) from the Voteview project for House and Senate (Poole and Rosenthal Reference Poole and Rosenthal2007). Expert surveys are standardized measures of left–right party placement from three sources (Castles and Mair Reference Castles and Mair1984; Huber and Inglehart Reference Huber and Inglehart1995; Benoit and Laver Reference Benoit and Laver2006). The other three references are the Rile measure of party placement based on the 2017 version of the Comparative Manifestos Project (CMP) dataset (Budge and Laver Reference Budge and Laver1992; Budge et al. Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001), the Vanilla measure of left–right placement (Gabel and Huber Reference Gabel and Huber2000), and the Legacy measure from Cochrane (Reference Cochrane2015). The pairwise accuracy metric counts the percentage of correct ideological orderings for all possible pairs of parties and parliaments.
Starting with the US corpus, we consider DW-NOMINATE estimates based on roll-call votes from the 67th to the 114th Congress and retrieved from the latest version of the Voteview project (Poole and Rosenthal Reference Poole and Rosenthal2007).Footnote 15 We use the first dimension of the aggregate Voteview indicator, which measures the average placement of Congress members by party over time. Our ideological placement is strongly correlated with the Voteview scores ( $r\approx 0.92$ ), and the pairwise accuracy, a more conservative metric, is around 85% for both the House and the Senate. These tests provide preliminary support to the conclusion that our model produces reliable estimates of ideological placement.
To further validate our methodology, we rely upon a second set of benchmarks based on human evaluations, namely expert surveys. Such surveys have been conducted sporadically in the discipline, asking country experts to locate national parties on a left–right scale. The average expert position is commonly used to examine party ideology at fixed points in time (see Benoit and Laver Reference Benoit and Laver2006). We retrieved expert surveys from three different sources covering the three countries under study in this paper (Castles and Mair Reference Castles and Mair1984; Huber and Inglehart Reference Huber and Inglehart1995; Benoit and Laver Reference Benoit and Laver2006). Two points on measurement should be emphasized. First, since expert surveys come from different sources, they may vary in construction and measurement scales. As a result, we standardize the scores before combining data points from different sources. We compute $z$ -scores for each of the three surveys by subtracting the mean for all parties across the three countries and then dividing by the standard deviation. Second, we should point out that expert surveys provide us with only a few data points, in contrast to the other benchmarks reported in Table 2. In the United States, we retrieved expert party placements from three sources, which means six data points (i.e. Democrats and Republicans at three different points in time).
The third and fourth rows of Table 2 report the Pearson correlation coefficient and pairwise accuracy of our ideological placement—again, the first principal component of the party embeddings—evaluated against expert surveys. For both the US House and Senate, the two goodness-of-fit scores suggest that our methodology produces ideology scores consistent with the views of experts. The correlation coefficients are very high ( $r\approx 0.98$ ), and the pairwise accuracy reaches 100% for the Senate. Despite the low number of comparison points, validating with a different source helps to give further credence to the interpretation of our unsupervised method of party placement. Experts surveys represent a more challenging test for the other two countries, which contain multiple parties, hence additional data points. The fit with expert surveys in Canada and Britain remains very strong, however, with correlation coefficients near or above 0.9.
Finally, we validate our ideological placement variables using data from the CMP (Budge and Laver Reference Budge and Laver1992; Budge et al. Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001). The CMP is based on party manifestos and relies on human annotations to score the orientation of political parties on a number of policy items, following a normalized scheme. We test whether the three ideology indicators derived from the project’s data are consistent with our estimated placement of the same party in the parliament that immediately follows. The Rile measure is the original left/right measure in the CMP. It is an additive index composed of 26 policy-related items, as described in Budge et al. (Reference Budge, Klingemann, Volkens, Bara and Tanenbaum2001). The Rile metric, however, excludes important dimensions of ideology of the CMP, such as equality and the environment. The Vanilla measure proposed by Gabel and Huber (Reference Gabel and Huber2000) uses all 56 items in the CMP and weights them according to their loadings on the first unrotated dimension of a factor analysis. Finally, the Legacy measure is a weighted index based on a network analysis of party positions and a model that assigns exponentially decaying weights to party positions in prior elections (Cochrane Reference Cochrane2015).
Overall, the CMP-based indicators are consistent with our ideology scores, in particular when considering the Vanilla and Legacy measures. For the US, the correlation is positive but very modest when considering the more basic “right minus left” measure (Rile). Accuracy reaches $r\approx 0.9$ , however, when considering the Legacy score.Footnote 16 Looking at the British case, party placements appear positively correlated with the three external indicators, ranging from $r\approx 0.68$ , when considering the Rile indicator, up to $r\approx 0.76$ and $r\approx 0.88$ using more robust ideology metrics based on the same data. As for pairwise accuracy, between 75% and 83% of comparisons against CMP-based measures are ordered consistently. Using the Canadian corpus, the fit with the CMP data is also strong across the three gold standards. Overall, these results provide strong evidence that our party embeddings are related to left–right ideology as coded by humans using party manifestos.
5.1.4 Comparison with WordFish
We now emphasize the properties of these embeddings using a comparison with estimates from the WordFish model.Footnote 17 The proposed methodology differs from WordFish in at least two important ways. Our model fits embeddings by predicting words from political variables and a window of context words. Each word in the corpus represents a distinct unit of analysis. In contrast, WordFish relies on word frequencies tabulated within groupings of interest. The data required to fit the WordFish estimator can be contained in a term–document matrix (for full expositions of this model, see Slapin and Proksch Reference Slapin and Proksch2008; Proksch and Slapin Reference Proksch and Slapin2010; Lowe and Benoit Reference Lowe and Benoit2013). This matrix representation loses the original sequence in which the terms were used in the document, and the information loss is where the two models fundamentally differ. A second difference is that WordFish produces a single estimate of party placement, whereas the embeddings contain a flexible number of dimensions, which can be projected onto lower dimensional spaces, as illustrated above.
To illustrate these differences, we compute WordFish estimates on the entire US House corpus, restricting the vocabulary to the 50,000 most frequent terms.Footnote 18 We combined speeches by members of the same party into single “documents” for each Congress, following the approach used in Proksch and Slapin (Reference Proksch and Slapin2010). Figure 4(a) plots the estimated party positions for the full time period, which can be contrasted against our placement in Figure 2(b). When fitting the WordFish model on the full corpus, the estimator fails to capture the ideological distinctiveness of the parties. In fact, the dominant dimension appears to be the distinction between the language used in the late 19th century and that of more recent Congresses. Put another way, the estimates for both parties appear confounded by the change in language. Both parties are located next to each other over time, and the estimates bear no resemblance to benchmarks from either the Voteview project or the results based on our party embeddings.
The sharp contrast reflects a central property of the proposed methodology—and of word embeddings more generally—namely the ability to place equivalent expressions at proximity in a vector space.Footnote 19 Because the names and issues debated in early Congresses tend to be markedly different from those used in recent Congresses, the Democrats (or Republicans) of the past may seem disconnected from the Democrats (Republicans) of the present when considering word usage alone. In contrast, word usage across parties within a specific time period will appear very similar since both parties are debating the same topics. By accounting for the fact that different words can express the same meaning, a model based on word embeddings can account for continuity in a party’s position even when word usage changes over time. For example, an expression related to an insurrection in the late 19th and early 20th century, the term “Moros” (referring to the Moro Rebellion in the Philippines) has a word embedding similar to the term “ISIL” (the Islamic State of Iraq and the Levant).Footnote 20 Even though ISIL was never mentioned in debates from the 19th century, the embedding model is trained to recognize that both terms share a related meaning.
When fitting the WordFish estimator on a shorter time span (the five most recent Congresses in the corpus), the model then places the two parties farther apart, and estimates become more consistent with the ideological divide expected between Democrats and Republicans (see Figure 4(b)). We calculate the accuracy of both implementations of the WordFish estimator against the Voteview benchmarks in Table 3. Although WordFish does not perform well for studying long periods of time, a shorter model produces estimates that are closer to those achieved with the first principal component of the proposed methodology. Both methods have benefits and limitations, and we should point out that an advantage of WordFish is that the model can be estimated on smaller corpora, as opposed to embedding models, which require more training data.
Accuracy scores for WordFish estimates and party embeddings against the DW-NOMINATE scores (first dimension) from the Voteview project (Poole and Rosenthal Reference Poole and Rosenthal2007). The pairwise accuracy metric counts the percentage of correct ideological orderings for all possible pairs of parties and Congresses.
5.2 Legislator Embeddings
The methodology introduced in this paper can be adapted to the study of individual legislators, rather than parties. The specification is similar to that of the model used for political parties, with the difference that legislator–Congress indicator variables replace the party–Congress ones. We illustrate such an implementation by examining a model comprising embeddings fitted at the level of individual Senators in the US corpus. Since the number of speeches per legislator is lower than for a party as a whole, we compensate by increasing the number of epochs during training. The model discussed below is trained with 20 epochs.
Figure 5 reports the two principal components of the embeddings associated with individual Senators in the 114th Congress. The figure illustrates the expected division of Senators along party lines, although the clustering is nowhere as tight as that typically obtained using scaling methods based on roll-call vote data. For instance, some Republican Senators appear mixed together with Democratic counterparts in the center of the spectrum. This can be explained in part by the fact that language reflects a much wider variance in positions than binary vote data, an issue that could be explored in future research. The centrist Republicans in Figure 5 include Robert Jones Portman, Lamar Alexander, and Susan Collins, who were generally ranked as moderates in external sources of ideology ratings discussed below.
To assess whether the low-dimensional projection captures political ideology, we proceed once more with a validation against external measures. By focusing on a recent Congress, we are able to use a variety of measures of ideological placement for US Senators: the Voteview scores for individual Senators, ratings from the American Conservative Union (ACU) and GovTrack.us ideology scores.Footnote 21 Table 4 reports Pearson correlation coefficients and pairwise accuracy scores using the first principal component of our model and each gold standard. Since our corpus is restricted to members affiliated with major party labels, this analysis excludes two independent Senators. For each gold standard considered, we obtain correlation coefficients of at least 0.85. The fit with the GovTrack ideology scores is particularly strong. Overall, these accuracy results support the conclusion that the first principal component extracted from our “Senator embeddings” is related to political ideology.
The DW-NOMINATE score is the main indicator from the Voteview data, and the Nokken–Poole is an alternative implementation based on Nokken and Poole (Reference Nokken and Poole2004). The ACU ratings are for the year indicated or, alternatively, the life-long rating attributed to each Senator. The methodology used for the GovTrack ideology scores is described in the source website.
5.3 Uncertainty Estimates
A natural question regarding models based on word embeddings is whether they can be fitted with measures of uncertainty, such as standard errors and confidence intervals commonly reported in inference-based research. For instance, it would be convenient to express uncertainty around a prediction of a political party’s placement. Unfortunately, deriving measures of uncertainty with neural network models, even with a relatively modest degree of complexity, remains an area of research under development (see Gentzkow, Kelly, and Taddy Reference Gentzkow, Kelly and Taddy2017). Bootstrap methods would be impractical due to the size of the corpora and the computing time needed to fit each model. Machine learning aims at prediction rather than inference, and the preferred validation methods are designed to assess model specification using predictive accuracy metrics, as we did above, rather than measures of uncertainty (Mullainathan and Spiess Reference Mullainathan and Spiess2017). However, there are solutions available for producing measures of uncertainty for quantities of interest after accepting the raw embeddings as point estimates. For example, when computing cosine similarities using lists of words, researchers working with word embeddings have relied on techniques such as permutation $t$ -tests (Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017) or bootstrap estimators (Garg et al. Reference Garg, Schiebinger, Jurafsky and Zou2018).
6 Intended Uses and Caveats
The proposed methodology can help to support research agendas in a number of subfields of political science. In comparative politics, the use of scaling estimates based on roll-call votes is often problematic when exported outside the United States, in particular for countries with high levels of party discipline or where the executive sits inside the parliament. The party embedding model introduced here affords comparative scholars with new opportunities to study party ideology, whenever a collection of digitized parliamentary corpora is available. Since the model can be adapted to the study of individual legislators, the technique also provides an additional resource for research on intra-party conflict, a central focus of recent research involving textual analysis of parliamentary speeches (see e.g. Proksch and Slapin Reference Proksch and Slapin2015; Bäck and Debus Reference Bäck and Debus2016; Lauderdale and Herzog Reference Lauderdale and Herzog2016; Schwarz, Traber, and Benoit Reference Schwarz, Traber and Benoit2017; Proksch et al. Reference Proksch, Lowe, Wäckerle and Soroka2018). Next, the model is well adapted for research on political representation (Powell Reference Powell2004; Bird Reference Bird, Bird, Saalfeld and Wüst2010). In particular, the party embeddings can allow researchers to go beyond descriptive representation and examine how the language of political actors relates to issues and groups of interest. The results presented earlier suggest a connection between parties and minorities in the US Congress, and the method could be adapted for studies on the representation of other groups of interest across legislatures. Finally, the model can serve as a basis to develop additional methodological tools for textual analysis going beyond word counts, with the goal of studying political semantics.
Some words of caution are warranted, however, when interpreting text-based indicators of ideological placement. There are well-known limitations to scaling estimates based on roll-call votes, and those are often relevant for applications based on textual data. For instance, strategic agenda-setting decisions may limit opportunities to vote on some issues, and nonrandom abstentions during votes could obscure the true ideal point of a legislator (Clinton Reference Clinton2012). In a similar fashion, the distribution of issues in parliamentary corpora may be constrained by who controls the agenda during a given legislature, and some legislators could strategically refrain from expressing their true opinions. These constraints on legislative debates may also vary across polities. For instance, the daily order of business in the Canadian House of Commons is dominated by the government, whereas other legislatures may offer extended opportunities to opposition parties. Researchers working with the proposed methodology should make sure to gain familiarity with the traits characterizing a legislature and the behavioral constraints that may affect their conclusions.
7 Conclusion
Methods based on word embeddings have the potential to offer a useful, integrated framework for studying political texts. By relying on artificial neural networks that take into account interactions between words and political variables, the methodology proposed in this paper is naturally suited for research focusing on latent properties of political discourse such as ideology. This approach has the benefit of going beyond word counts, which improves the ability to account for changes in word usage over time, in contrast to existing methods. The present paper relied on a custom implementation of word embeddings trained on parliamentary corpora and augmented to include input variables measuring party affiliations. As illustrated in our empirical section, the resulting “party embeddings” can be utilized to characterize the content of parliamentary speeches and place political actors in a vector space representing ideological dimensions. Our validity assessments suggest that low-dimensional projections from the model are generally consistent with human-annotated indicators of ideology, expert surveys, and, in the case of Congress, roll-call votes measures. Judging by the wide adoption of word embeddings in real-world applications based on natural language processing, we believe that the technique is bound to gain in popularity in the social sciences. In fact, our paper has probably scratched only the surface of the potential of word embeddings for political analysis.
Despite the promise of (augmented) word embeddings for political research, scholars should be wary of some limitations with the methodology. First, fitting the models requires a large corpus of text documents to fully benefit from their properties for semantic analysis. We find that corpora of digitized parliamentary debates appear reasonably large for this type of research, but smaller collections of texts may not achieve the same level of accuracy. Second, obtaining uncertainty estimates for the neural network models necessary to fit the embeddings would require further research. Although we have illustrated a variety of techniques for model validation, the methodology could eventually be implemented using frameworks like Bayesian neural networks and variational inference (see e.g. MacKay Reference MacKay1992; Tran et al. Reference Tran, Hoffman, Saurous, Brevdo, Murphy and Blei2017). This would facilitate the estimation of confidence intervals, although at this stage the cost in terms of computational time remains prohibitive. Moreover, the quantities of interest to political scientists will likely be derivatives of the embeddings—for instance, a measure of similarity between words and parties—which implies compounding uncertainty estimates across the different steps involved. On the other hand, we have shown that the methodology can be used with a limited number of arbitrary decisions and produce reliable indicators of party placement, eliminating a source of variance associated with individual judgments.
Supplementary material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2019.26.