1 Introduction
Many core concepts in the social sciences are not directly observable. To study democracy, culture, or ideology, we must first build a measure and make inferences about unobservable concepts from observed data. Methods for handling this problem have varied markedly over time and across fields. Congress scholars developed multiple tools to infer member ideology from roll-call behavior (e.g., Clinton, Jackman, and Rivers Reference Clinton, Jackman and Rivers2004; Poole and Rosenthal Reference Poole and Rosenthal1985) whereas survey researchers rely on tools such as factor analysis to infer traits such as “tolerance” from survey responses (e.g., Gibson and Bingham Reference Gibson and Bingham1982).
Recently, social scientists have turned toward text-as-data methods as a way to derive measures from written text, supplementing a long tradition of manual content analysis with computer-assisted techniques. Unsupervised probabilistic topic models (TMs) have emerged as a particularly popular strategy for analysis since their introduction to political science by Quinn et al. (Reference Quinn, Monroe, Colaresi, Crespin and Radev2010). TM s are attractive because they both discover a set of themes in the text and annotate documents with these themes. Due to their ease-of-use and scalability, TMs have become a standard method for measuring concepts in text.
Yet, TMs were not originally designed for the measurement use-case. Blei, Ng, and Jordan (Reference Blei, Ng and Jordan2003) present latent Dirichlet allocation as a tool for information retrieval, document classification, and collaborative filtering. Given this shift in focus, the scholars who introduced the “topics as measures” tradition to political science emphasized the necessity of robust validation (Grimmer Reference Grimmer2010; Quinn et al. Reference Quinn, Monroe, Colaresi, Crespin and Radev2010), with Grimmer and Stewart (Reference Grimmer and Stewart2013) naming a key principle for text methods, “validate, validate, validate.” Early work was excruciatingly careful to validate the substantive meaning of the topics through carefully constructed application-specific criteria and bespoke evaluations. Yet as we have routinized TMs, validation has received less emphasis and less space on the page. In our review of recent practice in top political science journals below, we show that over half of articles using TMs report only a list of words associated with the topic and only a handful of articles report fit statistics.Footnote 1 Meanwhile extensive, application-specific validations are more rare.
This status quo presents a challenge. On the one hand, we have the ability to measure important concepts using immense collections of documents that previous generations could neither have collected nor analyzed. On the other hand, the value of these findings increasingly rests entirely on our confidence in the authors’ qualitative interpretations, which cannot be succinctly reported.Footnote 2 The most important step for addressing this challenge is renewed attention to validation, but by their very nature customized, application-specific validations are difficult to formalize and routinize.
In this article, we take a different approach. We design and test a suite of validation exercises designed to capture human judgment which can be used in a wide range of settings. Our procedure refines a prior crowdsourcing method for validating topic quality (Chang et al. Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) and presents a new design for validating the researcher-assigned topic labels. We provide software tools and practical guidance that make all our validations straightforward to run. Crucially, our goal is not to supplant bespoke validation exercises but to supplement them. Although no single method can validate TMs for all settings, our aim is to re-emphasize the importance of validation and begin a dialogue between methodologists and applied researchers on improving best practices for validating topics as measures.
In the next section, we review how TMs are validated in the social sciences, drawing on a new survey of articles in top political science journals. Section 3 lays out our principles in designing new crowdsourced tasks and introduces our running example. We then outline and evaluate our designs for validating topic coherence (Section 4) and label quality (Section 5). We conclude by discussing the limitations of our designs and future directions for what we hope is only the first of many new methods for validating topics as measures.
2 How Topic Models are Used and Validated
In the social sciences, researchers quickly uncovered the potential of TMs for measuring key concepts in textual data. Political Science in particular has witnessed important work in all subfields where TMs measure latent traits including: senators’ home styles in press releases (Grimmer Reference Grimmer2013), freedom of expression in human rights reports (Bagozzi and Berliner Reference Bagozzi and Berliner2018), religion in political discourse (Blaydes, Grimmer, and McQueen Reference Blaydes, Grimmer and McQueen2018), styles of radical rhetoric (Karell and Freedman Reference Karell and Freedman2019), and more. In other works, the models are used to explore new conceptualizations which may in turn be measured using a different sample or a different approach (Grimmer and King Reference Grimmer and King2011; Pan and Chen Reference Pan and Chen2018).
This trend is promising in that this approach opens up important new lines of inquiry—especially in the context of the explosion of new textual data sources online. At the same time, the move toward measurement is worrying if we are running ahead of ourselves. Do these topics measure what they are supposed to measure? How would we know? We lack an established standard for affirming that a topic measures a particular concept.Footnote 3 In this section, we describe why TM validation is an essential task. We then briefly characterize early approaches to validation and conclude with a review of recent empirical practices.
2.1 The Importance of Topic Validation
The strength and weakness of TMs is that topics are simultaneously learned and assigned to documents. Thus, the researchers must, first, infer whether or not there are any coherent topics, second, place a conceptual label on those topics, and only then assess whether that concept is measured well. In this more open-ended process the potential for creative interpretation is vastly expanded—with all of the advantages and disadvantages that brings. The interpretation and adequacy of the topics are not justified by the model fitting process—those motivating assumptions were simply conveniences not structural assumptions about the world to which we are committed (Grimmer, Roberts, and Stewart Reference Grimmer, Roberts and Stewart2021). Instead, our confidence in the topics as measures comes from the validation that comes after the model is fit (Grimmer and Stewart Reference Grimmer and Stewart2013). This places a heavy burden on the validation exercises because they provide our primary way of assessing whether the topics measure the concept well relative to an externally determined definition.
A further complication is that TMs are typically fit, validated, and analyzed in a single manuscript. By contrast, NOMINATE was extensively validated before widespread adoption (e.g., Poole and Rosenthal Reference Poole and Rosenthal1985) and subsequently used in thousands of studies. Novel psychological batteries are often reported in stand-alone publications (e.g., Cacioppo and Petty Reference Cacioppo and Petty1982; Pratto et al. Reference Pratto, Sidanius, Stallworth and Malle1994), or at the very least subjected to common reporting standards. In other words, the common practice of one-time-use TMs means that research teams are typically going about this process alone.
The inherent difficulty of validation is critical for how readers and researchers alike understand downstream inferences. Subtle differences in topic meanings can matter, and outputs like the most probable words under a topic are, in our experience, rarely unambiguous. Whether a topic relates to “reproductive rights” or “healthcare,” for instance, can be difficult for a reader to ascertain based on these kinds of model outputs.Footnote 4 Yet showing that, for instance, female legislators are more likely to discuss “healthcare” has very different substantive implications than finding they are more likely to discuss “reproductive rights.”Footnote 5
Understanding when validation is needed is complicated by the ostensibly confirmatory, hypothesis-testing style of most quantitative work in the social sciences. Published work often erodes the difference between confirming an ex ante hypothesis and a data-driven discovery (Egami et al. Reference Egami, Fong, Grimmer, Roberts and Stewart2018)—settings that require different kinds of validation. Of course, this tension is not unique to TMs and, in fact, echoes debates about exploratory and confirmatory factor analysis of a previous era (see Armstrong Reference Armstrong1967).
2.1.1 Early approaches to validation
The early TM literature in political science followed a common pattern for validation (Grimmer Reference Grimmer2010; Grimmer and Stewart Reference Grimmer and Stewart2013; Quinn et al. Reference Quinn, Monroe, Colaresi, Crespin and Radev2010). First, estimate a variety of models, examine word lists, and carefully read documents which are highly associated with each topic. Then, in combination with theory, evaluate predictive validity of topics by checking that topics are responsive to external events, convergent validity by showing that it aligns with other measures, and hypothesis validity by showing that it can usefully test theoretically interesting hypotheses. These latter steps are what we call bespoke validations and are highly specific to the study under consideration. For example, Grimmer (Reference Grimmer2010) shows in an analysis of U.S. Senate press releases that senators talk more frequently about issues related to committees they chair. This is an intuitive evaluation that the model is detecting something we are ex ante confident is true, but that expectation is specific to this setting. In short, this approach is heavy on “shoe-leather” effort and involves a great deal of customization—but it is also the gold standard of validation.
2.2 A Review of Recent Practices
How are TMs validated in more recent articles published in top journals? To assess current practices in the field, we identified all articles published in the American Political Science Review, American Journal of Political Science, and Journal of Politics from January 1, 2018 to January 2021Footnote 6 that included the phrase “topic model.” Out of the 20 articles, the topic serves as an outcome variable in 13 and as a predictor in 8.Footnote 7
We created three dichotomous variables reflecting the most common classes of validation strategies reported: topic-specific word lists, fit statistics, and bespoke validation of individual topic meanings.Footnote 8 Notably, we have omitted “authors reading the text” which—while an essential form of validation—cannot be clearly demonstrated to the reader and thus is not fully public in the sense of King, Keohane, and Verba (Reference King, Keohane and Verba1994).Footnote 9 The results of our analysis are summarized in Figure 1. We did not explicitly exclude articles who used TMs for nonmeasurement purposes because we found it too difficult to reliably assess and thus the 20 articles should be taken as the size of our sample, but not necessarily the number of articles which would ideally have used validations of meaning.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_fig1.png?pub-status=live)
Figure 1 Survey of practices in topic model analysis in top political science journals.
2.2.1 Topic-specific word lists
The most common form of validation—used in 19 of the 20 articles—is presenting word lists for at least some subset of topics.Footnote 10 These could be either the most probable words in the topic under the model or alternative criteria such as frequency and exclusivity (FREX) (Roberts, Stewart, and Airoldi Reference Roberts, Stewart and Airoldi2016) and are sometimes reported in word clouds. In practice such lists help to establish content validity (e.g., does the measure include indicators we would expect it to include? Does the measure exclude indicators that are extraneous or ambiguous?).Footnote 11 The word lists allow readers to assess (if imperfectly) whether or not words are correlated with the assigned topic label as they might expect. If a topic is supposed to represent the European debt crisis, for instance, it is comforting to see that top words for the topic include word stems like: “eurozone,” “bank,” “crisi,” “currenc,” and “greec” (Barnes and Hicks Reference Barnes and Hicks2018).
However, eleven of those 19 articles provide only word lists. These are often short and rarely provide numerical information about the probability under the model. Although lists can be intuitive, they are rarely unambiguous. In the European debt crisis topic above we also see “year,” “last,” “auster,” and “deficit.” The first two words are ambiguous and the last two seem more associated with other topic labels (Austerity Trade-Offs and Macro/Fiscal) in the article (Barnes and Hicks Reference Barnes and Hicks2018). Stripped of their context, word lists are hard to assess making it hard for the reader to make their own judgment.
2.2.2 Fit statistics
Beyond word sets, 4 of the 20 articles also reported fit statistics such as held-out log likelihood (Wallach et al. Reference Wallach, Murray, Salakhutdinov and Mimno2009) or surrogate scores such as “semantic coherence” (Mimno et al. Reference Mimno, Wallach, Talley, Leenders and McCallum2011; Roberts et al. Reference Roberts2014).Footnote 12 This provides a sense of whether or not the model is over-fitting, and some previous research shows that surrogates correlate with human judgments.
2.2.3 Bespoke approaches
Five articles reported additional validations of topic meaning designed especially for their case to establish construct validity (does the measure relate to the claimed concept?). Blumenau and Lauderdale (Reference Blumenau and Lauderdale2018) coded 200 documents as to whether the document was related to the Euro crisis with the goal of finding topics that maximized predictions of crisis-related votes. In a supplemental analysis, Dietrich, Hayes, and O’Brien (Reference Dietrich, Hayes and O’Brien2019) qualitatively identify partisan topics and show Republicans/Democrats speak more about their topics. Motolinia (Reference Motolinia2020) fit a TM with 450 topics and reported validations for two relevant theoretical expectations (see their figure 2). Barberá et al. (Reference Barberá2019) provided considerable information about topics including a custom websiteFootnote 13 showing high frequency words and example documents, and reported a validation against external events for one of the 46 topics. Arguably, the most thorough reported validation was in Parthasarathy, Rao, and Palaniswamy (Reference Parthasarathy, Rao and Palaniswamy2019), which validates topics against theoretical predictions and survey responses from human observers of public deliberations in India. What counts as a bespoke validation is unavoidably subjective, but we emphasize here that we are considering bespoke validation of individual topic meanings which excludes many other valuable analyses.Footnote 14
2.2.4 Summary of findings
We emphasize that our analysis is limited to validations reported to readers. In many cases, the topics were validated in additional ways that could not be (or at least were not) reported. For instance, Blaydes et al. (Reference Blaydes, Grimmer and McQueen2018, p.1155) write, “Our research team also evaluated the model qualitatively …, selecting the specification and final model that provided the most substantive clarity.” This is an essential part of the process, but isn’t easily visible to the reader. The reader can see the reported high-probability words (Table 1 in Blaydes et al. (Reference Blaydes, Grimmer and McQueen2018)) and qualitative descriptions of topics (Supplementary Appendix C in Blaydes et al. (Reference Blaydes, Grimmer and McQueen2018)). Careful qualitative evaluation is arguably the most important validation, but it is not easily communicated.
Our point is not to call into the question of any of these findings, but merely to characterize common approaches to validation. Articles with bespoke validations are not necessarily validated well, and articles without bespoke validations are not necessarily validated poorly. Our results do show that there is limited agreement on what kinds of validations of topic meaning should be shown to the reader. Twelve of 20 articles report only key words. Four of 20 report fit statistics. Five report external validation of topic meaning. Just one article reports all three forms of validation we coded (Barberá et al. Reference Barberá2019).Footnote 15
Thus, our overall finding is that aside from word lists, which are near universal, there are few consistently used validation practices. Not surprisingly, extensive customized validations appear relatively rarely. This suggests the need for more validations that can be customized to the measurement task at hand, but can also be quickly and precisely conveyed to readers. Toward this end, we present an approach based on crowdsourced coding of word sets, documents, and topic labels. We emphasize again that this should not be seen as a substitute for theory-driven custom validation exercises or extensive reading, but rather as an additional tool.
3 Designing and Assessing an Off-the-Shelf Evaluation
In this article, we pursue the goal of designing an off-the-shelf evaluation design for TMs that leverage human ability to assess words and documents in context, can be easily and transparently communicated to readers, and is less burdensome than alternatives such as training expert coders or machine learning classifiers. We develop two classes of designs: one extends the intrusion tasks of Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) to evaluate the semantic coherence of a given TM (Section 4), and a second oriented toward validating that a set of topics corresponds to their researcher-assigned labels (Section 5). Before we present our method, we review the Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) approach in Section 3.1, introduce our design principles in Section 3.2, and describe the data we will use for evaluation in Section 3.3.
3.1 Using the Wisdom of the Crowds
In an agenda-setting article, Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) introduced a set of crowd-sourced tasks for evaluating TMs.Footnote 16 The core idea is to transform the validation task into short games which—if they are completed with high accuracy—imply a high quality model. The common structure for the two original tasks is shown in Figure 2.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_fig2.png?pub-status=live)
Figure 2 A diagram for the common structure of crowd-sourced validation tasks.
In each, a question (B) is presented to the coders and they must choose from options (C). Section (A) provides additional context for some tasks such as a document.
The first task in Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009), Word Intrusion (WI), is designed to detect topics which are semantically cohesive. We present workers with five words such as: tax, payment, gun, spending, and debt. Four words of these words are chosen randomly from the high-probability words from a given topic and an “intruder” word is chosen from the high-probability words from a different topic. The human is then asked to identify the “most irrelevant” of the words—the intruder—which in the case above is gun. If the topic is semantically coherent the words from the topic should have clear relevance to each other and the intruder stands out. An example for each task structure is shown in Appendix S2.
The second task, Top 8 Word Set Intrusion (T8WSI), detects coherence of topics within a document.Footnote 17 We present the coders with an actual document (or a snippet from the document) and four sets of eight words such as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tabu1.png?pub-status=live)
Each of the four word sets contains the eight highest probability words for a topic. Three of these topics correspond to the highest probability topics for the displayed document, whereas one is a low probability word for that document. The human is asked to identify the word set that does not belong—which in this case is (day…vacation). Here, the worker has both the cue from the document itself and from the pattern of co-occurrence across topics.
When these tasks can be completed with high-accuracy by workers, it demonstrates that words within a topic are coherent (WI) and that the topics that co-occur within a document are coherent (Top 8 WSI). Yet, they do not include the research-assigned labels for the topics and thus cannot demonstrate that topics represent what the researcher describes them as measuring. In Section 4, we will improve on these existing designs for evaluating coherence and in Section 5, we introduce new designs for validating the labels.
3.2 Principles
We design the tasks to be generalizable, discriminative, reliable, and easy to use. All tasks we present are generalizable to any mixed-membership model that represents a topic as a distribution over words and two of our designs also work with single-membership models. The approach also generalizes to different substantive settings, varying document collection sizes, lengths of documents, and number of topics. We design the tasks to discriminate based on model quality, which involves ensuring that successful completion is correlated with higher quality models, but also that the tasks are of medium difficulty to avoid ceiling or floor effects. Further, even though these tasks involve subjective judgments, we demonstrate that they are reliable by showing that results are stable under replication.
Finally, this innovation is only helpful if scholars actually employ these techniques. Despite being highly cited, the approach in Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) is rarely used in the academic literature and, as we already demonstrated in our review, extensive validations of TMs are rare. Thus, we prioritize ease of use and develop software to help users implement our methods.Footnote 18 Using workers on Amazon Mechanical Turk (MTurk) we were able to get results quickly and cheaply (usually in an afternoon and for less than $50 per task/model). For the researcher there is a fixed cost in getting set up on MTurk, building training modules for the workers, and creating a set of gold standard human intelligence tasks (HITs). But it does not require additional specialized skills and it is less arduous than alternatives such as establishing coding procedures for research assistants and/or training supervised classification algorithms. In addition to the software, we provide additional guidance and directions in the Supplementary Appendix.
3.3 Empirical Illustration
As an empirical testbed, we collected U.S. senators’ Facebook pages from the 115th Congress and applied a series of common preprocessing steps.Footnote 19 We fit five structural topic models (STM; Roberts et al. Reference Roberts, Stewart, Tingley, Airoldi, Burges, Bottou, Welling, Ghahramani and Weinberger2013, Roberts, Stewart, and Airoldi Reference Roberts, Stewart and Airoldi2016) using 163,642 documents.Footnote 20 In order to establish a clear benchmark for a flawed model, we estimate Model 1, a 10-topic STM run for only a single iteration of the expectation maximization algorithm. Even this model appears reasonable at first glance because of the initialization procedure in STM, thus making for a strong test.Footnote 21 We then fit three standard STM models with 10, 50, and 100 topics (Models 2–4). We do not have prior expectations of the quality ordering of these models. Finally, in order to provide a model which is almost certainly overfit given the length of the documents, we fit a 500 topic model (Model 5).
4 Coherence Evaluations
We present three task structures designed to pick out distinctive and coherent topics. This aligns closely with the stated goals of analysts in the social sciences. For instance, Kim (Reference Kim2018, Appendix, p. 39) justifies the choice of 25 topics stating, “models with the lower number of topics do not capture distinct topics, while the model with 30 topics does not provide additional categories that are meaningful for interpretation.” Similarly, Barnes and Hicks (Reference Barnes and Hicks2018, 346, footnote 13) say they chose the number of topics, “at which the topics content could be interpreted as substantively meaningful and distinct.”
Table 1 summarizes all three task structures, where column names correspond to the annotations in the sample diagram from Figure 2. The WI and the T8WSI tasks are slight alterations from the methods in Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) (discussed above). The primary difference is that we combine the probability mass for words with a common root and randomly draw words according to their mass (in contrast with drawing words uniformly). The term “probability mass” here refers to the topic-specific probability assigned to a given token (remembering that topics are represented as word distributions). Combining the probabilities in this way is a bit like stemming the word after the modeling is complete. This allows us to show complete words to the human coders while also preventing multiple words with a common root from appearing in the same task.
Table 1 Task structures for coherence evaluations.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tab1.png?pub-status=live)
In our initial testing, we found that the WI and T8WSI tasks were often too difficult for coders, reducing their power to discriminate. Further, T8WSI is sensitive to the words included in the “top eight,” making the results more arbitrary and again less informative. To address these concerns, we designed a new task, Random 4 Word Set Intrusion (R4WSI) which we summarize in the final row of Table 1.
In R4WSI, we present the coder with four different sets of four words such as,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tabu2.png?pub-status=live)
Similar to WI, three of these sets of words are chosen from the same topic, while an intruder word set comes from a different topic.Footnote 22 The coder’s goal is to identify the intruder word set (here serve…fight). In this new design, coders have access to 12 words from the nonintruder topic and thus more context to identify a common theme resulting in more informative decisions.
We tested these three task structures using workers with master certifications from Amazon’s Mechanical Turk (AMT) from March to July, 2020. To qualify, workers had to complete an online training module described in Appendix S6. The training explains the task, provides background about the document set, and walks workers through examples to ensure they understand their goals. In Appendix S4, we emphasize that these training modules are critical for screening workers with the requisite skills and knowledge and putting the tasks in context for the coders.
We paid $0.04/task for WI, $0.08/task for T8WSI, and $0.06 for R4WSI (which corresponds to roughly $15 per hour on average). For each task structure we posted 500 tasks, which Amazon calls HITs, for all five models. To assess the consistency of task structures, we then posted these exact same tasks again. To monitor the quality of the work, we randomly mixed in a gold-standard HIT every ten HITs.Footnote 23 In total, workers completed 16,500 tasks. However, a single batch of 500 HITs—a typical case for an applied researcher—takes only a few hours with total costs in the range of $25–$60.
Figure 3 shows the results for all five of our models on each of the three tasks. The first two light color bars indicate the two identical runs and the third darker line indicates the pooled results of those runs. We also indicate when the difference in means is significant across model pairs with connecting dotted lines, where the numbers represent p-values for a difference in proportions test (
$n=2,000$
). We make three observations. First, all task structures easily identified the nonconverged baseline (Model 1) as the worst, which provides a check that this approach has the ability to identify a model known to be a relatively poor fit. Second, all of them are able to identify over-fitting as the 500-topic model (Model 5) appears to be worse than the 100-topic model (Model 4) in all task structures. Third, all of the task structures are reliable in that they provide nearly indistinguishable estimates across runs when we include 500 tasks.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_fig3.png?pub-status=live)
Figure 3 Results for coherence evaluations. Note: The 95% confidence intervals are presented. The two light bars represent two identical trials (500 HITs each). The dark bar represents the pooled result (1,000 HITs). When two models yield significantly different results, the p-value is noted. (Significance tests are difference in proportions as calculated by the prop.test function in R.) No identical trails (two light bars) are significantly different from each other. The gray horizontal line represents the correct rates from random guessing.
Overall, these results provide evidence of several advantages of the R4WSI task structure. The estimated held-out log likelihood for Models 1–5, respectively, are
$-8.316$
,
$-7.981$
,
$-7.767$
,
$-7.705$
, and
$-7.984$
(higher is better). This rank ordering (with the 500-topic scoring lower than the versions with 10 or 50 topics) is consistent with R4WSI but not WI and T8WSI. R4WSI also more clearly distinguishes the unconverged Model 1 as inferior. The higher accuracy rates suggest that R4WSI task is indeed easier for workers to understand and complete with workers identifying the intruder nearly 85% of the time for Model 4. This suggests that the model has identified meaningful and coherent patterns in the document set that humans can reliably recognize. Although all the tasks appear reasonably effective, we recommend the R4WSI task for applied use.
5 Label Validation
In social science research, scholars typically place conceptual labels on topics that indicate the concept they are measuring. The accuracy of these labels may have relatively low stakes if topics are only used for prediction (e.g., Blumenau and Lauderdale Reference Blumenau and Lauderdale2018). However, in the majority of applications we reviewed, the stakes are high as the label communicates to the reader the nature of the evidence that the text provides about a theoretical claim of interest (e.g., Barnes and Hicks Reference Barnes and Hicks2018; Gilardi, Shipan, and Wüest Reference Gilardi, Shipan and Wüest2021; Horowitz et al. Reference Horowitz2019; Magaloni and Rodriguez Reference Magaloni and Rodriguez2020). In many cases, the individual labels may be important, but play a less central role in the analysis than the label assigned to a cluster of topics which share a common trait of interest (e.g., Barberá et al. Reference Barberá2019; Dietrich, Hayes, and O’Brien Reference Dietrich, Hayes and O’Brien2019; Lacombe Reference Lacombe2019; Martin and McCrain Reference Martin and McCrain2019; Motolinia Reference Motolinia2020). Reflecting the differences in social science usage of TMs, these concerns of label validity are largely unaddressed by the designs that originated in computer science (Chang et al. Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009).
We develop label validations for these use cases and test them on the 100-topic model (Model 4). First, we ask, “Are the conceptual labels sufficiently precise and nonoverlapping to allow us to distinguish between closely related topics?” Specifically, we identified 10 topics related to domestic policies and focus our analysis on only these topics. Second, we ask, “Can we usefully distinguish two broad conceptual categories of discussion from each other?” Specifically, we identified 10 topics related to the military and foreign affairs and focus on coders’ ability to distinguish between these topics and the “domestic” policy topics.Footnote 24
A problem for validating any new validation method is that we lack an unambiguous ground truth—many possible labels would accurately describe the contents of a topic and many labels would not. Ideally, our task will allow us to discriminate between higher and lower quality labels. In our empirical case, we need to produce a set of labels for which we have strong a priori expectations.
Members of our research team independently labeled each of the 100 topics. Each of us carefully read the high-probability words and FREX (Roberts, Stewart, and Airoldi Reference Roberts, Stewart, Tingley and Alvarez2016), as well as 50 representative documents per topic (Grimmer and Stewart Reference Grimmer and Stewart2013). From the topics that all of us deemed as coherent, we picked 10 domestic topics and 10 international/military topics where the labels were most consistent. The final labels for each are shown in Table 2 with additional details in the Supplementary Appendix. We refer to these as “careful coder” labels. To provide a contrast, we asked research assistants to create their own set of labels based only on the high-probability and FREX words (i.e., without looking at the documents). These labels, which we refer to as “cursory coder” labels, are shown in the second column of Table 2. Our expectation is that the careful coder labels are better labels (and thus should score more highly on the tasks) but that the cursory coder provides a reasonably strong baseline.Footnote 25
Table 2 Labels to validate.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tab2.png?pub-status=live)
5.1 Novel Task Structures
We designed two task structures to evaluate label quality which are summarized in Table 3: Label Intrusion and Optimal Label.Footnote 26 In the Label Intrusion (LI) task the coder is shown a text and four possible topic labels. Three of the labels come from the 3 topics most associated with the document and 1 is selected from the remaining 7 labels (“Within Category”) or 7 plus the 10 international labels (“Across Categories”). The coder is asked to identify the intruder, mimicking the word set intrusion design.
Table 3 Task structures for label validation.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tab3.png?pub-status=live)
aTop 3 predicted topics among the 10 domestic topics. bTop 1 predicted topic among the 10 domestic topics.
The second task, Optimal Label (OL), presents a document and four labels. One label is for the highest probability topic and the other 3 labels are chosen randomly from the remaining 9 domestic labels (“Within Category”) or 9 plus the 10 international labels (“Across Categories”). The coder is asked to identify the best label. This optimal label task structure is similar to the validation exercises already common in the literature where research assistants are asked to divide documents into predefined categories to assess topic quality (Grimmer Reference Grimmer2013). This task structure has the advantage of being the most directly interpretable since it essentially asks coders to confirm or refute the conceptual labels assigned to the documents and measures their accuracy in doing so.
In addition, we anticipated that discriminating between only domestic topics would be harder than discriminating between topics where intruders could be either domestic or military/international topics. That is, discriminating between conceptually similar topics (e.g., Drug Abuse vs. Healthcare/Reproductive Rights) is understandably a “harder test” than discriminating between clearly distinct topics (e.g., Drug Abuse vs. Terrorism).
5.2 Results
For each task/coder combination we created 500 tasks (plus 50 gold-standard HITs for evaluation purpose) that were coded by trained workers on AMT for $0.08 per HIT. These were then repeated. In total, workers completed 8,800 HITs and the results are shown in Figure 4. We focus only on documents about domestic policy (see Table 3 footnotes a,b) so that each task (Label Intrusion, Optimal Label) uses a common set of documents for the two variants (Across, Within).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_fig4.png?pub-status=live)
Figure 4 Results for label validations. Note: The 95% confidence intervals are presented, where the two light bars represent two identical trials (500 HITs each) and the dark bar represents the pooled result (1,000 HITs). p-values are based on the pooled set of tasks based on a difference-in-proportions test. No identical trials are significantly different from each other. The gray horizontal line represents the correct rate from random guessing.
The results are positive for both tasks. With 500 HITs, the results across runs are reliable with rank orderings of the label sets being indistinguishable across repetitions. Second, the results are consistent across task structures in identifying the careful coder labels as being superior. Finally, in Table 4, we disaggregate the accuracy rates of the “across” condition to show that workers achieve much higher correct rates when the goal is to distinguish the true label(s) from a different category of labels (domestic vs. international/military policies). For instance, when all three intruders crossed this conceptual boundary, coders were able to choose the correct optimal label 96.4% of the time for the careful coder labels while that figure falls to 78.8% when intruders were limited to other domestic topics.
Table 4 Disaggregation of Figure 4 accuracy rates for “across” condition by intruder categories.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220909020652191-0021:S1047198721000334:S1047198721000334_tab4.png?pub-status=live)
Note: All documents included are about domestic policy, so a cross-category option is any international label. It is possible to have zero international labels (as in the “within” condition) because in the “across” category condition we are randomly selecting labels from both categories.
Both the LI and OL tasks are reasonable choices for applied researchers. The LI task only works for mixed membership models and will be most effective when most documents strongly express multiple topics (and capturing more than the top label is particularly important). The OL task is more easily interpretable and can work for both single- and mixed-membership models, but relies on the ability of the coder to pick out the single best label which can be difficult in documents that are best represented by a mixture over many topics. In both designs, the researcher must also choose whether to draw the comparison topics from a set of conceptually related topics or from across broad categories. Closely related topics represent a harder test, but when the primary research claim is about the broader category, the task of making fine-grained distinctions may be unnecessarily difficult.
In our particular application, the results suggest that coders can easily make distinctions between broader policy categories (e.g., domestic and international policy debates). When looking only within a narrow set of topics, however, our results indicate a need for caution. When considering only the 10 domestic policy topics, the coders could identify an intruder only 70.3% of the time for the “careful coder” labels and less than half (49.0%) of the time for the cursory coders. This suggests that the careful coder labels are substantially better, but depending on the downstream task, even 70% might be concerning. The corresponding numbers for the OL task (78.8% and 71.7%) corroborate this finding, indicating that the careful coder labels are better, but we should put less faith in the fine-grained distinctions.
6 Limitations
Our goal is not to present the final word on this methodological question, but rather to begin a dialogue. Our collective work on validating topics as measures is just getting started. With this in mind, we highlight three limitations.
Limitation 1: These Designs Should Not Replace Bespoke Validation
When it comes to validation, there is no substitute for testing a measure against substantive, theoretically driven expectations in a bespoke validation. As Brady (Reference Brady, Brady and Collier2010, 78) writes, “Measurement … is not the same as quantification, and it must be guided by theories that emphasize the relationship of one measure to another.” Yet, as we noted in our review, bespoke validations appear so infrequently in the published literature that it may be helpful to extend the toolbox with new options.
The central advantage of these new tasks is that they are low cost, reliable, and easy to communicate to a reader. For any given application, there is likely a custom-designed solution which will be superior, but our tasks provide an approach that researchers can reach for in most circumstances. In the best case scenario, our proposed tasks would offer a complement to essential but difficult to convey validation methods such as close reading of the underlying text.
The ongoing need for bespoke validation is inextricably connected to the fact that we do not have access to a ground truth to benchmark validations against and thus we cannot guarantee that they will be accurate in general. Our coherence evaluations help to ensure that the topics convey a clear concept and are distinguishable from each other while the label validation exercises ensure that the researcher-assigned labels are sufficiently accurate to be distinguished among themselves. Importantly, using human judgments, our validations occupy a space between expert assessments and statistical metrics which lack any human judgment at all.Footnote 27
Limitation 2: These Designs Have Limited Scope
Although a major advantage of our designs is that they are more general than a given bespoke strategy, there are nonetheless some limitations in scope arising from the simplification inherent in the tasks. To begin, the documents have to be accessible to the workers. At a minimum documents have to be in a language the workers can read. Mechanical Turk relies primarily on a U.S.-based workforce, but Pavlick et al. (Reference Pavlick, Post, Irvine, Kachaev and Callison-Burch2014) shows that it is possible to find workers with specific language skills and our experience shows that only a small number of workers are needed to complete these coding tasks. There are also alternative crowdsourcing platforms with more international workers (Benoit et al. Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016).Footnote 28 Still, future research is needed to show that this approach is feasible for non-English texts. In addition, several of the task structures require coders to read documents or excerpts. This is reasonable for social media posts and other short texts that are the basis of most applications of TMs to date. Our document set is particularly well-positioned to use this technique, but that in turn makes it a comparatively easy case. Future work might explore how to best handle excerpting long documents or training workers for specialized texts (e.g., Blaydes, Grimmer, and McQueen Reference Blaydes, Grimmer and McQueen2018).
A more subtle limitation is that the representation of topics using a fixed number (e.g., 20) of the most probable words can present challenges in certain model fits. TMs can have very sparse distributions over the vocabulary, particularly with large number of topics, large vocabularies or when fit with collapsed Gibbs sampling. If the topic is too sparse, the later words in the top 20 might have close to zero probability, making the words essentially random. If stop words are not removed, the vocabulary can include high frequency words which are probable under all topics and thus also not informative.Footnote 29 This is another instance of text preprocessing decisions may play a consequential role in unsupervised learning (Denny and Spirling Reference Denny and Spirling2018). Because these concerns will arise in the creation of the training module for the workers, researchers will know in practice when this issue is arising and can adjust accordingly (e.g., by considering a smaller number of words).
We also emphasize that these designs cannot evaluate all properties necessary for accurate measurement. For example, many researchers use topics as outcomes in a regression. When estimating a conditional expectation, we want to know not only that the label is associated with the topic loadings, but that they are proper interval scales (so that the mean is meaningful). These validation designs do nothing to assess these properties, and further work is needed to establish under what circumstances topic probabilities can be used as interval estimates of latent traits.
Limitation 3: Results Are Difficult to Interpret in Isolation
A final limitation is the difficulty of interpreting the results in isolation. Above, we focus on the relative accuracy of the tasks across models or label sets in large part, but in practice applied researchers may only be evaluating a single model. If Model 3 scores 61.6% on the T8WSI task, is this good or bad? Is it comparable to performance on a completely different data set? Documents which involve more complex material or technical vocabularies may lead to poorer scores not because the models are worse, but simply because the task is inherently harder.
Readers may naturally want to assess some cut-off heuristic where models or labels that score below a particular threshold are not acceptable for publication. We note that this would be problematic and would fall into many of the traps that bedevil the debate over p-values. Thus, finding the right way to compare evidence across datasets remains an open challenge although one that exists for any kind of validation metric (including model fit statistics and bespoke evaluations). Authors will need to provide readers with context for evaluating and interpreting these numbers, perhaps by evaluating multiple models or using multiple validation methods. At a minimum, as readers we should expect to see that coders substantially exceed the threshold for random guessing (which is marked in all our plots). Still, as we accumulate more evidence about such validation exercises, it may become possible to get a better sense of what an “adequate” score will be.
7 Conclusion
The text-as-data movement is exciting, in part, because it comes with a rapidly expanding evidence base in the social sciences (King Reference King, King, Schlozman and Nie2009). The conventional sources of data such as surveys or voting records are giving way to study-specific, text-based datasets collected from the Internet or other digital sources. This means that individual scholars are increasingly taking on the role of designing unique measurements for each study built from messy, unstructured, textual records. While greatly extending the scope of the social sciences, this expansion places new burdens of validation on researchers which must be met with new, widely applicable tools.
We have taken a step in this direction by improving upon the existing crowd-sourced tasks of Chang et al. (Reference Chang, Gerrish, Wang, Boyd-Graber, Blei, Bengio, Schuurmans, Lafferty, Williams and Culotta2009) and extending them to create new designs that assess how well a set of labels represent corresponding topics. We tested these task structures using a novel topic model fit to Facebook posts by U.S. Senators, and provided evidence that the method is reliable and allows for discrimination between models, based on semantic coherence, and labels, based on their conceptual appropriateness for specific documents. These kinds of crowd-sourced judgments allow us to leverage the ability of humans to understand natural language without experiencing the scale issues of relying on experts.
Recognizing that such advancements are only helpful if they are straightforward enough for researchers to apply in their own work, we have built an R package which automates much of the work of launching these tasks. Although they do require a fixed cost in time and effort to set up, they are a straightforward way to include external human judgment. Our evaluations were all completed in less than 3 days and sometimes in only a few hours. Further, while certainly not free, the 500 task runs we used here are fairly affordable with costs ranging between $25 and $60. Nonetheless, there are still improvements to be made in terms of best practices for worker recruitment, training, and task structure. This is particularly true as the workforce and platforms are moving targets and future work might discover new challenges or new ways to ensure data reliability.
The social sciences have reimagined topic models for a purpose very different from the original goals of information retrieval in computer science. Yet these new ambitions bring with them new responsibilities to validate topic models with same high standards we apply to other measures in the social sciences. Early topic modeling work handled this with extensive bespoke validations, but as the topic model fitting routinized, the validations have not followed suit. In short, there is no free lunch: any method used for measurement—unsupervised topic models, supervised document classification, or any nontext approach—requires validation to ensure that the learned measurement is valid. This paper makes what is hopefully only one of many efforts to give renewed attention to measurement validation for text-as-data methods in the social sciences.
Acknowledgments
We are grateful for helpful comments from Kevin Quinn, Arthur Spirling, three anonymous reviewers, and audiences at the 2019 Summer meeting of the Society of Political Methodology, IV Latin American Political Methods Meeting, The International Methods Colloquium, Washington University in St. Louis, New York University, the University of California, Davis and the University of California, San Diego. A number of scholars whose work we reviewed in Section 2 also graciously provided comments including Benjamin E. Bagozzi, Pablo Barberá, Lisa Blaydes, Kaiping Chen, Germán Feierherd, Matthew Hayes, Tim Hicks, Michael Horowitz, Matthew J. Lacombe, Benjamin Lauderdale, Kenneth Lowande, Lucia Motolinia, Richard A. Nielsen, Jennifer Pan, Margaret Roberts, Arturas Rozenas, and Dustin Tingley. We thank Ryden Butler for providing the data and Ryden Butler, Bryant Moy, Erin Rossiter, and Michelle Torres for research assistance.
Funding
Funding for this project was provided by the Weidenbaum Center on the Economy, Government, and Public Policy at Washington University in St. Louis, the National Science Foundation RIDIR program award number 1738288 and The Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under Award Number P2CHD047879.
Data Availability Statement
Replication code for this article is available at Ying, Montgomery, and Stewart (Reference Ying, Montgomery and Stewart2021) at https://doi.org/10.7910/DVN/S02EBF.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2021.33.