Automated Text Classification of News Articles: A Practical Guide

Pablo Barberá; Amber E. Boydstun; Suzanna Linn; Ryan McMahon; Jonathan Nagler

doi:10.1017/pan.2020.8

Automated Text Classification of News Articles: A Practical Guide

Published online by Cambridge University Press: 09 June 2020

Ryan McMahon and

Pablo Barberá*: Affiliation:
Associate Professor of Political Science and International Relations, University of Southern California, Los Angeles, CA90089, USA. Email: pbarbera@usc.edu
Amber E. Boydstun: Affiliation:
Associate Professor of Political Science, University of California, Davis, CA95616, USA. Email: aboydstun@ucdavis.edu
Suzanna Linn: Affiliation:
Liberal Arts Professor of Political Science, Department of Political Science, Penn State University, University Park, PA16802, USA. Email: sld8@psu.edu
Ryan McMahon: Affiliation:
PhD Graduate, Department of Political Science, Penn State University, University Park, PA16802, USA (now at Google). Email: mcmahon.rb@gmail.com
Jonathan Nagler: Affiliation:
Professor of Politics and co-Director of the Center for Social Media and Politics, New York University, New York, NY10012, USA. Email: jonathan.nagler@nyu.edu
*: *Email: p.barbera@lse.ac.uk

Article contents

Abstract
Introduction
Selecting the Corpus: Keywords versus Subject Categories
Creating a Training Dataset: Two Crucial Decisions
Selecting a Classification Method: Supervised Machine Learning versus Dictionaries
Recommendations for Analysts of Text
Funding
Data Availability Statement
Supplementary material
Footnotes
References

Rights & Permissions

Abstract

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Keywords

statistical analysis of texts automated content analysis content analysis

Type: Articles
Information: Political Analysis , Volume 29 , Issue 1 , January 2021 , pp. 19 - 42

DOI: https://doi.org/10.1017/pan.2020.8 [Opens in a new window]
Copyright: Copyright © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology.

1 Introduction

The analysis of text is central to a large and growing number of research questions in the social sciences. While analysts have long been interested in the tone and content of such things as media coverage of the economy (De Boef and Kellstedt Reference De Boef and Kellstedt2004; Tetlock Reference Tetlock2007; Goidel et al. Reference Goidel, Procopio, Terrell and Wu2010; Soroka, Stecula, and Wlezien Reference Soroka, Stecula and Wlezien2015), congressional bills (Hillard, Purpura, and Wilkerson Reference Hillard, Purpura and Wilkerson2008; Jurka et al. Reference Jurka, Collingwood, Boydstun, Grossman and van Atteveldt2013), party platforms (Laver, Benoit, and Garry Reference Laver, Benoit and Garry2003; Monroe, Colaresi, and Quinn Reference Monroe, Colaresi and Quinn2008; Grimmer, Messing, and Westwood Reference Grimmer, Messing and Westwood2012), and presidential campaigns (Eshbaugh-Soha Reference Eshbaugh-Soha2010), the advent of automated text classification methods combined with the broad reach of digital text archives has led to an explosion in the extent and scope of textual analysis. Whereas researchers were once limited to analyses based on text that was read and hand-coded by humans, machine coding by dictionaries and supervised machine learning (SML) tools is now the norm (Grimmer and Stewart Reference Grimmer and Stewart2013). The time and cost of the analysis of text has thus dropped precipitously. But the use of automated methods for text analysis requires the analyst to make multiple decisions that are often given little consideration yet have consequences that are neither obvious nor benign.

Before proceeding to classify documents, the analyst must: (1) select a corpus; and (2) choose whether to use a dictionary method or a machine learning method to classify each document in the corpus. If an SML method is selected, the analyst must also: (3) decide how to produce the training dataset—select the unit of analysis, the number of objects (i.e., documents or units of text) to code, and the number of coders to assign to each object.Footnote ¹ In each section below, we first identify the options open to the analyst and present the theoretical trade-offs associated with each. Second, we offer empirical evidence illustrating the degree to which these decisions matter for our ability to predict the tone of coverage of the US national economy in the New York Times, as perceived by human readers. Third, we provide recommendations to the analyst on how to best evaluate their choices. Throughout, our goal is to provide a guide for analysts facing these decisions in their own work.Footnote ²

Some of what we present here may seem self-evident. If one chooses the wrong corpus to code, for example, it is intuitive that no coding scheme will accurately capture the “truth.” Yet less obvious decisions can also matter a lot. We show that two reasonable attempts to select the same population of news stories can produce dramatically different outcomes. In our running example, using keyword searches produces a larger corpus than using predefined subject categories (developed by LexisNexis), with a higher proportion of relevant articles. Since keywords also offer the advantage of transparency over using subject categories, we conclude that keywords are to be preferred over subject categories. If the analyst will be using SML to produce a training dataset, we show that it makes surprisingly little difference whether the analyst codes text at the sentence level or the article-segment level. Thus, we suggest taking advantage of the efficiency of coding at the segment level. We also show that maximizing the number of objects coded rather than having fewer objects, each coded by more coders, provides the most efficient means to optimize the performance of SML methods. Finally, we demonstrate that SML outperforms dictionary methods on a number of different criteria, including accuracy and precision, and thus we conclude that the use of SML is to be preferred, provided the analyst is able to produce a sufficiently high-quality/quantity training dataset.

Before proceeding to describe the decisions at hand, we make two key assumptions. First, we assume the analyst’s goal is to produce a measure of tone that accurately represents the tone of the text as read by humans.Footnote ³ Second, we assume, on average, the tone of a given text is interpreted by all people in the same way; in other words, there is a single “true” tone inherent in the text that has merely to be extracted. Of course, this second assumption is harder to maintain. Yet we rely on the extensive literature on the concept of the wisdom of the crowds—the idea that aggregating multiple independent judgments about a particular question can lead to the correct answer, even if those individual assessments are coming from individuals with low levels of information.Footnote ⁴ Thus we proceed with these assumptions in describing below each decision the analyst must make.

2 Selecting the Corpus: Keywords versus Subject Categories

The first decision confronting the analyst of text is how to select the corpus. The analyst must first define the universe, or source, of text. The universe may be well defined and relatively small, for example, the written record of all legislative speeches in Canada last year, or it may be broad in scope, for example, the set of all text produced by the “media.” Next, the analyst must define the population, the set of objects (e.g., articles, bills, tweets) in the universe relevant to the analysis. The population may correspond to the universe, but often the analyst will be interested in a subset of documents in the universe, such as those on a particular topic. Finally, the analyst selects the set of documents that defines the corpus to be classified.

The challenge is to adopt a sampling strategy that produces a corpus that mimics the population.Footnote ⁵ We want to include all relevant objects (i.e., minimize false negatives) and exclude any irrelevant objects (i.e., minimize false positives). Because we do not know a priori whether any article is in the population, the analyst is bound to work with a corpus that includes irrelevant texts. These irrelevant texts add noise to measures produced from the sample and add cost to the production of a training dataset (for analysts using SML). In contrast, a strategy that excludes relevant texts at best produces a noisier measure by increasing sampling variation in any measure produced from the sample corpus.

In addition to the concern about relevance is a concern about representation. For example, with every decision about which words or terms to include or omit from a keyword search, we run the risk of introducing bias. We might find that an expanded set of keywords yields a larger and highly relevant corpus, but if the added keywords are disproportionately negatively toned, or disproportionately related to one aspect of the economy compared to another vis a vis the population, then this highly relevant corpus would be of lower quality. The vagaries of language make this a real possibility. However, careful application of keyword expansion can minimize the potential for this type of error. In short, the analyst should strive for a keyword search that maximizes both relevance and representation vis a vis the population of interest.

One of two sampling strategies is typically used to select a corpus. In the first strategy, the analyst selects texts based on subject categories produced by an entity that has already categorized the documents (e.g., by topic). For example, the media monitoring site LexisNexis has developed an extensive set of hierarchical topic categories, and the media data provider ProQuest offers a long list of fine-grained topic categories, each identified through a combination of human discretion and machine learning.Footnote ⁶

In the second strategy, the analyst relies on a Boolean regular expression search using keywords (or key terms). Typically the analyst generates a list of keywords expected to distinguish between articles relevant to the topic compared to irrelevant articles. For example, in looking for articles about the economy, the analyst would likely choose “unemployment” as a keyword. There is a burden on the analyst to choose terms that capture documents representative of the population being studied. An analyst looking to examine articles to measure tone of the economy who started with a keyword set including “recession”, “depression”, and “layoffs” but omitting “boom” and “expansion” runs the risk of producing a biased corpus. But once the analyst chooses a small set of core keywords, there are established algorithms an analyst can use to move to a larger set.Footnote ⁷

What are the relative advantages of these two sampling strategies? We might expect corpora selected using subject categories defined at least in part by humans to be relatively more likely than keyword-generated samples to capture relevant documents and omit irrelevant documents precisely because humans were involved in their creation. If humans categorize the text synchronous with its production, it may also be that category labels account for differences in vocabulary specific to any given point in time. However, if subject categories rely on human coders, changing coders could cause a change in content independent of actual content, and this drift would be invisible to the analyst. More significantly, often, and specifically in the case of text categorized by media providers such as LexisNexis and ProQuest, the means of assigning individual objects to the subject categories provided by the archive (or even by an original news source) are proprietary.

The resulting absence of transparency is a huge problem for scientific research, even if the category is highly accurate (a point on which there is no evidence). Further, as a direct result, the search is impossible to replicate in other contexts, whether across publications or countries. It makes updating a dataset impossible. Finally, the categorization rules used by the archiver may change over time. In the case of LexisNexis and ProQuest, not only do the rules used change, but even the list of available subject categories changes over time. As of 2019, LexisNexis no longer even provides a full list of subject categories.Footnote ⁸

The second strategy, using a keyword search, gives the analyst control over the breadth of the search. In this case, the search is easily transported to and replicable across alternative or additional universes of documents. Of course, if the analyst chooses to do a keyword search, the choice of keywords becomes “key.” There are many reasons any keyword search can be problematic: relevant terms can change over time, different publications can use overlooked synonyms, and so on.Footnote ⁹

Here, we compare the results produced by using these two strategies to generate corpora of newspaper articles from our predefined universe (The New York Times), intended to measure the tone of news coverage of the US national economy.Footnote ¹⁰ As an example of the first strategy, Soroka, Stecula, and Wlezien (Reference Soroka, Stecula and Wlezien2015) selected a corpus of texts from the universe of the New York Times from 1980–2011 based on media-provided subject categories using LexisNexis.Footnote ¹¹ As an example of the second strategy, we used a keyword search of the New York Times covering the same time period using ProQuest.Footnote ¹² We compare the relative size of the two corpora, their overlap, the proportion of relevant articles in each, and the resulting measures of tone produced by each. On the face of it, there is little reason to claim that one strategy will necessarily be better at reproducing the population of articles about the US economy from the New York Times and thus a better measure of tone. The two strategies have the same goal, and one would hope they would produce similar corpora.

The subject category search listed by Soroka et al. captured articles indexed in at least one of the following LexisNexis defined subcategories of the subject “Economic Conditions”: Deflation, Economic Decline, Economic Depression, Economic Growth, Economic Recovery, Inflation, or Recession. They also captured articles in the following LexisNexis subcategories of the subject “Economic Indicators”: Average Earnings, Consumer Credit, Consumer Prices, Consumer Spending, Employment Rates, Existing Home Sales, Money Supply, New Home Sales, Productivity, Retail Trade Figures, Unemployment Rates, or Wholesale Prices.Footnote ¹³ This corpus is the basis of the comparisons below.

To generate a sample of economic news stories using a keyword search, we downloaded all articles from the New York Times archived in ProQuest with any of the following terms: employment, unemployment, inflation, consumer price index, GDP, gross domestic product, interest rates, household income, per capita income, stock market, federal reserve, consumer sentiment, recession, economic crisis, economic recovery, globalization, outsourcing, trade deficit, consumer spending, full employment, average wage, federal deficit, budget deficit, gas price, price of gas, deflation, existing home sales, new home sales, productivity, retail trade figures, wholesale prices AND United States.Footnote ¹⁴ $^{,}$ Footnote ¹⁵ $^{,}$ Footnote ¹⁶

Overall the keyword search produced a corpus containing nearly twice as many articles as the subject category corpus (30,787 vs. 18,895).Footnote ¹⁷ This gap in corpus size begs at least two questions. First, do they share common dynamics and, second, is the smaller corpus just a subset of the bigger one? Addressing the first question, the two series move in parallel (correlation $\unicode[STIX]{x1D70C}=0.71$ ). Figure 1 shows (stacked) the number of articles unique to the subject category corpus (top), the number unique to the keyword corpus (bottom) and the number common to both (middle).Footnote ¹⁸ Notably, not only do the series correlate strongly, but the spikes in each corpus also correspond to periods of economic crisis. But turning to the second question, the subject category corpus is not at all a subset of the keyword corpus. Overall only 13.9% of the articles in the keyword corpus are included in the subject category corpus and only 22.7% of the articles in the subject category corpus are found in the keyword corpus. In other words, if we were to code the tone of media coverage based on the keyword corpus we would omit 77.3% of the articles in the subject category corpus, while if we relied on the subject category corpus, we would omit 86.1% of the articles in the keyword corpus. There is, in short, shockingly little article overlap between two corpora produced using reasonable strategies designed to capture the same population: the set of New York Times articles relevant to the state of the US economy.

Figure 1. Comparing Articles Unique to and Common Between Corpora: Stacked Annual Counts New York Times, 1980–2011. Note: See text for details explaining the generation of each corpus. See Appendix Footnote 1 for a description of the methods used to calculate article overlap.

Having more articles does not necessarily indicate that one corpus is better than the other. The lack of overlap may indicate the subject category search is too narrow and/or the keyword search is too broad. Perhaps the use of subject categories eliminates articles that provide no information about the state of the US national economy, despite containing terms used in the keyword search. In order to assess these possibilities, we recruited coders through the online crowd-coding platform CrowdFlower (now Figure Eight), who coded the relevance of: (1) 1,000 randomly selected articles unique to the subject category corpus; (2) 1,000 randomly selected articles unique to the keyword corpus, and (3) 1,000 randomly selected articles in both corpora.Footnote ¹⁹ We present the results in Table 1.Footnote ²⁰

Table 1. Proportion of Relevant Articles by Corpus.

Note: Cell entries indicate the proportion of articles in each dataset (and their overlap) coded as providing information about how the US economy is doing. One thousand articles from each dataset were coded by three CrowdFlower workers located in the US. Each coder was assigned a weight based on her overall performance before computing the proportion of articles deemed relevant. If two out of three (weighted) coders concluded an article was relevant, the aggregate response is coded as “relevant”.

Overall, both search strategies yield a sample with a large proportion of irrelevant articles, suggesting the searches are too broad.Footnote ²¹ Unsurprisingly the proportion of relevant articles was highest, 0.44, in articles that appeared in both the subject category and keyword corpora. Nearly the same proportion of articles unique to the keyword corpus was coded as relevant (0.42), while the proportion of articles unique to the subject category corpus coded relevant dropped by 13.5%, to 0.37. This suggests the LexisNexis subject categories do not provide any assurance that an article provides “information about the state of the economy.” Because the set of relevant articles in each corpus is really a sample of the population of articles about the economy and since we want to estimate the population values, we prefer a larger to a smaller sample, all else being equal. In this case, the subject category corpus has 7,291 relevant articles versus the keyword corpus with 13,016.Footnote ²² Thus the keyword dataset would give us on average 34 relevant articles per month with which to estimate tone, compared to 19 from the subject category dataset. Further, the keyword dataset is not providing more observations at a cost of higher noise: the proportion of irrelevant articles in the keyword corpus is lower than the proportion of irrelevant articles in the subject category corpus.

These comparisons demonstrate that the given keyword and subject category searches produced highly distinct corpora and that both corpora contained large portions of irrelevant articles. Do these differences matter? The highly unique content of each corpus suggests the potential for bias in both measures of tone. And the large proportion of irrelevant articles suggests both resulting measures of tone will contain measurement error. But given that we do not know the true tone as captured by a corpus that includes all relevant articles and no irrelevant articles (i.e., in the population of articles on the US national economy), we cannot address these concerns directly.Footnote ²³ We can, however, determine how much the differences between the two corpora affect the estimated measures of tone. Applying Lexicoder, the dictionary used by Soroka, Stecula, and Wlezien (Reference Soroka, Stecula and Wlezien2015), to both corpora we find a correlation of 0.48 between the two monthly series while application of our SML algorithm resulted in a correlation of 0.59.Footnote ²⁴ Longitudinal changes in tone are often the quantity of interest and the correlations of changes in tone are much lower, 0.19 and 0.36 using Lexicoder and SML, respectively. These low correlations are due in part to measurement error in each series, but these are disturbingly low correlations for two series designed to measure exactly the same thing. Our analysis suggests that regardless of whether one uses a dictionary method or an SML method, the resulting estimates of tone may vary significantly depending on the method used for choosing the corpus in the first place.

The extent to which our findings generalize is unclear—keyword searches may be ineffective and subject categorization may be quite good in other cases. However, keyword searches are within the analyst’s control, transparent, reproducible, and portable. Subject category searches are not. We thus recommend analysts use keyword searches rather than subject categories, but that they do so with great care. Whether using a manual approach to keyword generationFootnote ²⁵ or a computational query expansion approach it is critical that the analyst pay attention to selecting keywords that are both relevant to the population of interest and representative of the population of interest. For relevance, analysts can follow a simple iterative process: Start with two searches: (1) narrow and (2) broader, and code (a sample of) each corpus for relevance. As long as (2) returns more objects without a decline in the proportion of relevant objects than (1), then repeat this process now using (2) as the baseline search and comparing it to an even broader one. As soon as a broader search yields a decline in the proportion of relevant objects, return to the previous search as the optimal keyword set.

Note there will always be a risk of introducing bias by omitting relevant articles in a non-random way. Thus, we recommend analysts utilize established keyword expansion methods but also domain expertise (good old-fashioned subject-area research) so as to expand the keywords in a way that does not skew the sample toward an unrepresentative portion of the population of interest. There is potentially a large payoff to this simple use of human intervention early on.

3 Creating a Training Dataset: Two Crucial Decisions

Once the analyst selects a corpus, there are two fundamental options for coding sentiment (beyond traditional manual content analysis): dictionary methods and SML methods. Before comparing these approaches, we consider decisions the analyst must make to carry out a necessary step for applying SML: producing a training dataset.Footnote ²⁶ To do so, the analyst must: a) choose a unit of analysis for coding, b) choose coders, and c) decide how many coders to have code each document.Footnote ²⁷

To understand the significance of these decisions, recall that the purpose of the training data is to train a classifier. We estimate a model of sentiment ( $Y$ ) as labeled by humans as a function of the text of the objects, the features of which compose the independent variables. Our goal is to develop a model that best predicts the outcome out of sample. We know that to get the best possible estimates of the parameters of the model we must be concerned with measurement error about $Y$ in our sample, the size of our sample, and the variance of our independent variables. Since, as we see below, measurement error about $Y$ will be a function of the quality of coders and the number of coders per object, it is impossible to consider quality of coders, number of coders and training set size independently. Given the likely existence of a budget constraint we will need to make a choice between more coders per object and more objects coded. Also, the unit of analysis (e.g., sentences or articles) selected for human coding will affect the amount of information contained in the training dataset, and thus the precision of our estimates.

In what follows we present the theoretical trade-offs associated with the different choices an analyst might make when confronted with these decisions, as well as empirical evidence and guidelines. Our goal in the running example is to develop the best measure of tone of New York Times coverage of the US national economy, 1947 to 2014, where best refers to the measure that best predicts the tone as perceived by human readers. Throughout, unless otherwise noted, we use a binary classifier trained from the coding produced using a 9-point ordinal scale (where 1 is very negative and 9 is very positive) collapsed such that $1{-}4=0$ , $6{-}9=1$ . If a coder used the midpoint (5), we did not use the item in our training dataset.Footnote ²⁸ The machine learning algorithm used to train the classifier uses logistic regression with an L2 penalty where the features are the 75,000 most frequent stemmed unigrams, bigrams, and trigrams appearing in at least three documents and no more than 80% of all documents (stopwords are included).Footnote ²⁹ Analyses draw on a number of different training datasets where the sample size, unit of analysis coded, the type and number of coders, and number of objects coded vary in accord with the comparisons of interest. Each dataset is named for the sample of objects coded (identified by a number from 1 to 5 for our five samples), the unit of analysis coded (S for sentences, A for article segments), and coders used (U for undergraduates, C for CrowdFlower workers). For example, Dataset 5AC denotes sample number five, based on article-segment-level coding by crowd coders. For article-segment coding, we use the first five sentences of the article.Footnote ³⁰ (See Appendix Table 1 for details.)

For the purpose of assessing out-of-sample accuracy we have two “ground truth” datasets. In the first (which we call CF Truth), ten CrowdFlower workers coded 4,400 article segments randomly selected from the corpus. We then utilized the set of 442 segments that were coded as relevant by at least seven of the ten coders, defining “truth” as the average tone coded for each segment. If the average coding was neutral (5), the segment was omitted. The second “ground truth” dataset (UG Truth) is based on Dataset 3SU (Appendix Table 1) in which between 2 and 14 undergraduates coded 4,195 sentences using a 5-category coding scheme (negative, mixed, neutral, not sure, positive) from articles selected at random from the corpus.Footnote ³¹ We defined each sentence as positive or negative based on a majority rule among the codings (if there was a tie, the sentence is coded neutral/mixed). The tone of each article segment was defined by aggregating the individual sentences coded in it, again following a majority rule so that a segment is coded as positive in UG Truth if a majority of the first five sentences classified as either positive or negative are classified as positive.

3.1 Selecting a Unit of Analysis: Segments versus Sentences

Should an SML classifier be trained using coding that matches the unit of interest to be classified (e.g., a news article), or a smaller unit within it (e.g., a sentence)? Arguably the dictum “code the unit of analysis to be classified” should be our default: if we wish to code articles for tone, we should train the classifier based on article-level human coding. Indeed, we have no reason to expect that people reading an article come away assessing its tone as a simple sum of its component sentences.

However, our goal in developing a training dataset is to obtain estimates of the weights to assign to each feature in the text in order to predict the tone of an article. There are at least two reasons to think that sentence-level coding may be a better way to achieve this goal. First, if articles contain sentences not relevant to the tone of the article, these would add noise to article-level classification. But using sentence-level coding, irrelevant sentences can be excluded. Second, if individual sentences contain features with a single valence (i.e., either all positive or all negative), but articles contain both positive and negative sentences, then information will be lost if the coder must choose a single label for the entire article. Of course, if articles consist of uniformly toned sentences, any benefit of coding at the sentence level is likely lost. Empirically it is unclear whether we would be better off coding sentences or articles.

Here, we do not compare sentence-level coding to article-level coding directly, but rather compare sentence-level coding to “segment”-level coding, using the first five or so sentences in an article as a segment. Although a segment as we define it is not nearly as long as an article, it retains the key distinction that underlies our comparison of interest, namely that it contains multiple sentences.Footnote ³²

Below we discuss the distribution of relevant and irrelevant sentences within relevant and irrelevant segments in our data, and then discuss the distribution of positive and negative sentences within positive and negative segments.Footnote ³³ Then we compare the out-of-sample predictive accuracy of a classifier based on segment-level coding to one based on sentence-level coding. We evaluate the effect of unit of analysis using two training datasets. In the first (Dataset 1SC in Appendix Table 1), three CrowdFlower coders coded 2,000 segments randomly selected from the corpus. In the second (Dataset 1AC) three CrowdFlower coders coded each of the sentences in these same segments individually.Footnote ³⁴

We first compute the average number of sentences coded as relevant and irrelevant in cases where an article was unanimously coded as relevant by all three coders. We find that on average slightly more sentences are coded as irrelevant (2.64) as opposed to relevant (2.33) (see Appendix Table 8). This finding raises concerns about using segments as the unit of analysis, since a segment-level classifier would learn from features in the irrelevant sentences, while a sentence-level classifier could ignore them.

Next, we examine the average count of positive and negative sentences in positive and negative segments for the subset of 1,789 segments coded as relevant by at least one coder. We find that among the set of segments all coders agreed were positive, an average of just under one sentence (0.91) was coded positive by all coders, while fewer than a third as many sentences were on average coded as having negative tone (0.27). Negative segments tended to contain one (1.00) negative sentence and essentially no positive sentences (0.08). The homogeneity of sentences within negative segments suggests coding at the segment level might do very well. The results are more mixed in positive segments.Footnote ³⁵ If we have equal numbers of positive and negative segments, then approximately one in five negative sentences will be contained in a positive segment. That could create some error when coded at the segment level.Footnote ³⁶

Finally, in order to assess the performance of classifiers trained on each unit of analysis, we produce two classifiers: one by coding tone at the sentence level and one by coding tone at the segment level.Footnote ³⁷ We compare out-of-sample accuracy of segment classification based on each of the classifiers using the CF Truth dataset (where accuracy is measured at the segment level). The out-of-sample accuracy scores of data coded at the sentence and segment levels are 0.700 and 0.693, respectively. The choice of unit of analysis has, in this case, surprisingly little consequence, suggesting there is little to be gained by going through the additional expense and processing burden associated with breaking larger units into sentences and coding at the sentence level. While other datasets might yield different outcomes, we think analysts could proceed with segment-level coding.

3.2 Allocating Total Codings: More Documents versus More Coders

Having decided the unit to be coded and assuming a budget constraint the analyst faces another decision: should each coder label a unique set of documents and thus have one coding per document, or should multiple coders code the same set of documents to produce multiple codings per document, but on a smaller set of documents? To provide an example of the problem, assume four coders of equal quality. Further assume an available budget of $100 and that each document coded by each coder costs ten cents such that the analyst can afford 1,000 total codings. If the analyst uses each coder equally, that is, each will code 250 documents, the relevant question is whether each coder should code 250 unique documents, all coders should code the same 250 documents, the coders should be distributed such that two coders code one set of 500 unique documents and the other two coders code a different set of 500 unique documents, and so on.Footnote ³⁸

The answer is readily apparent if the problem is framed in terms of levels of observations and clustering. If we have multiple coders coding the same document, then we have only observed one instance of the relationship between the features of the document and the true outcome, though we have multiple measures of it. Thus estimates of the classifier weights will be less precise, that is, $\hat{\unicode[STIX]{x1D6FD}}$ will be further from the truth, and our estimates of ${\hat{Y}}$ , the sentiment of the text, will be less accurate than if each coder labeled a unique set of documents and we expanded the sample size of the set where we observe the relationship between the features of documents and the true outcome. In other words, coding additional documents provides more information than does having an additional coder, coder $i$ , code a document already coded by coder $j$ . Consider the limiting case where all coders code with no error. Having a second coder code a document provides zero information and cannot improve the estimates of the relationship between the features of the document and the outcome. However, coding an additional document provides one new datapoint, increasing our sample size and thus our statistical power. Intuitively, the benefit of more documents over more coders would increase as coder accuracy increases.

We perform several simulations to examine the impact of coding each document with multiple coders versus more unique documents with fewer coders per document. The goal is to achieve the greatest out-of-sample accuracy with a classifier trained on a given number of Total Codings, where we vary the number of unique documents and number of coders per document. To mimic our actual coding tasks, we generate 20,000 documents with a true value between 0 and 1 based on an underlying linear model using 50 independent variables, converted to a probability with a logit link function. We then simulate unbiased coders with variance of 0.81 to produce a continuous coding of a subset of documents.Footnote ³⁹ Finally, we convert each continuous coding to a binary (0/1) classification. Using these codings we estimate an L2 logit regression.

Figure 2. Accuracy with Constant Number of Total Codings. Note: Results are based on simulations described in the text. Plotted points are jittered based on the difference from mean to clearly indicate ordering.

Figure 2 shows accuracy based on a given number of total codings $\mathit{TC}$ achieved with different combinations of number of coders, $j$ , ranging from 1 to 4, and number of unique documents, $n\in \{240,480,960,1920,3840\}$ . For example, the first vertical set of codings shows mean accuracy rates achieved with one coder coding 240 unique objects, 2 coders coding 120 unique objects, 3 coders coding 80 unique objects, and 4 coders coding 60 unique objects. The results show that for any given number of total codings predictive accuracy is always higher with fewer coders: $PCP_{\mathit{TC}|j}>PCP_{\mathit{TC}|(j+k)}\forall k>0$ , where PCP stands for Percent Correctly Predicted (accuracy).

These simulations demonstrate that the analyst seeking to optimize predictive accuracy for any fixed number of total codings should maximize the number of unique documents coded. While increasing the number of coders for each document can improve the accuracy of the classifier (see Appendix Section 7), the informational gains from increasing the number of documents coded are greater than from increasing the number of codings of a given document. This does not obviate the need to have multiple coders code at least a subset of documents, namely to determine coder quality and to select the best set of coders to use for the task at hand. But once the better coders are identified, the optimal strategy is to proceed with one coder per document.

4 Selecting a Classification Method: Supervised Machine Learning versus Dictionaries

Dictionary methods and SML methods constitute the two primary approaches for coding the tone of large amounts of text. Here, we describe each method, discuss the advantages and disadvantages of each, and assess the ability of a number of dictionaries and SML classifiers (1) to correctly classify documents labeled by humans and (2) to distinguish between more and less positive documents.Footnote ⁴⁰

A dictionary is a user-identified set of features or terms relevant to the coding task where each feature is assigned a weight that reflects the feature’s user-specified contribution to the measure to be produced, usually +1 for positive and -1 for negative features. The analyst then applies some decision rule, such as summing over all the weighted feature values, to create a score for the document. By construction, dictionaries code documents on an ordinal scale, that is, they sort documents as to which are more or less positive or negative relative to one other. If an analyst wants to know which articles are positive or negative, they need to identify a cut point (zero point). We may assume an article with more positive terms than negative terms is positive, but we do not know ex ante if human readers would agree. If one is interested in relative tone, for example, if one wanted to compare the tone of documents over time, the uncertainty about the cut point is not an issue.

The analyst selecting SML follows three broad steps. First, a sample of the corpus (the training dataset) is coded (classified) by humans for tone, or whatever attribute is being measured (the text is labeled). Then a classification method (machine learning algorithm) is selected and the classifier is trained to predict the label assigned by the coders within the training dataset.Footnote ⁴¹ In this way the classifier “learns” the relevant features of the dataset and how these features are related to the labels. Multiple classification methods are generally applied and tested for minimum levels of accuracy using cross-validation to determine the best classifier. Finally, the chosen classifier is applied to the entire corpus to predict the sentiment of all unclassified articles (those not labeled by humans).

Dictionary and SML methods allow analysts to code vast amounts of text that would not be possible with human coding, and each presents unique advantages but also challenges. One advantage of dictionaries is that many have already been created for a variety of tasks, including measuring the tone of text. If an established dictionary is a good fit for the task at hand, then it is relatively straightforward to apply it. However, if an appropriate dictionary does not already exist, the analyst must create one. Because creating a dictionary requires identifying features and assigning weights to them, it is a difficult and time-consuming task. Fortunately, humans have been “trained” on a lifetime of interactions with language and thus can bring a tremendous amount of prior information to the table to assign weights to features. Of course, this prior information meets many practical limitations. Most dictionaries will code unigrams, since if the dictionary is expanded to include bigrams or trigrams the number of potential features increases quickly and adequate feature selection becomes untenable. And all dictionaries necessarily consider a limited and subjective set of features, meaning not all features in the corpus and relevant to the analysis will be in the dictionary. It is important, then, that analysts carefully vet their selection of terms. For example, Muddiman and Stroud (Reference Muddiman and Stroud2017) construct dictionaries by asking human coders to identify words for inclusion, and then calculate the inter-coder reliability of coder suggestions. Further, in assigning weights to each feature, analysts must make the assumption that they know the importance of each feature in the dictionary and that all text not included in the dictionary has no bearing on the tone of the text.Footnote ⁴² Thus, even with rigorous validation, dictionaries necessarily limit the amount of information that can be learned from the text.

In contrast, when using SML the relevant features of the text and their weights are estimated from the data.Footnote ⁴³ The feature space is thus likely to be both larger and more comprehensive than that used in a dictionary. Further, SML can more readily accommodate the use of n-grams or co-occurrences as features and thus partially incorporate the context in which words appear. Finally, since SML methods are trained on data where humans have labeled an article as “positive” or “negative”, they estimate a true zero point and can classify individual documents as positive or negative. The end result is that much more information drives the subsequent classification of text.

But SML presents its own challenges. Most notably it requires the production of a large training dataset coded by humans and built from a random set of texts in which the features in the population of texts are well represented. Creating the training dataset itself requires the analyst to decide a unit of analysis to code, the number of coders to use per object, and the number of objects to be coded. These decisions, as we show above, can affect the measure of tone produced. In addition, it is not clear how generalizable any training dataset is. For example, it may not be true that a classifier trained on data from the New York Times is optimal for classifying text from USA Today or that a classifier trained on data from one decade will optimally classify articles from a different decade.

Dictionary methods allow the analyst to bypass these tasks and their accompanying challenges entirely. Yet, the production of a human-coded training dataset for use with SML allows the analyst to evaluate the performance of the classifier with measures of accuracy and precision using cross-validation. Analysts using dictionaries typically have no (readily available) human-coded documents with which to evaluate classifier performance. Even when using dictionaries tested by their designers, there is no guarantee that the test of the dictionary on one corpus for one task or within one domain (e.g., newspaper articles) validates the dictionary’s use on a different corpus for a different task or domain (e.g., tweets). In fact the evaluation of the accuracy of dictionaries is difficult precisely because of the issue discussed earlier, that they have no natural cut point to distinguish between positive and negative documents. The only way to evaluate performance of a dictionary is to have humans code a sample of the corpus and examine whether the dictionary assigns higher scores to positive documents and lower scores to negative documents as evaluated by humans. For example, Young and Soroka (Reference Young and Soroka2012) evaluate Lexicoder by binning documents based on scores assigned by human coders and reporting the average Lexicoder score for documents in each bin. By showing that the Lexicoder score for each bin is correlated with the human score, they validate the performance of Lexicoder.Footnote ⁴⁴ Analysts using dictionaries “off-the-shelf” could perform a similar exercise for their applications, but at that point the benefits of using a dictionary begin to deteriorate. In any case, analysts using dictionaries should take care both in validating the inclusion of terms to begin with and validating that text containing those terms has the intended sentiment.

Given the advantages and disadvantages of the two methods, how should the analyst think about which is likely to perform better? Before moving to empirics, we can perform a thought experiment. If we assume both the dictionary and the training dataset are of high quality, then we already know that if we consider SML classifiers that utilize only words as features, it is mathematically impossible for dictionaries to do as well as an SML model trained on a large enough dataset if we are testing for accuracy within sample. The dictionary comes with a hard-wired set of parameter values for the importance of a predetermined set of features. The SML model will estimate parameter values optimized to minimize error of the classifier on the training dataset. Thus, SML will necessarily outperform the dictionary on that sample. So the relevant question is, which does better out of sample? Here, too, since the SML model is trained on a sample of the data, it is guaranteed to do better than a dictionary as long as it is trained on a large enough random sample. As the sample converges to the population—or as the training dataset contains an ever increasing proportion of words encountered—SML has to do better than a dictionary, as the estimated parameter values will converge to the true parameter values.

While a dictionary cannot compete with a classifier trained on a representative and large enough training dataset, in any given task dictionaries may outperform SML if these conditions are not met. Dictionaries bring rich prior information to the classification task: humans may produce a topic-specific dictionary that would require a large training dataset to outperform it. Similarly, a poor training dataset may not contain enough (or good enough) information to outperform a given dictionary. Below we compare the performance of a number of dictionaries with SML classifiers in the context of coding sentiment about the economy in the New York Times, and we examine the role of the size of the training dataset set in this comparison in order to assess the utility of both methods.

4.1 Comparing Classification Methods

The first step in comparing the two approaches is to identify the dictionaries and SML classifiers we wish to compare. We consider three widely used sentiment dictionaries. First, SentiStrength is a general sentiment dictionary optimized for short texts (Thelwall et al. Reference Thelwall, Buckley, Paltoglou, Cai and Kappas2010). It produces a positive and negative score for each document based on the word score associated with the strongest positive word (between 0 and 4) and the strongest negative word (between 0 and -4) in the document that are also contained in the dictionary. The authors did not choose to generate a net tone score for each document. We do so by summing these two positive and negative sentiment scores, such that document scores range from -4 to +4. Second, Lexicoder is a sentiment dictionary designed specifically for political text (Young and Soroka Reference Young and Soroka2012). It assigns every n-gram in a given text a binary indicator if that n-gram is in its dictionary, coding for whether it is positive or negative. Sentiment scores for documents are then calculated as the number of positive minus the number of negative terms divided by the total number of terms in the document. Third, Hopkins et al. (Reference Hopkins, Kim and Kim2017) created a relatively simple dictionary proposed of just twenty-one economic terms.Footnote ⁴⁵ They calculate the fraction of articles per month mentioning each of the terms. The fractions are summed, with positive and negative words having opposite signs, to calculate net tone in a given time interval. We extend their logic to predict article-level scores by summing the number of unique positive stems and subtracting the number of unique negative stems in an article to produce a measure of sentiment.

We train an SML classifier using a dataset generated from 4,400 unique articles (Dataset 5AC in Appendix Table 1) in the New York Times randomly sampled from the years 1947 to 2014. Between three and ten CrowdFlower workers coded each article for relevance. At least one coded 4,070 articles as relevant with an average of 2.51 coders coding each relevant article for tone using the 9-point scale. The optimal classifier was selected from a set of single-level classifiers including logistic regression (with L2 penalty), Lasso, ElasticNet, SVM, Random Forest, and AdaBoost.Footnote ⁴⁶ Based on accuracy and precision evaluated using UG Truth and CF Truth, we selected regularized logistic regression with L2 penalty with up to 75,000 n-grams appearing in at least three documents and no more than 80% of all documents, including stopwords, and stemming.Footnote ⁴⁷ Appendix Table 12 presents the n-grams most predictive of positive and negative tone.

Figure 3. Performance of SML and Dictionary Classifiers—Accuracy and Precision. Note: Accuracy (percent correctly classified) and precision (percent of positive predictions that are correct) for the ground truth dataset coded by ten CrowdFlower coders. The dashed vertical lines indicate the baseline level of accuracy or precision (on any category) if the modal category is always predicted. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011 (see the text for details).

Figure 4. Accuracy of the SML Classifier as a Function of Size of the Training Dataset. Note: We drew ten random samples of 250 articles each from the full training dataset (Dataset 5AC, Appendix Table 1) of 8,750 unique codings of 4,400 unique articles (three to five crowd coders labeled each article) in the New York Times randomly sampled from the years 1947 to 2014. Using the same method as discussed in the text, we estimated the parameters of the SML classifier on each of these ten samples. We then used each of these estimates of the classifier to predict the tone of articles in CF Truth. We repeated this process for sample sizes of 250 to 8,750 by increments of 250, recording the percent of articles correctly classified.

We can now compare the performance of each approach. We begin by assessing them against the CF Truth dataset, comparing the percent of articles for which each approach (1) correctly predicts the direction of tone coded at the article-segment level by humans (accuracy) and (2) specifically for articles predicted to be positive, those predictions align with human annotations (precision).Footnote ⁴⁸ $^{,}$ Footnote ⁴⁹ Then for SML, we consider the role of training dataset size and the threshold selected for classification. Then we assess accuracy for the baseline SML classifier and Lexicoder for articles humans have coded as particularly negative or positive and those about which our coders are more ambivalent.Footnote ⁵⁰

4.1.1 Accuracy and Precision

Figure 3 presents the accuracy (left panel) and relative precision (right panel) of the dictionary and SML approaches. We include a dotted line in each panel of the figure to represent the percent of articles in the modal category. Any classifier can achieve this level of accuracy simply by always assigning each document to the modal category. Figure 3 shows that only SML outperform the naive guess of the modal category. The baseline SML classifier correctly predicts coding by crowd workers in 71.0% of the articles they coded. In comparison, SentiStrength correctly predicts 60.5%, Lexicoder 58.6%, and the Hopkins 21-Word Method 56.9% of the articles in CF Truth. The relative performance of the SML classifier is even more pronounced with respect to precision, which is the more difficult task here as positive articles are the rare category. Positive predictions from our baseline SML model are correct 71.3% of the time while for SentiStrength this is true 37.5% of the time and Lexicoder and Hopkins 21-Word Method do so 45.7% and 38.5% of the time, respectively. In sum, each of the dictionaries is both less accurate and less precise than the baseline SML model.Footnote ⁵¹

Figure 5. Receiver Operator Characteristic Curve: Lexicoder versus SML. Note: The x-axis gives the false positive rate—the proportion of all negatively toned articles in CF Truth that were classified as positively toned—and the y-axis gives the true positive rate—the proportion of all positively toned articles in CF Truth that were classified as positive. Each point on the curve represents the misclassification rate for a given classification threshold. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011.

What is the role the training dataset size in explaining the better accuracy and precision rates of the SML classifier? To answer this question, we drew ten random samples of 250 articles each from the full CF Truth training dataset. Using the same method as above, we estimated the parameters of the SML classifier on each of these ten samples. We then used each of these estimates of the classifier to predict the tone of articles in CF Truth, recording accuracy, precision, and recall for each replication.Footnote ⁵² We repeated this process for sample sizes of 250 to 8,750 by increments of 250. Figure 4 presents the accuracy results, with shaded areas indicating the 95% confidence interval. The x-axis gives the size of the training dataset and the y-axis reports the average accuracy in CF Truth for the given sample size. The final point represents the full training dataset, and as such there is only one accuracy rate (and thus no confidence interval).

What do we learn from this exercise? Using the smallest training dataset (250), the accuracy of the SML classifier equals the percent of articles in the modal category (about 63%). Further, accuracy improves quickly as the size of the training dataset increases. With 2,000 observations, SML is quite accurate, and there appears to be very little return for a training dataset with more than 3,000 articles. While it is clear that in this case 250 articles is not a large enough training dataset to develop an accurate SML classifier, even using this small training dataset the SML classifier has greater accuracy with respect to CF Truth than that obtained by any of the dictionaries.Footnote ⁵³

Figure 6. Classification Accuracy in CF Truth as a Function of Article Score (Lexicoder) and Predicted Probability an Article is Positive (SML Classification). Note: Dictionary scores and SML predicted probabilities are assigned to each article in the CF Truth dataset. Articles are then assigned to a decile based on this score. Each block or circle on the graph represents accuracy within each decile, which is determined based on coding from CF Truth. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011.

An alternative way to compare SML to dictionary classifiers is to use a receiver operator characteristic, or ROC, curve. An ROC curve shows the ability of each classifier to correctly predict whether the tone of an article is positive in CF Truth for any given classification threshold. In other words, it provides a visual description of a classifier’s ability to separate negative from positive articles across all possible classification rules. Figure 5 presents the ROC curve for the baseline SML classifier and the Lexicoder dictionary.Footnote ⁵⁴ The x-axis gives the false positive rate—the proportion of all negatively toned articles in CF Truth that were classified as positively toned—and the y-axis gives the true positive rate—the proportion of all positively toned articles in CF Truth that were classified as positive. Each point on the curve represents the misclassification rate for a given classification threshold. Two things are of note. First, for almost any classification threshold, the SML classifier gives a higher true positive rate than Lexicoder. Only in the very extreme cases in which articles are classified as positive only if the predicted probability generated by the classifier is very close to 1.0 (top right corner of the figure) does Lexicoder misclassify articles slightly less often. Second, the larger the area under the ROC curve (AUC), the better the classifier’s performance. In this case the AUC of the SML classifier (0.744) is significantly greater ( $p=0.00$ ) than for Lexicoder (0.602). This finding confirms that the SML classifier has a greater ability to distinguish between more positive versus less positive articles.

4.1.2 Ability to Discriminate

One potential shortcoming of focusing on predictive accuracy may be that, even if SML is better at separating negative from positive articles, perhaps dictionaries are better at capturing the gradient of potential values of sentiment, from very negative to very positive. If this were the case, then dictionaries could do well when comparing the change in sentiment across articles or between groups of articles. In fact, this is what we are often trying to do when we examine changes in tone from month to month.

To examine how well each method gauges relative tone, we conduct an additional validation exercise similar to that performed by Young and Soroka (Reference Young and Soroka2012) to assess the performance of Lexicoder relative to the SML classifier. Instead of reporting accuracy at the article level, we split our CF Truth sample into sets of deciles according to (1) the sentiment score assigned by Lexicoder and (2) the predicted probability according to the SML classifier. We then measure the proportion of articles that crowd workers classified as positive within each decile. In other words, we look at the 10% of articles with the lowest sentiment score according to each method and count how many articles in CF Truth are positive within this bucket; we then repeat this step for all other deciles.

As Figure 6 shows, while in general articles in each successive bin according to the dictionary scores were more likely to have been labeled as positive in CF Truth, the differences are not as striking as with the binning according to SML. The groups of articles the dictionary places in the top five bins are largely indistinguishable in terms of the percent of articles labeled positive in CF Truth and only half of the articles with the highest dictionary scores were coded positive in CF Truth. The SML classifier shows a clearer ability to distinguish the tone of articles for most of the range, and over 75% of articles classified with a predicted probability of being positive in the top decile were labeled as such in CF Truth. In short, even when it comes to the relative ranking of articles, the dictionary does not perform as well as SML and it is unable to accurately distinguish less positive from more positive articles over much of the range of dictionary scores.

4.2 Selecting a Classification Method: Conclusions from the Evidence

Across the range of metrics considered here, SML almost always outperformed the dictionaries. In analyses based on a full training dataset—produced with either CrowdFlower workers or undergraduates—SML was more accurate and had greater precision than any of the dictionaries. Moreover, when testing smaller samples of the CF training dataset, the SML classifier was more accurate and had greater precision even when trained on only 250 articles. Further, our binning analysis with Lexicoder showed that Lexicoder was not as clearly able to distinguish the relative tone of articles in CF Truth as was SML; and the ROC curve showed that the accuracy of SML outperformed that of Lexicoder regardless of the threshold used for classification. Our advice to analysts is to use SML techniques to develop measures of tone rather than to rely on dictionaries.

5 Recommendations for Analysts of Text

The opportunities afforded by vast electronic text archives and machine classification of text for the measurement of a number of concepts, including tone, are in a real sense unlimited. Yet in a rush to take advantage of the opportunities, it is easy to overlook some important questions and to underappreciate the consequences of some decisions.

Here, we have discussed just a few of the decisions that face analysts in this realm. Our most striking, and perhaps surprising, finding is that something as simple as how we choose the corpus of text to analyze can have huge consequences for the measure we produce. Perhaps more importantly, we found that analyses based on the two distinct sets of documents produced very different measures of the quantity of interest: sentiment about the economy. For the sake of transparency and portability, we recommend the analyst use keyword searches, rather than proprietary subject classifications. When deciding on the unit to be coded, we found that coding article segments was more efficient for our task than coding sentences. Further, segment-level coding has the advantage that the human coders are working closer to the level of object that is to be classified (here, the article), and it has the nontrivial advantage that it is cheaper and more easily implemented in practice. Thus, while it is possible that coding at the sentence level would produce a more precise classifier in other applications, our results suggest that coding at the segment level seems to be the best default. We also find that the best course of action in terms of classifier accuracy is to maximize the number of unique objects coded, irrespective of the selected coder pool or the application of interest. Doing so produces more efficient estimates than having additional coders code an object. Finally, based on multiple tests, we recommend using SML for sentiment analysis rather than dictionaries. Using SML does require the production of a training dataset, which is a nontrivial effort. But the math is clear: given a large enough training dataset, SML has to outperform a dictionary. And, in our case at least, the size of the training dataset required was not very large.

From these specific recommendations, we can distill overarching pieces of advice: (1) use transparent and reproducible methods in selecting a corpus and (2) classify by machine, but verify by human means. But our evidence suggests two lessons more broadly. First, for analysts using text as data, there are decisions at every turn, and even the ones we assume are benign may have meaningful downstream consequences. Second, every research question and every text-as-data enterprise is unique. Analysts should do their own testing to determine how the decisions they are making affect the substance of their conclusions, and be mindful and transparent at all stages in the process.

Acknowledgments

We thank Ken Benoit, Bryce Dietrich, Justin Grimmer, Jennifer Pan, and Arthur Spirling for their helpful comments and suggestions to a previous version of the paper. We thank Christopher Schwarz for writing code for simulations for the section on more documents versus more coders (Figure 2). We are indebted to Stuart Soroka and Christopher Wlezien for generously sharing data used in their past work, and for constructive comments on previous versions of this paper.

Funding

This material is based in part on work supported by the National Science Foundation under IGERT Grant DGE-1144860, Big Data Social Science, and funding provided by the Moore and Sloan Foundations. Nagler’s work was supported in part by a grant from the INSPIRE program of the National Science Foundation (Award SES-1248077).

Data Availability Statement

Replication code for this article has also been published in Code Ocean, a computational reproducibility platform that enables users to run the code, and can be viewed interactively at doi:10.24433/CO.4630956.v1. A preservation copy of the same code and data can also be accessed via Dataverse at doi:10.7910/DVN/MXKRDE.

Reader note: The Code Ocean capsule above contains the code to replicate the results of this article. Users can run the code and view the outputs, but in order to do so they will need to register on the Code Ocean site (or login if they have an existing Code Ocean account).

Supplementary material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2020.8.

Footnotes

Contributing Editor: Jeff Gill

1 Because our discussion of SML and dictionaries necessarily relies on having produced a training dataset, we first discuss selection of the corpus, then we discuss issues in creating a training dataset, and then we compare the benefits and trade-offs associated with choosing to use SML versus dictionary methods.

2 The discussion in this paper is limited to coding the tone of text (referred to as “sentiment” in the computer science literature), rather than other variables such as topics or events. But much of what we present is applicable to the analysis of text more broadly, both when using a computational approach and even (in the first stage we discuss below) when using manual content analysis.

3 In other words, we assume that the analyst is less interested in capturing the tone of the text that was intended by its author, and more interested in capturing how the public, in general, would interpret the tone. This assumption may not always hold. When studying social media, for example, an analyst might be more interested in the intended tone of a tweet than the tone as it comes across to a general audience.

4 The “wisdom of the crowds,” first introduced by Condorcet in his jury theorem (Reference Condorcet1972), is now widely applied in social science (Surowiecki Reference Surowiecki2005; Page Reference Page2008; Lyon and Pacuit Reference Lyon, Pacuit and Michelucci2013). As Benoit (Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016) demonstrated, this logic can also be applied to coding political text in order to obtain correct inferences about its content.

5 In some cases the corpus may correspond to the entire population, but in our running example, as in any example based on media sources, they will be distinct.

6 As another example, the Policy Agendas Project (www.comparativeagendas.net) offers an extensive database of items (e.g., legislative texts, executive texts) that other scholars have already categorized based on an institutionalized list of policy topics (e.g., macroeconomics) and subtopics (e.g., interest rates).

7 See King, Lam, and Roberts (Reference King, Lam and Roberts2016) for discussion of keyword generation algorithms. Most such algorithms rely on co-occurrence: if a term co-occurs with “unemployment,” but does not occur too frequently without it, then it is likely to also be about the economy and should be added to the set of keywords. Note that these methods must start with a human selection of keywords to seed the algorithm, meaning there is no escaping the need for vigilance in thinking about which keywords are both relevant and representative.

8 LexisNexis now states that “Due to proprietary reasons, we aren’t allowed to share this information [the list of subject categories].” Correspondence with authors, May 30, 2019.

9 Note that because the underlying set of archived articles can vary over time, based on the media provider’s contracts with news outlets and the provider’s internal archiving parameters, even the same keyword search performed at two points in time may yield maddeningly different results, although the differences should be less than those suffered using proprietary subject categories (Fan, Geddes, and Flory Reference Fan, Geddes and Flory2013).

10 A number of analysts have coded the tone of news coverage of the US national economy. The universe of text defined in this body of work varies widely from headlines or front-page stories in the New York Times (Goidel and Langley Reference Goidel and Langley1995; Blood and Phillips Reference Blood, Phillips, McCombs, Shaw and Weaver1997; Wu et al. Reference Wu, Stevenson, Chen and Güner2002; Fogarty Reference Fogarty2005) to multiple newspapers (Soroka, Stecula, and Wlezien Reference Soroka, Stecula and Wlezien2015), to as many as 30 newspapers (Doms and Morin Reference Doms and Morin2004). Blood and Phillips (Reference Blood, Phillips, McCombs, Shaw and Weaver1997) coded the full universe of text while others used subject categories and/or keyword searches to produce a sample of stories from the population of articles about the economy.

11 Soroka and colleagues generously agreed to share their dataset with us, for which we are deeply grateful, allowing us to perform many of the comparisons in this article.

12 We used ProQuest because LexisNexis does not have historical coverage for the New York Times earlier than 1980, and we wanted to base some of our analyses below on a longer time span.

13 Articles were kept if they had a relevance score of 85 or higher, as defined by LexisNexis, for any of the subcategories listed above. After collecting the data, Soroka et al. manually removed articles not focused solely on the US domestic economy, irrelevant to the domestic economy, shorter than 100 words, or “just long lists of reported economic figures and indicators,” (Soroka, Stecula, and Wlezien Reference Soroka, Stecula and Wlezien2015, 461–462).

14 We obtained articles from two sources: the ProQuest Historical New York Times Archive and the ProQuest Newsstand Database. Articles in the first database span the 1947–2010 period and are only available in PDF format and thus had to be converted to plain text using OCR (optical character recognition) software. Articles for the 1980–2014 period are available in plain text through ProQuest Newsstand. We used machine learning techniques to match articles in both datasets and to delete duplicated articles, keeping the version available in full text through ProQuest Newsstand. We used a filter to remove any articles that mentioned a country name, country capital, nationality or continent name that did not also mention U.S., U.S.A., or United States in the headline or first 1,000 characters of the article (Schrodt Reference Schrodt2011).

15 We could also have generated keywords using a(n) (un)supervised method or query expansion (Rocchio Reference Rocchio1971; Schütze and Pedersen Reference Schütze and Pedersen1994; Xu and Croft Reference Xu and Croft1996; Mitra, Singhal, and Buckley Reference Mitra, Singhal and Buckley1998; Bai et al. Reference Bai, Song, Bruza, Nie and Cao2005; King, Lam, and Roberts Reference King, Lam and Roberts2016). However, those methods are difficult to implement because they generally require unfettered access to the entire population of documents, which we lacked in our case due to access limitations imposed by ProQuest.

16 Note that, although in theory LexisNexis and ProQuest should have identical archives of New York Times articles for overlapping years, one or both archives might have idiosyncrasies that contribute to some portion of the differences between the keyword-based and subject category-based corpora presented below. Indeed, the quirks of LexisNexis alone are well documented (Fan, Geddes, and Flory Reference Fan, Geddes and Flory2013, 106).

17 See Appendix Section 3 for article counts unique to each corpus and common between them.

18 See Appendix Section 3 for details on how we determined overlap and uniqueness of the corpora.

19 The materials required to replicate the analyses below are available on Dataverse (Barberá et al. Reference Barberá, Boydstun, Linn, McMahon and Nagler2020).

20 Relevance codings were based on coders’ assessment of relevance upon reading the first five sentences of the article. See Section 1 of the Appendix for the coding instrument. Three different coders annotated each article (based on its first five sentences), producing 9,000 total codings. Each coder was assigned a weight based on his/her overall performance (the level of the coder’s agreement with that of other coders) before computing the proportion of articles deemed relevant. If two out of three (weighted) coders concluded an article was relevant, the aggregate response is coded as “relevant”. This is de facto a majority-rule criterion as coder weights were such that a single heavily weighted coder did not overrule the decisions of two coders when there was disagreement. The coding-level proportions were qualitatively equivalent and are presented in Table 3 in Section 3 of the Appendix.

21 Recall, however, that coders only read the first five sentences of each article. It may be that some (or even many) of the articles deemed irrelevant contained relevant information after the first five sentences.

22 The subject category corpus contains 4,290 articles in common with the keyword corpus, of which 44% are relevant, and 14,605 articles unique to the subject category corpus, of which 37% are relevant. Of the 26,497 articles unique to the keyword corpus, 42% are relevant.

23 As we discuss in Section 3 of the Appendix, an analyst could train the classifier for relevance and then omit articles classified as irrelevant. However, we found that it was difficult to train an accurate relevance classifier, which meant that using it as a filter could lead to sampling bias in the resulting final sample. Since our tests did not show a large difference in the estimates of tone with respect to the training set, we opted for the one-stage classifier as it was a more parsimonious choice.

24 Lexicoder tone scores for documents are calculated by taking the number of positive minus the number of negative terms over the total number of terms (Eshbaugh-Soha Reference Eshbaugh-Soha2010; Young and Soroka Reference Young and Soroka2012). We use our baseline SML algorithm which we define in more detail later on, but which is based on logistic regression with an L2 penalty. The monthly tone estimates for both measures are the simple averages across the articles in a given month.

25 For good practices in iterative keyword selection, see Atkinson, Lovett & Baumgartner (Reference Atkinson, Lovett and Baumgartner2014, 379–380).

26 The most important part of this task is likely the creation of a training instrument: a set of questions to ask humans to code about the objects to be analyzed. But here we assume the analyst has an instrument at hand and focus on the question of how to apply the instrument. The creation of a training instrument is covered in other contexts and is beyond the scope of our work. Central works include Sudman, Bradburn, and Schwartz (Reference Sudman, Bradburn and Schwartz1995), Bradburn, Sudman, and Wansink (Reference Bradburn, Sudman and Wansink2004), Groves et al. (Reference Groves, Fowler, Couper, Lepkowski, Singer and Tourangeau2009), and Krippendorff (Reference Krippendorff2018).

27 This list is not exhaustive of the necessary steps to create a training dataset. We discuss a method of choosing coders based on comparison of their performance and cost in Section 7 of the Appendix.

28 As with the other decisions we examine here, this choice may have downstream consequences.

29 We compared the performance of a number of classifiers with regard to accuracy and precision in both out of sample and cross-validated samples before selecting logistic regression with an L2 penalty. See Figures 1 and 2 in Section 4 of the Appendix for details.

30 Efforts to identify exactly the first five sentences can be difficult in cases where we were working with original PDFs transferred to text via OCR. Errors in the OCR translation would sometimes result in initial segments of more than five sentences.

31 Variation in number of coders was a function of how many undergraduates completed tasks. The 5-category coding scheme used here was employed before switching to the 1-9 scheme later in the project.

32 Of course, compiling a training dataset of segments is more cost-effective than coding articles. We do not claim the first five sentences are representative of an article’s tone. However, we assume the relationship between features and tone of a given coded unit are the same regardless of where they occur in the text.

33 See Section 5 of the Appendix for an additional exploration of how relevance and sentiment varies within article segments.

34 Coding was conducted using our 9-point scale. Sentences were randomized, so individual coders were not coding sentences grouped by segment.

35 This asymmetry is intriguing, and worthy of additional future study.

36 See Table 4 in the Appendix in Section 5. If we drop the unanimity threshold to coding based on majority rule, we see more positive sentences in negative segments, and more negative sentences in positive segments. See Tables 5, 6, and 7 in Section 5 of the Appendix to see how this distribution varies with different assignment rules based on multiple codings.

37 We treat a sentence or segment as relevant if at least one coder codes it as relevant, and we only use objects (sentences or segments) coded as relevant. In the segment-level dataset, at least one coder coded 1,789 of the segments as relevant, and an average of 2.27 coders coded each segment as relevant. In the sentence-level dataset, at least one coder coded 8,504 sentences as relevant, with an average of 2.17 coders marking each sentence as relevant.

38 One can create more complex schemes that would allocate a document to two coders, and only go to additional coders if there is disagreement. Here we only consider cases where the decision is made ex ante.

39 See Appendix Section 6 for a way to use variance of individual coders to measure coder quality.

40 Note that, depending on the task at hand, analysts may choose to use SML for one task and dictionaries for another (see, for example, Stecula and Merkley Reference Stecula and Merkley2019).

41 The terminology “training a classifier” is unique to machine learning, but easily translates to traditional econometrics as “choose the model specification and estimate model parameters.”

42 Some dictionaries, for example, SentiStrength, allow users to optimize weights using a training set.

43 The analyst is not prohibited from bringing prior information to bear by, for example, including prespecified combinations of words as features whose weights are estimated from the data.

44 As another example, Thelwall et al. (Reference Thelwall, Buckley, Paltoglou, Cai and Kappas2010) compare human coding of short texts in MySpace with each positive and negative SentiStrength scores to validate their dictionary.

45 Based on an iterative procedure designed to maximize convergent validity, Hopkins et al. used fifteen negative terms: “bad”, “bear”, “debt”, “drop”, “fall”, “fear”, “jobless”, “layoff”, “loss”, “plung”, “problem”, “recess”, “slow”, “slump”, “unemploy”, and six positive terms: “bull”, “grow”, “growth”, “inflat”, “invest”, and “profit”.

46 One could simultaneously model relevance and tone, or model topics and then assign tone within topics—allowing the impact of words to vary by topic. Those are considerations for future work.

47 Selecting the optimal classifier to compare to the dictionaries requires a number of decisions that are beyond the scope of this paper (but see Raschka Reference Raschka2015, James Reference James, Witten, Hastie and Tibshirani2013, Hastie Reference Hastie, Tibshirani, Friedman, Hastie, Tibshirani and Friedman2009, Caruana Reference Caruana and Niculescu-Mizil2006), including how to preprocess the text, whether to stem the text (truncate words to their base), how to select and handle stopwords (commonly used words that do not contain relevant information), and the nature and number of features (n-grams) of the text to include. Denny and Spirling (Reference Denny and Spirling2018) show how the choice of preprocessing methods can have profound consequences, and we examine the effects of some of these decisions on accuracy and precision in Section 4 of the Appendix.

48 It could be the case that the true tone of an article does not match the true tone of its first five sentences (i.e., article segment). Yet we have no reason to suspect that a comparison of article-level classifications versus article-segment human coding systematically advantages or disadvantages dictionaries or SML methods.

49 All articles for which the SML classifier generated a probability of being positive greater than 0.5 were coded as positive. For each of the dictionaries we coded an article as positive if the sentiment score generated by the dictionary was greater than zero. This assumes an article with more positive (weighted) terms than negative (weighted) terms is positive. This rule is somewhat arbitrary and different decision rules will change the accuracy (and precision) of the classifier.

50 See Section 9 in the Appendix for a comparison of the relationship between monthly measures of tone produced by each classification method and standard measures of economic performance. These comparisons demonstrate the convergent validity of the measures produced by each classifier.

51 Similar accuracy and precision are obtained with respect to the UG Truth dataset. See the Appendix.

52 Results for recall and precision rates by training dataset size may be found in Appendix Section 10. Briefly, we find that recall—the fraction of positive articles correctly coded as positive by our classifier—behaves similarly to accuracy. However, precision—the fraction of articles we predict as positive that coders identified as being positive—is quite low (about 47%) for $N=250$ but jumps up and remains relatively flat between 65% and 70% for all sized training datasets 500 and greater.

53 The results of this exercise do not suggest that a training dataset of 250 will consistently produce accuracy rates equal to the percent in the modal category, nor that 2,000 or even 3,000 observations is adequate to the task in any given application. The size of the training dataset required will depend both on the quality of the training data, likely a function of the quality of the coders and the difficulty of the coding task, as well as the ability of the measured features to predict the outcome.

54 Lexicoder scores were standardized to range between zero and one for this comparison.

References

Atkinson, M. L., Lovett, J., and Baumgartner, F. R.. 2014. “Measuring the Media Agenda.” Political Communication 31(2):355–380.CrossRef Google Scholar

Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G.. 2005. “Query Expansion Using Term Relationships in Language Models for Information Retrieval.” In Proceedings of the 14th ACM International Conference on Information and Knowledge Management , 688–695. Bremen, Germany: Association for Computing Machinery.CrossRef Google Scholar

Barberá, P., Boydstun, A., Linn, S., McMahon, R., and Nagler, J.. 2020. “Replication Data for: Automated Text Classification of News Articles: A Practical Guide.” URL: doi:10.7910/DVN/MXKRDE, Harvard Dataverse, V1, UNF:6:AR3Usj7mJKo7lkT/YUsaXA== [fileUNF].Google Scholar

Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110(2):278–295.CrossRef Google Scholar

Blood, D. J., and Phillips, P. C. B.. 1997. “Economic Headline News on the Agenda: New Approaches to Understanding Causes and Effects.” In Communication and Democracy: Exploring the Intellectual Frontiers in Agenda-setting Theory , edited by McCombs, M., Shaw, D. L., and Weaver, D., 97–113. New York: Routledge.Google Scholar

Bradburn, N. M., Sudman, S., and Wansink, B.. 2004. Asking Questions: The Definitive Guide to Questionnaire Design . San Francisco: John Wiley and Sons.Google Scholar

Caruana, R., and Niculescu-Mizil, A.. 2006. “An Empirical Comparison of Supervised Learning Algorithms.” In Proceedings of the 23rd International Conference on Machine Learning , 161–168. Pittsburgh: Association for Computing Machinery.CrossRef Google Scholar

Condorcet, M. J. et al. . 1972. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix, vol. 252 . Providence, RI: American Mathematical Society.Google Scholar

De Boef, S., and Kellstedt, P. M.. 2004. “The Political (and Economic) Origins of Consumer Confidence.” American Journal of Political Science 48(4):633–649.CrossRef Google Scholar

Denny, M. J., and Spirling, A.. 2018. “Assessing the Consequences of Text Preprocessing Decisions.” Political Analysis 26:168–189.Google Scholar

Doms, M. E., and Morin, N. J.. “Consumer sentiment, the economy, and the news media.” FRB of San Francisco Working Paper (2004–09), San Francisco: Federal Reserve Board.Google Scholar

Eshbaugh-Soha, M. 2010. “The Tone of Local Presidential News Coverage.” Political Communication 27(2):121–140.CrossRef Google Scholar

Fan, D., Geddes, D., and Flory, F.. 2013. “The Toyota Recall Crisis: Media Impact on Toyota’s Corporate Brand Reputation.” Corporate Reputation Review 16(2):99–117.Google Scholar

Fogarty, B. J. 2005. “Determining Economic News Coverage.” International Journal of Public Opinion Research 17(2):149–172.CrossRef Google Scholar

Goidel, K., Procopio, S., Terrell, D., and Wu, H. D.. 2010. “Sources of Economic News and Economic Expectations.” American Politics Research 38(4):759–777.CrossRef Google Scholar

Goidel, R. K., and Langley, R. E.. 1995. “Media Coverage of the Economy and Aggregate Economic Evaluations: Uncovering Evidence of Indirect Media Effects.” Political Research Quarterly 48(2):313–328.CrossRef Google Scholar

Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297.CrossRef Google Scholar

Grimmer, J., Messing, S., and Westwood, S. J.. 2012. “How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation.” American Political Science Review 106(04):703–719.Google Scholar

Groves, R., Fowler, F. Jr, Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R.. 2009. Survey Methodology . 2nd edn. Hoboken, NJ: Wiley.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J.. 2009. “Unsupervised Learning.” In The Elements of Statistical Learning , edited by Hastie, T., Tibshirani, R., and Friedman, J., 485–585. New York: Springer.CrossRef Google Scholar

Hillard, D., Purpura, S., and Wilkerson, J.. 2008. “Computer-assisted Topic Classification for Mixed-methods Social Science Research.” Journal of Information Technology & Politics 4(4):31–46.CrossRef Google Scholar

Hopkins, D. J., Kim, E., and Kim, S.. 2017. “Does Newspaper Coverage Influence or Reflect Public Perceptions of the Economy?” Research & Politics 4(4): 2053168017737900.CrossRef Google Scholar

James, G., Witten, D., Hastie, T., and Tibshirani, R.. 2013. An Introduction to Statistical Learning, vol. 6 . New York: Springer.CrossRef Google Scholar

Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., and van Atteveldt, W.. 2013. “RTextTools: A Supervised Learning Package for Text Classification.” The R Journal 5(1):6–12.CrossRef Google Scholar

King, G., Lam, P., and Roberts, M.. 2016. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” Working Paper.CrossRef Google Scholar

Krippendorff, K. 2018. Content Analysis: An Introduction to its Methodology . 4th edn. Thousand Oaks, CA: Sage.Google Scholar

Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts Using Words as Data.” American Political Science Review 97(02):311–331.CrossRef Google Scholar

Lyon, A., and Pacuit, E.. 2013. “The Wisdom of Crowds: Methods of Human Judgement Aggregation.” In Handbook of Human Computation , edited by Michelucci, P., 599–614. New York: Springer.CrossRef Google Scholar

Mitra, M., Singhal, A., and Buckley, C.. 1998. “Improving Automatic Query Expansion.” In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 206–214. Melbourne, Australia: Association for Computing Machinery.Google Scholar

Monroe, B. L., Colaresi, M. P., and Quinn, K. M.. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16(4):372–403.CrossRef Google Scholar

Muddiman, A., and Stroud, N. J.. 2017. “News Values, Cognitive Biases, and Partisan Incivility in Comment Sections.” Journal of Communication 67(4):586–609.CrossRef Google Scholar

Page, S. E. 2008. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies-New Edition . Princeton, NJ: Princeton University Press.CrossRef Google Scholar

Raschka, S. 2015. Python Machine Learning . Birmingham: Packt Publishing Ltd.Google Scholar

Rocchio, J. J. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing . Englewoods Cliffs, NJ: Prentice-Hall.Google Scholar

Schrodt, P.2011. Country Infro, 111216.txt. https://github.com/openeventdata/CountryInfo.Google Scholar

Schütze, H., and Pedersen, J. O.. 1994. “A Cooccurrence-based Thesaurus and Two Applications to Information Retrieval.” Information Processing & Management 33(3):307–318.CrossRef Google Scholar

Soroka, S. N., Stecula, D. A., and Wlezien, C.. 2015. “It’s (Change in) the (Future) Economy, Stupid: Economic Indicators, the Media, and Public Opinion.” American Journal of Political Science 59(2):457–474.CrossRef Google Scholar

Stecula, D. A., and Merkley, E.. 2019. “Framing Climate Change: Economics, Ideology, and Uncertainty in American News Media Content from 1988 to 2014.” Frontiers in Communication 4(6):1–15.CrossRef Google Scholar

Sudman, S., Bradburn, N. M., and Schwartz, N.. 1995. Thinking about Answers: The Application of Cognitive Processes to Survey Methodology . San Francisco: Jossey-Bass.Google Scholar

Surowiecki, J. 2005. The Wisdom of the Crowds . New York: Anchor.Google Scholar

Tetlock, P. C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” Journal of Finance 62(3):1139–1168.CrossRef Google Scholar

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A.. 2010. “Sentiment Strength Detection in Short Informal Text.” Journal of the American Society for Information Science and Technology 61(12):2544–2558.CrossRef Google Scholar

Wu, H. D., Stevenson, R. L., Chen, H.-C., and Güner, Z. N.. 2002. “The Conditioned Impact of Recession News: A Time-Series Analysis of Economic Communication in the United States, 1987–1996.” International Journal of Public Opinion Research 14(1):19–36.CrossRef Google Scholar

Xu, J., and Croft, W. B.. 1996. “Query Expansion Using Local and Global Document Analysis.” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 4–11. Zurich: Association for Computing Machinery.Google Scholar

Young, L., and Soroka, S.. 2012. “Affective News: The Automated Coding of Sentiment in Political Texts.” Political Communication 29(2):205–231.Google Scholar

Figure 1. Comparing Articles Unique to and Common Between Corpora: Stacked Annual Counts New York Times, 1980–2011. Note: See text for details explaining the generation of each corpus. See Appendix Footnote 1 for a description of the methods used to calculate article overlap.

Table 1. Proportion of Relevant Articles by Corpus.

Figure 2. Accuracy with Constant Number of Total Codings. Note: Results are based on simulations described in the text. Plotted points are jittered based on the difference from mean to clearly indicate ordering.

Figure 3. Performance of SML and Dictionary Classifiers—Accuracy and Precision. Note: Accuracy (percent correctly classified) and precision (percent of positive predictions that are correct) for the ground truth dataset coded by ten CrowdFlower coders. The dashed vertical lines indicate the baseline level of accuracy or precision (on any category) if the modal category is always predicted. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011 (see the text for details).

Figure 4. Accuracy of the SML Classifier as a Function of Size of the Training Dataset. Note: We drew ten random samples of 250 articles each from the full training dataset (Dataset 5AC, Appendix Table 1) of 8,750 unique codings of 4,400 unique articles (three to five crowd coders labeled each article) in the New York Times randomly sampled from the years 1947 to 2014. Using the same method as discussed in the text, we estimated the parameters of the SML classifier on each of these ten samples. We then used each of these estimates of the classifier to predict the tone of articles in CF Truth. We repeated this process for sample sizes of 250 to 8,750 by increments of 250, recording the percent of articles correctly classified.

Figure 5. Receiver Operator Characteristic Curve: Lexicoder versus SML. Note: The x-axis gives the false positive rate—the proportion of all negatively toned articles in CF Truth that were classified as positively toned—and the y-axis gives the true positive rate—the proportion of all positively toned articles in CF Truth that were classified as positive. Each point on the curve represents the misclassification rate for a given classification threshold. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011.

Figure 6. Classification Accuracy in CF Truth as a Function of Article Score (Lexicoder) and Predicted Probability an Article is Positive (SML Classification). Note: Dictionary scores and SML predicted probabilities are assigned to each article in the CF Truth dataset. Articles are then assigned to a decile based on this score. Each block or circle on the graph represents accuracy within each decile, which is determined based on coding from CF Truth. The corpus used in the analysis is based on the keyword search of The New York Times 1980–2011.

Barberá et al. Dataset

Dataset

https://doi.org/10.7910/DVN/MXKRDE

Link

Barberá et al. supplementary material

File 282.6 KB

Article contents

Automated Text Classification of News Articles: A Practical Guide

Abstract

Keywords

1 Introduction

2 Selecting the Corpus: Keywords versus Subject Categories

3 Creating a Training Dataset: Two Crucial Decisions

3.1 Selecting a Unit of Analysis: Segments versus Sentences

3.2 Allocating Total Codings: More Documents versus More Coders

4 Selecting a Classification Method: Supervised Machine Learning versus Dictionaries

4.1 Comparing Classification Methods

4.1.1 Accuracy and Precision

4.1.2 Ability to Discriminate

4.2 Selecting a Classification Method: Conclusions from the Evidence

5 Recommendations for Analysts of Text

Acknowledgments

Funding

Data Availability Statement

Supplementary material

Footnotes

References

Barberá et al. Dataset

Barberá et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests