1. Introduction
Event data provide high-resolution and high-volume information about political events. Event datasets can be coded either by hand or with the aid of software, a process referred to here as “automated event coding.” While automated event coding promises reproducible, timely, and exhaustive data, several outstanding challenges limit its practical use to a subset of problems of interest for social scientists. Among these challenges is dictionary generation. Current automated event coding solutions require large dictionaries of actors, events, and event characteristics to be populated a priori such that pattern matching can be used to identify those dictionary entries in the raw text of news stories from which event data will be generated. The dictionaries are hand-coded and therefore suffer from many of the same limitations that hand-coded event datasets suffer from: they are costly to produce, require frequent updates, are not reproducible, and are vulnerable to the forgetfulness or oversight of human coders. This paper presents a novel method for generating dictionaries for event coding that ameliorates these problems. Automated dictionary generation (ADG) promises to allow researchers to rapidly generate novel datasets tailored to their research questions rather than adapting their research questions to fit existing event datasets.Footnote 1 By lowering the costs of dictionary generation, researchers will be able to adapt better existing event coding software to new domains and to iterate rapidly on their datasets.
This paper proceeds by first discussing existing methods for producing event data in political science. Next the ADG method itself is detailed. The paper then offers an example application of this method and introduces an event dataset on cybersecurity: CYLICON, the CYber LexICON event dataset.Footnote 2 This application consists of the generation and updating of verb, actor, agent, issue, and synset dictionaries. It is shown that ADG enables the expansion of automated event coding to new domains, and therefore new problem sets, with a minimal amount of researcher effort. The paper concludes with a brief discussion of directions for future research in automated event coding.
2. Event data in political science
Political event data are produced both by hand and via automated processes. Most datasets of political events are still coded manually. This process is costly, time consuming, and irreproducible. However, hand-coded event data is popular due to the perceived control it affords researchers in leveraging their expertise to code events precisely. Hand coding also allows researchers to collect information from multiple sources to construct event records with details that may not be available from any single source. Notable hand-coded event datasets include the Armed Conflict Location and Event Dataset, the International Crisis Behavior dataset, the Militarized Interstate Dispute dataset, and the Conflict and Peace Databank (Azar Reference Azar1980; Brecher and Wilkenfeld Reference Brecher and Wilkenfeld2000; Raleigh et al. Reference Raleigh, Linke, Hegre and Karlsen2010; Palmer et al. Reference Palmer, D'Orazio, Kenwich and Lane2015; Brecher et al. Reference Brecher, Wilkenfeld, Beardsley, James and Quinn2016).
Since the mid 1990s, automated coding efforts for event datasets have grown in popularity (Schrodt Reference Schrodt1998, Reference Schrodt2011; Schrodt and Brackle Reference Schrodt, Brackle and Subrahmaniam2013; Ward et al. Reference Ward, Beger, Cutler, Dickenson, Dorff and Radford2013; Boschee et al. Reference Boschee, Lautenschlager, O'Brien, Shellman, Starz and Ward2015; Caerus Associates 2015). In just the past several years, several event datasets have been introduced in political science: The Global Database of Events, Language, and Tone (GDELT), the Integrated Crisis Early Warning System (ICEWS) dataset, the Open Event Data Alliance's Phoenix dataset, and the Cline Center's Historical Phoenix Dataset (Leetaru and Schrodt Reference Leetaru and Schrodt2013; Boschee et al. Reference Boschee, Lautenschlager, O'Brien, Shellman, Starz and Ward2015; Open Event Data Alliance 2015b; Althaus et al. Reference Althaus, Bajjalieh, Carter, Peyton and Shalmon2017).Footnote 3 These datasets provide information on individual events, usually at the daily level, with specific details about the actors involved. They also often provide geographic information at a subnational level. These datasets are enormous, typically comprising millions of events.Footnote 4
The event datasets listed above are built from streams of open-source news stories. The stories are processed through software that uses pre-defined dictionaries to infer the actors and actions they describe. Common software packages for this purpose include TABARI (Textual Analysis by Augmented Replacement Instructions) and PETRARCH (Python Engine for Text Resolution And Related Coding Hierarchy), both of which are successors to KEDS (Kansas Event Data System) (Schrodt Reference Schrodt1998, Reference Schrodt2011; Open Event Data Alliance 2015a).Footnote 5 The Open Event Data Alliance, authors of PETRARCH, provide Figure 1 to illustrate their event-coding process. Raw stories are first collected from online sources. These are uploaded to a database and formatted to the specifications required by TABARI (or PETRARCH). The stories are then passed to TABARI (or PETRARCH) which uses the supplied dictionaries to produce structured data. The data are then de-duplicated using a one-a-day filter to remove multiple identically-coded event records from the same day. The resulting data are then uploaded to a server for distribution. Under ideal circumstances, human interaction is only required to select appropriate news sources, devise an ontology for the resulting structured data, and to populate the necessary dictionaries. However, this last step, dictionary creation, requires a substantial level of effort. The CAMEO verb dictionary used by PETRARCH and the Phoenix dataset is nearly 15,000 lines long and includes very specific phrases that would not necessarily be apparent to researchers a priori.Footnote 6 The country actors dictionary, just one of multiple actor dictionaries utilized by Phoenix, is nearly 55,000 lines long. As of 2014, the ICEWS actor dictionary was over 102,000 lines long. Furthermore, as the relevant actors and language evolve, these dictionaries require regular updates to maintain up-to-date event data. Excerpts from the verb, country-actors, and synset dictionaries provided with PETRARCH are given in Table 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_fig1.png?pub-status=live)
Figure 1. The Phoenix pipeline (Open Event Data Alliance 2015c).
Table 1. Excerpts from dictionaries supplied with PETRARCH.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_tab1.png?pub-status=live)
The purpose of event-coding dictionaries, like those used by TABARI and PETRARCH, is to provide an exhaustive list of the terms and phrases that map to a set of labels. In a fully automated event-coding solution, both the ontology and the dictionary could be produced without human intervention. The effort described here, however, focuses on the latter challenge: automating the process of synonym and near-synonym extraction and classification given a known ontology.
PETRARCH's dictionary structure includes a verb dictionary, three distinct actor dictionaries, an agents dictionary, an issues dictionary, and a discard dictionary. The verb dictionary categorizes verb phrases into the sets of predetermined actions described by event data. The three actor dictionaries categorize persons and named organizations by their affiliations (i.e. country, organization type) and their roles with respect to the domain of interest. These dictionaries also resolve multiple spellings or representations of an entity's name into a single canonical representation. The default PETRARCH coding scheme provides three actor dictionaries: country-affiliated actors, international actors, and non-state military actors. The agents dictionary describes how to classify unnamed entities. For example, the agents dictionary maps “thief” and “trafficker” to criminal. The issues dictionary identifies phrases common to the domain-of-interest to label news by topic. For example, the current Phoenix issues dictionary tags issues like foreign aid, retaliation, and security services. Finally, the discard dictionary identifies phrases that disqualify sentences or stories from being coded entirely. This helps to remove stories that might otherwise be erroneously coded. For example, sports reporting is omitted as it often uses the language of warfare to describe “victories,” “defeats,” and teams being “destroyed.”
The common CAMEO coding scheme is not a comprehensive description of public interactions between politically relevant actors and agents. For researchers interested in types of interaction that do not conform to the existing dictionary structure, the creation of new dictionaries is a necessary but costly step. The Phoenix verb dictionary contains many thousands of verbs and phrases parsed according to a particular format and organized within a predetermined ontology. Currently, not only must researchers do this parsing and organization by hand, but they must also begin with a comprehensive list of verbs and phrases that will comprise the dictionary. Historically, the work of identifying verb phrases and classifying them has been done by undergraduate or graduate research assistants. This is time-consuming, expensive, and difficult to reproduce. The coding decisions made by research assistants are supposed to follow prescribed rules but their actual judgments are not auditable. Tools adapted from machine learning and natural language processing can be leveraged to ameliorate these challenges of event data generation. The technique presented here relies primarily on a word embedding model called word2vec.
3. A method for automated dictionary generation
The ADG process consists of four steps. (1) First, techniques common to NLP tasks are used to pre-process the text corpus that is to be event-coded. This is a necessary step for both event coding by PETRARCH as well as the dictionary creation process. (2) Word2vec, a neural network language model (NNLM), is then used to learn a vector-space representation of the entire vocabulary. (3) Seed words and phrases, chosen according to a pre-defined ontology, are used to extract synonymous and near-synonymous words and phrases from the word2vec model that will populate the dictionaries. (4) Finally, a set of post-processing heuristics are applied to prune and format the dictionaries. While this entire process consists of multiple steps, the researcher is responsible only for supplying an ontology in the form of a small set of seed words and phrases. The process is diagrammed in Figure 2 and described in detail below. While the examples provided are drawn from the application of ADG to cybersecurity, the process is domain agnostic and can be applied widely to a variety of event domains.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_fig2.png?pub-status=live)
Figure 2. ADG pipeline.
3.1. Step 1: Pre-processing
Every story in the corpus that is to be event-coded is parsed and part-of-speech tagged using a shift-reduce parser, the fastest parser available from Stanford's CoreNLP (Bauer Reference Bauer2014).Footnote 7 Additionally, CoreNLP's named entity recognizer (NER) is used to tag named entities as one of time, location, organization, person, money, percent, and date (Finkel et al. Reference Finkel, Grenager and Manning2005).
Once the corpus has been parsed and named entities have been identified, two versions of the annotated text are saved. The first version is a representation of each sentence's parse tree to be input into PETRARCH. The second version of the annotated corpus is formed by appending to each word both its entity-type tag and its part-of-speech tag. For example, the word “hackers” is transformed into “hackers:O:NNS” where “O” indicates that this word is not a named entity and “NNS” indicates a plural noun. “Snowden:PERSON:NNP” indicates that “Snowden” refers to a person and is a singular proper noun.Footnote 8 POS and NER-tagging each word and phrase in the corpus is necessary to retain sufficient information about each term to post-process the resulting dictionary entries.
The NER and POS-tagged corpus is then processed to produce multi-word phrases. The method chosen here for deriving phrases from the corpus is recommended by Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) and implemented in Rehurek and Sojka (Reference Rehurek and Sojka2010). A robust literature on phrase detection exists but is out of scope for review here.Footnote 9 Candidate bigrams (two-word phrases) are scored according to their frequency relative to the frequency of the constituent words being found independently:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_eqn1.png?pub-status=live)
The words w 1 and w 2 are concatenated into a single multi-word term, w 1_w 2, if score(w 1, w 2) surpasses a pre-defined threshold. δ is a discount factor that prevents spurious phrases from being formed by infrequently-occurring words. In order to produce phrases consisting of more than just two words, this algorithm is run iteratively. An example of this pre-processing is given in Figure 3.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_fig3.png?pub-status=live)
Figure 3. Example of pre-processing.
3.2. Step 2: Vocabulary modeling
Once the text data have been tagged and phrases have been formed, a model is required to identify terms and phrases that are synonymous with the seed phrases. Word2vec is chosen for this purpose. The word2vec model is a single-hidden-layer, fully-connected, feed-forward neural network that has been shown to learn the meanings of words given their contexts in natural language texts. Word2vec produces word vectors, in the form of real-valued vectors, from raw text input in a process called embedding (Rehurek and Sojka Reference Rehurek and Sojka2010; Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013). These word vectors are low-dimensional numeric representations of a vocabulary that preserve the syntactic and semantic relationships between words. Word2vec learns the meaning of words from the contexts in which they are found in the text. The importance of a word's context is found in the distributional hypothesis, an assumption required by word2vec. Harris (Reference Harris1954), in describing the distributional hypothesis, explains that words more similar in meaning will occur among more similar contexts than will words that are dissimilar in meaning. Rubenstein and Goodenough (Reference Rubenstein and Goodenough1965) demonstrate that “there is a positive relationship between the degree of synonymy (semantic similarity) existing between a pair of words and the degree to which their contexts are similar.”
Word2vec is actually a family of models that includes both a skipgram-based variant and a continuous bag of words (CBOW) variant.Footnote 10 The skipgram model takes as input a one-hot-encoded (dummy variable) vector of length V , where V is the size of the vocabulary, in which all values are 0 except for the target word, w i, which is coded 1. The skipgram model then attempts to predict the context words that are most likely to be found adjacent to the target word. Context words, {w i−k, …, w i−1, w i+1, …, w i+k} are those words that fall within a window of size k on either side of the target word.Footnote 11 The skipgram model therefore estimates a function, f(w i), that maps target word w i to its likely context words, {w i−k, …, w i−1, w i+1, …, w i+k}. The output of the skipgram model is a softmax-normalized vector of length V where elements represent the probabilities that each corresponding word will appear in the context window of the input word.Footnote 12 The CBOW variant is the reverse of the skipgram model and predicts a target word given its context. Both CBOW and skipgram models can be estimated with any of several software packages including the one used here, gensim (Rehurek and Sojka Reference Rehurek and Sojka2010).Footnote 13
Word2vec consists of two weights matrices: an input weights matrix and an output weights matrix. By multiplying the input vector (shape 1 × V) with the input weights matrix (shape V × D), a D-dimensional vector representation of the input word, its word vector, is formed. This vector representation is then multiplied by the output weights matrix (shape D × V) to produce the model's output layer.Footnote 14 The softmax function (“activation”) is applied to this output layer. Because D ≪ V, the hidden layer compresses the sparse input vectors into relatively small, dense vectors.Footnote 15 These word vectors are of interest because they encode semantic and syntactic relationships between words and can be used to measure word similarities. Furthermore, algebraic operations on this vector space produce intuitive results. The canonical example of this is the analogy task, often demonstrated by showing that:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_eqn2.png?pub-status=live)
By adding the vector representation of “king” to the vector representation of “woman” and subtracting the vector representation of “man,” a well-trained word2vec model will produce a vector very near to the vector representation of “queen” (i.e. king:man::queen:woman).Footnote 16 Why word vectors exhibit these linear relationships is the subject of active research (Pennington et al. Reference Pennington, Socher and Manning2014; Arora et al. Reference Arora, Li, Liang, Ma and Risteski2016).Footnote 17 English word embedding models are typically evaluated with a standard set of analogies like that offered by Mikolov et al. (Reference Mikolov, Chen, Corrado and Dean2013) to test a model's ability to represent 14 categories of semantic and syntactic relationships.
3.3. Step 3: Term and phrase extraction
Learning the corpus with word2vec allows us to easily identify synonyms or near-synonyms of our seed words and phrases. Given a seed phrase, a string search is performed on the model's vocabulary and all words and phrases that contain the given seed word or phrase are selected. The word vectors associated with the resulting words and phrases are retrieved. These vectors are element-wise averaged to produce a single category-wide vector. The element-wise average is taken as $\Vert\sum \nolimits_{\overrightarrow {w} \in C_i}{\overrightarrow {w}}\Vert_2$ where
$\sum \nolimits _{\overrightarrow {w} \in C_i}{\overrightarrow {w}}$ is the element-wise sum of all word vectors,
$\overrightarrow {w}$, in category C i. The resulting vector is l 2 normalized.Footnote 18 Then, the top n i most similar terms and phrases to each mean category vector are extracted from the word2vec model. Similar words and phrases are identified by first computing the cosine similarities of all word vectors with the category mean vector. Cosine similarity, defined as
${{(\vec{X}\cdot \vec{Y})} / {({\rm \Vert }\vec{X}{\rm \Vert } \times {\rm \Vert }\vec{Y}{\rm \Vert })}}$, is a measure of the angle between two vectors and is particularly useful for comparing high-dimensional vectors. Cosine similarity is used to rank-ordered all terms and phrases in the word2vec model's vocabulary by their similarity to the mean category vector in descending order. The top s most similar terms and phrases are chosen as candidates to populate the relevant category in the event-coding dictionary.Footnote 19
3.4. Step 4: Post-processing
Extracted terms are then post-processed according to a set of rules associated with the dictionary they are meant to comprise. These post-processing steps can be automated. The set of post-processing rules can be found in the online appendix. The post-processing is necessary to coerce the extracted terms and phrases into the dictionary formats expected by PETRARCH. This involves, among other things, grouping verb phrases by their common verbs and tagging each dictionary entry with a category tag. A post-processing filter that removes phrases from the verb dictionary if they do not include at least one verb is also applied.
ADG represents a major step towards fully-automated event-data coding for novel domains. Because this process can be done largely without human interaction and the content of the dictionaries are a function of the raw data that are to be event-coded, the dictionaries can be updated in tandem with the event dataset itself; new verb phrases, actors, or agents can be learned by the underlying models as they enter the relevant domain's vocabulary. Additionally, because the process described herein relies on only a small amount of initial researcher input data and the raw text data itself, the process of event data generation is made more fully reproducible from start to finish.
4. CYLICON: a cyber event dataset
This method of ADG for event coding is now applied to a novel domain for event data: cybersecurity. First, a cybersecurity ontology is selected and seed phrases are chosen to represent each category of that ontology. Five dictionaries are generated: verbs, actors, agents, synsets, and issues.Footnote 20 Only one seed phrase is provided per category.Footnote 21 Seed phrases are shown in Tables 2 and 3 and in the online appendix. For each seed term or phrase, the average vector of all terms and phrases containing the seed is computed and similar terms and phrases are identified according to the described ADG procedure. The extracted candidate terms and phrases are then post-processed and formatted into PETRARCH-styled dictionaries; no manual changes have been made to the dictionaries at any point after the input of the 26 seed phrases (one per category).
Table 2. Verb dictionary seeds
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_tab2.png?pub-status=live)
Table 3. Actor & agent dictionary seeds
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_tab3.png?pub-status=live)
Ten categories of events are identified for the verb dictionary: defacements, DDOS events, infiltrations, leaks, infections, vulnerability discoveries, arrests, patches, phishing attacks, and censorship incidents.Footnote 22 The ten seed phrases are representative examples of verb phrases for each category.Footnote 23 These are chosen by the researcher. The extracted verb dictionary contains 640 verbs and phrases after de-duplication and post-processing. The number of extracted phrases is due, in large part, to the minimum similarity threshold that is set by the researcher; terms and phrases must surpass this threshold with respect to the average category vector in order to be included in the final dictionaries. Here, a minimum cosine similarity of 0.6 has been chosen.
The new categories of actors and agents introduced in CYLICON include hackers, researchers, users, whistleblowers, and antivirus companies/organizations. These categories are appended to the existing actor and agent classifications already found in the default Phoenix dictionaries. New issue categories are appended to the issues already supplied with PETRARCH and include TOR, 0Day, hacktivism, DDOS, social engineering, and state-sponsorship. Synsets are produced for categories including hardware, virus, web asset, software, and computer.Footnote 24
The selected text corpus represents a convenience sample of 77,410 documents collected from online sources including cybersecurity-related blogs and news sites. Roughly 22,000 articles are sourced from the news section of www.softpedia.com. The remaining stories are largely sourced from blogs and technology-oriented news sites, the largest of which include feed aggregators, theregister.com, csoonline.com, circleid.com, and darkreading.com. There are 1,231 unique sources represented in the corpus. These sources are not a representative sample of cybersecurity events and were instead selected due to their relatively high concentration of relevant cybersecurity event stories. Collection occurred during 2014 and the latter part of 2015 and was inconsistent over time due to heterogeneity among sources with respect to the availability of archival text.
CYLICON includes 671 events in total. Arrests make up the largest category with 211 events, followed by infiltration (200), leaks (97), defacements (97), patches (19), infections (19), DDOS attacks (17), vulnerability discoveries (5), phishing attacks (5), and censorship incidents (1). Infiltration is a common category as many verb phrases from cybersecurity reporting accurately map to it. For example, phrases that include the words “breached” and “hacked” are often classified as infiltration by the ADG process. Additionally, when websites are defaced, it is common for reports to describe the websites as having been “breached and defaced,” indicating that the incident could be accurately assigned to either or both categories. Often, popular reporting on cybersecurity is not precise enough to distinguish the characteristic of a particular “hacking” event in a single sentence. Because of this, a bias towards infiltration coding is induced. If the coded sentence explains that a target was “hacked” and a second sentence explains that the event resulted in the defacement of the target's website, PETRARCH will fail to connect the defacement to the hacking event and will therefore code the event as an infiltration rather than a defacement. The discovery of vulnerabilities, issuance of patches, and phishing attempts, while very common, often go unreported in the news sources utilized here. They also tend not to conform to the source-action-target triple expected by PETRARCH. Of the 640 verb phrases in the CYLICON dictionaries, 157 of them account for all of the coded events. This is a 15-fold increase over the size of the verb seed dictionary.
The geographic distribution of actors involved in cyberspace according to CYLICON is shown in Figure 4. This map corresponds to conventional wisdom about the most active actors in cybersecurity-related events (The Economist 2012; Akamai 2015; Clapper Reference Clapper2015). However, this map is not representative of the entire CYLICON dataset; not all relevant actors are geo-coded. Of 1,338 total coded actors, 1,245 are assigned to specific countries. PETRARCH attempts to assign country codes to actors and agents when they can be inferred from the text; for example, the phrase “Syrian hackers” may be coded as SYRHAC. Actors affiliated with international organizations or otherwise unaffiliated with specific countries are, of course, not included in the map. Country associations for cybersecurity-based actors and agents have not been inferred for CYLICON.Footnote 25 The US is the most prominent country in CYLICON with 473 events followed by China (145), Great Britain (60), India (43), Pakistan (38), and Russia (38). 82 unique countries are represented in total.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_fig4.png?pub-status=live)
Figure 4. Spatial distribution of actors in CYLICON. White (NA) values indicate that no events in CYLICON identify an actor from a given country.
Because event data from PETRARCH are dyadic, we can also examine country pair interactions. Figure 5 represents the most common dyadic pairs in CYLICON. Chord plots, common in network analysis applications, represent the volume of interaction between nodes or, in this case, countries. This particular chord plot is non-directed and does not include self-connections. The top 12 countries (by volume of events) are plotted and the remaining 70 are grouped into the category “other” for visual clarity. The larger edges conform to the expectations of Valeriano and Maness (Reference Valeriano and Maness2014); regional pairs and rivals are apparent in the graph. The US is most active with China and Russia. India and Pakistan account for the majority of one another's cyber events. Iran interacts primarily with the US and Israel.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_fig5.png?pub-status=live)
Figure 5. Top country dyads in CYLICON.
To better illustrate the successes and shortcomings of CYLICON, a selection of events are examined alongside their original text. Event codes are indicated by the triplet ACTOR1 ACTOR2 ACTION preceding each sentence. Selected sentences and their corresponding data are enumerated in the list below, beginning with examples of accurate coding and ending with examples of inaccurate coding. Commentary follows.
1. ISR USAELIGOV INFILTRATED: “According to FBI, in the Year 2000 Israeli Mossad had penetrated secret communications throughout the Clinton administration, even Presidential phone lines.”Footnote 26
2. USACOP EST ARRESTED: “After the Estonian masterminds were apprehended by the FBI, the DNSChanger Working Group was established and the cleaning process began.” (Kovacs Reference Kovacs2012b)
3. MYS PHLGOVMEDHAC DDOS: “After Anonymous Malaysia launched distributed denial-of-service (DDOS) attacks against several Philippines government websites, Filipino hackers went on the offensive, defacing a large number of commercial websites.” (Kovacs Reference Kovacs2013)
4. USA USAMIL INFECTED: “US officials did not provide details on the status of the ‘corrupt’ software installed on DoD computers, but common sense points us to believe it was removed back in 2013.” (Cimpanu Reference Cimpanu2015)
5. BGDMED BGD DEFACED: “A Bangladeshi publisher of secular books has been hacked to death in the capital Dhaka in the second attack of its kind on Saturday, police say.” (BBC 2015)
6. IRNGOVGOVMIL USA INFILTRATED: “Head of Iran's Civil Defense Organization Gholam Reza Jalali told the agency that the country never hacked financial institutions from the United States.” (Kovacs Reference Kovacs2012a)
The first four examples are all accurately coded by PETRARCH. Item 1 is correctly identified as an instance of infiltration and the actors are accurate if imprecise (PETRARCH codes Mossad as ISR rather than ISRSPY). In Item 2, the ADG process identified “were apprehended” as indicative of arrest. While Item 3 is correctly labeled a DDOS event, PETRARCH has mistakenly associated the term “hackers” with the target actor rather than with Anonymous Malaysia. Item 4 highlights the difficulty associated with coding infection events. The “corrupt software installed” indicates that a malware infection event has occurred. However, as is often the case with infection events, a source actor is not described. In this case the target actor is accurately identified but the source actor is coded as the US, which is not supported by the given text. Note that none of the verb phrases in Items 1, 2, or 4 were included in the seed terms.
Items 5 and 6 were incorrectly coded. The incorrect coding in item 5 resulted from the dual meaning of the verb “hacked.” It is possible that with a larger ontology, one that includes both computer infiltration and murder, “hacked_to_death” would be accurately coded. However, without a method for automatically pruning erroneously-coded phrases from the dictionaries, edge cases like this must be identified and removed by hand. No manual pruning has been performed on these dictionaries and so edge cases remain. Item 6 is incorrectly coded because the sentence itself is a denial of the action that was identified. An Iranian official denies that his country had hacked into financial institutions in the US but PETRARCH interpreted the sentence to mean that the event had, in fact, occurred.Footnote 27
All CYLICON events have been reviewed manually and scored to help quantify the efficacy of automatically-generated event data dictionaries. The text content associated with each event is inspected and event codes are manually assigned without any knowledge of the CYLICON-assigned codes. In the case that multiple events are explicitly described (e.g. “…have breached and defaced…”), all appropriate events are assigned. When only one event is described (e.g. “…have defaced…”), only that specific event is assigned. When the language is ambiguous, all reasonable assignments are made but the event is also labeled as “ambiguous.” Only the action or event type field is evaluated as only the verb dictionary was produced completely via ADG. The CYLICON actor, agent, and issue dictionaries are a combination of the Phoenix hand-coded dictionaries and automatically-generated dictionaries and are therefore not evaluated. Events are scored as correct if the associated action code from CYLICON is among the manually-identified event types for a given sentence. Events are scored as ambiguous if the associated action code from CYLICON is among the manually-identified event types but the text itself is ambiguous rather than explicit. For example, “Hackers have attacked servers…” is ambiguous because it could reasonably describe a DDOS event, an infiltration event, or a defacement. Events are considered incorrect if they fall into neither of the above two cases.
Table 4 presents the results of this review by event category. Overall accuracy, the number of correctly-coded events and ambiguous events divided by the total number of coded events, is 70 percent. If ambiguous events are instead considered inaccurate, the accuracy of coded events falls to 65 percent. These values are in line with or above the reported human coder performance on top-level event categories. King and Lowe (Reference King and Lowe2003) report that trained undergraduates can correctly classify events by their aggregate (top-level) event category between 39 and 62 percent of the time.Footnote 28 Schrodt and Brackle (Reference Schrodt, Brackle and Subrahmaniam2013) report machine-coding accuracy percentages for TABARI on the ICEWS project in the low- to mid-70s. The false positive rate, the percentage of sentences incorrectly determined by PETRARCH to contain any event, is 16 percent.Footnote 29 This performance is achieved despite requiring only minimal researcher-hours and one seed phrase per category.
Table 4. Accuracy by event category
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210127064709085-0950:S2049847019000013:S2049847019000013_tab4.png?pub-status=live)
5. Conclusion
The ADG process described here allows researchers to quickly produce novel event datasets specific to their topics of interest. With minimal input from the researcher, ADG produces dictionaries of pre-categorized words and phrases for use with existing event coding software. In a demonstration of its application, ADG was used to populate and update a set of dictionaries for coding events in an entirely new domain for event data—that of cybersecurity.
While ADG takes a substantial step in the direction of a fully-automated event coding solution, work remains to be done in this area. Event coding software itself, like PETRARCH, remains largely heuristic-based. The stacking of multiple analysis techniques for sentence parsing, phrase-extraction, and named entity recognition, among others, compounds errors that lead to sub-optimal event coding. Future efforts should leverage advances in machine learning to minimize the application of heuristics and the stacking of text pre- and post-processing steps.Footnote 30
End-to-end event coding models may, for instance, facilitate the customization of event datasets through transfer learning.Footnote 31 For example, a model may be trained to produce CAMEO-coded event data from news and then adapted, with the help of a relatively small training set, to produce cybersecurity event data instead. This would allow novel event datasets to be generated for user-specific purposes with only a small number of “gold standard” training samples. An extension to the ADG process presented here would replace the word2vec component with a bilingual embedding model like BilBOWA (Gouws et al. Reference Gouws, Bengio and Corrado2015). BilBOWA requires only a parallel bilingual corpus in order to align separate word embedding models in two different languages and could therefore be used in the ADG process to extract bilingual dictionaries.
ADG demonstrates that even unstructured text can be converted into structured data suitable for social science inquiry with minimal researcher input. As machine learning and neural network-based models continue to advance the state-of-the-art in data analysis across fields, their application to the social sciences promises to similarly revolutionize how we measure, interpret, and understand political phenomena.
Supplementary Material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2019.1
Author ORCIDs
Benjamin J. Radford, 0000-0002-8440-0655.
Acknowledgments
Benjamin J. Radford received his Ph.D. in political science from Duke University (benjamin.radford@gmail.com). The author thanks Michael D. Ward, Scott De Marchi, Kyle Beardsley, Mark J.C. Crescenzi, and three exceptional anonymous reviewers for their support and feedback on drafts of this work.