1. Introduction
This anniversary issue gives us an opportunity to look back at the past 25 years of information extraction (IE)—consider what has changed and why these changes have occurred.
First, a few definitions:
IE is the automatic identification and classification of instances of user-specified types of entities, relations, and events from text. The output is structured information (e.g., a database) which can be readily interpreted by other applications. The specification may take the form of examples or verbal descriptions of the information to be extracted. Texts which the user considers equivalent should be mapped to the same output structure.
Although there are exceptions, the information to be extracted is limited to specific individuals and specific events. Generic information, conditional information, statements of knowledge, and beliefs are excluded. These restrictions are intended to make the task more tractable and the output easier to interpret than general language understanding.
This is distinguished from open IE, which reduces text to a set of elementary sentences (subject–verb–object triples) for human consumption or search (but does not necessarily involve the collapsing of alternative verbal descriptions to a canonical form).
Although common evaluation corpora now are an integral component of most areas of NLP, shared US Government evaluations have played a particularly large role in the development of IE. Although these evaluations are referred to as “conferences,” they involve much more: the specification of a task, the implementation of a system for the task by participants, the release of test data, and its processing and scoring. The past 30 years have seen three major series of evaluations:
- MUC
(Message Understanding Conference) Begun in 1988 to find a way to evaluate IE, the MUCs established IE as a major application of NLP (Sundheim Reference Sundheim1996).
- ACE
(Automatic Content Extraction) Replaced the filling of one complex and task-specific template with a few dozen more general relations and events. Produced lots of annotated training data, fostering development of supervised methods (Doddington et al. Reference Doddington, Mitchell, Przybocki, Ramshaw, Strassel and Weischedel2004).
- KBP
(Knowledge Base Population) Increased the scale of the data to be processed, with the goal of creating a unified data base connecting tens of thousands of entities with about 40 relations and then answering questions about selected entities. Provided minimal annotated training data, thereby encouraging semi-supervised methods (Ji and Grishman Reference Ji and Grishman2011).
The regular evaluations of IE in turn have served as a model for evaluations in many other areas of NLP.
The frequent evaluations (every 1 or 2 years) make it possible to get an accurate picture of the varied approaches to IE over the past 30 years. Each participant in an evaluation was required to provide a (multi-page) system description. The participants included both universities and industry, and were motivated (by the possibilities of Government contracts) to incorporate what they believed was “best practice”, not just the most publishable methods.
Since the conferences have all been organized by US Government agencies, it is not surprising that the initial participants were primarily from the US. But as the meetings progressed, they took on a more international character. By 2005, 6 out of 15 groups were non-US. By 2010, only 7 out of 20 KBP participants were from the US and 6 from Europe with the balance widely distributed.
2. Before corpora: rule-based systems
If we turn back the clock to 1994—25 years ago—and the start of this journal, we will find a new NLP technology being introduced to the wider world.
The information explosion of the last decade has placed increasing demands on processing and analyzing large volumes of online data. In response, the Advanced Research Projects Agency (ARPA) has been supporting research to develop a new technology called IE. IE is a type of document processing which captures and outputs factual information contained within a document. Similar to an information retrieval system, an IE system responds to a user’s information need. Whereas an Information Retrieval (IR) system identifies a subset of documents in a large text database or in a library scenario a subset of resources in a library, an IE system identifies a subset of information within a document (Okurowski Reference Okurowski1993).
This announcement was based on a series of MUCs which had defined the task of IE and its evaluation. The conference series began in 1988 with invitations to a meeting (“MUC-1”) at NOSC (Naval Ocean Systems Command) to discuss how IE might be evaluated. In order to be able to compare systems, there was agreement on the need for a shared template capturing the most important information in a document. Systems would be judged on how accurately they filled these template slots. MUC-2 represented a trial run of such an evaluation; MUC-3 agreed on scoring using recall, precision, and F measure. (F measure, the harmonic mean of recall and precision, was suggested as the primary metric for assigning a rank to participating systems.)
MUC-1 and MUC-2 both used Navy exercise message traffic (“rainforms” and “opreps”) as the corpus. A typical message is as follows:
The template planned for MUC-2 is shown in Appendix A.
MUC-3 and 4 used news about terrorism in Latin America (Chinchor et al. Reference Chinchor, Hirschman and Lewis1993). A sample message is shown in Appendix B along with one of the filled templates generated from this message.
As the task got better defined, the number of participants grew. By MUC-5, in 1993 there were 16 participants, evenly divided between universities and companies (primarily defense contractors) (MUC 1993). The MUC tasks were getting larger in other respects as well. Participants in MUC-5 had a choice of two extraction topics (joint ventures or microelectronics) and two languages (English or Japanese). The templates were substantially more complex than in prior years.Footnote a Following MUC-5, the tasks were simplified to limit the effort required to participate. To emphasize faster development of IE systems for new domains, the time from the release of training material to the actual evaluation was reduced to one month. MUC-6 involved executive succession; MUC-7 involved rocket launches.
Participation in multiple MUCs had led to some convergence of extraction architecture, a long pipeline including some familiar names (Hobbs Reference Hobbs1993). It quickly became clear, for example, that a preprocessor to identify names was essential. But there were still basic areas of disagreement.
2.1 To parse or not to parse
One disagreement concerned full-sentence parsing. The job of IE is to analyze the structure of the input text and then, guided by that structure, to generate the specified output relations. The question is how much structure to build. One possible answer is to build a full parse tree, thus defining the role of every word in the sentence. However, this was not so easy to do in 1990. Grammars were constructed by hand and were either too tight (failed to parse 1/3 to 1/2 of sentences) or too loose (produced dozens of parses). The typical solution was to combine a tight grammar with a mechanism to recover a partial parse if no full sentence parse was possible. For MUC-5, half the participants (8) tried to generate full parses of each sentence; it’s not always clear how successful they were. Most of these sites cited some linguistic formalism: GB (Government-Binding Theory), LFG (Lexical Functional Grammar), HPSG (Head-driven Phrase Structure Grammar), and CCG (Combinatory Categorial Grammar) were represented at MUC-5.Footnote b
The primary alternative to a full parse was partial parsing (chunking). This was faster and more reliable, but only generated some of the required structure; semantic patterns had to do the rest. Consider, for example, the “name” event for hiring an executive. It may appear as a simple active sentence, “IBM named Fred president” (pattern company named person position), a passive sentence, “Fred was named president of IBM,” a relative clause, “Fred, who was named president of IBM,” etc. This was OK for a sentence expressing a single event, but consider a sentence expressing two events:
Fred, who was named president of IBM last year, suddenly resigned yesterday.
The pattern for the relative clause still matches, but the other event (Fred … resigned) is split in two. Creating a full set of patterns to handle all these cases is quite tricky.
The SRI team provided a neat solution. They implemented event rules which were applied nondeterministically and could skip selected constituents. For example, the simple active sentence pattern for “resigned” was extended to
person relativePronoun (nounGroup | other)* verbGroup (nounGroup | other)* resigned
which could skip the relative clause, matching both “Fred” and “resigned.”Footnote c Because the patterns are applied nondeterministically, both patterns would match and two events would be reported. The resulting system, FASTUS, was fast and effective (Hobbs et al. Reference Hobbs, Appelt, Bear, Israel, Kameyalna and Tyson1993 Reference Hobbs, Appelt, Bear, Israel, Kameyama, Stickel and Tyson1997). The SRI researchers were careful to point out that this solution was suitable for IE but not for a general language understanding task which needs to capture the relation between events.
At this time, the first corpus trained systems, for part-of-speech tagging, became available (Church Reference Church1988). They were significantly more accurate than their rule-based predecessors and began to find some limited use in MUC-5.
2.2 Building a domain model
Once the input data have been syntactically analyzed, we must detect mentions of interest, identify their arguments, and generate the output structure. This was generally achieved through a process of semantic pattern matching, although described in different terms by different sites. The patterns consisted of English words, domain-specific word classes, and syntactic roles. If the system generated a full-sentence parse tree, the pattern had to match a subtree; if the system generated sequences of chunks, the pattern had to match a subsequence.
Studying the source texts and building the domain model remains something of a craft. If the word classes are too general or the patterns too brief, the system will overextract (low precision). More than likely, some patterns will be omitted and the system will underextract.
The one site which explored the possibility of (partially) automating this process was the group from the University of Massachusetts, Amherst, that participated in MUC-4. Most MUC task specifications included a small number (typically 100) of hand-tagged example documents. For MUC-3 and MUC-4, the Government provided these 100 annotated documents, but also over 1000 unlabeled documents, half of which were on the same topic. This provided an opening for a semi-supervised learner. The documents were divided into those that included a relevant event (in this case, a terrorist incident) and those that did not. This was a much smaller job than annotating the documents with their slot fillers. Meanwhile, the corpus was parsed and for every noun phrase in the corpus its immediate context (generally a subject–verb–object structure) was recorded. Then they computed, for every context, the fraction of documents containing that phrase which are relevant to the extraction task. These are ranked, and the top-ranked phrases are collected as promising extraction patterns (Riloff Reference Riloff1996). This set of patterns was as effective at IE as a set of manually selected patterns.
Completion of the Penn TreeBank in the mid-1990s (Marcus et al. Reference Marcus, Santorini and Marcinkiewicz1993) led to a series of treebank-trained parsers of increasing accuracy (Collins Reference Collins1996) and made full-sentence parsing more competitive. This came too late to have a significant influence on the remaining two MUCs—BBN was the only site to incorporate a treebank-based parser (Miller et al. Reference Miller, Crystal, Fox, Ramshaw, Schwartz, Stone and Weischedel1998)—but it left the field well prepared for supervised methods which required accurate parsers.
2.3 Dividing the task
Up through MUC-5, the only way to participate in an MUC was to create a complete system to fill event templates, which might require several component subsystems. To encourage development of these components, MUC-6 split off three tasks, named entity tagging, coreference, and template element, with separate evaluations (Grishman and Sundheim Reference Grishman and Sundheim1996). These were seen as more general scenario-independent tasks. The original task was dubbed scenario template. This brought greater attention to these tasks and favored the rise of NLP specialists who concentrated on one task. Having a separate evaluation also made it feasible to “plug and play.” MUC-7 added a fifth task, the template relation task.
The named entity task in particular quickly took on a life of its own. It had lots of things going for it. It was easy to explain. It is not too difficult to implement a system (using hand-coded rules) which exhibits useful performance. It became a separate task at just the time that machine-learning methods were being introduced. And it was useful by itself.
Finally after MUC-7 questions were raised about the value of continued MUCs. Some of the templates were very specific; MUC-5 included a template with more than 40 slots. This led to a lot of work not directly relevant to IE technology. Scores of the top performers seemed to have topped out at F = 50–60. A working group was formed which recommended extracting a set of elementary events and their arguments rather than a monolithic template (Hirschman et al. Reference Hirschman, Robinson, Ferro, Chinchor, Brown, Grishman and Sundheim1999). This became a basic theme of the ACE program, which started in 2001.
3. Supervised methods: ACE
3.1 Entity, relation, and event
In ACE, the information in each document is represented by a set of entities, relations, and events. There are seven types of entities, six types of relations, and eight types of events. The types are shown in Appendix C; each type is further divided into subtypes (not shown). Relations are binary; events may have any number of arguments. With minor exceptions, arguments must be entities or temporal expressions (thus excluding relations or events which take other events as arguments). The arguments to a relation or event must appear in the same sentence; this makes annotation more tractable. It also simplifies modeling because it reduces relation tagging to a classification task (classifying all pairs of entities in the same sentence).
The annotated corpora of the ACE evaluations are still widely used as benchmarks for IE. In particular, the three types of data structures produced for the 2005 evaluation are still being used to annotate additional data (Aguilar et al. Reference Aguilar, Beller, McNamee, Van Durme, Strassel, Song and Ellis2014).
Another basic theme was supervised training. It had become clear from the MUCs and contemporaneous NLP research that annotating training data could be an effective way of improving extraction performance. To support such training, a sizable investment was made in corpus annotation. New corpora were released annually. The largest, for ACE 2005, was 300,000 words of English and comparable amounts of Chinese and Arabic.
In addition, to gauge the robustness of the extraction, one release included noisy output from audio transcripts and OCR (optical character recognition), but this was not further pursued.
As we have already noted, in the early 1990s there was a shift in the core NLP tasks to corpus-trained models, initially for part-of-speech tagging and then for parsing, which greatly improved the quality of intermediate results.
We will consider in turn the most popular models for each type of IE structure: named entities, entities, relations, and events.
3.2 Named entity
The general role of this component is to identify and classify all the names in our corpus. More abstractly, its job is to encapsulate all the messy, ad hoc structures which are not part of the core language. In addition to names, this may include addresses, times of day, and chemical formulas (Nadeau and Sekine Reference Nadeau and Sekine2007).
This is essentially a sequence labeling problem and is typically solved by an MEMM (Maximum Entropy Markov Model) or a CRF (Conditional Random Field) at the token level (Nadeau and Sekine Reference Nadeau and Sekine2007). There is a small benefit from taking into account global features which capture name consistency across documents: if the same name appears in two documents, we favor the analysis which assigns the two instances the same name type (Finkel et al. Reference Finkel, Grenager and Manning2005). A lot of features are required to classify names not seen in training—primarily shape, prefixes, and suffixes. In some of the top systems, this feature-based approach has been replaced by a system which operates dual sequence models, one at the token level, one at the character level (Klein et al. Reference Klein, Smarr, Nguyen and Manning2003).
3.3 Entities
Entity generation will typically operate on parser output. It has two principal functions: grouping together coreferential phrases and assigning each group a semantic type. ACE has seven entity semantic types, shown in Appendix C. Groups not in one of these seven types are dropped. What remains is a set of entities, each consisting of a set of entity mentions.
Several types of models have been used for coreference, principally mention-mention models (which first classify each pair of entity mentions as to the probability of coreference and then resolves conflicts) and mention-entity models (which make a single pass over a document, processing entity mentions in text order, either assigning the mention to a previously created entity or constructing a new entity) (Ng Reference Ng2017).
3.4 Relations
As we noted earlier, because relations are between pairs of entities in the same sentence, it is possible to treat relation tagging as a classification problem, classifying each pair as a relation type or none. Extensive studies were made using maximum entropy methods and trying a wide variety of features, including words, entity types, and dependency relations (Kambhatla Reference Kambhatla2004; Jiang and Zhai Reference Jiang and Zhai2007). Kernel methods have also successfully been used (Zhao and Grishman Reference Zhao and Grishman2005).
3.5 Events
A proper treatment of events is more challenging because it involves the interaction of the trigger (the principal word defining the event) and multiple arguments. It is consequently a structured prediction task. The simplest solution is to decide first on the type of event, if any, and then to analyze the arguments (Ahn Reference Ahn2006). This, however, loses considerable accuracy because for many common verbs their meaning depends on the arguments it takes. For example, firing a person is a different type of event than firing a rocket.
A better solution is to use joint inference: optimize for a combination of label choices if these choices interact. Besides the interaction of event type with event arguments just noted, there are interactions between the types of adjacent events (attacks often co-occur with deaths) (Li et al. Reference Li, Ji and Huang2013).
Event extraction is followed by event coreference, whose role is to identify multiple mentions of the same event. As was the case for entity coreference, there are several viable strategies, including mention-pair models and mention-ranking models (Lu and Ng Reference Lu and Ng2018). These models rely on the argument structures of the mentions: they classify a pair of event mentions as potentially coreferential if the event types are consistent and the argument values are compatible. Some examples of compatible arguments can be learned through bootstrapping, but performance is modest (Huang et al. Reference Huang, Lu, Kurohashi and Ng2019). The problem in part is that many cases of event coreference are complicated, involving containment or partial overlap.
4. Semi-supervised methods
ACE was a success in terms of producing annotated corpora and research results, but there were issues it did not address. In particular, it treated documents separately, whereas many realistic tasks involved large numbers of interrelated documents. Information about an individual may need to be pieced together from several documents. To address these questions, NIST (the US National Institute of Standards and Technology) organized the annual “Text Analysis Conference” and its central task, “Knowledge Base Population” (KBP) (Ji and Grishman Reference Ji and Grishman2011). Starting in 2009, the KBP task added additional components year by year. We will describe the “Cold Start” variant as of 2017, when the data sets were largely complete.
Participants were given a large collection of unannotated documents, a mix of newspaper articles and blogs, two to four million documents in each of English, Chinese, and Spanish. A small portion of these, 30,000 documents in each language, served as the test corpus; sites were expected to build a graph in which each node represented an individual, organization, GPE (Geo-Political Entity), location, or facility mentioned in the test collection. Associated with each type of node were a set of properties. whose value could be a number, a date, a string, or another node in the network. For example, a person node would have an age property whose value is an integer and whose city_of_birth was a GPE node.
In addition, sites had to link the entities to the arguments of events appearing in the test collection.
Compared to ACE, the test corpora were about two orders of magnitude larger. At this scale, complete manual annotation of the test corpus was not feasible. Scoring was done by sampling: NIST selected some names mentioned in the test corpus and checked whether (1) the system had created a node for this name and (2) the node had the desired property. Training documents for the various annotation tasks were minimal—small samples the first year a task was run, augmented in subsequent years by the annotations required for scoring.
The large volumes of unannotated data and the lack of annotated training encouraged experimentation with semi-supervised methods—learning from partially labeled data. Most direct was the generalization of the earlier work in MUC-4 to bootstrapping, an iterative strategy starting from a small labeled seed. Bootstrapping was successfully applied to scenario template (Yangarber et al. Reference Yangarber, Grishman, Tapanainen and Huttunen2000), named entities (Collins and Singer Reference Collins and Singer1999), and relations (Agichtein and Gravano Reference Agichtein and Gravano2000). However, success was not always assured; adding an incorrect element might lead the bootstrapping badly astray.
Participants were also provided with a large data base, BaseKB. This enabled researchers to explore an approach to training a relation classifier termed distant supervision (Mintz et al. Reference Mintz, Bills, Snow and Jurafsky2009). The basic idea of distant supervision is to convert an existing set of facts into an annotated corpus and then use the annotated corpus to train a classifier in the usual way. Suppose we have a database with a relation R consisting of pairs $\lt x_{1},y_{1} \gt, \lt x_{2}, y_{2} \gt, \ldots$, and that some of these pairs appear in the corpus separated by word sequences w i. We will annotate every sequence w i as expressing relation R.
The basic model makes strong assumptions which are not satisfied by realistic data. It assumes that if a pair $\gt x_{i}, y_{i} \gt$ matches a sentence in the corpus, then that sentence expresses relation R. Violations of this assumption lead to a noisy annotated corpus, with many false positives and false negatives. An alternative MIML (MultiInstance MultiLabel) model requires only that at least one instance of the pair represent a relation and that the pair may represent more than one relation label. This model leads on average to cleaner annotations (Surdeanu et al. Reference Surdeanu, Tibshirani, Nallapati and Manning2012). Further improvements can be made by combining the distant supervision with some manually annotated data (Pershina et al. Reference Pershina, Min, Xu and Grishman2014).
More radical approaches are also being tried, including few-shot methods and even zero-shot methods. These address the situation where you have an event extractor which can recognize N event types and now want to add the ability to recognize an N + first event type. In a few-shot method, a small amount of training data is provided; in a zero-shot method, no additional training data is provided. Huang et al. (Reference Huang, Ji, Cho, Dagan, Riedel and Voss2018) propose to ground the event types and event instances in a shared semantic space based on the arguments to the event and then, given a new event instance, assign it to the closest type. Levy et al. (Reference Levy, Seo, Choi and Zettlemoyer2017) convert a relation into a set of questions and then rely on a reading comprehension system to answer these questions.
Whether distant supervision can outperform hand-built patterns or supervised training depends on several factors. Preparing patterns by hand requires considerable skill and insight but may yield a relatively clean (high precision) system. The preparation of an annotated corpus may require less skill but more time. Distant supervision requires the least labor but may produce the noisiest model. Most likely the best method will involve some combination of these approaches.
5. Deep learning
Advances in deep learning (multilayer neural networks) have had a dramatic effect on all of NLP over the past few years; IE was no exception.
Neural networks provide a major advantage over the trainable models which preceded them (primarily maximum entropy models): given sufficient training data and time, they can capture arbitrary functions of their inputs. That means that they do not require manual feature engineering. On the other hand, the time factor may be significant; citing training times of one or two weeks is not unusual.
The most widespread change brought about by neural networks was the way in which words are represented. Although prior models had made some use of smoothing lexical dependencies, words were generally treated as discrete symbols. If a vector representation were required it would take the form of a sparse 1-hot vector. Practical neural networks, however, required a representation using continuous-valued, low-dimension vectors. In effect, each word is represented by a point in d-space, termed its word embedding. Several methods were developed which captured the semantic properties of the vocabulary, in particular that words which were semantically similar would appear close-by in d-space.
Wt note here some aspects of the deep learning IE models. The primary network types currently in use are CNNs (Convolutional Neural Networks) and RNNs (Recurrent Neural Networks) using LSTMs (Long Short-Term Memories) (Yin et al. Reference Yin, Kann, Yu and Schütze2017).
5.0.0.1 Named entities. The best named entity performance is currently obtained by combining a dual token/character model with contextualized word embeddings (Akbik et al. Reference Akbik, Bergmann and Vollgraf2019) Performance on the standard test set (the Reuters newswire used by CONLL for the 2003 evaluation) has improved from an F measure of 89 in 2003 to an F of 93 (Li et al. Reference Li, Sun, Han and Li2018).
5.0.0.2 Relations. CNNs offer a particularly simple network structure but the convolution operates within a fixed window size, which may limit the ability to capture dependencies spanning the entire sentence. ACE relations are mostly realized at close range, with the entities separated by fewer than four words. This makes it reasonable to implement relation extraction using a CNN; Nguyen and Grishman (Reference Nguyen and Grishman2015) reported good results with windows of two, three, and four tokens.
5.0.0.3 Events. As noted above, event extraction can involve multiple interactions which may benefit from joint inference. In a neural network, these interactions can be captured directly through a set of “memory matrices” whose values are assigned as part of the network training and then used for event trigger and argument prediction (Nguyen et al. Reference Nguyen, Cho and Grishman2016).
Event extraction is in substantial part a matter of word sense disambiguation. But until recently each word was assigned a single word embedding and so did not capture sense distinctions. Contextualized word embeddings relax that constraint, making the embedding dependent on context. Using contextualized word embeddings on the ACE corpus improves event classification by about two points F-measure (Lu and Nguyen Reference Lu and Nguyen2018).
6. User-generated media
Another significant addition of the last few years was the processing of user-generated data. Twitter was founded in 2006; currently about 500,000,000 tweets are sent each day. Automatically monitored tweets provide a source of current activities second to none, so they have become a target for NLP developers (Panem et al. Reference Panem, Gupta and Varma2014). There is now a annual workshop on the analysis of such informal communication, WNUT (Workshop on Noisy User-Generated Text, web site http://noisy-text.github.io/).
But the tweets are quite different from the well-edited texts of newswires which had been the target of most NLP. The tweets may contain many variant spellings, little or no punctuation, and newly coined terms. In consequence, taggers which were trained on edited text performed poorly on tweets (e.g., a top-ranked named entity tagger which obtained an F score over 90% on the standard Reuters test corpus obtained an F score of about 40% on tweet corpora).
The WNUT workshops include an annual multi-site evaluation, but the performance of these tweet-optimized systems was not much better; top performance in the 2016 evaluation was F = 52% (Strauss et al. Reference Strauss, Toma, Ritter, de Marneffe and Xu2016). Generally the taggers used similar designs to those described above, principally CRFs and RNNs built using LSTMs. Because individual tweets provide much less context, tweet taggers must rely more on name lists (e.g., gazetters). Taking advantage of global consistency—a preference for assigning the same token the same tag in different tweets—is also important (Ritter et al. Reference Ritter, Clark, Mausam and Etzioni2011; Liu et al Reference Liu, Zhang, Wei and Zhou2011; Cherry and Guo Reference Cherry and Guo2015). (As we noted earlier, global consistency also plays a role, but a smaller one, in tagging edited text.)
7. Evaluation
At first glance, IE evaluation seems rather straightforward. We agreed already at MUC-3 to score using recall, precision, and F measure. We prepare a key and compare it to the IE system’s response.
and then compute the F-measure using
It quickly became clear that things would not be so simple. Systems were supposed to generate one template per event; if a document reported two events, two templates should be filled. However, the system response did not explicitly specify how to pair up the templates in the key and response. To address this issue, possible alignments of key and response templates were generated and scored, and the maximum score was reported (MUC no date). A similar problem arose at a smaller scale if there were multiple participants in an event. In general this recall/precision model provided satifactory and intuitive scores when new tasks were added to MUC. The one exception was the coreference task. One scoring scheme was originally designed and an elegant alternative was proposed at the MUC conference, but neither seemed intuitive.Footnote d To this day there are disagreements regarding coreference scoring metrics (Luo Reference Luo2005).
When MUC was divided into four and later five tasks, each was given its own scoring metric, which made sense since each task might be used independently. The ACE evaluation, in contrast, was based on a set of parallel cost models for entities, relations, and events. Each model combined detection, classification, clustering (i.e., coreference), and additional features. The official score (“ACE value”) was based on all these factors, suitably weighted. A positive value is assigned to each element correctly recognized and a false alarm penalty is charged for each incorrect output. The score could be negative if the number of errors exceeded the number of correctly identified elements (Doddington et al. Reference Doddington, Mitchell, Przybocki, Ramshaw, Strassel and Weischedel2004). This is a standard ROC (Receiver Operating Characteristic) model but was not intuitive to the participants; in consequence, it was used for formal Government reports but little used in the published literature.Footnote e
In place of the cost model, most researchers report recall/precision scores for relations and events. These scores are highly dependent on the accuracy of entity extraction since only entities can serve as arguments of relations and events. To isolate improvements to relation and event extraction, most researchers assume that the relation or event extractor is provided with perfect information about entities. This has the benefit of producing higher (more optimistic) scores than running a real entity extractor.
With the shift to deep-learning taggers which are capable of representation learning, some researchers now assume the relation tagger has minimal information regarding entities—only their position in the sentence, not their semantic type. These shifts—reflecting changing research goals—must be taken into account when comparing tagger performance.
8. Looking ahead
We have briefly described the wide range of approaches that have been developed over the past 25 years for building IE systems, and the gradual rise in task performance which has accompanied the introduction of these approaches. The result is a growing set of applications in finance (Ding et al. Reference Ding, Zhang, Liu and Duan2015), medicine (Wang et al. Reference Wang, Wang, Rastegar-Mojarad, Moon, Shen, Afzal, Liu, Zeng, Mehrabi, Sohn and Liu2018), and science (Peters et al. Reference Peters, Zhang and Livny2014). Still, performance (F score) after more than 25 years of development has only advanced from the low 60s to the low 70s on standard event classification benchmarks, and there are serious obstacles to be faced in further improving the scores. What are our prospects?
(1) In some regards, the standard benchmarks (drawn from newswires and blogs) are particularly difficult because the range of topics is so broad, increasing the risk of event misclassification Most applications involve a narrower range of topics and so yield higher performance than the benchmarks.
(2) There will be errors and uncertainties in the human annotations which limit the score we can get. This applies even to texts carefully prepared using dual annotation and reconciliation, such as ACE corpora. Annotating relations requires identifying two endpoints which are easily missed. Relatively abstract categories will lead to uncertainties in classification for both relations and events (Min and Grishman Reference Min and Grishman2012). We should embrace this vagueness as part of the power of natural language and take account of it in our evaluations.
(3) There will be examples which require world knowledge and inference. For instance, the ACE events include a phone event (a subtype of contact). Given the sentence “Fred phoned Jim and he later returned the call.” the system must be able to infer that Jim later called Fred. Handling such cases properly may require a deeper modeling of the events. This is much more feasible in a narrow domain.
(4) Insufficient training data. We expect that we would get several percent improvement in event extraction just by doubling the amount of ACE training data. But “just” may not be an appropriate word when the data were a major government investment. Going forward, we could not afford similar investment for everyone who wants an IE system of their own. Here we may be saved by semi-supervised or unsupervised methods. At a minimum the unsupervised systems could provide cores of relation and event types, which then can be extended and adjusted for particular users, using some form of domain adaptation.
(5) Pipeline problems. IE remains a multi-stage process where earlier stages may introduce errors which are magnified by later stages. Joint inference strategies can reduce this effect.
And we should keep in mind that deep learning is still a young technology from which we can expect continuing improvements in machine learning just as the advent of Bidirectional Encoder Representations from Transformers (BERT) and contextualized embeddings has given many systems a boost of late (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018). So our prospects for continued improvement seem pretty good.
As performance improves, the number of applications which become commercially viable will continue to grow. To maintain market share for their platforms, every one of the “tech giants” (along with multiple start-ups) is now counted on to provide an NLP API including all of the elements of the pipeline, and to update it steadily, bringing state-of-the-art NLP components much closer to IE applications. In this market-driven environment there may be less demand for the Government to guide research by funding fresh evaluations.
Acknowledgements
This work was supported in part by DARPA/I2O and US Army Research Office Contract No. W911NF-18-C-0003 under the World Modelers program. The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Department of Defense or the US Government. This document does not contain technology or technical data controlled under either the US International Traffic in Arms Regulations or the US Export Administration Regulations. The author wishes to thank the reviewers for their suggestions regarding topics to include in the paper.
Note
MUC 3-7 proceedings are available through the ACL Anthology at https://www.aclweb.org/anthology/
Appendix A. MUC-2 template
MUC-1 and 2 involved Navy messages. MUC-1 was exploratory and did not involve a shared template. The first shared template, developed for MUC-2, is shown here (Sundheim Reference Sundheim1996).
Some of the slots in the template are multiple choices, such as the FORCE INITIATING EVENT slot; alternative fills are separated by commas. Other slots, such as the ID slots, prefer to be filled with specific vessel IDs, locations and times when those are available.
Appendix B. Sample message and template for MUC-3
B.1 Message
TST1-MUC3-0099
LIMA, 25 OCT 89 (EFE) -- [TEXT] POLICE HAVE REPORTED THAT TERRORISTS TONIGHT BOMBED THE EMBASSIES OF THE PRC AND THE SOVIET UNION. THE BOMBS CAUSED DAMAGE BUT NO INJURIES.
A CAR-BOMB EXPLODED IN FRONT OF THE PRC EMBASSY, WHICH IS IN THE LIMA RESIDENTIAL DISTRICT OF SAN ISIDRO. MEANWHILE, TWO BOMBS WERE THROWN AT A USSR EMBASSY VEHICLE THAT WAS PARKED IN FRONT OF THE EMBASSY LOCATED IN ORRANTIA DISTRICT, NEAR SAN ISIDRO.
POLICE SAID THE ATTACKS WERE CARRIED OUT ALMOST SIMULTANEOUSLY AND THAT THE BOMBS BROKE WINDOWS AND DESTROYED THE TWO VEHICLES.
NO ONE HAS CLAIMED RESPONSIBILITY FOR THE ATTACKS SO FAR. POLICE SOURCES, HOWEVER, HAVE SAID THE ATTACKS COULD HAVE BEEN CARRIED OUT BY THE MAOIST “SHINING PATH” GROUP OR THE GUEVARIST “TUPAC AMARU REVOLUTIONARY MOVEMENT” (MRTA) GROUP. THE SOURCES ALSO SAID THAT THE SHINING PATH HAS ATTACKED SOVIET INTERESTS IN PERU IN THE PAST.
IN JULY 1989 THE SHINING PATH BOMBED A BUS CARRYING NEARLY 50 SOVIET MARINES INTO THE PORT OF EL CALLAO. FIFTEEN SOVIET MARINES WERE WOUNDED.
SOME 3 YEARS AGO TWO MARINES DIED FOLLOWING A SHINING PATH BOMBING OF A MARKET USED BY SOVIET MARINES.
IN ANOTHER INCIDENT 3 YEARS AGO, A SHINING PATH MILITANT WAS KILLED BY SOVIET EMBASSY GUARDS INSIDE THE EMBASSY COMPOUND. THE TERRORIST WAS CARRYING DYNAMITE.
THE ATTACKS TODAY COME AFTER SHINING PATH ATTACKS DURING WHICH LEAST 10 BUSES WERE BURNED THROUGHOUT LIMA ON 24 OCT.
B.2 A filled scenario template
This is one of three templates which should be generated for this message. The full set appears in MUC (1991). The “/” separates alternative correct slot fills.
Appendix C. ACE entities, relations, and events
C.1 Entities
A GPE is a location with a government, such as a city, state, or country. Mentions of a GPE may refer to the land mass (“He traveled to Florida”), the population (“Floida loves orange juice.”), or the government (“Florida declared a state of emergency”).
C.2 Relations
A relation expresses a relationship between two entities which are mentioned in the same sentence.
C.3 Events
These 8 event types are divided into 33 subtypes.