1 Introduction
The automatic extraction of temporal information from unstructured text plays an important role in the distillation of knowledge from large text corpora. For example, in automatic question answering tasks, extracting temporal information is necessary for a system to answer questions regarding when a particular event occurred (Mirroshandel and Ghassem-Sani Reference Mirroshandel and Ghassem-Sani2012). Similarly, in timeline construction tasks, extracting temporal information is necessary for arranging events in chronological order (Caselli et al. Reference Caselli, Fokkens, Morante and Vossen2015; Laparra, Aldabe, and Rigau Reference Laparra, Aldabe and Rigau2015).
To spur research in the development of systems for extracting temporal information, a competition has been held several times in conjunction with the International Workshop on Semantic Evaluation: TempEval-1 in 2007 (Verhagen et al. Reference Verhagen, Gaizauskas, Schilder and Pustejovsky2009), TempEval-2 in 2010 (Verhagen et al. Reference Verhagen, Saurí, Caselli and Pustejovsky2010), and TempEval-3 in 2013 (UzZaman et al. Reference UzZaman, Llorens, Allen, Derczynski, Verhagen and Pustejovsky2013). The competition organizers defined a series of temporal information extraction tasks, gave the competitors corpora of manually annotated documents for several languages, and asked the competitors to develop systems addressing one or more of the tasks for one or more languages.Footnote a The systems developed relied heavily on the manually annotated corpora. Thus, porting these systems to languages not considered in the competition requires acquiring large corpora of manually annotated documents in the target languages. Acquiring such corpora is difficult owing to the complexity of temporal information extraction annotation.
One strategy for addressing this difficulty is to reduce or eliminate the need for manually annotated corpora in the target languages. For other Natural Language Processing (NLP) problems (e.g., named entity recognition, part of speech tagging, parsing, etc.), researchers have adopted this strategy using a technique call annotation projection, which, generically put, proceeds as follows. Obtain a sentence-aligned parallel corpus between the source language (typically English) and the target language. Apply the tool for addressing the NLP problem of interest to the source language side of the parallel corpus. Use automatically generated word alignments to project annotation information to the target language side. Finally, train a system to address the NLP problem using the automatically annotated target language corpus (and, optionally, any available manually annotated target language documents).
We developed an annotation projection technique for producing an annotated, target language corpus and a pipeline that takes the annotated corpus and builds a system for temporal information extraction. We carried out an English (source) to French (target) case study by applying our annotation projection approach and pipeline to an English–French parallel corpus to build a French temporal information extraction system and comparingFootnote b that to a system built by applying our pipeline to a manually annotated French corpus. It is important to note that our annotation projection technique for building temporal information extraction systems is not limited to English and French. It can be applied to any source and target language pairs if the following resources are available.
A sentence-aligned, source and target language parallel corpus wherein each source language document has a document creation time.
Source language: a temporal information extraction system.
Target language: a tokenizer, sentence splitter, stemmer, constituency parser, and a temporal expression recognizer.
The next subsection describes in greater detail the temporal information extraction problem we address. To our knowledge, ours is the first paper examining annotation projection as applied to temporal information extraction. Section 2 discusses related work. Section 3 describes our annotation projection technique for producing an annotated, target language corpus that can be used to build a system to address the temporal information extraction problem defined in the next subsection. Section 4 discusses our pipeline for training a target language information extraction system from an annotated corpus. The pipeline applies unchanged to annotations produced via annotation projection or made manually. The section also discusses how the pipeline is applied to a new (test) document to assign annotations. Section 5 describes the experiments we performed as part of our English (source) to French (target) case study. The section also describes the data we used (parallel and manually annotated corpus), the scoring metrics, and our experiments’ methodology. Section 6 discusses our results, including an error analysis. Section 7 briefly summarizes the paper and provides conclusions and some directions for future work.
1.1 Problem definition
The series of tasks defined for the TempEval-3 competition, denoted A, B, and C, form the basis of the temporal information extraction problem we address. Given a text document and the document creation time, solve the following tasks:
Identify the extents of temporal expressions. This is a simplified version of TempEval-3 task A, since temporal normalization is not required for temporal relation identification. We made this change because we do not address the problem of building a temporal normalizer in the target language.
Identify the event triggers and their associated class and tense. Like TempEval-3 task B, each event trigger consists of a single word. Unlike task B, aspect, polarity, and modality determination are not required. We made this change because of data annotation sparsity.
Identify the temporal relation between pairs of event triggers in the same sentence, between event triggers and times in the same sentence, and between event triggers and document creation times. Each pair is assigned one of four temporal labels: NO-RELATION, BEFORE, AFTER, or OVERLAP. This is a simplified version of TempEval-3 task C, since the BEFORE, AFTER, and OVERLAP labels resulted from merging some of the 14 labels used in task C. However, unlike task C, temporal relations between main event triggers in consecutive sentences are not required. We made these simplifications because of data annotation sparsity (label collapse) and simplicity (within sentence relations).
Figure 1 depicts an example of temporal information extraction applied to a document consisting of a single sentence.
2 Related work
Related work is organized into several categories.
Work that is most closely related to our research. This includes efforts to apply annotation projection and closely related ideas to temporal information extraction and other problems. It also includes efforts to apply unsupervised learning to temporal information extraction.
Work that addresses temporal expression recognition and normalization, namely, identifying temporal expressions in free text and deducing a normalized version of the time expressed (e.g., “the third day of December in 2016” could be normalized to 2016-12-03).
Work that addresses local temporal relation classification using fully supervised learning techniques developed for manually annotated training data. Temporal relation classification involves the assignment of temporal labels (e.g., BEFORE, AFTER) to pairs of pre-identified event triggers or time expressions or the document creation time. Local refers to the fact that each pair is classified independently of all others.
Work that addresses variations on temporal information extraction. This includes work that addresses broader problems than temporal relation classification. For example, works that perform “end-to-end” extraction (times, events, temporal relations) from unstructured text are included here.
2.1 Most closely related work
Spreyer and Frank (Reference Spreyer and Frank2008) applied annotation projection to build, for German, a temporal expression recognizer, an event trigger recognizer, and a subordinate relation classifier. Subordinate relationships are between two event triggers and indicate when one event instance subordinates another in a way specified by one of the following labels: INTENSIONAL, EVIDENTIAL, NEG_EVIDENTIAL, FACTIVE, COUNTER_FACTIVE, and CONDITIONAL. Spreyer and Frank’s main idea was to use various heuristics to filter projections, resulting in less noisy training data. We employ the same idea with a similar, but not identical, set of filtering heuristics. The key difference between their work and ours is that we focus on different kinds of relationships—temporal relations between pairs of event triggers and pairs of event triggers and time expressions.
Jarzebowski and Przepiorkowski (Reference Jarzebowski, Przepiorkowski, Isahara and Kanzaki2012) projected events and temporal relations across an English–Polish parallel corpus. However, they manually corrected the Polish temporal relation labels before training a Polish temporal relation extraction system. Minard et al. (Reference Minard, Speranza, Urizar, Altuna, van Erp, Schoen and van Son2016) manually annotated event triggers and time expressions on both sides of an English–Italian, English–Spanish, and English–Dutch parallel corpus. They manually aligned the triggers and time expressions, applied an English temporal relation extractor, and projected the temporal relations to the target language. Forascu and Tufis (Reference Forascu and Tufis2012) produced a Romanian corpus with time expressions, event triggers, and temporal relation labels by manually translating an annotated English corpus. Costa and Branco (Reference Costa and Branco2010) produced an annotated Portuguese corpus in similar fashion, except they used automatic translation followed by manual correction.
Annotation projection has been applied to a variety of NLP problems not directly related to temporal information extraction. The original papers on the subject were published by Yarowski and Ngai (Reference Yarowski and Ngai2001) and Yarowski, Ngai, and Wicentowski (Reference Yarowsky, Ngai and Wicentowski2001). Therein, annotation projection was used to build target language part-of-speech (POS) taggers, base-phrase chunkers, and named entity recognizers. A variety of more recent approaches have adopted the strategy of not projecting annotations directly (after filtering), as we do, but rather, project other forms of information which, ideally, dampen the impact of noise. One class of more recent approaches has focused on applying a technique from machine learning, posterior regularization, to improve annotation projection for producing dependency parsers (Ganchev, Gillenwater and Taskar Reference Ganchev, Gillenwater and Taskar2009), POS taggers (Das and Petrov Reference Das and Petrov2011; Ganchev and Das Reference Ganchev and Das2013; He, Gillenwater, and Taskar Reference He, Gillenwater and Taskar2013), and named entity recognizers (Wang and Manning Reference Wang and Manning2014). These approaches project constraint information which influences the target language learner through regularization. Similarly, Tackstrom et al. (Reference Tackstrom, Das, Petrov, McDonald and Nivre2013) project type constraints and develop a method for training POS taggers from the constraint lattice. Another class of approaches has focused on neural-network-based embeddings that span multiple languages (Gouws and Sogaard Reference Gouws and Sogaard2015; Luong, Pham, and Manning Reference Luong, Pham and Manning2015; Zennaki Semmar and Besacier Reference Zennaki, Semmar and Besacier2016). These approaches have the major benefit of not requiring word alignments, a significant source of noise (Gouws and Sogaard’s approach does not even require a parallel corpus).
Mirroshandel, Ghassem-Sani, and Khayyamian (Reference Mirroshandel, Ghassem-Sani and Khayyamian2011) developed an unsupervised approach to temporal relation classification between event triggers. They assumed that the event triggers are pre-identified in an otherwise unannotated corpus. They developed an Expectation Maximization (EM) style algorithm for inferring temporal label probabilities on event trigger pairs.
2.2 Temporal expression recognition and normalization
Strötgen and Gertz (Reference Strötgen and Gertz2016) provide a detailed and comprehensive discussion of temporal expression recognition and normalization techniques. Early work includes that of Jang, Baldwin, and Mani (Reference Jang, Baldwin and Mani2004) who develop a temporal expression recognizer and normalizer for Korean. Subsequently, the Database Systems Research Group at Heidelberg University in Germany developed a temporal expression recognizer and normalizer for English called HeidelTime (Strötgen and Gertz Reference Strötgen and Gertz2010). It was designed to be easily adapted to new languages (Strötgen and Gertz Reference Strötgen and Gertz2013). Its most important design feature is its separation between patterns and rules specifying temporal extraction and normalization information and algorithms for applying those patterns and rules to unstructured text. The algorithms are designed to be language-independent. Currently, HeidelTime has been adapted to 13 languages. For example, Manfredi, Ströetgen, Zell, and Gertz (Reference Manfredi, Strotgen, Zell and Gertz2014), Moriceau and Tannier (Reference Moriceau and Tannier2014), and Skukan, Glavaš, and Šnajder (Reference Skukan, Glavaš and Šnajder2014) manually adapt HeidelTime to Italian, French, and Croatian, respectively. Furthermore, Strötgen and Gertz (Reference Strötgen and Gertz2015) show how HeidelTime can be automatically extended to more than 200 languages (with some loss of fidelity). Given the availability of HeidelTime in many languages and its adaptability to new ones, we do not use annotation projection for time expression identification. Instead, we use HeidelTime.
2.3 Local temporal relation classification
Chambers, Wang, and Jurafsky (Reference Chambers, Wang and Jurafsky2007) built a two-stage classifier for predicting temporal labels between pairs of English event triggers. The first stage predicts attributes of event triggers (tense, aspect, class, polarity, modality), and the second stage predicts the temporal labels using the predicted event attributes as features. Torbati et al. (Reference Torbati, Ghassem-Sani, Mirroshandel, Yaghoobzadeh and Hosseini2013) addressed the same problem for Persian as well as English. For classification, they used a Support Vector Machine with combinations of simple and complex (e.g., subsequence and tree) kernels. Mirroshandel et al. (Reference Mirroshandel, Ghassem-Sani and Khayyamian2011) used a bootstrapping approach to improve a fully supervised event trigger–event trigger temporal relation classification approach. Their fully supervised approach was followed by a bootstrapping approach based on topic similarity and a “one type of temporal relation per discourse” hypothesis.
D’Souza and Ng (Reference D’souza and Ng2013) built a classifier for predicting temporal labels between pairs of English event triggers, pairs of event triggers and time expressions, and pairs of event triggers and document creation times. They used a single classifier for all three types of relations (event trigger–event trigger, event trigger–time expression, event trigger–document creation time) and many classifier features involving complex linguistic information (e.g., WordNet relations). Mirza and Tonelli (Reference Mirza and Tonelli2014) addressed the same problem but used a simpler set of features and examined the impact of each type of feature used.
2.4 Variations of temporal information extraction
Several researchers have addressed temporal relation classification, recognizing that dependencies exist between relations. For example, if event e1 occurs before e2, which overlaps e3, then e3 cannot occur before e1. A variety of methods have been employed to take advantage of these dependencies. Yoshikawa et al. (Reference Yoshikawa, Riedel, Asahara and Matsumoto2009) and Fairholm (Reference Fairholm2014) used Markov Logic Networks to account for dependency information: an event with a future tense should not occur before the document creation time, in a probabilistic context. Do, Lu, and Roth (Reference Do, Lu and Roth2012) adopted an approach that jointly optimizes individual relation classification decisions along with global constraints. Chambers et al. (Reference Chambers, Cassidy, McDowell and Bethard2014) developed an approach that applies individual relation classification decisions in a cascading fashion, with earlier classifications affecting later ones through transitive closure. Laokulrat, Miwa and Tsuruoka (Reference Laokulrat, Miwa and Tsuruoka2015) developed an approach based on stacking to take advantage of dependencies between relation classifications deemed close in a time-graph.
Bethard (Reference Bethard2013) developed a system, ClearTK-TimeML, for addressing temporal information extraction as defined by the full series of tasks in the TempEval-3. His approach was based on the use of simple features and only local relation classifications. We based our temporal relation extraction pipeline on his, as described in Section 4. Jeong et al. (Reference Jeong, Kim, Do, Lim and Choi2015) and Mirza and Minard (Reference Mirza and Minard2014) developed systems for addressing temporal information extraction in Korean and Italian, respectively. Ling and Weld (Reference Ling and Weld2010) and Glavas and Snajder (Reference Glavas and Snajder2015) addressed a temporal information extraction problem (in English) similar to, but not the same as, that defined in TempEval-3.
3 Annotation projection for producing a temporal information extraction annotated corpus
This section describes our annotation projection technique for producing an annotated, target language corpus that can be used to build a system to address the temporal information extraction problem defined in Section 1.1 (see Figure 2). The next section describes our pipeline for training a target language information extraction system from an annotated corpus.
As described earlier, our annotation projection approach starts with a sentence aligned parallel corpus of source and target language documents wherein each source document has an assigned document creation time. Like Spreyer and Frank (Reference Spreyer and Frank2008) and Jarzebowski and Przepiorkowski (Reference Jarzebowski, Przepiorkowski, Isahara and Kanzaki2012), our approach uses a set of heuristics for aggressive filtering to reduce noise. Key to our heuristics is a token alignment filter that uses a user-defined threshold, ProjProbT (Projection Probability Threshold). Intuitively, a source language token is not aligned with any target language token if the maximum alignment probability between the source language token and all target language tokens in the paired sentence does not exceed ProjProbT. The alignments are the basis for projection filtering heuristics.
If the token in a source language event triggerFootnote c does not align with any target language token, then the trigger is not projected.
If the tokens in a source language time expression do not align with a non-empty, contiguous set of target language tokens, then the time expression is not projected.
If a source language temporal relation does not have both of its entities (event trigger and time expression) projected, then the temporal relation is not projected.
Finally, a filter is applied to drop entire target language documents if enough projections were filtered or if some projections conflict. This filter uses two user-defined thresholds, EventPerT and TLinkPerT (Event Percentage Threshold and TLink Percentage Threshold).Footnote d Our entire annotation projection procedure proceeds as follows.
(1) Project the document creation times and preprocess the parallel corpus. For each source language document in the parallel corpus, assign the document creation time to the associated target language document. In all documents on both sides of the parallel corpus, add whitespace around all occurrences of the following characters ‘.’, ‘ , ’, ‘!’, ‘?’. The remaining steps assume tokenization is defined by whitespace.
(2) Automatically annotate the source language documents. Apply a temporal information extraction system to automatically identify event triggers, time expressions, and temporal relations in the source language documents.
(3) Generate token alignment function for each sentence pair and project the annotations. Apply the Berkley Aligner (Liang, Taskar, and Klein Reference Liang, Taskar and Klein2006) to generate an alignment probability (possibly zero) between all pairs of source and target language tokens for each sentence pair. For each source-target language sentence pair, define a function, Align, from source language tokens to target language tokens as follows.
(a) Let e[1],…,e[n] and f[1],…,f[m] denote the tokens in the source and target language sentence, respectively. Let P(e[i],f[j]) denote the alignment probability assigned by the Berkley Aligner to the source language token e[i] and target language token f[j].
\begin{equation} Align\,(e[i])\,\underline {\underline {{\rm{def}}} } \left\{ {\matrix{{argmax \{ P(e[i],\,f[i]):1 \le j \le m\} \,if\,\exists j\,such\,that} \cr {P(e[i],\,f[i]) \gt \,ProjProbT} \cr {undefined\,otherwise.} \cr } } \right. \end{equation}(b) For all event triggers, e[i], identified in the source language sentence:
(i) if Align(e[i]) is undefined, then the event annotation is not projected, else identify the target language token Align(e[i]) as an event trigger and assign it the same event attributes (aspect, class, tense, modality, polarity) as e[i] was assigned.
(c) For all time expressions, {e[i], e[i+1],…, e[i+τ−1]}, identified in the source language sentence,
(i) if the set {Align(e[i+k]): 0 ≤ k < τ and Align(e[i+k]) is defined}, when sorted increasing, contains a consecutive sequence of integers, then the time expression is not projected, else identify the set of target language tokens as a time expression and assign it the same temporal type that the source language time expression was assigned.Footnote e Note the normalized time is not used by the pipeline, so it is not projected.
(d) For all temporal relations identified involving the sentence:
(i) between an event trigger in the sentence and the document creation time, if the event trigger is projected, then the temporal relation label is assigned to the target language event trigger and document creation time;
(ii) between two event triggers in the sentence, if the triggers are both projected, then the temporal relation label is assigned to the target language event triggers;
(iii) between an event trigger and a time expression in the sentence, if the trigger and time expression are projected, then the temporal relation label is assigned to the target language event trigger and time expression.
(4) Filter target language documents. Drop all target language documents if any of the following hold.
(a) Less than EventPerT percent of the event triggers in the associated source language document were projected to the target language document.
(b) Less than TLinkPerT percent of the temporal relations in the associated source language document were projected to the target language document.
(c) There is a pair of projected event triggers that have the same token. There is a projected event trigger whose token is among the tokens of a projected time expression. There is a pair of projected time expressions whose tokens overlap.
4 Our temporal information extraction pipeline
This section describes our pipeline for (a) training a target language temporal information extraction system from an annotated corpus, and (b) applying the system to a new corpus of target language documents. The pipeline operates in two different modes: training mode and application mode (see Figure 3).
4.1 Training mode
The pipeline is provided an annotated corpus of target language documents (produced by projection or manually) and produces a temporal information extraction system, essentially a collection of classifiers. The annotations are expected to include time expression markers, event trigger markers with values for attributes tense and class, and temporal relation markers (with a temporal label, e.g., BEFORE) between pairs of event triggers, event triggers and document creation times, and event triggers and time expressions.
The documents are first preprocessed: tokenized, sentence-split, constituency parsed, and stems generated for each token. Next, using the MALLET version 2.0.7 Java library (McCallum Reference McCallum2002), a collection of Laplace-prior, Logistic Regression classifiers are trained from the annotations and the preprocessed documents. The prior variance parameter is selected from the following set, as described in Genkin, Lewis, and Madigan (Reference Genkin, Lewis and Madigan2007), $\left\{ {\sqrt {{{10}^{\left( {i - 4} \right)}}} :i = 0,1,2,3,4,5,6} \right\}$. The selection is made to maximize the five-fold cross-validation F-score.
The features for the classifiers are based on those used in Bethard (Reference Bethard2013), D’Souza and Ng (Reference D’souza and Ng2013), and Mirza and Tonelli (Reference Mirza and Tonelli2014).
Event trigger identification
The classifier assigns to a token one of two labels indicating whether or not the token is an event trigger. A training instance is created for each token with POS V, VS, VINF, VPP, VPR, N, or NC. The instance consists of the following features: all token unigrams three to the left through three to the right of the current token; the stem of the current token; the POS of the current token; the POS bigrams including the token to the left and right of the current token; the grandparent in the parse tree of the current token; the path from the great-grandparent to the root; and the leftmost and rightmost leaves of the grandparent.
When training over the corpus with annotations produced using annotation projection, the number of instances labeled as not event triggers is typically much larger than the number of instances labeled as triggers. Under-sampling is applied to mitigate class imbalance problems. Specifically, for each instance labeled as not an event, drop the instance with probability 0.25.
When training over either corpus, a classification threshold probability is set using five-fold cross-validation. When the classifier is applied to a token, if the event trigger label probability exceeds the threshold, the token is identified as an event trigger. The threshold is set as follows. For each fold: train a classifier on the union of the other folds; calculate the maximum F-score over the fold with respect to all possible classification probability thresholds. Set the final classification threshold to the value that produced the maximum F-score over all the folds.
The remaining classifiers involve more than two labels and do not use a classification threshold probability. They assign the label with the maximum probability.
Event class classifier
The classifier assigns to event triggers one of the following labels: OCCURRENCE, STATE-ISTATE, ASPECTUAL, IACTION, or PERCEPTION-REPORTING. A training instance is created for each token annotated as an event trigger and with verb POS. The instance consists of the following features: the POS and stem of the current token.
Event tense classifier
The classifier assigns to a token identified as an event trigger one of the following labels: PAST, PRESENT, FUTURE, or IMPERFECT-FUTURE-NONE. A training instance is created for each token annotated as an event trigger and with verb POS. The instance consists of the following features: the POS of the current token, the last two characters of the current token, and all prepositions and adverbs three to the left through the current token.Footnote f
The temporal relation classifiers assign one of the labels NONE, AFTER, BEFORE, or OVERLAP to pairs of event triggers in the same sentence, triggers and document creation times, and triggers and time expressions in the same sentence. A separate classifier is trained for each of these three types of temporal relation.
Temporal relation between event triggers classifier
A training instance is created for each pair of event triggers in the same sentence that consists of the following features: the class label for each trigger and true/false if the labels are equal; the same for the tense labels; if the left trigger has a PP ancestor in the parse tree, the leftmost leaf whose POS is P and that has the same PP ancestor (dominating preposition); the same for the right trigger; the concatenation of the previous two features; if the left trigger has a VN or VPinf or VPpart ancestor, the POS of the leftmost token that has the same ancestor (dominating verb phrase POS); the same for the right trigger; the concatenation of the previous two features; the tokens and stems for the left and right triggers; all tokens and stems between the triggers; the POS of the left and right triggers and true/false if the POSs are equal; the number of triggers or time expressions between the triggers; all leaves of the grandparent of the left trigger; the same for the right trigger; and the path between the grandparents of the triggers.
When training over the corpus with annotations made manually, under-sampling is applied such that all instances with label NONE are dropped with probability 0.5.
Temporal relation between event trigger and document creation time classifier
A training instance is created for each event trigger that consists of the following features: the class and tense labels for trigger; the token, stem, and POS of the trigger; the grandparent and great-grandparent of the trigger in the parse tree; the dominating preposition of the trigger (if one exists); and the dominating verb phrase POS of the trigger (if one exists).
When training over the corpus with annotations made manually, under-sampling is applied such that all instances with label NONE are dropped with probability 0.3.
Temporal relation between event trigger and time expression classifier
A training instance is created for each event trigger and time expression in the same sentence that consists of the following features: the same features as described for triggers and document creation times; for each token in the time expression, the same features as described for triggers and document creation times except class and tense labels; the concatenation of the dominating preposition of the trigger (if one exists) and the dominating preposition of each token in the time expression (if one exists); true/false if the trigger is to the left of the time expression; the temporal type of the time expression; true/false if the POS of the trigger matches one of the POSs in the time expression; the verbs and prepositions among the tokens 5 to the right of the trigger; and the number of triggers or time expressions between the trigger and time expression.
4.2 Application mode
The pipeline takes a corpus of target language documents along with their creation times and produces annotations as described earlier. The documents are first preprocessed as described in the training mode. Temporal expressions and their types in the documents are identified. The event trigger classifier is applied to each token with POS tag V, VS, VINF, VPP, VPR, N, or NC. For tokens identified as triggers, those with POS tag V, VS, VINF, VPP, or VPR, have their event class and event tense set by the classifiers, those with POS tag N or NC have their event class set to OCCURRENCE, and those with any other POS tag have their event class set to STATE-I_STATE. The three temporal relation classifiers are applied to each pair of event triggers in the same sentence and to each trigger and time expression in the same sentence. The classifiers are also applied to each trigger independently to capture trigger–document creation time relations.
5 Experiments
This section describes the experiments we conducted as part of our English (source) to French (target) temporal information extraction case study. Section 5.1 describes the production of the two temporal information extraction systems, TIE-man and TIE-proj. Section 5.2 describes the procedure used to evaluate and compare these systems.
5.1 Producing the temporal information extraction systems
TIE-proj was produced from a French training dataset whose annotations were generated through projection. TIE-man was produced from a different French training dataset whose annotations were generated manually. Next, we discuss how projection was used to produce the first training dataset. Following that, we discuss how both training datasets were used to produce TIE-proj and TIE-man.
5.1.1 Applying annotation projection
The News Commentary Corpus is a sentence-aligned English–French parallel corpus of news articles (Bojar et al. Reference Bojar, Buck, Callison-Burch, Federmann, Haddow, Koehn, Monz, Post, Soricut and Specia2013). We preprocessed the corpus (both languages) by adding whitespace around all occurrences of the following characters “.”, “,”, “!”, “?”, then tokenizing by whitespace. The resulting preprocessed corpus consisted of 183,251 sentence pairs, 5,147,190 French tokens, and 4,498,021 English tokens. We found mistakes in the sentence alignments by identifying sentence pairs with large differences between their number of tokens, then manually examining the pairs. Because of the existence of sentence alignment mistakes, we employed a heuristic to aggressively filter these from corpus. A document pair is dropped if either of the following holds:
The document pair contains a sentence pair where
\begin{equation} {{\left| {Number\;of\;English\;Tokens\; - Number\;of\;French\;Tokens} \right|} \over {max\left\{ {Number\;of\;English\;Tokens,Number\;of\;French\;Tokens} \right\}}}\; \gt \;0.5. \end{equation}The number of occurrences of “.” not preceded by a capital letter in the English document is not the same as the number of occurrences in the French document.
We applied the annotation projection procedure described in Section 3 with ProjProbT = 0.9, EventPerT = 70, and TLinkPerT = 70. TipSem (Llorens, Saquete, and Navarro Reference Llorens, Saquete and Navarro2010) was used to automatically identify event triggers, time expressions, and temporal relations in the documents on the English side of the parallel corpus.Footnote g The temporal relation labels were collapsed down to AFTER, BEFORE, or OVERLAP to reduce sparsity. The documents on the French side were sentence-split and constituency parsed using version 3.6.0 of the Stanford CoreNLP Java library with model “frenchFactored.ser.gz” (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014). Stems were generated for each French token in using the Apache Lucene version 6.0.0 Java class “org.tartarus.snowball.ext.FrenchStemmer” (The Apache Software Foundation 2016).
The result was an annotated French corpus; see the “PROJECTION” columns of Table 1.
5.1.2 Training information extraction systems
The French TimeBank corpus consists of news articles manually annotated according to the ISO-TimeML standards (Bittar et al. Reference Bittar, Amsili, Denis and Danios2011), see the “TIMEBANK” column of Table 1. We randomly split the French TimeBank into training (76 documents) and testing (25 documents) parts. We applied our pipeline, in training mode, to the training part to build TIE-man and to the projection annotated French corpus to produce TIE-proj. The temporal expressions and their types were identified using HeidelTime version 2.1 with language setting FRENCH, document type setting NEWS, and POS tagger setting STANFORDPOSTAGGER (Strötgen and Gertz Reference Strötgen and Gertz2013).
5.2 Evaluating and comparing the temporal information extraction systems
We carried out the following procedure (trial) 20 times.
(1) Produce TIE-man and TIE-proj as described earlier, then apply each, using our pipeline in application mode, to the testing part of the manually annotated French corpus. The temporal expressions and their types are identified using HeidelTime version 2.1 with language setting FRENCH, document type setting NEWS, and POS tagger setting STANFORDPOSTAGGER (Strötgen and Gertz Reference Strötgen and Gertz2013).
(2) Generate the following types of scores for TIE-man and TIE-proj.
a. Accuracy of event trigger class and tense determination.
b. Precision, recall, and F-score of event trigger identification.
c. Precision, recall, and F-score of temporal relation identification for each of the three types of temporal relations: between event triggers (Event–Event) in the same sentence, between event triggers and document creation times (Event–DCT), and between event triggers and time expressions in the same sentence (Event–Time).
d. A label collapsed version of precision, recall, and F-score of temporal relation identification where the ground truth temporal relation labels are collapsed down to two, indicating that a relation is present or not (NONE). The labels produced by the temporal information extraction system are collapsed in the same way immediately prior to scoring (but not during training).
We computed the average, over the 20 trials, of each type of score and the 0.95 confidence interval estimated by the central limit theorem.Footnote h These averages are reported in the next section.
The class and tense accuracy scores are computed with respect to the set of tokens that are identified as event triggers in the ground truth and identified as event triggers by the temporal information extraction system—specifically, the fraction of this set of tokens whose system-assigned class and tense label, respectively, match its ground truth label. True positives (TPs), false positives (FPs), and false negatives (FNs) for 2b, 2c, and 2d are defined as follows.
Event triggers TPs, FPs, FNs:
A trigger identified by the system is a TP if it is also identified in the ground truth.
A trigger identified by the system is an FP if it is not identified in the ground truth.
A trigger identified in the ground truth is an FN if it is not identified by the system.
Event–Event temporal relations TPs, FPs, FNs:
A pair of same-sentence triggers identified by the system is a TP if both are also identified in the ground truth, and the system-assigned temporal relation label matches the ground truth label.
A pair of same-sentence triggers identified by the system is an FP if at least one is not identified in the ground truth, or both are identified in the ground truth but the system-assigned temporal relation label does not match the ground truth label.
A pair of same-sentence triggers identified in the ground truth is an FN if at least one is not identified by the system.
Event–DCT temporal relations TPs, FPs, FNs:
A trigger identified by the system is a TP if it is identified as a trigger in the ground truth and the system-assigned temporal relation label matches the ground truth label.
A trigger identified by the system is an FP if it is not identified as a trigger in the ground truth, or it is identified as a trigger in the ground truth, but the system-assigned temporal relation label does not match the ground truth label.
A trigger identified in the ground truth is an FN if it is not identified as a trigger by the system.
Event–Time temporal relations TPs, FPs, FNs:
A same-sentence trigger and time expression identified by the system is a TP if both hold:
◦ The trigger is identified in the ground truth and the time expression overlaps (i.e., shares at least one token in common) with a time expression identified in the ground truth,
◦ The system-assigned temporal relation label matches the ground truth label.
A same-sentence trigger and time expression identified by the system is an FP if either holds:
◦ The trigger is not identified in the ground truth or the time expression does not overlap with any time expression identified in the ground truth,
◦ The trigger is identified in the ground truth and the time expression overlaps with a time expression identified in the ground truth, but the system-assigned temporal relation label does not match the ground truth label.
A same-sentence trigger and time expression identified in the ground truth is an FN if the trigger is not identified in the ground truth or the time expression does not overlap with any time expression identified in the ground truth.
6 Results
In all cases, the 0.95 confidence interval half-length was <0.023 (usually much less). For brevity, only the averages are reported. Tables 2 and 3 show results regarding the event trigger and attribute identification part of the pipeline. In Table 2, “Verbs” indicates the method wherein all tokens with verb POSs are identified as event triggers.
In Table 3, “Manual Majority Label” indicates the method wherein all event triggers are assigned the class and tense labels that appeared on the majority of event triggers in the training data with annotation made manually, OCCURRENCE and PAST, respectively. “Proj Majority Label” indicates the method wherein all event triggers are assigned the class and tense labels that appeared on the majority of event triggers in the training data with annotations produced by projection, OCCURRENCE and PRESENT, respectively.
Table 4 shows results for the full pipeline, “AVG FLC” refers to the label collapsed average F-scores.
TIE-man achieves average F-scores less than 0.35 in all cases, suggesting that the temporal information extraction problem is challenging even with manually annotated data.
TIE-proj achieves an average F-score 12.5% larger than TIE-man for temporal relations between event triggers and document creation times (approximately the same score when temporal labels are collapsed).
TIE-proj achieves very low average F-scores for the other two types of temporal relations. However, when the temporal labels are collapsed, the TIE-proj scores increase dramatically. The cause of the low scores is addressed in the error analysis below.
TIE-proj achieves an average F-score 8.5% lower than TIE-man for event trigger identification, average accuracy 4.6% lower for trigger class identification, and average accuracy 1% higher for trigger tense identification.
Entries containing * had values less than 0.04. Some observations can be highlighted from these tables.
6.1 Error analysis
To better understand the sources of errorFootnote j in the temporal information extraction systems, we modify the application mode of the pipeline, wherein manually identified event triggers and time expressions are used instead of those identified automatically. Table 5 shows results regarding the full pipeline when manually identified triggers and automatically identified time expressions were used (“MAN EVENTS, AUTO TIMES” column) and when manually identified triggers and times expression were used (“MAN EVENTS, MAN TIMES” column). Table 6 shows results regarding the same situation but using collapsed labels in scoring.
Entries containing ** had values less than 0.06.
As seen in Tables 5 and 6, substantial improvements are observed when manually annotated triggers were used (particularly for collapsed label scoring), less so when manually annotated time expressions were also used. For example, consider TIE-proj for identifying temporal relations between triggers and time expressions when collapsed temporal labels were used for scoring and consider the last row of Table 6. When manually identified triggers and automatically identified time expressions were used, a 72% improvement in average F-score is observed for TIE-proj (an average F-score of 0.305 vs. 0.525). When manually identified triggers and time expressions were used, a 95% improvement in average F-score is observed (an average F-score of 0.305 vs. 0.595).
Finally, we sought to uncover the reason for the very low average F-scores for TIE-proj for temporal relations between triggers and for temporal relations between triggers and time expressions (the * and ** entries in Tables 4 and 5). In the case of relations between triggers and time expressions, the low-average F-scores are largely due to the fact that no relation annotations with an OVERLAP label were made via projection,Footnote k but 72% of the TimeBank relation annotations (from which the testing part is drawn) had an OVERLAP label (see Table 1).
In the case of relations between triggers, 9% of the relation annotations produced by projection had an OVERLAP label, while 56% of the TimeBank relation annotations had such a label. While this difference partially explains the low-average F-scores for TIE-proj, we examined annotations produced by TipSem on the English side of the parallel corpus to further understand the cause of the low scores. We manually examined 119 randomly selected temporal relation labels assigned to pairs of triggers. Eighty-seven percent had valid triggers and, of those, only 11% had a correct label. Clearly, TipSem labeling inaccuracies contributed to the low-average F-scores.
7 Conclusions and future work
In this paper, we addressed the problem of building systems for temporal information extraction, which we define as the following three-step process: (i) identify the extents of temporal expressions, (ii) identify the event triggers and their associated class and tense, and (iii) identify the temporal relation between pairs of event triggers in the same sentence, between event triggers and times in the same sentence, and between event triggers and document creation times. We focused on the situation in which no manually annotated data are available in a target language, but a temporal information extraction system is available for a source language as is a parallel corpus between the source and target languages (some additional assumptions are discussed in Section 1). We developed an annotation projection approach an examined its effectiveness in an English (source) to French (target) case study.
We found that, even using manually annotated data to build a temporal information extraction system, F-scores were relatively low (<0.35) suggesting that the problem is challenging even with manually annotated data. Our annotation projection approach performed well (relative to the system built from manually annotated data) on event trigger detection, tense, class prediction, and on event–document creation time temporal relation prediction. However, our projection approach performed poorly on the other kinds of temporal relation prediction (see the error analysis in Section 6 for more details).
We suggest several directions for improving annotation projection for temporal information extraction.
Improve event trigger and attribute identification. When manually identified triggers (and their attributes) were used, average F-scores for TIE-proj improved substantially. This implies that substantial improvements in event trigger and attribute identification would translate to substantial improvements overall.
Improve temporal information extraction on the English side of the parallel corpus—specifically in temporal relation label identification. We observed a very low accuracy in TipSem’s temporal relation label identification for relations between event triggers.Footnote l These TipSem labeling inaccuracies contributed to the low-average F-scores we observed for TIE-proj.
Employ a projection approach based on posterior regularization. Doing so would allow temporal information extraction systems to be trained using a more principled approach to mitigating noise than the heuristic filtering we applied. For other NLP problems, research has shown the benefits of posterior regularization (He, Gillenwater, and Taskar Reference He, Gillenwater and Taskar2013). However, to utilize posterior regularization, we need probabilities associated with the temporal annotations assigned to the English side of the parallel corpus, which TipSem does not provide.
Employ a projection approach based on cross-lingual neural network-based word embeddings. These approaches have the benefit of not needing word alignments, a significant source of noise.
Finally, an interesting additional direction for future work is to combine the EM-style, unsupervised learning approach of (Mirroshandel and Ghassem-Sani Reference Mirroshandel and Ghassem-Sani2012) with annotation projection.
Acknowledgements
Our MITRE colleagues John Prange, Rob Case, Nathan Giles, and Rod Holland provided valuable feedback while this research was in progress. Our MITRE colleague Lee Mickle improved our paper through careful editing. Professor Steven Bethard at the University of Arizona provided more details regarding his ClearTK-TimeML system for temporal information extraction.
Funding
This technical data deliverable was developed using contract funds under the terms of Basic Contract No. W15P7T-13-C-A802.