Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-02-06T09:45:40.737Z Has data issue: false hasContentIssue false

Annotation projection for temporal information extraction

Published online by Cambridge University Press:  15 May 2019

Chris R. Giannella*
Affiliation:
The MITRE Corporation, 7515 Colshire Dr., McLean, VA 22102, USA
Ransom K. Winder
Affiliation:
The MITRE Corporation, 7515 Colshire Dr., McLean, VA 22102, USA
Joseph P. Jubinski
Affiliation:
The MITRE Corporation, 7515 Colshire Dr., McLean, VA 22102, USA
*
*Corresponding author. Email: cgiannella@mitre.org
Rights & Permissions [Opens in a new window]

Abstract

Approaches to building temporal information extraction systems typically rely on large, manually annotated corpora. Thus, porting these systems to new languages requires acquiring large corpora of manually annotated documents in the new languages. Acquiring such corpora is difficult owing to the complexity of temporal information extraction annotation. One strategy for addressing this difficulty is to reduce or eliminate the need for manually annotated corpora through annotation projection. This technique utilizes a temporal information extraction system for a source language (typically English) to automatically annotate the source language side of a parallel corpus. It then uses automatically generated word alignments to project the annotations, thereby creating noisily annotated target language training data. We developed an annotation projection technique for producing target language temporal information extraction systems. We carried out an English (source) to French (target) case study wherein we compared a French temporal information extraction system built using annotation projection with one built using a manually annotated French corpus. While annotation projection has been applied to building other kinds of Natural Language Processing tools (e.g., Named Entity Recognizers), to our knowledge, this is the first paper examining annotation projection as applied to temporal information extraction where no manual corrections of the target language annotations were made. We found that, even using manually annotated data to build a temporal information extraction system, F-scores were relatively low (<0.35), which suggests that the problem is challenging even with manually annotated data. Our annotation projection approach performed well (relative to the system built from manually annotated data) on some aspects of temporal information extraction (e.g., event–document creation time temporal relation prediction), but it performed poorly on the other kinds of temporal relation prediction (e.g., event–event and event–time).

Type
Article
Copyright
© Cambridge University Press 2019 

1 Introduction

The automatic extraction of temporal information from unstructured text plays an important role in the distillation of knowledge from large text corpora. For example, in automatic question answering tasks, extracting temporal information is necessary for a system to answer questions regarding when a particular event occurred (Mirroshandel and Ghassem-Sani Reference Mirroshandel and Ghassem-Sani2012). Similarly, in timeline construction tasks, extracting temporal information is necessary for arranging events in chronological order (Caselli et al. Reference Caselli, Fokkens, Morante and Vossen2015; Laparra, Aldabe, and Rigau Reference Laparra, Aldabe and Rigau2015).

To spur research in the development of systems for extracting temporal information, a competition has been held several times in conjunction with the International Workshop on Semantic Evaluation: TempEval-1 in 2007 (Verhagen et al. Reference Verhagen, Gaizauskas, Schilder and Pustejovsky2009), TempEval-2 in 2010 (Verhagen et al. Reference Verhagen, Saurí, Caselli and Pustejovsky2010), and TempEval-3 in 2013 (UzZaman et al. Reference UzZaman, Llorens, Allen, Derczynski, Verhagen and Pustejovsky2013). The competition organizers defined a series of temporal information extraction tasks, gave the competitors corpora of manually annotated documents for several languages, and asked the competitors to develop systems addressing one or more of the tasks for one or more languages.Footnote a The systems developed relied heavily on the manually annotated corpora. Thus, porting these systems to languages not considered in the competition requires acquiring large corpora of manually annotated documents in the target languages. Acquiring such corpora is difficult owing to the complexity of temporal information extraction annotation.

One strategy for addressing this difficulty is to reduce or eliminate the need for manually annotated corpora in the target languages. For other Natural Language Processing (NLP) problems (e.g., named entity recognition, part of speech tagging, parsing, etc.), researchers have adopted this strategy using a technique call annotation projection, which, generically put, proceeds as follows. Obtain a sentence-aligned parallel corpus between the source language (typically English) and the target language. Apply the tool for addressing the NLP problem of interest to the source language side of the parallel corpus. Use automatically generated word alignments to project annotation information to the target language side. Finally, train a system to address the NLP problem using the automatically annotated target language corpus (and, optionally, any available manually annotated target language documents).

We developed an annotation projection technique for producing an annotated, target language corpus and a pipeline that takes the annotated corpus and builds a system for temporal information extraction. We carried out an English (source) to French (target) case study by applying our annotation projection approach and pipeline to an English–French parallel corpus to build a French temporal information extraction system and comparingFootnote b that to a system built by applying our pipeline to a manually annotated French corpus. It is important to note that our annotation projection technique for building temporal information extraction systems is not limited to English and French. It can be applied to any source and target language pairs if the following resources are available.

  • A sentence-aligned, source and target language parallel corpus wherein each source language document has a document creation time.

  • Source language: a temporal information extraction system.

  • Target language: a tokenizer, sentence splitter, stemmer, constituency parser, and a temporal expression recognizer.

The next subsection describes in greater detail the temporal information extraction problem we address. To our knowledge, ours is the first paper examining annotation projection as applied to temporal information extraction. Section 2 discusses related work. Section 3 describes our annotation projection technique for producing an annotated, target language corpus that can be used to build a system to address the temporal information extraction problem defined in the next subsection. Section 4 discusses our pipeline for training a target language information extraction system from an annotated corpus. The pipeline applies unchanged to annotations produced via annotation projection or made manually. The section also discusses how the pipeline is applied to a new (test) document to assign annotations. Section 5 describes the experiments we performed as part of our English (source) to French (target) case study. The section also describes the data we used (parallel and manually annotated corpus), the scoring metrics, and our experiments’ methodology. Section 6 discusses our results, including an error analysis. Section 7 briefly summarizes the paper and provides conclusions and some directions for future work.

1.1 Problem definition

The series of tasks defined for the TempEval-3 competition, denoted A, B, and C, form the basis of the temporal information extraction problem we address. Given a text document and the document creation time, solve the following tasks:

  • Identify the extents of temporal expressions. This is a simplified version of TempEval-3 task A, since temporal normalization is not required for temporal relation identification. We made this change because we do not address the problem of building a temporal normalizer in the target language.

  • Identify the event triggers and their associated class and tense. Like TempEval-3 task B, each event trigger consists of a single word. Unlike task B, aspect, polarity, and modality determination are not required. We made this change because of data annotation sparsity.

  • Identify the temporal relation between pairs of event triggers in the same sentence, between event triggers and times in the same sentence, and between event triggers and document creation times. Each pair is assigned one of four temporal labels: NO-RELATION, BEFORE, AFTER, or OVERLAP. This is a simplified version of TempEval-3 task C, since the BEFORE, AFTER, and OVERLAP labels resulted from merging some of the 14 labels used in task C. However, unlike task C, temporal relations between main event triggers in consecutive sentences are not required. We made these simplifications because of data annotation sparsity (label collapse) and simplicity (within sentence relations).

Figure 1 depicts an example of temporal information extraction applied to a document consisting of a single sentence.

Figure 1. An example of temporal information extraction. Two event trigger words, “exploded,” “said,” and one non-document creation time expression, “August 7, 1998,” have been identified. Four temporal relations have been identified.

2 Related work

Related work is organized into several categories.

  • Work that is most closely related to our research. This includes efforts to apply annotation projection and closely related ideas to temporal information extraction and other problems. It also includes efforts to apply unsupervised learning to temporal information extraction.

  • Work that addresses temporal expression recognition and normalization, namely, identifying temporal expressions in free text and deducing a normalized version of the time expressed (e.g., “the third day of December in 2016” could be normalized to 2016-12-03).

  • Work that addresses local temporal relation classification using fully supervised learning techniques developed for manually annotated training data. Temporal relation classification involves the assignment of temporal labels (e.g., BEFORE, AFTER) to pairs of pre-identified event triggers or time expressions or the document creation time. Local refers to the fact that each pair is classified independently of all others.

  • Work that addresses variations on temporal information extraction. This includes work that addresses broader problems than temporal relation classification. For example, works that perform “end-to-end” extraction (times, events, temporal relations) from unstructured text are included here.

2.1 Most closely related work

Spreyer and Frank (Reference Spreyer and Frank2008) applied annotation projection to build, for German, a temporal expression recognizer, an event trigger recognizer, and a subordinate relation classifier. Subordinate relationships are between two event triggers and indicate when one event instance subordinates another in a way specified by one of the following labels: INTENSIONAL, EVIDENTIAL, NEG_EVIDENTIAL, FACTIVE, COUNTER_FACTIVE, and CONDITIONAL. Spreyer and Frank’s main idea was to use various heuristics to filter projections, resulting in less noisy training data. We employ the same idea with a similar, but not identical, set of filtering heuristics. The key difference between their work and ours is that we focus on different kinds of relationships—temporal relations between pairs of event triggers and pairs of event triggers and time expressions.

Jarzebowski and Przepiorkowski (Reference Jarzebowski, Przepiorkowski, Isahara and Kanzaki2012) projected events and temporal relations across an English–Polish parallel corpus. However, they manually corrected the Polish temporal relation labels before training a Polish temporal relation extraction system. Minard et al. (Reference Minard, Speranza, Urizar, Altuna, van Erp, Schoen and van Son2016) manually annotated event triggers and time expressions on both sides of an English–Italian, English–Spanish, and English–Dutch parallel corpus. They manually aligned the triggers and time expressions, applied an English temporal relation extractor, and projected the temporal relations to the target language. Forascu and Tufis (Reference Forascu and Tufis2012) produced a Romanian corpus with time expressions, event triggers, and temporal relation labels by manually translating an annotated English corpus. Costa and Branco (Reference Costa and Branco2010) produced an annotated Portuguese corpus in similar fashion, except they used automatic translation followed by manual correction.

Annotation projection has been applied to a variety of NLP problems not directly related to temporal information extraction. The original papers on the subject were published by Yarowski and Ngai (Reference Yarowski and Ngai2001) and Yarowski, Ngai, and Wicentowski (Reference Yarowsky, Ngai and Wicentowski2001). Therein, annotation projection was used to build target language part-of-speech (POS) taggers, base-phrase chunkers, and named entity recognizers. A variety of more recent approaches have adopted the strategy of not projecting annotations directly (after filtering), as we do, but rather, project other forms of information which, ideally, dampen the impact of noise. One class of more recent approaches has focused on applying a technique from machine learning, posterior regularization, to improve annotation projection for producing dependency parsers (Ganchev, Gillenwater and Taskar Reference Ganchev, Gillenwater and Taskar2009), POS taggers (Das and Petrov Reference Das and Petrov2011; Ganchev and Das Reference Ganchev and Das2013; He, Gillenwater, and Taskar Reference He, Gillenwater and Taskar2013), and named entity recognizers (Wang and Manning Reference Wang and Manning2014). These approaches project constraint information which influences the target language learner through regularization. Similarly, Tackstrom et al. (Reference Tackstrom, Das, Petrov, McDonald and Nivre2013) project type constraints and develop a method for training POS taggers from the constraint lattice. Another class of approaches has focused on neural-network-based embeddings that span multiple languages (Gouws and Sogaard Reference Gouws and Sogaard2015; Luong, Pham, and Manning Reference Luong, Pham and Manning2015; Zennaki Semmar and Besacier Reference Zennaki, Semmar and Besacier2016). These approaches have the major benefit of not requiring word alignments, a significant source of noise (Gouws and Sogaard’s approach does not even require a parallel corpus).

Mirroshandel, Ghassem-Sani, and Khayyamian (Reference Mirroshandel, Ghassem-Sani and Khayyamian2011) developed an unsupervised approach to temporal relation classification between event triggers. They assumed that the event triggers are pre-identified in an otherwise unannotated corpus. They developed an Expectation Maximization (EM) style algorithm for inferring temporal label probabilities on event trigger pairs.

2.2 Temporal expression recognition and normalization

Strötgen and Gertz (Reference Strötgen and Gertz2016) provide a detailed and comprehensive discussion of temporal expression recognition and normalization techniques. Early work includes that of Jang, Baldwin, and Mani (Reference Jang, Baldwin and Mani2004) who develop a temporal expression recognizer and normalizer for Korean. Subsequently, the Database Systems Research Group at Heidelberg University in Germany developed a temporal expression recognizer and normalizer for English called HeidelTime (Strötgen and Gertz Reference Strötgen and Gertz2010). It was designed to be easily adapted to new languages (Strötgen and Gertz Reference Strötgen and Gertz2013). Its most important design feature is its separation between patterns and rules specifying temporal extraction and normalization information and algorithms for applying those patterns and rules to unstructured text. The algorithms are designed to be language-independent. Currently, HeidelTime has been adapted to 13 languages. For example, Manfredi, Ströetgen, Zell, and Gertz (Reference Manfredi, Strotgen, Zell and Gertz2014), Moriceau and Tannier (Reference Moriceau and Tannier2014), and Skukan, Glavaš, and Šnajder (Reference Skukan, Glavaš and Šnajder2014) manually adapt HeidelTime to Italian, French, and Croatian, respectively. Furthermore, Strötgen and Gertz (Reference Strötgen and Gertz2015) show how HeidelTime can be automatically extended to more than 200 languages (with some loss of fidelity). Given the availability of HeidelTime in many languages and its adaptability to new ones, we do not use annotation projection for time expression identification. Instead, we use HeidelTime.

2.3 Local temporal relation classification

Chambers, Wang, and Jurafsky (Reference Chambers, Wang and Jurafsky2007) built a two-stage classifier for predicting temporal labels between pairs of English event triggers. The first stage predicts attributes of event triggers (tense, aspect, class, polarity, modality), and the second stage predicts the temporal labels using the predicted event attributes as features. Torbati et al. (Reference Torbati, Ghassem-Sani, Mirroshandel, Yaghoobzadeh and Hosseini2013) addressed the same problem for Persian as well as English. For classification, they used a Support Vector Machine with combinations of simple and complex (e.g., subsequence and tree) kernels. Mirroshandel et al. (Reference Mirroshandel, Ghassem-Sani and Khayyamian2011) used a bootstrapping approach to improve a fully supervised event trigger–event trigger temporal relation classification approach. Their fully supervised approach was followed by a bootstrapping approach based on topic similarity and a “one type of temporal relation per discourse” hypothesis.

D’Souza and Ng (Reference D’souza and Ng2013) built a classifier for predicting temporal labels between pairs of English event triggers, pairs of event triggers and time expressions, and pairs of event triggers and document creation times. They used a single classifier for all three types of relations (event trigger–event trigger, event trigger–time expression, event trigger–document creation time) and many classifier features involving complex linguistic information (e.g., WordNet relations). Mirza and Tonelli (Reference Mirza and Tonelli2014) addressed the same problem but used a simpler set of features and examined the impact of each type of feature used.

2.4 Variations of temporal information extraction

Several researchers have addressed temporal relation classification, recognizing that dependencies exist between relations. For example, if event e1 occurs before e2, which overlaps e3, then e3 cannot occur before e1. A variety of methods have been employed to take advantage of these dependencies. Yoshikawa et al. (Reference Yoshikawa, Riedel, Asahara and Matsumoto2009) and Fairholm (Reference Fairholm2014) used Markov Logic Networks to account for dependency information: an event with a future tense should not occur before the document creation time, in a probabilistic context. Do, Lu, and Roth (Reference Do, Lu and Roth2012) adopted an approach that jointly optimizes individual relation classification decisions along with global constraints. Chambers et al. (Reference Chambers, Cassidy, McDowell and Bethard2014) developed an approach that applies individual relation classification decisions in a cascading fashion, with earlier classifications affecting later ones through transitive closure. Laokulrat, Miwa and Tsuruoka (Reference Laokulrat, Miwa and Tsuruoka2015) developed an approach based on stacking to take advantage of dependencies between relation classifications deemed close in a time-graph.

Bethard (Reference Bethard2013) developed a system, ClearTK-TimeML, for addressing temporal information extraction as defined by the full series of tasks in the TempEval-3. His approach was based on the use of simple features and only local relation classifications. We based our temporal relation extraction pipeline on his, as described in Section 4. Jeong et al. (Reference Jeong, Kim, Do, Lim and Choi2015) and Mirza and Minard (Reference Mirza and Minard2014) developed systems for addressing temporal information extraction in Korean and Italian, respectively. Ling and Weld (Reference Ling and Weld2010) and Glavas and Snajder (Reference Glavas and Snajder2015) addressed a temporal information extraction problem (in English) similar to, but not the same as, that defined in TempEval-3.

3 Annotation projection for producing a temporal information extraction annotated corpus

This section describes our annotation projection technique for producing an annotated, target language corpus that can be used to build a system to address the temporal information extraction problem defined in Section 1.1 (see Figure 2). The next section describes our pipeline for training a target language information extraction system from an annotated corpus.

Figure 2. The annotation projection procedure.

As described earlier, our annotation projection approach starts with a sentence aligned parallel corpus of source and target language documents wherein each source document has an assigned document creation time. Like Spreyer and Frank (Reference Spreyer and Frank2008) and Jarzebowski and Przepiorkowski (Reference Jarzebowski, Przepiorkowski, Isahara and Kanzaki2012), our approach uses a set of heuristics for aggressive filtering to reduce noise. Key to our heuristics is a token alignment filter that uses a user-defined threshold, ProjProbT (Projection Probability Threshold). Intuitively, a source language token is not aligned with any target language token if the maximum alignment probability between the source language token and all target language tokens in the paired sentence does not exceed ProjProbT. The alignments are the basis for projection filtering heuristics.

  • If the token in a source language event triggerFootnote c does not align with any target language token, then the trigger is not projected.

  • If the tokens in a source language time expression do not align with a non-empty, contiguous set of target language tokens, then the time expression is not projected.

  • If a source language temporal relation does not have both of its entities (event trigger and time expression) projected, then the temporal relation is not projected.

Finally, a filter is applied to drop entire target language documents if enough projections were filtered or if some projections conflict. This filter uses two user-defined thresholds, EventPerT and TLinkPerT (Event Percentage Threshold and TLink Percentage Threshold).Footnote d Our entire annotation projection procedure proceeds as follows.

  1. (1) Project the document creation times and preprocess the parallel corpus. For each source language document in the parallel corpus, assign the document creation time to the associated target language document. In all documents on both sides of the parallel corpus, add whitespace around all occurrences of the following characters ‘.’, ‘ , ’, ‘!’, ‘?’. The remaining steps assume tokenization is defined by whitespace.

  2. (2) Automatically annotate the source language documents. Apply a temporal information extraction system to automatically identify event triggers, time expressions, and temporal relations in the source language documents.

  3. (3) Generate token alignment function for each sentence pair and project the annotations. Apply the Berkley Aligner (Liang, Taskar, and Klein Reference Liang, Taskar and Klein2006) to generate an alignment probability (possibly zero) between all pairs of source and target language tokens for each sentence pair. For each source-target language sentence pair, define a function, Align, from source language tokens to target language tokens as follows.

    1. (a) Let e[1],…,e[n] and f[1],…,f[m] denote the tokens in the source and target language sentence, respectively. Let P(e[i],f[j]) denote the alignment probability assigned by the Berkley Aligner to the source language token e[i] and target language token f[j].

      \begin{equation} Align\,(e[i])\,\underline {\underline {{\rm{def}}} } \left\{ {\matrix{{argmax \{ P(e[i],\,f[i]):1 \le j \le m\} \,if\,\exists j\,such\,that} \cr {P(e[i],\,f[i]) \gt \,ProjProbT} \cr {undefined\,otherwise.} \cr } } \right. \end{equation}
    2. (b) For all event triggers, e[i], identified in the source language sentence:

      1. (i) if Align(e[i]) is undefined, then the event annotation is not projected, else identify the target language token Align(e[i]) as an event trigger and assign it the same event attributes (aspect, class, tense, modality, polarity) as e[i] was assigned.

    3. (c) For all time expressions, {e[i], e[i+1],…, e[i+τ1]}, identified in the source language sentence,

      1. (i) if the set {Align(e[i+k]): 0 ≤ k < τ and Align(e[i+k]) is defined}, when sorted increasing, contains a consecutive sequence of integers, then the time expression is not projected, else identify the set of target language tokens as a time expression and assign it the same temporal type that the source language time expression was assigned.Footnote e Note the normalized time is not used by the pipeline, so it is not projected.

    4. (d) For all temporal relations identified involving the sentence:

      1. (i) between an event trigger in the sentence and the document creation time, if the event trigger is projected, then the temporal relation label is assigned to the target language event trigger and document creation time;

      2. (ii) between two event triggers in the sentence, if the triggers are both projected, then the temporal relation label is assigned to the target language event triggers;

      3. (iii) between an event trigger and a time expression in the sentence, if the trigger and time expression are projected, then the temporal relation label is assigned to the target language event trigger and time expression.

  4. (4) Filter target language documents. Drop all target language documents if any of the following hold.

    1. (a) Less than EventPerT percent of the event triggers in the associated source language document were projected to the target language document.

    2. (b) Less than TLinkPerT percent of the temporal relations in the associated source language document were projected to the target language document.

    3. (c) There is a pair of projected event triggers that have the same token. There is a projected event trigger whose token is among the tokens of a projected time expression. There is a pair of projected time expressions whose tokens overlap.

4 Our temporal information extraction pipeline

This section describes our pipeline for (a) training a target language temporal information extraction system from an annotated corpus, and (b) applying the system to a new corpus of target language documents. The pipeline operates in two different modes: training mode and application mode (see Figure 3).

Figure 3. The temporal information extraction pipelines. TIE-man and TIE-proj denote the temporal information extraction systems that result from the training pipeline.

4.1 Training mode

The pipeline is provided an annotated corpus of target language documents (produced by projection or manually) and produces a temporal information extraction system, essentially a collection of classifiers. The annotations are expected to include time expression markers, event trigger markers with values for attributes tense and class, and temporal relation markers (with a temporal label, e.g., BEFORE) between pairs of event triggers, event triggers and document creation times, and event triggers and time expressions.

The documents are first preprocessed: tokenized, sentence-split, constituency parsed, and stems generated for each token. Next, using the MALLET version 2.0.7 Java library (McCallum Reference McCallum2002), a collection of Laplace-prior, Logistic Regression classifiers are trained from the annotations and the preprocessed documents. The prior variance parameter is selected from the following set, as described in Genkin, Lewis, and Madigan (Reference Genkin, Lewis and Madigan2007), $\left\{ {\sqrt {{{10}^{\left( {i - 4} \right)}}} :i = 0,1,2,3,4,5,6} \right\}$. The selection is made to maximize the five-fold cross-validation F-score.

The features for the classifiers are based on those used in Bethard (Reference Bethard2013), D’Souza and Ng (Reference D’souza and Ng2013), and Mirza and Tonelli (Reference Mirza and Tonelli2014).

Event trigger identification

The classifier assigns to a token one of two labels indicating whether or not the token is an event trigger. A training instance is created for each token with POS V, VS, VINF, VPP, VPR, N, or NC. The instance consists of the following features: all token unigrams three to the left through three to the right of the current token; the stem of the current token; the POS of the current token; the POS bigrams including the token to the left and right of the current token; the grandparent in the parse tree of the current token; the path from the great-grandparent to the root; and the leftmost and rightmost leaves of the grandparent.

When training over the corpus with annotations produced using annotation projection, the number of instances labeled as not event triggers is typically much larger than the number of instances labeled as triggers. Under-sampling is applied to mitigate class imbalance problems. Specifically, for each instance labeled as not an event, drop the instance with probability 0.25.

When training over either corpus, a classification threshold probability is set using five-fold cross-validation. When the classifier is applied to a token, if the event trigger label probability exceeds the threshold, the token is identified as an event trigger. The threshold is set as follows. For each fold: train a classifier on the union of the other folds; calculate the maximum F-score over the fold with respect to all possible classification probability thresholds. Set the final classification threshold to the value that produced the maximum F-score over all the folds.

The remaining classifiers involve more than two labels and do not use a classification threshold probability. They assign the label with the maximum probability.

Event class classifier

The classifier assigns to event triggers one of the following labels: OCCURRENCE, STATE-ISTATE, ASPECTUAL, IACTION, or PERCEPTION-REPORTING. A training instance is created for each token annotated as an event trigger and with verb POS. The instance consists of the following features: the POS and stem of the current token.

Event tense classifier

The classifier assigns to a token identified as an event trigger one of the following labels: PAST, PRESENT, FUTURE, or IMPERFECT-FUTURE-NONE. A training instance is created for each token annotated as an event trigger and with verb POS. The instance consists of the following features: the POS of the current token, the last two characters of the current token, and all prepositions and adverbs three to the left through the current token.Footnote f

The temporal relation classifiers assign one of the labels NONE, AFTER, BEFORE, or OVERLAP to pairs of event triggers in the same sentence, triggers and document creation times, and triggers and time expressions in the same sentence. A separate classifier is trained for each of these three types of temporal relation.

Temporal relation between event triggers classifier

A training instance is created for each pair of event triggers in the same sentence that consists of the following features: the class label for each trigger and true/false if the labels are equal; the same for the tense labels; if the left trigger has a PP ancestor in the parse tree, the leftmost leaf whose POS is P and that has the same PP ancestor (dominating preposition); the same for the right trigger; the concatenation of the previous two features; if the left trigger has a VN or VPinf or VPpart ancestor, the POS of the leftmost token that has the same ancestor (dominating verb phrase POS); the same for the right trigger; the concatenation of the previous two features; the tokens and stems for the left and right triggers; all tokens and stems between the triggers; the POS of the left and right triggers and true/false if the POSs are equal; the number of triggers or time expressions between the triggers; all leaves of the grandparent of the left trigger; the same for the right trigger; and the path between the grandparents of the triggers.

When training over the corpus with annotations made manually, under-sampling is applied such that all instances with label NONE are dropped with probability 0.5.

Temporal relation between event trigger and document creation time classifier

A training instance is created for each event trigger that consists of the following features: the class and tense labels for trigger; the token, stem, and POS of the trigger; the grandparent and great-grandparent of the trigger in the parse tree; the dominating preposition of the trigger (if one exists); and the dominating verb phrase POS of the trigger (if one exists).

When training over the corpus with annotations made manually, under-sampling is applied such that all instances with label NONE are dropped with probability 0.3.

Temporal relation between event trigger and time expression classifier

A training instance is created for each event trigger and time expression in the same sentence that consists of the following features: the same features as described for triggers and document creation times; for each token in the time expression, the same features as described for triggers and document creation times except class and tense labels; the concatenation of the dominating preposition of the trigger (if one exists) and the dominating preposition of each token in the time expression (if one exists); true/false if the trigger is to the left of the time expression; the temporal type of the time expression; true/false if the POS of the trigger matches one of the POSs in the time expression; the verbs and prepositions among the tokens 5 to the right of the trigger; and the number of triggers or time expressions between the trigger and time expression.

4.2 Application mode

The pipeline takes a corpus of target language documents along with their creation times and produces annotations as described earlier. The documents are first preprocessed as described in the training mode. Temporal expressions and their types in the documents are identified. The event trigger classifier is applied to each token with POS tag V, VS, VINF, VPP, VPR, N, or NC. For tokens identified as triggers, those with POS tag V, VS, VINF, VPP, or VPR, have their event class and event tense set by the classifiers, those with POS tag N or NC have their event class set to OCCURRENCE, and those with any other POS tag have their event class set to STATE-I_STATE. The three temporal relation classifiers are applied to each pair of event triggers in the same sentence and to each trigger and time expression in the same sentence. The classifiers are also applied to each trigger independently to capture trigger–document creation time relations.

5 Experiments

This section describes the experiments we conducted as part of our English (source) to French (target) temporal information extraction case study. Section 5.1 describes the production of the two temporal information extraction systems, TIE-man and TIE-proj. Section 5.2 describes the procedure used to evaluate and compare these systems.

5.1 Producing the temporal information extraction systems

TIE-proj was produced from a French training dataset whose annotations were generated through projection. TIE-man was produced from a different French training dataset whose annotations were generated manually. Next, we discuss how projection was used to produce the first training dataset. Following that, we discuss how both training datasets were used to produce TIE-proj and TIE-man.

5.1.1 Applying annotation projection

The News Commentary Corpus is a sentence-aligned English–French parallel corpus of news articles (Bojar et al. Reference Bojar, Buck, Callison-Burch, Federmann, Haddow, Koehn, Monz, Post, Soricut and Specia2013). We preprocessed the corpus (both languages) by adding whitespace around all occurrences of the following characters “.”, “,”, “!”, “?”, then tokenizing by whitespace. The resulting preprocessed corpus consisted of 183,251 sentence pairs, 5,147,190 French tokens, and 4,498,021 English tokens. We found mistakes in the sentence alignments by identifying sentence pairs with large differences between their number of tokens, then manually examining the pairs. Because of the existence of sentence alignment mistakes, we employed a heuristic to aggressively filter these from corpus. A document pair is dropped if either of the following holds:

  • The document pair contains a sentence pair where

    \begin{equation} {{\left| {Number\;of\;English\;Tokens\; - Number\;of\;French\;Tokens} \right|} \over {max\left\{ {Number\;of\;English\;Tokens,Number\;of\;French\;Tokens} \right\}}}\; \gt \;0.5. \end{equation}
  • The number of occurrences of “.” not preceded by a capital letter in the English document is not the same as the number of occurrences in the French document.

We applied the annotation projection procedure described in Section 3 with ProjProbT = 0.9, EventPerT = 70, and TLinkPerT = 70. TipSem (Llorens, Saquete, and Navarro Reference Llorens, Saquete and Navarro2010) was used to automatically identify event triggers, time expressions, and temporal relations in the documents on the English side of the parallel corpus.Footnote g The temporal relation labels were collapsed down to AFTER, BEFORE, or OVERLAP to reduce sparsity. The documents on the French side were sentence-split and constituency parsed using version 3.6.0 of the Stanford CoreNLP Java library with model “frenchFactored.ser.gz” (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014). Stems were generated for each French token in using the Apache Lucene version 6.0.0 Java class “org.tartarus.snowball.ext.FrenchStemmer” (The Apache Software Foundation 2016).

The result was an annotated French corpus; see the “PROJECTION” columns of Table 1.

Table 1. Statistics regarding the French corpora

5.1.2 Training information extraction systems

The French TimeBank corpus consists of news articles manually annotated according to the ISO-TimeML standards (Bittar et al. Reference Bittar, Amsili, Denis and Danios2011), see the “TIMEBANK” column of Table 1. We randomly split the French TimeBank into training (76 documents) and testing (25 documents) parts. We applied our pipeline, in training mode, to the training part to build TIE-man and to the projection annotated French corpus to produce TIE-proj. The temporal expressions and their types were identified using HeidelTime version 2.1 with language setting FRENCH, document type setting NEWS, and POS tagger setting STANFORDPOSTAGGER (Strötgen and Gertz Reference Strötgen and Gertz2013).

5.2 Evaluating and comparing the temporal information extraction systems

We carried out the following procedure (trial) 20 times.

  1. (1) Produce TIE-man and TIE-proj as described earlier, then apply each, using our pipeline in application mode, to the testing part of the manually annotated French corpus. The temporal expressions and their types are identified using HeidelTime version 2.1 with language setting FRENCH, document type setting NEWS, and POS tagger setting STANFORDPOSTAGGER (Strötgen and Gertz Reference Strötgen and Gertz2013).

  2. (2) Generate the following types of scores for TIE-man and TIE-proj.

    1. a. Accuracy of event trigger class and tense determination.

    2. b. Precision, recall, and F-score of event trigger identification.

    3. c. Precision, recall, and F-score of temporal relation identification for each of the three types of temporal relations: between event triggers (Event–Event) in the same sentence, between event triggers and document creation times (Event–DCT), and between event triggers and time expressions in the same sentence (Event–Time).

    4. d. A label collapsed version of precision, recall, and F-score of temporal relation identification where the ground truth temporal relation labels are collapsed down to two, indicating that a relation is present or not (NONE). The labels produced by the temporal information extraction system are collapsed in the same way immediately prior to scoring (but not during training).

We computed the average, over the 20 trials, of each type of score and the 0.95 confidence interval estimated by the central limit theorem.Footnote h These averages are reported in the next section.

The class and tense accuracy scores are computed with respect to the set of tokens that are identified as event triggers in the ground truth and identified as event triggers by the temporal information extraction system—specifically, the fraction of this set of tokens whose system-assigned class and tense label, respectively, match its ground truth label. True positives (TPs), false positives (FPs), and false negatives (FNs) for 2b, 2c, and 2d are defined as follows.

Event triggers TPs, FPs, FNs:

  • A trigger identified by the system is a TP if it is also identified in the ground truth.

  • A trigger identified by the system is an FP if it is not identified in the ground truth.

  • A trigger identified in the ground truth is an FN if it is not identified by the system.

Event–Event temporal relations TPs, FPs, FNs:

  • A pair of same-sentence triggers identified by the system is a TP if both are also identified in the ground truth, and the system-assigned temporal relation label matches the ground truth label.

  • A pair of same-sentence triggers identified by the system is an FP if at least one is not identified in the ground truth, or both are identified in the ground truth but the system-assigned temporal relation label does not match the ground truth label.

  • A pair of same-sentence triggers identified in the ground truth is an FN if at least one is not identified by the system.

Event–DCT temporal relations TPs, FPs, FNs:

  • A trigger identified by the system is a TP if it is identified as a trigger in the ground truth and the system-assigned temporal relation label matches the ground truth label.

  • A trigger identified by the system is an FP if it is not identified as a trigger in the ground truth, or it is identified as a trigger in the ground truth, but the system-assigned temporal relation label does not match the ground truth label.

  • A trigger identified in the ground truth is an FN if it is not identified as a trigger by the system.

Event–Time temporal relations TPs, FPs, FNs:

  • A same-sentence trigger and time expression identified by the system is a TP if both hold:

    • The trigger is identified in the ground truth and the time expression overlaps (i.e., shares at least one token in common) with a time expression identified in the ground truth,

    • The system-assigned temporal relation label matches the ground truth label.

  • A same-sentence trigger and time expression identified by the system is an FP if either holds:

    • The trigger is not identified in the ground truth or the time expression does not overlap with any time expression identified in the ground truth,

    • The trigger is identified in the ground truth and the time expression overlaps with a time expression identified in the ground truth, but the system-assigned temporal relation label does not match the ground truth label.

  • A same-sentence trigger and time expression identified in the ground truth is an FN if the trigger is not identified in the ground truth or the time expression does not overlap with any time expression identified in the ground truth.

6 Results

In all cases, the 0.95 confidence interval half-length was <0.023 (usually much less). For brevity, only the averages are reported. Tables 2 and 3 show results regarding the event trigger and attribute identification part of the pipeline. In Table 2, “Verbs” indicates the method wherein all tokens with verb POSs are identified as event triggers.

Table 2. Event trigger identification results

Table 3. Event trigger class, tense identification resultsFootnote i

In Table 3, “Manual Majority Label” indicates the method wherein all event triggers are assigned the class and tense labels that appeared on the majority of event triggers in the training data with annotation made manually, OCCURRENCE and PAST, respectively. “Proj Majority Label” indicates the method wherein all event triggers are assigned the class and tense labels that appeared on the majority of event triggers in the training data with annotations produced by projection, OCCURRENCE and PRESENT, respectively.

Table 4 shows results for the full pipeline, “AVG FLC” refers to the label collapsed average F-scores.

  • TIE-man achieves average F-scores less than 0.35 in all cases, suggesting that the temporal information extraction problem is challenging even with manually annotated data.

  • TIE-proj achieves an average F-score 12.5% larger than TIE-man for temporal relations between event triggers and document creation times (approximately the same score when temporal labels are collapsed).

  • TIE-proj achieves very low average F-scores for the other two types of temporal relations. However, when the temporal labels are collapsed, the TIE-proj scores increase dramatically. The cause of the low scores is addressed in the error analysis below.

  • TIE-proj achieves an average F-score 8.5% lower than TIE-man for event trigger identification, average accuracy 4.6% lower for trigger class identification, and average accuracy 1% higher for trigger tense identification.

Table 4. Full pipeline temporal relation identification results

Entries containing * had values less than 0.04. Some observations can be highlighted from these tables.

6.1 Error analysis

To better understand the sources of errorFootnote j in the temporal information extraction systems, we modify the application mode of the pipeline, wherein manually identified event triggers and time expressions are used instead of those identified automatically. Table 5 shows results regarding the full pipeline when manually identified triggers and automatically identified time expressions were used (“MAN EVENTS, AUTO TIMES” column) and when manually identified triggers and times expression were used (“MAN EVENTS, MAN TIMES” column). Table 6 shows results regarding the same situation but using collapsed labels in scoring.

Table 5. Modified full pipeline temporal information extraction results

Entries containing ** had values less than 0.06.

Table 6. Modified full pipeline temporal information extraction results, collapsed label scoring

As seen in Tables 5 and 6, substantial improvements are observed when manually annotated triggers were used (particularly for collapsed label scoring), less so when manually annotated time expressions were also used. For example, consider TIE-proj for identifying temporal relations between triggers and time expressions when collapsed temporal labels were used for scoring and consider the last row of Table 6. When manually identified triggers and automatically identified time expressions were used, a 72% improvement in average F-score is observed for TIE-proj (an average F-score of 0.305 vs. 0.525). When manually identified triggers and time expressions were used, a 95% improvement in average F-score is observed (an average F-score of 0.305 vs. 0.595).

Finally, we sought to uncover the reason for the very low average F-scores for TIE-proj for temporal relations between triggers and for temporal relations between triggers and time expressions (the * and ** entries in Tables 4 and 5). In the case of relations between triggers and time expressions, the low-average F-scores are largely due to the fact that no relation annotations with an OVERLAP label were made via projection,Footnote k but 72% of the TimeBank relation annotations (from which the testing part is drawn) had an OVERLAP label (see Table 1).

In the case of relations between triggers, 9% of the relation annotations produced by projection had an OVERLAP label, while 56% of the TimeBank relation annotations had such a label. While this difference partially explains the low-average F-scores for TIE-proj, we examined annotations produced by TipSem on the English side of the parallel corpus to further understand the cause of the low scores. We manually examined 119 randomly selected temporal relation labels assigned to pairs of triggers. Eighty-seven percent had valid triggers and, of those, only 11% had a correct label. Clearly, TipSem labeling inaccuracies contributed to the low-average F-scores.

7 Conclusions and future work

In this paper, we addressed the problem of building systems for temporal information extraction, which we define as the following three-step process: (i) identify the extents of temporal expressions, (ii) identify the event triggers and their associated class and tense, and (iii) identify the temporal relation between pairs of event triggers in the same sentence, between event triggers and times in the same sentence, and between event triggers and document creation times. We focused on the situation in which no manually annotated data are available in a target language, but a temporal information extraction system is available for a source language as is a parallel corpus between the source and target languages (some additional assumptions are discussed in Section 1). We developed an annotation projection approach an examined its effectiveness in an English (source) to French (target) case study.

We found that, even using manually annotated data to build a temporal information extraction system, F-scores were relatively low (<0.35) suggesting that the problem is challenging even with manually annotated data. Our annotation projection approach performed well (relative to the system built from manually annotated data) on event trigger detection, tense, class prediction, and on event–document creation time temporal relation prediction. However, our projection approach performed poorly on the other kinds of temporal relation prediction (see the error analysis in Section 6 for more details).

We suggest several directions for improving annotation projection for temporal information extraction.

  • Improve event trigger and attribute identification. When manually identified triggers (and their attributes) were used, average F-scores for TIE-proj improved substantially. This implies that substantial improvements in event trigger and attribute identification would translate to substantial improvements overall.

  • Improve temporal information extraction on the English side of the parallel corpus—specifically in temporal relation label identification. We observed a very low accuracy in TipSem’s temporal relation label identification for relations between event triggers.Footnote l These TipSem labeling inaccuracies contributed to the low-average F-scores we observed for TIE-proj.

  • Employ a projection approach based on posterior regularization. Doing so would allow temporal information extraction systems to be trained using a more principled approach to mitigating noise than the heuristic filtering we applied. For other NLP problems, research has shown the benefits of posterior regularization (He, Gillenwater, and Taskar Reference He, Gillenwater and Taskar2013). However, to utilize posterior regularization, we need probabilities associated with the temporal annotations assigned to the English side of the parallel corpus, which TipSem does not provide.

  • Employ a projection approach based on cross-lingual neural network-based word embeddings. These approaches have the benefit of not needing word alignments, a significant source of noise.

Finally, an interesting additional direction for future work is to combine the EM-style, unsupervised learning approach of (Mirroshandel and Ghassem-Sani Reference Mirroshandel and Ghassem-Sani2012) with annotation projection.

Acknowledgements

Our MITRE colleagues John Prange, Rob Case, Nathan Giles, and Rod Holland provided valuable feedback while this research was in progress. Our MITRE colleague Lee Mickle improved our paper through careful editing. Professor Steven Bethard at the University of Arizona provided more details regarding his ClearTK-TimeML system for temporal information extraction.

Funding

This technical data deliverable was developed using contract funds under the terms of Basic Contract No. W15P7T-13-C-A802.

Footnotes

a While corpora for six languages were provided, competitors only addressed tasks in two languages; 17 competitor systems were submitted for English and three for Spanish.

b Using a held-out test set of manually annotated French documents.

c Recall from Section 1.1 that event triggers are single tokens.

d TLink is the name used for temporal relation tags in TimeML.

e While it is possible to use HeidelTime on the target language and use the indicated time expressions in the projection process, doing so introduces potential conflicts. Namely, conflicts between the projected time expressions from the source language (English) and those indicated in the target language. To avoid this conflict, we chose to not apply HeidelTime on the target language during the projection process.

f The class and tense labels result from collapsing the original set of labels as indicated by the hyphens. We did not create classifiers for event aspect, modality, and polarity since there were very few of these annotations in the manually annotated data set.

g We use TipSem because it is an easy-to-use, pre-trained English temporal information extraction system.

h Repeated trials were conducted because the process of training a classifier is stochastic since the hyper-parameter is selected using five-fold cross-validation over a random partition.

i Six percent of the events in the manually annotated French corpus had as their class: EVENT_CONTAINER, CAUSE, or MODAL. Since these classes were not generated by TipSem, they do not appear in the training data with annotations produced by projection. They were ignored in the evaluation.

j We do not have the resources necessary to analyze the error caused by word alignment mistakes as that would require having a parallel corpus with manually produced word alignments.

k Of the relations between triggers and time expressions identified by TipSem on the English side of the parallel corpus, only 9% had an OVERLAP label.

l We did not examine TipSem’s temporal relation label identification accuracy for relations between event triggers and time expressions. However, given the low percentage of OVERLAP labels identified, we believe that low accuracies would be observed.

References

Bethard, S. (2013). ClearTK-TimeML: A minimalist approach to TempEval 2013. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-13) as part of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 1014.Google Scholar
Bittar, A., Amsili, P., Denis, P. and Danios, L. (2011). French Timebank: An ISO-TimeML annotated reference corpus. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2. Association for Computational Linguistics, pp. 130134.Google Scholar
Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., Monz, C., Post, M., Soricut, R. and Specia, L. (2013). Findings of the 2013 workshop on statistical machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. Association for Computational Linguistics, pp. 144.Google Scholar
Caselli, T., Fokkens, A., Morante, R. and Vossen, P. (2015). SPINOZA_VU: An NLP pipeline for cross document TimeLines. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval-15) as part of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 787791.Google Scholar
Chambers, N., Wang, S. and Jurafsky, D. (2007). Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL) – Interactive Poster and Demonstration Session. Association for Computational Linguistics, pp. 173176.Google Scholar
Chambers, N., Cassidy, T., McDowell, B. and Bethard, S. (2014). Dense event ordering with a multi-pass architecture. Transactions of the Association for Computational Linguistics 2, 273284.CrossRefGoogle Scholar
Costa, F. and Branco, A. (2010). Temporal information processing of a new language: Fast porting with minimal resources. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 671677.Google Scholar
D’souza, J. and Ng, V. (2013). Classifying temporal relations with rich linguistic knowledge. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, pp. 918927.Google Scholar
Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, pp. 600609.Google Scholar
Do, Q., Lu, W. and Roth, D. (2012). Joint inference for event timeline construction. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, pp. 677687.Google Scholar
Forascu, C. and Tufis, D. (2012). Romanian TimeBank: An annotated parallel corpus for temporal information. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC). European Language Resource Association, pp. 37623766.Google Scholar
Fairholm, W.O. (2014). Annotation of Temporal Relations Using Markov Logic Networks and Temporal Centering. Master’s Thesis, Guelph, Ontario, Canada: School of Computer Science, University of Guelph.Google Scholar
Ganchev, K. and Das, D. (2013). Cross-lingual discriminative learning of sequence models with posterior regularization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 19962006.Google Scholar
Ganchev, K., Gillenwater, J. and Taskar, B. (2009). Dependency grammar induction via Bitext projection constraints. In Proceedings of the 47th Annual Meeting of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, pp. 369377.Google Scholar
Genkin, A., Lewis, D. and Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics (American Statistical Association and the American Society for Quality) 49(3), 291304. doi: 10.1198/004017007000000245.Google Scholar
Glavas, G. and Snajder, J. (2015). Construction and evaluation of event graphs. Natural Language Engineering 21(4), 607652. doi: 10.1017/S1351324914000060.CrossRefGoogle Scholar
Gouws, S. and Sogaard, A. (2015). Simple task-specific bilingual word embedding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Association for Computational Linguistics (ACL), pp. 13861390.Google Scholar
He, L., Gillenwater, J. and Taskar, B. (2013). Graph-based posterior regularization for semi-supervised structured prediction. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, pp. 3846.Google Scholar
Jang, S.B., Baldwin, J. and Mani, I. (2004). Automatic TIMEX2 tagging of Korean news. ACM Transactions on Asian Languages Information Processing 3(1), 5165. doi: 10.1145/1017068.1017072.CrossRefGoogle Scholar
Jarzebowski, P. and Przepiorkowski, A. (2012). Temporal information extraction with cross-language projected data. In Isahara, H. and Kanzaki, K. (eds), Advances in Natural Language Processing, Lecture Notes in Computer Science, vol. 7614, Springer, Berlin, Heidelberg, pp. 198209.CrossRefGoogle Scholar
Jeong, Y.-S., Kim, Z.M., Do, H.-W., Lim, C.-G. and Choi, H.-J. (2015). Temporal information extraction from Korean texts. In Proceedings of the 19th Conference on Computational Language Learning (CoNLL). Association for Computational Linguistics, pp. 279288.Google Scholar
Kozhevnikov, M. and Titov, I. (2014). Cross-lingual model transfer using feature representation projection. In Proceedings of the 52nd Annual Meetings of the Association for Computational Linguistics. Association for Computational Linguistics (ACL), pp. 579585.Google Scholar
Laokulrat, N., Miwa, M. and Tsuruoka, Y. (2015). Stacking approach to temporal relation classification with temporal inference. Journal of Natural Language Processing 22(3), 171196. doi: 10.5715/jnlp.22.171.CrossRefGoogle Scholar
Laparra, E., Aldabe, I. and Rigau, G. (2015). Document level time-anchoring for TimeLine extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) and the 7th International Joint Conference on Natural Language Processing (IJCNLP). Association for Computational Linguistics, pp. 358364.Google Scholar
Liang, P., Taskar, B. and Klein, D. (2006). Alignment by agreement. In Proceeding of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL). Association of Computational Linguistics, pp. 104111.Google Scholar
Ling, X. and Weld, D. (2010). Temporal information extraction. In Proceedings of the 24th AAAI Conference on Artificial Intelligence. The AAAI Press, pp. 13851390.Google Scholar
Llorens, H., Saquete, E. and Navarro, B. (2010). TIPSem (English and Spanish): Evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-10) as part of the 48th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 284291.Google Scholar
Luong, M.-T. Pham, H. and Manning, C. (2015). Bilingual word representation with monolingual quality in mind. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Association for Computational Linguistics (ACL), pp. 151159.Google Scholar
Manfredi, G., Strotgen, J., Zell, J. and Gertz, M. (2014). HeidelTime at EVENTI: Tuning Italian resources and addressing TimeML’s empty tags. In Proceedings of the 1st Italian Conference on Computational Linguistics (CLiC-it) & the 4th International Workshop EVALITA, pp. 3943.Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. Association for Computational Linguistics, pp. 5560.CrossRefGoogle Scholar
McCallum, A. (2002). Available at http://mallet.cs.umass.edu (accessed 16 July 2013).Google Scholar
Minard, A.-L., Speranza, M., Urizar, R., Altuna, B., van Erp, M., Schoen, A. and van Son, C. (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC). European Languages Resources Association, pp. 44174422.Google Scholar
Mirroshandel, S.A. and Ghassem-Sani, G. (2012) Towards unsupervised learning of temporal relations between events. Journal of Artificial Intelligence Research 45, 125163. doi: 10.1613/jair.3693.CrossRefGoogle Scholar
Mirroshandel, S.A., Ghassem-Sani, G. and Khayyamian, M. (2011). Using syntactic-based kernels for classifying temporal relations. Journal of Computer Science and Technology 26(1), 6880. doi: 10.1007/s11390-011-9416-7.CrossRefGoogle Scholar
Mirza, P. and Minard, A.-L. (2014). FBK-HLT-time: a Complete Italian Temporal Processing System for EVENTI-Evalita 2014. In Proceedings of the 1st Italian Conference on Computational Linguistics (CLiC-it) & the 4th International Workshop EVALITA, pp. 4449.Google Scholar
Mirza, P. and Tonelli, S. (2014). Classifying temporal relations with simple features. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics, pp. 308317.Google Scholar
Moriceau, V. and Tannier, X. (2014). French resources for extraction and normalization of temporal expressions with HeidelTime. In 9th International Conference on Language Resources and Evaluation (LREC). The European Language Resources Association, pp. 32393243.Google Scholar
Skukan, L., Glavaš, G. and Šnajder, J. (2014). Heideltime.HR: Extracting and normalizing temporal expressions in Croatian. In Proceedings of the 9th Language Technologies Conference. Department of Intelligent Systems, Jožef Stefan Institute, Ljubljana, Slovenia, pp. 99103.Google Scholar
Spreyer, K. and Frank, A. (2008). Projection-based acquisition of a temporal labeller. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), pp. 489496.Google Scholar
Strötgen, J. and Gertz, M. (2010). HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-10) as part of the 48th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 321324.Google Scholar
Strötgen, J. and Gertz, M. (2013). Multilingual and cross-domain temporal tagging. Language Resources and Evaluation 47(2), 269298. doi: 10.1007/s10579-012-9179-y.CrossRefGoogle Scholar
Strötgen, J. and Gertz, M. (2015). A baseline temporal tagger for all languages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 541547.CrossRefGoogle Scholar
Strötgen, J. and Gertz, M. (2016). Domain-sensitive temporal tagging. Synthesis Lectures on Human Language Technologies 9(3), 1151.CrossRefGoogle Scholar
Tackstrom, O., Das, D., Petrov, S., McDonald, R. and Nivre, J. (2013). Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics 1, 112.CrossRefGoogle Scholar
The Apache Software Foundation. (2016). Apache Lucene 6.0.0 documentation. April 7. Available at https://lucene.apache.org/core/6_0_0/index.html (accessed 26 May 2016).Google Scholar
Torbati, M., Ghassem-Sani, G., Mirroshandel, S., Yaghoobzadeh, Y. and Hosseini, N. (2013). Temporal relation classification in Persian and English contexts. In Proceedings of the Recent Advances in Natural Language Processing (RANLP), pp. 261269.Google Scholar
UzZaman, N., Llorens, H., Allen, J., Derczynski, L., Verhagen, M. and Pustejovsky, J. (2013). SemEval-2013 Task 1: TempEval-3: Evaluating events, time expressions and temporal relations. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-13) as part of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 19.Google Scholar
Verhagen, M., Gaizauskas, R., Schilder, F. and Pustejovsky, J. (2009). The TempEval challenge: identifying temporal relations in text. Language Resources and Evaluation 43(2), 161179. doi: 10.1007/s10579-009-9086-z.CrossRefGoogle Scholar
Verhagen, M., Saurí, R., Caselli, T. and Pustejovsky, J. (2010). SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-10) as part of the 48th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp. 5762.Google Scholar
Wang, M. and Manning, C. (2014). Cross-lingual projected expectation regularization for weakly supervised learning. Transactions of the Association for Computational Linguistics 2, 5566.CrossRefGoogle Scholar
Yarowski, D. and Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL). Association for Computational Linguistics.Google Scholar
Yarowsky, D., Ngai, G. and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research. Association for Computational Linguistics, pp. 18.Google Scholar
Yoshikawa, K., Riedel, S., Asahara, M. and Matsumoto, Y. (2009). Jointly identifying temporal relations with Markov logic. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL) and the 4th International Joint Conference on Natural Language Processing (IJCNLP) of the Asian Federation of Natural Language Processing (AFNLP). Association of Computational Linguistics and Asian Federation of Natural Language Processing, pp. 405413.Google Scholar
Zennaki, O., Semmar, N. and Besacier, L. (2016). Inducing multilingual text analysis tools using bidirectional recurrent neural networks. In Proceedings of the 26th International Conference on Computational Linguistics (COLING). The Association for Computational Linguistics (ACL), pp. 450460.Google Scholar
Figure 0

Figure 1. An example of temporal information extraction. Two event trigger words, “exploded,” “said,” and one non-document creation time expression, “August 7, 1998,” have been identified. Four temporal relations have been identified.

Figure 1

Figure 2. The annotation projection procedure.

Figure 2

Figure 3. The temporal information extraction pipelines. TIE-man and TIE-proj denote the temporal information extraction systems that result from the training pipeline.

Figure 3

Table 1. Statistics regarding the French corpora

Figure 4

Table 2. Event trigger identification results

Figure 5

Table 3. Event trigger class, tense identification resultsi

Figure 6

Table 4. Full pipeline temporal relation identification results

Figure 7

Table 5. Modified full pipeline temporal information extraction results

Figure 8

Table 6. Modified full pipeline temporal information extraction results, collapsed label scoring