1 Introduction
A great number of data-driven approaches to anaphora resolution (also known in nlp as coreference) have recently been proposed, considerably pushing forward the state of the art in the field (see, e.g., Durrett and Klein Reference Durrett and Klein2013; Lee et al. Reference Lee, Chang, Peirsman, Chambers, Surdeanu and Jurafsky2013; Björkelund and Kuhn Reference Björkelund and Kuhn2014; Fernandes et al. Reference Fernandes, dos Santos and Milidiú2014; Martschat and Strube Reference Martschat and Strube2015; Clark and Manning Reference Clark and Manning2016; Lee et al. Reference Lee, He, Lewis and Zettlemoyer2017; see also Pradhan et al. Reference Pradhan, Moschitti, Xue, Uryupina and Zhang2012 for a comparative analysis of some of these systems). A key reason for these advances has been the creation of larger and more linguistically motivated gold annotated corpora, and in particular of OntoNotes Weischedel et al. (Reference Weischedel, Hovy, Marcus, Palmer, Belvin, Pradhan, Ramshaw, Xue, Olive, Christianson and McCary2011), and the success of recent evaluation campaigns using these new resources (Recasens et al. Reference Recasens and Martí2010; Pradhan et al. Reference Pradhan, Ramshaw, Marcus, Palmer, Weischedel and Xue2011, Reference Pradhan, Moschitti, Xue, Uryupina and Zhang2012). Most of the recently proposed approaches, however, still focus on the accurate modeling of relatively easy cases of anaphoric reference. For example, Durrett and Klein (Reference Durrett and Klein2013) build one of the best-performing system through extensive feature engineering for “easy victories,” avoiding “uphill battles” for more complex cases. This can be explained by (i) the still relative simplicity of the OntoNotes annotation scheme and (ii) the intrinsic difficulty of the task once we go beyond “easy victories.” We believe that the time is ripe for a dataset that better approximates the true complexity of the phenomenon of anaphoric reference. Such datasets now exist for languages other than English—for example, ANCORA for Catalan and Spanish (Recasens and Martí Reference Recasens, àrquez, Sapena, Mart, Taulé, Hoste, Poesio and Versley2010), the Prague Dependency Treebank for Czech (Nedoluzhko et al. Reference Nedoluzhko, Mírovský, Ocelák and Pergler2009a), or tüba-d/z for German (Hinrichs et al. Reference Hinrichs, Kübler and Naumann2005)—but not yet for English.
This paper presents the second release of the arrau corpus,Footnote a a multigenre corpus of English providing large-scale annotations of a broad range of anaphoric phenomena and of linguistic information relevant to anaphora resolution. arrau has been under development for over 10 years, and several features distinguish it from similar projects.
First, it supports a more complex and linguistically motivated annotation scheme for anaphora than any existing corpus for English and than most corpora for other languages, covering, for example, non-referring expressions, bridging references, and discourse deixis. Moreover, additional discourse-level information is available from third parties for subsets of arrau (e.g., the rhetorical structure annotations Carlson et al. Reference Carlson, Marcu and Okurowski2002 for the rst domain). This enables a more thorough analysis of these phenomena, as well as creates training material for algorithms that model these tasks jointly.
Second, the arrau guidelines specify the annotation of a number of semantic properties of mentions, most importantly of genericity. Identifying generic usages of nominal expressions is still an understudied task, and we believe that the release of a corpus annotated simultaneously for anaphora and genericity can provide much needed data.
Third, the corpus covers, in addition to news, a variety of genres so far poorly studied, such as dialog (the trains data) and fiction (the Pear Stories). Spontaneous dialog and fiction are not covered by most commonly used coreference corpora.Footnote b Although several linguistic studies focus on genre-specific discourse coherence and anaphora properties (Neumann Reference Neumann2013; Kunz and Lapshinova-Koltunski Reference Kunz and Lapshinova-Koltunski2015), only very few approaches aim at empirical analysis or per-genre modeling of coreference (Uryupina and Poesio Reference Uryupina and Poesio2012; Grishina and Stede Reference Grishina and Stede2015). In a recent work, Kunz et al. (Reference Kunz, Lapshinova-Koltunski and Martínez2016) provide a comprehensive data-driven analysis of different linguistic phenomena related to anaphoricity, demonstrating considerable genre-specific differences. We believe that anaphora, among many other discourse-related phenomena, can bring a lot of challenging genre-specific problems and the arrau corpus opens up numerous research paths in this direction.
Fourth, anaphoric ambiguity is annotated. Ambiguous anaphoric expressions constitute truly challenging examples that cannot be tackled with current methods for coreference resolution. Moreover, the most commonly used corpora (Doddington et al. Reference Doddington, Mitchell, Przybocki, Ramshaw, Strassell and Weischedel2004; Weischedel et al. Reference Weischedel, Hovy, Marcus, Palmer, Belvin, Pradhan, Ramshaw, Xue, Olive, Christianson and McCary2011) only focus on identity anaphora—the task of identifying multiple mentions of the same discourse entity—and thus cannot support anaphoric ambiguity. By annotating ambiguous anaphoric expressions, we make the first step toward a thorough investigation of anaphoric ambiguity.
Finally, during the 10 years in which the arrau dataset has been under development, we have had the opportunity not only to extend the annotation and the size of the corpus, but also crucially to continuously revise the annotation and improve its quality. In this paper, we describe the second major release of the corpus, whose development has been motivated not only by the objective of increasing the corpus size, particularly regarding spoken data, but also by improving the annotation quality and consistency in a number of ways, including via several automatic consistency checks. This is in contrast with other corpora, where subsequent releases, if any, expand the text collection and only fix occasional manually attested errors. We believe that the computational linguistic community can benefit considerably from cleaner and more curated datasets. This implies a methodology for data cleaning and maintenance that is currently in its infancy, with only very few exceptional studies (e.g., Dickinson and Meurers Reference Dickinson and Meurers2003; Dickinson and Lee Reference Dickinson and Lee2008) investigating possibilities for automatic error identification in manually annotated resources. Moreover, the few existing efforts are not supported by the data creation/labeling projects: to our knowledge, the common practice in annotating textual data does not go beyond ensuring high agreement between human coders, using, for example, κ or Krippendorf’s α (Carletta Reference Carletta1996; Artstein and Poesio Reference Artstein and Poesio2008). Corpus creators rarely make use of automatic means of data verification, such as specific consistency checks or error analysis for automatic systems trained and tested on the data. While our approach is far from being the final word on this, we think it makes a first step in the right direction.
The two versions of the arrau corpus were presented at the Language Resources and Evaluation conference (Poesio and Artstein Reference Poesio and Artstein2008; Uryupina et al. Reference Uryupina, Kabadjov, Poesio, Poesio, Stuckardt and Versley2016a), but this paper greatly expands upon the content of these two lrec papers, providing an extensive overview of the annotation guidelines and their motivation and a range of previously unpublished statistics about the linguistically more advanced features of arrau.
The second release of the arrau corpus, in mmax2 format and including the original annotations of the Penn Treebank from which the markablesFootnote c were extracted, is available from ldc, but the sub-corpora of this version of arrau that consist of anaphoric annotations of ldc corpora such as the rst Discourse Treebank and the trains-93 corpus can only be distributed for free to groups that acquire a license for the original corpora. However, the dataset extracted from arrau for the crac 2018 Shared Task (see Section 5.3) is freely available through ldc.
The rest of the paper is organized as follows. Section 2 provides an overview of the annotation guidelines. Section 3 discusses the corpus development between the two versions. Finally, Section 4 compares arrau against other datasets annotated for coreference.
2 Annotation methodology
The goal of the arrau project was to develop methods to annotate and interpret the more challenging cases of anaphoric reference, including in particular reference to abstract objects. A key aspect of this work was to use coding schemes based on extensive reliability tests to create large-scale annotated resources that could be used to study these types of anaphoric reference. Building on the gnome guidelines (Poesio Reference Poesio2000a, as discussed in Poesio Reference Poesio2004a,b) which already provided reliability-tested annotation schemes for aspects of anaphoric annotation such as bridging reference (Clark Reference Clark1975) and were used, e.g., to create the dataset in Poesio et al. (Reference Poesio2004a,b), we developed and tested extended annotation guidelines (Poesio and Artstein Reference Poesio and Artstein2008) aiming specifically at abstract anaphora and ambiguity (Poesio and Artstein Reference Poesio, Artstein and Meyers2005a; Artstein and Poesio Reference Artstein, Poesio, Schlangen and Fernandez2006). These annotation guidelines, distributed with the corpus and available from the project website, also provide detailed instructions for identifying markable boundaries and marking non-referentiality and non-anaphoricity, as well as a wide range of mention attributes such as genericity. In this section, we summarize these guidelines and more in general the methods adopted in the creation of the corpus, focusing on the most distinctive features of the arrau annotation.Footnote d
2.1 Genres
Some of the best known anaphoric corpora, particularly for English and particularly at the time when the arrau annotation was started, consist entirely of documents either in the news or broadcast genres. One of the objectives of the arrau annotation was to cover a greater variety of genres.
The corpus does include a substantial amount of news text, a sub-corpus or domain (we will use throughout the term domain to refer to arrau’s sub-corpora) called RST and consisting of the entire subset of the Penn Treebank that was annotated in the rst treebank (Carlson et al. Reference Carlson, Marcu, Okurowski, Kuppevelt and Smith2003). We annotated news data so as researchers could compare results on arrau with results on other news datasets, and we chose these documents because they had already been annotated in a number of ways—not only syntactically (e.g., through the Penn Treebank; Marcus et al. Reference Marcus, Santorini and Marcinkiewicz1993) and for their argument structure (e.g., through the Propbank; Palmer et al. Reference Palmer, Gildea and Kingsbury2005) but also for rhetorical structure (Carlson et al. Reference Carlson, Marcu, Okurowski, Kuppevelt and Smith2003). This dataset would therefore allow the study of the effect of these other types of linguistic information on anaphora resolution and vice versa.Footnote e
But in addition to RST, arrau includes three more domains, covering genres important from the point of view of discourse analysis but not normally covered by anaphoric corpora. Specifically, the TRAINS domain of arrau includes all the task-oriented dialogs in the trains-93 corpusFootnote f; the PEAR domain consists of the the complete collection of spoken narratives in the Pear Stories that provided some of the early evidence on salience and anaphoric reference (Chafe Reference Chafe1980); and the gnome domain covers documents from the medical and art history genres covered by the gnome corpus (Poesio Reference Poesio2004a, Reference Poesio2000b) used to study both local and global salience (Poesio et al. Reference Poesio, Stevenson, Di Eugenio and Hitzeman2004a, Reference Poesio, Patel and Di Eugenio2006a).
The same coding scheme was used for all domains, but separate guidelines were written for the textual domains and the spoken dialog domains; the distinct coding schemes are included in the documentation of the corpus as man_anno_gnome and man_anno_trains, respectively.
Table 1 provides basic statistics about the four arrau domains.Footnote g Both the RST and gnome domains consist of carefully edited texts with complex grammatical sentences. This results in long markables, often either multiword named entities (for example, full names of organizations) or complex nps. Markable detection for these domains requires a high-quality parser. Particularly in the gnome domain, synonyms and bridging references abound. Successful interpretation and resolution of such expressions would require sophisticated name-matching and aliasing techniques and advanced semantic features, going beyond head-noun compatibility.
Table 1. Corpus statistics for the four arrau domains
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab1.gif?pub-status=live)
The PEAR and TRAINS domains, by contrast, consist of uses of spontaneous speech. The language in these domains mostly consists of short utterances, often ungrammatical and/or with disfluencies. PEAR and TRAINS markables therefore are on average much shorter, with a lot of one-word markables, mostly pronouns. Discontinuous markables (see Section 2.2) are present in both PEAR and TRAINS, although not very common. So for these domains, markable detection might better be implemented through a chunker robust to noisy ungrammatical input. As far as anaphora resolution is concerned, however, ambiguity and references to abstract objects (e.g., plans in TRAINS) abound, as well as demonstratives used deictically. So salience features and context modeling become key factors.
To summarize, arrau contains documents from four domains, representing different genres, mostly not covered by other corpora. These genres pose challenging problems for the next generation of coreference resolvers, requiring complex techniques for accurate preprocessing and resolution.
2.2 Markables in arrau
arrau belongs to the “new wave” of anaphorically annotated corpora that were created after the re-examination of annotation schemes for anaphora started with the Discourse Resource Initiative and the MATE and gnome projects (Passonneau Reference Passonneau1997; Poesio et al. Reference Poesio, Bruneseaux, Romary and Walker1999; van Deemter and Kibble Reference van Deemter and Kibble2000). These new corpora—other examples include ANCORA (Recasens and Martí Reference Recasens, àrquez, Sapena, Mart, Taulé, Hoste, Poesio and Versley2010), COREA (Hendrickx et al. Reference Hendrickx, Bouma, Coppens, Daelemans, Hoste, Kloosterman, Mineur, Van Der Vloet and Verschelde2008), and OntoNotes (Pradhan et al. Reference Pradhan, Ramshaw, Weischedel, MacBride and Micciulla2007), the anaphoric annotation of the Prague Dependency Treebank (Nedoluzhko et al. Reference Nedoluzhko, Mírovský, Ocelák and Pergler2009a) and tüba-d/z (Hinrichs et al. Reference Hinrichs, Kübler and Naumann2005)—employed annotation schemes rooted in linguistic theory rather than aiming to capture domain-relevant knowledge as done in the earlier muc and ace corpora; for instance, the entire np is typically marked. Not all of these corpora however consider all NPs as markables. Some older corpora had imposed syntactic restrictions on markables—for instance, in many older corpora only pronouns are annotated (Ge et al. Reference Ge, Hale and Charniak1998). Other older corpora imposed semantic restrictions: for instance, in the ace corpora, only entities of semantic types of interest are considered. But even some of the “new generation” corpora still restrict mentions depending on their referentiality/anaphoricity properties: for instance, in OntoNotes neither expletives nor singletons are annotated (for a discussion of the state of the art in anaphoric annotation, see Poesio et al. Reference Poesio, Pradhan, Recasens, Rodriguez, Versley, Poesio, Stuckardt and Versley2016).
By contrast, according to the arrau guidelines (which follow for text the earlier gnome guidelines,Footnote h see below for the dialog guidelines) all nps are considered as markables, also when they are non-referring, like predicative a busy place in (1) (we discuss in Section 2.3 which NPs are considered non-referring in arrau), or when they do not corefer with any other mention and thus form a singleton coreference chain all by themselves. Moreover, non-referring markables are manually sub-classified. In addition, possessive pronouns are marked as well, and all premodifiers are marked when the entity referred to is mentioned again, for example, in the case of the proper name US in (2), and when the premodifier refers to a kind, like exchange-rate in (3).
(1) It seems to be [a busy place]
(2) … The Treasury Department said that the [US]1 trade deficit may worsen next year after two years of significant improvement… The statement was the [US]1’s government first acknowledgment of what other groups, such as the International Monetary Fund, have been predicting for months.
(3) The Treasury report, which is required annually by a provision of the 1988 trade act, again took South Korea to task for its [exchange-rate]1 policies. “We believe there have continued to be indications of [exchange-rate]1 manipulation …
In arrau, the full np is marked with all its modifiers; in addition, a min attribute is marked, as in the muc corpora: for nominal markables, min corresponds to the head noun, whereas for (modified or not) named entities min corresponds to the proper name:
(4) [[minAlan Spoon]min, recently named Newsweek president], said Newsweek’s ad rates would increase 5% in January.
Discontinuous markables
One of the distinctive features of arrau is the support of discontinuous markables—markables built out of non-continuous material. Discontinuous chunks are problematic for many corpus annotation formats Amoia et al. (Reference Amoia, Kunz and Lapshinova-Koltunski2011), and thus many guidelines developed for various linguistic phenomena allow for labeling continuous constituents exclusively.
Discontinuous markables, however, are common in dialog, for instance, in cases of so-called collaborative completions (Poesio and Rieser Reference Poesio and Rieser2010) illustrated by (5), where the mention an orange screw with a slit is constructed out of utterances 1.2 and 1.3.
(5)
For this reason, Müller included the functionality for the annotation of discontinuous markables in the mmax2 annotation tool (Müller and Strube 2006), developed to support his research on anaphora resolution in dialog (Müller Reference Müller2008), where spans can be arbitrary sequences of tokens. However, discontinuous markables also provide a way to include in a markable all information provided by the text, for example, in cases of coordination where the two coordinated nps share some information, illustrated by (6). In this example, the two names Anna Snezak and Morris Snezak are coordinated, but the last name Snezak is only repeated once. Discontinuous markables make it possible to include both the segments of text marked as part 1 and part 2 in the same markable. Similarly in (7).
(6) …after owners [part1 Anna]part1 and Morris [part2Snezak] part2…
(7) So he doesn’t have to play [part1 the same Mozart]part1 and Strauss [part2 concertos]part2 over and over again.
Discontinuous markables are typically ignored in anaphora resolution: state-of-the-art mention detection systems always output continuous chunks; the publicly available SemEval and conll coreference scorers (Pradhan et al. Reference Pradhan, Luo, Recasens, Hovy, Ng and Strube2014) assume numbered brackets as mention boundaries that cannot encode discontinuous fragments. To make arrau usable for these purposes, whereas the markable can be discontinuous, minimal spans cannot be. This way, all the markables in arrau can be aligned to contiguous sequences of tokens.
2.3 Markable properties
All markables are manually annotated for a variety of properties according to the gnome guidelines (Poesio Reference Poesio2000a): these include morphosyntactic agreement (gender, number and person), grammatical function, and the semantic type of the entity: person, animate, concrete, organization, space, time, plan (for actions), numerical, or abstract.Footnote i The guidelines and reliability studies leading to this scheme are discussed in Poesio (Reference Poesio2004b, Reference Poesio2000b). In this section, we will only discuss in detail two additional attributes, specifying the referential status of a markable and the genericity status of mentions. The reference attribute specifies the logical form status of the markable: referring, expletive, quantificational, or predicative. Genericity is annotated following a scheme developed in gnome after experiments based on the official annotation manual had shown poor reliability for this attribute. We discuss each attribute in turn.
Referring and non-referring markables
Most anaphorically annotated corpora focus on referring markables, or mentions proper: markables that refer to discourse entities and participate in anaphoric relations. This decision, primarily motivated by reasons of cost, makes it however difficult to train models able to recognize and interpret non-referring markables—nominal expressions that do not refer a discourse entity. It has been shown, however, that filtering out at least some types of non-referring expressions can improve the performance of a coreference resolver (Uryupina et al. Reference Uryupina, Kabadjov, Poesio, Poesio, Stuckardt and Versley2016b). In order to develop such a classifier for a corpus like OntoNotes in which non-referring expressions are not annotated, separate classifiers are required—for example, Björkelund and Farkas (Reference Björkelund and Farkas2012) trained a pre-filtering classifier for non-anaphoric it, you and we on the OntoNotes data.
In arrau, all nominal expressions are treated as markables, including non-referring nominal expressions. The annotation scheme and guidelines are based on those developed for gnome, where the lftype attribute (κ = .73) was used to distinguish between referring expressions proper (called terms in gnome) and several types of non-referring interpretations of NPs, including expletives (8), predicatives (9,10), quantifiers (11) and coordinations (12) Poesio (Reference Poesio2000a, Reference Poesio2004b). In arrau, coders are asked, first of all to classify markables as referring or non-referring. If a markable is classified as referring, coders are then asked if that expression is discourse-old or discourse-new (Prince Reference Prince, Thompson and Mann1992), and in the first case, to identify its antecedent (see Section 2.4). If the markable is classified as non-referring, coders have to either assign it to one of the gnome categories of non-reference, or label it as idiomatic (13), or as an incomplete or fragmentary expressions (14).
(8) And [there]non–referring’s a ladder coming out o of the tree and [there]non–referring’s a man at the top of the ladder
(9) It see it seems to be [a busy place]non–referring
(10) 1 ml of the prepared solution for injection contains 0.25 mg ([8 million IU]non–referring) of Interferon beta-1b.
(11) [Most of the analysts polled last week by Dow Jones International News Service in Frankfurt, Tokyo, London and New York]non–referring expect the US dollar to ease only mildly in November.
(12) Mr. Sutton recalls: “When I left, I sat down with [[Charlie Rangel], [Basil Paterson] and [David]]non–referring, and David said, ‘Who will run for borough president?
(13) so that would um if we left at six in the morning would that make [sense]non–referring six (mumble)
(14) U: okay then um okay then originally we need to have um the one boxcar go to [oranum]non–referring go to Corning from Elmira
The choice of marking quantifiers and coordination in arrau as non-referring is possibly the most controversial decision we took. The quantifier Most of the analysts polled last week by Dow Jones international in (11) is marked as non-referring. Similarly, whereas we asked coders to mark individual noun phrases (Charlie Rangel, Basil Paterson and David in (12)) as referring markables that can participate in anaphoric relations, the embedding coordinate np is marked as non-referring. These decisions mean that any expression anaphorically related to that quantifier cannot be marked as such. However, plural anaphora to antecedents introduced by coordination can be annotated, as discussed in Section 2.4. In the case of quantifiers, the decision was motivated by the high disagreement that we observed among our coders when left free to mark a quantifier as either referring or non-referring. For the case of coordination, the reasons were more complex; we discuss them when explaining how plural anaphora is handled. Both decisions might be reconsidered in a future release of arrau.
Table 2 shows the distribution of various types of non-referring markables in the entire corpus and in the four individual domains, overall and for each type of non-referring markables. As could be expected, the distribution of non-referring expressions is genre-specific. Thus, the two domains with spontaneously generated no-curated texts (TRAINS and PEAR) have a large number of incomplete or fragmentary expressions, virtually non-existent in RST and gnome documents. Idioms are common in all the genres except gnome—a collection of medical leaflets written in a very formal language. Predicative non-referring expressions, especially appositions, are more common in news.
Table 2. Distribution of non-referring markables in arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab2.gif?pub-status=live)
Genericity
The guidelines for genericity adopted for the gnome corpus were developed to distinguish generic uses of nominal expressions (as in Dogs bark) from non-generic cases (as in I saw dogs in the street). Developing reliable guidelines for this type of annotation proved quite a challenge, and two schemes were conceived before developing one achieving sufficient reliability. The first scheme attempted to capture the type / token distinction—a similar distinction to that between generic and specific entities made in the ace-2 coding scheme—but this type of judgment proved difficult to agree on in particular with mentions referring to substances such as oil or chemical components of medicines such as oestradiol, as illustrated in (15). The result was that this simple scheme only achieved a very modest level of reliability (κ = .33).
(15) Not that [oil]generic suddenly is a sure thing again.
A second scheme was then developed in which a new value, undersp-generic, was introduced as the value to be used for all references to substances.Footnote j The new scheme achieved better reliability, but still only κ = 0.55. The biggest remaining problem were quantifiers (including definites and indefinites). Our annotators found it very hard to agree on whether a quantified np used (non-generically) to quantify over a specific set of individuals at a particular spatio-temporal location, as in Many lecturers went on strike (on March 16th, 2004), should be marked as generic or not. A third and last scheme was therefore developed, in which separate values were introduced for each type of quantifier, as well as new guidelines, according to which the annotation of the genericity attribute is carried out following a decision tree going from the easiest cases to the more complex ones. Coders are first asked to check whether the nominal is in the scope of an explicit operator such as a conditional like if (as in (16)) or an individual quantifier such as every or most (iquant) (as in (17)) or a temporal quantifier like always or once (as in (18)) a modal (as in (19)) or an instruction (as in (20)). In these cases, the nominal is not marked as generic, but as being in the scope of the appropriate operator. If no such explicit quantifier/operator is present, coders are asked to check whether the nominal refers to semantic objects whose genericity is left underspecified, such as substances (e.g., gold), as in (21) seen before or in (22). Finally, the annotator is asked whether the sentence in which the markable occurs is generic, and in this case, to mark the nominal as generic-yes if it refers generically, as in (23), or generic-no otherwise. With these instructions, reasonable intercoder agreement was finally achieved (κ = .82) (Poesio Reference Poesio2004b).
(16) New York State Comptroller Edward Regan predicts a $ 1.3 billion budget gap for the city’s next fiscal year, a gap that could grow if there is [a recession]generic.
(operator-conditional)
(17) Mr. Uhr said that Mr. Petrie or his company have been accumulating Deb Shops stock for several years, each time issuing [a similar regulatory statement]generic.
(operator-iquant)
(18) In addition , once [money]generic is raised , [investors]generic usually have no way of knowing how [it]generic is spent.
(operator-tquant)
(19) They argue that their own languages should have [equal weight]generic, although recent surveys indicate that the majority of the country’s population understands Filipino more than any other language.
(operator-modal)
(20) Use [alcohol wipes]generic to clean the tops of the vials move in one direction and use one wipe per vial.
(operator-instruction)
(21) Not that [oil]generic suddenly is a sure thing again.
(underspecified-substance, RST)
(22) 1 ml of [the prepared solution for injection]generic contains 0.25 mg ( 8 million IU ) of [Interferon beta-1b]generic.
(underspecified-substance, gnome)
(23) In its report to Congress on [international economic policies]generic, the Treasury said that any improvement in the broadest measures of trade, known as the current account.
(generic-yes)
Genericity was already marked according to these guidelines in the first release of arrau (Poesio and Artstein Reference Poesio and Artstein2008), but its annotation was only partially checked. One of the main revisions carried out for the second release of the corpus was a systematic check that the annotation of this attribute was consistent with the guidelines. The distribution of generics and quantifiers in the separate arrau domains resulting from this verification is shown in Table 3. In total 2252 mentions were annotated as generic (2% of the total number of markables), 3167 as being bound by some other operator (3%), and 1.4% as underspecified.
Table 3. Distribution of generic mentions in arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab3.gif?pub-status=live)
2.4 Range of relations
The arrau guidelines support annotation of different types of anaphoric relations. All referring markables are marked as either discourse new or old. Discourse new mentions introduce new entities and thus are not marked as being coreferent with an entity already introduced (antecedent). For discourse old mentions, an antecedent can be identified, either of type phrase (in case the antecedent was introduced using a nominal expression) or segment (not introduced by a nominal expression, for the cases of discourse deixis).Footnote k In addition, referring nps can be marked as related to a previously mentioned discourse entity in order to identify them as examples of associative or bridging anaphora. We discuss the three most distinctive types of annotation in arrau—bridging anaphora, plural anaphora, and discourse deixis—in turn.
Bridging anaphora
Annotating—indeed, identifying—bridging anaphora in a reliable way is a difficult task (Poesio and Vieira Reference Poesio and Vieira1998; Vieira Reference Vieira1998), which is one of the reasons why so few large-scale corpora for anaphora include this type of annotation (apart from our own work, we are only aware of few attempts to do so; see Section 4.4 for a discussion of this work and (Poesio et al. Reference Poesio, Pradhan, Recasens, Rodriguez, Versley, Poesio, Stuckardt and Versley2016) for additional discussion of larger corpora some of which also include anaphora). The arrau guidelines for bridging anaphora are based on a series of experiments that started with the work of Vieira (Reference Vieira1998) and Poesio and Vieira (Reference Poesio and Vieira1998) and continued in the gnome project (Poesio Reference Poesio, Bruneseaux, Romary and Walker2004b). Vieira and Poesio attempted to annotate the full range of bridging references as discussed, for example, in Passonneau (Reference Passonneau1997) and Poesio et al. (Reference Poesio1999), but only achieved very poor agreement. In gnome, attempts were made to identify a subset of the relations that could be annotated reliably (Poesio Reference Poesio2004b), finding that by limiting the annotation to three types of relations: element-of as in (24), where the middle is a bridging reference to the middle of the three horizontal zones; subset as in (25), where Polygonal openwork rings incorporating an inscription in (u2) is a bridging reference to two gold finger rings in (u1) based on an inverse subset relation; and a generalized possession relation poss covering both part-of relations as in (26) and general possession relations, as in (27). The element relation was also used to annotate certain types of other anaphora, as in (28).
(24) The sixteen panels are each divided into [three horizontal zones]1, [the middle]→1 containing a letter
(25) (u1) [Two gold finger-rings from Roman Britain (2nd–3rd century AD)]1.
(u2) [Polygonal openwork rings incorporating an inscription]→1 are a distinctive type found throughout the Empire.
(26) (u1) [These “egg vases”]1 are of exceptional quality
(u2) basketwork bases support [egg-shaped bodies]→1
(u3) and bundles of straw form [the handles]→1
(27) (u1) [The Getty museums microscope]1 still works,
(u2) and [the case]→1 is fitted with a drawer filled with the necessary attachments.
(28) (u39) [The two stands]1 are of the same date as the coffers, but were originally designed to hold rectangular cabinets.
(u42) [One stand]→1 was adapted in the late 1700s or early 1800s century to make it the same height as [the other]]→1.
Poesio et al. found that coders following the gnome guidelines achieved good precision but low recall on identifying bridging references (Poesio Reference Poesio2004b). When asked to mark mentions as either discourse-new, discourse-old, or bridging according to the gnome definition of bridging, coders agreed on the type of relation for bridging references in 95.2% of the cases, but each of them only spotted about 1/3 of bridging references on average, and typically different bridging references, so that only 22% of bridging references were marked as such by all annotators.
The arrau Release 1 guidelines followed the gnome guidelines, but with an extension and a simplification. Annotators were asked to mark a mention as related to a particular antecedent if it stood to that antecedent in one of the relations identified in gnome (indeed, the same examples were used), and in addition, if they stood in two additional relations (but without testing the reliability of this annotation):
other, for other nps, broadly following the guidelines in Modieska (Reference Modieska2003);
an undersp-rel relation for “obvious cases of bridging that didn’t fit any other category.”
In arrau Release 1, however, coders were not asked to specify the relation—effectively, any associative bridging reference was considered a case of “underspecified relation.” In arrau Release 2, the annotation of bridging references was revised for the rst domain only and coders were now asked to mark the relations only in that domain. The resulting statistics about bridging references in arrau Version 2 are shown in Table 4. A total of 5512 bridging references were marked, but a classification of the relations was only provided for the 3777 bridging references identified in the rst domain. In the table, we write P+S+E+O+U as category for the bridging references in the other domains, currently not classified. We intend to provide a classification of these bridging references, as well as re-checking the existing classifications, in Release 3 of the corpus, currently planned for 2018.
Table 4. Distribution of bridging references in arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab4.gif?pub-status=live)
Plural anaphora
Till recently, no data-driven studies were attempting to model plural anaphora specifically, except for the simplest cases of plural reference to a plural antecedent, as in (29).Footnote l
(29) (u1) from Avon going to Dansville pick up [the three boxcars]1
(u2) go to Corning load [them]1 and …
This is because some types of plural reference are intrinsically difficult, both for annotation and resolution. We believe therefore that a dataset annotated for plural anaphora in a principled way will open several challenging research possibilities.
One example of the more complex forms of plural anaphora is plural reference to sets of objects introduced by listing their elements, as in the following toy examples.
(30) a. [Mr. Luzon and his team]1, however, say [they]1 aren’t interested in a merger.
b. Mr. Luzon agreed with his team that [they]? aren’t interested in a merger.
Anaphoric annotation schemes that do require coders to mark plural reference to antecedents introduced by coordination do so by assuming that the coordination Mr. Luzon and his team in (30a) (an actual example from the rst portion of arrau) introduces a discourse entity, and asking coders to link they to that entity. Indeed, this is the approach that was followed in gnome. This approach will not however work for the very similar (30b) (our own), since in this example there is no longer a constituent for Mr. Luzon and his team—so they becomes a discourse new mention with no antecedent. The approach to annotating plurals adopted in arrau was based on the belief that these two very similar cases of plural reference should be treated in the same way. In arrau, we annotate plural anaphors to sets of individually introduced entities as bridging references to each member of the corresponding set encoding an (element-of) bridging relation. Thus, in (30a) as well as in (30b) “They” is linked to both “Mr. Luzon” and “his team” individually. Note that such annotation allows for a more uniform interpretation of plural reference to individually introduced entities.
Discourse deixis
The term discourse deixis was introduced by Webber in Webber (Reference Webber1991) to indicate the reference to abstract entities which have not been introduced in the discourse through a nominal expression,Footnote m as in the following example from the trains corpus, where that in utterance 7.6 refers to the plan of shipping boxcars of oranges to Elmira.
(31)
Discourse deixis in its full form is a very complex form of reference, both to annotate (Artstein and Poesio Reference Artstein, Poesio, Schlangen and Fernandez2006; Dipper and Zinsmeister Reference Dipper and Zinsmeister2012) and to resolve (Marasović et al. Reference Marasović, Born, Opitz and Frank2017). Very few anaphoric annotation projects have attempted annotating discourse deixis in its entirety (Artstein and Poesio Reference Artstein, Poesio, Schlangen and Fernandez2006; Dipper and Zinsmeister Reference Dipper and Zinsmeister2012; Kolhatkar Reference Kolhatkar2014); more typical is a partial annotation, as in the work of Byron and Navarretta, who annotated pronominal reference to abstract objects (Byron and Allen Reference Byron and Allen1998; Navarretta Reference Navarretta2000); in OntoNotes, where event anaphora was marked (Pradhan et al. Reference Pradhan, Ramshaw, Weischedel, MacBride and Micciulla2007); and in the work of Kolhatkar Kolhatkar (Reference Kolhatkar and Hirst2014), which focused on so-called shell nouns. As a result, very few systems have attempted resolving this type of anaphors (Eckert and Strube Reference Eckert and Strube2000; Byron Reference Byron2002; Kolhatkar and Hirst Reference Kolhatkar and Hirst2012; Marasović et al. Reference Marasović, Born, Opitz and Frank2017)
Discourse deixis was one of the “difficult cases of anaphora” on which the arrau project focused, and a number of annotation experiments were conducted (Artstein and Poesio Reference Artstein, Poesio, Schlangen and Fernandez2006), resulting in guidelines according to which
1. A coder specifying that a referring expression is discourse old is asked whether its antecedent was introduced using a phrase (mention) or segment (discourse segment)
2. Coders choosing segment as the type of antecedent have to mark a sequence of (predefined) clauses
Artstein and Poesio (Reference Artstein, Poesio, Schlangen and Fernandez2006) point out that measuring disagreement on this type of annotation requires making a number of assumptions, and that figures of α Krippendorf (Reference Krippendorf2004) ranging from 0.45 to 0.9 can be achieved depending on which assumptions are made.
The statistics about discourse deixis in arrau Version 2 are shown in Table 5. A total of 1633 cases of discourse deixis were identified. It is worth noting how the trains sub-domain contains more than half the total cases of discourse deixis even though it is less than half the size of the rst sub-domain. (We intend to re-check the annotation in Release 3 of the corpus, currently planned for 2018.)
Table 5. Distribution of discourse deixis in the subdomains of arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab5.gif?pub-status=live)
Anaphoric ambiguity
A number of studies have shown that anaphoric expressions both in dialog and text can be ambiguous (Poesio and Reyle Reference Poesio, Reyle, Bunt and van der Sluis2001; Poesio et al. Reference Poesio, Sturt, Arstein and Filik2006b; Versley Reference Versley2008; Recasens et al. Reference Recasens, Hovy and Martí2011). A classic illustration is Example (32), from the trains corpus (Poesio and Reyle Reference Poesio, Reyle, Bunt and van der Sluis2001). The pronoun it in (u2) could refer equally well to engine E2 or the boxcar at Elmira. Studies carried out as part of arrau showed that such examples were fairly common in the trains corpus, and that different coders would interpret them differently (Poesio and Artstein Reference Poesio, Artstein and Meyers2005a; Poesio et al. Reference Poesio, Sturt, Arstein and Filik2006b). Other studies have shown that occurrences of it can be ambiguous between an expletive and a discourse deixis interpretation (Gundel et al. Reference Gundel, Hedberg and Zacharski2002).
(32) (u1) M: can we.. kindly hook up … uh … [engine E2]1 to [the boxcar at.. Elmira]2
(u2) M: +and+ send [it]1,2 to Corning as soon as possible please
The arrau coding scheme accommodates this. Referring markables can be marked as ambiguous between a discourse-new and a discourse-old interpretation; discourse-old mentions can be marked as ambiguous between a discourse-deictic and a phrase reading; and both phrase and segment mentions can be marked as ambiguous between two distinct interpretations. The annotated corpus contains examples of ambiguous anaphoric expressions from text as well, as in the following example.
(33) Criticism of [the Abbie Hoffman segment]1 is particularly scathing among people who knew and loved the man. <…> Both women say they also find it distasteful that [CBS News is apparently concentrating on Mr. Hoffman’s problems as a manic-depressive]2. “[This]1,2 is dangerous and misrepresents Abbie’s life,” says Ms. Lawrenson, who has had an advance look at the 36-page script.
In (33), the anaphoric mention “This” is ambiguous between “the Abbie Hoffman segment” (identity anaphora) and “CBS News is apparently concentrating on Mr. Hoffman’s problems as a manic-depressive” (discourse deixis).
The extent of ambiguity in anaphoric interpretation found using the arrau scheme was analyzed in a study reported in Poesio and Artstein (Reference Poesio, Artstein and Meyers2005a). A total of 18 subjects were asked to annotate dialogs from the trains subdomain of arrau with a scheme allowing them to mark for ambiguity. Poesio and Artstein reported that a minimum of 10% of markables in the trains corpus were marked as explicitly ambiguous. They also found however that a much higher percentage of markables, up to 40%, were implicitly ambiguous—i.e., were annotated differently by different subjects. In Poesio and Artstein (Reference Poesio and Artstein2005b) methods for computing agreement in a scheme allowing for ambiguity were proposed, based on developing extended distance metrics for α (Krippendorf Reference Krippendorf2004; Artstein and Poesio Reference Artstein and Poesio2008). Values of α between .58 and .67 were reported depending on the type of distance metric used and the choice of markables.
Statistics about anaphoric ambiguity in arrau Version 2 can be found in Table 6. The first column of the table shows the category of the first interpretation of the ambiguous markable: discourse old (either phrase or segment), discourse new, or non-referring. The second column shows the second interpretation indicated by the coder: again discourse old (phrase) but with a different antecedent, discourse new, discourse deixis, or non-referring. A total of 234 cases of ambiguous markables were identified, which is a very small fraction of the around 100,000 markables in arrau Version 2; the results of Poesio and Artstein (Reference Chiarcos and Krasavina2005a) suggest however that this figure substantially underestimates the actual extent of ambiguity, at least by a factor of 4. The majority of these ambiguities (75%) are between two discourse old interpretations with different antecedents, but there are also several cases of DN/DO ambiguity and DO/DD ambiguity. We also note how in all cases of ambiguity the first interpretation chosen is discourse old; this is because the instructions explicitly require coders to choose DO as first interpretation if the ambiguity is between a discourse-old interpretation and some other interpretation.
Table 6. Distribution of ambiguity in the subdomains of arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab6.gif?pub-status=live)
2.5 Reliability of the coding scheme, summarized
Table 7 summarizes the reliability of the different aspects of the arrau coding scheme presented in this section.
Table 7. Reliability of the several aspects of the arrau coding scheme
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab7.gif?pub-status=live)
2.6 Annotation tool and markup scheme
arrau was annotated using the mmax2 annotation tool (Müller and Strube Reference Müller, Strube, Braun, Kohn and Mukherjee2006). MMAX2 is based on token standoff technology: the annotated anaphoric information is stored in a phrase level whose markables point to a base layer in which each token is represented by a separate xml element. Because of the need to encode ambiguity and bridging references, anaphoric information is encoded using mmax2 pointers, linking together pairs of mentions and specifying discourse relations between them. This is in contrast with commonly used (e.g., in the OntoNotes scheme) set-based annotations, where each mention is only labeled with the id of the corresponding discourse entity and no relations are annotated. Note that set-based annotation for identity anaphora can be induced from such pointers in a straightforward way.
3 From arrau 1 to arrau 2: Checking annotation consistency
The first release of arrau (Poesio and Artstein Reference Poesio and Artstein2008) was made publicly available in 2008. The second release of arrau has augmented the corpus annotating all the documents available within the trains and rst datasets. This has resulted in a significant increase in the data size. This quantitative improvement is extremely important for the TRAINS domain, since it provides a unique large collection of dialogs annotated with anaphoric information. More statistics for both releases of arrau are provided in Table 8.
Table 8. Corpus statistics for two releases of arrau
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab8.gif?pub-status=live)
Most importantly, between the two releases we have invested a considerable effort in enforcing the annotation consistency. We believe that a large and complex annotation project, such as arrau, undergoing several rounds of manual adjudication and revision, should implement specific measures for preserving and improving the data quality. Unfortunately, the nlp community does not pay enough attention to the data consistency issue beyond the inter-annotator agreement. A notable exception is a series of studies by Dickinson and Meurers (Reference Dickinson and Meurers2003) and Boyd et al. (Reference Boyd, Dickinson and Meurers2008) on enforcing consistency in syntactic treebanks, as well as more recent approaches (Dickinson and Lee Reference Dickinson and Lee2008; Hollenstein et al. Reference Hollenstein, Schneider and Webber2016) on identifying errors in basic semantic annotation (predicate-argument structure, multi-word expressions, and super-sense tagging). These studies rely on corpus statistics (e.g., n-gram or production rule frequencies) to identify annotation anomalies. Differently from these studies, we assess the interaction between multiple annotation layers and derive constraints to identify inconsistencies and thus improve the overall labeling. A similar approach, albeit on a much smaller scale, has been adopted in Frank et al. (Reference Frank, Bögel, Hellwig and Reiter2012) for improving the labeling quality for automatic annotation of multiple NLP phenomena in a domain adaptation experiment.
In what follows, we describe our effort aimed at enforcing the formal consistency of the arrau data, in a hope to raise a discussion and make first steps in the direction of establishing good practice in this respect. The arrau scheme assumes simultaneous labeling of a variety of closely related phenomena, and therefore different parts of the mark-up can be used for deriving constraints for semi-automatic clean-up. For example, we can ensure that a non-referring markable is not marked as participating in a coreference chain. All the violating cases can be extracted automatically and then further checked and re-annotated manually. In a few cases, these constraints revealed intriguing cases of anaphoric expressions. Mostly, however, they have helped us identify and eliminate clear annotation errors.
3.1 Enforcing annotation consistency in arrau
A significant effort has been devoted to improving not only the quantity, but also the quality of the material annotated within the arrau project. To this end, we have implemented the following measures for the second release of the dataset:
Minimal and maximal spans, genericity and referentiality have been (re) annotated for all the documents. This enforces consistency across domains and allows for more principled cross-domain studies of the relevant phenomena. We have expanded our annotation of reference and genericity to all the domains, adopting a more principled approach. This has resulted in a more consistent annotation of reference: more than 10% of non-referring markables have been added to the documents already covered in arrau-1. For genericity, the first release only attempted a pilot annotation for the RST domain.
All the unspecified attributes have been re-annotated.
Morphological attributes have been checked across coreference chains. For example, a typical chain should not include two mentions of different gender. All the violating cases have been assessed manually.
Semantic type has been checked for consistency across coreference chains.
All the non-referring markables have been checked to exclude their participation in coreference chains. While the annotation scheme does not allow non-referentials to be anaphors, no mmax2 functionality prevents a non-referring markable from being selected as an antecedent.
All the mentions labeled as discourse-old have been assigned an antecedent.
Basic bracketing constraints have been enforced: no nominal markables should intersect each other or sentence boundaries.
The result of this effort has been two-fold. On the one hand, we have identified and removed various typos and inconsistencies that inevitably arise as a result of manual annotation. Table 9 shows the number of problematic cases for the three most common types of errors. Most of these cases are plain annotation mistakes: sometimes an incorrect labeling is introduced at the initial annotation stage; more often, however, the errors are by-products of post-corrections, either by the supervisor or by the annotators themselves.
Table 9. Enforcing annotation quality: inconsistency statistics for the first release of arrau, most common types of errors
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab9.gif?pub-status=live)
For example, in (34), the annotator has erroneously assigned an incorrect semantic type (space) to a mention the dollar. In (35), the annotator marked That as discourse old, but failed to provide a suitable (segment) antecedent. In (36), the annotator marked a non-referring markable as an antecedent, not distinguishing between co-reference and other anaphoric phenomena. Finding such errors manually can be very tedious, as it requires a careful supervision of each markable and all its attributes. The availability of multiple annotation levels, on the contrary, allows for immediate listing of such mistakes.
(34) … thus dumping
$[{\rm{dollar}}]_1^{{\rm{abstract}}}$ demand… Japanese institutions are comfortable with
$[{\rm{the \ dollar}}]_1^{{\rm{space}}}$ anywhere between current levels and 135 yen.
(35) … production could increase to 23 millions or 24 millions barrels a day … [That] would send prices plummeting…
(36) We weren’t allowed to do
$[{\rm{any \ due \ diligence}}]_1^{{\rm{non}} - {\rm{referential}}}$ because of competitive reasons. If we had, [it]1 might have scared us.
The following example illustrates a rather common problem with annotation projects that undergo several rounds of manual correction and adjudication. While each revision may fix some errors locally, the state-of-the-art annotation tools do not provide functionalities for ensuring the global data consistency.
(37) [Mr Dinkins]1’ position papers have more consistently reflected anti-development sentiment. [He]2 favors a form of commercial rent control.
Here, the (rather large) coreference chain for Mr Dinkins underwent several revisions, with individual mentions being deleted and re-annotated. As a result, some other annotations, for example, the one for He, became corrupt. Note that the mention He was not re-annotated per se, it merely contained a link to a mention that underwent deletion and re-annotation.
On the other hand, our quality control procedures have revealed, through identifying conflicting attributes within coreference chains, cases of coreference that are problematic for annotators and therefore lead to inconsistent labeling. We have identified two types of difficulties. First, some examples require a practical approach that could have been discussed in the guidelines. Consider the following snippets:
(38) [Mr. Wathen]1 says. “Their approach didn’t work,
$[{\rm{mine}}]_1^{{\rm{abstract}}}$ is.”
(39) Currency analysts around the world have toned down their assessment of
$[{\rm{the \ dollar}}]_1^{{\rm{concrete}}}$’s near-term performance… He said he expects U.S. interest rates to decline, dragging
$[{\rm{the \ dollar}}]_1^{{\rm{abstract}}}$ to
$[{\rm{around \ 1.80 \ marks}}]_2^{{\rm{abstract}}}$… I can’t really see it dropping far below
$[1.80 \ {\rm{marks}}]_2^{{\rm{num}}}$.
In (38), the annotators had difficulties labeling the mention mine, since the guidelines have no specific instructions on how to label this type of possessives. This resulted in an inconsistent labeling of mine as an abstract object (thus, referring to Mr. Wathen’s approach) coreferent with the person entity (Mr. Wathen). For (39), the guidelines provide no explicit instructions for assigning semantic class to currencies, resulting in a very inconsistent labeling, with four different values within the same document. Clearly, no annotation guidelines are perfectly complete, so, we believe that semi-automatic consistency checks can help identify and clarify such issues and consequently lead to better schemes with higher inter-annotator agreement.
Second, some semantic and discourse level phenomena are intrinsically difficult to annotate. In particular, we have seen a lot of inconsistent semantic class labelings. These cover cases where annotators cannot decide reliably on a unique semantic class for the whole chain, for example, cases of regular metonymy:
(40)
$[{\rm{Kellog}}]_1^{{\rm{organization}}}$‘s spokesman said … “As
$[{\rm{we}}]_1^{{\rm{person}}}$ regain our leadership….”
Analyzing the data consistency logs, we have identified a number of truly challenging cases of coreference, both in terms of annotation and automatic resolution. These cases often fall in the category of near-identity coreference Recasens et al. (Reference Recasens, Hovy and Martí2011). For example, in (41) survey, data and figures are very closely related mentions. It can be argued that they are all referring to the same entity—at the same time, it can also be argued, that figures, representing data, are part of survey, making the case for bridging relations. Example (42) shows another tricky case, posing a challenge, especially for automatic coreference resolution algorithms. Here, the same entity is described from two very different angles, using two mentions that are semantically rather dissimilar.
(41) [The Confederation of British Industry’s latest survey]1 shows… But despite mounting recession fears, [government data]1 don’t yet show the economy grinding to a halt… [The latest government figures]1 said retail prices in September were up 7.6% from a year earlier.
(42) Nearby Pasadena, Texas, police reported that [104 people]1 had been taken to area hospitals, but a spokeswoman said [that toll]1 could rise.
The near-identity coreference presents a true challenge for the community, yet, it is essential for the correct interpretation of textual inputs, especially in more complex domains (e.g., fiction) with evolving entities. For example, a Machine Reading system equipped with a strong coreference resolver, can suggest an informative answer (The Confederation of British Industry’s latest survey) to such queries as Which source is optimistic about the current economic situation? or Where can I find the data on the recent retail price trends?—whereas without coreference, the answer would be rather superficial and not helpful for the user (government data or the latest government figures.
The detailed analysis of such examples constitutes a part of our ongoing work. Note that producing a non-negligible amount of challenging examples has only been made possible as a by-product of our thorough linguistically motivated annotation, for example, through a conflict between coreference and non-referentiality annotations.
4 Related work: ARRAU vs. other anaphoric corpora
A number of anaphorically annotated corpora have appeared in the past two decades—an extensive overview can be found in Poesio et al. (Reference Poesio, Pradhan, Recasens, Rodriguez, Versley, Poesio, Stuckardt and Versley2016)—but very few of these cover the range of genres and the types of anaphoric relations annotated in arrau, such as bridging reference and discourse deixis, and much of this work started after the arrau annotation began. In this section, we discuss first of all the main differences between arrau and the two most commonly used corpora annotated for coreference in English, ace (Doddington et al. Reference Doddington, Mitchell, Przybocki, Ramshaw, Strassell and Weischedel2004) and OntoNotes (Pradhan et al. Reference Pradhan, Ramshaw, Marcus, Palmer, Weischedel and Xue2011; Weischedel et al. Reference Weischedel, Hovy, Marcus, Palmer, Belvin, Pradhan, Ramshaw, Xue, Olive, Christianson and McCary2011; Pradhan et al. Reference Pradhan, Moschitti, Xue, Uryupina and Zhang2012). We then discuss related work on annotating genres other than news, semantic properties of mentions such as referentiality and genericity, bridging references, and discourse deixis.
4.1 ARRAU vs. ACE vs. OntoNotes
Table 10 provides a summary of the most distinctive features of arrau as opposed to ace and OntoNotes.
The most prominent feature of arrau is its rich linguistically motivated annotation of mentions and relations between them. Thus, unlike ace and OntoNotes, arrau combines identity coreference with a number of related phenomena, such as referentiality, genericity, discourse deixis, and bridging. Moreover, we allow for ambiguity between different relations. The other datasets focus mainly on identity anaphora, with references to events being annotated in OntoNotes. We believe that it is very important to have the same corpus annotated for different anaphora-related phenomena to allow for deeper linguistic analysis and joint modeling. In this respect, arrau follows the line adopted by the Prague Dependency Tree Bank (Nedoluzhko et al. Reference Nedoluzhko, Mírovský, Ocelák and Pergler2009b), where several anaphoric relations are encoded for the same textual material, going beyond identity anaphora.Footnote n
Table 10. Comparison across anaphoricaly annotated corpora
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab10.gif?pub-status=live)
Information marked + is annotated, – is not annotated, and ± is partially annotated.
In arrau each markable is shown with its minimal and maximal span. This solution is in line with the ace annotation guidelines and it is unfortunate it was not been adopted in OntoNotes in order to decrease the annotation price and thus augment the corpus size. The maximal span corresponds to the full noun phrase, whereas the minimal span corresponds to the head noun or to the bare named entity for complex ne-nominals. With the latest development in parsing technology, it might seem redundant to include minimal spans in the manual annotation directly: using dependencies or constituents with head-finding rules, one might expect to extract the minimal span for each np rather reliably. It has been shown, however, that naive parsing-based heuristics do not lead to the best performance and a coreference resolver might benefit considerably from explicit or latent identification of minimal spans or heads (Zhekova and Kübler Reference Zhekova and Kübler2013; Peng et al. Reference Peng, Chang and Roth2015). Moreover, explicitly annotated minimal spans allow for better lenient matching that has been shown to improve the training procedure of coreference resolvers through better alignment of automatically extracted and gold mentions (Kummerfeld et al. Reference Kummerfeld, Bansal, Burkett and Klein2011). Finally, minimal spans can be intrinsically difficult to extract for non-conventional documents, such as dialog transcripts or social media, due to the low quality of parsing technology for such data [cf. an overview of parsing technology across domains/genres (Versley Reference Versley2005), as well as a recent study discussing numerous problems related to syntactic parsing specific for conversational data (Nasr et al. Reference Nasr, Damnati, Guerraz and Bechet2016)]. We believe therefore that the combination of minimal and maximal spans is the most reliable way of annotating mention boundaries for coreference. The second release of arrau provides minimal and maximal spans for all the domains.
Consistently with linguistic views on nominal expressions, arrau supports discontinuous mentions. The ace mark-up could potentially allow for discontinuous mentions, but the guidelines explicitly instruct the annotators to always select contiguous chunks. The conll mark-up is not expressive enough to support discontinuous mentions.
In arrau, all types of markables are annotated. In particular, we label singletons (mentions that do not participate in coreference chains) and non-referring nps. The ace guidelines restrict the annotation scope to referentials,Footnote o whereas in OntoNotes only co-referential mentions are marked, not singletons. Our corpus statistics show that non-referring markables and singleton mentions account for up to one third of all the markables. Again, restricting the annotation scope allows for reducing the manual effort per document and thus for increasing the corpus size. However, a dataset with all the nominal expressions annotated provides material for training mention detection systems. Mention detection for OntoNotes (Kummerfeld et al. Reference Kummerfeld, Bansal, Burkett and Klein2011; Uryupina and Moschitti Reference Uryupina and Moschitti2013) is a non-trivial problem that is further aggravated by the fact that singletons are removed and thus direct training becomes hardly possible.
Each markable is annotated in arrau with its basic morphological properties: number, gender and semantic class. This allows, again, for training markable-level classifiers to assign these features automatically. Similarly to minimal span, this task can be attempted via heuristics based on parse trees, however, one can expect a higher performance if such tasks are attempted in a data-driven way.
The text collections used in arrau have been annotated for a variety of relevant discourse-level properties by other projects. For example, our news documents are taken from the RST treebank and thus further annotations can be induced from RST to investigate possible interactions between coreference and rhetorical structure.Footnote p The OntoNotes dataset, on the contrary, provides valuable gold annotations of low-level phenomena (e.g., gold part-of-speech tags or parse trees), but does not, to our knowledge, provide deep discourse-level annotations apart from coreference.Footnote q We believe that a careful analysis of the overlapping documents, annotated within both the arrau and OntoNotes schemes, will provide valuable insights for computational modeling of coreference/anaphora.
To summarize, the arrau dataset provides a high-quality refined annotation of anaphora and related phenomena. It relies on much more detailed and specific annotation guidelines than other commonly used corpora. We believe therefore that while the OntoNotes corpus, being much larger, is of crucial importance for data-intensive modeling of linguistically easier cases of coreference, arrau can be valuable, on the one hand, for deeper linguistically oriented analysis of complex cases and, on the other hand, for learning models for related phenomena (genericity, referentiality, etc.).
4.2 Genres: beyond news
When the arrau annotation started in 2004, the main available corpora for studying coreference/anaphora resolution in English, such as muc and ace, focused on news content; there were a few resources covering other types of text, such as the gnome corpus already mentioned; but the only corpora covering English spoken dialog were Byron and Allen’s annotation of pronouns in the trains corpus (Byron and Allen Reference Byron and Allen1998) and the annotation of part of the sherlock corpus of task-oriented instructional dialog included in gnome and used for the study of anaphora and discourse structure reported in Poesio et al. (Reference Poesio, Patel and Di Eugenio2006a).Footnote r
In the years since the situation has improved. Corpora covering textual genres other than news now exist, such nlp4events of instructional manuals (Hasler et al. Reference Hasler, Orasan and Naumann2006), the genia corpus of biomedical text (Nguyen et al. Reference Nguyen, Kim and Tsujii2008) or the TED talks dataset annotated for coreference within the ParCor project (Guillou et al. Reference Guillou, Hardmeier, Smith, Tiedemann and Webber2014). For dialog, Müller annotated pronominal reference in the ICSI spoken multi-partner conversation corpus (Müller Reference Müller2008). Most importantly, the latest releases of OntoNotes and of the Prague Dependency Treebank include substantial amounts of spoken text—for instance, release 5 of OntoNotes contains, apart from newswire and broadcast news, a substantial amount of broadcast conversation and telephone conversation, as well as web data. And the recently created gecco corpus (Lapshinova-Koltunski and Kunz Reference Lapshinova-Koltunski and Kunz2014) covers a variety of genres including spoken language used both formally and informally (but as far as we know this corpus is not yet available).
4.3 Mention attributes
Referentiality
As mentioned above, the annotation schemes for coreference used in the best-known resources for English (the muc and ace corpora, OntoNotes) do not require the annotation of non-referring expressions, or even singletons. But several of what we have called the “new wave” of linguistically motivated corpora, particularly those with a syntactic definition of mention, do, although typically (non-)referentiality is indicated in a more indirect way than in arrau. In tüba/dz (Hinrichs et al. Reference Hinrichs, Kübler and Naumann2005), for instance, an expletive attribute is used to mark pleonastic instances of the impersonal third person singular pronoun es (it). No other types of non-referentiality seem to be marked. In ancora (Recasens and Martí Reference Recasens, àrquez, Sapena, Mart, Taulé, Hoste, Poesio and Versley2010), mentions are automatically extracted from the syntactic tree, and an entityref attribute is used to mark referring nps, so non-referring mentions can be identified although again the representation is not quite so explicit as in arrau.Footnote s
Genericity
When the arrau annotation started, we were only aware of one attempt at marking the genericity status of mentions apart from our own efforts in gnome—the annotation of the entity-class attribute in ace-2, with values generic and specific— but there has been some more work in this area since.
The ace-2 Entity Detection and Tracking Guidelines do provide instructions for distinguishing generic from specific mentions, relying heavily on examples to address the problems we encountered in gnome. We do not know however whether these difficulties were in fact solved as we are not aware of any results regarding the reliability of these guidelines. This annotation was nevertheless used to train one of the first models of automatic genericity classification (Reiter and Frank Reference Reiter and Frank2010). Herbelot and Copestake (Reference Herbelot, Copestake, Featherston and Winkler2008) developed an interesting scheme, strongly rooted in the literature on genericity in formal linguistics (Carlson and Pelletier Reference Carlson and Pelletier1995), and still attempting to capture genericity and specificity but treating them as two separate two dimensions of classification, as done in Carlson and Pelletier (Reference Carlson and Pelletier1995). The first version of the scheme provides a label for generic entities (gen), which would be the equivalent of the arrau label generic-yes; one for non-generic, specific entities (spec); and one for non-generic, non-specific entities (non-spec). In addition, the label amb is used for references ambiguous between generic and non-generic reading (like undersp-gen in the arrau scheme), and a label group for references to subgroups of a generic entity. This version of the scheme achieves a similar reliability to that of the second version of the gnome scheme for genericity. The authors then developed a second set of guidelines, based on the same scheme but providing detailed instructions for a number of special cases; with this second set of guidelines they manage to achieve a reliability of κ = 0.74. In the Prague Dependency Treebank, all nominals are marked as generic or specific, and coreference relations are only marked between nominals with the same category (generic or specific). Recently, a systematic analysis of coreference with generic nps was carried out by Nedoluzhko (Reference Nedoluzhko2013). Finally regarding manual annotation, we shall mention the recent and very interesting work by Friedrich et al. (Reference Friedrich, Palmer, Sorensen and Pinkal2015) who annotated clauses and subjects for genericity, a type of annotation that would be a very useful preliminary step towards the annotation of genericity of mentions in arrau. Friedrich and colleagues reported an average agreement of around κ = .56.
4.4 Other corpora annotated for more complex forms of anaphoric reference
Corpora annotated for bridging references
At the time the arrau annotation started, there had not been many other attempts to annotate bridging reference apart from our own work as part of the Viera / Poesio corpus (Poesio and Vieira Reference Poesio and Vieira1998) and gnome corpus (Poesio Reference Poesio2004a),Footnote t but there have been a number of efforts, since many of which have attempted to annotate a broader range of the bridging relations identified in the early literature (Passonneau Reference Passonneau1997; Davies et al. Reference Davies, Poesio, Bruneseaux and Romary1998; Vieira Reference Vieira1998).
One of the most ambitious such efforts in terms of coverage of relations is the work by Gardent and Manuélian as part of the annotation of the DeDe corpus (Gardent and Manuélian Reference Gardent and Manuélian2005). Gardent and Manuélian annotate a range of bridging relations including, apart from the part relations encoded in arrau, a more general circumstantial relation covering a variety of relations. The annotation was also carried out using mmax2 and the markup scheme is very compatible with that used in arrau. No agreement results were however reported as far as we’re aware.
Possibly the most extensive effort towards annotating bridging carried out in parallel with the annotation of arrau is the annotation of bridging coreference in the Prague Dependency Treebank (Nedoluzhko et al. Reference Nedoluzhko, Mírovský, Ocelák and Pergler2009a). Nedoluzhko et al. distinguish, apart from part and subset relations,
A funct relation covering function-value relations, as proposed in mate (Davies et al. Reference Davies, Poesio, Bruneseaux and Romary1998)
A new contrast relation covering relations between opposites (People don’t chew, it’s cows who chew)
A more general underspecified group rest, which is used for capturing other types of bridging references such as event argument.
Nedoluzhko et al. measured interannotator agreement using a combination of F1 values (for the antecedent) and κ (for the relation), achieving F1=.59 for the antecedent, and κ = .88 for the relation.
Another substantial annotation effort was carried out by Hou, Markert and Strube (Markert et al. Reference Markert, Hou and Strube2012), who annotated HSNotes, a corpus of 11,000 nps in 50 texts taken from the wsj portion of OntoNotes for information status, building on the scheme by Nissim et al. (Reference Nissim, Dingare, Carletta and Steedman2004) but also annotating the anchors of 663 bridging nps. Their scheme also expands on the definition of mediated from Nissim et al. by also including other anaphora among the bridging references, as well as the funct relation. An interesting analysis of the differences between the notion of bridging reference annotated in arrau and that annotated in HSNote can be found in Roesiger (Reference Roesiger2018).
Finally, we should mention that there has been quite a lot of research on bridging reference in corpus linguistics, which, while not producing usable corpora, did involve developing annotation guidelines—a notable example being the work by Botley (Reference Botley2006).
Discourse Deixis
Annotating discourse deixis is another task not tackled in all large-style anaphoric annotations, but there have been a number of efforts both preceding, parallel with, and subsequent to the effort in arrau and the already discussed studies of agreement in discourse deixis annotation (Artstein and Poesio Reference Artstein, Poesio, Schlangen and Fernandez2006).
Prior to arrau, we will mention the seminal work by Byron, who annotated pronominals and demonstratives in the trains-93 corpus, including abstract objects (Byron and Allen Reference Byron and Allen1998), and used the data to develop the first resolver of references to abstract objects we are aware of Byron (Reference Byron2002); Navarretta, who in parallel with Byron carried out similar studies of abstract reference in Danish and Italian (Navarretta Reference Navarretta2000, Reference Navarretta and Johansson2008); and by Eckert and Strube (Reference Eckert and Strube2000), who analyzed references in dialogs to both concrete and abstract objects. We will also mention the illuminating work by the corpus linguists Gundel, Hedberg, Hegarty, and associates on reference to “clausally introduced entities” (Gundel et al. Reference Gundel, Hegarty and Borthen2003, Reference Gundel, Hedberg and Zacharski2002), that was an important influence on our own work.
The most notable effort carried out in parallel with the work on arrau is the work in OntoNotes on annotating event anaphora, an important category of reference to abstract objects.
Following the first work in arrau on annotating discourse deixis, there appeared a number of studies that attempted to annotate a comparable subset of the phenomenon in other languages. Most notably among these efforts are the work in ancora on discourse deixis in Catalan and Spanish (Recasens Reference Recasens and Johansson2008), and the work by Dipper and Zinsmeister (Reference Dipper and Zinsmeister2012) on abstract anaphora in German. More recently, a very systematic analysis and annotation of another subset of discourse deixis, so called shell nouns, has been carried out by Kolhatkar (Reference Kolhatkar2014) and Kolhatkar and Hirst (Reference Kolhatkar and Hirst2012).
Ambiguity
Anaphoric ambiguity is a very understudied phenomenon and there have been hardly any other attempts to create a corpus in which the ambiguity of expressions is marked, with two exceptions. The coreference annotation carried out by Krasavina and Chiarcos (Reference Krasavina and Chiarcos2007) as part of the work on the Potsdam Commentary Corpus (Stede Reference Stede2004) is the only other coreference annotation scheme we are aware of that asks coders to mark ambiguity. The guidelines produced by Chiarcos and Krasavina (Reference Chiarcos and Krasavina2005) require coders to use the ambiguity mention attribute to indicate ambiguous mentions, and the type of ambiguity: for instance, ambig-ante if the mention is clearly discourse old but it’s not clear what the antecedent is, or ambig-expl for instances of es (“it”) that could be interpreted either as anaphoric or as expletives.
We must admit however that the results of Poesio and Artstein (Reference Poesio, Artstein and Meyers2005a) convinced us that this aspect of the arrau guidelines clearly needs rethinking, and in our research since we have taken a completely different direction, aiming to capture implicit ambiguity rather than explicit, as in arrau. Indeed, this aim was one of the primary motivations behind the Phrase Detectives project Poesio et al. (Reference Poesio, Chamberlain, Kruschwitz, Robaldo and Ducceschi2013), which has developed a Game-With-A-Purpose to elicit from players multiple interpretations for anaphoric expressions (25 on average and as many as 32 in some cases). The Phrase Detectives game uses a simplified version of the arrau coding scheme, but all interpretations are stored, and preliminary analysis suggests that around 40% of mentions have at least two interpretations selected by at least two players. A first, small subset of the Phrase Detectives corpus was recently released via ldc (Chamberlain et al. Reference Chamberlain, Poesio and Kruschwitz2016). More recently, this line of research has led us to start the dali project,Footnote u in which besides carrying out the development of more Games-With-A-Purpose to study anaphora, methods to compare and analyze these interpretations will also be developed.
5 Anaphora resolution with arrau
In this Section we briefly discuss anaphora resolution work that used arrau.
5.1 Identity anaphora
Rodriguez (Reference Rodriguez2010) used a preliminary release of arrau 2—about half the size of the final release, but already annotated with MIN information–to carry out a comparative analysis of anaphora resolution in English and Italian. Using bart (Versley et al. Reference Versley, Ponzetto, Poesio, Eidelman, Jern, Smith, Yang and Moschitti2008), he compared the difficulty of anaphora resolution in arrau and in the two more widely used corpora at the time, MUC-7 and ACE02. He also studied the effect of using MIN information to ascribe partial credit (50%) whenever a system markable overlaps with the minimal span of a gold markable, and the boundaries of the system markable do not exceed those of the gold markable, as done in muc. He found that assigning such partial credit substantially improves the scores.
Uryupina and Poesio (Reference Uryupina and Poesio2012) explored the effect of domain adaptation in anaphora resolution, comparing the results obtained by training different versions of bart separately for each domain or the entire dataset. They did that on both arrau 2 and OntoNotes, thus providing what to our knowledge is the only comparison between the two corpora in terms of system performance. Table 11 summarizes the results.
Table 11. (Uryupina and Poesio Reference Uryupina and Poesio2012): Running bart on different arrau genres and on different OntoNotes genres. muc score
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab11.gif?pub-status=live)
5.2 Discourse Deixis
Marasović et al. (Reference Marasović, Born, Opitz and Frank2017) developed an approach to abstract anaphora resolution based on bi-directional LSTMs to produce representations of the anaphor and the candidate sentence, and a mention ranking component adapted from the systems by Clark and Manning (Reference Clark and Manning2016) and Wiseman et al. (Reference Wiseman, Rush, Shieber and Weston2015). The system was tested using both the dataset by Kolhatkar et al. (Reference Kolhatkar, Zinsmeister and Hirst2013) (for shell nouns) and the discourse deixis cases in arrau.
5.3 The CRAC 2018 shared task
The first evaluation campaign based on arrau (Poesio et al. Reference Poesio, Grishina, Kolhatkar, Moosavi, Roesiger, Roussel, Simonjetz, Uma, Uryupina, Yu and Zinsmeister2018) was organized in connection with the 2018 naacl Workshop on Computational Models of Reference, Anaphora and Coreference (crac).Footnote v The shared task was composed of three subtasks: Task 1 on identity anaphora resolution, Task 2 on bridging reference, and Task 3 on discourse deixis.
Datasets
Three separate datasets were made available for the three distinct tasks. All three datasets were in the format used for the evalita-2011 evaluation campaign (Uryupina and Poesio 2013), which in turn was derived from the tabular conll-style format used in the semeval 2010 shared task on multilingual anaphora (Recasens et al. Reference Recasens and Martí2010). (See Poesio et al. Reference Poesio, Grishina, Kolhatkar, Moosavi, Roesiger, Roussel, Simonjetz, Uma, Uryupina, Yu and Zinsmeister2018 or the shared task page for further details on the format.) Three of the arrau sub-corpora were used: pear, rst and trains.
Evaluation Scripts
Three evaluation scripts were developed for the three tasks.
The coreference evaluation script developed by Moosavi and Strube, which, in turn, builds upon the official conll implementation (Pradhan et al. Reference Pradhan, Luo, Recasens, Hovy, Ng and Strube2014), was modified to produce the scorer for Task 1. We will refer to this script as ‘the extended coreference scorer’.Footnote w The extended scorer, when run excluding non-referring expressions and singletons and ignoring MIN information, evaluates a system’s response using the same metrics (indeed, a reimplementation of the same code) as the standard conll evaluation script, v8 (Pradhan et al. Reference Pradhan, Luo, Recasens, Hovy, Ng and Strube2014).Footnote x When required to use MIN information, the extended scorer follows the muc convention, and considers a mention boundary correct if it contains the MIN and doesn’t go beyond the annotated maximum boundary. When singletons are to be considered, singletons are also included in the scores (all metrics apart from muc can deal with singletons). Finally, when run in all-markables mode, the script scores referring and non-referring expressions separately. Referring expressions are scored using the conll metrics; for non-referring expressions, the script evaluates P, R and F1 at non-referring expression identification. The extended coreference scorer is available from Moosavi’s github at https://github.com/ns-moosavi/coval.
The evaluation script for Task 2 is based on the evaluation method proposed in (Hou et al. Reference Hou, Markert and Strube2013). The script separately measures precision and recall at anchor entity recognition (e.g., whether set_3 is the right coreference chain) and at anchor markable detection (i.e., whether markable_308 is the appropriate markable of set_3). Note that whereas the identification of the anchoring entity is considered correct whenever the right coreference chain is identified, irrespective of the particular anchor markable chosen, the identification of the anchor markable is strict, i.e., it is only considered correct if the same markable as annotated is found.
Finally, the evaluation script for Task 3 computes the Success@N metric proposed by Kolhatkar and Hirst (Reference Kolhatkar and Hirst2014)) and also used by Marasović et al. (2017). success@n is the proportion of instances where the gold answer—the unit label—occurs within a system’s first n choices. (S@1 is standard precision.)
Task 1: Markable detection
One of the important differences between corpora for anaphora / coreference is the definition of mentions (or markables, in this case). In order to compare the difficulty of markable detection in arrau with that of mention extraction OntoNotes, we ran two markable extractors on both corpora: a few versions of a mention extractor based on the Stanford core pipeline, and our own implementation of an lstm architecture for markable detection (see Poesio et al. Reference Poesio, Grishina, Kolhatkar, Moosavi, Roesiger, Roussel, Simonjetz, Uma, Uryupina, Yu and Zinsmeister2018 for details). Two versions of this markable detection were run on the OntoNotes dataset, one optimized for F1, one for recall. The results are shown in Table 12.
Table 12. Markable detection in arrau and OntoNotes
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab12.gif?pub-status=live)
The results suggest that markable extraction in arrau is considerably easier than mention extraction in OntoNotes. This might be due to the differences in markable definition, since singletons and non-referring nps have to be excluded in OntoNotes. But the accuracy gaps might also be a result of the domain differences between OntoNotes and arrau. To test this we tested the Stanford pipeline on the wsj portion of the OntoNotes test set. The highest scores on the wsj portion is obtained by the rule-based version of the pipeline, and is lower (43.1% F1) than that for the entire set. This suggests the difference in performance are due to the more straightforward notion of markable used in arrau.
Task 1
The Stanford CORE deterministic coreference resolver (Lee et al. Reference Lee, Chang, Peirsman, Chambers, Surdeanu and Jurafsky2013) was run on the rst subset of the dataset for Task 1 as a baseline, using the division into training, development and test built in the shared task for this subdomain. The system was run both on gold and on predicted mentions, and evaluated first using both the conll official scorer and the extended coreference scorer ignoring singletons and non-referring markables, then including those.
The first 10 lines of Table 13 show the results by the Stanford Deterministic Coreference Solver when run over gold markables, scored using both the extended coreference scorer and the conll official scorer excluding both singletons (4161 markables) and non-referring markables (1391)–i.e., the same conditions as in the standard conll evaluations. In these conditions, the extended coreference scorer and the conll official scorer obtain the same scores modulo rounding. The following lines in Table 13 show the results when including singletons in the assessment; for this evaluation, the Stanford deterministic coreference resolver was made to output singletons instead of removing them prior to evaluation. When non-referring markables are included as well, the results for referring expressions remain identical, but in addition, the scorer outputs the results on those separately. (The Stanford deterministic coreference resolver does not attempt to identify non-referring markables, hence all values are 0.)
Table 13. Baseline results on Task 1. Gold markables
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab13.gif?pub-status=live)
The first conclusion that can be obtained from this table is that the results achieved by the Stanford resolver on gold markables on this dataset are broadly comparable to the results the system achieved on gold markables (a conll score of 60.7). The second observation is that the system appears quite good at identifying singletons, as its conll score in that case is over ten percentage points higher—in other words, the system is very much penalized when running on the conll dataset.
Table 14 shows the results obtained by the Stanford deterministic coreference resolver when evaluated on predicted markables instead of gold markables. These are the results that are more directly comparable with those obtained by this system in the conll 2011 shared task. We can see a substantial drop in conll score, from 58.3 on predicted markables in the conll 2011 shared task to 43.2 on predicted markables with the Task 1 dataset. Most likely, that indicates that some degree of optimization to the characteristics of conll dataset was carried out in the system even though the system is not trained.
Table 14. Baseline results on Task 1 with predicted mentions, without MIN information
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab14.gif?pub-status=live)
Finally, Table 15 shows the effect of using the MIN information. As can be seen from the table, this results in five extra percentage points.
Table 15. Baseline results on Task 1 with predicted mentions, using MIN information
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab15.gif?pub-status=live)
Task 2
One aspect of anaphoric interpretation for which there were no previous results with arrau is bridging reference. One group from the University of Stuttgart participated in this subtask (Roesiger Reference Roesiger2018). We summarize here the results; for further detail, see the paper.
Roesiger developed two systems, one rule-based, one ML-based. The results obtained by these systems on all three subdomains are summarized in Table 16. The three columns present the result of the two systems at the tasks of (i) attempting to resolve all gold bridging references; (ii) only producing results when the system is reasonably convinced; and (iii) identifying and resolving bridging references. These results appear broadly comparable to those obtained by Hou et al. (Reference Hou, Markert and Strube2013) over the ISNotes corpus as far as the rst and trains domain are concerned, but much lower for the pear domain–although given the small number of bridging references in this domain (354) not too much should be read into this. See Roesiger (Reference Roesiger2018) for some interesting hypotheses regarding the differences between the two corpora.
Table 16. Roesiger’s results on Task 2 for all domains
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200421060843866-0325:S1351324919000056:S1351324919000056_tab16.gif?pub-status=live)
6 Conclusion
This paper presents arrau—a publicly available corpus of anaphora, annotated according to linguistically motivated guidelines. The dataset contains documents from four different genres for a total of 350K tokens.
arrau supports rich annotation of individual mentions: apart from morphosyntactic properties, we mark semantic type, genericity and referentiality. For the latter two properties, we also provide fine-grained subclassification. Apart from identity coreference, arrau guidelines cover bridging references and discourse deixis, thus providing data for joint modeling of these phenomena. We believe that the resulting resource provides valuable data for the next generation of anaphora resolvers. A few interesting studies in this direction were carried out as a result of the crac 2018 Shared Task (Poesio et al. Reference Poesio, Grishina, Kolhatkar, Moosavi, Roesiger, Roussel, Simonjetz, Uma, Uryupina, Yu and Zinsmeister2018).
The annotation scheme developed for arrau, as well, could be useful for future research. It has already been employed in other projects, for example, the creation of the LIVEMemories corpus of anaphora in Italian (Rodriguez et al. Reference Rodriguez, Delogu, Versley, Stemle and Poesio2010). containing texts from Wikipedia and blogs. The main distinguishing feature of the LIVEMemories coding scheme with respect to that of arrau is the incorporation of the mate / venex proposals concerning incorporated clitics and zeros in standoff schemes whose base layer is words (instead of an annotation of morphologically decomposed argument structure, as in the Prague Dependency Treebank). A second project using the arrau guidelines is the creation of the sensei corpus,Footnote y consisting of annotations of online forums in English (from The Guardian newspaper) and Italian (from La Repubblica newspaper) following similar guidelines.
Acknowledgements
The arrau corpus has been under development over several years and we are grateful to the many funding agencies that contributed to its development. Initial work was in part supported by the EPSRC-funded ARRAU Project (GR/S76434/01). Subsequent work was funded in part by the LiveMemories project, funded by the Provincia of Trento; in part by the EU Project H2020 5G-CogNet; and in part by the ERC Project dali, ERC-Adg-2015.