Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-02-06T03:38:08.712Z Has data issue: false hasContentIssue false

InferPortOIE: A Portuguese Open Information Extraction system with inferences

Published online by Cambridge University Press:  14 December 2018

Cleiton Fernando Lima Sena
Affiliation:
Formalisms and Semantic Applications Research Group (FORMAS), LaSiD/DCC/IME—Federal University of Bahia (UFBA), Av. Adhemar de Barros, s/n, Campus de Ondina, Salvador, Bahia, Brazil
Daniela Barreiro Claro*
Affiliation:
Formalisms and Semantic Applications Research Group (FORMAS), LaSiD/DCC/IME—Federal University of Bahia (UFBA), Av. Adhemar de Barros, s/n, Campus de Ondina, Salvador, Bahia, Brazil
*
*Corresponding author. Email: dclaro@ufba.br
Rights & Permissions [Opens in a new window]

Abstract

Nowadays, there is an increasing amount of digital data. In the case of the Web, daily, a vast collection of data is generated, whose contents are heterogeneous. A significant portion of this data is available in a natural language format. Open Information Extraction (Open IE) enables the extraction of facts from large quantities of texts written in natural language. In this work, we propose an Open IE method to extract facts from texts written in Portuguese. We developed two new rules that generalize the inference by transitivity and by symmetry. Consequently, this approach increases the number of implicit facts in a sentence. Our novel symmetric inference approach is based on a list of symmetric features. Our results confirmed that our method outstands close works both in precision and number of valid extractions. Considering the number of minimal facts, our approach is equivalent to the most relevant methods in the literature.

Type
Article
Copyright
© Cambridge University Press 2018 

1. Introduction

There is an increasing amount of digital data stored over the world. Considering the Web, the rise and popularity of the Internet have spawned a vast collection of heterogeneous data (Chang et al. Reference Chang, Kayed, Girgis and Shaala2006). A significant portion of this data is available in a natural language format. The manual analysis of relevant information in such a big data becomes a time-consuming task. In this way, Information Extraction (IE), also called traditional IE (Wu and Weld Reference Wu and Weld2010), emerged to automatically identify useful patterns in textual documents (Soderland Reference Soderland1999). This task had an initial focus on a small, homogeneous, and well-known domain. However, some problems have been identified, such as the low extraction of facts in texts with different domains and the human intervention to extract new facts (Banko and Etzioni Reference Banko and Etzioni2008). A new approach comes up called Open Information Extraction (Open IE) to overcome some difficulties. Open IE aims to extract semantic information in texts from different domains in the form of verbs (or verbal phrases) and their arguments (Mausam et al. Reference Mausam, Bart, Soderland and Etzioni2012).

Open IE, when compared to traditional IE, produces more invalid facts (Banko and Etzioni Reference Banko and Etzioni2008). A fact is invalid when the information extracted is not consistent with the information described in the sentence (Fader, Soderland, and Etzioni Reference Fader, Soderland and Etzioni2011). For example, in the sentence Peter, friend of Mary, traveled out of town the fact (Mary, traveled out of, town) is invalid because the information that Mary traveled is not present in the sentence. The fact might be (Peter, traveled out of, town). This type of error occurs, specifically, due to the difficulty of Open IE approaches such as TextRunner (Banko et al. Reference Banko, Cafarella, Soderland, Broadhead and Etzioni2007), Wikipedia-based Open Extractor (WOE) (Wu and Weld Reference Wu and Weld2010), and Reverb (Fader et al. Reference Fader, Soderland and Etzioni2011) on dealing with human written style. A more in-depth analysis observed that the main problem with these methods is dealing with commas. Disregarding the commas in a sentence might increase the number of invalid facts, thus missing the valid IE.

It is also worth mentioning that it has been a challenge for Open IE approaches to extract facts that are implicit in texts. For example, in the sentence Hortown is a city located in Oklahoma, current Open IE methods might extract the following facts: number 1 (Horntown, is, a city) and/or number 2 (a city, located in, Oklahoma). However, the facts number 3 (Horntown, located in, Oklahoma) and number 4 (a city, is, Horntown), which are implicit in the sentence were not extracted. In Bast and Haussmann (Reference Bast and Haussmann2014), that kind of problem is addressed but applied to the English language, and it is limited only to a transitive inference. That means, only the fact number 3 is extracted. In Sena, Glauber, and Claro (Reference Sena, Glauber and Claro2017), which is a Portuguese approach, authors used both types of inference: rules of transitivity and symmetry. Considering the same sentence, the method only extracts the fact number 3, due to a limitation to extract only one rule by time in the same sentence, for example, transitive or symmetric fact. Moreover, the approach (Sena et al. Reference Sena, Glauber and Claro2017) is restricted to few patterns of transitivity and symmetry.

In this paper, we propose an inference method for Portuguese texts which manipulates the sentence structure, including commas. Moreover, our inferential method generalizes previous works to infer new implicit facts. Both transitive and symmetric inference types are extracted by the use of general rules in a single sentence. Thus it increases the number of facts extracted per sentence. This novel method can improve both the quality and the number of valid facts, due to a more careful analysis of the sentence, considering particular cases of the Portuguese language. As a consequence, our inferential method can retrieve a significant number of valid facts, particularly those implicit into an inferential rule.

The remainder of this paper is organized as follows: Section 2 presents a background on OpenIE area and Section 3 describes our method. In Section 4 we show the settings and criteria of our experiments. Section 5 details our results and Section 6 describes some of our extractions, comparing our approach with some close works. We discuss the results in Section 7. We present in Section 8 our conclusions and some envisioning work.

2. Background

Open IE is not limited to previously defined facts. It may be applied to a set of arbitrary textual documents (Etzioni et al. Reference Etzioni, Banko, Soderland and Weld2008; Fader et al. Reference Fader, Soderland and Etzioni2011). A fact extracted through Open IE is composed of a relation between a pair of entities (Faruqui and Kumar Reference Faruqui and Kumar2015) defined in a triple form t = (e 1, rel, e 2), where rel corresponds to the relation between arguments (e 1, e 2). In the sentence The card is in the drawer., the triple (The card, is in, the drawer) is extracted without the need to specify the relation is in or its arguments The card and the drawer. The major strengths of Open IE are: (i) domain independence; (ii) high coverage of facts; and (iii) scalability for the Web (Del Corro and Gemulla Reference Del Corro and Gemulla2013).

2.1 Related works

Open IE systems were first introduced by TextRunner (Banko et al. Reference Banko, Cafarella, Soderland, Broadhead and Etzioni2007), which uses a self-supervised approach to training positive and negative examples in English. A classifier is trained to extract the facts. Other systems, such as WOE (Wu and Weld Reference Wu and Weld2010), are a heuristic combination from infoboxes of Wikipedia. WOE operates in two modes: WOEpos, which uses Part-of-Speech tagger (POS tagger); and WOEparse, which uses a dependency parser.

The second generation of Open IE systems does not deal with a machine learning to represent the relationships. ReVerb (Fader et al. Reference Fader, Soderland and Etzioni2011) is the leading approach which uses syntactic and lexical constraints to extract arguments and relations expressed by verbs in English sentences. According to the authors, the methodology adopted by Reverb corrects some problems identified in previous systems (Etzioni et al. Reference Etzioni, Banko, Soderland and Weld2008; Wu and Weld Reference Wu and Weld2010). One of these problems is the extraction of incoherent facts (relations without meaning, interpretation, and incomprehensible). In the sentence The guide contains dead links and omits sites, an incoherent relation would be contains omits. Continuing in the second generation of Open IE systems, the method in Sena et al. (Reference Sena, Glauber and Claro2017) deals with the Portuguese language. This approach proposes an adaptation of Reverb methodology (Fader et al. Reference Fader, Soderland and Etzioni2011) to Portuguese language and a syntactic constraint to identify nominal phrases. Moreover, their method tackles with an inference approach to extract new facts, using a binary Support Vector Machine (SVM) classifier between transitive and symmetric classes. However, as reported in Sena et al. (Reference Sena, Glauber and Claro2017), the trained model presents a considerable percentage of error (17%) in classification, generating a high number of invalid extractions.

A new generation of methods starts to use dependency parser between morphological classes of words and a set of rules for detecting useful parts (clauses) in a sentence. An example of this approach is DepOE (Gamallo, Garcia, and Fernández-Lanza Reference Gamallo, Garcia and Fernández-Lanza2012) that uses a rule-based parser to extract facts from texts, supporting English, Spanish, Portuguese, and Galician languages. Taking the Portuguese language, the authors present only the number of extracted facts and compare it with their English language results. Thus, the number of Portuguese’s results was 15× smaller than their English’s results. ArgOE (Gamallo and Garcia Reference Gamallo and Garcia2015) also works with different languages, which one of them is Portuguese. The authors evaluated their method to Portuguese in a very short set of sentences (103 in total) and within a particular domain (ecological), making it difficult to compare their method. Results indicated that their approach does not produce satisfactory results (53%) different from the English language.

Open IE is still producing a significant number of invalid facts extracted from multidomain and large-scale texts. In a more in-depth analysis, there is a considerable loss of extraction due to implicit facts which are not extracted yet. Recently, methods have been developed to treat this issue. One type of them is based on the inference matter, which can increase the number of useful facts. In this way, Open IE methods which deal with inference are still an open matter. Our method follows this tendency to exploit implicit facts thus increasing the number of useful facts extracted.

2.2 Inference

Inference is a mechanism used to infer new facts from texts (Bast and Haussmann Reference Bast and Haussmann2014). We can describe two types of inferences: transitive and symmetrical.

Considering Bast and Haussmann (Reference Bast and Haussmann2014), a sentence has a transitive relation if it follows this pattern: If A relation B and B relation C then A relation C. Considering our example, Horntown is a city located in Oklahoma, it is classified as transitive because it follows the aforementioned pattern: Horntown (A) is (relation) a city (B) located in (relation) Oklahoma (C). This sentence follows a pattern and it has a set of premises that could infer new facts: Horntown (A) located in (relation) Oklahoma (C). With the application of transitivity it is both possible to identify new implicit facts and to increase the number of facts extracted in texts.

According to Sena et al. (Reference Sena, Glauber and Claro2017), a sentence has a symmetric relation if it follows this kind of pattern: If A relation B then B relation A. Taking the sentence John is a friend of Mary, it might be classified as symmetric, since it follows the previous pattern: John (A) is a friend of (relation) Mary (B). And Mary (B) is a friend of (relation) John (A). As stated that both facts are true, then the symmetry is valid. Similar to the transitive approach, the application of symmetry allows to increase the extraction of new facts in texts.

2.3 Open issues

Despite Open IE area is growing fast and many methods have been proposed, some open points need to be addressed. The first one concerns the accuracy of extracted facts. Open IE methods extract a high number of unusual facts (Banko et al. Reference Banko, Cafarella, Soderland, Broadhead and Etzioni2007). This factor may be due to the nature of open domain, without a specification on what kind of information is extracted and no domain restriction. Moreover, facts are extracted from heterogeneous textual documents.

Another factor that may contribute to the high number of unusual facts is concerned with the creation of the corpus. The corpus is usually composed of random sentences which may either contain non-informative facts avoiding extracting concrete facts. The high non-informative fact extractions challenge the application of Open IE methods in practical tasks, such as the population of an ontology. Another open issue is concerned with the extraction of facts which do not have a verbal phrase; systems such as OLLIE (Mausam et al. Reference Mausam, Bart, Soderland and Etzioni2012) and ClausIE (Del Corro and Gemulla Reference Del Corro and Gemulla2013) treat these questions. Considering the sentence “Barack Obama, president of the USA,” both methods extract the fact (Barack Obama, is, president of the USA), adding a synthetic verb “is” as a verbal phrase, even it is not explicit in the sentence.

Last but not least is related to multiple arguments. Most methods extract binary facts; however, works such as Kraken (Akbik and Loser Reference Akbik and Loser2012) have adopted the extraction of multiple arguments (n-ary).

Although the community has treated these issues, they are mostly open problems. In this work, we handle binary facts without adding synthetic verbs to extract facts. We envision to tackle an extraction without a verbal phrase and to treat unusual extraction to improve accuracy in our future works.

3. InferPortOIE

InferPortOIE proposes an Open IE method to extract facts from texts in Portuguese. This method takes into advance the structure of writing, especially asyndetic coordination sentences. In addition, the methodology of Reverb (Fader et al. Reference Fader, Soderland and Etzioni2011) was adapted to the Portuguese language (Sena et al. Reference Sena, Glauber and Claro2017), in which the facts are extracted in a triple format t = (arg1, rel, arg2). InferPortOIE proposes two new rules that generalize the inference both by transitivity and by symmetry, thus increasing the number of extractions in a sentence. A new specific rule for symmetric reasoning is proposed based on a list of symmetric verbs reported in GODOY (Reference Godoy2008). We divided InferPortOIE into sixfolds: Pre-processing, Syntactic Constraint, Treatment of Particular Cases, Inference Detection, Transitivity Constraint, and Symmetric Constraint.

3.1 Pre-processing

InferPortOIE uses CoGroo (Moura Silva 2013) as a POS Tagger and an NP chunker for the Portuguese language. CoGroo project is a Portuguese-language spell checker that has analyzers trained with Brazilian and European Portuguese.

The POS Tagger output of CoGroo separates prepositions+articles and/or prepositions+pronouns, changing the original structure of the sentence. In this way, it is necessary to regroup this separation made by CoGroo in some contractions on Portuguese. For example, in the sentence The key is on the table (in Portuguese A chave está na mesa), CoGroo splits the na contraction into preposition+article, em + a, respectively. Regrouping the preposition+article, the sentence returns to its original and does not affect its structure and the extraction of facts.

3.2 Syntactic constraint

Most of the relationships in a sentence written in Portuguese are through verbs as stated by the authors in Fader et al. (Reference Fader, Soderland and Etzioni2011). Table 1 presents a syntactic constraint. Relation (rel) can be composed of a verb; or a verb followed by a preposition; or a verb followed by a sequence of words (nouns, adjectives, adverbs, and pronouns) finalizing by a preposition.

Table 1. Syntactic constraint to extract relations in Portuguese texts based in Fader et al. (Reference Fader, Soderland and Etzioni2011)

For instance, Table 2 presents three examples by the use of each pattern to identify a relation. In sentence 1, the V pattern applies only to a verb. In sentence 2, VP standard corresponds to a verb immediately followed by a preposition. In sentence 3, the pattern VWP corresponds to a verb, followed by an adjective and finalized by a preposition.

Table 2. Examples of the pattern application from Table. 1

After identifying a possible relationship in the sentence, the next step is to search for arguments through the left of the relation (arg 1) and then through the right (arg 2). The identification of arguments follows a syntactic constraint presented in Table 3. These arguments are formed by noun phrases (NPs) that can be composed of a noun, pronoun, or adjective, as well as can be a combination of NPs connected through prepositions. Finally, InferPortOIE extracts the triple: t = (arg 1, rel, arg 2) with a relation and two arguments.

Table 3. The constraints to identify the arguments (Sena et al. Reference Sena, Glauber and Claro2017)

3.3 Treatment of particular cases

3.3.1 Asyndetic coordination

Taking into account the structure of some sentences written in Portuguese, it is necessary to create specific treatments to identify the relationships and arguments of the facts extracted correctly. We verified situations in which asyndetic coordinationFootnote a occurs, following the Portuguese Grammar (Neto and Infante Reference Neto and Infante2003). We make this decision because InferPortOIE does not use a dependency parser (PT-BR), which has deeper processing of the structure of the sentence. However, there is no guarantee that a dependency parser can cover particular cases such as asyndetic coordination in the Portuguese language. Considering the sentence Château-Gontier is a commune of France, is located in the department of Mayenne, is headquartered in the region of the Loire, without treatment, InferPortOIE might extract the following facts: (Château-Gontier, is a commune of, France) and (a commune of France, is located in, the department of Mayenne). These facts are correct; however, new facts could be extracted by identifying the presence of asyndetic coordination. InferPortOIE searches for the leftmost NPs in the sentence that satisfy the identified relation. As a consequence, InferPortOIE can extract two new facts: (a commune of France, is headquartered in, the region of the Loire) and (Château-Gontier, is located in, the department of Mayenne). It is worth mentioning that there are no conjunctions linking the coordination sentences. Moreover, there are no adjuncts to indicate that one coordination has a syntactic function on another. Thus, this example is characterized as asyndetic coordination.

3.3.2 Adjacent nominal phrases

We also consider cases of nominal phrases that are either a proper name or a noun, succeeded by a comma, to extract implicit facts in the sentence. For example, taking the sentence Most plants can be designated as herbs, vines, lianas, etc. without our treatment, only the fact (Most plants, can be designated as, herbs) would be extracted. However, InferPortOIE identifies a punctuation (comma) after the last element of the extracted fact (herbs) and verifies if the next word is a noun or a proper name. In positive cases, we generate a new fact with that word and the process repeats until the condition is not true. In this sentence, InferPortOIE extracts the following facts: (Most plants, may be designated as, vines) and (Most plants, may be designated as, lianas).

3.4 Inference detection

The inference process aims to establish whether there is symmetry or transitivity in a sentence. This step identifies at least one of the restrictions presented in Table 4. CoGroo (Moura Silva 2013) was used to get a lemmaFootnote b directly from each sentence. For example, in the sentence New York is a city, the lemma might be (be a city), being identified by the standard 1 (IS-A) of Table 4. Verbs in Portuguese have n forms. The specification of some rules for each variation of verbs would be unfeasible. We argue that it is possible to reduce the error rate of a method when applying the lemma directly to the sentence. Moreover, we analyze if in the same sentence there are both transitive and symmetric features. If both are present, then InferPortOIE extracts both facts. Different from Sena et al. (Reference Sena, Glauber and Claro2017), they use a binary classifier between transitive and symmetric classes, and they do not identify two features in the same sentence.

Table 4. Our patterns for transitive and symmetric features in lemma forms

Adapted from Sena et al. (Reference Sena, Glauber and Claro2017).

3.5 Transitivity constraint

When a sentence is detected as transitive, it is necessary to extract at least two facts to determine which type of transitivity it is. Table 5 indicates the patterns for transitivity in a sentence. Among these patterns, we highlight number 1, proposed in this article, which we consider as a generalization of the transitive inference. In this case, it is sufficient to have only a relation of type IS-A followed by any other kind of relation, to apply our transitivity method. For example, in the sentence (Peter is a boy who lives in the farm), the lemmas of each word are identified and checked. Thus, the sentence would be classified as transitive because it follows the pattern number 1 of Table 5. The facts extracted would be: (Peter, is, a boy); (a boy, lives in, the farm); and by transitivity (Peter, lives in, the farm), being all them valid. We emphasize that those patterns can also be considered in reverse. For example, the pattern number 1 can also be recognized as argi ANY-REL argi +1 IS-A argi +2. To simplify the presentation, we show only one of the transitivity orders for each type.

Table 5. Patterns for transitive sentences

Adapted from Sena et al. (Reference Sena, Glauber and Claro2017).

a Any relationship identified.

3.6 Symmetric constraint

Our symmetric constraint has two simple forms presented in Table 6. Pattern number 1 represents the generalization of symmetric inference, and it requires to extract only one fact. For example, in the sentence Goran Višnjic (Šibenik, 9 of September of 1972), is a Croatian actor settled in the United States of America, by identifying the lemmas of each word, this sentence is classified as symmetric. It follows pattern number 1 of Table 6, with the facts: (Goran Višnjic, is, a Croatian actor); and by symmetry (a Croatian actor, is, Goran Višnjic). The undefined article “a” in the nominal phrase “a Croatian actor” ensures that for all Croatian actors, there is one who is “Goran Višnjic.” This procedure is applied to all facts extracted by this rule.

Table 6. Pattern for symmetric sentences

Adapted from Sena et al. (Reference Sena, Glauber and Claro2017).

Another important feature of symmetric inference is symmetrical verbs. We investigate a list of verbs with symmetric features reported in GODOY (Reference Godoy2008), and some of them are listed in Table 7.

Table 7. Some verbs with symmetrical features addressed by GODOY (Reference Godoy2008)

Considering the sentence Mary talked to John, it might be classified as symmetric according to pattern number 2 of Table 6. Symmetrical verbal relations are more accurate with the following extracted facts: (Mary, talked to, John); and by symmetry (John, talked to, Mary).

3.7 InferPortOIE workflow

InferPortOIE starts by processing the sentences through the POS tagger and NP chunker analyzers. With the words of each sentence labeled, we apply the syntactic constraint to extract facts in the sentences. Based on the sentence in Figure 1 the following relations are obtained: number 1 (is) and number 2 (located in). The relation number 1 is gathered from the pattern V, which the relationship begins with a verb, while relation number 2 is obtained from the standard VP, which is a verb followed by a preposition. After identifying those relations, we use our new syntactic constraint to identify the arguments of a relation. For instance, we have the following arguments: relation number 1 (Havana, a city) and relation number 2 (a city, Midwestern). The arguments (Havana, a city) are derived from the B-NP and B-NP I-NP patterns, respectively. The pattern B-NP indicates that (Havana) argument is the beginner of the nominal phrase and the pattern B-NP I-NP indicates that argument (a city) is the beginner of the nominal phrase followed by an intermediate nominal phrase. Since no more elements are belonging to this syntactic group, it finishes the execution on catching the arguments of relation number 1.

Figure 1. InferPortOIE workflow.

The argument (a city) of relation number 2 is derived from the pattern B-NP I-NP and the argument (Midwestern) of relation number 2 is derived from the pattern B-NP. After identifying all relations and arguments in the sentence, extractions are as follows: triple number 1 (Havana, is, a city) and triple number 2 (a city, located in, Midwestern). Finally, with all extractions, triples are loaded by the inference step. Applying inference patterns, InferPortOIE can identify two new relationships in that example. The first relation comes from the transitivity rules, specifically from pattern number 1 of Table 5 (generalization rule) and argument number 1 of triple number 1, relation of the triple number 2 and argument number 2 of triple number 2, building a new extraction (triple number 3) by the new fact: (Havana, located in, Midwestern). The second relation comes from the symmetry rules, from pattern number 1 of Table 6 (generalization rule) and argument number 2 of triple number 1, relation of triple number 1 and argument number 1 of triple number 1, building a new extraction (triple number 4) by the new fact: (a city, is, Havana).

4. Experimental setup

We propose a comparison against two methods of the state-of-the-art: ArgOE (Gamallo and Garcia Reference Gamallo and Garcia2015) and Sena et al. (Reference Sena, Glauber and Claro2017) (now on called SGC_2017). This comparison was based on five criteria: (a) the number of facts extracted, (b) the number of valid facts, (c) the number of minimality facts proposed in Bast and Haussmann (Reference Bast and Haussmann2013), (d) the precision accurate (prec-c), and (e) the precision minimality (prec-m). Our precision accurate measure was calculated based on the ratio of the total number of valid facts and the total number of extracted facts (Equation 1). While our precision minimality was calculated based on the ratio of the total number of minimal facts and the total number of valid facts (Equation 2).

(1) \begin{equation} PrecisionAccurate = {{\# (validfacts)} \over {\# (extractedfacts)}} \end{equation}

(2) \begin{equation} PrecisionMinimality = {{\# (minimalfacts)} \over {\# (validfacts)}} \end{equation}

The accuracy in this work is measured by the consistency of an extracted fact against the information of a sentence (Fader et al. Reference Fader, Soderland and Etzioni2011; Mausam et al. Reference Mausam, Bart, Soderland and Etzioni2012). Considering the sentence John is, according to friends, at home, the following fact (John, is, according to friends) is incoherent because it has not the same information as it is in the sentence. On the other hand, a coherent fact may be (John, is, at home); because this information is consistent with the sentence.

A minimality fact is defined as a fact that cannot be decomposed into new facts from its arguments (Bast and Haussmann Reference Bast and Haussmann2013). Minimality was evaluated, in this work, considering only valid facts. Taking the sentence Conus tabidus is a species of gastropod of genus Conus, an extracted fact can be (Conus tabidus, is a species of, gastropod of genus Conus) and this fact is coherent, according to the original sentence. From this valid fact, two minimal facts could be extracted: (Conus tabidus, is a species of, gastropod) and (Conus tabidus, is a species of, genus Conus).

It is worth mentioning that in Open IE systems the recall is a tough measure to be calculated because of its open nature (Mausam Reference Mausam2016). Often there is a new fact to be extracted that the methods are not able to identify and that a human, for example, could identify. For this reason, we consider the Yield measure that is proportional to the recall and represents the number of correct extractions (Mausam et al. Reference Mausam, Bart, Soderland and Etzioni2012; Xavier, de Lima, and Souza Reference Xavier, de Lima and Souza2015). In this paper, we obtain the yield measure by the number of valid extracted facts.

For each extraction, we classify if the fact is valid or invalid and if the fact is minimal. This process was carried out by two experts, and an extracted fact was only valid if both experts marked as valid. We used the Kappa coefficient (Carletta Reference Carletta1996) to measure the degree of agreement between the evaluations made by both experts.

The most relevant works on Open IE have used random sentences to evaluate their methods. Usually, these methods use a data set of 400 sentences: TextRunner (Etzioni et al. Reference Etzioni, Banko, Soderland and Weld2008)—400 sentences, WOE (Wu and Weld Reference Wu and Weld2010)—300 sentences, Reverb (Fader et al. Reference Fader, Soderland and Etzioni2011)—500 sentences, DepOE (Gamallo et al. Reference Gamallo, Garcia and Fernández-Lanza2012)—200 sentences, OLLIE (Mausam et al. Reference Mausam, Bart, Soderland and Etzioni2012)—300 sentences, and CSD (Bast and Haussmann Reference Bast and Haussmann2013)—400 sentences. Based on these numbers, we evaluated InferPortOIE on two data sets written in Portuguese: 200 random sentences from the Portuguese WikipediaFootnote c from now onwards called WIKI-200, and 200 random sentences from the Corpus of Electronic Texts Extracts NILCS/Folha de São Paulo newspaper (CETENFolha)Footnote d version 2008, made up of journalistic texts about the most different contents (sports, politics, cooking, etc.) from now onwards called CETEN-200.

5. Results

We compare InferPortOIE against two methods of the state-of-the-art in a fair comparison. To evaluate the InferPortOIE impact we consider two ways: InferPortOIE−, which is a version with no inference, and InferPortOIE+, which is our inference approach.

5.1 Evaluation on WIKI-200

In data set WIKI-200, InferPortOIE+ extracted 434 facts, in which the experts considered 286 as valid, while InferPortOIE− extracted 332 facts, obtaining 213 as valid. Figure 2 depicts that neither InferPortOIE+ nor InferPortOIE− obtained a high number of extractions than SGC_2017 (Sena et al. Reference Sena, Glauber and Claro2017). This situation occurs due to the inference classifier in SGC_2017 (Sena et al. Reference Sena, Glauber and Claro2017). SGC_2017 got a high error rate (17%), generating many invalid extractions. On the other hand, taking the number of valid facts, our approach (InferPortOIE+) obtained the best performance, while InferPortOIE− was similar to the SGC_2017 method. Considering the number of minimum facts, our approach achieves good performance in comparison with other methods.

Figure 2. Results of the number of extracted facts, valid extracted facts, and minimal extracted facts in WIKI-200.

Table 8 describes the precision results of the evaluated methods. Taking the precision accurate criterion (prec-c), our InferPortOIE+ and InferPortOIE− outstand all other methods. Considering the precision minimality (prec-m) criterion, ArgOE (Gamallo and Garcia Reference Gamallo and Garcia2015) obtained a high percentage than ours. However, this trade-off was already expected, since ArgOE gets a low number of extractions when compared with all other methods. This benefits the ArgOE method.

Table 8. Results of precision accurate (prec-c) and precision minimality (prec-m) by the evaluated methods in WIKI-200

Finally, we obtained as Kappa coefficients the values 88.6% and 91.4% for valid facts and minimum facts, respectively. In both cases, the coefficients indicate that the evaluations made have a high degree of reliability, confirming the results obtained from this data set.

5.2 Evaluation on CETEN-200

In CETEN-200 data set, InferPortOIE+ algorithm extracted 356 facts, in which the experts considered 195 as valid, while InferPortOIE− extracted 340 facts, being 190 as valid. Figure 3 depicts that neither InferPortOIE+ nor InferPortOIE− obtained a high number of extractions than SGC_2017 approach. Taking the number of valid facts, neither InferPortOIE+ nor InferPortOIE− achieved a good performance when compared with SGC_2017. It is justified because SGC_2017 method extracts more facts than our new approach. However, as argued before, SGC_2017 method extracts a lot of invalid facts due to the high error rate (17%) of their classifier, generating a large number of extractions. Considering the number of minimal facts extracted, all methods performed similarly.

Figure 3. Results of the number of extracted facts, valid extracted facts, and minimal extracted facts in CETEN-200.

In Table 9, considering the precision accurate criterion (prec-c), both approaches InferPortOIE+ and InferPortOIE− outstand the other methods. Regarding precision minimality (prec-m), both InferPortOIE+ and InferPortOIE− obtained similar performance and both were better than the other methods. It is noteworthy that the amount of inference data is an influenced factor. In this data set, the inference data were not very representative (about 8%).

Table 9. Results of precision accurate (prec-c) and precision minimality (prec-m) by the evaluated methods in CETEN-200

In this data set, we obtained the Kappa coefficients of 81.5% and 80.3% for valid facts and minimum facts, respectively. Although the coefficient values were lower than in WIKI-200 data set, in both cases, these coefficients still indicate that the evaluations made in this data set have a good degree of reliability, confirming the results obtained by InferPortOIE.

6. Discussions

The methods developed in Open IE approach extract facts without previously determining the type of the relation. This freedom rises a problem: incoherent extractions. In this section, we discuss some of the extractions obtained by InferPortOIE and by ArgOE and SGC_2017 against the used data sets. We organize this discussion in twofolds. First, we present some sentences and the facts extracted by InferPortOIE; Second, we compare some facts extracted by InferPortOIE with ArgOE and SGC_2017.

6.1 InferPortOIE

In the following sentence Homem-Serpente é uma raça humanoide fictícia criado pelo texano Robert (Snake-Man is a fictional humanoid race created by the Texan Robert) InferPortOIE extracts the facts presented in Table 10. All extracted facts were classified as valid. Triple 3 was inferred by transitivity and Triple 4 was inferred by symmetry.

Table 10. Example of valid transitivity and symmetry extractions by InferPortOIE

Considering the sentence Del james (nascido em 5 de Fevereiro de 1964 em New Rochelle, Nova Iorque, EUA) é um escritor e jornalista conhecido por ter trabalhado com Axl Rose da banda Guns N’ Roses e com Chuck Billy da banda Testament no álbum The Ritual (Del James (born on February 5, 1964, in New Rochelle, New York, USA) is a writer and journalist known for having worked with Axl Rose from Guns N ’Roses band and Chuck Billy from Testament band on The Ritual album), InferPortOIE extracts the triples presented in Table 11, which were valid. Such triples were extracted upon being submitted to our treatment of particular cases adjacent NPs identification. InferPortOIE identifies that there are more nominal phrases from the commas, after “New_Rochelle” and it extracts two more facts with arguments “New_York” and “USA.” Verifying that there is no more nominal phrase after the last argument (USA), InferPortOIE ends the search of adjacent nominal phrases related to relation “born on.”

Table 11. Example of valid extractions identifying adjacent NPs

Considering the sentence O ídolo nacional Hristo Stoichkov, provavelmente o maior craque da Bulgária em todos os tempos, está registrado no Barcelona da Espanha (The national idol Hristo Stoichkov, probably the greatest ace of Bulgaria in all time, is registered in Barcelona of Spain), Table 12 presents the only extracted fact (considered as valid), coming from our treatment of particular cases. In this case, InferPortOIE identifies that there is an asyndetic coordination pattern in the sentence.

Table 12. Example of valid extractions identifying asyndetic coordination

Our evaluation shows that InferPortOIE extracts a higher number of valid facts than ArgOE and SGC_2017. However, it is observable that part of the extracted facts remains not coherent. Considering the sentence Tomáš Vaclík (Ostrava, 29 de março de 1989) é um futebolista profissional checo que atua como goleiro, atualmente defende o FC Basel (Tomáš Vaclík (Ostrava, March 29, 1989) is a Czech professional footballer who plays as a goalkeeper, currently defending the FC Basel), InferPortOIE extracts the facts presented in Table 13. We consider the facts 2, 5, and 7 as invalids. Triple 2 was classified as invalid because the method mistakenly identified the Ostrava argument as a person referring to Tomáš Vaclík, when in fact Ostrava relates to a place. Consequently, triples 5 and 7, coming from the inference method, were also classified as invalid. The other triples were considered valid.

Table 13. Example of invalid extraction with transitive and symmetry inference performed by InferPortOIE

6.2 ArgOE against InferPortOIE

Considering the sentence A primeira final japonesa aconteceu em 92 (The first Japanese final happened in 92) in Table 14, it shows a comparison between the facts extracted by two methods: ArgOE and InferPortOIE. In the extracted triple it is noticed that the ArgOE method extracts a coherent information different from our novel method. In this sentence, InferPortOIE extracts an invalid triple, since it could not correctly identify the 1 (Japanese) argument of the triple. This fact occurred because our syntactic restriction for identification of nominal phrases labeled part of the sentence (the first Japanese final) as two different nominal phrases. Part of this problem can also be related to the CoGroo (Moura Silva 2013) chunker tool that did not make a very precise labeling in that sentence.

Table 14. Example that ArgOE is better than InferPortOIE

Considering now the sentence Pantaleão Mansi, muito lúcido e inteligente, diz viver como se estivesse no século passado (Pantaleão Mansi, very lucid and intelligent, says to live as if in the last century) in Table 15, it presents again a comparison between both methods. Unlike the previous example, in this sentence, the ArgOE method was worse than InferPortOIE. In fact, ArgOE was unable to extract any triples, since it was not able to deal with particular questions of the Portuguese language, such as asyndetic coordination patterns in a sentence, even though it is a method that uses a dependency parser for Portuguese language. On the other hand, InferPortOIE was able to handle this factor and to extract a valid triple from the sentence in question.

Table 15. Example that InferPortOIE is better than ArgOE

Analyzing the sentence Serviços de emergência como hospitais e prontos-socorros funcionarão normalmente nos três dias do feriado (Emergency services such as hospitals and emergency rooms will normally operate in the three days of the holiday) in Table 16, it shows another comparison between the facts extracted. We consider both facts as valid. In this example, InferPortOIE extracts more details when compared to ArgOE by adding information that the emergency room will function normally. However, we believe that both methods were successful on carrying this kind of task.

Table 16. Example that ArgOE and InferPortOIE are equivalent

Table 17 presents a quantitative comparison between ArgOE and InferPortOIE. Two samples of 25 sentences were randomly selected from CETEN-200 and WIKI-200 data sets, totalizing 50 sentences. The number of valid and invalid facts extracted by both methods was described per sentence. Analyzing Table 17, it is observable that InferPortOIE method obtained a high number of valid facts extracted and a low number of invalid facts extracted in both data sets.

Table 17. Quantitative comparison between InferPortOIE and ArgOE in a 25 sentences sample set randomly extracted from CETEN-200 and WIKI-200 data sets

VF—Number of valid facts; IVF—Number of invalid facts.

6.3 SGC_2017 against InferPortOIE

Given the following sentence, O Association Sportive d’Origine Arménienne de Valence foi um clube de futebol francês (The Association Sportive d’Origine Arménienne de Valence was a club of French football) in Table 18, it shows a comparison between the facts extracted by InferPortOIE and SGC_2017 (Sena et al. Reference Sena, Glauber and Claro2017). In the facts presented in Table 18 it is possible to observe that SGC_2017 extracted a coherent information regarding the sentence. Differently, InferPortOIE did not obtain a coherent fact. This occurs because the POS tagger failed to set d’Origine Arménienne de Valence was a club of, classifying “d’Origine,” wrongly, as a verb. As in English grammar, proper names are not translated into Portuguese, and the analyzer has made a mistake. Besides that, the phrase d’Origine Arménienne de Valence should be part of the argument 1 and not part of the relation.

Table 18. Example that SGC_2017 is better than InferPortOIE

Considering now the sentence A Grande Praça, o coração da cidade que já foi centro administrativo e religioso, é formada por um conjunto de templos, pirâmides e acrópoles (The Great Square, the heart of the city that once was administrative and religious center, is formed by a set of temples, pyramids, and acropolis) in Table 19, it shows the facts extracted. InferPortOIE extracted a coherent information through the treatment of asyndetic coordination while SGC_2017 did not. The information presented in the extracted fact by SGC_2017 did not correspond to the information contained in the sentence because the triple argument 1 is incoherent.

Table 19. Example that InferPortOIE is better than SGC_2017

Analyzing the sentence Os vivaldinos sabem que há uma enxurrada de mentiras sobre o funcionalismo brasileiro (The vivaldinos know that there is a flash flood of lies about the Brazilian functionalism) in Table 20, it shows again a comparison between the facts extracted by both methods. It is noteworthy that InferPortOIE extracted a more informative fact, in the sentence in question, when completing the argument 2 of the triple with Brazilian functionalism. However, both facts extracted were classified as valid.

Table 20. Example that SGC_2017 and InferPortOIE are equivalent

Table 21 presents a quantitative comparison between SGC_2017 and InferPortOIE. Two samples of 25 sentences were randomly selected from CETEN-200 and WIKI-200 data sets, totalizing 50 sentences. These sentences were completely distinct from the prior ones used in comparison between ArgOE and InferPortOIE. The number of valid and invalid facts extracted by both methods was described per sentence. Analyzing Table 21, it is observable that InferPortOIE method obtained a high number of valid facts extracted and a low number of invalid facts extracted in WIKI-25. Within CETEN-25, SGC_2017 obtained a high number of valid facts extracted per sentences. On the other hand, InferPortOIE obtained a low number of invalid facts extracted per sentence.

VF—Number of valid facts; IVF—Number of invalid facts.

Table 21. Quantitative comparison between InferPortOIE and SGC_2017 in a 25 sentences sample set randomly extracted from CETEN-200 and WIKI-200 data sets

VF—Number of valid facts; IVF—Number of invalid facts.

7. Threats of validity

Observing the results obtained in WIKI-200 data set, according to Figure 2, InferPortOIE was superior in almost all the evaluated criteria, behind only ArgOE (Gamallo and Garcia Reference Gamallo and Garcia2015) in precision minimality measure. However, this trade-off occurred because the ArgOE obtained a much lower number of extractions when compared to ours. In CETEN-200 data set, according to Figure 3, all methods obtained inferior results in comparison with WIKI-200, regarding precision accurate. This could be due to the construction of CETEN-200 data set. This data set is formed by journalistic texts, almost all sentences have a far-fetched language, even if it has several domains, which it is hard to extract the facts. In both data sets, we verified that SGC_2017 method was superior in the number of extracted facts. Although these results indicate its superiority, the SVM classifier generates a high number of invalid extractions, thus justifying the high number of extractions.

We also verified that InferPortOIE might fail, especially regarding the rules of transitive and symmetric inference, in which we consider relations of type IS-A as general relations to infer new facts. It is worth considering that our treatment of asyndetic coordination is not able to cover all the possibilities of Portuguese language, since the sentences can be written in different forms, making this treatment hard.

Finally, our results confirmed that InferPortOIE was superior both in precision and in the number of valid extractions due to our concerns about writing aspects in Portuguese language and the generalization of our inference approach.

8. Conclusions and future work

We described our inferential method (InferPortOIE) to extract facts in texts written in Portuguese. InferPortOIE takes into account the written structure of a text. We use both transitive and symmetric inference methods proposed by Bast and Haussmann (Reference Bast and Haussmann2014) and Sena et al. (Reference Sena, Glauber and Claro2017). Different from them, InferPortOIE can generalize transitive and symmetric facts, as well as can extract both inferences in the same sentence, when applicable. InferPortOIE has the following contributions:

  • we adapt and adjust the method proposed in Fader et al. (Reference Fader, Soderland and Etzioni2011) to extract facts in texts written in Portuguese;

  • we propose to identify relationships in sentences that have asyndetic coordination patterns and arguments in sentences that have adjacent NPs, increasing the number of useful facts extracted;

  • we create new rules of transitive and symmetric inference based on the methods proposed in Bast and Haussmann (Reference Bast and Haussmann2014) and Sena et al. (Reference Sena, Glauber and Claro2017), generalizing transitive and symmetric facts, thus increasing the number of facts extracted;

  • we also propose a new way of identifying transitivity and symmetry, extracting both types of facts in a single sentence, when applicable and,

  • we propose a list of verbs with symmetry features based on GODOY (Reference Godoy2008), increasing the coverage of our symmetric inference method.

We verified that the generalization of our transitivity rule has sometimes failed. Considering the sentence Peter is a student of a college that is in the center of the city, it might be classified as transitive. In this sentence the facts extracted may be: (Peter, is a student of, a college); (a college, is in the center of, the city); and by transitivity (Peter, is in the center of, the city). In this case, the fact extracted by transitivity is incoherent, since the relation is in the center of refers to the argument a college, and it does not refer to Peter argument. Another open question in InferPortOIE is the treatment for coreference. Some sentences are written with pronouns and do not favor InferPortOIE. Another open aspect observed in our experiments is the identification of a context. For example, in the sentence John will be approved in college if he obtains a score greater than or equal 7, most of the current methods of Open IE extract the fact (John, will be approved in, college). However, this fact is only valid if John obtains a score greater than or equal to 7 (context). In this way, a new version of InferPortOIE is envisioned to handle all those open issues.

Author ORCIDs

Daniela Barreiro Claro, 0000-0001-8586-1042

Footnotes

a The asyndetic coordination is coordinated sentences that are not introduced by conjunctions but by articulating elements such as dot and comma.

b Lemma is a canonical form of a word, for example, the verb is has a general form represented by the verb to be.

References

Akbik, A. andLoser, A. (2012). KrakeN: N-ary facts in open information extraction. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, AKBC-WEKEX ’12. Montreal, Canada: Association for Computational Linguistics (ACL), pp. 5256.Google Scholar
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M. and Etzioni, O. (2007). Open Information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 2670–2676.Google Scholar
Banko, M. and Etzioni, O. (2008). The Tradeoffs Between Open and Traditional Relation Extraction, vol. 8. Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), pp. 2836.Google Scholar
Bast, H. and Haussmann, E. (2013). Open information extraction via contextual sentence decomposition. In 2013 IEEE Seventh International Conference on Semantic Computing (ICSC). Irvine, CA, USA: IEEE, pp. 154159.CrossRefGoogle Scholar
Bast, H. and Haussmann, E. (2014). More informative open information extraction via simple inference. In Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval, ECIR 2014, vol. 8416. New York, NY, USA: Springer-Verlag New York, Inc., pp. 585590.Google Scholar
Carletta, J. (1996). Assessing agreement on classification tasks: The Kappa statistic. Computational Linguistics 22(2), 249254.Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R. and Shaala, K.F. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 14111428.CrossRefGoogle Scholar
Del Corro, L. and Gemulla, R. (2013). ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13. New York, NY, USA: ACM, pp. 355366.Google Scholar
Etzioni, O., Banko, M., Soderland, S. and Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 6874.CrossRefGoogle Scholar
Fader, A., Soderland, S. and Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 15351545.Google Scholar
Faruqui, M. and Kumar, S. (2015). Multilingual Open Relation Extraction Using Cross-lingual Projection. arXiv preprint. arXiv:1503.06450, abs/1503.06450 (May–June), pp. 13511356.CrossRefGoogle Scholar
Gamallo, P. and Garcia, M. (2015). Multilingual Open Information Extraction. Cham: Springer International Publishing, pp. 711722.Google Scholar
Gamallo, P., Garcia, M. and Fernández-Lanza, S. (2012). Dependency-based open information extraction. In Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, ROBUS-UNSUP ’12. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1018.Google Scholar
Godoy, L. (2008). Os verbos recíprocos no PB: interface sintaxe-semântica lexical. 2008. Dissertation (Mestrado em Estudos Linguísticos)-Faculdade de Letras, UFMG, Belo Horizonte.Google Scholar
Mausam, . (2016). Open information extraction systems and downstream applications. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16. New York, NY, USA: AAAI Press, pp. 40744077.Google Scholar
Mausam, Schmitz M., Bart, R., Soderland, S. and Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 523534.Google Scholar
Moura Silva, W.D.C.d. (2013). Improving the Corrector Gramatical CoGrOO. PhD Thesis, University of São Paulo.Google Scholar
Neto, P.C. and Infante, U. (2003). Gramática da Língua Portuguesa. São Paulo: Scipione.Google Scholar
Sena, C.F.L., Glauber, R. and Claro, D.B. (2017). Inference approach to enhance a Portuguese open information extraction. In Proceedings of the 19th International Conference on Enterprise Information Systems—ICEIS, vol. 1. Porto, Portugal: ScitePress for INSTICC, pp. 442451.Google Scholar
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233272.CrossRefGoogle Scholar
Wu, F. and Weld, D.S. (2010). Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 118127.Google Scholar
Xavier, C.C., de Lima, V.L.S. and Souza, M. (2015). Open information extraction based on lexical semantics. Journal of the Brazilian Computer Society 21(1), 114.CrossRefGoogle Scholar
Figure 0

Table 1. Syntactic constraint to extract relations in Portuguese texts based in Fader et al. (2011)

Figure 1

Table 2. Examples of the pattern application from Table. 1

Figure 2

Table 3. The constraints to identify the arguments (Sena et al.2017)

Figure 3

Table 4. Our patterns for transitive and symmetric features in lemma forms

Figure 4

Table 5. Patterns for transitive sentences

Figure 5

Table 6. Pattern for symmetric sentences

Figure 6

Table 7. Some verbs with symmetrical features addressed by GODOY (2008)

Figure 7

Figure 1. InferPortOIE workflow.

Figure 8

Figure 2. Results of the number of extracted facts, valid extracted facts, and minimal extracted facts in WIKI-200.

Figure 9

Table 8. Results of precision accurate (prec-c) and precision minimality (prec-m) by the evaluated methods in WIKI-200

Figure 10

Figure 3. Results of the number of extracted facts, valid extracted facts, and minimal extracted facts in CETEN-200.

Figure 11

Table 9. Results of precision accurate (prec-c) and precision minimality (prec-m) by the evaluated methods in CETEN-200

Figure 12

Table 10. Example of valid transitivity and symmetry extractions by InferPortOIE

Figure 13

Table 11. Example of valid extractions identifying adjacent NPs

Figure 14

Table 12. Example of valid extractions identifying asyndetic coordination

Figure 15

Table 13. Example of invalid extraction with transitive and symmetry inference performed by InferPortOIE

Figure 16

Table 14. Example that ArgOE is better than InferPortOIE

Figure 17

Table 15. Example that InferPortOIE is better than ArgOE

Figure 18

Table 16. Example that ArgOE and InferPortOIE are equivalent

Figure 19

Table 17. Quantitative comparison between InferPortOIE and ArgOE in a 25 sentences sample set randomly extracted from CETEN-200 and WIKI-200 data sets

Figure 20

Table 18. Example that SGC_2017 is better than InferPortOIE

Figure 21

Table 19. Example that InferPortOIE is better than SGC_2017

Figure 22

Table 20. Example that SGC_2017 and InferPortOIE are equivalent

Figure 23

Table 21. Quantitative comparison between InferPortOIE and SGC_2017 in a 25 sentences sample set randomly extracted from CETEN-200 and WIKI-200 data sets