Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-02-11T09:03:56.907Z Has data issue: false hasContentIssue false

Antonio Pareja-Lora, María Blume, Barbara C. Lust & Christian Chiarcos (eds.), Development of linguistic linked open data resources for collaborative data-intensive research in the language sciences. Cambridge: The MIT Press, 2019. Pp. xxi + 247.

Review products

Antonio Pareja-Lora, María Blume, Barbara C. Lust & Christian Chiarcos (eds.), Development of linguistic linked open data resources for collaborative data-intensive research in the language sciences. Cambridge: The MIT Press, 2019. Pp. xxi + 247.

Published online by Cambridge University Press:  07 June 2021

FRANCES GILLIS-WEBBER*
Affiliation:
Department of Computer Science, University of Cape Town, Cape Town, 7700, South Africafran@fynbosch.com
Rights & Permissions [Opens in a new window]

Abstract

Type
Reviews
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Linked Data is a set of principles for making heterogenous data collections accessible on the World Wide Web. Within the context of linguistics, where diverse data such as annotated corpora, terminologies, and language data of little-known languages are produced, these linguistic datasets can be represented in Linked Data, thus facilitating openness and reuse. The principles of Linked Data, published in Berners-Lee: Reference Berners-Lee2006 by Tim Berners-Lee, the inventor of the Web, are relatively simple:

To expand on these points further, in (1), a URI is a uniform resource identifier, similar to a uniform resource locator (URL), where the URL is typed in the browser’s address bar, except that where the latter serves to provide a location only, the former serves to uniquely identify (RDF Working Group 2014a). For (2), URIs can be used as identifiers only (meaning they are not resolvable), although with Linked Data it is preferable that if a user types this URI into their browser address bar, a web page is then returned with useful information of that entity/concept/object being shown. This information can be returned as an HTML web page for the human user, but if the URI is accessed by a machine, then ideally the same information should be presented in structured markup, making the content machine-readable. Within the context of Linked Data, this markup is typically Resource Description Framework (RDF). RDF allows short statements to be written in the form of ‘triples’: consisting of a subject and an object, with a predicate in between (RDF Working Group 2014a), similar to English’s subject-verb-object word order in the present infinitive. RDF statements can be serialised (written) in several formats, with the Turtle (Terse RDF Triple Language) format being the most human-readable (RDF Working Group 2014a). This thus concludes (3), and then in (4), within that structured markup, links to other resources, using URIs, should also be included. For a linguistic dataset, the dataset can be published on the web using Linked Data principles, represented by a single URI. Alternatively, the data within the dataset can also be published as Linked Data, with it then consisting of many URIs. Linked Data falls within the context of the Semantic Web, a notion of the web conceptualised by Berners-Lee, Hendler & Lassila (Reference Berners-Lee, Hendler and Lassila2001), where data on a web page is structured so that there can be shared understanding of the meaning, both by human and machine, with RDF as a core standard. Other core standards of the Semantic Web include the Web Ontology Language (OWL) and Simple Knowledge Organization System (SKOS), which both build on RDF (RDF Working Group 2014b). Controlled vocabularies, thesauri, and other taxonomies intended for reuse would typically be encoded using SKOS or OWL.

The book consists of an introduction, followed by eleven chapters, each separately authored. The area of language acquisition is the primary focus as the principles of Linked Data offer tremendous potential when applied to diverse and large language datasets encoded from raw multimodal data. The main theme of the book is the interoperability of linguistic and language resources for publication on the web, ideally with an open licence. Interoperability can be considered within two contexts: syntactic and semantic. For the former, this is agreed standards and formats for the exchange of data and, for the latter, this is shared agreement of meaning in the exchange of data. Linked Data aims to solve both for resources on the web, where syntactic interoperability is possible because of the common underlying standards, and semantic interoperability can be achieved across resources by shared use of ontologies, controlled vocabularies, and at its most granular, URIs.

In Chapter 1, ‘Open Data – Linked Data – Linked Open Data – Linguistic Linked Open Data (LLOD): A general introduction’, Christian Chiarcos and Antonio Pareja-Loja introduce the main concepts of the book, namely Open Data, Linked Data, Linked Open Data, and Linked Open Data in linguistics. Some small examples of RDF statements are provided, serialised in Turtle. For a more thorough description of Linked Data and how it can be applied to language data, the reader should proceed to Chapter 4, ‘Linguistic Linked Open Data and under-resourced languages: From collection to application’. In this chapter, Steven Moran and Christian Chiarcos argue for the benefits of Linked Data, which they consider within the context of under-resourced languages.

Moving back to Chapter 2, ‘Whither GOLD?’, the General Ontology for Linguistic Description (GOLD) is introduced by D. Terence Langendoen as the first OWL ontology developed for linguistic description. Although GOLD appears to no longer be actively maintained, the ontology remains available via a persistent URI, hosted by LINGUIST List. Nancy Ide follows with linguistic annotations in Chapter 3, ‘Management, sustainability, and interoperability of linguistic annotations’, focusing primarily on the history thereof in the quest for interoperability. In Chapter 5, ‘A data category repository for language resources’, Kara Warburton and Sue Ellen Wright introduce DatCatInfo, an online data category repository for language resources, detailing its history where it originated as ISOcat, a registry developed by the International Organization for Standardization Technical Committee 37 for Language and Terminology (ISO/TC 37). Each data category in DatCatInfo is uniquely identified by a URI, but these URIs are not resolvable and there appears to be no support for RDF to date.

In Chapter 6, ‘Describing research data with CMDI – challenges to establish contact with Linked Open Data’, the Common Language Resources and Technology Infrastructure, known as CLARIN, is described by Thorsten Trippel and Claus Zinn. This infrastructure is a distributed network across more than 36 institutions worldwide which makes a range of language tools and data available to the researcher. CLARIN also consists of a Component Registry and a Concept Registry. The former relates to profiles and components for resource types, and the latter relates to concepts, similar to DatCatInfo, although the concepts are predominantly skewed to metadata. A persistent URI is provided for each profile, component, and concept, however, there is no RDF support, although the URIs do resolve. Moving onto Chapter 7, ‘Expressing language resource metadata as Linked Data: The case of the Open Language Archives Community’, Gary F. Simons and Steven Bird introduce the Open Language Archives Community (OLAC), which, similar to CLARIN, aims to aggregate the metadata records of language resources for linguistic research across 60 participating archives into a single catalogue. OLAC’s language resource metadata is encoded according to Linked Data principles, thus there is RDF support and the URIs resolve.

DatCatInfo’s issue of duplicate and near-duplicate data categories is documented in the book and is an issue which remains unresolved to this day. As an alternative to prescribing a standard terminology (which is unlikely to be possible anyway, except for the most common of terms), terminology harmonisation can be achieved by mapping a data category to a suitable class in a reference ontology such as GOLD. This then serves to ‘ground’ the data category and thus assists with semantic interoperability. This concept of ‘distributed terminology harmonisation’ can be taken even further, by using, for example, the reference model of Ontologies of Linguistic Annotation (OLiA) (Chiarcos & Sukhareva Reference Chiarcos and Sukhareva2015). OLiA was developed to serve as an intermediate layer to GOLD and other reference ontologies and models. A data category from any of the many terminology repositories available (the Universal Dependencies, UniMorph, and LexInfo are some such examples) can be mapped to OLiA. OLiA in turn then maps to external reference resources, such as GOLD and ISOcat (the precursor of DatCatInfo). This thus creates a ‘shared semantic space’ between the various vocabularies without being required to use one standard terminology for linguistics (Chiarcos, Fäth & Abromeit Reference Chiarcos, Fäth and Abromeit2020: p. 5675).

Briefly considering linguistic annotation schemes, as shown above, the conceptual content can be associated with data categories which support RDF by mapping to a repository or using a mediator such as OLiA. The physical representation of annotations can be modelled in RDF as well. In addition to the models proposed by Ide for representing linguistic annotations, CoNLL RDFFootnote 2 and POWLA are another two examples of annotation formats for RDF (Cimiano et al. Reference Cimiano, Chiarcos, McCrae and Gracia2020). As column-based formats remain extremely popular, however, a suggestion here is to include another column and add a URI for each word, thus providing partial support for interoperability within this format. Although proposed for CoNLL TSV (Cimiano et al. Reference Cimiano, Chiarcos, McCrae and Gracia2020), in principle, this could be applied to any TSV format. An example is shown below in (1) for the token ‘Friday’.

The URI is from DBpedia, a knowledgebase compiled from the information boxes on Wikipedia pages. Wikidata is another knowledgebase available, and this URI is https://www.wikidata.org/wiki/Q130. Both are Linked Open Data projects.

Utterances would be modelled in RDF as string literals with a language tag appended to it. An example is shown in (2), serialised in Turtle, with the language tag highlighted in bold.

The language tag should conform to BCP 47 and typically includes a language code from the ISO 639 standard (Parts 1, 2, or 3). ISO 639 provides language codes for the world’s main languages but if a lesser-known language or dialect needs to be identified for which an ISO language code does not exist, then this is set in the privateuse portion of the tag. As this privateuse portion is user-defined, and thus not standardised, a pattern has been proposed by Gillis-Webber & Tittel (Reference Gillis-Webber and Tittel2019) to facilitate reuse.

Continuing with the theme of language identification, in Chapter 10, ‘Challenges for the development of Linked Open Data for research in multilingualism’, María Blume, Isabelle Barriére, Cristina Dye, and Carissa Kang discuss the challenges in the replication and reanalysis of linguistic research due to many complex factors. For example, as there is no single definition in the field of what constitutes a multilingual person, the result is that this term may be interpreted differently from one study to the next, making comparison difficult. However, with Linked Data, there is the potential to create extensive metadata, enabling the reuse (and extension) of language profiles defined for a study. To do this, rich metadata needs to be defined to account, for example, for the language(s) spoken, the cross-linguistic differences of each participant, and the multimodal markup of the resultant language data. However, as a starting point, each language needs to be accurately identified with an associated URI. Resolvable URIs are provided for by Lexvo.org for ISO 639 language codes. If a lesser-known language or dialect needs to be identified for which an ISO language code does not exist, then an alternative catalogue can be used. Glottolog is the most comprehensive catalogue available, particularly for under-resourced languages, and it has been encoded using SKOS (Hammarström et al. Reference Hammarström, Forkel and Haspelmath2020). However, although there is a persistent identifier, there is no URI which returns structured markup for that language (Gillis-Webber, Tittel & Keet Reference Gillis-Webber, Tittel, Keet, Villazón-Terrazas and Hidalgo-Delgado2019). If a language is not available in a catalogue, then the language can be described in RDF. The Model for Language Annotation, an OWL ontology, is one such example which can be used, where sociolects, ethnolects, and idiolects are also able to be modelled (Gillis-Webber et al. Reference Gillis-Webber, Tittel, Keet, Villazón-Terrazas and Hidalgo-Delgado2019).

Chapters 8 and 9 are dedicated to the challenges of representing research data as Linked Data within the field of language acquisition, using available tools. In Chapter 8, ‘TalkBank resources for psycholinguistic analysis and clinical practice’, Nan Bernstein Ratner and Brian MacWhinney describe TalkBank as ‘the largest open repository of data on spoken language’ which had its origins with the Child Language Data Exchange System (CHILDES). Other language banks include AphasiaBank, PhonBank, TBIBank, and many more, with each bank using the same transcription format and standards. The authors have opted to link corpora to other corpora, instead of linking to external resources directly within the transcriptions. They are also in the process of developing methods to automatically compute quantitative measures for the purpose of comparing corpora or subsections within a corpus. In Chapter 9, ‘Enabling new collaboration and research capabilities in language sciences: Management of language acquisition data and metadata with the data transcription and analysis tool’, María Blume, Antonio Pareja-Lora, Suzanne Flynn, Claire Foley, Ted Caldwell, James Reidy, Jonathan Masci, and Barbara Lust introduce the Data Transcription and Analysis (DTA) tool which allows for cross-linguistic comparison. The DTA is situated within the Virtual Linguistic Lab (VLL), a framework developed to facilitate collaborative research. Although the data captured within DTA is not stored as RDF, nor is it able to be exported in RDF format, the authors recognise the value of Linked Data, and have started explorations in this regard.

In closing, the Linguistic Linked Open Data (LLOD) cloud (https://linguistic-lod.org) is mentioned several times throughout the book. Each node in the online diagram represents a (meta)dataset, and clicking on a node will take the user to that dataset. In March 2015, there were 16 corpora available as Linked Data in the LLOD cloud. At time of writing, there were in excess of 65. As has been shown by the editors, there are significant challenges still to overcome in order to represent linguistic research data according to the principles of Linked Data. Therefore, making data Linked Data-aware can be the easier, first step, starting with URIs, expressive metadata, and the inclusion thereof in one of the repositories introduced in the book. For the reader new to Linked Data, the biggest challenge may be the concepts – there is a lot of jargon and the concepts can seem highly complex at first glance. For further reading, Moran and Chiarcos have provided a repository for do-it-yourself tutorials (Chiarcos n.d.). For a more thorough exploration of Linguistic Linked Data, the book by Cimiano et al. (Reference Cimiano, Chiarcos, McCrae and Gracia2020) is recommended further reading.

Footnotes

[1]

This work was financially supported by Hasso Plattner Institute for Digital Engineering through the HPI Research School at University of Cape Town.

[2] Within the field of computational linguistics or natural language processing (NLP), an outcome of the Conference on Natural Language Learning is CoNLL TSV, which is a widely used community standard where the format is tab-separated values (TSV). CoNLL TSV has since been converted to RDF.

References

Berners-Lee, Tim. 2006. Linked Data. https://www.w3.org/DesignIssues/LinkedData.html (accessed 10 March 2021).Google Scholar
Berners-Lee, Tim, Hendler, James & Lassila, Ora. 2001. The Semantic Web. Scientific American 284, 3443.CrossRefGoogle Scholar
Chiarcos, Christian. n.d. Teach yourself LLOD. http://acoli.informatik.uni-frankfurt.de/resources/llod/ (accessed 10 March 2021).Google Scholar
Chiarcos, Christian, Fäth, Christian & Abromeit, Frank. 2020. Annotation interoperability for the post-ISOCat era. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 56685677.Google Scholar
Chiarcos, Christian & Sukhareva, Maria. 2015. OLiA — Ontologies of Linguistic Annotation. Semantic Web 6, 379386. doi:10.3233/SW-140167.CrossRefGoogle Scholar
Cimiano, Philipp, Chiarcos, Christian, McCrae, John P. & Gracia, Jorge. 2020. Linguistic Linked Data: Representation, generation and applications. Cham: Springer. doi:10.1007/978-3-030-30225-2.CrossRefGoogle Scholar
Gillis-Webber, Frances & Tittel, Sabine. 2019. The shortcomings of language tags for Linked Data when modelling lesser-known languages. Proceedings of the 2nd Conference on Language, Data and Knowledge (LDK 2019), 4:14:15. Leipzig, Germany.Google Scholar
Gillis-Webber, Frances, Tittel, Sabine & Keet, C. Maria. 2019. A model for language annotations on the Web. In Villazón-Terrazas, Boris & Hidalgo-Delgado, Yusniel (eds.), Knowledge graphs and Semantic Web, 116. Cham: Springer.Google Scholar
Hammarström, Harald, Forkel, Robert, Haspelmath, Martin & Sebastian Bank. 2020. Glottolog 4.3. https://glottolog.org/glottolog/glottologinformation (accessed 13 March 2021).Google Scholar
RDF Working Group. 2014a. RDF 1.1 Primer: W3C Working Group Note 24 June 2014. https://www.w3.org/TR/rdf11-primer/ (accessed 10 March 2021).Google Scholar
RDF Working Group. 2014b. Resource Description Framework (RDF). https://www.w3.org/RDF/ (accessed 10 March 2021).Google Scholar