Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-02-11T16:36:08.085Z Has data issue: false hasContentIssue false

Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014, by Robbie Love. New York: Routledge, 2020. ISBN 978-1-138-36737-1, xviii + 202 pages

Review products

Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014, by Robbie Love. New York: Routledge, 2020. ISBN 978-1-138-36737-1, xviii + 202 pages

Published online by Cambridge University Press:  11 May 2021

Siqi Liu*
Affiliation:
Department of English Language and Literature, Korea Maritime and Ocean University, Yeongdo-Gu, Busan, Republic of Korea E-mail: liusiqi1225@foxmail.com
Rights & Permissions [Opens in a new window]

Abstract

Type
Book Review
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Developing spoken corpora is an important focus of linguistic research and natural language processing. As “one of the biggest available corpora of spoken British English” (Nesselhauf & Römer, Reference Nesselhauf and Römer2007, p. 297), the Spoken British National Corpus 1994 (Spoken BNC1994), created between 1991 and 1994, has been a highly productive resource for research (e.g., Gabrielatos, Reference Gabrielatos, Granger, Gilquin and Meunier2013; McEnery, Reference McEnery2005). However, it is problematic to use a corpus compiled from the early 1990s to explore features of “present-day” English. To provide a new data source, the Spoken British National Corpus 2014 (Spoken BNC2014) project, led by Tony McEnery and Claire Dembry, was completed and released in 2017, and its Extensible Markup Language (XML) files were downloadable in 2018. Robbie Love, who contributed to all stages of the corpus compilation, documented the entire process and completed this book, Overcoming Challenges in Corpus Construction: The Spoken British National Corpus. It presents a thorough description of the design and construction of the Spoken BNC2014. Meanwhile, it offers a valuable resource for corpus linguistics, applied linguistics, natural language processing, and English language studies, and sheds new insights into the future creation and application of spoken corpora.

This book consists of nine chapters. After a general introduction in Chapter 1, the volume comprises three parts, namely Before Corpus Construction (Chapters 2 and 3); During Corpus Construction (Chapters 4–7); and After Corpus Construction (Chapters 8 and 9).

Before Corpus Construction focuses on the theoretical background and design of the corpus. In Chapter 2, the author introduces the Spoken BNC1994 and describes the necessity of a new Spoken BNC. Spoken BNC1994 is composed of two parts: the demographically sampled part (Spoken BNC1994DS) and the context-governed part (Spoken BNC1994CG). Consisting of around 5 million words produced by 1408 speakers, the Spoken BNC1994DS comprises transcripts of informal, “everyday spontaneous interactions” (Leech et al., Reference Leech, Rayson and Wilson2001, p. 2). The Spoken BNC1994CG, containing around 7 million words produced by 3986 speakers, is task-oriented and formal. Although several spoken English corpora have been compiled and developed since the Spoken BNC1994, they are neither publicly accessible for commercial reasons (e.g., the Cambridge and Nottingham Corpus of Discourse in English) nor available without payment of the license fee (e.g., the British component of the International Corpus of English). Hence, the Spoken BNC1994, as a widely accessible spoken English corpus, still plays an essential role in natural language processing and linguistic research. However, since it was released two decades ago, there is now a need for a new spoken corpus of English for researchers.

After discussing the motivation for constructing a new Spoken BNC, Chapter 3 focuses on corpus design with considerable attention paid to representativeness. The author introduces the concept of representativeness, discusses how previously compiled spoken corpora have been designed, examines the representativeness of the Spoken BNC1994, and finally proposes the novel design of Spoken BNC2014.

Representativeness refers to “the extent to which a sample includes the full range of variability in a population” (Biber, Reference Biber1993, p. 243). It can be evaluated in two ways: target domain representativeness, which “represents the range of text types in a population” (ibid. p. 243); and linguistic representativeness, which represents “the extent to which it includes the range of linguistic distributions in the population” (ibid. p. 243).

The issue of representativeness is often discussed and may not be entirely overcome. What the corpus compiler can do is to take steps to maximize the representativeness of the corpus. According to Crowdy, representativeness is accomplished by “sampling a spread of language producers in terms of age, gender, …, and recording their language output over a set period of time” (Reference Crowdy1993, p. 257). There are many methods to collect a sample from a population. Such methods distribute in a continuum from probability sampling to convenience (i.e., non-probability) sampling (Phillips & Egbert, Reference Phillips and Egbert2018). Focusing on the Spoken BNC1994, the author discusses the distinction between probability sampling and convenience sampling. When choosing and evaluating a sampling method, “considerations of efficiency and cost effectiveness must be balances against higher degrees of representativeness” (Biber, Reference Biber1993, p. 244).As described by the author, the Spoken BNC1994 is “a compromise between what would maximize representativeness and what was possible in practice” (Love, Reference Love2020, p. 33). Like most corpora, the Spoken BNC1994 adopted the method of convenience sampling to collect data although it has some drawbacks. In addition, the author also summarizes some other spoken corpus compilation projects in terms of the design approach, concluding that it is inevitable to use some elements of convenience sampling for this process. Hence, similar to the Spoken BNC1994, the new spoken corpus takes the method of the convenience sampling for practical reasons.

The author then introduces the target domain of the original design of the Spoken BNC2014, viz., “Informal spoken British English, produced by L1 speakers of British English in the mid-2010s, whereby ‘British English’ comprises four major varieties: English, Scottish, Welsh and Northern Irish English” (ibid. p. 45). The other aspect of representativeness, known as linguistic representativeness, is discussed in Chapter 8, since the range of linguistic distributions can be obtained after the corpus compilation.

Moving from theory to practice, the author next discusses how the Spoken BNC2014 was constructed and how challenges encountered were overcome in the part called During Corpus Construction . This part presents the data collection procedure, transcription, and processing in detail. Chapter 4 concerns the issue of data collection including recruitment, metadata, and audio data. PPSR (Shirk et al., Reference Shirk, Ballard, Wilderman, Phillips, Wiggins, Jordan and Bonney2012), that is, public participation in scientific research, was adopted for the data collection of the Spoken BNC2014. Specifically, the Spoken BNC2014 adopted the contributory model of PPSR, a model for the public to contribute data to the project under construction. Moreover, different from the Spoken BNC1994DS, which recruited participants privately, the Spoken BNC2014 used several strategies to promote the project to the public and recorded a richer set of metadata. In addition, the new Spoken BNC adopted a novel method for audio data collection. Instead of using recording devices provided by the researchers, the contributors collected their spoken conversations through their smartphones.

Chapters 5 and 6 explore the aspects of transcription including the transcription of the audio data and speaker identification, respectively. Chapter 5 explains the decisions about the transcription of the recordings. The author first discusses whether to adopt human or automated transcription, then justifies the rejection of automated transcription. After presenting the shortcomings of the old Spoken BNC transcription scheme, the author introduces the newly developed bespoke transcription scheme for the Spoken BNC2014. Finally, Chapter 5 presents the transcription process for the new Spoken BNC, which involved a group of 20 trained transcribers transcribing the audio data. Since the audio recordings were provided remotely by the contributors, Chapter 6 discusses the confidence and accuracy in speaker identification. It was found that the transcribers were likely to find it challenging to assign speaker ID codes to the speakers’ turns when several people were involved in the recordings. The findings revealed that while certainty was high, the inter-rater agreement was only fair. Finally, the author discusses how the users of Spoken BNC2014 can mitigate the potential effects of this phenomenon.

The volume then comes to the final stage of the construction of the Spoken BNC2014: processing and dissemination. Chapter 7 can be divided into three sections: format conversion, annotation, and public dissemination. First, the author explores the conversion of Microsoft Word documents into XML through a PHP script. Then, he introduces the annotation scheme that was applied to the XML files, including part of speech (POS), lemma, and semantic categories. The final section discusses the public dissemination of the corpus. The initial release was via the CQPweb server at Lancaster University (Hardie, Reference Hardie2012) in September 2017, while the XML files and metadata were downloadable in November 2018.

In the part named After Corpus Construction , Chapter 8 summarizes the completed Spoken BNC2014 and evaluates it. Focusing on the major demographic strata of the speakers (e.g., gender, age range, socioeconomic status, regional dialect), the author describes how the main speaker metadata categories are populated in the new Spoken BNC, and evaluates target domain representativeness and linguistic representativeness of the corpus. It is argued that the Spoken BNC2014 is representative of “informal spoken English, produced by L1 speakers of British English, in England, in the mid-2010s” and at least “eight major part-of-speech categories which account for over 85% of all tokens in the corpus” (Love, Reference Love2020, p. 190). However, it should not be considered representative of “(a) English as spoken across the whole of the UK or (b) any type of spoken register beyond informal conversation” (ibid. p. 5). Finally, Chapter 9 comments on the future construction and application of the corpus. It also introduces some projects based on or further developing the Spoken BNC2014, which add additional value to the corpus, such as the BNC secondary data analysis project (Brezina, Gablasova & Reichelt, Reference Brezina, Gablasova and Reichelt2018) and the transcription of the BBC “Listening Project” recordings (McEnery, Reference McEnery2017).

To summarize, the Spoken BNC2014, created between 2012 and 2016, comprises 11,422,617 words of transcribed content produced by 668 speakers and across 1251 recordings. It is a corpus of informal British English conversation in the mid-2010s, comparable to the conversational component of the Spoken BNC1994, that is, Spoken BNC1994DS. This new corpus provides a fresh data source for researchers interested in corpus linguistics, natural language processing, etc. In all, the book is informative and written with great clarity. One minor issue is that the titles and subtitles are not numbered, which makes it a little difficult for people to grasp the general structure of each chapter. However, minor defects cannot obscure the great virtues. This book provides a theoretical and practical guide in terms of every stage of spoken corpus compilation, with intricate details of the possible challenges and how they can be overcome. Therefore, it is of great value not only for the corpus users who may refer to this book to gain a thorough understanding of this corpus, but also for corpus compilers interested in constructing their own corpora.

Acknowledgment

The author would like to thank the book review editor of NLE, Dr. Diana Maynard, for her detailed comments and helpful suggestions.

References

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing 8(4): 243257.CrossRefGoogle Scholar
Brezina, V., Gablasova, D. & Reichelt, S. (2018). BNClab. Lancaster University. Retrieved March 2021, from http://corpora.lancs.ac.uk/bnclab Google Scholar
Crowdy, S. (1993). Spoken corpus design. Literary and Linguistic Computing 8(4): 259265.CrossRefGoogle Scholar
Gabrielatos, C. (2013). If-conditionals in ICLE and the BNC: A success story for teaching or learning? In Granger, S., Gilquin, G. & Meunier, F. (eds.) Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead. pp. 155166. Louvain-la-Neuve, Belgium: Presses Universitaires de Louvain.Google Scholar
Hardie, A. (2012). CQPweb: Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17(3): 380409.CrossRefGoogle Scholar
Leech, G., Rayson, P. & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Pearson Education Limited.Google Scholar
Love, R. (2020). Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. New York: Routledge.CrossRefGoogle Scholar
McEnery, T. (2005). Swearing in English: Bad Language, Purity and Power from 1586 to the Present. London: Routledge.Google Scholar
McEnery, T. (2017). Introducing a new project with the British library. Retrieved March 2021, from http://cass.lancs.ac.uk/introducing-a-new-project-with-the-bbc-and-the-british-library/ Google Scholar
Nesselhauf, N. & Römer, U. (2007). Lexical-grammatical patterns in spoken English: The case of the progressive with future time reference. International Journal of Corpus Linguistics 12(3): 297333.CrossRefGoogle Scholar
Phillips, J. C. & Egbert, J. (2018). Advancing law and corpus linguistics: Importing principles and practices from survey and content analysis methodologies to improve corpus design and analysis. Brigham Young University Law Review 2017(6): 15891620.Google Scholar
Shirk, J. L., Ballard, H. L., Wilderman, C. C., Phillips, T., Wiggins, A., Jordan, R., … Bonney, R. (2012). Public participation in scientific research: A framework for deliberate design. Ecology and Society 17(2): 29.CrossRefGoogle Scholar