Hostname: page-component-7b9c58cd5d-hpxsc Total loading time: 0 Render date: 2025-03-15T19:50:36.631Z Has data issue: false hasContentIssue false

Martin Weisser, Practical Corpus Linguistics: An Introduction to Corpus-based Language Analysis. Chichester: Wiley-Blackwell, 2016. Pp. xviii + 287.

Review products

Martin Weisser, Practical Corpus Linguistics: An Introduction to Corpus-based Language Analysis. Chichester: Wiley-Blackwell, 2016. Pp. xviii + 287.

Published online by Cambridge University Press:  11 June 2018

Aleksander Siwicki*
Affiliation:
Faculty of Applied Linguistics, University of Warsaw, ul. Krakowskie Przedmieście 26/28, 00–927 Warsaw, Poland. a.siwicki@student.uw.edu.pl

Abstract

Type
Book Review
Copyright
Copyright © Nordic Association of Linguistics 2018 

A considerable number of corpus-linguistics textbooks are available on the market as corpus linguistics becomes an increasingly widespread area of linguistic research. Martin Weisser's Practical Corpus Linguistics: An Introduction to Corpus-based Language Analysis is one recent publication that may prove useful to academic teachers planning to design an introductory course in English corpus linguistics.

Since modern corpus studies rely almost exclusively on computerised methods, corpus-linguistics textbooks are notoriously prone to becoming quickly outdated. While other publications in this field may seem either partially obsolete (for instance, Tognini-Bonelli Reference Tognini-Bonelli2001) or not structured suitably enough for an introductory course (for instance, O'Keeffe & McCarthy Reference O'Keeffe and McCarthy2010 or Lüdeling & Kytö Reference Lüdeling and (eds2008), Weisser's book is free from both these shortcomings. It is designed for both students and teachers dealing with this area of linguistics. As the topics and exercises are arranged in such a way that each section builds on the previous one, the structure of the book provides a blueprint for a coherent course in corpus linguistics.

The book is divided into 12 chapters, including an introduction and a concluding chapter; the latter sums up the content of the book and suggests further perspectives on how corpus linguistics can be used. Chapter 1 serves as an introduction and discusses briefly the term ‘linguistic data’. It also includes information on how the book is structured and how it can be used by academic teachers. In Chapter 2, ‘What's out there? A general introduction to corpora’, the author defines the ‘corpus’ and provides some basic insights into various formats and types of corpora. The chapter includes a historical overview of early synchronic corpora (pp. 15–20), both written (such as the Brown Corpus, the Lancaster–Oslo–Berger Corpus, and the Frown Corpus) and spoken (such as the Survey of English Usage Corpus, the London–Lund Corpus, and the Spoken English Corpus). The author pays close attention to two factors that define a well-compiled corpus: balance and representativeness. He observes that most corpora are dominated by written texts, which creates a distorted image of how language is used by its speakers. This is because spoken materials are ‘much more difficult to obtain and handle than written ones, as well as more costly to process’ (p. 20), which results in ‘literature-focussed society and education’ (p. 16). Modern mega corpora (British National Corpus (BNC), American National Corpus, the Corpus of Contemporary American English (COCA), and the Corpus of Global Web-Based English) and diachronic corpora (such as the Helsinki Corpus, the Lampeter Corpus, and the Corpus of Historical American English) are also presented (pp. 20–21). As far as various types of corpora are concerned, the author makes a distinction between general and specific corpora, and provides several examples of academic, learner, and pragmatically annotated corpora (pp. 21–25).

Chapters 2–11 include a number of exercises intended for linguistics students and teachers of introductory courses in corpus linguistics. While Chapter 2 contains exercises based on reflection and understanding (‘do you think . . .?’, ‘compare . . .’, ‘try to understand. . .’, ‘try to evaluate . . .’), further chapters come with more practical tasks tailored to the topics covered. Each of these chapters concludes with detailed and carefully written answers and solutions to the exercises.

Chapter 3, the last descriptive chapter, explains to the reader the general structure of a corpus and touches on issues such as sampling (pp. 30–31) and the size (pp. 31–32) of corpora. This chapter also introduces subjects that lie outside the scope of ‘traditional’ linguistics. Namely, the author describes the design of electronic texts, outlining meta-data (pp. 33–38) and discussing the most popular encodings used in computer documents (pp. 38–40). This section is vital for the subsequent chapters, especially Chapter 11, which is devoted to the technical side of corpus linguistics.

In Chapter 4, the author provides instructions on how users can compile their own corpora. He also offers insightful exercises and detailed advice on how to find written as well as spoken materials, how to extract them, and how to create files that are supported by most computer tools used to analyse corpus data. This chapter may prove to be essential for (junior) researchers who prefer preparing their own reference materials to analysing corpora compiled by other researchers. A clear advantage of this section (and the book as a whole) is that it provides step-by-step instructions that can be used even by (as the author puts it, p. 11) less ‘technically literate’ linguists. On the other hand, readers who are more proficient in IT may find such detailed tutorials superfluous. Indeed, what seems to be missing is a clear division between instructions designed for ‘technophobic’ users and those directed at more seasoned corpus researchers. As a result, technically advanced researchers have to make their way through all the instructions before they are able to pick out issues they are not yet familiar with.

Chapter 5 focuses on concordancing. The author presents functions offered by one of the most widely used corpus tools, AntConc, available free of charge at http://www.laurenceanthony.net/. This section is focused on practical advice and provides a number of exercises and screenshots from AntConc that illustrate concrete steps to be followed by less advanced users. Functions described by the author range from opening texts that make up a corpus to actions such as saving, pruning, and reusing the results of concordancing (pp. 75–77). The only potential shortcoming of the way in which concordancing is explained here has to do with the fact that the AntConc tool is still being developed. Because of this, any future versions of the programme may diverge from the build described in the book (AntConc, version 3.4.3w). In fact, the current version of AntConc, (3.5.2w, March 2018) does include some options that are not offered by its older version, such as the ‘Show every nth row’ functionality. The author could address this problem by including the book's exercises on his web page (http://martinweisser.org/) and updating them to reflect any significant modifications applied to the tool. Unfortunately, to date, there is only one exercise related to concordancing on the accompanying web page, but this exercise is not based on AntConc. Instead, Weisser offers his own corpus tools, without describing them in the book under review.

Chapter 6, ‘Regular expressions’, instructs the reader how to use regexes in their work in corpus linguistics. Regexes, or regular expressions, are an alternative method of running searches on any computerized corpus. They make use of metacharacters that make it possible to match, for instance, any alphanumeric character, any element found zero or more times, any non-digit, and many more. The chapter relies to a great extent on exercises available on the author's web page (http://martinweisser.org/pract_cl/regExes.html). This chapter includes information on – and tasks related to – issues such as character classes (Sections 6.1 and 6.2), quantification thereof (Section 6.3), and functions such as anchoring, grouping, and alternation (Section 6.4). In Section 6.5, further (AntConc) exercises can be found.

Chapter 7 is a short overview of morpho-syntactic (part-of-speech, PoS) tagging. The author compares two popular tagsets, Penn Treebank and CLAWS 7, and proposes a tag-your-own-data exercise based on the latter tagset. The author's insights into various drawbacks of automatic taggers are valuable from an academic perspective. He argues that traditional taggers (like CLAWS) lack linguistic ‘knowledge’ in that they cannot recognize and appropriately categorize exceptional cases and some more complex structures such as longer noun phrases. For this reason and owing to the way they are constructed, they are not accurate enough. As a result, an additional burden of manual post-tagging is imposed on the researcher, who is advised to always verify the output of the automatic tagger. The author also mentions a tagger that is not limited to English – the TreeTagger. For obvious reasons, though, this topic is not explored in the book.

In Chapters 8–10, the author changes his focus from smaller corpora to mega corpora, mainly BNC and COCA. In Chapter 8, he provides a large number of exercises based on running queries (pp. 123–132 for BNCWeb and pp. 133–137 for COCA) as well as detailed solutions to those exercises. Chapter 9, ‘Basic frequency analysis – or what can (single) words tell us about texts?’, is closely related to Chapter 8 and begins with some reflections on the basic textual unit used in corpus analysis: the word. The author convincingly argues that choosing an orthographic word as a basic unit of word-class tagging (which is the case with most corpora) has some shortcomings. He gives a few examples of compounds that may cause tagging issues, such as ice cream (which, when it is not hyphenated, consists of two textual units that are tagged as separate words). On the other hand, as Weisser argues, there is no fixed rule on how to represent composite words as it ‘seems to be a case of whatever form becomes conventionally more or less accepted by a majority of people in/for its written form’ (p. 148). Other examples of English lexical structures that may be problematic in corpus linguistics, according to the author, include multi-word units (such as phrasal verbs or multi-word conjunctions such as as far as), contractions (can't, won't, etc.) and the possessive marker ’s, in addition to polysemic expressions (such as bank, which can represent a noun with several meanings/senses as well as a verb).

Further sections of Chapter 9 are devoted to frequency analysis and include issues such as creating word frequency lists and using stop words (words or phrases excluded from search results) in AntConc (pp. 151–160) and BNCWeb (pp. 160–169) as well as comparing two subcorpora in terms of keywords (pp. 169–171 for AntConc and pp. 172–174 for BNCWeb). Section 9.5, ‘Comparing and reporting frequency counts’, is vital for analytic purposes when dealing with two (or more) unequally large corpora. The thrust of this section is to offer guidance on how to normalise the results of frequency analysis so that relative frequency is taken into account instead of raw frequency. The author rightly observes (p. 146) that basic frequency analysis, which is based on comparing corpus information carried by single words, is limited in its usefulness:

[W]hat we want to do here is to try and develop an understanding of how much, but perhaps also to some extent how little, lists of single words can tell us about the texts they occur in.

Since investigating individual words may not be sufficient for many linguistic purposes, the ability to deal with multi-word units is crucial for any researcher. This topic is covered in Chapter 10, ‘Exploring words in context’. As an introduction to this chapter, the author provides an exercise related to establishing units of textual analysis. While it is assumed that the key unit that one should deal with in linguistics is the sentence, it is necessary to define what a sentence is, especially from the point of view of spoken corpora. When transcribing recorded speech, researchers are expected to set sentence boundaries themselves in the written form of a spoken corpus. In this respect, Weisser follows the segmentation method based on c-units as defined by Biber et al. (1999). In Section 10.3, the author outlines the notion of n-grams as distinct from both lexical bundles and word clusters (which may be seen as an umbrella term for these two). Subsequent sections include exercises for finding, and making sense of, collocations and colligations while using BNCWeb (pp. 198–202 and pp. 210–211), COCA (pp. 202–205 and pp. 211–212), and AntConc (pp. 205–207 and pp. 209–210).

Chapter 11, ‘Understanding markup and annotation’, stands out from other practical chapters (4–10) in that it focuses on technical IT issues rather than linguistically-driven topics. This chapter explores various annotation types, ranging from orthographic annotation to error coding (p. 228), and includes a timeline of annotation markup languages (SGML, HTML, and XML). In Section 11.2, the author explains why using XML and CSS (Cascading Style Sheet) may be beneficial for corpus linguistics research. His explanation seems plausible since using visual aids such as colour coding and visualisation significantly lowers the burden of dealing with large amounts of corpus data. Next, a number of linguistic annotation exercises are proposed that are based on what the author calls ‘Simple XML’ (pp. 236–239). Section 11.4 includes CSS-based exercises for visual markup of computerised corpora. While their importance should not be underestimated, they considerably stand out in terms of difficulty from previous exercises related to running queries and compiling corpora.

One of the most prominent advantages of the book, especially from a student's point of view, is that the author provides extensive meta-commentary on his work. One example is his prediction that introductory exercises ‘may not yet appear very useful to you, and you might well be tempted to give up if things appear too abstract’. However, as the author promises, if the reader follows his instructions, ‘he or she [will] definitely be able to profit from [them] in [his or her] work in corpus linguistics’ (p. 83). Such meta-commentaries are beneficial for two reasons. First, they are proof that the content and structure of the book have been carefully thought out. Second, they make the potential reader aware that each section and exercise is essential in his or her education in corpus linguistics.

Overall, the book fulfils the objective defined in its introductory chapter (p. 1): it can be used as reference material to teach the user how to use corpus data in their research. Thanks to its high instructional value, the book will no doubt prove to be useful to students and teachers of linguistics as well as up-and-coming academics. Those who did not previously use corpus methods in their research are also likely to benefit.

References

REFERENCES

Anthony, Lawrence. 2018. AntConc [Computer software]. Retrieved from http://www.laurenceanthony.net/software/antconc/, 30 March 2018.Google Scholar
Biber, Douglas, Johansson, Stig, Leach, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman.Google Scholar
Lüdeling, Anke & (eds, Merja Kytö.). 2008. Corpus Linguistics: An International Handbook, vol. 1. Berlin: Walter de Gruyter.Google Scholar
O'Keeffe, Anne & McCarthy, Michael. 2010. The Routledge Handbook of Corpus Linguistics. London: Routledge.Google Scholar
Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins.Google Scholar