Work in computational linguistics has until recently dealt almost exclusively with modern languages. Most known techniques in computational linguistics rely on statistical models that first have to be generated on the basis of data that has been correctly annotated. Only then can they can predict the analysis of unseen data. Manual annotation of a dataset is a very time-consuming and costly endeavour, and while computational linguistics has numerous applications that readily attract commercial funding, few customers demand that their new mobile phone should give them directions in Latin.
Given the focus on modern languages, it makes good sense to write a book about the challenges involved in applying computational linguistic techniques to historical languages. Philologists, linguists and computer scientists have to learn from each other (and understand each other's research priorities) to make this possible. Barbara McGillivray has taken this idea one step further and written a book specifically about computational linguistics applied to Latin. This too makes sense, not because Latin is unlike any other historical language, but because some of the resources that make it possible to analyse Latin linguistic data computationally have recently become available. We now, for example, have morphosyntactically annotated corpora of Latin texts, which are freely available for anyone to use.
M.'s goal is to illustrate the advantages of a computational approach and to show how well-known computational methods can be applied to Latin linguistic data. She explicitly states that the book is a methodological contribution and the reader should not expect novel linguistic insights. It is also clearly not intended as a step-by-step guide, although parts of the book certainly read like a textbook introduction.
The introductory chapters (chs 1 and 2) explain the motivation for Latin computational linguistics and provide a brief overview of existing work. M. then turns to three case studies. The case studies are based on real linguistic research questions and to some extent replicate research that has been done manually. In the first case study (ch. 3) she constructs a valency lexicon of Latin verbs by extracting argument frames from manually annotated Latin corpora. In the next case study (chs 4 and 5) she uses the valency lexicon and statistical methods to model selectional preferences in terms of more abstract semantic classes. The final case study (chs 6 and 7) evaluates the correspondence between Latin pre-verbs and the morphosyntactic realization of verbal arguments.
It is challenging to write about this topic in a manner that is accessible both to computational linguists and classicists. Presumably with this in mind, M. has organized two case studies so that one chapter explains the method in general terms while the following chapter provides the statistical background. This does make it easier for the reader to skip the more technical parts if so inclined, but it sometimes leaves the reader with questions that are not answered properly until the second chapter. Problems of a similar nature arise throughout the book. For example, technical terms, like ‘the synset score’ and ‘F’, are used before they are defined, others, like ‘shared verb-slot’, are never explained, and quantitative data given in tables do not always match data given in the text, as on p. 154 and table 6.2. This is frustrating for the reader who wonders if s/he has misunderstood something crucial about the method.
Of more serious concern is M.'s attitude to the linguistic analysis of her data. The Latin corpus is inherently diachronic, and existing techniques in computational linguistics do not usually take this into account. M. is, of course, aware of this and discusses the need to adapt existing techniques to this scenario. It would clearly be beyond the scope of this book to tackle this problem, so M. instead controls statistically for diachronic effects and makes a few unavoidable compromises along the way. This is a reasonable method, but, in a book like this, one would expect the author to discuss the effect of such compromises and thus also the linguistic relevance of the results.
M. deserves much praise for devoting an entire book to this emerging field and for including three advanced case studies that address non-trivial research questions. It is possible for readers with very different backgrounds to gain an up-to-date overview of the field and appreciate some of the challenges and trade-offs involved. However, to convince classicists and linguists that computational linguistic methods can be fruitfully applied to Latin one has to approach the subject matter with more linguistic sophistication and demonstrate that the methods can produce results that are meaningful to Latin linguists as well as computational linguists.