Corpora of Black South African English
Black South African English (henceforth BSAfE) has received more attention from corpus linguists than any other variety in the country. In the early 2000s, two corpus projects were initiated more or less simultaneously: the corpus of spoken Xhosa-English, compiled by Vivian de Klerk at Rhodes University, and the Tswana Learner English corpus that I compiled. A number of further corpora have also been compiled in the more recent past, many of which are available in the public domain for research purposes.Footnote 1
Before examining the findings of the various studies, it is important to consider why corpus linguistics is a fruitful approach to the study of a New/Outer Circle Variety of English. When the late Sydney Greenbaum (Reference Greenbaum1988) initiated the International Corpus of English (ICE) project, Schmied (Reference Schmied1990) took up the challenge to compile an ICE-East Africa. He noted that there were a number of unique challenges to overcome in collecting all the data categories, and also to identify the types of speakers to include in the corpus, especially in terms of their language proficiency and language backgrounds. Since then, a number of ICE-corpora have been completed for non-native varieties. They enable researchers of varieties of English to look for patterns of similarities and differences across various countries, larger sub-continental groupings and, especially, across the native/non-native divide. A complete ICE-corpus has not yet been completed for BSAfE, but a wide enough range of data types is available to allow for a similar range of research possibilities.
The most important advantage of a corpus for research on non-native varieties is to ground statements about the incidence of a particular feature in proper data (see Minow, Reference Minow2010: 1). While intuitive judgements are useful and valuable as data for some linguistic research questions, they are much less reliable when it comes to matters of frequency. The reason for this is the human inclination to overestimate the frequency of rare events. Kahneman (Reference Kahneman2011: 322–33) explains why this happens: the mere fact that an event can be recalled from memory tends to bias the observer to overestimate its frequency, a process he calls the confirmation bias (Reference Kahneman2011: 81). Thus, grammatical descriptions of BSAfE may potentially claim that a particular linguistic form (e.g. the use of the unmarked verb form in past time contexts) is ‘characteristic’ of BSAfE, but corpus evidence allows Minow (Reference Minow2010: 111) to conclude that this phenomenon occurs in only 15% of possible past tense contexts, whereas the standard form of past tense marking is present in 85% of the possible contexts.
The other advantages are related to the advantages of corpus linguistics in general. By studying corpora of naturally occurring data, the linguist is able to understand the functions of language in use much better, particularly since many instances of a particular linguistic feature can be retrieved to enable a comprehensive overview of the functional possibilities. Moreover, particularly within the more inductively-oriented corpus-driven approach, new and unexpected features of a language can be discovered, patterns that occur too far apart for the human analyst to spot the correspondences in active memory. A look at much of the published corpus work on BSAfE shows that this opportunity is not yet fully capitalised on. The ‘comparative fallacy’ of determining what is different from some standard variety remains a methodological characteristic of current research (see Van Rooy, Reference Van Rooy2008).
Given such advantages, this article sets out to synthesise the insights that have been gained from corpus linguistic investigations into the grammatical features of BSAfE. A number of features related to the noun phrase and the verb phrase have been identified by researchers, and supported by corpus data. However, corpus data have also indicated that some features are either quite rare, or clearly not stabilised features, but rather transitional phenomena that disappear when language proficiency increases. The established features will be presented first, before looking at those linguistic features that are shown not to be stable features of BSAfE.
Established features
The progressive aspect has a long history of scholarly attention in BSAfE and many other New Englishes. The standard account has focused on the ‘extension of the progressive to stative verbs’ (Gough, Reference Gough and de Klerk1996: 61; Mesthrie, Reference Mesthrie and Mesthrie2008: 489), which corpus analysis confirms (De Klerk, Reference De Klerk2006: 140). Siebers (Reference Siebers2012: 149) and Van Rooy (Reference Van Rooy2006) show that the progressive is still by far the most frequent with activity verbs, so the ‘extension’ does not alter the core possibilities of the construction. Furthermore, Minow (Reference Minow2010: 144) points out that the frequency of the progressive is inversely related to the proficiency levels of the speakers in her corpus: the more proficient a speaker, the less frequent the progressive is. These two observations may lead us to suspect that the extension of the progressive is a learner language phenomenon, which is bound to disappear as speakers adjust their grammatical usage with increased proficiency.
A functional approach, which goes beyond observing the presence of a particular feature, paints a different picture, though. Van Rooy (Reference Van Rooy2006) analyses a sample of 100 progressives from the TLE, and concludes that the underlying semantics of the construction is consistently different from the native speaker prototype of a dynamic event with limited duration. Rather, extended duration is profiled by the majority of progressive usages in the data. Once extended, rather than limited, duration is profiled, the construction is equally compatible with dynamic and stative predicates. Thus, Van Rooy (Reference Van Rooy2006) argues against the interpretation that the progressive is ‘extended to stative verbs’. Siebers (Reference Siebers2012: 150–3) likewise indicates that a large number of instances of the progressive in her corpus are compatible with extended, rather than limited, duration. In ongoing work that I am doing on the semantics of the progressive when used with stative verbs, it emerges that about half of all the examples are used with the semantics of extended duration (out of more than 500 drawn from all the corpora I had access to, including the XE1, VW, TLE and my own corpora). By contrast, about a quarter of the examples show the standard usage of states with limited duration, but just about the same number of instances denote states with unlimited duration, i.e. permanent states.Footnote 2 Thus, corpus linguistic research on the progressive refutes the simplistic view that the progressive is merely extended to stative contexts, as in the following example (from Siebers, Reference Siebers2012: 151):
(1) So the way we are thinking, we're thinking like like like whites we mustn't have a – we mustn’t have this part of the body out on the open, you must cover it.
Siebers (Reference Siebers2012: 151) points out that, in context, a general attitude is denoted rather than a temporary state, which is clearly a meaning that is not ascribed to the progressive in standard grammars. However, the temporal meaning in example (2), taken from Van Rooy (Reference Van Rooy2006: 57), which is fully consistent with example (1), points to extended duration as well, even if the predicate denotes a dynamic event:
(2) In prison is where they graduate in their criminal activities because all the criminals are infested there they become more wicked and dangerous because the society is treating them like outcasts or the worst sinners.
A second set of verb phrase features relate to the expression of modality. Older accounts have identified a number of unique expressions, especially the occurrence of the form ‘can be able to’ (Gough, Reference Gough and de Klerk1996: 63). De Klerk (Reference De Klerk2006: 150) confirms the presence of this form in her XE1 corpus. Using her data alongside the TLE, Van Rooy (Reference Van Rooy, Mukherjee and Hundt2011) takes a closer look at the semantics of ‘can be able to’ in examples such as the following (from Van Rooy, Reference Van Rooy, Mukherjee and Hundt2011: 200):
(3) People become sick for a long time and this caused Aids because this deseas will kill all your imune system and the body can’t be able to diffend itself against other deseases.
Semantically, ‘can’ conveys the sense of extrinsic possibility in combination with the expression ‘be able to’, which is a semi-modal expression synonymous with the intrinsic ability sense of ‘can’. Thus, unlike accounts that suggest redundancy or hyperclarity in the use of ‘can’ with ‘be able to’, a functional analysis of corpus data shows that the expression is not, logically speaking, problematic, but merely not conventionalised in present-day native varieties of English, unlike in the Early Modern English Period, when it was used more widely, and even made it into the King James Bible translation (Crystal, Reference Crystal2008).
As far as the noun phrase is concerned, corpus research on BSAfE also yields confirmation of some existing research, but elaborates on existing insights at the same time. The resumptive pronoun (or left dislocation) strategy has been reported for BSAfE since the earliest descriptions. Mesthrie (Reference Mesthrie and Schneider1997), in a pre-corpus study that draws on an extensive ‘corpus’ of sociolinguistic interviews, offers a substantial body of insight into the pragmatic functions and syntactic environments in which various topicalisation phenomena occur in spoken BSAfE. He notes that the construction is much more frequent in BSAfE than in other varieties, and while it shares some of the functions with other varieties, such as the reintroduction of given information, it sometimes functions in contexts where no prominent pragmatic function is prevalent. Two syntactic environments, with partitive of-constructions and relative clauses, are identified, as well as the high frequency of the combination ‘people they’, as illustrated by the following example from Mesthrie (Reference Mesthrie and Schneider1997):
(4) The people, they got nothing to eat.
Corpus studies by De Klerk (Reference De Klerk2006: 140), Minow (Reference Minow2010: 193) and Siebers (Reference Siebers2012: 204–6) all confirm Mesthrie's account, using three different spoken corpora. Botha (Reference Botha2012: 176–88) likewise confirms the existing accounts, but what is different is that her data are drawn from the written student work in the TLE. She finds that a much bigger proportion of instances can be attributed to referent tracking in especially relative clause and partitive constructions (Reference Botha2012: 187–8), and also explains that the high frequency of the ‘people they’ combination should not be overinterpreted: ‘people’ is simply by far the most frequent lexical noun in the entire TLE, and it is thus entirely expected that the most frequent noun should be the one that enters into the most frequent combination with a resumptive pronoun as well (Reference Botha2012: 176).
An area in which pre-corpus research is considerably less precise is the use of articles. There are three logically possible ways in which BSAfE articles may differ from native varieties, all of which have been reported in the literature: articles are omitted (Gough, Reference Gough and de Klerk1996: 61), or inserted (Mesthrie, Reference Mesthrie and Mesthrie2008: 496), or substituted for each other. Greenbaum and Mbali (Reference Greenbaum and Mbali2002: 241–3) mention all three possibilities in the same article. De Klerk's corpus analysis confirms the use of articles with non-count nouns (Reference De Klerk2006: 146), and she furthermore identifies a range of usage that will be unacceptable in native varieties, thereby largely confirming pre-corpus accounts of the unsystematic use of articles.
Minow (Reference Minow2010) and Siebers (Reference Siebers2012) add to our understanding by showing that the omission and substitution of articles, compared to native norms, occurs with a rather low frequency – between them they find article occurrence rates of between 87% and 97%, depending on corpus and whether indefinite or definite. The insertion of articles in positions where no overt article would be used in native varieties is the most frequent deviation from the norm, and both find that native-like usage increases with proficiency levels. Siebers (Reference Siebers2012: 120–1) identifies one idiomatic expression that is characteristic of BSAfE usage, the form ‘kind of a NOUN’, exemplified by the following example:
(5) Because if you go and look for a job, you must be doing some kind of a research in order to know what kind of company or what kind of institution is that (M1).
Botha (Reference Botha2012) identifies a few more systematic patterns of different usage in BSAfE. Firstly, BSAfE seems to use articles more widely than native varieties before human institutions: besides ‘go to the bank/shop’, as in native varieties, BSAfE speakers also prefer the formulation ‘go to the school/university/hospital/jail’ (Botha, Reference Botha2012: 257). Secondly, indefinite articles are used more widely in noun phrases with non-particular interpretations where such nouns are conventionally construed as uncountable; thus as well as the native-like ‘to have a better life’, BSAfE speakers also select ‘to have a time to relax’ (Botha, Reference Botha2012: 265). Lastly, Botha (Reference Botha2012: 253–4) also identifies the use of the definite articles with ascriptive nominals in BSAfE, for instance:
(6) It's the question of loyalty.
Botha (Reference Botha2012: 278) concurs with Siebers (Reference Siebers2012: 131) that the basic underlying system of article usage is the same in BSAfE as in native varieties. To the one idiomatic alternative in BSAfE that Siebers identifies, Botha adds three more. In her conclusion, Botha (Reference Botha2012: 78) notes that the small differences between BSAfE and native varieties are due to alternative constructions that are conventionalised in BSAfE. These alternatives develop in the leaky edges where native speaker grammar is also less regular.
The distinctive use of quantifiers in BSAfE has been noted in pre-corpus accounts as well. Gough (Reference Gough and de Klerk1996: 62–3) lists a number of features that can be grouped together under the broad umbrella of quantification (the use of ‘too much’, ‘very’ and ‘very much’, ‘some few’, the ‘other…other’ construction, ‘the most thing’ and ‘X's first time’), whereas Mesthrie (Reference Mesthrie and Mesthrie2008: 495–6) includes the ‘other…other’ construction in his discussion of subordination and coordination. De Klerk (Reference De Klerk2006: 143) identifies instances of almost all the constructions listed by Gough in her XE1 corpus, and thus confirms their presence in the data.
Botha (Reference Botha2012: 309–14) points out that most of the exceptional usages noted by De Klerk and pre-corpus accounts of BSAfE are due to systematic extensions of other uses of quantifiers that do resemble native varieties more closely. The construction ‘most of the NOUNs’ occurs almost ten times as often in the TLE compared to the native speaker control corpus, and while this construction is perfectly acceptable in native speaker writing, it is quite rare. A blended construction, not used by native speakers, is the construction ‘most of NOUNs’, which appears to take features of the ‘most of the NOUNs’ and ‘most NOUNs’ constructions. In fact, Botha (Reference Botha2012: 312) concurs with Mesthrie's (Reference Mesthrie2006: 139–40) undeletion account where the ‘of’ is inserted, rather than an article omitted, in the construction. Thus, the following example (from Botha, Reference Botha2012: 312) shows that the ‘of’ will probably be omitted in native varieties, since the context does not require the quantified head noun to be definite:
(7) Most of club owners complain about the standard of soccer in our country.
Botha (Reference Botha2012: 302–7) also offers an account of the ‘some few’ construction. She firstly points out that the form ‘some’ is used more extensively in BSAfE than in native varieties, due to its function as overt marker of indefiniteness in contexts where the indefinite article is not typically found (like indefinite plurals), as illustrated by the following example (Reference Botha2012: 305):
(8) So you cannot play soccer each and every week and after that you’ve been paid some peanuts.
She proceeds to show that in contexts where ‘some’ combines with ‘few’, ‘some’ is used in its determinative role, and not its quantifier role. Hence, no conflict should arise, within the BSAfE system, in the combination ‘some few’. Native varieties would permit ‘a few + NOUNs’ in this context, exceptionally combining the indefinite article with a plural noun. The two instances of the expression in the TLE can both be paraphrased as ‘a few’ in native varieties: ‘After some few days’ and ‘In some few years ago’. Thus, the ‘some few’ construction is not so much a case of unusual quantification, but follows from the functional extension of ‘some’ to complement ‘a/an’ as marker of indefiniteness.
One morphological feature of the noun that has received attention in research on BSAfE is the distinction between mass and count nouns. In pre-corpus overviews (Mesthrie, Reference Mesthrie and Mesthrie2008: 497; Gough, Reference Gough and de Klerk1996: 61), the claim is made that non-count nouns are used as if they are count nouns. De Klerk (Reference De Klerk2006: 146) reports data from her XE1 corpus that show that certain mass nouns are used with plural suffixes (homeworks, equipments, moneys and advices). However, apart from ‘homeworks’, which occurs in the plural 13 times while the singular is used only 7 times, all other items are used in the singular form the majority of the time. Siebers (Reference Siebers2012: 134) furthermore finds that such usage occurs mainly with the least proficient speakers in the XE2 corpus.
Botha (Reference Botha2012: 318) finds that the form ‘equipments’ is the most frequent pluralised mass noun in the TLE corpus, and it is the only form that has more plural usages than singular usages. She argues, however, that it is not so much that the contrast between mass and count nouns is violated, but rather that a number of nouns are re-analysed as count nouns, where native varieties of English typically use such nouns as mass nouns. The nouns that are most likely to undergo such reanalysis are nouns that refer to countable objects, such as ‘equipment’ or ‘furniture’. Abstract mass nouns, such as ‘information’ or ‘advice’, are the other type that undergo such reanalysis. These are variously construable as unbounded or bounded objects, as is shown by the contrast between German and English as far as ‘information’ is concerned: German ‘die Information’ can be pluralised to ‘die Informationen’.
Non-features
In the absence of a systematic data base of language, from a range of speakers, it is difficult to determine which observations in a small sample constitute a ‘pattern’ of BSAfE usage, and which are individual errors or transitional phenomena. Both Gough (Reference Gough and de Klerk1996) and Buthelezi (Reference Buthelezi and Mesthrie1995) concede that they mainly drew on student writing as the source of evidence for their discussion of BSAfE features, while other writers, such as Greenbaum and Mbali (Reference Greenbaum and Mbali2002), did not intend to identify properties of a variety, but specifically intended to pinpoint recurring errors in student writing. The availability of bigger corpora enables researchers to quantify the extent to which a particular feature occurs, and if the frequency is negligible in the face of an overwhelming trend to select the native-like variant (or some other variant), then there is little reason to regard such a feature as an established feature of BSAfE. Minow (Reference Minow2010: 3) argues that a stable feature of BSAfE is one that is used to some degree by all speakers, regardless of proficiency level, while a feature that is restricted to the least proficient speakers in the data should not be regarded as a stable feature. Siebers (Reference Siebers2012: 187) refers to the view of Romaine (Reference Romaine, Doughty and Long2005: 427) that a feature which appears 80–90% of the time should be regarded as acquired by a speaker, and the remainder should be regarded as performance errors.
Based on the analyses of various researchers, the following features are not regarded as stable features of BSAfE, but are occasional performance errors instead:
• The omission of the suffix –ly on adverbs such as ‘quickly’ (De Klerk, Reference De Klerk2006: 153);
• Subject–verb concord errors (Siebers, Reference Siebers2012: 187);
• Overgeneralisation of the past tense suffix –ed on irregular verbs (De Klerk, Reference De Klerk2006: 145);
• Use of ‘does + VERB’ to express present tense (De Klerk, Reference De Klerk2006: 153);
• The neutralisation of the contrast between themself/themselves (De Klerk, Reference De Klerk2006: 154);
• Conflation of pronoun gender by using ‘he’ and ‘she’ for the same referent (from my own unpublished analyses, incidence below 5% even in the writing of school pupils, and below 2% in the writing of adults).
Conclusion
Corpus linguistic research into BSAfE has enriched our understanding of the grammatical features of this variety in three ways. On a purely quantitative level, it has provided support for claims made in research based on less extensive data sets or less formalised analyses, to put the discussion of BSAfE on a surer footing. It has also shown that some features that are attributed to the grammar of BSAfE should rather be regarded as performance errors, since they represent a small minority of variants that differ from the majority variants in the data. Finally, for at least some features, such as the use of the progressive aspect (Van Rooy, Reference Van Rooy2006; Siebers, Reference Siebers2012), ‘can be able to’ as modal expression (Van Rooy, Reference Van Rooy, Mukherjee and Hundt2011), and articles and quantifiers (Botha, Reference Botha2012), functionally oriented corpus analyses have added to our understanding of features that were previously noticed mainly for their deviance, without a thorough grasp of ways in which they are consistent with the grammar of BSAfE.
Corpus linguistic research shows us that, at this point, the grammar of BSAfE is to a large extent similar to that of native varieties of English. Differences mainly reside in a small number of new grammatical constructions of fairly restricted scope. In the leaky corners of grammar, a few constructions that are unique to BSAfE have attained (or are close to attaining) stability, and should therefore be regarded as characteristic features of this variety (even if they are shared by other New Varieties of English elsewhere on the African continent or further afield).
BERTUS VAN ROOY is professor in English Language Studies at the Vaal Triangle Campus of the North-West University in Vanderbijlpark, South Africa. He is a past president of the International Association for World Englishes. His current interests include the grammatical features of non-native varieties of English, and the development of South African varieties of English in the nineteenth and twentieth centuries. He works on the compilation of a number of synchronic and diachronic corpora of varieties of South African English. Email: Bertus.VanRooy@nwu.ac.za