1. Introduction
In recent years, a variety of usage-based approaches to modeling language have attracted increasing interest as theoretical alternatives to the generative grammar paradigm associated with the work of Chomsky (Reference Chomsky1965, Reference Chomsky1986). Prominent among these alternatives are Bybee’s Network Model (Bybee, Reference Bybee1995), Langacker’s Cognitive Grammar (Langacker, Reference Langacker2008), Goldberg’s Construction Grammar (Goldberg, Reference Goldberg1995, Reference Goldberg2006), and Croft’s Radical Construction Grammar (Croft, Reference Croft, Fried and Östman2004). They have all promoted an alternative view of the most basic nature of language in which the linguistic generalizations that underlie our ability to speak and understand a language emerge directly from our experience with that language. This view contrasts sharply with that introduced by Chomsky (Reference Chomsky1965), in which our cognitive basis for speaking and understanding a language is seen as resulting in large part from the activation of an innate neural framework for structuring one’s experience with a language according to generally predetermined instructions. While the proponents of the usage-based approaches do not deny that humans have a species-specific biological specialization for language, they do, nevertheless, attribute most of the specific structures and forms that constitute one’s language to the operation of general cognitive processes and learning mechanisms that apply to one’s linguistic input. Within recent discussions of the usage-based alternatives, CATEGORIZATION has emerged as one of, if not, the fundamental cognitive operation(s) underlying linguistic behavior (Bybee, Reference Bybee2010, Reference Bybee, Hoffmann and Trousdale2013; Croft & Cruse, Reference Croft and Cruse2004; Goldberg, Reference Goldberg1995, Reference Goldberg2006; Langacker, Reference Langacker2008; Taylor, Reference Taylor1995, Reference Taylor, Dabrowska and Divjak2015).
Linguists are gradually coming to appreciate the critical significance of categorization to linguistic structure. The role of categorization is especially prominent in cognitive grammar, which invokes it for several basic functions not generally associated with it. … It is therefore essential that we develop a coherent view of the process of categorization and the structures it produces. (Langacker, Reference Langacker2008, p. 369)
Categorization is the most pervasive of these [domain-general cognitive] processes as it interacts with the others. By categorization I mean the similarity or identity matching that occurs when words and phrases and their component parts are recognized and matched to stored representations. The resulting categories are the foundation of the linguistic system. (Bybee, Reference Bybee2010, p. 6)
Despite this increasing appeal to categorization as the cognitive process central to our understanding of language acquisition and use, few of these linguists have yet offered explicit interpretations of how linguistic categories might be represented within the brain, nor do they explore in any detail what their different notions of categorization might entail in the way of cognitive capabilities and processes. In this paper I review a large body of evidence regarding the cognitive characteristics of categorization, and I argue that that evidence implicates strongly an instance-based, or exemplar-based, framework for both domain-general categorization and linguistic categorization. I then argue that of the three different instance-based theories of categorization now being tested within linguistics, the Analogical Model proposed in Skousen (Reference Skousen1989, Reference Skousen1992) provides the most cogent account of linguistic categorization.
With only a few exceptions to date (cf. Bybee, Reference Bybee, Hoffmann and Trousdale2013; Pierrehumbert, Reference Pierrehumbert, Bybee and Hopper2001), the proponents of usage-based linguistics have generally assumed, either implicitly or explicitly, some sort of prototype theory based largely on the work of Rosch (Reference Rosch1975) for explaining the categorization of linguistic forms and usages, and they almost always envision those categories as instantiated in some sort of connectionist representation à la Rumelhart and McClelland (Reference Rumelhart, McClelland, Rumelhart and McClelland1986) (e.g., Bybee, Reference Bybee2010; Croft & Cruse, Reference Croft and Cruse2004; Elman, Bates, Johnson, Karmiloff-Smith, Parisi, & Plunkett, Reference Elman, Bates, Johnson, Karmiloff-Smith, Parisi and Plunkett1996; Lakoff, Reference Lakoff1987; Langacker, Reference Langacker2008; Taylor, Reference Taylor1995, Reference Taylor, Dabrowska and Divjak2015). Even the proponents of the generative-based ‘dual mechanisms’ model of language, as developed in Pinker’s Words and Rules model (Pinker, Reference Pinker1999), have concluded that an adequate model of language must incorporate connectionist-like components. They see them as necessary for accounting for linguistic forms and behaviors that appear to arise from the operation of associative learning mechanisms such as those modeled by connectionist networks. Those generativists, however, see this aspect of linguistic knowledge as making up only a relatively minor part of one’s knowledge of one’s language. They continue to argue that by far the greater part of one’s knowledge of a language resides in symbolic-rule-based generalizations that do not show the usual cognitive characteristics of associative learning and category induction. The major research question, one that has been debated extensively in a large and still accumulating literature of point–counterpoint, is whether the connectionist models that Pinker and his colleagues acknowledge needing to include in their dual-systems model in order to account for exceptional forms and behaviors, are also sufficient to account for the more systematic aspects of regular linguistic behavior as well.
While the debate between the generativists and the connectionists has gone on within linguistics, there has been a largely independent but parallel debate among cognitive psychologists over how best to explain and characterize categorization per se as a domain-general cognitive behavior. Within the latter debate, for the past three and a half decades, two major theoretical approaches to modeling categorization have competed in the research literature: the prototype approaches, often modeled as connectionist systems, and an exemplar-based, or instance-based, approach linked most often to the work of Nosofsky (Reference Nosofsky1986). Although both approaches have been applied to the modeling of linguistic behavior, the connectionist approaches, as already noted, have been used much more extensively in modeling such aspects of language as inflectional morphology, especially the English past-tense forms. The exemplar-based approaches, on the other hand, have attracted only relatively limited interest among linguists to date, as in Nakissa, Plunkett, and Hahn’s (Reference Nakisa, Plunkett, Hahn, Broeder and Murre2000) application of Nosofsky’s Generalized Context Model (GCM) to inflectional morphology, and in the application of Daelemans’ Tilburg Memory Learning Model (TiMBL; Daelemans & van den Bosch, Reference Daelemans and van den Bosch2005), and Skousen’s Analogical Model (AM; Skousen, Reference Skousen1989) to various linguistic issues. In the next section, I shall first review some of the major theoretical and empirical issues that arise in comparing and evaluating the connectionist-prototype approaches to categorization with the exemplar-based – or instance-based – approaches. Then I will review the crucial theoretical and empirical differences exhibited by the three exemplar-based approaches just cited.
2. Some key empirical issues regarding domain-general categorization
2.1. prototype effects
Modern discussions of categorization derive largely from the work on concept learning and categorization pioneered by Eleanor Rosch and her colleagues (e.g., Rosch, Reference Rosch1975, Reference Rosch and Warren1977; Rosch & Lloyd, Reference Rosch and Lloyd1978; Rosch & Mervis, Reference Rosch and Mervis1975). Rosch and her colleagues identified a set of robust behavioral effects – the so-called prototype effects – that appear to characterize most acts of perceptually based categorization. Footnote 1 These prototype effects have come to be seen as the signature characteristics of domain-general cognitive processes of categorization, and when linguists assert that some hypothesized linguistic category shows these prototype effects (e.g., Bybee, Reference Bybee2010; Croft & Cruse, Reference Croft and Cruse2004; Lakoff, Reference Lakoff1987; Langacker, Reference Langacker2008; Pinker, Reference Pinker1999; Taylor, Reference Taylor2002, Reference Taylor, Dabrowska and Divjak2015), they are arguing that those linguistic categories do not differ in nature from categories identified in other cognitive domains.
Although Rosch herself avoided characterizations of what sorts of mental representations might underlie the prototype effects that she described, others have proposed a variety of cognitive mechanisms. Both psychologists and linguists have hypothesized that cognitive processes operate on mental categories by referring to some sort of mental representation of those categories that has been abstracted away from one’s actual perceptual experiences with exemplars of those categories and that has come to be represented in some sort of summary, or schematic, form (cf. Alba & Hasher, Reference Alba and Hasher1983; Croft & Cruse, Reference Croft and Cruse2004; Elman, Reference Elman2004, Reference Elman2009; Estes, Reference Estes1994; Joanisse & Seidenberg, Reference Joanisse and Seidenberg1999; McClelland & Rumelhart, Reference McClelland, Rumelhart, McClelland and Rumelhart1986; Murphy, Reference Murphy2002; Rumelhart & McClelland, Reference Rumelhart, McClelland, Rumelhart and McClelland1986; Taylor, Reference Taylor1995, Reference Taylor, Dabrowska and Divjak2015; Tomasello, Reference Tomasello2003, Reference Tomasello2006). Such prototypes are said to represent the central perceptual and functional tendencies of their respective categories and to provide for default interpretations of characteristics that may be under-specified in any given instance of usage. In comparing how people behave toward the presumed prototype of a category with how they behave toward various non-prototypical members of a category, researchers have often reported the following prototype effects:
Graded internal structure.
Given a collection of category members, participants will show high degrees of agreement with one another in ranking the representativeness of those items as members of the given category (e.g., Rips, Shoben, & Smith, Reference Rips, Shoben and Smith1973).
Feature correlations.
Asked to list the salient features that are characteristic or descriptive of a given category, participants will include many of the same features early in their lists and those features will correlate with one another most strongly in the category members that the participants have judged as most representative of a category (e.g., Rips et al., Reference Rips, Shoben and Smith1973).
Weighted features / Cue validity.
Certain features or cues often prove to be more diagnostic of a category than others and may therefore be weighted more heavily, especially when in the context of certain other features (e.g., Posner & Keele, Reference Posner and Keele1968).
Fuzzy boundaries.
Given a set of items to categorize, participants will show high levels of agreement for which items clearly are or are not members of a given category, but they will disagree with one another about borderline cases and may even respond inconsistently to such items across repeated classification trials (cf. Hampton, Reference Hampton2006; Labov, Reference Labov, Bailey and Shuy1973; Rosch, Reference Rosch1975, Reference Rosch and Warren1977).
Speed of classification.
In concept-learning and classification experiments, the more similar a test item is to a prototype, the faster and more accurately participants classify it (e.g., Rips et al., Reference Rips, Shoben and Smith1973).
False memory.
On recall tests, participants are more likely to exhibit a false recognition memory for a previously unseen item that closely resembles a prototypical item than they are for new items that are less similar to the prototype (e.g., Posner & Keele, Reference Posner and Keele1968, Reference Posner and Keele1970).
Word association.
Words for items that are closer to a category prototype, e.g., table or chair for the category FURNITURE, are more likely to evoke one another in word association tasks than are words for items that are further from the prototype (e.g., ottoman, armoire) (Moss, Hare, Day, & Tyler, Reference Moss, Hare, Day and Tyler1994; Rosch, Reference Rosch1975).
Semantic priming.
Words that are both closer to a category prototype, such as robin and sparrow for the category BIRD, are more likely to facilitate, i.e., prime, the subsequent recognition of one another in a variety of word-recognition tasks than are words that are more distant from the prototype (e.g., Rosch, Reference Rosch1975).
2.2. exemplar-based effects
Even as Rosch and her colleagues were still in the early stages of developing and testing prototype theory, Medin and Schaffer (Reference Medin and Schaffer1978) were beginning to develop and test what would become the major competing model of concept learning and categorization, the instance-based, or exemplar-based, models. Their Context Theory of Classification categorized new items not by comparing them to some extracted prototype representation of a category but by comparing them directly to memories for experiences with individual exemplars of items that shared perceptual features with the new item. The model assigned a new item to a category on the basis of its similarity to one or more of those previously encountered exemplars that had already been assigned to a category. Medin and Schaffer showed that not only could their model account for the prototype effects then being described, but that it also accounted for certain additional behaviors and effects not readily accounted for by the prototype models, as well as for certain effects that had been misunderstood within the prototype framework. They showed, for example, that, under the right conditions, participants actually could classify new examples that were further from a prototype, even outlier examples, just as quickly as – and sometimes even faster than – they could a new example that was more similar to the prototype. This effect became just one of several important exemplar-based effects to be identified experimentally which appear to be empirically inconsistent with pure prototype models.
The exemplar-based effects summarized here are behavioral effects that any general model of categorization behavior must account for in addition to the prototype effects listed above and include the following:
Faster classification of exemplars.
Medin and Schaffer (Reference Medin and Schaffer1978) showed that the studies which had reported that the speed of categorization for test items correlated with the degree of similarity to the prototype had confounded similarity to a prototype with other aspects of graded category structure and to differences in one’s learning experiences (see also Smith & Samuelson, Reference Smith, Samuelson, Lamberts and Shanks1997; Whittlesea, Reference Whittlesea, Lamberts and Shanks1997). Once they had controlled for the confounds, Medin and Schaffer showed that participants could categorize test items that were closer to an outlier example of a category than they were to its prototype just as quickly as they could a test item that was closer to the prototype than to an outlier. In other words, similarity to outlier exemplars could influence the speed of classification just as much as similarity to prototypes could.
Flexible internal structure.
The early work in prototype theory (e.g., Rosch, Reference Rosch1975) had established that naturally occurring categories exhibit graded internal structure. That is, given examples from any familiar category, participants can rank those members consistently from most typical to least typical. This graded internal structure also reflected the feature correlations reported by participants and was thought to reflect the probabilistic schematic structure that resulted from category learning (Rosch & Mervis, Reference Rosch and Mervis1975). Such structures were further thought by some to become resident schematic generalizations in one’s long-term memory, where they would remain available in the form of weighted features for subsequent categorizations long after the experiences that had led to learning a category had been forgotten (e.g., Hampton, Reference Hampton2006; Posner & Keele, Reference Posner and Keele1970). Barsalou (Reference Barsalou and Neisser1987) and Smith (Reference Smith, Srull and Wyer1990) showed, however, that neither the implied prototype nor the feature correlations were so immutable as supposed, and that indeed each could shift so as to accommodate different contexts and different tasks. The concept BIRD, for example, showed a different internal structure in the context BARNYARD than in the context BACKYARD or SEASHORE. Smith (Reference Smith, Srull and Wyer1990) and Whittlesea (Reference Whittlesea, Lamberts and Shanks1997) showed further that even a single exposure to a less typical example could skew the internal structure of one’s categories, a finding which implied that categories were not fixed knowledge structures with immutable feature weightings and feature correlations, but were actually more readily mutable than the prototype model implied.
Ad hoc categorizations.
Barsalou (Reference Barsalou1983, Reference Barsalou1985) showed that categories created on the fly, such as ‘things to eat on a diet’, showed the same graded internal structure, fuzzy boundaries, and other internal structural characteristics as did well-established resident categories. This finding suggested that whatever the mental basis was for the prototype effects seen in the references to well-established categories, those same effects were equally evident for temporary categories formed ad hoc as needed.
Compound categorizations.
The categorization effects evoked by modified nouns do not appear to arise from the conjoining of the conceptual components associated with their constituent prototypes (Barsalou, Reference Barsalou1983; Földiák, Reference Földiák1998; Hampton, Reference Hampton, Lamberts and Shanks1997; Osherson & Smith, Reference Osherson and Smith1981). The prototype effects evoked by the compound noun pet fish, for example, do not reflect a simple conjoining of the concepts of prototypical PETS with prototypical FISH. A feature typical of a FISH need not carry the same weight as a feature of the compound category PET FISH. Indeed, a modifier (noun or adjective) plus a head noun often give rise to ad hoc categories having emergent features that are not characteristic of either constituent concept alone (Cohen & Murphy, Reference Cohen and Murphy1987; Hampton, Reference Hampton, Lamberts and Shanks1997; Murphy, Reference Murphy1988; Springer & Murphy, Reference Springer and Murphy1992). For example, a stone lion is not and never was ALIVE and neither component is an ARTIFACT. As those researchers, and others, have noted, the compounds appear to invite participants to search their memories for experiences with exemplars of the referents being identified by a compound and to create a new, ad hoc, category for it on the spot based on the examples found in memory. Hampton even suggested, but did not pursue, a possible analogical process operating on memories for just such exemplars.
Semantic priming.
As was noted above, category members that are closer to their shared prototype will prime one another better than will less prototypical items. Whittlesea (Reference Whittlesea1987), Malt (Reference Malt1989), and Smith (Reference Smith, Srull and Wyer1990) showed, however, that outlier members that resembled one another physically or functionally can also prime one another’s recognition or one another’s categorization just as quickly as, and sometimes better than can a more prototypical exemplar. Prototypical members of semantic categories also prime their category labels better than do less typical members (e.g., the word hammer primes tool better than auger does). These effects, however, do not appear to reflect a shared categorization based on shared physical features. The word hammer, for example, both primes and evokes as an association, the word nail better than it does the perceptually and functionally more similar mallet or maul (Bowden & Beeman, Reference Bowden and Beeman1998; Malt, Reference Malt1989; Wettler, Rap, & Seidlmeier, Reference Wettler, Rapp and Sedlmeier2005). Similarly, bread primes and evokes butter better than it does roll or bun. In further examination of these phenomena, Bowden and Beeman (Reference Bowden and Beeman1998), Buchanan, Brown, Cabeza, and Maitson (Reference Buchanan, Brown, Cabeza and Maitson1999), Moss et al. (Reference Moss, Hare, Day and Tyler1994), Stuart and Hulme (Reference Stuart and Hulme2000), and Westbury, Buchanan, and Brown (Reference Westbury, Buchanan and Brown2002) all replicated an effect first reported in Deese (Reference Deese1960) and elaborated on in Ratcliff and McKoon (Reference Ratcliff and McKoon1988). They showed that how often content words are used in close proximity to one another in large corpora predicts both word-association effects and word-priming effects better than shared membership in a given category appears to. For example, pie, luck, and belly prime pot better than pan does.
Word association.
Parallel to the priming effects just described, subsequent research has also shown that similarity to an outlying member of a category can predict word associations as accurately as can similarity to a presumed prototype. More interestingly, however, the frequency with which a pair of words co-occurs near one another across usages again predicts word-association responses better than does membership in a semantic category (Bowden & Beeman, Reference Bowden and Beeman1998; Buchanan et al., Reference Buchanan, Brown, Cabeza and Maitson1999; Malt, Reference Malt1989; Ratcliff & McKoon, Reference Ratcliff and McKoon1988; Westbury et al., Reference Westbury, Buchanan and Brown2002; Wettler et al., Reference Wettler, Rapp and Sedlmeier2005).
False memories.
While a list of items from a given category, e.g., pear, orange, peach, nectarine, plum, etc. may induce a false memory for having also seen or heard another example of that category, e.g., apple, Buchanan et al. (Reference Buchanan, Brown, Cabeza and Maitson1999) and Westbury et al. (Reference Westbury, Buchanan and Brown2002) showed that a list of words that co-occur frequently with the target word is more likely to create a false memory for the target word than are a set of other category members. For example, sour, pie, core, rotten, red, adam’s are more likely to induce a false memory for apple than would the list of other fruits given above.
All three of the effects just described – semantic priming, word association, and false memories – appear to have more to do with how frequently words occur close to one another in instances of usage than they do with membership within a semantic category, effects not predicted by or accounted for by prototype theory. Interestingly, Prior and Bentin (Reference Prior and Bentin2008) have shown recently that people form stronger and longer-lasting associations among pairs of words that they encountered incidentally as co-constituents within a sentence than they do when the same pairs of words appear explicitly in a paired-association learning task.
Sensory details.
Most prototype theories imply that in the process of concept learning, prototypical characteristics of category exemplars are somehow abstracted away from one’s individual experiences with exemplars of a category and schematized into some sort of prototype representation of the category. Connectionist models, in particular, do not retain records of their experiences with individual exemplars, much less records of the perceptual details of those exemplars that appear to be insignificant for purposes of categorization (cf. Hinton, Reference Hinton, Hinton and Anderson1981; McClelland & Rumelhart, Reference McClelland, Rumelhart, McClelland and Rumelhart1986; Smolensky, Reference Smolensky1988). Unfortunately for such approaches, there now exists a growing body of research which shows that not only are memories for experiences not nearly so schematic as the connectionist-prototype theorists have assumed but that those sensory details actually are important to categorization behavior. While it is true, as Sachs (Reference Sachs1967, Reference Sachs1974) had shown, that people do notice changes in the meaning of a written text better than they notice changes in surface syntactic form (e.g., active versus passive), Alba and Hasher (Reference Alba and Hasher1983) pointed out that it was also the case, contrary to oft-repeated claims, that the participants in Sachs’ studies also noticed changes in surface syntactic form at levels well above chance. Alba and Hasher went on to show that even such seemingly irrelevant features as a speaker’s voice in spoken examples or the font, color, and case of printed examples could affect how well people were able to recall material that had been presented to them.
Many subsequent studies have continued to confirm the richness of such sensory memory for experiences and its effects on priming, recall, and categorization (e.g., Church & Schacter, Reference Church and Schacter1994; Goldinger, Reference Goldinger1996; Humphreys, Evett, Quinlan, & Besner, Reference Humphreys, Evett, Quilan, Besner and Coltheart1987; Pierrehumbert, Reference Pierrehumbert, Bybee and Hopper2001, Reference Pierrehumbert, Gussenhoven and Warner2002; Schacter & Church, Reference Schacter and Church1992). Tenpenny (Reference Tenpenny1995) demonstrated that the effects of those physical details could persist for up to a year after the original presentation in an experiment. Recent studies of lexical access to stylistic and dialectal variations in pronunciation and code-switching have shown that even such minor phonetic variations as the alternative pronunciations of ‘window’ in French [fənεtR] versus [fnεtR] appear to be stored as memories for separate lexical entries within an individual speaker (Bürki, Alario, & Frauenfelder, Reference Bürki, Alario and Frauenfelder2011), and Gahl (Reference Gahl2008) showed that speakers perceive, record in memory, and reproduce, completely without conscious awareness, minute physical differences in vowel length in words such as thyme versus time, nouns that for all intents and purposes seem completely homophonous to native speakers of English, even to linguists. Finally, new research on embodied meaning provides strong evidence of words, phrases, and sentences being associated with rich sensorimotor memories for the contexts within which those items have been experienced (e.g., Barsalou, Reference Barsalou1999; Bergen, Reference Bergen2012; Glenberg, Reference Glenberg, Rickheit and Habel1999).
Being able to demonstrate that people do record and retain sensory details as part of their memories for their experiences is important to theories of categorization because Whittlesea (Reference Whittlesea1987, Reference Whittlesea, Lamberts and Shanks1997) and Ross and Makin (Reference Ross, Makin and Sternberg1999), among others, have shown that seemingly insignificant details such as those actually can influence categorization behavior. Even a single exposure to an exemplar can lead a participant in a category-learning experiment to recalibrate the apparent central tendency of her or his prototype effects significantly.
Latent inhibition and perceptual learning.
There are at least two other, seemingly contradictory, effects that also reflect the postponed influence of perceptual details on the subsequent learning of categories: latent inhibition and perceptual learning. Latent inhibition (Lipp, Siddle, & Vaitl, Reference Lipp, Siddle and Vaitl1992; McLaren & Mackintosh, Reference McLaren and Mackintosh2000) appears when participants have been ‘exposed to’ a stimulus several times, but with no obvious significance or outcome associated with it. If later, under the right conditions, that stimulus does become significant and learners have to begin to associate it with a particular new outcome, it will take those learners longer to learn that new association than it would have taken them to learn the new response to a completely new stimulus. It appears as though memories for the perceptual features of that insignificant stimulus somehow interfere with the learners’ efforts to associate it with the new response.
In perceptual learning (McLaren, Kaye, & Mackintosh, Reference McLaren, Kaye, Mackintosh and Morris1989; McLaren & Mackintosh, Reference McLaren and Mackintosh2000), learners see and have to operate on two or more very similar, but still distinct, stimuli in the absence of any possibly differentiating feedback to them. In some versions of these studies, the two stimuli are learned as exemplars of the same category in a category-learning task (and, crucially, they need to share their common features with the category prototype for the effect to emerge). Contrary to the intuitions expressed by those researchers, learners who had encountered those previously undiscriminated stimuli earlier will learn to discriminate them faster in a new discrimination-learning task than they will two new stimuli that they have not seen before. This result too implies memories for those previously uninformative perceptual features except that in this case those memories appear to facilitate new learning whereas in the previous case, latent inhibition, they appear to retard new learning.
Superficially, these two phenomena, latent inhibition and perceptual learning, appear contradictory, but both present difficulties for connectionist models of category learning because such models ignore irrelevant features in comparing perceptual experiences to schematic prototypes. The Analogical Model described below accounts for the latent inhibition effects and perceptual learning effects, albeit for reasons different from those described in McLaren and Mackintosh. It assumes that the sensory details of each of the training episodes are indeed recorded in memory, as suggested by the independent evidence on sensory memory cited above. On subsequent trials, whether training trials or test trials, a given stimulus will evoke collectively those trials in memory that resemble it, along with the outcomes associated with that stimulus in each remembered trial, or with no outcome if none were associated with that stimulus within a given episode evoked from memory. In the case of latent inhibition, as training progresses each new trial will evoke memories for both the accumulating training trials, and a correct response, as well as for the earlier episodes of exposure to the stimulus with no particular outcome associated with it, thus giving the appearance of retarded learning.
Accounting for the perceptual learning effects within AM is more involved. In applying AM as described below, we assume that even during the initial phase, in which the stimuli are not discriminated from one another, that they do, nonetheless, undergo supracontext comparison (approximately feature comparison, but see below) in the process of assigning them to the implied categorization. Since the items have to share some features with the presumed prototype for the effect to emerge, some of the supracontexts (feature comparisons) for the undifferentiated test items will necessarily correspond to features of that prototype. This means that in the second phase, when the learner begins to learn to assign one of those stimuli to a new, distinct category, that the learner will have memories for episodes of experience in which those supracontexts continue to assign the item to the target category correctly. In the case of an item now being assigned to a different category, those supracontexts that once assigned it to the original category will now predict contradictory results and generally lead, therefore (as described below), to heterogeneous supracontexts that will no longer even be considered in arriving at the new categorization being learned.
Non-linear separability.
The fuzzy areas created by overlapping category boundaries that appear to be inherent to virtually all naturally occurring categories, and that are deliberately designed into many of the artificially created categories used in concept-learning experiments, gave rise to an issue that has sometimes been the focus of intense study in efforts to distinguish prototype theories of categorization from exemplar-based theories: linear separability. Two categories are said to be linearly separable if the features represented in the network collectively distinguish all the members of one category from all the members of the other category. In other words, one could, in principle, draw a line between the two categories that would divide them deterministically from one another. In non-linearly separable categories, however, there are members of one category that actually are more similar to the prototype of another category than they are to the prototype of their own category. Many naturally occurring categories appear to be non-linearly separable (e.g., Estes, Reference Estes1986a, Reference Estes1986b; Medin & Schwanenflugel, Reference Medin and Schwanenflugel1981; Murphy, Reference Murphy2002; Nosofsky, Reference Nosofsky1992; Whittlesea, Reference Whittlesea1987, Reference Whittlesea, Lamberts and Shanks1997) and it appears to be a common characteristic of linguistic categories. Footnote 2
Connectionist models that incorporate so-called ‘hidden units’ can acquire non-linearly separable categories by developing unique representations for each non-linearly separable item in which different combinations of features are associated with different hidden units. However, as McClelland and Rumelhart (1986, p. 210) noted, those systems can do so only if they contain at least one hidden unit for every member of the competing categories. In effect, the connectionist model must become a de facto exemplar-based model because it must create and maintain a unique representation for each different item presented to it during training. This requirement creates obvious theoretical and practical problems because it requires that the connectionist network somehow must access an additional hidden unit each time a new type of item is encountered and somehow incorporate it into the network representation of all of the previously encountered exemplars.
3. Theoretical models of categorization
3.1. connectionist models of categorization
Given that an increasing number of usage-based linguists have described prototype effects in their data (see, for example, Bybee, Reference Bybee2010; Croft & Cruse, Reference Croft and Cruse2004; Goldberg, Reference Goldberg1995, Reference Goldberg2006; Lakoff, Reference Lakoff1987; Langacker, Reference Langacker2008; Taylor, 2002, 2015; Tomasello, Reference Tomasello2003, Reference Tomasello2006), it is not surprising that many of them have found connectionist models particularly attractive as a theoretical framework, even if only tacitly so. Connectionist models were developed in large part specifically to model the prototype effects summarized above (e.g., Hinton, Reference Hinton, Hinton and Anderson1981; McClelland & Rumelhart, Reference McClelland, Rumelhart, McClelland and Rumelhart1986; Smolensky, Reference Smolensky1988). Through repeated presentation of representative members of a category to a network of massively interconnected features that can represent the salient components of those examples, the network comes to represent its experiences with those examples collectively as patterns of associations of varying weights among those features. The theoretical and empirical strength and weaknesses of such models have been discussed and debated extensively over the past two and a half decades, both as models of linguistic behavior (cf. Chandler, Reference Chandler, Skousen, Lonsdale and Parkinson2002, Reference Chandler2010; McClelland & Patterson, Reference McClelland and Patterson2002; Pinker & Prince, Reference Pinker and Prince1988; Pinker & Ullman, Reference Pinker and Ullman2002; Skousen, Reference Skousen1989, Reference Skousen, Skousen, Lonsdale and Parkinson2002a) and as more general models of cognitive behavior (cf. Crick, Reference Crick1989; Grossberg, Reference Grossberg1987; Murphy, Reference Murphy2002; Smolensky, Reference Smolensky1988).
Unfortunately, connectionist models, which were developed specifically to model prototype effects, do not readily accommodate the exemplar-based effects described above. The heart of the matter, of course, lies in the most fundamental difference between the two types of theoretical approaches to categorization. The connectionist models represent categories as schematic compilations of experiences with examples, but they do not retain individualized records – memories – of those experiences or of the sensory and contextual details of those encounters which do not seem relevant to categorization. It is those schematic representations that come to define a category, and the probability of a new example being assigned to one category or another will be a function of its similarity to those abstracted category prototypes. To date, the proponents of connectionist models of categorization have not proposed mechanisms to account for our better understanding of behavior toward outliers, the contributions of incidental sensory details to categorization, i.e., perceptual learning and latent inhibition, the effects of frequent co-occurrence in corpora, the formulation of ad hoc categories on the fly, or the compounding of categories.
3.2. exemplar-based models of categorization
To date, three different exemplar-based models of categorization have been applied to the modeling of language: Nosofsky’s Generalized Context Model (GCM) (Nosofsky, Reference Nosofsky1984, Reference Nosofsky1988), the Tilburg Memory-Based Learning model (TiMBL) as described most recently in Daelemans and van den Bosch (Reference Daelemans and van den Bosch2005), and the Analogical Model (AM) developed in Skousen (Reference Skousen1989, Reference Skousen1992). These exemplar-based models are all based on the deceptively simple notion that we recognize and interpret the significance of current experiences by comparing them directly to memories for previous experiences, rather than by comparing them to schematized representations that have been abstracted away from those collective experiences. Thus, all three of those exemplar-based models posit the ongoing accumulation of rich sensory memories over one’s lifetime and some procedure for comparing the sensory input of a current experience to the representations in memory of one or more of those earlier experiences. Within this theoretical framework, exemplars are viewed as comprising both the memory record for instances of sensory experiences and a record of the interpretation or significance that was associated with the components of that sensory experience at the time they were experienced. As described below, those exemplars – sensory records plus their interpretations – then become the basis in turn for interpreting and evaluating the significance of new instances of sensory input.
The three exemplar-based models discussed here all include algorithms for comparing a current experience – such as a new instance of linguistic usage – with memories for similar experiences. If a ‘match’ is found in memory – leaving aside for the moment what constitutes a match – then that match becomes the basis for interpreting and operating on the new experience. If no sufficiently similar experience is found in memory to count as a match, then those models have procedures for comparing the new instance to instances in memory that resemble it, and for selecting one or more of those remembered instances to act as the basis for interpreting the new instance analogically. How the three exemplar-based models accomplish these procedures is what distinguishes one model from another; and, as we shall see, these differences also raise other theoretical considerations that may lead us to prefer one model over another.
Nosofsky (Reference Nosofsky1984, Reference Nosofsky1986) developed his Generalized Context Model directly from the work of Medin and Schaffer (Reference Medin and Schaffer1978), and within the domain of experimental cognitive psychology it has become the most widely tested exemplar-based model of concept learning and categorization. The GCM determines the probability of assigning a test item to a given category by comparing its degree of collective similarity to all the items in that category that share any features with the test item versus its degree of collective similarity to all items in all categories that share any features with it. The model treats each feature mathematically as a dimension in a multidimensional psychological space, the number of dimensions determined by the number of features needed to represent the examples. The values for those dimensions can then be interpreted metaphorically as vectors within that multidimensional space and can be used to locate each item within that space by plotting it according to the value of each of its dimensional vectors. Once an example is located within the multidimensional space, one can use simple geometry to measure and compare its psychological distance – or similarity – to every other example in the space objectively (this is not Nosofsky’s characterization). The probability of an item being assigned to a given category would be the sum of its similarity (distances) to all the members of that given category divided by the sum of its similarity (distances) to all members of all categories represented within that space.
Nosofsky (Reference Nosofsky1986, Reference Nosofsky1988) understood that perceived similarity is not an objective cognitive primitive and that response choices in categorization studies often are not bias free. He, therefore, factored two other variables into his model that have proven crucial to the successful performance of the GCM, response biases and feature-weighting multipliers. Values for both variables are determined in advance in separate experimental procedures. Response bias is just what the term indicates, a participant’s tendency to respond with one category label more often than another. Feature weighting reflects the changes in the perceived weight – or importance – of a feature when it occurs in a specified context of other features. Both variables, the response bias and the differential weighting of features within different contexts, have proven essential to the successes of the GCM in modeling human performance on categorization tasks, including linguistic categorizations (Keuleers, Reference Keuleers2008).
The memory-based learning models (MBL) described in Daelemans and van den Bosch (Reference Daelemans and van den Bosch2005), and especially the Tilburg version of that model were developed initially as computational models of natural-language processing rather than as computer models of the cognitive processes underlying language. Consequently, the development and testing of MBL models have sometimes been motivated more by considerations of computational tractability and efficacy than by neuropsychological plausibility. Whereas Nosofsky’s GCM (Reference Nosofsky1986, Reference Nosofsky1988, 1992) compares the features of a test probe with feature values represented collectively for all of the examples held in memory, TiMBL compares a test probe feature by feature with each of the exemplars held in a database – memory – individually, and then, in some versions, selects the exemplar in memory that is most similar to the test probe – the ‘nearest neighbor’ – as the basis for analogical interpretation. In the case of a tie – two or more equally similar exemplars in memory – researchers will typically have TiMBL choose the more frequently recurring exemplar, or they may have it select one of the alternatives randomly.
Although nearest-neighbor models, such as TiMBL, work well for computational models of natural-language processing – in that they almost always yield an acceptable response – other studies have shown that they are not empirically accurate as cognitive models of categorization. With predictable probabilities, for example, people will provide past-tense forms for nonce verbs that clearly are not motivated by analogy with the most similar verbs in a language (Chandler, Reference Chandler1998; Skousen, Reference Skousen1989, Reference Skousen, Blevins and Blevins2009; van den Bosch, Reference van den Bosch, Skousen, Lonsdale and Parkinson2002). Therefore, researchers interested in modeling cognitive behavior more realistically often expand the pool of candidate exemplars to the k-NN set, where k is an integer representing the n th number of nearest neighbors, most similar exemplars, that are to be considered. A decision rule then selects an exemplar from the pool either randomly or based on some criterion such as frequency of occurrence.
Developers of the MBL models encountered the same difficulties and issues with their similarity metrics that Nosofsky (Reference Nosofsky1986, Reference Nosofsky1988) had with his. Simple summations of similarities and differences do not reflect human performance on outlier examples accurately, nor can such models learn to apply non-linearly separable categories correctly to new instances. The TiMBL researchers chose a solution similar to Nosofsky’s – feature weighting – although they derive the weighting values in a different way. Whereas Nosofsky uses a separate experimental procedure to estimate how much weight human participants give to different features in different contexts (Nosofsky, Reference Nosofsky1986, Reference Nosofsky1988), Daelemans and van den Bosch (Reference Daelemans and van den Bosch2005) simply calculate mathematically the amount of information gain contributed by different features in different contexts (different combinations of features).
Comparisons of the TiMBL-IG models – models that incorporate feature weightings based on information gain – with Nosofsky’s GCM with feature weightings (Nosofsky, Reference Nosofsky1986) show virtually indistinguishable performance (Keuleers, Reference Keuleers2008). Keuleers, in fact, argued that the two models compared by him are formally equivalent in their predictive power. Other research (Daelemans, Reference Daelemans, Skousen, Lonsdale and Parkinson2002; Eddington, Reference Eddington, Skousen, Lonsdale and Parkinson2002) has also shown the performance of the TiMBL-IG models to perform comparably to Skousen’s (1989) Analogical Model, although for different reasons.
Since the three exemplar-based models discussed in this paper all compare new input directly to memories for individual examples of past experiences, rather than to presumed prototype representations, they all classify a new item that resembles a known outlier example of a category just as quickly as they will an item that resembles a more typical example. Performance on non-linearly separable categories, however, is more complex, and the three exemplar-based models of categorization differ significantly in how they resolve the interpretation of non-linearly separable outliers, but each depends crucially on the fact that the system has access to memories for experiences with those individual examples.
Once researchers began to use the strongest form of Nosofsky’s GCM (see especially, Keuleers, Reference Keuleers2008), all three of these models have been shown to perform comparably to one another, seldom showing statistically significant differences in their outcomes (cf. Chandler, Reference Chandler, Skousen, Lonsdale and Parkinson2002, Reference Chandler2010; Daelemans, Reference Daelemans, Skousen, Lonsdale and Parkinson2002; Eddington, Reference Eddington, Skousen, Lonsdale and Parkinson2002, Reference Eddington2004; Keuleers, Reference Keuleers2008). The proponents of the different models have, therefore, had to appeal to other criteria for choosing among them (cf. Chandler, Reference Chandler and Eddington2009, Reference Chandler2010; Eddington, Reference Eddington2000, Reference Eddington, Skousen, Lonsdale and Parkinson2002; Ernestus & Baayen, Reference Ernestus and Baayen2003; Keuleers, Reference Keuleers2008). As described in detail elsewhere (cf. Chandler, Reference Chandler2010; Eddington Reference Eddington, Skousen, Lonsdale and Parkinson2002; Ernestus & Baayen, Reference Ernestus and Baayen2003), Skousen’s Analogical Model appears to be theoretically simpler in that it posits fewer special theoretical structures and operations than do either the GCM or the TiMBL. One key issue is feature weighting. Unlike AM, both the GCM and TiMBL require that the features representing the different linguistic forms be weighted to reflect how much information a given feature or combination of features conveys within a given context. These feature weightings must be determined through separate experimental and computational operations applied to a set of stimuli ahead of time, and they are crucial to the successful performance of both models, including their ability to operate on non-linearly separable categories correctly (cf. Albright & Hayes, Reference Albright and Hayes2003; Chandler, Reference Chandler2010; Daelemans, Reference Daelemans, Skousen, Lonsdale and Parkinson2002; Eddington, Reference Eddington, Skousen, Lonsdale and Parkinson2002, 2004; Keuleers, Reference Keuleers2008). Unfortunately, such weightings cannot be interpreted theoretically as resident components of the mental representations of the linguistic forms being modeled. The weightings are likely to differ for different collections of forms, and the weightings needed to model one linguistic operation, such as verb inflection, do not necessarily apply to other operations, such as predicting pronunciation from spelling (e.g., Eddington, 2004; Keuleers, Reference Keuleers2008). These characteristics of feature weighting make them extremely problematic for any model that incorporates imperfect memory as a variable (see below). Given imperfect memory, which appears to be a real component of categorization, the sets of linguistic forms being compared could differ somewhat for any given instance of usage. As described next, AM accommodates the different information values of different features and feature combinations in a different way, and does not, therefore, posit feature weighting of the sort crucial to the successful operation of the other two models.
3.3. the analogical model
Working independently of other exemplar-based modeling researchers, Skousen (Reference Skousen1989) developed and began testing an exemplar-based alternative to both the symbolic-rule approach of generative linguistics and to the connectionist approaches. Like other exemplar-based models, AM interprets new instances of sensory input by comparing them with exemplars of previous instances stored in memory, and then selecting one or more of those exemplars to serve as the basis for interpreting or operating on a new instance analogically. The nature of the comparison, however, differs from those used in the other two exemplar-based models in two important respects. First, in AM all features of a given type (e.g., phonological or graphemic or semantic) are considered to carry equal weight, and thus there are no differential weightings to be discovered and applied. Footnote 3 Second, AM compares not only the individual features of the test forms and remembered forms, but it also compares all of the combinations of features shared between the target form and the remembered exemplars. Each separate comparison of features and feature combinations identifies exemplars in the dataset (memory) that are possible bases for an analogical interpretation of the new input. If AM finds a match in memory, then that match becomes the basis for interpreting the new instance. If AM does not find an exemplar in memory that is sufficiently similar to the new instance to count as a match, then the model compiles a set of candidate exemplars that share one or more features and feature combinations with the new instance. The model next applies an information-theoretic heuristic to remove from the pool of candidate exemplars just those that introduce an increase in uncertainty about the interpretation of the new instance. Although not developed originally for this purpose, this procedure turns out to be what overcomes the problem of non-linearly separable categorizations for AM. The set of candidate exemplars that results from these comparisons are all equally likely choices to serve as the basis for an analogical interpretation, and the model chooses one of those exemplars randomly, or it chooses the largest like-behaving subset of the pool (the plurality), to serve as the basis for an analogical interpretation of the new input.
3.4. details of the analogical model
The Analogical Model has been described in detail and tested extensively elsewhere (e.g., Chandler, Reference Chandler, Skousen, Lonsdale and Parkinson2002, Reference Chandler2010, unpublished observations; Daelemans, Reference Daelemans, Skousen, Lonsdale and Parkinson2002; Eddington, Reference Eddington, Skousen, Lonsdale and Parkinson2002, 2004; Ernestus & Baayen, Reference Ernestus and Baayen2003; Skousen, Reference Skousen1989, Reference Skousen1992, Reference Skousen, Skousen, Lonsdale and Parkinson2002a, Reference Skousen, Skousen, Lonsdale and Parkinson2002b, Reference Skousen, Blevins and Blevins2009). It consists of three major components (see Figure 1): (i) the dataset is the set of exemplars amassed in long-term memory (LTM) and available as a basis for an analogical operation on a current target form; (ii) the core of the Analogical Model is an algorithm for choosing from the dataset the exemplar or exemplars that will become the basis for the analogical interpretation of the target form, an instance of new input. This algorithm includes a set of procedures for comparing the features of the target form systematically with those of each exemplar in the dataset and identifying those exemplars that share the corresponding features with the target form. A second set of procedures evaluates the exemplars identified by a particular set of shared features for whether those forms increase uncertainty, in an information-theoretic sense, about the analogical outcomes. Footnote 4 Exemplars that are associated with feature comparisons (supracontexts) that increase uncertainty about the analogical behavior are eliminated from further consideration. The remaining exemplars selected from the dataset are compiled into an analogical set, the set of forms from the dataset (from memory) that may provide the basis for an analogical interpretation of the target form. (iii) Finally, one of two decision rules selects one or more of the forms in the analogical set to act as the basis for the analogical interpretation of the target form.
Within AM, the dataset represents theoretically the set of exemplars of linguistic experience amassed in one’s long-term memory. While many of the details of dataset search and access have yet to be developed explicitly, the working assumption is that it consists of instances, tokens, of linguistic usages whose significance has been interpreted previously. Thus, exemplars of came and walked have been interpreted previously as signifying the past-tense meaning of come and walk, respectively, and the physical realizations of those expressions and of the linguistic and non-linguistic contexts accompanying them on a given occasion are all consolidated into an episode of experience in long-term memory. We do not assume that those memory representations record fully and accurately all of the sensory information that might have been available to an individual at a given sensory episode. On the contrary, we assume that due to the vagaries of attention, sensory scanning, and perceptual salience that memories for exemplars will reflect the inconsistent effects of stimulus sampling as those experiences are recorded (cf. Estes, Reference Estes, Arrow, Karlin and Suppes1959, Reference Estes1994).
The evidence cited earlier showing that even very tiny differences of linguistic form – such as alternative pronunciations of the French fenêtre or the subconscious but consistent differences in vowel length seen in pronunciations of time versus thyme may be stored as separate lexical entries – confirms Skousen’s intuition (Skousen, Reference Skousen1989, p. 54) that ‘in principle’ token representations are preferable. Footnote 5 Indeed, token representations in the dataset seem necessary for the proper interpretation and application of imperfect memory as discussed below. However, in moving from the dataset (exemplars recorded in long-term memory) to the analogical set, the set of linguistic forms selected as a/the basis for analogical interpretation, token representations (in the form of frequency data) give blatantly incorrect results (Chandler, Reference Chandler and Eddington2009; Eddington, Reference Eddington2004; Skousen, Reference Skousen1989, Reference Skousen, Skousen, Lonsdale and Parkinson2002a, Reference Skousen, Blevins and Blevins2009). The correct empirical interpretation of the model appears to be that each homogeneous supracontext selects from the dataset only one representative token of a form by ignoring the random fluctuations evident among all the tokens of that form that are compatible with the given supracontext. Thus, one’s dataset for come may contain literally tens of thousands of tokens of the word, but the supracontext /k_m/ will ignore the random fluctuations of sensory detail represented in those tens of thousands of exemplars that are compatible with the context in choosing one as the representative form in the analogical set. This interpretation of exemplars in the analogical set allows AM to achieve very high levels of accuracy.
In the computer implementations of AM Footnote 6 the theoretical dataset in LTM is simulated as a corpus of linguistic forms representing the particular behavior of interest: past-tense verb forms in the current example. Unlike many models of associative learning, AM does not posit memory decay as a system variable, but, as described below, we also do not assume that the model will access every exemplar in the dataset on every occasion. Much research on verbal memory suggests that, in healthy brains, exemplars of experience remain in memory permanently (cf. Bowers, Mattys, & Gage, Reference Bowers, Mattys and Gage2009; Oh, Au, & Jun, Reference Oh, Au and Jun2010; Shanks, Reference Shanks1995). On the other hand, much research also shows that examples recalled on any given occasion may not be on another, while accessed successfully once again on a third occasion (cf. Estes, Reference Estes, Arrow, Karlin and Suppes1959, Reference Estes1986a). Therefore, rather than a memory decay variable, Skousen incorporated an imperfect memory factor into the model (Skousen, Reference Skousen1989, Reference Skousen1992, Reference Skousen, Skousen, Lonsdale and Parkinson2002a). In practice, this factor is usually set at 0.5, which is to say that on any given occasion there is a 50–50 chance of accessing a given token in the dataset. As shown formally in Skousen (Reference Skousen1998, Reference Skousen, Blevins and Blevins2009), this value for imperfect memory gives the model the predictive power of the Pearson X 2 statistic applied to a decision table.
The core component of AM is the algorithm for selecting from the dataset the set of remembered forms that will become candidates for determining the analogical inflection of a target form. For example, in predicting the past-tense form of a basic verb, the program begins by comparing the linear sequence of variables (e.g., phonemes for the examples given in this paper) encoded for the target form (e.g., come /kəm/, walk /wak/, rike /rajk/) systematically with every exemplar accessed in the dataset (allowing for the effects of imperfect memory). Figure 2 illustrates the effects of this procedure as the set of supracontexts (actually the power set of the phonemes making up the target form) that are created by subtracting phonemes and sets of phonemes from the target form (the nonce verb rike) one by one while preserving the linear order of the remaining phonemes. Each supracontext is compared to the exemplars in the dataset (allowing, again, for the effects of imperfect memory) and all of the exemplars that share that supracontext with the target form are selected.
The next step is to eliminate from further consideration the forms selected by those supracontexts whose exemplars cause an increase in uncertainty regarding the analogical behavior of the target form. Such supracontexts are said to be heterogeneous. Supracontexts whose associated exemplars do not increase uncertainty about the analogical outcomes are said to be homogeneous. Over the years, AM programmers have used different computational procedures for testing whether a given supracontext is homogeneous or heterogeneous, but the result is the same. The test eliminates just those supracontexts that include exemplars that are more like the target form (share more features with it) than the supracontext itself is (for example, reek /rik/ and ride /rajd/ each share more sounds with the target form /rajk/ than the supracontext /r _ _/ does. Furthermore, those very forms (reek and ride) also increase the number of disagreements (uncertainty about the outcome) over those forms that share only the /r/ with the target form. Footnote 7
The final operation within this component of AM is for each homogeneous supracontext to contribute its exemplars to the analogical set, the set of exemplars that will serve as possible bases for the analogical operation on the target form. Exemplars that appear in more than one homogeneous supracontext will appear in the analogical set once for each of those supracontexts. This is how AM models the fact that exemplars that share more features with a target form are more likely to influence the behavior of that form. Figure 3 shows the analogical set derived for the nonce verb rike.
The final procedure within the AM algorithm is to choose one or more of the exemplars included in the analogical set as the basis for the analogical interpretation of the target form. Skousen (Reference Skousen1989) identifies and motivates two different decision rules for choosing a form, or forms, from the analogical set to serve as the basis for an analogical outcome, both rules are motivated independently in research on probability learning theory and choice behavior (cf. Ashby, Reference Ashby and Ashby1992; Derks & Paclisanu, Reference Derks and Paclisanu1967; Estes, Reference Estes, Arrow, Karlin and Suppes1959, Reference Estes1986a, Reference Estes1986b, Reference Estes1994; Messick & Solley, Reference Messick and Solley1957; Skousen, Reference Skousen1989). The random selection rule simply selects any one of the exemplars in the analogical set randomly as the basis for an analogical response. Using this rule means that the probability of choosing one possible outcome over another represented in the analogical set will be a straightforward function of how often those outcomes are represented in the analogical set. The alternative decision rule, selection by plurality, identifies the outcome represented the greatest number of times in the analogical set and selects it as the basis for the analogical interpretation of the target form. Using this rule creates the appearance of hypothesis testing in participants’ responses. People apparently can exercise some modicum of strategic control over which rule to apply in a given situation (see, especially, Ashby, Reference Ashby and Ashby1992; Derks & Paclisanu, Reference Derks and Paclisanu1967; Estes, Reference Estes, Arrow, Karlin and Suppes1959, Reference Estes1986a, Reference Estes1986b, Reference Estes1994; Messick & Solley, Reference Messick and Solley1957).
3.5. the psychological plausibility of the Analogical Model
3.5.1. The experiential basis of ‘exemplars’
In the computer implementations of exemplar-based models of language behavior, researchers draw their datasets of exemplars from corpora of examples illustrating the behavior of interest for a given study. Thus, linguists using a model to predict the past-tense forms of English verbs draw their datasets, lists of English verbs and their associated past-tense forms, from a corpus of actual past-tense verb usages. No one argues seriously, however, that what we call the mental lexicon consists of such lists of extracted words. Indeed, we do not appear to carry around dictionaries in our heads (see, for example, Baayen, Reference Baayen2010; McDonald & Shillcock, Reference McDonald and Shillcock2001; Shaoul & Westbury, Reference Shaoul and Westbury2011). Instead, we assume that an exemplar of a given linguistic usage is extracted somehow from one’s memory for an episode of experience in which the brain has recorded both the sensory correlates of that experience, including the sensory components of speech, the sensory components of the non-linguistic context in which that episode of speech usage has occurred, and the brain’s interpretation of the significance of those sensory components. Thus, the brain does not simply record a collection of utterances containing verb forms. It also notes which parts of those phonological forms are interpreted within a given context as expressing the contextualized embodied meaning of a given verb (discussed below) and which parts express the past-tense component of a given instance of a verb usage. The brain does not simply record the sensory representations of the interlocutors’ speech and of the communicative context. It also records its interpretation of the significance of those components, of the results of comparing the new input to the remembered exemplars.
3.5.2. Exemplars as episodic memories for the sensory representations of experience
All of the exemplar-based models described above assume a database of long-term memories for episodes of experience that embody rich memories for sensorimotor details, any or all of which details are potentially available to participate in the analogical processes. The behavioral evidence justifying this assumption was cited above in the description of sensory details in exemplar-based categorization. Thus, we are positing the psychological reality of that theoretical construct, the memory episode, as the basis for widely observed categorization behaviors in general and for such behaviors seen in the acquisition and use of language in particular. The validation of the theoretical construct depends ultimately on our identifying what memory episodes consist of, what the nature and structure of their internal content is, what delimits an episode, and how the relevant information can be extracted from remembered episodes in order to apply it to the interpretation of a new episode of experience that shares only some features with the memories for the older experiences. Since examining each of these issues in detail exceeds the scope and limitations of this paper, my more modest goal here is simply to demonstrate the plausibility of certain tentative answers to those questions and, more importantly, to show that nothing we know to date about how the brain processes and remembers experiences contradicts the working assumptions of the Analogical Model described in this paper.
The sensory-rich episodic memories described earlier that we are positing to account for the phenomena of exemplar-based concept learning and categorization have close affinities to, and find independent motivation in, several other research threads on cognition and linguistic behavior. These threads include Barsalou’s theory of Perceptual Symbol Systems (cf. Barsalou, Reference Barsalou1999; Barsalou, Simmons, Barbey, & Wilson, Reference Barsalou, Simmons, Barbey and Wilson2003; Goldstone & Barsalou, Reference Goldstone and Barsalou1998), the ‘embodied language hypothesis’ (Bergen, Reference Bergen2012; Feldman & Narayanyan, Reference Feldman and Narayanyan2004; Glenberg, Reference Glenberg, Rickheit and Habel1999; Papeo, Negri, Zadini, & Rumiati, Reference Papeo, Negri, Zadini and Rumiati2010), the generation of ‘affordances’ and ‘construals’ triggered by linguistic input (Wisniewski, Reference Wisniewski1997), and “the dynamic application of ‘event knowledge’” observed by Elman (Reference Elman2004, Reference Elman2011) and Bergen (Reference Bergen2012) in people’s on-line interpretations of instances of language usage. These independent lines of research have contributed further behavioral and neurophysiological evidence that not only are concepts represented and processed within the same distributed cortical areas as were the original sensory percepts that motivated them, but that those embodied concepts retain rich vestiges of their sensory origins. Beyond simply remembering sensory details about experience, these research threads all confirm that people can and do infer sensorimotor information for instances of categorization and language usage beyond that traditionally thought to be conveyed by words and syntax alone, information evidently grounded in sensorimotor memory for specific instances of experience with the objects, and relationships invoked through those linguistic expressions.
Skousen (Reference Skousen1989, Reference Skousen1992) developed his Analogical Model independently of the work on exemplar-based concept learning and categorization behavior then in progress within cognitive psychology. His immediate goal was to account for probabilistic language behavior such as that described by Labov (Reference Labov1970) and others. Labov had found that speakers of languages exhibit very consistent probabilities in their variable usages, such as whether they pronounced the final [t] or [d] in words such as burnt or hand. Linguists (e.g., Bybee, Reference Bybee1985, Reference Bybee, Gruber, Higgins, Olson and Wysocki1988) could identify some of the linguistic and contextual variables that made the pronunciation of these sounds more likely or less likely, but collectively those identifiable variables always left a significant proportion of the variance unexplained. Skousen’s efforts to account for probabilistic behavior in language usage more fully led to his development of the Analogical Model, and much subsequent research such as that cited elsewhere in this paper has confirmed the empirical strength of the model. His model, however, was motivated largely by his efforts to account for empirical data on language behavior theoretically, and not so much by an intention to develop a psychological model of language.
3.5.3. Supracontexts and the comparison of linguistic forms in the Analogical Model
Based on work such as that of Labov (Reference Labov1970) and Bybee (Reference Bybee1985,Reference Bybee, Gruber, Higgins, Olson and Wysocki1988), Skousen recognized that the probability of choosing any one linguistic form over another would be some function not only of the relative frequency of the competing forms in a person’s linguistic experience but also of the similarity of the target form to other forms in memory. His first task, therefore, was to devise a procedure for comparing a target form to other forms stored in memory. Many years of research in linguistics have shown that more fully specified ‘rules’, or contexts, take precedence over more general rules, or contexts, in predicting behavior. For example, the past tense of the fully specified verb run (i.e., all three letters) is irregular, ran, whereas the past tenses for the verbs sharing only a subset of letters with run are almost always regular. Footnote 8 Therefore, Skousen chose words from memory to compare with a target form by removing first just one feature from the target form and comparing the remainder to all forms in memory sharing that supracontext with the target form. He continued by removing additional features progressively as illustrated by the supracontexts shown in Figure 2.
Collectively, the forms selected from memory through comparison to the respective supracontexts represent the forms in memory that resemble the target form to a greater or lesser degree. Because a form in memory that shares more than one feature with the target form will be selected by each supracontext that also includes at least one of those features, the forms selected by the supracontexts collectively will reflect the fact that forms in memory that share more features with a target form will also have greater probability of predicting or of influencing the behavior of that target form. Skousen’s (Reference Skousen1989) procedure for comparing supracontexts replicated independently the similarity effects reported in Hayes-Roth and Hayes-Roth (Reference Hayes-Roth and Hayes-Roth1977) and confirmed in Logan (Reference Logan2002). Overall, models of concept learning that compare the formal equivalent of supracontexts outperform models that rely only on the simple comparison of individual features. Another effect replicated by Skousen’s comparison of supracontexts in AM is the finding reported by Shepard (Reference Shepard1987) that the subjective judgment of similarity between two items and the probability of interpreting those two forms as members of the same category varies approximately exponentially (Shepard’s characterization) as a function of the number of features shared by those two forms. Supracontext comparisons capture this relationship because although the supracontexts themselves do represent the potential influence of the features exponentially, not all supracontexts will ultimately contribute forms to the analogical set because (i) some supracontexts will not correspond to any forms that actually exist in memory and (ii) not all supracontexts will turn out to be homogeneous.
3.5.4. The durability of episodic memories
Since at least the early work on verbal memory by Ebbinghaus (1964 [1885]), the phenomenon of forgetting material once learned has played a central role in learning theory. Under controlled laboratory settings, the rates of forgetting can be predicted accurately (e.g., Wickelgren, Reference Wickelgren and Estes1976). Many learning theorists accommodate this fact about learning and memory by incorporating a variable for memory ‘decay’ into their models. The connectionist models of category learning that have followed from the work of Rumelhart and McClelland (Reference Rumelhart, McClelland, Rumelhart and McClelland1986) have all included memory decay factors, and Estes (Reference Estes1986a, Reference Estes1986b) has shown that such decay factors are necessary in order for those models to closely match the empirical evidence on human category learning. Despite this common theoretical assumption, however, the evidence on learning and forgetting suggests that, so long as one’s brain remains relatively healthy, humans actually do not ‘forget’ experiences that they have encoded into long-term memories (Shanks, Reference Shanks1995). A more likely interpretation of the phenomenon of forgetting appears to be that in the absence of individualizing sensory information in an input stimulus, the brain simply has great difficulty isolating and accessing a specific exemplar of commonly recurring experiences from among the mass of memories for similar experiences stored in LTM.
In order to accommodate these conflicting phenomena, both Estes (Reference Estes1986a, Reference Estes1986b) and Skousen (Reference Skousen1989, Reference Skousen1992) chose to model those effects as ‘imperfect memory’ rather than as ‘memory decay’. Imperfect memory implies – accurately – that memory for a given episode of experience may be accessed successfully on one occasion but not on another. Estes found that allowing his imperfect memory factor to vary between 0.05 and 0.10 allowed his category-learning model to closely approximate the performance of different human participants. Skousen adopted the value of 0.50 for his imperfect memory factor initially for theoretical reasons based on information theory and uncertainty. That value gives AM power equivalent to the Pearson X 2 statistic in predicting outcomes from decision tables. Nonetheless, that value has also proven to allow AM to track the performance of human participants closely on a variety of linguistic tasks (often with better than 90% accuracy). Chandler (Reference Chandler, Skousen, Lonsdale and Parkinson2002, unpublished observations) showed that varying the imperfect memory value also allowed the AM to track the performance of patients with a variety of neurological impairments on a battery of verb-inflection tasks.
Currently, we have no way of determining whether every episode of experience lays down a new episodic memory in LTM, nor do we know whether episodic memories are indeed permanent. Jacoby and Brooks (Reference Jacoby, Brooks and Bower1984) and Whittlesea (Reference Whittlesea, Lamberts and Shanks1997) showed that even a single, brief presentation of an exemplar during a category-learning task could affect the speed of classification of that exemplar up to a year following the single presentation, and that that once-seen item could prime the categorization of similar exemplars. These data appear to confirm the imperfect-memory approach adopted by Estes (1986a, 1086b) and Skousen (Reference Skousen1989, Reference Skousen1992), and appear incompatible with the memory-decay approach incorporated into connectionist theories.
Recent research has shown that even into late middle age, people can exhibit the influence of linguistic experiences laid down in very early childhood and subsequently lost to conscious recall. Oh et al. (Reference Oh, Au and Jun2010) and Bowers et al. (Reference Bowers, Mattys and Gage2009) studied adults who had been exposed to, and had begun to use, very different languages in early childhood before being removed from those early linguistic environments as young children. As adults, those children had lost all conscious memory for their early childhood languages and experiences. Yet, when those adults attempted to relearn the languages that they had once been in the process of acquiring as toddlers, they were able to learn non-English phonological contrasts and certain other linguistic features significantly better than were age-matched controls who had had no previous experience with those languages. These results suggest that even after fifty years had passed that the participants were still able to benefit from memories for linguistic experiences laid down in early childhood.
3.5.5. Probability and uncertainty
Having used supracontexts to identify a set of forms in memory that resemble a target form to a lesser or greater degree, Skousen’s next task was to find a psychologically plausible way to model the application of the information represented in those forms to the analogical interpretation of the target form. He needed to predict the probability that a speaker would choose one of the behaviors represented in the forms selected by the supracontexts to apply to the target form. As Skousen (Reference Skousen1998) pointed out, one can calculate the probability distribution for those contexts and their associated outcomes if one has access to the relative frequencies of usage for them. However, he continued, there are no known neural mechanisms for recording and accessing such frequency information or for carrying out the complex mathematical calculations used to arrive at those distributions.
Skousen’s solution to this conundrum, and one of the most important insights of his model, was his discovery that the model could accomplish the same effect through simple inspection of the examples chosen from memory by the supracontexts. The test for heterogeneity examines each supracontext one by one and determines, as noted earlier, (i) whether it selects any examples from long-term memory that are more similar to the target form than that supracontext itself is, and (ii) if there are such forms, whether any of them introduce greater uncertainty about the possible outcomes (behaviors) than the forms selected by the supracontext alone do. Skousen (Reference Skousen1989, Reference Skousen1992, Reference Skousen1998) called this procedure a ‘natural statistic’ because it exhibits the same predictive power as a traditional statistical test but without positing that the brain actually effects such a calculation, and, as noted earlier, it gives the model the same predictive power as the Pearson X 2 statistic applied to a contingency table.
3.5.6. Decision rules
The exemplars in memory that become activated because they correspond to homogeneous supracontexts become in turn part of the analogical set, the candidate forms that will serve as the basis for the analogical interpretation of the new input form. As noted above, Skousen (Reference Skousen1989) proposed two decision rules for the Analogical Model, random selection, choosing any one of the forms represented in the analogical set randomly to serve as the basis for the analogical interpretation of the target form, and selection by plurality, selecting the outcome, or response, represented most often in the analogical set.
There is abundant evidence that people can exercise at least some strategic, meta-cognitive control over their choice of which decision rule to apply on a given occasion (e.g., Ashby, Reference Ashby and Ashby1992; Derks & Paclisanu, Reference Derks and Paclisanu1967; Estes, Reference Estes1994; Lovett, Reference Lovett, Anderson and LaBiere1998; Messick & Solley, Reference Messick and Solley1957; Simon, Reference Simon and Simon1957). These studies agree largely that, in the absence of any particular conscious motivation for preferring one outcome over another, people of all ages (preschool through adult) apply the equivalent to random selection as their default response strategy. That is, the proportion of their use of alternative responses appears to reflect the proportions of those alternatives in their learning histories. If, however, people sense some tangible advantage for preferring one outcome over another, they quickly begin to learn to, and consciously do, maximize their gain by selecting the most frequent outcome consistently – the plurality rule. Footnote 9 When participants do adopt the plurality rule, they give the appearance of responding deterministically, that is, of hypothesis testing, which of course is, in a sense, what they are doing. Nosofsky and Zaki (Reference Nosofsky and Zaki1998) showed evidence both of different participants in categorization experiments using different response strategies and of participants changing strategies during the course of an experiment. Chandler (Reference Chandler2010, unpublished observations) has also cited evidence of response strategy shifting in the results for past-tense verb inflection data reported by Albright and Hayes (Reference Albright and Hayes2003) and by Ullman, Pancheva, Love, Yee, Swinney, and Hickok (Reference Ullman, Pancheva, Love, Yee, Swinney and Hickok2005).
4. Conclusions and implications
In developing usage-based models of language, an increasing number of linguists have singled out categorization as perhaps the most important domain-general cognitive process underlying the acquisition and use of language (cf. Bybee, Reference Bybee1995; Croft, Reference Croft, Fried and Östman2004; Croft & Cruse, Reference Croft and Cruse2004; Goldberg, Reference Goldberg1995, Reference Goldberg2006; Lakoff, Reference Lakoff1987; Langacker, Reference Langacker2008; Taylor, Reference Taylor1995, Reference Taylor, Dabrowska and Divjak2015; Tomasello, Reference Tomasello2003, Reference Tomasello2006). In this paper, I have reviewed a large body of research which shows that prototype-based models of concept learning and categorization, especially as implemented in connectionist networks, are theoretically and empirically incapable of accounting not only for many of the commonly cited prototype effects but also for many other findings about concept learning and categorization not generally considered by the proponents of prototype theory. Any model of concept learning and categorization must accommodate that evidence and the exemplar-based behaviors cited in this paper. Prototype-based models and, in particular, the extant connectionist implementations of those models, do not appear capable of doing so.
The three exemplar-based models cited in this paper appear to be fully capable of accounting for both those prototype effects as well as for additional exemplar-based effects that were neither anticipated by nor recognized by the proponents of prototype theories. These include the unanticipated effects of sensory details on recall and categorization, and the construction of ad hoc categories and compound categories on the fly. Of those three exemplar-based models, however, Skousen’s (Reference Skousen1989) Analogical Model appears to account for the behaviors while positing the simplest and most straightforward set of underlying theoretical procedures and assumptions.
The major implication of this conclusion for linguistic theory is that our knowledge of linguistic categories, and perhaps of language more generally, does not consist of resident linguistic generalizations, a grammar, that have been abstracted away from our experiences with exemplars of linguistic usage. Instead, the phenomena of categories and of categorization appear to be better explained by positing a mechanism and a set of procedures by which we compare current instances of linguistic usage systematically to memories for previous instances of similar usages in order to arrive at a formulation or interpretation of the new instance on the fly. Such models not only account for categorization behavior more successfully than do the prototype-extraction models, but they also obviate the need to try to identify learning processes and mechanisms that can account for such extractions, an effort that has challenged both linguistic theorizing and psychological theorizing for at least the past half century.