1. Introduction
Research on the cognitive basis of language comprehension has developed in considerable ways since the principles of cognitive linguistics were first formulated. Along with furthering the field of ‘simulation semantics’, researchers have become increasingly concerned with the interactions between meaning and processes in domains such as perception, motor action, and emotion (e.g., Barsalou, Reference Barsalou1999; Bergen, Reference Bergen2012; Glenberg & Kaschak, Reference Glenberg and Kaschak2002; Zwaan, Reference Zwaan2003). The dynamic view of conceptualization as prompted by linguistic input that underlies this paradigm has been explicitly cited as consonant with some views of semantics within cognitive linguistics, such as cognitive grammar (Langacker, Reference Langacker2008: Ch. 14) and conceptual metaphor theory (Gibbs, Reference Gibbs2006).
Regrettably, however, the development of this framework has taken place largely independently of two other developments in language research, namely increased interest in (i) the social, contextual, and pragmatic aspects of linguistic communication; and (ii) the variably multimodal nature of spoken language. In this paper, we argue that as a result of divorcing language from its ‘canonical encounter’, i.e., spoken conversation (Clark, Reference Clark1973), it is not evident how simulation theories of language comprehension extend to real-life situations. After evaluating the complexity of the process of language comprehension in view of four prevalent perspectives on interpersonal communication, this paper proposes an integrative account that provides the background for assessing the role of mental simulation in everyday language comprehension.
The following presents a brief overview of the state of the art of simulation-based theories of language understanding. Next, we assess how comprehension in a laboratory setting differs from comprehension in spoken interaction and show that the explanatory scope of simulation theories is confined to a non-canonical form of ‘language’ and an overly simplistic conception of ‘comprehension’. Footnote 1
Finally, we discuss a number of connections between simulation semantics and aspects of face-to-face communication, and touch upon some methodological implications.
1.1. simulation theories of language comprehension
Simulation-based theories of comprehension have come about in opposition to the premise that meaning relies on abstract, atomic symbols. Copiously supported by experimental research, simulation theories hold that language comprehension engages partial re-enactment of perceptual, motoric, and affective memory traces, insofar as they are relevant to the concept or situation described. Evidence for this proposal has been provided on different levels of linguistic complexity.
1.2. word comprehension
Since the late 1990s, brain imaging studies have demonstrated that processing isolated nouns and verbs recruits substrates of memory systems that correspond to their content. Listening to descriptions of objects, for instance, has been shown to elicit modality-specific neural activation in brain areas related to their perceptual features (Chao, Haxby, & Martin, Reference Chao, Haxby and Martin1999; Pulvermüller, Mohr, & Schleichert, Reference Pulvermüller, Mohr and Schleichert1999). Words describing physical actions, likewise, induce activation in motor areas specific to the parts of the body they are performed with (Hauk, Johnsrude, & Pulvermüller, Reference Hauk, Johnsrude and Pulvermüller2004; Hauk & Pulvermüller, Reference Hauk and Pulvermüller2004; Vigliocco, Warren, Siri, Arciuli, Scott, & Wise, Reference Vigliocco, Warren, Siri, Arciuli, Scott and Wise2006). These and other findings have been taken to support the idea that the ‘symbols’ on which language operates are abstracted away from embodied experiences, thus grounded in perceptual, motor, and emotive systems (Barsalou, Reference Barsalou1999; Evans, Reference Evans2009).
1.3. sentence comprehension
Comparable evidence has been obtained in neuroscientific studies where participants were exposed to sentences rather than single words. Desai, Binder, Conant, and Seidenberg (Reference Desai, Binder, Conant and Seidenberg2010), for example, observed modality-specific activation of cortical areas in response to sentences describing motor actions or visual scenes. Tettamanti et al. (Reference Tettamanti, Buccino, Saccuman, Gallese, Danna, Scifo, Fazio, Rizzolatti, Cappa and Perani2005) and Aziz-Zadeh, Wilson, Rizzolatti, and Iacoboni (Reference Aziz-Zadeh, Wilson, Rizzolatti and Iacoboni2006) report language-induced activation in premotor areas corresponding to foot, hand, or mouth actions, in close correspondence with the type of action described in the stimulus sentences.
More informative results with respect to semantic representation of sentence meaning come from behavioral research. A range of experimental studies has suggested that, in sentence comprehension, people conceptualize perceptual and motor details beyond the propositions presented explicitly. Zwaan, Stanfield, and Yaxley (Reference Zwaan, Stanfield and Yaxley2002), for example, found that after reading The ranger saw the eagle in the sky, participants were faster to respond to an image of a bird with spread wings than to an image of a bird with closed wings, whereas this effect was reversed after reading The ranger saw the eagle in the nest. Employing analogous research strategies, scholars have provided converging evidence that mental simulations of sentence content encode rigorous perceptual detail of the objects described, including their spatial orientation (Stanfield & Zwaan, Reference Stanfield and Zwaan2001), color (Connell, Reference Connell2007), visibility (Yaxley & Zwaan, Reference Yaxley and Zwaan2007), spatial characteristics (Bergen, Lindsay, Matlock, & Narayanan, Reference Bergen, Lindsay, Matlock and Narayanan2007; Winter & Bergen, Reference Winter and Bergen2012), trajectory of motion (Kaschak et al., Reference Kaschak, Madden, Therriault, Yaxley, Aveyard, Blanchard and Zwaan2005; Matlock, Reference Matlock2004; Zwaan, Madden, Yaxley, & Aveyard, Reference Zwaan, Madden, Yaxley and Aveyard2004), and viewpoint (Borghi, Glenberg, & Kaschak, Reference Borghi, Glenberg and Kaschak2004).
Similar interaction effects have been reported in motor and affective domains. Comprehension of sentences describing actions (Klatzky, Pellegrino, McCloskey, & Doherty, Reference Klatzky, Pellegrino, McCloskey and Doherty1989) and emotional situations (Havas, Glenberg, & Rinck, Reference Havas, Glenberg and Rinck2007) has been found to be sensitive to the comprehender’s current bodily state. Conversely, experimental evidence has shown that reading sentences in which physical movement is implied affects subsequent performance of actual motor actions, insofar as the performed and implied actions are compatible (Bergen & Wheeler, Reference Bergen and Wheeler2005; Glenberg & Kaschak, Reference Glenberg and Kaschak2002; Zwaan & Taylor, Reference Zwaan and Taylor2006). The motor simulations involved in processing action descriptions, furthermore, appear to be sensitive to the affordances of the objects described, i.e., the way they can be manipulated or interacted with (Borghi & Riggio, Reference Borghi and Riggio2009; Chwilla, Kolk, & Vissers, Reference Chwilla, Kolk and Vissers2007; Kaschak & Glenberg, Reference Kaschak and Glenberg2000; Masson, Bub, & Warren, Reference Masson, Bub and Warren2008).
In an attempt to bring together these and other findings, Zwaan (Reference Zwaan2003) hypothesizes that during sentence comprehension, people construct vicarious and holistic mental simulations of the expressed content in an immersed fashion, i.e., as if they are actually part of the referential scene. Similar hypotheses have been set forward by Barsalou (Reference Barsalou2003, Reference Barsalou, Cohen and Lefebvre2005).
1.4. discourse comprehension
The study of language comprehension on the level of stretches of written discourse longer than single sentences has developed largely independently of the research mentioned so far. Discourse comprehension research has focused mainly on people’s ability to keep track of the local and global coherence in a text and the kinds of inferences they draw in doing so. Such inferences involve the spatial, temporal, and causal structure of the described situations, as well as the traits, goals, and motives of the relevant characters (Zwaan, Magliano, & Graesser, Reference Zwaan, Magliano and Graesser1995). The mental constructs subsuming the totality of these inferences, in addition to the information explicitly presented, are often referred to as situation models (Graesser, Millis, & Zwaan, Reference Graesser, Millis and Zwaan1997; Van Dijk & Kintsch, Reference Van Dijk and Kintsch1983; Zwaan & Radvansky, Reference Zwaan and Radvansky1998). The neurocognitive underpinnings of situation models have only relatively recently been gaining serious attention, along with advances in research on embodied sentence processing. Zwaan (2009, p. 1145), for instance, argues that simulation theories of comprehension “fill the theoretical gap” between situation model theories of event relations and embodied theories of conceptualization. Empirical support for recruitment of sensorimotor systems in discourse comprehension is available (e.g., Speer, Zacks, & Reynolds, Reference Speer, Zacks and Reynolds2007; Wallentin, Nielsen, Vuust, Dohn, Roepstorff, & Lund, Reference Wallentin, Nielsen, Vuust, Dohn, Roepstorff and Lund2011), but the relation between simulation semantics and situation model theory currently remains underspecified (see also Section 3.2).
1.5. language comprehension?
Simulation semantics, in its current state, is not devoid of controversy. Current debates mostly center around the questions whether abstract language engages mental simulation in the same way as concrete language (e.g., Dove, Reference Dove2010) and whether simulation actually has a functional role with respect to comprehension (Mahon & Caramazza, Reference Mahon and Caramazza2008; Willems & Francken, Reference Willems and Francken2012). A question less often addressed is what exactly it means to claim that simulation semantics provides ‘a theory of language comprehension’. To what degree, if at all, is the way people process isolated, written excerpts of language in the laboratory similar to the way people comprehend each other in everyday life? Some researchers of (narrative) discourse comprehension have argued that the stimuli they deploy provide a good model for everyday language use, claiming that:
[n]arrative text has a close correspondence to everyday experiences in contextually specific situations, […] both narrative texts and everyday experiences involve people performing actions in pursuit of goals, the occurrence of obstacles to goals, and emotional reactions to events. (Graesser, Singer, & Trabasso, Reference Graesser, Singer and Trabasso1994, p. 372)
Whereas it might be true that narrative text comprehension, given its contextualized character, is more natural than the comprehension of isolated sentences or words, it is still, in many ways, an artificial form of language, encountered in relatively late stages of language acquisition. The ecological validity of the research discussed in the previous sections, therefore, might not be evident as previously supposed. In line with Willems and Francken (Reference Willems and Francken2012), we contend that the role of mental simulation in the process of language comprehension is to be understood against the background of a situated and interactive view on language use. As a first step toward laying the foundations of such a theory, the following sections review four different perspectives on language comprehension in everyday spoken interaction.
2. Comprehension in the laboratory vs. comprehension in interpersonal communication
Social-interactional factors have long been considered beyond, or at most peripheral to, cognitive linguistic approaches to meaning (see Geeraerts, Reference Geeraerts, Tabakowska, Choinski and Wiraszka2010). Although critiques of the cognitivist take on language processing are not new (cf. Parker, Reference Parker1992), the undisputable importance of such factors for the study of linguistic meaning has only recently become more widely accepted. As Croft (2009, p. 395) argues:
[T]he foundations of cognitive linguistics […] are too solipsistic, that is, too much ‘inside the head’. In order to be successful, cognitive linguistics must go ‘outside the head’ and incorporate a social-interactional perspective on the nature of language.
Literature from social psychology, indeed, demonstrates that language comprehension in interpersonal communication encompasses much more than text comprehension does, and can be approached from a variety of perspectives (Krauss & Fussell, Reference Krauss, Fussell, Higgins and Kruglanski1996). In addition to constructing semantic representations, language comprehension in face-to-face interaction draws upon various pragmatic and social-interactive capacities. Moreover, it involves the dynamic integration of verbal, intonational, and gestural cues. In the following, we outline four perspectives on how ‘language comprehension’ in interpersonal communication can be defined, highlighting factors that have remained largely outside the current purview of simulation theories of comprehension.
2.1. comprehension as intersubjective conformity
Understanding a sentence like The ranger saw the eagle in the sky in a decontextualized, Footnote 2 experimental setting entails taking hold of the type of situation this sentence may refer to. In interpersonal communication, by contrast, comprehension involves pursuing intersubjective conformity; that is, comprehenders need to assess which particular tokens of entities or events the speaker is most likely to refer to given the current context (e.g., a particular eagle or a particular instance of seeing that eagle). The human capacity for reference resolution in contextualized communicative situations, thus, is integral to situated language comprehension.
One of the first accounts to take the relation between context and referential meaning seriously was Barwise and Perry’s (Reference Barwise and Perry1981, Reference Barwise and Perry1983) influential situation semantics, which proposed that the meaning of an utterance is determined by the set of alternative interpretations that the communicative situation avails, and that the correct interpretation can be deduced by means of parameter setting. Somewhat independently, reference disambiguation has become a topic of interest in social psychological approaches to language, as an aspect of ‘grounding’ (Clark & Brennan, Reference Clark, Brennan, Resnick, Levine and Teasley1991; Clark & Marshall, Reference Clark, Marshall, Joshi, Webber and Sag1981). On Clark’s (1996 and elsewhere) account, language users understand referential expressions through assessment of their common ground, i.e., the degree to which current beliefs, knowledge, and suppositions are shared. Reference resolution, on this view, can be thought of as an active search process, whereby the search space is restricted to the assumed overlap in viewpoint and experiential memory between the interlocutors. Others have hypothesized that much of the problem of reference resolution may be resolved in a more automatic way, by dint of ad-hoc referential routines established during dialogue (Pickering & Garrod, Reference Pickering and Garrod2004).
2.2. comprehension as intention recognition
Plausibly reminiscent of the computational paradigm of the mind dominant during the previous decades, many experimental studies on language processing implicitly assume language’s main goal to be ‘the transfer of information’ (cf. Shannon & Weaver, Reference Shannon and Weaver1948). As Austin (Reference Austin1962) and other pragmaticists have pointed out, however, one can do much more with words than assert information. A more contemporary view, as proposed in Relevance Theory (and elsewhere), is that linguistic expression serves to bring about “contextual effects in an individual” (Sperber & Wilson, Reference Sperber and Wilson1986, p. 265). The primary function of linguistic utterance, accordingly, is to elicit either an overt (e.g., verbal) response by the addressee or a covert effect (e.g., a change of beliefs or intentions).
From the addressee’s point of view, this ‘intentionalist’ paradigm of communication implies that the essence of language understanding is not to seize the literal (referential) meaning of an utterance, but to successfully infer what effect the speaker aims to bring about. Akin to the debate on the cognitive underpinnings of common ground assessment, theories on the mechanisms underlying intention recognition vary along the automatic–effortful continuum (Carruthers & Smith, Reference Carruthers and Smith1996). Some ‘categorical’ aspects of intention are commonly believed to be understood directly through routinized associations with linguistic elements (e.g., through grammatical marking of illocutionary force), while more ad-hoc aspects of the speaker’s communicative intention have been argued to recruit effortful computations, taking into account presumptions about the speaker’s current mental state (Goldman, Reference Goldman1992).
2.3. comprehension as dynamic anticipation
A third limitation of many current accounts of comprehension is their conception of language use as a unidirectional event. Despite abundant criticism on the ‘myth of the isolated mind’ inherent in such approaches (Bakhtin, Reference Bakhtin1981; Clark, Reference Clark1973, Reference Clark1996; Linell, Reference Linell2007; Stolorow & Atwood, Reference Stolorow and Atwood1992), it is still common practice in psycholinguistic and cognitive linguistic research to consider language processing as taking place inside one individual brain. Alternative, dialogical approaches to comprehension, by contrast, hold that linguistic communication should not be regarded as an encapsulated event, but as a dynamic and interactive process in which meanings and intentions are cooperatively negotiated. Linguistic communication, according to this view, can be characterized as a joint activity: a process whereby interlocutors collaborate to achieve a shared conception of the ongoing discourse.
Within dialogical models, a discrepancy exists in terms of the types of interaction considered relevant to the dialogue. A first type of model is predominantly concerned with interactions among interlocutors themselves, whereby dialogue is regarded as a process of jointly constructing shared discourse models (Bakhtin, Reference Bakhtin1981; Garrod & Anderson, Reference Garrod and Anderson1987; Garrod & Pickering, Reference Garrod and Pickering2004). A second, more radical version of ‘dialogism’ maintains that an accurate model of language use should not only take interactions between interlocutors into account, but instead take the dynamics of the communicative setting in which the interaction takes place as a starting point (e.g., Linell, Reference Linell1998, Reference Linell2007). Language comprehension and other cognitive capacities, on this view, are to be understood as residing in a continual dialectic between the interlocutors and their environment, and to be studied in relation to the situational continuum in which they are embedded: “Rather than being the seat of epistemically private mental representations, the brain functions to regulate the body’s interactions with its ecosocial environment” (Thibault, Reference Thibault, Hasan, Matthiessen and Webster2005, p. 152). Neither the semantic nor the pragmatic–intentional dimensions of comprehension, accordingly, should be seen as encapsulated events. Rather, these aspects of comprehension can be thought of as instrumental to the more general cognitive capacities for projection and anticipation in dynamic interaction. Or, as Linell (2007, p. 611) puts it:
[Meaning] potentiality is related to creativity and adaptability, to the principled capacity of language to meet the communicative needs of ever changing situations […] To understand an utterance in real time, we must be able to predict the continuation and project one’s own and others’ possible next actions.
2.4. comprehension as multimodal
A final dimension of face-to-face language comprehension which has largely remained on the background pertains to language’s multimodal nature. A few examples aside (e.g., Richardson, Spivey, Barsalou, & McRae, Reference Richardson, Spivey, Barsalou and McRae2003; Winter & Bergen, Reference Winter and Bergen2012), cues for constructing mental simulations in the experiments discussed all have the form of written text. The foundations of simulation theories, thus, rely on a form of language that is sentence-based and unimodal, in sharp contrast to the form language takes in spoken interaction (cf. Chafe, Reference Chafe1994: Ch. 2; Linell, 1982/2005; Ochs, Schegloff, & Thompson, Reference Ochs, Schegloff and Thompson1996).
A first difference is that extemporaneous spoken speech, being a time-constrained activity, involves an on-line process of ‘information packaging’ (Chafe & Tannen, Reference Chafe and Tannen1987). Chafe (1994, p. 109) and others have posited that speakers distribute their messages over prosodically delineated intonation units so as to conform to their own processing constraints and those of the addressee. Using the term ‘tone unit’ (TU) instead of intonation unit, Altenberg (1987, p. 46) claims that “it is in terms of TUs — rather than any specific grammatical unit — that speakers organize and present information in discourse, and it is through TUs that listeners perceive and understand this information”.
Information packaging has profound implications for discourse-anticipatory dimensions of language comprehension. Prosodic contours are known to mark an utterance’s information structure and to be revealing with respect to the flow and continuation of the discourse (Bolinger, Reference Bolinger1986; Brazil, Reference Brazil1997). Listeners, as demonstrated experimentally (e.g., Swerts & Geluykens, Reference Swerts and Geluykens1994), exploit melodic and pausal cues to process local and global aspects of discourse structure. Comprehension of spoken language, thus, is to be seen as an incremental process which imposes different processing constraints than written language.
Second, face-to-face communication allows for extensive use of ostensive behaviors other than oral expression, including manual and facial gestures. The type of gestures most often studied in this context, those with a representational function, primarily pertains to semantic dimensions of comprehension (e.g., Beattie & Shovelton, Reference Beattie and Shovelton1999; McNeill, Reference McNeill1992). Such gestures can be communicative in various ways (Kendon, Reference Kendon2004), for instance by providing spatial content in a way that speech is not suited for, by providing information that is additional to what is conveyed verbally, or by providing additional cues in case speech comprehension is difficult (Hostetter, Reference Hostetter2011; Kendon, Reference Kendon1994).
In addition to their relevance to semantic processes, co-speech gestures have been associated with a range of functions in the realm of pragmatics. Kendon (2004, pp. 158−159) mentions three main pragmatic functions of gesture, namely modal functions (which alter the frame in terms of which what is being said is to be interpreted), performative functions (that indicate the kind of speech act the person is engaged in), and parsing functions (e.g., marking the logical structure of what is being uttered). In the latter function, speakers may, for example, use their hands to contrast two opposing positions in a debate, or to sum up a list of points. Experimental studies, furthermore, have foregrounded gesture’s role in disambiguation of verbal expressions and interactive grounding (Clark & Krych, Reference Clark and Krych2004; Holler & Beattie, Reference Holler and Beattie2003; Kelly, Özyürek, & Maris, Reference Kelly, Özyürek and Maris2010). In interactional terms, such gestures allow the addressee to predict the upcoming material and prepare their own reactions.
Altogether, intonation and co-speech gestures are (often simultaneously) relevant to comprehension in terms of semantic, pragmatic, and anticipatory aspects of spoken language comprehension.
2.5. a reconciled view
The previous sections have set forward two fundamental differences between language comprehension inside and outside the laboratory. The first concerns the kind of process that comprehension is (active and continuous rather than passive and event-like). The second concerns the kind of stimulus that language is (multimodal to varying degrees and distributed across conversational turns, rather than consisting of unimodal units). Language comprehension in face-to-face communication, all in all, is a dynamic and multi-faceted activity, accomplished on the basis of verbal as well co-verbal signaling.
How are these facts to be reconciled into a coherent processing model? Notwithstanding that there exists some tension between the perspectives outlined in the sections above in terms of their underlying presumptions, they are not fully incompatible. Rather, they can be seen as mirroring different layers of a hierarchically organized processing architecture. In line with Clark’s (Reference Clark1999) view of speech acts as comprising ‘action ladders’, the multilayeredness of comprehension can be thought of as reflecting language’s role as a coordination device for communication, and communication’s role as a coordination device for other types of joint action (cf. Clark, Reference Clark1996; Croft, Reference Croft, Evans and Pourcel2009). Comprehension in the sense of establishing intersubjective conformity, accordingly, is at least to some extent conditional on intention recognition, which can be considered subordinate to the more general skill of engaging in dynamic interaction.
These different types of process, however, are not to be regarded as modular or merely sequentially organized. Ample evidence has shown that linguistic and more general (socio-)cognitive processes continually interact in a top-down fashion as well (e.g., Hagoort, Hald, Bastiaansen, & Petersson, Reference Hagoort, Hald, Bastiaansen and Petersson2004; Van Berkum, Van Den Brink, Tesink, Kos, & Hagoort, 2008). A felicitous processing model should therefore acknowledge that comprehension emerges out of the bi-directional interplay between linguistic and more general communicative capacities, and, from a mechanistic point of view, is to be defined “across multiple coupled dynamical systems” (Wilson & Golonka, Reference Wilson and Golonka2013, p. 10).
Figure 1 provides a simplistic sketch of a model that incorporates these considerations: general socio-cognitive processes and more specific linguistic processes are portrayed as constituting a hierarchically organized, mutually interactive network. Footnote 3 The term ‘analysis’ is used after Bergen and Chang (2005), referring to the process of extracting the parameters according to which a mental simulation is performed from the perceived utterance. The other three levels correspond to the dimensions of comprehension discussed in this section. The arrows on the left indicate that all different subcomponents of face-to-face language comprehension are potentially sensitive to gesture, intonation, and other co-verbal behaviors.
This model is contrasted with the ‘experimental’ take on language comprehension, where co-verbal, discursive, and social-contextual factors are typically factored out. This comparison, notably, is not meant to suggest that the experimenters in question do not (at least in theory) acknowledge the importance of social-pragmatic processes, but rather to point out that factoring out such facets of comprehension at the benefit of experimental control may come at the expense of the ecological validity of this type of research.
Does this mean, then, that social-pragmatic processes and co-verbal aspects of communication are to be a concern for (simulation) semanticists? Many would contend that the inclusion of such factors blurs the scope of what semantics is about. For two related reasons, however, these issues are worth considering. The first is that, from a cognitive point of view, the scope of semantics is barely delineable in the first place. As argued by Langacker (Reference Langacker1987, Reference Langacker, Nuyts and Pedersen1997) and others, the meaning of contextualized utterances is never devoid of pragmatics. Because the usage events from which linguistic meanings are abstracted have predominantly taken place in social-interactive settings, all utterances have an inherent socio-pragmatic import. A second reason is that simulation theories are often not only presented as an approach to semantics, but also explicitly put forward as a theory of language comprehension (e.g., Bergen & Chang, 2005; Glenberg & Robertson, Reference Glenberg and Robertson1999; Zwaan, Reference Zwaan2003). If this ambition is taken seriously, the interfaces between semantic, social-pragmatic, and multimodal aspects of communication need more thorough examination. The following section raises a number of issues and questions relevant to pursuing this goal.
3. Simulation semantics in the context of interactive, spoken language comprehension
Here we discuss some issues and open questions that are relevant to assessing the explanatory scope of current simulation theories in the light of a broader, interactional perspective on comprehension.
3.1. mental simulation and mentalizing
By factoring out the communicator and situational context, experimental studies such as those reviewed earlier limit their participants’ freedom for meaning construction to the individual mind. This arguably makes an unnatural appeal to the participant’s language resources, which have been claimed to be “designed to be completed only in situated meaning-making” (Linell, Reference Linell2007, p. 611). In other words, this research can be said to capture the subjective, but not the intersubjective nature of meaning.
As discussed before, the human capacity for social-pragmatic dimensions of comprehension, such as the inference of others’ viewpoint and communicative intentions, has been ascribed to automatic, associative systems as well as more effortful ‘mentalizing’ mechanisms (cf. Carruthers & Smith, Reference Carruthers and Smith1996). We will here discuss the relation of these types of mechanism to language-driven mental simulation in turn.
Associative accounts of intention recognition maintain that mental states are in essence inaccessible, and that social behavior is essentially understood by virtue of regularities (or ‘rules’) in social-interactive experiences (Gopnik, Reference Gopnik1995). This type of account is quite readily commensurable with the principles of simulation semantics, insofar as mental simulations are not approached as static, internal entities, but as “dynamic, generalized associations which always act relative to the environment” (Robinson, Reference Robinson, Nuyts and Pedersen2000, p. 260). That is, simulation semantics has the potential to extend to socially embedded language use, simply by acknowledging that the experiential knowledge people re-enact during language comprehension does not only involve perception and bodily action, but also contextualized patterns of goal-oriented communication with other individuals. This extension, in fact, follows quite naturally from the usage-based paradigm that lies at its core.
Other theorists have argued that intention recognition amounts to generating a mental simulation of the interlocutor’s behavior, as to “replicate, mimic, or impersonate with the mental life of the target agent” (Gallese & Goldman, Reference Gallese and Goldman1998, p. 497). The question of how this relates to language-driven simulation, inevitably raised by this terminological resemblance, is subject of controversy. Some have argued that both types of ‘simulation’ draw on embodied mechanisms and ultimately reside in patterns of activity in substrates of the mirror neuron system (e.g., Gallese, Reference Gallese2007). Evidence from neuroimaging (Willems, de Boer, de Ruiter, Noordzij, Hagoort, & Toni, Reference Willems, de Boer, de Ruiter, Noordzij, Hagoort and Toni2010), electrophysiology (Egorova, Shtyrov, & Pulvermüller, Reference Egorova, Shtyrov and Pulvermüller2013), and research on aphasia (Willems, Benn, Hagoort, Toni, & Varley, Reference Willems, Benn, Hagoort, Toni and Varley2011), however, suggests that semantic and pragmatic aspects of comprehension rely on largely distinct neural substrates (for a review, see Willems & Varley, Reference Willems and Varley2010). The terminological resemblance of the two ‘simulation theories’, hence, does not seem to reflect full functional or neural overlap.
Various questions still remain with respect to the relation between simulation semantics and social-cognitive capacities. A first issue concerns the notion of ‘inference’: do the different types of inference involved in sentence comprehension (Section 1.3) — situation model construction (Section 1.4) and intention recognition (Section 2.2) — rely on the same (type of) neural mechanism? A second question concerns the notion of ‘perspective’, which seems to play an important role in the construction of mental simulations (Borghi, et al., Reference Borghi, Glenberg and Kaschak2004; Brunyé, Ditman, Mahoney, Augustyn, & Taylor, Reference Brunyé, Ditman, Mahoney, Augustyn and Taylor2009; Zwaan, Reference Zwaan2003), as well as in social aspects of communication (Krauss & Fussell, Reference Krauss and Fussell1988; Sperber & Wilson, Reference Sperber and Wilson1986): To what extent does the human capacity for perspective reallocation constitute an interface between mental imagery and mentalizing capacities? Finally, one may wonder to what extent the vicarious, immersed, character of mental simulations has a functional role with respect to pragmatic inference. Do the representational details of language-driven mental simulations constitute a source for the hearer to draw upon in understanding the behavioral implications of an utterance, or is the vivacity of language-induced imagery merely epiphenomenal to the brain’s associative nature, and irrelevant for social aspects of comprehension?
3.2. mental simulation and dialogism
Some dialogue-oriented theories have proposed the ultimate goal of communication to be the alignment of situation models (Menenti, Pickering, & Garrod, Reference Menenti, Pickering and Garrod2012; Pickering & Garrod, Reference Pickering and Garrod2004), and have argued that embodiment (of various kinds) plays a substantial role in accomplishing this (Pickering & Garrod, Reference Pickering and Garrod2009). This view of dialogical language understanding, as the co-construction of shared representations, is as of yet far from commensurable with simulation semantics. Little is known about the role of mental simulation in the process of integrating individual utterance meanings into a broader representation of the ongoing dialogue (despite Zwaan’s, Reference Zwaan2009, p. 1145, optimism that simulation theories have the potential to ‘bridge the gap’ between sentence comprehension and discourse comprehension). For instance, we do not know much about how long mental simulations persist over time as the dialogue unfolds, and what memory systems modulate their accessibility during turn-taking. In addition, there is a paucity of research on the way mental simulations relate to aspects of dialogue such as back-channel responses, ellipses, and interactive repair strategies.
In view of the discussion of ‘strong dialogism’ in Section 2.3, moreover, one might argue that dialogical language comprehension is best approached without recourse to representational notions such as ‘situation model’ in the first place. As Linell (Reference Linell2007) argues, when acknowledging that language understanding is part of a continual series of interactions with the environment:
the emphasis shifts from representation to control, interaction and intervention. While we surely need knowledge of and assumptions about the world, the various corresponding ‘representations’ are largely subordinated to interaction and intervention in the world. (Linell, Reference Linell2007, p. 613, emphasis in the original)
Mental simulations and situation models, accordingly, may have no reality independent of the predictions that they allow the language user to make and the actions that they serve to prepare. This shifts the burden (for the comprehender) from internal cognitive processes to the direct employment of perceptual and actional resources for engaging in (joint) action (Varela, Thompson, & Rosch, Reference Varela, Thompson and Rosch1991; Wilson & Golonka, Reference Wilson and Golonka2013). Comprehension, accordingly, is not to be seen as “the calculations and representation of a knowledge structure in the mind”, but rather as “the state of the cognitive system at a certain point in time in relation to the world around it” (Robinson, Reference Robinson, Nuyts and Pedersen2000, p. 260).
Such models ostensibly align with the view that mental simulations are instrumental to action preparation and prediction (e.g., Barsalou, Reference Barsalou2009; Willems & Hagoort, Reference Willems and Hagoort2007; Zwaan & Taylor, Reference Zwaan and Taylor2006). However, important theoretical issues have to be resolved for these perspectives to be truly commensurable. Because stronger versions of dialogism (e.g., Varela et al., Reference Varela, Thompson and Rosch1991; Wilson & Golonka, Reference Wilson and Golonka2013) reject the notion of mental representation tout court, they are in essence at odds with the basic principles of simulation semantics. Whether this tension can be resolved in a constructive fashion remains a topic of dispute, to which some have expressed optimism. Van Elk, Slors, and Bekkering (Reference Van Elk, Slors and Bekkering2010), for instance, propose a procedural rather than representational interpretation of mental simulation theory. Rączaszek-Leonardi (Reference Rączaszek-Leonardi2009), alternatively, proposes a view of linguistic symbols as constraints on interactional dynamics. These views invite reframing the debate on embodiment in terms of whether grounded (embodied) symbols constrain interaction differently from abstract (disembodied) symbols.
3.3. mental simulation, multimodality and apprehensive flexibility
The potential connections between simulation semantics and gesture research are plentiful, from the point of view of both language production and comprehension (Marghetis & Bergen, in press). Hostetter and Alibali’s (Reference Hostetter and Alibali2008) influential Gesture as Simulated Action framework, for instance, hypothesizes that co-speech gestures originate in spatial-motoric simulations performed during speech production. Conversely, it has been argued that, during comprehension, verbal and co-verbal components of expression are integrated into a shared representation of meaning through continual interactions (Kelly et al., Reference Kelly, Özyürek and Maris2010; Özyürek, Willems, Kita, & Hagoort, Reference Özyürek, Willems, Kita and Hagoort2007). In view of the apparent involvement of mirror neurons in gesture perception (Bernardis & Gentilucci, Reference Bernardis and Gentilucci2006; Montgomery, Isenberg, & Haxby, Reference Montgomery, Isenberg and Haxby2007; Skipper, Goldin-Meadow, Nusbaum, & Small, Reference Skipper, Goldin-Meadow, Nusbaum and Small2007) it has furthermore been hypothesized that the representational resources involved in verbally evoked motor imagery can be ‘merged’ with the motor resonance elicited by gestures: “activation from the gesture can summate with activation from speech and contextual information to substantially reduce uncertainty as to what needs to be simulated” (Glenberg & Gallese, Reference Glenberg and Gallese2012, p. 917).
Interestingly, the cortical networks involved in understanding manual behaviors have been proven to be sensitive to contextual factors such as the cultural background of the speaker (Molnar-Szakacs, Wu, Robles, & Iacoboni, Reference Molnar-Szakacs, Wu, Robles and Iacoboni2007) and the communicative relevance of these behaviors (Skipper et al., Reference Skipper, Goldin-Meadow, Nusbaum and Small2007). Skipper et al. (2007, p. 274), in discussing these findings, speculate that:
if the behavioral goal involves understanding a sentence when speech-associated gestures can be observed, then areas of the cortex involved in the execution of hand movements and semantic aspects of language comprehension are likely to constitute the mirror system […] the human mirror system dynamically changes according to the observed action, and the relevance of that action, to understanding a given behavior.
This proposed flexibility of resources for comprehension is in line with Cienki’s (Reference Cienki, Badio and Kosecki2012) hypothesis that the scope of behaviors taken into account by a language user depends on the behaviors’ relevance to the current situation. In other words, people variably employ various audible and/or visible behaviors for communicative aims depending on cognitive and contextual affordances and constraints (these constitute what Cienki terms the producer’s scope of relevant behaviors). The hypothesis continues that, likewise, those attending to speakers apprehend a variable scope of the producer’s audible and/or visible behaviors as relevant for communication, varying sometimes moment by moment. Applied to language comprehension, this renders the prediction that different cues for comprehension (e.g., elements of speech, gestures, and prosody) evoke sensorimotor simulations only to the extent that these contribute to engagement in the ongoing interaction. This can offer an explanation to the finding that the semantic resources deployed in comprehension are highly sensitive to task demands and various forms of context (e.g., Sato, Mengarelli, Riggio, Gallese, & Buccino, Reference Sato, Mengarelli, Riggio, Gallese and Buccino2008; Van Dam, Rueschemeyer, Lindemann, & Bekkering, Reference Van Dam, Rueschemeyer, Lindemann and Bekkering2010).
3.4. towards methodological convergence
A fully ecological reassessment of simulation theories, in all directions proposed simultaneously, may not be a realistic objective. The issues outlined in this paper, nonetheless, can inspire future researchers to move in the direction of theorizing about and studying a more natural form of language, as well as a more interactive model of comprehension. The first may simply involve extensions of current experimental studies. By supplementing stimuli with a co-verbal (gestural or prosodic) dimension, a better understanding can be gained of the relation between sensorimotor processes and the variably multimodal nature of spoken language.
Arriving at a more interactive notion of comprehension is a more substantial challenge. A first step is to take context and task demands more seriously. In order to avoid the caveat of seeing comprehension as a process that takes place entirely inside an individual’s brain, a research program akin to that outlined by Wilson and Golonka (Reference Wilson and Golonka2013) can be of help. This program dictates a careful analysis of experimental task demands and the various resources that may be relevant for satisfying them. In terms of comprehension research, this entails a closer inquiry of the availability of situated and long-term memory resources in during (experimental) comprehension tasks. The taxonomy of different ‘levels of situational embedding’ proposed by Zwaan (Reference Zwaan2014) can be a starting point for analyzing (or modulating) the availability of such resources.
A third direction is to approach comprehension as an activity, rather than a passive process. This involves more than just augmenting current experimental paradigms. Rather, there is a need for future studies to incorporate interactive settings, where participants engage in a shared activity. Inferences on the nature and role of the semantic resources recruited in such interactions may unavoidably be more indirect than those in more traditional lab-based tasks (e.g., to be derived from eye-tracking or gesture analysis), but can be an important diagnostic of the ecological value of previously obtained results. Modeling the way (dis)embodied concepts constrain interactional dynamics, e.g., from a dynamical systems point of view (cf. Dale, Fusaroli, Duran, & Richardson, 2014; Rączaszek-Leonardi, Reference Rączaszek-Leonardi2009), can furthermore initiate a better understanding of the connection between representation-based and dynamical accounts of comprehension, as well as the role that sensorimotor grounding plays in this respect.
4. Conclusion
Clark (1997, p. 594) asserts that:
[l]anguage understanding is so complex that we have had to cut it into model-sized pieces to study it. But in cutting it up we have also made a number of idealizations, and many of these have become dogmas – premises we take as gospel.
The issues discussed in this paper are natural consequences of this development: whereas taking language comprehension into the laboratory has given rise to many insights on the cognitive nature of semantics, it has at the same time removed language from its natural form and environment. As a consequence, current theories of language comprehension are based on impoverished notions of ‘language’ and ‘comprehension’. In this paper, we have argued that the external validity of experimentally based accounts of language comprehension is more questionable than generally supposed: it is by no means evident whether and how experimentally obtained results on the involvement of mental simulation in comprehension extend to real-life situations, where communication is multimodal to varying degrees and embedded in an interactional setting. Potential connections between simulation semantics and social-pragmatic aspects of comprehension have been discussed, but need much more examination. New types of experimental stimuli and more interactive research paradigms are needed in order to better understand the role of mental simulation in everyday face-to-face language comprehension. Most importantly, language needs to be studied as a variably multimodal phenomenon and the research needs to reflect its primary status as a vehicle for communication in a dynamic environment.