Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-02-11T02:50:40.284Z Has data issue: false hasContentIssue false

Are we predictive engines? Perils, prospects, and the puzzle of the porous perceiver

Published online by Cambridge University Press:  10 May 2013

Andy Clark*
Affiliation:
School of Philosophy, Psychology, and Language Sciences, University of Edinburgh, Edinburgh EH12 5AY, Scotland, United Kingdom. andy.clark@ed.ac.ukhttp://www.philosophy.ed.ac.uk/people/full-academic/andy-clark.html

Abstract

The target article sketched and explored a mechanism (action-oriented predictive processing) most plausibly associated with core forms of cortical processing. In assessing the attractions and pitfalls of the proposal we should keep that element distinct from larger, though interlocking, issues concerning the nature of adaptive organization in general.

Type
Authors' Response
Copyright
Copyright © Cambridge University Press 2013 

R1. Introduction: Combining challenge and delight

The target article (“Whatever next? Predictive brains, situated agents, and the future of cognitive science”–henceforth WN for short) drew a large and varied set of responses from commentators. This has been a source of both challenge and delight. Challenge, because the variety and depth of the commentaries really demands (at least) a book-length reply, not to mention far more expertise than I possess. Delight, because the wonderfully constructive and expansive nature of those responses already paints a far richer picture of both the perils and the prospects of the emerging approach to cortical computation that I dubbed “action-oriented predictive processing” (henceforth PP for short). In what follows I respond, at least in outline, to three main types of challenge (the “perils” referred to in the title) that the commentaries have raised. I then offer some remarks on the many exciting suggestions concerning complementary perspectives and further applications (the prospects). I end by addressing a kind of conceptual puzzle (I call it “the puzzle of the porous perceiver”) that surfaced in different ways and that helps focus some fundamental questions concerning the nature (and plausibility) of the implied relation between thought, agent, and world.

R2. Perils of prediction

The key perils highlighted by the commentaries concern (1) the proper “pitch” of the target proposal (is it about implementation, algorithm, or something more abstract?); (2) the relation between PP and various other strategies and mechanisms plausibly implicated in human cognitive success; and (3) the nature and adequacy of the treatment of attention as a mechanism for “precision-weighting” prediction error.

R2.1 Questioning pitch

Rasmussen & Eliasmith raise some important worries concerning content and pitch. They agree with the target article on the importance and potency of action-oriented predictive processing (PP), and describe the ideas as compelling, compatible with the empirical data, and potentially unifying as well. But the compatibility, they fear, comes at a price. For, the architectural commitments of PP as I defined it are, they argue, too skimpy as yet to deliver a testable model unifying perception, action, and cognition. I agree. Indeed (as they themselves note) much of the target article argues that PP does not serve to specify the detailed form of a cognitive architecture. I cannot agree with them, however, that the commitments PP does make therefore run the risk of being “empirically vacuous.” Those commitments include the top-down use of a hierarchical probabilistic generative model for both perception and action, the presence of functionally distinct neural populations coding for representation (prediction) and for prediction-error, and the suggestion that predictions flow backwards through the neural hierarchy while only information concerning prediction error flows forwards. The first of these (the widespread, top-down use of probabilistic generative models for perception and action) constitutes a very substantial, but admittedly quite abstract, proposal: namely, that perception and (by a clever variant–see WN, sect. 1.5) action both depend upon a form of “analysis by synthesis” in which observed sensory data is explained by finding the set of hidden causes that are the best candidates for having generated that sensory data in the first place.

Mechanistically, PP depicts the top-down use of (hierarchical) probabilistic generative models as the fundamental form of cortical processing, accommodating central cases of both perception and action, and makes a further suggestion concerning the way this is achieved. That suggestion brings on board the data compression strategy known as “predictive coding” (WN, sect. 1.1) from which it inherits–or so I argued, but see below–a distinctive image of the flow of information: one in which predictions (from the generative model) flow downwards (between regions of the neural hierarchy) and only deviations from what is predicted (in the form of residual errors) flow forwards between such regions. The general form of this proposal (as Bridgeman properly stresses) is not new. It has a long history in mainstream work in neuroscience and psychology that depicts cortex as coding not for properties of the stimulus but for the differences (hence the “news”) between the incoming signal and the expected signal.

PP goes further, however, by positing a specific processing regime that seems to require functionally distinct encodings for prediction and prediction error. Spratling notes, helpfully, that the two key elements of this complex (the use of a hierarchical probabilistic generative model, and the predictive coding data compression device) constitute what he describes as an “intermediate-level model”: one that still leaves unspecified a great many important details concerning implementation. Unlike Rasmussen & Eliasmith, however, Spratling notes that: “Such intermediate-level models can identify common computational principles that operate across different structures of the nervous system … and they provide functional explanations of the empirical data that are arguably the most relevant to neuroscience” (emphasis Spratling's). WN aimed to present just such an intermediate-level model. In so doing, it necessarily fell short of providing a detailed architectural specification of the kind Rasmussen & Eliasmith seek. It does, however, aim to pick out a space of models that share some deep assumptions: assumptions that already have (or so I argued–see WN, sect. 2) many distinctive conceptual and empirical consequences.

Spratling then worries (in a kind of inversion of the doubts raised by Rasmussen & Eliasmith) that in one respect, at least, the presentation in WN is rather too specific, too close to one possible (but not compulsory) implementation. The issue here concerns the depiction of error as flowing forwards (i.e., between regions in the hierarchy) and predictions as flowing backwards. WN depicts this as a direct consequence of the predictive coding compression technique. But it is better seen, Spratling convincingly argues, as a feature of one (albeit, as he himself accepts, the standard) implementation of predictive coding. Spratling is right to insist upon the distinction between theory and implementation. It is only by considering the space of alternative implementations that we can start to ask truly pointed experimental questions, (of the kind highlighted by Rasmussen & Eliasmith) of the brain: questions that may one day favour one implementation of the key principles, or even none at all. One problem, I suspect, will be that resolving the “what actually flows forward?” issue looks crucial to adjudicating between various close alternatives. But that depends (as Spratling's work shows) upon how we carved the system into levels in the first place, since that determines what counts as flow within a level versus flow between levels. This is not going to be as easy as it sounds, since it is not gross cortical layers but something much more functional (cortical columns, something else?) that is at issue. Experimenters and theorists will thus need to work together to build detailed, testable models whose assumptions (especially concerning what counts as a region or level) are agreed in advance.

Egner & Summerfield describe a number of empirical studies that support the existence both of (visual) surprise signals and of the hierarchical interplay between expectation and surprise. Some of this evidence (e.g., the work by Egner et al. Reference Egner, Monti and Summerfield2010 and by Murray et al. Reference Murray, Kersten, Olshausen, Schrater and Woods2002) is discussed in the text, but new evidence (see, e.g., Wyart et al. 2011) continues to emerge. In their commentary Egner & Summerfield stress, however, that complex questions remain concerning the origins of such surprise. Is it locally computed or due to predictions issuing from elsewhere in the brain? My own guess is that both kinds of computation occur, and that complex routing strategies (see Phillips et al. Reference Phillips, von der Malsburg, Singer, von der Malsburg, Phillips and Singer2010 and essays in von der Marlsberg et al. 2010) determine, on a moment-to-moment basis, the bodies of knowledge and evidence relative to which salient (i.e., precise, highly weighted) prediction error is calculated. It is even possible that these routing effects are themselves driven by prediction errors of various kinds, perhaps in the manner sketched by den Ouden et al. (Reference den Ouden, Daunizeau, Roiser, Friston and Stephan2010). Egner & Summerfield go on to note (see WN, sect. 3.1) the continued absence of firm cellular-level evidence for the existence of functionally distinct neural encodings for expectation and surprise. More positively, they highlight some recent studies (Eliades & Wang Reference Eliades and Wang2008; Keller et al. Reference Keller, Bonhoeffer and Hubener2012; Meyer & Olson Reference Meyer and Olson2011) that offer tantalizing hints of such evidence. Especially striking here is the work by Keller et al. (Reference Keller, Bonhoeffer and Hubener2012) offering early evidence for the existence of prediction-error neurons in supra-granular layers 2/3, which fits nicely with the classic proposals (implicating superficial pyramidal cells) by Friston (Reference Friston2005), Mumford (Reference Mumford1992), and others. Such work represents some early steps along the long and complex journey that cognitive science must undertake if it is deliver evidence of the kind demanded by Rasmussen & Eliasmith.

Muckli, Petro, & Smith (Muckli et al.) continue this positive trend, describing a range of intriguing and important experimental results that address PP at both abstract and more concrete levels of description. At the abstract level, they present ongoing experiments that aim to isolate the contributions of cortical feedback (downward-flowing prediction) from other processing effects. Such experiments lend considerable support to the most basic tenets of the PP model. Moving on to the more concrete level they suggest, however, that the standard implementation of predictive coding may not do justice to the full swathe of emerging empirical data, some of which (Kok et al. 2012) shows both sharpening of some elements of the neuronal signal, as well as the kind of dampening mandated by successful “explaining away” of sensory data. However, as mentioned in WN sect. 2.1 (see also comments on Bowman, Filetti, Wyble, & Olivers [Bowman et al.] below), this combination is actually fully compatible with the standard model (see, e.g., remarks in Friston Reference Friston2005), since explaining away releases intra-level inhibition, resulting in the correlative sharpening of some parts of the activation profile. I agree, however, that more needs to be done to disambiguate and test various nearby empirical possibilities, including the important questions about spatial precision mentioned later in Muckli et al's interesting commentary. Such experiments would go some way towards addressing the related issues raised by Silverstein, who worries that PP (by suppressing well-predicted signal elements) might not gracefully accommodate cases in which correlations between stimulus elements are crucial (e.g., when coding for objects) and need to be highlighted by increasing (rather than suppressing) activity. It is worth noting, however, that such correlations form the very heart of the generative models that are used to predict the incoming sensory patterns. This fact, combined with the co-emergence of both sharpening and dampening, makes the PP class of models well-suited to capturing the full gamut of observed effects.

I turn now to the relation between key elements of PP and the depiction of the brain as performing Bayesian inference. Trappenberg & Hollensen note that the space of Bayesian models is large, and they distinguish between demonstrations of Bayesian response profiles in limited experimental tasks and the much grander claim that that such specifics flow from something much more general and fundamental. The latter position is most strongly associated with the work of Karl Friston, and is further defended in his revealing commentary. PP is, however, deliberately pitched between these two extremes. It is committed to a general cortical processing strategy that minimizes surprisal using sensorimotor loops that sample the environment while deploying multilevel generative models to predict the ongoing flow of sensation.

Friston's focus is on a presumed biological imperative to reduce surprisal: an imperative obeyed by reducing the organism-computable quantity free energy. Both predictive coding and the Bayesian brain are, Friston argues, results of this surprise minimization mandate. The kinds of processing regime PP describes are thus, Friston claims, the results of surprisal minimization rather than its cause. Friston may be right to stress that, assuming the free energy story as he describes it is correct, predictive coding and the Bayesian brain emerge as direct consequences of that story. But I do not think the target article displays confusion on this matter. Instead, the issue turns on where we want to place our immediate bets, and perhaps on the Aristotelian distinction between proximate and ultimate causation. Thus, the proximal cause (the mechanism) of large amounts of surprisal reduction may well be the operation of a cortical predictive processing regime, even if the ultimate cause (the explanation of the presence of that very mechanism) is a larger biological imperative for surprisal minimization itself. This seems no stranger than saying that the reproductive advantages of distal sensing (an ultimate cause) explain the presence of various specific mechanisms (proximal causes) for distal sensing, such as vision and audition. WN, however, deliberately took no firm position on the full free energy story itself.

Friston also notes, importantly, that other ideas that fit within this general framework include ideas about efficient coding. This is correct, and I regard it as a shortfall of my treatment that space precluded discussion of this issue. For, as Trappenberg & Hollensen nicely point out, dimensionality reduction using generative models will only yield neurally plausible encodings (filters that resemble actual receptive fields in the brain) if there is pressure to minimize both prediction error and the complexity of the encoding itself. The upshot of this is pressure towards various forms of “sparse coding” running alongside the need to reduce prediction error at multiple spatial and temporal scales, and in some acceptably generalizable fashion. Trappenberg & Hollensen suggest that we still lack any concrete model capable of learning to form such sparse representations “due to the world's sparseness” rather than due to the pre-installation of some form of pressure (e.g., an innate hyperprior) towards sparse encodings. But this may be asking too much, given the quite general utility of complexity reduction. Reflecting on the sheer metabolic costs of creating and maintaining internal representations, such a bias seems like a very acceptable ingredient of any “minimal nativism” (Clark Reference Clark1993).

R2.2. Other mechanisms

I move now to a second set of perils, or challenges. These challenges concern the relation between PP and various other strategies and mechanisms plausibly implicated in human cognitive success. Ross draws our attention to a large and important body of work on “neuroeconomic models of sub-cognitive reward valuation.” Such models (e.g., Lee & Wang Reference Lee, Wang, Glimcher, Camerer, Fehr and Poldrack2009; Glimcher Reference Glimcher2010) posit pre-computed reward valuations (outputs from specialized subsystems performing “striatal valuation”) as the inputs to more flexible forms of cortical processing. But despite my intended emphasis on cortical processing, nothing in the PP story was meant to stand in the way of such modes of influence. To be sure, Friston's own (“desert landscape”–see WN, sects. 1.5 and 5.1) attempt to replace reward and value signals with multilevel expectations may at first sight seem inimical to such approaches. But Friston's account ends up accommodating such modes of influence (see, e.g., Friston Reference Friston2011b), albeit with an importantly different functional and terminological spin. Here (see WN, sect. 3.2, and the commentary by Friston), it is important to recognize that predictions and expectations, in Friston's large-scale free energy treatments, are determined by the shape and nature of the whole agent (morphology, reflexes, and subcortical organization included) and are not merely the products of probabilistic models commanded by sensory and motor cortex. Insights concerning the importance of the mid-brain circuitry are compatible both with PP and with the full “desert landscape” version of Friston's own account. This means, incidentally, that the kind of non-cortical route to a (partial) resolution of the Darkened Room problem suggested by Ross (and hinted at also by Shea) is in fact equally available to Friston. It is also consistent with (though it is not implied by) the more restricted perspective offered by PP, understood as an account of cortical processing.

Ross's concern that PP may be losing sight of the crucial role played by non-cortical (e.g., environmental, morphological, and subcortical) organization is amplified by Anderson & Chemero, who fear that PP puts us on a slippery slope back to full-blown epistemic internalism of the kind I am supposed to have roundly and convincingly (Clark 1997; Reference Clark2008) rejected. That slope is greased, Anderson & Chemero suggest, by the conflation of two very different senses of prediction. In the first sense, prediction amounts to nothing more than correlation (as in “height predicts weight”), so we might find “predictive processing” wherever we find processing that extracts and exploits correlations. This sense Anderson & Chemero regard as innocent because (involving merely “simple relationships between numbers”) it can be deployed without reliance upon inner models, in what they call a model-free or even “knowledge-free” (I would prefer to say “knowledge-sparse”) fashion so as to make the most of, for example, reliable cross-modal relationships among sensed information. The second sense is more loaded and “allied with abductive inference and hypothesis testing.” It involves the generation of predictions using internal models that posit hidden variables tracking complex causal structure in the body and world. Prediction thus construed is, let us agree, knowledge-rich. Evidence for the utility and ubiquity of prediction in the knowledge-free (or knowledge-sparse) sense provides, just as Anderson & Chemero insist, no evidence for the ubiquity and operation (nor even for the biological possibility) of predictive processing in the second (knowledge-rich) sense.

This is undeniably true and important. But nowhere in the target article did I make or mean to imply such a claim. In displaying the origins of this kind of use of generative models in cognitive science (e.g., Dayan et al.'s [Reference Dayan, Hinton and Neal1995] work on the Helmholz machine, leading to all the work on “deep learning”–see Bengio Reference Bengio2009) I was careful to highlight their role in dealing with cases where successful learning required deriving new representations tracking hidden variables. As the story progressed, the role of complex multilevel models learnt and deployed using bidirectional hierarchies (as most clearly implemented by the cortex) was constantly center stage. The larger free energy story, to be sure, covers both the knowledge-rich and knowledge-sparse cases. From the free energy minimization perspective we might even choose to consider (as does Friston) the whole embodied, embedded agent as “the model” relative to which surprise is (long-term) minimized. But that story, in turn, does not conflate the two senses of prediction either, since it fluidly covers both. Anderson & Chemero suggest that I somehow rely on the (very speculative) model of binocular rivalry to make an illegitimate move from a knowledge-free to a knowledge-rich understanding of prediction. Here, the exposition in WN must be at fault. It may be that they think the account of rivalry plays this role because I preceded it with some remarks on dynamic predictive coding by the retina. But the retinal case, which may indeed be understood as essentially knowledge-sparse and internal-model-free prediction, was meant to illustrate only the predictive coding data compression technique, and not the full PP apparatus. Nor did I intend anything much to turn on the binocular rivalry story itself, which was meant merely as a helpful illustration of how the hypothesis-testing brain might deploy a multi-layered model. It is clear that much more needs to be done to defend and flesh out that account of binocular rivalry (as also pointed out by Sloman).

Anderson & Chemero believe that an account might be given that delivers the rivalry response by appealing solely to “low-level, knowledge-free, redundancy-reducing interactions between the eyes.” This might turn out to be true, thus revealing the case as closer to that of the retinal ganglion cells than to any case involving hierarchical predictive processing as I defined it. There are, however, very many cases that simply cry out for an inner model–invoking approach. Thus, consider the case of handwritten digit recognition. This is a benchmark task in machine learning, and one that Hinton and Nair (Reference Hinton, Nair, Weiss, Scholkopf and Platt2006) convincingly treat using a complex acquired generative model that performs recognition using acquired knowledge about production. The solution is knowledge-rich because the domain itself is highly structured, exhibiting (like the external world in general) many stacked and nested regularities that are best tracked by learning that unearths multiple interacting hidden variables. I do not think that such cases can be dealt with (at least in any remotely neurally plausible fashion) using resources that remain knowledge-free in the sense that Anderson & Chemero suggest. What seems true (Clark Reference Clark1989; 1997; Reference Clark2008) is that to whatever extent a system can avoid the effort and expense of learning about such hidden causes, and rely instead on surface statistics and clever tricks, it will most likely do so. Much of the structure we impose (this relates also to the comments by Sloman) upon the designed world is, I suspect, a device for thus reducing elements of the problems we confront to simpler forms (Clark & Thornton Reference Clark and Thornton1997). Thus, I fully agree that not all human cognition depends upon the deployment of what Anderson & Chemero call “high-level, knowledge-rich predictive coding.”

What kind of overall cognitive organization, it might be asked, does the embodied, embedded agent then display? Is that organization multiply and impenetrably fractured and firewalled, comprising a motley within which a few principled, knowledge-rich responses bed down with unwashed legions of just-good-enough ploys and stratagems? Surely such a motley is incompatible with the hope for any kind of unifying treatment? This issue (let's call it the Motley Challenge) is among the deepest unresolved questions in cognitive science. Buckingham & Goodale join Ross and Anderson & Chemero, and (as I discuss later) Sloman and Froese & Ikegami, in pressing the case for the cognitive motley. Following a crisp description of the many successes of Bayesian (i.e., optimal cue integration, given prior probabilities) models in the field of motor control and psychophysics, Buckingham & Goodale turn to some problem cases–cases where Bayesian style optimal integration seems to fail–using these to argue for a fractured and firewalled cognitive economy displaying “independent sets of priors for motor control and perceptual/cognitive judgments, which ultimately serve quite different functions.” Poster-child for this dislocation is the size-weight illusion in which similar-looking objects appear weight-adjusted so that we judge the smaller one to feel heavier than the larger despite their identical objective weights (a pound of lead feels heavier, indeed, than a pound of feathers). Buckingham & Goodale survey some intriguing recent work on the size-weight illusion, noting that although Bayesian treatments do manage to get a grip on lifting behavior itself, they fail to explain the subjective comparison effect which some describe as “anti-Bayesian” since prior expectancies and sensory information there seem contrasted rather than integrated (Brayanov & Smith Reference Brayanov and Smith2010).

Is this a case of multiple, independently operating priors governing various forms of response under various conditions? Perhaps. The first point I would make in response is that nothing either in PP or in the full free-energy formulation rules this out. For the underlying architecture, by dint of evolution, lifetime learning, or both, may come to include “soft modules” partially insulating some response systems from others. To the extent that this is so, that may be traceable, as Friston suggests, to the relative statistical independence of various key tracked variables. Information about what an object is, for example, tells us little about where it is, and vice versa, a fact that might explain the emergence of distinct (though not fully mutually insulated–see Schenk & McIntosh Reference Schenk and McIntosh2010) “what” and “where” pathways in the visual brain. Returning to the size-weight illusion itself, Zhu and Bingham (Reference Zhu and Bingham2011) show that the perception of relative heaviness marches delicately in step with the affordance of maximum-distance throwability. Perhaps, then, what we have simply labeled as the experience of “heaviness” is, in some deeper ecological sense, the experience of optimal weight-for-size to afford long-distance throwability? If that were true, then the experiences that Buckingham & Goodale describe re-emerge as optimal percepts for throwability, albeit ones that we routinely misconceive as simple but erroneous perceptions of relative object weight. The Zhu and Bingham account is intriguing but remains quite speculative. It reminds us, however, that what looks from one perspective to be a multiple, fragmented, and disconnected cognitive economy may, on deeper examination, turn out to be a well-integrated (though by no means homogeneous) mechanism responding to organism-relevant statistical structure in the environment.

Gerrans continues the theme of fragmentation, resisting the claim that prediction error minimization proceeds seamlessly throughout the cortical hierarchy. His test cases are delusions of alien control. I agree with Gerrans that nothing in the simple story about prediction error minimization explains why it seems that someone else is in control, rather than simply (as in the other cases he mentions) that the action is not under our own control. It is not clear to me, however, why that shortfall should be thought to cast doubt on the more general (“seamlessness”) claim that perception phases gently into cognition, and that the differences concern scale and content rather than underlying mechanism.

Silverstein raises some important challenges both to the suggestion that PP provides an adequately general account of the emergence of delusions and hallucinations in schizophrenia, and (especially) to any attempt to extend that account to cover other cases (such as Charles Bonnet syndrome) in which hallucinations regularly emerge without delusions. Importantly, however, I did not mean to suggest that the integrated perceptuo-doxastic account that helps explain the co-emergence of the two positive symptoms in schizophrenia will apply across the board. What might very reasonably be expected, however, is that other syndromes and patterns (as highlighted by Gerrans) should be explicable using the same broad apparatus, that is, as a result of different forms of compromise to the very same kind of prediction-error–sensitive cognitive economy. In Charles Bonnet syndrome (CBS), gross damage to the visual system input stream (e.g., by lesions to the pathway connecting the eye to the visual cortex, or by macular degeneration) leads to complex hallucinations without delusion. But this pattern begins to makes sense if we reflect that the gross damage yields what are effectively localized random inputs that are then subjected to the full apparatus of learnt top-down expectation (see Stephan et al. Reference Stephan, Friston and Frith2009, p. 515). Recent computational work by Reichert et al. (Reference Reichert, Seriès and Storkey2010) displays a fully implemented model in which hallucinations emerge in just this broad fashion, reflecting the operation of a hierarchical generative (predictive) model of sensory inputs in which inputs are compared with expectations and mismatches drive further processing. The detailed architecture used by Reichert et al. was, however, a so-called Deep Boltzmann Machine architecture (Salakhutdinov & Hinton Reference Salakhutdinov and Hinton2009), a key component of which was a form of homeostatic regulation in which processing elements learn a preferred activation level to which they tend, unless further constrained, to return.

Phillips draws attention to the important question of how a PP-style system selects the right sub-sets of information upon which to base some current response. Information that is critical for one task may be uninformative or counter-productive for another. Appeals to predictive coding or Bayesian inference alone, he argues, cannot provide this. One way in which we might cast this issue, I suggest, is by considering how to select what, at any given moment, to try to predict. Thus, suppose we have an incoming sensory signal and an associated set of prediction errors. For almost any given purpose, it will be best not to bother about some elements of the sensory signal (in effect, to treat prediction failures there as noise rather than signal). Other aspects, however, ones crucial to the task at hand, will have to be got exactly right (think of trying to spot the four-leaf clover among all the others in a field). To do this, the system must treat even small prediction errors, in respect of such crucial features, as signal and use them to select and nuance the winning top-down model. Within the PP framework, the primary tool for this is, Phillips notes, the use of context-sensitive gain control. This amplifies the effects of specific prediction error signals while allowing other prediction errors to self-cancel (e.g., by having that error unit self-inhibit). The same mechanism allows estimates of the relative reliability of different aspects of the sensory signal to be factored in, and it may underpin the recruitment of problem-specific temporary ensembles of neural resources, effectively gating information flow between areas of the brain (see den Ouden et al. [Reference den Ouden, Friston, Daw, McIntosh and Stephan2009] and essays in von der Marlsburg et al. [2010]). On-the-hoof information selection and information coordination of these kinds is, Phillips then argues, a primary achievement of the neurocomputational theory known as “Coherent Infomax” (Kay & Phillips Reference Kay and Phillips2010; Phillips et al. Reference Phillips, Kay and Smyth1995). Both Coherent Infomax and PP emphasize the role of prediction in learning and response, and it remains to be determined whether Coherent Infomax is best seen as an alternative or (more likely) a complement to the PP model, amounting perhaps to a more detailed description of a cortical microcircuit able to act as a repeated component in the construction of a PP architecture.

R2.3. Attention and precision

This brings us to our third set of perils: perils relating to the treatment of attention as a device for upping the gain on (hence the estimated “precision” of) selected prediction errors. Bowman et al. raise several important issues concerning the scope and adequacy of this proposal. Some ERP (event-related potential) components (such as P1 and N1), Bowman et al. note, are increased when a target appears repeatedly in the same location. Moreover, there are visual search experiments in which visual distractors, despite their rarity, yield little evoked response, yet pre-described, frequently appearing, targets deliver large ones. Can such effects be explained directly by the attention-modulated precision weighting of residual error? A recent fMRI study by Kok et al. (2012) lends elegant support to the PP model of such effects by showing that these are just the kinds of interaction between prediction and attention that the model of precision-weighted prediction error suggests. In particular, Kok et al. show that predicted stimuli that are unattended and task-irrelevant result in reduced activity in early visual cortex (the “silencing” of the predicted, as mandated by simple predictive coding) but that “this pattern reversed when the stimuli were attended and task-relevant” (Kok et al. 2012, p. 2198). The study manipulated spatial attention and prediction by using independent prediction and spatial cues (for the details, see the original paper by Kok et al.) and found that attention reversed the silencing effect of prediction upon the sensory signal, in just the way the precision-weighting account would specify. In addition, Feldman and Friston (Reference Feldman and Friston2010) present a detailed, simulation-based model in which precision-modulated prediction error is used to optimize perceptual inference in a way that reproduces the ERP and psychophysical responses elicited by the Posner spatial cueing paradigm (see Posner Reference Posner1980).

Bowman et al. go on to press an important further question concerning feature-based attention. For, feature-based attention seems to allow us to enhance response to a given feature even when it appears at an unpredicted location. In their example, the command to find an instance of bold type may result in attention being captured by a nearby spatial location. If the result of that is to increase the precision-weighting upon prediction error from that spatial location (as PP suggests) that seems to depict the precision weighting as a consequence of attending rather than a cause or implementation of attending. The resolution of this puzzle lies, I suggest, in the potential assignment of precision-weighting at many different levels of the processing hierarchy. Feature-based attention corresponds, intuitively, to increasing the gain on the prediction error units associated with the identity or configuration of a stimulus (e.g., increasing the gain on units responding to the distinctive geometric pattern of a four-leaf clover). Boosting that response (by giving added weight to the relevant kind of sensory prediction error) should enhance detection of that featural cue. Once the cue is provisionally detected, the subject can fixate the right spatial region, now under conditions of “four-leaf-clover-there” expectation. Residual error is then amplified for that feature at that location, and high confidence in the presence of the four-leaf clover can (if you are lucky!) be obtained. Note that attending to the wrong spatial region (e.g., due to incongruent spatial cueing) will actually be counter-productive in such cases. Precision-weighted prediction error, as I understand it, is thus able to encompass both mere-spatial and feature-based signal enhancement.

Block & Siegel claim that predictive processing (they speak simply of predictive coding, but they mean to target the full hierarchical, precision-modulated, generative-model based account) is unable to offer any plausible or distinctive account of very basic results such as the attentional enhancement of perceived contrast (Carrasco et al. Reference Carrasco, Ling and Read2004). In particular, they claim that the PP model fails to capture changes due to attending that precede the calculation of error, and that it falsely predicts a magnification of the changes that follow from attending (consequent upon upping the gain on some of the prediction error). However, I find Block & Siegel's attempted reconstruction of the PP treatment of such cases unclear or else importantly incomplete. In the cases they cite, subjects fixate a central spot with contrast gratings to the left and right. The gratings differ in absolute (actual) contrast. But when subjects are cued to attend (even covertly) to the lower contrast grating, their perception of the contrast there is increased, yielding the (false) judgment that, for example, an attended 70% (actual value) contrast grating is the same as an unattended 82% grating. Block & Siegel suggest that the PP account cannot explain the initial effect here (the false perception of an 82% contrast for the covertly attended 70% contrast grating) as the only error signal–but this is where they misconstrue the story–is the difference between the stable pre-attentive 70% registration and the post-attentive 82% one. But this difference, they worry, wasn't available until after attention had done its work! Worse still, once that difference is available, shouldn't it be amplified once more, as the PP account says that gain on the relevant error units is now increased?

This is an ingenious challenge, but it is based on a misconstrual of the precision-weighting proposal. It is not the case that PP posits an error signal calculated on the basis of a difference between the unattended contrast (registered as 70%) and the subsequently attended contrast (now apparently of 82%). Rather, what attention alters is the expectation of precise sensory information from the attended spatial location. Precision is the inverse of the variance, and it is our “precision expectations” that attention here alters. What seems to be happening, in the case at hand, is that the very fact that we covertly attend to the grating on the left (say) increases our expectations of a precise sensory signal. Under such conditions, the expectation of precise information induces an inflated weighting for sensory error and our subjective estimate of the contrast is distorted as a result. The important point is that the error is not computed, as Block & Siegel seem to suggest, as a difference between some prior (in this case unattended) percept and some current (in this case attended) one. Instead, it is computed directly for the present sensory signal itself, but weighted in the light of our expectation of precise sensory information from that location. Expectations of precision are what, according to PP, is being manipulated by the contrast grating experiment, and PP thus offers a satisfying and distinctive account of the effect itself. This same mechanism explains the general effect of attention on spatial acuity, especially in cases where we alter fixation and where more precise information is indeed then available. Block & Siegel are right to demand that the PP framework confront the full spectrum of established empirical results in this area. But they underestimate the range of apparatus (and the distinctiveness of the accounts) that PP can bring to bear. This is not surprising, since these are early days and much further work is needed. For an excellent taste, however, of the kind of detailed, probing treatment of classic experimental results that is already possible, see Hohwy's (Reference Hohwy2012) exploration of conscious perception, attention, change blindness, and inattentional blindness from the perspective of precision-modulated predictive processing.

R3. Prospects

I have chosen to devote the bulk of this Response to addressing the various perils and pitfalls described above and to some even grander ones to be addressed in section 4 further on. A reassuringly large number of commentators, however, have offered illuminating and wonderfully constructive suggestions concerning ways in which to improve, augment, and extend the general picture. I'm extremely grateful for these suggestions, and plan to pursue several of them at greater length in future work. For present purposes, we can divide the suggestions into two main (though non-exclusive) camps: those which add detail or further dimensions to the core PP account, extending it to embrace additional mental phenomena, such as timing, emotion, language, personal narrative, and high-level forms of “self-expectation”; and those which reach out to the larger organizational forms of music, culture, and group behaviors.

R3.1. New dimensions

Shea usefully points out that perception and action, even assuming they indeed share deep computational commonalities, would still differ in respect of their “direction of fit.” In (rich, world-revealing) perception, we reduce prediction error by selecting a model that explains away the sensory signal. In world-engaging action, we reduce prediction error by altering body and world to conform to our expectations. This is correct, and it helps show how the PP framework, despite offering a single fundamental model of cortical processing, comports with the evident multiplicity and variety of forms of cognitive contact with the world.

Farmer, Brown, & Tanenhaus (Farmer et al.) suggest (this was music to my ears) that the hierarchical prediction machine perspective provides a framework that might one day “unify the literature on prediction in language processing.” They describe, in compelling detail, the many applications of prediction-and-generative-model-based accounts to linguistic phenomena. Language, indeed, is a paradigm case of an environmental cause that exhibits a complex, multilevel structure apt for engagement using hierarchical, generative models. Farmer et al. stress several aspects of language comprehension that are hard to explain using traditional models. All these aspects revolve (it seems to me) around the fact that language comprehension involves not “throwing away” information as processing proceeds, so much as using all the information available (in the signal, in the generative model, and in the context) to get a multi-scale, multi-dimensional grip on the evolving acoustic and semantic content. All manner of probabilistic expectation (including speaker-specific lexical expectations formed “on-the-hoof” as conversation proceeds) are thus brought to bear, and impact not just recognition but production (e.g., your own choice of words), too. Context effects, rampant on-the-hoof probability updating, and cross-cueing are all grist to the PP mill.

The PP framework, Holm & Madison convincingly argue, also lends itself extremely naturally to the treatment of timing and of temporal phenomena. In this regard, Holm & Madison draw our attention to large and important bodies of work that display the complex distribution of temporal control within the brain, and that suggest a tendency of later processing stages and higher areas to specialize in more flexible and longer time-scale (but correlatively less dedicated, and less accurate) forms of time-sensitive control. Such distributions, as they suggest, emerge naturally within the PP framework. They emerge from both the hierarchical form of the generative model and the dynamical and multi-scale nature of key phenomena. More specifically, the brain must learn a generative model of coupled dynamical processes spanning multiple temporal scales (a nice example is Friston and Kiebel's [Reference Friston and Kiebel2009] simulation of birdsong recognition). Holm & Madison (and see comments by Schaefer, Overy, & Nelson [Schaefer et al.]) also make the excellent point that action (e.g., tapping with hands and feet) can be used to bootstrap timing, and to increase the reliability of temporal perception. This provides an interesting instance of the so-called “self-structuring of information” (Pfeifer et al. Reference Pfeifer, Lungarella, Sporns and Kuniyoshi2007), a key cognitive mechanism discussed in Clark (Reference Clark2008) and in the target article (see WN, sect. 3.4).

Gowaty & Hubbell suggest that all animals are Bayesians engaged in predicting the future on the basis of flexibly updated priors, and that they “imagine” (the scare quotes are theirs) alternatives and make choices among them. This is an intriguing hypothesis, but it is one that remains poised (thanks in part to those scare quotes) between two alternative interpretations. On the one hand, there is the (plausible) claim that elements in the systemic organization of all animals respond sensitively, at various timescales, to environmental contingencies so as to minimize free energy and allow the animals to remain within their envelope of viability. On the other hand, there is the (to me less plausible) claim that all animals ground flexible behavioral response in the top-down deployment of rich, internally represented generative models developed and tuned using prediction-driven learning routines of the kind described by PP. I return to this issue in section 4.

Seth & Critchley sketch a powerful and potentially very important bridge between PP-style work and new cognitive scientific treatments of emotion and affect. The proposed bridge to emotion relies on the idea that interoception (the “sense of the physiological condition of the body”) provides a source of signals equally apt for prediction using the kinds of hierarchical generative models described in the target article The step to emotion is then accomplished (according to their “interoceptive predictive coding” account–see Seth et al. Reference Seth, Suzuki and Critchley2011) by treating emotional feelings as determined by a complex exchange between driving sensory (especially interoceptive) signals and multilevel downwards predictions. Of special interest here are signals and predictions concerning visceral, autonomic, and motor states. Attention to predictions (and pathologies of prediction) concerning these states provides, Seth & Critchley plausibly suggest, a major clue to the nature and genesis of many psychiatric syndromes. Dissociative syndromes, for example, may arise from mistaken assignments of precision (too little, in these cases) to key interoceptive signals. But are emotional feelings here constructed by successful predictions (by analogy to the exteroceptive case)? Or are feelings of emotion more closely tied (see also the comments by Schaefer et al. regarding prediction error in music) to the prediction errors themselves, presenting a world that is an essentially moving target, defined more by what it is not than by what it is? Or might (this is my own favorite) the division between emotional and non-emotional components itself prove illusory, at least in the context of a multi-dimensional, generative model–nearly every aspect of which can be permeated (Barrett & Bar Reference Barrett and Bar2009) by goal and affect-laden expectations that are constantly checked against the full interoceptive and exteroceptive array?

Dennett's fascinating and challenging contribution fits naturally, it seems to me, with the suggestions concerning interoceptive self-monitoring by Seth & Critchley. Just how could some story about neural prediction illuminate, in a deep manner, our ability to experience the baby as cute, the sky as blue, the honey as sweet, or the joke as funny? How, in these cases, does the way things seem to us (the daily “manifest image”) hook up with the way things actually work? The key, Dennett suggests, may lie in our expectations about our own expectations. The cuteness of the baby, if I read Dennett correctly, is nothing over and above our expectations concerning our probable reactions (themselves rooted, if the PP story is correct, in a bunch of probabilistic expectations) to imminent baby-exposure. We expect to feel like cooing and nurturing, and when those expectations (which can, in the manner of action-oriented predictive processing, be partially self-fulfilling) are met, we deem the baby itself cute. This is what Dennett (Reference Dennett2009) describes as a “strange inversion,” in which we seem to project our own reactive complexes outward, populating our world with cuteness, sweetness, blueness, and more. I think there is something exactly right, and something that remains unclear, in Dennett's sketch. What seems exactly right is that we ourselves turn up as one crucial item among the many items that we humans model when we model our world. For, we ourselves (not just as organisms but as individuals with unique histories, tendencies, and features) are among the many things we need to get a grip upon if we are to navigate the complex social world, predicting our own and others' responses to new situations, threats, and opportunities.

To that extent (see also Friston Reference Friston, Tschacher and Bergomi2011a), Dennett is surely right: We must develop a grip (what Dennett describes as a set of “Bayesian expectations”) upon how we ourselves are likely to react, and upon how others model us. Our Umwelt, as Dennett says, is thus populated not just with simple affordances but with complex expectations concerning our own nature and reactions. What remains unclear, I think, is just how this complex of ideas hooks up the question with which Dennett precedes it, namely, “what makes our manifest image manifest (to us)?” For this, on the face of it, is a question about the origins of consciously perceived properties: the origins of awareness, or of something like it–something special that we have and that the elevator (in Dennett's example) rather plausibly lacks. It does not strike me as impossible that there might be a link here, perhaps even a close one. But how does it go? Is the thought that any system that models itself and has expectations about its own reactive dispositions, belongs to the class of the consciously aware? That condition seems both too weak (too easily satisfied by a simple artificial system) and too strong (as there may be conscious agents who fail to meet it). Is it that any system that models itself in that way will at least judge (perhaps self-fulfillingly) that it is consciously aware of certain things, such as the cuteness of babies? That's tempting, but we need to hear more. Or is this really just a story–albeit a neat and important one–about how, assuming a system is somehow conscious of some of the things in its world, those things might (if you are a sufficiently bright and complex social organism under pressure to include yourself in your own generative model) come to include such otherwise elusive items as cuteness, sweetness, funniness, and so on?

Hirsh, Mar, & Peterson (Hirsh et al.) suggest that an important feature of the predictive mosaic, when accounting for distinctively human forms of understanding, might be provided by the incorporation of personal narratives as high-level generative models that structure our predictions in a goal- and affect-laden way. This proposal sits well with the complex of ideas sketched by Dennett and by Seth & Critchley, and it provides, as they note, a hook into the important larger sociocultural circuits (see also comments by Roepstorff, and section 4 further on) that also sculpt and inform human behavior. Personal narratives are often co-constructed with others, and can feed the structures and expectations of society back in as elements of the generative model that an individual uses to make sense of their own acts and choices. Hirsh et al., like Dennett, are thus concerned with bridging the apparent gap between the manifest and scientific image, and accounts that integrate culturally inflected personal narratives into the general picture of prediction-and generative-model based cognition seem ideally placed to play this important role. Narrative structures, if they are correct, lie towards the very top of the predictive hierarchy, and they influence and can help coordinate processing at every level beneath. It is not obvious to me, however, that personal narrative needs to be the concern of a clearly demarcated higher level. Instead, a narrative may be defined across many levels of the processing hierarchy, and supported (in a graded rather than all-or-none fashion) by multiple interacting bodies of self-expectation.

R3.2. Larger organizational forms

This brings us to some comments that directly target the larger organizational forms of music, culture, and group behaviors. Many aspects of our self-constructed sociocultural world, Paton, Skewes, Frith, & Hohwy (Paton et al.) argue, can be usefully conceptualized as devices that increase the reliability of the sensory input, yielding a better signal for learning and for online response. A simple example might be the use of windscreen wipers in the rain. But especially illuminating, in this regard, are their comments on conversation, ritual, convention, and shared practices. In conversation, speakers and listeners often align their uses (e.g., lexical and grammatical choices–see Pickering & Garrod Reference Pickering and Garrod2007). This makes good sense under a regime of mutual prediction error reduction. But conversants may also, as Paton et al. intriguingly add, align their mental states in a kind of “fusion of expectation horizons.” When such alignment is achieved, the otherwise blunt and imprecise tools of natural language (see Churchland Reference Churchland1989; Reference Churchland2012) can be better trusted to provide reliable information about another's ideas and mental states. Such a perspective (“neural hermeneutics”; Frith & Wentzer, in press) extends naturally to larger cultural forms, such as ritual and shared practice, which (by virtue of being shared) enhance and ensure the underlying alignment that improves interpersonal precision. Culture, in this sense, emerges as a prime source of shared hyperpriors (high-level shared expectations that condition the lower-level expectations that each agent brings to bear) that help make interpersonal exchange both possible and fruitful. Under such conditions (also highlighted by Roepstorff) we reliably discern each other's mental states, inferring them as further hidden causes in the interpersonal world. Natural hermeneutics may thus contribute to the growing alignment between the humanities and the sciences of mind (Hirsh et al.). At the very least, this offers an encompassing vision that adds significant dimensions to the simple idea of mutual prediction error reduction.

Schaefer et al. combine the themes of mutual prediction error reduction, culture, and affect. Their starting point is the idea that music (both in perception and production) creates a context within which prediction error– mutual prediction error, in the group case–is reduced. But this simple idea, they argue, needs augmenting with considerations of arousal, affect, and the scaffolding effects of cultural, training, and musical style. There is, Schaefer et al. suggest, an optimal or preferred level of surprisal at which musical experience leads to maximal (positive) affective response. That level is not uniform across musical types, musical features, or even individuals, some of whom may be more “thrill-seeking” than others. The commentary provides many promising tools for thinking about these variations, but makes one claim that I want to question (or at any rate probe a little), namely, that affect is what “makes prediction error in music … meaningful, and indeed determines its value.” This is tricky ground, but I suspect it is misleading (see also comments on Seth & Critchley) to depict prediction error as, if you like, something that is given in experience, and that itself generates an affective response, rather than as that which (sub-personally) determines the (thoroughly affect-laden) experience itself. I am not convinced, that is to say, that I experience my own prediction errors (though I do, of course, sometimes experience surprise).

R4. Darkened rooms and the puzzle of the porous perceiver

R4.1. Darkened rooms

Several commentators (Anderson & Chemero, Froese & Ikegami, Sloman, and to a lesser extent Little & Sommer) have questioned the idea of surprisal minimization as the underlying imperative driving all forms of cognition and adaptive response. A recurrent thread here is the worry that surprisal minimization alone would incline the error-minimizing agent to find a nice “darkened room” and just stay there until they are dead. Despite explicitly bracketing the full free-energy story, WN did attempt (in sects. 3.2–3.4) to address this worry, with apparently mixed results. Little & Sommer argue that the solution proffered depends unwholesomely upon innate knowledge, or at least upon pre-programmed expectations concerning the shape (itinerant, exploratory) of our own behavior. Froese & Ikegami contend (contrary to the picture briefly explored in WN, sect. 3.2) that good ways of minimizing surprisal will include “stereotypic self-stimulation, catatonic withdrawal from the world, and autistic withdrawal from others.”

Hints of a similar worry can be found in the comments by Schaefer et al., who suggest that musical appreciation involves not the simple quashing of prediction error (perhaps that might be achieved by a repeated pulse?) but attraction towards a kind of sweet spot between predictability and surprise: an “optimal level of surprisal,” albeit one that varies from case to case and between individuals and musical traditions. As a positive suggestion, Little & Sommer then suggest we shift our attention from the minimization of prediction error to the maximization of mutual information. That is to say, why not depict the goal as maximizing the mutual information (on this, see also Phillips) between an internal model of estimated causes and the sensory inputs? Minimizing entropy (prediction error) and maximizing mutual information (hence prediction success), Little & Sommer argue, each deliver minimal prediction error but differ in how they select actions. A system that seeks to maximize mutual information won't, they suggest, fall into the dark room trap. For, it is driven instead towards a sweet spot between predictability and complexity and will “seek out conditions in which its sensory inputs vary in a complex, but still predictable, fashion.”

Many interesting issues arise at this point. For example, we might also want to minimize mutual information (redundancy) among outputs (as opposed to between inputs and model) so as to achieve sparse, efficient coding (Olshausen & Field Reference Olshausen and Field1996). But for present purposes, the main point to make is that any improvement afforded by the move to mutual information is, as far as I can determine, merely cosmetic. Thus, consider a random system driven towards some sweet spot between predictability and complexity. For that system, there will be some complex set of inputs (imagine, to be concrete, a delicate, constantly changing stream of music) such that the set of inputs affords, for that agent, the perfect balance between predictability and complexity. The musical stream can be as complex as you like. Perhaps it must be so complex as never quite to repeat itself. Surely the agent must now enter the “musical room” and stay there until it is dead? The musical room, I contend, is as real (and, more important, as unreal) a danger as the darkened room. Notice that you can ramp up the complexity at will. Perhaps the sweet spot involves complex shifts between musical types. Perhaps the location of the sweet spot varies systematically with the different types. Make the scenario as complex as you wish. For that complexity, there is some musical room that now looks set to act as a death trap for that agent.

There is, of course, a perfectly good way out of this. It is to notice, with Friston, that all the key information-theoretic quantities are defined and computed relative to a type of agent–a specific kind of creature whose morphology, nervous system, and neural economy already render it (but only in the specific sense stressed by Friston; more on this shortly) a complex model of its own adaptive niche. As such, the creature, simply because it is the creature that it is, already embodies a complex set of “expectations” concerning moving, eating, playing, exploring, and so forth. It is because surprisal at the very largest scale is minimized against the backdrop of this complex set of creature-defining “expectations” that we need fear neither darkened nor musical (nor meta-musical, nor meta-meta-musical) rooms. The free-energy principle thus subsumes the mutual information approach (for a nice worked example, see Friston et al. Reference Friston, Adams, Perrinet and Breakspear2012). The essential creature-defining backdrop then sets the scene for the deployment (sometimes, in some animals) of PP-style strategies of cortical learning in which hierarchical message passing, by implementing a version of “empirical Bayes,” allows effective learning that is barely, if at all, hostage to initial priors. That learning requires ongoing exposure to rich input streams. It is the backdrop “expectations,” deeply built-in to the structure of the organism (manifesting as, for example, play, curiosity, hunger, and thirst) that keep the organism alive and the input stream rich, and that promote various beneficial forms of “self-structuring” of the information flow–see Pfeifer et al. (Reference Pfeifer, Lungarella, Sporns and Kuniyoshi2007).

This means that the general solution to the darkened room worry that was briefly scouted in WN, section 3.2, is mandatory, and that we must embrace it whatever our cosmetic preferences concerning entropy versus mutual information. This also means that the suggestion (Froese & Ikegami) that enactivism offers an alternative approach, with a distinctive resolution of the dark room issue, is misguided. Indeed, the “two” approaches are, with respect to the darkened room issue at least, essentially identical. Each stresses the autonomous dynamics of the agent. Each depicts agents as moving through space and time in ways determined by “the viability constraints of the organism.” Each grounds value, ultimately, in those viability constraints (which are the essential backdrop to any richer forms of lifetime learning).

Froese & Ikegami also take PP (though they dub it HPM: the “Hierarchical Prediction Machine” story) to task for its commitment to some form of representationalism. This commitment leads, they fear, to an unacceptable internalism (recall also the comments from Anderson & Chemero) and to the unwelcome erection of a veil between mind and world. This issue arises also (although from essentially the opposite direction) in the commentary by Paton et al. Thus, Froese & Ikegami fear that the depiction of the cerebral cortex as commanding probabilistic internal models of the world puts the world “off-limits,” while Paton et al. suggest that my preferred interpretation of the PP model makes the mind–world relation too direct and obscures the genuine sense in which “perception remains an inferred fantasy about what lies behind the veil of input.” I find this strangely cheering, as these diametrically opposed reactions suggest that the account is, as intended, walking a delicate but important line. On the one hand, I want to say that perception–rich, world-revealing perception of the kind that we humans enjoy–involves the top-down deployment of generative models that have come, courtesy of prediction-driven learning within the bidirectional cortical hierarchy, to embody rich, probabilistic knowledge concerning the hidden causes of our sensory inputs. On the other hand, I want to stress that those same learning routines make us extremely porous to the statistical structure of the actual environment, and put us perceptually in touch, in as direct a fashion as is mechanistically possible, with the complex, multilayered, world around us.

R4.2. The puzzle of the porous perceiver

This, then, is the promised “puzzle of the porous perceiver”: Can we both experience the world via a top-down generative-model based cascade and be in touch not with a realm of internal fantasy but, precisely, with the world? One superficially tempting way to try to secure a more direct mind–world relation is to follow Froese & Ikegami in rejecting the appeal to internally represented models altogether (we saw hints of this in the comments by Anderson & Chemero too). Thus, they argue that “Properties of the environment do not need to be encoded and transmitted to higher cortical areas, but not because they are already expected by an internal model of the world, but rather because the world is its own best model.” But I do not believe (nor have I ever believed: see, e.g., Clark 1997, Ch. 8) that this strategy can cover all the cases, or that, working alone, it can deliver rich, world-revealing perception of the kind we humans enjoy–conscious perception of a world populated by (among other things) elections, annual rainfall statistics, prayers, paradoxes, and poker hands. To experience a world rich in such multifarious hidden causes we must do some pretty fancy things, at various time-scales, with the incoming energetic flux: things at the heart of which lie, I claim, the prediction-driven acquisition and top-down deployment of probabilistic generative models. I do not believe that prayers, paradoxes, or even poker hands can be their own best model, if that means they can be known without the use of internal representations or inner models of external hidden causes. Worse still, in the cases where we might indeed allow the world, and directly elicited actions upon the world, to do most of the heavy lifting (the termite mound-building strategies mentioned by Sloman are a case in point) it is not obvious that there will–simply by virtue of deploying such strategies alone–be any world-presenting experience at all. What seems exactly right, however, is that brains like ours are masters of what I once heard Sloman describe as “productive laziness.” Hence, we will probably not rely on a rich internal model when the canny use of body or world will serve as well, and many of the internal models that we do use will be partial at best, building in all kinds of calls (see Clark Reference Clark2008) to embodied, problem-simplifying action. The upshot is that I did not intend (despite the fears of Anderson & Chemero) to depict all of cognition and adaptive response as grounded in the top-down deployment of knowledge-rich internal models. But I do think such models are among the most crucial achievements of cortical processing, and that they condition both online and offline forms of human experience.

Nor did I intend, as Sloman in a kind of reversal of the worry raised by Anderson & Chemero fears, to reduce all cognition to something like online control. Where Anderson & Chemero subtly mislocate the PP account as an attempt to depict all cognition as rooted in procedures apt only for high-level knowledge-rich response, Sloman subtly mislocates it as an over-ambitious attempt to depict all cognition as rooted in procedures apt only for low-level sensorimotor processing. Sloman thus contrasts prediction with interpretation, and stresses the importance to human (and animal) reasoning of multiple meta-cognitive mechanisms that (he argues) go far beyond the prediction and control of gross sensory and motor signals. In a related vein, Khalil interestingly notes that human cognition includes many “conception-laden processes” (such as choosing our own benchmark for a satisfactory income) that cannot be corrected simply by adjustments that better track gross sensory input.

Fortunately, there are no deep conflicts here. PP aims only to describe a core cortical processing strategy: a strategy that can deliver probabilistic generative models apt both for basic sensorimotor control and for more advanced tasks. The same core strategy can drive the development of generative models that track structure within highly abstract domains, and assertions concerning such domains can indeed resist simple perceptual correction. To say that the mechanisms of (rich, world-presenting) perception are continuous with the mechanisms of (rich, world-presenting) cognition is not to deny this. It may be, however, that learning about some highly abstract domains requires delivering structured symbolic inputs; for example, using the formalisms of language, science and mathematics. Understanding how prediction-driven learning interacts with the active production and uptake of external symbolic representations and with various forms of designer learning environments is thus a crucial challenge, as Roepstorff also notes.

PP thus functions primarily as an intermediate-level (see Spratling) description of the underlying form of cortical processing. This is the case even though the larger story about free energy minimization (a story I briefly sketched in WN, sect. 1.6, but tried to bracket as raising issues far beyond the scope of the article) aims to encompass far more. As a theory of cortical processing, PP suggests we learn to represent linked sets of probability density distributions, and that they provide the form of hierarchical generative models underlying both perception (of the rich, world-presenting variety) and many forms of world-engaging action. Importantly, this leaves plenty of space for other ploys and strategies to coexist with the core PP mechanism. I tried to celebrate that space by making a virtue (WN, sects. 3.2–3.4) out of the free-energy story's failure to specify the full form of a cognitive architecture, envisaging a cooperative project requiring many further insights from evolutionary, situated, embodied, and distributed approaches to understanding mind and adaptive response. Was it then false advertising to offer PP itself as a unifying account? Not, I fondly hope, if PP reveals common computational principles governing knowledge-rich forms of cortical processing (in both the sensory and motor realms), delivers a novel account of attention (as optimizing precision), and reveals prediction error minimization as the common goal of many forms of action, social engagement, and environmental structuring.

There is thus an important difference of emphasis between my treatment and the many seminal treatments by Karl Friston. For as the comments by Friston made clear, he himself sets little store by the difference between what I (like Anderson & Chemero) might describe as knowledge- and inner-model-rich versus knowledge-sparse ways of minimizing free energy and reducing surprisal. Viewed from the loftier perspective of free-energy minimization, the effect is indeed the same. Free-energy reduction can be promoted by the “fit” between morphology and niche, by quick-and-dirty internal-representation-sparse ploys, and by the costlier (but potent) use of prediction-driven learning to infer internally represented probabilistic generative models. But it is, I suspect, only that costlier class of approaches, capable of on-the-hoof learning about complex interanimated webs of hidden causes, that delivers a certain “cognitive package deal.” The package deal bundles together what I have been calling “rich, world-presenting perception,” offline imagination, and understanding (not just apt response) and has a natural extension to intentional, world-directed action (see Clark, forthcoming). Such a package may well be operative, as Gowaty & Hubbell suggested, in the generation of many instances of animal response. It need not implicate solely the neocortex (though that seems to be its natural home). But potent though the package is, it is not the only strategy at work, even in humans, and there may be some animals that do not deploy the strategy at all.

Thus, consider the humble earthworm. The worm is doubtless a wonderful minimizer of free energy, and we might even describe the whole worm (as the comments by Friston suggest) as a kind of free-energy minimizing model of its world. But does the worm command a model of its world parsed into distal causes courtesy of top-down expectations applied in a multilevel manner? This is far from obvious. The worm is capable of sensing, but perhaps it does not thereby experience a perceptual world. If that is right, then not all ways of minimizing free energy are equal, and only some put you in perceptual touch (as opposed to mere causal contact) with a distal world characterized by multiple interacting hidden causes.

This brings us back, finally, to the vexed question of the mind–world relation itself. Where Froese & Ikegami fear that the PP strategy cuts us off from the world, inserting an internal model between us and the distal environment, I believe that it is only courtesy of such models that we (perhaps unlike the earthworm) can experience a distal environment at all! Does that mean that perception presents us (as Paton et al. suggest) with only a fantasy about the world? I continue to resist this way of casting things. Consider (recall the comments by Farmer et al.) the perception of sentence structure during speech processing. It is plausibly only due to the deployment of a rich generative model that a hearer can recover semantic and syntactic constituents from the impinging sound stream. Does that mean that the perception of sentence structure is “an inferred fantasy about what lies behind the veil of input”? Surely it does not. In recovering the right set of interacting distal causes (subjects, objects, meanings, verb-clauses, etc.) we see through the sound stream to the multilayered structure and complex purposes of the linguistic environment itself. This is possible because brains like ours are sensitive statistical sponges open to deep restructuring by the barrage of inputs coming from the world. Moreover, even apparently low-level structural features of cortex (receptive field orientations and spatial frequencies), as Bridgeman very eloquently reminds us, come to reflect the actual statistical profile of the environment, and do so in ways that are surprisingly open to variation by early experience.

Does this commit me to the implausible idea that perception presents us with the world “as it is in itself”? Here, the helpful commentary by König, Wilming, Kaspar, Nagel, & Onat (König et al.) seems to me to get the issue exactly right. Predictions are made, they stress (see also the comments by Bridgeman), in the light of our own action repertoire. This simple (but profound) fact results in reductions of computational complexity by helping to select what features to process, and what things to try to predict. From the huge space of possible ways of parsing the world, given the impinging energetic flux, we select the ways that serve our needs by fitting our action repertoires. Such selection will extend, as Paton et al. have noted (see also Dennett), to ways of parsing and understanding our own bodies and minds. Such parsing enables us to act on the world, imposing further structure on the flow of information, and eventually reshaping the environment itself to suit our needs.

Roepstorff's engaging commentary brings several of these issues into clearer focus by asking in what ways, if any, the PP framework illuminates specifically human forms of cognition. This is a crucial question. The larger free-energy story targets nothing that is specifically human, though (of course) it aims to encompass human cognition. The PP framework seeks to highlight a cortical processing strategy that, though not uniquely human, is plausibly essential to human intelligence and that provides, as mentioned above, a compelling “cognitive package deal.” That package deal delivers, at a single stroke, understanding of complex, interacting distal causes and the ability to generate perception-like states from the top down. It delivers understanding because to perceive the world of distal causes in this way is not just to react appropriately to it. It is to know how that world will evolve and alter across multiple timescales. This, in turn, involves learning to generate perception-like states from the top-down. This double-innovation, carefully modulated by the precision-weighting of attention, lies (PP claims) at the very heart of many distinctively human forms of cognition. To be sure (recall Gowaty & Hubbell) the same strategy is at work in many nonhuman animals, delivering there too a quite deep understanding of a world of distal causes. What, then, is special about the human case?

Roepstorff points to a potent complex of features of human life, especially our abilities of temporally co-coordinated social interaction (see also commentaries by Holm & Madison, Paton et al., and Schaefer et al.) and our (surely deeply related) abilities to construct artifacts and designer environments. Versions of all of this occur in other species. But in the human case, the mosaic comes together under the influence of flexible structured symbolic language and an almost obsessive drive to engage in shared cultural practices. We are thus enabled repeatedly to redeploy our core cognitive skills in the transformative context of exposure to patterned sociocultural practices, including the use of symbolic codes (encountered as “material symbols”; Clark Reference Clark2006a) and complex social routines (Hutchins Reference Hutchins1995; Roepstorff et al. Reference Roepstorff, Niewohner and Beck2010). If, as PP claims, one of the most potent inner tools available is deep, prediction-driven learning that locks on to interacting distal hidden causes, we may dimly imagine (WN, sect. 3.4; Clark 2006; Reference Clark2008) a virtuous spiral in which our achieved understandings are given concrete and communicable form, and then shared and fed back using structured practices that present us with new patterns.

Such pattern-presenting practices should, as Roepstorff suggests, enable us to develop hierarchical generative models that track ever more rarefied causes spanning the brute and the manufactured environment. By tracking such causes they may also, in innocent ways, help create and propagate them (think of patterned practices such as marriage and music). It is this potentially rich and multilayered interaction between knowledge-rich prediction-driven learning and enculturated, situated cognition that most attracts me to the core PP proposal. These are early days, but I believe PP has the potential to help bridge the gap between simpler forms of embodied and situated response, the self-structuring of information flows, and the full spectrum of socially and technologically inflected human understanding.

References

Barrett, L. F. & Bar, M. (2009) See it with feeling: Affective predictions during object perception. Philosophical Transactions of the Royal Society of London B: Biological Sciences 364(1521):1325–34.CrossRefGoogle ScholarPubMed
Bengio, Y. (2009) Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1):1127.CrossRefGoogle Scholar
Brayanov, J. B. & Smith, M. A. (2010) Bayesian and “anti-Bayesian” biases in sensory integration for action and perception in the size–weight illusion. Journal of Neurophysiology 103(3):1518–31.CrossRefGoogle ScholarPubMed
Carrasco, M., Ling, S. & Read, S. (2004) Attention alters appearance. Nature Neuroscience 7:308–13.Google Scholar
Churchland, P. M. (1989) The neurocomputational perspective. MIT/Bradford Books.Google Scholar
Churchland, P. M. (2012) Plato's camera: How the physical brain captures a landscape of abstract universals. MIT Press.CrossRefGoogle Scholar
Clark, A. (1989) Microcognition: Philosophy, cognitive science and parallel distributed processing. MIT Press/Bradford Books.Google Scholar
Clark, A. (1993) Minimal rationalism. Mind 102(408):587610.Google Scholar
Clark, A. (2006a) Language, embodiment and the cognitive niche. Trends in Cognitive Sciences 10(8):370–74.Google Scholar
Clark, A (2008) Supersizing the mind: Action, embodiment, and cognitive extension. Oxford University Press.CrossRefGoogle Scholar
Clark, A. (2012) Dreaming the whole cat: Generative models, predictive processing, and the enactivist conception of perceptual experience. Mind 121(483):753–71.Google Scholar
Clark, A. (forthcoming) Perceiving as predicting, In: Perception and its modalities, ed. Mohan, M., Biggs, S. & Stokes, D.. Oxford University Press.Google Scholar
Clark, A. & Thornton, C. (1997) Trading spaces: Computation, representation, and the limits of uninformed learning. Behavioral and Brain Sciences 20(1):5766.Google Scholar
Dayan, P., Hinton, G. E. & Neal, R. M. (1995) The Helmholtz machine. Neural Computation 7:889904.CrossRefGoogle ScholarPubMed
Dennett, D. C. (2009) Darwin's “Strange Inversion of Reasoning”. Proceedings of the National Academy of Sciences USA 106 (Suppl. 1):10061–65.CrossRefGoogle Scholar
den Ouden, H. E. M., Daunizeau, J., Roiser, J., Friston, K. J. & Stephan, K. E. (2010) Striatal prediction error modulates cortical coupling. Journal of Neuroscience 30:3210–19.CrossRefGoogle ScholarPubMed
den Ouden, H. E. M, Friston, K. J., Daw, N. D., McIntosh, A. R. & Stephan, K. E. (2009) A dual role for prediction error in associative learning. Cerebral Cortex 19:1175–85.Google Scholar
Egner, T., Monti, J. M. & Summerfield, C. (2010) Expectation and surprise determine neural population responses in the ventral visual stream. Journal of Neuroscience 30(49):16601–608.CrossRefGoogle ScholarPubMed
Eliades, S. J. & Wang, X. (2008) Neural substrates of vocalization feedback monitoring in primate auditory cortex. Nature 453:1102–106.Google Scholar
Feldman, H. & Friston, K. J. (2010) Attention, uncertainty, and free-energy. Frontiers in Human Neuroscience 4:215. doi:10.3389/fnmuh.2010.00215.Google Scholar
Friston, K. (2005) A theory of cortical responses. Philosophical Transactions of the Royal Society of London B: Biological Sciences 360(1456):815–36.CrossRefGoogle ScholarPubMed
Friston, K. (2011a) Embodied inference: Or I think therefore I am, if I am what I think. In: The implications of embodiment (Cognition and Communication), ed. Tschacher, W. & Bergomi, C., pp. 89125. Imprint Academic.Google Scholar
Friston, K. (2011b) What is optimal about motor control? Neuron 72:488–98.CrossRefGoogle ScholarPubMed
Friston, K., Adams, R. A., Perrinet, L. & Breakspear, M. (2012) Perceptions as hypotheses: Saccades as experiments. Frontiers in Psychology 3:151. doi: 10.3389/fpsyg.2012.00151.Google Scholar
Friston, K. & Kiebel, S. (2009) Cortical circuits for perceptual inference. Neural Networks 22:1093–104.CrossRefGoogle ScholarPubMed
Frith, C. D. & Wentzer, T. S. (in press) Neural hermeneutics. In: Encyclopedia of philosophy and the social sciences, vol. 1, ed. Kaldis, B.. Sage.Google Scholar
Glimcher, P. (2010) Foundations of neuroeconomic analysis. Oxford University Press.Google Scholar
Hinton, G. E. & Nair, V. (2006) Inferring motor programs from images of handwritten digits. In: Advances in neural information processing systems 18: Proceedings of the 2005 NIPS Conference, ed. Weiss, Y., Scholkopf, B. & Platt, J., pp. 515–22. MIT Press.Google Scholar
Hohwy, J. (2012) Attention and conscious perception in the hypothesis testing brain. Frontiers in Psychology 3:96, 114. doi: 10.3389/fpsyg.2012.00096.CrossRefGoogle ScholarPubMed
Hutchins, E. (1995) Cognition in the wild. MIT Press.Google Scholar
Kay, J. & Phillips, W. A. (2010) Coherent Infomax as a computational goal for neural systems. Bulletin of Mathematical Biology 73:344–72. doi: 10.1007/s11538-010-9564-x.CrossRefGoogle ScholarPubMed
Keller, G. B., Bonhoeffer, T. & Hubener, M. (2012) Sensorimotor mismatch signals in primary visual cortex of the behaving mouse. Neuron 74:809–15.CrossRefGoogle ScholarPubMed
Kok, P., Rahnev, D., Jehee, J. F., Lau, H. C. & de Lange, F. P. (2011) Attention reverses the effect of prediction in silencing sensory signals. Cerebral Cortex 22:2197–206.CrossRefGoogle ScholarPubMed
Lee, D. & Wang, X.-J. (2009) Mechanisms for stochastic decision making in the primate frontal cortex: Single-neuron recording and circuit modeling. In: Neuroeconomics: Decision making and the brain, ed. Glimcher, P., Camerer, C., Fehr, E. & Poldrack, R., pp. 481501. Elsevier.Google Scholar
Meyer, T. & Olson, C. R. (2011) Statistical learning of visual transitions in monkey inferotemporal cortex. Proceedings of the National Academy of Sciences USA 108:19401–406.Google Scholar
Mumford, D. (1992) On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biological Cybernetics 66(3):241–51.CrossRefGoogle ScholarPubMed
Murray, S. O., Kersten, D., Olshausen, B. A., Schrater, P. & Woods, D. L. (2002) Shape perception reduces activity in human primary visual cortex. Proceedings of the National Academy of Sciences USA 99(23):15164–69.CrossRefGoogle ScholarPubMed
Olshausen, B. A. & Field, D. J. (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607609.Google Scholar
Pfeifer, R., Lungarella, M., Sporns, O. & Kuniyoshi, Y. (2007) On the information theoretic implications of embodiment – principles and methods. Lecture Notes in Computer Science (LNCS), vol. 4850. Springer.Google Scholar
Phillips, W. A., Kay, J. & Smyth, D. (1995) The discovery of structure by multistream networks of local processors with contextual guidance. Network: Computation in Neural Systems 6:225–46.CrossRefGoogle Scholar
Phillips, W. A., von der Malsburg, C. & Singer, W. (2010) Dynamic coordination in brain and mind. In: Strüngmann Forum Report, vol. 5: Dynamic coordination in the brain: From neurons to mind, ed. von der Malsburg, C., Phillips, W. A. & Singer, W., Chapter 1, pp. 124. MIT Press.Google Scholar
Pickering, M. J. & Garrod, S. (2007) Do people use language production to make predictions during comprehension? Trends in Cognitive Sciences (11):105110.CrossRefGoogle ScholarPubMed
Posner, M. (1980) Orienting of attention. Quarterly Journal of Experimental Psychology 32:33.CrossRefGoogle ScholarPubMed
Reichert, D., Seriès, P. & Storkey, A. (2010) Hallucinations in Charles Bonnet Syndrome induced by homeostasis: A Deep Boltzmann Machine model. Advances in Neural Information Processing Systems 23:2020–28.Google Scholar
Roepstorff, A., Niewohner, J. & Beck, S. (2010) Enculturing brains through patterned practices. Neural Networks 23(8–9):1051–59.Google Scholar
Salakhutdinov, R. & Hinton, G. E. (2009) Deep Boltzmann machines. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 5, ed. D. van Dyk & M. Welling, pp. 448–55. The Journal of Machine Learning Research, published online, at http://jmlr.csail.mit.edu/proceedings/papers/v5/ Google Scholar
Schenk, T. & McIntosh, R. (2010) Do we have independent visual streams for perception and action? Cognitive Neuroscience 1:5278.Google Scholar
Seth, A. K., Suzuki, K. & Critchley, H. D. (2011) An interoceptive predictive coding model of conscious presence. Frontiers in Psychology 2:395.Google ScholarPubMed
Stephan, K., Friston, K. & Frith, C. (2009) Dysconnection in schizophrenia: From abnormal synaptic plasticity to failures of self-monitoring. Schizophrenia Bulletin 35(3):509–27.CrossRefGoogle ScholarPubMed
von der Malsburg, C., Phillips, W. A. & Singer, W., eds. (2010) Strungmann Forum Report, Vol. 5. Dynamic coordination in the brain: From neurons to mind. MIT Press.Google Scholar
Wyart, V., Nobre, A. C. & Summerfield, C. (2012) Dissociable prior influences of signal probability and relevance on visual contrast sensitivity. Proceedings of the National Academy of Sciences USA 109:3593–98.Google Scholar
Zhu, Q. & Bingham, G. P. (2011) Human readiness to throw: The size-weight illusion is not an illusion when picking the best objects to throw. Evolution and Human Behavior 32(4):288–93.CrossRefGoogle Scholar