Firestone & Scholl (F&S) argue there is no good evidence for cognitive penetration of perception, specifically vision. Instead, they propose, visual processing is informationally encapsulated. Importantly, their version of encapsulation goes beyond Fodor's original proposal “that at least some of the background information at the subject's disposal is inaccessible to at least some of his perceptual mechanisms” (Fodor Reference Fodor1983, p. 66). Their hypothesis is much more ambitious: “perception proceeds without any direct, unmediated influence from cognition” (sect. 5, para. 1). We will refer to this view as total encapsulation.
One possible counterexample to total encapsulation is multisensory modulation. For example, sounds in rapid succession can induce the illusory reappearance of visual flashes (Shams et al. Reference Shams, Kamitani and Shimojo2000). Such reappearances increase objective sensitivity for visual features of the flash (Berger et al. Reference Berger, Martelli and Pelli2003) and are linked to individual structure and function of primary visual cortex (de Haas et al. Reference de Haas, Kanai, Jalkanen and Rees2012; Watkins et al. Reference Watkins, Shams, Tanaka, Haynes and Rees2006). Waving one's hand in front of the eyes can induce visual sensations and enable smooth pursuit eye movements, even in complete darkness (Dieter et al. Reference Dieter, Hu, Knill, Blake and Tadin2014). The duration of sounds can bias the perceived duration of concurrent visual stimuli (Romei et al. Reference Romei, de Haas, Mok and Driver2011), and sensitivity for a brief flash increases parametrically with the duration of a co-occurring sound (de Haas et al. Reference de Haas, Cecere, Cullen, Driver and Romei2013a). The noise level of visual stimulus representations in retinotopic cortex is affected by the (in)congruency of co-occurring sounds (de Haas et al. Reference de Haas, Schwarzkopf, Urner and Rees2013b). Category-specific sounds and visual imagery can be decoded from early visual cortex, even with eyes closed (Vetter et al. Reference Vetter, Smith and Muckli2014), and the same is true for imagined hand actions (Pilgramm et al. Reference Pilgramm, de Haas, Helm, Zentgraf, Stark, Munzert and Krüger2016). At the same time, the location of visual stimuli can bias the perceived origin of sounds (Thomas Reference Thomas1941), and a visible face articulating a syllable can bias the perception of a concurrently presented (different) syllable (McGurk & MacDonald Reference McGurk and MacDonald1976). F&S argue that multisensory effects can be reconciled with total encapsulation. The inflexible nature and short latency of such effects would provide evidence they happen “within perception itself,” rather than reflecting the effect of “more central cognitive processes on perception” (sect. 2.4, para. 1). However, multisensory effects have different temporal latencies and occur at multiple levels of processing, from direct cross-talk between primary sensory areas to top-down feedback from association cortex (de Haas & Rees Reference de Haas, Rees, Shams and Kim2010; Driver & Noesselt Reference Driver and Noesselt2008). They may further be subject to attentional (Navarra et al. Reference Navarra, Alsius, Soto-Faraco and Spence2010), motivational (Bruns et al. Reference Bruns, Maiworm and Röder2014), and expectation-based (Gau & Noppeney Reference Gau and Noppeney2015) modulations. Therefore, evidence regarding a strictly horizontal nature of multisensory effects seems ambiguous at best. If total encapsulation hinges on the hypothesis of strictly horizontal effects, this hypothesis needs to be clearer. Specifically, what type of neural or behavioural evidence could refute it?
A second, perhaps more definitive, counterexample is attentional modulation of vision. F&S acknowledge that attention can change what we see (cf. Anton-Erxleben et al. Reference Anton-Erxleben, Abrams and Carrasco2011; Carrasco et al. Reference Carrasco, Fuller and Ling2008) and that these effects can be under intentional control. For example, voluntary attention can induce changes in the perceived spatial frequency (Abrams et al. Reference Abrams, Barbot and Carrasco2010), contrast (Liu et al. Reference Liu, Abrams and Carrasco2009), and position (Suzuki & Cavanagh Reference Suzuki and Cavanagh1997) of visual stimuli. Withdrawal of attention can induce perceptual blur (Montagna et al. Reference Montagna, Pestilli and Carrasco2009) and reduce visual sensitivity (Carmel et al. Reference Carmel, Thorne, Rees and Lavie2011) and sensory adaptation (Rees et al. Reference Rees, Frith and Lavie1997). Nevertheless, F&S argue for total encapsulation. On such an account, attention would not interfere with visual processing per se but with the input to this process, “similar to changing what we see by moving our eyes” or “turning the lights off” (sect. 4.5, para. 4).
Attention-related spatial distortions and changes in acuity have been linked to effects on the spatial tuning of visual neurons (Anton-Erxleben & Carrasco Reference Anton-Erxleben and Carrasco2013; Baruch & Yeshurun Reference Baruch and Yeshurun2014). Receptive fields can shift and grow towards, or shrink around, attended targets (e.g., Womelsdorf et al. Reference Womelsdorf, Anton-Erxleben and Treue2008). Such effects go beyond mere amplitude modulation and can provide important evidence regarding their locus. In a recent study (de Haas et al. Reference de Haas, Schwarzkopf, Anderson and Rees2014), we investigated the effects of attentional load at fixation on neuronal spatial tuning in early visual cortices. Participants performed either a hard or an easy fixation task while retinotopic mapping stimuli traversed the surrounding visual field. Importantly, stimuli were identical in both conditions – only the task instructions differed. Performing the harder task, and consequently having to pay less attention to the task-irrelevant mapping stimuli (Lavie Reference Lavie2005), yielded a blurrier neural representation of the surround, as well as a centrifugal repulsion of population receptive fields in V1-3 (pRFs; Dumoulin & Wandell Reference Dumoulin and Wandell2008). Importantly, this repulsion in V1-3 was accompanied by a centripetal attraction of pRFs in the intraparietal sulcus (IPS), perhaps because the larger receptive fields in IPS specifically encode the attended location (Klein et al. Reference Klein, Harvey and Dumoulin2014). Critically, retinotopic shifts merely inherited from input modulations cannot trivially explain such opposing shifts, because any such effect should be the same (or very similar) across the visual processing hierarchy.
How can one reconcile these findings with total encapsulation? We can think of only one way: redefining visual processing in a way that excludes processing associated with retinotopic tuning of visual cortex but includes feedback processes from multisensory areas (as outlined in our second paragraph above). Such a re-defintion seems hard to reconcile with the widespread evidence that visually tuned neuronal populations in occipital cortex are involved with visual processing. We instead argue that attentional and multisensory modulations are inconsistent with total encapsulation, and at least here, the line between cognition and perception is blurred. F&S concede that accepting these exceptions to total encapsulation would be far less revolutionary than many of the claims they attack. We second their demand to back up extraordinary claims with rigorous evidence and applaud the standards they propose. Many effects they discuss may indeed fail to meet these standards. But precisely because attentional and multisensory effects are well established, total encapsulation itself strikes us as an extraordinary claim that is not supported by the available evidence.
Firestone & Scholl (F&S) argue there is no good evidence for cognitive penetration of perception, specifically vision. Instead, they propose, visual processing is informationally encapsulated. Importantly, their version of encapsulation goes beyond Fodor's original proposal “that at least some of the background information at the subject's disposal is inaccessible to at least some of his perceptual mechanisms” (Fodor Reference Fodor1983, p. 66). Their hypothesis is much more ambitious: “perception proceeds without any direct, unmediated influence from cognition” (sect. 5, para. 1). We will refer to this view as total encapsulation.
One possible counterexample to total encapsulation is multisensory modulation. For example, sounds in rapid succession can induce the illusory reappearance of visual flashes (Shams et al. Reference Shams, Kamitani and Shimojo2000). Such reappearances increase objective sensitivity for visual features of the flash (Berger et al. Reference Berger, Martelli and Pelli2003) and are linked to individual structure and function of primary visual cortex (de Haas et al. Reference de Haas, Kanai, Jalkanen and Rees2012; Watkins et al. Reference Watkins, Shams, Tanaka, Haynes and Rees2006). Waving one's hand in front of the eyes can induce visual sensations and enable smooth pursuit eye movements, even in complete darkness (Dieter et al. Reference Dieter, Hu, Knill, Blake and Tadin2014). The duration of sounds can bias the perceived duration of concurrent visual stimuli (Romei et al. Reference Romei, de Haas, Mok and Driver2011), and sensitivity for a brief flash increases parametrically with the duration of a co-occurring sound (de Haas et al. Reference de Haas, Cecere, Cullen, Driver and Romei2013a). The noise level of visual stimulus representations in retinotopic cortex is affected by the (in)congruency of co-occurring sounds (de Haas et al. Reference de Haas, Schwarzkopf, Urner and Rees2013b). Category-specific sounds and visual imagery can be decoded from early visual cortex, even with eyes closed (Vetter et al. Reference Vetter, Smith and Muckli2014), and the same is true for imagined hand actions (Pilgramm et al. Reference Pilgramm, de Haas, Helm, Zentgraf, Stark, Munzert and Krüger2016). At the same time, the location of visual stimuli can bias the perceived origin of sounds (Thomas Reference Thomas1941), and a visible face articulating a syllable can bias the perception of a concurrently presented (different) syllable (McGurk & MacDonald Reference McGurk and MacDonald1976). F&S argue that multisensory effects can be reconciled with total encapsulation. The inflexible nature and short latency of such effects would provide evidence they happen “within perception itself,” rather than reflecting the effect of “more central cognitive processes on perception” (sect. 2.4, para. 1). However, multisensory effects have different temporal latencies and occur at multiple levels of processing, from direct cross-talk between primary sensory areas to top-down feedback from association cortex (de Haas & Rees Reference de Haas, Rees, Shams and Kim2010; Driver & Noesselt Reference Driver and Noesselt2008). They may further be subject to attentional (Navarra et al. Reference Navarra, Alsius, Soto-Faraco and Spence2010), motivational (Bruns et al. Reference Bruns, Maiworm and Röder2014), and expectation-based (Gau & Noppeney Reference Gau and Noppeney2015) modulations. Therefore, evidence regarding a strictly horizontal nature of multisensory effects seems ambiguous at best. If total encapsulation hinges on the hypothesis of strictly horizontal effects, this hypothesis needs to be clearer. Specifically, what type of neural or behavioural evidence could refute it?
A second, perhaps more definitive, counterexample is attentional modulation of vision. F&S acknowledge that attention can change what we see (cf. Anton-Erxleben et al. Reference Anton-Erxleben, Abrams and Carrasco2011; Carrasco et al. Reference Carrasco, Fuller and Ling2008) and that these effects can be under intentional control. For example, voluntary attention can induce changes in the perceived spatial frequency (Abrams et al. Reference Abrams, Barbot and Carrasco2010), contrast (Liu et al. Reference Liu, Abrams and Carrasco2009), and position (Suzuki & Cavanagh Reference Suzuki and Cavanagh1997) of visual stimuli. Withdrawal of attention can induce perceptual blur (Montagna et al. Reference Montagna, Pestilli and Carrasco2009) and reduce visual sensitivity (Carmel et al. Reference Carmel, Thorne, Rees and Lavie2011) and sensory adaptation (Rees et al. Reference Rees, Frith and Lavie1997). Nevertheless, F&S argue for total encapsulation. On such an account, attention would not interfere with visual processing per se but with the input to this process, “similar to changing what we see by moving our eyes” or “turning the lights off” (sect. 4.5, para. 4).
Attention-related spatial distortions and changes in acuity have been linked to effects on the spatial tuning of visual neurons (Anton-Erxleben & Carrasco Reference Anton-Erxleben and Carrasco2013; Baruch & Yeshurun Reference Baruch and Yeshurun2014). Receptive fields can shift and grow towards, or shrink around, attended targets (e.g., Womelsdorf et al. Reference Womelsdorf, Anton-Erxleben and Treue2008). Such effects go beyond mere amplitude modulation and can provide important evidence regarding their locus. In a recent study (de Haas et al. Reference de Haas, Schwarzkopf, Anderson and Rees2014), we investigated the effects of attentional load at fixation on neuronal spatial tuning in early visual cortices. Participants performed either a hard or an easy fixation task while retinotopic mapping stimuli traversed the surrounding visual field. Importantly, stimuli were identical in both conditions – only the task instructions differed. Performing the harder task, and consequently having to pay less attention to the task-irrelevant mapping stimuli (Lavie Reference Lavie2005), yielded a blurrier neural representation of the surround, as well as a centrifugal repulsion of population receptive fields in V1-3 (pRFs; Dumoulin & Wandell Reference Dumoulin and Wandell2008). Importantly, this repulsion in V1-3 was accompanied by a centripetal attraction of pRFs in the intraparietal sulcus (IPS), perhaps because the larger receptive fields in IPS specifically encode the attended location (Klein et al. Reference Klein, Harvey and Dumoulin2014). Critically, retinotopic shifts merely inherited from input modulations cannot trivially explain such opposing shifts, because any such effect should be the same (or very similar) across the visual processing hierarchy.
How can one reconcile these findings with total encapsulation? We can think of only one way: redefining visual processing in a way that excludes processing associated with retinotopic tuning of visual cortex but includes feedback processes from multisensory areas (as outlined in our second paragraph above). Such a re-defintion seems hard to reconcile with the widespread evidence that visually tuned neuronal populations in occipital cortex are involved with visual processing. We instead argue that attentional and multisensory modulations are inconsistent with total encapsulation, and at least here, the line between cognition and perception is blurred. F&S concede that accepting these exceptions to total encapsulation would be far less revolutionary than many of the claims they attack. We second their demand to back up extraordinary claims with rigorous evidence and applaud the standards they propose. Many effects they discuss may indeed fail to meet these standards. But precisely because attentional and multisensory effects are well established, total encapsulation itself strikes us as an extraordinary claim that is not supported by the available evidence.
ACKNOWLEDGMENTS
This work was supported by a research fellowship from the Deutsche Forschungsgemeinschaft (BdH; HA 7574/1-1), European Research Council Starting Grant 310829 (DSS, BdH), and the Wellcome Trust (GR).