No perceptual task takes place in a contextual vacuum. How do we know that an effect is one of perception qua perception that does not involve other cognitive contributions? Experimental instructions alone involve various cognitive factors that guide task performance (Roepstorff & Frith Reference Roepstorff and Frith2004). Even a request to detect simple stimulus features requires participants to understand the instructions (language, memory), keep track of them (working memory), become sensitive to them (attention), and pick up the necessary information to become appropriately sensitive (perception). These processes work in a dynamic parallelism that is required when one participates in any experiment. Any experiment with enough cognitive content to test top-down effects would seem to invoke all of these processes. From this task-level vantage point, the precise role of visual perception under strict modular assumptions seems, to us, difficult to intuit. We are, presumably, seeking theories that can also account for complex natural perceptual acts. Perception must somehow participate with cognition to help guide action in a labile world. Perception operating entirely independently, without any task-based constraints, flirts with hallucination. Additional theoretical and empirical matters elucidate even more difficulties with their thesis.
First, like Firestone & Scholl (F&S), Fodor (Reference Fodor1983) famously used visual illusions to argue for the modularity of perceptual input systems. Cognition itself, Fodor suggested, was likely too complex to be modular. Ironically, F&S have turned Fodor's thesis on its head; they argue that perceptual input systems may interact as much as they like without violating modularity. But there are some counterexamples. In Jastrow's (Reference Jastrow1899) and Hill's (Reference Hill1915) ambiguous figures, one sees either a duck or rabbit on the one hand, and either a young woman or old woman on the other. Yet, you can cognitively control which of these you see. Admittedly, cognition cannot “penetrate” our perception to turn straight lines into curved ones in any arbitrary stimulus; and clearly we cannot see a young woman in Jastrow's duck-rabbit figure. Nonetheless, cognition can change our interpretation of either figure.
Perhaps more compelling are auditory demonstrations of certain impoverished speech signals called sine-wave speech (e.g., Darwin Reference Darwin1997; Remez et al. Reference Remez, Pardo, Piorkowski and Rubin2001). Most of these stimuli sound like strangely squeaking wheels until one is told that they are speech. But sometimes the listener must be told what the utterances are. Then, quite spectacularly, the phenomenology is one of listening to a particular utterance of speech. Unlike visual figures such as those from Jastrow and Hill, this is not a bistable phenomenon; once a person hears a sine wave signal as speech, he or she cannot fully go back and hear these signals as mere squeaks. Is this not top-down?
Such phenomena – the bistability of certain visual figures and the asymmetric stability of these speechlike sounds, among many others – are not the results of confirmatory research. They are indeed the “amazing demonstrations” that F&S cry out for.
Second, visual neuroscience shows numerous examples of feedback projections to visual cortex, and feedback influences on visual neural processing that F&S ignore. The primary visual cortex (V1) receives descending projections from a wide range of cortical areas. Although the strongest feedback signals come from nearby visual areas V3 and V4, V1 also receives feedback signals from V5/MT, parahippocampal regions, superior temporal parietal regions, auditory cortex (Clavagnier et al. Reference Clavagnier, Falchier and Kennedy2004) and the amygdala (Amaral et al. Reference Amaral, Behniea and Kelly2003), establishing that the brain shows pervasive top-down connectivity. The next step is to determine what perceptual function descending projections serve. F&S cite a single paper to justify ignoring a massive literature accomplishing this (sect 2.2, para 2).
Neurons in V1 exhibit differential responses to the same visual input under a variety of contextual modulations (e.g., David et al. Reference David, Vinje and Gallant2004; Hupé et al. Reference Hupé, James, Payne, Lomber, Girard and Bullier1998; Kapadia et al. Reference Kapadia, Ito, Gilbert and Westheimer1995; Motter Reference Motter1993). Numerous studies with adults have established that selective attention enhances processing of information at the attended location, and suppresses distraction (Gandhi et al. Reference Gandhi, Heeger and Boynton1999; Kastner et al. Reference Kastner, Pinsk, De Weerd, Desimone and Ungerleider1999; Markant et al. Reference Markant, Worden and Amso2015b; Slotnick et al. Reference Slotnick, Schwarzbach and Yantis2003). This excitation/suppression mechanism improves the quality of early vision, enhancing contrast sensitivity, acuity, d-prime, and visual processing of attended information (Anton-Erxleben & Carrasco Reference Anton-Erxleben and Carrasco2013; Carrasco Reference Carrasco2011; Lupyan & Spivey Reference Lupyan and Spivey2010; Zhang et al. Reference Zhang, Jamison, Engel, He and He2011). This modulation of visual processing in turn supports improved encoding and recognition for attended information among adults (Rutman et al. Reference Rutman, Clapp, Chadick and Gazzaley2010; Uncapher & Rugg Reference Uncapher and Rugg2009; Zanto & Gazzaley Reference Zanto and Gazzaley2009) and infants (Markant & Amso Reference Markant and Amso2013; Reference Markant and Amso2016; Markant et al. Reference Markant, Oakes and Amso2015a). Recent data indicate that attentional biases can function at higher levels in the cognitive hierarchy (Chua & Gauthier Reference Chua and Gauthier2015), indicating that attention can serve as a mechanism guiding vision based on category-level biases.
Results like these have spurred the visual neuroscience community to develop new theories to account for how feedback projections change the receptive field properties of neurons throughout visual cortex (Dayan et al. Reference Dayan, Hinton, Neal and Zemel1995; Friston Reference Friston2010; Gregory Reference Gregory1980; Jordan Reference Jordan2013; Kastner & Ungerleider Reference Kastner and Ungerleider2001; Kveraga et al. Reference Kveraga, Ghuman and Bar2007b; Rao & Ballard Reference Rao and Ballard1999; Spratling Reference Spratling2010). It is not clear how F&S's theory of visual perception can claim that recognition of visual input takes place without top-down influences, when the activity of neurons in the primary visual cortex is routinely modulated by contextual feedback signals from downstream cortical subsystems. The role of downstream projections is still under investigation, but theories of visual perception and experience ought to participate in understanding them rather than ignoring them.
F&S are incorrect when they conclude that it is “eminently plausible that there are no top-down effects of cognition on perception” (final paragraph). Indeed, F&S's argument is heavily recycled from a previous BBS contribution (Pylyshyn Reference Pylyshyn1999). Despite their attempt to distinguish their contribution from that one, it suffers from very similar weaknesses identified by past commentary (e.g., Bruce et al. Reference Bruce, Langton and Hill1999; Bullier Reference Bullier1999; Cavanagh Reference Cavanagh1999, among others). F&S are correct when they state early on that, “discovery of substantive top-down effects of cognition on perception would revolutionize our understanding of how the mind is organized” (abstract). Especially in the case of visual perception, that is exactly what has been happening in the field for these past few decades.
No perceptual task takes place in a contextual vacuum. How do we know that an effect is one of perception qua perception that does not involve other cognitive contributions? Experimental instructions alone involve various cognitive factors that guide task performance (Roepstorff & Frith Reference Roepstorff and Frith2004). Even a request to detect simple stimulus features requires participants to understand the instructions (language, memory), keep track of them (working memory), become sensitive to them (attention), and pick up the necessary information to become appropriately sensitive (perception). These processes work in a dynamic parallelism that is required when one participates in any experiment. Any experiment with enough cognitive content to test top-down effects would seem to invoke all of these processes. From this task-level vantage point, the precise role of visual perception under strict modular assumptions seems, to us, difficult to intuit. We are, presumably, seeking theories that can also account for complex natural perceptual acts. Perception must somehow participate with cognition to help guide action in a labile world. Perception operating entirely independently, without any task-based constraints, flirts with hallucination. Additional theoretical and empirical matters elucidate even more difficulties with their thesis.
First, like Firestone & Scholl (F&S), Fodor (Reference Fodor1983) famously used visual illusions to argue for the modularity of perceptual input systems. Cognition itself, Fodor suggested, was likely too complex to be modular. Ironically, F&S have turned Fodor's thesis on its head; they argue that perceptual input systems may interact as much as they like without violating modularity. But there are some counterexamples. In Jastrow's (Reference Jastrow1899) and Hill's (Reference Hill1915) ambiguous figures, one sees either a duck or rabbit on the one hand, and either a young woman or old woman on the other. Yet, you can cognitively control which of these you see. Admittedly, cognition cannot “penetrate” our perception to turn straight lines into curved ones in any arbitrary stimulus; and clearly we cannot see a young woman in Jastrow's duck-rabbit figure. Nonetheless, cognition can change our interpretation of either figure.
Perhaps more compelling are auditory demonstrations of certain impoverished speech signals called sine-wave speech (e.g., Darwin Reference Darwin1997; Remez et al. Reference Remez, Pardo, Piorkowski and Rubin2001). Most of these stimuli sound like strangely squeaking wheels until one is told that they are speech. But sometimes the listener must be told what the utterances are. Then, quite spectacularly, the phenomenology is one of listening to a particular utterance of speech. Unlike visual figures such as those from Jastrow and Hill, this is not a bistable phenomenon; once a person hears a sine wave signal as speech, he or she cannot fully go back and hear these signals as mere squeaks. Is this not top-down?
Such phenomena – the bistability of certain visual figures and the asymmetric stability of these speechlike sounds, among many others – are not the results of confirmatory research. They are indeed the “amazing demonstrations” that F&S cry out for.
Second, visual neuroscience shows numerous examples of feedback projections to visual cortex, and feedback influences on visual neural processing that F&S ignore. The primary visual cortex (V1) receives descending projections from a wide range of cortical areas. Although the strongest feedback signals come from nearby visual areas V3 and V4, V1 also receives feedback signals from V5/MT, parahippocampal regions, superior temporal parietal regions, auditory cortex (Clavagnier et al. Reference Clavagnier, Falchier and Kennedy2004) and the amygdala (Amaral et al. Reference Amaral, Behniea and Kelly2003), establishing that the brain shows pervasive top-down connectivity. The next step is to determine what perceptual function descending projections serve. F&S cite a single paper to justify ignoring a massive literature accomplishing this (sect 2.2, para 2).
Neurons in V1 exhibit differential responses to the same visual input under a variety of contextual modulations (e.g., David et al. Reference David, Vinje and Gallant2004; Hupé et al. Reference Hupé, James, Payne, Lomber, Girard and Bullier1998; Kapadia et al. Reference Kapadia, Ito, Gilbert and Westheimer1995; Motter Reference Motter1993). Numerous studies with adults have established that selective attention enhances processing of information at the attended location, and suppresses distraction (Gandhi et al. Reference Gandhi, Heeger and Boynton1999; Kastner et al. Reference Kastner, Pinsk, De Weerd, Desimone and Ungerleider1999; Markant et al. Reference Markant, Worden and Amso2015b; Slotnick et al. Reference Slotnick, Schwarzbach and Yantis2003). This excitation/suppression mechanism improves the quality of early vision, enhancing contrast sensitivity, acuity, d-prime, and visual processing of attended information (Anton-Erxleben & Carrasco Reference Anton-Erxleben and Carrasco2013; Carrasco Reference Carrasco2011; Lupyan & Spivey Reference Lupyan and Spivey2010; Zhang et al. Reference Zhang, Jamison, Engel, He and He2011). This modulation of visual processing in turn supports improved encoding and recognition for attended information among adults (Rutman et al. Reference Rutman, Clapp, Chadick and Gazzaley2010; Uncapher & Rugg Reference Uncapher and Rugg2009; Zanto & Gazzaley Reference Zanto and Gazzaley2009) and infants (Markant & Amso Reference Markant and Amso2013; Reference Markant and Amso2016; Markant et al. Reference Markant, Oakes and Amso2015a). Recent data indicate that attentional biases can function at higher levels in the cognitive hierarchy (Chua & Gauthier Reference Chua and Gauthier2015), indicating that attention can serve as a mechanism guiding vision based on category-level biases.
Results like these have spurred the visual neuroscience community to develop new theories to account for how feedback projections change the receptive field properties of neurons throughout visual cortex (Dayan et al. Reference Dayan, Hinton, Neal and Zemel1995; Friston Reference Friston2010; Gregory Reference Gregory1980; Jordan Reference Jordan2013; Kastner & Ungerleider Reference Kastner and Ungerleider2001; Kveraga et al. Reference Kveraga, Ghuman and Bar2007b; Rao & Ballard Reference Rao and Ballard1999; Spratling Reference Spratling2010). It is not clear how F&S's theory of visual perception can claim that recognition of visual input takes place without top-down influences, when the activity of neurons in the primary visual cortex is routinely modulated by contextual feedback signals from downstream cortical subsystems. The role of downstream projections is still under investigation, but theories of visual perception and experience ought to participate in understanding them rather than ignoring them.
F&S are incorrect when they conclude that it is “eminently plausible that there are no top-down effects of cognition on perception” (final paragraph). Indeed, F&S's argument is heavily recycled from a previous BBS contribution (Pylyshyn Reference Pylyshyn1999). Despite their attempt to distinguish their contribution from that one, it suffers from very similar weaknesses identified by past commentary (e.g., Bruce et al. Reference Bruce, Langton and Hill1999; Bullier Reference Bullier1999; Cavanagh Reference Cavanagh1999, among others). F&S are correct when they state early on that, “discovery of substantive top-down effects of cognition on perception would revolutionize our understanding of how the mind is organized” (abstract). Especially in the case of visual perception, that is exactly what has been happening in the field for these past few decades.