The target article comprehensibly deconstructs common misconceptions, such as models of human vision that can be reduced to mechanisms of object recognition or that useful analogies between neuronal and artificial architectures can be drawn solely from accuracy scores and their correlations with brain activity. We fully agree that such oversimplifications need to be avoided if deep neural networks (DNNs) are to be considered accurate models of vision. Troubles stemming from similar oversimplifications are well-known in consciousness research. One of the main obstacles for the field is the separation of mechanisms that process visual information from those that transform it into the conscious activity of seeing. Here, we offer a high-level outlook on the human vision from this perspective. We believe it could serve as a guiding principle for building more ecologically valid artificial models. It would also lead to better testing criteria for assessing the similarities and differences between humans and DNNs that go beyond object recognition.
When presented with an object, it appears that we first see it in all of its details and only then recognize it. However, experimental evidence suggests that, under carefully controlled conditions, individuals can correctly categorize objects while denying seeing them (Lamme, Reference Lamme2020). The discrepancy between objective performance (i.e., correct categorization) and subjective experience of seeing convincingly illustrates the presence of unconscious processing of perceptual information (Mudrik & Deouell, Reference Mudrik and Deouell2022). It also highlights that categorization may refer to different neural processes depending on the type of object. Identification of faces is a common example of fast automatic processing of a complex set of features that allows us to easily recognize each other. It also demonstrates problems with taking brain activity as an indicator of successful perception. The fusiform gyrus is selectively activated when participants are presented with images of faces (Fahrenfort et al., Reference Fahrenfort, Snijders, Heinen, van Gaal, Scholte and Lamme2012; Haxby, Hoffman, & Gobbini, Reference Haxby, Hoffman and Gobbini2000). However, this activation can be found even if the participant reports no perception (Axelrod, Bar, & Rees, Reference Axelrod, Bar and Rees2015). Similar specific neural activations can be observed in response to other complex stimuli (e.g., one's name) during sleep (Andrillon & Kouider, Reference Andrillon and Kouider2020). Therefore, while behavioural responses and brain activity can provide insights into the extent of processing evoked by certain stimuli, they do not equate to conscious vision.
Feature extraction and object categorization are not the only visual processes that can occur without consciousness. There is evidence of interactions between already differentiated objects that alter each other neural responses when placed closely in the visual field (Lamme, Reference Lamme2020). This includes illusions like the Kanizsa triangle, which requires the integration of multiple objects (Wang, Weng, & He, Reference Wang, Weng and He2012). However, these processes seem to be restricted to local features and are not present when processing requires information integration from larger parts of the visual scene. This is precisely the moment when conscious perception starts to play a role, enabling the organization of distinct elements in the visual field into a coherent scene (e.g., figure-ground differentiation; Lamme, Zipser, & Spekreijse, Reference Lamme, Zipser and Spekreijse2002). Experimental evidence suggests that conscious vision allows for better integration of spatially or temporally distributed information, as well as higher precision of the visual representations (Ludwig, Reference Ludwig2023). A coherent scene can then be used to guide adequate actions and predict future events. From this perspective, while object recognition is an essential part of the visual processing pipeline, it cannot fulfil the representational function of vision alone.
Another notion that complicates comparisons between humans and DNNs is temporal integration. Our perception is trained from birth on continuous perceptual input that is highly temporally correlated. Scenes are not a part of a randomized stream of unrelated snapshots. Temporal integration enables our visual system to augment the processing of stimuli with information extracted from the immediate past. This type of information can involve, for example, changes in the relative position of individuals or objects. Subsequently, this leads to one of the crucial discrepancies between human and artificial vision (the target article identifies aspects of it in sect. 4.1.1–4.1.7). DNNs are built to classify ensembles of pixels in a digital image, while human brains interpret everything as two-dimensional (2D) projections of three-dimensional (3D) objects. This fact imposes restrictions on possible interpretations of perceptual stimuli (which can lead to mistakes) but ultimately allows the visual system to not rely solely on immediate physical stimulation. This in turn makes perception more stable and useful in the context of interactions with the environment. These processes may occur without human-like consciousness. However, consciousness seems to increase the temporal integration of stimuli, strongly shaping the outcome of visual processing.
In this commentary, we aimed to justify why consciousness should be taken into account while modelling human vision with DNNs. Similar inspirations from cognitive science have proven very successful in the recent past in the case of attention (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Polosukhin2017) and some researchers already proposed consciousness-like mechanisms (Bengio, Reference Bengio2019). However, even in healthy humans, reliable measurement of consciousness is difficult both theoretically (Seth, Dienes, Cleeremans, Overgaard, & Pessoa, Reference Seth, Dienes, Cleeremans, Overgaard and Pessoa2008) and methodologically (Wierzchoń, Paulewicz, Asanowicz, Timmermans, & Cleeremans, Reference Wierzchoń, Paulewicz, Asanowicz, Timmermans and Cleeremans2014). The task is even more challenging if one would attempt to implement such measurement in artificial neural networks (Timmermans, Schilbach, Pasquali, & Cleeremans, Reference Timmermans, Schilbach, Pasquali and Cleeremans2012). Nevertheless, probing the capabilities of DNNs in realizing functions connected to conscious vision might prove necessary for comparison between DNNs and humans. To make such a comparison more feasible, we would like to propose a rudimentary operationalization of subjective experience as “context dependence.” In the case of visual perception, context can be defined very broadly as all the spatially or temporally distant elements of a visual scene that alter its processing. It also suggests that the global integration of perceptual features is a good approximation of the unifying function of conscious vision. Interestingly, we note that most of the phenomena mentioned in sect. 4.2 of the target article can be reformulated as examples of some form of context dependence, making this overarching principle easy to convey. Showing that DNNs are similar to humans, that is, are selectively susceptible to illusions, alter categorization based on other objects in the scene, or demonstrate object invariance, would be a strong argument in favour of the functional similarity.
The target article comprehensibly deconstructs common misconceptions, such as models of human vision that can be reduced to mechanisms of object recognition or that useful analogies between neuronal and artificial architectures can be drawn solely from accuracy scores and their correlations with brain activity. We fully agree that such oversimplifications need to be avoided if deep neural networks (DNNs) are to be considered accurate models of vision. Troubles stemming from similar oversimplifications are well-known in consciousness research. One of the main obstacles for the field is the separation of mechanisms that process visual information from those that transform it into the conscious activity of seeing. Here, we offer a high-level outlook on the human vision from this perspective. We believe it could serve as a guiding principle for building more ecologically valid artificial models. It would also lead to better testing criteria for assessing the similarities and differences between humans and DNNs that go beyond object recognition.
When presented with an object, it appears that we first see it in all of its details and only then recognize it. However, experimental evidence suggests that, under carefully controlled conditions, individuals can correctly categorize objects while denying seeing them (Lamme, Reference Lamme2020). The discrepancy between objective performance (i.e., correct categorization) and subjective experience of seeing convincingly illustrates the presence of unconscious processing of perceptual information (Mudrik & Deouell, Reference Mudrik and Deouell2022). It also highlights that categorization may refer to different neural processes depending on the type of object. Identification of faces is a common example of fast automatic processing of a complex set of features that allows us to easily recognize each other. It also demonstrates problems with taking brain activity as an indicator of successful perception. The fusiform gyrus is selectively activated when participants are presented with images of faces (Fahrenfort et al., Reference Fahrenfort, Snijders, Heinen, van Gaal, Scholte and Lamme2012; Haxby, Hoffman, & Gobbini, Reference Haxby, Hoffman and Gobbini2000). However, this activation can be found even if the participant reports no perception (Axelrod, Bar, & Rees, Reference Axelrod, Bar and Rees2015). Similar specific neural activations can be observed in response to other complex stimuli (e.g., one's name) during sleep (Andrillon & Kouider, Reference Andrillon and Kouider2020). Therefore, while behavioural responses and brain activity can provide insights into the extent of processing evoked by certain stimuli, they do not equate to conscious vision.
Feature extraction and object categorization are not the only visual processes that can occur without consciousness. There is evidence of interactions between already differentiated objects that alter each other neural responses when placed closely in the visual field (Lamme, Reference Lamme2020). This includes illusions like the Kanizsa triangle, which requires the integration of multiple objects (Wang, Weng, & He, Reference Wang, Weng and He2012). However, these processes seem to be restricted to local features and are not present when processing requires information integration from larger parts of the visual scene. This is precisely the moment when conscious perception starts to play a role, enabling the organization of distinct elements in the visual field into a coherent scene (e.g., figure-ground differentiation; Lamme, Zipser, & Spekreijse, Reference Lamme, Zipser and Spekreijse2002). Experimental evidence suggests that conscious vision allows for better integration of spatially or temporally distributed information, as well as higher precision of the visual representations (Ludwig, Reference Ludwig2023). A coherent scene can then be used to guide adequate actions and predict future events. From this perspective, while object recognition is an essential part of the visual processing pipeline, it cannot fulfil the representational function of vision alone.
Another notion that complicates comparisons between humans and DNNs is temporal integration. Our perception is trained from birth on continuous perceptual input that is highly temporally correlated. Scenes are not a part of a randomized stream of unrelated snapshots. Temporal integration enables our visual system to augment the processing of stimuli with information extracted from the immediate past. This type of information can involve, for example, changes in the relative position of individuals or objects. Subsequently, this leads to one of the crucial discrepancies between human and artificial vision (the target article identifies aspects of it in sect. 4.1.1–4.1.7). DNNs are built to classify ensembles of pixels in a digital image, while human brains interpret everything as two-dimensional (2D) projections of three-dimensional (3D) objects. This fact imposes restrictions on possible interpretations of perceptual stimuli (which can lead to mistakes) but ultimately allows the visual system to not rely solely on immediate physical stimulation. This in turn makes perception more stable and useful in the context of interactions with the environment. These processes may occur without human-like consciousness. However, consciousness seems to increase the temporal integration of stimuli, strongly shaping the outcome of visual processing.
In this commentary, we aimed to justify why consciousness should be taken into account while modelling human vision with DNNs. Similar inspirations from cognitive science have proven very successful in the recent past in the case of attention (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Polosukhin2017) and some researchers already proposed consciousness-like mechanisms (Bengio, Reference Bengio2019). However, even in healthy humans, reliable measurement of consciousness is difficult both theoretically (Seth, Dienes, Cleeremans, Overgaard, & Pessoa, Reference Seth, Dienes, Cleeremans, Overgaard and Pessoa2008) and methodologically (Wierzchoń, Paulewicz, Asanowicz, Timmermans, & Cleeremans, Reference Wierzchoń, Paulewicz, Asanowicz, Timmermans and Cleeremans2014). The task is even more challenging if one would attempt to implement such measurement in artificial neural networks (Timmermans, Schilbach, Pasquali, & Cleeremans, Reference Timmermans, Schilbach, Pasquali and Cleeremans2012). Nevertheless, probing the capabilities of DNNs in realizing functions connected to conscious vision might prove necessary for comparison between DNNs and humans. To make such a comparison more feasible, we would like to propose a rudimentary operationalization of subjective experience as “context dependence.” In the case of visual perception, context can be defined very broadly as all the spatially or temporally distant elements of a visual scene that alter its processing. It also suggests that the global integration of perceptual features is a good approximation of the unifying function of conscious vision. Interestingly, we note that most of the phenomena mentioned in sect. 4.2 of the target article can be reformulated as examples of some form of context dependence, making this overarching principle easy to convey. Showing that DNNs are similar to humans, that is, are selectively susceptible to illusions, alter categorization based on other objects in the scene, or demonstrate object invariance, would be a strong argument in favour of the functional similarity.
Competing interest
None.