Computational modeling has long been used by psychologists to test hypotheses about human cognition and behavior. Prior to the recent rise of deep neural networks (DNNs), most computational models were handcrafted by scientists who determined their parameters and features. In vision sciences, these models were used to test hypotheses about the mechanisms that enable human object recognition. However, these handcrafted models used simple, engineered-designed features (e.g., Gabor wavelets), which produced low-level representations that did not account for human-level, view-invariant object recognition (Biederman & Kalocsai, Reference Biederman and Kalocsai1997; Turk & Pentland, Reference Turk and Pentland1991). The main advantage of DNNs over these traditional models is not only that they reach human-level performance in object recognition, but that they do so through hierarchical processing of the visual input that generates high-level, view-invariant visual features. These high-level features are the “missing link” between the low-level representations of the hand crafted models and human-level object classification. They therefore offer psychologists an unprecedented opportunity to test hypotheses about the origin and nature of these high-level representations, which were not available for exploration so far.
In the target article, Bowers et al. propose that psychologists should abandon DNNs as models of human vision, because they do not produce some of the perceptual effects that are found in humans. However, many of the listed perceptual effects that DNNs fail to produce are also not produced by the traditional handcrafted computational vision models, which have been prevalently used to model human vision. Furthermore, although current DNNs are primarily developed for engineering purposes (i.e., best performance), there are myriad of ways in which they could and should be modified to better resemble the human mind. For example, current DNNs that are often used to model human face and object recognition (Khaligh-Razavi & Kriegeskorte, 2014; O'Toole & Castillo, Reference O'Toole and Castillo2021; Yamins & DiCarlo, Reference Yamins and DiCarlo2016) are trained on static images (Cao, Shen, Xie, Parkhi, & Zisserman, Reference Cao, Shen, Xie, Parkhi and Zisserman2018; Deng et al., Reference Deng, Dong, Socher, Li, Li and Fei-Fei2009), whereas human face and object recognition are performed on continuous streaming of dynamic, multi-modal information. One way that has recently been suggested to close this gap is to train DNNs on movies that are generated by head-mounted cameras attached to infants’ forehead (Fausey, Jayaraman, & Smith, Reference Fausey, Jayaraman and Smith2016), to better model the development of human visual systems (Smith & Slone, Reference Smith and Slone2017). Training DNNs initially on blurred images also provided insights into the potential advantage of the initial low acuity of infants’ vision (Vogelsang et al., Reference Vogelsang, Gilad-Gutnick, Ehrenberg, Yonas, Diamond, Held and Sinha2018). Such and many other modifications (e.g., multi-modal self-supervised image-language training, Radford et al., Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal and Sutskever2021) in the way DNNs are built and trained may generate perceptual effects that are more human-like (Shoham, Grosbard, Patashnik, Cohen-Or, & Yovel, Reference Shoham, Grosbard, Patashnik, Cohen-Or and Yovel2022). Yet even current DNNs can advance our understanding of the nature of the high-level representations that are required for face and object recognition (Abudarham, Grosbard, & Yovel, Reference Abudarham, Grosbard and Yovel2021; Hill et al., Reference Hill, Parde, Castillo, Colón, Ranjan, Chen and O'Toole2019), which are still undefined in current neural and cognitive models. This significant computational achievement should not be dismissed.
Bowers et al. further claim that DNNs should be used to test hypotheses rather than to solely make predictions. We fully agree and further propose that psychologists are best suited to apply this approach by utilizing the same procedures they have used for decades to test hypotheses about the hidden representations of the human mind. Since the early days of psychological sciences, psychologists have developed a range of elegant experimental and stimulus manipulations to study human vision. The same procedures can now be used to explore the nature of DNNs’ high-level hidden representations as potential models of the human mind (Ma & Peters, Reference Ma and Peters2020). For example, the face inversion effect is a robust, extensively studied, and well-established effect in human vision, which refers to the disproportionally large drop in performance that humans show for upside-down compared to upright faces (Cashon & Holt, Reference Cashon and Holt2015; Farah, Tanaka, & Drain, Reference Farah, Tanaka and Drain1995; Yin, Reference Yin1969). Because the low-level features extracted by, handcrafted algorithms are similar for upright and inverted faces, these traditional models do not reproduce this effect. Interestingly, a human-like face inversion effect that is larger than an object inversion effect is found in DNNs (Dobs, Martinez, Yuhan, & Kanwisher, Reference Dobs, Martinez, Yuhan and Kanwisher2022; Jacob, Pramod, Katti, & Arun, Reference Jacob, Pramod, Katti and Arun2021; Tian, Xie, Song, Hu, & Liu, Reference Tian, Xie, Song, Hu and Liu2022; Yovel, Grosbard, & Abudarham, Reference Yovel, Grosbard and Abudarham2023). Thus, we can now use the same stimulus and task manipulations that were used to study this effect in numerous human studies, to test hypotheses about the mechanism that may underlie this perceptual effect. Moreover, by manipulating DNNs’ training diet, we can examine what type of experience is needed to generate this human-like perceptual effect, which is impossible to test in humans where we have no control over their perceptual experience. Such an approach has recently been used to address a long-lasting debate in cognitive sciences about the domain-specific versus the expertise hypothesis in face recognition (Kanwisher, Gupta, & Dobs, Reference Kanwisher, Gupta and Dobs2023; Yovel et al., Reference Yovel, Grosbard and Abudarham2023).
It was psychologists, not engineers, who first designed these neural networks to model human intelligence (McClelland, McNaughton, & O'Reilly, Reference McClelland, McNaughton and O'Reilly1995; Rosenblatt, Reference Rosenblatt1958; Rumelhart, Hinton, & Williams, Reference Rumelhart, Hinton and Williams1986). It took more than 60 years since the psychologist, Frank Rosenblatt published his report about the perceptron, for technology to reach its present state where these hierarchically structured algorithms can be used to study the complexity of human vision. Abandoning DNNs would be a huge oversight for cognitive scientists, who can contribute considerably to the development of more human-like DNNs. It is therefore pertinent that psychologists join the artificial intelligence (AI) research community and study these models in collaboration with engineers and computer scientists. This is a unique time in the history of cognitive sciences, where scientists from these different disciplines have shared interests in the same type of computational models that can advance our understanding of human cognition. This opportunity should not be missed by psychological sciences.
Computational modeling has long been used by psychologists to test hypotheses about human cognition and behavior. Prior to the recent rise of deep neural networks (DNNs), most computational models were handcrafted by scientists who determined their parameters and features. In vision sciences, these models were used to test hypotheses about the mechanisms that enable human object recognition. However, these handcrafted models used simple, engineered-designed features (e.g., Gabor wavelets), which produced low-level representations that did not account for human-level, view-invariant object recognition (Biederman & Kalocsai, Reference Biederman and Kalocsai1997; Turk & Pentland, Reference Turk and Pentland1991). The main advantage of DNNs over these traditional models is not only that they reach human-level performance in object recognition, but that they do so through hierarchical processing of the visual input that generates high-level, view-invariant visual features. These high-level features are the “missing link” between the low-level representations of the hand crafted models and human-level object classification. They therefore offer psychologists an unprecedented opportunity to test hypotheses about the origin and nature of these high-level representations, which were not available for exploration so far.
In the target article, Bowers et al. propose that psychologists should abandon DNNs as models of human vision, because they do not produce some of the perceptual effects that are found in humans. However, many of the listed perceptual effects that DNNs fail to produce are also not produced by the traditional handcrafted computational vision models, which have been prevalently used to model human vision. Furthermore, although current DNNs are primarily developed for engineering purposes (i.e., best performance), there are myriad of ways in which they could and should be modified to better resemble the human mind. For example, current DNNs that are often used to model human face and object recognition (Khaligh-Razavi & Kriegeskorte, 2014; O'Toole & Castillo, Reference O'Toole and Castillo2021; Yamins & DiCarlo, Reference Yamins and DiCarlo2016) are trained on static images (Cao, Shen, Xie, Parkhi, & Zisserman, Reference Cao, Shen, Xie, Parkhi and Zisserman2018; Deng et al., Reference Deng, Dong, Socher, Li, Li and Fei-Fei2009), whereas human face and object recognition are performed on continuous streaming of dynamic, multi-modal information. One way that has recently been suggested to close this gap is to train DNNs on movies that are generated by head-mounted cameras attached to infants’ forehead (Fausey, Jayaraman, & Smith, Reference Fausey, Jayaraman and Smith2016), to better model the development of human visual systems (Smith & Slone, Reference Smith and Slone2017). Training DNNs initially on blurred images also provided insights into the potential advantage of the initial low acuity of infants’ vision (Vogelsang et al., Reference Vogelsang, Gilad-Gutnick, Ehrenberg, Yonas, Diamond, Held and Sinha2018). Such and many other modifications (e.g., multi-modal self-supervised image-language training, Radford et al., Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal and Sutskever2021) in the way DNNs are built and trained may generate perceptual effects that are more human-like (Shoham, Grosbard, Patashnik, Cohen-Or, & Yovel, Reference Shoham, Grosbard, Patashnik, Cohen-Or and Yovel2022). Yet even current DNNs can advance our understanding of the nature of the high-level representations that are required for face and object recognition (Abudarham, Grosbard, & Yovel, Reference Abudarham, Grosbard and Yovel2021; Hill et al., Reference Hill, Parde, Castillo, Colón, Ranjan, Chen and O'Toole2019), which are still undefined in current neural and cognitive models. This significant computational achievement should not be dismissed.
Bowers et al. further claim that DNNs should be used to test hypotheses rather than to solely make predictions. We fully agree and further propose that psychologists are best suited to apply this approach by utilizing the same procedures they have used for decades to test hypotheses about the hidden representations of the human mind. Since the early days of psychological sciences, psychologists have developed a range of elegant experimental and stimulus manipulations to study human vision. The same procedures can now be used to explore the nature of DNNs’ high-level hidden representations as potential models of the human mind (Ma & Peters, Reference Ma and Peters2020). For example, the face inversion effect is a robust, extensively studied, and well-established effect in human vision, which refers to the disproportionally large drop in performance that humans show for upside-down compared to upright faces (Cashon & Holt, Reference Cashon and Holt2015; Farah, Tanaka, & Drain, Reference Farah, Tanaka and Drain1995; Yin, Reference Yin1969). Because the low-level features extracted by, handcrafted algorithms are similar for upright and inverted faces, these traditional models do not reproduce this effect. Interestingly, a human-like face inversion effect that is larger than an object inversion effect is found in DNNs (Dobs, Martinez, Yuhan, & Kanwisher, Reference Dobs, Martinez, Yuhan and Kanwisher2022; Jacob, Pramod, Katti, & Arun, Reference Jacob, Pramod, Katti and Arun2021; Tian, Xie, Song, Hu, & Liu, Reference Tian, Xie, Song, Hu and Liu2022; Yovel, Grosbard, & Abudarham, Reference Yovel, Grosbard and Abudarham2023). Thus, we can now use the same stimulus and task manipulations that were used to study this effect in numerous human studies, to test hypotheses about the mechanism that may underlie this perceptual effect. Moreover, by manipulating DNNs’ training diet, we can examine what type of experience is needed to generate this human-like perceptual effect, which is impossible to test in humans where we have no control over their perceptual experience. Such an approach has recently been used to address a long-lasting debate in cognitive sciences about the domain-specific versus the expertise hypothesis in face recognition (Kanwisher, Gupta, & Dobs, Reference Kanwisher, Gupta and Dobs2023; Yovel et al., Reference Yovel, Grosbard and Abudarham2023).
It was psychologists, not engineers, who first designed these neural networks to model human intelligence (McClelland, McNaughton, & O'Reilly, Reference McClelland, McNaughton and O'Reilly1995; Rosenblatt, Reference Rosenblatt1958; Rumelhart, Hinton, & Williams, Reference Rumelhart, Hinton and Williams1986). It took more than 60 years since the psychologist, Frank Rosenblatt published his report about the perceptron, for technology to reach its present state where these hierarchically structured algorithms can be used to study the complexity of human vision. Abandoning DNNs would be a huge oversight for cognitive scientists, who can contribute considerably to the development of more human-like DNNs. It is therefore pertinent that psychologists join the artificial intelligence (AI) research community and study these models in collaboration with engineers and computer scientists. This is a unique time in the history of cognitive sciences, where scientists from these different disciplines have shared interests in the same type of computational models that can advance our understanding of human cognition. This opportunity should not be missed by psychological sciences.
Financial support
This study was funded by an ISF 971/21 and Joint NSFC-ISF 2383/18 to G. Y.
Competing interest
None.