The target article proposes that deep neural networks (DNNs) be assessed using controlled experiments that evaluate changes in model behaviour as all but one variable is kept constant. Such experiments might provide information about the similarities and differences between brains and DNNs, and hence, spur development of DNNs better able to model the biological visual system. However, in reality work in deep learning is concerned with developing methods that work, irrespective of the biological plausibility of those methods: Deep learning is an engineering endeavour driven by the desire to produce DNNs that perform the “best.” Even in the subdomain where brain-like behaviour is a consideration (Schrimpf et al., Reference Schrimpf, Kubilius, Lee, Murty, Ajemian and DiCarlo2020) the desire is to produce DNNs that produce the best performance. Hence, if controlled experiments were introduced, the results would almost certainly be summarised by a single value so that the performance of competing models could be ranked, and as a consequence there would be little to distinguish these new experimental methods from current ones.
What is meant by “best” performance, and how is it assessed, is the key issue. While training samples and supervision play a role in deep learning analogous to nurture during brain development, assessment plays a role analogous to that of evolution: Determining which DNNs are seen as successful, and hence, which will become the basis for future research efforts. The evaluation methods accepted as standard by a research community thus have a huge influence on progress in that field. Different evaluation methods might be adopted by different fields, for example classification accuracy on unseen test data might be accepted in computer vision, while Brain-Score or the sort of controlled experiments advocated by the target article might be used to evaluate models of biological vision. However, as is comprehensively catalogued in the target article, current DNNs suffer from such a range of severe defects that they are clearly inadequate either as models of vision or as reliable methods for computer vision. Both research agendas would, therefore, benefit from more rigorous and comprehensive evaluation methods that can adequately gauge progress.
Given the gross deficits of current DNNs, it seems premature to assess them in fine detail against psychological and neurobiological data. Rather, their performance should be evaluated by testing the ability to generalise to changes in viewing conditions (Hendrycks & Dietterich, Reference Hendrycks and Dietterich2019; Michaelis et al., Reference Michaelis, Mitzkus, Geirhos, Rusak, Bringmann, Ecker and Brendel2019; Mu & Gilmer, Reference Mu and Gilmer2019; Shen et al., Reference Shen, Liu, He, Zhang, Xu, Yu and Cui2021), the ability to reject samples from categories that were not seen during training (Hendrycks & Gimpel, Reference Hendrycks and Gimpel2017; Vaze, Han, Vedaldi, & Zisserman, Reference Vaze, Han, Vedaldi and Zisserman2022), the ability to reject exemplars that are unlike images of any object (Kumano, Kera, & Yamasaki, Reference Kumano, Kera and Yamasaki2022; Nguyen, Yosinski, & Clune, Reference Nguyen, Yosinski and Clune2015), and robustness to adversarial attacks (Biggio & Roli, Reference Biggio and Roli2018; Croce & Hein, Reference Croce and Hein2020; Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2014).
Methods already exist for testing generalisation and robustness of this type; the problem is that they are not routinely used, or that models are assessed using one benchmark but not others. The latter is particularly problematic, as there are likely to be trade-offs between performance on different tasks. The trade-off between adversarial robustness and clean accuracy is well known (Tsipras, Santurkar, Engstrom, Turner, & Madry, Reference Tsipras, Santurkar, Engstrom, Turner and Madry2019), but others are also likely to exist. For example, improving the ability to reject unknown classes is likely to reduce performance on classifying novel samples from known classes, as such exemplars are more likely to be incorrectly seen as unknown. Hence, efforts to develop a model that is less deficient in one respect, may be entirely wasted as the resulting model may be more deficient in another respect. Only when the community routinely requires comprehensive evaluation of models for generalisation and robustness will progress be made in reducing the range of deficits exhibited by models. Once such progress has been made it will be necessary to expand the range of assessments performed in order to effectively distinguish the performance of competing models and to spur further progress to address other deficiencies. The range of assessments might eventually be expanded to include neurophysiological and psychophysical tests.
The assessment regime advocated here can only be applied to models that are capable of processing images, and hence, would not be applicable to many models proposed in the psychology and neuroscience literatures. The target article advocates expanding assessment methods to allow such models to be evaluated and compared to DNNs. However, the ability to process images would seem to me to be a minimum requirement for a model of vision, and models that cannot be scaled to deal with images are not worth evaluating.
To perform well in terms of generalisation and robustness it seems likely that DNNs will require new mechanisms. As Bowers et al. say, it is unclear if suitable mechanisms can be learnt purely from the data. Indeed, even a model trained on 400 million images fails to generalise well (Radford et al., Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal and Sutskever2021). The target article also points out that biological visual systems do not need to learn many abilities (such as adversarial robustness, tolerance to viewpoint, etc.), and instead these abilities seem to be “built-in.” Brains contain many inductive biases: The nature side of the nature–nurture cooperation that underlies brain development. These biases underlie innate abilities and behaviours (Malhotra, Dujmović, & Bowers, Reference Malhotra, Dujmović and Bowers2022; Zador, Reference Zador2019) and constrain and guide learning (Johnson, Reference Johnson1999; Zaadnoordijk, Besold, & Cusack, Reference Zaadnoordijk, Besold and Cusack2022). Hence, as advocated in the target article, and elsewhere (Hassabis, Kumaran, Summerfield, & Botvinick, Reference Hassabis, Kumaran, Summerfield and Botvinick2017; Malhotra, Evans, & Bowers, Reference Malhotra, Evans and Bowers2020; Zador, Reference Zador2019), biological insights can potentially inspire new mechanisms that will improve deep learning. However, work in deep learning does not need to be restricted to only considering inductive biases that are biologically inspired, especially as there are currently no suggestions as to how to implement many potentially useful mechanisms which humans appear to use. Indeed, if better models of biological vision are to be developed it is essential that work in neuroscience and psychology contribute useful insights. Unfortunately, the vast majority of such work so far has concentrated on cataloguing “where” and “when” events happen (where an event might be a physical action, neural spiking, fMRI activity, etc.). Such information is of no use to modellers who need information about “how” and “why.”
The target article proposes that deep neural networks (DNNs) be assessed using controlled experiments that evaluate changes in model behaviour as all but one variable is kept constant. Such experiments might provide information about the similarities and differences between brains and DNNs, and hence, spur development of DNNs better able to model the biological visual system. However, in reality work in deep learning is concerned with developing methods that work, irrespective of the biological plausibility of those methods: Deep learning is an engineering endeavour driven by the desire to produce DNNs that perform the “best.” Even in the subdomain where brain-like behaviour is a consideration (Schrimpf et al., Reference Schrimpf, Kubilius, Lee, Murty, Ajemian and DiCarlo2020) the desire is to produce DNNs that produce the best performance. Hence, if controlled experiments were introduced, the results would almost certainly be summarised by a single value so that the performance of competing models could be ranked, and as a consequence there would be little to distinguish these new experimental methods from current ones.
What is meant by “best” performance, and how is it assessed, is the key issue. While training samples and supervision play a role in deep learning analogous to nurture during brain development, assessment plays a role analogous to that of evolution: Determining which DNNs are seen as successful, and hence, which will become the basis for future research efforts. The evaluation methods accepted as standard by a research community thus have a huge influence on progress in that field. Different evaluation methods might be adopted by different fields, for example classification accuracy on unseen test data might be accepted in computer vision, while Brain-Score or the sort of controlled experiments advocated by the target article might be used to evaluate models of biological vision. However, as is comprehensively catalogued in the target article, current DNNs suffer from such a range of severe defects that they are clearly inadequate either as models of vision or as reliable methods for computer vision. Both research agendas would, therefore, benefit from more rigorous and comprehensive evaluation methods that can adequately gauge progress.
Given the gross deficits of current DNNs, it seems premature to assess them in fine detail against psychological and neurobiological data. Rather, their performance should be evaluated by testing the ability to generalise to changes in viewing conditions (Hendrycks & Dietterich, Reference Hendrycks and Dietterich2019; Michaelis et al., Reference Michaelis, Mitzkus, Geirhos, Rusak, Bringmann, Ecker and Brendel2019; Mu & Gilmer, Reference Mu and Gilmer2019; Shen et al., Reference Shen, Liu, He, Zhang, Xu, Yu and Cui2021), the ability to reject samples from categories that were not seen during training (Hendrycks & Gimpel, Reference Hendrycks and Gimpel2017; Vaze, Han, Vedaldi, & Zisserman, Reference Vaze, Han, Vedaldi and Zisserman2022), the ability to reject exemplars that are unlike images of any object (Kumano, Kera, & Yamasaki, Reference Kumano, Kera and Yamasaki2022; Nguyen, Yosinski, & Clune, Reference Nguyen, Yosinski and Clune2015), and robustness to adversarial attacks (Biggio & Roli, Reference Biggio and Roli2018; Croce & Hein, Reference Croce and Hein2020; Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2014).
Methods already exist for testing generalisation and robustness of this type; the problem is that they are not routinely used, or that models are assessed using one benchmark but not others. The latter is particularly problematic, as there are likely to be trade-offs between performance on different tasks. The trade-off between adversarial robustness and clean accuracy is well known (Tsipras, Santurkar, Engstrom, Turner, & Madry, Reference Tsipras, Santurkar, Engstrom, Turner and Madry2019), but others are also likely to exist. For example, improving the ability to reject unknown classes is likely to reduce performance on classifying novel samples from known classes, as such exemplars are more likely to be incorrectly seen as unknown. Hence, efforts to develop a model that is less deficient in one respect, may be entirely wasted as the resulting model may be more deficient in another respect. Only when the community routinely requires comprehensive evaluation of models for generalisation and robustness will progress be made in reducing the range of deficits exhibited by models. Once such progress has been made it will be necessary to expand the range of assessments performed in order to effectively distinguish the performance of competing models and to spur further progress to address other deficiencies. The range of assessments might eventually be expanded to include neurophysiological and psychophysical tests.
The assessment regime advocated here can only be applied to models that are capable of processing images, and hence, would not be applicable to many models proposed in the psychology and neuroscience literatures. The target article advocates expanding assessment methods to allow such models to be evaluated and compared to DNNs. However, the ability to process images would seem to me to be a minimum requirement for a model of vision, and models that cannot be scaled to deal with images are not worth evaluating.
To perform well in terms of generalisation and robustness it seems likely that DNNs will require new mechanisms. As Bowers et al. say, it is unclear if suitable mechanisms can be learnt purely from the data. Indeed, even a model trained on 400 million images fails to generalise well (Radford et al., Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal and Sutskever2021). The target article also points out that biological visual systems do not need to learn many abilities (such as adversarial robustness, tolerance to viewpoint, etc.), and instead these abilities seem to be “built-in.” Brains contain many inductive biases: The nature side of the nature–nurture cooperation that underlies brain development. These biases underlie innate abilities and behaviours (Malhotra, Dujmović, & Bowers, Reference Malhotra, Dujmović and Bowers2022; Zador, Reference Zador2019) and constrain and guide learning (Johnson, Reference Johnson1999; Zaadnoordijk, Besold, & Cusack, Reference Zaadnoordijk, Besold and Cusack2022). Hence, as advocated in the target article, and elsewhere (Hassabis, Kumaran, Summerfield, & Botvinick, Reference Hassabis, Kumaran, Summerfield and Botvinick2017; Malhotra, Evans, & Bowers, Reference Malhotra, Evans and Bowers2020; Zador, Reference Zador2019), biological insights can potentially inspire new mechanisms that will improve deep learning. However, work in deep learning does not need to be restricted to only considering inductive biases that are biologically inspired, especially as there are currently no suggestions as to how to implement many potentially useful mechanisms which humans appear to use. Indeed, if better models of biological vision are to be developed it is essential that work in neuroscience and psychology contribute useful insights. Unfortunately, the vast majority of such work so far has concentrated on cataloguing “where” and “when” events happen (where an event might be a physical action, neural spiking, fMRI activity, etc.). Such information is of no use to modellers who need information about “how” and “why.”
Financial support
This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.
Competing interest
None.