Comprehensive assessment methods are key to progress in deep learning

Michael W. Spratling

doi:10.1017/S0140525X23001668

Comprehensive assessment methods are key to progress in deep learning

Published online by Cambridge University Press: 06 December 2023

Michael W. Spratling

Show author details

Michael W. Spratling*: Affiliation:
Department of Informatics, King's College London, London, UK michael.spratling@kcl.ac.uk https://nms.kcl.ac.uk/michael.spratling/

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Bowers et al. eloquently describe issues with current deep neural network (DNN) models of vision, claiming that there are deficits both with the methods of assessment, and with the models themselves. I am in agreement with both these claims, but propose a different recipe to the one outlined in the target article for overcoming these issues.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e407

DOI: https://doi.org/10.1017/S0140525X23001668 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

The target article proposes that deep neural networks (DNNs) be assessed using controlled experiments that evaluate changes in model behaviour as all but one variable is kept constant. Such experiments might provide information about the similarities and differences between brains and DNNs, and hence, spur development of DNNs better able to model the biological visual system. However, in reality work in deep learning is concerned with developing methods that work, irrespective of the biological plausibility of those methods: Deep learning is an engineering endeavour driven by the desire to produce DNNs that perform the “best.” Even in the subdomain where brain-like behaviour is a consideration (Schrimpf et al., Reference Schrimpf, Kubilius, Lee, Murty, Ajemian and DiCarlo2020) the desire is to produce DNNs that produce the best performance. Hence, if controlled experiments were introduced, the results would almost certainly be summarised by a single value so that the performance of competing models could be ranked, and as a consequence there would be little to distinguish these new experimental methods from current ones.

What is meant by “best” performance, and how is it assessed, is the key issue. While training samples and supervision play a role in deep learning analogous to nurture during brain development, assessment plays a role analogous to that of evolution: Determining which DNNs are seen as successful, and hence, which will become the basis for future research efforts. The evaluation methods accepted as standard by a research community thus have a huge influence on progress in that field. Different evaluation methods might be adopted by different fields, for example classification accuracy on unseen test data might be accepted in computer vision, while Brain-Score or the sort of controlled experiments advocated by the target article might be used to evaluate models of biological vision. However, as is comprehensively catalogued in the target article, current DNNs suffer from such a range of severe defects that they are clearly inadequate either as models of vision or as reliable methods for computer vision. Both research agendas would, therefore, benefit from more rigorous and comprehensive evaluation methods that can adequately gauge progress.

Given the gross deficits of current DNNs, it seems premature to assess them in fine detail against psychological and neurobiological data. Rather, their performance should be evaluated by testing the ability to generalise to changes in viewing conditions (Hendrycks & Dietterich, Reference Hendrycks and Dietterich2019; Michaelis et al., Reference Michaelis, Mitzkus, Geirhos, Rusak, Bringmann, Ecker and Brendel2019; Mu & Gilmer, Reference Mu and Gilmer2019; Shen et al., Reference Shen, Liu, He, Zhang, Xu, Yu and Cui2021), the ability to reject samples from categories that were not seen during training (Hendrycks & Gimpel, Reference Hendrycks and Gimpel2017; Vaze, Han, Vedaldi, & Zisserman, Reference Vaze, Han, Vedaldi and Zisserman2022), the ability to reject exemplars that are unlike images of any object (Kumano, Kera, & Yamasaki, Reference Kumano, Kera and Yamasaki2022; Nguyen, Yosinski, & Clune, Reference Nguyen, Yosinski and Clune2015), and robustness to adversarial attacks (Biggio & Roli, Reference Biggio and Roli2018; Croce & Hein, Reference Croce and Hein2020; Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2014).

Methods already exist for testing generalisation and robustness of this type; the problem is that they are not routinely used, or that models are assessed using one benchmark but not others. The latter is particularly problematic, as there are likely to be trade-offs between performance on different tasks. The trade-off between adversarial robustness and clean accuracy is well known (Tsipras, Santurkar, Engstrom, Turner, & Madry, Reference Tsipras, Santurkar, Engstrom, Turner and Madry2019), but others are also likely to exist. For example, improving the ability to reject unknown classes is likely to reduce performance on classifying novel samples from known classes, as such exemplars are more likely to be incorrectly seen as unknown. Hence, efforts to develop a model that is less deficient in one respect, may be entirely wasted as the resulting model may be more deficient in another respect. Only when the community routinely requires comprehensive evaluation of models for generalisation and robustness will progress be made in reducing the range of deficits exhibited by models. Once such progress has been made it will be necessary to expand the range of assessments performed in order to effectively distinguish the performance of competing models and to spur further progress to address other deficiencies. The range of assessments might eventually be expanded to include neurophysiological and psychophysical tests.

The assessment regime advocated here can only be applied to models that are capable of processing images, and hence, would not be applicable to many models proposed in the psychology and neuroscience literatures. The target article advocates expanding assessment methods to allow such models to be evaluated and compared to DNNs. However, the ability to process images would seem to me to be a minimum requirement for a model of vision, and models that cannot be scaled to deal with images are not worth evaluating.

To perform well in terms of generalisation and robustness it seems likely that DNNs will require new mechanisms. As Bowers et al. say, it is unclear if suitable mechanisms can be learnt purely from the data. Indeed, even a model trained on 400 million images fails to generalise well (Radford et al., Reference Radford, Kim, Hallacy, Ramesh, Goh, Agarwal and Sutskever2021). The target article also points out that biological visual systems do not need to learn many abilities (such as adversarial robustness, tolerance to viewpoint, etc.), and instead these abilities seem to be “built-in.” Brains contain many inductive biases: The nature side of the nature–nurture cooperation that underlies brain development. These biases underlie innate abilities and behaviours (Malhotra, Dujmović, & Bowers, Reference Malhotra, Dujmović and Bowers2022; Zador, Reference Zador2019) and constrain and guide learning (Johnson, Reference Johnson1999; Zaadnoordijk, Besold, & Cusack, Reference Zaadnoordijk, Besold and Cusack2022). Hence, as advocated in the target article, and elsewhere (Hassabis, Kumaran, Summerfield, & Botvinick, Reference Hassabis, Kumaran, Summerfield and Botvinick2017; Malhotra, Evans, & Bowers, Reference Malhotra, Evans and Bowers2020; Zador, Reference Zador2019), biological insights can potentially inspire new mechanisms that will improve deep learning. However, work in deep learning does not need to be restricted to only considering inductive biases that are biologically inspired, especially as there are currently no suggestions as to how to implement many potentially useful mechanisms which humans appear to use. Indeed, if better models of biological vision are to be developed it is essential that work in neuroscience and psychology contribute useful insights. Unfortunately, the vast majority of such work so far has concentrated on cataloguing “where” and “when” events happen (where an event might be a physical action, neural spiking, fMRI activity, etc.). Such information is of no use to modellers who need information about “how” and “why.”

Financial support

This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.

Competing interest

None.

References

Biggio, B., & Roli, F. (2018). Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84, 317–331. doi:10.1016/j.patcog.2018.07.023CrossRef Google Scholar

Croce, F., & Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In H. Daumé III & A. Singh (Eds.), Proceedings of the international conference on machine learning, volume 119 of Proceedings of machine learning research (pp. 2206–2216). arXiv:2003.01690.Google Scholar

Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial intelligence. Neuron, 95, 245–258. doi:10.1016/j.neuron.2017.06.011CrossRef Google Scholar PubMed

Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the international conference on learning representations, New Orleans, USA. arXiv:1903.12261.Google Scholar

Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the international conference on Learning representations, Toulon, France. arXiv:1610.02136.Google Scholar

Johnson, M. H. (1999). Ontogenetic constraints on neural and behavioral plasticity: Evidence from imprinting and face recognition. Canadian Journal of Experimental Psychology, 53, 77–90.CrossRef Google Scholar

Kumano, S., Kera, H., & Yamasaki, T. (2022). Are DNNs fooled by extremely unrecognizable images? arXiv, arXiv:2012.03843.Google Scholar

Malhotra, G., Dujmović, M., & Bowers, J. S. (2022). Feature blindness: A challenge for understanding and modelling visual object recognition. PLoS Computational Biology, 18(5), e1009572. doi:10.1371/journal.pcbi.1009572CrossRef Google Scholar PubMed

Malhotra, G., Evans, B. D., & Bowers, J. S. (2020). Hiding a plane with a pixel: Examining shape-bias in CNNs and the benefit of building in biological constraints. Vision Research, 174, 57–68. doi:10.1016/j.visres.2020.04.013CrossRef Google Scholar PubMed

Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A. S., … Brendel, W. (2019). Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv, arXiv:1907.07484.Google Scholar

Mu, N., & Gilmer, J. (2019). MNIST-C: A robustness benchmark for computer vision. arXiv, arXiv:1906.02337.Google Scholar

Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. arXiv, arXiv:1412.1897.Google Scholar

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv, arXiv:2103.00020. https://proceedings.mlr.press/v139/radford21a.html Google Scholar

Schrimpf, M., Kubilius, J., Lee, M. J., Murty, N. A. R., Ajemian, R., & DiCarlo, J. J. (2020). Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron, 108(3), 413–423 https://www.cell.com/neuron/fulltext/S0896-6273(20)30605-X CrossRef Google Scholar PubMed

Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., & Cui, P. (2021). Towards out-of-distribution generalization: A survey. arXiv, arXiv:2108.13624.Google Scholar

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., & Fergus, R. (2014). Intriguing properties of neural networks. In Proceedings of the international conference on learning representations, Banff, Canada. arXiv:1312.6199.Google Scholar

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In Proceedings of the international conference on learning representations, New Orleans, USA. arXiv:1805.12152.Google Scholar

Vaze, S., Han, K., Vedaldi, A., & Zisserman, A. (2022). Open-set recognition: A good closed-set classifier is all you need? In Proceedings of the international conference on learning representations, Virtual. arXiv:2110.06207.Google Scholar

Zaadnoordijk, L., Besold, T. R., & Cusack, R. (2022). Lessons from infant learning for unsupervised machine learning. Nature Machine Intelligence, 4, 510–520. doi:10.1038/s42256-022-00488-2CrossRef Google Scholar

Zador, A. M. (2019). A critique of pure learning and what artificial neural networks can learn from animal brains. Nature Communications, 10, 3770. doi:10.1038/s41467-019-11786-6CrossRef Google Scholar PubMed