Bowers et al. discuss the limited connection between the psychological literature on human vision and recent work combining artificial neural networks (ANNs) and benchmark-based statistical evaluation. They are correct that the psychological literature has described behavioral signatures of human vision that ANNs should but do not currently explain. A model of human vision should ideally explain all available neural and behavioral data, including the unprecedentedly rich data from naturalistic benchmarks as well as data from experiments designed to address specific psychological hypotheses. None of the current models (ANNs, handcrafted computational models, and abstractly described psychological theories) meet this challenge.
Importantly, however, the failure of current ANNs to explain all available data does not amount to a refutation of neural network models in general. Falsifying the entire, highly expressive class of ANN models is impossible. ANNs are universal approximators of dynamical systems (Funahashi & Nakamura, Reference Funahashi and Nakamura1993; Schäfer & Zimmermann, Reference Schäfer and Zimmermann2007) and hence can implement any potential computational mechanism. Future ANNs may contain different computational mechanisms that have not yet been explored. ANNs therefore are best understood not as a monolithic falsifiable theory but as a computational language in which particular falsifiable hypotheses can be expressed. Bowers et al.'s long list of cited studies presenting shortcomings of particular models neither demonstrates the failure of the ANN modeling framework in general nor a lack of openness of the field to falsifications of ANN models. Instead, their list of citations rather impressively illustrates the opposite: That the emerging ANN research program (referred to as “neuroconnectionism” in Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2022) is progressive in the sense of Lakatos: It generates a rich variety of falsifiable hypotheses (expressed in the language of ANNs) and advances through model comparison (Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2022). Each shortcoming drives improvement. For example, the discovery of texture bias in ANNs (Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019) has led to a variety of alternative training methods that make ANNs rely more strongly on larger-scale structure in images (e.g., Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019; Hermann, Chen, & Kornblith, Reference Hermann, Chen, Kornblith, Larochelle, Ranzato, Hadsell, Balcan and Lin2020; Nuriel, Benaim, & Wolf, Reference Nuriel, Benaim and Wolf2021). Similarly, the discovery of adversarial susceptibility of ANNs (Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013) has motivated much research on perceptual robustness (e.g., Cohen, Rosenfeld, & Kolter, Reference Cohen, Rosenfeld, Kolter, Chaudhuri and Salakhutdinov2019; Guo et al., Reference Guo, Lee, Leclerc, Dapello, Rao, Madry, Dicarlo, Chaudhuri, Jegelka, Song, Szepesvari, Niu and Sabato2022; Madry, Makelov, Schmidt, Tsipras, & Vladu, Reference Madry, Makelov, Schmidt, Tsipras and Vladu2019).
Bowers et al. create a false dichotomy between benchmark studies (e.g., Cichy, Roig, & Oliva, Reference Cichy, Roig and Oliva2019; Kriegeskorte et al., Reference Kriegeskorte, Mur, Ruff, Kiani, Bodurka, Esteky and Bandettini2008; Nonaka, Majima, Aoki, & Kamitani, Reference Nonaka, Majima, Aoki and Kamitani2021; Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018) and controlled psychological experiments. Both approaches test model-based predictions of empirical data. Traditional psychological experiments are designed to test verbally defined theories, minimizing confounders of the independent variables of theoretical interest. In contrast, the numerous experimental conditions included in natural image behavioral and neural benchmarks are high-dimensional, complex, and ecologically relevant. Controlled experiments pose specific questions. They promise to give us theoretically important bits of information but are biased by theoretical assumptions and risk missing the computational challenge of task performance under realistic conditions (Newell, Reference Newell1973; Olshausen & Field, Reference Olshausen and Field2005). Observational studies and experiments with large numbers of natural images pose more general questions. They promise evaluation of many models with comprehensive data under more naturalistic conditions, but risk inconclusive results because they are not designed to adjudicate among alternative computational mechanisms (Rust & Movshon, Reference Rust and Movshon2005). Between these extremes lies a rich space of neural and behavioral empirical tests for models of vision. The community should seek models that can account for data across this spectrum, not just one end of it.
Despite their widely discussed shortcomings (e.g., Lindsay, Reference Lindsay2021; Peters & Kriegeskorte, Reference Peters and Kriegeskorte2021; Serre, Reference Serre2019), ANNs are sometimes referred to as the “current best” models of human vision. This characterization is justified on both a priori and empirical grounds. A priori, ANNs are superior to verbally defined cognitive theories in that they are image-computable, that is, they are fully computationally specified and take images as input. These properties enable ANNs to make quantitative predictions about a broad range of empirical phenomena, rendering ANNs more amenable to falsification. Being fully computationally specified enables them to make quantitative predictions of neural and behavioral responses (an advantage shared with other cognitive computational models). Taking images as inputs enables ANNs to make predictions about neural and behavioral responses to arbitrary visual stimuli. A model that explains only a particular psychological phenomenon is a priori inferior, ceteris paribus, to a model that predicts data across a wide range of conditions and dependent measures. The discrepancies between human vision and current ANNs are “bugs” of particular models, but the fact that we can discover these bugs is a feature of image-computable ANNs, fueling empirical progress. Since ANNs are image-computable, they enable severe tests of their predictions (superstimuli, adversarial examples, metamers; Bashivan, Kar, & DiCarlo, Reference Bashivan, Kar and DiCarlo2019; Dujmović, Malhotra, & Bowers, Reference Dujmović, Malhotra and Bowers2020; Feather, Durango, Gonzalez, & McDermott, Reference Feather, Durango, Gonzalez, McDermott, Wallach, Larochelle, Beygelzimer, d'Alché-Buc, Fox and Garnett2019; Walker et al., Reference Walker, Sinz, Cobos, Muhammad, Froudarakis, Fahey and Tolias2019) and powerful model comparisons (controversial stimuli; Golan, Raju, & Kriegeskorte, Reference Golan, Raju and Kriegeskorte2020).
The empirical reason why ANNs can be called the “current best” models of human vision is that they offer unprecedented mechanistic explanations of the human capacity to make sense of complex, naturalistic inputs. Most basically, ANNs are currently the only models that can recognize objects, parse scenes, or identify faces at performance levels similar to human performance. Furthermore, they offer image-specific predictions of errors (e.g., Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann, Brendel, Ranzato, Beygelzimer, Dauphin, Liang and Wortman Vaughan2021; Rajalingham et al., Reference Rajalingham, Issa, Bashivan, Kar, Schmidt and DiCarlo2018) and reaction times (e.g., Spoerer, McClure, & Kriegeskorte, Reference Spoerer, McClure and Kriegeskorte2017). Their predictions are far from perfect but better than those of alternative models. Finally, the intermediate representations of ANNs currently best match the neural representations that underlie human visual capacities (e.g., Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021; Güçlü & van Gerven, Reference Güçlü and van Gerven2015).
In sum, ANNs provide a language that enables us to express and test falsifiable computational models that have extraordinary power and can generalize to a broad range of empirical phenomena. Lakatos (Reference Lakatos1978) noted that all theories “are born refuted and die refuted” and stressed the importance of comparing competing theories in the light of the evidence. Our studies, then, should compare many models and report both their failures and their relative successes. It is through creation and comparison of many models that our field will progress.
Bowers et al. discuss the limited connection between the psychological literature on human vision and recent work combining artificial neural networks (ANNs) and benchmark-based statistical evaluation. They are correct that the psychological literature has described behavioral signatures of human vision that ANNs should but do not currently explain. A model of human vision should ideally explain all available neural and behavioral data, including the unprecedentedly rich data from naturalistic benchmarks as well as data from experiments designed to address specific psychological hypotheses. None of the current models (ANNs, handcrafted computational models, and abstractly described psychological theories) meet this challenge.
Importantly, however, the failure of current ANNs to explain all available data does not amount to a refutation of neural network models in general. Falsifying the entire, highly expressive class of ANN models is impossible. ANNs are universal approximators of dynamical systems (Funahashi & Nakamura, Reference Funahashi and Nakamura1993; Schäfer & Zimmermann, Reference Schäfer and Zimmermann2007) and hence can implement any potential computational mechanism. Future ANNs may contain different computational mechanisms that have not yet been explored. ANNs therefore are best understood not as a monolithic falsifiable theory but as a computational language in which particular falsifiable hypotheses can be expressed. Bowers et al.'s long list of cited studies presenting shortcomings of particular models neither demonstrates the failure of the ANN modeling framework in general nor a lack of openness of the field to falsifications of ANN models. Instead, their list of citations rather impressively illustrates the opposite: That the emerging ANN research program (referred to as “neuroconnectionism” in Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2022) is progressive in the sense of Lakatos: It generates a rich variety of falsifiable hypotheses (expressed in the language of ANNs) and advances through model comparison (Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2022). Each shortcoming drives improvement. For example, the discovery of texture bias in ANNs (Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019) has led to a variety of alternative training methods that make ANNs rely more strongly on larger-scale structure in images (e.g., Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019; Hermann, Chen, & Kornblith, Reference Hermann, Chen, Kornblith, Larochelle, Ranzato, Hadsell, Balcan and Lin2020; Nuriel, Benaim, & Wolf, Reference Nuriel, Benaim and Wolf2021). Similarly, the discovery of adversarial susceptibility of ANNs (Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013) has motivated much research on perceptual robustness (e.g., Cohen, Rosenfeld, & Kolter, Reference Cohen, Rosenfeld, Kolter, Chaudhuri and Salakhutdinov2019; Guo et al., Reference Guo, Lee, Leclerc, Dapello, Rao, Madry, Dicarlo, Chaudhuri, Jegelka, Song, Szepesvari, Niu and Sabato2022; Madry, Makelov, Schmidt, Tsipras, & Vladu, Reference Madry, Makelov, Schmidt, Tsipras and Vladu2019).
Bowers et al. create a false dichotomy between benchmark studies (e.g., Cichy, Roig, & Oliva, Reference Cichy, Roig and Oliva2019; Kriegeskorte et al., Reference Kriegeskorte, Mur, Ruff, Kiani, Bodurka, Esteky and Bandettini2008; Nonaka, Majima, Aoki, & Kamitani, Reference Nonaka, Majima, Aoki and Kamitani2021; Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018) and controlled psychological experiments. Both approaches test model-based predictions of empirical data. Traditional psychological experiments are designed to test verbally defined theories, minimizing confounders of the independent variables of theoretical interest. In contrast, the numerous experimental conditions included in natural image behavioral and neural benchmarks are high-dimensional, complex, and ecologically relevant. Controlled experiments pose specific questions. They promise to give us theoretically important bits of information but are biased by theoretical assumptions and risk missing the computational challenge of task performance under realistic conditions (Newell, Reference Newell1973; Olshausen & Field, Reference Olshausen and Field2005). Observational studies and experiments with large numbers of natural images pose more general questions. They promise evaluation of many models with comprehensive data under more naturalistic conditions, but risk inconclusive results because they are not designed to adjudicate among alternative computational mechanisms (Rust & Movshon, Reference Rust and Movshon2005). Between these extremes lies a rich space of neural and behavioral empirical tests for models of vision. The community should seek models that can account for data across this spectrum, not just one end of it.
Despite their widely discussed shortcomings (e.g., Lindsay, Reference Lindsay2021; Peters & Kriegeskorte, Reference Peters and Kriegeskorte2021; Serre, Reference Serre2019), ANNs are sometimes referred to as the “current best” models of human vision. This characterization is justified on both a priori and empirical grounds. A priori, ANNs are superior to verbally defined cognitive theories in that they are image-computable, that is, they are fully computationally specified and take images as input. These properties enable ANNs to make quantitative predictions about a broad range of empirical phenomena, rendering ANNs more amenable to falsification. Being fully computationally specified enables them to make quantitative predictions of neural and behavioral responses (an advantage shared with other cognitive computational models). Taking images as inputs enables ANNs to make predictions about neural and behavioral responses to arbitrary visual stimuli. A model that explains only a particular psychological phenomenon is a priori inferior, ceteris paribus, to a model that predicts data across a wide range of conditions and dependent measures. The discrepancies between human vision and current ANNs are “bugs” of particular models, but the fact that we can discover these bugs is a feature of image-computable ANNs, fueling empirical progress. Since ANNs are image-computable, they enable severe tests of their predictions (superstimuli, adversarial examples, metamers; Bashivan, Kar, & DiCarlo, Reference Bashivan, Kar and DiCarlo2019; Dujmović, Malhotra, & Bowers, Reference Dujmović, Malhotra and Bowers2020; Feather, Durango, Gonzalez, & McDermott, Reference Feather, Durango, Gonzalez, McDermott, Wallach, Larochelle, Beygelzimer, d'Alché-Buc, Fox and Garnett2019; Walker et al., Reference Walker, Sinz, Cobos, Muhammad, Froudarakis, Fahey and Tolias2019) and powerful model comparisons (controversial stimuli; Golan, Raju, & Kriegeskorte, Reference Golan, Raju and Kriegeskorte2020).
The empirical reason why ANNs can be called the “current best” models of human vision is that they offer unprecedented mechanistic explanations of the human capacity to make sense of complex, naturalistic inputs. Most basically, ANNs are currently the only models that can recognize objects, parse scenes, or identify faces at performance levels similar to human performance. Furthermore, they offer image-specific predictions of errors (e.g., Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann, Brendel, Ranzato, Beygelzimer, Dauphin, Liang and Wortman Vaughan2021; Rajalingham et al., Reference Rajalingham, Issa, Bashivan, Kar, Schmidt and DiCarlo2018) and reaction times (e.g., Spoerer, McClure, & Kriegeskorte, Reference Spoerer, McClure and Kriegeskorte2017). Their predictions are far from perfect but better than those of alternative models. Finally, the intermediate representations of ANNs currently best match the neural representations that underlie human visual capacities (e.g., Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021; Güçlü & van Gerven, Reference Güçlü and van Gerven2015).
In sum, ANNs provide a language that enables us to express and test falsifiable computational models that have extraordinary power and can generalize to a broad range of empirical phenomena. Lakatos (Reference Lakatos1978) noted that all theories “are born refuted and die refuted” and stressed the importance of comparing competing theories in the light of the evidence. Our studies, then, should compare many models and report both their failures and their relative successes. It is through creation and comparison of many models that our field will progress.
Financial support
This research received no specific funding from any funding agency or commercial or not-for-profit entity.
Competing interest
None.