Don't admit defeat: A new dawn for the item in visual search

Stefan Van der Stigchel; Sebastiaan Mathôt

doi:10.1017/S0140525X16000285

Don't admit defeat: A new dawn for the item in visual search

Published online by Cambridge University Press: 24 May 2017

Stefan Van der Stigchel and

Sebastiaan Mathôt

Show author details

Stefan Van der Stigchel: Affiliation:
Department of Experimental Psychology, Helmholtz Institute, Utrecht University, 3584 CS Utrecht, The Netherlands; s.vanderstigchel@uu.nlhttp://www.attentionlab.nl
Sebastiaan Mathôt: Affiliation:
Aix-Marseille University, CNRS, LPC UMR 7290, Marseille, 13331 Cedex 1, France. s.mathot@cogsci.nlhttp://www.cogsci.nl/smathot

Article contents

Abstract
References

Rights & Permissions

Abstract

Even though we lack a precise definition of “item,” it is clear that people do parse their visual environment into objects (the real-world equivalent of items). We will review evidence that items are essential in visual search, and argue that computer vision – especially deep learning – may offer a solution for the lack of a solid definition of “item.”

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 40 , 2017 , e159

DOI: https://doi.org/10.1017/S0140525X16000285 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

To say that items do not play a role in visual search is to admit defeat. Even though we lack a precise definition of “item,” it is clear that people do parse their visual environment into objects (the real-world equivalent of items in visual search). In this commentary, we will review evidence that items are essential in visual search; furthermore, we will argue that computer vision – especially deep learning – may offer a solution for the lack of a solid definition of “item.”

In the model of Hulleman & Olivers (H&O), search proceeds on the basis of fixations that are used to scan a visual scene for a target. Although we appreciate its parsimony, the model lacks a crucial aspect of visual search: the decision where to look next. The model simply assumes that an arbitrary new location is selected. Yet there is abundant evidence that fixation selection is not random but rather results from integration of top-down and bottom-up influences in a common saccade map (Meeter et al. Reference Meeter, Van der Stigchel and Theeuwes2010; Trappenberg et al. Reference Trappenberg, Dorris, Munoz and Klein2001). That is, we look mostly at things that are salient or behaviorally relevant (Theeuwes et al. Reference Theeuwes, Kramer, Hahn and Irwin1998) and, crucially, behavioral relevance is related to how we parse visual input into items (e.g., Einhäuser et al. Reference Einhäuser, Spain and Perona2008). Consider repetition priming: People preferentially look at distractor items that resemble previous target items (Becker et al. Reference Becker, Ansorge and Horstmann2009; Meeter & Van der Stigchel Reference Meeter and Van der Stigchel2013); or its complement, negative priming: People avoid distractor items that resemble previous distractor items (Kristjánsson & Driver Reference Kristjánsson and Driver2008). In addition, there are many object-based attention effects in visual search. For example, we tend to shift our attention and gaze within, compared to between, objects (Egly et al. Reference Egly, Driver and Rafal1994; Theeuwes et al. Reference Theeuwes, Mathôt and Kingstone2010); and, if an attended object moves, the focus of attention follows (Theeuwes et al. Reference Theeuwes, Mathôt and Grainger2013). We could list even more object-based effects, but our main point is: Items matter, whether we know how to define them or not. Therefore, by denying a role for items in visual search, H&O ignore, or at least downplay the importance of, a substantial part of the visual-search literature.

But how can there ever be a role for items in models of visual search if we do not even know what “item” means? Possibly, our language simply lacks the vocabulary to define “item” or “object.” Many researchers, such as David Marr, have speculated that it is impossible to define “object” (Marr Reference Marr1982) – and we agree. But rather than abandon items altogether (and admit defeat!) we should adopt recent computational approaches to object recognition as an alternative to formal definitions.

Consider a modern deep-learning network: an artificial neural network that consists of many nodes across many layers. (We will not discuss one specific network, but focus on the general architecture that is shared by most networks.) Such models are inspired by the architecture of our visual system by implementing a complex arrangement of nodes, each of which only looks at small portions of the input image (Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012). First, this network is trained on a large set of example images, which can be either labeled (e.g., Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012), unlabeled (e.g., Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012), or a mix (LeCun et al. Reference LeCun, Kavukvuoglu and Farabet2010). Crucially, in all cases training occurs by example, without explicit definitions. Next, when the trained network is presented with an image, nodes in the lowest layers respond to simple features, such as edges and specific orientations (Lee et al. Reference Lee, Grosse, Ranganath and Ng2009), reminiscent of neurons in lower layers of the visual cortex (Hubel & Wiesel Reference Hubel and Wiesel1959). Nodes in higher layers of the network respond to progressively more complex features, until, near the top layers of the network, nodes have become highly selective object detectors; for example, a node may respond selectively to faces, cats, human body parts, cars, and so forth. (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). These nodes are reminiscent of neurons in the temporal cortex, which also respond selectively to object categories such as faces or hands (Desimone et al. Reference Desimone, Albright, Gross and Bruce1984), 194). Importantly, deep-learning networks detect objects in those real-world scenes that H&O consider problematic (He et al. Reference He, Zhang, Ren and Sun2015; Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012); and they do so without explicit definitions, seemingly like humans do.

Combining deep-learning networks with traditional visual search models could explain how people explore their environment, item by item. As a starting point, we could take the model of H&O, and replace their bag of items with active nodes in high layers of a deep-learning network – that is, nodes that respond selectively to high-level features of the input (for example, cats), and for which the activation exceeds a certain threshold (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). This would provide H&O's model with a bag of items to search through, without being fed any definition of “item.” Of course, in its simplest form, this combined model is far from perfect. First, it does not explain object-based effects of the kind that we discussed above. Second, it assumes that the entire visual field is parsed at once, and does not take into account eye movements – the very idea that H&O rightfully want to get away from. But this simple combined model would be a good starting point that combines cognitive psychology with computer vision. And when combining principles from both disciplines, improvements readily come to mind. For example, a deep-learning network could be fed with eye-centered visual input that takes into account the functional viewing window.

In conclusion, we feel that H&O have been too quick to admit defeat. They have constructed a parsimonious model that explains visual-search behavior well without requiring items. Now all we need to do is put the item back in.

References

Becker, S. I., Ansorge, U. & Horstmann, G. (2009) Can inter-trial priming effects account for the similarity effect in visual search? Vision Research 49:1738–56.CrossRef Google Scholar

Desimone, R., Albright, T., Gross, C. & Bruce, C. (1984) Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience 4(8):2051–62.CrossRef Google Scholar PubMed

Egly, R., Driver, J. & Rafal, R. D. (1994) Shifting visual attention between objects and locations: Evidence from normal and parietal lesion subjects. Journal of Experimental Psychology: General 123:161–77. doi: 10.1037//0096-3445.123.2.161.CrossRef Google Scholar PubMed

Einhäuser, W., Spain, M. & Perona, P. (2008) Objects predict fixations better than early saliency. Journal of Vision 8(14):18.CrossRef Google Scholar PubMed

He, K., Zhang, X., Ren, S. & Sun, J. (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv 1502.01852.CrossRef Google Scholar

Hubel, D. H. & Wiesel, T. N. (1959) Receptive fields of single neurons in the cat's striate cortex. Journal of Physiology 148(3):574–91.CrossRef Google Scholar PubMed

Kristjánsson, Á. & Driver, J. (2008) Priming in visual search: Separating the effects of target repetition, distractor repetition and role-reversal. Vision Research 48(10):1217–32. Available at: http://doi.org/10.1016/j.visres.2008.02.007.CrossRef Google Scholar PubMed

Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012) ImageNet classification with deep convolutional neural networks. In: Conference of Advances in neural information processing systems, ed. Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q., pp. 1097–105. Neural Information Processing Systems Foundation. Available at: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.Google Scholar

Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G. S., Dean, J. & Ng, A. Y. (2012) Building high-level features using large scale unsupervised learning. Paper presented at the ICML.CrossRef Google Scholar

LeCun, Y., Kavukvuoglu, K. & Farabet, C. (2010) Convolutional networks and applications in vision. Paper presented at the ISCAS.CrossRef Google Scholar

Lee, H., Grosse, R., Ranganath, R. & Ng, A. Y. (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Paper presented at the ACM.CrossRef Google Scholar

Marr, D. (1982) Vision: A computational investigation into the human representation and processing of visual information. W.H. Freeman.Google Scholar

Meeter, M. & Van der Stigchel, S. (2013) Visual priming through a boost of the target signal: Evidence from saccadic landing positions. Attention, Perception, and Psychophysics 75:1336–41.CrossRef Google Scholar PubMed

Meeter, M., Van der Stigchel, S. & Theeuwes, J. (2010) A competitive integration model of exogenous and endogenous eye movements. Biological Cybernetics 102:271–91.CrossRef Google Scholar PubMed

Theeuwes, J., Kramer, A. F., Hahn, S. & Irwin, D. E. (1998) Our eyes do not always go where we want them to go: Capture of eyes by new objects. Psychological Science 9:379–85.CrossRef Google Scholar

Theeuwes, J., Mathôt, S. & Grainger, J. (2013) Exogenous object-centered attention. Attention Perception, and Psychophysics 75:812–18. doi: 10.3758/s13414-013-0459-4.CrossRef Google Scholar PubMed

Theeuwes, J., Mathôt, S. & Kingstone, A. (2010) Object-based eye movements: The eyes prefer to stay within the same object. Attention Perception, and Psychophysics 72(3):12–21. doi: 10.3758/APP.72.3.597.CrossRef Google Scholar PubMed

Theeuwes, J., Mathôt, S. & Grainger, J. (2013) Exogenous object-centered attention. Attention, Perception, and Psychophysics 75:812–18.CrossRef Google Scholar PubMed

Trappenberg, T. P., Dorris, M. C., Munoz, D. P. & Klein, R. M. (2001) A model of saccade initiation based on the competitive integration of exogenous and endogenous signals in the superior colliculus. Journal of Cognitive Neuroscience 13(2):256–71.CrossRef Google Scholar