To say that items do not play a role in visual search is to admit defeat. Even though we lack a precise definition of “item,” it is clear that people do parse their visual environment into objects (the real-world equivalent of items in visual search). In this commentary, we will review evidence that items are essential in visual search; furthermore, we will argue that computer vision – especially deep learning – may offer a solution for the lack of a solid definition of “item.”
In the model of Hulleman & Olivers (H&O), search proceeds on the basis of fixations that are used to scan a visual scene for a target. Although we appreciate its parsimony, the model lacks a crucial aspect of visual search: the decision where to look next. The model simply assumes that an arbitrary new location is selected. Yet there is abundant evidence that fixation selection is not random but rather results from integration of top-down and bottom-up influences in a common saccade map (Meeter et al. Reference Meeter, Van der Stigchel and Theeuwes2010; Trappenberg et al. Reference Trappenberg, Dorris, Munoz and Klein2001). That is, we look mostly at things that are salient or behaviorally relevant (Theeuwes et al. Reference Theeuwes, Kramer, Hahn and Irwin1998) and, crucially, behavioral relevance is related to how we parse visual input into items (e.g., Einhäuser et al. Reference Einhäuser, Spain and Perona2008). Consider repetition priming: People preferentially look at distractor items that resemble previous target items (Becker et al. Reference Becker, Ansorge and Horstmann2009; Meeter & Van der Stigchel Reference Meeter and Van der Stigchel2013); or its complement, negative priming: People avoid distractor items that resemble previous distractor items (Kristjánsson & Driver Reference Kristjánsson and Driver2008). In addition, there are many object-based attention effects in visual search. For example, we tend to shift our attention and gaze within, compared to between, objects (Egly et al. Reference Egly, Driver and Rafal1994; Theeuwes et al. Reference Theeuwes, Mathôt and Kingstone2010); and, if an attended object moves, the focus of attention follows (Theeuwes et al. Reference Theeuwes, Mathôt and Grainger2013). We could list even more object-based effects, but our main point is: Items matter, whether we know how to define them or not. Therefore, by denying a role for items in visual search, H&O ignore, or at least downplay the importance of, a substantial part of the visual-search literature.
But how can there ever be a role for items in models of visual search if we do not even know what “item” means? Possibly, our language simply lacks the vocabulary to define “item” or “object.” Many researchers, such as David Marr, have speculated that it is impossible to define “object” (Marr Reference Marr1982) – and we agree. But rather than abandon items altogether (and admit defeat!) we should adopt recent computational approaches to object recognition as an alternative to formal definitions.
Consider a modern deep-learning network: an artificial neural network that consists of many nodes across many layers. (We will not discuss one specific network, but focus on the general architecture that is shared by most networks.) Such models are inspired by the architecture of our visual system by implementing a complex arrangement of nodes, each of which only looks at small portions of the input image (Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012). First, this network is trained on a large set of example images, which can be either labeled (e.g., Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012), unlabeled (e.g., Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012), or a mix (LeCun et al. Reference LeCun, Kavukvuoglu and Farabet2010). Crucially, in all cases training occurs by example, without explicit definitions. Next, when the trained network is presented with an image, nodes in the lowest layers respond to simple features, such as edges and specific orientations (Lee et al. Reference Lee, Grosse, Ranganath and Ng2009), reminiscent of neurons in lower layers of the visual cortex (Hubel & Wiesel Reference Hubel and Wiesel1959). Nodes in higher layers of the network respond to progressively more complex features, until, near the top layers of the network, nodes have become highly selective object detectors; for example, a node may respond selectively to faces, cats, human body parts, cars, and so forth. (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). These nodes are reminiscent of neurons in the temporal cortex, which also respond selectively to object categories such as faces or hands (Desimone et al. Reference Desimone, Albright, Gross and Bruce1984), 194). Importantly, deep-learning networks detect objects in those real-world scenes that H&O consider problematic (He et al. Reference He, Zhang, Ren and Sun2015; Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012); and they do so without explicit definitions, seemingly like humans do.
Combining deep-learning networks with traditional visual search models could explain how people explore their environment, item by item. As a starting point, we could take the model of H&O, and replace their bag of items with active nodes in high layers of a deep-learning network – that is, nodes that respond selectively to high-level features of the input (for example, cats), and for which the activation exceeds a certain threshold (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). This would provide H&O's model with a bag of items to search through, without being fed any definition of “item.” Of course, in its simplest form, this combined model is far from perfect. First, it does not explain object-based effects of the kind that we discussed above. Second, it assumes that the entire visual field is parsed at once, and does not take into account eye movements – the very idea that H&O rightfully want to get away from. But this simple combined model would be a good starting point that combines cognitive psychology with computer vision. And when combining principles from both disciplines, improvements readily come to mind. For example, a deep-learning network could be fed with eye-centered visual input that takes into account the functional viewing window.
In conclusion, we feel that H&O have been too quick to admit defeat. They have constructed a parsimonious model that explains visual-search behavior well without requiring items. Now all we need to do is put the item back in.
To say that items do not play a role in visual search is to admit defeat. Even though we lack a precise definition of “item,” it is clear that people do parse their visual environment into objects (the real-world equivalent of items in visual search). In this commentary, we will review evidence that items are essential in visual search; furthermore, we will argue that computer vision – especially deep learning – may offer a solution for the lack of a solid definition of “item.”
In the model of Hulleman & Olivers (H&O), search proceeds on the basis of fixations that are used to scan a visual scene for a target. Although we appreciate its parsimony, the model lacks a crucial aspect of visual search: the decision where to look next. The model simply assumes that an arbitrary new location is selected. Yet there is abundant evidence that fixation selection is not random but rather results from integration of top-down and bottom-up influences in a common saccade map (Meeter et al. Reference Meeter, Van der Stigchel and Theeuwes2010; Trappenberg et al. Reference Trappenberg, Dorris, Munoz and Klein2001). That is, we look mostly at things that are salient or behaviorally relevant (Theeuwes et al. Reference Theeuwes, Kramer, Hahn and Irwin1998) and, crucially, behavioral relevance is related to how we parse visual input into items (e.g., Einhäuser et al. Reference Einhäuser, Spain and Perona2008). Consider repetition priming: People preferentially look at distractor items that resemble previous target items (Becker et al. Reference Becker, Ansorge and Horstmann2009; Meeter & Van der Stigchel Reference Meeter and Van der Stigchel2013); or its complement, negative priming: People avoid distractor items that resemble previous distractor items (Kristjánsson & Driver Reference Kristjánsson and Driver2008). In addition, there are many object-based attention effects in visual search. For example, we tend to shift our attention and gaze within, compared to between, objects (Egly et al. Reference Egly, Driver and Rafal1994; Theeuwes et al. Reference Theeuwes, Mathôt and Kingstone2010); and, if an attended object moves, the focus of attention follows (Theeuwes et al. Reference Theeuwes, Mathôt and Grainger2013). We could list even more object-based effects, but our main point is: Items matter, whether we know how to define them or not. Therefore, by denying a role for items in visual search, H&O ignore, or at least downplay the importance of, a substantial part of the visual-search literature.
But how can there ever be a role for items in models of visual search if we do not even know what “item” means? Possibly, our language simply lacks the vocabulary to define “item” or “object.” Many researchers, such as David Marr, have speculated that it is impossible to define “object” (Marr Reference Marr1982) – and we agree. But rather than abandon items altogether (and admit defeat!) we should adopt recent computational approaches to object recognition as an alternative to formal definitions.
Consider a modern deep-learning network: an artificial neural network that consists of many nodes across many layers. (We will not discuss one specific network, but focus on the general architecture that is shared by most networks.) Such models are inspired by the architecture of our visual system by implementing a complex arrangement of nodes, each of which only looks at small portions of the input image (Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012). First, this network is trained on a large set of example images, which can be either labeled (e.g., Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012), unlabeled (e.g., Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012), or a mix (LeCun et al. Reference LeCun, Kavukvuoglu and Farabet2010). Crucially, in all cases training occurs by example, without explicit definitions. Next, when the trained network is presented with an image, nodes in the lowest layers respond to simple features, such as edges and specific orientations (Lee et al. Reference Lee, Grosse, Ranganath and Ng2009), reminiscent of neurons in lower layers of the visual cortex (Hubel & Wiesel Reference Hubel and Wiesel1959). Nodes in higher layers of the network respond to progressively more complex features, until, near the top layers of the network, nodes have become highly selective object detectors; for example, a node may respond selectively to faces, cats, human body parts, cars, and so forth. (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). These nodes are reminiscent of neurons in the temporal cortex, which also respond selectively to object categories such as faces or hands (Desimone et al. Reference Desimone, Albright, Gross and Bruce1984), 194). Importantly, deep-learning networks detect objects in those real-world scenes that H&O consider problematic (He et al. Reference He, Zhang, Ren and Sun2015; Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012); and they do so without explicit definitions, seemingly like humans do.
Combining deep-learning networks with traditional visual search models could explain how people explore their environment, item by item. As a starting point, we could take the model of H&O, and replace their bag of items with active nodes in high layers of a deep-learning network – that is, nodes that respond selectively to high-level features of the input (for example, cats), and for which the activation exceeds a certain threshold (Le et al. Reference Le, Ranzato, Monga, Devin, Chen, Corrado, Dean and Ng2012). This would provide H&O's model with a bag of items to search through, without being fed any definition of “item.” Of course, in its simplest form, this combined model is far from perfect. First, it does not explain object-based effects of the kind that we discussed above. Second, it assumes that the entire visual field is parsed at once, and does not take into account eye movements – the very idea that H&O rightfully want to get away from. But this simple combined model would be a good starting point that combines cognitive psychology with computer vision. And when combining principles from both disciplines, improvements readily come to mind. For example, a deep-learning network could be fed with eye-centered visual input that takes into account the functional viewing window.
In conclusion, we feel that H&O have been too quick to admit defeat. They have constructed a parsimonious model that explains visual-search behavior well without requiring items. Now all we need to do is put the item back in.