Hulleman & Olivers (H&O) describe several tricky issues in visual search, relating to relationships among covert attention, eye movements, and the inhomogeneity of the visual field. However, I believe that their proffered resolution of those issues is fundamentally incorrect. Their title promises “the impending demise of the item.” “The focus on the item as the core unit of visual search is rather problematic,” they say (sect. 4). H&O's basic misstep is to look for the core purposes of search in some tricky aspects of the search literature rather than asking why people and animals search. Outside of the lab, we are almost always searching for something. The goal of search may be ambiguously defined (e.g., “threat” in airport luggage or a suspicious mass in a mammogram) and that goal may be hard to segment from the background. Still, that goal is some thing. It is an item that we need to search for because our capacity is limited and we cannot fully process the entire scene at once (Tsotsos Reference Tsotsos1990). Following Treisman (Reference Treisman1996), I would argue that we search because we need to attend to an item to successfully “bind” its features, and we generally need to bind features to recognize items that are the goals of search.
It is true that binding is not always necessary to do laboratory search tasks (see sect. 4.1). In some cases, the unbound image statistics are enough to classify displays as target-present or -absent. Very rapid decisions about the presence or absence of an animal in a scene (Li et al. Reference Li, VanRullen, Koch and Perona2002) would be one example. However, when H&O invoke these abilities and conclude that a model, like our Guided Search (GS) model, “overestimates the role of individual item locations” (sect. 4.2), they are, again, thinking more about the lab than about the world. It may be that some “present/absent decisions are based on parallel extraction of properties of groups of items within local areas” (sect. 4.2), but, as we found while studying the ability to detect breast cancer in a flash, these “gist” signals are often not adequate to locate the target (Evans et al. Reference Evans, Georgian-Smith, Tambouret, Birdwell and Wolfe2013b). In real-world search, you want to know the actual location of your keys, not just that the image statistics indicate their presence at above chance levels – interesting as that may be.
H&O's model is built on the functional viewing field (FVF) that surrounds each fixation. They point out, quite fairly, that classic models such as GS have tended to ignore the inhomogeneity of the retina and the role of eye movements. As a matter of convenience, we have often used large, vivid stimuli that can be resolved anywhere in the display. We have been interested in the covert deployments of attention and have designed experiments that make the overt deployments of the eyes less critical. H&O correctly argue that models such as GS have been guilty of oversimplification in regarding eye movements as nothing more than coarse indicators of the more rapid deployments of attention. Covert attention is processing items at a rate of, perhaps, 20–40 items per second. That means that approximately 5–10 items are processed on each fixation. GS has not concerned itself very much with how those fixations are chosen; that is an omission. As H&O review, GS proposes a “car wash” model in which the 5–10 items are selected, one after the other, during fixation. As in a car wash, though they enter in series, multiple items are processed at the same time because it probably takes at least 200–300 ms to process a selected item to recognition. H&O might have been proposing that, on each fixation, all items in the FVF enter the car wash at the same time. Discriminating such parallel selection of a clump of items from rapid serial selection of each of those items is extremely difficult. Theoreticians from one camp can almost always account for the data from the other.
However, H&O are not proposing parallel selection of clumps of N items. They want to get rid of the items and parallel process everything within the FVF. This raises questions that are left for future work in their model. Consider a classic conjunction search for red vertical lines among red horizontal and green vertical distractors. How do we tell the difference between an FVF that contains a red, vertical item and one that contains red and vertical features that are not bound to the same item? It is not adequate to propose that the system can determine when red and vertical occur in the same place. Think of black vertical lines on a red background. Red and vertical are in the same location, but observers are not confused into thinking that these are red vertical targets (Wolfe & Bennett Reference Wolfe and Bennett1997).
If H&O are wrong to condemn the poor item to death, why does their model work so well? Search efficiency, as indexed by the slope of RT × set size functions, is a continuum. In the H&O model, the FVF is a parameter that scales efficiency. Hard tasks produce small FVFs and easy tasks produce big ones (Nakayama Reference Nakayama and Blakemore1990). What determines FVF size? That is not clear in this schematic model so, really, FVF just serves as a free parameter to scale those slopes appropriately. The model also avoids other difficulties by coding all distractors as “0” and the targets as “1.” A parallel process, operating over the whole FVF, won't have much trouble detecting that target, but applying this model to real stimuli (e.g., conjunctions) might be a challenge.
Models such as GS also have parametric variations in difficulty that will scale search efficiency. In GS, more guidance by basic features (and by scene structure in current versions of GS) allows attention to be more efficiently directed to likely candidate targets. If you can exclude half of the items because they are, for example, the wrong color, your slope is cut in half. I strongly suspect that any mechanism for controlling FVF size will look a lot like GS's guidance, and I strongly suspect that the item, hard as it is to define, will be there to be selected.
Hulleman & Olivers (H&O) describe several tricky issues in visual search, relating to relationships among covert attention, eye movements, and the inhomogeneity of the visual field. However, I believe that their proffered resolution of those issues is fundamentally incorrect. Their title promises “the impending demise of the item.” “The focus on the item as the core unit of visual search is rather problematic,” they say (sect. 4). H&O's basic misstep is to look for the core purposes of search in some tricky aspects of the search literature rather than asking why people and animals search. Outside of the lab, we are almost always searching for something. The goal of search may be ambiguously defined (e.g., “threat” in airport luggage or a suspicious mass in a mammogram) and that goal may be hard to segment from the background. Still, that goal is some thing. It is an item that we need to search for because our capacity is limited and we cannot fully process the entire scene at once (Tsotsos Reference Tsotsos1990). Following Treisman (Reference Treisman1996), I would argue that we search because we need to attend to an item to successfully “bind” its features, and we generally need to bind features to recognize items that are the goals of search.
It is true that binding is not always necessary to do laboratory search tasks (see sect. 4.1). In some cases, the unbound image statistics are enough to classify displays as target-present or -absent. Very rapid decisions about the presence or absence of an animal in a scene (Li et al. Reference Li, VanRullen, Koch and Perona2002) would be one example. However, when H&O invoke these abilities and conclude that a model, like our Guided Search (GS) model, “overestimates the role of individual item locations” (sect. 4.2), they are, again, thinking more about the lab than about the world. It may be that some “present/absent decisions are based on parallel extraction of properties of groups of items within local areas” (sect. 4.2), but, as we found while studying the ability to detect breast cancer in a flash, these “gist” signals are often not adequate to locate the target (Evans et al. Reference Evans, Georgian-Smith, Tambouret, Birdwell and Wolfe2013b). In real-world search, you want to know the actual location of your keys, not just that the image statistics indicate their presence at above chance levels – interesting as that may be.
H&O's model is built on the functional viewing field (FVF) that surrounds each fixation. They point out, quite fairly, that classic models such as GS have tended to ignore the inhomogeneity of the retina and the role of eye movements. As a matter of convenience, we have often used large, vivid stimuli that can be resolved anywhere in the display. We have been interested in the covert deployments of attention and have designed experiments that make the overt deployments of the eyes less critical. H&O correctly argue that models such as GS have been guilty of oversimplification in regarding eye movements as nothing more than coarse indicators of the more rapid deployments of attention. Covert attention is processing items at a rate of, perhaps, 20–40 items per second. That means that approximately 5–10 items are processed on each fixation. GS has not concerned itself very much with how those fixations are chosen; that is an omission. As H&O review, GS proposes a “car wash” model in which the 5–10 items are selected, one after the other, during fixation. As in a car wash, though they enter in series, multiple items are processed at the same time because it probably takes at least 200–300 ms to process a selected item to recognition. H&O might have been proposing that, on each fixation, all items in the FVF enter the car wash at the same time. Discriminating such parallel selection of a clump of items from rapid serial selection of each of those items is extremely difficult. Theoreticians from one camp can almost always account for the data from the other.
However, H&O are not proposing parallel selection of clumps of N items. They want to get rid of the items and parallel process everything within the FVF. This raises questions that are left for future work in their model. Consider a classic conjunction search for red vertical lines among red horizontal and green vertical distractors. How do we tell the difference between an FVF that contains a red, vertical item and one that contains red and vertical features that are not bound to the same item? It is not adequate to propose that the system can determine when red and vertical occur in the same place. Think of black vertical lines on a red background. Red and vertical are in the same location, but observers are not confused into thinking that these are red vertical targets (Wolfe & Bennett Reference Wolfe and Bennett1997).
If H&O are wrong to condemn the poor item to death, why does their model work so well? Search efficiency, as indexed by the slope of RT × set size functions, is a continuum. In the H&O model, the FVF is a parameter that scales efficiency. Hard tasks produce small FVFs and easy tasks produce big ones (Nakayama Reference Nakayama and Blakemore1990). What determines FVF size? That is not clear in this schematic model so, really, FVF just serves as a free parameter to scale those slopes appropriately. The model also avoids other difficulties by coding all distractors as “0” and the targets as “1.” A parallel process, operating over the whole FVF, won't have much trouble detecting that target, but applying this model to real stimuli (e.g., conjunctions) might be a challenge.
Models such as GS also have parametric variations in difficulty that will scale search efficiency. In GS, more guidance by basic features (and by scene structure in current versions of GS) allows attention to be more efficiently directed to likely candidate targets. If you can exclude half of the items because they are, for example, the wrong color, your slope is cut in half. I strongly suspect that any mechanism for controlling FVF size will look a lot like GS's guidance, and I strongly suspect that the item, hard as it is to define, will be there to be selected.