R1. Introduction
We are grateful to all commentators for their excellent contributions. As befits a good scientific discussion, some argued that we are fundamentally wrong or went too far, while others argued that we are on the right track but did not go far enough. Again others brought to the forefront relevant aspects that we were not aware of, or had not thought of within the current context. In all cases, the commentators' views either forced or enabled us to improve our arguments and widen our perspective.
In the target article we made a number of claims, namely that (1) the study of visual search has been governed by the assumption that search proceeds on the basis of individual items; (2) this has prompted a lopsided empirical focus on easy search tasks that have relatively shallow slopes for their RT × set size functions; and (3) this, as has been noted by others, has resulted in ignoring the eye as a major component of search. We argued that the emphasis on item-based processing has led to central, cognitive explanations of visual search (involving bottlenecks in feature binding and template matching). Yet the more likely determinant of visual search performance may be the sensory limitations associated with peripheral vision. Summarized in what we referred to as the Functional Visual Field (FVF), these limitations involve reduced acuity, increased crowding, and overall decreased attention to, and awareness of, peripheral stimuli. Together, they severely reduce the value of “item” as a concept in search. We argued that the field best abandons the item as the major unit of processing, and instead adopts the information patterns available within eye fixations as the major determinant of RTs, search slopes and RT distributions. A simple simulation showed the viability of this approach.
We have organized our response according to a number of common themes emerging from the commentaries. The first theme, treated in section R2, revolves around the question of whether we have anything new to offer. The second theme (section R3) involves the repeated objection that items, or objects, are important in visual search. The third theme (section R4) concerns the role of covert attention as a selection mechanism that is independent of eye movement. The fourth theme (section R5) comprises various issues related to top-down and contextual influences on search, as these appear to be omitted from the concept of the FVF. The fifth theme, further expanded on in section R6, is how to best define the FVF. In section R7 we focus on a number of technical issues related to our simulation. Finally, we end with a reflection on where we are after incorporating the commentaries, and discuss future directions (section R8).
R2. Does this approach offer anything new?
Several commentators remark that our framework is not new. For example, Kristjánsson, Chetverikov, & Brinkhuis (Kristjánsson et al.) state that concerns about traditional approaches are part and parcel of parallel models of visual search, and Itti points out that the use of fixations as the conceptual unit is central to most computational theories and models of attention because they use prediction of human fixation locations as the main test for their efficacy. Müller, Liesefeld, Moran, & Usher (Müller et al.) note that SERR (Humphreys & Müller Reference Humphreys and Müller1993) anticipated several of the ideas we advocate, and Rosenholtz sees clear parallels with recent work in her own lab (Rosenholtz et al. Reference Rosenholtz, Huang and Ehinger2012a; Zhang et al. Reference Zhang, Huang, Yigit-Elliott and Rosenholtz2015). Kieras & Hornof argue that EPIC (Meyer & Kieras Reference Meyer and Kieras1997) contains visual modules that already incorporate parts of the FVF concept. The FVF is also similar to what has been referred to as the “area of control” in the work of Prinz (Prinz Reference Prinz and Dornič1977, which in turn is similar to the perceptual span in reading, McConkie & Rayner Reference McConkie and Rayner1975). Cave suggests that Treisman and Gormican (Reference Treisman and Gormican1988) might have had a concept similar to the FVF in mind when they referred to the role of the resolution of the attentional scan in making local feature information available.
As we tried to make clear in our target article, we agree that many if not all components of our framework have been proposed before. A number of authors deserve a prominent position here, as they made earlier proposals about the combination of components.Footnote 1 First of all, there is Engel (Reference Engel1977), who showed that the size of the FVF (as measured through target discrimination accuracy at various eccentricities) predicts spontaneous eye movements during visual search. Separate manual RT-related measures of search were not reported, though. Kraiss and Knäeuper (Reference Kraiss and Knäeuper1982) were probably the first to propose a model in which FVF size predicts number of fixations as well as manual RTs. Geisler and Chou (Reference Geisler and Chou1995) also related peripheral discrimination accuracy for targets at a known location (referred to as the “accuracy window”) to overall manual search RTs for the same targets at unknown locations. Although Geisler and Chou (Reference Geisler and Chou1995) did not measure eye movements, they nevertheless demonstrated that a simple model assuming a variably sized fixation region (operating within the limits set by the accuracy window) accurately captured the pattern of search RTs for one of their participants. Our framework shares a number of properties with the Geisler and Chou model, including the idea that fixation duration is best assumed to be relatively constant, at around 250 ms. Geisler and Chou (Reference Geisler and Chou1995) did not assess RT slopes or RT distributions, but here we have demonstrated that an FVF-type approach captures these across the full range of search difficulties.
Zelinsky and Sheinberg (Reference Zelinsky, Sheinberg, Findlay, Walker and Kentridge1995) measured eye movements during search and demonstrated that manual RTs on individual trials can be accurately predicted from the number of fixations that people make on those trials, for both easy and moderately difficult searches (see Williams et al. Reference Williams, Reingold, Moscovitch and Behrmann1997 for similar data). Moreover, Zelinsky and Sheinberg (Reference Zelinsky, Sheinberg, Findlay, Walker and Kentridge1995) proposed that it is mainly the number of items that can be processed within a fixation which determines search efficiency – an idea to which our current implementation of the FVF directly corresponds. Finally, Findlay and Gilchrist (Reference Findlay and Gilchrist2003), in their book on active vision, also explicitly stated that by taking the limitations of peripheral vision into account we could do away with central processes of covert attention as the major delimiter in search – a view that is very reminiscent of the idea that individual items are not the unit of processing. Instead, the major delimiter is how many items can be processed in parallel within a fixation. Thus, Findlay and Gilchrist (Reference Findlay and Gilchrist2003; see also Eckstein Reference Eckstein2011) arguably provide the most comprehensive general framework for the relationship between visual search performance and eye movements. However, they did not offer a formal model or simulation to demonstrate its viability.
The above shows that the concept of the FVF in relation to eye movements and visual search performance has surfaced at least once every decade. However, as we also alluded to in our target article, whereas the FVF has gained a clear foothold in the eye movement strand of the visual search literature, it has failed to do so in the more classic, and therefore perhaps more mainstream, strand of the literature focused on RTs and their slopes (including our own work). Our central aim was to show that this appears to be caused by the latter type of literature being heavily grounded in item-based thinking and the emphasis on covert, cognitive limitations that comes with it. This results in a fundamental clash of two classes of theory, requiring a principled choice. This choice, in our view, falls overwhelmingly in favour of the fixation-based approach. We hope our paper will aid in reconnecting all of the dots, as we believe FVF-based models have not made the impact on the field of visual search that they deserve. Furthermore, we extend the explanatory power of such models by showing that they explain search efficiency across the full range of search types, from easy parallel to extremely difficult serial, while also accounting for RT distributions and, to a great extent, errors. We believe that our effort opens up new research avenues, something that we will return to in section R8.
R3. But items are important, even essential!
Several authors indicate that we too easily abandon the item as the unit of selection. The argument comes in different varieties.
R3.1. Your data can be accommodated by item-based models
Müller et al. note that much of what we know about guidance in visual search stems from item-based experiments and that including any form of guidance for eye movements into our framework would bring it closer into line with traditional item-based models such as Guided Search. Indeed, Moran, Liesefeld, Usher, & Müller (Moran et al.), from the same lab, show that their item-based Competitive Guided Search model (CGS) can simulate our data, with even better fits for reaction times and errors than our own simulation. Not only does CGS show the inversion of the standard deviations from medium to hard search that we highlighted, but also its target-absent slopes for difficult search are much closer to the observed data of Young and Hulleman (Reference Young and Hulleman2013). In addition, Moran et al. argue that our framework does not cope well with data of Liesefeld et al. (Reference Liesefeld, Moran, Usher, Müller and Zehetleitner2016) who reported on a type of search where target-present search slopes are flat, but target-absent search slopes are not, and where there are large effects (>100 ms) on the intercept. Parameter adjustment in CGS allows the data to be simulated correctly. Accordingly, Moran et al. feel that it is premature to prefer a fixation-based approach over an item-based approach.
We disagree with this position for several reasons. First, we note that Moran et al. offer no accompanying graphs for fixations. We suspect this is because an item-based approach such as CGS does not allow for a principled way of including eye movements, because the rate at which items are identified in the model (θ/δ) has no fundamental relation to fixation duration. Admittedly, equating identification rate with fixation duration establishes a connection and even yields a good fit for medium search (see the final entry of Table 1 in Moran et al.). The good fit, however, comes at a price: the value of the guidance parameter nearly doubles, the quit-weight parameter increases almost a hundredfold and the residual time parameter is halved. This brings us to a more fundamental point: In our view, models are meant to engender understanding, rather than merely provide a good fit. If we compare the entries across Table 1 in Moran et al., not a single parameter remains constant. Nor are any of the parameters the same as in the paper first introducing CGS (cf. Table 2 in Moran et al. Reference Moran, Zehetleitner, Müller and Usher2013; except the parameter for motor errors). What seems to be lacking is a clear justification of the parameter values used, either from experimental observations or otherwise. At the moment, CGS looks like it is too focused on fitting rather than understanding visual search data.
Furthermore, it stands to reason that a model with eight free parameters will outperform a model with only a single free parameter. But what our framework loses in lack of fit, it more than makes up for in range of description. We think that it is paramount that models of visual search explain fixations and reaction times simultaneously, rather than trying to optimise RT-modelling before adding fixations.
In this respect we see the approach of Khani & Ordikhani-Seyedlar as more promising. While they propose an item-based approach based on FIT, they do encompass fixations, which we think is crucial. In their account the first fixation produces incomplete feature maps. Rather than complete conjunctions, these maps only contain “loose” conjunctions and clusters of feature similarity, with salient features having a higher chance of entering the maps. These maps are subsequently used to covertly or overtly guide attention. Each fixation, then, leads to more detailed maps. When one or more items reach a threshold of similarity with a target template, these individual items are serially selected to establish whether one of them is the target. The number of items involved in each fixation is determined by the perceptual load (Lavie Reference Lavie1995; Lavie & Tsal Reference Lavie and Tsal1994).
In a way, the proposal of Khani & Ordikhani-Sevedlar represents a return to the origins of FIT (Treisman & Gelade Reference Treisman and Gelade1980), when the role of eye movements in search was still acknowledged. Given that it is in its very early stages, we will have to wait to see whether this proposal bears fruit, but we would nevertheless like to make two remarks. First, items actually seem to make their appearance quite late in this proposal. Only when the feature maps are detailed enough are individual items selected (cf. Hochstein & Ahissar Reference Hochstein and Ahissar2002). Second, it remains to be seen how this model handles situations in which items move around. Item motion will impinge on the building up of the feature maps over several fixations, because the sources of the features – the items – will have moved position in the meantime. Yet, as shown in Hulleman (Reference Hulleman2009; Reference Hulleman2010), search for a T amongst Ls is robust against motion.
R3.2. The importance of objects
The second type of argument in favour of items is that, clearly, we perceive discrete objects and that, equally clearly, there are object-based effects on attention (Cave, Van der Stigchel & Mathôt, Urale). Likewise, the goal of many, if not most, searches is to select a particular object for identification, inspection, clicking, counting, or picking up (Eimer, Kieras & Hornof, Pasqualotto, Watson, Wolfe). Furthermore, Watson and Wu & Zhao make the case that such goals are often represented through an item-based target template (but see Prinz, who argues against such templates and sees target detection as a disruption of integrated distractor processing across fixations). The importance of objects in visual search is clearly stated by Eimer: “[a]t a more fundamental level, it is difficult to see how objects can be replaced as conceptual units in visual search, given that the visual world is made up of objects, and finding a particular target object is the goal of a typical search task.” As Wolfe expresses it, the goal of search is some thing.
Yes, we wholeheartedly agree: The goal of most searches is to find a specific object. This includes most real-world searches, but also many artificial laboratory tasks, such as the compound search task in which the participant needs to make an additional decision about the target object. However, the goal of the process is not the same as the process itself. As Rosenholtz comments, “just because we recognize ‘things’ at the output of perception, and employ high-level reasoning about objects, does not mean that our visual systems operate upon presegmented things. This is a common and tempting cognitive error, which can hamper uncovering the true mechanisms of vision.” Enns & Watson also cite an interesting remark by Hochberg in this respect: “unlike objects themselves, our perception of objects is not everywhere dense” (Hochberg Reference Hochberg and Beck1982, p. 214). In fact, given the severe limitations of the periphery, one could argue that real object segmentation is limited to central vision, with the periphery only being able to deliver the coarsest candidates, especially in complex, real-world scenes. It is true, as Van der Stigchel & Mathôt state, that recent deep-learning networks have demonstrated successful parsing of complex, real-world scenes into relevant objects. However, we note as well that, unlike the brain, such algorithms almost invariably work with images of high and homogeneous resolution.
There is indeed also clear behavioural evidence for object-based attention (e.g., Duncan Reference Duncan1984; Egly et al. Reference Egly, Driver and Rafal1994; Theeuwes et al. Reference Theeuwes, Mathôt and Kingstone2010; Reference Theeuwes, Mathôt and Grainger2013). But although object-based attention may have a modulating influence on selection, again this does not mean that objects form the unit of selection in visual search. It may be telling that none of the referred-to demonstrations of object-based attention used visual search tasks. Rather, they typically involved rather sparse displays of at most two objects. That said, we agree with Urale that it is interesting to investigate how object-based attention contributes to shaping the FVF by grouping elements together – something that is also echoed by Kristjánsson et al. and by Wu & Zhao, who argue that objects may be flexibly defined by learning conglomerates of features. In fact, a fixation-based approach may accommodate such grouping mechanisms more naturally than an individual item approach (cf., Duncan & Humphreys Reference Duncan and Humphreys1989). For example, Töllner & Rangelov note that increasing the number of distractors in a present/absent version of a pop-out task does not influence reaction times (although see Wolfe Reference Wolfe1998b). Yet the same increase in a compound task benefits search. Within a fixation-based framework this can be explained by the fact that the compound task, unlike the present/absent task, requires precise saccadic targeting, which is likely to benefit more from the improved signal-to-noise ratio allowed by the grouping of the distractors. It is likewise true that there is considerable EEG evidence for the selection of individual items – some of which we will treat in more detail in section R4. Here we wish to make two comments regarding this evidence.
First, the selection of items is typically linked to the N2pc component, which is a spatially lateralized evoked potential. It does not index item selection as such, but rather the spatially selective processing of information at a relevant or interesting location. Spatial selectivity is not the same as item selectivity. Note that this type of experiment often uses sparse displays with two or four clearly segmented items (e.g., Eimer & Grubert Reference Eimer and Grubert2014; Grubert & Eimer Reference Grubert and Eimer2015; Woodman & Luck Reference Woodman and Luck1999; Reference Woodman and Luck2003). This makes it tempting to link the N2pc to individual item processing, but that does not necessarily follow from the evidence so far. As Rosenholtz suggests, we find item-based processing because we design item-based displays.
Second, the observation that spatially selective processing is stronger or takes longer for compound search tasks than for present/absent tasks, resulting in more pronounced N2pcs – as pointed out by Töllner & Rangelov – might simply reflect the fact that compound tasks require more fine-grained discrimination, not that this discrimination is item-based. While these types of EEG experiments are extremely useful in uncovering attentional processes, in our view they do not provide direct evidence that visual search is item-based.
We emphasize that we do not claim that visual search is never object- or item-based (as Van der Stigchel & Mathôt appear to suggest), nor are we of the opinion that we should not use artificial displays consisting of clearly separated items. As we pointed out in the target article, in displays with separate items where target-distractor discrimination is very difficult, inspection of each individual item may be required (resulting in a corresponding FVF). Furthermore, in some tasks target-distractor discrimination is easier, yet still an item-based strategy is required. As Watson suggests, one such task is counting multiple items rather than finding a single one. In a counting task the identity of the individual item matters, because it is important to separate those that already have been counted from those that have not. In other words it is important not only to discriminate targets from distractors, but also to discriminate between targets. In our view this necessitates smaller FVFs to prevent interference from similar but already counted items. Support for this contention comes from the last experiment in Hulleman (Reference Hulleman2010), where participants had to establish whether there were at least five Ts in a display that also contained Ls. When there were either very few or very many Ts in the display, motion of the items did not influence performance. When there were four, five, or six Ts, however, more errors were made when the items were moving. This demonstrates the interaction between task demands and FVF size. Only when individual item identity is crucial does the FVF become small and does search become item-based.
R3.3. The importance of feature binding
Because items are defined as bound features, item-based approaches to visual search are predicated on feature binding. Wolfe is the most explicit here in arguing “that we search because we need to attend to an item to successfully bind its features, and we generally need to bind features to recognize items that are the goal of search.” Wolfe agrees that binding is not always necessary but also points out that unbound features do not allow for accurate localisation of the target. As a case in point, he wonders how it would be possible to determine the presence of a red vertical target amongst red horizontal and green vertical distractors without binding the red and the vertical into a single item by attending to it. Eimer is less explicit but nevertheless appears to concur with this objection by noting that “[w]hat remains unclear is whether such global area-based mechanisms can detect the presence or absence of targets even in moderately difficult search tasks where no diagnostic low-level saliency signals are available and distractors share features with the target.”
In our view this type of argument is based on a couple of assumptions that do not necessarily hold.
First, consider Eimer's assumption that there are no diagnostic low-level saliency signals available in moderately difficult search where distractors share features with the target. We assume that for Eimer, as for Wolfe, this means a conjunction search, for example, for a red vertical amongst red horizontal and green vertical distractors. Classic item-based thinking dictates that for these types of searches the target item does not carry a unique feature. However, this assumes that the individual features of the individual items provide the only type of information available and that indeed there are no other diagnostic signals possible. This neglects that at several levels a patch containing a red vertical target among red horizontal and green vertical distractors is different from the same patch without a target. Even if the items within a patch do not have distinctive features, the overall image of the patch does. For instance, in our colour/orientation conjunction example, a patch with n distractors will contain x reds, x horizontals, (n−x) greens and (n−x) verticals. So there is a match between red and horizontal and between green and vertical. Replacing one of the distractors with the red vertical target will, depending on the distractor replaced, yield either x reds, (x−1) horizontals, (n−x) greens and (n−x+1) verticals or (x+1) reds, x horizontals, (n−x−1) greens and (n−x) verticals. In both cases, there is now a mismatch between red and horizontal and between green and vertical. So without assuming any item-based binding, it is possible to distinguish between patches with and without a target purely on the basis of summed totals. It is indeed necessary to bring information from colours and orientations together but, as this example shows, this does not have to happen at the level of a fully bound individual item. Rosenholtz lists quite a number of other properties of target and non-target patches that apply here. The presence of a red vertical changes the local spacing and alignment between the red bars, resulting in what resembles red T-like or L-like junctions. It likely changes the spatial frequencies present in the patch, because the red vertical target would probably be adjacent to a green vertical. There is no a priori reason to assume that these signals are not available for visual search.
Second, feature binding is only needed if one assumes that, unless attended, features remain represented separately throughout the visual system. As discussed by Di Lollo (Reference Di Lollo2012) this assumption was originally based on the work of Hubel and Wiesel (Reference Hubel and Wiesel1962; Reference Hubel and Wiesel1977) whose single cell recordings indicated that neurons in the primary visual cortex responded selectively to orientation or colour but not both. This then implied the need for a mechanism to integrate these features at a later stage by binding them, because a neuron representing one type of feature does not “know” about the neuron representing the other feature. However, it has long since been established that many neurons, also in early visual cortex, respond to integrated features (e.g., Friedman et al. Reference Friedman, Zhou and Von der Heydt2003; Seymour et al. Reference Seymour, Clifford, Logothetis and Bartels2009; Reference Seymour, Clifford, Logothetis and Bartels2010; Shipp et al. Reference Shipp, Adams, Moutoussis and Zeki2009). Moreover, there is considerable cross-talk within and between retinotopically organized layers in both feedforward and feedback pathways. This altogether makes binding less of a problem than it was (Di Lollo Reference Di Lollo2012; Hochstein & Ahissar Reference Hochstein and Ahissar2002; Lamme & Roelfsema Reference Lamme and Roelfsema2000). Because many of these integrative mechanisms appear to operate without attention, the role that attention plays in binding remains unclear. We suspect that attending to items may be necessary to distinguish their feature conglomerates at a sufficiently fine resolution, rather than to bind these features together – especially when it involves overt orienting (but not exclusively so; He et al. Reference He, Cavanagh and Intriligator1996; Hochstein & Ahissar Reference Hochstein and Ahissar2002).
In sum, it is questionable (1) whether binding of item features is really necessary (2) what relative contribution attention makes, and (3) whether the visual system cannot use more global features other than those of the individual items. All in all, we find an insufficient basis for the claim that binding necessitates item-based selection in visual search.
R4. What about covert deployments of attention?
Several authors raised the concern that a focus on eye movements and/or FVFs faces the dilemma of how to explain search in the absence of saccades (Ohl & Rolfs), because our framework must assume that there is then nothing left for covert attention to do (Cave and Kristjánsson et al.). This is at odds with clear evidence for covert attention effects that often precede eye movements (Eimer, Khani & Ordikhani-Seyedlar). Others raise the more general concern that a focus on eye movements may result in missing interesting covert attention effects (Lleras, Cronin, Madison, Wang, & Buetti [Lleras et al.], Watson, Wolfe). Specifically, Wolfe writes that his interest in serial covert deployments has been the reason to design experiments that make eye movements less critical. Lleras et al. state that our framework assumes that parallel searches (occurring within a single fixation) are not very interesting because they are all created equal. They point out that this misses out on subtle but important variations, due to differences in task sets and differences in similarity between the target item and distractor items, that occur even in parallel search without eye movements. This means that another source of variation in visual search stems from the efficiency with which individual items are judged in parallel search.
We would like to reiterate that we present a fixation based account of visual search, not an eye movement based account. This is a subtle but important difference. As we wrote, visual search can clearly occur without eye movements. Yet even when observers are instructed to keep their eyes still, at least one fixation is involved. Accordingly, search will be limited by the retinal and cortical constraints imposed by that fixation, leading to reduced discriminability in peripheral vision. This alone means that not all searches within a fixation are equal: Even when a target falls within the FVF and the mechanism is in essence a parallel one, detection rates will not be homogeneous, as the signal-to-noise ratio is worse in the periphery (e.g., Geisler & Chou Reference Geisler and Chou1995). Furthermore, signal-to-noise ratios will differ for different stimulus combinations and set sizes, leading to either subtle (e.g., Buetti et al. Reference Buetti, Cronin, Madison, Wang and Lleras2016) or less subtle set size effects (taken as indicative of serial search). Specific top-down task sets may further shape the priority map, boosting some signals over others, as in Guided Search.
The end result of the interaction between bottom-up and top-down factors is likely to be a covert selection of a candidate region for further evaluation (e.g., on a secondary detail) or motor response (e.g., an eye movement, see also Cave's commentary). This region might be an item, but that is not necessary. In natural circumstances covert selection is followed by an eye movement, which enables a full resolution image of the target. In some tasks this is even explicitly required (e.g., Buschman & Miller Reference Buschman and Miller2009). In other tasks eye movements are forbidden, but this does not necessarily stop the underlying selection process. Furthermore, the process is not perfect, so multiple covert hot spots may occur when a selected region turns out to be the wrong one. This is especially the case when non-targets are deliberately designed to closely resemble targets. For example, Woodman and Luck (Reference Woodman and Luck2003) measured EEG when participants saw visual search displays with two salient target candidates (both defined by colour against black distractors), while maintaining fixation. They showed that the target candidate near the fovea was selected first, followed by selection of the more peripheral one (as the central one turned out not to be the target after all). In another experiment (Woodman & Luck Reference Woodman and Luck1999) they showed that observers first (again covertly) selected the target candidate that carried the most frequent target colour, followed by the candidate that carried a less likely colour (see also the commentary of Töllner & Rangelov for target prevalence effects on the EEG signal). Both findings are perfectly consistent with a non-homogeneous FVF, where essentially parallel bottom-up and top-down processes deliver target candidates, but not necessarily all at the same time.
One could argue that in these specific EEG cases seriality is strongly imposed by the task, given that it is more efficient to inspect the closest or most prevalent item first. As recent work by Eimer and colleagues has shown, there is virtually parallel processing of multiple target candidates when the task is more balanced (Eimer & Grubert Reference Eimer and Grubert2014; Grubert & Eimer Reference Grubert and Eimer2015). Note that this is essentially no different from signal detection accounts of search that also assume parallel processing (Eckstein et al. Reference Eckstein, Thomas, Palmer and Shimozaki2000; Palmer et al. Reference Palmer, Verghese and Pavel2000; Verghese Reference Verghese2001). Furthermore, when one forces participants to keep the eyes still, as is usually the case in EEG experiments, the relative influence of covert selection will increase. But this is usually not the case in the real world, nor in most laboratory search tasks for that matter.
We do not deny the existence of covert attention, nor the importance of its investigation. Rather, our point is that some visual search theories appear to rely too heavily on covert selection to explain search, in that they assume that search efficiency is predominantly determined by fast covert scanning of items in combination with central, item-based bottlenecks (processing items at a rate of 20–40 items per second, as Wolfe confirms in his commentary). The fixation-based view follows Findlay and Gilchrist's (Reference Findlay and Gilchrist2003) stance that covert selection is more likely to be the end product of a search process that is primarily determined by limitations in retinal and cortical receptive fields, at most delivering the candidate region for the next fixation. As such, it is not an independent search process that occurs during fixation, but part of the active eye movement mechanisms. As Findlay and Gilchrist (Reference Findlay and Gilchrist2003) pointed out, there is little to no evidence for a serial covert scan during fixations. Further evidence against covert serial scanning comes from the robustness of search against motion of the items, even for searches that have been deemed serial (Hulleman Reference Hulleman2010; Hulleman & Olivers Reference Hulleman and Olivers2014; Young & Hulleman Reference Young and Hulleman2013). Serial deployment of covert attention within a fixation would predict a drop in performance, because it becomes harder to distinguish between items that have and have not yet been inspected.
R5. Where to look next: top-down factors
A number of authors point out that an account solely based on an FVF is incomplete because it fails to incorporate important if not crucial mechanisms that determine where people look next in a visual scene. Here we best summarize such mechanisms as “top-down” in nature, and they come in a number of varieties: Enns & Watson; Lleras et al; Shi, Zang, & Geyer [Shi et al.], Töllner & Rangelov, and Van der Stigchel & Mathôt emphasize the importance of the task, whereas Menneer, Godwin, Liversedge, Hillstrom, Benson, Reichle, & Donnelly [Menneer et al.] and Itti, as well as Crabb & Taylor; Crawford, Litchfield, & Donovan [Crawford et al.]; Laubrock, Engbert, & Cajar [Laubrock et al.], and Watson highlight the role of context and scene gist in making predictions, guiding selection, and determining scan strategies. Learning, whether explicit or implicit, or task expertise also play an important role in shaping search (Kristjánsson et al., Van der Kamp & Dicks, Wu & Zhao, Crawford et al., Menneer et al., Van der Stigchel & Mathôt). Kieras & Hornof argue that a full model of such task- and memory-dependent strategies therefore requires an overarching cognitive architecture like EPIC (Meyer & Kieras Reference Meyer and Kieras1997) or ACT-R (Anderson & Lebiere Reference Anderson and Lebiere1998). Müller et al. note that some form of feature guidance, such as in Guided Search, is required in many search tasks, and Wolfe suspects that these guiding features will be the same as those determining the FVF. Finally, Henriksson & Hari suggest that top-down cues need not be represented at sensory levels but may be of a high-level semantic and even social nature.
Non-visual influences are also accentuated by Van der Kamp & Dicks, as well as Campion (see also Kieras & Hornof; Töllner & Rangelov), who underline the influence of motor requirements on the search process. Because observers normally move about in their environments, there is a continuous perception-action cycle. Van der Kamp & Dicks call for a move away from traditional search tasks where observers are passively watching computer screens. Campion even calls for a move away from the information processing view of cognition that such traditional approaches appear to induce. These views are reminiscent of the Gibsonian ecological approach. They also fit with the active vision approach of Findlay and Gilchrist (Reference Findlay and Gilchrist2003) that precedes our own. On the other hand, in referring to Julian Hochberg's legacy, Enns & Watson appear to endorse the information processing approach, stating that, “what happens behind the observer's eyes is more important than what happens in front of them (the display items) or even in them (the FVF).”
We welcome all of these important suggestions. We agree that task, context, and actions play an important role in driving selection in laboratory studies, and even more so in real world environments. We also agree that in real-world circumstances, fixation patterns are generally not random. We restricted our simulations to standard abstract laboratory displays – where such influences are minimized or controlled for – exactly because these have typically subserved the RT data that item-based theories have been grounded in. We sought to demonstrate that these types of data can be more straightforwardly captured by a model that used fixations rather than display items as its unit of processing. Given the high level of randomness of laboratory displays, a simulation that assumes random fixation selection (with some restrictions) suffices.
Nevertheless, the decision about where to look next is one of the major research questions arising from a fixation-based approach. We believe that FVF-based accounts provide a more natural and fruitful way of thinking about how this decision comes about than item-based accounts. First of all, we agree with Wolfe that features that have shown a high degree of guidance in classic visual search tasks will result in large FVFs. However, whereas Guided Search assumes feature status, and thus the capability of guiding attention, for at most a handful of visual properties (Wolfe & Horowitz Reference Wolfe and Horowitz2004), within an FVF account any visual information that can be discriminated beyond the fovea can, by definition, subserve attentional guidance in visual search. Be it a low-level feature, a semantic category, or the social signal conveyed by a complex facial expression, if observers can distinguish it in the periphery, it can become the next target of fixation. The central point is that such information need not be item-based. What we claim is that once the attention-guiding properties are mapped out, the item as such is no longer necessary for explaining search mechanisms. Note here too Rosenholtz's remark that the same information available in a rich set of image statistics (Keshvari & Rosenholtz Reference Keshvari and Rosenholtz2016) also plausibly underlies scene perception (Ehinger & Rosenholtz, in press; Rosenholtz et al. Reference Rosenholtz, Huang, Raj, Balas and Ilie2012b). This suggests a common encoding scheme for both extracting the scene context and supporting search. In this respect, Guided Search can be regarded as a representative of classic early selection theories, in which only relatively low-level properties can be used to filter information (Broadbent Reference Broadbent1958; Treisman & Gelade Reference Treisman and Gelade1980). FVF-based theories, on the other hand, are representatives of multiple level selection theories (Allport Reference Allport and Claxton1980; Findlay & Gilchrist Reference Findlay and Gilchrist2003; Norman & Shallice Reference Norman and Shallice1980; Reference Norman, Shallice, Davidson, Schwartz and Shapiro1986), in which the level of selection is determined by the task requirements and the level of information available in the input.
In our view then, the FVF is not simply a re-description of bottom-up salience. Classically, the FVF for a certain type of information is measured with a task in which observers actively look for this information in a known location. The FVF is thus an amalgamation of the availability of the information in the input, and the top-down modulation of that input. Indeed, it has been shown that the FVF can change size or shape depending on additional task load or the expected spatial distribution of the target information (e.g., Engel Reference Engel1971; Ikeda & Takeuchi Reference Ikeda1975; Williams Reference Williams1982). This might at least partially aid in natural scene search, where targets are often restricted to certain spatial areas. Given the well-documented effectiveness of feature- and object-based attention, the shape or size of the FVF is also likely to be modulated by increasing the gain on specific feature or category distinctions, but to our knowledge there have been few studies looking directly into this (e.g., Põder Reference Põder2007, has shown how repeating the target feature reduces peripheral crowding, but did not map out the full FVF). Thus, Enns & Watson's assertion that “what happens behind the observer's eyes is more important than what happens … in them (the FVF)” (our italics), is partly tautological: The mechanisms determining the FVF include what happens behind the eyes. That said, Enns & Watson are correct in suggesting that in setting up our account, we wished to emphasize the sensory restrictions in visual processing outside of the fovea, rather than the central cognitive restrictions associated with foveal processing.
Of course, the amalgamation of top-down and bottom-up factors into a single construct makes it vulnerable to becoming circular and unfalsifiable. We will address this issue in section R6.
R6. The nature of the FVF
Several authors have questions about the nature of the FVF or whether it is even possible to come up with an operational definition. Phillips & Takeda feel that the FVF lacks independent motivation, a sentiment also expressed by Kristjánsson et al., Watson, and Little, Eidels, Houpt, & Yang [Little et al.], who mention the risk of circularity: Search is difficult because the FVF is small, and the FVF is small because search is difficult. Furthermore, according to Itti, positing a single FVF size conflates guidance, selection, and enhancement mechanisms. The relation between the FVF and guidance is also touched upon by Wolfe, who thinks that the mechanisms controlling the size of the FVF will look a lot like those controlling guidance in Guided Search. Control of the size of the FVF also comes to the fore in the various comparisons of the FVF to a spotlight (Laubrock et al.; Itti), a zoom lens (Cave), an attentional window (Kristjánsson et al.) and its relation to perceptual load (Khani & Ordikhani-Seyedlar). Rosenholtz points out that is important to make clear that the FVF is not a mechanism, where its size is under active control, but an outcome of several mechanisms: It describes the informative visual regions for a particular task.
R6.1. Circularity
We will first address the issue of circularity in the definition of the FVF. Whereas Phillips & Takeda argue that it is possible to provide a mathematical basis for the FVF, we believe that an approach to visual search based on the FVF allows the circle to be broken in an empirical manner as well, because it connects performance in visual search to performance in other visual tasks. Several researchers have already led the way. For instance, Engel (Reference Engel1977) determined the “conspicuity area” by having observers detect the appearance of a target. He subsequently tested visual search for that target and related the conspicuity area with the cumulative probability of finding the target. A similar approach was used by Geisler and Chou (Reference Geisler and Chou1995): They measured an “accuracy window” in a two-alternative forced-choice (2AFC) task, where one of the intervals contained the target and the other contained only distractors. They then found a correlation between the size of the accuracy window and reaction times in a visual search task. So far the number of different types of visual stimuli that have been tested and correlated with search this way has been limited, but this is a matter of filling in the gaps.
We do note that the method of Geisler and Chou (Reference Geisler and Chou1995) could still be considered circular, because the 2AFC task used for measuring the FVF is essentially a skeleton version of the search task, and it would therefore be a surprise if there were no correlation. This criticism holds a little less for the approach used by Engel (Reference Engel1977), who presented single targets without distractors in the FVF task. But here too one could say that detection predicts detection. Still, we want to argue that this method brings us at least one step further than the item-based approach.
Some non-psychophysical methods might also offer a route out of circularity. One such method is the use of gaze-contingent displays. Young and Hulleman (Reference Young and Hulleman2013) demonstrated that a very difficult search task is more robust against the masking of non-fixated items than easier search tasks. This indicates that the former has a smaller FVF than the latter. Neurophysiological data, too, may be used as an independent predictor of search performance. For example, Song et al. (Reference Song, Schwarzkopf, Kanai and Rees2015) recently reported how anatomical characteristics of V1 and V2 cortex predict individual differences in both the precision of neural population tuning and performance on a visual discrimination task at various eccentricities. It would be exciting to see whether the same measures also predict visual search performance.
In summary, although we agree that circularity in the definition of the FVF presents a problem, we do think it is possible to find a solution that will provide a size measure of the FVF that is independent of search performance. Therefore, although Wolfe rightly observes that the size of the FVF acts to scale search slopes in the same way that guidance does in Guided Search, we think that there is a crucial distinction: Only the FVF account seeks to systematically anchor search performance in an independent task.
R6.2. Control over FVF size
The issue of size control goes to the very heart of the nature of the FVF. We agree with Rosenholtz that the FVF is an outcome – rather than a mechanism – with its size delimited by the interaction between retinal and cortical constraints on the one hand, and task demands on the other. Sometimes the retinal and cortical constraints might work in opposite directions (for instance in the Gestalt grouping mentioned by Urale). In any case, the size of the FVF is not actively controlled, but a particular task can only be performed when the FVF does not exceed a certain size. Within the FVF, active modulation (for instance by attentional processes or central perceptual load, Khani & Ordikhani-Seyedlar) might be possible, but this can never be more than modulation. Moreover, this active modulation is not item-based.
As such, the FVF is fundamentally different from attentional zoom lenses and spotlights. The latter are explicitly conceived of as operating covertly, independent from the eye, while the FVF is centred on current fixation. As LaBerge and Brown (Reference LaBerge and Brown1986, p. 198) put it: “attentional factors dominate in processing visual targets […] retinal sensitivity factors have a minor role, if any.” We therefore do not agree with Cave when he suggests that Treisman and Gormican's (Reference Treisman and Gormican1988, p. 17) description of the role of spatial attention is similar to the FVF. Indeed, Treisman and Gormican wrote: “Attention selects a filled location within the master map and thereby temporarily restricts the activity from each feature map to the features that are linked to the selected location. The finer the grain of the scan, the more precise the localization and, as a consequence, the more accurately conjoined the features present in different maps will be.” But this alludes to the way covert attention promotes correct feature binding. It ignores the fact that the change in real spatial resolution from peripheral to central vision by overtly fixating an item probably contributes much more to correct object perception.
FVF size control is also at the core of Itti's commentary. He suggests that FVF size is unlikely to be fixed as in our simulations, and that an FVF that is allowed to rapidly change size and form becomes a liability because it would be very difficult to measure in real time. But as pointed out above, we do not see the FVF as an entity that is directly under active control of the observer. Rather, it is the outcome of the interaction between task demands on the one hand and retinal and cortical limitations on the other. Task demands might change over the course of a search, but this will be in a gradual, predictable manner (and is moreover something that any model of search will have to deal with.) Any change in FVF size will follow this change in task demands. Itti also suggests that it may be necessary to separate the FVF into three: a broader FVF for guidance of search, a smaller FVF for selection, and a potentially even smaller FVF for enhancement, because he thinks that using a single FVF size conflates the separate mechanisms of attentional guidance, attentional selection, and attentional enhancement. However, conflation might not necessarily be a drawback. The advantage of using the FVF is that these three presumably separate mechanisms of increasingly “homing in” on the target might actually represent one and the same process operating on increasingly detailed and target-like information. In a recent paper, Zelinsky et al. (Reference Zelinsky, Peng, Berg and Samaras2013) argued that guidance and recognition in visual search are two sides of the same coin: eye movement guidance is in essence recognition from the corner of the eye. The first signal that may be recognized is some fuzzy statistical reflection of what might be a target that, after an eye movement has been made, becomes recognition of a more detailed version.
R7. Technical issues
Several authors take issue with some of the more detailed choices we have made in our simulation.
R7.1. The stopping rule
Moran et al. suggest not only that our simple stopping rule leads to poor fits for the target-absent trials in difficult search, but also that replacing it with a more plausible rule will substantially change search RT distributions and error rates, possibly to an extent that puts the desirable properties of our framework at risk. We disagree. In our view, a more plausible stopping rule will actually improve the quality of our simulation. Currently, simulations of target-absent trials are most affected by the poor quality of our simple stopping rule. Furthermore, the influence of the stopping rule increases with decreasing FVF size. In combination, this means that the difficult target-absent condition is most affected. However, this is also the condition where the discrepancy between simulation and observed data is largest. Therefore we expect improvement with a more plausible stopping rule. Crucially, this should lead to fewer target-absent trials with extremely long reaction times and reduced variability in the target-absent trials of difficult search. Consequently one of the major insights of this paper (reversal of SD from medium to difficult search) remains unscathed. So, although a more plausible stopping rule is certainly needed, implementing it is not expected to invalidate the fundamental strength of our framework.
Crawford et al. raise an issue related to the use of the stopping rule. They applied the framework to data collected from radiologists who assessed chest radiographs, and noted that there may be an incompatibility in how people reach a target present or absent decision. Crawford et al. report that in their data, target-absent decisions are faster than target-present decisions, whereas in our framework (and in about every other model of visual search based on fundamental research) target-absent trials are typically slower. Although this could point to basic differences between the fundamental and applied research domains, we believe this is a case of incommensurate definitions of “absent decision.” In our approach a target-absent reaction time is the result of searching close to an entire display, failing to find the target and terminating the trial with a target-absent response. This typically takes longer than searching the display, finding the target and terminating the trial with a target-present response. For Crawford et al. a target absent reaction time appears to be something else: A radiologist fixates a particular part of the radiograph, correctly decides that there is no suspicious lesion and moves on. The target-absent reaction time is taken as the duration of the fixation of this lesion-free zone. Target-present reaction times are defined in a similar way. Only here the radiologist correctly decides that there is a suspicious lesion. So when Crawford et al. refer to target-absent and target-present RTs, they refer to single fixation events, while in standard search tasks a decision RT is the result of the accumulation of several such decisions for the same image. Their comment highlights that the results of fundamental and applied research cannot productively inform each other without consistent definitions.
R7.2. Fixation durations are not constant
Enns & Watson, Henriksson & Hari, Laubrock et al., Little et al., Menneer et al., and Shi et al. all point out that fixation durations vary, for example, in response to task demands, and that the constant value used in our simulation is therefore unrealistic. Ohl & Rolfs make a similar point in stating that there is useful information in the amplitude of saccades, including microsaccades. Little et al. make the valid argument that allowing variability of fixation durations will lead to increased variability in reaction times, and that this variability is positively correlated with the number of fixations. As a result the target-absent condition of difficult search will see the largest increase in variability, perhaps even making the target-absent trials more variable than the target-present trials. We ran some additional simulations with variable fixation durations. It is indeed possible to make difficult target-absent trials more variable than target-present trials, but this takes levels of variability in fixation duration that go far beyond those observed in our own recorded eye movement data. Variable fixation durations will thus not change the basic pattern of our simulation. We also point to Zelinsky and Sheinberg (Reference Zelinsky, Sheinberg, Findlay, Walker and Kentridge1995), as well as Geisler and Chou (Reference Geisler and Chou1995), who showed that it is sufficient to assume a relatively constant fixation duration.
R7.3. Attentional dwell time and its relation to fixation duration
Eimer questions our equating dwell time with fixation duration. He points out the discrepancy between our fixation duration (250 ms) and attentional dwell time estimates (300–500 ms). Menneer et al. also suggest that there might be a dissociation between fixation location and the location whose information is currently processed when searching scenes. We would like to reply that although there seems to be a consensus that dwell time estimates of 20–40 ms per item in visual search tasks are too low, there is actually much less consensus on a more realistic value. The 300–500 ms mentioned by Eimer seems to be derived from Duncan et al. (Reference Duncan, Ward and Shapiro1994). However, note that this is for tasks where two difficult-to-perceive (because masked) targets need to be reported. Consolidating targets for report likely involves additional processing and is quite different from the standard visual search task where the mere presence or absence of a single target is to be reported. Furthermore, most estimates of dwell time come from studies that did not use a visual search paradigm and presented items sequentially rather than simultaneously. Some of these dwell-time estimates are in the region of 250 ms (Theeuwes et al. Reference Theeuwes, Godijn and Pratt2004) and even go as low as about 100 ms (Wolfe et al. Reference Wolfe, Alvarez and Horowitz2000). All in all, this suggests dwell times may be substantially lower than 500 ms.
We do agree with Menneer et al.'s point that there might be differences between scene search and more standard search tasks in terms of the relation between fixation duration and dwell time. This is one of the challenges facing the development of a unifying framework. One way to accommodate dwell time variations would of course be to consider them as contributing factors to the fixation duration variations mentioned in section R7.2.
R8. Conclusions: Where do we stand and where do we go?
In our view one of the most significant outcomes of this discussion is that all commentators seem to agree that it is important to include fixations in theories of visual search. This would constitute a major change for the item-based, attentional strand of the visual search literature, which should have wider implications. Referring to their own 2003 work, Findlay and Gilchrist (Reference Findlay, Gilchrist and Underwood2005, p. 259) wrote: “There has indeed been widening interest more generally in eye scanning and we have even been prepared to suggest that a fundamental theoretical shift is in the process of occurring.” Yet more than 10 years down the line, a simple look at some of the most popular current textbooks on cognitive psychology, cognitive neuroscience, and even perception shows that the treatment of visual search seldom includes eye fixations, let alone assigns them a central roleFootnote 2 (Braisby & Gellatly Reference Braisby and Gellatly2012; Eysenck & Keane Reference Eysenck and Keane2015; Goldstein Reference Goldstein2014; Reference Goldstein2015; Reisberg Reference Reisberg2013; Sternberg, Reference Sternberg2017; Ward Reference Ward2015; Wolfe et al. Reference Wolfe, Kluender, Levi, Herz, Klatzky, Lederman and Merfeld2015). In fact, most do not go further than Treisman's FIT plus perhaps some excursion to Duncan and Humphreys' AET or Wolfe's Guided Search. Not only do our students grow up with these textbooks, but also the books serve as a theoretical frame of reference for many researchers, including clinicians, who use search tasks only as a tool, rather than as a topic of study. Such researchers may not only miss out on a rich empirical source of information, but also may attribute their findings to the wrong mechanisms.
That said, it is also clear that the idea of abandoning the item as the conceptual unit of visual search does not enjoy a similar consensus. The commentators have offered three main arguments for keeping the item: (1) item-based models can accommodate our data; (2) objects are important in behaviour; and (3) feature binding is necessary. However, we do not think that these provide sufficient support for the suggestion that visual search is essentially an item-based process. Neither do we think that the effects of covert attention in visual search make this case. So although the commentators have sharpened our views, they have not changed them: We still see a fixation-based framework as the best way to think about visual search.
In essence, the fixation-based approach returns visual search from mainly being an attentional problem to mainly being a perceptual problem. By using fixations and an FVF based on retinal/cortical limitations (in combination with task demands) the proposed framework makes direct connections with processes involved in crowding, reading, and perception in general. For example, there is no a priori reason to assume that there are no FVFs for semantic or categorical information (Wu & Zhao; De Groot et al. Reference de Groot, Huettig and Olivers2016; Lupyan Reference Lupyan2008), which after all form an intrinsic part of the perceptual process.
Apart from making a connection between visual search and other areas in vision science, we also intended our paper as an attempt to invite new thinking about current problems in visual search, whether in fundamental or in applied research. We feel encouraged that several of the commentators have already made a start with this. Pasqualotto demonstrates how abandoning the item-based approach to visual search facilitates thinking about the common aspects of visual and haptic search. Kristjánsson et al. argue that the FVF approach makes certain testable predictions about priming, specifically that if search is fixation-based, priming ought to be fixation-based too. Menneer et al. also derived new predictions from our framework, which prompted them to reanalyse their data for the prevalence effect in visual search through X-ray images (where rare targets are more easily missed than frequent targets). Their findings were consistent with our framework. Crawford et al. note that expert radiologists examining chest radiographs have scan paths that differ from those of novices, with fewer fixations and larger saccadic amplitudes. This shows the potential of fixation-based approaches to incorporate expertise and learning. Van der Kamp & Dicks point out that successful goalkeepers also have fixation patterns that differ from those of less successful ones, citing Piras and Vickers' (Reference Piras and Vickers2011) observation that experienced goalkeepers seem particularly interested in the empty space between the non-kicking leg and the ball. Empty space has no role in item-based theories. An FVF approach is more flexible in accommodating this kind of observation, as it does not have to rely on the features of individual objects. There are many aspects in the latter and other real-world searches that differ from standard laboratory search (see also Crabb & Taylor) and that are not incorporated in our framework, such as the effects of ill-defined targets, low target prevalence, and unknown number of targets. But by adopting a common framework and common definitions, we believe it will be easier to establish what these differences actually are and what factors give rise to them.
Clearly, there is much more work to be done. First, as also became clear from the commentaries, the FVF needs to be defined properly, through extensive empirical measurements using multiple independent and converging methods, as well as through clever computational modeling (cf. Itti). Interestingly, in 1995 Geisler and Chou (p. 361) wrote that “the low-level mechanisms are not understood well enough at this time to precisely quantify the variations in search information across different search stimuli.” In our view, this holds more than two decades later. There has been no concerted large-scale effort to map out the FVF for the wide range of stimuli that the visual system is sensitive to. Or as Kieras & Hornof put it, we still need to collect the empirical data to more completely parameterize the detectability of visual properties based on object eccentricity, size, and density. One reason for not embarking on this effort may be the vast task that lies ahead: By definition, there will be a specific FVF for every type of stimulus contrast. Add to this, the effects of spatial attention, top-down attentional sets, context, and experience. Clever down-sampling is therefore required. A second reason may have been the perceived circularity. As we have argued, there are methods in place that at least partially address this.
Little is also known about the dynamics of FVF-based search. How does the visual system determine the length and precision of saccades and the duration of fixations? The FVF is an outcome, not a mechanism, but it appears that this outcome can be used to make the system adapt, perhaps based on the initial fixation or on previous experience with similar displays. The dynamics become even more complicated when we consider that the FVF changes during the search itself. This occurs when the required information changes, for example, from the target-defining feature to the to-be-reported features (as in compound search).
The second area where work clearly needs to be done is on how the next fixation location is determined. As the commentators made clear, the roles of task, social, and scene context; learning and expertise; and action requirements are grossly underspecified. To this list we would like to add mechanisms that enable efficient sampling of the display, for example by selecting certain clusters of items, or the space in-between (e.g., Zelinsky et al. Reference Zelinsky, Rao, Hayhoe and Ballard1997; Pomplun Reference Pomplun and Gray2007). Furthermore, any of these effects is likely to be amplified in real world situations, with more realistic task goals, expectations, and actions. However, we wish to point out that these problems are not specific to our account; item-based approaches have to explain such effects as well. Our message is that the problem of where to look next is best approached from the standpoint of the machinery that does the looking, and that the outcome will also be informative for the cases where one only considers covert attention. Note that the still-dominant theories of visual search (FIT, AET, Guided Search) were all developed at a time when eye movement recording was in its infancy, monitors and computers capable of displaying photorealistic pictures were yet to be introduced, and thus investigations into the perception of objects was confined to search tasks using simple, clearly defined, and static items. Rather than retrofitting the dominant theories to more complex tasks and scenes, we think that it is better to use a theoretical framework that starts with what all of the tasks, simple or complex, lab-based or “real-world,” have in common: fixations.
Target article
The impending demise of the item in visual search
Related commentaries (30)
An appeal against the item's death sentence: Accounting for diagnostic data patterns with an item-based model of visual search
Analysing real-world visual search tasks helps explain what the functional visual field is, and what its neural mechanisms are
Chances and challenges for an active visual search perspective
Cognitive architecture enables comprehensive predictive models of visual search
Contextual and social cues may dominate natural visual search
Don't admit defeat: A new dawn for the item in visual search
Eye movements are an important part of the story, but not the whole story
Feature integration, attention, and fixations during visual search
Fixations are not all created equal: An objection to mindless visual search
Gaze-contingent manipulation of the FVF demonstrates the importance of fixation duration for explaining search behavior
How functional are functional viewing fields?
Item-based selection is in good shape in visual compound search: A view from electrophysiology
Looking further! The importance of embedding visual search in action
Mathematical fixation: Search viewed through a cognitive lens
Oh, the number of things you will process (in parallel)!
Parallel attentive processing and pre-attentive guidance
Scanning movements during haptic search: similarity with fixations during visual search
Searching for unity: Real-world versus item-based visual search in age-related eye disease
Set size slope still does not distinguish parallel from serial search
Task implementation and top-down control in continuous search
The FVF framework and target prevalence effects
The FVF might be influenced by object-based attention
The “item” as a window into how prior knowledge guides visual search
Those pernicious items
Until the demise of the functional field of view
What fixations reveal about oculomotor scanning behavior in visual search
Where the item still rules supreme: Time-based selection, enumeration, pre-attentive processing and the target template?
Why the item will remain the unit of attentional selection in visual search
“I am not dead yet!” – The Item responds to Hulleman & Olivers
“Target-absent” decisions in cancer nodule detection are more efficient than “target-present” decisions!
Author response
On the brink: The demise of the item in visual search moves closer