Hulleman & Olivers (H&O) criticize the visual search literature for having focused largely on the individual item as the primary unit of selection. As an alternative to this view, the authors propose that (1) visual sampling during fixations is a critical process in visual search, and that (2) factors in addition to items determine the selection of upcoming fixation locations. H&O developed a very parsimonious simulation model, in which the size of the functional visual field (FVF) adapts to search difficulty. Items within the FVF are processed in parallel. Consequently, when search difficulty is very high, the FVF shrinks to a size of one item, effectively producing serial search. When search difficulty is lower, more items are processed in parallel within the FVF. These modeling assumptions were sufficient to qualitatively reproduce much of the canonical data pattern obtained in visual search tasks.
We applaud H&O for acknowledging the important and long-neglected contribution of eye movement control in guiding the search process, because we believe that many attentional phenomena can be explained by considering oculomotor activity (e.g., Laubrock et al. Reference Laubrock, Engbert and Kliegl2005; Reference Laubrock, Engbert and Kliegl2008). Although not all attention shifts are overt, the neural underpinnings of covert attention shifts are largely identical to those of eye movement control (Corbetta et al. Reference Corbetta, Akbudak, Conturo, Snyder, Ollinger, Drury, Linenweber, Petersen, Raichle, Essen and Shulman1998). Attention research should therefore be able to profit from the advanced models of the spatiotemporal evolution of activations in visual and oculomotor maps as well as from the methods for directly manipulating the FVF.
Gaze-contingent displays are a method to directly manipulate the FVF. For example, in the moving-window technique (McConkie & Rayner Reference McConkie and Rayner1975) information is only visible within a window of variable size that moves in real-time with the viewer's gaze. Visual information outside of the window is either completely masked or attenuated. A very robust result from studies using this technique is that FVF size is globally adjusted to processing difficulty. In reading research, the size of the FVF is often called the perceptual span, which has been shown to increase with reading development (Rayner Reference Rayner1986; Sperlich et al. Reference Sperlich, Schad and Laubrock2015) and to be dynamically adjusted, for example, when viewing difficult words (Henderson & Ferreira Reference Henderson and Ferreira1990; Schad & Engbert Reference Schad and Engbert2012). In scene perception parametrically increasing peripheral processing difficulty, by, for example, selectively removing parts of the spatial frequency spectrum from the peripheral visual field (Fig. 1, top), leads to corresponding reductions in saccade amplitudes (Cajar et al. Reference Cajar, Engbert and Laubrock2016a; Loschky & McConkie Reference Loschky and McConkie2002), suggesting a smaller FVF. These modulations are stronger when broad features are removed than when fine details are removed (Cajar et al. Reference Cajar, Engbert and Laubrock2016a), reflecting the low spatial resolution of peripheral vision. Conversely, when the filter is applied to the central visual field (Fig. 1, bottom) saccade amplitudes increase–particularly if fine detail is removed, corresponding to the high spatial resolution of foveal vision. Cajar et al. (Reference Cajar, Schneeweiß, Engbert and Laubrock2016b) show that these very robust modulations of mean saccade amplitude are directly correlated with the distribution of attention (i.e., the perceptual span).
Figure 1. Illustration of gaze-contingent spatial frequency filtering in real-world scenes. The white cross indicates the current gaze position of the viewer. Top: Peripheral low-pass filtering attenuates high spatial frequencies (i.e., fine-grained information) in the peripheral visual field. Bottom: Central high-pass filtering attenuates low spatial frequencies (i.e., coarse-grained information) in the central visual field.
Are existing models of saccadic selection compatible with a variable FVF? In biologically plausible models, a critical feature is a spatial map with a limited spotlight of attention (i.e., an FVF-like representation). Additionally, a simple memory mechanism (called inhibitory tagging) prevents the model from getting stuck by continually selecting the point of highest saliency. Engbert and colleagues implemented such a dynamic model of eye guidance in scene viewing (Engbert et al. Reference Engbert, Trukenbrod, Barthelmé and Wichmann2015), based on an earlier model of fixational eye movements (Engbert et al. Reference Engbert, Mergenthaler, Sinn and Pikovsky2011). The combination of two interacting attentional and inhibitory maps could reproduce a broad range of spatial statistics in scene viewing. Whereas these models do explain the selection of fixation locations fairly well, an additional mechanism that adjusts the zoom lens of attention with respect to foveal processing difficulty (Schad & Engbert Reference Schad and Engbert2012) is necessary to capture modulations of fixation duration.
In comparison to the complexity of these detailed dynamic models, the H&O model has the advantage of simplicity. However, this comes at a cost of somewhat unrealistic assumptions. For example, H&O assume that fixations have a constant duration of 250 ms and that only the number and distribution of fixations adapt to search difficulty. The authors justify this decision with previous research that barely found effects of target discriminability on fixation durations in typical search displays. However, at least for visual search in complex real-world scenes, research shows that fixation durations are indeed affected by search difficulty (e.g., Malcolm & Henderson Reference Malcolm and Henderson2009; Reference Malcolm and Henderson2010).
Thus, not only selection of fixation locations, but also control of fixation duration is influenced by the FVF. In particular, mean fixation duration increases when visual information accumulation in regions of the visual field is artificially impaired by means of gaze-contingent spatial filtering (Laubrock et al. Reference Laubrock, Cajar and Engbert2013; Loschky et al. Reference Loschky, McConkie, Yang and Miller2005; Nuthmann Reference Nuthmann2014; Shioiri & Ikeda Reference Shioiri and Ikeda1989). However, this effect is observed only when filtering does not completely remove useful information – otherwise, default timing takes over, meaning that fixation durations fall back to the level observed during unfiltered viewing (e.g., Laubrock et al. Reference Laubrock, Cajar and Engbert2013). This might explain why effects of visual search difficulty are more often reported for number of fixations rather than fixation duration. A critical aspect of a model of fixation duration in visual scenes is parallel and partially independent processing of foveal and peripheral information (Laubrock et al. Reference Laubrock, Cajar and Engbert2013). Given that both FVF size and fixation duration adapt to task difficulty, an important research goal of the future is to integrate models of fixation location and fixation duration.
Hulleman & Olivers (H&O) criticize the visual search literature for having focused largely on the individual item as the primary unit of selection. As an alternative to this view, the authors propose that (1) visual sampling during fixations is a critical process in visual search, and that (2) factors in addition to items determine the selection of upcoming fixation locations. H&O developed a very parsimonious simulation model, in which the size of the functional visual field (FVF) adapts to search difficulty. Items within the FVF are processed in parallel. Consequently, when search difficulty is very high, the FVF shrinks to a size of one item, effectively producing serial search. When search difficulty is lower, more items are processed in parallel within the FVF. These modeling assumptions were sufficient to qualitatively reproduce much of the canonical data pattern obtained in visual search tasks.
We applaud H&O for acknowledging the important and long-neglected contribution of eye movement control in guiding the search process, because we believe that many attentional phenomena can be explained by considering oculomotor activity (e.g., Laubrock et al. Reference Laubrock, Engbert and Kliegl2005; Reference Laubrock, Engbert and Kliegl2008). Although not all attention shifts are overt, the neural underpinnings of covert attention shifts are largely identical to those of eye movement control (Corbetta et al. Reference Corbetta, Akbudak, Conturo, Snyder, Ollinger, Drury, Linenweber, Petersen, Raichle, Essen and Shulman1998). Attention research should therefore be able to profit from the advanced models of the spatiotemporal evolution of activations in visual and oculomotor maps as well as from the methods for directly manipulating the FVF.
Gaze-contingent displays are a method to directly manipulate the FVF. For example, in the moving-window technique (McConkie & Rayner Reference McConkie and Rayner1975) information is only visible within a window of variable size that moves in real-time with the viewer's gaze. Visual information outside of the window is either completely masked or attenuated. A very robust result from studies using this technique is that FVF size is globally adjusted to processing difficulty. In reading research, the size of the FVF is often called the perceptual span, which has been shown to increase with reading development (Rayner Reference Rayner1986; Sperlich et al. Reference Sperlich, Schad and Laubrock2015) and to be dynamically adjusted, for example, when viewing difficult words (Henderson & Ferreira Reference Henderson and Ferreira1990; Schad & Engbert Reference Schad and Engbert2012). In scene perception parametrically increasing peripheral processing difficulty, by, for example, selectively removing parts of the spatial frequency spectrum from the peripheral visual field (Fig. 1, top), leads to corresponding reductions in saccade amplitudes (Cajar et al. Reference Cajar, Engbert and Laubrock2016a; Loschky & McConkie Reference Loschky and McConkie2002), suggesting a smaller FVF. These modulations are stronger when broad features are removed than when fine details are removed (Cajar et al. Reference Cajar, Engbert and Laubrock2016a), reflecting the low spatial resolution of peripheral vision. Conversely, when the filter is applied to the central visual field (Fig. 1, bottom) saccade amplitudes increase–particularly if fine detail is removed, corresponding to the high spatial resolution of foveal vision. Cajar et al. (Reference Cajar, Schneeweiß, Engbert and Laubrock2016b) show that these very robust modulations of mean saccade amplitude are directly correlated with the distribution of attention (i.e., the perceptual span).
Figure 1. Illustration of gaze-contingent spatial frequency filtering in real-world scenes. The white cross indicates the current gaze position of the viewer. Top: Peripheral low-pass filtering attenuates high spatial frequencies (i.e., fine-grained information) in the peripheral visual field. Bottom: Central high-pass filtering attenuates low spatial frequencies (i.e., coarse-grained information) in the central visual field.
Are existing models of saccadic selection compatible with a variable FVF? In biologically plausible models, a critical feature is a spatial map with a limited spotlight of attention (i.e., an FVF-like representation). Additionally, a simple memory mechanism (called inhibitory tagging) prevents the model from getting stuck by continually selecting the point of highest saliency. Engbert and colleagues implemented such a dynamic model of eye guidance in scene viewing (Engbert et al. Reference Engbert, Trukenbrod, Barthelmé and Wichmann2015), based on an earlier model of fixational eye movements (Engbert et al. Reference Engbert, Mergenthaler, Sinn and Pikovsky2011). The combination of two interacting attentional and inhibitory maps could reproduce a broad range of spatial statistics in scene viewing. Whereas these models do explain the selection of fixation locations fairly well, an additional mechanism that adjusts the zoom lens of attention with respect to foveal processing difficulty (Schad & Engbert Reference Schad and Engbert2012) is necessary to capture modulations of fixation duration.
In comparison to the complexity of these detailed dynamic models, the H&O model has the advantage of simplicity. However, this comes at a cost of somewhat unrealistic assumptions. For example, H&O assume that fixations have a constant duration of 250 ms and that only the number and distribution of fixations adapt to search difficulty. The authors justify this decision with previous research that barely found effects of target discriminability on fixation durations in typical search displays. However, at least for visual search in complex real-world scenes, research shows that fixation durations are indeed affected by search difficulty (e.g., Malcolm & Henderson Reference Malcolm and Henderson2009; Reference Malcolm and Henderson2010).
Thus, not only selection of fixation locations, but also control of fixation duration is influenced by the FVF. In particular, mean fixation duration increases when visual information accumulation in regions of the visual field is artificially impaired by means of gaze-contingent spatial filtering (Laubrock et al. Reference Laubrock, Cajar and Engbert2013; Loschky et al. Reference Loschky, McConkie, Yang and Miller2005; Nuthmann Reference Nuthmann2014; Shioiri & Ikeda Reference Shioiri and Ikeda1989). However, this effect is observed only when filtering does not completely remove useful information – otherwise, default timing takes over, meaning that fixation durations fall back to the level observed during unfiltered viewing (e.g., Laubrock et al. Reference Laubrock, Cajar and Engbert2013). This might explain why effects of visual search difficulty are more often reported for number of fixations rather than fixation duration. A critical aspect of a model of fixation duration in visual scenes is parallel and partially independent processing of foveal and peripheral information (Laubrock et al. Reference Laubrock, Cajar and Engbert2013). Given that both FVF size and fixation duration adapt to task difficulty, an important research goal of the future is to integrate models of fixation location and fixation duration.