Can visual search be explained by a model with only one free parameter, the size of the functional visual field (FVF) of a fixation, as suggested by Hulleman & Olivers (H&O)? Considering fixations, rather than individual items, as the primary unit of visual search agrees with the tight connection between eye gaze and information retrieval. H&O demonstrate that their framework successfully captures the variability of reaction times in easy, medium and difficult searches of elementary visual features. However, beyond laboratory conditions (“find a specific item among very similar distractors”), visual search strategies can hardly be explained by such a simple model because the search space is poorly specified (e.g., “Where did I leave my keys?”, “Is my friend already here?”), and the search strategy is affected, for example, by experience, task, memory, and motives. Moreover, some parts of the scene may attract attention and eye-gaze automatically because of their social and not only visual saliency.
In real-life situations, the search targets are not a priori evenly distributed in the visual field, and the task given to the subject will affect the eye movements (Neider & Zelinsky Reference Neider and Zelinsky2006; Torralba et al. Reference Torralba, Oliva, Castelhano and Henderson2006; Yarbus Reference Yarbus1967). Moreover, the scene context can provide spatial constraints on the most likely locations of the target(s) within the scene (Neider & Zelinsky Reference Neider and Zelinsky2006; Torralba et al. Reference Torralba, Oliva, Castelhano and Henderson2006). The viewing strategy is also affected by expertise: experienced radiologists find abnormalities in mammography images more efficiently than do less-experienced colleagues (Kundel et al. Reference Kundel, Nodine, Conant and Weinstein2007); experts in art history and laypersons view paintings differently (Pihko et al. Reference Pihko, Virtanen, Saarinen, Pannasch, Hirvenkari, Tossavainen, Haapala and Hari2011); and dog experts view interacting dogs differently than do naïve observers (Kujala et al. Reference Kujala, Kujala, Carlson and Hari2012). Moreover, the fixation durations vary depending on the task and scene: Although all fixations may be of about the same duration for homogeneous search displays, short fixations associated with long saccades occur while exploring the general features of a natural scene (ambient processing mode) and long fixations with short saccades take place while examining the focus of interest (focal processing mode; Unema et al. Reference Unema, Pannasch, Joos and Velichkovsky2005).
H&O suggest that the concept of FVF would allow semantic biases in visual search by accommodating multiple parallel FVFs – for example, a small FVF for the target object and a larger FVF for recognizing the scene. This extension might account for processing within the fixated area, but could it also predict saccade guidance? Predicting eye movements occurring in the real world would require a comprehensive model of the semantic saliency of the scene, which is really challenging. That said, the recent advances in neural network modeling of artificial visual object recognition (Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012) could facilitate the modeling of the semantic and contextual features that guide the gaze (Kümmerer et al. Reference Kümmerer, Theis and Bethge2014).
Finally and importantly, social cues strongly affect natural visual processing. Faces and other social stimuli efficiently attract gaze (Birmingham et al. Reference Birmingham, Bischof and Kingstone2008; Yarbus Reference Yarbus1967), insofar as a saccade toward a face can be difficult to suppress (Cerf et al. Reference Cerf, Frady and Koch2009; Crouzet et al. Reference Crouzet, Kirchner and Thorpe2010). Thus, the mere presence of a task-irrelevant face can disrupt visual search by attracting more frequent and longer fixations than do other distractors (Devue et al. Reference Devue, Belopolsky and Theeuwes2012). Such a viewing behavior contrasts with the conventional search tasks that become more difficult when the resemblance of the distractors and target increases. Whereas faces capture attention (and gaze) in healthy subjects, autistic individuals are less distracted by social stimuli in the search scene (Riby et al. Reference Riby, Brown, Jones and Hanley2012) and experience reduced saliency in semantic-level features, especially in faces and social gaze, during free-viewing of natural scenes (Wang et al. Reference Wang, Jiang, Duchesne, Laugeson, Kennedy, Adolphs and Zhao2015). Altogether, social stimuli have such a central role in human behavior and brain function (Hari et al. Reference Hari, Henriksson, Malinen and Parkkonen2015) that they should not be neglected in models aimed to explain natural visual-search behavior. Peripheral vision can provide effective summary statistics of the global features of the visual field (Rosenholtz Reference Rosenholtz2016), and thus social stimuli, such as faces, outside of the foveal vision could significantly affect the visual search.
Face recognition represents a special case of visual search – a natural search task could be, for example, to find a friend among a crowd of people. For (Western) faces, the optimal fixation location is just below the eyes (Peterson & Eckstein Reference Peterson and Eckstein2012), and two fixations can be enough for face recognition (Hsiao & Cottrell Reference Hsiao and Cottrell2008) for isolated face images. Whether the same is true for faces in their natural context remains to be seen. Overall, it appears that the saccades to faces and to scenes are consistent across subjects during the initial viewing and become less consistent during later saccades (Castelhano & Henderson Reference Castelhano and Henderson2008). In addition, the initial saccades are consistent across cultures, with saccade endpoints reflecting the optimal fixation locations in face identification tasks (Or et al. Reference Or, Peterson and Eckstein2015). These findings raise interesting questions related to the neural underpinnings of natural visual search: How does the guidance of the initial saccades differ from later saccades? At what level of cortical processing does the cultural background or expertise affect the saccade guidance?
In conclusion, we doubt that “an overarching framework of visual search” can be built without implementing effects of contextual and social cues. Building a model that can predict an observer's eye movements during natural search tasks in real-world visual environment remains a challenge.
Can visual search be explained by a model with only one free parameter, the size of the functional visual field (FVF) of a fixation, as suggested by Hulleman & Olivers (H&O)? Considering fixations, rather than individual items, as the primary unit of visual search agrees with the tight connection between eye gaze and information retrieval. H&O demonstrate that their framework successfully captures the variability of reaction times in easy, medium and difficult searches of elementary visual features. However, beyond laboratory conditions (“find a specific item among very similar distractors”), visual search strategies can hardly be explained by such a simple model because the search space is poorly specified (e.g., “Where did I leave my keys?”, “Is my friend already here?”), and the search strategy is affected, for example, by experience, task, memory, and motives. Moreover, some parts of the scene may attract attention and eye-gaze automatically because of their social and not only visual saliency.
In real-life situations, the search targets are not a priori evenly distributed in the visual field, and the task given to the subject will affect the eye movements (Neider & Zelinsky Reference Neider and Zelinsky2006; Torralba et al. Reference Torralba, Oliva, Castelhano and Henderson2006; Yarbus Reference Yarbus1967). Moreover, the scene context can provide spatial constraints on the most likely locations of the target(s) within the scene (Neider & Zelinsky Reference Neider and Zelinsky2006; Torralba et al. Reference Torralba, Oliva, Castelhano and Henderson2006). The viewing strategy is also affected by expertise: experienced radiologists find abnormalities in mammography images more efficiently than do less-experienced colleagues (Kundel et al. Reference Kundel, Nodine, Conant and Weinstein2007); experts in art history and laypersons view paintings differently (Pihko et al. Reference Pihko, Virtanen, Saarinen, Pannasch, Hirvenkari, Tossavainen, Haapala and Hari2011); and dog experts view interacting dogs differently than do naïve observers (Kujala et al. Reference Kujala, Kujala, Carlson and Hari2012). Moreover, the fixation durations vary depending on the task and scene: Although all fixations may be of about the same duration for homogeneous search displays, short fixations associated with long saccades occur while exploring the general features of a natural scene (ambient processing mode) and long fixations with short saccades take place while examining the focus of interest (focal processing mode; Unema et al. Reference Unema, Pannasch, Joos and Velichkovsky2005).
H&O suggest that the concept of FVF would allow semantic biases in visual search by accommodating multiple parallel FVFs – for example, a small FVF for the target object and a larger FVF for recognizing the scene. This extension might account for processing within the fixated area, but could it also predict saccade guidance? Predicting eye movements occurring in the real world would require a comprehensive model of the semantic saliency of the scene, which is really challenging. That said, the recent advances in neural network modeling of artificial visual object recognition (Krizhevsky et al. Reference Krizhevsky, Sutskever, Hinton, Pereira, Burges, Bottou and Weinberger2012) could facilitate the modeling of the semantic and contextual features that guide the gaze (Kümmerer et al. Reference Kümmerer, Theis and Bethge2014).
Finally and importantly, social cues strongly affect natural visual processing. Faces and other social stimuli efficiently attract gaze (Birmingham et al. Reference Birmingham, Bischof and Kingstone2008; Yarbus Reference Yarbus1967), insofar as a saccade toward a face can be difficult to suppress (Cerf et al. Reference Cerf, Frady and Koch2009; Crouzet et al. Reference Crouzet, Kirchner and Thorpe2010). Thus, the mere presence of a task-irrelevant face can disrupt visual search by attracting more frequent and longer fixations than do other distractors (Devue et al. Reference Devue, Belopolsky and Theeuwes2012). Such a viewing behavior contrasts with the conventional search tasks that become more difficult when the resemblance of the distractors and target increases. Whereas faces capture attention (and gaze) in healthy subjects, autistic individuals are less distracted by social stimuli in the search scene (Riby et al. Reference Riby, Brown, Jones and Hanley2012) and experience reduced saliency in semantic-level features, especially in faces and social gaze, during free-viewing of natural scenes (Wang et al. Reference Wang, Jiang, Duchesne, Laugeson, Kennedy, Adolphs and Zhao2015). Altogether, social stimuli have such a central role in human behavior and brain function (Hari et al. Reference Hari, Henriksson, Malinen and Parkkonen2015) that they should not be neglected in models aimed to explain natural visual-search behavior. Peripheral vision can provide effective summary statistics of the global features of the visual field (Rosenholtz Reference Rosenholtz2016), and thus social stimuli, such as faces, outside of the foveal vision could significantly affect the visual search.
Face recognition represents a special case of visual search – a natural search task could be, for example, to find a friend among a crowd of people. For (Western) faces, the optimal fixation location is just below the eyes (Peterson & Eckstein Reference Peterson and Eckstein2012), and two fixations can be enough for face recognition (Hsiao & Cottrell Reference Hsiao and Cottrell2008) for isolated face images. Whether the same is true for faces in their natural context remains to be seen. Overall, it appears that the saccades to faces and to scenes are consistent across subjects during the initial viewing and become less consistent during later saccades (Castelhano & Henderson Reference Castelhano and Henderson2008). In addition, the initial saccades are consistent across cultures, with saccade endpoints reflecting the optimal fixation locations in face identification tasks (Or et al. Reference Or, Peterson and Eckstein2015). These findings raise interesting questions related to the neural underpinnings of natural visual search: How does the guidance of the initial saccades differ from later saccades? At what level of cortical processing does the cultural background or expertise affect the saccade guidance?
In conclusion, we doubt that “an overarching framework of visual search” can be built without implementing effects of contextual and social cues. Building a model that can predict an observer's eye movements during natural search tasks in real-world visual environment remains a challenge.