Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-02-12T05:43:54.522Z Has data issue: false hasContentIssue false

Those pernicious items

Published online by Cambridge University Press:  24 May 2017

Ruth Rosenholtz*
Affiliation:
Department of Brain & Cognitive Sciences, CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139. rruth@mit.eduhttp://persci.mit.edu/people/rosenholtz

Abstract

Hulleman & Olivers (H&O) identify a number of problems with item-based thinking and its impact on our understanding of visual search. I detail ways in which item-thought is worse than the authors suggest. I concur with the broad strokes of the theory they set out, and also clarify the relationship between their view and our recent theory of visual search.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2017 

Our impression of a scene usually includes objects and their properties. When crossing the street, we consider the location and speed of a nearby car. However, just because we recognize “things” at the output of perception and employ high-level reasoning about objects, this does not mean that our visual systems operate on presegmented things. This is a common and tempting cognitive error, which can hamper uncovering the true mechanisms of vision.

Objects make for a useful abstraction. It is natural, therefore, that many theories of vision describe processes as operating on objects and their features. For example, preattentive vision has been depicted as encoding item locations and features. (This is distinct from knowing image features, such as the outputs of V1 cells.) According to this view, search is slow because serial selective attention is necessary to bind those features together. Such word models enable easy intuitions and guide new experiments. Furthermore, abstracting from the image input to things and their features can sometimes make modeling tractable, as with signal detection theory.

The authors argue that a focus on items has tainted our ideas about search: it has hampered understanding search in real-world images, for which the set size (number of items) is ill-defined (Rosenholtz et al. Reference Rosenholtz, Li and Nakano2007; Wolfe et al. Reference Wolfe, Alvarez, Rosenholtz, Kuzmova and Sherman2011a; Zelinsky Reference Zelinsky2008); it has led to a focus on selective attention as the limiting mechanism, discounting the role of eye movements; it caused the field to focus on the easy end of the performance spectrum; and it has led to over-estimation of the importance of item location. I would argue that thinking about items is even more pernicious.

Item-based theories have not merely biased our choice of stimuli by limiting use of real-world images. Experimenters often design stimuli to preserve the preeminence of the item. One must avoid alignment, which might produce perceptual groups, or else risk violating assumptions that the items can be treated independently. This is analogous to visual short-term memory experiments that seem designed to give the subject little choice but to remember items; it should not surprise us when slot models do well.

Relatedly, only a handful of experiments have studied the effects of image transformations on search: What if we make the items larger or the bars thinner, change the sign of contrast, add noise to the display, or make the displays more dense? Unless these transformations interfere with item visibility, none should have an impact if items are the atoms of search. Yet there is evidence that such transformations do have a significant effect (e.g., Beck et al. Reference Beck, Sutter and Ivry1987; Chang & Rosenholtz Reference Chang and Rosenholtz2014; Graham et al. Reference Graham, Beck and Sutter1992; Rubenstein and Sagi Reference Rubenstein and Sagi1996).

More broadly, it is risky to think of the visual input as consisting of an array of items with particular experimenter-defined features. Vertical rectangular bars also contain horizontal edges; oblique filters will also respond to those bars; the “white space” between items also has features; and some features of the display may have a scale larger than any individual items.

The dominance of item-based theories has led to a serious disconnect between theories that essentially operate on experimenter-labeled stimuli (items and their nominal features) and those that operate on actual images. In working with real images, a number of reasonable search strategies do not require items as such, for example applying a template throughout the image and looking for locations with a strong response (Zelinsky Reference Zelinsky2008). If one does attempt to implement item-based theories, it quickly becomes clear that neither segmenting the items nor determining their supposed “features” is trivial. One is left with the puzzle of why one is “allowed” to use bound features to “preattentively” segment the image into items, but not to recognize the target.

It is hard not to think in terms of items. Despite their main thesis, Hulleman & Olivers (H&O) suggest that target-distractor discriminability is important for setting the size of the functional visual field (FVF). Why discriminability of the items? This leaves the puzzle of why search asymmetries abound, as surely target-distractor discriminability is generally the same as distractor-target discriminability. We argue that the major determinant of search performance is crowding (not retinal resolution), which demonstrates that peripheral vision operates over sizeable patches, typically containing multiple items. Discriminability of patches is what matters in this scheme (Rosenholtz et al. Reference Rosenholtz, Huang, Raj, Balas and Ilie2012b).

The authors allude to one theory attributing crowding to limited attentional resolution (Intriligator & Cavanagh Reference Intriligator and Cavanagh2001). This is subtly item-centric, presuming that attention aims to select only a single item. We have argued this is not ideal in real images (Rosenholtz & Wijntjes Reference Rosenholtz and Wijntjes2014). Other theories of crowding also describe mechanisms that operate on items (Greenwood et al. Reference Greenwood, Bex and Dakin2009; Reference Greenwood, Bex and Dakin2012; Levi & Carney Reference Levi and Carney2009; Parkes et al. Reference Parkes, Lund, Angelucci, Solomon and Morgan2001; Põder and Wagemans Reference Põder and Wagemans2007; Strasburger Reference Strasburger2005; van den Berg et al. Reference van den Berg, Johnson, Martinez Anton, Schepers and Cornelissen2012). Recently we have shown that a number of results used to test these item-based theories can instead be explained by the information available in a rich set of image statistics (Keshvari & Rosenholtz Reference Keshvari and Rosenholtz2016). These same statistics plausibly underlie scene perception (Ehinger & Rosenholtz Reference Ehinger and Rosenholtz2016; Rosenholtz et al. Reference Rosenholtz, Huang and Ehinger2012a), suggesting a single encoding scheme could both extract the scene context and support search, in agreement with H&O.

The target article presents clever and thoughtful critiques of prevailing theories, and a new model. The parallels to recent work in my lab are fairly clear (Rosenholtz et al. Reference Rosenholtz, Huang, Raj, Balas and Ilie2012b; Zhang et al. Reference Zhang, Huang, Yigit-Elliott and Rosenholtz2015), though differences raise important questions. We agree that search likely involves parallel processing, punctuated by serial shifts of the point of fixation. Peripheral vision limits the information available at a glance. Our view is that, rather than being a mechanism, the FVF might describe the more informative image regions. It would degrade smoothly, with some regions providing more information than others. It need not be continuous; eccentric uncrowded regions might provide more information than closer crowded ones. The authors are somewhat unclear on these points: Does the FVF have hard edges, outside of which no information exists for telling apart the target and distractor? Is it mechanistic or descriptive? Does some mechanism set the size? If so, how, and why?

References

Beck, J., Sutter, A. & Ivry, R. (1987) Spatial frequency channels and perceptual grouping in texture segregation. Computer Vision, Graphics, and Image Processing 37(2):299325.Google Scholar
Chang, H. & Rosenholtz, R. (2014) New exploration of classic search tasks. Journal of Vision 14(10):933.Google Scholar
Ehinger, K. A. & Rosenholtz, R. (2016) A general account of peripheral encoding also predicts scene perception performance. Journal of Vision 16(2):13.Google Scholar
Graham, N., Beck, J. & Sutter, A. (1992) Nonlinear processes in spatial-frequency channel models of perceived texture segregation: Effects of sign and amount of contrast. Vision Research 32(4):719–43.CrossRefGoogle ScholarPubMed
Greenwood, J. A., Bex, P. J. & Dakin, S. C. (2009) Positional averaging explains crowding with letter-like stimuli. Proceedings of the National Academy of Sciences of the United States of America 106(31):13130–35.Google Scholar
Greenwood, J. A., Bex, P. J. & Dakin, S. C. (2012) Crowding follows the binding of relative position and orientation. Journal of Vision 12(3):120.Google Scholar
Intriligator, J. & Cavanagh, P. (2001) The spatial resolution of visual attention. Cognitive Psychology 43:171216. doi: 10.1006/cogp.2001.0755.Google Scholar
Keshvari, S. & Rosenholtz, R. (2016) Pooling of continuous features provides a unifying account of crowding. Journal of Vision 16(3):39, 115. doi: 10.1167/16.3.39.Google Scholar
Levi, D. M. & Carney, T. (2009) Crowding in peripheral vision: Why bigger is better. Current Biology 19(23):1988–93.CrossRefGoogle ScholarPubMed
Parkes, L., Lund, J., Angelucci, A., Solomon, J. A. & Morgan, M. (2001) Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience 4(7):739–44.Google Scholar
Põder, E. & Wagemans, J. (2007) Crowding with conjunctions of simple features. Journal of Vision 7(2):23. doi: 10.1167/7.2.23.CrossRefGoogle ScholarPubMed
Rosenholtz, R., Huang, J. & Ehinger, K. A. (2012a) Rethinking the role of top-down attention in vision: Effects attributable to a lossy representation in peripheral vision. Frontiers in Psychology 3(13):115. doi: 10.3389/fpsyg.2012.00013.CrossRefGoogle ScholarPubMed
Rosenholtz, R., Huang, J., Raj, A., Balas, B. J. & Ilie, L. (2012b) A summary statistic representation in peripheral vision explains visual search. Journal of Vision 12(4):14, 117. doi: 10.1167/12.4.14.CrossRefGoogle ScholarPubMed
Rosenholtz, R., Li, Y. & Nakano, L. (2007) Measuring visual clutter. Journal of Vision 7(2):17, 122. doi: 10.1167/7.2.17.Google Scholar
Rosenholtz, R. & Wijntjes, M. (2014) Peripheral object recognition with informative natural context. Journal of Vision 14(10):214.Google Scholar
Rubenstein, B. S. & Sagi, D. (1996) Preattentive texture segmentation: The role of line terminations, size, and filter wavelength. Perception and Psychophysics 58(4):489509.CrossRefGoogle ScholarPubMed
Strasburger, H. (2005) Unfocused spatial attention underlies the crowding effect in indirect form vision. Journal of Vision 5(11):1024–37.Google Scholar
van den Berg, R., Johnson, A., Martinez Anton, A., Schepers, A. L. & Cornelissen, F. W. (2012) Comparing crowding in human and ideal observers. Journal of Vision 12(6):13 doi: 10.1167/12.6.13.CrossRefGoogle ScholarPubMed
Wolfe, J. M., Alvarez, G. A., Rosenholtz, R. E., Kuzmova, Y. I. & Sherman, A. M. (2011a) Visual search for arbitrary objects in real scenes. Attention, Perception, and Psychophysics 73:1650–71. doi: 10.3758/s13414-011-0153-3.Google Scholar
Zelinsky, G. J. (2008) A theory of eye movements during target acquisition. Psychological Review 115:787835. doi: 10.1037/a0013118.Google Scholar
Zhang, X., Huang, J., Yigit-Elliott, S. & Rosenholtz, R. (2015) Cube search, revisited. Journal of Vision 15(3):9, 118. doi: 10.1167/15.3.9.CrossRefGoogle Scholar