1. Introduction
Research suggests that vision deploys a system of object representations. This system has been called the object file system. Many have held that by virtue of internalizing certain principles about objects, the object file system is tuned to a specific type of entity. In this article, I consider the types of entities to which object files are tuned.
I focus on a view defended by Tyler Burge, Susan Carey, and others on which the object file system is tuned to Spelke-objects. Spelke-objects are characterized by several properties, including three-dimensionality (3-D), cohesion, and boundedness. I will call the view that the object file system is tuned to Spelke-objects the restrictive view. I will argue that we have little reason to accept the restrictive view.
In section 2, I distinguish two versions of the restrictive view, which I call the strong restrictive view and the weak restrictive view. The former view claims that the object file system critically relies on representing 3-D, cohesion, and boundedness, while the latter view claims that object files have the proper function of picking out Spelke-objects. In section 3, I argue that current evidence fails to support the strong restrictive view over a competing position that I call the permissive view. According to the permissive view, the object file system tracks entities by attributing certain well-known perceptual organization cues. In section 4, I argue that if the strong restrictive view is false, then the weak restrictive view is likely false too.
2. The Restrictive Principles
Object representations play an important role in visual attention. Attention naturally “spreads throughout” an object, enabling quicker comparisons of features belonging to the same object than features belonging to different objects (Chen Reference Chen2012). Moreover, multiple-object tracking (MOT) studies have shown that perceivers can attentively track a small set of objects as they move randomly amid distractors (Pylyshyn and Storm Reference Pylyshyn and Storm1988), even as their features change unpredictably (Zhou et al. Reference Zhou, Luo, Zhou, Zhuo and Chen2010).
Object representations are also used in visual working memory (VWM). First, object-reviewing studies have shown that we are faster to reidentify a previously seen feature when it reappears on the same object on which it initially appeared, even if the object shifts location in the interim (Kahneman, Treisman, and Gibbs Reference Kahneman, Treisman and Gibbs1992; Hollingworth and Rasmussen Reference Hollingworth and Rasmussen2010). Second, change-detection studies have shown that it is easier to recall two features from the same object than from different objects (Luck and Vogel Reference Luck and Vogel1997; Olson and Jiang Reference Olson and Jiang2002). Thus, vision binds features to objects, and VWM capacity is set in part by the number of objects stored. There is also evidence that the representations used in VWM are continuous with those used in attentive tracking. For example, both MOT and VWM recruit the intraparietal sulcus (Drew et al. Reference Drew, Horowitz, Wolfe and Vogel2011), and objects held in VWM interfere with performance in MOT (Fougnie and Marois Reference Fougnie and Marois2009).
Researchers have posited an “object file” system (OF-system) to account for these phenomena. The OF-system has three signature characteristics. First, object files sustain reference to entities over time, enabling us to track them from one moment to the next. Second, object files temporarily store features of the entities they pick out. Third, the OF-system is limited in how many entities it can select in parallel. On most views, the OF-system is limited to selecting three to four objects, with capacity slightly increasing between infancy and adulthood (Zosh and Feigenson Reference Zosh, Feigenson, Hood and Santos2009). Limits in this range have been observed for both MOT and VWM, again suggesting a common system for both capacities (Pylyshyn and Storm Reference Pylyshyn and Storm1988; Luck and Vogel Reference Luck and Vogel1997).
Object files have played an important role in recent philosophical accounts of demonstrative reference (Murez and Recanati Reference Murez and Recanati2016). Some have even appealed to object files in explicating the basis of our capacity to think about particulars (Pylyshyn Reference Pylyshyn2007). For these projects, it is no doubt important to consider the basic question of which things object files pick out. For example, if demonstrative reference is fixed through visual attentional channels (Dickie Reference Dickie and Jeshion2010), it is important to know which things those channels deliver.
In several seminal papers, Spelke (Reference Spelke1990, Reference Spelke1994) proposed that infants segment scenes in accordance with certain object principles. Following Spelke, subsequent authors—notably Burge and Carey—have proposed that OF-system processes (selection, tracking, and VWM) are governed by these principles. Burge writes: “To represent something as a body, the individual’s perceptual system must segment a three-dimensional whole from a surround. … Its doing so is governed by principles for identifying cohesiveness and boundedness of three-dimensional volume shapes. And it must be able to track the wholes over time, either in motion or at rest. Tracking depends on attribution of maintenance of cohesiveness and boundedness of volume shapes” (Reference Burge2010, 464). Burge thus proposes that when tracking particular objects (which he calls “bodies”), the visual system relies on the principles of 3-D, cohesion, and boundedness. He claims elsewhere that visual tracking abilities are “emphatically tied to bodies that maintain their cohesion” (461). Carey adopts a similar view: “Object files symbolize physical objects, by which I mean bounded, coherent, 3-D, separable, spatio-temporally continuous wholes” (Reference Carey2009, 97). Let us unpack what these principles mean.
The 3-D principle, on Burge’s construal, requires that objects are volumetric, and thus are distinct from 2-D retinal regions and 2-D surface patches. Note, however, that while Burge and Carey endorse versions of the 3-D criterion, Spelke does not include it within her set of object principles.
As Burge and Carey acknowledge, the cohesion and boundedness principles are due to Spelke. Spelke’s rules state topological criteria. The cohesion principle is this:
Cohesion: “Two surface points lie on the same object only if the points are linked by a path of connected surface points” (Spelke Reference Spelke1990, 49).
Cohesion entails that objects are material, topologically continuous figures. For any two surface points x and y, if x and y belong to a single object, then y can be reached from x by following a continuous path of surface points. My cell phone satisfies the cohesion constraint because any two points on it are linked via a continuous path of surface points.
The boundedness principle is this:
Boundedness: “Two surface points lie on distinct objects only if no path of connected surface points links them” (Spelke Reference Spelke1990, 49).
Boundedness entails that objects are maximal connected figures. For any two surface points x and y, if there are distinct objects O and O* such that x belongs to O and y belongs to O*, then y cannot be reached from x by following a continuous path of surface points. My cell phone and toaster count as distinct objects under the boundedness principle, while two halves of my phone do not.
Some include solidity as another object principle. This principle states that distinct objects cannot physically interpenetrate. Looking-time studies suggest that infants are surprised by solidity violations (Spelke et al. Reference Spelke, Breinlinger, Macomber and Jacobson1992). However, others have questioned whether the OF-system really takes solidity to be a necessary condition on objecthood, since carefully constructed displays can elicit percepts of objects persisting through interpenetration (Scholl and Leslie Reference Scholl, Leslie, Lepore and Pylyshyn1999, 37–38). It is possible that we are surprised by solidity violations but do not always interpret them as disrupting object persistence. In any event, I will not consider solidity in detail here.
By the restrictive view, I mean the view that the principles governing visual object selection and tracking include the principles of 3-D, cohesion, and boundedness. I will refer to these as the “restrictive principles,” and I will call the entities that satisfy them “Spelke-objects.”
We can distinguish two versions of the restrictive view, which differ in the role they assign the restrictive principles. The strong restrictive view holds that in order to select and track an object, the OF-system must represent it as 3-D, cohesive, and bounded. This view is suggested by Burge’s claim that “tracking depends on attribution of maintenance of cohesiveness and boundedness of volume shapes” (Reference Burge2010, 464). The weak restrictive view holds that the restrictive principles characterize the type of thing that the OF-system has the proper function of selecting. On this view, the OF-system is tuned to Spelke-objects, but it need not attribute the properties constitutive of being a Spelke-object during tracking. I will argue that we have little reason to accept either version of the restrictive view.
3. The Strong Restrictive View
This section considers the strong restrictive view. I first contrast this view with a competing view of the properties relied on during tracking, which I call the permissive view. Sections 3.2–3.4 consider the three restrictive principles. In each case, I will argue (i) that the evidence garnered in support of the principle can also be explained by the permissive view, (ii) that there are experimental tests that may help to settle the dispute between these positions, and (iii) that there already exists evidence against the principle.
3.1. The Permissive View
Kimchi (Reference Kimchi2009, 25) characterizes “visual objects” as “elements in the visual scene organized by Gestalt factors into a coherent unit” (see also Brovold and Grush Reference Brovold, Grush, Raftopoulos and Machamer2012; Chen Reference Chen2012). The idea is that the principles guiding visual object individuation are simply the rules of perceptual organization. These rules characterize the properties that selection and tracking processes represent and rely on. I believe that this idea has received insufficient attention in the object file literature.
We can divide perceptual organization principles into grouping principles and parsing principles. The former specify rules for composing smaller entities into larger entities, while the latter specify rules for decomposing larger entities into smaller entities. By “entities,” I will be as noncommittal as possible. Bodies, shadows, parts, and arbitrary mereological sums all count as entities. When I speak of “units,” I mean only those types of entities that can be singled out through perceptual grouping or parsing.
Traditional grouping principles include proximity, similarity, good continuation, and common fate (Wagemans et al. Reference Wagemans, Elder, Kubovy, Palmer, Peterson, Singh and von der Heydt2012). For example, the proximity principle states that entities that are close together tend to be grouped into units, while the common fate principle states that entities that move along similar motion paths tend to be grouped. Newer grouping principles include element connectedness and uniform connectedness (Palmer and Rock Reference Palmer and Rock1994). The former states that connected entities tend to be grouped. The latter states that regions of the visual field that have some uniform property (e.g., color) tend to be treated as units. Examples are shown in figure 1.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095428018-0777:S0031824800010564:S0031824800010564_fg1.png?pub-status=live)
Figure 1. Grouping by (a) proximity, (b) common fate, (c) similarity, (d) element connectedness, and (e) uniform connectedness.
Two important parsing principles are the minima rule (Hoffman and Richards Reference Hoffman and Richards1984) and the shortcut rule (Singh, Seyranian, and Hoffman Reference Singh, Seyranian and Hoffman1999). The minima rule states that the boundaries between parts of an object tend to be found at negative curvature minima—places at which the contour of a shape is most concave. The shortcut rule states that part divisions tend to be made by linking curvature minima along the shortest paths possible. The decomposition in figures 2a and 2b follows the minima and shortcut rules. There is strong evidence that vision segments objects in this way (Singh and Hoffman Reference Singh and Hoffman2001) and that infants and young children engage in parsing (Tversky Reference Tversky1989; Bhatt et al. Reference Bhatt, Hayden, Kangas, Zieber and Joseph2010). I call parts of objects returned by parsing principles parsable parts.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095428018-0777:S0031824800010564:S0031824800010564_fg2.png?pub-status=live)
Figure 2. Decompositions following (a) minima and (b) shortcut rules.
Grouping and parsing principles operate in natural scenes. Flocks of geese and swarms of bees seem like they can be differentiated and tracked over time. Likewise, it seems as though we can differentiate a person’s arm and track it independently of her torso. But does the OF-system select and track entities such as these?
If so, this raises prima facie difficulties for the strong restrictive view. If the OF-system tracks groups while representing their noncohesion, then tracking does not critically rely on representing cohesion. If the OF-system tracks parsable parts while representing their nonboundedness, then tracking does not critically rely on representing boundedness. Note also that perceptual organization principles apply to both volumetric and 2-D entities. A group of 2-D dots can be organized by proximity, similarity, or common fate. If the OF-system tracks such units while representing their planarity, then tracking does not critically rely on representing 3-D.
By the permissive view, I mean, roughly, the view that the OF-system selects and tracks in accordance with perceptual organization principles and, moreover, that such principles characterize the properties represented and relied on during selection and tracking. The permissive view claims that when the OF-system selects and tracks an entity, it does so because it represents the entity as satisfying certain perceptual organization cues.
Multiple versions of the permissive view are possible, depending on which perceptual organization principles the OF-system follows. If the OF-system follows proximity, but not similarity, then it can treat a group of entities as a single object when the entities are tightly clustered but not when they merely share the same color. In what follows, I have in mind a specific variant of the permissive view. This is the view that, at all ages, object files can select entities either by representing them as satisfying common fate and proximity grouping cues or by representing them as satisfying minima and shortcut parsing cues. This view predicts that in certain situations, attribution of common fate and proximity is sufficient for the OF-system to select an entity, even if it is not represented as 3-D or cohesive. The view also predicts that in certain situations, the OF-system can select parts represented as satisfying the minima and shortcut rules, even if they are not represented as bounded.
Four clarifications are in order. First, the permissive view is consistent with internal connectedness (which figures prominently in the restrictive principles) being a cue to grouping and to tracking. However, unlike the strong restrictive view, the permissive view does not hold that attribution of internal connectedness is required for tracking. It instead places connectedness on a par with the other grouping cues.
Second, although I emphasize proximity and common fate, I believe it is plausible that at later ages, object files select entities on the basis of the other cues as well. There is evidence that certain grouping principles are acquired later than others, through learning or maturation (Quinn and Bhatt Reference Quinn and Bhatt2005). Moreover, below I discuss evidence that adults can select groups defined by similarity or good continuation.
Third, the permissive view does not entail that object files always select entities on the basis of geometrical simplicity. Spelke (Reference Spelke1990) persuasively argues that infants do not automatically organize static scenes by contour alignment or symmetric form. This suggests that during infancy, the OF-system does not follow all of the perceptual organization principles. However, one can accept this while maintaining that infants follow certain organization principles. The common fate and proximity principles do not prioritize relations like symmetry or collinearity.
Fourth, I will not suggest that Spelke-objects, groups, and parts are treated alike by all visual processes. The present issue concerns only whether object files and the processes they subserve are keyed to the former category alone. Other visual processes may be different. I now turn to the evidence for the strong restrictive view.
3.2. Evidence for 3-D?
Many of the tasks used to study object files employ 2-D images. It is also plausible from everyday experience that we can track certain nonvolumetric entities, such as shadows. Prima facie, these observations challenge the strong restrictive view. How can it be that the OF-system critically relies on attributing 3-D, if it so easily tracks 2-D entities?
The obvious response is that such 2-D entities are misrepresented as volumetric. Carey writes:
Does the fact that 2-D bounded entities activate object-files mean that their content is more perceptual—perhaps closed shape? … No, they should not. For computer displays to work, we must present many of the cues for depth in 2-D arrays, and surfaces arranged in 3-D are routinely perceived in such displays. That the system can be fooled into accepting 2-D entities as objects does not mean that it is not representing the stimuli as real objects, just as the fact that the system can be fooled into seeing depth in 2-D displays … does not mean it is not representing the stimuli as arrayed in 3-D space.
(Reference Carey2009, 98)Carey proposes, then, that computer images in tracking studies are misrepresented as “real” objects with volume. The evidence for this is that 2-D figures are selected only when they supply appropriate cues to depth.
The requirement that objects are “3-D” is somewhat ambiguous. On a stronger reading—which I have assumed so far—this means that objects have volume. But on a weaker reading, it just means that objects are arrayed in 3-D space, standing in depth relations to the observer and to other things. Something can have the second characteristic without the first. Surfaces are generally treated as 2-D entities arrayed in depth.
Even granting that 2-D computer images are represented as located in depth, this does not entail that they are represented as volumetric. So Carey’s response at best rescues the weaker version of the 3-D principle, not the stronger one. As we saw above, Burge understands the 3-D requirement in the stronger way. Carey is less explicit, but she may have in mind the weaker version. However, the weak reading seems quite forced. It counts planar polygons as “3-D,” as long as they are embedded in 3-D space. It should be noted, though, that the 3-D principle may be less important to Carey than the other principles, since her experiments do not generally test it.
Turning to the stronger version of the 3-D principle, there is very little evidence that the OF-system relies on representing tracked entities as volumetric. Stimuli in MOT and object-reviewing studies do supply cues to depth and occlusion (see Scholl and Pylyshyn Reference Scholl and Pylyshyn1999), and MOT performance is better when targets occupy a different depth plane from distractors (Rehman et al. Reference Rehman, Kihara, Matsumoto and Ohtsuka2015). However, such stimuli rarely supply standard cues to volume (e.g., differential shading, texture density, or surface orientation edges). Moreover, such stimuli look nonvolumetric. They look like flat, 2-D figures. As such, there is little evidential support for the claim that tracking relies on attributing volume.
To make progress on this issue, future work should systematically examine object selection and tracking as a function of cues to planarity versus volume. While I have suggested that they look 2-D, the stimuli generally used in midlevel vision tasks are arguably consistent with either a 2-D or 3-D interpretation.
There is, however, suggestive evidence that we can visually select entities that are quite clearly 2-D. De-Wit, Milner, and Kentridge (Reference de-Wit, Milner and Kentridge2012) found that shadows elicit typical object-based attention effects. When cued to a region of a cast shadow, subjects are faster to shift attention to another region of the same shadow than to an equidistant region of another shadow. Furthermore, children can detect shadows and acquire information from them at an early age (Cameron and Gallup Reference Cameron and Gallup1988). It remains to be seen whether shadows can be selected in other OF-system paradigms.
3.3. Evidence for Cohesion?
Studies of infant object cognition and adult midlevel vision have been taken to show that the OF-system relies on representing cohesion. I will argue that the available evidence does not in fact show this, since it can be explained equally well on the hypothesis that the OF-system selects and tracks by representing proximity and common fate cues. I will also discuss positive evidence that object files can select noncohesive entities.
Huntley-Fenner, Carey, and Solimando (Reference Huntley-Fenner, Carey and Solimando2002) contrasted 8-month-olds’ ability to track cohesive objects with their ability to track noncohesive sand piles. Infants in the ‘object’ condition saw a rigid entity that had the same shape, color, and texture as a sand pile. The object was lowered onto a stage while maintaining cohesion. Two screens were then introduced, one of which occluded the object. Next, infants saw another pile-shaped cohesive object lowered behind the second screen. Finally, both screens were removed, revealing either one or two objects. The ‘sand’ condition was identical to the object condition, except that the experimenter poured loose sand out of a cup into piles on the stage, rather than lowering cohesive objects.
In the object condition, infants looked longer toward the unexpected outcome of only one object, while in the sand condition, there was no significant difference in looking times for the two outcomes. This is consistent with the proposal that object files were maintained for cohesive objects (enabling infants to track how many there were) but were discarded for noncohesive sand piles.Footnote 1 The proposal is that once the piles were perceived as noncohesive, the OF-system could no longer track them.
Similarly, vanMarle and Scholl (Reference vanMarle and Scholl2003) compared the standard MOT condition with a condition in which each object disintegrated into small pieces and “poured” from one location to the next. Tracking was significantly worse in the latter case (89% vs. 67% accuracy). More recently, Cheries et al. (Reference Cheries, Mitroff, Wynn and Scholl2008) showed that simply splitting an object into two pieces disrupts infants’ object representations. Infants were divided into a ‘no-split’ condition and a ‘split’ condition. Those in the no-split condition saw one cracker placed in a container and two already disconnected crackers placed in another container. In the split condition, they also saw one cracker placed in one container and two crackers placed in the other container. However, the two crackers resulted from breaking a single larger cracker within view of the infants. Cheries et al. found that infants reliably crawled toward the two-cracker container in the no-split condition, but were at chance in the split condition. They suggest that infants represented the loss of cohesion from splitting and thus discarded the object file for the larger cracker. Furthermore, infants did not have enough time to assign new files to the resulting pair of crackers before they were placed in the container.Footnote 2
However, there is reason to suspect that cohesion violations may not be the key factor underlying tracking failures in these studies. This is because (i) failures in these or similar paradigms are observed with cohesive stimuli, and (ii) successes in these or similar paradigms are observed with noncohesive stimuli.
As regards i, tracking is also impaired when entities retain their cohesion but expand or contract the way a pile of sand changes shape when poured from a cup. For instance, vanMarle and Scholl (Reference vanMarle and Scholl2003) found that when targets expanded and contracted like “slinkies” (while maintaining cohesion), tracking was just as impaired as when they poured from one location to another (see also Howe et al. Reference Howe, Holcombe, Lapierre and Cropper2013). This raises the possibility that the tracking impairments in both Huntley-Fenner et al. (Reference Huntley-Fenner, Carey and Solimando2002) and the pouring condition of vanMarle and Scholl (Reference vanMarle and Scholl2003) may be due to the expansion and contraction involved in pouring, rather than loss of cohesion per se.
As regards ii, there is evidence that tracking is unimpaired for noncohesive groups, as long as entities in a group move as a tight cluster. In a fourth condition of vanMarle and Scholl’s (Reference vanMarle and Scholl2003) study, noncohesive groups of items that moved as clusters were tracked just as well as standard figures in the MOT task. Similarly, Wynn, Bloom, and Chiang (Reference Wynn, Bloom and Chiang2002) found that infants are capable of enumerating perceptual groups, independently of the number of elements in those groups. Infants were habituated to a display containing either two or four moving clusters of three dots. During test trials, they saw displays containing either two groups of four dots or four groups of two dots. Infants who habituated to displays containing two groups looked longer at test displays of four groups, while those habituated to displays containing four groups looked longer at test displays of two groups. Note that both test displays contained the same number of individual dots. A natural explanation, then, is that infants assigned files to each group of dots on the basis of common fate and proximity, enabling them to detect changes in the number of groups.Footnote 3
The foregoing considerations suggest a competing explanation of the results taken to support cohesion. The reason for failures of tracking and individuation in these cases may be that the OF-system is highly sensitive to correlated motion paths among clustered items. On this hypothesis, items are grouped by representing cues to proximity and correlated motion. However, given cues to independent motion among items, items are unlikely to be grouped—or, if they are already grouped, they are likely to become ungrouped. The attribution of proximity and common fate is, on this proposal, sufficient for selection and tracking. Further attribution of cohesion is unnecessary.
Consider again vanMarle and Scholl’s (Reference vanMarle and Scholl2003) slinky condition. The present hypothesis suggests that although the entities maintained cohesion, the visual system detected that the elements comprising them (their front and back edges) moved independently. This could have prevented grouping of the edges into a single object and led to a failure to maintain object files for the slinkies. This proposal also applies to the “pouring” condition of vanMarle and Scholl’s study and to the Huntley-Fenner et al. (Reference Huntley-Fenner, Carey and Solimando2002) task, since the elements of the poured substances provided strong cues to independent motion. Moreover, consistent with the proposal that tracking relies on correlated motion cues, St. Clair, Huff, and Seiffert (Reference St. Clair, Huff and Seiffert2010) found that when texture elements within a target move opposite its motion direction, MOT performance suffers.
A similar hypothesis may explain the data from Cheries et al. (Reference Cheries, Mitroff, Wynn and Scholl2008). When the cracker split apart, the resulting pieces initially followed quite different trajectories. The current hypothesis suggests that because of this, they were no longer perceptually grouped. Again, on this proposal it is not the violation of cohesion that explains tracking failures but, rather, cues to independent motion.
Likewise, in studies demonstrating tracking or individuation success despite lack of cohesion (Wynn et al. Reference Wynn, Bloom and Chiang2002; vanMarle and Scholl Reference vanMarle and Scholl2003), the key feature, on this proposal, is that the elements of noncohesive groups followed similar motion paths. This led them to be grouped into a unit and targeted by an object file.
Might the tracked groups in Wynn et al. (Reference Wynn, Bloom and Chiang2002) and vanMarle and Scholl (Reference vanMarle and Scholl2003) have been represented as cohesive? A defender of the strong restrictive view might appeal to some of Spelke’s earlier work to motivate this proposal. Spelke and colleagues demonstrated that when infants encounter surface fragments that move together on opposite ends of an occluder, they perceive them as belonging to a single, cohesive object. For example, if infants see the ends of a rod move back and forth behind a box, then they will subsequently look longer toward a pair a fragmented rods than toward a single, connected rod (Kellman and Spelke Reference Kellman and Spelke1983; Kellman, Spelke, and Short Reference Kellman, Spelke and Short1986).
Kellman and Spelke concluded that from an early age, infants can compute cohesion from common motion. One might suggest that this occurred in Wynn et al. (Reference Wynn, Bloom and Chiang2002) and vanMarle and Scholl (Reference vanMarle and Scholl2003). However, in Kellman and Spelke’s paradigm, an occluder separated the moving fragments. Accordingly, infants were not provided with any evidence that contradicted a cohesive interpretation. Thus, this evidence does not show that vision infers cohesion from common motion even when contradictory information is available—namely, when discontinuities between fragments are fully visible. The stimuli in Wynn et al. (Reference Wynn, Bloom and Chiang2002) and vanMarle and Scholl (Reference vanMarle and Scholl2003) fall into the second category. Thus, to motivate this response the defender of the strong restrictive view would need to show that participants in these studies represented cohesion despite inconsistent visual information.
Future research should directly investigate whether infants’ apparent sensitivity to cohesion is really due to common fate or proximity grouping. Are infants also impaired when required to individuate and track “slinky” items, just as adults were in vanMarle and Scholl (Reference vanMarle and Scholl2003)? If so, then common fate likely explains the Huntley-Fenner et al. (Reference Huntley-Fenner, Carey and Solimando2002) results. Furthermore, it is critical to determine whether the clusters of elements tracked in Wynn et al. (Reference Wynn, Bloom and Chiang2002) and vanMarle and Scholl (Reference vanMarle and Scholl2003) were represented as cohesive, despite visible spatial gaps. Do participants perceptually distinguish moving clusters from Spelke-objects but track them anyway? If so, then this is compelling evidence for the permissive view. If not, then the strong restrictive view may yet handle our capacity to track groups.
There is evidence beyond MOT, however, that object files may target noncohesive groups. Like Spelke-objects, perceptual groups structure the storage of information in VWM. Participants are better at recalling features from two grouped items than from two ungrouped items (Woodman, Vecera, and Luck Reference Woodman, Vecera and Luck2003; Peterson and Berryhill Reference Peterson and Berryhill2013). Moreover, Brady and Tenenbaum (Reference Brady and Tenenbaum2013) recently showed that VWM capacity for patterned, many-item displays could be modeled reasonably well on the assumption participants “chunk” items based on similarity and proximity grouping cues. They estimated a capacity of four groups, mirroring limits generally found for cohesive objects.Footnote 4 Finally, there is evidence that groups elicit typical object-based attention effects. Feldman (Reference Feldman2007) found that participants were faster to compare probes that appeared on line segments grouped by good continuation, symmetry, or parallelism than probes that appeared on ungrouped line segments.
3.4. Evidence for Boundedness?
The boundedness requirement precludes undetached parts from counting as objects. But is there evidence that the OF-system relies on attributing boundedness? If so, then infants and adults should display pronounced deficits when asked to track entities seen as nonbounded. I will argue that while the evidence suggests that object files cannot target arbitrary parts of objects, it does not show that object files cannot target parsable parts of objects. I will then argue that further evidence suggests that object files can target parsable parts.
Although boundedness has received less attention than cohesion, several authors have suggested that the vision fails to track undetached parts. For example, Fodor and Pylyshyn (Reference Fodor and Pylyshyn2015) write: “It turns out that subjects can track dumbbells but can’t track their weights (unless we remove the rod that connects them). Connecting the parts of the dumbbell creates a single new object, not just an arrangement of the parts of one (see Scholl, Pylyshyn, and Feldman Reference Scholl, Pylyshyn and Feldman2001). … What subjects see when they see dumbbells is not arrangements of undetached dumbbell parts but, unsurprisingly, dumbbells. The world is the totality of things, not undetached parts of things” (131). Fodor and Pylyshyn suggest that the cited experiment (Scholl et al. Reference Scholl, Pylyshyn and Feldman2001) shows that tracking is selectively keyed to bounded things, rather than to undetached parts.Footnote 5 They also suggest that this helps in responding to Quine’s “gavagai” problem. While I will not be concerned here with referential indeterminacy issues, it is worth asking whether the evidence that they cite actually supports the claim that tracking mechanisms incorporate a strict boundedness constraint.
Fodor and Pylyshyn claim that subjects cannot track the weights of dumbbells unless we remove the rod that connects them. But the study they cite does not substantiate this. Scholl et al. (Reference Scholl, Pylyshyn and Feldman2001) examined a number of tracking conditions, but three are most relevant. The first condition replicated the standard MOT paradigm. In the second, subjects tracked the endpoints of lines as they moved about in the presence of distractors. In the third, subjects tracked the square ends of dumbbells. Dumbbells were composed of two squares linked by a line segment. Importantly, dumbbell weights are likely to be discriminated through perceptual parsing, while the endpoints of lines are not—endpoints are arbitrary parts. So if the OF-system can select parsable but not arbitrary parts, then it should be able to select dumbbell weights but not endpoints.
Consistent with Fodor and Pylyshyn’s claim, subjects were better in the standard MOT condition than in either of the other two, and performance in the third condition (tracking endpoints) was only slightly better than chance (see also Howe, Incledon, and Little Reference Howe, Incledon and Little2012). However, performance in the second condition (tracking weights) was well above chance. Tracking accuracy with weights was roughly 84% versus 92% in the standard condition (Scholl et al. Reference Scholl, Pylyshyn and Feldman2001, 170). Thus, while there was some (statistically significant) decrement associated with tracking weights versus tracking individual boxes, this hardly shows that we cannot track them.
There is also an unfortunate asymmetry in comparing tracking of “whole” objects versus parts of objects. Suppose that object files can select both bounded figures and parsable parts of them. Further, suppose that during tracking, we sometimes mix up a target with one of its parts. In this case, then we would expect MOT performance to be slightly better for whole objects than for parts. Tracking a part of an object is sufficient for reidentifying the whole object at the end of a MOT trial, while tracking a whole object is insufficient for reidentifying one of its parts at the end of a MOT trial. Thus, if a participant needs to track a whole object but “swaps” it with one of its parts, she will still perform the task equally well. However, if she needs to track only a part of an object but swaps it with the whole object (or an adjoining part), her performance will suffer. Thus, even if tracking accuracy is somewhat higher for whole objects than for parts, we cannot be sure whether this reflects a systematic deficit in selecting nonbounded entities or occasional object-part swapping.
It is important, then, to consider evidence beyond MOT. In another study taken to support boundedness, Mitroff, Scholl, and Wynn (Reference Scholl and Wynn2005) analyzed the effect of merging on the object-specific preview benefit (OSPB). Participants saw an initial display containing three circles. Letters briefly appeared on each circle and then vanished. Next, the objects took part in one of two events. Either two of the circles merged to become a single circle, or two of the circles approached one another without merging. Finally, a single letter appeared in one of the circles, and subjects had to indicate whether it had appeared earlier. OSPBs are associated with faster “same” responses to letters that reappear on the same object than to letters that reappear on a different object. Mitroff et al. found that merging (but not approaching) eliminated OSPBs for the lower of the two merged objects. This demonstrates that object files treat the merging of two circles into one circle as a transition from two objects to one. But why? One explanation for this is that the OF-system strictly relies on boundedness in determining object persistence. When the merged objects were no longer perceived as bounded, the OF-system concluded that one of them went out of existence and discarded the relevant object file.
However, we need not conclude that the OF-system relies critically on attributing boundedness in order to explain this result, because perceptual parsing rules would also require treating the merged circle as only one object. Since a circle lacks curvature discontinuities, there is no basis for perceptually decomposing it into parts. Thus, even if the OF-system merely relies on grouping and parsing cues, rather than boundedness per se, it should arguably treat the premerge circles as two objects and the postmerge circle as only one.
To make progress on this issue, future work should systematically compare OSPBs for bounded objects, arbitrary parts, and parsable parts. Are OSPBs also found when a preview feature reappears in the same parsable part of an object versus a different one? Moreover, it is critical to determine whether boundedness violations disrupt perception of object persistence in the context of preserved parsability. For example, suppose that two objects merge, but their merging preserves curvature minima that allow the figure to be decomposed into parts (see fig. 3). The strong restrictive view predicts that at least one of the OSPBs should be eliminated in this case, since the resulting entities violate boundedness. If, however, OSPBs are preserved through such transformations, this would support the permissive view.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095428018-0777:S0031824800010564:S0031824800010564_fg3.png?pub-status=live)
Figure 3. Merging with preserved parsability. Left, congruent match; right, incongruent match. OSPBs would be indicated by faster reaction times on congruent versus incongruent trials.
Although this question awaits further investigation, there is already evidence to suggest that object files can target parts. First, like Spelke-objects and groups, parsable parts structure information storage in VWM. Two features are easier to recall when they belong to the same part of an object than when they belong to different parts (Xu Reference Xu2002). Moreover, visual attention more readily spreads within a part than across parts (Barenholtz and Feldman Reference Barenholtz and Feldman2003), mirroring typical object-based attention effects. Finally, Porter et al. (Reference Porter, Mazza, Garofalo and Caramazza2016) recently found that parallel enumeration, or “subitizing,” is similarly efficient for parsable parts (protrusions from a disc) and bounded objects. Enumeration time increased very gradually up to four parts (a mark of parallel selection), after which the slope steepened. This matches the OF-system’s set-size limit, and it is commonly held that subitizing recruits the OF-system (Pylyshyn Reference Pylyshyn2007; Carey Reference Carey2009, chap. 4).
3.5. Nonmaterial Things
I have argued that the evidence for the strong restrictive view is inadequate because it can also be explained by the permissive view. The strong restrictive view also faces another challenge. Note that the cohesion and boundedness principles apply to entities composed of surface points. This suggests that object files only target material things. But evidence suggests that this may be incorrect.
Holes are not composed of material surfaces but instead of empty space wholly surrounded by material surfaces. Nonetheless, holes can be tracked just as efficiently as standard stimuli in the MOT paradigm (Horowitz and Kuzmova Reference Horowitz and Kuzmova2011). Moreover, while perceivers are poor at remembering the shapes of arbitrary background regions, they remember the shapes of holes just as well as the shapes of solid objects (Palmer et al. Reference Palmer, Davis, Nelson and Rock2008; although see Bertamini Reference Bertamini2006). There is also evidence that we can track cavities or indentations in objects. Giralt and Bloom (Reference Giralt and Bloom2000) found no significant differences between 3-year-olds’ ability to track and enumerate cavities and their ability to track and enumerate solid objects. Furthermore, indentations can be rapidly individuated in parallel. In addition to parts, Porter et al. (Reference Porter, Mazza, Garofalo and Caramazza2016) found typical subitizing effects for indentations in an object, suggesting similar set-size limits for indentations and Spelke-objects.
One might contend that nonmaterial entities can only be tracked if they are misrepresented as Spelke-objects. However, this is doubtful, since several studies of nonmaterial entities have provided ample information about surface depth. Giralt and Bloom used full-cue physical stimuli, while Horowitz and Kuzmova used holes defined by stereo disparity. Nonetheless, it remains to be seen whether nonmaterial entities elicit typical results in other paradigms, such as object reviewing.
3.6. Multiple Systems?
Thus far, I have assumed a unified theory of tracking mechanisms. Specifically, I have assumed that if we can perform MOT or VWM tasks with both Spelke-objects and non-Spelke-objects, then object files are used in both cases. One might suggest, however, that vision uses multiple systems for selection and tracking. Perhaps the OF-system is just one of them, and it follows the restrictive principles.Footnote 6
There are prima facie reasons to doubt this multisystem account. To make the case that separate systems are responsible for selecting Spelke-objects and non-Spelke-objects, one needs to show that performance differs systematically between these stimulus categories. But this is precisely the assumption I have been questioning. For instance, there is little evidence for substantially different set-size limits across the various cases I have discussed. A multisystem account would render this a coincidence. Thus, if performance on standard OF-system tasks is similar for Spelke-objects and non-Spelke-objects, then this already damages the case for multiple systems.
Alternatively, one might suggest that the tasks I have discussed do not all recruit object files. Perhaps object files are engaged for VWM tasks but not for MOT, rendering MOT evidence irrelevant.Footnote 7 As motivation for this, one might note that we often fail to remember properties of tracked objects during MOT (Bahrami Reference Bahrami2003). This is puzzling, since object files are supposed to store properties of the objects they select. However, although participants are poor at recalling properties during MOT, this may be because MOT is so resource demanding that it often prevents properties from being entered into files (Pylyshyn Reference Pylyshyn2007, 39–40). Tracking may still recruit object files, even if no properties are entered into them.
In any case, I have discussed evidence that VWM selects non-Spelke-objects, so the strong restrictive view’s difficulties are not limited to MOT. Further, recall that there is evidence that common mechanisms are involved in tracking and VWM (Fougnie and Marois Reference Fougnie and Marois2009; Drew et al. Reference Drew, Horowitz, Wolfe and Vogel2011). Likewise, tracking an object in MOT boosts its OSPB (Haladjian and Pylyshyn Reference Haladjian and Pylyshyn2008), as would be expected if the tasks engage the same mechanism. There is, moreover, theoretical precedent for identifying the representations used in MOT with those used in object reviewing (Kahneman et al. Reference Kahneman, Treisman and Gibbs1992; Carey and Xu Reference Carey and Xu2001; Pylyshyn Reference Pylyshyn2007, 37–40). Accordingly, while multisystem accounts are available, they have questionable support.
Perhaps, however, the OF-system’s processing principles differ from context to context. It might rely on the restrictive principles in some cases but not others. In this vein, Mitroff, Arita, and Fleck (Reference Mitroff, Arita and Fleck2009) found that objects defined by subjective contours elicited OSPBs during blocks where they were the only objects presented but not when intermixed with trials involving physically defined objects. It is an empirical question whether such contextual effects carry over to, say, collections and parts. If so, then a context-relativized account of the OF-system’s principles may be viable.
If the OF-system picks out parts, collections, and shadows in addition to Spelke-objects, why call it an object file system at all? Granted, if “object” is interpreted to mean Spelke-object, this label may be misleading. However, the crucial question is whether the system responsible for tracking Spelke-objects is also responsible for tracking 2-D entities, parts, and collections. How to label this system is a further matter.
4. The Weak Restrictive View
I have considered whether the OF-system critically relies on attributing 3-D, cohesion, and boundedness. I have argued that the evidence cited for this view is inadequate and that further evidence raises problems for the view. This challenges the strong restrictive view, but does not directly target the weak restrictive view, which claims only that the OF-system has the proper function of selecting Spelke-objects. Might the weak restrictive view be true, even if the strong restrictive view is false?
Suppose that the strong restrictive view is false. The proponent of the weak restrictive view then needs to argue that the OF-system has the function of selecting Spelke-objects despite readily picking out non-Spelke-objects. One suggestion would be that object files are selectively tuned to Spelke-objects simply because the majority of entities they pick out in natural environments are Spelke-objects. However, this sort of proposal faces familiar problems (Fodor Reference Fodor1987, 99–104). For example, even if the majority of entities we pick out in natural environments are Spelke-objects, an even greater majority are members of a broader kind that includes Spelke-objects alongside certain 2-D entities (shadows or reflections), collections, and parts. So why think that object files are selectively tuned to Spelke-objects, rather than to some broader kind?
A more promising suggestion can be gleaned from Burge (Reference Burge2010). Burge holds that tracking relies on attributing the kind body and that a precondition for representing something as a body is that we represent it as 3-D, cohesive, and bounded. This entails the strong restrictive view. However, Burge also holds there are further properties constitutive of being a body that we need not attribute during tracking—namely, solidity (465–68). The reason, he argues, is that nonsolid entities that satisfy the remaining restrictive principles have not figured in biological explanations of our activities. Tracking can be keyed to solid entities even if it does not rely on attributing solidity, so long as only solid entities figure in biological explanations of our activities. Perhaps we should apply this teleological treatment of solidity to the 3-D, cohesion, and boundedness properties as well.
I believe the most compelling version of the weak restrictive view would combine this teleological approach with the claim that the OF-system tracks objects by attributing perceptual organization cues. On this proposal, the restrictive principles characterize the kind of entity that the OF-system has the proper function of tracking, while perceptual organization principles characterize the properties that it typically attributes during tracking. This attempts to combine the insights of the restrictive and permissive views.
Nevertheless, assuming that we can readily track non-Spelke-objects without representing them as Spelke-objects, the weak restrictive view is dubious. Contrast the current situation with a classic case in which a system picks out entities it does not have the function of picking out. The frog’s “bug detector” fires in response to any small, dark object moving through its field of vision. Nonetheless, many claim that the detector is tuned to flies, rather than pellets. Its firing causes tongue-flicking, which is only adaptive when directed at sources of nourishment. Thus, since flies plausibly figure in biological explanations of tongue-flicking, while pellets do not, the bug detector functions to detect flies.
Note that although tongue-flicking is fast and reflexive, this is not why the bug detector is plausibly tuned to flies. Rather, the reason is that those downstream behaviors caused by the bug detector are only successful given a fly. This might have obtained even if tongue-flicking were controlled by a slow, voluntary mechanism (although this would be far less efficient). The mechanism would perform its function in response to flies but not pellets.
Now consider cases of tracking non-Spelke-objects. We find two important disanalogies. First, there is an explanation of why the bug detector “misfires” in response to black pellets: The frog is unable to discriminate pellets from flies due to lack of visual acuity. But we plausibly can discriminate parts, collections, and shadows from Spelke-objects. So why should the OF-system “misfire” in response to these entities?
Second, we have a compelling story about why the bug detector only performs its function when it detects a fly—tongue-flicking is only successful given a fly. But no such story seems available for the tracking of collections, parts, or 2-D entities. Tracking such things is often adaptive. Tracking a swarm of bees can help us evade it. Selecting the handle of a mug can help us grasp it. And the ability to track the shadows of creatures flying overhead plausibly helped our evolutionary ancestors to avoid predation.Footnote 8 These entities are both trackable and plausibly relevant in biological explanations of our activities.
Thus, if the strong restrictive view is false, then the weak restrictive view is probably false too. There are non-Spelke-objects that it is in our biological interest to select and track. If the OF-system tracks them, then it likely performs its function when doing so.
I want to consider one more argument, due to Dickie (Reference Dickie and Jeshion2010), for the claim that object files are selectively keyed to “ordinary” objects. Dickie proposes that every object file has a “Governing Conception,” characterized as a “cluster of dispositions that determines which rationally driven operations the subject will perform in maintaining the file” (224). Specifically, we are rationally disposed to maintain files through certain changes in the entities they pick out but not through others. Dickie proposes that an object file refers to an entity O only if its Governing Conception is matched to O’s category. Roughly, the file must be maintained only through changes that something of O’s category can undergo. Given this, she argues that object files can pick out ordinary objects, but not shadows or parts, because object files are systematically discarded in response to cues that an object is not ordinary.
Nevertheless, if by “ordinary object” Dickie means “Spelke-object,” then this account faces some of the same difficulties discussed earlier. There is evidence that perceivers can select and track objects despite cues inconsistent with cohesion and boundedness. Thus, it is questionable whether object files’ Governing Conceptions really are matched to Spelke-objects. Perhaps, however, Dickie is using “ordinary object” in a different way.Footnote 9 If so, then her view is distinct from the positions addressed here and should be evaluated separately.
5. Conclusion
A popular view, which I have called the restrictive view, claims that object files are selectively keyed to 3-D, cohesive, bounded entities (Spelke-objects). I have argued that the evidential support for this view is unpersuasive.
I first considered the strong restrictive view, according to which the OF-system relies on attributing 3-D, cohesion, and boundedness during tracking. I argued that the evidence garnered in support of this view can be explained equally well by a permissive view, according to which object files rely on well-known cues to perceptual organization. The evidence for cohesion attribution can be explained by the attribution of proximity and common fate. The evidence for boundedness attribution shows only that we cannot track arbitrary parts of objects, not that we cannot track parsable parts. Work on the selection and tracking of shadows, groups, and parts provides further evidence against the strong restrictive view and in support of the permissive view.
Next, I considered the weak restrictive view, according to which object files have the proper function of picking out Spelke-objects. I argued that if the strong restrictive view is false, then the weak restrictive view is likely false too. There are many non-Spelke-objects that it is adaptive to track. If the OF-system tracks them, then it plausibly performs its function when doing so.
This raises some further questions. Adults conceptually distinguish among bodies, parts, and collections. Children believe that certain objects can persist despite changes in the substances from which they are constituted (Rips and Hespos Reference Rips and Hespos2015), and they generally prefer to apply novel words to “whole” objects, rather than to parts (Markman Reference Markman1990). If object files are not strictly governed by the body/nonbody dichotomy, where do these distinctions come from?
Note that even if the permissive view is correct, this does not entail that Spelke-objects play no role in perception. The distinction between Spelke-objects and non-Spelke-objects may be perceptually salient even if it does not mark the boundaries of what object files track. Indeed, cohesion and boundedness are topological properties, and there is strong evidence that vision represents topology (Chen Reference Chen2005; Green Reference Green2017). Thus, the property of being a Spelke-object may still be visually represented and made available for judgment and language acquisition.
If object files are not selectively keyed to Spelke-objects, then what are they keyed to? One possibility is that there is no easily specifiable category to which they are keyed. Another possibility is that they are keyed to a broader category than Spelke-objects. Note that Spelke-objects, perceptual groups, and parsable parts all tend to exhibit a high degree of internal dependence. Their parts have stronger probabilistic dependencies on one another than on the parts of other entities. This is clear in the case of motion. Parts of my arm are better motion correlated with one another than with parts of my torso. Likewise, a goose is better motion correlated with another from the same flock than one from a different flock. One possibility, then, is that it is adaptive for us to pick out things that exhibit such internal dependence, and the OF-system exploits the rich assortment of cues to internal dependence that natural scenes offer.