Introduction

The seeing-versus-thinking debate in cognitive science has deep historical roots. One school of thinkers can be said to follow Leonardo da Vinci’s (1452–1519) motto “All our knowledge has its origins in perception”. This motto suggests that perception is a largely autonomous source of knowledge rather than that knowledge is a resource for perception (Firestone & Scholl, 2015; Fodor, 1983; Pylyshyn, 1999; Rock, 1985). This is also called cognitive impenetrability, which means that vision is largely unaffected by other cognitive domains—not because it is neurally encapsulated but because it is a stable process, wired into the brain and not easily modifiable by knowledge, beliefs, or intentions. This is typically illustrated by visual illusions, which persist even when we know what we are looking at. Another school can be said to follow William Kingdon Clifford (1845–1879), who restricted seeing to sensations and argued that a sensation gives us ideas connected with things because of earlier hands-on experience with things that caused this sensation too (Clifford, 1890). This suggests that thinking transforms sensory input directly into meaningful concepts. It reverberates in the recent idea that, in the visual hierarchyFootnote 1, all perceptually relevant information is represented by activity patterns in the primary, retinotopic, area V1 and that activity in other visual and non-visual areas is secondary or auxiliary but not representational (Gur, 2015).

Clifford’s idea does not include what is called the Høffding step. Harald Høffding (1843–1931) argued that there must be a stage of visual structuring, or perceptual organization, to transform a two-dimensional (2D) retinal image into a percept of three-dimensional (3D) objects arranged in space (Høffding, 1891). This idea attributes more autonomy to perception but also implies that a stimulus can be perceptually organized in many different ways. Therefore, the early 20th-century Gestaltists proposed the law of Prägnanz, which holds that the visual system settles in relatively stable organizations, characterized by symmetry and simplicity (Koffka, 1935; Köhler, 1929; Wertheimer, 1923). A modern version hereof is the simplicity principle. It defines the complexity of an organization by the amount of information needed to specify it, and holds that the visual system prefers simplest hierarchical organizations (Hochberg & McAlister, 1953; Leeuwenberg & van der Helm, 2013; van der Helm, 2014). It further postulates that the subsequent hierarchical levels in such an organization—say, from local features to global structures—are represented at subsequent levels in the visual hierarchy (van der Helm, 2012).

The Høffding step is also not included in Hermann von Helmholtz’ (1821–1894) idea that visual perception is a process of unconscious inference guided by the likelihood principle (von Helmholtz, 1909/1962). This principle holds that “we perceive the most likely objects or events that would fit the sensory pattern that we are trying to interpret” (Hochberg, 1978). It led, among other things, to the idea that the internal process of perception is veridical, meaning that it captures most truthfully the structure of the external world (e.g., Cohen, 2015; Pizlo, 2015). Visual illusions speak against this, and in fact, I think it is fundamentally unverifiable (see Appendix A). Be that as it may, the Helmholtzian likelihood principle is often taken as a permit to include knowledge in perception models. Some Bayesian models, for instance, test knowledge-based hypotheses against the sensory input (e.g., Friston, 2009). My problem with this is that it is basically a form of template matching, which, at least in human vision research, has been abandoned long ago because it is too rigid and limited to deal with ill-defined categories and novel objects.

The foregoing illustrates that, in the seeing-versus-thinking debate, much depends on how perception is defined. Definitions of perception range from only V1 activity to any cognitive activity that contributes to arriving at unique percepts. In both these extreme options, knowledge plays a large part in determining what we think we see when looking at a visual stimulus—be it (unconscious) phylogenetic knowledge acquired during evolution or (conscious) ontogenetic knowledge acquired during one’s life. However, whereas thinking seems to be a relatively slow process that has been described as involving the sequential activation of sets of neural assemblies (Hebb, 1949), we can detect stimulus features like mirror symmetry under presentation times as short as 50 ms (Csathó, van der Vloed, & van der Helm, 2003; Locher & Wagemans, 1993), while complete percepts seem to be formed within less than 500 ms (Breitmeyer & Ogmen, 2006; Sekuler & Palmer, 1992). These temporal specifications are, admittedly, not necessarily indicative of temporal aspects of the cascade of perceptual processes triggered by a stimulus, but they do suggest that, for thinking to intrude into seeing, it might have a timing problem.

To explore this issue further, this article focuses on the process of perceptual organization. As indicated, perceptual organization is the neuro-cognitive process—in the visual hierarchy—that enables us to perceive scenes as structured wholes consisting of objects arranged in space (Fig. 1ab). This includes the perception of randomly organized spatial elements as well as elements that can be organized, in 2D or 3D, into a single object, multiple objects, partially hidden ones, etc. This presumably automatic process may seem to occur effortlessly in daily life, but by all accounts, it must be both complex and flexible. For a proximal stimulus, the perceptual organization process usually singles out one hypothesis about the distal stimulus from among a myriad of hypotheses that also would fit the proximal stimulus. This means, as Gray (1999) put it, that multiple sets of features at multiple, sometimes overlapping, locations in a stimulus must be grouped in parallel and that the process must cope with a large number of possible combinations simultaneously. This indicates that the combinatorial capacity of the perceptual organization process must be high, which is remarkable considering that it completes in just a few hundreds of milliseconds.

Fig. 1
figure 1

Perceptual organization. a A stimulus with a typically perceived organization comprising the two triangular shapes in b, which, therefore, are called compatible parts. c An incompatible part, which is masked by the typically perceived organization and which, therefore, is called an embedded figure. (After Kastens & Ishikawa, 2006)

Perception is, admittedly, broader than perceptual organization, but the latter is a pivotal process between sensory input and percepts, so, it is relevant to explore how it might interact with top-down processes. In the next sections, I first sketch an earlier-presented model of perceptual organization, called PATVISH (Perception and ATtention in the VISual Hierarchy; for details, see van der Helm, 2012, 2014, 2016). Using this model, I then argue that the input of the perceptual organization process may be modifiable and its output enrichable, but that the process itself is so fast (or efficient) that it has done most of its job by the time thinking might interfere. By “most”, I mean that the perceptual organization process is not neurally encapsulated and that thinking might have time to intrude—but not much. Regarding the exact degree to which thinking might intrude, this study remains speculative, but its main objective nevertheless is to put speed (or efficiency) forward as a relevant factor in the seeing-versus-thinking debate.

Before I begin, two remarks are in order. First, theoretical studies aim to integrate empirical findings and theoretical ideas into coherent frameworks or to apply such a framework to address topical issues. PATVISH represents a proposed integration of ideas that have gained some sort of support (empirical or otherwise), and in this theoretical study, I apply this proposal to address issues in the seeing-versus-thinking debate. Theoretical research is not empirical research but is yet an integral part of the empirical cycle, and at the end of this article, I raise several empirical questions for future investigation. Second, a semantic problem in the seeing-versus-thinking debate is that thinking, or knowledge, is often discussed in terms of attentional effects, even though thinking and attention are not the same. Yet, attention seems the obvious channel through which thinking would affect seeing, and here, I therefore focus on effects of attention on perceptual organization. Through such effects, if any, one might infer effects of thinking.

Modeling perceptual organization

To account for the high combinatorial capacity of the perceptual organization process, PATVISH follows Lamme, Supèr, and Spekreijse (1998) in assuming that this distributed hierarchical process (Footnote 1) comprises three neurally intertwined but functionally distinguishable subprocesses. These subprocesses are taken to be responsible for (a) feedforward extraction of, or tuning to, features to which the visual system is sensitive, (b) horizontal binding of similar features, and (c) recurrent selection and integration of different features (Fig. 2, left-hand panel). Furthermore, adopting the simplicity principle, PATVISH assumes that the process yields a complexity distribution over candidate organizations (i.e., stimulus organizations in terms of whole and parts; Fig. 2, right-hand panel). Such a complexity C can be converted into a normalized probability 2C, which reflects an organization’s probability of being perceived and implies that simpler organizations are more likely to be perceived.

Fig. 2
figure 2

Processing in the visual hierarchy. A stimulus-driven perceptual organization process (at the left) comprises three intertwined subprocesses (further explained in the text), which, together, yield percepts in the form of hierarchical stimulus organizations (i.e., organizations in terms of wholes and their parts). A task-driven attention process (at the right) may scrutinize such a hierarchical organization—starting at higher levels where relatively global structures are represented, and if required by task and allowed by time, descending to lower levels where relatively local features are represented. (Reproduced from van der Helm, 2016)

The subprocess of feedforward extraction is reminiscent of the neuroscientific idea that, going up in the visual hierarchy, neural cells mediate detection of increasingly complex features (Hubel & Wiesel, 1968). Furthermore, the subprocess of recurrent selection and integration is reminiscent of the connectionist idea that, by parallel distributed processing (PDP), neural activation spreading yields percepts represented by stable activation patterns (Churchland, 1986). In PATVISH, these subprocesses interact like a fountain under increasing water pressure: As the feedforward extraction progresses along ascending connections, each passed level in the visual hierarchy forms the starting point of integrative recurrent processing along descending connections. This yields a gradual buildup from percepts of parts at lower levels in the visual hierarchy to percepts of wholes near its top end (for similar pictures, see Lee & Mumford, 2003; VanRullen & Thorpe, 2002).

Neural activation spreading as presumed manifestation of standard PDP may be a basic phenomenon in the brain, but more sophisticated, the brain also exhibits neuronal synchronization (see Appendix B). This is the phenomenon that neurons, in transient assemblies, temporarily synchronize their firing activity. Synchronization in the 30–70 Hz gamma band, in particular, has been associated with local visual computations, especially with feature binding in horizontal neural assemblies (Gilbert, 1992). PATVISH’s capstone now is its assumption that gamma synchronization is a manifestation of transparallel processing, which means that up to an exponential number of similar features are processed in one go, that is, simultaneously as if only one feature were concerned. The source of this assumption is sketched next (see Appendix C for more technical details).

In PISA—a minimal coding algorithm for strings (van der Helm, 2004, 2015)—regularities such as symmetries and repetitions are extracted to compute simplest hierarchical organizations. To this end, the algorithm implements formal counterparts of the three intertwined but functionally distinguishable subprocesses that are believed to take place in the visual hierarchy. Horizontal binding of similar features, in particular, is implemented by gathering sets of up to an exponential number of similar regularities in special distributed representations, called hyperstrings (see Appendix C). A hyperstring represents those regularities in such a way that they, for all intents and purposes in minimal coding, can be processed further as if they constituted a single regularity. This means that those regularities can be hierarchically recoded in a transparallel fashion, that is, simultaneously as if only one regularity were concerned—thus solving the computationally heavy combinatorial search for simplest hierarchical organizations. This led to the idea that hyperstrings can be seen as formal counterparts of those temporarily synchronized neural assemblies, so that, inversely, synchronization in those transient assemblies can be seen as a manifestation of transparallel processing. Notice that, unlike standard PDP, transparallel processing by hyperstrings is feasible on classical computers, giving them (for some computing tasks) the same extraordinary computing power as that promised by quantum computers (for some other computing tasks; see van der Helm, 2015).

The transparallel recoding of similar features yields a hierarchy of feature constellations—that is, in PISA, a hierarchy of hyperstrings, and in PATVISH, a hierarchy of synchronized neural assemblies. From this hierarchy of feature constellations, different features are selected to be integrated into percepts. Thus, transparallel processing underlies the perceptual integration capability—as distinct from the feedforward extraction of visual features. This distinction between extraction and integration agrees with that between base-grouping and incremental grouping as put forward by Roelfsema (2006; see also Lamme & Roelfsema, 2000; Roelfsema & Houtkamp, 2011), who, however, did not provide a computational account like transparallel processing.

To give a sense of the timing of these processes, the so-called fast feedforward sweep reaches the top end of the visual hierarchy in about 100 ms (Lamme & Roelfsema, 2000; Tovee, 1994). In some cases, this feedforward sweep may be sufficient to detect particular features. For instance, to discriminate between two clear-cut categories—say, animated versus inanimated structures, or rural versus city structures—holistic spatial organizations are not needed because one can rely on a large variety of local features to quickly complete the task (cf. Kirchner & Thorpe, 2006). Feature conjunctions, however, require more than that. For instance, binocular depth information kicks in around 100–200-ms post-stimulus onset (Ritter, 1980). Furthermore, Makin et al. (2016) investigated detection of single and multiple symmetries, repetitions, and Glass patterns, in fairly simple multi-element stimuli. They recorded the sustained posterior negativity (SPN)—an event-related potential (ERP) generated by visual regularities—and found that it correlates highly with behavioural data, particularly around 300-400-ms post-stimulus onset. It is therefore also plausible that processes manifesting synchronization play a part in this—after all, synchronization arises around 150–400-ms post-stimulus onset (Kveraga et al., 2011; Tallon-Baudry & Bertrand, 1999).

In sum, by PATVISH, the process of perceptual organization comprises a gradual buildup—through successive groupings (cf. Palmer, Brooks, & Nelson, 2003) with feedback loops (cf. Lee & Mumford, 2003)—from percepts of parts to percepts of wholes. Such a gradual buildup takes time, so, in principle, it leaves room for top-down processes to intrude and modulate things before a percept has completed. In this sense, PATVISH does not exclude influences from higher cognitive levels. However, it also postulates that—due to transparallel processing—the perceptual organization process is so fast (or efficient) that it, by then, already has done most of its job. This opens a new perspective on the limits of top-down influences on perception. Next, I discuss several implications.

Receptive fields

The perceptual organization process is not neurally encapsulated in the visual hierarchy, but for the moment, suppose it is. Even then, PATVISH implies that it involves top-down processing, namely, by the subprocess of recurrent selection and integration. This subprocess takes pieces of information from a lower level, integrates them at a higher level, and feeds information about the result back to update the lower level (Lee & Mumford, 2003). This has consequences for what is called a neuron’s receptive field (RF).

The classical receptive field (cRF) of a neuron is defined by the region of the retina to which the neuron is connected by way of feedforward connections (Hubel & Wiesel, 1968). Going up in the visual hierarchy, cRFs increase in size, which suggests that neurons at any level in the visual hierarchy can be conceived of as feature detectors, the output of which is simply summed by neurons with larger cRFs at the next level. This also suggests that vision involves only the fast feedforward sweep. However, via horizontal and recurrent connections, a neuron also receives input from neurons at the same and higher levels in the visual hierarchy. This suggests that a neuron is context sensitive, that is, responsive to local features outside its cRF and global features extending beyond its cRF. This context sensitivity—which does not rely on input from higher cognitive levels beyond the visual hierarchy—is not only implied by PATVISH but also supported by neuroscientific evidence (Gilbert, 1992; Lamme et al., 1998; Self et al., 2016; Smith & Muckli, 2010; Vetter, Smith, & Muckli, 2014). To be clear, I think that the cRF remains a useful concept in neuroscientific settings. The foregoing suggests, however, that its definition is too limited to capture a neuron’s effective RF in cognitive settings.

Attention

In behavioral perception experiments, participants respond to a task on the basis of what they think they saw. Hence, responses are based on perception in combination with task-driven top-down attention. Various forms of attention have been distinguishedFootnote 2 but notice that attention—of whatever form and involving whatever action—is basically the allocation of processing resources (Anderson, 2004). This may imply an enhancement of stimulus information focused on (cf. Nandy, Nassi, & Reynolds, 2017), but it neither prescribes how this information is (or has been) organized nor how it interacts with information outside the focus of attention. For instance, you must have perceived a bright flash before your attention is drawn by it. Furthermore, even if attention is directed specifically to stimulus parts relevant to a task, other stimulus parts may still affect responses to this task (e.g., Palmer & Hemenway, 1978; van der Helm & Treder, 2009).

PATVISH leaves room for attention to have measurable effects throughout the visual hierarchy—for instance, related to preparatory arrangements regarding what is focused on (Self et al., 2016). Its focus, however, is on the processing of stimulus information, and in this context, it postulates that attention also scrutinizes established perceptual organizations in a top-down fashion (Fig. 2, right-hand panel). This means that it starts with global structures represented at higher levels in the visual hierarchy, and if required by task and allowed by time, may descend to local features represented at lower levels. This agrees with reverse hierarchy theory (RHT) as proposed by Hochstein and Ahissar (2002; see also Ahissar & Hochstein, 2004; Wolfe, 2007; for neurophysiological evidence, see Campana et al., 2016). RHT, by the way, focuses mainly on the attention side, and unlike PATVISH does, less on processing details at the perception side.

The combination of perceptual organization and attention obviates the idea that perception comprises V1 activity only (Gur, 2015). Inspired by the fact that we can be aware of details, this idea relied on the preservation of details represented in V1. However, the foregoing implies that details are preserved and attainable also if perception is taken to comprise more than just V1 activity. It also agrees with findings that figure-ground segregation—which is part of perceptual organization—can take place outside the focus of attention (i.e., independently of attention, or preattentively), and that not attention but the figure-ground assignment itself is responsible for an enhancement of figures relative to grounds (Hecht, Cosman, & Vecera, 2016; Kimchi & Peterson, 2008). Because of this enhancement, attention may subsequently be drawn more to figures than to grounds (Nelson & Palmer, 2007), but the point is that, by then, the figure-ground segregation already has done its job. Next, I discuss three implications to further illustrate that perceptual organization supplies, fairly autonomously, input for top-down attention.

First, in visual search, a “pop-out” is a target that is detected fast and independently of the number of distractors (e.g., a red item among blue items; Treisman & Gelade, 1980). However, a target is a pop-out not by its own merits but by the merits of the distractors: The search for a target is easier as the distractors are more similar to each other and more different from the target (Duncan & Humphreys, 1989; Wolfe, 2007). Hence, for a target to be detected, properties of all elements have to be processed first (for evidence, see Conci, Toellner, Leszczynski, & Müller, 2011; Conci, Müller, & von Mühlenen, 2013), which may well involve lateral inhibition among similar things so that the target rises above the distractors. As argued in van der Helm (2016), it is therefore plausible that the similarity of the distractors is represented first in lower visual areas and that the representation of the target ends up in higher visual areas. This suggests that a pop-out is a pop-out not because it is (unconsciously) processed first by perceptual processes but because its representation ends up in higher visual areas so that it is among the first things (consciously) encountered by top-down attentional processes.

Second, whereas perceptual organization logically processes parts before wholes, the top-down attentional scrutiny of hierarchical organizations implies that wholes are experienced before parts. The latter explains the phenomenon of global dominance as postulated by the early 20th-century Gestaltists (Koffka, 1935; Köhler, 1929; Wertheimer, 1923). This is the phenomenon that, perceptually, global structures are more important than local features. For instance, we typically classify things on the basis of their perceived global structures rather than on the basis of their physical local features, and their perceived global structures determine which local features we perceive as their parts. This global dominance has been confirmed in behavioural studies (for a review, see Wagemans et al., 2012), in which it has been specified further by notions such as global precedence (Navon, 1977), configural superiority (Pomerantz, Sager, & Stoever, 1977), primacy of holistic properties (Kimchi, 1992), and superstructure dominance (Leeuwenberg & van der Helm, 1991; Leeuwenberg, van der Helm, & van Lier, 1994). It also agrees with Hochstein and Ahissar’s (2002) RHT and Campana et al.’s (2016) neurophysiological evidence.

Third, what if the perceptual integration of local features into global structures is hampered? By PATVISH, this could be caused by impaired gamma synchronization, as, for instance, found in autism spectrum disordersFootnote 3 (ASD) (Grice et al., 2001; Maxwell et al., 2015; Sun et al., 2012). Then, top-down attention will hardly encounter perceived global structures and will have better access to embedded figures (Fig. 1ac), that is, to local features that are incompatible with typically perceived global structures (van der Helm, 2016). Better than typical access to embedded figures is exactly what has been found in ASD (Frith, 1989; Jolliffe & Baron-Cohen, 1997; Shah & Frith, 1983).

Perceptual organization and thinking

If standard PDP were the only form of processing in the brain, then everything would influence everything, and seeing and thinking would be inextricable. Synchronization in transient neural assemblies changes the game, however. Higher cognitive functions seem to be mediated by processes manifesting synchronization involving relatively slow oscillations in the 4–30-Hz theta, alpha, and beta bands, whereas perceptual organization seems to be mediated by processes manifesting synchronization involving relatively fast oscillations in the 30–70-Hz gamma band (see Appendix B). By PATVISH, this fast gamma synchronization is a manifestation of transparallel processing, which, in classical computers, has the same extraordinary computer power as that promised by quantum computers. The foregoing implies that it is plausible to make a functional distinction between fairly autonomous perceptual organization and higher cognitive functions.

This functional distinction does not mean that they do not cooperate. After all, as indicated, there are both preperceptual and postperceptual effects of attention. Perceptual organization, however, is like an Olympic 100-m sprint: it may involve preparation beforehand and scrutiny afterwards, but the sprint itself is over in a jiffy. For instance, because it is perceptual organization that organizes scenes into objects arranged in space, objects are the output of perception, not the input—so, object-based attention can only be postperceptual. Furthermore, by PATVISH, candidate organizations are assigned complexity-based probabilities of being perceived. Thus, different organizations may have nearly the same probability of being perceived, which holds, in particular, for visually ambiguous or bistable figures. Then, prolonged viewing or a shift in focus may trigger a switch between such organizations (Suzuki & Peterson, 2000), but notice that these organizations and their bistability had already been supplied by perceptual organization.

The latter illustrates that perceptual organization provides the starting point for subsequent cognitive structuring of, among other things, attention, generalization, learning, and memory (Conci et al., 2013; Kanizsa, 1985; Rock, 1985). For instance, remaining perceptual ambiguities may be resolved by heuristic knowledge such as: light usually comes from above, objects usually are viewed from above, surfaces are usually convex, etc. Furthermore, knowledge can be invoked to recognize the objects supplied by perceptual organization and to enrich percepts to the level of what we call seeing in everyday life. For instance, “I see a chair” is actually short for “I see an object, and based on knowledge, I recognize it as something one can sit on”. This illustrates that the fast and unconscious process that makes you see the object is perceptual organization, while the rest is relatively slow conscious thinking. All in all, I think that perceptual organization is a fairly autonomous process, which, by and large, is unaffected by thinking.

It is true that this article started from a specific model of perceptual organization, but speed (or efficiency) may also be a critical factor in other neuro-cognitive models, and I hope this article stimulates further research into this. Furthermore, this article surely does not settle the seeing-versus-thinking debate. On the one hand, it shows that various (especially early) effects of attention in the visual hierarchy are not necessarily effects on perceptual organization. On the other hand, Peterson and coworkers, for instance, reported evidence for effects of object recognition, memory, and past experience on figure-ground perception (see, e.g., Peterson & Gibson, 1994; Trujillo et al., 2010). Such evidence has to be taken seriously, although it cannot be said to prove cognitive penetrability. Peterson and Gibson (1994, p. 561), for instance, pointed out that the orientation dependence of their results demonstrates that their phenomena are not dependent on semantic knowledge. Furthermore, Firestone and Scholl (2015) argued that such effects merely reflect an increasing sensitivity over time to certain visual features and do not involve effects of knowledge per se. In other words, just as preparatory attentional arrangements, such effects may apply to the input of the perceptual organization process, but not necessarily to what this process does with the input it receives. Trujillo et al. (2010), for instance, found effects of past experience in early (106–156 ms) ERPs but not in the figural outcomes. Be that as it may, further research certainly is needed, and based on this article, this might be guided by, for instance, the next three questions.

First, transparallel processing is an extraordinarily powerful form of processing that is feasible in classical computers, but does it indeed also underlie gamma synchronization in the visual hierarchy? Further investigation into this question might focus on feature binding in horizontal neural assemblies, which has been associated with gamma synchronization. By PATVISH, this subprocess is a crucial part of visual processing, but thus far, it has been a relatively underexposed topic in cognitive neuroscience.

Second, embedded figures are local features that are incompatible with typically perceived global structures (Fig. 1), and the phenomenon that ASD individuals are better than typical individuals in detecting them has been attributed to either enhanced local processing (Mottron & Burack, 2001) or reduced global processing (Frith, 1989). A critical question then is: are ASD individuals better also in detecting compatible features? Enhanced local processing implies they are, whereas by PATVISH, reduced global processing implies they are not (van der Helm, 2016).

Third, microgenetic analyses on amodal completion, for instance, have shown that the domain of perception lies within the first 500 ms after stimulus onset (Sekuler & Palmer, 1992; see also Breitmeyer & Ogmen, 2006). By PATVISH, higher cognitive functions take over after that, but are transitional changes visible in, for instance, electroencephalographic data? In research on multiple symmetry perception, a first indication hereof (Makin et al., 2016) helped to reconcile seemingly opposed ideas that actually hold for different time windows after stimulus onset (see Hamada et al., 2016).

Conclusions

The seeing-versus-thinking debate in cognitive science has been muddled by different and unclear definitions of perception. Therefore, this article focused on the well-defined perceptual process that organizes scenes into objects arranged in space, which is a pivotal process between sensory input and percepts. In this perceptual organization process, as modeled here, similar features are hierarchically recoded extremely efficiently, so that the whole process arrives quickly at hierarchical organizations likely to be perceived. Several factors may modify the input of this process but not necessarily what the process does with the input it receives. An important role of attention, in particular, seems to be top-down scrutiny of already-established hierarchical organizations, that is, starting with global structures, and if required by task and allowed by time, descending to local features. Furthermore, thinking processes may, of course, enrich the outcome of the perceptual organization process, but they are relatively slow and can therefore hardly intrude into the process itself. In other words, we think about what we see rather than that we see what we think.