Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Biscione, Valerio; Bowers, Jeffrey S.

doi:10.1007/s42113-023-00169-2

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Original Paper
Open access
Published: 10 July 2023

Volume 6, pages 438–456, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Brain & Behavior Aims and scope Submit manuscript

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Download PDF

1844 Accesses
1 Citation
Explore all metrics

Abstract

Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grouping principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform well on various brain and behavioural benchmarks. Here we test a total of 16 networks covering a variety of architectures and learning paradigms (convolutional, attention-based, supervised and self-supervised, feed-forward and recurrent) on dots (Experiment 1) and more complex shapes (Experiment 2) stimuli that produce strong Gestalts effects in humans. In Experiment 1 we found that convolutional networks were indeed sensitive in a human-like fashion to the principles of proximity, linearity, and orientation, but only at the output layer. In Experiment 2, we found that most networks exhibited Gestalt effects only for a few sets, and again only at the latest stage of processing. Overall, self-supervised and Vision Transformer appeared to perform worse than convolutional networks in terms of human similarity. Remarkably, no model presented a grouping effect at the early or intermediate stages of processing. This is at odds with the widespread assumption that Gestalts occur prior to object recognition, and indeed, serve to organize the visual scene for the sake of object recognition. Our overall conclusion is that, albeit noteworthy that networks trained on simple 2D images support a form of Gestalt grouping for some stimuli at the output layer, this ability does not seem to transfer to more complex features. Additionally, the fact that this grouping only occurs at the last layer suggests that networks learn fundamentally different perceptual properties than humans.

Qualitative similarities and differences in visual object representations between brains and deep networks

Article Open access 25 March 2021

Neural Networks Trained on Natural Scenes Exhibit Gestalt Closure

Article Open access 09 April 2021

Capturing the objects of vision with neural networks

Article 20 September 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Human tends to group perceptual features together in order to form a coherent whole. Understanding when this happens has been the focus of Gestalt psychology research for over 100 years and more than a hundred grouping “laws” have been suggested (Wagemans et al. 2012a). Whereas in the past the formulation of these laws was based on subjective experience and was criticized for a lack of scientific rigour, subsequent researchers have developed experimental designs with carefully constructed stimuli (e.g. Gabor stimuli, dot lattices) that allow for parametric control, richer visual displays, and objective measures of grouping effects (Wagemans et al. 2012b). One such approach consists of measuring the impact of salient Emergent Features (EFs) on discriminating visual patterns. These EFs derive from the relationship amongst individual parts rather than the parts themselves (Pomerantz et al. 1977; Pomerantz & Portillo 2011). We will use the concept of EFs as the basis of our approach, as detailed later.

Recently there has been an explosion of interest in Deep Neural Networks (DNNs) as models of the human visual system for object recognition. Even though DNNs have primarily been designed to solve engineering tasks, reports that the pattern of activations of units in DNNs are similar to neural activation in human and macaque visual systems have led to the view that DNNs can be used as a test bed for simulating biological vision in mammals (Gauthier & Tarr 2016; Kriegeskorte 2015). As a way of formalizing this similarity, a Brain-Score benchmark has been put forward (Schrimpf et al. 2018), which has been enthusiastically embraced by researchers comparing DNNs to human vision.

In the current paper, we explore several DNNs thought to be amongst the best model of human vision and test whether they support various Gestalt grouping phenomena. In particular, we tested whether DNNs are sensitive to some basic principles of organization such as proximity, orientation, and linearity (Experiment 1), and whether they experience Gestalt grouping when presented with more complex stimuli (Experiment 2). We compare networks’ responses with human responses from classic visual perception work (Pomerantz et al. 1977; Pomerantz & Portillo 2011).

Our main research question is as follows: do DNNs exhibit human-like Gestalt grouping effects? We split this question into two sub-questions: are networks sensitive to the basic properties of proximity, orientation and linearity?; are they sensitive to more complex Gestalt grouping effects? In addition to the primary aim of this investigation, we will also gain insight into what architecture or training regimes appear to be most appropriate for acquiring Gestalt properties, and to the extent to which Gestalt phenomena are a learned or innate aspect of the human visual system.

In the following sections, we contextualize these questions and explain how our experiments address them.

Neural Networks as a Model of the Human Visual System

DNNs trained on ImageNet (a dataset consisting of 1000 categories of objects taken across over 1 million photographs, Deng et al. 2010) develop a set of internal feature representations that are statistically similar to the neural representations in human and non-human primate visual systems (Yamins & DiCarlo 2016; Khaligh-Razavi & Kriegeskorte 2014; Schrimpf et al. 2018). A neuronal and behavioural benchmark called Brain-Score has been developed to assess neural networks on their similarity with biological object recognition systems, with DNNs performing much better than all previous approaches (Schrimpf et al. 2018). At the time of writing more than 160 models have been tested on Brain-Score.

In spite of these successes when compared with neuronal data and tested on classification accuracy benchmarks, DNNs often fail on the most basic perceptual properties exhibited by humans (Bowers et al. 2022). For example, DNNs do not possess human-like shape bias (Geirhos et al. 2018; Malhotra et al. 2020), they appear to discriminate catego-ries based on local instead of global features (Baker et al. 2018b; Malhotra et al. 2022), are much more susceptible to a low amount of image degradation (Geirhos et al. 2018), do not account for humans’ similarity judgements of 3D shapes (German & Jacobs 2020, and fail to support basic visual reasoning such as classifying images as the same or different (Puebla & Bowers 2021). In addition, DNNs often act in surprising non-human-like ways, such as being fooled by adversarial images (Szegedy et al. 2013; Dujmović et al. 2020) and make bizarre classification errors to familiar objects in unusual poses (Kauderer-Abrams 2017; Gong et al. 2014; Chen et al. 2017). Furthermore, recent work by Xu & Vaziri-Pashkam (2021) failed to find strong neural correlates between DNNs’ internal representation and fMRI signals from high-level visual areas of human participants.

At the same time, some authors have highlighted that DNNs appear to capture some key psychological findings, such as Weber’s Law, sensitivity to scene incongruencies, and the Thatcher effect (Jacob et al. 2021) (but see Bowers et al. 2022 for limitation of these findings). Biscione & Bowers (2021, 2022) also found that networks exhibited strong invariance to several novel object transformations (rotation, scale, change in luminance, translation, and to a lesser degree change in viewpoint), but only after being trained on a correspondingly transformed dataset of different classes, indicating that DNNs can learn the human perceptual property of object invariance to object transformation (Blything et al. 2020, 2021). Recently, Yin et al. (2023) found that DNNs trained on words provided a better account for visual priming effects than many classic orthographic coding schemes.

Neural Networks and Gestalt

As far as we are aware, Gestalt grouping in DNN have been explored only in relation to illusory contours and closure. Baker et al. (2018a) tested the degree to which networks could perceive illusory contours after being trained on non-illusory similar shapes (large and thin rectangles). The network successfully predicted the type of shape regardless of whether the contour was normal or illusory, but the authors found no evidence that the network used the same information as humans and concluded that CNNs do not perceive illusory contours. Kim et al. (2021) challenged this conclusion and found that several architectures (including InceptionNet) pretrained on ImageNet exhibited closure on display of edge fragments. Whether the later findings reflect Gestalt processes similar to human visual processing remains unclear. Other researchers have focused on modifying architecture or training regime in order to explore this issue: Lotter et al. (2020) found that a network based on predictive coding (PredNet, one of the 16 networks tested in this work) which was trained on predicting the next frame of video sequences, exhibited disparate phenomena observed in the visual cortex, including the flash-lag effect and illusory contours. The same network also appeared to perceive illusory motion (Watanabe et al. 2018). Using a DNN modified to include feedback connections through predictive coding dynamics, Pang et al. (2021) also showed human-like perception of illusory contours. Illusory contours constitute an important phenomenon in Gestalt psychology, but human perception is mediated by a wider set of grouping principles which, to the best of our knowledge, have not been explored in DNNs.

If DNNs are going to be used as models of human vision it will be important that they not only do well on various Brain-Score measures but also account for key experimental results reported in psychology (Bowers et al. 2022). Here we have focused on Gestalt rules of organization not only because they play a key role in visual perception and object recognition (Biederman 1987; Perrett & Oram 1993; Spillmann 2009), but because there are existing image datasets and robust empirical phenomena that make it easy to test. Specifically, we consider whether DNNs show sensitivity to EFs, as described next.

Formation of Wholes Through Emergent Features

Gestalt researchers have long studied the emergence of “wholes” from the combination of individual parts, but Gestalts have proven difficult to define and measure.Pomerantz et al. (1977) operationally defined Gestalts as the result of salient Emergent Features (EFs), that is features that are the result of the relations amongst individual elements, and are not possessed by the individual elements. These EFs behave as though they were elementary themselves and are sometimes detected more quickly than the more basic features from which they arise. The effect can be measured in a discrimination task, is robust to added noise (Moors et al. 2020), and might be a fundamental stage in shape processing related to the perception of non-accidental properties involved in shape perception as posed by the Recognition-by-Components theory(Biederman 2001; Kubilius et al. 2017).

In Fig. 1, left, we illustrate this idea. As a baseline, we measure how well humans distinguish two stimuli (A and B, base pair). We then add a new contextual stimulus C to both the A and B images, creating a composite pair. Importantly, stimulus C is not informative by itself in distinguishing AC from BC. Nevertheless, the composite pair is now much more discriminable than the base pair due to the interaction of the context with the base elements (that is, through the creation of Emergent Features). Whenever the context stimulus changes performance on the discrimination task a Configural Effect (CE) is observed. Improved performance is described as a Configural Superiority Effects (CSE) and this provides a measure of an EF that is the product of a Gestalt organizational principle. By contrast, impaired performance is described as a Configural Inferiority Effect (CIE), and does not reflect an EF, but rather, reflects a number of possible factors, including additional computational and attentional load, increase similarity, and crowding. That is, for a CSE to manifest itself, the EFs need to be powerful enough to override all these other factors. In a standard procedure, these stimulus sets have been used in a “oddity reaction time (RT) task” where subjects were asked to determine in which quadrant of a 2x2 grid an “odd” stimulus was presented (Fig. 1, right). Comparing RTs amongst each base and composite pair allows for the measurement of CSE and CIE in various contextual configurations.

The advantage of this approach is that it provides a quantitative measure of CSEs/CIEs rather than relying on subjective judgements. As predicted, many configurations (combination of characters, line segments forming letters, surfaces, 3D volumes) result in CIEs or very modest CSEs (Pomerantz & Portillo 2011). However, critically, other configurations show strong CSEs (Pomerantz et al. 1977; Pomerantz & Pristach 1989; Pomerantz & Portillo 2011). In Experiment 1 we test network sensitivity to specific and low-level EFs (proximity, orientation, and linearity) with simple dot configuration following Pomerantz & Portillo (2011). The combination of these and others EFs (and possibly other factors, including high-level features such as shape familiarity or closure) are assumed to give rise to the strong CSEs observed with the complex stimuli used in Pomerantz et al. (1977). In Experiment 2, networks are tested on these stimuli as well as stimuli that produced strong CIEs in humans.

Where Do the Laws of Perception Come From?

A related question is the role of perceptual experience in acquiring EFs, that is, to what degree the grouping principles are learnt from the visual environment and to what degree they are a function of the innate architecture of the visual system (Todorović 2011). Classic Gestalt psychologists minimized the significance of a learning account (Metzger 1966). These authors conceded that some aspects of visual organization could be based on habit or learning (such as the ability to group particular continuous lines on a paper in letters), but these cases were thought to be the exception and weaker than others (Wertheimer 1923).

Nevertheless, some evidence has emerged supporting the idea that basic Gestalt principles could be the result of learning from the visual environment: Geisler et al. (2001) showed that human performance in contour detection is quantitatively predicted by a grouping rule derived directly from the statistics of natural images; Peterson & Gibson (1994) found that a silhouette is more likely to be assigned as the figure if it suggests a common object (see also Peterson 2019); with two different paradigms, Duncan (1984) and Zemel et al. (2002) found that a perceptual grouping can be altered with only a small amount of experience in a novel stimulus environment. Other evidence based on RT responses has been collected by Vecera & Farah (1997). However, some other combinations are impenetrable to learning, as demonstrated by the Kanizsa stratification images (Grossberg & Zajac 2017).

In the current work, all networks tested were pre-trained on a large dataset of natural images: either ImageNet (a dataset consisting of a thousand categories of objects taken across over 1 million photographs, Deng et al. 2010) or, in the case of PredNet, the KITTI dataset (Geiger et al. 2013, a car-mounted camera dataset). Most of the networks used here achieve an impressive degree of accuracy on a test set of objects, at par with human performance on ImageNet (He et al. 2015). Therefore, regardless of their plausibility as a model of the human ventral pathways, we can use them to test whether learning statistical regularities on a complex domain of 2D natural images affords the network to extract EFs. Some of these regularities might be low-level such as being sensitive to the proximity of two stimuli, or their orientations with respect to one another; others might be higher level such as grouping based on shape familiarity. If we fail to observe grouping principles in our experiments it could be that the models need to be trained on more complex datasets (e.g. an interactive 3D world) in order for some Gestalt principles to emerge, or alternatively, different architectures are required.

Outline of the Current Work

We test a wide variety of DNNs (details in “Network Used’’) on several sets of stimuli. Instead of using the “odd” quadrant task, we presented each image to the neural network, and compared its internal activation across sets of images, always presented separately from one another. Each set is composed of two pairs of images, a base and a composite pair (obtained by adding a non-informative context to the base pair as in Fig. 1). For each pair, we computed the Euclidean distance between the activation vector at each networks’ layer produced by the images of the pair. We then compared the Euclidean distances between the composite and the base pair and normalize it from -1 to 1 to obtain a Network Configural Effect (CE). Most of the results in the main text will refer to the Euclidean-based approach, but we also repeated the same analysis using the cosine similarity metric, a measure that is independent of the vectors’ magnitude, which produced slightly different results as discussed in“Magnitude Encoding Information’’. The two metrics are outlined in detail in “Computing Network Configural Effects’’ in “General Methods’’.

A positive CE is a measure of enhanced discrimination (Configural Superiority Effect, CSE); a negative of impaired discrimination (Configural Inferiority Effect, CIE). These measures can then be compared with CSEs/CIEs found in human participants assessed through reaction times (RTs) recording. While we compared humans and networks on both CIEs and CSEs, notice that only CSEs are the result of Gestalt grouping, while CIEs correspond to crowding/attention load, etc. We computed the CEs across the Convolutional and Fully Connected layers of the networks (before the non-linear operation was applied), with a particular emphasis on the output layer, as it appears to be the most appropriate for comparison with human RTs. Nevertheless, we expect CSEs to start emerging from earlier layers than the output layer given that Gestalt grouping is widely assumed to organize the visual scene for the sake of object recognition (Biederman 1987).

We selected the models based on their historical importance, their performance on standard datasets, their biological plausibility and their Brain-Score (see the score for each network in Appendix C), and we aimed at using a wide variety of architectures and training regimes. We used a total of 16 networks, all pre-trained on ImageNet (a part from PredNet which was pretrained on the KITTI dataset): 5 of them are classic convolutional networks, 3 are CNNs that have a direct biological inspiration with a front-end that simulates primate V1, 4 are attention-based networks, 4 are self-supervised networks. Amongst all the above networks, 3 have recurrent mechanisms. The human RTs data were collected from two sources: Pomerantz et al. (1977) and Pomerantz & Portillo (2011). In Experiment 1, we investigated specific, low-level emergent features, by generating a wide number of configurations composed of simple dot patterns, as introduced by Pomerantz & Portillo (2011) and illustrated in Fig. 2. Pomerantz & Portillo (2011) found a strong and consistent effect for three EFs: proximity, orientation, and linearity, and therefore we tested whether these same features could be used by the networks to enhance discriminability of a base image pair.

A multitude of low-level features combined together form Gestalt perceptual grouping for more complex shapes. In Experiment 2 we used 17 sets of stimuli that, albeit still simple, are composed of combinations of line segments and thus intend to elicit a wide set of emergent features in humans (Pomerantz et al. 1977, the whole set is shown in Fig. 10, left). Five of these sets generated high CSEs in humans, indicating a strong Gestalt grouping effect. It is assumed that the CSE observed in this complex stimuli is the result of many, low-level emergent features. In addition, another five sets generated high CIEs in humans, indicating a strong impact of a combination of attention load, crowding, etc. Both CSE and CIE sets are shown in the legends in Figs. 5 and 6. We generalized the results across different background conditions and transformation conditions (rotation, translation, scale, and no transformation). More details about the networks used and the analysis are presented in “General Methods’’.

Experiment 1

Methods

In order to study individual Emergent Features (EFs) selectively, Pomerantz & Portillo (2011) designed an odd-discrimination task in which dot patterns were used to create a base and a composite pair. They found that Configural Superiority Effects (CSEs) were consistently exhibited for three EFs: proximity, linearity, and orientation, which respectively generated a CSE of 0.38, 0.36 and 0.22 s (bottom-right box in Fig. 3), corresponding to a speed-up of RTs of 11%, 29% and 26% (Pomerantz & Portillo 2011). In this experiment, a “base pair” consisted of two images with a single dot placed at different locations, whereas a “composite pair” was generated by adding one or two dots in the same location to both base pair images, in such a way as to elicit different emergent properties: orientation, proximity, or linearity (see Fig. 2). We generated 500 sets (a base pair and a composite pair for each of the three emergent features tested here). Each dot in the base configuration was constrained to be located at a distance of at least 20 pixels from one another, and 40 pixels from the border in order to avoid border effects (Kayhan & van Gemert 2020).

In addition, we also employed a control condition to assess the sensitivity of the networks to the stimuli used in this study (simple features on a randomly pixellated background), due to the highly different appearance from the dataset used for pretraining (which consisted of natural images). To do that, we computed the similarity between a pair of randomly pixellated canvases, and a pair composed of an randomly pixellated canvas and a randomly pixellated canvas with a single dot (see top part of Fig. 2). If the pair with the canvas containing a single dot is more discriminable than the pair without the dot, this would imply that the model is indeed sensitive to these types of images.

CEs in humans were measured through an “odd” discrimination task and CSEs were established when the added uninformative dot in the composite pair elicited faster discrimination than in the base pair (where only the location of a single dot changed across the two canvasses). In networks, the discriminability of each pair was calculated by presenting each image individually and comparing the internal activation of the networks between the two elements of each pair. If the difference between pairs was greater for the composite (e.g. the “orientation” pair) than the base pair, this would indicate a superiority effect (CSE), since the uninformative feature added to the composite pair increases the discriminability of the two canvases. Conversely, if the difference between pairs was lower, this would indicate a configural inferiority effect (CIE), where the uninformative feature added to the composite pair decreases the discriminability of the two canvases. Unless otherwise stated, the discriminability values are computed through the Euclidean-based metric detailed in “Computing Network Configural Effects’’.

Each experiment was repeated across 3 stroke-over-background conditions: white-over-black, black-over-white, and black-over-randomly pixellated background. We present the results for the latter, but we obtained very similar results across all three conditions.

Results

The results are shown in Fig. 3 for the output layer and in Fig. 4 for all other layers. At the output layer, all convolutional networks trained with supervision are sensitive to proximity, linearity, and half of them are also sensitive to orientation. These networks also showed a pattern of responses that mimicked that of human participants (see bottom-right box in Fig. 3): the proximity feature elicited a bigger effect than linearity, and linearity bigger still than orientation.

Vision Transformers were also always sensitive to proximity, but are inconsistently responsive to the other two features. On the other hand, PredNet and the DINO models showed insensitivity to the three EFs and indeed a negative effect (CIE). The very low responses obtained with these networks could be the result of a general insensitivity to the types of dot-stimuli presented in this work. However, the analysis of the control conditions (blue line in Fig. 4) shows that all networks used are strongly sensitive to the addition of a single dot starting from the middle stage of processing onwards (with the exception of some Vision Transformers across the middle layers, which showed a weak but still positive effect for the control condition). In all cases, the added dot increased discrimination across all networks. On the other hand, adding one or two dots in order to elicit the EFs of proximity, linearity, and orientation, resulted in a negligible effect for early layers, and a strong but negative effect on middle layers, before producing the observed CE at the late stage of processing (see Fig. 4). That is, the additional EFs that make the stimuli more discriminable for humans produced less discriminable representations in the middle stages of the network. We consider the implication of these results in “Discussion’’. Note, when we repeated the analysis with the cosine similarity approach (as opposed to the Euclidean approach), we found much weaker effects across all networks, and indeed, sometimes the CSEs found in the output layer were reversed into CIEs (see Fig. 8 in Appendix A). This shows that the CSEs are at least partially encoded in the magnitude of the internal representations.

The results presented thus far refer to the condition with black dots over randomly pixellated backgrounds. Across the other two background conditions, the results are generally consistent, with the exception of Vision Transformer which showed a higher sensitivity to background, producing more often CIEs for both white and black backgrounds, sometimes even for the proximity condition.

Experiment 2

Methods

In Experiment 2 we investigate whether these models show human-like Gestalt phenomena when presented with more complex stimuli taken from Pomerantz et al. (1977). We used the image pairs from the first two experiments in Pomerantz et al. (1977). Images were arranged in 17 sets, each composed of two pairs: a base and a composite pair (the full set of images, including the added context to each pair, is shown in Fig. 10, left). The sets were composed so that they could elicit a wide variety of CSEs and CIEs. Furthermore, we used 4 different transformation conditions: no transformation, translation (up to 18% of the image size), scale (0.7 to 1.3 times the original image size), and rotation (up to 360 degrees). The same random transformation was kept fixed within a set (that is, with a random translation of 20 pixels, both images of the base pair and both images of the composite pair were translated by 20 pixels). For the three conditions employing transformations, each comparison was repeated 500 times, each time with a different random transformation uniformly sampled within the boundaries defined above.

Results

Of the 17 sets of stimuli used by Pomerantz et al. (1977) we first focus on the 5 sets that showed the largest CSE in humans, corresponding to a speed up between 40% to 180% (Fig. 5A, the sets are showed in the legend). The corresponding human CSEs are shown in the bottom-right box in Fig. 5A. When analysing the output layer, sets 1, 2 and 3 indeed produced positive CSEs for all supervised networks (both convolutional and Transformers). By contrast, sets 4 and 5 were positive only for InceptionNet, CORnet-S, and VOneCORnet-S. And in contrast with Experiment 1, the relationship amongst the CSEs did not resemble that found in humans (for example, in almost all cases, set 2 elicited a stronger effect than set 1). Similarly to the result in Experiment 1, three out of four self-supevised networks presented consistent negative effect for all sets (the exception being SimCLR in both Experiments). We then analysed the 5 sets that showed the largest CIE in humans (Fig. 5B). We found that not only the pattern of responses did not resemble humans RTs, but almost all networks showed a superiority effect for at least one of these sets.

The analysis across all networks layers (Fig. 6) confirmed the pattern observed in Experiment 1, with negligible effects in early layers, and a negative effect in middle layers.

We observed that the two convolutional networks which possessed fully connected layers in addition to the output (AlexNet and VGG19), demonstrated a CSE across all fully connected layers, suggesting the possibility that only this type of layer, but not a convolutional one, may exhibit CSE. We will further explore this point and its connection to studies exploring Gestalt effect using human neural data in the “Discussion’’.

We extended the analysis to all 17 sets of stimuli by plotting humans CEs vs the networks CEs in Fig. 10, in Appendix B. To quantify whether humans and networks CEs are correlated, we computed Kendall’s tau correlation coefficient across human and networks’ CEs. The relation was always positive apart for PredNet and DINO ViT-S/8 models, but in no case it was significant at $p<0.01$.

The results presented thus far refer to the condition with randomly pixelated background and translation transformation. We obtained similar results with the other two background conditions (white and black) and transformation conditions (scale, rotation, and no transformation). When repeating the analysis with the cosine similarity metric we again found an overall reduction of the effect across all layers, which resulted in a larger amount of CIEs.

The Effect of Familiarity

We investigated whether the CSEs consistently found across some sets of items were the result of familiarity rather than emergent features. That is, it is possible that networks acquired a sensitivity to some basic shapes as a consequence of being trained on ImageNet, and this makes these shapes more salient and discriminable when compared to different shapes or non-shapes. For example, in sets 1 and 2, adding a non-informative stimulus turned the diagonal lines into a familiar geometrical shape (a triangle). In humans, the increase discriminability that results in strong CSEs is assumed to be the result of several emergent features (symmetry, closedness, Pomerantz et al. 1977). On the other hand, it is possible that networks are sensitive to these images because they have experienced these shapes during training.

We provided an initial test of this hypothesis by modifying sets 1 and 2 (the sets which composite pair included a triangular shape) so that they did not contain familiar shapes anymore. We re-run the analysis only for those networks that exhibited CSEs for sets 1 and 2. The results (Fig. 7) indicated that most networks still exhibited the same degree of Gestalt grouping effects for sets 1 and 2 (with the only exception being SimCLR, which seemed to be sensitive to the particular triangular shape presented in set 2). Therefore, we can reject the hypothesis that the networks were simply sensitive to the familiar shape, and might be sensitive to other features such as closedness.

Overall, Experiment 2 produced mixed results. Only a subset of the stimulus sets that produced large CSEs in humans evoked high CSEs in networks, and almost uniquely at the final layer. In no case the pattern of CSE across stimuli resembled the one produced in humans. Furthermore, networks sometimes elicited CSEs in cases where humans most strongly produced CIEs. On the other hand, we verified that networks were not simply basing their responses on familiarity.

Discussion

We will now elaborate to what extent our experiments can answer the questions raised in “Introduction’’ and then discuss how our findings relate to previous work assessing Gestalt organizational principles in DNNs.

Do Modern DNNs Exhibit Human-Like Gestalt Grouping?

Overall, we found mixed support for the hypothesis that DNNs support Gestalt grouping in a human-like fashion. It is indeed remarkable that most convolutional networks trained with supervision did show CSEs for proximity, linearity, and (less frequently) orientation (Experiment 1), and furthermore, the size of the effects mirrored human performance. However, for all 16 networks, the sensitivity to these properties at all layers earlier than the output layer was either negligible or reversed. Furthermore, networks did not show a consistent pattern of responses for complex stimuli: in Experiment 2, only three of the five sets of stimuli that produce CSEs in humans gave rise to consistent CSEs in convolutional networks, and again only at the output layer; the other two showed either a weak effect or the opposite, inferiority effect. Furthermore, in this experiment, the pattern of CSE across different sets did not resemble those observed in human responses. EFs are powerful Gestalt properties that are not only subjectively compelling (see Fig. 1) but that support fast RTs in participants, similar to other low-level visual features that “pop out” (Treisman 1998). The essential role of EFs is reflected in their importance in figure-ground segregation (Wagemans et al. 2012a) as well as building representations of object parts through the perceptual generation of non-accidental properties (Biederman 1987; Kubilius et al. 2017). Therefore, it would be expected that EFs emerge in the human visual stream earlier than the stage at which objects are classified. In fact, research has suggested that Gestalt phenomena do not emerge at low-level retinotopic areas, but rather in shape-selective regions such as the lateral occipital complex (LOC) (Kubilius et al. 2011), even though their effect has neural correlates as early as V1 (Fox et al. 2017). Note that LOC corresponds to an earlier processing stage than object identification in the inferotemporal cortex (IT). This is in striking contrast with what is found in this experiment, in which the output layer, often associated with IT, was in most cases the only layer exhibiting any Gestalt effect. Note, we do expect to find CSE in the output layer — but the lack of CSEs (and instead, the strong CIEs) in all earlier layers points to a clear discrepancy between network and human visual processing.

It is possible that the crucial property of the output layer that allows the emergence of CSEs is that it is a fully connected (FC) layer (as opposed to a convolutional layer). In the current work, amongst the convolutional networks, only AlexNet and VGG19 had multiple FC layers. Interestingly, AlexNet did seem to produce CSE even for “internal” FC layers in both Experiments, and VGG19 did so for sets 1–3 in Experiment 2, but generally failed to produce CSE at all in Experiment 1 even at the output layer (in fact, it was the worst performing convolutional network in that experiment).

This hints at the possibility that Gestalt effects in DNNs can emerge prior to the output layer and that the reason why effects were restricted to the output layer in most models is that there was only one FC layer. Why fully connected layers but not convolutional layers support some CSEs is not currently clear. But it suggests that the architectures of most current DNNs are incompatible with the Gestalt organizational principles observed in humans. Furthermore, we noticed that even though the layer taken into account during the analysis of the Vision Transformers were all FC, they did not produce Gestalt grouping consistently, which might be due to the fundamentally different mechanism used by these networks.

Difference Amongst Networks

When considering only the output layer, we found a clear difference amongst convolutional and supervised networks (which appeared to be the most sensitive to some basic properties used by humans such as proximity, linearity, and orientation) and self-supervised networks (which produced often the opposite effect as humans). Vision Transformers appeared to be somewhat in between, producing weak and inconsistent human-like effects. It did not appear that networks possessing a recurrent mechanism performed differently that the other networks. Amongst the self-supervised networks, it is surprising that PredNet (a self-supervised, convolutional, and recurrent architecture), which has been shown to be sensitive to several types of illusions (Watanabe et al. 2018; Lotter et al. 2020), performed extremely poorly and in fact it always showed inferiority effects. Interestingly, this model also scored poorly on Brain-Score (see Table in Appendix C). On the other hand, SimCLR, the lowest scoring model amongst those used here in the Brain-Score benchmark (almost 3 times lower than the highest scoring model, ResNet-152) did show sensitivity to proximity and linearity, and some CSE for sets 2, 4, and 5 in Experiment 2. Overall, when considering the output layer, we see a clear distinction between networks that were trained with supervision and possessed a convolutional architecture, which showed a higher human-like grouping effect, and all other models. However, when considering intermediate layers of processing, the picture was much more homogeneous: most networks’ did not show any discrimination between composite and base pairs during early processing, and the two pairs become even less discriminable at the middle stages of processing (resulting in CIEs). Almost always networks presented an inferiority effect regardless of architecture, training regime, and their Brain-Score.

Magnitude Encoding Information

We observed a small but important difference between the results computed with normalized euclidean distance and cosine similarity (a measure independent of the magnitude of the internal activation). The effects when measured through the latter were strongly reduced and sometimes reversed (that is, a response appearing as CSE with Euclidean distance became a CIE with cosine similarity). We believe the Euclidean distance to be a better metric for two reasons: the magnitude of the internal activations is used in the internal computations in the networks, through non-linearities such as ReLU. Furthermore, even if the classification itself is commonly computed by disregarding the last layer’s activation magnitude (through argmax), it is nevertheless important to establish that the information is present and could potentially be used if additional layers were to be appended.

Acquisition of Principles for Gestalt Grouping Principles

The degree to which some basic grouping properties can be learnt has been a controversial topic for many decades (see “Introduction’’). Whether or not DNNs are good models of the human visual system, they are excellent at extracting statistical regularities from a dataset, and therefore provide a test to whether some grouping phenomena are implicitly encoded within the statistics of a particular dataset.

The finding that the networks only showed the right pattern of CSEs for low-level features (Experiment 1) suggests that the simplified training environment provided by 2D images might be insufficient for acquiring more complex Gestalt principles, or for combining the low-level features in a human-like way. Still, it is possible that using a more realistic dataset (e.g. a 3D environment), or more naturalistic training environments (e.g. with a reward signal such as in the reinforcement learning approach) could result in the acquisition of a wide array of grouping principles, but this is pure speculation for now. On the other hand, we also note that PredNet was trained on a sequence of realistic images (as opposed to static images used for all other networks), and it was one of the most dissimilar model in terms of showing any type of Gestalt grouping.

It is also interesting to notice that Gestalt grouping has been obtained in artificial networks that have very different architectures than DNNs (Grossberg et al. 1997; Francis et al. 2017; Herzog et al. 2003). In other cases, architectural modifications have been added to DNNs to solve tasks that appear to underlie Gestalt grouping (Linsley et al. 2018). Highly specific and unnatural training setups (e.g. training on a checkerboard-like pattern) also resulted in the emergence of some configural properties in convolutional DNN (Keshvari et al. 2021). This suggests that Gestalt principles might be obtained only with appropriate architecture or with highly specific training, and are not simply the result of statistical feature extraction from the retinal images.

Conclusion

Overall, the results presented here highlight a partial disconnect between DNNs trained on natural 2D images and human vision. Testing 16 networks that varied across architectures and training setups on stimuli that elicited strong Gestalt grouping (measure as increased discrimination in an “oddity Reaction Time task”) in humans, we found that all models almost universally showed no grouping effect across early layers, and a negative effects (meaning a decreased discrimination) across middle and late stages of processing up to the output layer. This is in contrast to the human visual processing system, where the effect of emergent features has been observed at intermediate stages of processing. At the output layer, only convolutional method trained with supervision could consistently acquire low-level visual features that elicited grouping effects that strongly resembled those produced by humans. At the same time, the sensitivity to low-level grouping features in trained networks did not fully transfer to more complex stimuli, which instead produced grouping effects that were only inconsistently mimicking human behaviour and were sometimes strongly deviating from it. Although our findings highlight the limitations of current DNNs in capturing Gestalt effects, our results do suggest the possibility that human-like Gestalt effects may emerge in response to training in some network architectures. More generally, this work highlights the importance of comparing network performance with well-established psychological phenomena, which have been largely ignored when comparing DNNs to the human brain.

General Methods

Network Used

We selected 16 networks based on their historical importance, their performance on standard datasets, and their biological plausibility. We used as a point of reference the Brain-Score value (Schrimpf et al. 2018), indicating the amount of variance explained by the model across several benchmarks. AlexNet (Krizhevsky et al. 2012), VGG19 (Simonyan & Zisserman 2014), and ResNet (He et al. 2016) (we used ResNet-152) are classic networks that have been often tested on several cognitive phenomena (Schrimpf et al. 2018; Baker et al. 2018a; Biscione & Bowers 2022), with mixed results. InceptionNet (Szegedy et al. 2015) was shown by Kim et al. (2021) to be sensitive to the effect of Gestalt Closure and thus it seemed suited for this battery of tests (we used InceptionNet V3). In DenseNet (Huang et al. 2017) each convolutional layer is connected with each other layer. A smaller version of this family of networks, which we used (Densenet-201), has been shown to possess human-like translation invariance (Biscione & Bowers 2021).

We also tested two models specifically developed to be biologically plausible and that provide a good match with primate neural data. The “CORnet” model family (Kubilius et al. 2019) aimed to incrementally build a network architecture by adding recurrent and skip connections while monitoring both classification accuracy and agreement with a body of primate brain neural data. From this family, the CORnet-S was selected as the best CORnet architecture.

The VOneNet family (Dapello et al. 2020) has been developed to better match the structure of the primate visual cortex. Each VOneNet contains a fixed-weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models. We used two versions of VOneNet: one with CorNet-S backend (VOneCORnet-S), and the other with Resnet50 backend (VOneResnet50).

Vision Transformers (Mehrer et al. 2021) are an attention-based family of models which achieve higher accuracy on vision tasks and have been found to also have a more consistent pattern of error with those of humans (Tuli et al. 2021). We used ViT-B/16, ViT-B/32, ViT-L/16, ViT-L/32, where B/L indicates either a “base“ or a “large“ model, and the number indicates the patch size. In spite of their success in vision tasks, Vision Transformers do not have a particularly high success with the Brain Score

Self-supervised models aim at building meaningful representations from unlabelled data that can be used for solving downstream tasks, and the deep contrastive embedding they use has been suggested to be a biologically plausible computational theory of the primate visual system (Zhuang et al. 2020). We used four self-supervised networks: Dino ViT-S/8, Dino ViT-B/8 (Caron et al. 2021), SimCLR-ResNet18 (Chen et al. 2020) and PredNet (Lotter et al. 2018). The name after the space indicates the networks’ backbone. PredNet is particularly interesting for this work as it is inspired by the predictive coding theory in the neuroscience literature (Rao & Ballard 1999) and can mimic several effects of visual perception, including illusory contours, the flash-lag effect (Lotter et al. 2017) and illusory motion (Watanabe et al. 2018).

VOneNet-Resnet50, DenseNet, ResNet-152 and VGG19 are in the top 10 on Brain-Score at the moment of writing, all with a score higher than 0.4 (the highest scoring network with 0.465 was a version of EffNet which we were not able to obtain). Amongst the convolutional, supervised models, AlexNet scored the lowest with 0.38. In spite of their success in vision tasks, Vision Transformers do not have a particularly high success with the Brain Score (ViT-B/32 scored the highest with 0.355, all other ViT scored between 0.16 and 0.2). Similarly, SimCLR score is as high as AlexNet, which scored the lowest amongst convolutional supervised networks ($\sim 0.38$). Surprisingly, PredNet is the lowest scoring models amongst those tested here (0.195). All values are shown in Appendix C.

All networks were pretrained on ImageNet. When feeding the images into the networks, we first resized them to the same size used during the networks’ pretraining (that is 224x224 for all networks but InceptionNet, which was 299x299). Furthermore, all images were normalized with mean and standard deviation used during ImageNet pretraining. We analysed the activation of all Convolutional and Fully Connected layers before the non-linearity operation was applied.

Computing Network Configural Effects

The main difficulty in comparing Pomerantz’s behavioural results with neural networks is that the behavioural results were based on RTs, which DNNs do not produce (there are some exceptions, but the models commonly scoring high on Brain-Score do not possess this feature). However, since we have direct access to models’ internal representations, we can nevertheless obtain a measure of stimuli discriminability. We presented a pair of images to the network; for each image, we recorded the value of activation for every unit of a given layer, obtaining an activation vector for each image and each layer. The “distance” between the two activation vectors would correspond to a measure of discriminability for the image pair at a particular layer. We compared these measures to human RTs: high distance would correspond to high discriminability, which would correspond to fast RTs; and vice-versa.

We used two different ways of comparing the distances across the activation vector $d^{l}({\textbf {x}})$ (where l indicates a specific layer, and x is an input image): a Euclidean-based method:

$$\begin{aligned} D^{l}({\textbf {a}}, {\textbf {b}}) = \left\| d^{l}({\textbf {a}}) - d^{l}({\textbf {b}}) \right\| , \end{aligned}$$

and a cosine similarity metric, which is invariant to the magnitude of the internal activations:

$$\begin{aligned} C^{l}({\textbf {a}}, {\textbf {b}}) = \frac{d^{l}({\textbf {a}}) \cdot d^{l}({\textbf {b}})}{\left\| d^{l}({\textbf {a}}) \right\| \left\| d^{l}({\textbf {b}}) \right\| }. \end{aligned}$$

We refer to the general difference across composite and base as Configural Effect (CE). By using the same approach outlined in Pomerantz & Portillo (2011), we obtained the networks’ Configural Superiority Effects (CSEs) and Configural Inferiority Effects (CIEs) by computing the difference for each distance metric across the two pairs of stimuli: a base and a composite pair. The composite pair is obtained by adding a non-informative feature to each image of the base pair. For humans, CE is simply computed as $RT_{base} - RT_{composite}$, with positive values indicating CSE and negative CIE.

For networks, CEs with the Euclidean distance can be computed as follows (Jacob et al. 2021):

$$\begin{aligned} NetworkCE = \frac{D^{l}(composite_{a}, composite_{b}) - D^{l}(base_{a}, base_{b})}{D^{l}(composite_{a}, composite_{b}) + D^{l}(base_{a}, base_{b})} \end{aligned}$$

:

Since $D^{l}$ is a measure of dissimilarity, a $D^{l}$ higher for the composite pair than for the base pair indicates that the uninformative feature added to the composite pair made the two images more dissimilar (that is, more discriminable) from one another, which would correspond to CSE.

Alternatively, we can compute the CE using cosine similarity:

$$\begin{aligned} NetworkCE = C^{l}(base_{a}, base_{b}) - C^{l}(composite_{a}, composite_{b}). \end{aligned}$$

Being the cosine similarity a measure of similarity, a base pair more similar than the composite pair will again indicate CSE. In both cases, positive NetworkCE indicates CSE, and negative NetworkCE indicates CIE.

Note that the Euclidean and cosine similarity approaches to computing CE will have different scales, which will again be different than humans CE computed through RTs.

We compute these metrics for each layer in the networks, in the same order in which they are traversed during the feedforward pass. That is, for the networks in this work using recurrence (CORnet-S, VOneCORnet-S, PredNet) we computed the distance at the recurrent layers multiple times, one for each time it’s been traversed, and in the same order it’s been computed.

Data Availability

In the code, stimuli are generated at runtime, and therefore no dataset is needed to replicate the results when using our code. However, a dataset of the stimuli used in the experiments is provided: https://valeriobiscione.com/PomerantzDataset

Code Availability

Code is provided in full: https://github.com/ValerioB 88/gestalt-DNNs.

References

Baker, N., Erlikhman, G., Kellman, P., Lu, H. (2018a). Deep convolutional networks do not perceive illusory contours. Cognitive Science.
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14, 1–43. https://doi.org/10.1371/journal.pcbi.1006613
Article Google Scholar
Biederman, I. (1987). Recognition-by-Components: A theory of human image understanding. Psychological Review, 94, 115–147.
Biederman, I. (2000). Recognizing depth-rotated objects: a review of recent research and theory. Spatial Vision, 13, 241–253.
Article PubMed Google Scholar
Biscione, V., Bowers, J. S. (2021). Convolutional neural networks are not invariant to translation, but they can learn to be. Journal of Machine Learning Research, 22, 1–28. http://jmlr.org/papers/v22/21-0019.html.
Biscione, V., Bowers, J. S. (2022). Learning online visual invariances for novel objects via supervised and self-supervised training. Neural Networks, 150, 222–236. arXiv:2110.01476, https://doi.org/10.1016/J.NEUNET.2022.02.017.
Blything, R., Biscione, V., Bowers, J. (2020). A case for robust translation tolerance in humans and CNNs. A commentary on Han et al. arXiv:2012.05950.
Blything, R., Biscione, V., Vankov, I. I., Ludwig, C. J. H., & Bowers, J. S. (2021). The human visual system and CNNs can both support robust online translation tolerance following extreme displacements. Journal of Vision, 21, 1–16. https://doi.org/10.1167/jov.21.2.9
Article Google Scholar
Bowers, J. S., Malhotra, G., Dujmović, M., Montero, M. L., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., Blything, R. (2022). Deep problems with neural network models of human vision. Behavioral and Brain Sciences, 1–74. https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/a rticle/deep-problems-with-neural-network-models-of-human-vision/ABCE483EE95E80 315058BB262DCA26A9, https://doi.org/10.1017/S0140525X22002813.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE International Conference on Computer Vision , 9630–9640, arXiv:2104.14294v2, https://doi.org/10.48550/arxiv.2104.14294.
Chen, F. X., Roig Noguera, G., Isik, L., Boix Bosch, X., Poggio, T. A. (2017). Eccentricity dependent deep neural networks: Modeling invariance in human vision. AAAI Spring Symposium - Technical Report, SS-17-01 -, 541–546.
Chen, T., Kornblith, S., Norouzi, M., Hinton, G., (2020). A simple framework for contrastive learning of visual representations. arXiv:2002.05709, PartF16814, 1575–1585.
Dapello, J., Marques, T., Schrimpf, M., Geiger, F., Cox, D. D., Dicarlo, J. J. (2020). Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020)https://github.com/dicarlolab/vonenet.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255, https://doi.org/10.1109/CVPR.2009.5206848.
Dujmović, M., Malhotra, G., Bowers, J. S. (2020). What do adversarial images tell us about human vision? eLife, 9, 1–29. https://doi.org/10.7554/ELIFE.55978.
Duncan, J. (1984). Selective attention and the organization of visual information. Journal of Experimental Psychology. General, 113, 501–517. https://pubmed.ncbi.nlm.nih.gov/6240521/, https://doi.org/10.1037//0096-3445.113.4.501.
Fox, O. M., Harel, A., & Bennett, K. B. (2017). How configural is the configural superiority effect? A neuroimaging investigation of emergent features in visual cortex. Frontiers in Psychology, 8, 32. https://doi.org/10.3389/FPSYG.2017.00032/BIBTEX
Article PubMed PubMed Central Google Scholar
Francis, G., Manassi, M., & Herzog, M. H. (2017). Neural dynamics of grouping and segmentation explain properties of visual crowding. Psychological Review, 124, 483–504. https://doi.org/10.1037/REV0000070
Article PubMed Google Scholar
Gauthier, I., & Tarr, M. J. (2016). Visual Object Recognition: Do We (Finally) Know More Now Than We Did? Annual review of vision science, 2, 377–396. https://doi.org/10.1146/annurev-vision-111815-114621
Article PubMed Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R. (2013). Vision meets robotics: the KITTI dataset. International Journal of Robotics Research, 32, 1231–1237. http://www.cvlibs.net/datasets/kitti
Geirhos, R., Medina Temme, C. R., Rauber, J., Schütt, H. H., Bethge, M., Wichmann, F. A., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems 31 (pp. 7538–7550). Curran Associates Inc.
Google Scholar
Geisler, W. S., Perry, J. S., Super, B. J., & Gallogly, D. P. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vision Research, 41, 711–724. https://doi.org/10.1016/S0042-6989(00)00277-7
Article PubMed Google Scholar
German, J. S., & Jacobs, R. A. (2020). Can machine learning account for human visual object shape similarity judgments? Vision Research, 167, 87–99. https://doi.org/10.1016/j.visres.2019.12.001
Article PubMed Google Scholar
Gong, Y., Wang, L., Guo, R., Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (8695 LNCS, pp. 392–407). arXiv:1403.1840.
Grossberg, S., Mingolla, E., Ross, W. D. (1997). Visual brain and visual perception: how does the cortex do perceptual grouping? Trends in Neurosciences, 20, 106–111. https://pubmed.ncbi.nlm.nih.gov/9061863/, https://doi.org/10.1016/S0166-2236(96)01002-8.
Grossberg, S., & Zajac, L. (2017). How humans consciously see paintings and paintings illuminate how humans see. Art and Perception, 5, 1–95. https://doi.org/10.1163/22134913-00002059
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter, 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90.
Herzog, M. H., Ernst, U. A., Etzold, A., Eurich, C. W. (2003). Local interactions in neural networks explain global effects in Gestalt processing and masking. Neural Computation, 15, 2091–2113. https://pubmed.ncbi.nlm.nih.gov/12959667/, https://doi.org/10.1162/089976603322297304.
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jacob, G., Pramod, R. T., Katti, H., Arun, S. P. (2021). Qualitative similarities and differences in visual object representations between brains and deep networks. Nature Communications, 12, 1–14. https://www.nature.com/articles/s41467-021-22078-3, https://doi.org/10.1038/s41467-021-22078-3.
Kauderer-Abrams, E. (2017). Quantifying translation-invariance in convolutional neural networks. arXiv: 1801.01450v1. arXiv:1801.01450.
Semih Kayhan, O., van Gemert, J. C. (2020). On translation invariance in CNNs: Convolutional layers can exploit absolute spatial location. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14262–14273. https://doi.org/10.1109/cvpr42600.2020.01428.
Keshvari, S., Fan, X., & Elder, J. H. (2021). Configural processing in humans and deep convolutional neural networks. Journal of Vision, 21, 2887–2887. https://doi.org/10.1167/JOV.21.9.2887
Article Google Scholar
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol., 10, e1003915. https://doi.org/10.1371/journal.pcbi.1003915
Article PubMed PubMed Central Google Scholar
Kim, B., Reif, E., Wattenberg, M., Bengio, S., Mozer, M. C. (2021). Neural networks trained on natural scenes exhibit gestalt closure. Computational Brain and Behavior, 4, 251–263. https://link.springer.com/article/10.1007/s42113-021-00100-7, https://doi.org/10.1007/S42113-021-00100-7/FIGURES/8.
Kriegeskorte, N. (2015). Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1, 417–446. www.annualreviews.org, https://doi.org/10.1146/annurev-vision-082114-035447.
Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.
Kubilius, J., Schrimpf, M., Kar, K., Rajalingham, R., Hong, H., Majaj, N. J., Issa, E. B., Bashivan, P., Prescott-Roy, J., Schmidt, K., Nayebi, A., Bear, D., Yamins, D. L. K., Dicarlo, J. J. (2019). Brain-like object recognition with high-performing shallow recurrent ANNs. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019).
Kubilius, J., Sleurs, C., Wagemans, J. (2017). Sensitivity to nonaccidental configurations of two-line stimuli. i-Perception, 8, 1–12. https://doi.org/10.1177/2041669517699628.
Kubilius, J., Wagemans, J., Op de Beeck, H. P. (2011). Emergence of perceptual Gestalts in the human visual cortex: the case of the configural-superiority effect. Psychological Science, 22, 1296–1303. https://pubmed.ncbi.nlm.nih.gov/21934133/, https://doi.org/10.1177/0956797611417000.
Linsley, D., Kim, J., Veerabadran, V., Serre, T., (2018). Learning long-range spatial dependencies with horizontal gated-recurrent units. Advances in Neural Information Processing Systems, 2018-Decem, 152–164. https://doi.org/10.32470/ccn.2018.1116-0.
Lotter, W., Kreiman, G., Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–18.
Lotter, W., Kreiman, G., Cox, D. (2018). A neural network trained to predict future video frames mimics critical properties of biological neuronal responses and perception. arXiv:1805.10734v2.
Lotter, W., Kreiman, G., & Cox, D. (2020). A neural network trained for prediction mimics diverse features of biological neurons and perception. Nature Machine Intelligence, 2, 210–219. https://doi.org/10.1038/s42256-020-0170-9
Article PubMed PubMed Central Google Scholar
Malhotra, G., Dujmović, M., & Bowers, J. S. (2022). Feature blindness: a challenge for understanding and modelling visual object recognition. PLOS Computational Biology. https://doi.org/10.1101/2021.10.20.465074
Article PubMed PubMed Central Google Scholar
Malhotra, G., Evans, B. D., & Bowers, J. S. (2020). Hiding a plane with a pixel: examining shape-bias in CNNs and the benefit of building in biological constraints. Vision Research, 174, 57–68. https://doi.org/10.1016/J.VISRES.2020.04.013
Article PubMed Google Scholar
Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N., Kietzmann, T. C. (2021). An ecologically motivated image dataset for deep learning yields better models of human vision. In Proceedings of the National Academy of Sciences 118. https://www.pnas.org/content/118/8/e2011417118, https://www.pnas.org/content/118/8/e2011417118.abstract, https://doi.org/10.1073/PNAS.2011417118.
Metzger, W., (1966). Handbuch der Psychologie 1. Band 1. Halbband Bücher gebraucht, antiquarisch & neu kaufen. Gottingen: Hogrefe.
Moors, P., Costa, T. L., Wagemans, J. (2020). Configural superiority for varying contrast levels. Attention, Perception & Psychophysics, 82, 1355–1367. https://pubmed.ncbi.nlm.nih.gov/31741319/, https://doi.org/10.3758/S13414-019-01917-Y.
Pang, Z., Biggs O’May, C., Choksi, B., VanRullen, R. (2021). Predictive coding feedback results in perceived illusory contours in a recurrent neural network. arXiv:2102.01955v2.
Perrett, D. I., & Oram, M. W. (1993). Neurophysiology of shape processing. Image and Vision Computing, 11, 317–333. https://doi.org/10.1016/0262-8856(93)90011-5
Article Google Scholar
Peterson, M. A. (2019). Past experience and meaning affect object detection: A hierarchical Bayesian approach. Psychology of Learning and Motivation - Advances in Research and Theory, 70, 223–257. https://doi.org/10.1016/BS.PLM.2019.03.006
Article Google Scholar
Peterson, M. A., Gibson, B. S. (1994). Object recognition contributions to figure-ground organization: Operations on outlines and subjective contours. Perception & Psychophysics 56, 551–564. https://link.springer.com/article/10.3758/BF03206951, https://doi.org/10.3758/BF03206951.
Pomerantz, J. R., Portillo, M. C. (2011). Grouping and emergent features in vision: Toward a theory of basic Gestalts. Journal of Experimental Psychology: Human Perception and Performance, 37, 1331–1349. /record/2011-13455-001, https://doi.org/10.1037/A0024330.
Pomerantz, J. R., & Pristach, E. A. (1989). Emergent features, attention, and perceptual glue in visual form perception. Journal of Exerpimental Psychology: Human Perception and Perormance, 15, 635–649.
Google Scholar
Pomerantz, J. R., Sager, L. C., & Stoever, R. J. (1977). Perception of wholes and of their component parts: Some configural superiority effects. Journal of Experimental Psychology: Human Perception and Performance, 3, 422–435. https://doi.org/10.1037/0096-1523.3.3.422
Article PubMed Google Scholar
Puebla, G., Bowers, J. S. (2021). Can deep convolutional neural networks support relational reasoning in the same-different task? bioRxiv , 2021.09.03.458919 https://www.biorxiv.org/content/10.1101/2021.09.03.458919v1, https://www.biorxiv.org/content/10.1101/2021.09.03.458919v1.abstract, https://doi.org/10.1101/2021.09.03.458919.
Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2, 79–87. https://doi.org/10.1038/4580
Article PubMed Google Scholar
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., Kar, K., Bashivan, P., Prescott-Roy, J., Geiger, F., Schmidt, K., Yamins, D. L. K., DiCarlo, J. J. (2018). Brain-Score: which artificial neural network for object recognition is most brain-like? bioRxiv , 407007. https://www.biorxiv.org/content/10.1101/407007v1, https://doi.org/10.1101/407007.
Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, http://www.robots.ox.ac.uk/.
Spillmann, L. (2009). Phenomenology and neurophysiological correlations: Two approaches to perception research. Vision Research, 49, 1507–1521. https://doi.org/10.1016/J.VISRES.2009.02.022
Article PubMed Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07-12-June, 1–9. arXiv:1409.4842v1, https://doi.org/10.1109/CVPR.2015.7298594.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2013). Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track ProceedingsarXiv:1312.6199v4.
Todorović, D. (2011). What is the Origin of the Gestalt Principles. Humanamente, 17, 1–20.
Google Scholar
Treisman, A. (1998). Feature binding, attention and object perception. Philosophical Transactions of the Royal Society B: Biological Sciences, 353, 1295. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692340/, https://doi.org/10.1098/RSTB.1998.0284.
Tuli, S., Dasgupta, I., Grant, E., Griffiths, T. L. (2021). Are convolutional neural networks or transformers more like human vision? In Proceedings of the 43rd Annual Meeting of the Cognitive Science Society: Comparative Cognition: Animal Minds, CogSci 2021, 1844–1850. arXiv:2105.07197v2, https://doi.org/10.48550/arxiv.2105.07197.
Vecera, S. P., Farah, M. J. (1997). Is visual image segmentation a bottom-up or an interactive process? Perception & Psychophysics, 59, 1280–1296. https://pubmed.ncbi.nlm.nih.gov/9401461/, https://doi.org/10.3758/BF03214214.
Wagemans, J., Elder, J. H., Kubovy, M., Palmer, S. E., Peterson, M. A., Singh, M., & von der Heydt, R. (2012). A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. Psychological Bulletin, 138, 1172–1217. https://doi.org/10.1037/a0029333
Article PubMed PubMed Central Google Scholar
Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J. R., Van der Helm, P. A., & Van Leeuwen, C. (2012). A century of Gestalt psychology in visual perception: II. Conceptual and theoretical foundations. Psychological Bulletin, 138, 1218–1252. https://doi.org/10.1037/a0029334
Article PubMed Google Scholar
Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M., & Tanaka, K. (2018). Illusory motion reproduced by deep neural networks trained for prediction. Frontiers in Psychology, 9, 345. https://doi.org/10.3389/FPSYG.2018.00345/BIBTEX
Article PubMed PubMed Central Google Scholar
Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt. II. Psychologische Forschung 301–350. https://link.springer.com/article/10.1007/BF00410640, https://doi.org/10.1007/BF00410640.
Xu, Y., Vaziri-Pashkam, M. (2021). Examining the coding strength of object identity and nonidentity features in human occipito-temporal cortex and convolutional neural networks. The Journal of Neuroscience, 41, 4234–4252. https://pubmed.ncbi.nlm.nih.gov/33789916/, https://doi.org/10.1523/jneurosci.1993-20.2021.
Yamins, D. L. K., DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience 19, 356–365. https://www.nature.com/articles/nn.4244, https://doi.org/10.1038/nn.4244.
Yin, D., Biscione, V., Bowers, J. (2023). Convolutional neural networks trained to identify words provide a good account of visual form priming effects. https://arxiv.org/abs/2302.03992v1, https://doi.org/10.48550/arxiv.2302.03992.
Zemel, R. S., Mozer, M. C., Behrmann, M., & Bavelier, D. (2002). Experience-dependent perceptual grouping and object-based attention. Journal of Experimental Psychology: Human Perception and Performance, 28, 202–217. https://doi.org/10.1037/0096-1523.28.1.202
Article Google Scholar
Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C., DiCarlo, J. J., Yamins, D. L. K. (2020). Unsupervised neural network models of the ventral visual stream. bioRxiv, 2020.06.16.155556 https://www.biorxiv.org/content/10.1101/2020.06.16.155556v1, https://doi.org/10.1101/2020.06.16.155556.

Download references

Funding

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 741134).

Author information

Authors and Affiliations

Department of Psychology, University of Bristol, Bristol, BS8 1TL, UK
Valerio Biscione & Jeffrey S. Bowers

Authors

Valerio Biscione
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey S. Bowers
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed ideas for the project. VB created the datasets and ran the experiment and the analysis. Both authors contributed to writing the manuscript.

Corresponding author

Correspondence to Valerio Biscione.

Ethics declarations

Ethics Approval

Not applicable

Consent to Participate

Not applicable

Consent for Publication

Not applicable

Conflict of Interest

The authors declare no competing interests.

Appendices

Appendix A: Cosine Similarity Analysis

We report here the cosine similarity results for the last fully connected layer for Experiments 1 and 2 in the paper. Figure 8 shows the result for Experiment 1. Overall, the cosine similarity appears to reduce the amount of CSE in such a way that some EFs went from CSE to CIE (that is, adding an additional dot now hindered recognition). The cosine similarity revealed CSE to proximity for all networks, it always resulted in CIE for orientation, and showed more mixed results for the linearity condition. The pattern for Experiment 2 is shown in Fig. 8. Like before, the cosine similarity analysis underestimates the amount of CSE present compared to the Euclidean-based metric, and more sets result in CIE compared to the Euclidean-based approach. However, the overall picture is the same for both metrics: networks produce CSE in a very inconsistent way, the shape of CSE is dissimilar to those of humans, and it sometimes produces CSE with stimuli that strongly elicit CIE in humans (Fig. 8, bottom). For the layers earlier than the output ones, using cosine similarity also decreased the amount of CE across all networks but, as discussed in the main text, these were already in the great majority CIEs.

Since cosine similarity does not take the magnitude of the activation vector into account, and that cosine similarity consistently under-estimate CSEs, we can conclude that the magnitude of the internal activation vector contributes in encoding the effect of the EFs.

Appendix B: Experiment 2: All-Stimuli Correlation

While in the main text we focused on the five stimuli that produced the most CSE and CIE, here we show the result across all 17 stimuli used in Experiment 2 (Fig. 10) at the output layer. The stimuli are recreated according to Pomerantz et al. (1977). The human CE are computed from the same source. Network CE are computed using the Euclidean-based approach.

Appendix C: Brain-Score for Network Used

We show here the Brain-Score averaged across all benchmarks for each of the networks used in the present work. When a network has been tested multiple times and has multiple scores, we indicated the highest one. Collected on January 31, 2023.

Network Name	Average Brain Score
ResNet-152	0.432
DenseNet-201	0.421
Inception V3	0.414
VOneResNet50	0.405
CORnet-S	0.402
VGG19	0.402
VOneCORnet-S	0.390
AlexNet	0.381
ViT-B/32	0.355
ViT-L/32	0.198
PredNet	0.195
ViT-B/16	0.190
ViT-L/16	0.161
SimCLR-ResNet18	0.160
DINO-ViT-S/8	-
DINO-ViT-B/8	-

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Biscione, V., Bowers, J.S. Mixed Evidence for Gestalt Grouping in Deep Neural Networks. Comput Brain Behav 6, 438–456 (2023). https://doi.org/10.1007/s42113-023-00169-2

Download citation

Accepted: 16 February 2023
Published: 10 July 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s42113-023-00169-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Abstract

Similar content being viewed by others

Qualitative similarities and differences in visual object representations between brains and deep networks

Neural Networks Trained on Natural Scenes Exhibit Gestalt Closure

Capturing the objects of vision with neural networks

Introduction

Neural Networks as a Model of the Human Visual System

Neural Networks and Gestalt

Formation of Wholes Through Emergent Features

Where Do the Laws of Perception Come From?

Outline of the Current Work

Experiment 1

Methods

Results

Experiment 2

Methods

Results

The Effect of Familiarity

Discussion

Do Modern DNNs Exhibit Human-Like Gestalt Grouping?

Difference Amongst Networks

Magnitude Encoding Information

Acquisition of Principles for Gestalt Grouping Principles

Conclusion

General Methods

Network Used

Computing Network Configural Effects

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Consent to Participate

Consent for Publication

Conflict of Interest

Appendices

Appendix A: Cosine Similarity Analysis

Appendix B: Experiment 2: All-Stimuli Correlation

Appendix C: Brain-Score for Network Used

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation