Introduction

Human tends to group perceptual features together in order to form a coherent whole. Understanding when this happens has been the focus of Gestalt psychology research for over 100 years and more than a hundred grouping “laws” have been suggested (Wagemans et al. 2012a). Whereas in the past the formulation of these laws was based on subjective experience and was criticized for a lack of scientific rigour, subsequent researchers have developed experimental designs with carefully constructed stimuli (e.g. Gabor stimuli, dot lattices) that allow for parametric control, richer visual displays, and objective measures of grouping effects (Wagemans et al. 2012b). One such approach consists of measuring the impact of salient Emergent Features (EFs) on discriminating visual patterns. These EFs derive from the relationship amongst individual parts rather than the parts themselves (Pomerantz et al. 1977; Pomerantz & Portillo 2011). We will use the concept of EFs as the basis of our approach, as detailed later.

Recently there has been an explosion of interest in Deep Neural Networks (DNNs) as models of the human visual system for object recognition. Even though DNNs have primarily been designed to solve engineering tasks, reports that the pattern of activations of units in DNNs are similar to neural activation in human and macaque visual systems have led to the view that DNNs can be used as a test bed for simulating biological vision in mammals (Gauthier & Tarr 2016; Kriegeskorte 2015). As a way of formalizing this similarity, a Brain-Score benchmark has been put forward (Schrimpf et al. 2018), which has been enthusiastically embraced by researchers comparing DNNs to human vision.

In the current paper, we explore several DNNs thought to be amongst the best model of human vision and test whether they support various Gestalt grouping phenomena. In particular, we tested whether DNNs are sensitive to some basic principles of organization such as proximity, orientation, and linearity (Experiment 1), and whether they experience Gestalt grouping when presented with more complex stimuli (Experiment 2). We compare networks’ responses with human responses from classic visual perception work (Pomerantz et al. 1977; Pomerantz & Portillo 2011).

Our main research question is as follows: do DNNs exhibit human-like Gestalt grouping effects? We split this question into two sub-questions: are networks sensitive to the basic properties of proximity, orientation and linearity?; are they sensitive to more complex Gestalt grouping effects? In addition to the primary aim of this investigation, we will also gain insight into what architecture or training regimes appear to be most appropriate for acquiring Gestalt properties, and to the extent to which Gestalt phenomena are a learned or innate aspect of the human visual system.

In the following sections, we contextualize these questions and explain how our experiments address them.

Neural Networks as a Model of the Human Visual System

DNNs trained on ImageNet (a dataset consisting of 1000 categories of objects taken across over 1 million photographs, Deng et al. 2010) develop a set of internal feature representations that are statistically similar to the neural representations in human and non-human primate visual systems (Yamins & DiCarlo 2016; Khaligh-Razavi & Kriegeskorte 2014; Schrimpf et al. 2018). A neuronal and behavioural benchmark called Brain-Score has been developed to assess neural networks on their similarity with biological object recognition systems, with DNNs performing much better than all previous approaches (Schrimpf et al. 2018). At the time of writing more than 160 models have been tested on Brain-Score.

In spite of these successes when compared with neuronal data and tested on classification accuracy benchmarks, DNNs often fail on the most basic perceptual properties exhibited by humans (Bowers et al. 2022). For example, DNNs do not possess human-like shape bias (Geirhos et al. 2018; Malhotra et al. 2020), they appear to discriminate catego-ries based on local instead of global features (Baker et al. 2018b; Malhotra et al. 2022), are much more susceptible to a low amount of image degradation (Geirhos et al. 2018), do not account for humans’ similarity judgements of 3D shapes (German & Jacobs 2020, and fail to support basic visual reasoning such as classifying images as the same or different (Puebla & Bowers 2021). In addition, DNNs often act in surprising non-human-like ways, such as being fooled by adversarial images (Szegedy et al. 2013; Dujmović et al. 2020) and make bizarre classification errors to familiar objects in unusual poses (Kauderer-Abrams 2017; Gong et al. 2014; Chen et al. 2017). Furthermore, recent work by Xu & Vaziri-Pashkam (2021) failed to find strong neural correlates between DNNs’ internal representation and fMRI signals from high-level visual areas of human participants.

At the same time, some authors have highlighted that DNNs appear to capture some key psychological findings, such as Weber’s Law, sensitivity to scene incongruencies, and the Thatcher effect (Jacob et al. 2021) (but see Bowers et al. 2022 for limitation of these findings). Biscione & Bowers (2021, 2022) also found that networks exhibited strong invariance to several novel object transformations (rotation, scale, change in luminance, translation, and to a lesser degree change in viewpoint), but only after being trained on a correspondingly transformed dataset of different classes, indicating that DNNs can learn the human perceptual property of object invariance to object transformation (Blything et al. 2020, 2021). Recently, Yin et al. (2023) found that DNNs trained on words provided a better account for visual priming effects than many classic orthographic coding schemes.

Fig. 1
figure 1

Pomerantz et al. (1977) approach to measure Emergent Features (EFs). Left: Starting with a base pair, a non-informative context is added to obtain a composite pair. Right: The base pair and composite pair are arranged in a 2x2 grid to form an odd-discrimination task. Participants are asked to indicate the location of the “odd” element, and RTs are measured for the base and the composite grid. With some composite grids (such as the one illustrated in the figure) discrimination is much faster than the corresponding base grid. Since the added context in the composite is equal for odd and non-odd elements, any facilitation must be due to EFs

Neural Networks and Gestalt

As far as we are aware, Gestalt grouping in DNN have been explored only in relation to illusory contours and closure. Baker et al. (2018a) tested the degree to which networks could perceive illusory contours after being trained on non-illusory similar shapes (large and thin rectangles). The network successfully predicted the type of shape regardless of whether the contour was normal or illusory, but the authors found no evidence that the network used the same information as humans and concluded that CNNs do not perceive illusory contours. Kim et al. (2021) challenged this conclusion and found that several architectures (including InceptionNet) pretrained on ImageNet exhibited closure on display of edge fragments. Whether the later findings reflect Gestalt processes similar to human visual processing remains unclear. Other researchers have focused on modifying architecture or training regime in order to explore this issue: Lotter et al. (2020) found that a network based on predictive coding (PredNet, one of the 16 networks tested in this work) which was trained on predicting the next frame of video sequences, exhibited disparate phenomena observed in the visual cortex, including the flash-lag effect and illusory contours. The same network also appeared to perceive illusory motion (Watanabe et al. 2018). Using a DNN modified to include feedback connections through predictive coding dynamics, Pang et al. (2021) also showed human-like perception of illusory contours. Illusory contours constitute an important phenomenon in Gestalt psychology, but human perception is mediated by a wider set of grouping principles which, to the best of our knowledge, have not been explored in DNNs.

If DNNs are going to be used as models of human vision it will be important that they not only do well on various Brain-Score measures but also account for key experimental results reported in psychology (Bowers et al. 2022). Here we have focused on Gestalt rules of organization not only because they play a key role in visual perception and object recognition (Biederman 1987; Perrett & Oram 1993; Spillmann 2009), but because there are existing image datasets and robust empirical phenomena that make it easy to test. Specifically, we consider whether DNNs show sensitivity to EFs, as described next.

Formation of Wholes Through Emergent Features

Gestalt researchers have long studied the emergence of “wholes” from the combination of individual parts, but Gestalts have proven difficult to define and measure.Pomerantz et al. (1977) operationally defined Gestalts as the result of salient Emergent Features (EFs), that is features that are the result of the relations amongst individual elements, and are not possessed by the individual elements. These EFs behave as though they were elementary themselves and are sometimes detected more quickly than the more basic features from which they arise. The effect can be measured in a discrimination task, is robust to added noise (Moors et al. 2020), and might be a fundamental stage in shape processing related to the perception of non-accidental properties involved in shape perception as posed by the Recognition-by-Components theory(Biederman 2001; Kubilius et al. 2017).

In Fig. 1, left, we illustrate this idea. As a baseline, we measure how well humans distinguish two stimuli (A and B, base pair). We then add a new contextual stimulus C to both the A and B images, creating a composite pair. Importantly, stimulus C is not informative by itself in distinguishing AC from BC. Nevertheless, the composite pair is now much more discriminable than the base pair due to the interaction of the context with the base elements (that is, through the creation of Emergent Features). Whenever the context stimulus changes performance on the discrimination task a Configural Effect (CE) is observed. Improved performance is described as a Configural Superiority Effects (CSE) and this provides a measure of an EF that is the product of a Gestalt organizational principle. By contrast, impaired performance is described as a Configural Inferiority Effect (CIE), and does not reflect an EF, but rather, reflects a number of possible factors, including additional computational and attentional load, increase similarity, and crowding. That is, for a CSE to manifest itself, the EFs need to be powerful enough to override all these other factors. In a standard procedure, these stimulus sets have been used in a “oddity reaction time (RT) task” where subjects were asked to determine in which quadrant of a 2x2 grid an “odd” stimulus was presented (Fig. 1, right). Comparing RTs amongst each base and composite pair allows for the measurement of CSE and CIE in various contextual configurations.

The advantage of this approach is that it provides a quantitative measure of CSEs/CIEs rather than relying on subjective judgements. As predicted, many configurations (combination of characters, line segments forming letters, surfaces, 3D volumes) result in CIEs or very modest CSEs (Pomerantz & Portillo 2011). However, critically, other configurations show strong CSEs (Pomerantz et al. 1977; Pomerantz & Pristach 1989; Pomerantz & Portillo 2011). In Experiment 1 we test network sensitivity to specific and low-level EFs (proximity, orientation, and linearity) with simple dot configuration following Pomerantz & Portillo (2011). The combination of these and others EFs (and possibly other factors, including high-level features such as shape familiarity or closure) are assumed to give rise to the strong CSEs observed with the complex stimuli used in Pomerantz et al. (1977). In Experiment 2, networks are tested on these stimuli as well as stimuli that produced strong CIEs in humans.

Where Do the Laws of Perception Come From?

A related question is the role of perceptual experience in acquiring EFs, that is, to what degree the grouping principles are learnt from the visual environment and to what degree they are a function of the innate architecture of the visual system (Todorović 2011). Classic Gestalt psychologists minimized the significance of a learning account (Metzger 1966). These authors conceded that some aspects of visual organization could be based on habit or learning (such as the ability to group particular continuous lines on a paper in letters), but these cases were thought to be the exception and weaker than others (Wertheimer 1923).

Nevertheless, some evidence has emerged supporting the idea that basic Gestalt principles could be the result of learning from the visual environment: Geisler et al. (2001) showed that human performance in contour detection is quantitatively predicted by a grouping rule derived directly from the statistics of natural images; Peterson & Gibson (1994) found that a silhouette is more likely to be assigned as the figure if it suggests a common object (see also Peterson 2019); with two different paradigms, Duncan (1984) and Zemel et al. (2002) found that a perceptual grouping can be altered with only a small amount of experience in a novel stimulus environment. Other evidence based on RT responses has been collected by Vecera & Farah (1997). However, some other combinations are impenetrable to learning, as demonstrated by the Kanizsa stratification images (Grossberg & Zajac 2017).

In the current work, all networks tested were pre-trained on a large dataset of natural images: either ImageNet (a dataset consisting of a thousand categories of objects taken across over 1 million photographs, Deng et al. 2010) or, in the case of PredNet, the KITTI dataset (Geiger et al. 2013, a car-mounted camera dataset). Most of the networks used here achieve an impressive degree of accuracy on a test set of objects, at par with human performance on ImageNet (He et al. 2015). Therefore, regardless of their plausibility as a model of the human ventral pathways, we can use them to test whether learning statistical regularities on a complex domain of 2D natural images affords the network to extract EFs. Some of these regularities might be low-level such as being sensitive to the proximity of two stimuli, or their orientations with respect to one another; others might be higher level such as grouping based on shape familiarity. If we fail to observe grouping principles in our experiments it could be that the models need to be trained on more complex datasets (e.g. an interactive 3D world) in order for some Gestalt principles to emerge, or alternatively, different architectures are required.

Fig. 2
figure 2

Generation of stimuli for Experiment 1, following Pomerantz & Portillo (2011). Top: Control set: one pair always consists of randomly pixellated canvas without any stimuli on top; the other pair is obtained by adding one dot in a random location to one of the canvasses. Bottom: Starting with a pair of images in which the only discriminant feature is the location of a dot, an additional dot is added, yielding the EF of proximity or orientation. The EF of linearity is obtained by adding a dot to the orientation pair. These three EFs have been found to elicit strong and consistent grouping effects in humans (Pomerantz & Portillo 2011). Each network was tested on 500 randomly generated sets, and each analysis was repeated on three different stroke-over-background conditions (see text)

Outline of the Current Work

We test a wide variety of DNNs (details in “Network Used’’) on several sets of stimuli. Instead of using the “odd” quadrant task, we presented each image to the neural network, and compared its internal activation across sets of images, always presented separately from one another. Each set is composed of two pairs of images, a base and a composite pair (obtained by adding a non-informative context to the base pair as in Fig. 1). For each pair, we computed the Euclidean distance between the activation vector at each networks’ layer produced by the images of the pair. We then compared the Euclidean distances between the composite and the base pair and normalize it from -1 to 1 to obtain a Network Configural Effect (CE). Most of the results in the main text will refer to the Euclidean-based approach, but we also repeated the same analysis using the cosine similarity metric, a measure that is independent of the vectors’ magnitude, which produced slightly different results as discussed in“Magnitude Encoding Information’’. The two metrics are outlined in detail in “Computing Network Configural Effects’’ in “General Methods’’.

A positive CE is a measure of enhanced discrimination (Configural Superiority Effect, CSE); a negative of impaired discrimination (Configural Inferiority Effect, CIE). These measures can then be compared with CSEs/CIEs found in human participants assessed through reaction times (RTs) recording. While we compared humans and networks on both CIEs and CSEs, notice that only CSEs are the result of Gestalt grouping, while CIEs correspond to crowding/attention load, etc. We computed the CEs across the Convolutional and Fully Connected layers of the networks (before the non-linear operation was applied), with a particular emphasis on the output layer, as it appears to be the most appropriate for comparison with human RTs. Nevertheless, we expect CSEs to start emerging from earlier layers than the output layer given that Gestalt grouping is widely assumed to organize the visual scene for the sake of object recognition (Biederman 1987).

We selected the models based on their historical importance, their performance on standard datasets, their biological plausibility and their Brain-Score (see the score for each network in Appendix C), and we aimed at using a wide variety of architectures and training regimes. We used a total of 16 networks, all pre-trained on ImageNet (a part from PredNet which was pretrained on the KITTI dataset): 5 of them are classic convolutional networks, 3 are CNNs that have a direct biological inspiration with a front-end that simulates primate V1, 4 are attention-based networks, 4 are self-supervised networks. Amongst all the above networks, 3 have recurrent mechanisms. The human RTs data were collected from two sources: Pomerantz et al. (1977) and Pomerantz & Portillo (2011). In Experiment 1, we investigated specific, low-level emergent features, by generating a wide number of configurations composed of simple dot patterns, as introduced by Pomerantz & Portillo (2011) and illustrated in Fig. 2. Pomerantz & Portillo (2011) found a strong and consistent effect for three EFs: proximity, orientation, and linearity, and therefore we tested whether these same features could be used by the networks to enhance discriminability of a base image pair.

A multitude of low-level features combined together form Gestalt perceptual grouping for more complex shapes. In Experiment 2 we used 17 sets of stimuli that, albeit still simple, are composed of combinations of line segments and thus intend to elicit a wide set of emergent features in humans (Pomerantz et al. 1977, the whole set is shown in Fig. 10, left). Five of these sets generated high CSEs in humans, indicating a strong Gestalt grouping effect. It is assumed that the CSE observed in this complex stimuli is the result of many, low-level emergent features. In addition, another five sets generated high CIEs in humans, indicating a strong impact of a combination of attention load, crowding, etc. Both CSE and CIE sets are shown in the legends in Figs. 5 and 6. We generalized the results across different background conditions and transformation conditions (rotation, translation, scale, and no transformation). More details about the networks used and the analysis are presented in “General Methods’’.

Fig. 3
figure 3

Amount of Configural Effect (CE) in Networks and humans (human data showed in the bottom-right box, extracted from Pomerantz & Portillo 2011). In humans, proximity, linearity and orientation all produced high CSEs. Notice that each CE is computed with respect to the base pair and, for the networks, using the Euclidean-based method. Most networks not only show consistent CSE across the three EFs, but also a pattern of responses that match the one found in humans. Vision-Transformers showed a less human-like responses, and three out of four self-supervised networks presented a inferiority effect. All figures refer to the condition with black dots on randomly pixellated backgrounds. In this and the following figures, the networks within a colored rectangle are convolutional, those within a rectangle are Vision-Transformers. A normal font-weight indicates supervised learning, a bold font indicates self-supervised, a font color indicates the presence of recurrent mechanisms. For example is a , self-supervised model with . In this and the following boxplot figures, each box extends from the first to the third quantile, with a horizontal line at the median. Filled boxes indicate a median > 0, hatched boxes a median < 0. The bars extend from the box by 1.5x the inter-quartile range. Networks are ordered according to their Brain-Score, from the highest to the lowest.

Experiment 1

Methods

In order to study individual Emergent Features (EFs) selectively, Pomerantz & Portillo (2011) designed an odd-discrimination task in which dot patterns were used to create a base and a composite pair. They found that Configural Superiority Effects (CSEs) were consistently exhibited for three EFs: proximity, linearity, and orientation, which respectively generated a CSE of 0.38, 0.36 and 0.22 s (bottom-right box in Fig. 3), corresponding to a speed-up of RTs of 11%, 29% and 26% (Pomerantz & Portillo 2011). In this experiment, a “base pair” consisted of two images with a single dot placed at different locations, whereas a “composite pair” was generated by adding one or two dots in the same location to both base pair images, in such a way as to elicit different emergent properties: orientation, proximity, or linearity (see Fig. 2). We generated 500 sets (a base pair and a composite pair for each of the three emergent features tested here). Each dot in the base configuration was constrained to be located at a distance of at least 20 pixels from one another, and 40 pixels from the border in order to avoid border effects (Kayhan & van Gemert 2020).

In addition, we also employed a control condition to assess the sensitivity of the networks to the stimuli used in this study (simple features on a randomly pixellated background), due to the highly different appearance from the dataset used for pretraining (which consisted of natural images). To do that, we computed the similarity between a pair of randomly pixellated canvases, and a pair composed of an randomly pixellated canvas and a randomly pixellated canvas with a single dot (see top part of Fig. 2). If the pair with the canvas containing a single dot is more discriminable than the pair without the dot, this would imply that the model is indeed sensitive to these types of images.

CEs in humans were measured through an “odd” discrimination task and CSEs were established when the added uninformative dot in the composite pair elicited faster discrimination than in the base pair (where only the location of a single dot changed across the two canvasses). In networks, the discriminability of each pair was calculated by presenting each image individually and comparing the internal activation of the networks between the two elements of each pair. If the difference between pairs was greater for the composite (e.g. the “orientation” pair) than the base pair, this would indicate a superiority effect (CSE), since the uninformative feature added to the composite pair increases the discriminability of the two canvases. Conversely, if the difference between pairs was lower, this would indicate a configural inferiority effect (CIE), where the uninformative feature added to the composite pair decreases the discriminability of the two canvases. Unless otherwise stated, the discriminability values are computed through the Euclidean-based metric detailed in “Computing Network Configural Effects’’.

Each experiment was repeated across 3 stroke-over-background conditions: white-over-black, black-over-white, and black-over-randomly pixellated background. We present the results for the latter, but we obtained very similar results across all three conditions.

Results

The results are shown in Fig. 3 for the output layer and in Fig. 4 for all other layers. At the output layer, all convolutional networks trained with supervision are sensitive to proximity, linearity, and half of them are also sensitive to orientation. These networks also showed a pattern of responses that mimicked that of human participants (see bottom-right box in Fig. 3): the proximity feature elicited a bigger effect than linearity, and linearity bigger still than orientation.

Fig. 4
figure 4

Amount of Configural Effect (CE) across all layers, for each network. The control condition (blue) is the only one consistently producing superiority effects across all layers, indicating that the networks are sensitive to the dot stimuli used here. However, the EFs of proximity, orientation, and linearity, produced no effect at early stages and inferiority effect at all but the final stages of processing. Dashed vertical red lines indicate the first fully connected layer in each network. Shaded areas correspond to one standard deviation. Networks labels are colour-coded as illustrated in the caption of Fig. 3

Vision Transformers were also always sensitive to proximity, but are inconsistently responsive to the other two features. On the other hand, PredNet and the DINO models showed insensitivity to the three EFs and indeed a negative effect (CIE). The very low responses obtained with these networks could be the result of a general insensitivity to the types of dot-stimuli presented in this work. However, the analysis of the control conditions (blue line in Fig. 4) shows that all networks used are strongly sensitive to the addition of a single dot starting from the middle stage of processing onwards (with the exception of some Vision Transformers across the middle layers, which showed a weak but still positive effect for the control condition). In all cases, the added dot increased discrimination across all networks. On the other hand, adding one or two dots in order to elicit the EFs of proximity, linearity, and orientation, resulted in a negligible effect for early layers, and a strong but negative effect on middle layers, before producing the observed CE at the late stage of processing (see Fig. 4). That is, the additional EFs that make the stimuli more discriminable for humans produced less discriminable representations in the middle stages of the network. We consider the implication of these results in “Discussion’’. Note, when we repeated the analysis with the cosine similarity approach (as opposed to the Euclidean approach), we found much weaker effects across all networks, and indeed, sometimes the CSEs found in the output layer were reversed into CIEs (see Fig. 8 in Appendix A). This shows that the CSEs are at least partially encoded in the magnitude of the internal representations.

The results presented thus far refer to the condition with black dots over randomly pixellated backgrounds. Across the other two background conditions, the results are generally consistent, with the exception of Vision Transformer which showed a higher sensitivity to background, producing more often CIEs for both white and black backgrounds, sometimes even for the proximity condition.

Fig. 5
figure 5

A Networks Configural Effects (CEs) for the output layer for the 5 sets producing high CSEs in humans. All supervised, convolutional networks exhibited CSE for sets 1–3, but overall only three models showed CSE for all stimuli. The pattern did not appear to resemble human responses (in the boxes, data extracted from Pomerantz et al. (1977). Three of the self-supervised networks presented a negative effect. B Networks’ CE for the last layer of the 5 sets producing high inferiority effects (CIE) in humans. Again, the pattern of responses did not match humans’, and almost all networks exhibited a superiority effect for at least one of these stimuli. The figure refers to the condition with the randomly pixellated background and translation transformation across pairs. See Fig. 3 caption for details about the boxplots and the colour-code used for labelling the networks

Experiment 2

Methods

In Experiment 2 we investigate whether these models show human-like Gestalt phenomena when presented with more complex stimuli taken from Pomerantz et al. (1977). We used the image pairs from the first two experiments in Pomerantz et al. (1977). Images were arranged in 17 sets, each composed of two pairs: a base and a composite pair (the full set of images, including the added context to each pair, is shown in Fig. 10, left). The sets were composed so that they could elicit a wide variety of CSEs and CIEs. Furthermore, we used 4 different transformation conditions: no transformation, translation (up to 18% of the image size), scale (0.7 to 1.3 times the original image size), and rotation (up to 360 degrees). The same random transformation was kept fixed within a set (that is, with a random translation of 20 pixels, both images of the base pair and both images of the composite pair were translated by 20 pixels). For the three conditions employing transformations, each comparison was repeated 500 times, each time with a different random transformation uniformly sampled within the boundaries defined above.

Fig. 6
figure 6

Amount of Configural Effect (CE) across all layers, for each network. Similar to the plot in Experiment 1, most networks exhibited a limited sensitivity to Gestalt grouping at early layers, and a mostly negative sensitivity during middle stages, until the output layer. This is in contrast to what is expected in human participant (see “Discussion’’). A notable exception was AlexNet and, partially, VGG19. This could be connected with the fact that these networks had fully connected layers at their middle-late processing stages, and not just at the output layers like all other convolutional networks. Networks labels are colour-coded as explained in the caption of Fig. 3

Results

Of the 17 sets of stimuli used by Pomerantz et al. (1977) we first focus on the 5 sets that showed the largest CSE in humans, corresponding to a speed up between 40% to 180% (Fig. 5A, the sets are showed in the legend). The corresponding human CSEs are shown in the bottom-right box in Fig. 5A. When analysing the output layer, sets 1, 2 and 3 indeed produced positive CSEs for all supervised networks (both convolutional and Transformers). By contrast, sets 4 and 5 were positive only for InceptionNet, CORnet-S, and VOneCORnet-S. And in contrast with Experiment 1, the relationship amongst the CSEs did not resemble that found in humans (for example, in almost all cases, set 2 elicited a stronger effect than set 1). Similarly to the result in Experiment 1, three out of four self-supevised networks presented consistent negative effect for all sets (the exception being SimCLR in both Experiments). We then analysed the 5 sets that showed the largest CIE in humans (Fig. 5B). We found that not only the pattern of responses did not resemble humans RTs, but almost all networks showed a superiority effect for at least one of these sets.

The analysis across all networks layers (Fig. 6) confirmed the pattern observed in Experiment 1, with negligible effects in early layers, and a negative effect in middle layers.

We observed that the two convolutional networks which possessed fully connected layers in addition to the output (AlexNet and VGG19), demonstrated a CSE across all fully connected layers, suggesting the possibility that only this type of layer, but not a convolutional one, may exhibit CSE. We will further explore this point and its connection to studies exploring Gestalt effect using human neural data in the “Discussion’’.

We extended the analysis to all 17 sets of stimuli by plotting humans CEs vs the networks CEs in Fig. 10, in Appendix B. To quantify whether humans and networks CEs are correlated, we computed Kendall’s tau correlation coefficient across human and networks’ CEs. The relation was always positive apart for PredNet and DINO ViT-S/8 models, but in no case it was significant at \(p<0.01\).

The results presented thus far refer to the condition with randomly pixelated background and translation transformation. We obtained similar results with the other two background conditions (white and black) and transformation conditions (scale, rotation, and no transformation). When repeating the analysis with the cosine similarity metric we again found an overall reduction of the effect across all layers, which resulted in a larger amount of CIEs.

Fig. 7
figure 7

Testing the contribution of shape familiarity on sets 1 and 2 by changing the contextual image in order to generate unfamiliar composite pairs. Only networks exhibiting CSEs for these sets were tested. The modified contextual image is shown with a red border. Networks did not seem to be affected by using an unfamiliar shape (with the only exception of SimCLR for set 2). Therefore, the CSEs found with these shapes is the result of other features, such as closedness. See Fig. 3 for details about the boxplots and the colour-code used for labelling the networks

The Effect of Familiarity

We investigated whether the CSEs consistently found across some sets of items were the result of familiarity rather than emergent features. That is, it is possible that networks acquired a sensitivity to some basic shapes as a consequence of being trained on ImageNet, and this makes these shapes more salient and discriminable when compared to different shapes or non-shapes. For example, in sets 1 and 2, adding a non-informative stimulus turned the diagonal lines into a familiar geometrical shape (a triangle). In humans, the increase discriminability that results in strong CSEs is assumed to be the result of several emergent features (symmetry, closedness, Pomerantz et al. 1977). On the other hand, it is possible that networks are sensitive to these images because they have experienced these shapes during training.

We provided an initial test of this hypothesis by modifying sets 1 and 2 (the sets which composite pair included a triangular shape) so that they did not contain familiar shapes anymore. We re-run the analysis only for those networks that exhibited CSEs for sets 1 and 2. The results (Fig. 7) indicated that most networks still exhibited the same degree of Gestalt grouping effects for sets 1 and 2 (with the only exception being SimCLR, which seemed to be sensitive to the particular triangular shape presented in set 2). Therefore, we can reject the hypothesis that the networks were simply sensitive to the familiar shape, and might be sensitive to other features such as closedness.

Overall, Experiment 2 produced mixed results. Only a subset of the stimulus sets that produced large CSEs in humans evoked high CSEs in networks, and almost uniquely at the final layer. In no case the pattern of CSE across stimuli resembled the one produced in humans. Furthermore, networks sometimes elicited CSEs in cases where humans most strongly produced CIEs. On the other hand, we verified that networks were not simply basing their responses on familiarity.

Discussion

We will now elaborate to what extent our experiments can answer the questions raised in “Introduction’’ and then discuss how our findings relate to previous work assessing Gestalt organizational principles in DNNs.

Do Modern DNNs Exhibit Human-Like Gestalt Grouping?

Overall, we found mixed support for the hypothesis that DNNs support Gestalt grouping in a human-like fashion. It is indeed remarkable that most convolutional networks trained with supervision did show CSEs for proximity, linearity, and (less frequently) orientation (Experiment 1), and furthermore, the size of the effects mirrored human performance. However, for all 16 networks, the sensitivity to these properties at all layers earlier than the output layer was either negligible or reversed. Furthermore, networks did not show a consistent pattern of responses for complex stimuli: in Experiment 2, only three of the five sets of stimuli that produce CSEs in humans gave rise to consistent CSEs in convolutional networks, and again only at the output layer; the other two showed either a weak effect or the opposite, inferiority effect. Furthermore, in this experiment, the pattern of CSE across different sets did not resemble those observed in human responses. EFs are powerful Gestalt properties that are not only subjectively compelling (see Fig. 1) but that support fast RTs in participants, similar to other low-level visual features that “pop out” (Treisman 1998). The essential role of EFs is reflected in their importance in figure-ground segregation (Wagemans et al. 2012a) as well as building representations of object parts through the perceptual generation of non-accidental properties (Biederman 1987; Kubilius et al. 2017). Therefore, it would be expected that EFs emerge in the human visual stream earlier than the stage at which objects are classified. In fact, research has suggested that Gestalt phenomena do not emerge at low-level retinotopic areas, but rather in shape-selective regions such as the lateral occipital complex (LOC) (Kubilius et al. 2011), even though their effect has neural correlates as early as V1 (Fox et al. 2017). Note that LOC corresponds to an earlier processing stage than object identification in the inferotemporal cortex (IT). This is in striking contrast with what is found in this experiment, in which the output layer, often associated with IT, was in most cases the only layer exhibiting any Gestalt effect. Note, we do expect to find CSE in the output layer — but the lack of CSEs (and instead, the strong CIEs) in all earlier layers points to a clear discrepancy between network and human visual processing.

It is possible that the crucial property of the output layer that allows the emergence of CSEs is that it is a fully connected (FC) layer (as opposed to a convolutional layer). In the current work, amongst the convolutional networks, only AlexNet and VGG19 had multiple FC layers. Interestingly, AlexNet did seem to produce CSE even for “internal” FC layers in both Experiments, and VGG19 did so for sets 1–3 in Experiment 2, but generally failed to produce CSE at all in Experiment 1 even at the output layer (in fact, it was the worst performing convolutional network in that experiment).

This hints at the possibility that Gestalt effects in DNNs can emerge prior to the output layer and that the reason why effects were restricted to the output layer in most models is that there was only one FC layer. Why fully connected layers but not convolutional layers support some CSEs is not currently clear. But it suggests that the architectures of most current DNNs are incompatible with the Gestalt organizational principles observed in humans. Furthermore, we noticed that even though the layer taken into account during the analysis of the Vision Transformers were all FC, they did not produce Gestalt grouping consistently, which might be due to the fundamentally different mechanism used by these networks.

Difference Amongst Networks

When considering only the output layer, we found a clear difference amongst convolutional and supervised networks (which appeared to be the most sensitive to some basic properties used by humans such as proximity, linearity, and orientation) and self-supervised networks (which produced often the opposite effect as humans). Vision Transformers appeared to be somewhat in between, producing weak and inconsistent human-like effects. It did not appear that networks possessing a recurrent mechanism performed differently that the other networks. Amongst the self-supervised networks, it is surprising that PredNet (a self-supervised, convolutional, and recurrent architecture), which has been shown to be sensitive to several types of illusions (Watanabe et al. 2018; Lotter et al. 2020), performed extremely poorly and in fact it always showed inferiority effects. Interestingly, this model also scored poorly on Brain-Score (see Table in Appendix C). On the other hand, SimCLR, the lowest scoring model amongst those used here in the Brain-Score benchmark (almost 3 times lower than the highest scoring model, ResNet-152) did show sensitivity to proximity and linearity, and some CSE for sets 2, 4, and 5 in Experiment 2. Overall, when considering the output layer, we see a clear distinction between networks that were trained with supervision and possessed a convolutional architecture, which showed a higher human-like grouping effect, and all other models. However, when considering intermediate layers of processing, the picture was much more homogeneous: most networks’ did not show any discrimination between composite and base pairs during early processing, and the two pairs become even less discriminable at the middle stages of processing (resulting in CIEs). Almost always networks presented an inferiority effect regardless of architecture, training regime, and their Brain-Score.

Magnitude Encoding Information

We observed a small but important difference between the results computed with normalized euclidean distance and cosine similarity (a measure independent of the magnitude of the internal activation). The effects when measured through the latter were strongly reduced and sometimes reversed (that is, a response appearing as CSE with Euclidean distance became a CIE with cosine similarity). We believe the Euclidean distance to be a better metric for two reasons: the magnitude of the internal activations is used in the internal computations in the networks, through non-linearities such as ReLU. Furthermore, even if the classification itself is commonly computed by disregarding the last layer’s activation magnitude (through argmax), it is nevertheless important to establish that the information is present and could potentially be used if additional layers were to be appended.

Acquisition of Principles for Gestalt Grouping Principles

The degree to which some basic grouping properties can be learnt has been a controversial topic for many decades (see “Introduction’’). Whether or not DNNs are good models of the human visual system, they are excellent at extracting statistical regularities from a dataset, and therefore provide a test to whether some grouping phenomena are implicitly encoded within the statistics of a particular dataset.

The finding that the networks only showed the right pattern of CSEs for low-level features (Experiment 1) suggests that the simplified training environment provided by 2D images might be insufficient for acquiring more complex Gestalt principles, or for combining the low-level features in a human-like way. Still, it is possible that using a more realistic dataset (e.g. a 3D environment), or more naturalistic training environments (e.g. with a reward signal such as in the reinforcement learning approach) could result in the acquisition of a wide array of grouping principles, but this is pure speculation for now. On the other hand, we also note that PredNet was trained on a sequence of realistic images (as opposed to static images used for all other networks), and it was one of the most dissimilar model in terms of showing any type of Gestalt grouping.

It is also interesting to notice that Gestalt grouping has been obtained in artificial networks that have very different architectures than DNNs (Grossberg et al. 1997; Francis et al. 2017; Herzog et al. 2003). In other cases, architectural modifications have been added to DNNs to solve tasks that appear to underlie Gestalt grouping (Linsley et al. 2018). Highly specific and unnatural training setups (e.g. training on a checkerboard-like pattern) also resulted in the emergence of some configural properties in convolutional DNN (Keshvari et al. 2021). This suggests that Gestalt principles might be obtained only with appropriate architecture or with highly specific training, and are not simply the result of statistical feature extraction from the retinal images.

Conclusion

Overall, the results presented here highlight a partial disconnect between DNNs trained on natural 2D images and human vision. Testing 16 networks that varied across architectures and training setups on stimuli that elicited strong Gestalt grouping (measure as increased discrimination in an “oddity Reaction Time task”) in humans, we found that all models almost universally showed no grouping effect across early layers, and a negative effects (meaning a decreased discrimination) across middle and late stages of processing up to the output layer. This is in contrast to the human visual processing system, where the effect of emergent features has been observed at intermediate stages of processing. At the output layer, only convolutional method trained with supervision could consistently acquire low-level visual features that elicited grouping effects that strongly resembled those produced by humans. At the same time, the sensitivity to low-level grouping features in trained networks did not fully transfer to more complex stimuli, which instead produced grouping effects that were only inconsistently mimicking human behaviour and were sometimes strongly deviating from it. Although our findings highlight the limitations of current DNNs in capturing Gestalt effects, our results do suggest the possibility that human-like Gestalt effects may emerge in response to training in some network architectures. More generally, this work highlights the importance of comparing network performance with well-established psychological phenomena, which have been largely ignored when comparing DNNs to the human brain.

General Methods

Network Used

We selected 16 networks based on their historical importance, their performance on standard datasets, and their biological plausibility. We used as a point of reference the Brain-Score value (Schrimpf et al. 2018), indicating the amount of variance explained by the model across several benchmarks. AlexNet (Krizhevsky et al. 2012), VGG19 (Simonyan & Zisserman 2014), and ResNet (He et al. 2016) (we used ResNet-152) are classic networks that have been often tested on several cognitive phenomena (Schrimpf et al. 2018; Baker et al. 2018a; Biscione & Bowers 2022), with mixed results. InceptionNet (Szegedy et al. 2015) was shown by Kim et al. (2021) to be sensitive to the effect of Gestalt Closure and thus it seemed suited for this battery of tests (we used InceptionNet V3). In DenseNet (Huang et al. 2017) each convolutional layer is connected with each other layer. A smaller version of this family of networks, which we used (Densenet-201), has been shown to possess human-like translation invariance (Biscione & Bowers 2021).

We also tested two models specifically developed to be biologically plausible and that provide a good match with primate neural data. The “CORnet” model family (Kubilius et al. 2019) aimed to incrementally build a network architecture by adding recurrent and skip connections while monitoring both classification accuracy and agreement with a body of primate brain neural data. From this family, the CORnet-S was selected as the best CORnet architecture.

The VOneNet family (Dapello et al. 2020) has been developed to better match the structure of the primate visual cortex. Each VOneNet contains a fixed-weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models. We used two versions of VOneNet: one with CorNet-S backend (VOneCORnet-S), and the other with Resnet50 backend (VOneResnet50).

Vision Transformers (Mehrer et al. 2021) are an attention-based family of models which achieve higher accuracy on vision tasks and have been found to also have a more consistent pattern of error with those of humans (Tuli et al. 2021). We used ViT-B/16, ViT-B/32, ViT-L/16, ViT-L/32, where B/L indicates either a “base“ or a “large“ model, and the number indicates the patch size. In spite of their success in vision tasks, Vision Transformers do not have a particularly high success with the Brain Score

Self-supervised models aim at building meaningful representations from unlabelled data that can be used for solving downstream tasks, and the deep contrastive embedding they use has been suggested to be a biologically plausible computational theory of the primate visual system (Zhuang et al. 2020). We used four self-supervised networks: Dino ViT-S/8, Dino ViT-B/8 (Caron et al. 2021), SimCLR-ResNet18 (Chen et al. 2020) and PredNet (Lotter et al. 2018). The name after the space indicates the networks’ backbone. PredNet is particularly interesting for this work as it is inspired by the predictive coding theory in the neuroscience literature (Rao & Ballard 1999) and can mimic several effects of visual perception, including illusory contours, the flash-lag effect (Lotter et al. 2017) and illusory motion (Watanabe et al. 2018).

VOneNet-Resnet50, DenseNet, ResNet-152 and VGG19 are in the top 10 on Brain-Score at the moment of writing, all with a score higher than 0.4 (the highest scoring network with 0.465 was a version of EffNet which we were not able to obtain). Amongst the convolutional, supervised models, AlexNet scored the lowest with 0.38. In spite of their success in vision tasks, Vision Transformers do not have a particularly high success with the Brain Score (ViT-B/32 scored the highest with 0.355, all other ViT scored between 0.16 and 0.2). Similarly, SimCLR score is as high as AlexNet, which scored the lowest amongst convolutional supervised networks (\(\sim 0.38\)). Surprisingly, PredNet is the lowest scoring models amongst those tested here (0.195). All values are shown in Appendix C.

All networks were pretrained on ImageNet. When feeding the images into the networks, we first resized them to the same size used during the networks’ pretraining (that is 224x224 for all networks but InceptionNet, which was 299x299). Furthermore, all images were normalized with mean and standard deviation used during ImageNet pretraining. We analysed the activation of all Convolutional and Fully Connected layers before the non-linearity operation was applied.

Computing Network Configural Effects

The main difficulty in comparing Pomerantz’s behavioural results with neural networks is that the behavioural results were based on RTs, which DNNs do not produce (there are some exceptions, but the models commonly scoring high on Brain-Score do not possess this feature). However, since we have direct access to models’ internal representations, we can nevertheless obtain a measure of stimuli discriminability. We presented a pair of images to the network; for each image, we recorded the value of activation for every unit of a given layer, obtaining an activation vector for each image and each layer. The “distance” between the two activation vectors would correspond to a measure of discriminability for the image pair at a particular layer. We compared these measures to human RTs: high distance would correspond to high discriminability, which would correspond to fast RTs; and vice-versa.

We used two different ways of comparing the distances across the activation vector \(d^{l}({\textbf {x}})\) (where l indicates a specific layer, and x is an input image): a Euclidean-based method:

$$\begin{aligned} D^{l}({\textbf {a}}, {\textbf {b}}) = \left\| d^{l}({\textbf {a}}) - d^{l}({\textbf {b}}) \right\| , \end{aligned}$$

and a cosine similarity metric, which is invariant to the magnitude of the internal activations:

$$\begin{aligned} C^{l}({\textbf {a}}, {\textbf {b}}) = \frac{d^{l}({\textbf {a}}) \cdot d^{l}({\textbf {b}})}{\left\| d^{l}({\textbf {a}}) \right\| \left\| d^{l}({\textbf {b}}) \right\| }. \end{aligned}$$

We refer to the general difference across composite and base as Configural Effect (CE). By using the same approach outlined in Pomerantz & Portillo (2011), we obtained the networks’ Configural Superiority Effects (CSEs) and Configural Inferiority Effects (CIEs) by computing the difference for each distance metric across the two pairs of stimuli: a base and a composite pair. The composite pair is obtained by adding a non-informative feature to each image of the base pair. For humans, CE is simply computed as \(RT_{base} - RT_{composite}\), with positive values indicating CSE and negative CIE.

For networks, CEs with the Euclidean distance can be computed as follows (Jacob et al. 2021):

$$\begin{aligned} NetworkCE = \frac{D^{l}(composite_{a}, composite_{b}) - D^{l}(base_{a}, base_{b})}{D^{l}(composite_{a}, composite_{b}) + D^{l}(base_{a}, base_{b})} \end{aligned}$$

:

Since \(D^{l}\) is a measure of dissimilarity, a \(D^{l}\) higher for the composite pair than for the base pair indicates that the uninformative feature added to the composite pair made the two images more dissimilar (that is, more discriminable) from one another, which would correspond to CSE.

Fig. 8
figure 8

Result from Experiment 1 with cosine similarity approach as opposed to the Euclidean-based method presented in the main text. Amount of Configural Effect (CE) in Networks and humans (bottom-right box, data extracted from Pomerantz & Portillo 2011). Even though the pattern of responses is similar across humans and networks (that is, the order from the most to the least CE is proximity, linearity, and orientation), DNNs often exhibited CIE, in contrast with the Euclidean-based approach in which they express CSE more frequently, especially the convolutional architectures. No networks but SimCLR showed CSE for orientation, and many networks failed to show CSE for linearity

Alternatively, we can compute the CE using cosine similarity:

$$\begin{aligned} NetworkCE = C^{l}(base_{a}, base_{b}) - C^{l}(composite_{a}, composite_{b}). \end{aligned}$$

Being the cosine similarity a measure of similarity, a base pair more similar than the composite pair will again indicate CSE. In both cases, positive NetworkCE indicates CSE, and negative NetworkCE indicates CIE.

Note that the Euclidean and cosine similarity approaches to computing CE will have different scales, which will again be different than humans CE computed through RTs.

We compute these metrics for each layer in the networks, in the same order in which they are traversed during the feedforward pass. That is, for the networks in this work using recurrence (CORnet-S, VOneCORnet-S, PredNet) we computed the distance at the recurrent layers multiple times, one for each time it’s been traversed, and in the same order it’s been computed.