Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images

In this paper, we study the problem of unsupervised object segmentation from single images. We do not introduce a new algorithm, but systematically investigate the effectiveness of existing unsupervised models on challenging real-world images. We first introduce seven complexity factors to quantitatively measure the distributions of background and foreground object biases in appearance and geometry for datasets with human annotations. With the aid of these factors, we empirically find that, not surprisingly, existing unsupervised models fail to segment generic objects in real-world images, although they can easily achieve excellent performance on numerous simple synthetic datasets, due to the vast gap in objectness biases between synthetic and real images. By conducting extensive experiments on multiple groups of ablated real-world datasets, we ultimately find that the key factors underlying the failure of existing unsupervised models on real-world images are the challenging distributions of background and foreground object biases in appearance and geometry. Because of this, the inductive biases introduced in existing unsupervised models can hardly capture the diverse object distributions. Our research results suggest that future work should exploit more explicit objectness biases in the network design.


Introduction
The ability of automatically identifying individual objects from complex visual observations is a central aspect of human intelligence (Spelke et al., 1992).It serves as the key building block for higher-level cognition tasks such as planning and reasoning (Greff et al., 2020).In last years, a plethora of models have been proposed to segment objects from single static images in an unsupervised fashion: from the early AIR (Eslami et al., 2016), MONet (Burgess et al., 2019) to the recent SPACE (Lin et al., 2020), SlotAtt (Locatello et al., 2020), GENESIS-V2 (Engelcke et al., 2021), etc.They jointly learn to represent and segment multiple objects from a single image, without needing any human annotations in training.This process is often called perceptual grouping/binding or object-centric learning.These methods and their variants have achieved impressive segmentation results on numerous synthetic datasets such as dSprites (Matthey et al., 2017) and CLEVR (Johnson et al., 2017).Such advances come with great expectations that the unsupervised techniques would likely close the gap with fully-supervised methods for real-world visual understanding.However, few work has systematically investigated the true potential of the emerging unsupervised object segmentation models on complex real-world images such as COCO dataset (Lin et al., 2014).This naturally raises an essential question: Is it promising or even possible to segment generic objects from real-world single images using (existing) unsupervised methods?
What is an object?To answer the above question involves another fundamental question: what is an object?Exactly 100 years ago in Gestalt psychology, Wertheimer (Wertheimer, 1923) first introduced a set of perceptual grouping and figure-ground organization principles to heuristically define visual data as individual objects and/or backgrounds.Perceptual grouping principles such as proximity and simi-Single Images

Predicted
Object Masks Fig. 1: SlotAtt can perfectly segment simple objects on three synthetic images (left-hand side) but clearly fails on three real-world images (right-hand side).
larity specify how foreground visual elements are grouped into individual objects, whereas figure-ground organization principles investigate how visual data are just separated as the foreground against background, e.g. a relatively small area tends to be seen as foreground (Wagemans et al., 2012).However, these principles are highly subjective, whilst the real-world scenes and objects are far more complex with extremely diverse appearances and shapes.Therefore, it is practically impossible to quantitatively define what is an object, i.e., the objectness, from visual inputs (e.g., a set of image pixels).Nevertheless, to thoroughly understand whether unsupervised methods can truly learn objectness akin to the psychological process of humans, it is vital to investigate the underlying factors that potentially facilitate or otherwise hinder the ability of unsupervised models.In this regard, by drawing on Gestalt principles, we instead define a series of new factors to quantitatively measure the complexity of individual foreground objects and the background in Section 2. By taking into account both appearance and geometry, our complexity factors explicitly assess the difficulty of segmenting single objects and the background.For example, it is harder to segment a chair with colorful textures from a cluttered background than a single-color ball from a clean background for unsupervised methods.With the aid of these factors, we extensively study whether and how existing unsupervised models can discover objects in Section 4.
What is the problem of unsupervised object segmentation from single images?A large number of models (Yuan et al., 2022) aim to tackle the problem of unsupervised object segmentation from single images.They share several key problem settings: 1) all training images do not have any human annotations; 2) every single image has multiple objects optionally with a textured background; 3) each image is treated as a static data point without any dynamic or temporal information; 4) all models are trained from scratch without requiring any pretrained networks on additional datasets.Ultimately, the goal of these models is to segment all individual objects as accurate as the ground truth human annotations.In this paper, we regard these settings as the basic and necessary part of unsupervised object segmentation from single images, and empirically evaluate how successfully the existing models can exhibit on real-world images.
Contributions and findings.This paper addresses the essential question regarding the potential of unsupervised seg-mentation of generic objects from real-world single images.Our contributions are: -We first introduce 4 complexity factors to quantitatively measure the difficulty of individual objects and 3 factors to quantify backgrounds.These factors are key to investigating the true potential of existing unsupervised models.-We extensively evaluate current unsupervised approaches in a large-scale experimental study.We implement 4 representative unsupervised methods and train more than 200 models on 4 groups of curated datasets from scratch.
The datasets, code and pretrained models are available at https://github.com/vLAR-group/UnsupObjSeg -We analyze our experimental results and find that: 1) existing unsupervised object segmentation models cannot discover generic objects from single real-world images, although they can achieve outstanding performance on synthetic datasets, as qualitatively illustrated in Figure 1; 2) the challenging distributions of both foreground object and the background biases in appearance and geometry from real-world images are the key factors incurring the failure of existing models; 3) the inductive biases introduced in existing unsupervised models are fundamentally not matched with the objectness biases exhibited in real-world images, and therefore fail to discover the real objectness.
Related Work.Recently, ClevrTex (Karazija et al., 2021) and the concurrent work (Papa et al., 2022) also study unsupervised object segmentation on single images.Through evaluation on (complex) synthetic datasets only, both works focus on benchmarking the effectiveness of particular network designs of baselines.By comparison, our paper aims to explore what and how the objectness distribution gaps between synthetic and real-world images incur the failure of existing models.The recent work by (Weis et al., 2021) which investigates video object discovery is orthogonal to our work as the motion signals do not exist in single images.
Scope of this research.In addition to our core study on purely unsupervised object segmentation approaches from single images, we also include discussions about more recent methods such as Odin (Hénaff et al., 2022), DINOSAUR (Seitzer et al., 2022), FreeSOLO (Wang et al., 2022), and Cut-LER (Wang et al., 2023) that require pretrained models to obtain objectness biases on monolithic object images such as ImageNet (Russakovsky et al., 2015) in Section 4.7.However, this paper does not investigate object segmentation from saliency maps (Wang et al., 2021), static multi-views (Yuan et al., 2022), or dynamic videos (Singh et al., 2022a,b), because the input information of these methods is clearly different from that on single images.
A preliminary version of this work has been published in (Yang and Yang, 2022) and our new technical extensions include: 1) 3 additional complexity factors to quantitatively measure the difficulty of background in each image in Sec-

Complexity Factors
As illustrated in the left three images of Figure 2, an individual object, represented by a set of color pixels within a mask, can vary significantly given different types of appearance and geometric shape.A specific scene, represented by a set of objects placed on a clean canvas, can also differ vastly given different relative appearances and geometric layouts between objects, as illustrated in the middle images.If these objects are instead placed on diverse backgrounds, they tend to appear more differently as shown in the right images.
Unarguably, such variations and complexity of appearance and geometry in object level, scene level, and background level, directly affect human's ability to precisely separate all objects.Naturally, the performance of unsupervised segmentation models is also expected to be influenced by these variations.In this regard, we carefully define the following three groups of factors to quantitatively describe the complexity of visual scenes in datasets.

Object-level Complexity Factors
As to an object, all its information can be described by appearance and geometry.Therefore we define the below two factors to measure the complexity of appearance and geometry respectively.Notably, both factors are nicely invariant to the object scale.

Gradient
Inner Gradient Object Image Grayscale Image Fig. 3: Illustrations for Object Color Gradient.
Object Color Gradient: This factor aims to calculate how frequently the appearance changes within the object mask.As shown in Figure 3, given the RGB image and mask of an object, we firstly convert RGB into grayscale and then apply Sobel filter (Sobel and Feldman, 1973) to compute the gradients horizontally and vertically for each pixel within the mask.The final gradient value is obtained by  averaging out all object pixels.Note that, the object boundary pixels are removed to avoid the interference of background.
Numerically, the higher this factor is, the more complex texture and/or lighting effect the object has, and therefore it is likely harder to segment.
Object Shape Concavity: This factor is designed to evaluate how irregular the object boundary is.As shown in Figure 4, given an object (binary) mask, denoted as M obj ∈ R H×W , we firstly find the smallest convex polygon mask (M cvx ∈ R H×W ) that surrounds the object mask using an existing algorithm (Eddins, 2011), and then the object shape concavity value is computed as: 1 − M obj / M cvx .Clearly, the higher this factor is, the more irregular object shape is, and segmentation is trickier.

Scene-level Complexity Factors
Given an image, in addition to the object-level complexity, the spatial and appearance relationships between all objects can also incur extra difficulty for segmentation.We define the following two factors to quantify the complexity of relative appearance and geometry between objects in an image.Inter-object Color Similarity: This factor intends to assess the appearance similarity between all objects in the same image.As shown in Figure 5, we first calculate the average color for each object, and then compute the pair-wise Euclidean distances of object colors, obtaining a K × K matrix where K represents the object number.The average color distance is calculated by averaging the matrix excluding diagonal entries, and the final inter-object color similarity is computed as: 1−average color distance/(255 × √ 3).Intuitively, the higher this factor is, the more similar all objects  appear to be, the less distinctive each object is, and the harder it is to separate each object.
Inter-object Shape Variation: This factor aims to measure the relative geometry diversity between all K objects in an image.We first find the diagonal vector of each object bounding box.Then we calculate the differences between each pair of diagonal vectors, resulting in K×(K−1) 2 vectors (denoted as black vectors in Figure 7).The final inter-object shape variation is the averaged norm of these K×(K−1) 2 vectors.The higher this factor, objects within an image are more diverse with imbalanced sizes, and therefore segmenting both gigantic and tiny objects is likely harder.By capturing the appearance and geometry of objects in both object and scene levels, the four factors in Sections 2.1 & 2.2 are designed to quantify the image complexity only with clean backgrounds.For illustration, Figure 6 shows sample images for the four factors at different values.The higher the values, the more complex foreground objects are in both object and scene levels.
In fact, these factors are carefully selected from more than 10 candidates because they are empirically more suitable to differentiate the gaps between synthetic and real-world images, and they eventually serve as key indicators to diagnose existing unsupervised models in Section 4. Details of other candidates are in the appendix.

Background Complexity Factors
Since the diversity of image backgrounds also plays a critical role in potentially distinguishing or messing up foreground objects, we introduce the following three factors to measure the complexity of backgrounds.The background here is de-fined as a collection of all pixels which do not belong to any considered objects in a single image.
Background Color Gradient: This factor is designed to measure how frequently the appearance changes within background pixels.The calculation is the same Object Color Gradient as illustrated in Figure 8.The background boundary is also removed to avoid the interference of foreground objects.The higher this factor, the more complex texture and/or lighting of the background, and therefore both the background and objects may be more difficult to be segmented.

Background Image
Grayscale Image Background Boundary Background Gradient Background Inner Gradient Fig. 8: Illustrations for Background Color Gradient.
Background-Foreground Color Similarity: This factor aims to measure the appearance similarity between background and foreground pixels in an image.Specifically, we first calculate the Euclidean distance between each background pixel RGB and each foreground pixel RGB, resulting in a U × V matrix where U and V represent the number of pixels in background and foreground respectively.We then treat it as a form of cost matrix where the Hungarian algorithm is applied to obtain an optimal cost value between all background and foreground pixels.The final backgroundforeground color similarity is computed as: 1 -cost value.Intuitively, the higher this factor, the more similar background and foreground appear to be, and it is harder to separate each other.Details are in the appendix.Background Shape Irregularity: This factor aims to measure the irregularity of the background shape, i.e., the full contour where the background intersects with all foreground objects.As illustrated in Figure 9, given the (binary) back-  ground mask, we first divide its contour into subcontours, where each subcontour is fully self-connected and encloses a region.In fact, that region may originally contain one or more foreground objects, which means that the previously designed Inter-object Shape Variation factor for foreground objects cannot be simply reused to measure the shape complexity of background.For each enclosed region, we first compute its maximal inscribed convex set based on an existing algorithm (Borgefors and Strand, 2005) whose details are provided in the Appendix.The area of each enclosed region is denoted as A i and the area of its inscribed convex set is denoted as C i .The irregularity of each subcontour is then calculated as 1 − Ci Ai .The final Background Shape Irregularity score for each image is calculated as the average of irregularity score for all subcontours within the image as: , where N is the total number of enclosed regions within an image.Clearly, the higher this factor, the more irregular the background shape (contour), and therefore the harder to separate background pixels from foreground objects.
Above all, the three factors together are designed to quantify the complexity of image backgrounds in terms of appearance and shape.Figure 10 shows sample images for three background complexity factors at different values.The original full images are also included for reference.

Considered Methods
A range of works have explored unsupervised object segmentation in recent years.They are typically formulated as (variational) autoencoders (AE/VAE) (Kingma and Welling, 2014) or generative adversarial networks (GAN) (Goodfellow et al., 2014).GAN based models (Chen et al., 2019;Arandjelovic and Zisserman, 2019;Bielski and Favaro, 2019;van Steenkiste et al., 2020;Azadi et al., 2020;Voynov et al., 2021;Abdal et al., 2021) are usually limited to identifying a single foreground object and can hardly discover multiple objects due to the training instabilities, therefore not considered in this paper.There is another line of works that adopt Energy-based Models (EBMs) (LeCun et al., 2006;Du et al., 2021;Liu et al., 2022) on scene decomposition.However, they do not explicitly generate object masks but encode objects into energy functions, therefore not discussed in this paper.As shown in Table 1, the majority of existing models are based on AE/VAE and can be generally divided into two groups according to the object representation: -Factor-based models: Each object is represented by explicit factors such as size, position, appearance, etc., and the whole image is a spatial organization of multiple objects.Basically, such representation explicitly enforces objects to be bounded within particular regions.-Layer-based models: Each object is represented by an image layer, i.e., a binary mask, and the whole image is a spatial mixture of multiple object layers.Intuitively, this representation does not have strict spatial constraints, and instead is more flexible to cluster similar pixels as objects.
To decompose input images into objects against backgrounds, these approaches introduce different types of network architecture, losses, and regularization terms as inductive biases.These biases broadly include: 1) variational encoding which encourages the disentanglement of latent variables; 2) iterative inference which likely ends up with better scene representations over occlusions; 3) object relationship regularization such as depth estimation and autoregressive prior which aims at capturing the dependency of multiple objects; and many other biases.With different combinations of these biases, many methods have shown outstanding performance on synthetic datasets.Among them, we select 4 representative models for investigation: 1) AIR (Eslami et al., 2016), 2) MONet (Burgess et al., 2019), 3) IODINE (Greff et al., 2019), and 4) SlotAtt (Locatello et al., 2020).We also add the fully-supervised Mask R-CNN (He et al., 2017) as an additional baseline for a comprehensive comparison in Sections 4.1 ∼ 4.6.
More recently, a number of new methods such as Odin (Hénaff et al., 2022), DINOSAUR (Seitzer et al., 2022), FreeSOLO (Wang et al., 2022), andCutLER (Wang et al., 2023) make use of pretrained models on monolithic object images such as ImageNet to segment objects from real-world images.We additionally evaluate and discuss the representative method DINOSAUR (Seitzer et al., 2022) in Section 4.7.
Naturally, objects and scenes in different datasets tend to have very different types of biases.For example, the objects in dSprites tend to have a single-color bias, while COCO does not; the objects in ScanNet bg tend to be influenced by cluttered backgrounds, while ScanNet does not.Generally, the object-level biases can be divided as: 1) appearance biases including different textures and lighting effects for objects, and 2) geometry biases including the object shape and occlusions.Similarly, the scene-level biases include: 1) appearance biases such as the color similarity between all objects, and 2) geometry biases such as the diversity of all object shapes.The background biases can be divided as: 1) appearance biases including textures and lighting effects of backgrounds, and the color similarity between background and foreground, 2) geometry biases such as the irregularity of background shape.In fact, our complexity factors introduced in Section 2 are designed to well capture these biases.Table 3 in the appendix qualitatively summarizes the biases of datasets in Groups 1/2/3/4.We may hypothesize that the large gaps of biases between synthetic and real-world datasets would have a huge impact on the effectiveness of existing models.
To guarantee the fairness and consistency of all experiments, we carefully prepare the four groups of datasets using the following same protocols.Preparation details of each dataset are provided in the appendix.
-All images are rerendered or cropped with the same resolution of 128 × 128.
-Each image has about 2 to 6 solid objects, with a blank background in Groups 1&2, with a synthetic background in Group 3, and with a (semi)real background in Group 4.
-Each dataset has about 10000 images for training, 2000 images for testing.
The datasets in Groups 1&2 are primarily used to evaluate how the four object-and scene-level complexity factors affect object segmentation performance of existing methods, while the datasets in Groups 3&4 are used to investigate how the diversity of backgrounds affects the segmentation performance.

Considered Metrics
Having the 12 representative datasets and existing unsupervised methods at hand, we choose the following metrics to evaluate the object segmentation performance: 1) AP score which is widely used for object detection and segmentation (Everingham et al., 2015), 2) PQ score which is used to measure non-overlap panoptic segmentation (Kirillov et al., 2019), and 3) Precision and Recall scores.A predicted mask is considered correct if its IoU against a ground truth mask is above 0.5.All objects are treated as a single class.The blank background is not taken into account for foreground object segmentation.To compute AP, Fig. 11: Quantitative results of object segmentation on 7 datasets from Groups 1&2 with blank backgrounds.we simply treat the mean value of the soft object mask as the object confidence score.In order to quantitatively analyze over-/under-segmentation issues, we additionally calculate ARP/ARR scores (Zimmermann et al., 2023) and ARI score (Rand, 1971).Note that, the alternative segmentation covering (SC) (Arbeláez et al., 2011) is not considered as it can be easily saturated.
4 Key Experimental Results

Can current unsupervised models succeed on real-world datasets?
We first evaluate the performance of all five baselines on synthetic datasets (dSprites, Tetris, CLEVR) in Group 1, as well as on the datasets MOViC, YCB, ScanNet, and COCO in Group 2. It is important to note that all images in these datasets only have blank backgrounds.Both quantitative and qualitative results are presented in Figures 11&14.Detailed breakdown ARP/ARR scores for comparing over-/under-segmentation are provided in Figure 12.Notably, on synthetic datasets, all methods, especially the recent baselines IODINE and SlotAtt, demonstrate satisfactory segmentation outcomes.However, it is not surprising that all unsupervised methods encounter significant challenges when applied to real-world datasets, despite the blank back-grounds observed in all images.In addition, from ARP/ARR scores, the methods AIR, MONet and IODINE are more prone to over-segmentation, while SlotAtt tends to exhibit under-segmentation across all 7 datasets.Such an observation is aligned with results in (Zimmermann et al., 2023).We then evaluate three baselines (IODINE/SlotAtt/ MaskR-CNN, because of their better abilities to model backgrounds) on datasets (CLEVR bg /MOViC bg /YCB bg /ScanNet bg /COCO bg ) from Groups 3&4, where all images have either synthetic or real-world backgrounds.Figures 13&15 show the quantitative/qualitative results respectively.We can see that both unsupervised IODINE&SlotAtt achieve excellent performance on the synthetic dataset CLEVR bg , but again fail on realworld datasets with real backgrounds.
Preliminary Diagnosis: To diagnose the failure on realworld images, we hypothesize that it is because of the gaps in biases between synthetic and real-world datasets.In this regard, we quantitatively compute the distributions of all 7 complexity factors on the four groups of datasets.In particular, the two object-level factors, i.e., Object Color Gradient and Object Shape Concavity, and the two scene-level factors, i.e., Inter-object Color Similarity and Inter-object Shape Variation, are computed on the seven training splits of Groups 1&2 datasets (dSprites/Tetris/CLEVR, MOViC/YCB/ScanNet/COCO).The three background-level factors, i.e., Background Color Gradient, Background-Foreground Color Similarity and Background Shape Irregularity, are computed for each image of the four training splits of Groups 3&4 datasets (CLEVR bg , MOViC bg , YCB bg / ScanNet bg /COCO bg ).
As shown in the top row of Figure 16, we can see that: 1) for the two object-level factors (Subfigs 1-a and 1-b), the complexity scores for synthetic datasets are substantially lower than those of real-world datasets, and the semi-realistic MOViC is in between.This implies that synthetic objects are more likely to have uniform colors and convex shapes; 2) for the two scene-level factors (Subfigs 1-c and 1-d), the images in synthetic datasets tend to include less similar objects in terms of color, which means that multiple objects in realworld scenes are less distinctive in appearance.In addition, the multiple objects in synthetic scenes tend to have similar sizes, whereas real-world scenes usually have diverse object scales in single images.As shown in the bottom row of Figure 16, we can see that: 1) real-world image backgrounds are more likely to have non-uniform colors and irregular shapes (Subfigs 2-a and 2-b); 2) synthetic images tend to have more distinctive backgrounds against foreground objects than realworld images (Subfig 2-c).
To validate whether these distribution biases are the true reasons incurring the failure, we conduct extensive ablative experiments in Sections 4.2, 4.3, 4.4, and 4 Figures show new distributions of two object-level factors and two scene-level factors on the -C ablated datasets.We can see that the Object Color Gradient becomes all zeros in Figure 17 (1-a), even simpler than three synthetic datasets.Yet, the distributions of other three factors are almost the same as the original ones, i.e., Figure 17 (1b Figure 17 (2-b) shows that the distributions of Object Shape Concavity of -S ablated datasets now become similar to the synthetic datasets, while the distributions of other three factors are unaffected, i.e., Figure 17 (2-a)(2-c)(2-d) being similar to Figure 16 (1-a)(1-c) (1-d).Note that, for the -C+S ablations, the distributions will be the same as shown in Figures 17 (1-a), (2-b), (1-c) or (2-c), (1-d) or (2-d).Having the three groups of object-level ablated real-world datasets,   we then train and evaluate our four unsupervised baselines from scratch on each of the ablated datasets separately.
Brief Analysis: Figure 20 shows the quantitative segmentation results and the ARP/ARR scores for over/undersegmentation. Figure 18 qualitatively presents the evolvement of segmentation performance from IODINE according to the adjustment in object-level complexity.More qualita-tive results for all approaches can be found in Figure 40 in the appendix.We can see that: 1) Once the pixels of realworld objects are replaced by their mean colors, i.e., no color gradients, the object segmentation performance has been significantly improved for almost all methods.The ARP and ARR scores tend to be closer with -C ablation, indicating both over-and under-segmentation are alleviated.2) Reducing the Fig. 19: Distributions of four foreground factors on datasets ablated in scene-level.
irregularity of real-world objects can also improve object segmentation, although not significantly.Simplifying object shapes does not effectively reduce the gap between ARP and ARR scores.In most cases, the over/under-segmentation issues remain the same as on the unablated datasets.3) Overall, these results show that existing methods are more likely to learn the objectness represented by uniform colors and/or regular objects.However, compared with Figure 11, the results of current ablated datasets in Figure 20(a) still lag behind the synthetic datasets.This means that there should be some other factors that also potentially affect the object segmentation of existing models.Brief Analysis: Figure 20 shows the quantitative segmentation results and ARP/ARR/ARI scores for over-/undersegmentation. Figure 21 presents qualitative performance of SlotAtt according to the decrease of scene-level complexity.More qualitative results for all methods can be found in Figure 41 of the appendix.We can see that: 1) Once the textures of real-world images are replaced by more distinctive    textures, i.e., with a lower similarity between object appearances, the segmentation performance has been surprisingly boosted remarkably for almost all methods.For MONet and IODINE, the gap between ARR and ARP scores is effectively reduced, indicating a mitigation of over-segmentation (for MONet) and under-segmentation (for IODINE).2) Normalizing object sizes over images can also reasonably improve the segmentation performance for AIR through the alleviation of over-segmentation.Its effectiveness, however, is less obvious for other three approaches.3) Overall, these results clearly show that existing unsupervised models significantly favor objectness with distinctive appearances in single images.However, compared with Figure 11, the results on current scene-level ablated datasets are still inferior to synthetic datasets, meaning that the scene-level factors alone are not enough to explain the performance gap.Since these ablations are conducted independently, the new distributions of four complexity factors on current jointly ablated datasets are the same as Figures 17 (1-a)(2-b), Figure 19 (1-c) (2-d).We train and evaluate the four baselines from scratch on each of the ablated datasets separately.
Brief Analysis: Figure 20 shows the quantitative segmentation results and the ARP/ARR/ARI scores for comparing over-/under-segmentation. Figure 22 presents qualitative performance of MONet according to the decrease of object-level and scene-level complexity.More qualitative results for all methods can be found in Figure 42 in the appendix.We can see that: 1) Combining the two object-level factors with scene-level color ablation (-C+S+T) can significantly improve segmentation results, especially for SlotAtt as shown in Figure 20(a).Their ARP and ARR scores tend to be closer, suggesting an alleviation of over-/under-segmentation.Combining the two object-level factors with scene-level shape ablation (-C+S+U), however, can have a notable effect only on AIR, where its over-segmentation issue is relieved.2) If the challenging real-world objects and images are ablated in both object-and scene-level, the segmentation performance of all unsupervised models achieves the same level with three synthetic datasets as shown in Figure 11. 3) Overall, these three groups of experiments demonstrate that the failure of

Background Color Gradient
Background-Foreground Color Similarity Background Shape Irregularity unsupervised models on real-world images involves both object-and scene-level dataset biases.

How do background factors affect current models?
In this section, we investigate to what extent the distributions of background-level factors affect the segmentation performance by conducting the following ablative experiments.
-Ablation of Background Color Gradient: For each image of datasets in Group 4, we replace the background pixels with its average color rgb.The four ablated datastes are denoted as: MOViC bg -C/YCB bg -C/ ScanNet bg -C/COCO bg -C.-Ablation of Background-Foreground Color Similarity: For each image of Group 4 datasets, we replace the background texture with a new one from DTD database as shown in Figure 23.We select the most distinctive color against foreground pixels.In this way, the foreground objects are more distinguishable against the background, while the background gradients are roughly preserved.The three ablated datasets are denoted as: MOViC bg -T/YCB bg -T/ScanNet bg -T/COCO bg -T.-Ablation of Background Shape Irregularity: For each image, we first find all connected subcontours of its background.For each region enclosed by a subcontour, we find the smallest convex hull (Eddins, 2011) that surrounds it.
The original enclosed regions are enlarged to their corresponding convex hulls and filled by shifting around the original region pixels.In this way, the background shape becomes more regular, while the background appearance keeps the same.The three ablated datasets are denoted as: MOViC bg -S/YCB bg -S/ScanNet bg -S/COCO bg -S.16).Comparing Figure 24 (1-c) and Figure 16 (2-c), we observe the Background-Foreground Color Similarity is increased inevitably as a byproduct.Background Shape Irregularity (Subfig 3-b) decreases, while the Background Color Gradient (Subfig 3-a) and Background-Foreground Color Similarity (Subfig 3-c) remain similar to the original MOViC bg /YCB bg /ScanNet bg /COCO bg .We train and evaluate SlotAtt and IODINE from scratch on each of the ablated datasets.Note that, the performance of background segmentation is separately measured by a Recall score, denoted as BG Recall, where a predicted background mask is considered correct if its IoU against a ground truth background mask is above 0.5.Brief Analysis: Figures 25&26 show the quantitative and qualitative results respectively.We can see that: 1) Removing the Background Color Gradient or replacing the background texture with a more discriminative one against foreground objects can largely increase BG Recall score.2) Making the background contours to be more regular alone can hardly benefit the segmentation of objects and backgrounds.3) Overall, the background can be easily recognized if it is of simple and discriminative color, which however cannot fundamentally alleviate the difficulty of segmenting individual foreground objects.This further confirms that the four object-and scenelevel complexity factors introduced in Sections 2.1&2.2play an essential role in object segmentation of existing models.

Why do current unsupervised models fail on real-world datasets?
As demonstrated in Sections 4.2/4.3/4.4,once the object-and scene-level complexity factors are removed from the challenging real-world datasets, existing unsupervised models can perform as excellent as on the synthetic datasets, as qualitatively illustrated in Figure 42.From this, we can safely conclude that the inductive biases designed in existing unsupervised models are far from able to match with and fully capture the true and complex objectness biases exhibited in real-world images.Nevertheless, from Figure 20, each baseline tends to favor different objectness biases.In particular, -AIR (Eslami et al., 2016): As a factor based model, AIR has a strong spatial-locality bias.Despite its poor segmentation performance across all datasets, there is a notable improvement when inter-object shape variation is ablated from real-world datasets (U/T+U/C+S+U/C+S+T+U).More convincingly, even when all other three factors are ablated (C+S+T), it can be hardly improved, showing that object shape variation is a key factor for AIR.Qualitatively (Figure 14) and quantitatively (Figure 12), AIR tends to use multiple bounding boxes for single real-world objects, which leads to over-segmentation.We can see from Figure 20(b) that the uniform-scale ablation, especially -T+U datasets, can effectively decrease the gap between ARP and ARR scores, which implies an alleviated over-segmentation.Since AIR is designed to attend and infer objects as bounding boxes and does not explicitly model backgrounds, the background-level factors are not studied here.-MONet (Burgess et al., 2019): MONet is more sensitive to color-related factors than shape-related factors.The ablations of object color gradient and inter-object color similarity significantly improve its performance, while ablations of object shape concavity and inter-object shape variation make little difference.For the two color-related factors, the scene-level one is more important than the object-level factor.From this, we can see that MONet has a strong dependency on color.Similar colors tend to be grouped together while different colors are separated apart.From Figure 20(b), we observe that the gaps between ARP and ARR scores are relatively small for all experiments on MONet, even when the AP scores are low.This suggests there are both over-segmentation and under-segmentation in the results.An object with various colors tends to be over-segmented, and different objects with similar colors are likely to be grouped together.Due to its heavy reliance on color clues, MONet cannot identify backgrounds without regular and distinctive colors.Thus, background-level factors are not studied for it.-IODINE (Greff et al., 2019): IODINE also has a heavy dependency on object-/scene-level color-related factors.However, different from MONet, the ablation on object color gradient brings better performance than inter-object color similarity.We speculate it is because the regularization on shape latent alleviates under-segmentation by biasing towards more regular shapes.From Figure 20(b), we observe both under-/over-segmentation, and the gaps between ARR and ARP scores are not obvious in most cases.Unsurprisingly, as shown in Figure 25 observed that MOViC and COCO are more prone to under-segmentation while YCB and ScanNet are more prone to over-segmentation.From Figure 16, it is observed that MOViC and COCO have less distinguishable colors while YCB and ScanNet have more complex shapes.This also illustrates both color and shape are important factors for SlotAtt.As shown in Figure 25, SlotAtt is not able to segment real-world backgrounds.The background-foreground separation in SlotAtt is particularly sensitive to the color contrast between the background and foreground.However, similar to IODINE, a distinguishable background cannot alleviate the burden of foreground object segmentation, as it does not fundamentally remove the complex object-and scene-level factors in real-world images.The models pretrained on monolithic object images have recently achieved promising results on real-world datasets.FreeSOLO (Wang et al., 2022) and CutLER (Wang et al., 2023) generate pseudo labels from pretrained features and train detectors interactively with pseudo labels.Odin (Hénaff et al., 2022) and DINOSAUR (Seitzer et al., 2022) make use of features pretrained on ImageNet (Russakovsky et al., 2015) and apply different clustering strategies.Among this new line of works, we select the representative DINOSAUR to evaluate its segmentation performance on datasets of Groups 1/2/4.
Brief Analysis: As shown in Figure 27, DINOSAUR exhibits clear limitations on datasets in Groups 1&2 where all images just have blank backgrounds.Primarily, this is because DINOSAUR reiles heavily on pretrained ViT features, specifically those extracted from models like DINO (Caron et al., 2021) trained on the ImageNet dataset.Such features can hardly generalize to images in Groups 1&2 datasets where the synthetic objects and blank backgrounds are significantly different from those in ImageNet.
As shown in Figure 28 and Table 2, as expected, DI-NOSAUR demonstrates better and reasonable object segmentation results on the 4 datasets in Group 4. In particular, it shows favorable FG-ARI and mean Best Overlap (mBO) scores, but tends to over-segment objects when the number of available slots exceeds the number of objects present within images.Moreover, it often fails to identify backgrounds as complete entities, instead segmenting them into fragmented pieces.That is why the precision score tend to be lower than the recall score, and the AP score is further diminished.
Overall, by leveraging the large-scale pretrained realworld image features, DINOSAUR shows promising performance to segment objects in real-world images thanks to the feature grouping capability of slot attention based mechanisms (Yu et al., 2022;Xu et al., 2022).Nevertheless, it is still in its infancy to accurately identify generic objects and separate complex backgrounds.We conjecture that more explicit object biases need to be encoded into the pretrained models or the unsupervised learning process, although the existing pretrained features on monolithic object images may implicitly have the concepts of objectness.More advanced research works are expected in this direction in the future.

Conclusions
We systematically show that existing unsupervised methods are practically impossible to segment generic objects from single real-world images, and investigate the underlying factors that incur the failure.With the aid of our carefully designed seven object-, scene-, and background-level complexity factors, we conduct extensive experiments on multiple groups of ablated real-world objects and images, and safely conclude that the distributions of both object-, scene-, and background-level biases in appearance and geometry of realworld datasets are particularly diverse and indiscriminative, such that current unsupervised models cannot segment real objects or backgrounds.Based on this finding, we suggest two main directions for future study: 1) To exploit more discriminative objectness biases such as object motions which expressively describe the ownership of visual pixels as recently explored in (Tangemann et al., 2021;Chen et al., 2022;Bear et al., 2020) for 2D images and in (Song and Yang, 2022) for 3D point clouds.2) To leverage pretrained features from single-object-dominant datasets which explicitly regard each image as an object as recently studied in (Caron et al., 2021;Hénaff et al., 2022;Seitzer et al., 2023) In addition to the primary four complexity factors in Sections 2.1&2.2,we also explore other potential complexity factors to quantitatively measure the distributions of object-and scene-level biases in appearance and geometry.Basically, we aim to consider as many aspects as possible to investigate key factors underlying the distribution gaps between synthetic and real-world datasets.However, we empirically find that these candidate factors do not show significant discrepancies between synthetic and real-world datasets.Details are shown below.

A.1.1 Candidates of Object-level Complexity Factors
-Object Color Count: This factor is defined as the total number of unique colors within an object mask.Basically, this is to simply measure the diversity of object colors.-Object Color Entropy: Inspired by Shannon entropy (Shannon, 1948), we calculate the entropy value at each pixel by applying a 3 × 3 filter on the grayscale image converted from RGB.In particular, for each pixel, its color value becomes a discrete value in [0,255].
We compute its entropy score: H(x) = − n i=1 p(x i ) log 2 p(x i ), where p(x i ) denotes the probability of a specific color value x i within the 3×3 neighbourhood.Basically, this factor aims to measure the color diversity within 3 × 3 image patches.The higher this factor, the more frequently the object color changes in small local areas.
-Object Shape Non-rectangularity: Given the binary mask of an object (M obj ∈ R H×W ), we first calculate its axis-aligned bounding box (M bbox ∈ R H×W ).Object shape non-rectangularity is calculated as 1 − M obj / M bbox .Similar to object shape concavity, this factor is also designed to measure the complexity of object shapes.However, this factor is more likely to be affected by the object orientation since it takes axis-aligned bounding boxes as a reference.
-Object Shape Incompactness: There are two similar methods to quantify the compactness of object shapes.The first one is Polsby-Popper test (Polsby and Popper, 1991): The other is Schwartzberg (Schwartzberg, 1965) compactness score: S(M obj ) = (2π A(M obj )/π)/P (M obj ).In both formula, P (M obj ) is the object perimeter and A(M obj ) is the object area.
For simplicity, we choose P P (M obj ) to calculate the object shape incompactness score: 1 − P P (M obj ).-Object Shape Discontinuity: Given an object mask (M obj ∈ R H×W ), we first find the largest connected component (M lcc ∈ R H×W ) in its binary mask.The discontinuity of shape is calculated as: 1 − M lcc / M obj .This factor is to evaluate how continuous an object shape is.
-Object Shape Decentralization: Given an object mask, we first calculate its centroid (x, ȳ) by averaging all pixel coordinates in the object.Then, the second moment of this object is calculated as: , where (x, y) is the coordinates of pixels within the object.The higher this factor, the object shape is less likely to be centralized.
As shown in Figure 29, we compare the distributions of the objectlevel factor candidates on both synthetic and real-world datasets.It can be seen that the majority of these factors do not show significant gaps between the simple synthetic and the challenging real-world datasets.Therefore, we do not conduct relevant ablation experiments.

A.1.2 Candidates of Scene-level Complexity Factors
-Inter-object Color Similarity with Chamfer Distance: In the calculation of this factor, we first convert each pixel into a point in RGB space.In this way, each object can be represented by a point set in the RGB space.This factor is calculated between each pair of objects by measuring the Chamfer distance of two point sets in the RGB space.Since Chamfer distance is an asymmetric measurement, we calculate and average out the bidirectional Chamfer distances.Compared with Euclidean distance, this measurement favors the most similar colors between two objects.-Inter-object Color Similarity with Hausdorff Distance: This factor is similar to the previous one.The only difference is that we replace Chamfer distance with Hausdorff distance.Hausdorff distance is also a directed and asymmetric measurement, so the final score is the average of distance values in both directions.-Inter-object Shape Similarity over Boundaries: For each object mask, we first find its boundary using the method in (Cheng et al., 2021), and then crop it with its axis-aligned bounding box.Each bounding box is scaled and fits into a unit box with its original aspect ratio.Lastly, we calculate the IoU between the boundaries of two objects to measure their shape similarity.-Inter-object Shape Entropy between Boundaries: We first combine all object masks into a single image by assigning different indices to different objects.Then we compute the entropy of each pixel with a 3 × 3 filter.The final factor score is calculated by averaging all non-zero entropy values.Note that, the interior part of objects and background will not be considered because their entropy values will always be zeros.Basically, this factor is designed to evaluate how crowded an image is.The higher this factor, the more objects are spatially adjacent.-Inter-object Proximity between Centroids: We first calculate the centroid (x, ȳ) of each object by averaging all pixel coordinates in the object mask.Euclidean distances between object centroids are then computed pair-wisely before they are averaged to be the final factor score.This factor is designed to measure the spatial proximity of multiple objects in a single image.-Inter-object Proximity with Chamfer Distance: To measure the spatial proximity between objects, we also calculate the spatial Chamfer distance between objects.Specifically, each object is represented by a set of (h/w) coordinates, and the average of pair-wise bidirectional Chamfer distances is calculated as the proximity score for each image.-Inter-object Area Variation: In order to measure the variation of objects in terms of their scale, We first calculate the area of each object, and then compute the pair-wise absolute difference for all object areas, obtaining a K × K matrix.The final inter-object area variation is the average of the matrix excluding diagonal entries.
As shown in Figure 30, we compare the distributions of the scenelevel candidate factors on both synthetic and real-world datasets.We can see that both the inter-object color similarity with Chamfer and Hausdorff distances share similar distribution gaps with our primary inter-object color similarity factor defined in Section 2. The remaining four candidate factors relating to inter-object shape complexity do not show significant distribution gaps between the synthetic and real-world datasets.In this regard, we choose not to conduct ablation experiments on these six candidate factors.The distribution for Inter-object Area Variation is presented in Figure 31 .Overall, the inter-object area variation for synthetic datasets is higher than real-world datasets.However, comparing Inter-object Area Variation in Figure 31 and Inter-object Shape Variation in Figure 2 (1-d), we find that factor values for Inter-object Shape Variation have smaller variation within each dataset, indicating it better captures the scene-level property of datasets.Besides, Inter-object Shape Variation does not only take into account the scale of objects but also the orientations.Thus, we select Inter-object Shape Variation over Inter-object Area Variation.Inter-object Area Variation

A.1.3 Complexity Factors Measured Across Dataset
Object-level complexity factors are calculated for each object as shown in Figure 32(1-a) (1-b).Scene-level complexity factors are calculated for each image as shown in Figure 32(1-c) (1-d).We additionally measure the four complexity factors across the datasets as shown in Figure 32 To measure object-level complexity factors across the datasets, we directly calculate the average of all object complexity values.To measure the scene-level complexity factors across the datasets, we make the color/shape comparison across the dataset and calculate the factor value.We find that complexity factors calculated across the dataset follow similar patterns as those calculated across the object/image.

A.2 More Details of Background-level Complexity Factors
-Background-Foreground Color Similarity: Given all background pixels and the foreground pixels in a single image, we calculate the Euclidean distance between each background pixel and each foreground pixel in the RGB space.This results in a U × V matrix E where U and V represent the number of pixels in background and foreground respectively.We treat 255 √ 3 − E as the cost matrix for a Hungarian algorithm and solve the optimal assignment between foreground pixels and background pixels.Each background pixel will be assigned to a distant foreground pixel in the RGB space.Given the assignment between foreground pixels and background pixels, the background-foreground color distance is calculated as the Euclidean distance between assigned pairs.The final factor value for background-foreground color similarity is computed as: 1 -background-foreground color distance / 255 √ 3. The higher this factor, the more similar background and foreground appear to be, and it is harder to separate each other.
-Background Shape Irregularity: In order to compute the maximal inscribed convex set of a connected region, we follow the algorithm proposed by Borgefors and Strand (2005).The main idea is to iteratively find the deepest concavity and remove it with minimum Then we calculate the constrained distance transform (Piper and Granum, 1987) for the convex deficiency D where the region R is the constraint.On the calculated distance transform map, the deepest concavity dc ∈ R 2 is the element with the largest distance label as shown in Figure 33(d).In order to remove the deepest concavity dc, we cut the region with a straight line from dc with 8 directions ( n×π 4 , n = 0, 1, ...7) as shown in Figure 34].The cut that generates the smallest sub-region is selected and the smallest sub-region is removed.The above process is repeated until the remained sub-region is convex (distance label(dc) <= 3).The remained convex region is the maximal inscribed convex set.For all experiments, the output component distribution is an independent pixel-wise Gaussian with fixed scales (std=0.7).The attention network is a standard UNet (Ronneberger et al., 2015) with five blocks.Each block consists of the following: a 3x3 bias-free convolution with stride 1, followed by instance normalization with a learned bias term, followed by a ReLU activation, and finally downsampled or upsampled by a factor of 2 using nearest neighbour-resizing (no resizing occurs in the last block of each path).

A.3 Implementation Details of Baselines
IODINE (Greff et al., 2019) -Source Code: We use the official implementation at: https:// github.com/deepmind/deepmind-research/tree/master/iodine-Important Adaptations: The architecture is set the same as what is used for CLEVR dataset (Johnson et al., 2017) in the original paper (Greff et al., 2019).
-Training Details: Since we use a single GPU for the training of all models, the batch size is adjusted to be 4 and the learning rate 0.0001 × 1/8.The number of slots K is set as 7 and the inference iteration T as 5.We train each model for 500K iterations until the loss is fully converged.-Implementation Details: The convolutional layers in the encoder and decoder have five layers with size-5 kernels, strides of [1,2,1,2,1], and filter sizes of [32,32,64,64,64] and [64,32,32,32,32] The stride for the backbone is set to [4,8,16,32,64].The maximum number of instances to detect in one image is 100.The minimum confidence is 0.5 and NMS ratio is 0.3.The scale of the RPN anchor is (32,64,128,256,512) and stride is 2.
-Important Adaptations: We use a similar model architecture as for MOViC dataset in the original paper.In light of the fact that our datasets only consist of 2-6 objects, we adapt the number of slots to be 7.And since we are using a single GPU for training, we adopt 64 as batch size.

A.4 Details of Benchmark Datasets
In this section, we present the details of synthetic and real-world datasets.Figure 35 presents sample images from all four dataset groups.Table 3 qualitatively summarizes the biases of datasets in Groups 1/2/3/4.dSprites (Matthey et al., 2017) To generate a specific image for this dataset, we first sample a random integer K from a uniform distribution with interval [2,6] as the number of objects in that image.Then, K object shapes are selected from the binary dsprites dataset (Matthey et al., 2017) also in a uniformly random manner.Each object is assigned a random RGB color by sampling three random integers from a uniform distribution with interval [0,255].In total, we generate 10000 images for training, and 2000 for testing.
Tetris (Kabra et al., 2019) For each image in this dataset, we first sample a random integer K from a uniform distribution with interval [2,6] as the number of objects in this image.To render one tetris-like object onto the canvas, we randomly pick up a Tetris object from a randomly selected image from (Kabra et al., 2019).Each object is resized to be 88 × 88 and then placed onto the canvas.The position of each object is also sampled from a uniform distribution with 2 criteria: 1) all objects shall be on the canvas with complete shapes; 2) all objects shall not overlap with each other.
CLEVR (Johnson et al., 2017) We first generate CLEVR images following https://github.com/facebookresearch/clevr-dataset-gen,where the number of objects per image is restricted between 3 to 6.Given generated images with a resolution 640 × 480, we perform center-cropping and then resize them to be 128 × 128.Then, we remove tiny objects which have less than 35 pixels from each image.Subsequently, the images with less than 2 objects are removed.Being consistent with the previous 2 synthetic datasets, all images have a black background.
MOViC (Greff et al., 2022) We use the training and validation sets of the released MOVi dataset from gs://kubric-public/tfds. Specifically, the downsampled variant at a resolution of 128x128 of MOVi-C is used.In the original MOVi-C dataset, each image consists of 3 ∼ 10 objects.Only images with 2 ∼ 6 objects are kept.All background pixels are replaced by the black color.
YCB (Calli et al., 2017) We sample single frames from the YCB video dataset (Calli et al., 2017) every 20 images.Given the sampled frames with a resolution of 640 × 480, we first center crop and then resize them to be 128 × 128.Then, the images consisting of less than 2 or more than 6 objects are removed.Similarly, all background pixels are replaced by a black color.
ScanNet (Dai et al., 2017) We sample single frames from the ScanNet dataset (Dai et al., 2017) every 20 images.Given the selected frames with a resolution of 1296 × 968, we first center crop the images with a size of 800 × 800 and then resize them to be 128 × 128.For each resized image, we remove objects that contain more than 128 × 128 × 0.2 pixels or less than 128 × 128 × 0.007 pixels.The images with less than 2 or more than 6 are also dropped.All background pixels are replaced by the black color.
COCO (Lin et al., 2014) Given images in COCO-2017 (Lin et al., 2014) with various resolutions, we first center crop and then resize images to be 128 × 128.For each resized image, we use the same criteria applied to ScanNet to remove too-large and too-small objects.The images with 2 ∼ 6 objects are kept.All background pixels are replaced by the black color.Since the number of images that could meet all requirements is less than 2,000 (only 1,597) in the official validation split, we additionally select 403 different images from its official training split for testing.There is no overlap between our training and testing splits.
CLEVR bg , MOViC bg / YCB bg /ScanNet bg /COCO bg These five datasets are created in the same way as the above (CLEVR, MOViC/ YCB/ ScanNet/ COCO) datasets, with synthetic or real-world backgrounds being added back to each image.

A.5 Examples of the ablated datasets
Figure 36 presents before/after view for images ablated with object-level factors and scene-level factors respectively.Figure 37 shows examples of the joint ablations on both object-level and scene-level.

A.8 Additional Joint Ablations on Object-and Scene-Level Factors
In addition to the experiments in Section 4.4, we generate additional 6 groups of datasets ablated with different combinations of object-and scene-level factors as follows and conduct corresponding experiments.Experiment results can be found in Figures 38&39, and Table 10.

Additional Ablated Datasets:
-Ablation of Object Color Gradient and Inter-object Color Similarity: In each image of the three real-world datasets, we replace the object color by averaging all pixels of a distinctive texture, and keep the original shape unchanged, getting three ablated datasets: YCB-C+T/ScanNet-C+T/COCO-C+T.This can be seen from the fact that: for datasets without ablations in appearance, i.e., the (S+U) ablated datasets, the object segmentation performance is inferior.By contrast, the object segmentation accuracy can be greatly improved on the datasets only with appearance factors ablated, i.e., the (C+T) datasets.Meanwhile, more regular shapes and uniform scales of objects still have a significant positive influence on the success of object segmentation especially when the appearance factors are combined in ablated datasets.

object-level ablations scene-level ablations original
To be specific, AIR (Eslami et al., 2016) is quite sensitive to the scale of objects apart from object color gradient and inter-object color similarity.MONet (Burgess et al., 2019) can obtain comparable performance to the simple synthetic datasets once object color gradient and inter-object color similarity are ablated.All four factors are closely relevant to the results of IODINE (Greff et al., 2019) and SlotAtt (Locatello et al., 2020).

Fig. 2 :
Fig. 2: Complexity in appearance and geometry for objects and scenes.tion 2.3, 2) new benchmarks with 6 more datasets, a new baseline, and 3 extra metrics in Section 3, 3) and more comprehensive experiments to analyze object segmentation of existing models.

Fig. 6 :
Fig. 6: Sample objects and scenes for the four factors at different complexity values.

Fig. 10 :
Fig. 10: Sample image backgrounds for the three factors at different complexity values.

Fig. 13 :
Fig. 13: Quantitative results of object segmentation on 5 datasets of Groups 3&4 with synthetic or real-world backgrounds.

Fig. 14 :
Fig. 14: Qualitative results of object segmentation from five methods on datasets of Groups 1&2 with blank backgrounds.
do object-level factors affect current models?In this section, we aim to verify to what extent the distributions of object-level factors affect the segmentation performance by doing the following three ablative experiments.Examples of object-level ablations can be found in the appendix Figure 36.-Ablation of Object Color Gradient: For each object of Group 2 datasets (MOViC/YCB/ScanNet/COCO), we only replace all pixel colors by its average color rgb value within each object mask, without touching the object shapes.In this way, the color gradients of each object are totally erased, thus removing the potential impact of Object Color Gradient.The three ablated datasets are: MOViC-C/YCB-C/ScanNet-C/COCO-C.-Ablation of Object Shape Concavity: For each object in Group 2 datasets (MOViC/YCB/ScanNet/COCO), we find the smallest convex hull (Eddins, 2011) for its object mask and then fill the empty pixels by shifting original object pixels.Basically, this ablation aims to only reduce the irregularity of object shapes, yet retain the distributions of color gradients.The ablated datasets are: MOViC-S/YCB-S/ScanNet-S/COCO-S.-Ablation of Object Color Gradient and Shape Concavity: We combine the above two ablations for each object, getting datasets: MOViC-C+S/YCB-C+S/ScanNet-C+S/COCO-C+S.

Fig. 17 :
Fig. 17: Distributions of four foreground factors on datasets ablated in object-level.

4. 3
How do scene-level factors affect current models?In this section, we investigate to what extent the distributions of scene-level factors affect the segmentation performance with three ablative experiments.Image examples of scenelevel ablations can be found in Figure 36 in the appendix.-Ablation of Inter-object Color Similarity: In each image of datasets of Group 2 (MOViC/YCB/ScanNet/COCO), we replace all object textures with a set of new distinctive textures from DTD database (Cimpoi et al., 2014), as shown in Figure 23.In this way, the multiple objects look more distinctive in appearance, while the per-object texture gradients are roughly preserved.The ablated datasets are denoted as: MOViC-T/YCB-T/ScanNet-T/COCO-T.-Ablation of Inter-object Shape Variation: In each image of Group 2 datasets, we normalize the scales of multiple objects by shrinking or expanding the diagonal length of their bounding boxes, such that the new object sizes tend to be uniform.For each object, its shape and texture are linearly scaled up or down.Basically, this aims to remove the diversity of object sizes within single images.The ablated datasets are denoted as: MOViC-U/YCB-U/ScanNet-U/COCO-U.-Ablation of both Inter-object Color Similarity and Shape Variation: We simply combine the above two ablation strategies for images.The ablated datasets are denoted as: MOViC-T+U/YCB-T+U/ScanNet-T+U/COCO-T+U.

Figure 19
Figure 19 (2-d) shows that the distributions of Inter-object Shape Variation for ablated datasets now become similar to the synthetic datasets, whereas the distributions of other three factors are still the same as the original ones, i.e., Figure 19 (2a)(2-b)(2-c) being similar to Figure 16 (1-a)(1-b)(1-c).Note that, for the ablation of -T+U datasets, the distributions are the same as shown in Figure 19 (1-a)(2-b)(1-c)(2-d). Having these three groups of scene-level ablated real-world datasets, we then train and evaluate our four unsupervised baselines from scratch on each of the ablated datasets separately.Brief Analysis: Figure20shows the quantitative segmentation results and ARP/ARR/ARI scores for over-/undersegmentation. Figure21presents qualitative performance of SlotAtt according to the decrease of scene-level complexity.More qualitative results for all methods can be found in Figure41of the appendix.We can see that: 1) Once the textures of real-world images are replaced by more distinctive The ARP/ARR/ARI scores of baselines on the group 2 datasets for analyzing over-/under-segmentation.

Fig. 20 :
Fig. 20: The letters C/S/C+S represent the three ablated datasets in Section 4.2; T/U/T+U represent the three ablated datasets in Section 4.3; C+S+T/C+S+U/C+S+T+U represent the three ablated datasets in Section 4.4.

4. 4
How do object-and scene-level factors jointly affect current models?In this section, we aim to study how the object-and scenelevel factors jointly affect the segmentation performance.We conduct the following three ablative experiments.Image examples of joint ablations are in Figure37in the appendix.-Ablation of Object Color Gradient, Object Shape Concavity, and Inter-object Color Similarity: In each image of Group 2 datasets(MOViC/YCB/ScanNet/COCO), we replace the object color by averaging all pixels of the distinctive texture, and also replace the object shape with a simple convex hull.The ablated datasets are denoted as: MOViC-C+S+T/YCB-C+S+T/ScanNet-C+S+T/COCO-C+S+T.-Ablation of Object Color Gradient, Object Shape Concavity, and Inter-object Shape Variation: In each image of the Group 2 datasets (MOViC/YCB/ScanNet/COCO), we replace the object color by averaging its own texture, and modify the object shape as convex hull following by size normalization.The ablated datasets are denoted as: MOViC-C+S+U/YCB-C+S+U/ScanNet-C+S+U/ COCO-C+S+U.-Ablation of all four factors: We aggressively combine all four ablations and get datasets: MOViC-C+S+T+U/YCB-C+S+T+U/ ScanNet-C+S+T+U/COCO-C+S+T+U.

1 )
Fig. 24: Distributions of three background-level factors on three types of ablation datasets in Section 4.5.

Fig. 29 :
Fig. 29: Distributions of additional candidates of Object-level Complexity Factors.

Fig. 31 :
Fig. 31: Distributions of additional candidates of Inter-object Area Variation.

A. 6
Details of Experimental Results in Section 4.1 Tables 4&5 present the detailed results of experiments conducted in Section 4.1.Standard deviations for all scores are calculated over 3 independent runs.A.7 Details of Experimental Results in Sections 4.2&4.3&4.4&4.5 Tables 6&7&8&9 present the detailed results of experiments conducted in Sections 4.2&4.3&4.4& 4.5.Standard deviations for all scores are calculated over 3 independent runs.

Fig. 36 :
Fig. 36: Example images of real-world datasets ablated with object-level and scene-level factors.

Fig. 37 :
Fig. 37: Example images of real-world datasets ablated with object-level and scene-level factors (continued).
Fig. 38: Quantitative results of baselines on Group 2 datasets and their variants in Section A.8 Original Ablation CT GT AIR MONet IODINE SlotAtt Original Ablation CU GT AIR MONet IODINE SlotAtt

Fig. 39 :
Fig. 39: Qualitative results of additional joint ablations on both object-and scene-level factors in Section A.8.

Object Color Gradient Object Shape Concavity Inter-object Color Similarity Inter-object Shape Variation
Fig. 16: Distributions of the 7 complexity factors on four groups of datasets.

Table 2 :
Quantitative object segmentation results of DI-NOSAUR on four datasets (MOViC bg / YCB bg / ScanNet bg / COCO bg ) of Group 4 with backgrounds.All scores are in percentage (%).
AIR (Eslami et al., 2016) -Source Code: We refer to https://pyro.ai/examples/air.htmland https://github.com/addtt/attend-infer-repeat-pytorchfor the implementation.-Important Adaptations: We use an additional parameter to weight the KL divergence loss and reconstruction loss.For each experiment, we choose the weight for KL divergence from 1, 10, 25 and 50.The highest AP score is kept.-Training Details: All experiments of AIR (Eslami et al., 2016) are conducted with a batch size of 64.The learning rate is set to 1e−4 for training inference networks and decoders, 1e−3 for baselines which are the same as the original paper.Since the number of objects in our datasets ranges between 2 and 6, we set the maximum number of steps at inference to be 6 for all experiments.All models are trained on a single GPU for 1000 epochs.We perform an evaluation every 50 epochs and select the highest AP score.-Implementation Details: LSTMs have 256 cell units and object appearances are coded with 50 units.The hidden dimensions for appearance encoder and decoder are both 200.Images are normalized to hold values between 0 and 1 and the likelihood function is a Gaussian with a fixed standard deviation equal to 0.3.The presence prior p(n) is fixed to a geometric distribution that favors sparse reconstructions (p=0.01).The location prior is a normal distribution with mean [3, 0, 0] and standard deviation [0.2, 1, 1], presenting the scale, x-position and y-position respectively.The appearance prior is a standard normal distribution with a dimension of 50.
(Locatello et al., 2020)nected layers are used at the lowest resolution.The autoregressive prior is implemented as an LSTM with 256 units.The conditional distribution is parameterized by a multilayer perception (MLP) with two hidden layers, 256 units per layer.The architecture is set the same as what is used for CLEVR dataset(Johnson et al., 2017)in the original paper(Locatello et al., 2020).-TrainingDetails: All experiments of SlotAtt are conducted with a batch size of 32 and learning rate selected from [4e−4, 4e−5].The number of slots K is set as 7 and the number of iterations T is set as 3.All models are trained on a single GPU for 500K iterations until the loss is fully converged.
-Training Details: We train DINOSAUR using the Adam optimizer with a learning rate of 4e−4, a linear learning rate warm-up of 10 000 optimization steps and an exponentially decaying learning rate schedule.All models are trained on a single GPU with a batch size of 64.-Implementation Details: We use a ViT with patch size 8 as the feature extractor and the MLP decoder for all datasets considered.The ViT-8 we used has the token dimensionality of 384 and 6 heads.All models use 12 Transformer blocks, linear patch embedding and additive positional encoding.The output of the last block (not applying the final layer norm) is passed on to the Slot Attention module and used in the feature reconstruction loss, after removing the entry corresponding to the CLS token.For the pre-trained weights, we use the timm library for DINO and the specific timm model name is vit small patch8 224 dino.We use a four-layer MLP with ReLU activations with hidden layer sizes of 1024.

Table 10 :
Quantitative results on additional datasets ablated with both object-and scene-level factors.Standard deviations are calculated over 3 runs (marked with blue).