An explicit investigation of the roles that feature distributions play in rapid visual categorization

Im, Hee Yeon; Tiurina, Natalia A.; Utochkin, Igor S.

doi:10.3758/s13414-020-02046-7

An explicit investigation of the roles that feature distributions play in rapid visual categorization

Published: 14 May 2020

Volume 83, pages 1050–1069, (2021)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

An explicit investigation of the roles that feature distributions play in rapid visual categorization

Download PDF

Hee Yeon Im^1,2,
Natalia A. Tiurina³ &
Igor S. Utochkin³

1405 Accesses
11 Citations
4 Altmetric
1 Mention
Explore all metrics

Abstract

Ensemble representations are often described as efficient tools when summarizing features of multiple similar objects as a group. However, it can sometimes be more useful not to compute a single summary description for all of the objects if they are substantially different, for example when they belong to entirely different categories. It was proposed that the visual system can efficiently use the distributional information of ensembles to decide whether simultaneously displayed items belong to single or several different categories. Here we directly tested how the feature distribution of items in a visual array affects an ability to discriminate individual items (Experiment 1) and sets (Experiments 2–3) when participants were instructed explicitly to categorize individual objects based on the median of size distribution. We varied the width (narrow or fat) as well as the shape (smooth or two-peaked) of distributions in order to manipulate the ease of ensemble extraction from the items. We found that observers unintentionally relied on the grand mean as a natural categorical boundary and that their categorization accuracy increased as a function of the size differences among individual items and a function of their separation from the grand mean. For ensembles drawn from two-peaked size distributions, participants showed better categorization performance. They were more accurate at judging within-category ensemble properties in other dimensions (centroid and orientation) and less biased by superset statistics. This finding corroborates the idea that the two-peaked feature distributions support the “segmentability” of spatially intermixed sets of objects. Our results emphasize important roles of ensemble statistics (mean, range, distribution shape) in explicit visual categorization.

Perceiving ensemble statistics of novel image sets

Article Open access 08 January 2021

What is the basis of ensemble subset selection?

Article 13 February 2024

Global and local interference effects in ensemble encoding are best explained by interactions between summary representations of the mean and the range

Article Open access 27 January 2021

,

Introduction

The visual system organizes complex scenes using strategies for forming coherent and concise representations, rather than passively receiving all (millions of) bits of information hitting our retinas at any given moment. One powerful heuristic is representing sets of similar objects as an ensemble using summary statistics. Ensemble coding provides global information about a group of items in the entire image such as average across multiple dimensions (Alvarez & Oliva, 2008; Ariely, 2001; Bauer, 2009; Chong & Treisman, 2003; Dakin & Watt, 1997; Haberman & Whitney, 2007, 2009), variance (Morgan, Chubb, & Solomon, 2008; Solomon, 2010), or approximate number (Chong & Evans, 2011; Feigenson, Dehaene, & Spelke, 2004; Halberda, Sires, & Feigenson, 2006). Such global information can be extracted through a pooling process across multiple objects. It provides a quick and precise description about the image as a whole, even when the number of objects to be averaged exceeds the cognitive capacity, which is severely constrained by selective attention and working memory systems (e.g., Cowan, 2001; Luck & Vogel, 1997; Pylyshyn & Storm, 1988) and when little conscious access or selective attention to the image is available (e.g., Alvarez, 2011; Alvarez & Oliva, 2008; Alvarez & Oliva, 2009; Ariely, 2001; Corbett & Oriet, 2011; Im & Halberda, 2013; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001).

Ensemble coding is of great value in our daily perception and cognition of visual scenes. If we look around, we always find some redundancy and regularity in real-world images: Buildings in a city, trees in a forest, and fruit in a bush, for example, are often seen as groups of similar but not identical objects. For most everyday needs, we may not need to store all individuating information from these scenes. We can instead only extract summary statistics of a scene to make sense of the overall layout, pattern, and gist concisely and compactly. Such perceptual ability to extract ensemble representations allows our brain to “do more with less” and better interact with the complex, dynamic visual world. For example, representing and storing an ensemble (e.g., average) of multiple objects helps the visual system to maintain and recall an image better. At the object level, only a few items (up to three or four) can be remembered at a time; the rest may be missed entirely due to the limited memory capacity. When attempting to recall missed objects, one would have to make random guesses. However, a higher-level representation of average extracted from all the objects in the image can guide one to recall the missed object to some extent by retrieving values biased toward the average and reducing the overall expected errors (Brady & Alvarez, 2011; Im & Chong, 2014). Say, you try to remember different colors of six disks in an image and report the colors you remember. If you remember that the disks were in “cool” colors on average (even if you cannot remember the exact colors for each disk), you will likely reduce the overall error by choosing six colors from only the continuum of “cool” colors and avoid “warm” colors. However, if you remember only the colors of three disks and completely missed the others (without remembering average information), you cannot avoid making extreme errors when you recall the color of one of the three forgotten colors (e.g., by randomly choosing “red” for a disk that is turquoise).

Previous studies have demonstrated that multiple sets of objects, up to three or four, can be extracted in parallel, as higher-order units for perception and memory (e.g., Halberda et al., 2006; Im & Chong, 2014; Im, Park, & Chong, 2015). The limit on the number of ensembles that can be extracted and remembered at any given time also converges with the well-documented three-or-four-object limits of visual attention (e.g., Pylyshyn & Storm, 1988) and working memory (e.g., Luck & Vogel, 1997; Zhang & Luck, 2008). Such convergence illustrates how items in an image can be represented hierarchically (e.g., as individuals or ensembles). Results of different units of visual processing can allow complementary information about the image (e.g., local vs. global) to be available to an observer at the same time. In a display of 20 dots in four different colors (five red, five blue, five yellow, and five green dots), for example, an observer can represent 20 individual dots, ensemble features of four color-sets, and even those of a superset. Previous work has empirically investigated such a notion of hierarchical coding in visual perception and memory (Brady & Alvarez, 2011; Corbett, 2017; Halberda et al., 2006; Im & Chong, 2014; Im, Zhong, & Halberda, 2016). This work collectively suggests that the nature of visual representations in perception and memory is constructive, hierarchical, and interactive across multiple levels of abstraction. Hierarchical and constructive visual representation in perception and memory can be made possible when the extraction of ensemble features is as rapid as those of individuals so that both levels of representation are available to interact with each other. Indeed, previous studies have shown that ensembles can be extracted from groups of objects very rapidly (e.g., Im et al., 2016; Leib, Kosovicheva, & Whitney, 2016). The question that remains to be addressed is how the visual system utilizes ensemble features that are rapidly extracted to facilitate perceptual and cognitive processes underlying hierarchical coding of complex, cluttered visual scenes. In the current study, we report three case studies that empirically tested the roles global representation of ensembles created from multiple items in a visual image plays in rapid visual segmentation, categorization, and perceptual grouping of visual arrays. New findings from the current study will provide an insight into how ensemble representations can be extracted from multiple sets of similar objects in a visual array and serve as perceptual bases for hierarchical coding of the scene to make sense of it.

In the context of texture perception and visual search, it has been shown that the visual system can split multiple textons or individual items into clearly distinguishable (e.g., pre-attentive; Julesz, 1981) global subsets; and the roles of various spatial factors are widely discussed as principal determinants of subset formation: local proximity and local contrasts (Bacon & Egeth, 1991; Bravo & Nakayama, 1992; Itti & Koch, 2001; Treisman, 1988; Wolfe, 1994) as well as more global factors, such as abrupt violation of spatial statistics over a region (Nothdurft, 1992, 1993). As soon as spatial interactions take an important part in these models, the explanations for grouping and segmentation strongly rely on well-established mechanisms of space-based, retinotopic interactions akin to lateral inhibition (e.g., Knierim & van Essen, 1992). However, not much of the prior work was done to examine how individual objects can be categorized into discrete, higher-level sets when spatial layout does not support strong organization of similar elements into compact patches lying apart from dissimilar elements. Perceiving a set of apples among leaves and branches is an example showing how common the categorization of spatially intermixed sets can be in real-world perception.

Recently, Utochkin (2015) has suggested that ensemble summary statistics can be a candidate representation that supports the rapid categorization of multiple objects into sets in a visual image. Ensemble representation can be “spatially abstract (or spatially blind in an extreme case),” in a sense that extracted summary statistics do not have to retain an exact knowledge of how individual elements are located, once created in another feature domain such as size, orientation, and so on. Thus, it can be well suitable for the rapid categorization of spatially intermixed items of different kinds. Human observers appear to be very sensitive to how features of individual items in a visual array are distributed, such that their perceptual ability to segment and discriminate groups of items is systematically influenced by the shape of the distribution of features that are tested (e.g., Chetverikov, Campana, & Kristjánsson, 2016, 2017; Chong & Treisman, 2003; Corbett, Wurnitsch, Schwartz, & Whitney, 2012; Im & Halberda, 2013; Oriet & Hozempa, 2016; Rosenholtz, 2000; Utochkin, Khvostov, & Stakina, 2018; Utochkin & Yurevich, 2016).

According to Utochkin (2015; also see Utochkin & Yurevich, 2016; Utochkin et al., 2018), the central concept that is related to categorization is segmentability. Segmentability is derived from the shape of a feature distribution – in particular, from its peaks and pits. If individual features of all presented items smoothly cover the entire range of displayed features, thus forming either no sharp peak (as in the uniform distribution) or a single peak (as in the Gaussian distribution), such an ensemble is non-segmentable and is likely to be categorized as consisting of items of one kind. In contrast, if individual features are distributed unevenly, forming dense clusters (peaks) separated by relatively large gaps within the range, then such an ensemble is segmentable and is likely perceived as consisting of several categorical groups. A similar idea of the distributional difference as the determinant of efficient segmentation was suggested for explaining pop-out visual search (Hochstein, Pavlovskaya, Bonneh, & Soroker, 2018; Rosenholtz, 1999, 2001).

In their previous work, Utochkin and colleagues (Utochkin & Yurevich, 2016; Utochkin et al., 2018) manipulated the shape of ensemble distributions to empirically test whether it supported categorical grouping in the manner predicted by the segmentability hypothesis. To test the effects of distribution shape, they used indirect measurements such as proportion correct or response time (RT) using the visual search task or texture discrimination task paradigms. In these task paradigms, higher accuracy or faster RTs were considered to reflect greater segmentability between a target and distractors or between two different texture patches. Yet, they have not explicitly asked or instructed human observers to report whether they perceive objects in a visual array as belonging to the same or different categories based on the perceived feature distributions. Therefore, the first aim of the current study was to examine how human observers use such feature distributions for rapid categorization of individual objects into subsets in a visual image (Experiment 1). The second aim is to examine further how the rapid categorization based on one feature dimension (e.g., size) serves as the basis of segmenting subsets of individual objects to mediate extraction of ensemble summary statistics in another feature dimension (e.g., location).

Experiment 1

Rapid categorization by size: Assigning individual objects into subsets in a visual image

In Experiment 1, we first examined how human observers segment objects in an image into categorical subsets relying on feature distributions of the objects. Although previous work has shown that human observers are sensitive to feature distributions in a visual image, none of them has directly tested how feature distribution is utilized when observers perceive “subsets” in the image and categorize individual items into the subsets. Here we tested our hypothesis that human observers can categorize individual objects into two subsets in an image very rapidly, relying on summary statistics about the whole image that describe how individual features of the objects are distributed (e.g., mean, median, variability, and the shape of distribution). We first conducted a simple study using a straightforward and explicit approach by asking participants to categorize an individual item into one of two subgroups in a visual array based on a feature dimension of size (e.g., “does this circle belong to a larger set or a smaller set in this image?”). Although perceptual categorization by size (e.g., small vs. large sets) has not been tested yet in the context of ensemble coding, it is a testable and viable category. It is easy to imagine you are sorting out a pile of apples into two piles, so that you can use the pile of small piles to make an apple pie and the pile of large apples to eat them raw. We examined how participants’ categorization performance was affected by the shapes of feature distributions of individual objects in the image. We predicted that participants’ ability to categorize individual objects into subsets would be systematically varied by the global properties of the feature distribution (e.g., mean, variance, and the smoothness of the distribution presumably affecting ensemble segmentability) and by the individual objects’ relative dispositions in the distribution.

Method

Participants

Twenty undergraduate students (13 females; age range: 18–27 years) of the Higher School of Economics took part in the experiment for extra course credits. All reported having normal or corrected-to-normal vision, normal color vision, and no neurological problems. All were naïve as to the purpose of the experiment. Written informed consent was obtained for the experiment from the participants in accordance with the Declaration of Helsinki.

Apparatus and stimuli

The stimulation was developed and presented via PsychoPy v1.82 for Linux Ubuntu (Peirce et al., 2019) on a standard VGA-monitor (screen diagonal 19 in., 75 Hz refresh rate, resolution of 1,200 × 800 pixels, which was 30.65° × 20.43° in visual angle). Observers responded by pressing keys on a computer keyboard with their dominant hand.

Each visual stimulus contained a set of 16 white circles (superset) randomly positioned in a gray screen. The distribution of individual sizes of the white circles was divided into two categories (subsets) containing eight items each: “small” (ranging from ~.3° to ~1.3° in visual angle across all trials) and “large” (ranging from ~.7° to ~2.3° across the entire experiment). The overlap between the entire ranges of “small” and “large” provided that the categorical belongingness of an item with a particular size could change from trial to trial, therefore encouraging participants to “calibrate” their impression of an item’s category only based on a current trial. Item sizes in one-half of the circles were always smaller than the median of the whole ensemble, and item sizes in the other half of the circles were always greater than the median. The grand mean diameter varied in a broad enough range (from 0.8° to 1.5°) across trials to ensure that the “categorical boundary” varied unpredictably and observers relied on an impression from the current trial only, rather than the memory of the grand mean diameter of the whole ensemble from previous trials. Individual sizes of the circles could be drawn from one of three distributions varying in range and shape, detailed in the following:

1)
Two-peaked distribution: To generate a size distribution that contains two thin distributions of highly separable subsets, we used the large set:small set mean ratio of 3:1, where the gap between the smallest circle of set 1 and the largest circle of set 2 was as big as 100% of the least mean. Within each subset, individual sizes differed from the set mean by -21%, -15%, -9%, -3%, 3%, 9%, 15%, and 21%. It can be seen in Fig. 1A that size distributions for each of the sets were relatively narrow, but together they formed two clusters of sizes separated by a substantial gap. We predicted that such size distributions would make segmentation and categorization of two subsets within a superset relatively easy.
2)
Smooth narrow distribution: We generated a smooth narrow distribution to provide the same relative range within each category as in the two-peaked distribution condition (from -21% to 21% of the set mean) but with a smaller separation between the sets, not much exceeding the separation within each set. To generate size distributions of the two sets, we used the large set:small set mean ratio of 1.7:1. In this case, the transition between sets was just slightly bigger than the transition within each set (Fig. 1B). We predicted that this condition would make the segmentation of two subsets more difficult based on the size distributions than the two-peaked distribution condition since there was no discontinuity in the overall distribution that could support peak separation.
3)
Smooth fat distribution: When we generated a smooth distribution with a greater bandwidth from which individual sizes that belonged to both of the subsets were drawn, we ensured that the overall range of the smallest and biggest items of the superset was about the same as in the two-peaked distribution. Here, the large set/small set mean ratio was 2.4:1 and individual sizes within each set differed from the set mean by -35%, -25%, -15%, -5%, 5%, 15%, 25%, and 35% (Fig. 1C). We predicted that segmenting and categorizing two subsets in this condition would be difficult since two distributions of the subsets were not easily segmentable.

Figure 1 describes how individual sizes in the small and large sets were generated and distributed based on our size-generation algorithm. Because the actual mean sizes of the small and large sets were varied in every trial, we plotted the distributions of sizes as proportions to the mean sizes of small and large subsets. As shown in Fig. 1, individual sizes were spaced equally (by 6% from their adjacent neighbors) within their categories (as they were scaled relative to categorical mean sizes). On the other hand, the individual sizes ended up being spaced unequally in terms of the entire ensemble so that the absolute step size between items within the large set was always bigger than that within the small set according to the large set/small set mean ratio. Such asymmetry resulting from our size generation algorithm, in fact, complied with Weber’s law that perceived difference between two sizes is approximately proportional to their sizes in the domains of both individual size and mean size (e.g., Allik et al., 2013). This way of size generation for individual circles in the small and large sets was implemented in previous work and has been shown to ensure that perceived variability of individual members (e.g., variance or range) is roughly the same across the categories (e.g., Khvostov & Utochkin, 2019).

Moreover, this algorithm also made the whole feature distribution inherently skewed, resulting in asymmetric probability density. As a result, the grand mean, suggested to be one of the robust ensemble representatives (Alvarez, 2011; Ariely, 2001; Chong & Treisman, 2003; Khayat & Hochstein, 2018, 2019), was shifted to the right compared to the median (which is defined as a categorical boundary in our task). In other words, the smallest items from the large category were always closer to the grand mean than the largest items of the small category. As can be seen in Fig. 1B and C, the smallest items of the large category were even smaller than the grand mean in the smooth narrow and smooth fat conditions. As will be seen later, this mismatch between the task-defined categorical boundary (median) and the grand mean would provide a sensitive case to query the type of the internal rule (e.g., whether observers rely more on the median or the grand mean) for establishing that perceptual boundary.

Procedure

Experimental sessions were run in a darkened room. Participants were seated approximately 50 cm from a monitor. On each trial, they were instructed to categorize a probed item as small or large, depending on its relative size within the whole set, including all the items presented briefly in the visual array. The categorization rule was explicitly stated as median-based: Participants were told to answer whether the probed item belonged to the smallest or the largest half of the set.

A sample trial of Experiment 1 is shown in Fig. 2. After a ready signal, a stimulus image that contained all the 16 circles was presented for 100 ms. After the stimulus image, only one of the circles from the stimulus image remained, and the rest of the circles disappeared to instruct participants to indicate which of the subsets – either large set or small set – the remaining circle (a test circle) seemed to belong to. The participant responded whether this object belongs to the “small” or to the “large” category by pressing the “left” or the “right” button, respectively. After their response was made, the feedback was provided. Participants completed a total of 576 trials (16 relative sizes × 3 size distributions × 12 repetitions per condition). Twelve trials were added at the beginning of the experimental session for practice but were excluded for data analyses.

Results and discussion

Trials with excessively fast responses (< 200 ms) were excluded from the analysis. Data from one participant who made such excessively fast responses ~56% of the time were also excluded from the analysis. Therefore, the data from 19 participants were analyzed. Overall, less than 0.3% of trials were excluded from the data analysis of these participants due to excessively fast responses.

Participants’ response accuracy for categorizing an item into one of the two subsets showed the clear “v-shaped” curves for all the three different types of the size distribution, two-peaked, smooth narrow, and smooth fat distributions. Figure 3A plots the participants’ percentage correct responses as a function of the distance between the size of the single circle to be categorized and the categorical boundary (median) of the entire set of the circles shown in the visual image. Each of the 16 items shown in the visual image had its unique size, and the absolute distance from the categorical border strongly depended on the type of the size distribution. To resolve and control for such variations, we merged each of two neighbor sizes starting with the smallest and ranked them: We assigned ranks -4 to -1 to the items of the “small” category (with -4 being the smallest item) and ranks 1 to 4 to the “large” category (with 4 being the largest item). Therefore, the ranks define the relative position of a probed circle, both within a category (e.g., either a “small” or “large” category) and away from the categorical boundary.

Not surprisingly, participants’ accuracy for the item categorization was the worst when the test circle to be categorized was close to the categorical boundary (e.g., ±1), but systematically improved as the size of the circle became more deviated from the categorical boundary. This observation was confirmed by the significant main effect of the size distance between the test circle and the superset (F(7,126) = 125.96, p < 0.001, η²_G = 0.68) from the statistical test using the two-way repeated-measures ANOVA with the two factors of the size distance rank (eight levels: -4, -3, -2, -1, +1, +2, +3, and +4) and the distribution types (three conditions: two-peaked, smooth narrow, and smooth fat distributions).

From the same ANOVA test, we also found a significant main effect of the distribution types (F(2,36) = 157.14, p < 0.001, η²_G = 0.40). Specifically, the overall accuracy for categorization was better for the two-peaked distribution than the other two conditions (smooth narrow and smooth fat), which was further confirmed by the post hoc contrast analyses (two-peaked vs. smooth narrow: t(18) = 14.39, p < 0.001, Bonferroni-corrected α = 0.017, Cohen’s d = 3.3; two-peaked vs. smooth fat: t(18) = 10.55, p < 0.001, Bonferroni-corrected α = 0.017, Cohen’s d = 2.4). For the two-peaked distribution, the categorization accuracy mostly reached near the plateau except for the point at size rank +1. Conversely, the other two conditions showed categorization accuracy that strongly depended on the size difference between the test circle and the mean of the superset. In turn, in the smooth fat distribution participants were overall more accurate than in the smooth narrow one (t(18) = 9.99, p < 0.001, Bonferroni-corrected α = 0.017, Cohen’s d =2.3), presumably because the former one included some exemplars more distinct from the categorical boundary (at least two extreme values at both sides of the “tails” of the size distribution). This pattern of results suggests that the ease of categorical parsing was systematically varied across the distribution types of the superset. The participants appeared to be sensitive to size distributions of circles and capable of categorizing individual members into one of the subsets, based on the size distributions shown in visual stimuli.

Moreover, both the range and smoothness of distributions seemed to affect the accuracy of categorization systematically, as supported by the significant interaction (F(14,252) = 33.06, p < 0.001, η²_G = 0.39) between the two factors – size distances and the distribution types. This significant interaction suggests increasing profoundness (e.g., depth) of the “dipper parts” of the V-shapes across the distribution types (Fig. 3A). That is, participants showed quite similar levels of accuracy for the extreme items at the “tails” (e.g., ranks -4 and +4), whereas decrement in the accuracy for the center of the distribution (especially ranks -1 and +1) was much greater in the two smooth distributions than in the two-peaked distribution. This pattern is presumably because middle items of the smooth distributions were closer to the boundary and between-categorical gap was smaller than in the two-peak distribution.

Finally, we found an interesting asymmetry in our V-shaped functions. For all the three types of size distributions, the worse categorization response was observed at +1, suggesting that the participants made more errors when the probed test circle was the smallest circle in the large category (but slightly larger than the categorical boundary), by erroneously categorizing them into a small subset. As a result, the V-shaped functions were all shifted to the right relative to the task-defined categorical boundary. When collapsed across the distribution types (Fig. 3B), this shift was most clear, with rank -1 = rank +2, rank -2 = rank +3, and rank -3 = rank +4 (ps > 0.33, Cohen’s ds < 0.23) in the observed categorization accuracies, whereas symmetrical ranks yielded strongly asymmetrical results, with a systematic prevalence of the small category (ps < 0.001, Bonferroni-corrected α = 0.002, ds > 1.1; except rank -4 = rank +4, p = 0.3, d = 0.25). These results suggest that participants made more error responses when they had to categorize items from the large category compared to the small category, especially when these items were relatively close to the categorical boundary. To recap, the way we generated individual sizes made the whole distribution skewed, such that the grand mean was shifted towards smallest items of the large category more, compared to towards largest items of the small category. Importantly, in the smooth narrow and smooth fat distributions, rank +1 items were greater than the median but smaller than the grand mean. For these particular points, we observed that the accuracy dropped to chance (smooth narrow: accuracy = 0.52, smooth fat: accuracy = 0.41, Fig. 3A). At the same time, in the two-peak distribution, where even the smallest item of the “large” category was fairly greater the grand mean, the drop in the accuracy at the rank of +1 was not that dramatic (accuracy = 0.74, Fig. 3A). Therefore, the whole pattern of asymmetry, with the “rank +1 effect” remarkably correlated with an actual item position relative to the grand mean, suggests that the internal categorical boundary was shifted in the direction of the grand mean of the superset. The finding that this shift occurs despite the instruction to use the median criterion of categorization can also suggest that people tend to rely on a representation of the mean as a categorical boundary automatically.

Experiment 2

Categorization in the domain of size and ensemble extraction in the domain of location

In Experiment 1, we have shown that participants could categorize an individual object very rapidly into one of the two subsets based on the average size of the superset. The ease of such segmentation and categorization was systematically varied both by the size difference between the individual object to be categorized and the mean of the superset and by the shape of the feature distribution of all the items shown in a visual array. In Experiment 2, we extended the findings from Experiment 1 and further tested how participants use the distributional properties of an ensemble in one feature dimension (e.g., size) to parse items into categories and make task-relevant, category-specific judgments in another feature dimension. Returning to the farmer’s market example, you might also want to make sure you choose apples that are ripe enough as well as big enough. You will first compare the overall size of each pile of apples to check which pile is relatively larger, but then also compare other qualities such as the overall color or hue of the two piles (smaller and larger sets) for the final decision. Experiment 2 sought to characterize such a process: ensemble-based segmentation in one feature dimension (e.g., size) for extraction of ensembles in the other feature dimension (e.g., centroid).

In many previous studies, ensemble extraction in a particular feature domain (e.g., average size, numerosity, and so on) has been tested independently from other features. For example, estimation of the average size or numerosity of multiple subsets was tested by using different color cues that are discrete and separable enough for each of the subsets (e.g., Chong & Treisman, 2005; Halberda et al., 2006; Im & Chong, 2014, Utochkin & Vostrikov, 2017, etc.). In other studies, location (e.g., spatial separation between subsets of circles) was utilized for segmentation of subsets (e.g., left vs. right sets; Chong & Treisman, 2003; Corbett, Wurnitsch, Schwartz, & Whitney, 2012; Epstein & Emmanouil, 2017). Many of these studies assume that multiple subsets can be (almost) perfectly segmented based on the color or spatial cues prior to ensemble extraction. Although the results collectively suggest that human observers are capable enough of segmenting subsets by color cues or spatial separation, this process is not necessarily “cost-free,” given that combining color and location cues for segmentation can significantly improve participants’ performance on ensemble extraction from multiple subsets, compared to when only one feature is provided as a segmentation cue (e.g., Im, Park, & Chong, 2015).

It has been shown that within-subset ensemble judgments can be penetrable for the influence of another, irrelevant subset (Inverso, Sun, Chubb, Wright, & Sperling, 2016; Oriet & Brand, 2013; Utochkin et al., 2018). Here, we hypothesize that the degree of such penetrability should depend on the ease and robustness of segmentation of subsets based on the shape of the distribution in the feature domain of categorization. That is, if the estimated summary differs between two subsets and these subsets can be segmented into two different categories quite easily (e.g., as in the two-peaked distribution in Experiment 1), then the extracted ensemble representation by participants would be closer to the genuine summary of the subsets. In contrast, if subsets are hardly distinguishable and less separable as two categories, then participants’ ensemble estimation of subsets should be reported with a greater error, biased towards the common summary of all subsets (e.g., grand mean), possibly because some elements of the irrelevant subsets are confused with relevant elements. Furthermore, we predict that the ease of subset segmentation determined by the shape of the distribution in one feature domain (e.g., size) would systematically modulate the precision of ensemble extraction in another visual feature domain (e.g., location).