Do Semantic Parts Emerge in Convolutional Neural Networks?

Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic parts. We perform two extensive quantitative analyses. First, we use ground-truth part bounding-boxes from the PASCAL-Part dataset to determine how many of those semantic parts emerge in the CNN. We explore this emergence for different layers, network depths, and supervision levels. Second, we collect human judgements in order to study what fraction of all filters systematically fire on any semantic part, even if not annotated in PASCAL-Part. Moreover, we explore several connections between discriminative power and semantics. We find out which are the most discriminative filters for object recognition, and analyze whether they respond to semantic parts or to other image patches. We also investigate the other direction: we determine which semantic parts are the most discriminative and whether they correspond to those parts emerging in the network. This enables to gain an even deeper understanding of the role of semantic parts in the network.


Introduction
Semantic parts are object regions interpretable by humans (e.g.wheel, leg) and play a fundamental role in several visual recognition tasks.For this reason, semantic part-based models have gained significant attention in the last few years.
The key advantages of exploiting semantic part representations is that parts have lower intra-class variability than whole objects, they deal better with pose variation and their configuration provides useful information about the aspect of the object.The most notable examples of works on semantic part models are fine-grained recognition [8,9,10], generic object detection [5], articulated pose estimation [11,12,13], and attribute prediction [14,15,16].
In this paper we look into these two worlds and address the following question: does a CNN learn semantic parts in its internal representation ?In order to answer it, we investigate whether the network's convolutional filters learn to respond to semantic parts of objects.Previous works [1,2,3,4] have studied the matter by visually inspecting filter responses to check if they look like semantic parts.Based on this qualitative analysis, these works suggest that semantic parts do emerge in CNNs.Here we go a step further and perform a quantitative evaluation using ground-truth bounding-boxes of parts, thus providing a more conclusive answer to the question.We examine the different stimuli of the filters and try to associate them with semantic parts, taking advantage of the available ground-truth part location annotations in the PASCAL-Part dataset [5].
As an analysis tool, we turn filters into part detectors based on their responses to stimuli.If some filters systematically respond to a certain semantic part, their detectors will perform well, and hence we can conclude that they do represent the semantic part.Given the difficulty of the task, while building the detectors we assist the filters in several ways.The actual image region to which a filter responds typically does not accurately cover the extent of a semantic part.We refine this region by a regressor trained to map it to a part's ground-truth bounding-box.Moreover, as suggested by other works [26,27,28], a single semantic part might emerge as distributed across several filters.For this reason, we also consider filter combinations as part detectors, and automatically select the optimal combination of filters for a semantic part using a Genetic Algorithm.
We present an extensive analysis evaluating different network layers, architectures, and supervision levels.Results show that 34 out of 123 semantic parts emerge in AlexNet [6] finetuned for object detection [7].This is a modest number, despite all favorable conditions we have engineered into the evaluation and all assists we have given to the network.This result demystifies the findings of [1,2,3,4] and shows that the network learns to associate filters to part classes, but only for some of them and often to a weak degree.In general, these semantic parts are those that are large or very discriminative for the object class (e.g., torso, head, wheel).Furthermore, we find that some filters that respond to parts are shared across several object classes, for example a single filter firing for wheels of cars, bicycles, and buses.Another interesting discovery is that the emergence of parts grows with the depth of the layer within a network.However, deeper architectures like [17] do not seem to significantly promote a stronger emergence of parts.Similarly, the supervision level does not seem to make a substantial difference either.This suggests that the part emergence is ubiquitous and comparable across architectures and supervision levels.Finally, we explore the possibility of the network responding to parts as recurrent discriminative patches, rather than truly semantic parts.We observe that each class is associated with an average of nine discriminative filters.Interestingly, 60% of these are also semantic.The overlap between which filters are discriminative and/or semantic might be the reason why previous works [1,2,3,4] have suggested a stronger emergence of semantic parts, based on visual inspection only.

Related Work
Analyzing CNNs.CNN-based representations are unintuitive and there is no clear understanding of why they perform so well or how they could be improved.
In an attempt to better understand the properties of a CNN, some recent vision works have focused on analyzing their internal representations [29,30,31,32,1,2,3,4,33].Some of these investigated properties of the network, like stability [29], feature transferability [30], equivariance, invariance and equivalence [31], the ability to reconstruct the input [32] and how the number of layers, filters and parameters affects the network performance [3,33].
More related to this paper are [1,2,3,4], which look at the convolutional filters.Zeiler and Fergus [1] use deconvolutional networks to visualize locally optimal visual inputs for individual filters.Simonyan et al. [2] use a gradient-based visualization technique to highlight the areas of an image discriminative for an object class.Agrawal et al. [3] show that the feature representations are distributed across object classes.Zhou et al. [4] show that the layers of a network learn to recognize visual elements at different levels of abstraction (e.g.edges, textures, objects and scenes).All these works make an interesting observation: filter responses can often be linked to objects and semantic parts.Nevertheless, they base this observation on visual inspection only.Instead, we present an extensive quantitative analysis on whether filters can be associated with semantic parts and to which degree.We transform the filters into part detectors and evaluate their performance on ground-truth part bounding-boxes from the PASCAL-Part dataset [5].We believe this methodology goes a step further than previous works and supports more conclusive answers to the quest for semantic parts.

Filters as intermediate part representations for recognition.
Several works use filter responses for recognition tasks [16,26,27,28,34].Simon et al. [26] train part detectors for fine-grained recognition, while Gkioxari et al. [16] train them for action and attribute classification.Furthermore, Simon et al. [27] learn constellations of filter activation patterns, and Xiao et al. [28] cluster group of filters responding to different bird parts.All these works assume that the convolutional layers of a network are related to semantic parts.In this paper we try to shed some light on this assumption and hopefully inspire more works on exploiting the network's internal structure for recognition.

Methodology
Network architecture.Standard image classification CNNs such as [6,17] process an input image through a sequence of layers of various types, and finally output a class probability vector.Each layer i takes the output of the previous layer x i−1 as input, and produces its output x i by applying up to four operations: convolution, nonlinearity, pooling, and normalization.The convolution operation slides a set of learned filters of different sizes and strides over the input.The nonlinearity of choice for many networks is the Rectified Linear Unit (ReLU) [6], and it is applied right after the convolution.
Goal.Our goal is understanding whether the convolutional filters learned by the network respond to semantic parts.In order to do so, we investigate the image regions to which a filter responds and try to associate them with a particular part.Fig. 1 presents an overview of our approach.Let f i j be the j-th convolutional filter of the i-th layer, including also the ReLU.Each pixel in a feature map ... Fig. 1: Overview of our approach for a layer 5 filter.Each local maxima of the filter's feature map leads to a stimulus detection (red).We transform each detection with a regressor trained to map it to a bounding-box tightly covering a semantic part (green).
is the activation value of filter f i j applied to a particular position in the feature maps x i−1 of the previous layer.The resolution of the feature map depends on the layer, decreasing as we advance through the network.Fig. 1 shows feature maps for layers 1, 2, and 5.When a filter responds to a particular stimulus in its input, the corresponding region on the feature map has a high activation value.By studying the stimuli that cause a filter to fire, we can characterize them and decide whether they correspond to a semantic object part.

Stimulus detections from activations
The value a c,r of each particular activation α, located at position (c, r) of feature map x i j , indicates the response of the filter to a corresponding region in its input x i−1 .By recursively back-propagating this region down the layers, we can reconstruct the actual receptive field on the input image, i.e. the whole image region on which the filter acted.The size of the receptive field varies depending on the layer, from the actual size of the filter for the first convolutional layer, up to a much larger image region on the top layer.For each feature map, we select all its local maxima as activations with high response.Each of these activations will lead to a stimulus detection in the image.The location of such detection is defined by the center of the receptive field of the activation, whereas its size varies depending on the layer.Fig 1 shows an example, where the two local maxima of feature map x 5 j lead to the stimulus detections depicted in red.Regressing to part bounding-boxes.The receptive field of an activation gives a rough indication about the location of the stimulus.However, it rarely covers a part tightly enough to associate the stimulus with a part instance (fig.2).In general, the receptive field of high layers is significantly larger than the part ground-truth bounding-box, especially for small classes like ear.Moreover, while the receptive field is always square, some classes have other aspect ratios (e.g.legs).Finally, the response of a filter to a part might not occur in its center, but at an offset instead (e.g. on the bottom area, fig.2(d-e)).
In order to factor out these elements, we assist each filter with a bounding-box regression mechanism that refines its stimulus detection for each part class.The regressor applies a 4D transformation, i.e. translation and scaling along width and height.We believe that if a filter fires systematically on many instances of a part class at the same relative location (in 4D), then we can grant that filter a 'part detector' status.This implies that the filter responds to that part, even if the actual receptive field does not tightly cover it.For the rest of the paper, all stimulus detections include this regression step unless stated otherwise.
We train one regressor for each part class and filter.Let {G l } be the set of all ground-truth bounding-boxes for the part in the training set.Each instance bounding-box G l is defined by its center coordinates (G l x , G l y ), width G l w , and height G l h .We train the regressor on K pairs of activations and ground-truth part bounding-boxes {α k , G k }.Let (c x , c y ) be the center of the receptive field on the image for a particular feature map activation α of value a c,r , and let w, h be its width and height (w = h as all receptive fields are square).We pair each activation with an instance bounding-box G l of the corresponding image if (c x , c y ) lies inside it.We are going to learn a 4D transformation d x , d y , d w , d h to predict a part bounding-box G from α's receptive field where γ(α) = (c x , c y , a c−1,r−1 , a c−1,r , ..., a c+1,r+1 ).Therefore, the regression depends on the center of the receptive field and on the values of the 3x3 neighborhood of the activation on the feature map.Note that it is independent of w and h as these are fixed for a given layer.Each d * is a linear combination of the elements in γ(α) with a weight vector w * , where * can be x, y, w, or h.We set regression targets ) and optimize the following weighted least squares objective In practice, this tries to transform the position, size and aspect-ratio of the original receptive field of the activations into the bounding-boxes in {G l }.Fig. 2 presents some examples of our bounding-box regression for 6 different parts.For each part, we show the feature map of a layer 5 filter and both the original receptive field (red) and the regressed box (green) of some activations.We can see how given a strong activation on the feature map, the regressor not only refines the center of the detection, but also successfully captures its extent.Some classes are naturally more challenging, like dog-tail in fig.2(f), due to higher size and aspect-ratio variance or lack of satisfactory training examples.
Evaluating filters as part detectors.For each filter and part combination, we need to evaluate the performance of the filter as a detector of that part.We take all the local maxima of the filter's feature map for every input image and compute their stimulus detections, applying Non-Maxima Suppression [35] to remove duplicate detections.We consider a stimulus detection as correct if it has an intersection-over-union ≥ 0.4 with any ground-truth bounding-box of the part, which is the usual condition for part detection [5].All other detections are considered false positives.A filter is a good part detector if it has high recall but a small number of false positives, indicating that when it fires, it is because the part is present.Therefore, we use Average Precision (AP) to evaluate the filters as part detectors, following the PASCAL VOC [36] protocol.

Filter combinations
Several works [3,4,28] noted that one filter alone is often insufficient to cover the spectrum of appearance variation of an object class.We believe that this holds also for part classes.For this reason, we present here a technique to automatically select the optimal combination of filters for a part class.
For a given network layer, the search space consists of binary vectors z = [z 1 , z 2 , ..., z N ], where N is the number of filters in the layer.If z i = 1, then the i-th filter is included in the combination.We consider the stimulus detections of a filter combination as the set union of the individual detections of each filter in it.Ideally, a good filter combination should make a better part detector than the individual filters in it.Good combinations should include complementary filters that jointly detect a greater number of part instances, increasing recall.At the same time, the filters in the combination should not add many false positives.Therefore, we can use the collective AP of the filter combination as objective function to be maximized: where det i indicates the stimulus detections of the i-th filter.We use a Genetic Algorithm (GA) [37] to optimize this objective function.GAs are iterative search methods inspired by natural evolution.At every generation, the algorithm evaluates the 'fitness' of a set of search points (population).Then, the GA performs three genetic operations to create the next generation: selection, crossover and mutation.In our case, each member of the population (chromosome) is a binary vector z as defined above.Our fitness function is the AP of the filter combination.In our experiments, we use a population of 200 chromosomes and run the GA for 100 generations.We use Stochastic Universal Sampling [37].We set the crossover and mutation probabilities to 0.7 and 0.3, respectively.We bias the initialization towards a small number of filters by setting the probability P (z i = 1) = 0.02, ∀i.This leads to an average combination of 5 filters when N = 256, in the initial population.

AlexNet for object detection
In this section we analyze the role of convolutional filters in AlexNet and test whether some of them can be associated with semantic parts.In order to do so, we design our settings to favor the emergence of this association.

Experimental settings
Dataset.We evaluate filters on the recent PASCAL-Part dataset [5], which augments PASCAL VOC 2010 [36] with pixelwise semantic part annotations.For our experiments we fit a bounding-box to each part segmentation mask.We use the train subset and evaluate all parts listed in PASCAL-Part with some minor refinements: we discard fine-grained labels (e.g.'car wheel front-left' and 'car wheel back-left' are both mapped to car-wheel ) and merge contiguous subparts of the same larger part (e.g.'person upper arm' and 'person lower arm' become a single part person-arm).The final dataset contains 123 parts of 16 object classes.
AlexNet.One of the most popular networks in computer vision is the CNN model of Krizhevsky et al. [6], winner of the ILSVRC 2012 image classification challenge [38].It is commonly referred to as AlexNet.This network has 5 convolutional layers followed by 3 fully connected layers.The number of filters at each of the convolutional layers L is: 96 (L1), 256 (L2), 384 (L3), 384 (L4), and 256 (L5).The filter size changes across layers, from 11x11 for L1, to 5x5 to L2, and to 3x3 for L3, L4, L5.
Training.We use the publicly available AlexNet network of [7] trained for object class detection (for the 20 classes in PASCAL VOC + background) using ground-truth bounding-boxes.Note how these bounding-boxes provide a coordinate frame common across all object instances.This makes it easier for the network to learn parts as it removes variability due to scale changes (as the convolutional filters have fixed size) and presents different instances of the same part class at rather stable positions within the image.We refer to this network as AlexNet-Object.The network is trained on the train set of PASCAL VOC 2012.Note how this set is a superset of PASCAL VOC 2010 train, on which we analyze whether filters correspond to semantic parts.
Finally, we assist each of its filters by providing a bounding-box regression mechanism that refines its stimulus detections to each part class (sec.3.1) and we learn the optimal combination of filters for a part class using a GA (sec.3.2).
Evaluation settings.We restrict the network inputs to ground-truth object bounding-boxes.More specifically, for each part class we look at the filter responses only inside the instances of its object class and ignore the background.For example, for cow-head we only analyze cow ground-truth bounding-boxes.Furthermore, before inputting a bounding-box to the network we follow the R-CNN pre-processing procedure [7], which includes adding a small amount of background context and warping to a fixed image size.An example of an input bounding-box is shown in fig. 1.These settings are designed to be favorable to the emergence of parts, as this is the exact input seen by AlexNet-Object during training and we ignore image background that does not contain parts.

Results
Table 1 shows results for few parts of seven object classes in terms of average precision (AP).Results on all 123 parts of the 16 object classes are in the supplementary material.For each part class and network layer, the table reports the AP of the best individual filter in the layer ('Best'), the increase in performance over the best filter thanks to selecting a combination of filters with our GA ('GA'), and the number of filters in that combination ('nFilters').Moreover, the last row of the table reports the mAP over all 123 part classes.Several interesting facts arise from these results.
Need for regression.In order to quantify how much the bounding-box regression mechanism of sec.3.1 helps, we performed part detection using the non-regressed receptive fields.On AlexNet-Object layer 5, taking the single best filter for each part class achieves an mAP of 7.7.This is very low compared to mAP 22.7 achieved by assisting the filters with the regression.Moreover, results show that the receptive field is only able to detect large parts (e.g.bird-torso, bottle-body, cow-torso, etc.).This is not surprising, as the receptive field of layer 5 covers most of the object surface (fig.2).Instead, filters with regressed receptive fields can detect much smaller parts (e.g.cat-ear, cow-muzzle, person-hair ), as the regressor shrinks the area covered by the receptive field and adapts its aspect ratio to the one of the part.We conclude that the receptive field alone cannot perform part detection and regression is necessary.Differences between layers.Overall, the higher the network layer, the higher the performance.This is consistent with previous observations [1,4] that the first layers of the network respond to generic corners and other edge/color junctions, while higher levels capture more complex structures.Nonetheless, it seems that some of the best individual filters of the very first layers can already perform detection to a weak degree when helped by our regression (e.g.bike-wheel ).
Differences between part classes.Performance varies greatly across part classes.For example, some parts (e.g.aeroplane-tail, bike-headlights, horse-eye and person-ear ) are clearly not represented by any filter nor filter combination, as their AP is steady at 0 across all layers.On other parts (e.g.bike-wheel, cat-head and horse-torso), instead, the network achieves good detection performance, proving that some of the filters can be associated with these parts.
Filter combinations.Performing part detection using a combination of filters (GA) always performs better (or equal) than the single best filter.This is interesting, as it shows that different filters learn different appearance variations of the same part class.Moreover, combining multiple filters improves part detection performance more for deeper layers.This suggests that they are more class-specific, i.e. they dedicate more filters to learning the appearance of specific object/part classes.This can be observed by looking not only at the improvement in performance brought by the GA, but also at the number of filters that the GA selects.Clearly, filters in L1 are so far from being parts that even selecting many filters does not bring significant improvements (+0.6 mAP only).Instead, in L4 and L5 there are more semantic filters and the GA combination helps more (+2.4mAP and +4.6 mAP, respectively).Interestingly, for L5 the improvement is higher than for L4, yet the number of filters combined is lower.This further shows that filters in higher layers better represent semantic parts.
GA analysis.The AP improvement provided by our GA for some parts is remarkable, like for aeroplane-body (+17.0),horse-head (+11.6) and cow-head (+17.1).While these results suggest that our GA is doing a good job in selecting filter combinations, here we compare against a much simpler method that selects the top few best filters for a part class.We refer to it as TopFilters.We let both methods select the same number of filters and evaluate their combinations in terms of mAP.Our GA consistently outperforms TopFilters (22.1 vs 27.3 mAP, layer 5).The problem with TopFilters is that often the top individual best filters capture the same visual aspect of a part.Instead, our GA can select filters that complement themselves and work well jointly (indeed 57% of the filters it selects are not TopFilters).We can see this phenomenon in fig. 3.In the blue car (fig.3b), TopFilters is able to detect two wheels correctly, but fails to fit a tight bounding-box around the third wheel that appears much smaller.Similarly, in the other car (fig.3a) TopFilters fails to correctly localize the very large wheel.Instead, our GA is capable in both cases to localize all wheels correctly.Furthermore, note how for more challenging parts GA seems to be able to fit tighter bounding-boxes, achieving more accurate detections (fig.3c-f).
Filter sharing across part classes.We looked into which filters were selected by our GA and noticed that some are shared across different part classes.By looking at these filters' detections (fig.4), it is clear that some filters are representative for a generic part and work well on all object classes containing it.
Instance coverage.Table 1 presents high AP results for several part classes, showing how some filters can indeed act as part detectors.However, as AP conflates both recall and false-positives, it does not easily reveal how many part instances the filters cover.To answer this question, we show in fig. 5 recall vs. false-positives curves for several part classes.For each part class, we take the top 3 filters of layer 5, and compare them to the filter combination returned by the GA.We can see how the combination reaches higher AP not only by having fewer false positives in the low recall regime, but also by reaching considerably higher recall levels than the individual filters.For some part classes, the filter combination covers as many as 80% of its instances (e.g.car-door, bike-wheel, dog-head ).For the more challenging classes, neither the individual filters nor the combination achieve high recall levels, suggesting that the convolutional filters have not learned to respond to these parts (e.g.cat-eye, horse-ear ).
How many semantic parts emerge in AlexNet-Object?So far we discussed part detection performance for all individual filters of AlexNet-Object and their combinations.Here we want to answer the main bottomline question: for how many part classes does a detector emerge?We answer this for two criteria: AP and instance coverage.
For AP, we consider a part to emerge if the detection AP for the best filter combination in the best layer (L5) exceeds 30.This is a rather generous threshold, which represents the level above which the part can be somewhat reliably detected.Under these conditions, 34 out of the 123 semantic part classes emerge.This is a modest number, despite all favorable conditions we have engineered into the evaluation and all assists we have given to the network (including bounding-box regression and optimal filter combinations).
For coverage, instead, results are more positive.We consider that a filter combination covers a part when it reaches a recall level above 50%, regardless of false-positives.According to this criterion, 71 out of the 123 part classes are covered, which is greater than the number of part detectors found according to AP.This indicates that, although there are filter combinations covering many instances of many part classes, their number of false positives is also high.
Based on all this evidence, we conclude that the network does contain filter combinations that can cover some part classes well, but they do not fire exclusively on the part, making them weak part detectors.This demystifies the visual observations of [1,2,3,4].Moreover, the part classes covered by such semantic filters tend to either cover a large image area, such as torso or head, or be very discriminative for their object class, such as wheels for vehicles and wings for birds.Most small or less discriminative parts are not represented well in the network filters, such as headlight, eye or tail.

Other network architectures and levels of supervision
In this section we explore how the level of supervision provided during network training and the network architecture affect what the filters learn.
Networks and training.We consider two additional networks, one with a different supervision level (AlexNet-Image) and one with a different architecture (VGG16-Object).AlexNet-Image [6] is trained for image classification on 1.3M images of 1000 object classes in ILSVRC 2012 [38].We use the publicly available model from [39].Note how this has not seen object bounding-boxes during training.For this reason, we expect its filters to learn less about semantic parts than AlexNet-Object.VGG16-Object is the 16-layer network of [17], finetuned for object detection [7].While its general structure is similar to AlexNet, it is deeper and the filters are smaller (3x3 in all layers), leading to better image classification [17] and object detection [20] performance.Its convolutional layers can be grouped in 5 blocks.The first two blocks contain 2 layers each, with 64 and 128 filters, respectively.The next block contains 3 layers of 256 filters.Finally, the last 2 blocks contain 3 layers of 512 filters each.Results.Table 2 presents results for these two networks and AlexNet-Object.For both AlexNet architectures, we focus on the last three convolutional layers, as we observed in sec.4.2 that filters in the first two layers correspond poorly to semantic parts.Analogously, for VGG16-Object we present the top layer of each of the last 3 blocks of the network.Each column of table 2 corresponds to an object class and shows the AP result obtained by the GA filter combination, averaged over all parts of the object class (see supplementary material for results on individual part classes).Results confirm the trend observed for AlexNet-Object for the two new networks: filters of higher layers are more responsive to semantic parts.Interestingly, against our expectations, AlexNet-Image and AlexNet-Object perform about the same.This shows that the network's inclination to learn semantic parts is already present even when trained for whole image classification, which in turns suggest that object parts are useful for that task too.Moreover, parts do not seem to emerge more in it the deeper VGG16-Object.This suggests that having additional layers does not encourage learning filters that model better semantic parts.

Parts as discriminative patches
CNNs are trained for recognition tasks, e.g.image classification or object detection.The training procedure maximizes an objective function related to recognition performance.Therefore, it is sensible to assume that the network filters learn to respond to image patches discriminative for the object classes in the training set.However, these discriminative filters need not correspond to semantic parts.In this section we investigate to which degree the network learns such discriminative filters.Moreover, we test whether some discriminative filters are also semantic (e.g.wheels are very discriminative for recognizing cars).
Discriminative filters.We investigate whether layer 5 filters of AlexNet-Object respond to recurrent discriminative image patches, by assessing how discriminative each filter is for each object class.We use the following measure of the discriminativeness of a filter f j for a particular object class.First, we record the output score s i of the network on an input image I i .Then, we compute a second score s j i using the same network but ignoring filter f j .We achieve this by zeroing the filter's feature map x j , which means a c,r = 0, ∀a c,r ∈ x j .Finally, we define the discriminativeness of filter f j as the score difference averaged over the set I of all images of the object class In practice, δ j indicates how much filter f j contributes to the classification score of the class.Fig. 6a shows an example of these score differences for class car.
Only a few filters have high δ values, indicating they are really discriminative for the class.The remaining filters have low values attributable to random noise.We consider f j to be a discriminative filter if f j > 2σ, where σ is the standard deviation of the distribution δ k , k = {1, ..., 256}.For the car class, only 7 filters are discriminative under this definition.Fig. 6b shows an example of the receptive field centers of activations of the top 5 most discriminative filters, which seem to be distributed on several locations of the car.Interestingly, on average over all classes, we find that only 9 out of 256 filters in L5 are discriminative for a particular class.The total number of discriminative filters in the network, over all 16 object classes amounts to 104.This shows that the discriminative filters are largely distributed across different object classes, with very little sharing, as also observed by [3].The network obtains its discriminative power from just a few different filters specialized to each class.
Discriminative and semantic filters.We now investigate the connection between discriminative and semantic filters.Fig. 6c presents an example for class car.We can see how the discriminative filters are also semantic for many parts.Some filters are shared across semantic parts, for example leftside and rightside correspond to the same two filters, one of them also corresponding to wheel.Similarly, doors and windows share most of their associated filters.However, there are also highly discriminative filters that are not semantic, e.g.filter 8. Fig. 7 shows examples for other classes, where we can observe some other interesting patterns.For example, wheels are extremely discriminative for class bicycle, in contrast to class car, where discriminative filters are more equally distributed.Since wheels are generally big for bicycle images, some filters specialize to subparts of the wheel, such as its bottom area.Another interesting observation is that the discriminativeness of a semantic part might depend on the object class to which it belongs.For example, class cat accumulates 5 of its most discriminative filters on parts of the head.On the other hand, class horse tends to prefer parts of the body, such as the legs, devoting just 1 discriminative filter to the head.Besides firing on subparts, some discriminative filters fire on superparts, either assemblies of multiple parts or a single part with some additional region (e.g.filter 206 for class bird is associated with both wing and tail).
We count how many of the discriminative filters of each object class are also semantic.Analogously to sec.4.2, we define a filter as semantic if its performance as a detector for a part class has an AP>30.On average, we find that 5.5 out of the 9 discriminative filters for an object class are also semantic.Therefore, about 60% of the discriminative filters are also semantic.Perhaps this is why several works [1,4,26] have hypothesized that convolutional filters were responding to actual semantic parts.In reality this is only partially true, as many filters are just responding to discriminative patches.

Conclusions
We have analyzed the emergence of semantic parts in CNNs.We have investigated whether the network's filters learn to respond to semantic parts.We have associated filter stimuli with ground-truth part bounding-boxes in order to perform a quantitative evaluation for different layers, network architectures and supervision levels.Despite promoting this emergence by providing favorable settings and multiple assists, we found that only 34 out of 123 semantic parts in PASCAL-Part dataset [5] emerge in AlexNet [6] finetuned for object detection [7].Interestingly, different levels of supervision and network architectures do not significantly affect the emergence of parts.Finally, we have studied the response to another type of part: recurrent discriminative patches.We have found that the network discriminates using only a few filters specialized to each class, about 60% of which also correspond to semantic parts.

Supplementary Material
In this section we present the complete results for all part classes for the three different settings we have evaluated in our work.For each part class and network layer, the table reports the AP of the best individual filter in the layer ('Best'), the increase in performance over the best filter thanks to selecting a combination of filters with our GA ('GA'), and the number of filters in that combination ('nFilters').Table 3 presents results for all parts for AlexNet-Object, for all five convolutional layers.Table 2 presents results for all parts for AlexNet-Image, for the last three convolutional layers.Finally, table 3 presents results for all parts for VGG16-Object, for the last convolutional layers of the last 3 blocks.

Fig. 2 :
Fig. 2: Examples of stimulus detections for layer 5 filters.For each part class we show a feature map on the left, where we highlight the strongest activation in red.On the right, instead, we show the corresponding original receptive field and the regressed box.

Fig. 3 :
Fig. 3: Part detection examples obtained by combination of filters selected by our GA (top) or by TopFilters (bottom).Different box colors correspond to different filters' detections.Note how the GA is able to better select filters that complement each other.

Fig. 4 :
Fig. 4: Detections performed by filters 141, 133, and 236 of AlexNet-Object (L5).The filters are specific to a part and they work well on several object classes containing it.

Fig. 5 :
Fig. 5: Recall vs. false positives curves for six part classes using AlexNet-Object's layer 5 filters.For each part class we show the curve for the the top three individually best filters and for the combination of filters selected by our GA.

Fig. 6 :
Fig. 6: Discriminative filters for object class car.(a) Shows how discriminative the filters of AlexNet-Object (layer 5) are for car detection (higher values are more discriminative).(b) Shows the activations of the five most discriminative filters.(c) Shows which of the ten most discriminative filters for car are also semantic filters of its parts.

Fig. 7 :
Fig. 7: Activations for the five most discriminative filters on different object classes (top) and filters that are both discriminative and semantic (bottom).

Table 1 :
Part detection results in terms of AP on the train set of PASCAL-Part for AlexNet-Object.Best is the AP of the best individual filter whereas GA indicates the increment over Best obtained by selecting the combination of (nFilters) filters.

Table 3 :
Part detection results in terms of AP on the train set of PASCAL-Part for AlexNet-Object.Best is the AP of the best individual filter whereas GA indicates the increment over Best obtained by selecting the combination of (nFilters) filters.

Table 2 :
Part detection results in terms of AP on the train set of PASCAL-Part for AlexNet-Image.Best is the AP of the best individual filter whereas GA indicates the increment over Best obtained by selecting the combination of (nFilters) filters.

Table 3 :
Part detection results in terms of AP on the train set of PASCAL-Part for VGG16-Object.Best is the AP of the best individual filter whereas GA indicates the increment over Best obtained by selecting the combination of (nFilters) filters.