Keywords

1 Introduction

Following the success of convolutional neural networks (CNNs) for large scale image classification [2, 3], remarkable efforts have been made to deliver state-of-the-art performance on this task. Along with more complex and elaborate architectures, lots of techniques concerning parameter initialization, optimization and regularization have also been developed to achieve better performance. Despite the fact that various aspects of CNNs have been investigated, design of the convolution kernels, which can be considered as one of the fundamental problems, has been barely studied. Some studies examined how size of kernels affects performance [4], leading to a recent trend of stacking small kernels (e.g. \(3\times 3\)) in deep layers of CNNs. However, analysis of the shapes of kernels is mostly left untouched. Although there seems to be no latitude in designing the shape of convolution kernels intuitively (especially \({3\times 3}\) kernels), in this work, we suggest that designing the shapes of kernels is feasible and practical, and we analyze its effect on the performance.

In the early studies of biological vision [57], it was observed that the receptive fields (RFs) of neurons are arranged in an approximately hexagonal lattice. A recent work reported an interesting result that an irregular lattice with appropriately adjusted asymmetric RFs can be accurate in representation of visual patterns [8]. Intriguingly, hexagonal-shaped filters and lattice structures have been analyzed and employed for solving various problems in computer vision and image processing [9, 10]. In this work, motivated by these studies, we propose a method for designing the kernel shapes in CNNs. Specifically, we propose a method to use an asymmetric shape, which simulates hexagonal lattices, for convolution kernels (see Figs. 3 and 4), and then deploy kernels with this shape in different orientations for different layers of CNNs (Sect. 2).

Fig. 1.
figure 1

Examples of visualization of ROI (Sect. 3) in two images (a) for CNNs equipped with kernels with (b) square, and (c) our proposed “quasi-hexagonal” shapes (Sect. 2). The pixels marked with red color indicate their maximum contribution for classification scores of the correct classes. For (b), these pixels tend to be concentrated on local, specific parts of the object, whereas for (c), they distribute more across multiple local parts of the object. See texts for more details. (Color figure online)

This design of kernel shapes brings multiple advantages. Firstly, as will be shown in the experimental results (Sect. 4.1), CNNs which employ the proposed design method are able to achieve comparable or even better classification performance, compared to CNNs which are constructed using the same architectures (same depth and output channels for each layer) but employing square (\({3\times 3}\)) kernels. Thus, a notable improvement in computational efficiency (a reduction of 22 % parameters and training time) can be achieved as the proposed kernels include fewer weights than \({3\times 3}\) kernels. Meanwhile, increasing the number of output channels of our proposed models (to keep the number of parameters same as corresponding models with square shape), leads to a further improvement in performance.

Secondly, CNNs which employ our proposed kernels provide improvement in learning for extraction of discriminative features in a more flexible and robust manner. This results in better robustness to various types of noise in natural images that could make classification erroneous, such as occlusions. Figure 1 shows examples of visualization of features extracted using fully-trained CNNs equipped with and without our proposed kernels, which are obtained by the method introduced in Sect. 3. These depict the image pixels that have the maximum contribution to the classification score of the correct class (shown in red). It is observed that for CNNs equipped with our proposed kernels, they tend to be less concentrated on local regions and rather distributed across a number of sub-regions, as compared to CNNs with standard square kernels. This property prevents erroneous classification due to occlusions, as will be shown in the experimental results. This also helps to explain the fact that the CNNs equipped with our proposed kernels perform on par with the CNNs equipped with square kernels despite having less number of parameters. The contributions of the paper are summarized as follows:

  1. 1.

    We propose a method to design convolution kernels in deep layers of CNNs, which is inspired by hexagonal lattice structures employed for solving various problems of computer vision and image processing.

  2. 2.

    We examine classification performance of CNNs equipped with our kernels, and compare the results with state-of-the-art CNNs equipped with square kernels using benchmark datasets, namely ImageNet and CIFAR 10/100. The experimental results show that the proposed method is superior to the state-of-the-art CNN models in terms of computational time and/or classification performance.

  3. 3.

    We introduce a method for visualization of features to qualitatively analyze the effect of kernel design on classification. Additionally, we analyze the robustness of CNNs equipped with and without our kernel design to occlusion by measuring their classification accuracy when some regions on input images are occluded.

Fig. 2.
figure 2

(a) Visualization of a subset of kernels \(\mathbf {K}^S_{a,l} \in {\mathbb {R}^{K \times K}}\), where K is the size of kernel, at the first convolution layer \(l=1\) of AlexNet [3] trained on ImageNet. (b) An average kernel \(\hat{\mathbf {K}}_S=\frac{1}{A} \sum _{a=1}^{A}|\mathbf {K}^S_{a,l}|\) is depicted at the top-left part. Each bar in the histogram shows a cumulative distribution of values over each channel, c.

2 Our Approach

We propose a method for designing shape of convolution kernels which will be employed for image classification. The proposed method enables us to reduce the computational time of training CNNs providing more compact representations, while preserving the classification performance.

In CNNs [3, 4, 11], an input image (or feature map) \(\mathbf {I}\in {\mathbb {R}^{W\times H \times C}}\) is convolved with a series of square shaped kernels \(\mathbf {K}^S\in {\mathbb {R}^{K\times K \times C}}\) through its hierarchy. The convolution operation \(\mathbf {K}^S*\mathbf {I}\) can be considered as sampling of the image \(\mathbf {I}\), and extraction of discriminative information with learned representations. Figure 2 shows a subset of learned kernels \(\mathbf {K}^S\), and the kernel \(\hat{\mathbf {K}}_S\) averaged over all the kernels employed at the first layer of AlexNet [3]. Distribution of values of \(\hat{\mathbf {K}}_S\) shows that most of the weights at the corner take values close to zero, thus making less contribution for representing features at the higher layers. If a computationally efficient and compressed model is desired, additional methods need to be employed, such as pruning these diluted parameters during fine-tuning [12].

Fig. 3.
figure 3

(a) Our proposed kernel. (b) It can approximate a hexagonal kernel by shifting through direction D. (c) A set of kernel candidates which are denoted as design pattens “U”,“R”, “D”, “L” from left to right.

Fig. 4.
figure 4

(a) Employment of the proposed method in CNNs by stacking small size “quasi-hexagonal” kernels. (b) The kernels employed at different layers of a two-layer CNN will induce the same pattern of RFs on images observed in (a), if only the kernels designed with the same patterns are used, independent of order of their employment.

2.1 Designing Shape of Convolution Kernels

In this work, we address the aforementioned problems by designing shapes of kernels on a two-dimensional coordinate system. For each channel of a given image \(\mathbf {I}\), we associate each pixel \(I_{i,j} \in \mathbf {I}\) at each coordinate (ij) with a lattice point (i.e., a point with integer coordinates) in a square grid (Fig. 3a) [13, 14]. If two lattice points in the grid are distinct and each (ij) differs from the corresponding coordinate of the other by at most 1, then they are called 8-adjacent [13, 14]. An 8-neighbor of a lattice point \(I_{i,j} \in \mathbf {I}\) is a point that is 8-adjacent to \(I_{i,j}\). We define \(N_9[I_{i,j}]\) as a set consisting of a pixel \(I_{i,j} \in \mathbf {I}\), and its 8 nearest neighbors (Fig. 3a). A shape of a quasi-hexagonal kernel \(\mathbf {K}^H(D_{p,q}) \subset N_9[I_{i,j}]\) is defined as

$$\begin{aligned} \mathbf {K}^H(D_{p,q}) = \{N_9[I_{i,j}] \cap N_9[I_{i+p,j+q}] \cup I_{i-p,j-q} \} \end{aligned}$$
(1)

where \({D_{p,q}\in \mathcal {D} }\) is a random variable used as an indicator function employed for designing of shape of \(\mathbf {K}^H(D_{p,q})\), and takes values from \(\mathcal {D}=\{(-1,0),(1,0),(0,-1),(0,1)\}\) (see Fig. 3c). Then, convolution of the proposed quasi-hexagonal kernel \(\mathbf {K}^H(D_{p,q})\) on a neighborhood centered at a pixel located at (xy) on an image \(\mathbf {I}\) is defined as

$$\begin{aligned} I_{x,y}*\mathbf {K}^H(D_{p,q})= \sum _{s,t} \mathbf {K}^H_{s,t}(D_{p,q}) I_{x-s,y-t}. \end{aligned}$$
(2)

2.2 Properties of Receptive Fields and Quasi-Hexagonal Kernels

Aiming at more flexible representation of shapes of natural objects which may diverge from a fixed square shape, we stack “quasi-hexagonal” kernels designed with different shapes, as shown in Fig. 4. For each convolution layers, we randomly select \(D_{p,q}\in \mathcal {D} \) according to a uniform distribution to design kernels. Random selection of design patterns of kernels is feasible because the shapes of RFs will not change, independent of the order of employment of kernels if only the kernels designed with the same patterns are used by the corresponding units (see Fig. 4b). Therefore, if a CNN model is deep enough, then RFs with a more stable shape will be induced at the last layer, compared to the RFs of middle layer units.

We carry out a Monte Carlo simulation to examine this property using different kernel arrangements. Given an image \(\mathbf {I}\in {\mathbb {R}^{W\times H}}\), we first define a stochastic matrix \({\mathcal {M}\in {\mathbb {R}^{W\times H}}}\). The elements of the matrix are random variables \({\mathcal {M}_{i,j}\in {[0,1]}}\) whose values represent the probability that a pixel \({I_{i,j} \in \mathbf {I}}\) is covered by an RF. Next, we define \({\hat{\mathcal {M}} \triangleq \sum _{k} \mathcal {M}_S^k}\) as an average of RFs for a set of kernel arrangements \( \{ \mathcal {M}_S^k \} _{k=1} ^K\). Then, the difference between \(\mathcal {M}_S^k\) and the average \(\hat{\mathcal {M}}\) is computed using

$$\begin{aligned} d(\hat{\mathcal {M}},\mathcal {M}_{S}^k) = \Vert \hat{\mathcal {M}} - \mathcal {M}_{S}^k \Vert _F^2/(WH), \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _F^2\) is the squared Frobenius norm [15]. Note that, we obtain a better approximation to the average RF as the distance decreases. The average \(\mu _d\) and standard deviation \(\sigma _d\) given in Fig. 5. show that a better approximation to the average RF is obtained, if kernels used at different layers are integrated at higher depth.

Fig. 5.
figure 5

In (a), (b), (c) and (d), the figures given in left and right show an average shape of kernels emerged from 5000 different shape configurations, and a shape of a kernel designed using a single shape configuration, respectively. It can be seen that the average and variance of d decreases as the kernels are computed at deeper layers. In other words, at deeper layers of CNNs, randomly generated configurations of shapes of kernels can provide better approximations to average shapes of kernels.

3 Visualization of Regions of Interest

We propose a method to visualize the features detected in RFs and the ROI of the image. Following the feature visualization approach suggested in [16], our proposed method provides a saliency map by back-propagating the classification score for a given image and a class. Given a CNN consisting of L layers, the score vector for an input image \(\mathbf {I} \in \mathbb {R}^{H \times W \times C}\) is defined as

$$\begin{aligned} \mathbf {S} = F_{1}(\mathbf {W}^{1}, F_{2}(\mathbf {W}^{2},\ldots , F_{L}(I,\mathbf {W}^{L}))), \end{aligned}$$
(4)

where \(\mathbf {W}^{L}\) is the weights of the kernel \(\mathbf {K}_{L}\) at \(L^{th}\) layer, and \(S^\mathcal {C}\) is the \(\mathcal {C}^{th}\) element of \(\mathbf {S}\) representing the classification score for the \(\mathcal {C}^{th}\) class. At the \(l^{th}\) layer, we compute a feature map \(\mathbf {M}^{l}\) for each unit \(u^{l}_{i,j,k}\in {\mathbf {M}^{l}}\), which takes values from its receptive field \(\mathcal {R}(u^{l}_{i,j,k})\), and generate a new feature map in which all the units except \(u^{l}_{i,j,k}\) are set to be 0. Then, we feed to the tail of the CNN to calculate its score vector as

(5)

Thereby, we obtain a score map \(\mathbb {S}^{l}\) for all the units of \(\mathbf {M}^{l}\), from which we choose top N most contributed units, i.e. the units with the N-highest scores. Then, we back-propagate their score \(\mathbf {S}^{\mathcal {C}}(u^{l}_{i,j,k})\) for the correct (target) class label towards the forepart of the CNN to rank the contribution of each pixel \(p \in \mathbf {I}\) to the score as

$$\begin{aligned} \mathbb {S}^{l} (\mathcal {C},u^{l}_{i,j,k}) = F^{-1}_{1}(\mathbf {W}^{1}, F^{-1}_{2}(\mathbf {W}^{{2}}, \ldots , F^{-1}_{l}(\mathbf {S}^{\mathcal {C}}(u^{l}_{i,j,k}),\mathbf {W}^{l}))), \end{aligned}$$
(6)

where \(\mathbb {S}^{l} (\mathcal {C},u^{l}_{i,j,k})\) is a score map that has the same dimension with the image \(\mathbf {I}\), and that records the contribution of each pixel \(p \in \mathbf {I}\) to the \(\mathcal {C}^th\) class. Here we choose the top \(\varOmega \) unit \(\{\mathfrak {u}^l_\omega \}^\varOmega _{\omega =1}\) with the highest score \(\mathbf {S}^{\mathcal {C}}\), where \(\mathfrak {u}^l_\omega \) is the \(\omega ^{th}\) unit employed at the \(l^{th}\) layer. Then, we compute the incorporated saliency map \(\mathbf {L}^{\mathcal {C},l}\in \mathbb {R}^{H\times W}\)extracted at the \(l^{th}\) layer, for the \(\mathcal {C}^{th}\) class as follows

$$\begin{aligned} \mathbf {L}^{\mathcal {C},l} = \sum _{\omega } |\mathbb {S}^{l} (\mathcal {C},\mathfrak {u}^{l}_{\omega })|, \end{aligned}$$
(7)

where \(| \cdot |\) is the absolute value function. Finally, the ROI of defined by a set of merged RFs, \(\{\mathcal {R}(\mathfrak {u}^l_\omega )\}^\varOmega _{\omega =1}\) is depicted as a non-zero region in \(\mathbf {L}^{\mathcal {C},l}\).

4 Experiments

In Sect. 4.1, we examine classification performance of CNNs implementing proposed methods using two benchmark datasets, CIFAR-10/100 [1] and ILSVRC-2012 (a subset of ImageNet [2]). We first analyze the relationship between shape of kernels, ROI and localization of feature detections on images. Then, we examine the robustness of CNNs for classification of occluded images. Implementation details of the algorithms, and additional results are provided in the supplemental material. We implemented CNN models using the Caffe framework [17], the implementation detail is given in supplemental materialFootnote 1.

4.1 Classification Performance

Experiments on CIFAR Datasets. A list of CNN models used in experiments is given in Table 1a. We used the ConvPool-CNN-C model proposed in [18] as our base model (BASE-A). We employed our method in three different models: (i) QH-A retains the structure of the BASE-A by just implementing kernels using the proposed methods, (ii) QH-B models a larger number of feature maps compared to QH-A such that QH-B and BASE-A have the same number of parameters, (iii) QH-C is a larger model which is used for examination of generalization properties (over/under-fitting) of the proposed QH-models. Following [18] we implement dropout on the input image and at each max pooling layer. We also utilized most of the hyper-parameters suggested in [18] for training the models.

Table 1. CNN configurations. The convolution layer parameters are denoted as \({<}\mathrm{duplication}{>}\times \mathrm{conv}{<}\mathrm{kernel}{>}\,-\,{<}\mathrm{number~of~channels}>\). A rectified linear unit (ReLU) is followed after each convolution layer. ReLU and dropout layer are not shown for brevity. All the conv-3\(\,\times \,\)3/QH/FK layers are set to be stride 1 equipped with pad 1.

Since our proposed kernels have fewer parameters compared to 3 \(\times \) 3 square shaped kernels, by retaining the same structure as BASE-A, QH-A may benefit from the regularization effects brought by less numbers of total parameters that prevent over-fitting. In order to analyze this regularization property of the proposed method, we implemented a reference model, called BASE-REF with conv-FK (fragmented kernel) layer, which has \(3\times 3\) convolution kernels, and the values of two randomly selected parameters are set to 0 (to keep the number of effective parameters same with quasi-hexagonal kernels). In another reference model (QH-EXT), shape patterns of kernels (Sect. 2) are chosen to be the same (\(<R, \ldots , R>\) in this implementation). Moreover, we introduced two additional variants of models using (i) different kernel sizes for max pooling (-pool4), and (ii) an additional dropout layer before global average pooling (-AD).

Results given in Table 2 show that the proposed QH-A has comparable performance to the base CNN models that employ square shape kernels, despite a smaller number of parameters. Meanwhile, a significant decrement in accuracy appears in the BASE-REF model that employs the same number of parameters as QH-A, which suggests that our proposed model works not only by the employment of a regularization effect but by the utilization of a smaller number of parameters. The inferior performance for QH-EXT model indicates the effectiveness of randomly selecting kernels described in Sect. 2. Moreover, it can also be observed that the implementation of additional dropout and larger size pooling method improves the classification performance of both BASE-A and proposed QH-A in a similar magnitude. Then, the experimental observation implies a general compatibility between the square kernels and the proposed kernels (Table 3).

Table 2. Comparison of classification errors using CIFAR-10 dataset (Single models trained without data augmentation).
Table 3. Comparison of classification error of models using CIFAR-10/100 datasets (Single models trained without data augmentation).

Additionally, we compare the proposed methods with state-of-the-art methods for CIFAR-10 and CIFAR-100 datasets. For CIFAR-100, we used the same models implemented for CIFAR-10 with the same hyper-parameters. The results given in Table 4 show that our base model with an additional dropout (BASE-A-AD) provides comparable classification performance for CIFAR-10, and outperforms the state-of-the-art models for CIFAR-100. Moreover, our proposed models (QH-B-AD and QH-C-AD) improve the classification accuracy by adopting more feature maps.

Experiments on ImageNet. We use an up-scale model of BASE-A model for CIFAR-10/100 as our base model, which stacks 11 convolution layers with kernels that have regular 3 \(\times \) 3 square shape, that are followed by a 1 \(\times \) 1 convolution layer and a global average pooling layer. Then, we modified the base model with three different types of kernels: (i) our proposed quasi-hexagonal kernels (denoted as conv-QH layer), (ii) reference kernels where we remove an element located at a corner and one of its adjacent elements located at edge of a standard 3 \(\times \) 3 square shape kernel (conv-UB), (iii) reference kernels where we remove an element from a corner and an element from a diagonal corner of a standard 3 \(\times \) 3 square shape kernel (conv-DIA). Notice that unlike the fragmented kernels we employed in the last experiment, these two reference kernels can also be used to generate aforementioned shapes of RFs. However, unlike the proposed quasi-hexagonal kernels, we cannot assure that these kernels can be used to simulate hexagonal processing. Configurations of the CNN models are given in Table 1b. Dropout [24] is used on an input image at the first layer (with dropout ratio 0.2), and after the last conv-3 \(\times \) 3 layer. We employ a simple method for fixing the size of train and test samples to \(256 \times 256\) [4], and a patch of \(224 \times 224\) is cropped and fed into network during training. Additional data augmentation methods, such as random color shift [3], are not employed for fast convergence.

Classification results are given in Table 4. The results show that the performance of reference models is slightly better than that of the base model. Notice that since the base model is relatively over-fitted (top5 accuracy for training sets is \(\ge \)97 %), these two reference models are more likely to be benefited from the regularization effect brought by less number of parameters. Meanwhile, our proposed QH-BASE outperformed all the reference models, implying the validity of the proposed quasi-hexagonal kernels in approximating hexagonal processing. Detailed analyses concerning compactness of models are provided in the next section.

Table 4. Comparison of classification accuracy using validation set of ILSVRC-2012.
Table 5. Comparison of number of parameters and computational time of different models.

Analysis of Relationship Between Compactness of Models and Classification Performance. In this section, we analyze the compactness of learned models for ImageNet and CIFAR-10 datasets. We provide a comparison of the number of parameters and computational time of the models in Table 5. The results show that, in the experimental analyses for the CIFAR-10 dataset, QH-A model has a comparable performance to the base model with fewer parameters and computational time. If we keep the same number of parameters (QH-B), then classification accuracy improves for similar computational time. Meanwhile, in the experimental analyses for the ImageNet dataset, our proposed model shows significant improvement in both model size and computational time.

We conducted another set of experiments to analyze the relationship between the classification performance and the number of training samples using CIFAR-10 dataset. The results given in Table 6 show that the QH-A-AD model provides a comparable performance with the base model, and the QH-B-AD model provides a better classification accuracy compared to the base model, as the number of training samples decreases. In an extreme case where only 1000 training samples is selected, QH-A-AD and QH-B-AD outperform the base model by 0.7 % and 3.1 %, respectively, which indicates the effectiveness of the proposed method.

Table 6. Comparison of classification error between models BASE-A-AD, QH-A-AD and QH-B-AD with different number of training samples on CIFAR-10 dataset.

4.2 Visualization of Regions of Interest

Figure 6 shows some examples of visualizations depicted using our method proposed in Sect. 3. Saliency maps are normalized and image contrast is slightly raised to improve visualization of images. We observed that for most of these correctly classified testing images, both the BASE model equipped with square kernels and the proposed QH-BASE model equipped with quasi-hexagonal kernels are able to present an ROI that roughly specify the location and some basic shape of the target objects, and vise versa. Since the ROI is directly determined by RFs of neurons with strong reactions toward special features, this observation suggests that the relevance between learned representations and target objects is crucial for recognition and classification using large-scale datasets of natural images such as ImageNet.

However, some obvious difference between the ROI of the base model and the proposed model can be observed: (i) ROI of the base model usually involves more background than that of the proposed model. That is, compared to these pixels with strong contributions, the percentage of these pixels that are not essentially contributing to the classification score, is generally higher in the base model. (ii) Features learned using the square kernels are more like to be detected within clusters on special parts of the objects. The accumulation of the features located in these clusters results in a superior contribution, compared to the features that are scattered on the images. For instance, in the base model, more neurons have their RFs located in the heads of hare and parrots, thus the heads obtain higher classification scores than other parts of body. (iii) As a result of (ii), some duplicated important features (e.g., the supporting parts of cart and seats of coach) are overlooked in these top reacted high-level neurons in the base model. Meanwhile, our proposed model with quasi-hexagonal kernels is more likely to obtain discriminative features that are spatially distributed on the whole object. In order to further analyze the results obtained by employing the square kernel and the proposed kernels for object recognition, we provide a set of experiments using occluded images in the next section.

Fig. 6.
figure 6

Examples of visualization of ROI. A ROI demonstrates a union of RFs of the top 40 activated neurons at the last max pooling layer. The pixels marked with red color indicate their contribution to classification score, representing the activated features located at them. Borderlines of ROI are represented using yellow frames. Top 5 class predictions provided by the models are also given, and the correct (target) class is given using orange color. (Color figure online)

4.3 Occlusion and Spatially Distributed Representations

The analyses given in the last section imply that the base CNN models equipped with the square kernel could be vulnerable to recognition of objects in occluded scenes, which is a very common scenario in computer vision tasks. In order to analyze the robustness of the methods to partial occlusion of images, we prepare a set of locally occluded images using the following methods. (i) We randomly select 1249 images that are correctly classified by both the base and proposed models using the validation set of ILSVRC-2012 [2]. (ii) We select Top1 or Top5 elements with highest classification score at the last maxpool layers of a selected modelFootnote 2 and calculate the ROI defined by their RFs, as we described in Sect. 3. (iii) Within the ROI, we choose 1–10% of pixels that provide the most contribution, and then occlude each of the selected pixels with a small circular occlusion mask (with radius \(r=5\) pixels), which is filled by black (Bla.) or randomly generated colors (Mot.) drawn from a uniform distribution. In total, we generate 120 different occlusion datasets (149880 different occluded images in total), Table 7 shows the classification accuracy on the occluded images. The results show that our proposed quasi-hexagonal kernel model reveal better robustness in this object recognition under targeted occlusion task compared to square kernel model. Some sample images are shown in Fig. 7.

Table 7. Performances on the occlusion datasets. Each column shows the classification accuracy (%) of test models in different occlusion conditions. In the first row, BASE/QH-BASE/VGG indicate the models used for generating occlusion, Top1/Top5 indicate the numbers of selected neurons that control the size of occluded region, Bla./Mot. indicate the patterns of occlusion.
Fig. 7.
figure 7

Analysis of robustness of different models to occlusion. We use the same proposed method to select neurons and visualize their RFs for each model (see Sect. 3). The comparison between the ROI shown in Fig. 6 suggests that the proposed model overcomes the occlusion by detecting features that are spatially distributed on target objects. It can also be seen that, the classification accuracy of the base model is decreased although the ROI of the base model seems to be more adaptive to the shape of objects. This also suggests that the involvement of background may make the CNNs hard to discriminate background from useful features.

5 Conclusion

In this work, we analyze the effects of shapes of convolution kernels on feature representations learned in CNNs and classification performance. We first propose a method to design the shape of kernels in CNNs. We then propose a feature visualization method for visualization of pixel-wise classification score maps of learned features. It is observed that the compact representations obtained using the proposed kernels are beneficial for the classification accuracy. In the experimental analyses, we obtained outstanding performance using ImageNet and CIFAR datasets. Moreover, our proposed methods enable us to implement CNNs with less number of parameters and computational time compared to the base-line CNN models. Additionally, the proposed method improves the robustness of the base-line models to occlusion for classification of partially occluded images. These results confirm the effectiveness of the proposed method for designing of the shape of convolution kernels in CNNs for image classification. In future work, we plan to apply the proposed method to perform other tasks such as object detection and segmentation.