Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, which has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact networks and large number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus reducing the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to 4\(\times \) more compact networks in terms of the number of parameters at similar or better performance.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price includes VAT (USA)
Tax calculation will be finalised during checkout.
Our current implementation in CUDA allows only distances up to 4 or 8 pixels. This limitation can be overcome by modifying the implementation.
DAU layers with stride operation are not yet implemented.
Current implementation of DAUs requires an even number of channels.
Amidror, I. (2013). Mastering the discrete Fourier transform in one, two or several dimensions. Berlin: Springer.
Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1872–86. https://doi.org/10.1109/TPAMI.2012.230.
Chang, J., Gu, J., Wang, L., Meng, G., Xiang, S., & Pan, C. (2018). Structure-aware convolutional neural networks. In Proceedings of the neural information processing systems (pp. 1–10).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016a). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In Pattern analysis and machine intelligence (pp. 1–14). arXiv:1606.00915.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016b). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184.
Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587.
Chen, L. C., Zhu, Y., Papandreou, G., & Schroff, F. (2018). Encoder–decoder with atrous separable convolution for semantic image segmentation. In European conference on machine learning: Workshop on music and machine learning.
Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fps. In Computer vision and pattern recognition (pp. 3286–3293). IEEE. https://doi.org/10.1109/CVPR.2014.414.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.350.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In International conference on computer vision. https://doi.org/10.1051/0004-6361/201527329.
Eigen, D., Rolfe, J., Fergus, R., & Lecun, Y. (2014). Understanding deep architectures using a recursive convolutional network (pp. 1–9). arXiv:1312.1847v2.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2011). The Pascal visual object classes challenge 2011 (VOC2011) results. Retrieved December 17, 2019, from http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (vol. 9, pp. 249–256).
Hariharan, B., Arbel, P., Bourdev, L., Maji, S., Malik, J., Berkeley, U. C., Systems, A., Ave, P., & Jose, S. (2011). Semantic contours from inverse detectors. In International conference on computer vision.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (pp. 346–361).
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image recognition. In CVPR (pp. 171–180). https://doi.org/10.3389/fpsyg.2013.00124.
He, K., Zhang, X., Ren, S., & Sun, J. (2016b) Identity mappings in deep residual networks. In European conference on computer vision (vol. 9908, pp. 630–645). LNCS. https://doi.org/10.1007/978-3-319-46493-0_38.
Holschneider, M., Kronland-Martinet, R., Morlet, J., & Tchamitchian, P. (1990). A real-time algorithm for signal analysis with the help of the wavelet transform. In J. M. Combes, A. Grossmann, & P. Tchamitchian (Eds.), Wavelets (pp. 286–297). Berlin: Springer.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50\(\times \)fewer parameters and \(<0.5\) MB model size (pp 1–13). https://doi.org/10.1007/978-3-319-24553-9.
Jacobsen, J. H., van Gemert, J., Lou, Z., & Smeulders, A. W. M. (2016). Structured receptive fields in CNNs. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2610–2619). https://doi.org/10.1109/CVPR.2016.286.
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Speeding up convolutional neural networks with low rank expansions. In British machine vision conference (p. 7). https://doi.org/10.5244/C.28.88.
Jeon, Y., & Kim, J. (2017). Active convolution: Learning the shape of convolution for image classification. https://doi.org/10.1109/CVPR.2017.200.
Kaiming, H., Gkioxara, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. In International conference on computer vision (pp. 2961–2969). arXiv:1703.06870.
Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In International conference on learning representations (pp. 1–13). arXiv:1412.6980v5.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Science Department, University of Toronto, Tech Report (pp. 1–60).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105. arXiv:1102.0183.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2323. https://doi.org/10.1109/5.726791.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8828, 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965.
Luan, S., Zhang, B., Chen, C., Cao, X., Ye, Q., Han, J., & Liu, J. (2017). Gabor convolutional networks. In British machine vision conference (pp. 1–12). arXiv:1705.01450.
Luo, P., Wang, G., Lin, L., & Wang, X. (2017). Deep dual learning for semantic image segmentation. In Computer vision and pattern recognition (CVPR) (pp. 2718–2726). https://doi.org/10.1109/ICCV.2017.296.
Luo, W., Li, Y., Urtasun, R., & Richard, Z. (2016). Understanding the effective receptive field in deep convolutional neural networks. In NIPS. arXiv:1701.04128.
Nah, S., Kim, T. H., & Lee, K. M. (2017). Deep multi-scale convolutional neural network for dynamic scene deblurring. In Computer vision and pattern recognition (pp. 3883–3891). https://doi.org/10.1109/CVPR.2017.35.
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. https://doi.org/10.1109/IJCNN.2015.7280696.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention—MICCAI 2015 (pp. 234–241). https://doi.org/10.1007/978-3-319-24574-4_28.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision , 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Shelhamer, E., Long, J., & Darrell, T. (2016). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651. https://doi.org/10.1109/TPAMI.2016.2572683.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (pp 1–14). arXiv:1409.1556v6.
Tabernik, D., Kristan, M., & Leonardis, A. (2018). Spatially-adaptive filter units for deep neural networks. In Computer vision and pattern recognition (pp. 9388–9396). arXiv:1711.11473.
Tabernik, D., Kristan, M., Wyatt, J. L., & Leonardis, A. (2016). Towards deep compositional networks. In International conference on pattern recognition. arXiv:1609.03795.
Tao, X., Gao, H., Wang, Y., Shen, X., Wang, J., & Jia, J. (2018). Scale-recurrent network for deep image deblurring. In Computer vision and pattern recognition (pp. 8174–8182). https://doi.org/10.1109/CVPR.2018.00853.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2017.634.
Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Computer vision and pattern recognition. arxiv:1705.09914.
The authors would like to thank Hector Basevi for his valuable comments and suggestion on improving the paper. This work was supported in part by the following research projects and programs: Project GOSTOP C3330-16-529000, DIVID J2-9433 and ViAMaRo L2-6765, Program P2-0214 financed by Slovenian Research Agency ARRS, and MURI Project financed by MoD/Dstl and EPSRC through EP/N019415/1 Grant. We thank Vitjan Zavrtanik for his contribution in porting the DAUs to the TensorFlow framework.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.
About this article
Cite this article
Tabernik, D., Kristan, M. & Leonardis, A. Spatially-Adaptive Filter Units for Compact and Efficient Deep Neural Networks. Int J Comput Vis 128, 2049–2067 (2020). https://doi.org/10.1007/s11263-019-01282-1
- Compact ConvNets
- Efficient ConvNets
- Displacement units
- Adjustable receptive fields