Spatially-Adaptive Filter Units for Compact and Efficient Deep Neural Networks
- 118 Downloads
Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, which has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact networks and large number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus reducing the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to 4\(\times \) more compact networks in terms of the number of parameters at similar or better performance.
KeywordsCompact ConvNets Efficient ConvNets Displacement units Adjustable receptive fields
The authors would like to thank Hector Basevi for his valuable comments and suggestion on improving the paper. This work was supported in part by the following research projects and programs: Project GOSTOP C3330-16-529000, DIVID J2-9433 and ViAMaRo L2-6765, Program P2-0214 financed by Slovenian Research Agency ARRS, and MURI Project financed by MoD/Dstl and EPSRC through EP/N019415/1 Grant. We thank Vitjan Zavrtanik for his contribution in porting the DAUs to the TensorFlow framework.
- Chang, J., Gu, J., Wang, L., Meng, G., Xiang, S., & Pan, C. (2018). Structure-aware convolutional neural networks. In Proceedings of the neural information processing systems (pp. 1–10).Google Scholar
- Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016a). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In Pattern analysis and machine intelligence (pp. 1–14). arXiv:1606.00915.
- Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016b). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184.CrossRefGoogle Scholar
- Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587.
- Chen, L. C., Zhu, Y., Papandreou, G., & Schroff, F. (2018). Encoder–decoder with atrous separable convolution for semantic image segmentation. In European conference on machine learning: Workshop on music and machine learning.Google Scholar
- Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fps. In Computer vision and pattern recognition (pp. 3286–3293). IEEE. https://doi.org/10.1109/CVPR.2014.414.
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.350.
- Eigen, D., Rolfe, J., Fergus, R., & Lecun, Y. (2014). Understanding deep architectures using a recursive convolutional network (pp. 1–9). arXiv:1312.1847v2.
- Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2011). The Pascal visual object classes challenge 2011 (VOC2011) results. Retrieved December 17, 2019, from http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html.
- Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (vol. 9, pp. 249–256). Google Scholar
- Hariharan, B., Arbel, P., Bourdev, L., Maji, S., Malik, J., Berkeley, U. C., Systems, A., Ave, P., & Jose, S. (2011). Semantic contours from inverse detectors. In International conference on computer vision.Google Scholar
- He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (pp. 346–361).Google Scholar
- He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image recognition. In CVPR (pp. 171–180). https://doi.org/10.3389/fpsyg.2013.00124.
- Jacobsen, J. H., van Gemert, J., Lou, Z., & Smeulders, A. W. M. (2016). Structured receptive fields in CNNs. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2610–2619). https://doi.org/10.1109/CVPR.2016.286.
- Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Speeding up convolutional neural networks with low rank expansions. In British machine vision conference (p. 7). https://doi.org/10.5244/C.28.88.
- Jeon, Y., & Kim, J. (2017). Active convolution: Learning the shape of convolution for image classification. https://doi.org/10.1109/CVPR.2017.200.
- Kaiming, H., Gkioxara, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. In International conference on computer vision (pp. 2961–2969). arXiv:1703.06870.
- Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In International conference on learning representations (pp. 1–13). arXiv:1412.6980v5.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Science Department, University of Toronto, Tech Report (pp. 1–60).Google Scholar
- Luan, S., Zhang, B., Chen, C., Cao, X., Ye, Q., Han, J., & Liu, J. (2017). Gabor convolutional networks. In British machine vision conference (pp. 1–12). arXiv:1705.01450.
- Luo, P., Wang, G., Lin, L., & Wang, X. (2017). Deep dual learning for semantic image segmentation. In Computer vision and pattern recognition (CVPR) (pp. 2718–2726). https://doi.org/10.1109/ICCV.2017.296.
- Luo, W., Li, Y., Urtasun, R., & Richard, Z. (2016). Understanding the effective receptive field in deep convolutional neural networks. In NIPS. arXiv:1701.04128.
- Nah, S., Kim, T. H., & Lee, K. M. (2017). Deep multi-scale convolutional neural network for dynamic scene deblurring. In Computer vision and pattern recognition (pp. 3883–3891). https://doi.org/10.1109/CVPR.2017.35.
- Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. https://doi.org/10.1109/IJCNN.2015.7280696.
- Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (pp 1–14). arXiv:1409.1556v6.
- Tabernik, D., Kristan, M., & Leonardis, A. (2018). Spatially-adaptive filter units for deep neural networks. In Computer vision and pattern recognition (pp. 9388–9396). arXiv:1711.11473.
- Tabernik, D., Kristan, M., Wyatt, J. L., & Leonardis, A. (2016). Towards deep compositional networks. In International conference on pattern recognition. arXiv:1609.03795.
- Tao, X., Gao, H., Wang, Y., Shen, X., Wang, J., & Jia, J. (2018). Scale-recurrent network for deep image deblurring. In Computer vision and pattern recognition (pp. 8174–8182). https://doi.org/10.1109/CVPR.2018.00853.
- Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2017.634.
- Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Computer vision and pattern recognition. arxiv:1705.09914.