Incorporating Side Information by Adaptive Convolution

Abstract

Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in deep learning based counting systems. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned “filter manifold” sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters and achieves similar or better than state-of-the-art performance on crowd counting task. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information. We also perform ablation experiments, mainly for crowd counting, to study the helpfulness of the side information, and the effect of the placement of the adaptive convolutional layers in order to get insight about ACNNs.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    The perspective value on a pixel location is proportional to the size of the object if the object exists there.

  2. 2.

    To reduce clutter, here we do not show the bias term for the convolution.

  3. 3.

    The mean absolute difference (MAD) between the density maps generated using the original perspective maps and our perspective maps is 0.475 on average, and [0.029, 0.818, 0.800, 0.597, 0.131] respectively on the five test scenes.

  4. 4.

    The MAD between the original density maps and those using single Gaussian kernels is 2.893 on average, and [0.582, 4.491, 1.946, 7.078, 0.368] respectively on the five test scenes (using our perspective map). This is because the ROI boundary cuts through the most crowded regions on scenes 2 and 4.

  5. 5.

    CSRNet termed the first ten convolution layers from VGG as front-end, which is more commonly referred as back-end elsewhere.

  6. 6.

    On the clean MNIST dataset, the 2-conv and 4-conv CNN architectures achieve 0.81% and 0.69% error, while the current state-of-the-art is \(\sim \) 0.23% error (Ciresan et al. 2012).

References

  1. Arteta, C., Lempitsky, V., Noble, J. A., & Zisserman, A. (2014). Interactive object counting. In ECCV

  2. Burger, H. C., Schuler, C. J., & Harmeling, S. (2012). Image denoising: Can plain neural networks compete with BM3D? In CVPR

  3. Chan, A. B., & Vasconcelos, N. (2009). Bayesian poisson regression for crowd counting. In ICCV

  4. Chan, A. B., Liang, Z. S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR. IEEE.

  5. Chan, A. B., & Vasconcelos, N. (2012). Counting people with low-level features and bayesian regression. IEEE Transactions on Image Processing, 21, 2160–2177.

    MathSciNet  Article  Google Scholar 

  6. Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR

  7. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In CVPR

  8. De Brabandere, B., Jia X., Tuytelaars, T., & Van Gool, L. (2016). Dynamic filter networks. In NIPS

  9. Dozat, T. (2015). Incorporating nesterov momentum into adam. Technical report, Stanford University (2015). http://cs229.stanford.edu/proj2015/054report.pdf

  10. Eigen, D., Krishnan, D., & Fergus, R. (2013). Restoring an image taken through a window covered with dirt or rain. In ICCV

  11. Fiaschi, L., Nair, R., Koethe, U., & Hamprecht, F. (2012). Learning to count with regression forest and structured labels. In ICPR

  12. Gharbi, M., Chaurasia, G., Paris, S., & Durand, F. (2016). Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG).

  13. Ha, D., Dai, A., & Le, Q. V. (2017). HyperNetworks. In ICLR

  14. He, K., Zhang, X., Ren, S., & Sun J. (2016). Deep residual learning for image recognition. In CVPR

  15. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

    MathSciNet  Article  Google Scholar 

  16. Idrees, H., Saleemi, I., Seibert, C., & Shah, M. (2013). Multi-source multi-scale counting in extremely dense crowd images. In CVPR

  17. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML

  18. Jaderberg, M., Simonyan, K, Zisserman A, & Kavukcuoglu K. (2015). Spatial transformer networks. In NIPS

  19. Kang, D., & Chan, A. (2018). Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC

  20. Kang, D., Dhar, D., & Chan A. (2017). Incorporating side information by adaptive convolution. In NIPS

  21. Kang, D., Ma, Z., & Chan, A. B. (2018). Beyond counting: Comparisons of density maps for crowd analysis tasks–Counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29, 1408–1422.

    Article  Google Scholar 

  22. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980

  23. Klein, B., Wolf, L., & Afek, Y. (2015). A dynamic convolutional layer for short range weather prediction. In CVPR

  24. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS

  25. Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In NIPS

  26. Li, S., Liu, Z. Q., & Chan, A. B. (2015). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: IJCV

  27. Li, Y., Zhang, X., & Chen, D. (2018). CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR

  28. Liu, R., Li, Z., & Jia, J. (2008). Image partial blur detection and classification. In CVPR

  29. Ma, Z., Yu, L., & Chan, A. B. (2015). Small instance detection by integer programming on object density maps. In CVPR

  30. Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In ICML

  31. Niu, Z., Zhou, M., Wang, L., Gao, X., & Hua, G. (2016). Ordinal regression with multiple output CNN for age estimation. In CVPR

  32. Onoro-Rubio, D., & López-Sastre, R. J. (2016). Towards perspective-free object counting with deep learning. In ECCV

  33. Pech-Pacheco, J. L., Cristóbal, G., Chamorro-Martinez, J., & Fernández-Valdivia, J. (2000). Diatom autofocusing in brightfield microscopy: A comparative study. In ICPR

  34. Ren, W., Kang, D., Tang, Y., & Chan, A. (2017). Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In CVPR

  35. Rodriguez, M., Laptev, I., Sivic, J., & Audibert, J. Y. Y. (2011). Density-aware person detection and tracking in crowds. In ICCV

  36. Rothe, R., Timofte, R., & Van Gool, L. (2015). DEX: Deep expectation of apparent age from a single image. In ICCVW

  37. Sam, D. B., Surya, S., & Babu, R. V. (2017). Switching convolutional neural network for crowd counting. In CVPR

  38. Shi, J., Xu, L., & Jia, J. (2014). Discriminative blur detection features. In CVPR

  39. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR

  40. Sindagi, V. A., & Patel, V. M. (2017). Generating high-quality crowd density maps using contextual pyramid CNNs. In ICCV

  41. Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In NIPS

  42. Xu, L., Ren, J. S., Liu, C., & Jia, J. (2014). Deep convolutional neural network for image deconvolution. In NIPS

  43. Zhang, C., Li, H., Wang, X., & Yang, X. (2015). Cross-scene crowd counting via deep convolutional neural networks. In CVPR

  44. Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In ECCV

  45. Zhang, L., Shi, M., & Chen, Q. (2018). Crowd counting via scale-adaptive convolutional neural network. In WACV

  46. Zhang, Y., Zhou, D., & Chen, S., Gao, S., & Ma, Y. (2016). Single-image crowd counting via multi-column convolutional neural network. In CVPR

Download references

Acknowledgements

The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R] and CityU 11212518). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Di Kang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by S. Soatto.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kang, D., Dhar, D. & Chan, A.B. Incorporating Side Information by Adaptive Convolution. Int J Comput Vis 128, 2897–2918 (2020). https://doi.org/10.1007/s11263-020-01345-8

Download citation

Keywords

  • Convolutional neural network (CNN)
  • Deep learning
  • Crowd counting