Advertisement

Energy-Based Models for Deep Probabilistic Regression

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)

Abstract

While deep learning-based classification is generally tackled using standardized approaches, a wide variety of techniques are employed for regression. In computer vision, one particularly popular such technique is that of confidence-based regression, which entails predicting a confidence value for each input-target pair (xy). While this approach has demonstrated impressive results, it requires important task-dependent design choices, and the predicted confidences lack a natural probabilistic meaning. We address these issues by proposing a general and conceptually simple regression method with a clear probabilistic interpretation. In our proposed approach, we create an energy-based model of the conditional target density p(y|x), using a deep neural network to predict the un-normalized density from (xy). This model of p(y|x) is trained by directly minimizing the associated negative log-likelihood, approximated using Monte Carlo sampling. We perform comprehensive experiments on four computer vision regression tasks. Our approach outperforms direct regression, as well as other probabilistic and confidence-based methods. Notably, our model achieves a \(2.2\%\) AP improvement over Faster-RCNN for object detection on the COCO dataset, and sets a new state-of-the-art on visual tracking when applied for bounding box estimation. In contrast to confidence-based methods, our approach is also shown to be directly applicable to more general tasks such as age and head-pose estimation. Code is available at https://github.com/fregu856/ebms_regression.

Notes

Acknowledgments

This research was supported by the Swedish Foundation for Strategic Research via ASSEMBLE, the Swedish Research Council via Learning flexible models for nonlinear dynamics, the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and Nvidia.

Supplementary material

504476_1_En_20_MOESM1_ESM.pdf (859 kb)
Supplementary material 1 (pdf 859 KB)

References

  1. 1.
    Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_56CrossRefGoogle Scholar
  2. 2.
    Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–498 (2018)Google Scholar
  3. 3.
    Bishop, C.M.: Mixture density networks (1994)Google Scholar
  4. 4.
    Cao, W., Mirjalili, V., Raschka, S.: Rank-consistent ordinal regression for neural networks. arXiv preprint arXiv:1901.07884 (2019)
  5. 5.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7291–7299 (2017)Google Scholar
  6. 6.
    Chou, C.R., Frederick, B., Mageras, G., Chang, S., Pizer, S.: 2D/3D image registration using regression learning. Comput. Vis. Image Underst. 117(9), 1095–1106 (2013)CrossRefGoogle Scholar
  7. 7.
    Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 4759–4770 (2018)Google Scholar
  8. 8.
    Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4660–4669 (2019)Google Scholar
  9. 9.
    Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6638–6646 (2017)Google Scholar
  10. 10.
    Diaz, R., Marathe, A.: Soft labels for ordinal regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  11. 11.
    Du, Y., Mordatch, I.: Implicit generation and modeling with energy based models. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)Google Scholar
  12. 12.
    Fanelli, G., Dantone, M., Gall, J., Fossati, A., Van Gool, L.: Random forests for real time 3D face analysis. Int. J. Comput. Vis. (IJCV) 101(3), 437–458 (2013)CrossRefGoogle Scholar
  13. 13.
    Feng, D., Rosenbaum, L., Timm, F., Dietmayer, K.: Leveraging heteroscedastic aleatoric uncertainties for robust real-time Lidar 3D object detection. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1280–1287. IEEE (2019)Google Scholar
  14. 14.
    Gao, R., Lu, Y., Zhou, J., Zhu, S.C., Wu, Y.N.: Learning generative ConvNets via multi-grid modeling and sampling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9155–9164 (2018)Google Scholar
  15. 15.
    Gast, J., Roth, S.: Lightweight probabilistic deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3369–3378 (2018)Google Scholar
  16. 16.
    Gu, J., Yang, X., De Mello, S., Kautz, J.: Dynamic facial analysis: from Bayesian filtering to recurrent neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1548–1557 (2017)Google Scholar
  17. 17.
    Gustafsson, F.K., Danelljan, M., Schön, T.B.: Evaluating scalable Bayesian deep learning methods for robust computer vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)Google Scholar
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)Google Scholar
  19. 19.
    He, Y., Zhu, C., Wang, J., Savvides, M., Zhang, X.: Bounding box regression with uncertainty for accurate object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2888–2897 (2019)Google Scholar
  20. 20.
    Hinton, G., Osindero, S., Welling, M., Teh, Y.W.: Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cogn. Sci. 30(4), 725–731 (2006)CrossRefGoogle Scholar
  21. 21.
    Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 73–101 (1964) Google Scholar
  22. 22.
    Ilg, E., et al.: Uncertainty estimates and multi-hypotheses networks for optical flow. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 652–667 (2018)Google Scholar
  23. 23.
    Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799 (2018)Google Scholar
  24. 24.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (NeurIPS), pp. 5574–5584 (2017)Google Scholar
  25. 25.
    Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1125–1134 (2017)Google Scholar
  26. 26.
    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 6402–6413 (2017)Google Scholar
  27. 27.
    Lathuilière, S., Mesejo, P., Alameda-Pineda, X., Horaud, R.: A comprehensive analysis of deep regression. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42(9), 2065–2081 (2019) Google Scholar
  28. 28.
    Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750 (2018)Google Scholar
  29. 29.
    Lawson, D., Tucker, G., Dai, B., Ranganath, R.: Energy-inspired models: learning with sampler-induced distributions. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)Google Scholar
  30. 30.
    LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting Struct. Data 1 (2006) Google Scholar
  31. 31.
    Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4282–4291 (2019)Google Scholar
  32. 32.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)Google Scholar
  33. 33.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755 (2014)Google Scholar
  34. 34.
    Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7144–7153 (2019)Google Scholar
  35. 35.
    Mnih, A., Hinton, G.: Learning nonlinear constraints with contrastive backpropagation. In: Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, pp. 1302–1307. IEEE (2005)Google Scholar
  36. 36.
    Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 445–461 (2016)Google Scholar
  37. 37.
    Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317 (2018)Google Scholar
  38. 38.
    Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016)Google Scholar
  39. 39.
    Niethammer, M., Huang, Y., Vialard, F.-X.: Geodesic regression for image time-series. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011. LNCS, vol. 6892, pp. 655–662. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23629-7_80CrossRefGoogle Scholar
  40. 40.
    Nijkamp, E., Hill, M., Han, T., Zhu, S.C., Wu, Y.N.: On the anatomy of MCMC-based maximum likelihood learning of energy-based models. In: Thirty-Fourth AAAI Conference on Artificial Intelligence (2020)Google Scholar
  41. 41.
    Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multiple output CNN for age estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4920–4928 (2016)Google Scholar
  42. 42.
    Pan, H., Han, H., Shan, S., Chen, X.: Mean-variance loss for deep age estimation from a face. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5285–5294 (2018)Google Scholar
  43. 43.
    Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 8024–8035 (2019)Google Scholar
  44. 44.
    Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)Google Scholar
  45. 45.
    Prokudin, S., Gehler, P., Nowozin, S.: Deep directional statistics: Pose estimation with uncertainty quantification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551 (2018)Google Scholar
  46. 46.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271 (2017)Google Scholar
  47. 47.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39, 1137–1149 (2015)CrossRefGoogle Scholar
  48. 48.
    Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. (IJCV) 126(2–4), 144–157 (2016)MathSciNetGoogle Scholar
  49. 49.
    Ruiz, N., Chong, E., Rehg, J.M.: Fine-grained head pose estimation without keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2074–2083 (2018)Google Scholar
  50. 50.
    Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1145–1153 (2017)Google Scholar
  51. 51.
    Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 3483–3491 (2015)Google Scholar
  52. 52.
    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)Google Scholar
  53. 53.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. (JMLR) 6, 1453–1484 (2005)MathSciNetzbMATHGoogle Scholar
  54. 54.
    Varamesh, A., Tuytelaars, T.: Mixture dense regression for object detection and human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13086–13095 (2020)Google Scholar
  55. 55.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016)Google Scholar
  56. 56.
    Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  57. 57.
    Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018)Google Scholar
  58. 58.
    Xie, J., Lu, Y., Zhu, S.C., Wu, Y.: A theory of generative ConvNet. In: International Conference on Machine Learning (ICML), pp. 2635–2644 (2016)Google Scholar
  59. 59.
    Yang, T.Y., Chen, Y.T., Lin, Y.Y., Chuang, Y.Y.: FSA-Net: learning fine-grained structure aggregation for head pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1087–1096 (2019)Google Scholar
  60. 60.
    Yang, T.Y., Huang, Y.H., Lin, Y.Y., Hsiu, P.C., Chuang, Y.Y.: SSR-Net: a compact soft stagewise regression network for age estimation. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2018)Google Scholar
  61. 61.
    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)CrossRefGoogle Scholar
  62. 62.
    Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5810–5818 (2017). https://susanqq.github.io/UTKFace/
  63. 63.
    Zhou, X., Zhuo, J., Krahenbuhl, P.: Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859 (2019)Google Scholar
  64. 64.
    Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Information TechnologyUppsala UniversityUppsalaSweden
  2. 2.Computer Vision LabETH ZürichZürichSwitzerland

Personalised recommendations