Skip to main content
Log in

A unifying representation for pixel-precise distance estimation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We propose a new representation of distance information that is independent from any specific acquisition device, based on the size of portrayed subjects. In this alternative description, each pixel of an image is associated with the size, in real life, of what it represents. Using our proposed representation, datasets acquired with different devices can be effortlessly combined to build more powerful models, and monocular distance estimation can be performed on images acquired from devices that were never used during training. To assess the advantages of the proposed representation, we used it to train a fully convolutional neural network that predicts with pixel-precision the size of different subjects depicted in the image, as a proxy for their distance. Experimental results show that our representation, allowing the combination of heterogeneous training datasets, makes it possible for the trained network to gain better results at test time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. All sizes refer to linear size and not surface size, so they can be either height or width. For the experiments of this paper we will always use width measures.

References

  1. Battiato S, Farinella GM, Gallo G, Giudice O (2018) On-board monitoring system for road traffic safety analysis. Comput Ind 98:208–217

    Article  Google Scholar 

  2. Bianco S, Buzzelli M, Mazzini D, Schettini R (2017) Deep learning for logo recognition. Neurocomputing 245:23–30

    Article  Google Scholar 

  3. Bianco S, Buzzelli M, Schettini R (2018) Multiscale fully convolutional network for image saliency. J Electron Imaging 27:27 – 27 – 10

    Google Scholar 

  4. Burgos-Artizzu XP, Ronchi MR, Perona P (2014) Distance estimation of an unknown person from a portrait. In: European conference on computer vision. Springer, pp 313–327

  5. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  6. Dong X, Zhang F, Shi P (2014) A novel approach for face to camera distance estimation by monocular vision. Int J Innov Comput Inf Control 10(2):659–669

    Google Scholar 

  7. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp 2366–2374

  8. Elgammal A, Duraiswami R, Harwood D, Davis LS (2002) Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc IEEE 90(7):1151–1163

    Article  Google Scholar 

  9. Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus. IEEE Trans Pattern Anal Mach Intell 15(2):97–108

    Article  Google Scholar 

  10. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2011) The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html

  11. Flores A, Christiansen E, Kriegman D, Belongie S (2013) Camera distance from face images. In: International symposium on visual computing. Springer, pp 513–522

  12. Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237

    Article  Google Scholar 

  13. Godard C, Mac Aodha O, Brostow GJ (2016) Unsupervised monocular depth estimation with left-right consistency. arXiv:1609.03677

  14. Gossan S, Ott C (2012) Methods of measuring astronomical distances

  15. Harkness L (1977) Chameleons use accommodation cues to judge distance. Nature 267(5609):346–349

    Article  Google Scholar 

  16. Hirschmuller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. In: 2005. CVPR 2005. IEEE computer society conference onComputer vision and pattern recognition, vol 2. IEEE, pp 807–814

  17. Hochberg CB, Hochberg JE (1952) Familiar size and the perception of depth. J Psychol 34(1):107–114

    Article  Google Scholar 

  18. Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15

    Article  Google Scholar 

  19. Hong D, Tavanapong W, Wong J, Oh J, De Groen PC (2014) 3d reconstruction of virtual colon structures from colonoscopy images. Comput Med Imaging Graph 38(1):22–33

    Article  Google Scholar 

  20. Howard IP, Rogers BJ (1995) Binocular vision and stereopsis. Oxford University Press, Oxford

    Google Scholar 

  21. Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 89–96

  22. Li B, Shen C, Dai Y, van den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1119–1127

  23. Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38 (10):2024–2039

    Article  Google Scholar 

  24. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  25. Marotta J, Perrot T, Nicolle D, Servos P, Goodale M (1995) Adapting to monocular vision: grasping with one eye. Exp Brain Res 104(1):107–114

    Article  Google Scholar 

  26. Mendelson AL, Papacharissi Z (2010) Look at us: collective narcissism in college student facebook photo galleries. Netw self: Identity, Commun Cult Soc Netw Sites 1974:1–37

    Google Scholar 

  27. Neven D, De Brabandere B, Georgoulis S, Proesmans M, Van Gool L (2017) Fast scene understanding for autonomous driving. arXiv:1708.02550

  28. Prados E, Faugeras O (2006) Shape from shading. In: Handbook of mathematical models in computer vision, pp 375–388

  29. Ranftl R, Vineet V, Chen Q, Koltun V (2016) Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4058–4066

  30. Rodrigues DG, Grenader E, Nos FdS, Dall’Agnol MdS, Hansen TE, Weibel N (2013) Motiondraw: a tool for enhancing art and performance using kinect. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 1197–1202

  31. Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3234–3243

  32. Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In: 2003. Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition. IEEE, vol 1, pp i–i

  33. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  34. Spinello L, Arras KO (2011) People detection in rgb-d data. In: 2011 IEEE/RSJ international conference on Intelligent robots and systems (IROS). IEEE, pp 3838–3843

  35. Subbarao M, Surya G (1994) Depth from defocus: a spatial domain approach. Int J Comput Vis 13(3):271–294

    Article  Google Scholar 

  36. Torralba A, Oliva A (2002) Depth estimation from image structure. IEEE Trans Pattern Anal Mach Intell 24(9):1226–1238

    Article  Google Scholar 

  37. Uhrig J, Cordts M, Franke U, Brox T (2016) Pixel-level encoding and depth layering for instance-level semantic labeling. In: German conference on pattern recognition. Springer International Publishing, pp 14–25

  38. Wedel A, Franke U, Klappstein J, Brox T, Cremers D, et al. (2006) Realtime depth estimation and obstacle detection from monocular video. Lect Notes Comput Sci 4174:475

    Article  Google Scholar 

  39. Yonas A, Pettersen L, Granrud CE (1982) Infants’ sensitivity to familiar size as information for distance. Child Dev 53(5):1285–1290

    Article  Google Scholar 

  40. Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimed 19(2):4–10

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Buzzelli.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bianco, S., Buzzelli, M. & Schettini, R. A unifying representation for pixel-precise distance estimation. Multimed Tools Appl 78, 13767–13786 (2019). https://doi.org/10.1007/s11042-018-6568-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6568-2

Keywords

Navigation