Monocular Object Detection Using 3D Geometric Primitives

  • Peter Carr
  • Yaser Sheikh
  • Iain Matthews
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7572)


Multiview object detection methods achieve robustness in adverse imaging conditions by exploiting projective consistency across views. In this paper, we present an algorithm that achieves performance comparable to multiview methods from a single camera by employing geometric primitives as proxies for the true 3D shape of objects, such as pedestrians or vehicles. Our key insight is that for a calibrated camera, geometric primitives produce predetermined location-specific patterns in occupancy maps. We use these to define spatially-varying kernel functions of projected shape. This leads to an analytical formation model of occupancy maps as the convolution of locations and projected shape kernels. We estimate object locations by deconvolving the occupancy map using an efficient template similarity scheme. The number of objects and their positions are determined using the mean shift algorithm. The approach is highly parallel because the occupancy probability of a particular geometric primitive at each ground location is an independent computation. The algorithm extends to multiple cameras without requiring significant bandwidth. We demonstrate comparable performance to multiview methods and show robust, realtime object detection on full resolution HD video in a variety of challenging imaging conditions.


Object Detection Ground Plane Object Location Multiple Camera Ground Location 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with a probabilistic occupancy map. PAMI 30, 267–282 (2008)CrossRefGoogle Scholar
  3. 3.
    Eshel, R., Moses, Y.: Homography based multiple camera detection and tracking of people in a dense crowd. In: CVPR (2008)Google Scholar
  4. 4.
    Khan, S.M., Shah, M.: Tracking multiple occluding people by localizing on multiple scene planes. PAMI 31, 505–519 (2009)CrossRefGoogle Scholar
  5. 5.
    Delannay, D., Danhier, N., Vleeschouwer, C.D.: Detection and recognition of sports(wo)men from multiple views. In: ACM/IEEE International Conference on Distributed Smart Cameras (2009)Google Scholar
  6. 6.
    Shitrit, H.B., Berclaz, J., Fleuret, F., Fua, P.: Tracking multiple people under global appearance constraints. In: ICCV (2011)Google Scholar
  7. 7.
    Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 32–40 (1975)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Franco, J.S., Boyer, E.: Fusion of multiview silhouette cues using a space occupancy grid. In: ICCV (2005)Google Scholar
  9. 9.
    Yildiz, A., Akgul, Y.S.: A fast method for tracking people with multiple cameras. In: ECCV Workshop on HUMAN MOTION Understanding, Modeling, Capture and Animation (2010)Google Scholar
  10. 10.
    D’Orazio, T., Leo, M., Mosca, N., Spagnolo, P., Mazzeo, P.L.: A semi-automatic system for ground truth generation of soccer video sequences. In: AVSS (2009)Google Scholar
  11. 11.
    Binford, T.O.: Visual perception by computer. In: IEEE Conf. on Systems and Control (1971)Google Scholar
  12. 12.
    Agin, G.J.: Representation and Description of Curved Objects. PhD thesis, Stanford University (1972)Google Scholar
  13. 13.
    Nevatia, R., Binford, T.O.: Description and recognition of curved objects. AI 8, 77–98 (1977)zbMATHGoogle Scholar
  14. 14.
    Marr, D., Nishihara, H.K.: Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B, Biological Sciences 200, 269–294 (1978)CrossRefGoogle Scholar
  15. 15.
    O’Rourke, J., Badler, N.: Model-based image analysis of human motion using constraint propagation. PAMI 2, 522–536 (1980)Google Scholar
  16. 16.
    Barr, A.: Global and local deformations of solid primitives. Computer Graphics 18, 21–30 (1984)CrossRefGoogle Scholar
  17. 17.
    Azarbayejani, A., Pentland, A.: Real-time self-calibrating stereo person tracking using 3-D shape estimation from blob features. In: ICPR (1996)Google Scholar
  18. 18.
    Farrell, R., Oza, O., Zhang, N., Morariu, V.I., Darrell, T., Davis, L.S.: Birdlets: Subordinate cateogrization using volumetric primitives and pose-normalized appearance. In: ICCV (2011)Google Scholar
  19. 19.
    Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. CVIU 81, 231–268 (2001)zbMATHGoogle Scholar
  20. 20.
    Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. CVIU 104, 90–126 (2006)Google Scholar
  21. 21.
    Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38 (2006)Google Scholar
  22. 22.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. In: CVPR (2006)Google Scholar
  23. 23.
    Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. IJCV 77, 259–289 (2008)CrossRefGoogle Scholar
  24. 24.
    Cornelis, N., Leibe, B., Cornelis, K., Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. Int. J. Comput. Vision 78, 121–141 (2008)CrossRefGoogle Scholar
  25. 25.
    Wojek, C., Roth, S., Schindler, K., Schiele, B.: Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 467–481. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  26. 26.
    Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. PAMI 22, 809–830 (2000)CrossRefGoogle Scholar
  27. 27.
    Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR (2005)Google Scholar
  28. 28.
    Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian manifolds. In: CVPR (2007)Google Scholar
  29. 29.
    Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: ICCV (2003)Google Scholar
  30. 30.
    Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet part detectors. IJCV 75, 247–266 (2007)CrossRefGoogle Scholar
  31. 31.
    Dalal, N., Triggs, B.: Histograms of orientated gradients for human detection. In: CVPR (2005)Google Scholar
  32. 32.
    Enzweiler, M., Gavrila, D.M.: Monocular pedestrian detection: Survey and experiments. PAMI 31, 2179–2195 (2009)CrossRefGoogle Scholar
  33. 33.
    Dollár, P., Belongie, S., Perona, P.: The fastest pedestrian detector in the west. In: BMVC (2010)Google Scholar
  34. 34.
    Prisacariu, V.A., Reid, I.: fastHOG – a real-time GPU implementation of HOG. Technical Report 2310/09, University of Oxford (2009)Google Scholar
  35. 35.
    Hayes, P.J.: The second naive physics manifesto. In: Hobbs, J., Moore, R. (eds.) Formal Theories of the Commonsense World. Ablex (1985)Google Scholar
  36. 36.
    Sain, S.R., Scott, D.W.: On locally adaptive density estimation. Journal of the American Statistical Association 91, 1525–1533 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  37. 37.
    Criminisi, A.: Accurate Visual Metrology from Single and Multiple Uncalibrated Images. PhD thesis, University of Oxford (1999)Google Scholar
  38. 38.
    Orechovesky Jr., J.R.: Single source error ellipse combination. Master’s thesis, Naval Postgraduate School (1996)Google Scholar
  39. 39.
    Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. PAMI 31, 319–336 (2009)CrossRefGoogle Scholar
  40. 40.
    Berclaz, J., Shahrokni, A., Fleuret, F., Ferryman, J., Fua, P.: Evaluation of probabilstic occupancy map people detection for surveillance systems. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2009)Google Scholar
  41. 41.
    Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., Desai, M.: A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Peter Carr
    • 1
  • Yaser Sheikh
    • 2
  • Iain Matthews
    • 1
    • 2
  1. 1.Disney ResearchPittsburghUSA
  2. 2.Carnegie Mellon UniversityUSA

Personalised recommendations