Extracting Moving People from Internet Videos

  • Juan Carlos Niebles
  • Bohyung Han
  • Andras Ferencz
  • Li Fei-Fei
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5305)


We propose a fully automatic framework to detect and extract arbitrary human motion volumes from real-world videos collected from YouTube. Our system is composed of two stages. A person detector is first applied to provide crude information about the possible locations of humans. Then a constrained clustering algorithm groups the detections and rejects false positives based on the appearance similarity and spatio-temporal coherence. In the second stage, we apply a top-down pictorial structure model to complete the extraction of the humans in arbitrary motion. During this procedure, a density propagation technique based on a mixture of Gaussians is employed to propagate temporal information in a principled way. This method reduces greatly the search space for the measurement in the inference stage. We demonstrate the initial success of this framework both quantitatively and qualitatively by using a number of YouTube videos.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC, Edinburgh, UK, vol. III, pp. 949–958 (2006)Google Scholar
  2. 2.
    Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: ICML (2002)Google Scholar
  3. 3.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61, 55–79 (2005)CrossRefGoogle Scholar
  4. 4.
    Ramanan, D.: Learning to parse images of articulated objects. In: NIPS, Vancouver, Canada (2006)Google Scholar
  5. 5.
    Ramanan, D., Forsyth, D., Zisserman, A.: Tracking people by learning their appearance. PAMI 29, 65–81 (2007)Google Scholar
  6. 6.
    Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. IJCAI, 674–679 (1981)Google Scholar
  7. 7.
    Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: CVPR, Hilton Head, SC, vol. II, pp. 142–149 (2000)Google Scholar
  8. 8.
    Cham, T., Rehg, J.: A multiple hypothesis approach to figure tracking. In: CVPR, Fort Collins, CO, vol. II, pp. 219–239 (1999)Google Scholar
  9. 9.
    Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR, Hilton Head, SC (2000)Google Scholar
  10. 10.
    Han, T.X., Ning, H., Huang, T.S.: Efficient nonparametric belief propagation with application to articulated body tracking. In: CVPR, New York, NY (2006)Google Scholar
  11. 11.
    Haritaoglu, I., Harwood, D., Davis, L.: W4: Who? When? Where? What? - A real time system for detecting and tracking people. In: Proc. of Intl. Conf. on Automatic Face and Gesture Recognition, Nara, Japan, pp. 222–227 (1998)Google Scholar
  12. 12.
    Lee, C.S., Elgammal, A.: Modeling view and posture manifolds for tracking. In: ICCV, Rio de Janeiro, Brazil (2007)Google Scholar
  13. 13.
    Sigal, L., Bhatia, S., Roth, S., Black, M., Isard, M.: Tracking loose-limbed people. In: CVPR, Washington DC, vol. I, pp. 421–428 (2004)Google Scholar
  14. 14.
    Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body tracking. In: CVPR, Kauai, Hawaii, vol. I, pp. 447–454 (2001)Google Scholar
  15. 15.
    Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: CVPR, San Diego, CA, vol. I, pp. 390–397 (2005)Google Scholar
  16. 16.
    Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR, San Diego, CA, vol. I, pp. 878–885 (2005)Google Scholar
  17. 17.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, San Diego, CA, vol. I, pp. 886–893 (2005)Google Scholar
  18. 18.
    Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian manifolds. In: CVPR, Minneapolis, MN (2007)Google Scholar
  19. 19.
    Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: ICCV, Nice, France, pp. 734–741 (2003)Google Scholar
  20. 20.
    Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: ICCV, Beijing, China, vol. I, pp. 90–97 (2005)Google Scholar
  21. 21.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR, Anchorage, AK (2008)Google Scholar
  22. 22.
    Ren, X., Malik, J.: Tracking as repeated figure/ground segmentation. In: CVPR, Minneapolis, MN (2007)Google Scholar
  23. 23.
    Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 50, 174–188 (2002)CrossRefGoogle Scholar
  24. 24.
    Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001)MATHGoogle Scholar
  25. 25.
    Han, B., Zhu, Y., Comaniciu, D., Davis, L.: Kernel-based bayesian filtering for object tracking. In: CVPR, San Diego, CA, vol. I, pp. 227–234 (2005)Google Scholar
  26. 26.
    Han, B., Comaniciu, D., Zhu, Y., Davis, L.: Sequential kernel density approximation and its application to real-time visual tracking. PAMI 30, 1186–1197 (2008)Google Scholar
  27. 27.
    Lienhart, R.: Reliable transition detection in videos: A survey and practitioner’s guide. International Journal of Image and Graphics 1, 469–486 (2001)CrossRefGoogle Scholar
  28. 28.
    Van Rijsbergen, C.J.: Information Retreival. Butterworths, London (1979)Google Scholar
  29. 29.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV, Beijing, China, pp. 1395–1402 (2005)Google Scholar
  30. 30.
    Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV, Beijing, China, pp. 166–173 (2005)Google Scholar
  31. 31.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79, 299–318 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Juan Carlos Niebles
    • 1
    • 2
  • Bohyung Han
    • 3
  • Andras Ferencz
    • 3
  • Li Fei-Fei
    • 1
  1. 1.Princeton UniversityPrincetonUSA
  2. 2.Universidad del NorteColombia
  3. 3.Mobileye Vision TechnologiesPrincetonUSA

Personalised recommendations