Abstract
We present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Our algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. Our system proves capable of handling large rotations, partial occlusions, and the noisy depth data acquired using commercial sensors. Moreover, the algorithm works on each frame independently and achieves real time performance without resorting to parallel computations on a GPU. We present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft Kinect.
This is a preview of subscription content, access via your institution.


































Notes
Most of the datasets are publicly available at http://www.vision.ee.ethz.ch/datasets.
Because of the proprietary license for Paysan et al. (2009), we cannot share the above database. The PCA model, however, can be obtained from the University of Basel.
We used the source code provided by the authors.
Commercially available: http://www.faceshift.com.
References
Amberg, B., & Vetter, T. (2011). Optimal landmark detection using shape models and branch and bound slides. In International conference on computer vision.
Balasubramanian, V. N., Ye, J., & Panchanathan, S. (2007). Biased manifold embedding: AÂ framework for person-independent head pose estimation. In IEEE conference on computer vision and pattern recognition.
Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of exemplars. In IEEE conference on computer vision and pattern recognition.
Besl, P., & McKay, N. (1992). A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2), 239–256.
Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. In ACM international conference on computer graphics and interactive techniques (SIGGRAPH) (pp. 187–194).
Breidt, M., Buelthoff, H., & Curio, C. (2011). Robust semantic analysis by synthesis of 3d facial motion. In Automatic face and gesture recognition.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey: Wadsworth and Brooks.
Breitenstein, M. D., Jensen, J., Hoilund, C., Moeslund, T. B., & Van Gool, L. (2009). Head pose estimation from passive stereo images. In Scandinavian conference on image analysis.
Breitenstein, M. D., Kuettel, D., Weise, T., Van Gool, L., & Pfister, H. (2008). Real-time face pose estimation from single range images. In IEEE conference on computer vision and pattern recognition.
Cai, Q., Gallup, D., Zhang, C., & Zhang, Z. (2010). 3d deformable face tracking with a commodity depth camera. In European conference on computer vision.
Chang, K. I., Bowyer, K. W., & Flynn, P. J. (2006). Multiple nose region matching for 3d face recognition under varying facial expression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1695–1700.
Chen, L., Zhang, L., Hu, Y., Li, M., & Zhang, H. (2003). Head pose estimation using fisher manifold learning. In Analysis and modeling of faces and gestures.
Chua, C. S., & Jarvis, R. (1997). Point signatures: A new representation for 3d object recognition. International Journal of Computer Vision, 25, 63–85.
Colbry, D., Stockman, G., & Jain, A. (2005). Detection of anchor points for 3d face verification. In IEEE conference on computer vision and pattern recognition.
Cootes, T. F., Edwards, G. J., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 681–685.
Cootes, T. F., Wheeler, G. V., Walker, K. N., & Taylor, C. J. (2002). View-based active appearance models. Image and Vision Computing, 20(9–10), 657–664.
Criminisi, A., Shotton, J., & Konukoglu, E. (2011). Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning. Tech. Rep. TR-2011-114, Microsoft Research.
Criminisi, A., Shotton, J., Robertson, D., & Konukoglu, E. (2010). Regression forests for efficient anatomy detection and localization in ct studies. In Recognition techniques and applications in medical imaging.
Cristinacce, D., & Cootes, T. (2008). Automatic feature localisation with constrained local models. Journal of Pattern Recognition, 41(10), 3054–3067.
Dantone, M., Gall, J., Fanelli, G., & Van Gool, L. (2012). Real-time facial feature detection using conditional regression forests. In IEEE conference on computer vision and pattern recognition.
Dorai, C., & Jain, A. K. (1997). COSMOS—A representation scheme for 3D Free-Form objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(10), 1115–1130.
Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! my name is… buffy—automatic naming of characters in tv video. In British machine vision conference.
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., & Van Gool, L. (2010). A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6), 591–598.
Fanelli, G., Gall, J., & Van Gool, L. (2011a). Real time head pose estimation with random regression forests. In IEEE conference on computer vision and pattern recognition.
Fanelli, G., Weise, T., Gall, J., & Van Gool, L. (2011b). Real time head pose estimation from consumer depth cameras. In German association for pattern recognition.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Gall, J., & Lempitsky, V. (2009). Class-specic hough forests for object detection. In IEEE conference on computer vision and pattern recognition.
Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. In IEEE transactions on pattern analysis and machine intelligence.
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In International conference on computer vision.
Gross, R., Matthews, I., & Baker, S. (2005). Generic vs. person specific active appearance models. Image and Vision Computing, 23(12), 1080–2093.
Huang, C., Ding, X., & Fang, C. (2010). Head pose estimation based on random forests for multiclass classification. In International conference on pattern recognition.
Jones, M., & Viola, P. (2003). Fast multi-view face detection. Tech. Rep. TR2003-096, Mitsubishi Electric Research Laboratories.
Ju, Q., O’keefe, S., & Austin, J. (2009). Binary neural network based 3d facial feature localization. In International joint conference on neural networks.
Kakadiaris, I. A., Passalis, G., Toderici, G., Murtuza, M. N., Lu, Y., Karampatziakis, N., & Theoharis, T. (2007). Three-dimensional face recognition in the presence of facial expressions: an annotated deformable model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4), 640–649.
Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77(1–3), 259–289.
Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In IEEE conference on computer vision and pattern recognition.
Li, H., Adams, B., Guibas, L. J., & Pauly, M. (2009). Robust single-view geometry and motion reconstruction. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia), 28(5). 2009.
Lu, X., & Jain, A. K. (2006). Automatic feature extraction for multiview 3d face recognition. In Automatic face and gesture recognition.
Martins, P., & Batista, J. (2008). Accurate single view model-based head pose estimation. In Automatic face and gesture recognition.
Matthews, I., & Baker, S. (2003). Active appearance models revisited. International Journal of Computer Vision, 60(2), 135–164.
Mehryar, S., Martin, K., Plataniotis, K., & Stergiopoulos, S. (2010). Automatic landmark detection for 3d face image processing. In Evolutionary computation.
Mian, A., Bennamoun, M., & Owens, R. (2006). Automatic 3d face detection, normalization and recognition. In 3D data processing, visualization, and transmission.
Morency, L. P., Sundberg, P., & Darrell, T. (2003). Pose estimation using 3d view-based eigenspaces. In Automatic face and gesture recognition.
Morency, L. P., Whitehill, J., & Movellan, J. R. (2008). Generalized adaptive view-based appearance model: integrated framework for monocular head pose estimation. In Automatic face and gesture recognition.
Mpiperis, I., Malassiotis, S., & Strintzis, M. (2008). Bilinear models for 3-d face and facial expression recognition. IEEE Transactions on Information Forensics and Security, 3(3), 498–511.
Murphy-Chutorian, E., & Trivedi, M. (2009). Head pose estimation in computer vision: A survey. Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607–626.
Nair, P., & Cavallaro, A. (2009). 3-d face detection, landmark localization, and registration using a point distribution model. IEEE Transactions on Multimedia, 11(4), 611–623.
Okada, R. (2009). Discriminative generalized hough transform for object detection. In International conference on computer vision.
Osadchy, M., Miller, M. L., & LeCun, Y. (2005). Synergistic face detection and pose estimation with energy-based models. In Neural information processing systems.
Papageorgiou, C., Oren, M., & Poggio, T. (1998). AÂ general framework for object detection. In International conference on computer vision.
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., & Vetter, T. (2009). AÂ 3d face model for pose and illumination invariant face recognition. In Advanced video and signal based surveillance.
Ramnath, K., Koterba, S., Xiao, J., Hu, C., Matthews, I., Baker, S., Cohn, J., & Kanade, T. (2008). Multi-view aam fitting and construction. International Journal of Computer Vision, 76(2), 183–204.
Seemann, E., Nickel, K., & Stiefelhagen, R. (2004). Head pose estimation using stereo vision for human-robot interaction. In Automatic face and gesture recognition.
Segundo, M., Silva, L., Bellon, O., & Queirolo, C. (2010). Automatic face segmentation and facial landmark detection in range images. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(5), 1319–1330.
Sharp, T. (2008). Implementing decision trees and forests on a GPU. In European conference on computer vision.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In IEEE conference on computer vision and pattern recognition.
Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In IEEE conference on computer vision and pattern recognition.
Storer, M., Urschler, M., & Bischof, H. (2009). 3d-mam: 3d morphable appearance model for efficient fine head pose estimation from still images. In Workshop on subspace methods.
Sun, Y., & Yin, L. (2008). Automatic pose estimation of 3d facial models. In International conference on pattern recognition.
Valstar, M., Martinez, B., Binefa, X., & Pantic, M. (2010). Facial point detection using boosted regression and graph models. In IEEE conference on computer vision and pattern recognition.
Vatahska, T., Bennewitz, M., & Behnke, S. (2007). Feature-based head pose estimation from images. In International conference on humanoid robots.
Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Wang, Y., Chua, C., & Ho, Y. (2002). Facial feature detection and face recognition from 2d and 3d images. Pattern Recognition Letters, 10(23), 1191–1202.
Weise, T., Bouaziz, S., Li, H., & Pauly, M. (2011). Realtime performance-based facial animation. In ACM international conference on computer graphics and interactive techniques (SIGGRAPH).
Weise, T., Leibe, B., & Van Gool, L. (2007). Fast 3d scanning with automatic motion compensation. In IEEE conference on computer vision and pattern recognition.
Weise, T., Li, H., Van Gool, L., & Pauly, M. (2009a). Face/off live facial puppetry. In Symposium on computer animation.
Weise, T., Wismer, T., Leibe, B., & Van Gool, L. (2009b). In-hand scanning with online loop closure. In 3-D digital imaging and modeling.
Whitehill, J., & Movellan, J. R. (2008). AÂ discriminative approach to frame-by-frame head pose tracking. In Automatic face and gesture recognition.
Yao, A., Gall, J., & Van Gool, L. (2010). AÂ hough transform-based voting framework for action recognition. In IEEE conference on computer vision and pattern recognition.
Yin, L., Wei, X., Sun, Y., Wang, J., & Rosato, M. J. (2006). AÂ 3d facial expression database for facial behavior research. In Face and gesture recognition.
Yu, T. H., & Moon, Y. S. (2008). AÂ novel genetic algorithm for 3d facial landmark localization. In Biometrics: theory, applications and systems.
Zhao, X., Dellandréa, E., Chen, L., & Kakadiaris, I. (2011). Accurate landmarking of three-dimensional facial data in the presence of facial expressions and occlusions using a three-dimensional statistical facial feature model. IEEE Transactions on Systems, Man, and Cybernetics, part B: Cybernetics, 41(5), 1417–1428.
Acknowledgements
We thank Thibaut Weise for useful code and discussions. We acknowledge financial support from EU projects RADHAR (FP7-ICT-248873) and TANGO (FP7-ICT-249858), and from the SNF project Vision-supported Speech-based Human Machine Interaction (200021-130224).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fanelli, G., Dantone, M., Gall, J. et al. Random Forests for Real Time 3D Face Analysis. Int J Comput Vis 101, 437–458 (2013). https://doi.org/10.1007/s11263-012-0549-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0549-0