Multimedia Tools and Applications

, Volume 72, Issue 3, pp 2439–2467 | Cite as

Incremental learning patch-based bag of facial words representation for face recognition in videos

  • Chao Wang
  • Yunhong Wang
  • Zhaoxiang ZhangEmail author
  • Yiding Wang


Video-based face recognition is a fundamental topic in image processing and video analysis, and presents various challenges and opportunities. In this paper, we introduce an incremental learning approach to video-based face recognition which efficiently exploits the spatiotemporal information in videos. Face image sequences are incrementally clustered based on their descriptors, and the representative face images of each cluster are picked out. The incremental algorithm of creating facial visual words is applied to construct a codebook using the descriptors of the representative face images. Continuously, with the quantization of the facial visual words, each descriptor extracted from patches is converted into codes, and codes from each region are pooled together into a histogram. The representation of the face image is generated by concatenating the histograms from all regions, which is employed to perform the categorization. In the online recognition, a similarity score matrix and a voting algorithm are employed to judge a face video’s identity. Recognition is performed online while face video sequence is continuous and the proposed method gives nearly realtime feedback. The proposed method achieves a 100 % verification rate on the Honda/UCSD database and 82 % on the YouTube datebase. Experimental results demonstrate the effectiveness and flexibility of the proposed method.


Video analysis Face recognition Biometrics Incremental learning Bag of words 



This work is funded by the National Basic Research Program of China (No. 2010CB327902), the National Natural Science Foundation of China (No. 61005016, No. 61061130560), the National High-tech R&D Program of China (2011AA010502), the Open Projects Program of National Laboratory of Pattern Recognition, and the Fundamental Research Funds for the Central Universities.


  1. 1.
    Aggarwal G, Chowdhury A, Chellappa R (2004) A system identification approach for video-based face recognition. In: Proc. ICPR, pp 175–178Google Scholar
  2. 2.
    Ahonen T, Matas J, He C, Pietikäinen M (2009) Rotation invariant image description with local binary pattern histogram fourier features. Image Anal 5575:61–70CrossRefGoogle Scholar
  3. 3.
    Carnegie RC (2003) Mean-shift blob tracking through scale space. In: Proc. CVPR, pp 234–240Google Scholar
  4. 4.
    Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intel Syst Technol 2:27Google Scholar
  5. 5.
    Cui Z, Shan S, Zhang H, Lao S, Chen X (2012) Image sets alignment for video-based face recognition. In: Proc. CVPR, pp 2626–2633Google Scholar
  6. 6.
    Fan W, Wang Y, Tan T (2005) Video-based face recognition using bayesian inference model. In: Audio-and video-based biometric person authentication, pp 122–130Google Scholar
  7. 7.
    Fischer M, Ekenel H, Stiefelhagen R (2011) Person re-identification in tv series using robust face recognition and user feedback. Multimed Tools Appl 55(1):83–104CrossRefGoogle Scholar
  8. 8.
    Gkalelis N, Mezaris V, Kompatsiaris I, Stathaki T (2013) Mixture subclass discriminant analysis link to restricted gaussian model and other generalizations. IEEE Trans Neural Netw Learn Syst 24(1):8–21Google Scholar
  9. 9.
    Gou G, Shen R, Wang Y, Basu A (2011) Temporal-spatial face recognition using multi-atlas and markov process model. In: Proc. international conference on multimedia and expo, pp 1–4Google Scholar
  10. 10.
    Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: Proc. ICPR, vol 2, pp 1458–1465Google Scholar
  11. 11.
    Hadid A, Pietikainen M (2004) From still image to video-based face recognition: an experimental analysis. In: Proc. automatic face and gesture recognition, pp 813–818Google Scholar
  12. 12.
    Hall P, Marshall D, Martin R (2000) Merging and splitting eigenspace models. IEEE Trans PAMI 22(9):1042–1049CrossRefGoogle Scholar
  13. 13.
    Hu Y, Mian A, Owens R (2011) Sparse approximated nearest points for image set classification. In: Proc. CVPR, pp 121–128Google Scholar
  14. 14.
    Huang K, Trivedi M (2002) Streaming face recognition using multicamera video arrays. In: Proc. ICPR, pp 213–216Google Scholar
  15. 15.
    Kim M, Kumar S, Pavlovic V, Rowley H (2008) Face tracking and recognition with visual constraints in real-world videos. In: Proc. CVPR, pp 1–8Google Scholar
  16. 16.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proc. CVPR, pp 2169–2178Google Scholar
  17. 17.
    Lee K, Ho J, Yang M, Kriegman D (2003) Video-based face recognition using probabilistic appearance manifolds. In: Proc. CVPR, pp 313–320Google Scholar
  18. 18.
    Lee K, Ho J, Yang M, Kriegman D (2005) Visual tracking and recognition using probabilistic appearance manifolds. Comput Vis Image Underst 99:303–331CrossRefGoogle Scholar
  19. 19.
    Li Z, Imai J, Kaneko M (2010) Robust face recognition using block-based bag of words. In: Proc. ICPR, pp 1285–1288. IEEEGoogle Scholar
  20. 20.
    Liu X, Cheng T (2003) Video-based face recognition using adaptive hidden markov models. In: Proc. CVPR, pp 340–345Google Scholar
  21. 21.
    Liu L, Wang Y, Tan T (2007) Online appearance model learning for video-based face recognition. In: Proc. CVPR, pp 1–7Google Scholar
  22. 22.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  23. 23.
    Matta F, Dugelay J (2006) Person recognition using human head motion information. In: Articulated motion and deformable objects, pp 326–335Google Scholar
  24. 24.
    Matta F, Dugelay J (2009) Person recognition using facial video information: a state of the art. J Vis Lang Comput 20(3):180–187CrossRefGoogle Scholar
  25. 25.
    Mian A (2011) Online learning from local features for video-based face recognition. Pattern Recogn 44(5):1068–1075CrossRefzbMATHGoogle Scholar
  26. 26.
    Phillips P, Grother P, Micheals R, Blackburn D, Tabassi E, Bone J (2003) Face recognition vendor test 2002 results. Evaluation reportGoogle Scholar
  27. 27.
    Poh N, Chan C, Kittler J, Marcel S, Mc Cool C, Rua E, Alba Castro J, Villegas M, Paredes R (2010) An evaluation of video-to-video face verification. IEEE Trans Inf Forensics Secur 5(4):781–801CrossRefGoogle Scholar
  28. 28.
    Schneider J, Borlund P (2007) Matrix comparison, part 1: motivation and important issues for measuring the resemblance between proximity measures or ordination results. J Am Soc Inf Sci Technol 58(11):1586–1595CrossRefGoogle Scholar
  29. 29.
    Schwarze T, Riegel T, Han S, Hutter A, Nowak S, Ebel S, Petersohn C, Ndjiki-Nya P (2013) Role-based identity recognition for tv broadcasts. Multimed Tools Appl 63(2):501–520CrossRefGoogle Scholar
  30. 30.
    Seo H, Milanfar P (2010) Training-free, generic object detection using locally adaptive regression kernels. IEEE Trans PAMI 32(9):1688–1704CrossRefGoogle Scholar
  31. 31.
    Shan S, Gao W, Cao B, Zhao D (2003) Illumination normalization for robust face recognition against varying lighting conditions. In: International workshop on analysis and modeling of faces and gestures, pp 157–164Google Scholar
  32. 32.
    Swain M, Ballard D (1991) Color indexing. Int J Comput Vis 7(1):11–32CrossRefGoogle Scholar
  33. 33.
    Van De Sande K, Gevers T, Snoek C (2009) Evaluating color descriptors for object and scene recognition. IEEE Trans PAMI 32:1582–1596CrossRefGoogle Scholar
  34. 34.
    Vedaldi A, Zisserman A (2010) Efficient additive kernels via explicit feature maps. In: Proc. CVPR, pp 3539–3546Google Scholar
  35. 35.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proc. CVPR, p 511. Intel, Microprocessor Research LabsGoogle Scholar
  36. 36.
    Yilmazturk M, Ulusoy I, Cicekli N (2013) Online annotation of faces in personal videos by sequential learning. Multimed Tools Appl 63(3):591–613CrossRefGoogle Scholar
  37. 37.
    Zhang L, Chu R, Xiang S, Liao S, Li S (2007) Face detection based on multi-block lbp representation. In: Advances in biometrics, pp 11–18Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Chao Wang
    • 1
  • Yunhong Wang
    • 1
  • Zhaoxiang Zhang
    • 1
    Email author
  • Yiding Wang
    • 2
  1. 1.Laboratory of Intelligent Recognition and Image Processing, Beijing Key Laboratory of Digital Media, School of Computer Science and EngineeringBeihang UniversityBeijingChina
  2. 2.School of Information EngineeringNorth China University of TechnologyBeijingChina

Personalised recommendations