International Journal of Computer Vision

, Volume 109, Issue 1–2, pp 42–59 | Cite as

Weakly-Supervised Cross-Domain Dictionary Learning for Visual Recognition

  • Fan Zhu
  • Ling Shao


We address the visual categorization problem and present a method that utilizes weakly labeled data from other visual domains as the auxiliary source data for enhancing the original learning system. The proposed method aims to expand the intra-class diversity of original training data through the collaboration with the source data. In order to bring the original target domain data and the auxiliary source domain data into the same feature space, we introduce a weakly-supervised cross-domain dictionary learning method, which learns a reconstructive, discriminative and domain-adaptive dictionary pair and the corresponding classifier parameters without using any prior information. Such a method operates at a high level, and it can be applied to different cross-domain applications. To build up the auxiliary domain data, we manually collect images from Web pages, and select human actions of specific categories from a different dataset. The proposed method is evaluated for human action recognition, image classification and event recognition tasks on the UCF YouTube dataset, the Caltech101/256 datasets and the Kodak dataset, respectively, achieving outstanding results.


Visual categorization Image classification Human action recognition Event recognition Transfer learning  Weakly-supervised dictionary learning 


  1. Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transaction on Signal Processing, 54(11), 4311–4322.CrossRefGoogle Scholar
  2. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H. P., Schölkopf, B., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatices, 22, e49– e57.Google Scholar
  3. Boureau, Y., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. CVPR.Google Scholar
  4. Cao, L., Liu, Z., & Huang, T. S. (2010). Cross-dataset action detection. CVPR.Google Scholar
  5. Cao, X., Wang, Z., Yan, P., & Li, X. (2013). Transfer learning for pedestrian detection. Neurocomputing, 100, 51–57.CrossRefGoogle Scholar
  6. Chen, S. S., Donoho, L. D., & Saunders, A. M. (1993). Atomic decomposition by basis pursuit. IEEE Transaction on Signal Processing, 41(12), 3397–3415.CrossRefGoogle Scholar
  7. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. CVPR.Google Scholar
  8. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. ECCV.Google Scholar
  9. Daumé III, Hal, Frustratingly easy domain adaptation, Proceedings of the Annual Meeting Association for Computational Linguistics, pp. 256–263 (2007).Google Scholar
  10. Dikmen, M., Ning, H., Lin, D. J., Cao, L., Le, V., Tsai, S. F., et al. (2008). Surveillance event detection. TRECVID Video Evaluation Workshop.Google Scholar
  11. Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features, IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 .Google Scholar
  12. Duan, L., Tsang, I. W., & Xu, D. (2012). Domain transfer multiple kernel learning. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34, 465–479.CrossRefGoogle Scholar
  13. Duan, L., Tsang, I. W., Xu, D., & Maybank, J. S. (2009). Domain transfer svm for video concept detection. CVPR.Google Scholar
  14. Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34, 1667–1680.CrossRefGoogle Scholar
  15. Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. ICCV.Google Scholar
  16. Fei-Fei, L. (2006). Knowledge transfer in learning to recognize visual objects classes. ICDL.Google Scholar
  17. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples. An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.CrossRefGoogle Scholar
  18. Gao, X., Wang, X., Li, X., & Tao, D. (2011). Transfer latent variable model based on divergence analysis. Pattern Recognition, 44, 2358–2366.CrossRefzbMATHGoogle Scholar
  19. Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 883–897.CrossRefGoogle Scholar
  20. Golub, G., Hansen, P., & O’Leary, D. (1999). Tikhonov regularization and total least squares. Journal on Matrix Analysis and Applications, 21(1), 185–194.CrossRefzbMATHMathSciNetGoogle Scholar
  21. Gregor, K., & LeCun, Y. (2010). ICML: Learning fast approximations of sparse coding. New York: Saunders.Google Scholar
  22. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset, CIT Technical Report 1694.Google Scholar
  23. Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. ECCV.Google Scholar
  24. Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. International Journal of Computer Vision, 87, 316–336.CrossRefGoogle Scholar
  25. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35, 221–231.CrossRefGoogle Scholar
  26. Jiang, Z., Lin, Z., & Davis, L. S. (2011) Learning a discriminative dictionary for sparse coding via label consistent K-SVD. CVPR.Google Scholar
  27. Junejo, I. N., Dexter, E., Laptev, I., & Pérez, P. (2011). View-independent action recognition from temporal self-similarities. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 172–185.CrossRefGoogle Scholar
  28. Kuehne, H., Jhuang, H., Garrote, E., Poggio, & T., Serre, T. (2011). HMDB: A large video database for human motion recognition. ICCV.Google Scholar
  29. Kullback, S. (1987). The kullback-leibler distance. The American Statistician, 41, 340–341.Google Scholar
  30. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. CVPR.Google Scholar
  31. Laptev, I. (2005). On space-time interest points. Internation Journal of Computer Vision, 64, 107–123.CrossRefGoogle Scholar
  32. Lazebnik, S., Schmid, C., & Ponce, J. (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR.Google Scholar
  33. Lee, H., Battle, A., Raina, R., & Andrew, Ng. (2007). Efficient sparse coding algorithms. NIPS.Google Scholar
  34. Lee, H., Battle, A., Raina, R., & Ng, A. (2006). Efficient sparse coding algorithms. NIPS.Google Scholar
  35. Li, R., & Zickler, T. (2012). Discriminative virtual views for cross-view action recognition. CVPR.Google Scholar
  36. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. CVPR.Google Scholar
  37. Liu, J., Shah, M., Kuipers, B., & Savarese, S. (2011). Cross-view action recognition via view knowledge transfer. CVPR.Google Scholar
  38. Liu, L., Shao, L., & Rockett, P. (2012). Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognition. doi: 10.1016/j.patcog.2012.10.004.
  39. Liwicki, S., Zafeiriou, S., Tzimiropoulos, G., & Pantic, M. (2012). Efficient online subspace learning with an indefinite kernel for visual tracking and recognition. IEEE Transaction on Neural Networks and Learning Systems, 23, 1624–1636.CrossRefGoogle Scholar
  40. Loui, A., Luo, J., Chang, S., Ellis, D., Jiang, W., Kennedy, l., Lee, K., & Yanagawa, K. (2007). Kodak’s consumer video benchmark data set: concept definition and annotation. IWMIR.Google Scholar
  41. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110.CrossRefGoogle Scholar
  42. Lowe, D. G., Luo, J., Chang, S. F., Ellis, D., Jiang, W., Kennedy, L., et al. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  43. Mairal, J., Bach, F., Ponce, J., Sapiro, G,. & Zisserman, A. (2008). Discriminative learned dictionaries for local image analysis. CVPR.Google Scholar
  44. Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). Supervised dictionary learning. NIPS.Google Scholar
  45. Mairal, J., Leordeanu, M., Bach, F., Hebert, M., & Ponce, J. (2008) Discriminative sparse image models for class-specific edge detection and image interpretation. ECCV.Google Scholar
  46. Maji, S., Berg, A., & Malik, J. (2013). Efficient classification for additive Kernel SVMs. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35, 66–77.CrossRefGoogle Scholar
  47. Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transaction on Signal Processing, 41(12), 3397–3415.CrossRefzbMATHGoogle Scholar
  48. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. CVPR.Google Scholar
  49. Orrite, C., Rodríguez, M., & Montañés, M. (2011). One-sequence learning of human actions. Human Behavior Unterstanding, 7065, 40–51.CrossRefGoogle Scholar
  50. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transaction on Knowledge and Data Engineering, 22, 1345–1359.CrossRefGoogle Scholar
  51. Pati, Y., & Ramin, R. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. Asilomar Conference on Signals, Systems and Computers, 4, 40–44.CrossRefGoogle Scholar
  52. Qiu, Q., Patel, V. M., Turaga, P., & Chellappa, R. (2012). Domain adaptive dictionary learning. ECCV.Google Scholar
  53. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: Transfer learning from unlabeled data. ICML.Google Scholar
  54. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. ICPR.Google Scholar
  55. Sidenblada, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54, 183–209.Google Scholar
  56. Sohn, K., Jung, D., Lee, H., & Hero, A. (2011) Efficient learning of sparse, distributed, convolutional feature representations for object recognition. ICCV.Google Scholar
  57. Su, Y., & Jurie, F. (2012). Improving image classification using semantic attributes. International Journal of Computer Vision, 100, 1–19.CrossRefGoogle Scholar
  58. Uemura, H., Ishikawa, S., Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. BMVC.Google Scholar
  59. Wang, H., Klaser, A., Schmid, C., Liu, C. (2011). Action recognition by dense trajectories. CVPR.Google Scholar
  60. Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. BMVC.Google Scholar
  61. Wang, J., Yang, J., Yu, K., Lv, F., huang, T., Gong, Y. (2010). Locality-constrained linear coding for image classification. CVPR.Google Scholar
  62. Wang, Y., & Mori, G. (2009). Max-margin hidden conditional random fields for human action recognition. CVPR.Google Scholar
  63. Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 1310–1323.CrossRefGoogle Scholar
  64. Wright, J., Yang, Y. A., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). IEEE Transaction on Pattern Analysis and Machine Intelligence, 31, 210–227.CrossRefGoogle Scholar
  65. Xiang, S., Nie, F., Meng, G., Pan, C., & Zhang, C. (2012). Discriminative least squares regression for multiclass classification and feature selection. IEEE Transaction on Neural Networks and Learning Systems, 23, 1738–1754. Google Scholar
  66. Yang, L., Jin, R., Sukthankar, R., & Jurie, F. (2008). Unifying discriminative visual codebook generation with classifier training for object category recognition. CVPR.Google Scholar
  67. Yang, J., Yan, R., & Hauptmann, A. G. (2007). Cross-domain video concept detection using adaptive SVMs. ACM MM.Google Scholar
  68. Yang, J., Yu, K., Gong, Y., Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. CVPR.Google Scholar
  69. Yang, J., Yu, K., & Huang, T. (2010). Supervised translation-invariant sparse coding. CVPR.Google Scholar
  70. Yao, A., Gall, J., & Van, L. G. (2012). Coupled action recognition and pose estimation from multiple views. International Journal of Computer Vision, 100, 16–37.CrossRefzbMATHGoogle Scholar
  71. Zafeiriou, S., Tzimiropoulos, G., Petrou, M., & Stathaki, T. (2012) Regularized kernel discriminant analysis with a robust kernel for face recognition and verification. NIPS.Google Scholar
  72. Zhang, H., Berg, C. A., Maire, M., & Malik, J. (2006) SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. CVPR.Google Scholar
  73. Zhang, Q., & Li, B. (2010). Discriminative K-SVD for dictionary learning in face recognition. CVPR.Google Scholar
  74. Zhang, W., Surve, A., Fern, X., & Dietterich, T. (2009). Learning non-redundant codebooks for classifying complex objects. ICML.Google Scholar
  75. Zheng, J., Jinag, Z., Phillips,P. J., & Chellappa, R. (2012) Cross-view action recognition via a transferable dictionary pair. BMVC.Google Scholar
  76. Zhou, D., Bousquet, O., Lal, T., Weston, J., Gretton, A., & Schölkopf, B. (2004). Learning with local and global consistency. NIPS.Google Scholar
  77. Zhou, M., Chen, H., Paisley, J., Ren, L., Sapiro, G., & Carin, L. (2009). Non-parametric bayesian dictionary learning for sparse image representations. NIPS.Google Scholar
  78. Zhou, D., Weston, J., Gretton, A., Bousquet, O., & Schölkopf, B. (2004). Ranking on data manifolds. NIPS.Google Scholar
  79. Zhu, F., & Shao, L. (2013). Enhancing action recognition by cross-domain dictionary learning. BMVC.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.College of Electronic and Information EngineeringNanjing University of Information Science and TechnologyNanjing China
  2. 2.Department of Electronic and Electrical EngineeringThe University of SheffieldSheffield UK

Personalised recommendations