Advertisement

Journal of Computer Science and Technology

, Volume 32, Issue 3, pp 480–493 | Cite as

Captioning Videos Using Large-Scale Image Corpus

  • Xiao-Yu Du
  • Yang Yang
  • Liu Yang
  • Fu-Min Shen
  • Zhi-Guang Qin
  • Jin-Hui TangEmail author
Regular Paper
  • 80 Downloads

Abstract

Video captioning is the task of assigning complex high-level semantic descriptions (e.g., sentences or paragraphs) to video data. Different from previous video analysis techniques such as video annotation, video event detection and action recognition, video captioning is much closer to human cognition with smaller semantic gap. However, the scarcity of captioned video data severely limits the development of video captioning. In this paper, we propose a novel video captioning approach to describe videos by leveraging freely-available image corpus with abundant literal knowledge. There are two key aspects of our approach: 1) effective integration strategy bridging videos and images, and 2) high efficiency in handling ever-increasing training data. To achieve these goals, we adopt sophisticated visual hashing techniques to efficiently index and search large-scale images for relevant captions, which is of high extensibility to evolving data and the corresponding semantics. Extensive experimental results on various real-world visual datasets show the effectiveness of our approach with different hashing techniques, e.g., LSH (locality-sensitive hashing), PCA-ITQ (principle component analysis iterative quantization) and supervised discrete hashing, as compared with the state-of-the-art methods. It is worth noting that the empirical computational cost of our approach is much lower than that of an existing method, i.e., it takes 1/256 of the memory requirement and 1/64 of the time cost of the method of Devlin et al.

Keywords

video captioning hashing image captioning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2017_1738_MOESM1_ESM.pdf (183 kb)
ESM 1 (PDF 183 kb)

References

  1. [1]
    Song Y, Tang J H, Liu F, Yan S C. Body surface context: A new robust feature for action recognition from depth videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 952-964.CrossRefGoogle Scholar
  2. [2]
    Qi G J, Hua X S, Rui Y, Tang J H, Mei T, Zhang H J. Correlative multi-label video annotation. In Proc. the 15th ACM International Conference on Multimedia, Sept. 2007, pp.17-26.Google Scholar
  3. [3]
    Chen J W, Cui Y, Ye G N, Liu D, Chang S F. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In Proc. International Conference on Multimedia Retrieval, Apr. 2014.Google Scholar
  4. [4]
    Tang J H, Shu X B, Qi G J, Li Z C, Wang M, Yan S C, Jain R. Tri-clustered tensor completion for social-aware image tag refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, doi:  10.1109/TPAMI.2016.2608882.Google Scholar
  5. [5]
    Tang J H, Shu X B, Li Z C, Qi G J, Wang J D. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications, 2016, 12(4s): Article No. 68.Google Scholar
  6. [6]
    Li Z C, Tang J H. Weakly supervised deep matrix factorization for social image understanding. IEEE Transactions on Image Processing, 2017, 26(1): 276-288.MathSciNetCrossRefGoogle Scholar
  7. [7]
    Li Z C, Liu J, Tang J H, Lu H Q. Robust structured subspace learning for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(10): 2085-2098.CrossRefGoogle Scholar
  8. [8]
    Yang Y, Zhang H W, Zhang M X, Shen F M, Li X L. Visual coding in a semantic hierarchy. In Proc. the 23rd ACM International Conference on Multimedia, Oct. 2015, pp.59-68.Google Scholar
  9. [9]
    Yatskar M, Zettlemoyer L, Farhadi A. Situation recognition: Visual semantic role labeling for image understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016. pp.5534-5542.Google Scholar
  10. [10]
    Yang Y, Zha Z J, Gao Y, Zhu X F, Chua T S. Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Transactions on Multimedia, 2014, 16(6): 1677-1689.CrossRefGoogle Scholar
  11. [11]
    Yang Y, Yang Y, Shen H T. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications, 2013, 9(2): Article No. 14.Google Scholar
  12. [12]
    Li Z C, Liu J, Yang Y, Zhou X F, Lu H Q. Clusteringguided sparse structural learning for unsupervised feature selection. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(9): 2138-2150.CrossRefGoogle Scholar
  13. [13]
    Wang J D, Shen H T, Song J K, Ji J Q. Hashing for similarity search: A survey. arXiv:1408.2927, 2014. https://arxiv.org/abs/1408.2927, Apr. 2017.
  14. [14]
    Tang J H, Li Z C, Wang M, Zhao R Z. Neighborhood discriminant hashing for large-scale image retrieval. IEEE Transactions on Image Processing, 2015, 24(9): 2827-2840.MathSciNetCrossRefGoogle Scholar
  15. [15]
    Gong Y C, Lazebnik S. Iterative quantization: A procrustean approach to learning binary codes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011 pp.817-824.Google Scholar
  16. [16]
    Shen F M, Shen C H, Liu W, Shen H T. Supervised discrete hashing. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp.37-45.Google Scholar
  17. [17]
    Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick C L. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015. https://arxiv.org/abs/1505.04467, Apr. 2017.
  18. [18]
    Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1097-1105.Google Scholar
  19. [19]
    Zhu Z, Liang D, Zhang S H, Huang X L, Li B, Hu S M. Traffic-sign detection and classification in the wild. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.2110-2118.Google Scholar
  20. [20]
    Mikolov T, Karafiát M, Burget L, Černocký J, Khudanpur S. Recurrent neural network based language model. In Proc. the 11th Annual Conference of the International Speech Communication Association, Sep. 2010, pp.1045-1048.Google Scholar
  21. [21]
    Song J, Tang S L, Xiao J, Wu F, Zhang Z F. LSTM-in-LSTM for generating long descriptions of images. Computational Visual Media, 2016, 2(4): 379-388.CrossRefGoogle Scholar
  22. [22]
    Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In Proc. the 22nd ACM International Conference on Multimedia, Nov. 2014, pp.675-678.Google Scholar
  23. [23]
    Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp.1-9.Google Scholar
  24. [24]
    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. https://arxiv.org/abs/1409.1556, Apr. 2017.
  25. [25]
    Razavian A S, Azizpour H, Sullivan J, Carlsson S. CNN features off-the-shelf: An astounding baseline for recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.512-519.Google Scholar
  26. [26]
    Norris J R. Markov Chains. Cambridge University Press, 1998.Google Scholar
  27. [27]
    Moon T K. The expectation-maximization algorithm. IEEE Signal Processing Magazine, 1996, 13(6): 47-60.CrossRefGoogle Scholar
  28. [28]
    Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39-71.Google Scholar
  29. [29]
    Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models. In Proc. the 31st International Conference on Machine Learning, Jun. 2014, pp.595-603.Google Scholar
  30. [30]
    Wu Q, Shen C H, van den Hengel A, Liu L Q, Dick A. Image captioning with an intermediate attributes layer. arXiv:1506.01144v1, 2015. https://www.arxiv.org/abs/1506.01144v1, Apr. 2017.
  31. [31]
    Gao H Y, Mao J H, Zhou J, Huang Z H, Wang L, Xu W. Are you talking to a machine? Dataset and methods for multilingual image question answering. arXiv:1505.05612, 2015. https://arxiv.org/abs/1505.05612, Apr. 2017.
  32. [32]
    Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Longterm recurrent convolutional networks for visual recognition and description. arXiv:1411.4389, 2014. https://www.arxiv.org/abs/1411.4389, Apr. 2017.
  33. [33]
    Chen X L, Fang H, Lin T Y, Vedantam R, Gupta S, Dollar P, Zitnick C L. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015. https://www.arxiv.org/abs/1504.00325, Apr. 2017.
  34. [34]
    Ordonez V, Kulkarni G, Berg T L. Im2text: Describing images using 1 million captioned photographs. In Proc. Neural Information Processing Systems, Dec. 2011, pp.1143-1151.Google Scholar
  35. [35]
    Charikar M S. Similarity estimation techniques from rounding algorithms. In Proc. the 34th Annual ACM Symposium on Theory of Computing, May 2002, pp.380-388.Google Scholar
  36. [36]
    Shen F M, Zhou X, Yang Y, Song J K, Shen H T, Tao D C. A fast optimization method for general binary code learning. IEEE Transactions on Image Processing, 2016, 25(12): 5610-5621.MathSciNetCrossRefGoogle Scholar
  37. [37]
    Yang Y, Luo Y S, Chen W L, Shen F M, Shao J, Shen H T. Zero-shot hashing via transferring supervised knowledge. In Proc. ACM Conference on Multimedia, Oct. 2016, pp.1286-1295.Google Scholar
  38. [38]
    Shen F M, Shen C H, Shi Q F, van den Hengel A, Tang Z M, Shen H T. Hashing on nonlinear manifolds. IEEE Transactions on Image Processing, 2015, 24(6): 1839-1851.MathSciNetCrossRefGoogle Scholar
  39. [39]
    Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 1999, pp.518-529.Google Scholar
  40. [40]
    Yang Y, Shen F M, Shen H T, Li H X, Li X L. Robust discrete spectral hashing for large-scale image semantic indexing. IEEE Transactions on Big Data, 2015, 1(4): 162-171.CrossRefGoogle Scholar
  41. [41]
    Song J K, Yang Y, Yang Y, Huang Z, Shen H T. Intermedia hashing for large-scale retrieval from heterogeneous data sources. In Proc. ACM SIGMOD International Conference on Management of Data, Jun. 2013, pp.785-796.Google Scholar
  42. [42]
    Strecha C, Bronstein A, Bronstein M, Fua P. Ldahash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(1): 66-78.CrossRefGoogle Scholar
  43. [43]
    Luo Y D, Yang Y, Shen F M, Huang Z, Zhou P, Shen H T. Robust discrete code modeling for supervised hashing. Pattern Recognition, 2017, doi:  10.1016/j.patcog.2017.02.034.Google Scholar
  44. [44]
    Ballas N, Yao L, Pal C, Courville A. Delving deeper into convolutional networks for learning video representations. arXiv:1511.06432, 2015. https://arxiv.org/abs/1511.06432, Apr. 2017.
  45. [45]
    Mazloom M, Li X R, Snoek C G M. TagBook: A semantic video representation without supervision for event detection. arXiv:1510.02899v2, 2015. https://arxiv.org/abs/1510.02899v2, Apr. 2017.
  46. [46]
    Lowe D G. Object recognition from local scale-invariant features. In Proc. the 7th IEEE International Conference on Computer Vision, Sep. 1999, pp.1150-1157.Google Scholar
  47. [47]
    Xu M, Duan L Y, Cai J F, Chia L T, Xu C S, Tian Q. HMM-based audio keyword generation. In Proc. Pacific-Rim Conference on Multimedia, Dec. 2004. pp.566-574.Google Scholar
  48. [48]
    Shetty R, Laaksonen J. Video captioning with recurrent networks based on frame- and video-level features and visual content classification. arXiv:1512.02949, 2015. https://arxiv.org/abs/1512.02949, Apr. 2017.
  49. [49]
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In Proc. IEEE International Conference on Computer Vision, Dec. 2015, pp.4507-4515.Google Scholar
  50. [50]
    Pan P B, Xu Z W, Yang Y, Wu F, Zhuang Y T. Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv:1511.03476, 2015. https://arxiv.org/abs/1511.03476, Apr. 2017.
  51. [51]
    Sener O, Zamir A R, Savarese S, Saxena A. Unsupervised semantic parsing of video collections. In Proc. IEEE International Conference on Computer Vision, Dec. 2015, pp.4480-4488.Google Scholar
  52. [52]
    Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting on Association for Computational Linguistics, Jul. 2002, pp.311-318.Google Scholar
  53. [53]
    Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In Proc. the 9th Workshop on Statistical Machine Translation, Vol. 6, Apr. 2014, pp.376-380.Google Scholar
  54. [54]
    Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensusbased image description evaluation. arXiv:1411.5726, 2014. https://arxiv.org/abs/1411.5726, Apr. 2017.
  55. [55]
    Lin C Y. Rouge: A package for automatic evaluation of summaries. In Proc. ACL-04 Workshop on Text Summarization Branches Out, Jul. 2004, pp.74-81.Google Scholar
  56. [56]
    Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proc. IEEE International Conference on Computer Vision, Dec. 2013, pp.2712-2719.Google Scholar
  57. [57]
    Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R J. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proc. the 25th International Conference on Computational Linguistics, Aug. 2014, pp.1218-1227.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Xiao-Yu Du
    • 1
    • 2
  • Yang Yang
    • 3
    • 4
  • Liu Yang
    • 1
    • 5
  • Fu-Min Shen
    • 3
    • 4
  • Zhi-Guang Qin
    • 1
  • Jin-Hui Tang
    • 1
    • 6
    Email author
  1. 1.School of Information and Software EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  2. 2.School of Software EngineeringChengdu University of Information TechnologyChengduChina
  3. 3.Center for Future MediaUniversity of Electronic Science and Technology of ChinaChengduChina
  4. 4.School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  5. 5.Sichuan University West China Hospital of StomatologyChengduChina
  6. 6.School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjingChina

Personalised recommendations