Learning Action Concept Trees and Semantic Alignment Networks from Image-Description Data

Gao, Jiyang; Nevatia, Ram

doi:10.1007/978-3-319-54184-6_2

Jiyang Gao¹⁷ &
Ram Nevatia¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10112))

Included in the following conference series:

Asian Conference on Computer Vision

1913 Accesses

Abstract

Action classification in still images has been a popular research topic in computer vision. Labelling large scale datasets for action classification requires tremendous manual work, which is hard to scale up. Besides, the action categories in such datasets are pre-defined and vocabularies are fixed. However humans may describe the same action with different phrases, which leads to the difficulty of vocabulary expansion for traditional fully-supervised methods. We observe that large amounts of images with sentence descriptions are readily available on the Internet. The sentence descriptions can be regarded as weak labels for the images, which contain rich information and could be used to learn flexible expressions of action categories. We propose a method to learn an Action Concept Tree (ACT) and an Action Semantic Alignment (ASA) model for classification from image-description data via a two-stage learning process. A new dataset for the task of learning actions from descriptions is built. Experimental results show that our method outperforms several baseline methods significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations (2016)
Google Scholar
Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47, 3343–3361 (2014)
Article Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: CVPR (2014)
Google Scholar
Gkioxari, G., Hariharan, B., Girshick, R., Malik, J.: R-CNNs for pose estimation and action detection. arXiv preprint arxiv:1406.5212 (2014)
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV (2015)
Google Scholar
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: Hico: A benchmark for recognizing human-object interactions in images. In: ICCV (2015)
Google Scholar
Ramanathan, V., Li, C., Deng, J., Han, W., Li, Z., Gu, K., Song, Y., Bengio, S., Rossenberg, C., Fei-Fei, L.: Learning semantic relationships for better action retrieval in images. In: CVPR (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)
Google Scholar
Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research. arXiv preprint arxiv:1503.01070 (2015)
Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Learning from narrated instruction videos (2016)
Google Scholar
Ramanathan, V., Liang, P., Fei-Fei, L.: Video event understanding using natural language descriptions. In: ICCV (2013)
Google Scholar
Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: ACM MM (2014)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24, 971–981 (2013)
Article Google Scholar
Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: ICCV (2015)
Google Scholar
Gao, J., Sun, C., Nevatia, R.: ACD: action concept discovery from image-sentence corpora. In: ICMR (2016)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_48
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Google Scholar
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: CVPR (2016)
Google Scholar
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. arXiv preprint arxiv:1603.06180 (2016)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
Cao, S., Chen, K., Nevatia, R.: Abstraction hierarchy and self annotation update for fine grained activity recognition. In: WACV (2016)
Google Scholar
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. JMLR 9, 1871–1874 (2008)
MATH Google Scholar
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: NIPS (2013)
Google Scholar

Download references

Acknowledgement

This research was supported, in part, by the Office of Naval Research under grant N00014-13-1-0493. We would like to thank Chen Sun for valuable discussions.

Author information

Authors and Affiliations

University of Southern California, Los Angeles, USA
Jiyang Gao & Ram Nevatia

Authors

Jiyang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Ram Nevatia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyang Gao .

Editor information

Editors and Affiliations

National Tsing Hua University, Hsinchu, Taiwan
Shang-Hong Lai
Graz University of Technology, Graz, Austria
Vincent Lepetit
Drexel University, Philadelphia, Pennsylvania, USA
Ko Nishino
The University of Tokyo, Tokyo, Japan
Yoichi Sato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, J., Nevatia, R. (2017). Learning Action Concept Trees and Semantic Alignment Networks from Image-Description Data. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10112. Springer, Cham. https://doi.org/10.1007/978-3-319-54184-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-54184-6_2
Published: 10 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54183-9
Online ISBN: 978-3-319-54184-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics