Abstract
In this paper we propose an approach for multi-modal image retrieval in multi-labelled images. A multi-modal deep network architecture is formulated to jointly model sketches and text as input query modalities into a common embedding space, which is then further aligned with the image feature space. Our architecture also relies on a salient object detection through a supervised LSTM-based visual attention model learned from convolutional features. Both the alignment between the queries and the image and the supervision of the attention on the images are obtained by generalizing the Hungarian Algorithm using different loss functions. This permits encoding the object-based features and its alignment with the query irrespective of the availability of the co-occurrence of different objects in the training set. We validate the performance of our approach on standard single/multi-object datasets, showing state-of-the art performance in every dataset.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? CVIU 163, 90–100 (2017)
Dey, S., Dutta, A., Ghosh, S.K., Valveny, E., Lladós, J., Pal, U.: Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch. In: ICPR, pp. 916–921 (2018)
Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE TVCG 17(11), 1624–1636 (2011)
Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Gordo, A., Almazán, J., Murray, N., Perronin, F.: LEWIS: latent embeddings for word images and their semantics. In: ICCV, pp. 1242–1250 (2015)
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Gordo, A., Larlus, D.: Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. In: CVPR, pp. 5272–5281 (2017)
Hu, R., Barnard, M., Collomosse, J.: Gradient field descriptor for sketch based retrieval and localization. In: ICIP, pp. 1025–1028 (2010)
Hu, R., Collomosse, J.: A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU 117(7), 790–806 (2013)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Lan, T., Yang, W., Wang, Y., Mori, G.: Image retrieval with structured object queries using latent ranking SVM. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 129–142. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_10
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, C., Mao, J., Sha, F., Yuille, A.L.: Attention correctness in neural image captioning. In: AAAI, pp. 4176–4182 (2017)
Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: fast free-hand sketch-based image retrieval. In: CVPR, pp. 2862–2871 (2017)
Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR, 2579–2605 (2008)
Mai, L., Jin, H., Lin, Z., Fang, C., Brandt, J., Liu, F.: Spatial-semantic image search by visual feature synthesis. In: CVPR, pp. 1121–1130 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Munkres, J.: Algorithms for the assignment and transportation problems. JSIAM 5(1), 32–38 (1957)
Paulin, M., Mairal, J., Douze, M., Harchaoui, Z., Perronnin, F., Schmid, C.: Convolutional patch representations for image retrieval: an unsupervised approach. IJCV 121, 149–168 (2017)
Qi, Y., Song, Y.Z., Zhang, H., Liu, J.: Sketch-based image retrieval via siamese convolutional neural network. In: ICIP, pp. 2460–2464 (2016)
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPRW, pp. 512–519 (2014)
Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: CVPR, pp. 49–58 (2016)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV, 211–252 (2015)
Saavedra, J.M.: Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO). In: ICIP, pp. 2998–3002 (2014)
Saavedra, J.M., Barrios, J.M., Orand, S.: Sketch based image retrieval using learned keyshapes (LKS). BMVC 1, 1–10 (2015)
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM SIGGRAPH (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv abs/1409.1556 (2014)
Singh, S., Hoiem, D., Forsyth, D.: Learning to localize little landmarks. In: CVPR, pp. 260–269 (2016)
Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: CVPR, pp. 2325–2333 (2016)
Wang, F., Kang, L., Li, Y.: Sketch-based 3D shape retrieval using convolutional neural networks. In: CVPR, pp. 1875–1883 (2015)
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: CVPR, pp. 2285–2294 (2016)
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv 1607.06215 (2016)
Wei, Y., et al.: HCP: a flexible CNN framework for multi-label image classification. In: PAMI, pp. 1901–1907 (2016)
Xiao, C., Wang, C., Zhang, L., Zhang, L.: Sketch-based image retrieval via shape words. In: ACM ICMR, pp. 571–574 (2015)
Xie, J., Dai, G., Zhu, F., Fang, Y.: Learning barycentric representations of 3D shapes for sketch-based 3D shape retrieval. In: CVPR, pp. 3615–3623 (2017)
Xu, H., Wang, J., Hua, X.S., Li, S.: Interactive image search by 2D semantic map. In: ACM ICWWW, pp. 1321–1324 (2010)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Yang, H., Tianyi Zhou, J., Zhang, Y., Gao, B.B., Wu, J., Cai, J.: Exploit bounding box annotations for multi-label object recognition. In: CVPR, pp. 280–288 (2016)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)
Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: CVPR, pp. 799–807 (2016)
Zhu, F., Xie, J., Fang, Y.: Learning cross-domain neural networks for sketch-based 3D shape retrieval. In: AAAI, pp. 3683–3689 (2016)
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images. In: CVPR, pp. 4995–5004 (2016)
Acknowledgement
This work has been partially supported by the European Union’s Marie Skłodows-ka Curie grant agreement No. 665919 (H2020-MSCA-COFUND-2014:665919:CV-PR:01), the Spanish projects TIN2015-70924-C2-2-R, TIN2014-52072-P, and the CERCA Program of Generalitat de Catalunya.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Dey, S., Dutta, A., Ghosh, S.K., Valveny, E., Lladós, J., Pal, U. (2019). Aligning Salient Objects to Queries: A Multi-modal and Multi-object Image Retrieval Framework. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11362. Springer, Cham. https://doi.org/10.1007/978-3-030-20890-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-20890-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20889-9
Online ISBN: 978-3-030-20890-5
eBook Packages: Computer ScienceComputer Science (R0)