Picture it in your mind: generating high level visual representations from textual descriptions

Abstract

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    http://image-net.org/challenges/LSVRC/2012.

  2. 2.

    http://places.csail.mit.edu/index.html.

  3. 3.

    Publicly available at http://mscoco.org/.

  4. 4.

    Actually in the dataset there are few images with more than five captions available for processing. In such cases we took the first five listed.

  5. 5.

    We considered the part-of-speech patterns: ‘NOUN-VERB’, ‘NOUN-VERB-VERB’, ‘ADJ-NOUN’, ‘VERB-PRT’, ‘VERB-VERB’, ‘NUM-NOUN’, and ‘NOUN-NOUN’.

  6. 6.

    They reported slightly better results with the marginal ranking loss (MRL), a cost function that takes two visual vectors for each example, one considered relevant, and another irrelevant, to the textual description. However, relevance judgments to generate the training triplets relied on the user-click logs available in their dataset.

  7. 7.

    https://github.com/tylin/coco-caption.

  8. 8.

    LCS is a way to find a common exact sequence of words that is similar to matching word n-grams but less stringent (i.e., inside the LCS sequence other words may appear).

References

  1. Bai, Y., Yu, W., Xiao, T., Xu, C., Yang, K., Ma, W.-Y., & Zhao, T. (2014). Bag-of-words based deep neural network for image retrieval. In Proceedings of the ACM international conference on multimedia (pp. 229–232). ACM.

  2. Cappallo, S., Mensink, T., & Snoek, C. G. (2015). Image2emoji: Zero-shot emoji prediction for visual media. In Proceedings of the 23rd ACM international conference on multimedia, MM ’15 (pp. 1311–1314). New York, NY: ACM.

  3. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

  4. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. (2016). Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems (pp. 7–10). ACM.

  5. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

  6. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314.

    Article  MATH  Google Scholar 

  7. Costa Pereira, J., Coviello, E., Doyle, G., Rasiwasia, N., Lanckriet, G. R., Levy, R., et al. (2014). On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 521–535.

    Article  Google Scholar 

  8. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625–2634).

  9. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531

  10. Dong, J., Li, X., & Snoek, C. G. M. (2016). Word2VisualVec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838

  11. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1473–1482).

  12. Feng, F., Wang, X., & Li, R. (2014). Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM international conference on multimedia (pp. 7–16). ACM.

  13. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., et al. (2013). Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), (pp. 2121–2129). USA: Curran Associates Inc.

  14. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision (pp. 392–407). Springer.

  15. Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., & Lazebnik, S. (2014). Improving image-sentence embeddings using large weakly annotated photo collections. In Computer vision–ECCV 2014 (pp. 529–545). Springer.

  16. Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2016). Deep image retrieval: Learning global representations for image search (pp. 241–257). Cham: Springer International Publishing.

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

  18. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  19. Hua, X.-S., Yang, L., Wang, J., Wang, J., Ye, M., Wang, K., Rui, Y., & Li, J. (2013). Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM international conference on multimedia (pp. 243–252). ACM.

  20. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422–446.

    Article  Google Scholar 

  21. Jégou, H., & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In Computer vision—ECCV 2012 (pp. 774–787). Springer.

  22. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).

  23. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  24. Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.

  25. Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In C. Cortes, D. D. Lee, M. Sugiyama & R. Garnett (Eds.), Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), (pp. 3294–3302). Cambridge, MA, USA: MIT Press.

  26. Klein, B., Lev, G., Sadeh, G., & Wolf, L. (2014). Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399.

  27. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou & K. Q. Weinberger (Eds.), Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), (pp. 1097–1105). USA: Curran Associates Inc.

  28. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Marie-Francine Moens, S. S. (Ed.), Text summarization branches out: Proceedings of the ACL-04 workshop, Barcelona, Spain, July 2004 (pp. 74–81). Association for Computational Linguistics.

  29. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision—ECCV 2014 (pp. 740–755). Springer.

  30. Ma, L., Lu, Z., Shang, L., & Li, H. (2015). Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision (pp. 2623–2631).

  31. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

  32. McSherry, F., & Najork, M. (2008). Computing information retrieval performance measures efficiently in the presence of tied scores (pp. 414–421). Berlin: Springer.

    Google Scholar 

  33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13) (pp. 3111–3119). USA: Curran Associates Inc.

  34. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 689–696).

  35. Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., Dean, J. (2013). Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650.

  36. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP (Vol. 14, pp. 1532–1543).

  37. Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 806–813).

  38. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  39. Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH (pp. 338–342).

  40. Sharif, A., Hossein, R., Josephine, A., Stefan, S., Royal, K. T. H., Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.

  41. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  42. Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. In Interspeech (pp. 194–197).

  43. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 64–73.

    Article  Google Scholar 

  44. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164).

  45. Wang, L., Li, Y., & Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5005–5013).

  46. Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.

    Article  Google Scholar 

  47. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), (pp. 487–495). Cambridge, MA, USA: MIT Press.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andrea Esuli.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Carrara, F., Esuli, A., Fagni, T. et al. Picture it in your mind: generating high level visual representations from textual descriptions. Inf Retrieval J 21, 208–229 (2018). https://doi.org/10.1007/s10791-017-9318-6

Download citation

Keywords

  • Image retrieval
  • Cross-media retrieval
  • Text representation