Cross-modal recipe retrieval with stacked attention model

Chen, Jing-Jing; Pang, Lei; Ngo, Chong-Wah

doi:10.1007/s11042-018-5964-y

Cross-modal recipe retrieval with stacked attention model

Published: 17 April 2018

Volume 77, pages 29457–29473, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

457 Accesses
6 Citations
Explore all metrics

Abstract

Taking a picture of delicious food and sharing it in social media has been a popular trend. The ability to recommend recipes along will benefit users who want to cook a particular dish, and the feature is yet to be available. The challenge of recipe retrieval, nevertheless, comes from two aspects. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing tens of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence between images and recipes. As learning happens at the regional level for image and ingredient level for recipe, the model has the ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Modal Recipe Retrieval: How to Cook this Dish?

PBLF: Prompt Based Learning Framework for Cross-Modal Recipe Retrieval

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval

Article 18 May 2023

Notes

https://www.xiachufang.com

References

Aizawa K, Ogawa M (2015) Foodlog: multimedia tool for healthcare applications. IEEE Multimed 22(2):4–8
Article Google Scholar
Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of international conference on machine learning, pp 1247–1255
Beijbom O, Joshi N, Morris D, Saponas S, Khullar S (2015) Menu-match: restaurant-specific food logging from images. In: Proceedings of IEEE workshop on applications of computer and vision, pp 844–851
Bossard L, Guillaumin M, Van Gool L (2014) Food-101–mining discriminative components with random forests. In: Proceedings of european conference on computer vision, pp 446–461
Chen J, Ngo CW (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of ACM international conference on multimedia
Chen J, Pang L, Ngo CW (2017) Cross-modal recipe retrieval: how to cook this dish?. In: Proceedings of international conference on multimedia modeling. Springer, pp 588–600
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of international conference on machine learning, pp 647–655
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al. (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, pp 2121–2129
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 580–587
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233
Article Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article Google Scholar
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of neural information processing systems, pp 1889–1897
Kawano Y, Yanai K (2014) Foodcam-256: a large-scale real-time mobile food recognitionsystem employing high-dimensional features and compression of classifier weights. In: Proceedings of ACM international conference on multimedia, pp 761–762
Kitamura K, Yamasaki T, Aizawa K (2008) Food log by analyzing food images. In: Proceedings of ACM international conference on multimedia, pp 999–1000
Maruyama T, Kawano Y, Yanai K (2012) Real-time mobile recipe recommendation system using food ingredient recognition. In: Proceedings of ACM international workshop on interactive multimedia on mobile and portable devices, pp 27–34
Matsuda Y, Hoashi H, Yanai K (2012) Recognition of multiple-food images by detecting candidate regions. In: Proceedings of international conference on multimedia and expo
Matsunaga H, Doman K, Hirayama T, Ide I, Deguchi D, Murase H (2015) Tastes and textures estimation of foods based on the analysis of its ingredients list and image. In: New trends in image analysis and processing–ICIAP 2015 workshops, pp 326–333
Chapter Google Scholar
Meyers A, Johnston N, Rathod V, Korattikara A, Gorban A, Silberman N, Guadarrama S, Papandreou G, Huang J, Murphy KP (2015) Im2calories: towards an automated mobile vision food diary. In: Proceedings of IEEE international conference on computer vision, pp 1233–1241
Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality
Probst Y, Nguyen DT, Rollo M, Li W (2015) mhealth diet and nutrition guidance. mHealth
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of ACM international conference on multimedia, pp 251–260
Rosipal R, Krämer N. (2006) Overview and recent advances in partial least squares. In: Subspace, latent structure and feature selection. Springer, pp 34–51
Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I, Torralba A (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of IEEE conference on computer vision and pattern recognition
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Su H, Lin TW, Li CT, Shan MK, Chang J (2014) Automatic recipe cuisine classification by ingredients. In: Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: adjunct publication, pp 565–570
Wang X, Kumar D, Thome N, Cord M, Precioso F (2015) Recipe recognition with large multimodal food dataset. In: Proceedings of international conference on multimedia and expo workshop, pp 1–6
Xie H, Yu L, Li Q (2010) A hybrid semantic item model for recipe search by example. In: 2010 IEEE international symposium on Proceedings of multimedia (ISM), pp 254–259
Xu R, Herranz L, Jiang S, Wang S, Song X, Jain R (2015) Geolocalized modeling for dish recognition. IEEE Trans Multimed 17(8):1187–1199
Article Google Scholar
Yamakata Y, Imahori S, Maeta H, Mori S (2016) A method for extracting major workflow composed of ingredients, tools and actions from cooking procedural text. In: 8Th workshop on multimedia for cooking and eating activities
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of international conference on machine learning, pp 3441–3450
Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. arXiv:1511.02274
Zhang W, Yu Q, Siddiquie B, Divakaran A, Sawhney H (2015) Snap-n-eat: food recognition and nutrition estimation on a smartphone. J Diabetes Sci Technol 9 (3):525–533
Article Google Scholar

Download references

Acknowledgements

This work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 11203517).

Author information

Authors and Affiliations

City University of Hong Kong, Kowloon Tong, Hong Kong
Jing-Jing Chen, Lei Pang & Chong-Wah Ngo

Authors

Jing-Jing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Pang
View author publications
You can also search for this author in PubMed Google Scholar
Chong-Wah Ngo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing-Jing Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, JJ., Pang, L. & Ngo, CW. Cross-modal recipe retrieval with stacked attention model. Multimed Tools Appl 77, 29457–29473 (2018). https://doi.org/10.1007/s11042-018-5964-y

Download citation

Received: 09 May 2017
Revised: 02 March 2018
Accepted: 03 April 2018
Published: 17 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-5964-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal recipe retrieval with stacked attention model

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Recipe Retrieval: How to Cook this Dish?

PBLF: Prompt Based Learning Framework for Cross-Modal Recipe Retrieval

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-modal recipe retrieval with stacked attention model

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Recipe Retrieval: How to Cook this Dish?

PBLF: Prompt Based Learning Framework for Cross-Modal Recipe Retrieval

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation