Abstract
Taking a picture of delicious food and sharing it in social media has been a popular trend. The ability to recommend recipes along will benefit users who want to cook a particular dish, and the feature is yet to be available. The challenge of recipe retrieval, nevertheless, comes from two aspects. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing tens of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence between images and recipes. As learning happens at the regional level for image and ingredient level for recipe, the model has the ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.
Similar content being viewed by others
References
Aizawa K, Ogawa M (2015) Foodlog: multimedia tool for healthcare applications. IEEE Multimed 22(2):4–8
Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of international conference on machine learning, pp 1247–1255
Beijbom O, Joshi N, Morris D, Saponas S, Khullar S (2015) Menu-match: restaurant-specific food logging from images. In: Proceedings of IEEE workshop on applications of computer and vision, pp 844–851
Bossard L, Guillaumin M, Van Gool L (2014) Food-101–mining discriminative components with random forests. In: Proceedings of european conference on computer vision, pp 446–461
Chen J, Ngo CW (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of ACM international conference on multimedia
Chen J, Pang L, Ngo CW (2017) Cross-modal recipe retrieval: how to cook this dish?. In: Proceedings of international conference on multimedia modeling. Springer, pp 588–600
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of international conference on machine learning, pp 647–655
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al. (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, pp 2121–2129
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 580–587
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of neural information processing systems, pp 1889–1897
Kawano Y, Yanai K (2014) Foodcam-256: a large-scale real-time mobile food recognitionsystem employing high-dimensional features and compression of classifier weights. In: Proceedings of ACM international conference on multimedia, pp 761–762
Kitamura K, Yamasaki T, Aizawa K (2008) Food log by analyzing food images. In: Proceedings of ACM international conference on multimedia, pp 999–1000
Maruyama T, Kawano Y, Yanai K (2012) Real-time mobile recipe recommendation system using food ingredient recognition. In: Proceedings of ACM international workshop on interactive multimedia on mobile and portable devices, pp 27–34
Matsuda Y, Hoashi H, Yanai K (2012) Recognition of multiple-food images by detecting candidate regions. In: Proceedings of international conference on multimedia and expo
Matsunaga H, Doman K, Hirayama T, Ide I, Deguchi D, Murase H (2015) Tastes and textures estimation of foods based on the analysis of its ingredients list and image. In: New trends in image analysis and processing–ICIAP 2015 workshops, pp 326–333
Meyers A, Johnston N, Rathod V, Korattikara A, Gorban A, Silberman N, Guadarrama S, Papandreou G, Huang J, Murphy KP (2015) Im2calories: towards an automated mobile vision food diary. In: Proceedings of IEEE international conference on computer vision, pp 1233–1241
Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality
Probst Y, Nguyen DT, Rollo M, Li W (2015) mhealth diet and nutrition guidance. mHealth
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of ACM international conference on multimedia, pp 251–260
Rosipal R, Krämer N. (2006) Overview and recent advances in partial least squares. In: Subspace, latent structure and feature selection. Springer, pp 34–51
Salvador A, Hynes N, Aytar Y, Marin J, Ofli F, Weber I, Torralba A (2017) Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of IEEE conference on computer vision and pattern recognition
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Su H, Lin TW, Li CT, Shan MK, Chang J (2014) Automatic recipe cuisine classification by ingredients. In: Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: adjunct publication, pp 565–570
Wang X, Kumar D, Thome N, Cord M, Precioso F (2015) Recipe recognition with large multimodal food dataset. In: Proceedings of international conference on multimedia and expo workshop, pp 1–6
Xie H, Yu L, Li Q (2010) A hybrid semantic item model for recipe search by example. In: 2010 IEEE international symposium on Proceedings of multimedia (ISM), pp 254–259
Xu R, Herranz L, Jiang S, Wang S, Song X, Jain R (2015) Geolocalized modeling for dish recognition. IEEE Trans Multimed 17(8):1187–1199
Yamakata Y, Imahori S, Maeta H, Mori S (2016) A method for extracting major workflow composed of ingredients, tools and actions from cooking procedural text. In: 8Th workshop on multimedia for cooking and eating activities
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of international conference on machine learning, pp 3441–3450
Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. arXiv:1511.02274
Zhang W, Yu Q, Siddiquie B, Divakaran A, Sawhney H (2015) Snap-n-eat: food recognition and nutrition estimation on a smartphone. J Diabetes Sci Technol 9 (3):525–533
Acknowledgements
This work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 11203517).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, JJ., Pang, L. & Ngo, CW. Cross-modal recipe retrieval with stacked attention model. Multimed Tools Appl 77, 29457–29473 (2018). https://doi.org/10.1007/s11042-018-5964-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5964-y