1 Introduction

Background Food is an essential part of human life, not only as a biological need to sustain our daily activities and to keep an adequate health status, but also for mood balancing, leisure, and self-satisfaction. The complex function of food has thus led to the aphorism “eating for living, or living for eating”, indicating the different attitude towards food as need or as pleasure. The rapid evolution of multimedia technologies immediately reflected this natural human attitude, and it is nowadays common practice to immortalize dishes and meals through digital pictures and to share convivial or individual food-related experiences, like a particularly well done self-made dish, or a particularly yummy and well-presented restaurant meal. Just to provide an example of how much social media are focused on food, at the time of writing this report, the hashtag #food in Instagram appears in more than 484 million posts, while various other associated hashtags easily reach 100 million pictures (like #foodporn, #foodie, #instafood, etc.). At the same time, the rise in the importance of food in media communication has led to the emergence of new professions such as “food blogger” or “food influencer”, people who extensively use digital media to inform about recipes, dishes, restaurants for reviewing or marketing purposes [72]. Concurrently, the recent explosion of artificial intelligence (AI) has affected the performance and experience of multimedia systems across all domains. As a result, various applications related to food computing are continually being designed and are routinely used for activities associated with everyday meals. Out of the increasing interest to support various needs and the recent availability of public data, a new computing field called food computing concerned with automated food analysis has recently emerged [2, 93].

Problem

The main challenges addressed by the field are related to the classification and recognition of food images, that, compared to standard image classification tasks, is considered more difficult for the following reasons:

  • Data variability: numerous environmental and technical factors can become nuisances that affect the performance of food classification, such as lighting conditions, noise, occlusions, camera angle and the quality of images. Furthermore, variations in appearance due to different cooking styles, ingredients, and culinary cultures can complicate the classification problem [6].

  • Visual variability: Automatic classification of food from images is a fine-grained classification problem [46], and it is affected by two significant issues: inter-class variance and intra-class variance: inter-class variance relates to food items that exhibit visual similarities despite belonging to different categories. For instance, visually distinct food items like a salad and a pizza may share certain appearance characteristics, such as round shapes, vibrant colors, and toppings. Intra-class variance, instead, refers to images within the same food category that exhibit considerable visual variations due to factors such as cooking styles, ingredients, presentation, and cultural influences. For example, pizzas with different crusts, toppings, or cooking times all fall under the same category. Figure 1 shows some examples of inter-class and intra-class variance.

Fig. 1
figure 1

Inter-class and intra-class variance in the Food-101 [9] data set.Top row: inter-class visual similarity. Bottom row: high intra-class variability

This field is especially fueled by deep learning and Convolutional Neural Networks (CNNs), which have extensively improved the accuracy of object detection, identification, and localization from single pictures [104]. Hence, in the context of food computing, machine learning approaches have been applied especially for: food detection [104, 105], food recognition [17, 76, 101, 104, 113, 128], food segmentation [28, 30, 73, 83, 103], food-tray analysis [2, 87, 95], food classification [2, 4, 19, 97, 102], ingredient recognition [13, 57, 85], food quality estimation [40, 51, 55], calorie counting [23, 56, 65, 99], and portion estimation [22, 43].

Fig. 2
figure 2

Mobile applications for food analysis. (a) Kawano et al.’s app [50]. (b) Ingredients recognition and a cooking recipe recommendation [69]. (c) Calories estimation [113]. (d) Real-time mobile application classification on Pizza-Styles [29]. (e) Real-time mobile application classification on the GCC-30 data set [29]

Numerous efforts have been geared towards health-related targets in order to provide nutritional guidelines to users, such as calories and nutrition estimation [3, 113], food recommendation related to specific health conditions [93], ingredients recognition for people suffering from allergies, and many more (see Fig. 2).

Aim and contributions

Recent surveys about food computing [11, 53, 64, 78, 110] mostly target health related applications due to their enormous impact on society: they overview the technical aspects of computer vision approaches employed for recognition and classification. In contrast, this report surveys recent literature from a data perspective: we place special emphasis on the data sets used in and generated by previous work. In particular, we wish to understand data sizes, geographical coverage, and how multimedia and social media technologies in food computing leverage these data sets. Our main contributions to the field are:

  1. 1.

    We provide a critical analysis of recently published AI-based methods for automatic food computing, with a focus on the data sets used and generated.

  2. 2.

    We provide a critical analysis of recently published data sets and investigate their coverage in terms of represented cultural and regional environments, with the goal of geographical and geo-referenced classification. To this end, we release a public web resource listing the currently available data sets, and we indicate which areas of the world are still not covered. Researchers can access our web resource at https://slowdeepfood.github.io/datasets/.

  3. 3.

    We discuss remaining challenges in the field from a multimedia perspective, the future of food computing for personal and regional applications, and the challenging connections to robotics for automatic food creation. To this end, we try to indicate possible directions for future research efforts.

Methods

We survey more than 100 papers, with topics related to:

  • application of machine learning and deep learning to food computing tasks, like food detection, food recognition, and food classification tasks;

  • available food image data sets for training and testing machine learning models;

  • available food computing applications.

Search queries

We obtained the corpus of surveyed papers through searches on popular digital libraries: Google Scholar, IEEE explorer, Springer, ACM Digital Library, and arXiv. We used the following query, combining relevant keywords: (“Machine learning” OR “Neural network” OR “deep learning”) AND (“Food applications” OR “Food detection” OR “Recognition” OR “Food computing”) AND “Data set*”. The body of research in this area is growing rapidly and this survey covers the period between 2010 and 2022. Descriptive statistics of published papers according to their category and year are shown in Fig. 3, left.

Inclusion & exclusion criteria

In this survey, we only consider peer-reviewed papers and arXiv pre-prints that were published between 2010 and 2022. We excluded all papers written in languages other then English. We furthermore exclude papers that present a food computing methodology that is not specific to a given data set.

Article organization

The rest of this article is organized as follows. Section 2 presents the machine learning (Subsection 2.1) and deep learning approaches (Subsection 2.2) applied to food analysis. Section 3 provides a critical analysis of food data sets, and a description of the web resource for publicly available data that we created. Finally, Section 4 highlights the remaining challenges in food recognition and classification and suggests potential avenues for future investigations.

2 Overview of food classification approaches

The aim of this survey is not to provide an extensive overview of all methods developed for addressing the food classification challenges; we refer readers to the recent surveys specifically targeting that topic. Albeit many new frameworks have been recently proposed, Min et al. [78] provide a complete review of food computing up to 2019, mostly targeting the use of machine learning approaches for classification of images containing food-related content. Additional surveys [53, 64, 110] focus more on volume quantification and caloric estimates for dietary assessment.

Fig. 3
figure 3

Left: Categorization of the reviewed literature based on year and number of papers. This survey covers the period from 2010 to 2022 and focuses on machine learning, deep learning approaches, applications, and data sets in the food computing domain. Right: An overview of machine learning and deep learning pipelines. (a) Traditional machine learning approaches require manual feature extraction. (b) Modern deep learning approaches remove the human labeling bottleneck and automate all processes in an end-to-end framework

Here, we will provide a brief analysis of current technologies and the data sets used, and we provide guidelines for future development and applications. In general, food classification methods can be subdivided into two macro categories, corresponding to two different periods of technological advance in the field of machine learning, especially in computer vision and image processing. We observe:

  • a first period characterized by the use of traditional (i.e., “shallow”) machine learning methods, more or less spanning the time between 2010 and 2016;

  • a second period characterized by the use of deep learning and transfer learning, that started around 2016 when CNNs began to gain popularity in the computer vision community.

Figure 3 right illustrates the two macro categories for image classifications in the food computing domain.

2.1 Traditional machine learning approaches

We characterize traditional machine learning as being composed of building blocks like modeling, extracting and quantifying geometry, and designing visual and categorical features from images. The process involves human engineering efforts and subjective analysis for modeling and discriminating the most descriptive and significant features for a given task. Since an exhaustive review of such methods is out of the scope of this survey, we only briefly review the most common methods for feature composition and supervised classification in the context of food computing. We then discuss their practical limitations.

Starting from feature design, the following popular feature-based composition methods have been considered by the community and successfully applied in food classification tasks.

  • Gabor filters [88] are linear filters that perform a directional frequency analysis around a point of interest. They are motivated by an attempt to emulate the human visual system. Gabor filters can be understood as band-pass filters obtained by modulating a Gaussian kernel with a complex, sinusoidal planar wave.

  • Local Binary Patterns (LBP) [15] are visual feature vectors obtained by partitioning the image into uniform cells, and by deriving a bit-string according to the comparison between neighboring pixels. The resulting bit-string is then used for creating a normalized feature histogram.

  • Bag of Feature (BoF, or Bag-of-visual words) [24, 37, 112] techniques aggregate features through clustering which are then encoded to create synthetic codebooks for classification.

  • Histograms of Oriented Gradients (HOG) [70, 94] consider the occurrences of discretized gradient orientations in portions of an image. A subsequent binning process on a uniform grid is used to compute a histogram that can be used as a feature vector for classification.

  • Scale Invariant Feature Transforms (SIFT) [70] consist of extracting key points of objects. Candidate matching of features is then performed using the Euclidean distance between feature vectors. The method benefits from efficient hashing on top of a generalised Hough transform.

  • Bag-of-Textons [24, 117] The concept of Bag-of-Textons is inspired by the Bag-of-Words model commonly used in natural language processing. In the Bag-of-Words model, documents are represented as collections of individual words, and focusing on their frequency of occurrence. The Bag-of-Textons model represents an image as a collection of local texture patterns and their spatial arrangement. Bag-of-Textons has been widely used in computer vision studies for texture analysis and image classification.

  • Pairwise Rotation Invariant Co-occurrence Local Binary Pattern (PRICoLBP) [24, 90] enhances LBP by incorporating multi-orientation, multi-scale, and multi-channel information. Unlike LBP, which considers only a single circular neighborhood around each pixel, PRICoLBP instead employs pairwise circular neighborhoods. Each neighborhood consists of a pair of points at a fixed distance and angle from the center pixel.

  • Speeded Up Robust Features (SURF) [47] is inspired by SIFT descriptor but is several times faster and more robust against image transformations. It uses an integer approximation of the determinant of a Hessian blob detector [60], replacing the original scale space [61] with the sum of the Haar wavelet response around the point of interest for performing candidate matching.

Concerning the classification task, the following methodologies have been considered.

  • K-Nearest Neighbors (KNN) [12] performs unsupervised classification by capturing the idea of similarity (or proximity, or closeness) through distance evaluations between the feature vectors. A voting scheme depending on the K parameter is used to establish a partition in feature space.

  • Support Vector Machines (SVM) [37] try to compute separation hypersurfaces in the feature space by minimizing a loss function defining the soft margin of the separation. Various kernels are available to define the shape of the separation surface.

  • Multiple Kernel Learning (MKL) [37] tries various combinations different kernels with different parameterizations, chosen from larger kernel sets. An optimizer decides how to choose the best kernel or combination of kernels.

  • Random Forests (RF) [9, 75] construct many decision trees as building blocks and use a majority-voting scheme for performing classification.

  • Near Duplicate Image Retrieval (NDIR) refers to the task of identifying and retrieving images that are visually similar or nearly identical to a given query image from a large database of images. Farinella at el [24] use NDIR on UNICT-FD889 [24] to evaluate the performance of the three image descriptors Bag-of-Textons, PRICoLBP, and SIFT.

  • Fisher vectors [49, 123] use the Fisher kernel for patch aggregation. After extracting local features using SIFT and HoG, local extracted features are then encoded into representations such as BoF or Fisher Vectors (FV). BoF representation involves clustering the local features and creating a histogram of the cluster assignments, representing the frequency of different visual patterns in the image. Conversely, FV captures the statistical properties of the local features using the mean and covariance matrix.

Table 1 Traditional Machine Learning approaches applied to food classification

Most of the proposed food recognition methods mix and match various feature composition techniques with the aforementioned supervised classification methods. Table 1 provides an overview of the various attempts together with the reported classification accuracy. We point out here that traditional methodologies hardly reach \(85\%\) accuracy, indicating a performance wall. Consequently, the obtained performance cannot be considered adequate for many practical applications, especially for dietary assessment. Moreover, during the period 2010–2016 there was a lack of standardization in defining common benchmarks for evaluating the technologies, and most papers used their own image databases. This fact makes it difficult to carry out a consistent comparison between the various frameworks in terms of performance.

2.2 Deep learning approaches

Like in other application domains related to image analysis, the introduction and rapid success of deep neural networks coupled with practical training schemes dramatically affected the food computing field. Within a few years, most researchers in the community were dedicating their efforts towards exploiting various deep learning methods for food analysis tasks. As a result, an increasing number of end-to-end frameworks were presented and released for practical applications. Concurrently, various food databases were compiled and released to provide standardized benchmarks for the proposed methodologies. In the rest of this survey, we will try to categorize the various technologies from a data set perspective. Regarding the proposed classification frameworks, we identified the following two macro categories:

  • frameworks based on design of customized deep convolutional networks (DCNN) mix-and-match various layers to form a hierarchy able to extract latent features to be used for classification [62, 63, 68, 124];

  • frameworks exploiting pre-trained convolutional neural networks through transfer learning [10]. Transfer learning gained significant attention in recent years for achieving excellent performance at comparatively little computational training cost [2, 18, 29, 42, 46, 81, 109, 122].

2.2.1 Frameworks based on customized deep CNNs

The customized DCNN methods have the advantage of integrating “domain knowledge”: they try to explicitly model specific characteristics of food images for specific tasks. Therefore, various customized deep learning architectures have been proposed for food classification. Liu et al. [62] customized the GoogLeNet architecture [106] by modifying the convolutional and pooling layers to automatically derive the food information (e.g., food type and portion size) from images acquired with smartphones. Martinelli et al. [68] proposed WIde-Slice Residual Networks (WISeR) by incorporating two main branches within a single network, a residual network, and a slice network branch, and by introducing a slice convolution block able to capture the vertical food layers. The outputs of the deep residual blocks are combined within the sliced convolution to improve the classification score for specific food categories. Pandey et al. [86] proposed a multi-layer ensemble network (EnsembleNet) for food recognition that took advantage of three CNN fine-tuned AlexNet [54], GoogLeNet [106], and ResNet [35]. The classifiers work in an ensemble. Inspired by Adversarial Erasing (AE) [120], Qiu et al. [91] proposed a hybrid adversarial network architecture called PAR-Net. This network consists of three networks: a primary network to maintain the base accuracy of classifying an input image, an auxiliary network that mines discriminative food regions, and a region network that classifies the resulting mined regions. For targeting visual food recognition on mobile devices, Zhao et al. [127] present a student-teacher architecture [36] called Joint-learning Distilled Network (JDNet). JDNet performs simultaneous student-teacher training at different levels of abstraction by exploiting instance activation maps at various resolutions. Jiang et al. [44] proposed a scheme called Multi-Scale Multi-View Feature Aggregation (MSMVFA). This scheme enables two-level fusion: first, it combines features of different scales for each feature type, and then it aggregates features from multiple views with varying levels of detail. This approach aims to generate a fine-grained representation that is more resilient, discriminate, and comprehensive, leading to improved food recognition. In order to incorporate multiple semantic features in the modeling process, Liang et al. [58] proposed a multi-task learning approach, called Multi-View Attention Network (MVANet). MVANet considers the multi-view attention mechanism [100] to automatically adjust the weights of different semantic features in to enable the interaction between different tasks. Similarly, Jian et al. [44, 79] exploit distinctive spatial arrangements and common semantic patterns in food images for developing an Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN). IG-CMAN tries to localize image regions at multiple scales, ranging from category-level to ingredient-level in a coarse-to-fine manner. On the technical side, IG-CMAN uses a Spatial Transformer [41] for generating attentional regions and combine them with Long Short Term Memory [38, 116] to sequentially discover diverse attentional regions at ingredient levels. Min et al. [80] introduced an approach called Stacked Global-Local Attention Network (SGLANet), that simultaneously captures both global and local features, enhancing the overall recognition performance. Min et al. [81] proposed Progressive Region Enhancement Network (PRENet) that comprises progressive local feature learning and region feature enhancement. In progressive local feature learning, a training strategy is employed to acquire complementary multi-scale finer local features, such as diverse ingredient-related information. The region feature enhancement employs self-attention to integrate more comprehensive contexts with multiple scales into local features, thereby improving their representation. Finally, some frameworks tried to exploit the advantages of different CNNs by designing ensembles [86] or by considering voting schemes like in the framework called "TastyNet" [14].

2.2.2 Frameworks based on transfer learning

Transfer learning gained significant attention in recent years for achieving excellent performance at comparatively little computational training cost [2, 18, 29, 42, 46, 81, 109, 122]. Various food classification frameworks have exploited transfer learning by considering the following generic CNN architectures:

  • Inception [107, 108] networks, that are deep neural networks consisting of repeating blocks where the output of a block act as an input to the next block. Each block is defined as an Inception block. It has been used in three food classification architectures [32, 109, 121]. Specifically, Hassanejad et al. [32] fine-tuned a pre-trained Inception architecture for classifying food images, Tahir et al. [109] used InceptionNet as feature extractor for open-ended continual incremental learning, and finally Wibisono et al. [121] customized InceptionNet for classification of traditional indonesian food;

  • GoogleNet [106] is a type of convolutional neural network based on the Inception architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. It was used for transfer learning in two frameworks [63, 75]: specifically Meyers et al. [75] applied GoogleNet to predict which foods are present in a meal, and to lookup the corresponding nutritional facts, while Liu et al. [63] incorporated GoogleNet in a food recognition system employing edge computing-based service computing paradigm;

  • DenseNet [39] is a type of convolutional neural network that introduced the concept of dense connections between every layer in a feed-forward pattern, ensuring optimal information flow throughout the network. For food classification, Tahir at el. [109] used DenseNet as a feature extractor for open-ended continual learning;

  • Residual Network (ResNet) [35] architecture incorporates skip connections, which enable the network to skip one or more layers. These connections allow the model to learn residual functions, capturing the difference between the input and the output of a layer. By skipping layers, the network can propagate the gradient signal more effectively during training, addressing the problem of degradation that often occurs in deeper networks. It has been used extensively in food classification frameworks [18, 42, 46, 109, 122]. Specifically, Tahir et al. [109] used ResNet as a feature extractor for continual learning, Ciocca et al. [18] fine-tuned the ResNet on Food524DB for food image classification, Jalal et al. [42] incorporated ResNet-101 to train a classifier named KenyanFTR (Kenyan Food Type Recognizer) to classify 13 dishes in Kenya, Kaur at el. [46] used a pre-trained ResNet-101 on FoodX-251 data set for the food classification task, and finally Won et al. [122] utilized pre-trained ResNet-50 together with Inception-ResNet-V2 on various food data sets (i.e., UEC Food-256 [48], Food-101 [9] and Vireo Food-172 [12]) for fine-grained food classification;

  • EfficientNet [111] is an architecture that is designed to be highly efficient and achieve state-of-the-art performance on image classification tasks while maintaining a relatively small model size and computational cost. The main intuition behind the EfficientNet is the "compound scaling method" that uniformly scales all the dimensions of the network depth, width, and resolution. It has been utilized for food classification frameworks [27, 29] by Gilal et al. [29], who used EfficientNet to train custom classification models in the context of a framework for creating custom food classification tools for regional gastronomy; finally, Foret et al. [27] modified EfficientNet by applying Sharpness-Aware Minimization (SAM) and tested the modified architecture on classification of Food-101 data set.

2.2.3 Performance comparison

Table 2 compiles the accuracy of all discussed deep learning technologies for better comparison of the performance of the methods described so far, organized by the benchmark data set used. The table clearly underscores the current trend towards transfer learning on top of high performance architectures. At the time of writing, the best accuracies are obtained using the EfficientNet family of networks [27, 29]. EfficientNets have the advantage of providing control over training times and lightweight models that can be deployed on mobile platforms.

Table 2 Accuracy of deep learning architectures on publicly available data sets

In the following sections, we provide a more accurate analysis of public domain data sets and a critical discussion to identify gaps and limitations.

3 Analysis of food data sets

Concurrently with the development of technologies for automated analysis of food images, researchers compiled a big corpus of image databases to be used for various tasks such as training artificial intelligence models or to serve as public benchmarks for comparing various methods. The proliferation of public image databases benefited from the growth of the internet, the capillary availability of modern smart devices and the digital revolution [82]. In general, available data sources can be categorized into three main key types such as catering websites, social media, and cameras. In recent years, the availability of huge online food data collections has contributed to the explosion of websites for sharing recipes and food information, such as Yummly,Footnote 1 Meishijie,Footnote 2 and Allrecipes.Footnote 3

Fig. 4
figure 4

(a) and (b) show some recipes with nutrition and ingredients taken from Yummly, (c) recipes are taken from Meishijie and (d) recipes are taken from Allrecipes websites respectively

As an example, Yummly’s website contains info related to eleven cuisines of different countries and more than two million recipes with ingredients and nutrition. Figure 4(a) and (b), show some examples from Yummly. Each recipe includes cuisine category, dish name, food image, a list of ingredients, and nutritional information.

Furthermore, some recipe websites provide rich social information, such as comments and ratings, which can be helpful for tasks such as recipe recommendation [114] and prediction of recipe rating [126]. In addition to recipe sharing websites, social media such as Facebook, Flicker, Twitter, Instagram, YouTube and Foursquare are also considerable food-related data sources. For instance, Culotta [20] investigated whether linguistic patterns in Twitter correlate with health-related statistics. Abbar, Mejova and Weber [1] merged Twitter demographic details and food names to model the value-diabetes correlation. Besides to textual data, latest research [74, 84] has used huge collections of food images from social media for the investigation of food perception and eating behaviors. Given the popularity of cameras embedded in smartphones and wearable devices [118], collecting food images directly off cameras has also become common. For example, researchers have started capturing food images for visual food comprehension in restaurants or canteens [17, 21]. In addition to food images, Damen et al. [21] used a head-mounted GoPro camera for collecting videos of cooking sessions.

In any case, given the extremely high online availability, a huge number of food data collections have been compiled and made available to the public. In Table 3 we provide a collection of the food databases published over the last decade, together with the corresponding references, statistical information, the task for which they were compiled, and the provenance of the food specialties considered. Most of the available databases were used for training and testing automatic classification of food and the recognition of dishes inside scenes or trays (N=20). This is mainly driven by the increasing success of deep CNNs. More recently, image databases with additional metadata were compiled for addressing more application-oriented tasks, like calorie estimation for dietary purposes (N=3), recipe retrieval (N=2), or understanding the nutritional content (N=2). In the following, we will provide a more detailed analysis of the public databases by focusing on two aspects: the relationship between data complexity and performance, and the geographical distribution.

Table 3 Publicly available food data sets

3.1 Complexity analysis

We performed a statistical analysis of the most popular food databases according to their size and accuracy. Our analysis targets food classification tasks and we consider the methods reported in Table 2.

Fig. 5
figure 5

Top-1 classification accuracy on the most popular databases:histogram plot for comparing performance of classification methods

Figure 5 provides a direct comparison of classification methods on the most popular databases, namely UEC Food-100 [70], UEC Food-256 [49], VIREO Food-172 [12], and ETH Food-101 [9]. We note that for those data sets perfect classification has not yet been achieved: at the time of this writing, the best Top-1 accuracies are: \(89.58\%\) [68] for UEC Food-100, \(83.15\%\) [68] for UEC Food-256, \(91.34\%\) [122] for VIREO Food-172, and \(96.18\%\) [27] for ETH Food-101.

Fig. 6
figure 6

Complexity analysis: data set comparison of accuracy performance with respect to the number of categories and number of images. Top: bubble plots indicating the accuracy compared to the number of classes (left), and the number of images. Bottom: scatter plot in semi-logarithmic scale comparing the number of classes and number of images

We then performed an analysis of the relationship between data set complexity and accuracy: Figure 6 shows two bubbleplots and one scatterplot for comparing the database complexity and the attained accuracy. From these plots we conclude that databases containing more food categories, like UNICT-F0889 [24] or ISIA Food 200 [79] and 500 [80] are still challenging for classification methods. For the first case (UNICT-F0889), an additional source of complexity is the low ratio between the number of images and the number of categories (around four images per category). Since future applications will need models that scale with ever growing databases, it is paramount that practitioners should start considering iterative and continual learning approaches.

There is also a clear need to provide technologies that incorporate a continually growing number of categories and to address the challenges in fine-grained classification resulting from this growth. To this end, one promising framework in that direction was recently presented by He et al. [34]. They propose a method based on clustering and exemplar selection for storing the most representative data belonging to each learned food category, and they demonstrated their method on a reduced version of Food-2K [81].

Fig. 7
figure 7

Dataset complexity clusters: scatter plot in semi-logarithmic scale comparing the number of classes and number of images, with clusters grouping moderate and high complexity data sets

Finally, Fig. 7 represents a plot illustrating the two groups identified in the food datasets analysis: moderate and high complexity data sets.

  • Moderate complexity data sets: data sets fall under the moderate complexity category ranging from 646 to around 10K images, historically used for training the models based on traditional schemes and deep learning architectures to perform food classification.

  • High complexity data sets: datasets fall under the high complexity category ranging from approximately 10K to millions of images, more adequate to train higher complexity deep learning models.

The moderate complexity datasets can be trained relatively faster using traditional machine learning algorithms due to small data set sizes, while the high complexity datasets require more time due to the increased complexity of deep learning algorithms and the larger dataset sizes.

3.2 Geographic and gastronomic analysis

Besides the previous complexity analysis, we also performed an analysis of the geographical distribution of publicly available data sets for food computing. We mapped each data set to the corresponding region and we reported them in a world map with geo-located glyphs. We then created an open resource web page,Footnote 4 in which the food computing community can gather information about the most significant food databases. The geographic distribution provides visual information on which parts of the world are well-represented by food databases and which are still missing. Figure 8 shows a view of the website’s geographic map: each circle marker on the world map represents the data set, whereas the size of the circle indicates the size of the data set (i.e., the number of images).

Fig. 8
figure 8

Geographic distribution of food data sets: with this survey, we also release an open source web page that contains publicly available data sets under a single source. We mapped each data sets with geo-location and original source. Each circle marker on the world map represents a data set and its size, with a link to the original source

Figure 9 gives examples for the diversity in food data sets, which is due to difference in cooking styles and culinary culture, like pizza styles, sushi, Arabic food, Chinese food, etc.

Fig. 9
figure 9

Visualization of food data sets with some sample taken from each data set

4 Challenges and future work

Despite the impressive progresses in food computing technologies, many challenges still remain unsolved and there is a big space of improvement in many parts of the processing pipeline. As logical conclusion of our survey, we highlight here a number of problems and few possible development directions that we expect will stimulate the research efforts in the field for the next years.

Fig. 10
figure 10

Recipe disassembly:traditional Roman pasta preparations can be obtained by different composition of ingredients starting from the basic “Cacio e Pepe” to reach the popular “Carbonara” and “Amatriciana”

First of all, as shown in Sec. 3, the geographic distribution of available data sets is not uniform and many important gastronomic areas are not even represented. This is because most data sets were created for stress-testing automatic processing methods. They are too general for being applied to different culinary styles, preparation methods, and regions. Many international organizations, like IGCAT (International Institute of Gastronomy, Culture, Arts and Tourism,Footnote 5) or SlowFood,Footnote 6 regularly promote initiatives for raising awareness about the importance of cultural food uniqueness, as well as for highlighting distinctive food cultures. We believe that data customizations relevant to different cultures can definitely contribute to the aim of preventing the disappearance of local food traditions, thus stimulating creativity, educating for better nutrition and improving sustainable tourism standards. We expect in the future various efforts for creating databases representing region of gastronomy of different extents, and we plan to contribute to this field by targeting various areas not considered until now. We would also like to mention other initiatives like TasteAtlas,Footnote 7 attempting to provide a world atlas of traditional dishes, by featuring an interactive global food map with dish icons shown in their respective regions. In this context, Gilal et al. [29] recently proposed a framework that is able to create customized models for different gastronomies by using image databases compiled through semi-automatic filtering of downloaded images. Moreover, as suggested by the analysis of current technologies, we expect that future architectures and models will be able to scale with respect to taxonomies and food specialties represented, similarly to popular music recognition applications. To achieve these goals, food computing will need to incorporate latest deep learning technologies with particular focus on online continual learning [34, 109], few shot learning [45], and imbalanced classification [26].

Another important problem to consider is artificial intelligence for food reverse engineering. In this context, “reverse engineering” seeks to automatically decompose a plate by recovering the steps for creating it, thus extracting a recipe from the final dish. Here, we would like to give a simple example taken from traditional Roman cuisine that is related to the preparation of pasta starting from simple ingredients in a way to show the connections between popular recipes. In Fig. 10 we show how starting from the basic “Cacio e Pepe” (cacio cheese and pepper), we can obtain the famous “Carbonara” and “Amatriciana”, passing through “Gricia”, just by adding different simple ingredients. An advanced food computing system should be able to automatically recover the steps for obtaining the plate, paving the way to applications such as driving robotic systems for automatic food creation and replication. In last five years, start-up companies like Moley,Footnote 8 Creator,Footnote 9 and PicnicFootnote 10 made impressive progresses in developing prototype robo-kitchens that are able to provide a full cooking takeover, and to fully substitute human intervention, either for residential use or burger and pizza restaurants. These kinds of robotic systems can definitely benefit from the integration with automatic food computing frameworks. We expect that science fiction pop scenarios are realistically possible in few years: in the future, an input picture of a plate will be enough to drive a trained automatic system for recognition, recipe disassembly, and finally physical reproduction. The synergy between robotic companies and the artificial intelligence community will be decisive to speedup this process.