1 Introduction

The Double Burden of Malnutrition (DBM) is defined as the coexistence of both undernutrition and overweight, a global issue affecting populations across all regions worldwide. According to the World Health Organization (WHO), it is estimated that 39% of the adult population is currently overweight today and by 2030, this percentage will reach 50%. In addition, Non-Communicable Diseases (NCD) have been spread in the last century mainly due to bad eating behaviours and the lack of physical activity, among others. These NCDs, such as diabetes, cardiovascular problems, or cancer, result in millions of annual deaths, emphasising the critical need for healthy dietary planning to mitigate these risks [26, 66].

Historically, strategies for promoting healthier diets have been based on recommendations tailored to the general population. National and international organisations have created food pyramids, which serve as guidelines for daily dietary choices across various food groups [40, 58]. Figure 1 provides a graphical representation of a typical food pyramid, categorising food intake into 6 nutritional levels based on intake frequency. Personalising these recommendations from general populations [25, 28], together with the rapid deployment of smart devices and Artificial Intelligence (AI) methods [20, 68], are expected to revolutionise the promotion of healthier lifestyles.

Fig. 1
figure 1

AI4Food nutritional food pyramid. Top levels (levels 1, 2, and 3) mean lower food intake frequency, whereas bottom levels (levels 4, 5, and 6) imply higher food intake frequency

Apart from that, health-related information, such as nutrition and physical activity, can easily be acquired through smartphones and wearable devices today [2, 5]. As a result, a large amount of data has been generated in recent years and their analysis, often referred to as food computing, can provide very interesting insights. Thus, food computing encompasses the acquisition and analysis of heterogeneous food data to address various food-related issues across domains such as medicine, biology, gastronomy, and agronomy [44]. For instance, a popular and user-friendly way of acquiring the eating habits of a person consists of taking pictures of the food consumed. Consequently, millions of food images have been shared through social networks and new computer vision applications based on food detection and recognition have emerged [3, 17, 46, 47].

The current studies are limited in scope, primarily focusing on the pure application of computer vision techniques to only food images, e.g., the task of food recognition. However, the AI4Food-NutritionDB database intends to reduce the gap between resources generated by computer vision experts and the nutrition expert’s guidance. First, we introduce a food image database that includes various nutrition taxonomies such as the nutritional levels of the popular food pyramid and their subcategories, among others. Second, we introduce a benchmark considering novel deep learning models to automatically assess several nutritional aspects and scenarios (i.e., intra- and inter-database). A graphical diagram of the proposed study is shown in Fig. 2.

In this article, we also present our interdisciplinary framework named AI4Food [54], which aims to reduce the gap between computer scientists and nutritionists. Our overall objective is to foster a new generation of technologies focused on modelling users’ habits including food diet and physical activity patterns. AI4Food-NutritionDB comprises a database and a taxonomy aimed at improving current methods and resources for research in food computing. One of the primary objectives of the AI4Food framework is to create a configurable software environment capable of generating synthetic diets, including food images to simulate different user profiles depending on lifestyles and eating behaviours. This functionality has two potential applications, among many others: i) the automatic proposal of healthy diets to the final user that can be frequently updated depending on the specific user’s habits, and ii) the automatic and continuous analysis of the eating habits of the user from the food pictures taken, giving recommendations to improve the food eating habits. To achieve this goal, this article focuses on the generation of a food nutrition image database that includes a taxonomy derived from international nutritional guidelines.

Fig. 2
figure 2

Graphical diagram of the proposed study. First, we graphically show the AI4Food-NutritionDB database generation and the proposed nutrition taxonomy (i.e., nutritional level, category, subcategory, and food product). Second, the AI4Food-NutritionDB benchmark is presented, analysing the proposed nutrition taxonomy and scenarios (i.e., intra- and inter-database)

Fig. 3
figure 3

Description of the AI4Food-NutritionDB food image database framework and taxonomy. This database is generated using food images from seven different state-of-the-art databases, i.e., UECFood-256 [37], Food-101 [8], Food-11 [57], FruitVeg-81 [64], MAFood-121 [3], ISIA Food-500 [46], and VIPER-FoodNet [39]. AI4Food-NutritionDB database comprises 6 nutritional levels, 19 main categories, 73 subcategories, and 893 final products with over 500K food images

The main contributions of the present study are:

  • AI4Food-NutritionDB database. To the best of our knowledge, this is the first nutrition database that considers food images and the nutrition taxonomy. The proposed taxonomy includes four different levels of categorisations, i.e., 6 nutritional levels (see Fig. 1), 19 main categories (e.g., the family of vertebrate animals such as “Meat”), 73 subcategories (e.g., specific products such as “White Meat”), and 893 final food products (e.g., final products such as “Chicken”). In addition, each subcategory is defined by a type of dish (e.g., “Appetizer” and “Main Dish”) regarding its healthiness and food quantity. Figure 3 provides a graphical description of the database and its associated taxonomy. AI4Food-NutritionDB opens the doors to new food computing approaches in terms of food intake frequency, quality, and categorisation.

  • Proposal of a standard experimental protocol and benchmark, including different recognition tasks (category, subcategory, and final product). The experimental protocol considered comprises both intra- and inter-database scenarios, ensuring robust evaluation.

  • Free release to the research community of the described datasets, protocols, and multiple deep learning models. These models can serve as pre-trained models, achieving accurate recognition results when applied to other challenging food databases. All these resources are publicly available in our GitHub repositoryFootnote 1.

The remainder of the article is organised as follows. State-of-the-art studies related to food computing and food image databases are presented in Section 2. Section 3 explains the design of the AI4Food-NutritionDB food image database. Section 4 describes the proposed standard experimental protocol and benchmark results carried out on the AI4Food-NutritionDB database using deep learning techniques. Finally, conclusions and future studies are drawn up in Section 5.

2 Related works

2.1 Food computing

Food computing has become a very active topic in recent years, applying computational approaches to food-related areas. Methods based on computer vision, data mining, or machine learning, among others, have been used to analyse large amounts of food images obtained from the Internet, social platforms, and smartphones. In general, food computing considers a wide range of tasks, including food segmentation, recognition, and recommendation, with applications in various fields, for instance, in health, biology, or agriculture [1, 44].

Among these tasks, food recognition is one of the most popular ones in the literature. This task consists of detecting and classifying food images using different techniques. Traditional approaches rely on visual features such as shape, colour, and texture for food product detection [59]. Scale Invariant Feature Transform (SIFT), Histogram Oriented Gradients (HOG), and Local Binary Patterns (LBP) are some popular descriptors used in the literature as feature extractors [61]. For classification, Support Vector Machine (SVM) and K-Nearest Neighbour (KNN) algorithms are the most common ones to differentiate food products and categories [51]. However, traditional approaches are ineffective against challenging databases. In contrast, deep learning techniques have shown better performances instead [42, 47]. Concretely, complex architectures based on Convolutional Neural Networks (CNNs) consider both feature extraction and classification together, achieving accuracy (Acc.) rates of above 80% [4, 34]. For instance, Min et al. [47] used the architecture Squeeze-and-Excitation Network (SENet) [31], achieving a high inter-database generalisation capacity with a 91.45% Top-1 Acc. on the VireoFood-172 database. They also considered in [46, 47] other deep learning architectures, for instance, Stacked Global-Local Attention Network (SGLANet) and Progressive Region Enhancement Network (PRENet). Experiments were carried out using the ISIA Food-500 and Food2K databases [46, 47], achieving Top-1 Acc. results of 64.74% and 83.62%, respectively.

It is important to highlight that most published studies focus on food recognition at the final food product level (using as labels the name of the dish, for example, “Pasta alla Norma”) or the main category (e.g., “Fast Food"). However, in the present article, we analyse the task of food recognition based on the proposed nutrition taxonomy (6 nutritional levels, 19 main categories, 73 subcategories, and 893 final food products) as this is needed for many real applications, particularly those related to healthy dietary practices. In addition, each subcategory is defined by a type of dish (e.g., “Appetizer” and “Main Dish”) regarding its healthiness and food quantity.

Table 1 State-of-the-art food image databases

2.2 Food image databases

Many food image databases have emerged in recent years, including a wide range of food products from various world areas. These databases are categorized based on three different acquisition protocols found in the literature (i.e., self-collected, web scraping, and combination). Table 1 provides a summary of these databases and their metadata, including the protocol used, the number of classes and images, and the world region:

  • Self-collected: this consists of taking food images from a camera or smartphone in controlled or semi-controlled environments. Although databases such as PFID [13], UNIMIB2015 [16], or UNIMIB2016 [17] include a large variety of food products, the number of total images is relatively low (< 20K images) due to the extensive manual process. Similarly, UNICT-FD889 [23] and UNICT-FD1200 [24] are two databases with 889 and 1,200 final food products, respectively, which represent food dishes from different parts of the world and nationalities (e.g., English, Japanese, Indian, and Italian, among others). In addition, FruitVeg-81 [64] contains more than 15,000 fruit and vegetable images, and Mixed-Dish [21], in contrast, is a food image database of 164 Asian food products. Finally, F4H [9] and Food-Pics Extended [7] are two databases captured in a controlled scenario and a plain background.

  • Web Scraping: web scraping techniques are employed to acquire large amounts of food images. In contrast to self-collected techniques, thousands of food images can be easily captured from social and web platforms. Some databases focus on food products from specific regions of the world, for instance, traditional Japanese and Chinese dishes (e.g., Food50 [35], UECFood-256 [37], VireoFood-251 [11]) while others include dishes from Europe and North America (e.g., Food-101 [8], UPMCD Food-101 [65]). Additionally, databases like TurkishFoods-15 [29], KenyanFood13 [33], and VIPER-FoodNet [39] are three food image databases from Turkey, Kenya, and the United States, respectively. Other databases include food dishes from several regions of the world. Instagram 800k [53], Food500 [43], ISIA Food-200 [45], and FoodX-251 [36]. Similarly to FruitVeg-81, VegFru [30] contains only fruit and vegetable images. ISIA Food-500 [46] is a database with 500 food final food products and lastly, Food2K [47] is a recent database with around 1M food images organised into 2K food products.

  • Combination: this consists in creating new food image databases by combining data from existing ones. For instance, Food201-Segmented [49], is derived from the Food-101 database, supplemented with food tags using crowd-sourcing. Food-11 [57] is a database created primarily from three different databases (Food-101, UECFOOD-100, and UECFOOD-256) and grouped into 11 main food categories. Multi-Attribute Food-121 (MAFood-121) [3] database comprises the top-11 most popular cuisines in the world according to Google Trends (such as French, Mexican, or Vietnamese cuisines) and comprises 121 final food products and more than 21K food images.

To summarise, various food image databases have been presented in the literature considering different acquisition approaches and conditions. However, none of them have previously incorporated a nutrition taxonomy that assesses the quality, quantity, and intake frequency of foods based on images. The database proposed in this study offers a nutritional categorisation that facilitates the development of a new generation of food computing algorithms that foster its use in various food-related areas.

3 AI4Food-NutritionDB database

The proposed AI4Food-NutritionDB is the first nutrition database that considers food images and the nutrition taxonomy. This taxonomy includes four different levels of categorisation, i.e., 6 nutritional levels (see Fig. 1), 19 main categories (e.g., the family of vertebrate animals such as “Meat”), 73 subcategories (e.g., specific products such as “White Meat”), and 893 final food products (e.g., final products such as “Chicken”). In addition, each subcategory is defined by a type of dish (e.g., “Appetizer” and “Main Dish”) considering factors related to healthiness and food quantity. Figure 3 provides a graphical description of the database. AI4Food-NutritionDB has been built by combining food images from seven different databases, encompassing food products from all over the world. We provide next all the information regarding the source databases (Section 3.1) and the construction process of the AI4Food-NutritionDB (Section 3.2).

3.1 Source food image databases

Seven state-of-the-art food image databases were selected to construct our database. These databases encompass various world regions and exhibit different characteristics.

3.1.1 UECFood-256Footnote 2 [37]

UECFood-256 contains 256 food products and more than 30K Japanese food images from different platforms such as Bing Image Search, Flickr, and Twitter (web scraping acquisition). In addition, they employed Amazon Mechanical Turk (AMT) for image selection and labelling.

3.1.2 Food-101Footnote 3 [8]

This database comprises over 100K food images and 101 unique food products from various world regions. All the images were sourced from the FoodSpotting application, a social platform where individuals uploaded and shared food images.

3.1.3 Food-11Footnote 4 [57]

Singla et al. analysed the eating behaviour to construct a database that comprised some of the food groups consumed in the United States. This way they defined 11 general food categories from the United States Department of Agriculture (USDA), including bread, dairy products, dessert, eggs, fried food, meat, noodle/pasta, rice, seafood, soups, and vegetables/fruits. They also combined three different databases (Food-101, UECFood-100, and UECFood-256) and two social platforms (Flickr and Instagram) to finally accumulate more than 16K food images.

3.1.4 FruitVeg-81Footnote 5 [64]

Many of the state-of-the-art food image databases do not consider many fruit or vegetable food products. As a distinctive feature, this database contains images in these mentioned groups highly underrepresented. FruitVeg-81 database has 81 different fruits and vegetable food products acquired from the self-collected acquisition protocol.

3.1.5 MAFood-121Footnote 6 [3]

Considering the 11 most popular cuisines in the world (according to Google Trends), Aguilar et al. released the MAFood-121 database. This database contains 121 unique food products and around 21K food images grouped into 10 main categories (bread, eggs, fried food, meat, noodle/pasta, rice, seafood, soup, dumpling, and vegetables). They utilised the combination acquisition protocol, using three state-of-the-art public databases (Food-101, UECFood-256, and TurkishFoods-15) and a private one.

3.1.6 ISIA Food-500Footnote 7 [46]

ISIA Food-500 is a database released in 2020. All food images (around 400K) are organised into 500 different food products and were acquired from Google, Baidu, and Bing search engines, including both Western and Eastern cuisines. Following a similar approach to other databases, they categorised all food products into 11 major groups, including meat, cereal, vegetables, seafood, fruits, dairy products, bakery products, fat products, pastries, drinks, and eggs.

3.1.7 VIPER-FoodNetFootnote 8 [39]

Similar to the Food-11 database, VIPER-FoodNet is an 82-food-product database, selected based on the most commonly consumed items in the United States from the What We Eat In America (WWEIA) databaseFootnote 9. All the images were obtained through web scraping, specifically from Google Images.

As a result, the proposed AI4Food-NutritionDB initially comprises 1,152 food products with 586,914 food images. This diverse database represents traditional dishes from several world areas, such as Food-101 with Western dishes, UECFood-256 with traditional Japanese dishes, and VIPER FoodNet, with typical food dishes from the United States. In addition, the ISIA Food-500 database has 500 food products from various countries, and the FruitVeg-81 database also includes fruit and vegetable images from several world regions. Finally, it is important to highlight that Food-11 and MAFood-121 databases are created from some of the previous databases. As a result, post-processing was conducted to remove duplicated images.

3.2 Food product categorisation

Each of the 1,152 food products obtained in the previous stage is individually processed for classification into the following levels: i) nutritional level, ii) category, iii) subcategory, and iv) type of dish. Table 2 summarise information from the AI4Food-NutritionDB database, including the levels, the number of products, and the type of dish for each subcategory. For completeness, we provide in Fig. 4   a graphical representation of the categories, subcategories, and nutritional levels considered in AI4Food-NutritionDB. Each subcategory features one or two food images labelled with its corresponding nutritional level, and subcategories are further grouped into main categories.

Table 2 Description of the categories, subcategories, number of products, nutritional levels, and types of dish considered in the AI4Food-NutritionDB database

3.2.1 Process of categorisation

Three different stages are considered to classify each food product into subcategories, categories, and types of dishes. In the initial stage, the food product’s taxonomy is extracted using FoodOn ontologyFootnote 10 [22], which provides supercategories and subcategories for corresponding food products. Although this ontology comprises over 9K food products, a high percentage of the analysed items is not contemplated by FoodOn, particularly those translated from their original language (e.g., the Danish plate æbleflæsk). The second stage involves querying the food term within the TasteAtlas web platformFootnote 11, which contains around 10K traditional dishes worldwide. In addition, metadata such as ingredients, dish type, or food region is included in this platform.

Fig. 4
figure 4

Graphical representation of the categories and subcategories within the AI4Food-NutritionDB food image database. Note that the placement of the categories has been attempted to align with the pyramid in Fig. 1. As can be observed, this positioning generates ambiguities and discrepancies (e.g., for mixed and cooked food) that were resolved as described in Section 3.2

Finally, in the third stage, each final food product is classified into subcategories and categories, and all examined food terms go through a review and unification process. This step involves eliminating terms that do not comply with established criteria and merging those with similar characteristics. The outcome of this process yields a collection of 893 final food products.

3.2.2 Nutritional level

The nutritional level indicates the intake frequency for a specific food product and is determined by the popular nutritional pyramids proposed by national and international organisations, such as the United States Department of Agriculture (USDA) pyramid [40] and the Spanish Society of Community Nutrition (SENC) pyramid [58]. Figure 1 provides a graphical representation of the typical food pyramid considered for AI4Food-NutritionDB based on 6 different nutritional levels. A lower nutritional level (at the pyramid’s top) implies limited consumption, whereas a higher level (at the pyramid’s bottom), denotes greater intake.

Most food products align with the different nutritional levels proposed in the pyramid, allowing the nutritional level assignment. However, some food products or subcategories are not directly contemplated by nutritional level, e.g., “Fried Vegetables” or “Rice and Fish”. For all these ambiguous cases, the nutrition experts of the AI4Food framework have manually defined the appropriate nutritional level.

3.2.3 Dish type

Following a similar process to the nutritional level assignment, each subcategory is set to a dish type to differentiate it from others that can be found during a meal, since the quantity of each dish significantly varies. Seven different types of dishes are defined, following the guidelines established in [50]:

  • Main Dish: this type of dish represents most of the subcategories defined and includes both first and second courses.

  • Appetizer: this dish is usually consumed before the main dish and the quantity is relatively less. “Pâté”, “Cheese”, and “Other Types of Bread” are included in it.

  • Snack: similar to an appetizer, this is consumed at any time of the day. All the “Salty Snack” subcategories and “Sauce” comprise this dish.

  • Dessert: usually consumed at the end of a meal, dessert often consists of sweet food products. In this study, the “Fruits” and “Toast” subcategories, and “Sweet Products” category, are included in the Dessert dish type.

  • Side Dish: served with main dishes such as “Fries” and “Side Dish Salad”.

  • Bread: this basic food product is usually eaten alongside main dishes. In this case, only the “Bread” subcategory is included.

  • Drinks: this is represented by the “Drinks” products.

As a result, the AI4Food-NutritionDB database comprises 558,676 food images grouped into 6 nutritional levels, 19 main categories, 73 subcategories, and 893 food products as depicted in Table 2.

4 AI4Food-NutritionDB benchmark

This section describes the proposal of a standard experimental protocol and benchmark evaluation of AI4Food-NutritionDB, based on the nutrition taxonomy (category, subcategory, and final product). First, the deep learning recognition systems are described in Section 4.1. Then, Section 4.2 describes the proposed experimental protocol. Finally, Sections 4.3 and 4.4 provide the recognition results achieved on intra- and inter-database scenarios, respectively.

In addition, we provide the complete experimental protocol, benchmark evaluation, and pre-trained models, all of which are available on our GitHub repositoryFootnote 12. The repository contains detailed documentation and instructions for reproducing our experiments and scenarios.

4.1 Proposed food recognition systems

The proposed food recognition systems utilize state-of-the-art CNN architectures, namely Xception [15] and EfficientNetV2Footnote 13 [62]. These architectures have been selected due to outstanding performances in computer vision tasks such as food recognition, deepfake detection, and image classification in general [12, 48, 63]. First, the Xception approach is inspired by Inception [60], replacing Inception modules with depthwise separable convolutions. Secondly, the EfficientNetV2 approach is an optimised model within the EfficientNet family of architectures, able to achieve better results with fewer parameters compared to other models in challenging databases like ImageNet [38].

In this study, we follow the same training approach considered in [63], using a pre-trained model with ImageNet, where the last fully-connected layers are replaced with the number of classes specific to each experiment. Then, all the weights from the model are fixed up to the fully-connected layers and re-trained for over 10 epochs. Subsequently, the entire network is trained again for 50 more epochs, choosing the best-performing model in terms of validation accuracy. We use the following features for all experiments, employing an Adam optimiser based on binary cross-entropy using a learning rate of \(10^-3\), and \(\beta _1\) and \(\beta _2\) of 0.9 and 0.999, respectively. In addition, training and testing are performed with an image size of 224\(\times \)224. The experimental protocol was executed with the aid of an NVIDIA GeForce RTX 4090 GPU, utilising the Keras library.

4.2 Experimental protocol

For reproducibility reasons, we adopt the same experimental protocol considered in the collected databases, dividing them into development and test subsets following each corresponding subdivision. In addition, the development subset is also divided into train and validation subsets. However, three of the collected databases -FruitVeg81, UECFood-256, and Food-101- do not contain this division. In such cases, we employ a similar procedure as presented in [63]. Around 80% of the images comprise the development subset, with the train and validation subsets also distributed around 80% and 20% of the development subset, respectively. The remaining images correspond to the test subset (around 20%). It is important to remark that no images are duplicated across the three subsets (train, validation, and test) in any of the seven databases. Similarly to [47], Top-1 (Top-1 Acc.) and Top-5 classification accuracy (Top-5 Acc.) are used as evaluation metrics.

Table 3 Intra-database results in terms of Top-1 and Top-5 Acc. over the AI4Food-NutritionDB database

4.3 Intra-database results

Three different scenarios are considered for the intra-database evaluation of the AI4Food-NutritionDB database. Each scenario represents a different level of granularity defined by the number of categories (19), subcategories (73), and final products (893). Table 3 summarises the performances obtained for the different intra-database scenarios and deep learning architectures considered in the AI4Food-NutritionDB. For completeness, we also include the results achieved on the individual databases included in AI4Food-NutritionDB. We highlight the best results in bold for each dataset and nutrition taxonomy. This allows us to also assess the model’s performance across the different subsets. Regarding the whole AI4Food-NutritionDB database, category scenario performances show the best results, obtaining 77.74% Top-1 Acc. and 97.78% Top-5 Acc. for Xception, and 82.04% Top-1 Acc. and 98.45% Top-5 Acc. for EfficientNetV2. However, the performance significantly drops as the granularity becomes finer for both architectures. For example, for the EfficientNetV2 architecture, the Top-1 Acc. decreases from 82.04% to 77.66% and 66.28% for the subcategory (73 classes) and product (893 classes) analysis, respectively. This decrease is mainly due to the similarity in appearance among different subcategories (e.g., “White Meat” and “Red Meat”), final products (e.g., “Pizza Carbonara” and “Pizza Pugliese”), or even the same food cooked in several manners (e.g., “Baked Salmon” and “Cured Salmon”). Regarding each specific dataset, the FruitVeg-81 dataset shows the best results in general for both deep learning architectures, classifying almost perfectly the different fruits and vegetables (over 98% Top-1 and Top-5 Acc. for all categorisation scenarios). Contrarily, the VIPER-FoodNet dataset obtains the worst results in each categorisation scenario as images sometimes contain food products with mixed ingredients (e.g., different types of beans, meat, and pasta). Finally, in terms of the deep learning architecture, EfficientNetV2 outperforms Xception in all scenarios (category, subcategory, product) of the AI4Food-NutritionDB for both Top-1 Acc. and Top-5 Acc. metrics. These results highlight the potential of the state-of-the-art EfficientNetV2 architecture for the nutrition taxonomy proposed in the present article.

Table 4 Inter-database results in terms of Top-1 and Top-5 Acc. over the VireoFood-251 database

4.4 Inter-database results

To assess the generalisation ability of our deep learning models pre-trained with the proposed AI4Food-NutritionDB, we include an inter-database scenario using the challenging VireoFood-251 food image database [11], which is an extended version of VireoFood-172 [10]. This database comprises over 169K food images distributed in 251 Chinese food plates, which were not included in AI4Food-NutritionDB. In this experiment, we consider two different scenarios based on the training process. First, we consider XceptionNet and EfficientNetV2 architectures both pre-trained only with ImageNet [38]. Secondly, we consider again both architectures but pre-trained with AI4Food-NutritionDB. In the last case, three different models are considered, each trained at a different level of granularity following our proposed nutrition taxonomy (category, subcategory, and final product). In order to reproduce the same experimental protocol proposed by the authors, we only train the last fully-connected layers from each pre-trained model for 30 epochs, freezing the rest of the model. Table 4 shows the final test results obtained in each scenario for the final product categorisation (i.e., 251 Chinese food plates). Again, the best performances are marked in bold for each deep learning model. The results indicate that using our pre-trained models with AI4Food-NutritionDB improve the performance in terms of both Top-1 and Top-5 Acc. in comparison with the models pre-trained with only the ImageNet database. For instance, for the Xception architecture, the model pre-trained with the AI4Food-NutritionDB achieves for the final product categorisation results of 82.10% Top-1 and 95.71% Top-5 Acc., much better results in comparison with the 58.91% Top-1 and 83.78% Top-5 Acc. obtained with the ImageNet model. For the EfficientNetV2 architecture, results are even better with 88.80% Top-1 Acc. and 98.07% Top-5 Acc. Therefore, the proposed deep learning models trained with the proposed AI4Food-NutritionDB can effectively serve as reliable pre-trained models, achieving accurate recognition results with unseen food databases.

5 Conclusion and future study

This article presents the AI4Food-NutritionDB, the first database with a nutrition taxonomy and over 560K food images. Furthermore, we propose a standardised experimental protocol and benchmark for the AI4Food-NutritionDB, utilising food recognition systems based on two state-of-the-art architectures. Our evaluation encompasses both intra- and inter-database scenarios across different food recognition levels. We finally prove that our pre-trained models using AI4Food-NutritionDB can improve state-of-the-art food recognition systems in challenging scenarios. Our contribution facilitates the development of novel food computing approaches that foster a better understanding of what we eat.

This study opens several future research lines including the improvement of the database by incorporating new taxonomy levels from nutritional experts (e.g., based on the nutritional composition of the ingredients or the composition of the prepared food at hand). On the other hand, behavioural habits (e.g., physical activity, sleep quality) are key factors strongly related to the impact of nutrition on our health [55]. Future studies will significantly benefit by incorporating comprehensive multimodal models of user habits towards personalised interventions adapted to individual characteristics and necessities [54]. For instance, new studies could focus on the impact of glucose from food intake on metabolic health or the impact of sleep quality on dietary patterns [27]. We also plan to integrate statistical [32] and human-readable food descriptors through recent Large Language Models (LLMs) [19] into our framework to improve both classification rates and interpretability of our models [6].