Keywords

1 Introduction

Nutrition and health organizations offer specialized and curated resources describing food and food composition, often under open access licenses. The most widely used resource is the database of the United States Department of Agriculture (USDA), which collects and harmonizes food facts from academic and industrial sources [10]. In Europe, the primary reference is the European Food Information Resource Network (EuroFIR), which compiles data from different European countries’ databases [15]. There are also private initiatives such as i-Diet [22], an information system addressed to nutritionists to create personalized diets and focused on Spanish cuisine. Along with their nutritional information, i-Diet includes food item labels in Spanish and English.

These resources differ in scope and focus and usually struggle to capture the peculiarities of regional cuisines and the specificity of local products. At the same time, diet recommendation systems must be localized to the patients’ context and need to be effective. Based on this principle, the Stance4Health projectFootnote 1 aims at developing a personalized and localized nutrition service that will optimize the gut microbiota activity and long-term consumer commitment. The absence of wide-scope large databases, including regional and local products at the European and Spanish levels [16], makes it necessary in Stance4Health to combine resources mentioned above, e.g., USDA and i-Diet. However, this task is not trivial since these databases have significant differences in structure, semantics, and coverage. The latter, along with the vagueness associated to the language (e.g., a mapping of two equivalent items with different level of specialization) calls for flexible approaches to calculate the mappings.

In this paper, we propose a methodology based on a word embedding model to map food items’ databases from their respective short descriptions in English. Similarity between items is calculated by using a (fuzzy) distance metric. In particular, we use this methodology to map the i-Diet and USDA databases: given an i-Diet food item, we calculate the most similar USDA item by measuring the distance between their embedding representations, obtained after encoding the short text associated with each of them with the learnt model.

In contrast to similar works, we use a larger corpus to train the language model and consider the complete recipes instead of just the ingredient list. This approach allows us to find matches between items that are different but have a similar role in several preparations, e.g., hazelnut and almond butter. This contribution could be used to cross-link food items used in different regional cuisines and to propose ingredient substitutions (or even new fusion dishes). More importantly, we expect that the mapped databases will support personalized nutrition in Stance4Health, as well as other Food Computing applications such as recipe nutrients calculation before and after cooking.

The remainder of this paper is structured as follows. In the following section, we contextualize our work within the recent literature on food item mapping and Food Computing. In Sect. 3, we further describe the data sources used in the study: USDA, i-Diet, and the corpus of recipes. Afterward, we describe the methodological approach (Sect. 4) and the experiments carried out (Sect. 5). In the last section, we analyze the results and interpret them. The paper finishes unfolding the conclusions of the work and hinting some promising directions for future work.

2 Related Work

Food Computing researchers have long acknowledged the need for a standard and open food and food components resource considering regional cuisines and cultural differences [20]. Given the effort required for such development and the absence of a central organization, the usual procedure is to extend the USDA database according to application needs [11]. In this regard, database and ontology merging and alignment techniques can be applied to find similarities and links between item registries automatically [21].

Food databases’ principal elements are meals and ingredients. Therefore, it is possible to leverage ingredient detection and cuisine prediction methods to match food items based on their constituents. For instance, in [28] and [25], we can find algorithms to classify recipes by country from their ingredients. Similarly, in [19], the authors identified cuisine by using topics extracted from the recipe’s text. Predictive models have also been used to translate typical dishes from one region to another by applying an encoder-decoder Deep Learning architecture [12].

From a broader perspective, other research works studied the relation among ingredients and cooking methods from food data descriptions, as in [1]. These relationships can be reused to match food items in different databases. Our work follows the same strategy, but we learn a language model based on embeddings instead of a network of ingredients that appear together in recipes. Our approach has some advantages over the latter, such as avoiding the need for precise ingredient identification in the texts. The latter problem has been extensively addressed in the literature, mostly by applying customized parsers statistical natural language processing, with limited results, e.g., [6, 7, 29].

Regarding the use of Deep Learning for recipe text processing, Food2Vec used a word embedding model trained only with the list of ingredients included in recipes [2]. In contrast, we also use the text describing the cooking instructions. Therefore, we obtain close encodings for ingredients that appear together in recipes (as in Food2Vec), but also for those that are involved in similar preparations (which is useful for cross-cultural item matching). The Recipe2Vec [5] tool does encode the whole text, although it focuses on recipe comparison and retrieval and not publicly available. Food images were used in [26] to enhance the embedding model. Since we do not have image information in the recipes of our corpus, analyzing the possible improvement after incorporating images remains as future work.

Furthermore, we must take into account that the food text includes the use of food brands, often replacing ingredients themselves. Moreover, brand information also appears in the USDA database. Consequently, our language model must be able to deal with such terms. We follow the guidelines of [8], which identified semantically-related terms with an embedding model, including brands.

Finally, we can use several metrics to measure the distance between two words encoded according to the model [9] and, more interestingly, between two short texts [13]. In this context, similarity techniques combining token-based similarity and Fuzzy Logic [30] can be applied to obtain the mappings. We leverage and validate these approaches to formulate a fuzzy distance metric to tackle both vagueness of the language and syntactic/semantic content within the tokens.

3 Data

We used the English recipe corpus published by archive.orgFootnote 2 to build the word embedding model. This corpus collates recipes extracted from several websites, e.g., BBC Food Recipe, Epicurious, Cookstr, and AllRecipes. The final corpus includes 267,071 texts. The records corresponding to each recipe source can be seen in Table 1.

Table 1. Recipe corpus: sources and number of records

As mentioned in the introduction, the databases used in this work are i-Diet and the USDA Food Composition Databases. i-Diet is a proprietary database that provides nutritional content of food items usually found in Spanish diets. The USDA database, in turn, contains more extensive and more detailed data, since its scope goes beyond the use in diet recommendations. Examples of their structure and fields are respectively shown in Tables 2 and 3. Due to the nature of the databases, item descriptions have a substantial variability.

Each register in the i-Diet Food Composition Database corresponds to a food item, which can be a complete meal or an ingredient. A food item register consists of an identification number, a description of the item in Spanish, the corresponding translation of the description into English, and the food group to which the item belongs in Spanish. Translations in i-Diet have been performed manually by nutritionists. Additionally, each register includes numerical fields corresponding to the nutritional values of the item. The mapping procedure only uses the English description field; others are discarded.

Table 2. Example of food items in the i-Diet Food Composition Database

The structure of the USDA Food Composition Database is similar. Each food item register in USDA encompasses an identification number, a short description of the item, a food group category, and the category description. The rest of the fields are related to the item nutritional facts (mostly major and minor nutrient values). The mapping only uses the description field; the others are discarded.

Table 3. Example of food items in the USDA Food Composition Database

4 Methods

Our methodology is organized into four main steps: (1) data preprocessing, (2) word embedding model training and parameter tuning, (3) distance metrics, (4) calculation of mappings by computing the Word Mover’s Distance between pairs of short texts from the encodings obtained with the trained model, and (5) validation of the mappings. These steps are further described in the following sections.

4.1 Data Preprocessing

Although the recipe corpus was already collated and published on the web in a readable format, an extra preprocessing stage was required to prepare the data to train the model:

  1. 1.

    We extracted the data from the text files, i.e., the ingredient list and the cooking instructions. (Note that we did not consider ingredients and instruction separately.) These two pieces of data were filtered and saved in text files, one per recipe.

  2. 2.

    We performed a typical text cleaning process: conversion to lowercase; removal of punctuation marks, digits and special characters; removal of stop words; and lemmatization.

  3. 3.

    The clean data was used to train a bigram model to detect compound words. For this step, we used the Software Framework for Topic Modelling with Large Corpora [24]. English stop words were also imported from this module.

The steps above were applied to the cooking instructions presented in the recipes, e.g., the recipe text “Combine nutritional yeast, salt, cumin, garlic powder, onion powder, paprika, chili powder, and cayenne pepper in a small bowl.” is turned into “combin nutrit yeast salt cumin garlic_powder onion powder paprika_chili powder cayenn pepper small_bow” after the preprocessing phase.

4.2 Model Training and Parameter Tuning

We built the language model from a corpus of text recipes by using Word2Vec [17, 18], an unsupervised Deep Learning algorithm for the creation of word embeddings. An embedding is a set of numeric vectors, each one coding a feature, which represents a language unit preserving its semantics [3]. That is, two related language units (e.g., words) will have encodings located closely in the embeddings space. Therefore, they allow us to operate with the embeddings in a meaningful way; e.g., \(\langle \) king \(\rangle \) - \(\langle \) man \(\rangle \) + \(\langle \) woman \(\rangle \) = \(\langle \) queen \(\rangle \). There are other algorithms for learning word embeddings that can be used with the same purpose, such as GloVe [23] and fasttext [4].

Since a generic word embedding model does not encompass such a specific domain as food from a nutritional context, a Word2Vec model was trained on the preprocessed corpus by using the Continuous Bag Of Words (CBOW) implementation, also provided by the Software Framework for Topic Modelling with Large Corpora [24]. We trained the model using the cooking instructions as a whole entry to the training model, instead of processing every sentence from each recipe separately. The nature of the text of the corpus, with short sentences and frequent anaphora, suggests that this is the most suitable approach. Experimental work and comparison to other works confirmed this assumption [27].

4.3 Distance Metrics

Let \(S_{i}\) be the textual representation of an item, and let \(T_{i}=\lbrace t_{1},...,t_{n}\rbrace \) be the token set obtained as a result of the preprocessing task of such item; e.g., consider the item k whose textual representation is \(S_{k}\)=“Canned fish, average”, the corresponding \(T_{k}\) would be \(\lbrace {``can", ``fish", ``averag"} \rbrace \).

We formulate the mapping problem between two items as finding the minimal distance of an item token set against every item token set from the other database. For that purpose, the different distance metrics listed below were compared.

Crisp Distance Metrics

  • Jaccard Distance: JACCARD is a token-based distance metric which quantifies the distance based on the lexical difference between the token sets [30]:

    $$\begin{aligned} JACCARD(S_{1},S_{2})=1-\frac{\left| T_{1} \cap T_{2}\right| }{\left| T_{1} \right| +\left| T_{2} \right| -\left| T_{1}\cap T_{2} \right| } \end{aligned}$$
    (1)
  • Word Mover’s Distance: WMD treats a text document as a cloud of words; each word represented as a point in the vector embeddings space [14]. The distance between two clouds is quantified by the minimum cumulative distance that words from one text document need to travel to match exactly the point cloud of the other text document. To calculate the distance between two single words, an Euclidean Distance between the corresponding vector representation is used. Therefore, WMD takes advantage from the semantic information provided by the word embedding model.

  • Hybrid Distance: Preliminary studies within this work showed that using a unique distance measure, either lexical or semantic, strongly reduces the precision of the model. Therefore, we propose a hybrid distance measure formulated as a weighted combination of Jaccard and Word Mover’s Distances.

    $$\begin{aligned} HDISTANCE(t_{1},t_{2})=wJACCARD(t_{1},t_{2}) + (1-w)WMD(t_{1},t_{2}) \end{aligned}$$
    (2)

    where \(w\in \mathrm{I\!R}\) and \(0 \le w \le 1\)

Fuzzy Distance Metrics

  • Fuzzy Jaccard Distance [30]: This metric consists of a combination of token-based similarity and character-based similarity to determine the fuzzy overlap set. The Jaccard Distance described above is used to measure the distance between tokens, and a threshold determines which ones belong to the fuzzy overlap set. This latter parameter has been empirically tuned to 0.2.

    $$\begin{aligned} FJACCARD _{\delta }(S_{1},S_{2})=\frac{\left| T_{1} \overset{\sim }{\cap }_{\delta } T_{2}\right| }{\left| T_{1} \right| +\left| T_{2} \right| -\left| T_{1}\overset{\sim }{\cap }_{\delta } T_{2} \right| } \end{aligned}$$
    (3)

    \(\delta =0.2\)

  • Fuzzy Document Distance: We propose a fuzzy approach of the distance between short documents, considering each document as a token set. The distance between two sets is calculated as the Euclidean Distance between the vectors’ tokens in both sets. These vectors correspond to the numerical representation obtained from the Word Embedding model previously trained. The fuzzy function is described as follows: \(FDIST(S_{1},S_{2})=\frac{\sum _{x\epsilon T_{1}\cup T_{2} }min(\mu S_{1})x \times min(\mu T_{2})x}{\sum _{x\epsilon T_{1}}(\mu T_{1})(x)+\sum _{x\epsilon T_{2} }(\mu T_{2})(x)-\sum _{x\epsilon T_{1}\cup T_{2} }min(\mu S_{1})x \times min(\mu T_{2})x} \)

    $$\begin{aligned} \mu _{T_{i}}(x)= \left\{ \begin{array}{lcc} sigmoid(\frac{1}{distance(t_{i},x)}) &{} &{} 0< distance(t_{i},x) < \infty \\ \\ 1 &{} &{} distance(t_{i},x) = 0 \\ \\ 0 &{} &{} distance(t_{i},x) = \infty \end{array} \right. \end{aligned}$$
    (4)

    where \(distance(t_{i},x)\) is the Euclidean distance between \(t_{i}\) and x Noted that the membership of a token x to a set \(S_{i}\) is defined as the minimum distance of x to every token in \(S_{i}\).

4.4 Mapping Food Items

Once the embedding model is available, it can be used to compare the similarity of two words. To this aim, as already introduced, we tested different metrics to get the most accurate results. Our mapping procedure calculated item mappings for each i-Diet register. That is, for each i-Diet item, we obtained the distance between its English description and the description of every USDA item. The algorithm finally returns the USDA item that minimizes the distance, i.e., the most likely match. Let us mention that we tackle the mapping as a multilabel classification problem, where there are many labels as USDA items apart from the “No matches” label (which represents the case where there is not a possible matching for an item between the databases).

4.5 Validation

A nutrition expert validated the quality of the mappings by verifying their exactness. Note that, in some cases, there may be more than one best candidate mapping (i.e., with the same quality). This situation typically happens when items in one database are more general (hypernym) than the corresponding items in the other one (hyponyms). In these cases, the validation labels the mapping as correct as long as one of the possible best mappings is retrieved.

Different flexibility levels have been considered to detect the robustness of the model. We obtained the number of i-Diet items where the best possible matching is achieved. We also calculate a less restrictive accuracy value, that allows us to determine the number of items whose best matching is reached between the first and the tenth candidate from the whole USDA database.

5 Experiments

The embedding model was trained during 30 epochs with vector dimensionality set to 300 and a window of size 5. Words that appear less than three times in the whole corpus are ignored. The final model yielded a vocabulary of 11,288 words. Mappings were calculated for every i-Diet food item (735 items). One human expert manually assigned the validation label of each mapping.

The results of the validation of the mappings with the different metrics are showed in Table 4. The first column “Top 1” shows, for each metric, the percentage of items whose best possible matching is achieved by the model. The rest of columns show, respectively, the percentage of items in which the best matching is found in the 2,3,5 or 10 best candidates. The weight parameter of (3) was empirically tuned to achieve the optimal performance (\(w=0.2\)).

Table 4. Accuracy of the model (\(\%\)) obtained with the different metrics

A sample of the final results is provided in Tables 5 and 6. In both cases, matching are carried out using the distance metric with the best performance (see Table 4). Both tables have the same structure. In the first column, we show the original i-Diet item name. Columns from 2 to 4 show the results of the mapping: from left to right, the English description of the source i-Diet item, the description of the mapped USDA item, and the distance between both of them. The last column corresponds to the most accurate mapping identified manually.

6 Discussion

Table 5 shows a selection of successful mappings between i-Diet and USDA, i.e., mappings labeled as correct. Rows (1) to (3) show that when equivalent items had a similar text description in both databases, the model was able to match them properly. Note that a lower distance value of a mapping with respect to another one does not necessarily entail that it is better. The relative values of the distance metric are useful to select the best match for a given item, but not to compare different mappings.

Table 5. Selected examples of correct mappings

Rows (4) to (6) illustrate more difficult mappings that were correctly solved by the procedure. In these cases, the model was capable of matching item descriptions even though one of them was slightly less specific than the other. In particular, row (4) includes a commercial brand. Rows (5) and (6) correspond to cases in which the model can map a broad (i-Diet) description with a more precise one (in USDA). Last but not least, the rows (7) to (9) show correct mappings that were not as obvious as the previous ones.

We also found some limitations to our approach, as depicted in Table 6, largely due to the coverage of the corpus and errors in translations in i-Diet from the original Spanish item description into the English one. First, in rows (1) and (2), we can see that items with no real translation and that are never used in the English recipe corpus were not mapped. We expected this behavior since there is no proper embedding for the terms used in the description. Accordingly, a more diverse corpus should be used, including recipes for local cuisines.

Table 6. Selected examples of not found, acceptable, approximate, and wrong mappings

Besides, rows (3) and (4) illustrate mappings where the Spanish text is poorly translated, and therefore the mapped item has a slightly different meaning. In these cases, mappings are marked as acceptable because, despite their similar semantics, there is a better match in USDA. These problems could be addressed by manually editing the translations or by using a (more accurate) machine translation system.

Rows (5) to (7) show approximate mappings in which the link USDA is semantically related, but the association is not correct or can be improved. It is interesting to highlight that row (5) include a food brand that is correctly identified. Also, row (5) shows a case of mapping a local item and a replacement with similar usage.

Finally, rows (8) and (9) depict incorrect mappings due to the limitations of the corpus and the (unfrequent) case that the i-Diet item is more specific than the possible USDA candidates. The last column of (9) shows that dealing with hypernym and hyponyms is difficult, and can lead to several possible candidate mappings in USDA for one item in i-Diet.

As shown in Table 4, the fuzzy metrics improved the outcomes obtained with crisp approaches. From the obtained results we can draw the conclusion that vagueness of the language can make Fuzzy Logic a suitable option to tackle the matching task. Given the dimensionality and complexity of the problem, the results are reasonably accurate.

7 Conclusions and Future Work

This research work was motivated by the need for mapping two food composition databases with different scopes. This problem poses additional obstacles when the food items correspond to different regions and local cuisines. We created a word embedding model to address these issues and showed that this technique has the potential to facilitate working with non-overlapping data resources in the Food Computing domain. Our model worked well with regional brands and was able to some extent to identify substitute items used in similar preparations. Fuzzy distance metrics showed better performance than crisp alternatives.

For the future, we plan to improve the mappings by training the embedding model with a larger-scale recipe corpus and by improving the translations of Spanish item descriptions into English in i-Diet. A relevant aspect of our approach that can be further explored is the capability for finding ingredient replacements in recipes, which also entails using more imprecise knowledge. These replacements can either refer to the same item expressed differently, or to similar ingredients more often used in a particular region or cuisine. This kind of situation cannot be addressed by more traditional techniques –e.g., regex and concordances– without resorting to a specialized and comprehensive knowledge base. The absence of such resources is indeed the original motivation for our work. This same idea can be applied to recipe retrieval and automatic generation of recipes.

This work only considered English text recipes from the web. Consequently, some bias is introduced, since the popular dishes from other countries could not have sufficient representation in the collected corpus. Nevertheless, since international dishes have been introduced in cuisines from all over the world we consider that this corpus is suitable to generate useful word embeddings. We acknowledge that including typical recipes from other cuisines would help to improve the model performance. As well, more sophisticated measures can be added as well as combined with the implemented ones. Additionally, Machine Translation techniques can be applied to the Spanish text descriptions in order to reduce the errors generated by the manual translations. Also, we plan to research a multi-modal extension of this work, combining short text embeddings with the numerical fields from Food Composition Databases and other media resources, e.g., images.