1 Introduction

Food has always been at the heart of human life. In the past, people had to identify and store food to survive, while in nowadays, people have more concerns about dietary needs including essential nutrition, health, taste, calories, and social occasions [8, 16]. Due to the growing information overload of various food-related content on multimedia, food recommender systems (RSs) are becoming increasingly attractive for people worldwide. Clearly, long-term unhealthy eating habits would be harmful to people’s health with potential risks such as the development of undesired chronic diseases. Taking into consideration of the importance of healthy eating habits, the RS is now used as an efficient tool by people to make informed decisions on food selection according to their health conditions, thereby helping people develop heathy eating habits and reduce unaware health risks [32, 54,55,56].

Generally speaking, RSs have the advantage of saving time and money by using a series of algorithms to analyze users’ food behaviors and ratings so as to recommend the most relevant and appealing foods to users [31]. Note that there are still several challenges (e.g., diversity, adaptation and fluctuation) that hinder the further development and application of RSs. The diversity challenge lies in the fact that food RSs are required to be able to handle diverse types of food preferences of all individuals, e.g., different taste preferences, perceptual abilities, cognitive restrictions, cultural backgrounds, and even genetic influences [40, 41, 58]. The adaptation challenge means that, accounting for the fast change of food trends among people, food RSs are expected to constantly adapt to the latest food fashions in order to provide up-to-date food suggestions [9]. The fluctuation challenge implies that it is unreasonable to supply the users with a one-size-fits-all food recommendation taking into consideration of the dramatic fluctuation of users’ food preferences [44]. Faced by the three challenges, there is a practical need to develop a novel recommendation technique to help people select food plans that are reasonable and personalized based on individually diverse, rapidly changing and dramatically fluctuated food preferences.

The collaborative filtering (CF) as well as the content-based filtering (CBF) are two widely used recommendation techniques for food preference learning. Typically, CF works by taking into account the food preferences of users with similar tastes, while CBF focuses on the attributes (e.g., ingredients, nutrition, and reviews) of the food itself [11]. Although CF and CBF achieve reasonable performance for preference learning, their adaptation to the change of user preferences or food contents is poor. To overcome this problem, it is crucial to consider the dynamic pattern of user-item interactions into the recommendation process so as to help better predict future preferences (of users) and optimize recommendation results accordingly. In addition, the consideration of such a dynamic pattern can also provide valuable insights into the interaction between users and recommendation applications/systems, and therefore help improve the user experience [36].

Recently, artificial neural networks have been successfully applied to RSs owing to their strong feature extraction abilities [7, 22, 35, 39, 42, 47, 49]. For example, a convolutional sequence embedding (Caser) RS has been proposed in [42] for product recommendation, where a convolutional neural network (CNN) is employed to capture the sequential features by analyzing the embedding matrix. It is worth mentioning that the embedding matrix can be treated as the “image” of the items in the latent space. Experimental results demonstrate the effectiveness of the proposed Caser RS in extracting sequential patterns by taking both sequential patterns and general preferences into account. In [49], a convolutional attention network has been put forward to explore the user behaviors by unifying a general RS and a sequential RS. A user-based recurrent neural network (RNN) has been developed in [7] for sequence prediction by integrating user information so as to provide personalized recommendation. In [22], a multi-period product RS has been introduced for online food recommendation, where an RNN-based recommendation model is developed to provide product recommendation in multiple time periods.

Serving as a popular RNN, the long short-term memory (LSTM) network has been widely adopted in RSs with hope to comprehensively investigate the dynamic features through user-item interactions [2, 18, 22]. It should be noted that the LSTM network has shown competitive performance in capturing both the long-term and short-term patterns, which contributes to a comprehensive investigation of user behavior in RSs. Motivated by above discussions, it becomes a seemingly natural idea to employ LSTM networks to study the user-item interaction sequences in order to carry out personalized food recommendation. In this paper, a sequence-based recommendation approach is proposed to lay an effective and systematic basis for establishing food RSs. A traditional LSTM network is adopted to reflect users’ food preferences and generate accurate recommendation suggestions. Furthermore, the proposed LSTM-based RS is tested on a public data set, and the experiments verify the promising performance of our approach for food recommendation.

The main contributions of this paper are outlined in threefold as follows: (1) a unified framework is proposed for food recommendation that leverages feedbacks from sequences as well as historical interactions to model users’ long- and short-term preferences; (2) a traditional LSTM network is employed to extract the user representation by considering the user habits at a certain time period; and (3) a series of numerical experiments are conducted on a real-world data set to validate the effectiveness of the developed RS. In summary, the established RS is capable of effectively modeling users’ long- and short-term preferences and providing more accurate and diverse food recommendation in comparison with existing food recommendation techniques.

The remainder of this paper is organized as follows. In Sect. 2, the related work is presented on existing solutions for food recommendation. A sequence-based model is developed in Sect. 3 for food recommendation. The experimental results are discussed in Sect. 4 and appropriate evaluation metrics are carefully selected for evaluating the algorithm performance. Section 5 concludes the paper while pointing out the future directions.

2 Related work

In the food domain, RSs play an important role in promoting healthy eating behaviors by approaches such as suggesting healthier food substitutes to users. Food RSs can be divided into three types based on the information used for food recommendation [43]. The first type adopts users’ food preference for recommendation, e.g., the search terms or ingredient inputs of users have been utilized in [5] to conduct recipe recommendation. The second type leverages the healthy and nutritional needs of users for recommendation, e.g., a food plan has been generated in [33] based on healthy ingredients instead of harmful ones. The third type finds a trade-off between user preferences and nutritional needs, e.g., a healthy and nutritional meal plan has been made for the elderly in [38] by taking advantage of information from both user preferences and food nutrition.

In comparison with the general RSs, the food RSs have the following differences. The first difference is that the food RSs have to consider more factors when conducting recommendation, e.g., users’ nutritional needs, weight goals, and health problems which are primary factors for food recommendation. The second difference is that different domain knowledge and food databases (e.g., nutritional, medical, and dietary information) are required by the RSs to supply users with healthier food suggestions. The last difference is that the unique characteristics (e.g., cooking methods, preparation time, and ingredient combination effects) of various food have to be concerned when making recommendation. In summary, more factors and information should be taken into good consideration by the RSs in order to provide users effective and healthy food plans [12].

For the purpose of improving the accuracy of food RSs, it is crucial to consider both users’ dynamic preferences and historical neighbor feedbacks. In this paper, sequence-based recommender systems (SRSs) are introduced to capture dynamic preferences of users. Modeling the sequential pattern of users’ behaviors allows the RSs to understand the evolution of user tastes over time, thereby providing better recommendation [51]. It is worth mentioning that the SRSs are different from traditional RSs in that SRSs account for the order of items via the perspective of users’ historical behaviors, and thus both timing and frequency of interactions are taken into account for recommendation [10]. So far, the SRSs have been successfully applied in many applications such as e-commerce, music, and news recommendation [3, 14, 19,20,21, 34].

Incorporating sequence-based RSs into food recommendation has several advantages. First, food recommendation is often time-sensitive where suggestions are expected to be interactive. For example, if a customer has ordered a steak, it is reasonable for RSs to recommend a salad as a starter and a glass of red wine as an accompaniment. Second, SRSs make it possible to model the rather complicated couplings/interactions among different foods that are consumed together. For example, SRSs can capture the fact that consuming bread increases the likelihood of subsequently consuming milk. Third, SRSs are effective at handling implicit feedbacks that are more reliable than the explicit feedbacks (ratings) which are not always available [46].

Existing solutions for SRSs mainly fall into two categories which are the Markov chain models and deep learning techniques. The Markov chain model treats the users’ behavior as a sequence of states and item recommendation is provided based on the state transition probabilities [1, 17, 37]. As for the deep learning techniques, typical examples are CNNs, RNNs, and graph neural networks, which have been widely used in a variety of sequence-based RSs [6, 7, 18, 42, 49]. Although the SRS has been implemented in a variety of domains, it has been rarely considered in the food domain due to the fact that food recommendation is a highly contextualized and personalized task which unavoidably leads to significant difficulties to the satisfactory design of SRSs. As such, we are motivated to investigate a specialized food recommendation approach that effectively integrates the SRSs with other types of RSs to provide a comprehensive and personalized solution for food recommendation.

3 Sequence-based deep learning model for food recommendation

In this section, the recommendation task is described with elaborated descriptions of the proposed model. We start by introducing some key concepts required for the model.

3.1 Problem formulation

Consider a set of m users \({\mathcal {U}}=\left\{ u_{1}, u_{2}, \ldots , u_{m}\right\}\) and a set of n items \({\mathcal {I}}=\left\{ i_{1}, i_{2}, \ldots , i_{n}\right\}\), where m and n are the sizes of the user set and item set, respectively. For user \(u_{i}\), it has an ordered list of items \({\mathcal {S}}^{u_i}\) according to the action sequences. For each user \(u_{i}\), the prediction task can be written as:

$$\begin{aligned} {\mathcal {S}}_{t-L}^{u_{i}}, \ldots , {\mathcal {S}}_{t-2}^{u_{i}}, {\mathcal {S}}_{t-1}^{u_{i}} \rightarrow {\mathcal {S}}_t^{u_{i}} \end{aligned}$$
(1)

where t for \({\mathcal {S}}^{u_i}\) denotes the temporal order in which actions happen. Given sequence \({\mathcal {S}}_{t-L}^{u_i}, \ldots , {\mathcal {S}}_{t-2}^{u_i}, {\mathcal {S}}_{t-1}^{u_i}\), the model tries to predict \(S_t^{u_i}\) as the next item with which the user will interact. Two sequential patterns are considered in this paper, i.e., the point-level sequential pattern (PSP) and the union-level sequential pattern (USP).

3.1.1 Point-level sequential pattern

The point-level sequential pattern (PSP) is a type of sequence-to-point learning model, where predictions are made based on all the actions that have occurred up to a certain time point [57]. As shown in Fig. 1a, the output \({\mathcal {S}}_{t}^{u_i}\) is a predicted item for the next action. All of the previous points influence the target independently. For example, if we have a list of items ingested by a person over the course of a day, we may be interested in finding two possible patterns, e.g., “coffee is usually consumed after dessert” and “chips are usually consumed after fish.”

3.1.2 Union-level sequential pattern

The USP tries to predict the behavior of users based on aggregating multiple interactions [23]. The pattern is based on the assumption that the union of all the event is a reasonable predictor for the next event in the sequence. Different from the PSP, the USP is able to identify items that are frequently co-consumed, such as the combination of breakfast and lunch. Figure 1b shows an illustration of the PSP, where several previous actions jointly influence the target action. The LSTM network is employed in our proposed approach to mine both PSP and USP that exist in users’ behaviors.

Fig. 1
figure 1

Point and union-level sequential patterns

3.2 Modeling and learning

The proposed model mainly consists of two units, i.e., the LSTM unit and the CF unit, where the LSTM network attempts to discover long- and short-term preferences that exist in users’ interaction sequences and to determine the latent representation of the user in the embedding layer. The sequential learning process allows the model to learn behavior patterns of users’ preferences at both point- and union-level, which enables the model to make more accurate predictions about their future behaviors.

Specifically, the user-item interaction at each time step is transformed into one-hot encoded vector as the network input. Then, these vectors are mapped to low-dimensional dense vectors through the embedding layer and passed to the LSTM network to capture the behavior pattern of each user. Afterward, the user embedding vector is calculated by averaging the trained embedding vectors. The results are fed as input vectors for the CF unit.

The CF unit makes recommendation by suggesting items (liked by other users with similar tastes) to target users. Cosine similarity is used to quantify the correlation between two users. After obtaining the similarity matrix, the most similar group of users (to the target users) is identified with the most liked item selected and forwarded to the target users as recommendation.

3.2.1 LSTM

Recurrent neural networks (RNNs) use data patterns to predict the probability of future events based on the sequential characteristics of the data [30]. Various ordinal or temporal problems can be solved using this method, such as language translation, natural language processing, speech recognition, and image captioning. In contrast to traditional deep neural networks, which assume that inputs and outputs are independent of each other, RNNs incorporate input and output information from previous inputs to influence the current input and output. The LSTM network, as a variation of the traditional RNN, is designed to better retain information over a long period of time.

In addition to learning the non-linear and non-stationary nature of sequential data, the LSTM network has the advantage of preserving information in memory for a long period of time, which is in line with the goal of capturing the union-level pattern. The LSTM network controls the flow of information using three gates: the forget gate, the input gate, and the output gate, where the forget gate determines which information requires attention and which may be ignored by using the update function given as follows:

$$\begin{aligned} f_{t}=\sigma \left( W_{f}\cdot \left[ h_{t-1}, x_{t}\right] +b_{f}\right) \end{aligned}$$
(2)

where a sigmoid layer is applied on the input of the unit at time t and the last cell state, denoted by \(x_{t}\) and \(h_{t-1}\), respectively. The next step is to determine what information should be stored in the current cell state. First, the input gate layer determines which values to update. Then, a tanh layer formulates a vector consisting of the values of new candidates, denoted as \({\tilde{C}}_{t}\), which can be added to the state. Such two layers combine to produce an update to the current state, which is defined as follows:

$$\begin{aligned} i_{t}= & {} \sigma \left( W_{i} \cdot \left[ h_{t-1}, x_{t}\right] +b_{i}\right) \end{aligned}$$
(3)
$$\begin{aligned} {\tilde{C}}_{t}= & {} \tanh \left( W_{C} \cdot \left[ h_{t-1}, x_{t}\right] +b_{C}\right) \end{aligned}$$
(4)
$$\begin{aligned} C_{t}= & {} f_{t} * C_{t-1}+i_{t} * {\tilde{C}}_{t} \end{aligned}$$
(5)

The new cell state \(C_{t}\) is decided by the old state \(f_{t} * C_{t-1}\) and the new candidate value \(i_{t} * {\tilde{C}}_{t}\). Finally, the output is generated from the current internal cell state \(C_{t}\).

$$\begin{aligned} o_{t}= & {} \sigma \left( W_{o}\left[ h_{t-1}, x_{t}\right] +b_{o}\right) \end{aligned}$$
(6)
$$\begin{aligned} h_{t}= & {} o_{t} * \tanh \left( C_{t}\right) \end{aligned}$$
(7)

where the values of the current state \(x_{t}\) and the previous hidden state \(h_{t-1}\) are passed into the sigmoid function to decide which parts of the cell state are to be updated. Then, the new cell state passes through the tanh function. Both of these outputs are multiplied point by point. The final hidden state \(h_{t}\) is used for prediction.

3.2.2 Customized LSTM network

The LSTM network is used in this paper to learn user representation. The input of the LSTM network is the item of the actual interaction, while the output is the predicted item which a user tends to interact with at next time step. The item is first converted to a one-hot encoding vector, where the length of the vector equals the number of items. Here, only the coordinate corresponding to the active item is one, and the rest coordinate are zeros. Then, the one-hot encoding is mapped to a learnable, low-dimensional vector through the embedding layer. After retrieving the pre-trained item embeddings, the user embedding can be calculated by averaging item embeddings. Note that the pre-trained process is independent for each user, and therefore the averaging embedding can be used as the reasonable representation for each user. Figure 2 depicts the structure of the LSTM unit. Additional embedding layers are added between the input and the LSTM layer, and the output is the predicted preference of the items.

Fig. 2
figure 2

General structure of the network

3.2.3 CF unit

The CF unit starts with user embedding that represents the individual interest of each user. To find the user group most similar to the target user, the similarity between each pair of users is calculated using the cosine similarity measurement. The cosine similarity is defined as:

$$\begin{aligned} {\text {sim}}({\varvec{x}}, {\varvec{y}})=\frac{{\varvec{x}} \cdot {\varvec{y}}}{\Vert {\varvec{x}}\Vert \Vert {\varvec{y}} \Vert } \end{aligned}$$
(8)

where \(\Vert \cdot \Vert\) is the Euclidean norm of vector “\(\cdot\)”. Conceptually, \(\Vert \cdot \Vert\) is the length of the vector. The measure computes the cosine of the angle between vectors x and y. The greater the cosine value is, the more similar the tastes of the two users are. The next step is to generate the recommendation. The top N most liked items have been retrieved from the target users’ neighborhood based on their popularity, and the recommendation lists are ranked according to their relevance and popularity. Table 1 provides the similarity matrix acquired from the CF unit.

Table 1 Similarity matrix

4 Experiments and results

In this section, the proposed model is evaluated against popular baselines on one of the most popular food data sets, i.e., the Food.com data set which is previously the GeniusKitchen.com data set.

4.1 Data set

The website Food.com is arguably the largest food-oriented website that attracts 1.5 billion visits every year, and the adopted data set is comprised of 180K+ recipes as well as 700K+ reviews that cover user interactions for 18 years. Each interaction in the data set consists of a user identifier, a recipe identifier, and the corresponding rating and date. For better model performance, the explicit feedbacks have been converted to the implicit feedback.

Table 2 Food.com data set

Table 2 shows some examples of the Food.com data set. The user_id and recipe_id represent the user identifier and the recipe identifier, respectively. The date indicates the record time of this entry, and the interactions identify that the user has consumed the item.

4.2 Evaluation metrics

The goal of the experiments is to evaluate the quality and performance of the proposed approach against various baselines. For each user, the last 20% interactions are held as the test set and the remaining data are utilized for training. The performance of the utilized RSs is measured by precision@N, recall@N, mean average precision (MAP), as well as mean reciprocal rank (MRR). Precision refers to the number of retrieved items that are relevant, while recall indicates the number of relevant items that are retrieved. Precision@N and Recall@N are defined as:

$$\begin{aligned} {\text {Prec}} @ N= & {} \frac{\left| R \bigcap {\hat{R}}_{1: N}\right| }{N} \end{aligned}$$
(9)
$$\begin{aligned} {\text {Recall}} @ N= & {} \frac{\left| R \bigcap {\hat{R}}_{1: N}\right| }{|R|} \end{aligned}$$
(10)

where \({\hat{R}}_{1: N}\) denotes a list of top-N predicted items for a user and R denotes the last 20% of actions in the test set. To evaluate the overall performance of the approach, the MAP and MRR are used. The MAP is widely used in the RS for its ability to provide general estimation of model performance. The MAP is the average of the average precision (AP) defined by:

$$\begin{aligned} \textrm{AP}=\frac{\sum _{N=1}^{|{\hat{R}}|} {\text {Prec}} @ N \times {\text {rel}}(N)}{|{\hat{R}}|} \end{aligned}$$
(11)

where \({\text {rel}}(N) = 1\) if the Nth items are in the same ranking order in both prediction and test sets. The MRR is used to assess the performance of a CF unit and calculated as the mean of the reciprocal ranks of the items retrieved by the approach. The MRR is defined as:

$$\begin{aligned} \textrm{MRR}=\frac{1}{|Q|} \sum _{i=1}^{|Q|} \frac{1}{\textrm{rank}_i}. \end{aligned}$$
(12)

4.3 Experiment setting

In this paper, three widely used baselines (including the Item k nearest neighbor (Item-KNN) algorithm [25], the Meta-Prod2vec collaborative filtering (MPCF) algorithm [45], and the convolutional sequence embedding recommendation (Caser) algorithm [42]) are selected as the benchmark.

  • Item-KNN: Item-KNN recommends items similar to the target item, and similarity is defined as the cosine similarity between the vectors of the user interaction history.

  • MPCF: The Meta-Prod2vec method computes low-dimensional embeddings of items based on previous interactions with the items. The representation of a user is calculated as the mean of the products consumed by the user.

  • Caser: A personalized top-N sequential recommendation framework which uses CNNs for sequence modelling.

The recommended item numbers are set to be 1, 5, and 10 in the experiment to evaluate the performance of the utilized RSs. The learning rate, the minimum epoch and the mini-batch size of the MPCF algorithm, the Caser algorithm and the proposed approach are set to be 0.001, 50 and 128, respectively. The numbers of horizontal filters and the vertical filters of the Caser algorithm are set to be 16 and 4, respectively. In the proposed approach, the number of layer and the size of hidden neurons are set to be 1 and 30, respectively.

4.4 Performance comparison

The evaluation results of the three baselines and the proposed approach are presented in Table 3, where the best performer in each row is highlighted in bold, and the last column also included the improvement of the proposed approach over the best baseline in percentages. As shown in Table 3, the proposed method outperforms the Item-KNN method, the MPCF method, and the Caser method in terms of Prec@5, Prec@10, Recall@5, MAP and MRR. In addition, the proposed method obtains the second-best results in terms of Prec@1, Recall@1 and Recall@10 comparing to the other three baseline methods. In general, we can draw the conclusion that the proposed method outperforms the baseline methods with respect to the four chosen evaluation metrics. It should also be noted that sequential RSs (e.g., MPCF and Caser) outperform the Item-KNN method (which is the traditional RS), suggesting that the considered sequential patterns in user behaviors lead to higher accuracy.

Table 3 Performance comparison

In our experiment, the embedding dimension is a key hyper-parameter which is optimized through the model selection process. To obtain an optimal solution of the embedding dimension, we adopt the embedding dimension from 10 to 100, and compare the MAP of two baselines with that of proposed model on different embedding dimensions, as shown in Fig. 3.

Fig. 3
figure 3

MAP (y-axis) vs. the number of the latent dimension d (x-axis)

Fig. 4
figure 4

Comparing Prec@10 and Recall@10 of the proposed solution against three baselines

Fig. 5
figure 5

Comparing MAP and MRR of the proposed solution against three baselines

Figure 3 shows the MAP of two baseline plus the proposed model based on different embedding dimensions. Among these baselines, the MPCF, Caser, and the proposed method achieve their best performance with the embedding dimension of 30. It should be noted that performance does not improve with the increase of the dimension. Overall, the proposed model beats the strongest baseline based on the selected range and shows a rather steady trend compared to other baselines, which verifies the stability of the approach. Figures 4 and 5 compare the proposed solution against Item-KNN, MPCF and Caser on four metrics in the form of bar charts.

5 Conclusion

In this paper, a novel sequence-based recommendation approach has been developed to solve food recommendation tasks. Specifically, LSTM networks are used to approximate user-item interactions where CF techniques are adopted to make recommendation. Experimental results show reasonable performance gains over the popular baseline of the sequence-based RSs. Some future research directions include (1) the adoption of additional information (e.g., images, reviews and browsing history); (2) the proposal of explainable and personalized food recommendation; and (3) the introduction of more advanced machine learning techniques for cross-domain recommendations, see e.g. [4, 13, 15, 24, 26,27,28,29, 48, 50, 52, 53, 59].