1 Introduction

A user’s rating indicates his/her attitude toward an purchased item. Rating prediction aims to predict the user’s ratings on unrated items which may reflect his/her potential interests on these items. Collaborative filtering (CF) approaches, which mainly depend on historical ratings, have aroused great research interests and become the dominant method in recommender systems. As a typical CF technique, matrix factorization (MF) learns the latent features of users and items by decomposing the user-item rating matrix and then uses these two feature vectors to predict the rating that the user would assign to the item.

MF is the most widely used technique for rating prediction. However, MF-based methods suffer from the data sparsity problem and the predicted rating lacks the interpretability on why the user gives high or low scores. To tackle these issues, textual reviews have become a key complementary data source to enhance the performance and interpretation of the rating prediction task [1, 8, 26, 41]. In particular, due to the power of nonlinear combination of different types of information, impressive progress has been made by applying deep neural networks to this problem [3, 4, 6, 24, 35, 42].

The pioneering work by Zheng et al. [42] proposed a DeepCoNN model to represent both users and items in a joint manner using all the reviews of users and items. As proven in  [3], the target review, which is written by the target user for the target item, provides much of the predictive value for rating prediction. The performance of the DeepCoNN model  [42] drops severely when the target reviews are omitted. Indeed, the target review usually contains the target user’s preference on the target item’s attributes or properties and is closely related to the rating score. However, the target review will not be available at test time in real-world recommendation settings. The hereafter studies along this line do not access the target reviews in the validation and test set at any time to simulate a real-world scenario. Clearly, the inherent limitation in these methods is that they are unable to utilize the most predictive target review.

In light of this, we propose a novel framework, namely AGTR, to generate the target review for rating prediction. Our model has two distinguishing characteristics. Firstly, we generate the target review with rating tailored generative adversarial nets (RTGAN) which incorporates the rating into its objective function in addition to the user’s and the item’s historical reviews. Secondly, we develop a neural latent factor module (NLFM) to accurately predict the rating score by learning from the generated target review which encodes the user’s specific preference on the item. In such a way, the target review naturally provides guidance for the rating prediction task beyond the above-mentioned review-aware deep recommendation approaches [3, 4, 6, 24, 35]. Meanwhile, the rating drives the RTGAN module to produce a target review conveying consistent sentiment with the rating score.

We are aware of a few existing studies for generating reviews [5, 25, 37] or abstractive tips [21]. However, our AGTR model is fundamentally different from the NRT [21], MT  [25], and CAML [5] models, in the sense that all these approaches do not directly utilize the target review for rating prediction. Although the neural memory (NM) model proposed by Wang and Zhang [37] also integrates the target review in their prediction step, we distinguish our model with NM in both the review generation and rating prediction modules. We present a conditional GAN architecture for review generation, whereas NM [37] uses the sequence-to-sequence (seq2seq) [33] generative model. More importantly, we design a novel neural latent factor model to stress the target review to make good use of its predictive ability, while NM simply feeds the target review as the input of rating prediction in the last layer.

We have implemented the proposed AGTR model in both the rating prediction and review generation tasks. Empirical evaluation on four real world datasets proves that our model achieves the state-of-the-art performance on both tasks.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 introduces the preliminary and problem definition. Section 4 presents our AGTR model in detail. Section 5 gives the experimental results. Finally Sect. 6 concludes the paper.

2 Related Work

We summarize the research progress in review-aware rating prediction, categorized by the traditional methods and deep learning-based methods. We omit the classic collaborative filtering methods which do not use text reviews.

2.1 Traditional Methods

When integrating review texts, the traditional methods can be roughly classified into three categories. The first one is to extract useful textual information such as topics or aspects from review texts and learn latent factors from ratings and then link the textual information and latent factors together using linear [2, 26, 34, 40] or Bayesian combination  [23, 38]. The second one is by extending the latent factor model [7, 13, 28, 29, 41] to encode the textual influence. The third one is to modify graphic models to include latent factors from ratings [1, 8, 11, 36].

2.2 Deep Learning-Based Methods

The first type of deep learning-based methods only uses historical reviews without generating the target review. These approaches differ mainly in how they combine reviews with ratings. For example, NARRE [4] jointly learns hidden latent features for users and items using two parallel neural networks with the attention mechanism [4]. TARMF [24] adopts a neural network for mutual learning between reviews and ratings, where the features from reviews are optimized by an attention-based GRU network. A\(^3\)NCF [6] extracts features from reviews using topic models and fuses them with the embeddings from ratings, and it then captures a user’s attention on the item with an attention network. MPCN [35] presents a pointer-based co-attention mechanism which can extracts multiple interactions between user and item reviews.

The second type of deep learning-based methods generates the target review, but not all of them exploit the predictive ability of the target review. As we have illustrated this issue in the introduction section, here we discuss these methods on how they generate target reviews. NRT [21] is mainly for the purpose of enhancing explainability by generating tips based on a standard generative model with the GRU architecture. NM [37] adopts the seq2seq modeling [33] technique for review generation. Meanwhile, MT [25] uses an adversarial training process which helps overcome the problem of exposure bias in seq2seq models.

Our proposed AGTR model falls into the second type of deep learning-based methods. It has the following three advantages. Firstly, the exploiting of target reviews provides a target-dependent modeling of user and item characteristics. Secondly, our model incorporates rating as one of the conditions in both the generator and discriminator when generating the target review with GAN mechanism. This is different from the previous MT [25] method which relies purely on reviews without the guidance of ratings. As we will show in the experimental part, the incorporation of ratings helps the model generate high-quality target reviews and further improves the rating prediction task. Thirdly, when utilizing the target reviews, the previous MT method adopts a traditional MF method for rating prediction, which does not take the target review into consideration. In contrast, our model can fully leverage the target review with a carefully designed neural latent factor model.

3 Problem Definition and Preliminary

This section presents the problem definition and preliminary.

3.1 Problem Definition

The goal of rating prediction is to predict the ration of a given user to an item. Additionally, in our task, we also generate a target review in the form of a sentence that the user will write for the item. More formally, let \({\mathcal {U}}\) be a user set and \({\mathcal {I}}\) be an item set, and \({\mathcal {D}}\) be a review set on the items in \({\mathcal {I}}\) written by a set of users in \({\mathcal {U}}\). Each review \(d_{ui}\) written by user u on item i has an accompanying rating \(r_{ui}\) indicating u’s overall satisfaction toward i. We refer to all historical reviews written by the user, i.e., except that on item i, as the user’s historical review document \(d_{u}\). Similarly, the set of historical reviews on item i, except the one written by u, is referred to as the item’s historical review document \(d_{i}\). Each training instance is denoted as a sextuple (u, i, \(d_{ui}\), \(r_{ui}\), \(d_{u}\), \(d_{i}\)). The goal is to predict a rating \({\hat{r}}_{ui}\) and learn a synthetic target review \(s_{ui}\) for each item i that u does not interact with.

For ease of presentation, we summarize the notations in Table  1.

Table 1 Notations used in this paper

3.2 Preliminary

One key property of our model is to utilize the generative adversarial nets (GAN) to generate the target review. GAN is proposed by Goodfellow el al. [9]. GAN is inspired by the two-player zero-sum game. The main components in GAN is a generator G and a discriminator D. These two components are trained simultaneously under the adversarial learning idea. The optimization process of GAN is a mini-max game process, where the target is to reach Nash equilibrium  [30]. Later on, Salimans et al. [32] present several new architectural features and training procedures, and the goal is to improve the stability of training and the perceptual quality of GAN samples. Mirza and Osindero [27] construct the conditional adversarial nets where both the generator and discriminator are conditioned on some auxiliary information like class labels or data from other modalities. Yu et al. [39] further combine sequence to sequence model with the generative adversarial nets. GAN has been applied to the research fields like vision tasks [32], speech and language processing [20, 39].

In this subsection, we give a brief introduction to the original GAN and the closely related conditional GAN and sequence GAN. The generator G in GAN is used to capture the data distribution, while the discriminator D estimates the probability that a sample comes from the training data rather than G. In order to learn the distribution \(p_g\) over data x in G, we first define a prior on input noise variables \(p_z(z)\), then we define two mapping functions \(G(z,\theta _g)\) and \(D(x,\theta _d)\) which represents a mapping from a prior noise distribution \(p_z(z)\) to data space and outputs a single scalar representing the probability that x comes from training data rather than \(p_g\), respectively. Both G and D can be represented by a multilayer perceptron with parameters \(\theta _d\) and \(\theta _g\).

The objective of D is to maximize the probability of assigning the correct label to both training examples and samples from G. Meanwhile, the objective of G is to minimize \(\log (1 - D(G(z)))\). That is to say, D and G play the two-player min–max game with the objective function L(GD) defined as follows.

$$\begin{aligned} \small \underset{G}{min} \, \underset{D}{\max }L(D,G)& {} = E_{x \sim p_\mathrm{data}(x)}[\log D(x)]\nonumber \\&+ E_{z \sim p_z (z)}[\log (1-D(G(z)))] \end{aligned}$$
(1)

The generative models which learn complex distributions via variational inference, e.g., VAE [18] and its variants like [15, 17], suffer from the difficulty of approximating intractable probabilistic computations. In contrast, GAN does not assume the generated data belongs to a predefined distribution. Instead, it uses a discriminative model to guide the training of the generative model has succeeded in generating real-valued data. However, GAN also has the disadvantage that there is no control on the data being generated. Hence Mirza and Osindero [27] extend GAN to a conditional model to incorporate extra information. Formally, the objective function of the conditional GAN can be defined as follows.

$$\begin{aligned} \small \underset{G}{\min } \, \underset{D}{\max }V(D,G)& {}= E_{x \sim p_\mathrm{data}(x)}[\log D(x|y)]\nonumber \\&+ E_{z \sim p_z (z)}[\log (1-D(G(z|y)))] \end{aligned}$$
(2)

When GAN is introduced into the sequence generation, there are two main challenges. Firstly, the sequence data is not continuous and needs a sampling procedure when generating the word; hence, it is hard to pass the gradient update from the discriminative model D to the generative model G. Secondly, D can only assess a complete sequence. However, sequence generation is performed one by one, and it is nontrivial to balance the current score and the future one on a partially generated sequence. To address the above problems, Yu et al. [39] propose a sequence generation GAN framework which adopts the reinforcement learning and Monte Carlo search strategy, and then, Guo et al. [10] further present a LeakGAN model for generating long sequence.

Since our task is to generate the target review which is usually a long sequence, we adopt the LeakGAN as the basic framework and we extend it to combine the conditional GAN into the model. We will present the detail later in our model.

4 Our Proposed AGTR Model

In this section, we introduce our proposed AGTR model. We begin with the overall architecture and then go to the details of two modules.

4.1 Model Overview

Our AGTR model consists of two modules. One is the rating tailored GAN (RTGAN), which takes the rating as an important condition in the generator and the discriminator of GAN for review generation. The other is the neural latent factor module (NLFM) that leverages the generated target review along with the historical reviews for ration prediction using a neural network. The overall architecture of our model is shown in Fig. 1.

Fig. 1
figure 1

The architecture of our AGTR model

4.2 Rating Tailored GAN (RTGAN) Module

We have two basic assumptions for generating the synthetic target review \(s_{ui}\). Firstly, \(s_{ui}\) should reflect the user u’s preferences and the item i’s features. Secondly, the sentiment expressed in \(s_{ui}\) should be consistent with the rating score \(r_{ui}\). Following these assumptions, we design our rating tailored GAN (RTGAN) module conditioned on three types of information: (1) the user’s historical review document \(d_u\) to capture u’s preferences, (2) the item’s historical review document \(d_i\) to represent i’s features, and (3) the rating \(r_{ui}\) of the user u to the item i to serve as a constraint. During training, we learn a generator G using three types of condition information to produce a synthetic review, and a discriminator D to distinguish it with the real one.

4.2.1 Condition Information Encoder

We first introduce the condition information encoder (the left gray part in Fig. 1). It maps three types of condition information into user’s general preference embedding \({\mathbf {g}}_{u}\), item’s feature embedding \({\mathbf {g}}_{i}\), and the rating embedding \({\mathbf {h}}_{ui}\).

We take the process of mapping user’s review document \(d_u\) to his/her preference embedding \({\mathbf {g}}_{u}\) as an example. Each word in \(d_u\) is randomly initialized as a d dimensional vector, and each review in \(d_u\) is transformed into a matrix with the fixed length T (padded with 0 if necessary). Since the text processing is not the focus of this study, we take the same TextCNN [4] approach to encode each review in \(d_u\). Essentially, TextCNN can be summarized as a CNN structure followed by an attention mechanism. The convolution layer consists of m neurons. Each neuron is associated with a filter \({{\mathbf {K}}}\in {\mathbb {R}}^{t\times d}\) which produces features by applying convolution operator on word vectors. Let \({\mathbf {V}}_{ul}\) be the embedding matrix corresponding to the lth review in \(d_u\), the jth neuron in CNN produces its feature as:

$$\begin{aligned} \small {\mathbf {z}}_j=\sigma ({\mathbf {V}}_{ul}*{\mathbf {K}}_j+b_{j}), \end{aligned}$$
(3)

where \(*\) is convolution operator, \(b_{j}\) is bias term and \(\sigma\) is a nonlinear RELU activation function. We then apply a max-pooling operation to obtain the output feature \({\mathbf {o}}_{j}\) corresponding to this neuron. By concatenating the output from all m neurons, the convolution layer can produce the embedding \({\mathbf {o}}_{ul}\) of the review \(d_{ul}\) as:

$$\begin{aligned} \small {\mathbf {o}}_{ul} = [{\mathbf {o}}_{1},{\mathbf {o}}_{2},{\mathbf {o}}_{3},\ldots , {\mathbf {o}}_{m}], \end{aligned}$$
(4)

After getting the embedding for each review in \(d_u\), the attention mechanism is adopted to get the weights for these reviews. The attention \(a_{ul}\) for review \(d_{ul}\) is defined as:

$$\begin{aligned} \small a_{ul}^{*} = {\mathbf {h}}_a^\mathrm{T}ReLU({\mathbf {W}}_{O}{\mathbf {o}}_{ul} + {\mathbf {W}}_{i} {\mathbf {i}}_{ul}+b_{1})+b_{2}, \end{aligned}$$
(5)

where \({\mathbf {h}}_a\in {\mathbb {R}}^{t}\), \({\mathbf {W}}_O\in {\mathbb {R}}^{t\times k_{1}}\), \({\mathbf {W}}_{i}\in {\mathbb {R}}^{t\times k_{2}}\), \(b_{1}\in {\mathbb {R}}^{t}\), \(b_{2}\in {\mathbb {R}}^{1}\) are model parameters, \({\mathbf {i}} _{ul}\) \(\in\) \({\mathbb {R}}^{K}\) is the embedding of the item which the user write this review for.

A softmax function is used to normalize the above \(a_{ul}^{*}\) to get the final attention \(a_{ul}\). The user’s u general preference embedding \({\mathbf {g}}_{u}\) is then calculated as the attention weighted sum of all reviews \(d_{ul}\) \(\in\) \(d_u\), i.e.,

$$\begin{aligned} \small {\mathbf {g}}_{u} = \sum \limits _{l=1,\ldots |d_u|}^{ }a_{ul}{\mathbf {o}}_{ul} \end{aligned}$$
(6)

The mapping of the item’s review document \(d_i\) to its feature embedding \({\mathbf {g}}_{i}\) is in the same way. We first encode each review \(d_{il}\) into an embedding \({\mathbf {o}}_{il}\) using convolution and pooling operation in TextCNN. We then employ the attention mechanism to combine the review embeddings with different weights.

$$\begin{aligned} \small {\mathbf {g}}_{i} = \sum \limits _{l=1,\ldots |d_i|}^{ }a_{il}{\mathbf {o}}_{il} \end{aligned}$$
(7)

The mapping from the original rating \(r_{ui}\) to an one-hot embedding \(h_{ui}\) is straight-forward. We simply discretize the rating \(r_{ui}\) into a m-dimension vector (\(m=5\) in our case). If the value falls into an interval, the corresponding dimension is set to 1 and other dimensions are set to 0. For example, a rating \(r_{ui}=3.78\) will be mapped into a \({\mathbf {h}}_{ui}\) as \((0, 0, 0, 1,0)^\mathrm{T}\). Note that the rating \(r_{ui}\) is known only in training. During validation or test, we will use a basic rating from NLFM module instead. The detail will be given later.

4.2.2 RTGAN for Target Review Generation

A good number of generative methods have been proposed for text generation in recent years, such as seq2seq [33]-based models, SeqGAN [39], and RankGAN [22]. Since the reviews are usually long (with average length > 40), we adopt the state-of-the-art LeakGAN [10] model to generate reviews in this paper and extend it by incorporating three types of condition information into both the generator and the discriminator.

Conditional generator Starting from the random state, LeakGAN generates texts via the adversarial generation of synthetic texts against real texts. This implies that, if simply adopting LeakGAN in our model, the generated reviews are only ensured to be written in a human-like style. However, we need to generate the target review that is written by a specific user for a specific item.

In order to provide additional information for guiding the target review generation, we incorporate LeakGAN with the conditional GAN by taking three types of information as the condition of the generator in LeakGAN. We call the combination of these three types of information as a condition vector \({\mathbf {c}}_{ui}\), and define it as:

$$\begin{aligned} \small {\mathbf {c}}_{ui} = {\mathbf {g}}_{u}\small {\oplus }{\mathbf {g}}_{i}{\oplus }({\mathbf {W}}_{r}* {\mathbf {h}}_{ui}), \end{aligned}$$
(8)

where \({\mathbf {W}}_{r}\) is a mapping matrix to transform the sparse \({\mathbf {h}}_{ui}\) to a dense vector.

Similar to many text generation methods [10, 25], we employ a decoder GRU to iteratively generate a review word by word. Different from these methods, the decoder layer in our RTGAN module is conditioned on \({\mathbf {c}}_{ui}\), which is the combination of three types of information. By doing so, our generator produces a synthetic target review that reflects not only the user u’s preferences but also the item i’s features. Moreover, the sentiment contained in the synthetic review is also forced to match the rating score.

To ensure that the condition information is maintained during the generation process, the condition vector \({\mathbf {c}}_{ui}\) is concatenated with the word vector before it is fed into the decoder GRU at each time step. Suppose \({\mathbf {x}}_t\) is the embedding for the current word being processed at time step t, the concatenated vector \({\mathbf {x}}_{t}^{'}={\mathbf {c}}_{ui}{\oplus }{\mathbf {x}}_{t}\) is input into the decoder GRU to get the hidden state \({\mathbf {h}}_{t}\). And then, the hidden state \({\mathbf {h}}_{t}\) is multiplied by an output projection matrix and passed through a softmax over all the words in the vocabulary to obtain the probability of each word in the current context. Finally, the output word \(y_{t}\) at time t is sampled from the multi-nominal distribution through a softmax layer.

The difference between the generator in our RTGAN module and that in LeakGAN is that our generator is conditioned on the additional information as discussed above. For learning, we follow the generator training method in LeakGAN [10] by adopting a hierarchical architecture to effectively generate long texts.

Briefly, the hierarchical generator G consists of a high-level MANAGER module and a low-level WORKER module. At each step, the MANAGER receives a leaked feature vector \({\mathbf {f}}_t\) (which is the last layer in discriminator D), and uses \({\mathbf {f}}_t\) to form the guiding goal vector \({\mathbf {g}}_t\) for the WORKER module. Compared to the scalar classification probability of D, the leaked feature vector \({\mathbf {f}}_t\) is a much more informative guiding signal for G, since it tells what the position of currently generated word is in the extracted feature space.

The loss for the MANAGER module is defined as:

$$\begin{aligned} L_{G_M} = -{\sum \limits _{t=1}^\mathrm{T}}Q({\mathbf {f}}_{t}, {\mathbf {g}}_{t})*d_\mathrm{cos}({\mathbf {f}}_{t+c}-{\mathbf {f}}_{t}, {\mathbf {g}}_{t}), \end{aligned}$$
(9)

where \(Q({\mathbf {f}}_{t},{\mathbf {g}}_{t})\) is the expected reward (the classification probability output by D) under the current policy, and \(d_\mathrm{cos}\) represents the cosine similarity between the change of leaked feature representation of discriminator after c-step transition (from \({\mathbf {f}}_{t}\) to \({\mathbf {f}}_{t+c}\)) and the goal vector \({\mathbf {g}}_{t}\), and T is the maximum sequence length we set for review. The loss function aims to force the goal vector to match the transition in the feature space while achieving high reward. Meanwhile, the loss for the WORKER module is defined as:

$$\begin{aligned} \small L_{G_W} = - \sum \limits _{t=1}^\mathrm{T} r_{t}^{I}\cdot {p }(y_t|s_{t-1}, {\mathbf {c}}_{ui}), \end{aligned}$$
(10)

where \({p }(y_t|s_{t-1}, {\mathbf {c}}_{ui})\) denotes the conditional generative probability of the next token \(y_t\) given a sequence \(s_{t-1} = [y_0, y_1,\ldots ,y_{t-1}]\) and the condition vector \({\mathbf {c}}_{ui}\) in WORKER module. \(r_{t}^{I}\) is the intrinsic reward defined as:

$$\begin{aligned} \small r_{t}^{I} = \frac{1}{c} \sum \limits _{i=1}^\mathrm{T} d_\mathrm{cos} ({\mathbf {f}}_{t}-{\mathbf {f}}_{t-i},{\mathbf {g}}_{t-i}) \end{aligned}$$
(11)

The objective in G is to minimize \(L_{G_M}\) and \(L_{G_W}\) in two modules, which are alternatively trained while fixing the other.

Conditional discriminator The discriminator learns to distinguish the ground-truth review \(d_{ui}\) from the synthetic one \(s_{ui}\). We adopt the same CNN structure in the generator to process review texts, and we can get the embedding \({\mathbf {d}}_{ui}\) for \(d_{ui}\) and \({\mathbf {s}}_{ui}\) for \(s_{ui}\), respectively. Different from the discriminator that only distinguishes between the real and the synthetic one, our discriminator needs to determine whether the review is related to the user and the item, and whether the review is written by the user for this item. Therefore, we take the condition information \({\mathbf {c}}_{ui}\) into account in the discrimination as well. The loss for the discriminator D is defined as:

$$\begin{aligned} \small L_D = - (\log (D({\mathbf {d}}_{ui}|{\mathbf {c}}_{ui})) + \log (1-D({\mathbf {s}}_{ui}|{\mathbf {c}}_{ui}))), \end{aligned}$$
(12)

where D() is the probability function computed by applying a softmax layer to the concatenation of \({\mathbf {d}}_{ui}/{\mathbf {s}}_{ui}\) and \({\mathbf {c}}_{ui}\). The objective in D is to maximize the probability of classifying the ground-truth review as positive and to minimize the probability of classifying the synthetic one as authentic.

The training of G and D in RTGAN module is an adversarial process. The goal of generator is to produce the most indistinguishable synthetic reviews to fool the discriminator, while the discriminator aims to distinguish synthetic and ground-truth reviews as much as possible. Hence, we iteratively train G and D to reach an equilibrium. The procedure for generating a target review\({\hat{s}}_{ui}\) is illustrated in Algorithm 1.

figure a

In Algorithm 1, we generate the words in a sentence one by one. We first get the word embedding \(\mathbf {x}_t\) in line 2 and concatenate it with the condition vector \(\mathbf {c}_{ui}\) to get a new word embedding \(\mathbf {x}_{t}^{'}\) in line 3. Lines 4–7 follows the standard generation procedure in LeakGAN. Specifically, in line 4, the Worker module takes the new word embedding \(\mathbf {x}_{t}^{'}\) as input and outputs a matrix \(\mathbf {O}_t\) which represents the current vector for all words using an LSTM. Line 5 sends the current sentence \(S_t = [x_1, x_2,\ldots ,x_t]\) into the Discriminator and get the output vector \(\mathbf {f}_t\). In line 6, the Manager also implemented by an LSTM takes the extracted feature vector \(\mathbf {f}_t\) as its input, and outputs a goal vector \(\mathbf {g}_t\), which is in turn fed into the Worker module to guide the generation of the next word in line 7. Line 8 is used to get the next word \(x_{t+1}\). Finally line 9 returns the entire sentence.

4.3 Neural Latent Factor Model (NLFM) Module

Inspired by the neural latent factor models in  [4, 12], we propose our NLFM module by extending these neural models in the following ways. Firstly, we represent general latent factors of user and item merely based on historical reviews without ratings. Secondly, we extend to exploit the special latent factors which encode the user’s preference on the item in the target review.

Specifically, the embeddings of user preferences and item features, i.e., \(\mathbf {g}_u\) and \(\mathbf {g}_i\), are passed from the RTGAN module, and then, we map them with a hidden layer to get the general latent factors of user and item. To obtain the special latent factors, we transform the target review \(d_{ui}\) (\(s_{ui}\) when testing) through a CNN structure and a hidden layer as follows:

$$\begin{aligned} \small {\mathbf {p}}_u = \tan h({\mathbf {W}}_{su}*\text {CNN}(d_{ui})+b_{su}), \end{aligned}$$
(13)
$$\begin{aligned} \small {\mathbf {p}}_i = \tan h({\mathbf {W}}_{si}*\text {CNN}(d_{ui})+b_{si}), \end{aligned}$$
(14)

where \(\text {CNN}()\) is a convolutional neural network that maps the target review \(d_{ui}\) into a feature vector, and \({\mathbf {W}}_{su}\), \({\mathbf {W}}_{si}\) are the projection matrices and \(b_{su}\), \(b_{si}\) are biases.

Combining the general and special latent factors together, we can obtain the user’s and item’s overall representations:

$$\begin{aligned} \small {\mathbf {f}}_{u} = \tan h({\mathbf {W}}_{gu}*{\mathbf {g}}_{u}) + \tan h({\mathbf {W}}_{pu}*{\mathbf {p}}_{u}), \end{aligned}$$
(15)
$$\begin{aligned} \small {\mathbf {f}}_{i} = \tan h({\mathbf {W}}_{gi}*{\mathbf {g}}_{i}) + \tan h({\mathbf {W}}_{pi}*{\mathbf {p}}_{i}), \end{aligned}$$
(16)

where \({\mathbf {W}}_{gu}\), \({\mathbf {W}}_{pu}\), \({\mathbf {W}}_{gi}\), \({\mathbf {W}}_{pi}\) are weight matrices.

We then pass these two overall representations \({\mathbf {f}}_{u}\) and \({\mathbf {f}}_{i}\) to a prediction layer to get a real-valued rating \({\hat{r}}_{ui}\):

$$\begin{aligned} \small {\hat{r}}_{ui} = {\mathbf {f}}_{u}^\mathrm{T} {\mathbf {f}}_{i} + b_{u} + b_i +b, \end{aligned}$$
(17)

where \(b_u\), \(b_i\), and b denotes the user bias, item bias and global bias, respectively. Clearly, our predicted rating \({\hat{r}}_{ui}\) encodes the general user interests and item features as well as the user’s specific interest on this item.

Since rating prediction is actually a regression problem, a commonly used squared loss is adopted as the objective function for our NLFM module:

$$\begin{aligned} \small L_r = \sum \limits _{u,i\in {\mathcal {U}},{\mathcal {I}}}^{ }({\hat{r}}_{ui}-r_{ui})^{2}, \end{aligned}$$
(18)

where U, I denotes the user and item set, respectively, and \(r_{ui}\) is the ground-truth rating assigned by u on i.

4.4 Training and Prediction

We iteratively train the RTGAN and NLFM modules. Since these two modules share the parameters in the historical reviews encoder layer, the parameters will be iteratively updated.

At the time of validation and testing, we first get a basic rating using the user’s and item’s embeddings saved in NLFM after training. We then input this basic rating as a condition to RTGAN to generate the synthetic target review. Finally, the generated review is fed into NLFM to get the final rating score. Note that though we add the RTGAN module in order to generate and utilize the synthetic review, the rating prediction task in our AGTR model can be performed offline like MF methods. The procedure for the entire AGTR model is presented in Algorithm 2.

figure b

In Algorithm 2, line 1 initializes the parameters and line 2 pre-trains the generator. Lines 6–17 train the AGTR module. Specifically, lines 3–5 train the NLFM module, and lines 10–16 train the RTGAN module. The generator G and the discriminator D is trained in lines 10–13 and lines 14–16, respectively. Line 18 inputs the generated \({\hat{s}}_{ui}\) into NLFM module to get the final prediction \({\hat{r}}_{ui}\). Finally, line 19 returns the generated target reviews and the predicted ratings.

Finally, we analyze the time complexity of our AGTR model. For the NLFM module, the time complexity for calculating the predicted rating is O(Td), where T and d are the fixed review length and embedding size, respectively. For the RTGAN module, we encode the user’s and item’s historical reviews to generate target review. Hence, it’s time complexity is O(CTd), where \(C=\max \{\Vert d_{u}\Vert , \Vert d_{i}\Vert \}\), where \(\Vert d_{u}\Vert\) and \(\Vert d_{i}\Vert\) is the number of reviews of the user and the item, respectively. We train the AGTR model over \({|\mathcal{R}|}\) training instances, where \({|\mathcal{R}|}\) denotes the number of user-item pairs. Therefore, the overall time complexity for training is \(O({|\mathcal{R}|}CTd)\).

5 Experiments

5.1 Experimental Setup

Datasets We conduct experiments on two publicly accessible data sources: Amazon product reviewFootnote 1 and Yelp 2017.Footnote 2 We use three of product categories in Amazon: Patio, Lawn and Garden, Automotive, and Grocery and Gourmet Food. We take the 5-core version for experiments following the previous studies [4, 6, 35]. In this version, each user or item has at least 5 interactions. For all datasets, we extract the textual reviews as well as the numerical ratings to conduct experiments. The basic statistics of the datasets are shown in Table 2.

Table 2 Statistics of the datasets

Evaluation metrics For rating prediction, we employ MAE [21] and MSE [25, 35, 37] as evaluation metrics. For review generation, we report the results in terms of negative log-likelihood (NLL) [10, 39] and ROUGE-1 [21, 37]. All these metrics are widely used in text generation and recommendation systems.

5.2 Evaluation Metrics

For rating prediction, we employ Mean Absolute Error (MAE) [21] and Mean Squared Error (MSE) [25, 35, 37] as evaluation metrics. For review generation, we report the results in terms of negative log-likely hood (NLL) [10, 39], perplexity [25], and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [21, 37]. All these metrics are widely used in text generation and recommendation systems.

MAE and MSE are used to measure accuracy for continuous variables (in our case, ratings), they are defines as follows.

$$\begin{aligned} {\text{MAE}} = \frac{1}{m}\sum _{i=1}^{m}|(r_i-{\hat{r}}_i)| \end{aligned}$$
(19)
$$\begin{aligned} {\text{MSE}} = \frac{1}{m}\sum _{i=1}^{m}{(r_i-{\hat{r}}_i)}^{2}, \end{aligned}$$
(20)

where m is the number of interactions between user and item, \(r_i\) is the ground-truth rating, and \({{\hat{r}}_i}\) is the predicted rating. For MAE and MSE, smaller values indicate better prediction performance.

NLL is defined as follows:

$$\begin{aligned} \small NLL = -\sum _{i=1}^\mathrm{T}p(y_i)*\log (q(x_i)), \end{aligned}$$
(21)

where T is the length of the generated review, \(p(y_i)\) is the distribution probability of word \(y_i\) in the real dataset, \(q(x_i)\) is the generation probability of word \(x_i\).

The perplexity metric is defined as the exponent of the average negative log-likelihood per word. Below is the formula for computing perplexity:

$$\begin{aligned} \small {Perplexity} = e^{\frac{-\sum \log (p(w))}{N}}, \end{aligned}$$
(22)

where p(w) is the probability of the word in the test set and N the total number of words. Lower perplexity implies higher log likelihood, and it indicates a better language model.

ROUGE is a classic metric in text summarization and machine translation. We adopt it to evaluate the quality of the generated reviews. There are three commonly used metrics including ROUGE-1, ROUGE-2, and ROUGE-L. The definitions for these metrics are as follows.

$$\begin{aligned} \small ROUGE-N({\hat{s}}_{ui})) = \frac{\sum _{ng\in s_{ui}}{\text {Count}}_\mathrm{match}ng(s_{ui}, {\hat{s}}_{ui})}{\sum _{ng\in s'}Count(ng(s'))}, \end{aligned}$$
(23)

where ng is the n-gram, \(Count(ng(s'))\) is the number of n-grams in \(s'\) (either \({\hat{s}}_{ui}\) or \(s_{ui}\)). For ROUGE-1 and ROUGE-2, \({\text {Count}}_\mathrm{match}ng(s_{ui}, {\hat{s}}_{ui})\) is the number of uni-grams and bi-grams co-occurring in \(s_{ui}\) and \({{\hat{s}}_{ui}}\). For ROUGE-L, \({\text {Count}}_\mathrm{match}ng(s_{ui}, {\hat{s}}_{ui})\) denotes the longest common subsequence in \(s_{ui}\) and \({\hat{s}}_{ui}\). We report the scores in terms of ROUGE-1, ROUGE-2, and ROUGE-L to evaluate the quality of the generated reviews in different granularities.

5.3 Compared Methods

We compare our AGTR model with the following state-of-the-art methods.

SentiRec [14] first encodes each review into a fixed-size vector using CNN and then generates recommendations using vector-encoded reviews.

MPCN [35] exploits review-level co-attention mechanism to determine the most informative reviews and gets the representations of users and items.

A\(^{3}\)NCF [6] designs a new topic model to extract user preferences and item characteristics from review texts and then feeds them into a neural network for rating prediction.

ALFM [7] develops an aspect-aware latent factor model where a new topic model in integrated to model user preferences and item features from different aspects.

NARRE [4] processes each review using CNN and adopts attention mechanism to build the recommendation model and select useful reviews simultaneously.

TARMF [24] adopts attention-based RNN to extract textual features and maximizes the similarity between latent factors and textual features.

MT [25] jointly learns to perform rating prediction and recommendation explanation by combining MF for rating prediction and SeqGan [39] for review generation.

NRT [21] uses MF and generation networks to combine ratings, reviews, and tips for rating prediction and abstractive tips generation.

NM [37] uses a single neural network to model users and products, and generates customized product representations using a deep memory network, from which customized ratings and reviews are constructed jointly.

CAML [5] uses an encoder–selector–decoder architecture to model the cross-knowledge transferred for both the recommendation task and the explanation task using a multi-task framework.

In addition to the above baselines, we propose two variants for MT and our AGTR models. Specifically, MT-lg replaces SeqGan [39] in the review generation module of MT [25] with LeakGan [10] in our model to exclude the potential influence caused by using different generation models. AGTR-r removes the rating condition from the generation module in our AGTR model to investigate the effects of our rating tailored GAN.

We do not compare our model with other methods like DeepCoNN [42] and TransNet [3] using reviews for rating prediction, neither with the traditional methods like NMF [19], FM [31], and NeuMF [12] which do not use reviews. These methods have been shown to be weaker than the baselines [7, 24, 35] used in our experiments; thus, we only show improvements over the baselines.

5.4 Parameter Settings

Each dataset is divided into 80%/10%/10% splits for training, validation, and testing, respectively. We train the model on the training set and tune the hyper-parameters on the validation set. The ground-truth reviews in the training set are used for training the model. Note that those in validation or testing sets are never accessed. Instead, only the generated target reviews are used for validation or testing.

The parameters of all baselines are the same as those in the corresponding original papers. For our AGTR model, we set dimensionality to 32 for all embeddings of users, items, and word latent factors. In review generation, the maximum review length T is set to 40 words, and other parameters such as the kernel size of CNN are the same as those in LeakGAN. We use Adam [16] for optimization. We set learning rate=0.002, minibatch size=64, and dropout ratio=0.5 for all the datasets.

5.5 Rating Prediction

The results of all methods for rating prediction are presented in Table 3. (1) The upper six rows from SentiRec to TARMF are the first type of review-aware rating prediction methods which do not generate target reviews. (2) The middle five rows from MT to CAML are the second type which generates target reviews/tips. (3) The last two rows are our AGTR model and its variant. From Table 3, we have the following important observations.

Table 3 Rating prediction performance in terms of MAE and MSE

Firstly, our AGTR model statistically significantly outperforms all baselines in terms of MAE and MSE metrics on three of the four datasets. The baselines’ performances fluctuate among different datasets. MPCN, ALFM, and CAML once becomes the second best in some cases. This shows that it is hard to get the consistently better performance for one method due to the characteristics of the different datasets. In contrast, our model achieves the best performance on Garden, Automotive, and Yelp datasets. CAML is the best on Grocery. However, the difference between our model and CAML on this dataset is not significant. All these results clearly demonstrate the effectiveness of our model.

Secondly, among six methods in the first type, ALFM and NARRE are generally better than other methods. Both these methods differentiate the importance of each review or each aspect. This infers that a fine-grained analysis on the reviews has great impacts on the related rating prediction task. Among five methods in the second type, CAML benefits a lot from the joint training of two tasks under the multi-task framework. Moreover, NM performs better than MT and NRT which only generate but do not integrate target reviews for rating prediction. Both these clearly show the predictive ability of target reviews. Our AGTR model’s superior performance over NM can be due to our carefully designed NLFM module, which makes the best use of the target review. The other reason is that the quality of our generated reviews is higher than that of NM with the help of rating tailored adversarial learning.

Thirdly, MT-lg is better than the original MT, suggesting the importance of generative model. On the other hand, GRT-r performs worse than AGTR, showing that rating condition plays a critical role in generating reviews consistent with rating scores. However, the enhanced MT-lg is still worse than our simplified version GRT-r. This indicates that our NLFM module performs much better the matrix factorization model in MT. NRT is designed for abstractive tips generation, which results in its inferior performance.

5.6 Review Generation

This section evaluates the performance of our AGTR model on review generation by comparing it with the second-type baselines. The results in terms of NLL, Perplexity, and ROUGE are presented in Tables 4, 5, and 6, respectively. Remember that for NLL and Perplexity, smaller scores indicate better performance. In contrast, larger scores indicate better performance for ROUGE. The best results are in bold, and the second best ones (except those in our AGTR-r variant) are italicized.

Table 4 Review generation performance in terms of NLL
Table 5 Review generation performance in terms of Perplexity

From Tables 4,  5, and  6, it is clear that our AGTR model can generate the best or the second best reviews in terms of NLL, Perplexity, and ROUGE metrics on all datasets. Moreover, AGTR-r’s results are not as good as AGTR. This, once again, demonstrates that our strategy of taking rating as the condition in GAN helps generate high-quality reviews.

Among the baselines, CAML is the second best in terms of NLL and ROUGE metrics in most cases. This infers that CAML can generate good reviews with the help of supervision from the rating subtask under the multi-task learning framework. However, when the evaluation metric is Perplexity, the second best ones are MT-lg and NRT. We can conclude that none of these methods always performs the best in terms of all metrics on all datasets. This is due to that the metrics focus on different factors. Also note that the perplexity values are quite large. This is probably caused by the removal of stop words from reviews, which results in the lack of continuity between two words. Finally, we find that both NRT and NM perform relatively poorly. The reason might be that they only adopt the maximum likelihood estimation to generate reviews without exploiting the adversarial network. On the other hand, MT-lg is better than MT, indicating that LeakGAN performs better than SeqGAN.

Table 6 Review generation performance in terms of ROUGE

5.7 Case Study

In order to capture more details, we provide several examples in Table 7 to analyze the relevance between the generated synthetic reviews/ratings and the real ones.

Table 7 Examples of the predicted ratings and the generated reviews (Ref. denotes the ground-truth review and rating)

As can be seen, our AGTR model gets the highest rating score, i.e., 4.85, which is very close to the real score. Furthermore, our generated review is suitable to express the strong positive sentiment reflected by the full credit, and it is most similar to the real review. We also need to point out that the words in the latter half of our generated review are not very accurate. This also happens to other generated reviews. The reason is that the sentences in the training data have varying lengths. However, when training the model, all sentence should have a fixed length. Hence long sentences will be truncated, and short sentences need add some random words to reach the fixed length after the original words. That is to say, the latter part of some training sentences are not semantically related. Consequently, the network is unable to generate accurate words for the latter part of the sentence.

5.8 Parameter Analysis

In this subsection, we investigate the effects of two parameters, i.e., the number of latent factors and the max length of reviews. We first examine the effect of the latent factor size in Fig. 2a, b. We can see that, with the increase number of latent factors, the performance could be enhanced since more latent factors bring better representation capability. For example, the model achieves the best performance when the latent factor number is 32 on almost all datasets. However, too many latent factors, e.g., 64, may cause over-fitting and result in the decrease of performance.

Fig. 2
figure 2

Performance of different size of latent factor and max length of review

We then study the effects of the max length of reviews in Fig. 2c, d. When the review length is small, the part of texts that exceeds the specified length need to be truncated when preprocessing, which will result in a information loss. In this case, the smaller the specified length, the more information is missing, and thus, the performance will decrease. When the review length increases, the reviews which is shorter than the threshold need to be padded. The irrelevant words padded would bring noises to the model, which will harm the performance of the model. Hence a reasonable max length of review is about 40 words.

6 Conclusion

In this paper, we presented a novel AGTR model to leverage the predictive ability of target reviews. We developed a unified framework to generate target reviews using a rating tailored GAN and to do rating prediction with a neural latent factor model which well exploits the generated target review besides historical reviews. We conducted extensive experiments on four real-world datasets. Results demonstrate our model achieves the state-of-the-art performance in both rating prediction and review generation tasks.

As for future work, one possible direction is to generate target reviews with variable length. The second is to enhance the interaction between two modules under the multi-task framework. The third is to develop new approach instead of extending LeakGAN for review generation, which might be explored as a separate problem rather than a component in our rating prediction task.