1 Introduction

Reading news to stay up-to-date with the latest information has been an integral part of human life. In the modern days, the Word Wide Web provides us with abundant online news resources, enabling us to keep updated with the current occurrences. However, due to the large amount of news articles as well as the proliferation of news websites, users may get overwhelmed to decide what and where to read the news for their needs. As a result, online news websites such as Google NewsFootnote 1 or Bing NewsFootnote 2 try to solve the problem by aggregating many sources of news and generating a personalized reading list to the users based on their preferences. This strategy of recommending news that is tailored for each user has been an effective way to target user reading interests [1,2,3,4,5].

However, news recommendation poses several challenges in comparison to the traditional recommendation problem. First and foremost, unlike movies or shopping items recommendation, news articles are time sensitive items. The value and relevance of a news articles deteriorate quickly over a short period of time because fresh news itemsFootnote 3 are updated frequently. Due to this time sensitive property, traditional methods like collaborative filtering [6], which depends on the identity (ID) of users or items, would not work efficiently. Second, the content of a news article contains dense textual data, which also encodes the latent preferences of the users. For instance, certain individuals may only read sport news and discard the rest. Thus this is a strong signal indicating the “Sports” category encodes one of their long-term preferences. As another example, some users sporadically click on the latest news that has title or content related to the celebrities, and this behaviour shows a strong short-term interest signal, in which the article’s content exhibits certain words or knowledge patterns. This second problem prompts a strong need to measure a user’s reading history to infer her latent preferences, either long-term, short-term or a mixture of the both. Other aspects like diversity, also plays a significant role in news recommendation. Users should be able to find the types of news that they have high interest in, but also can explore other news that may pique their curiosity. This diversity can significantly improve users’ satisfaction and retain their loyalty to these online news services. Essentially, the core tasks of solving news recommendation problems are i) capturing a user preferences from their reading history and ii) understanding all of the signals from a news article.

Figure 1 illustrates a scenario where understanding of different contexts between a user and a news item is very important. In this scenario, each news item has several features, such as the category, the knowledge entity inside the news, and the title. Each user can select different news item based on the number of news articles that is shown to them. It is very clear that User-1 only selects the items in the Sports category, which defines her long-term preferences. User-2 and User-3 demonstrate different behaviours. User-2 shows attention to the news items which contain the knowledge entities about countries, while User-3 only reads the latest news. Clearly, these behaviours display their recent preferences.

Figure 1
figure 1

Example of a news recommendation scenario. The horizontal axis shows multi-aspect properties of each news article. The vertical axis shows interactions of each user. Each of them has different choices depending on the properties of the news articles that they pay attention to

Given these challenges, researchers and industry partners resort to deep learning and also spend a significant amount of time to collect the right datasets to facilitate the development of news recommendation. In this paper, we address news recommendation problem by proposing a novel deep learning model named CupMar, which is able to learn both the contextual user profile and the news article content representation. The main components of CupMar consist of i) the News Encoder (NE) and ii) the User-Profile Encoder (UE). The NE infers representation of a news article based on its important properties such as category, title, and abstract content. Self-attention and attention mechanism are used to learn news content effectively. In addition, due to recent successes of using knowledge entity for news recommendation task [7], we also enrich the learning of news representation by adding knowledge entities taken from WikiData knowledge graphFootnote 4 to the feature list of the NE. The UE contains two submodules. The first submodule is the Long-term Latent Preferences Extraction (LPE), and the second submodule is the Recent Latent Preferences Extractor (RPE). We are strongly motivated by the observation that the reading history of a user always encodes both of her long-term as well as her current interests. Thus by using both LPE and RPE submodules, we can learn the representations of the user’s contextual profile. The encoding news representation from the NE’s output and the user-profile representation from the UE’s output together are used to calculate an interaction score, which helps us to identify highly relevant candidate news items for each user. Hence, the online news services can recommend a ranked list of suitable news articles to their users, thereby improving their recommendation service quality and increasing users’ satisfaction and experience.

We perform extensive experiments on the Microsoft News Dataset (MIND) [8], and the results show that our approach improves the performance of news recommendation task. Our source code is also available online for the reproducibility purposeFootnote 5. In a nutshell, the main contributions in this paper are as follows:

  • We introduce a novel deep learning model CupMar to solve news recommendation challenge. CupMar leverages its major components NE and UE to learn user and news article representation, and uses the Score Rating component to rank relevant articles and recommend to users.

  • We propose two strategies to infer user and news article representation from the CupMar model’s main components. The NE component uses multi-aspect properties of a news article and an ensemble of advanced neural network layers to accurately learn a news article representation. The UE component looks at a user’s news reading history, and learns her contextual profile including long-term and temporary preferences to derive a user’s representation.

  • We conduct extensive experiments of the CupMar model on the popular MIND dataset. The CupMar model shows the state-of-the-art performance against all the baselines, thus demonstrating the efficiency of our approach.

This work is an extended version from our previous work accepted in the 22nd International Conference on Web Information Systems Engineering (WISE 2021) [9]. Compared to the previous work, we provide a more in-depth explanation of the enhanced CupMar approach, a thorough discussion of the literature, and additional experiments on new dataset and a detailed elaboration on the evaluation process. We also have made the source code of this research publicly available to the research community. The rest of this paper is organized as follows. In Section 2, we discuss the related works on news recommendation problem. We then first introduce the CupMar model design in Section 3, and describe the technical details of the News Encoder and the User-Profle Encoder in Sections 4 and 5 respectively. The experimentation and evaluation are described in Section 6. Finally, we provide some concluding remarks of our work in Section 7.

2 Related works

News recommendation is a popular and essential task in the field of natural language processing (NLP) and recommender systems [1, 10,11,12]. A number of online businesses rely heavily onto this task to tailor personalized experience for millions of users [5, 13,14,15]. The main approach for solving news recommendation problem is to accurately learn the news article and user representation [16]. Henceforth, several popular works rely on different feature engineering strategies to build their own news article and user representation [4, 5, 17,18,19,20,21,22]. Particularly, Latent Dirichlet Allocation (LDA) is used to generate topic distribution features to infer news representation in each session, and user representation is inferred by all the news in her session [23]. Another noteworthy method is the Explicit Localized Semantic Analysis (ELSA) proposed by Son et al. [24] for location-based news recommendation, where location and topic signals are calculated from Wikipedia posts as news representation. Nevertheless, the downside of manual feature engineering is the dependence on expert domain knowledge, which is not always available for many approaches. Additionally, traditional NLP methods do not incorporate word context and word order well enough to derive semantic meaning and learn user and news article representation effectively [10].

Owning to the popularity of deep learning methods, there have been a lot of efforts during the recent years to address the aforementioned issues pertaining to the task of news recommendations [3, 4, 17]. For instance, the work of Wang et al. [7] tries to infer news item representation from the news title using a knowledge-entity-aware method with Convolutional Neural Network (CNN) layer, and learn user representation by her browsed history news. Another approach in the work of Okura et al. [5] takes advantages of the denoising autoencoder [25] to learn the news article representation. They use this technique with a Gated Recurrent Unit (GRU) neural network layer to learn the user representation from her history records. A recent deep learning approach is proposed by Wu et al. [26], where the attention mechanism is used to attend to word-level and news-level to learn a news article representation, and the embedding ID of the user is used as a vector for the attentions as well.

Additionally, as the recommendation engines are becoming more and more relevant in modern services, researchers also combine different neural approaches to make use several of information signals within the news articles. For instance, the authors of [27] propose to use the knowledge graph to enhance and distil signals from document representation and show improved performance. In another work, the authors of [28] leverage the implicit negative feedback from user interactions based on reading time and clicks, thus resulting in an improvement of the accuracy.

Our proposed CupMar model also takes advantage of deep neural network to solve the news recommendation problem. However, the most prominent features that make our model different with the aforementioned models are:

  • The utilization of multi-aspect properties in each news article, where each property is first encoded differently and then merged together to derive the final news representations,

  • The combination of both long-term and short-term interactions to infer the user representations. We rely on an ensemble of multiple advanced neural network mechanisms to automatically capture the similarity between a user and a candidate news article representation.

3 The CupMar model

In this section, we briefly introduce the CupMar (Contextual User-Profile and Multi-aspect Articles Representation) model as shown in Figure 2. The CupMar model comprises of two major components. The first component is NE (News Encoder) that uses multiple neural network mechanisms on its multi-aspect properties to learn a news article representation in the form of a news vector. The second component is UE (User-Profile Encoder) that is further sub-divided into two submodules, which are LPE (Long-term Preferences latent Extractor) and RPE (Recent Preferences latent Extractor). The LPE is responsible for understanding a user’s long-term latent preferences, while the RPE is responsible for extracting temporary preferences from a user’s reading history. The latent vectors of the LPE and RPE are concatenated to form a contextual user-profile vector. Finally, a Score Rating component uses inputs of both the candidate news vector and the contextual user profile vector to predict the interaction score between these two entities.

Figure 2
figure 2

The overall design of the CupMar model. CupMar has two main components. The first component is NE (News Encoder), which is able to learn a news article representation. The second component is UE (User-Profile Encoder), which is able to derive a contextual user representation thanks to its submodules: LPE and RPE. The final interaction score is calculated by the Score Rating component

To accurately train the CupMar for the news recommendation task, there are several things we need to address. First, we need to have a scoring function to measure the interactive score between a user and news article representation. One of the fast and effective methods to gain this requirement is the dot product operation, as applied in the famous work of Okura et al. [5]. Hence, we use the dot product operation to compute the interaction probability inside the final component Score Rating of the CupMar model, as illustrated in Figure 2. If we have a user-profile u with its representation vector ru and a candidate news article n with its representation vector rn, then we can calculate the interaction score between them as \(\boldsymbol {s}(u,n) = \boldsymbol {r}^{\mathsf {{T}}}_{\boldsymbol {u}} \boldsymbol {r}_{\boldsymbol {n}}\).

Secondly, we address our news recommendation problem as a classification task, and use the negative sampling technique during model training [29]. Therefore, when a user is presented with multiple news articles, the articles that are clicked by the user are the positive samples, whereas the other N random sampled articles that are not clicked by the user are the negative samples. Then, the CupMar model can learn to infer the interaction probability between the positive and N negative news articles, thus formulating this as a N + 1 classes prediction for the classification task. The loss function is the negative log-likelihood of the positive samples. As such, the training total loss of all positive samples is calculated as follows:

$$\text{loss} = - \sum\limits_{i=1}^{P} \log \frac{\exp(\boldsymbol{s}(u,{n}_{i}^{pos}))}{\exp(\boldsymbol{s}(u,{n}_{i}^{pos}) + {\sum}_{k=1}^{K} \exp(\boldsymbol{s}(u,{n}_{i,k}^{neg}))}$$
(1)

where P is the amount of positive training samples, \({n}_{i}^{pos}\) is the ith positive sample in one news session, and \({n}_{i,k}^{neg}\) is the kth negative sample for this ith positive sample.

In the sequel, the technical details of NE and UE will be described in Sections 4 and 5, respectively.

4 Learning the news representation

The task of the CupMar News Encoder (NE) is to learn the representation of a news article. A news article contains several pieces of useful information such as the news category, news title, news body content, and the knowledge entities, as depicted in Figure 3. It is essential to leverage all of these pieces of information to derive a meaningful representation for downstream machine learning tasks. As such, for each news item, we use five main features to encode its representation vector. We denote a news item as n = {c,sc,k,t,b}, where cC is the category feature in the set C of all categories in the dataset, scC is the subcategory feature. We have kK as the knowledge entity feature in the set K of all knowledge entities in the dataset. We have t as the news title feature with T words, hence \(t=[{w^{t}_{1}}, {w^{t}_{2}},\dots ,{w^{t}_{T}}]\), where wtW is a word in the title t in the set of all distinct words W in the dataset. Similarly, b is the news body content feature with B words, hence \(b=[{w^{b}_{1}}, {w^{b}_{2}}, \dots , {w^{b}_{B}}]\) where wbW is a word in the body b.

Figure 3
figure 3

The News Encoder (NE) component design. In this NE component, multi-aspect properties such as the category, knowledge entity, content of the news article are used and processed in different ways via multiple neural network layers. All of these property vectors are concatenated and go through a Dense layer to derive the final representation vector

First, we derive the vector rc from both the category c and subcategory sc of the news article. The category and subcategory features give us clear information about the topic of the news article, and they also serve as strong signals for a user’s long-term preferences. The vector rc is formulated as follows:

$$\begin{array}{@{}rcl@{}} & \boldsymbol{r}_{\boldsymbol{c}} = \text{ReLU}(\mathbf{W}_{\mathbf{c}} \times [e_{c} \parallel e_{sc}] + \boldsymbol{b}_{\boldsymbol{c}}), \end{array}$$
(2)

where Wc and bc are the weight and bias parameters of the Densec (feed-forward) layer in Figure 3, the [ecesc] is the concatenation of the category embedding ec of category c, and subcategory embedding esc of subcategory sc, and ReLU is the non-linear activation function [30].

Likewise, we perform a similar procedure to learn the vector rk of the knowledge entity k of the news article. Since one article can contain multiple knowledge entities, we perform the mean operation on their embedding before feeding them into the Densek layer as illustrated in Figure 3. The formulation is as follows:

$$\begin{array}{@{}rcl@{}} & \boldsymbol{r}_{\boldsymbol{k}} = \text{ReLU}(\mathbf{W}_{\mathbf{k}} \times \boldsymbol{\mu}(e_{k_{1}}, e_{k_{2}}, \dots, e_{k_{n}}) + \boldsymbol{b}_{\boldsymbol{k}}), \end{array}$$
(3)

where Wk and bk are the weight and bias parameters of the Densek layer, \(\boldsymbol {\mu }(e_{k_{1}}, e_{k_{2}}, \dots , e_{k_{n}})\) is the mean operation of n knowledge entity embeddings ek in the article.

The most important feature of a news article is actually the content itself. We want to learn the representation from both the news article’s title and body content. Primarily, we want to know how each word interacts with its surrounded nearby words. Therefore, we choose to apply both the attention and multi-head self-attention mechanisms that is popularized by the work of Vaswani et al. [31]. The formulation to learn the representation rtb of the news article content in the title and the body is as follows:

$$\begin{array}{@{}rcl@{}} & \boldsymbol{r}_{\textbf{tb}} = \textbf{Att}(\textbf{Heads}(e_{{w^{t}_{1}}}, e_{{w^{t}_{2}}}, \dots, e_{{w^{t}_{T}}}, e_{{w^{b}_{1}}}, e_{{w^{b}_{2}}}, \dots, e_{{w^{b}_{B}}})), \end{array}$$
(4)

where \(\textbf {Heads}(e_{w_{1}}, \dots , e_{w_{i}})\) is a word-level multi-head self-attention layer [31] on each word embedding \(e_{w_{i}}\). This layer contains k heads, which is a hyperparameter. The head hk learns the representation of word wi as follows:

$$\begin{array}{@{}rcl@{}} && \boldsymbol{h}^{\boldsymbol{w}}_{\boldsymbol{i},\boldsymbol{k}} = \mathbf{V}^{w}_{k} \left( \sum\limits_{j=1}^{T+B} \boldsymbol{a}^{\boldsymbol{k}}_{\boldsymbol{i},\boldsymbol{j}} e_{j}\right), \end{array}$$
(5)
$$\begin{array}{@{}rcl@{}} && \boldsymbol{a}^{\boldsymbol{k}_{i,j}} = \frac{\exp(e^{\mathsf{T}}_{w_{i}} \mathbf{Q}^{w}_{k} e_{j})}{{\sum}_{m=1}^{T+B} \exp(e^{\mathsf{T}}_{i} \mathbf{Q}^{w}_{k} e_{m})}, \end{array}$$
(6)

where \(\mathbf {Q}^{w}_{k}\) and \(\mathbf {V}^{w}_{k}\) are the weight parameters in the hk head, (⋅)T is the transpose operation, T + B is the total amount of words in the title and body, and \(\boldsymbol {a}^{\boldsymbol {k}}_{\boldsymbol {i},\boldsymbol {j}}\) is the interaction weight between i and j words. The final representation for each word wi is the concatenation of all the self-attention heads, that is \(\boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i}} = [\boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i},\boldsymbol {1}} \parallel \boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i},\boldsymbol {2}} \parallel {\dots } \parallel \boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i},\boldsymbol {h}}]\), hence we have Heads\((e_{w_{1}}, \dots , e_{w_{i}}) = \{\boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {1}}, \dots , \boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i}} \}\). Subsequently, the Att(Heads) function of the attention layer then attends to each word after the self-attention representation \(\boldsymbol {h}^{\boldsymbol {w}}_{\boldsymbol {i}}\). The formula for deriving the attention weight of each word \(\boldsymbol {\alpha }^{\boldsymbol {w}}_{\boldsymbol {i}}\) is:

$$\begin{array}{@{}rcl@{}} && \boldsymbol{b}^{\boldsymbol{w}}_{i} = \boldsymbol{q}^{\mathsf{T}}_{w} tanh (\mathbf{V}_{w} \times \boldsymbol{h}^{\boldsymbol{w}}_{\boldsymbol{i}} + \boldsymbol{v}_{w}), \end{array}$$
(7)
$$\begin{array}{@{}rcl@{}} && \boldsymbol{\alpha}^{\boldsymbol{w}}_{\boldsymbol{i}} = \frac{\exp(\boldsymbol{b}^{\boldsymbol{w}}_{\boldsymbol{i}})}{{\sum}_{j=1}^{T+B}\exp(\boldsymbol{b}^{\boldsymbol{w}}_{\boldsymbol{j}})}, \end{array}$$
(8)

where Vw and vw are the attention weight and bias parameters, qw is the query vector. After all of these attention weights are calculated, the content vector rtb of a news article is computed as:

$${\boldsymbol r}_{\mathbf{tb}}=\sum\limits_{i=1}^{T+B}\boldsymbol\alpha_{\boldsymbol i}^{\boldsymbol w}\boldsymbol h_{\boldsymbol i}^{\boldsymbol w}.$$
(9)

Finally, we combine all of these multi-aspect vectors rc, rk and rtb to learn the multi-aspect news representation vector rne by combining them and let the final Densene layer to extract the most prominent patterns of a news article as illustrated in Figure 3, using the following formula:

$$\begin{array}{@{}rcl@{}} & \boldsymbol{r}_{\textbf{ne}} = \text{ReLU}(\mathbf{W}_{\textbf{ne}} \times [\boldsymbol{r}_{\boldsymbol{c}} \parallel \boldsymbol{r}_{\boldsymbol{k}} \parallel \boldsymbol{r}_{\textbf{tb}}] + \boldsymbol{b}_{\textbf{ne}}), \end{array}$$
(10)

where Wne and bne are the weight and bias parameters of the Densene layer, and [rcrkrtb] is the concatenation of the multi-aspects vectors from a news article.

5 Learning the user representation

The CupMar User-Profile Encoder (UE) is responsible for learning user representation from their news reading history. Figure 2 shows the complete architecture of the CupMar model, where the left-side portion visualizes the UE component and its submodules. A user’s reading habit can exhibit both long-term and recent preferences. To extract both of these signals, the UE uses two of its submodules, Long-term Preferences latent Extractor (LPE) and Recent Preferences latent Extractor (RPE), to handle them. One might think we need the user’s complete reading history records for the UE’s submodules to do their job. However, we do not do so due to the high computation complexity and low extraction performance. Instead, by sampling the reading history of the last several days, one can infer the long-term preferences of a user by paying attention to her most frequent reading topics. Likewise, it is also feasible to extract her recent interests by paying attention to the news article title, the body, as well as the embedded knowledge entities. That is advantage of this sampling strategy. The following sections dive into the details of each submodule.

5.1 Long-term preferences latent extractor

The sole purpose of LPE is to learn the long-term preferences of a user throughout her history in news reading records. It looks for frequent signals that signify repetitive behaviours of a user. For example, a user keeps reading entertainment news over multiple sessions, which clearly is an indication about her strong preference for enjoying entertainment content.

We argue that in real life, the news genres or topics from a user’s history records serve as a strong indication for a user’s general and long-term preferences. Additionally, the unique characteristic of a user also refines her choice. For instance, a fan of basketball is more likely to check sports news about the “The National Basketball Association” (NBA) rather than checking for badminton news. Therefore, to mimic those long-term preferences scenarios, we decide to assign each user a unique embedding vector based on her Identification (ID), and calculate the accumulation of the most frequent categories of a user’s history news records via their categorical embedding and knowledge entity embedding. The algorithm for extracting a user’s long-term latent preferences vector rlpe is detailed in Algorithm 1.

figure a

First, we initialize each user with a unique user embedding vector \(\boldsymbol {e}_{\boldsymbol {u}_{\boldsymbol {i}}}\) using a UserEmbedding layer (line 2). Second, we learn the most frequent L categories inside a user history records and store them in the set C (lines 3 to 7). Third, we initially set the long-term latent preferences vector rlpe as a zero-vector, then accumulate rlpe with all of the summation of category embedding and knowledge entity embedding vectors that belong in the set C using the category embedding layer CateEmbedding and set D using the knowledge entity embedding layer, respectively (lines 8 to 15). Then, we average the rlpe based on the count value, where it counts the total news articles that has the category in set C (line 16). Finally, we concatenate rlpe with the user embedding \(\boldsymbol {e}_{\boldsymbol {u}_{\boldsymbol {i}}}\) (line 17). Using this algorithm, we can extract both user’s long-term preferences and her unique characteristic into the representative vector rlpe.

5.2 Recent preferences latent extractor

RPE learns recent preferences of a user via the Gated Recurrent Unit (GRU) neural network layer. We denote Z as a variable for the amount of news articles a user has recently read, then for Z news articles in chronological order, the set of news records is denoted as \(N = \{n_{1},n_{2},\dots , n_{Z}\}\). The RPE derives the recent preference latent vector rrpe from a user using the GRU layer as follows:

$$\begin{array}{@{}rcl@{}} \boldsymbol{z}_{\boldsymbol{t}} &=& \sigma(W_{z}[h_{t-1}, \boldsymbol{NE}(n_{t})]), \end{array}$$
(11)
$$\begin{array}{@{}rcl@{}} \boldsymbol{r}_{\boldsymbol{t}} &=& \sigma(W_{r}[h_{t-1}, \boldsymbol{NE}(n_{t})]), \end{array}$$
(12)
$$\begin{array}{@{}rcl@{}} \widetilde{\boldsymbol{h}_{\boldsymbol{t}}} &=& tanh(W_{\widetilde{h}}[r_{t} \odot h_{t-1}, \boldsymbol{NE}(n_{t}) ]), \end{array}$$
(13)
$$\begin{array}{@{}rcl@{}} \boldsymbol{h}_{\boldsymbol{t}} &=& \boldsymbol{z}_{\boldsymbol{t}} \odot \boldsymbol{h}_{\boldsymbol{t}} + (1 - \boldsymbol{z}_{\boldsymbol{t}}) \odot \widetilde{\boldsymbol{h}_{\boldsymbol{t}}}, \end{array}$$
(14)

where σ is the sigmoid function, ⊙ is the item-wise product, Wr, Wz, and \(W_{\widetilde {h}}\) are the GRU’s network weights, NE(.) is the News Encoder function described in Section 4. With the initial hidden vector state h0 initialized as a zero-vector, we repeat the process with the GRU network until we reach the last hidden state vector hZ. Thus, the RPE vector is rrpe = hZ.

5.3 The representation of contextual user-profile

Given the two contextual vectors of a user, which are the long-term latent preferences vector rlpe and the recent latent preferences vector rrpe, the final contextual user-profile vector rue, is calculated as follows:

$$\boldsymbol{r}_{\textbf{ue}} = \text{ReLU}(\mathbf{W}_{\textbf{ue}} \times [\boldsymbol{r}_{\textbf{lpe}} \parallel \boldsymbol{r}_{\textbf{rpe}}] + \boldsymbol{b}_{\textbf{ue}}),$$
(15)

where Wue and bue are the weight and bias parameters of the Denseue layer (Illustrated in Figure 2), and [rlperrpe] is the concatenation of both contextual vectors that we learn from the aforementioned sections. The usage of both contextual vectors is the key ingredient to help the CupMar model achieving better scores as we discuss in the later sections.

6 Evaluations of CupMar

In this section, we describe the evaluation processes and the detailed performance analysis about internal components of the proposed CupMar model against several baselines.

6.1 Experimental dataset

There has been a shortage of quality datasets for news recommendation research. Fortunately, the recent work of Wu et al. [8] introduces a large-scale MIND dataset, which can serve as a benchmark dataset for news recommendation. We conduct all our experiments on this high-quality dataset. MIND is collected from the user’s behaviour logs of Microsoft News websiteFootnote 6. It contains more than 150,000 news articles, and more than 15 million behaviour logs that are generated by one million users. Each news item comes with rich textual attributes such as the category, subcategory, title, body and knowledge entities embedded inside. Additionally, the MIND dataset also comes with a smaller version called MIND-small, which is suitable for quick prototyping and validation. MIND-small accounts for 5% data of the total dataset. Henceforth, the research community has quickly adapted to use MIND as a robust benchmarking dataset for news recommendation, as shown in [3, 4, 17, 32]. We run our evaluation on the both sizes of the MIND dataset. Table 1 summarizes the statistics of the MIND dataset.

Table 1 MIND dataset statistics [8]

Before the training, we perform preprocessing steps to align the MIND dataset into an appropriate format for CupMar to train. We first convert all words, categories and knowledge entities into integer numbers for embedding purpose. Then for each news session, we choose one positive sample and four random negative samples, and repeat this process five times. Hence, for every log session in the dataset, we generate five training log sessions, resulting in an even larger training dataset for the CupMar model. This helps us to have a balance ratio of correct positive and negative pairs of input signals, and improve the model accuracy.

6.2 Experimental environment

For evaluation, we apply the same settings for different variations of our CupMar model. The categorical embedding dimension is 100 for the category, subcategory and knowledge entity features. We also use the popular pre-trained word embedding FastText [33] (it should be noted that we used Glove word embedding [34] in our conference paper), with the embedding dimension of 300. We use dropout with a drop rate of 30% to prevent overfitting the model. The Adam optimizer [35] is used to optimize the network. The batch size is 32 and the learning rate is 0.001. We also select four negative samples for each positive sample to emulate a classification task of 5 classes as mentioned in previous sections to be compatible for comparison with the training methods outlined in the works of [3, 4]. We choose the amount of self-attention head to be 10, and each head has a dimension of 10; thus the total dimension of all heads for each word vector is 100.

For the evaluation metrics, we use ranking metrics to benchmark the performance of the validation models. The ones that we choose are the Area Under ROC curve (AUC), Mean Reciprocal Rank (MRR), and the Normalized Discounted Cumulative Gain (nDCG). Each model is evaluated five times and the average scores are reported. We also report our scores on both the MIND and MIND-small datasets.

6.3 Baseline models

Our CupMar model is evaluated against the following baseline models:

  • Factorization Machines (FM) [36]: FM is a state-of-the-art model for many recommendation problems based on matrix factorization approach. In our evaluation, we define the user representation as the combination of all TF_IDF signals extracted from the title and the body of a user’s history news. The news article representation includes the TF_IDF features from its title and body, the one-hot encoding of its category and subcategory. Finally, the input into the FM model is the concatenation of both user and candidate news article representation.

  • CNN [37]: We adopt the CNN model proposed by Kim as one of the baselines. It uses max pooling on the text to learn news article representation from the title and body.

  • DKN [7]: Deep Knowledge-Aware Network for News Recommendation is a deep learning model that leverages CNN and knowledge entity awareness attention on the news article to derive the user and news representation.

  • HiFiArk [4]: High-Fidelity Archive Network is another robust deep learning model for news recommendation task. It treats user’s news reading history as a compact vector and store them into archives during offline stage. Then during the online stage, these compact vectors are used to infer user interest upon candidate news.

  • NRMS [3]: Neural News recommendation with Multi-head Self-attention is a recent deep news recommendation model, which uses its news encoder with multi-head self-attention to learn words interaction, and attention mechanism on the user encoder to extract user preferences.

  • CUPCate: CUPCate is a simple variant of our CupMar model. In this model, we only consider the category and subcategory features in the news encoder. The user-profile then is encoded by averaging all the history news records representation. We develop this as a simple baseline during our experimentation.

  • CUPShort: CUPShort is another variant of the CupMar model. It is identical to CupMar, but without the LPE submodule. CUPShort only learns to extract the recent preferences of a user via its GRU layer. By using CUPShort, we can compare the effectiveness of the CupMar model when we employ the LPE submodule.

  • CUPLong: CUPLong is another variant of the CupMar model. It is identical to CupMar, but without the RPE submodule. CUPLong only learns to extract the long-term preferences of a user.

6.4 Evaluation results

By training and evaluating the CupMar model on the MIND and MIND-small datasets, we obtain the performance results shown in Table 2. It is interesting to see that we achieve the state-of-the-art scores on the MIND dataset, but fail to have that position on the MIND-small dataset, and that will be explained together with other observations in detail.

Table 2 Performance comparison of the CupMar model with other methods over the MIND and MIND-small datasets

First, our CupMar model achieves the state-of-the-art scores and outperforms all the baseline methods on the MIND dataset. CupMar’s performance is followed closely behind by NRMS and HiFiArk, which are the two strongly performing models for news recommendation. This result has proven that using multi-aspect properties for news encoding and leveraging contextual user profile signals can significantly boost the learning capability of the deep learning model for news recommendation task. Moreover, our CupMar model also achieves slightly better performance from having the LPE submodule compared to the CUPShort model which only employs the RPE submodule, as can be seen by the small gap in the scores between them. We will give detailed analysis of this point in the later sections.

Second, the deep neural network models clearly show superior performance in comparison to the matrix factorization FM approach. The better performance of neural network models can be explained by their high learning capacity. Due to the high amount of weight parameters, neural network models have the ability to tackle the complicated task of news recommendation. Another evident supporting this statement is the ranking scores of our simple CUPCate model, which has the lowest scores across all metrics in the MIND dataset. The most likely reason is the low number of parameters it has due to the crude design of only using two categorical features and one dense layer.

Third, we observe an interesting phenomenon. Our CupMar model does not perform well when it is trained and evaluated on the MIND-small dataset, which contains about 5% total samples of the MIND dataset. The CupMar model ranks in the third place, while the top spots belong to NRMS and HiFiArk models. After careful examination, we believe that due to the usage of multi-aspect properties and multiple advanced neural network layers such as self-attention heads, attention layer, GRU layer and several Dense layers in the whole model, the number of the weight parameters in the CupMar model increases significantly. We have 40% more weight parameters in comparison to our implemented NRMS model. Although the high number of parameters helps CupMar to make better generalization over large datasets, it is underfit when being trained on smaller datasets. This is a little setback we want to improve in the future work, we want the lower bound for total samples should be 10% total samples of the MIND dataset for a good performance model training.

6.5 Detailed analysis on contextual user-profile

In this section, we analyze our CupMar model’s performance concerning the use of contextual information, which is handled by the LPE and RPE submodules. We create two variant models, called CUPShort and CUPLong, respectively. The CUPShort model only uses the RPE submodule inside the UE component to tackle the task, while the CUPLong model only uses the LPE submodule. Then we compare the inference scores of each of them to other models to see the changes in the performance. In particular, we compare with CupMar as the full model, CUPCate as the simple baseline, and CNN as a neural network model with high learning capacity for text representation. According to the results shown in Figures 4 and 5, we can see that leveraging both the long-term and recent-term contexts can strongly boost the performance of the CupMar model. CupMar always has higher scores than both CUPShort and CUPLong across all three different metrics. This clearly shows the effectiveness of the contextual information of a user in the news recommendation task.

Figure 4
figure 4

Evaluation results to show the analysis of LPE submodule in comparison to other methods

Figure 5
figure 5

Evaluation results to show the analysis of RPE submodule in comparison to other methods

We also want to answer a further question: which user contextual aspect contributes more to the CupMar model. Hence, by looking at the percentage gap in their respective scores to the CupMar model, we can confirm that the RPE submodule contributes more to the performance of the CupMar model. CUPShort scores (sum of all metrics) lower than CupMar by only 3.6%, while the CUPLong’s scores witness a gap of 14%. This result shows that the recent-term preferences contribute more to the user representation than the long-term preferences, which does make sense since a user’s recent preferences usually also include her long-term preferences as well.

6.6 Detailed analysis on multi-aspect properties

In this section, we further run evaluations to compare the effectiveness of using multi-aspects properties in the News Encoder (NE) with other approaches. Similar to the analysis of the user contexts, we deploy a model variant called CUPSeq, where instead of using self-attention mechanism and categorical features, we use Seq2Seq [38] architecture with recurrent neural network (RNN) on the news title and body to infer the rne vector, as explained in Section 4. We then compare the evaluation scores of CUPSeq with other approaches, including the full CupMar model, the NRMS model with self-attention layer, the CNN model for its convolution operation on text data, and the baseline CUPCate with only categorical features. The experimental results are depicted in Figure 6.

Figure 6
figure 6

Evaluation results to show the analysis of using multi-aspect properties to encode a news representation

At first glance, we can see that using advanced neural network mechanisms such as Seq2Seq or self-attention layer outperforms the simple baseline using categorical features, since the CupMar, CUPSeq, CNN and NRMS advanced models all score significantly higher than the CUPCate model. Especially, this also signifies that the body of text of a news article contributes more information to the neural models than other signals since both CUPSeq and CNN only employ textual data of the title and body of a news article. Additionally, we also understand that more sophisticated architecture such as self-attention layer can learn more effectively than older approaches such as RNN and CNN, because NRMS model achieves better scores than CUPSeq and CNN. Nevertheless, we do see the benefits of using multi-aspect properties in the CupMar model, as CupMar outscores all other models, albeit just a little better than the NRMS model. This demonstrates the strong performance of the proposed CupMar model.

7 Conclusion

In this paper, we propose a novel deep neural network called CupMar for the challenging task of news recommendation. Making personalized recommendations from news articles requires the understanding of both the textual information of a news item, and the user contexts in terms of long-term and recent preferences via the user’s history records. To resolve those issues, at the heart of our proposed CupMar model are the News Encoder and User-Profile Encoder. More specifically, the News Encoder learns news article representation from various features such as the category, subcategory, knowledge entities inside the article, the article title and news body. It uses self-attention, attention and dense layers to effectively combine all the necessary signals to represent a news article. On the other hand, the User-Profile Encoder uses the user’s recent historical news data with dense textual information to infer both long-term and recent-term signals for the user representation, thanks to the two submodules, the Long-term Preferences latent Extractor and the Recent Preferences latent Extractor with GRU network layer. We perform extensive evaluation of the CupMar model on the popular MIND dataset, and CupMar shows a better performance against all the baselines.

For the future work, we plan to enhance the CupMar model in the recommendation serendipity. We plan to develop a new interaction score based on both the click-probability and the diversification of the candidate news items when compared to that user’s historical news reading data. This can help the model to suggest a more diversified news list to its users and increase exploration as well as satisfaction.