1 Introduction

MF [4] projects users and items into a latent low-dimensional space. Further, the missing entries in the original matrix can be recovered using the dot product between user and item latent vectors. Recently, LLORMA [7] has been shown to be more effective than the traditional MF. The original matrix is divided into several smaller sub-matrices, in which we can exploit local structures for better low-rank approximation. In each sub-matrix, the standard MF technique is applied to generate sub-matrix-specific latent vectors for both users and items.

The above techniques can achieve good performance in rating prediction when high-quality explicit feedback is available. For example, ratings are explicit feedbacks which indicate users’ preference. However, explicit feedbacks are not easy to get and rating prediction cannot be used in top-n item recommendation directly. Compared with the explicit feedback, the implicit feedbacks are more common and larger. User discovers the item if her behaviors are implicit feedbacks, such as listening, watching or visiting the item. Otherwise, user is unaware of the item. Different from the explicit feedback, the numerical value to describe implicit feedback is nonnegative and very likely to be noisy [10].

Therefore, we consider doing top-n item recommendation based on implicit feedback datasets. Specifically, we also assume that the implicit feedback matrix is not globally low rank but some sub-matrices are low rank. Instead of decomposing the original matrix, we decompose the sub-matrix intuitively. We propose Local Weighted Matrix Factorization (LWMF), integrating LLORMA [7] with WMF [10] in recommending by employing the kernel function to intensify local property and the weight function to intensify modeling user preference. The problem of sparsity can also be relieved by sub-matrix factorization in LWMF, since the density of sub-matrices is much higher than the original matrix. Two key issues of such a sub-matrix-ensemble method are (1) how to generate the sub-matrices and (2) how to set the ensemble weights for sub-matrices. For the first problem, we propose a heuristic method DCGASC to select sub-matrices which approximate the original matrix well. For the second problem, we adopt the kernel function to model local property and explore user preferences by the weight function.

The main contributions can be summarized as follows:

  • We propose LWMF which integrates LLORMA with WMF to recommend items on implicit feedback datasets. LWMF utilizes the local property to model the matrix by dividing the original matrix into sub-matrices and relieves the sparsity problem.

  • Based on kernel function, we propose DCGASC (Discounted Cumulative Gain Anchor Point Set Cover) to select the sub-matrices in order to approximate the original matrix better. At the same time, we conduct the theoretical sub-modularity analysis of the DCGASC objective function.

  • Based on item recommendation problem, we further propose a variant method user-based LWMF, which is more reasonable for item recommendation and get better performance.

  • Extensive experiments on real datasets are conducted to compare LWMF with state-of-the-art WMF algorithm. The experimental results demonstrate the effectiveness of our proposed solutions.

The rest of the paper is organized as follows: Section 2 reviews related work and Sect. 3 presents some preliminaries about MF (Matrix factorization), WMF and LLORMA. Then, we describe LWMF in Sect. 4 including the heuristic method DCGASC to select sub-matrices and the learning algorithm of local latent vectors. Experimental evaluations using real datasets are given in Sect. 5. Conclusion and future work are followed in Sect. 6.

2 Related Work

One of the most traditional and popular ways for recommender systems is KNN [1]. Item-based KNN uses the similarity techniques (e.g., cosine similarity, Jaccard similarity and Pearson correlation) between items to recommend the similar items. Then, MF [24] methods play an important role in model-based CF methods, which aim to learn latent factors on user-item matrix. MF usually gets better performance than KNN-based methods, especially on rating prediction. Recently, several studies focus on using the ensemble of sub-matrices for better low-rank approximation, including DFC [5], LLORMA [7, 8], ACCAMS [9] and WEMAREC [26]. These methods partition the original matrix into several smaller sub-matrices, and a local MF is applied to each sub-matrix individually. The final predictions are obtained using the ensemble of multiple local MFs. Typically, clustering-based techniques with heuristic adaptations are used for sub-matrix generation. We give a brief review of these studies. Mackey et al. [5] introduces a Divide-Factor-Combine (DFC) framework, in which the expensive task of matrix factorization is randomly divided into smaller subproblems. LLORMA [7, 8] uses a nonparametric kernel smoothing method to search nearest neighbors; WEMAREC [26] employs Bregman co-clustering [30] techniques to partition the original matrix. However, such methods focus on explicit feedback datasets, while most of the feedbacks are implicit, such as listening times, click times and check-ins. The explicit feedbacks are not always available, while implicit feedbacks are large and common. So Hu et al. [10] and Pan et al. [11, 12] propose weighted matrix factorization (WMF) to model implicit feedback with alternative least square (ALS). For details, Hu et al. [10] present a whole-data-based learning approach setting a uniform weight to missing entries, i.e., giving all zero entries the same weight. Pan et al. [11, 12] propose a sample-based approach which samples negative instances from missing data and adopts nonuniform weighting.

To improve the efficiency of WMF, several approaches have been proposed. Pilaszy et al. [27] design an approximate solution to ALS presenting novel and fast ALS variants both for the implicit and for the explicit feedback datasets. Recently, Devooght et al. [28] propose the randomized block coordinate descent (RCD) learner, which is a dynamic framework and reduces the complexity. Further, He et al. [24] design an algorithm based on the element-wise alternating least squares (eALS) technique to optimize a MF model with variably weighted missing data. Other related work on implicit feedback datasets is ranking methods, such as BPR [13] and pairwise learning [14]. With the explosion of size of the training data, the ranking methods need use some efficient sampling techniques to reduce complexity. Finally, for BPR framework, there are a lot of special scenarios, such as recommending music [15], News [16], TV show [17] and POI [18, 19], utilizing the additional information (e.g., POI recommender considers the geographical information) to improve prediction performance.

Our method employs the kernel function to intensify local property and the weight function to explore user preferences. As for parameter learning, we adopt eALS skillfully to learn the latent factors.

3 Preliminary

In this section, we present some preliminaries about basic MF, weighted MF for implicit datasets and local matrix factorization method LLORMA. A glossary of notations used in the paper is listed in Table 1. In what follows, we denote matrices by bold capital letters and sets by handwritten form. Superscripts of different forms, such as \({\mathbf {R}}^h\), denote different sub-matrices. Subscripts on matrices mean the indices of data. For example, \({\mathbf {R}}^h_{um}\) denotes the entry for the u-th user and the m-th item of the h-th data sub-matrix. In addition, row vectors are represented by having the transpose superscript\(^\top\), otherwise by default they are column vectors.

Table 1 Notations used in the paper

3.1 Matrix Factorization

MF is a dimensionality reduction technique, which has been widely used in recommendation system, especially for the rating prediction [3, 4]. Due to its attractive accuracy and scalability, MF plays a vital role in recent recommendation system competitions, such as Netflix Prize,Footnote 1 KDD Cup 2011 Recommending Music Items,Footnote 2 Alibaba Big Data Competitions.Footnote 3 Given a sparse matrix \({\mathbf {R}} \in {\mathbb {R}}^{N\times M}\) with indicator matrix \({\mathbf {I}}\), and latent factor number \(K \ll \min \{N,M\}\). The aim of MF is:

$$\begin{aligned} \min _{{\mathbf {P}},{\mathbf {Q}}}\sum _{u=1}^{N}\sum _{m=1}^{M}{\mathbf {I}}_{um}\left( {\mathbf {R}}_{um}-{\mathbf {P}}^\top _u {\mathbf {Q}}_m\right) ^2 \end{aligned}$$
(1)

where \({\mathbf {R}}_{um}\) is the observed score by the u-th user for the m-th item. \({\mathbf {P}}_u\) and \({\mathbf {Q}}_m\) are the latent vectors of the u-th user and the m-th item, respectively. In order to avoid overfitting, regularization terms are usually added to the objective function to modify the squared error. So the task is to minimize \(\sum _{u=1}^{N}\sum _{m=1}^{M}{\mathbf {I}}_{um}({\mathbf {R}}_{um}-{\mathbf {P}}^\top _u {\mathbf {Q}}_m)^2+\lambda _{\mathbf {P}}\Vert {\mathbf {P}}\Vert _F^2+\lambda _{\mathbf {Q}}\Vert {\mathbf {Q}}\Vert _F^2\). The parameters \(\lambda _{\mathbf {P}}\) and \(\lambda _{\mathbf {Q}}\) are used to control the magnitudes of the latent feature matrices (i.e., \({\mathbf {P}}\) and \({\mathbf {Q}}\)). Stochastic gradient descent is often used to learn the parameters [4].

3.2 Weighted Matrix Factorization

Hu et al. [10] and Pan et al. [11, 12] argue that original MF is applied on explicit feedback datasets, especially for rating prediction and is not suitable on implicit feedback. So they propose weighted matrix factorization (WMF) to handle the cases with implicit feedback. Recently, WMF has been widely used in TV show, music and point-of-interest recommendation. To utilize the undiscovered items and to distinguish between discovered and undiscovered items, weight matrix is added to the MF:

$$\begin{aligned} {\mathbf {W}}_{um} = 1+log(1+{\mathbf {R}}_{um}\times 10^\varepsilon ) \end{aligned}$$
(2)

where the constant \(\varepsilon\) is used to control the rate of increment. Considering the weights of implicit feedback, the optimization function is reformulated as follows:

$$\begin{aligned} \min _{{\mathbf {P}},{\mathbf {Q}}}\sum _{u=1}^{N}\sum _{m=1}^{M}{\mathbf {W}}_{um}\left( {\mathbf {C}}_{um}-{\mathbf {P}}^\top _u {\mathbf {Q}}_m\right) ^2+\lambda _{\mathbf {P}}\Vert {\mathbf {P}}\Vert _F^2+\lambda _{\mathbf {Q}}\Vert {\mathbf {Q}}\Vert _F^2 \end{aligned}$$
(3)

where each entry \({\mathbf {C}}_{um}\) in the 0/1 matrix \({\mathbf {C}}\) indicates whether the u-th user has discovered the m-th item, which can be defined as a binarized matrix:

$$\begin{aligned} {\mathbf {C}}_{um}= {\left\{ \begin{array}{ll} 1&{} {\mathbf {R}}_{um}>0\\ 0&{} {\mathbf {R}}_{um}=0 \end{array}\right. }. \end{aligned}$$
(4)

3.3 Low-rank Matrix Approximation

Lee et al. [7, 8] proposed LLORMA, which is under the assumption of locally low rank instead of globally low rank. That is, limited to certain types of similar users and items, the entire rating matrix \({\mathbf {R}}\) is not low rank but a sub-matrix \({\mathbf {R}}^h\) is low rank. It is to say that the entire matrix \({\mathbf {R}}\) is composed by a set of low-rank sub-matrices \({\mathcal {R}}= \{{\mathbf {R}}^1,{\mathbf {R}}^2,\ldots ,{\mathbf {R}}^H\}\) with weight matrix set \({\mathcal {T}}=\{{\mathbf {T}}^1,{\mathbf {T}}^2,\ldots ,{\mathbf {T}}^H\}\) of sub-matrices, where \({\mathbf {T}}^h_{um}\) indicates the sub-matrix weight of \({\mathbf {R}}^h_{um}\) in \({\mathbf {R}}^h\):

$$\begin{aligned} {\mathbf {R}}_{um}\approx \frac{1}{{\mathbf {Z}}_{um}}\sum _{h=1}^H {\mathbf {T}}_{um}^h{\mathbf {R}}_{um}^h \end{aligned}$$
(5)

where \({\mathbf {Z}}_{um} = \sum _{h=1}^H {\mathbf {T}}_{um}^h\). LLORMA uses the MF introduced in Sect. 3.1 to approximate the sub-matrix \({\mathbf {R}}^h\). If the matrix has local property, it can achieve good accuracy in predicting ratings following the paper [7].

Fig. 1
figure 1

Local matrix factorization

4 Local Weighted Matrix Factorization

In this section, we introduce our proposed model LWMF and further propose a heuristic method to select sub-matrices. Finally, we adopt fast element-wise ALS to learn the local latent vectors \({\mathbf {P}}^h\) and \({\mathbf {Q}}^h\).

4.1 Our Proposed Model

Following the LLORMA, we first select sub-matrices from the original matrix, and then each sub-matrix is decomposed by WMF methods as shown in Fig. 1. We propose LWMF which integrates LLORMA with WMF to recommend top-n items on implicit datasets. We estimate each binarized sub-matrix \({\mathbf {C}}^h\) by WMF in Sect. 3.2 as follows:

$$\begin{aligned}&\min _{{\mathbf {P}}_h,{\mathbf {Q}}_h}\sum _{u=1}^{N}\sum _{m=1}^{M}{\mathbf {T}}_{um}^h{\mathbf {W}}_{um}^h({\mathbf {C}}_{um}^h-{{\mathbf {P}}_{u}^h}^\top {\mathbf {Q}}_{m}^h)^2 \nonumber \\&\quad +\lambda _{\mathbf {P}}^h\Vert {\mathbf {P}}^h\Vert _F^2+\lambda _{\mathbf {Q}}^h\Vert {\mathbf {Q}}^h\Vert _F^2 \end{aligned}$$
(6)

where \(\lambda _{\mathbf {P}}^h\) and \(\lambda _{\mathbf {Q}}^h\) are the regularization of user and item in the sub-matrix. So the original binarized Matrix \({\mathbf {C}}\) can be approximated by the set of sub-matrices \({\mathcal {C}}= \{{\mathbf {C}}^1,{\mathbf {C}}^2,\ldots ,{\mathbf {C}}^H\}\):

$$\begin{aligned} {\mathbf {C}}_{um} \approx \frac{1}{{\mathbf {Z}}_{um}}\sum _{h=1}^H {\mathbf {T}}_{um}^h{{\mathbf {P}}^h_u}^\top {\mathbf {Q}}^h_m \end{aligned}$$
(7)

where \({\mathbf {Z}}_{um} = \sum _{h=1}^H {\mathbf {T}}_{um}^{h}\) is the normalizer and \({\mathbf {T}}^{h}_{um}\) indicates the weight for the entry \({\mathbf {C}}^{h}_{um}\) in the sub-matrix \({\mathbf {C}}^{h}\). Two key issues of such a sub-matrix-ensemble method are (1) how to generate the sub-matrices and (2) how to set the ensemble weights for sub-matrices.

Following LLORMA to get the sub-matrix, we firstly find a data point \(a_i=\langle u_i, m_i \rangle\) in the data point set \({\mathcal {A}}=\{a_1, a_2,\ldots , a_{|{\mathbf {R}}|}\}\) as the anchor point \({\hat{a}}_h\). Then we calculate the relevant degree between anchor point and other data points by a similarity measure or kernel function. Finally, we choose the data points whose relevant degree is larger than a constant to compose the sub-matrix. So the data points in this selected sub-matrix are similar. In addition, we can select more anchor points to get more sub-matrices.

Actually, we use the Epanechnikov kernel to calculate the relationship between two data point pairs \(a_i=(u_i,m_i)\) and \(a_j = (u_j,m_j)\). It is computed as the product of user Epanechnikov kernel (\(E_{b} (u_i, u_j)\)) and item Epanechnikov kernel (\(E_{b} (m_i, m_j)\)) as follows:

$$\begin{aligned} E (a_i, a_j)= E_{b} (u_i, u_j)\times E_{b} (m_i, m_j) \end{aligned}$$
(8)

where

$$\begin{aligned}&E_{b} (u_i, u_j)\varpropto (1-d(u_i,u_j)^2) \,{\mathbf {1}}_{\{d(u_i,u_j)\le {b}\}}\\&E_{b} (m_i, m_j)\varpropto (1-d(m_i,m_j)^2) \,{\mathbf {1}}_{\{d(m_i,m_j)\le {b}\}} \end{aligned}$$

and b is the bandwidth parameter of kernel. Distance between two users or two items is the distance between two row vectors (for user kernel) or column vectors (for item kernel). The initial user latent factor and item latent factor are learned by WMF. Accordingly, the distance between users \(u_i\) and \(u_j\) is \(d(u_i, u_j)=\mathrm{arccos}(\frac{{\mathbf {P}}_{u_i}\cdot {\mathbf {P}}_{u_j}}{\Vert {\mathbf {P}}_{u_i}\Vert \cdot \Vert {\mathbf {P}}_{u_j}\Vert })\), where \({\mathbf {P}}_{u_i}\), \({\mathbf {P}}_{u_j}\) are the local latent vector for the \(u_i\)-th user and the \(u_j\)-th user. The distance between items is computed in the same way. So with the anchor point \({\hat{a}}_h\) we set the weight \({\mathbf {T}}^h_{u_jm_j} = E ({\hat{a}}_h, a_j)\) of user-item pair \(\langle u_j,m_j \rangle\) for sub-matrix \({\mathbf {R}}^h\), the sub-matrix regularization \(\lambda _{\mathbf {P}}^h = \lambda _{\mathbf {P}} E_{b} ({\hat{u}}_h, u_j)\) and \(\lambda _{\mathbf {Q}}^h=\lambda _{\mathbf {Q}} E_{b} ({\hat{m}}_h, m_j)\).

Therefore, each anchor point stands for a sub-matrix. Selecting the sub-matrix set \({\mathcal {C}}\) is in fact to select a set of anchor points \(\hat{{\mathcal {A}}}=\{{\hat{a}}_1, {\hat{a}}_2,\ldots , {\hat{a}}_H\}\). The details of selecting anchor point set are discussed in next section.

4.2 Anchor Point Set Selection

Intuitively, the sub-matrix set \({\mathcal {C}}= \{{\mathbf {C}}^1,{\mathbf {C}}^2,\ldots ,{\mathbf {C}}^H\}\) should cover the original matrix \({\mathbf {C}}\), that is \({\mathbf {C}} = \cup _{{\mathbf {C}}^h\in {\mathcal {C}}}{\mathbf {C}}^h\), so that these sub-matrix sets \({\mathcal {C}}\) can approximate the original matrix \({\mathbf {C}}\) better than the set that does not cover. Therefore, the anchor points selection problem can be reduced to the set cover problem.

4.2.1 Anchor Point Set Cover (ASC)

We treat all the nonzero user-item pairs, i.e., data point set \({\mathcal {A}}=\{a_1,a_2,\ldots ,a_{|{\mathbf {R}}|}\}\) as the candidate anchor point set. Every candidate point \(a_i\) can cover itself several other candidate points denoted by \({\mathcal {A}}^i=\{a_i, a_{i1}, a_{i2},\ldots , a_{iD}\}\subset {\mathcal {A}}\). Then, we propose the naive anchor points cover method, called Anchor Point Set Cover that returns an anchor point set \(\hat{{\mathcal {A}}} \subset {\mathcal {A}}\) such that

$$\begin{aligned} \max J(\hat{{\mathcal {A}}})= & {} |\cup _{i\in \hat{{\mathcal {A}}}}{\mathcal {A}}^i| \nonumber \\ s.t. |\hat{{\mathcal {A}}}|= & {} H \end{aligned}$$
(9)

Obviously, the ASC problem is sub-modular and monotone [25]. So the greedy algorithm can achieve \(1-\frac{1}{e}\) approximation ratio of the optimized result.

4.2.2 Discounted Cumulative Gain Anchor Point Set Cover (DCGASC)

However, set cover problem only needs to cover a point only once while covering all training data only once is not enough. Covering the training data more times is also helpful for the final recommendation. Although performance is improved by increasing cover times, the gain is discounted, which is similar to the situation in ranking quality measures NDCG (normalized discounted cumulative gain) [22] and ERR (expected reciprocal rank) [21] in IR(information retrieval). The premise of NDCG and ERR is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced proportional to the position of the result. Learning from this discounted approach, we propose a heuristic method to model this situation, called Discounted Cumulative Gain Anchor Point Set Cover (DCGASC) that returns an anchor point order list \(\hat{{\mathcal {A}}} = \{{\hat{a}}_{1},{\hat{a}}_{2},\ldots ,{\hat{a}}_{H}\} \subset {\mathcal {A}}\) such that

$$\begin{aligned}&\max J(\hat{{\mathcal {A}}})\nonumber \\&\quad = \sum _{h=1}^{H}\sum _{a_l \in \hat{{\mathcal {A}}}^{h}}\alpha ^{o_{lh}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,h-1\}} E_b({\hat{a}}_{h}, {\hat{a}}_{h'})) \nonumber \\&\ \\ \ \ \ \ s.t. |\hat{{\mathcal {A}}}| = H \end{aligned}$$
(10)

where \(o_{lh}\) denotes the covered times of \(a_l\) by the selected anchor points \(\{{\hat{a}}_1,{\hat{a}}_2,\ldots ,{\hat{a}}_h\}\). \(\alpha \in (0,1)\) is the discount parameter. When point \(a_l\) has been covered by a anchor point before, the covered gain will be reduced next time. When \(\alpha = 0\), this problem reduces to the set cover problem. And when \(\alpha = 1\), it just gets the anchor point which covers the other points at most every time. The \((1-\mathrm{max}_{h'\in \{1,\ldots ,h-1\}} E_b({\hat{a}}_{h}, {\hat{a}}_{h'})\) term means DCGASC tends to select the point which is far from the selected anchor points. Below we prove that \(J(\cdot )\) is sub-modular and monotone.

Theorem 1

DCGASC function is sub-modular and also monotone nondecreasing.

Proof

Let \({\mathcal {S}}=\{{\hat{a}}_1,{\hat{a}}_2,\ldots ,{\hat{a}}_{H-1}\}\) and\({\mathcal {V}}=\{{\hat{a}}_1,{\hat{a}}_2,\ldots ,{\hat{a}}_{H-1},\ldots , {\hat{a}}_{X-1}\}\) are the anchor point sets, \(X\ge H\) and \(a_i = {{\hat{a}}_X} \in {\mathcal {A}} \backslash {\mathcal {V}}\) is the next selected anchor point. We have that

$$\begin{aligned}&J({\mathcal {V}}\cup \{{\hat{a}}_X\})-J({\mathcal {V}}) \nonumber \\&\quad = \sum _{h=1}^{X}\sum _{a_l \in \hat{{\mathcal {A}}}^{h}}\alpha ^{o_{lh}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,h-1\}} E_b({\hat{a}}_{h}, {\hat{a}}_{h'})) \nonumber \\&\qquad - \sum _{h=1}^{X-1}\sum _{a_l \in \hat{{\mathcal {A}}}^{h}}\alpha ^{o_{lh}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,h-1\}} E_b({\hat{a}}_{h}, {\hat{a}}_{h'})) \nonumber \\&\quad = \sum _{a_l \in \mathcal {{\hat{A}}}^X}\alpha ^{o_{lX}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,X-1\}} E_b({\hat{a}}_X, {\hat{a}}_{h'}))\ge 0 \end{aligned}$$
(11)

and

$$\begin{aligned}&J({\mathcal {S}}\cup \{{\hat{a}}_X\})-J(S)-(J({\mathcal {V}}\cup \{{\hat{a}}_X\})-J({\mathcal {V}}))=\nonumber \\&\quad =\sum _{a_l \in \mathcal {{\hat{A}}}^X}\alpha ^{o_{lX'}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,H-1\}} E_b({\hat{a}}_X, {\hat{a}}_{h'}))\nonumber \\&\qquad -\sum _{a_l \in \mathcal {{\hat{A}}}^X}\alpha ^{o_{lX}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,X-1\}} E_b({\hat{a}}_X, {\hat{a}}_{h'})) \end{aligned}$$
(12)

where \(o_{lX'}\) means the covered times of \(a_l\) by the anchor points \(S\cup \{{\hat{a}}_X\}\). Because the number of anchor points covered satisfies that \(o_{lX'}\leqslant o_{lX}\), discount parameter \(\alpha \in [0,1]\) and \(\mathrm{max}_{h'\in \{1,\ldots ,H-1\}} E_b({\hat{a}}_{X},{\hat{a}}_{h'})\le \mathrm{max}_{h'\in \{1,\ldots ,X-1\}} E_b({\hat{a}}_{X},{\hat{a}}_{h'})\), we know that \(J({\mathcal {S}}\cup \{{\hat{a}}_X\})-J({\mathcal {S}})-(J({\mathcal {V}}\cup \{{\hat{a}}_X\})-J({\mathcal {V}}))\ge 0\). Therefore, it is proved that the DCGASC function is monotone and sub-modular.

Due to the monotonicity and sub-modularity of DCGASC function, the greedy algorithm 1 can provide a theoretical approximation guarantee of factor \(1-\frac{1}{e}\) as described in [23]. Algorithm 1 shows the greedy algorithm: It first obtains the anchor point which cover other points at most and then uses Eq. (12) to get the following anchor points in turn.

figure a

4.3 Learning Algorithm

Alternating least square (ALS) is a popular approach to optimize weighted matrix factorization [10]. [24] proposed a fast element-wise ALS learning algorithm which optimizes each coordinate of the latent vector with the other fixed ones and speeds up learning by avoiding the massive repeated computations introduced by the weighted missing data. In this paper, we use the element-wise ALS learning algorithm to learn the sub-matrix latent vectors. More specifically, the latent factors of the u-th user are updated based on

$$\begin{aligned} {\mathbf {P}}_{uk}^h = \frac{\sum _{m\in {\mathcal {M}}^h}\left( {\mathbf {C}}_{um}-{{\hat{\mathbf{C}}}}_{um,k}^h\right) {\mathbf {T}}_{um}^h{\mathbf {W}}_{um}{\mathbf {Q}}_{mk}^h}{\sum _{m\in {\mathcal {M}}^h}{\mathbf {T}}_{um}^h{\mathbf {W}}_{um}{\mathbf {Q}}_{mk}^h{\mathbf {Q}}_{mk}^h+\lambda _{\mathbf {P}}^h} \end{aligned}$$
(13)

where \({\mathcal {M}}^h\) denotes item indices set in the h-th sub-matrix, i.e., the prediction without the component of latent factor k \({{\hat{\mathbf{C}}}}_{um,k}^h = {{\hat{\mathbf{C}}}}_{um}^h-{\mathbf {P}}_{uk}^h{\mathbf {Q}}_{mk}^h\), where \({{\hat{\mathbf{C}}}}_{um}^h\) is the predict score. Noted that \({\mathbf {C}}_{um}\) and \({\mathbf {W}}_{um}\) are all the same in the different sub-matrices. The sub-matrix weight \({\mathbf {T}}_{um}^h\) is the only difference in Eq. (13) with the original WMF, which may lead to high running time. Fortunately, due to \({\mathbf {T}}^h_{um} = E_{b} ({\hat{u}}_h, u)\times E_{b} ({\hat{m}}_h, m)\) and \(\lambda _{\mathbf {P}}^h = \lambda _{\mathbf {P}} E_{b} ({\hat{u}}_h, u)\), we also can speed up learning by memorizing the massive repeated computations. Firstly, \(E_{b} ({\hat{u}}_h, u)\) is both in the numerator and in the denominator so it can be canceled. Noted that if \(E_{b} ({\hat{u}}_h, u)=0\), it does not need to calculate the latent vector \({\mathbf {P}}_u^h\). Then, we focus on the numerator:

$$\begin{aligned} \sum _{m\in {\mathcal {M}}^h}&\left( {\mathbf {C}}_{um}-{{\hat{\mathbf{C}}}}_{um,k}^h\right) E_{b} ({\hat{m}}_h, m){\mathbf {W}}_{um}{\mathbf {Q}}_{mk}^h \nonumber \\ =&\sum _{m\in {\mathcal {M}}_u^h}\left[ {\mathbf {W}}_{um}{\mathbf {C}}_{um}-({\mathbf {W}}_{um}-1){{\hat{\mathbf{C}}}}_{um,k}^h\right] E_{b} ({\hat{m}}_h, m){\mathbf {Q}}_{mk}^h\nonumber \\&-\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){{\hat{\mathbf{C}}}}_{um,k}^h{\mathbf {Q}}_{mk}^h \end{aligned}$$
(14)

where \({\mathcal {M}}_u^h\) means the set of items discovered by the u-th user in the h-th sub-matrix. Because \(E_{b} ({\hat{m}}_h, m)\) is the same for different users, the cache method can also be utilized here. The \(\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){{\hat{\mathbf{C}}}}_{um,k}^h{\mathbf {Q}}_{mk}^h\) term can be speeded up:

$$\begin{aligned}&\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){{\hat{\mathbf{C}}}}_{um,k}^h{\mathbf {Q}}_{mk}^h \nonumber \\&\quad = \sum _{f\ne k}{\mathbf {P}}_{uf}\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){\mathbf {Q}}_{mk}^h{\mathbf {Q}}_{mf}^h \end{aligned}$$
(15)

So the \(\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){\mathbf {Q}}_{mk}^h{\mathbf {Q}}_{mf}^h\) can be pre-computed and used in updating the latent vectors for all users. Similarly, the same cache method can be used in the calculation of denominator. We define the \({\mathbf {S}}^{{\mathbf {Q}}^h}\) as \({\mathbf {S}}^{{\mathbf {Q}}^h}=\sum _{m\in {\mathcal {M}}^h} E_{b} ({\hat{m}}_h, m){\mathbf {Q}}_m^h{\mathbf {Q}}_m^{h\top }\), so Eq. (13) can be calculated as:

$$\begin{aligned} {\mathbf {P}}_{uk}^h&= \left\{ \sum _{m\in {\mathcal {M}}_u^h}\left[ {\mathbf {W}}_{um}{\mathbf {C}}_{um}-({\mathbf {W}}_{um}-1){{\hat{\mathbf{C}}}}_{um,k}^h\right] E_{b} ({\hat{m}}_h, m){\mathbf {Q}}_{mk}^h\right. \nonumber \\&\quad \left. -\sum _{f\ne k}{\mathbf {P}}_{uf}^h{\mathbf {S}}^{{\mathbf {Q}}^h}_{fk}\right\} /\left\{ \sum _{m\in {\mathcal {M}}^h_u} E_{b} ({\hat{m}}_h, m)({\mathbf {W}}_{um}-1){\mathbf {Q}}_{mk}^h{\mathbf {Q}}_{mk}^h\right. \nonumber \\&\quad \left. +\,{\mathbf {S}}^{{\mathbf {Q}}^h}_{kk}+\lambda _{\mathbf {P}}\right\} \end{aligned}$$
(16)

where \({\mathbf {S}}^{{\mathbf {Q}}^h}_{fk}\) is the (fk)-th element of the \({\mathbf {S}}^{{\mathbf {Q}}^h}\). Similarly, we define the \({\mathbf {S}}^{{\mathbf {P}}^h}\) as \({\mathbf {S}}^{{\mathbf {P}}^h}=\sum _{u\in {\mathcal {U}}^h} E_{b} ({\hat{u}}_h, u){\mathbf {P}}_u^h{\mathbf {P}}_u^{h\top }\) and the update of item latent vectors is:

$$\begin{aligned} {\mathbf {Q}}_{mk}^h&= \left\{ \sum _{u\in {\mathcal {U}}_m^h}\left[ {\mathbf {W}}_{um}{\mathbf {C}}_{um}-({\mathbf {W}}_{um}-1){{\hat{\mathbf{C}}}}_{um,k}^h\right] E_{b} ({\hat{u}}_h, u){\mathbf {P}}_{uk}^h\right. \nonumber \\&\quad \left. -\sum _{f\ne k}{\mathbf {Q}}_{mf}^h{\mathbf {S}}^{{\mathbf {P}}^h}_{fk}\right\} /\left\{ \sum _{u\in {\mathcal {U}}^h_m} E_{b} ({\hat{u}}_h, u)({\mathbf {W}}_{um}-1){\mathbf {P}}_{uk}^h{\mathbf {P}}_{uk}^h\right. \nonumber \\&\quad \left. +\,{\mathbf {S}}^{{\mathbf {P}}^h}_{kk}+\lambda _{\mathbf {Q}}\right\} \end{aligned}$$
(17)

So with the local sub-matrix weights, one iteration takes \(O(NK^2+MK^2+|{\mathbf {R}}|K)\) time as the same as the fast element-wise ALS [24].

figure b

Algorithm 2 summarizes the process of learning local weighted latent vectors. First, we use the fast element-wise ALS [23] to learn the global latent vectors (Line 1). Then we obtain the anchor set by Algorithm 1. At last, we adopt the fast element-wise ALS to learning every sub-matrix latent vectors.

Fig. 2
figure 2

User-based local matrix factorization

4.4 User-based Local Weighted Matrix Factorization

The above method LWMF uses the selected sub-matrices to model the local property and ignore global feature. Especially for the item recommendation problem, we should recommend items for a user from all the items. So we propose a variant method, called User-based Local Weighted Matrix Factorization, which only considers users to select the anchor points and puts all items into the sub-matrix. Given the user set \({\mathcal {U}}=\{u_1,u_2,\ldots ,u_N\}\) (all users) while every user \(u_i\) can cover itself several other users denoted by \({\mathcal {U}}^i=\{u_i, u_{i1}, u_{i2},\ldots , u_{iD}\}\), we need to find user anchor set \(\mathcal {{\hat{U}}}=\{{\hat{u}}_1,{\hat{u}}_2,\ldots ,{\hat{u}}_H\}\) to maximize the user anchor set cover function:

$$\begin{aligned}&\max J(\mathcal {{\hat{U}}})\nonumber \\&\quad = \sum _{h=1}^{H}\sum _{u_l \in {\mathcal {U}}^h}\alpha ^{o_{lh}-1}(1-\mathrm{max}_{h'\in \{1,\ldots ,h-1\}} E_b({\hat{u}}_h,{\hat{u}}_{h'})) \nonumber \\&\quad \qquad s.t. |\mathcal {{\hat{U}}}| = H \end{aligned}$$
(18)

Obviously, this user-based DCGASC function is also sub-modular and also monotone nondecreasing. Figure 2 shows the user-based LWMF to select the sub-matrices. Because we do not need to consider the items, it is much faster to the user anchor point set. Moreover, user-based LWMF is more reasonable for item recommendation problem. As a direct comparison of user-based LWMF, we also implement item-based LWMF, which only considers items to select the anchor points and lets all users into the sub-matrix.

5 Experiments

Table 2 Precision and recall comparison on Foursquare and Gowalla, where column “improve” indicates the relative improvements that our approach LWMF achieves relative to the basic WMF results

In this section, we evaluate the method proposed in this paper using real datasets. We first introduce the datasets and experimental settings. Then, we compare our method with WMF under specific parameter settings. We also compare results with different anchor numbers and two anchor points selection methods.

5.1 Dataset

We choose two real-world datasets from [29]. One is the Foursquare check-in data made in Singapore between August 2010 and July 2011, and another is the Gowalla check-in data made in California and Nevada between February 2009 and October 2010. Both are popular online LBSNs datasets.

The Foursquare check-in data comprises 194,108 check-ins made by 2312 users at 5596 POIs, and the density is \(1.50\times 10^{-2}\). The Gowalla check-in data comprises 456,967 check-ins made by 10,162 users at 24,238 POIs, and the density is \(1.86\times 10^{-3}\). Two datasets are very sparse (Table 2).

More details about two datasets are listed in Table 3. We randomly select 80% of each user’s visiting locations as the training set and the rest 20% as the testing set.

Table 3 Detail information of Gowalla and Foursquare

5.2 Setting

Fig. 3
figure 3

Comparison with different number of anchor points. a Precision on Foursquare, b recall on Foursquare, c precision on Gowalla, d recall on Gowalla

Fig. 4
figure 4

Anchor point set selection methods comparison. a Precision on Foursquare, b recall on Foursquare, c precision on Gowalla, d recall on Gowalla

Fig. 5
figure 5

Comparison with different discounts of anchor points. a Precision on Foursquare, b recall on Foursquare, c precision on Gowalla, d recall on Gowalla

Next, we show the parameter values. The regularization \(\lambda\) is set to 10, and the performance of recommendation is not sensitive to this parameter. The weight parameter \(\varepsilon\) for Fousquare is set to 2 and for Gowalla is set to 3. We set the bandwidth parameter in Epanechnikov kernel as \(b=0.8\). The discount \(\alpha\) of DCGASC is set to 0.4. We select 100 anchor points for both datasets. In the experiments, we observe that if the number of anchor points is larger, the performance is better. But the training time increases accordingly.

We employ the Precision@n and Recall@n to measure the performance. For the u-th user, we set \({\mathcal {I}}^P_u\) as the predicted item list and \({\mathcal {I}}^T_u\) as the true list in the testing dataset. So the Precision@n and Recall@n are:

$$\begin{aligned} \mathrm{Precision}@n= & {} \frac{1}{N}\sum _{u=1}^{N}\frac{|{\mathcal {I}}^P_u\bigcap {\mathcal {I}}^T_u|}{n}\\ \mathrm{Recall}@n= & {} \frac{1}{N}\sum _{u=1}^{N}\frac{|{\mathcal {I}}^P_u\bigcap {\mathcal {I}}^T_u|}{|{\mathcal {I}}^T_u|} \end{aligned}$$

where \(|{\mathcal {I}}^P_u|=n\). In our base experiments, we choose top 10 as evaluation metrics.

We compare seven methods for implicit feedback datasets:

  • Most popular: This is the most basic method, which recommends the most popular items to the target user.

  • KNN\(_u\): This is user-based CF method, where user-user similarity is calculated based on the training data.

  • KNN\(_m\): This method is similar to KNN\(_u\), and the difference is that KNN\(_m\) calculates item-item similarity based on the training data. Specifically, we set the neighbor numbers in KNN\(_u\) and KNN\(_m\) to 100.

  • WMF: This is the state-of-the-art method, which is a whole-data-based learning approach setting a uniform weight to missing entries [10, 24].

  • LWMF\(_{\mathrm{both}}\): This is our proposed method that employs the kernel function to intensify local property and the weight function to explore user preferences.

  • LWMF\(_u\): A variant method of LWMF\(_{\mathrm{both}}\) which only considers users to select the anchor points and puts all items into the sub-matrix.

  • LWMF\(_m\): A variant method of LWMF\(_{\mathrm{both}}\) which only considers items to select the anchor points and puts all users into the sub-matrix.

Then, we compare two anchor points selection methods to study the performance of LWMF:

  • Random: Sampling anchor points uniformly from training dataset as paper [7] does.

  • Discounted Cumulative Gain Anchor Set Cover (DCGASC): Discounting cumulative gain of covering the points which is also sub-modular and monotone.

So LWMF can be expanded into two sub-methods: LWMF_Random and LWMF_DCGASC. By default, LWMF means LWMF_DCGASC. Each method is conducted five times independently. Therefore, the average score indicates the performance of the recommendation methods.

5.3 Experimental Results

In this section, we discuss the experimental results on Foursquare and Gowalla datasets.

5.3.1 Recommendation Methods Comparison

Table 2 lists the precision and recall of seven methods mentioned above on Foursquare and Gowalla datasets. It shows the same result as [7] that LORMA outperforms SVD, and LWMF always outperforms WMF. The performances of WMF and LWMF are increasing with the increase of K. However, on Foursquare, when K gets to 40, the performances both fall, which indicates that the value of K has resulted in overfitting. So we choose K to be 20. On the other hand, the experiments based on Gowalla dataset show that the value of K is bigger than 40 when the performance is best. It is obvious that performances of all LWMF methods are better than WMF methods in all dimensions. Especially on Gowalla dataset, the precision and recall of LWMF are more than 25% better than WMF. Specifically when K equals to 5, the precision of LWMF\(_{both}\) are 52.56% better than WMF. More obvious improvements on Foursquare and Gowalla is due to the local property. For example, there are some business districts in a city and business POIs are geographically close to each other within each business district. Additionally as for our three approaches, we can find that the differences between their performances are not very obvious. But from an overall view, LWMF\(_u\) are better than the other two methods. LWMF\(_u\) does the recommendation task based on users, so it can be inferred that selecting points based on users are more reasonable than the other two methods. We also do the comparison of three basic methods, which are MostPopular, KNN\(_u\) and KNN\(_m\). The experimental results indicate that our methods are better than these three basic methods. Although KNN\(_u\) and KNN\(_m\) are better than LWMF when K is low on Gowalla, the performance of LWMF goes up with the increase of K and is far more better than KNN\(_u\) and KNN\(_m\).

5.3.2 Comparison with Different Number of Anchor Points

Figure 3 shows the performance of LWMF with different anchor numbers. For both datasets, the precision and recall of both LWMF and WMF improve while K increases and LWMF performs better than WMF with \(K\ge 20\). For Foursquare dataset, LWMF with \(K=20\) and anchor number \(H\ge 20\) outperforms WMF with \(K=20\), while the same performance on Gowalla dataset needs \(H \ge 40\) anchor points. We can see that as the number of anchor points increases, the performance gets better. When the number of anchor points gets to 50, we can get a good performance. Although the training time increases, the gap of running time of matrix factorization between LWMF and WMF is small, because the running time of WMF is \(O(NK^2+MK^2+|{\mathbf {R}}|K)\) and the sub-matrices of LWMF are much smaller than the original matrix (i.e., in both datasets, each sub-matrix is about 10% of original matrix averagely). Only one sub-matrix factorization is much faster than original matrix factorization. Despite all this, LWMF costs more time on calculating the KDE between users and items and selecting anchor points.

5.3.3 Anchor Point Set Selection Methods Comparison

Next, we compare the performance of LWMF\(_u\)_Random and LWMF\(_u\) in Fig. 4. The discount parameter \(\alpha\) is set 0.4. K is set to 20 for Foursquare dataset, while 40 for Gowalla dataset. From Fig. 4, when the number of anchor points is small, LWMF\(_u\) performs better in precision and recall. When the number of anchor points increases, the gap of performance among three gets less. Despite this, LWMF\(_u\) outperforms LWMF\(_u\)_Random on both datasets.

5.3.4 Comparison with Different Discounts for DCGASC

Finally, we study the performance of LWMF\(_u\) with different discount parameters. K is set to 20 for Foursquare dataset, while 40 for Gowalla dataset. For each \(\alpha\), we explore results obtained by varying the parameter in the range (0, 1] with decimal steps. Because the results with discount parameter \(\alpha \in [0.2,0.8]\) are similar, we only plot the curves with \(\alpha \in \{0.2,0.4,0.6,0.8\}\) in Fig. 5. The gap of performance with four discount parameters is small. The performance with discount parameter \(\alpha = 0.4\) is better slightly. In general, the performance of LWMF is not sensitive to the discount parameter but mainly depends on the number of anchor points.

6 Conclusion and Future Work

In this paper, we propose LWMF which selects sub-matrices to model the user behavior better. LWMF relieves the sparsity problem by sub-matrix factorization. Moreover, we propose DCGASC to select sub-matrix set, which improves the performance of LWMF. The extensive experiments on two real datasets demonstrate the effectiveness of our approach compared with state-of-the-art method WMF.

We will study the three further directions: (1) to speed up selecting sub-matrices; (2) in this paper, we first select the sub-matrix set by selecting anchor points, then do the weighted matrix factorization for each sub-matrix. So we need two steps to optimize the objective function. We can try to find the methods to optimize the local matrix factorization in only one objective function; (3) we can further leverage other special additional information into LWMF in some special scenarios, such as the geographical information in POI recommender.