1 Introduction

With the rapid development of web 2.0 and wireless communication technologies, we are going through a new era of information overload. That is, it is difficult to quickly find the available information for users. To cope with the challenge, there are two solutions: information retrieval [19] and recommender systems [12]. If users can express their requirement clearly, information retrieval is a good method to help them. For example, when will start the next game of Real Madrid football club? However, it is difficult to generate the specific demand in many cases. Such as, what the Internet is talking about right now, which movie is the most interesting recently, which book should I buy? Recommender systems can give the answers.

Recommender systems have become an important tool to help users to easily find the favorite items. In general, recommender systems can be divided into three categories: content-based recommendation, collaborative filtering and hybrid recommendation [9]. Content-based recommendation can play an important role to help users making decisions when the content can be abstracted from the items, e.g., news [4], jokes, books, reviews. However, the recommended items are very familiar to the users. Lack of novelty is the one of the weak points of content-based recommendation. The main idea of collaborative filtering approaches is to exploit information about the past behaviors of all users of the system for predicting which items the current user will most probably be interested in. Pure collaborative filtering approaches take a matrix of given user–item ratings as the only input. Though collaborative filtering approaches achieves great success, there are some shortcomings. For example, cold-start items cannot be recommended, and popular items often be recommended. Due to known limitations of either pure content-based recommender systems or collaborative filtering , it rather soon led to the development of hybrid recommendation that combine the advantages of different recommendation techniques. In this work, we focus on collaborative filtering by exploiting the hierarchal information implied to improve the performance of recommendations.

Collaborative filtering [11, 12, 23, 25] methods that behind the recommender systems have been developed for many years and is still a hot research topic up to now. User’s decisions (e.g., clicked, purchased, re-tweeted, commented) to the relevant items are made under the certain environments which is often referred to as context. The contexts which include time, location, mood, companion and so on can be collected easily in real-world applications. Comparing to conventional recommendation solely based on user–item interactions, context-aware recommendation (CAR) can significantly improve the recommendation quality.

For this purpose, a great number of context-aware recommendation methods [10, 13, 21] have been proposed. Among them, Factorization Machines (FM) [21] is currently an influential and popular one. It represents the user–item–context interactions as a linear combination of the latent factors to be inferred from the data and treats the latent factors of user, item and context equality. Despite its successful application, existing FM model is weak to utilize hierarchical information. In practice, hierarchies can capture broad contextual information at different levels and hence ought to be exploited to improve the recommendation quality. The intuition here is that local homogeneous contexts tend to generate similar ratings. For example, many men who are engaged in IT department like to browse on technology Web sites in office during the day. However, they enjoy visiting sport Web sites at home in the evening. Here, users may be arranged in a hierarchy based on gender or occupation, Web sites may be characterized by contents, and there are natural hierarchies for time and location.

In this paper, we focus on solving the problem of exploiting the hierarchical information to improve the recommendation quality. We propose a Random Partition Factorization Machines (RPFM) by adopting random decision trees to split the contexts hierarchically to better capture the local complex interplay. More specifically, the user–item–context interactions are first partitioned to different nodes of a decision tree according to their local contexts. Then, FM model is applied to the interactions of each node to capture the tight impact of each other. During prediction, our method goes through from the root to the leaves and borrows from predictions at higher level when there is sparseness at lower level. Other than estimation accuracy of rating, RPFM also reduces the over-fitting by building an ensemble model on multiple decision trees. The main contribution of the paper is summarized as follows:

  1. 1.

    FM model is one of the most successful approaches for context-aware recommendation. However, there is only one set of the model parameters which can be learned from the whole training set. We propose the novel RPFM model which makes use of the intuition that similar ratings can be generated from homogeneous environments.

  2. 2.

    We adopt the k-means cluster method to partition the user–item–context interactions at each node of decision trees. The similarity between the latent factor vectors of FM model can be used to partition the user–item–context interactions. The subset at each node is expected to be more impacted each other.

  3. 3.

    We conduct experiments on three datasets and compare it with five state-of-the-art context-aware recommendations to demonstrate RPFM’s performance.

The rest of the paper is organized as follows: In Sect. 2, we provide related works about context-aware and random partition-based models. In Sect. 3, we introduce the FM model. In Sect. 4, we propose the Random Partition Factorization Machines (RPFM) model which includes algorithm description and discussion with two state-of-the-art random partition-based models. In Sect. 5, we present the experimental result on three real datasets. The paper is concluded in Sect. 6, and the future research direction is outlined.

2 Related Works

The work presented in this paper is closely related to context-aware recommendation and random partition on tree structure. In the following, we introduce the related works to serve as background for our solution.

2.1 Context-Aware Recommendation

In general, there are three types of integration method [2]: (1)contextual pre-filtering method; (2) contextual post-filtering method; and (3) contextual modeling method. In contrast to the previous two methods, the contextual modeling method uses all the contextual and user–item information simultaneously to make predictions. More recent works have focused on the third method [10, 13, 24, 27].

Karatzoglou et al. [10] proposed Multiverse Recommendation model in which the different types of context are considered as additional dimensions in the representation of the data as a tensor. The factorization of this tensor leads to a compact model of the data which can be used to provide context-aware recommendations. However, for real-world scenarios its computational complexity is too high. Rendle [24] showed that Factorization Machines (FM) model can be applied to context-aware recommendation because that a wide variety of context-aware data can be transformed into prediction task using real-valued feature vectors. Nguyen et al. [16] developed a nonlinear probabilistic algorithm for context-aware recommendation using Gaussian processes which is called Gaussian Process Factorization Machines (GPFM). GPFM is applicable to both the explicit feedback setting and the implicit feedback setting. Currently, the most recent approach in terms of prediction accuracy is COT [13] model, which represented the common semantic effects of contexts as a contextual operating tensor and represents a context as a latent vector. Then, to model the semantic operation of a context combination, it generates contextual operating matrix from the contextual operating tensor and latent vectors of contexts. Thus latent vectors of users and items can be operated by the contextual operating matrices. However, its computational complexity is also too high.

2.2 Random Partition on Tree Structure

Fan et al. [5] proposed Random Decision Trees which are applicable for classification and regression to partition the rating matrix and build ensemble. Each time, according to the feature and threshold which were selected randomly, the instances at each intermediate nodes are partitioned into two parts. Zhong et al. [28] proposed Random Partition Matrix Factorization (RPMF), based on a tree structure constructed by using an efficient random partition technique, which explore low-rank approximation to the current sub-rating matrix at each node. RPMF combines the predictions at each node (non-leaf and leaf) on the decision path from root to leaves. Liu et al. [14] handled contextual information by using random decision trees to partition the original user–item–rating matrix such that the ratings with similar contexts are grouped. Matrix factorization was then employed to predict missing ratings of users for items in the partitioned sub-matrix.

3 Preliminaries

In this section, we briefly review Factorization Machines (FM) which is closely related to our work.

The notations used in this paper are summarized in Table 1.

Table 1 Definition of notations

3.1 Factorization Machines

Factorization Machines (FM), proposed by Rendle [21], is a general predictor which can mimic classical models like biased MF [12], SVD++ [11], PITF [25] or FPMC [23]. The model equation for FM of degree \(d=2\) is defined as:

$$\begin{aligned} {\hat{y}}(x_{i}) = \omega _{0}+\sum _{j=1}^{p}\omega _{j}x_{i,j}+\sum _{j=1}^{p}\sum _{j'=j+1}^{p}\langle <{\mathbf {v}}_{j},{\mathbf {v}}_{j'}\rangle x_{i,j}x_{i,j'}, \end{aligned}$$
(1)

and

$$\begin{aligned} \langle {\mathbf {v}}_{j},{\mathbf {v}}_{j'}\rangle :=\sum _{k=1}^{f}v_{j,k}\cdot v_{j',k}, \end{aligned}$$
(2)

where the mode parameters \(\varTheta \) that have to be estimated are:

$$\begin{aligned} \omega _{0}\in {\mathbb {R}},\quad {\mathbf {w}}\in {\mathbb {R}}^{p},\quad \mathbf {V}\in {\mathbb {R}}^{f\times {p}}. \end{aligned}$$
(3)

A row vector \(\mathbf v _{i}\) of \(\mathbf V \) represents the ith variable with f factors. \(f \in {\mathbb {N}}_{0}^{+}\) is the dimensionality of the factorization.

The model equation of a factorization machine in Eq. (1) can be computed in linear time \(O(f*p)\) because the pairwise interaction can be reformulated:

$$\begin{aligned}&\sum _{j=1}^{p}\sum _{j'=j+1}^{p}\langle {\mathbf {v}}_{j},{\mathbf {v}}_{j'}\rangle x_{i,j}x_{i,j'}\\ &\quad=\frac{1}{2}\sum _{k=1}^{f}\left( \left( \sum _{j=1}^{p}v_{j,k}x_{i,j}\right) ^{2}-\sum _{j=1}^{p}v_{j,k}^2 x_{i,j}^2\right) \end{aligned}$$

Table 2 shows an example of input formation of training set. Here, there are \(|U|=3\) users, \(|I|=4\) items, \(|L|=4\) locations, which are binary indicator variables.

$$\begin{aligned} U &= \{u_{1},u_{2},u_{3}\} \\ I &= \{i_{1},i_{2},i_{3},i_{4}\} \\ L &= \{l_{1},l_{2},l_{3},l_{4}\} \end{aligned}$$

The first tuple \(x_1\) means that user \(u_1\) consumed \(i_1\) at \(l_1\) and rated it as 4 stars. For simplicity, we only consider categorical features in the paper. Table 3 shows the model parameters learned from the training set which is shown in Table 2.

Table 2 An example of training set of FM model
Table 3 An example of parameters’ values of FM model

3.2 Extensions to FM

There are a lot of extensions to FM model. Freudenthaler et al. [6] presented simple and fast structured Bayesian learning for FM model. Rendle [22] scaled FM to relational data. Hong et al. [8] proposed co-FM to model user interests and predicted individual decisions in twitter. Qiang et al. [20] exploited ranking FM for microblog retrieval. Loni et al. [15] presented ’Free lunch’ enhancement for collaborative filtering with FM. Oentaryo et al. [17] predicted response in mobile advertising with hierarchical importance-aware FM. Cheng et al. [3] proposed a Gradient Boosting Factorization Machine (GBFM) model to incorporate feature selection algorithm with FM into a unified framework. To the best of our knowledge, there is no extension to FM model integrated into random decision trees such as to exploit the universal context-aware recommendations.

4 Random Partition Factorization Machines

The intuition is that there are similar rating behaviors among users under the same or similar contextual environments. Motivated by Zhong et al. [28], We describe the proposed Random Partition Factorization Machines (RPFM) for context-aware recommendations.

4.1 Algorithm Description

In order to efficiently take advantage of different contextual information, we adopt the idea the random decision trees algorithm.

The rational is to partition the original training set R such that the tuples generated by the similar users, items or contexts are grouped into the same node. Tuples in the same cluster are expected to be more correlated each other than those in original training set R. The main flow can be found in Fig. 1 and Algorithm 2.

To begin with, there is an input parameter S, the structure of decision trees, which can be generated by Algorithm 1 and determined by cross-validation. The parameter S includes contexts for partition at each level, numbers of clusters at each node. The maximal depth of trees can be inferred from the parameter S. For instance, if the value of S is ’C2:4,C3:6,C1:10,C0:5’, the meaning is: (1) at the root node of decision trees, the R can be divided into four groups by using k-means method according to the similarity between factor vectors of context \(C_{2}\). Subsequently, the set at each node of 2nd, 3rd and 4th level of decision trees can be, respectively, divided into six, ten and five groups according to the similarity between factor vectors of context \(C_{3}\), \(C_{1}\) and \(C_{0}\) using k-means method. (2) The maximal depth of each tree is five because there are four intermediate levels and one terminal level.

Fig. 1
figure 1

Random decision trees (one tree)

figure a
figure b

At each node, we learn the model parameter using FM model.

$$\frac{\lambda_{2}}{2}\|\textbf{w}\|^{2} +\frac{\lambda_{3}}{2}\|\textbf{V}\|^{2}{\hat{\omega}_{0},\hat{\textbf{w}},\hat{\textbf{V}}} = arg\min_{\omega_{0},\mathbf{w},\mathbf{V}}\sum_{i=1}^{|R|}(y(x_{i})-\hat{y}(x_{i}))^2 +\lambda\sum_{j=1}^{p}\|\textbf{V}_{j}-\textbf{V}_{j}^{pa}\|^{2} $$
(4)

where \(\Vert *\Vert \) is the Frobenius norm and \(\mathbf V ^\mathrm{pa}\) is the latent factor matrix at parent node. The parameter \(\lambda \) controls the extent of regularization. Equation (4) can be solved using two approaches: (1) stochastic gradient descent (SGD) algorithms, which are very popular for optimizing factorization models as they are simple, work well with different loss functions. The SGD algorithm for FM has a linear computational and constant storage complexity [21]. (2) Alternating least-squares (ALS) algorithms that iteratively solves a least-squares problem per model parameter and updates each model parameter with the optimal solution [24]. Here, V is \(f\times {p}\) matrix of which f is the dimensionality of factor vectors and \(p=n_{0}+n_{1}+...n_{m-1}\). \(n_{i}\) is the number of context \(C_{i}\), m is the number of contextual variables. For simplicity, we denote user set as \(C_{0}\) and item set as \(C_{1}\). Each of the \(f\times {n_{i}}\) sub-matrix is the latent representation of context \(C_{i}\), as shown in Table 3. The smaller the distance among the factor vectors of context \(C_{i}\), the greater the similarity.

To partition the training set R, we extract the context and the number of clusters according to the tree structure S and current level. We group the similar latent vectors of context C by making use of the k-means method, In Table 3, suppose we get the context \(C_{1}\) (i.e., Item) and number of clusters \(k=2\) according to input parameter S. Then the initial cluster central points selected randomly are \(i_{1}\) and \(i_{2}\). Subsequently, the generated clustering result could be \(\{i_{1},i_{3},i_{4}\}\) and \(\{i_{2}\}\). Lastly, the training set in the current node can be divided into two groups according to the clustering result of context \(C_{1}\) (i.e., Item) and the value of \(C_{1}\) (i.e., Item) of tuples. In other words, the current node has two children nodes. The subset of one chid node includes the tuples whose value of \(C_{1}\) (i.e., Item)\(\in \{i_{1},i_{3},i_{4}\}\), the remaining tuples are assigned to the other children node.

The partition process stops once one of following conditions is met: (1) the height of a tree exceeds the limitation which can be inferred from the given tree structure parameter S; (2) the number of tuples at each child node of current node is less than the number of least support tuples leastL.

During training, the function of each non-leaf node is to separate training set by making use of the clustering result of special context, such that the tuples in the subset have more impact each other. However, leaf nodes are responsible for prediction.

Note that in different decision trees, the training set is divided differently because that initial k cluster central points are selected randomly at each node of decision trees.

During prediction, for a given case \(x_{i}\) in the test set, we transfer it from root node to leaf node at each tree using the clustering information of each non-leaf node. For instance, the value of S is ‘C1:2, C0:3, C2:4’ and a test case \(x_{i}=\{u_{3},i_{1},l_{2}\}\) corresponding to Table 2. Thus from the root node, the \(x_{i}\) would be transferred to node (e.g., \(R_{23}\)) which include \(i_{1}\) at second level. Then from the node \(R_{23}\), the \(x_{i}\) would be transferd to node (e.g., \(R_{33}\)) which include \(u_{3}\) at third level. Subsequently, from the node \(R_{33}\), the \(x_{i}\) would be transferd to node (e.g., \(R_{41}\)) which include \(l_{2}\) at fourth level. At the target leaf node, the rating can be predicted by taking advantage of Eq. (1) and the parameters learned by the training subset. To the end, the predictions from all trees are combined to obtain the final prediction as shown in Eq. (5)

$$\begin{aligned} {\hat{y}}(x_{i}) = \frac{\sum _{t=1}^{N}{\hat{y}}_{t}(x_{i})}{N}, \end{aligned}$$
(5)

where \({\hat{y}}_{t}\) means the prediction of the tuple \(x_{i}\) at tth decision tree, N denotes the number of decision trees.

After partitioning the original training set, the tuples at each leaf node have the more influence on each other. So, the FM model at each leaf node can achieve high quality recommendation. By combining multiple predictions from different decision trees, all subsets in which the tuples are more correlated are comprehensively investigated, personalized and accurate context-aware recommendations can be generated.

4.2 Discussion

We discuss the relationship between the proposed RPFM and other state-of-the-art random partition-based methods.

  • Relation to RPMF Zhong et al. [28] proposed RPMF works by applying a set of local decomposition processes on sub-rating matrices. There are some differences between RPMF and our proposed RPFM. First of all, RPMF explores a basic MF model to factorize the user–item rating matrix. However, RPFM factorizes the user–item–context interactions using FM model. Secondly, the decision trees in RPMF are binary trees created by selecting a latent factor from U, V and a splitting point randomly, while that in RPFM are irregular trees generated by k-means method where k initial cluster central points are selected randomly. Thirdly, the depth of decision trees in RPMF can be very large in theory, while that in RPFM is limited by the number of contextual variables. Finally, during prediction, for a given user–item pair, RPMF obtain a partial prediction at each node on the path from the root to leaf node on each decision tree. However, our proposed RPFM make a partial prediction only at the leaf node of each decision tree for a given user–item–context interaction tuple. So, RPMF spend more time in prediction than RPFM.

  • Relation to SoCo Liu et al. [14] proposed SoCo to improve recommendation quality by using contexts and social network information. Here, we only pay attention to the relation between SoCo without social information and RPFM. Firstly, in SoCo contextual information \(c_{r}\), used to separate data at each level of each tree, is selected randomly. Then the training data at each intermediate node are partitioned according to the value of \(c_{r}\). However, the tree structure in RPFM is determined by the input parameter S and training subset is generated according to similarity of latent factor vectors of selected context \(c_{r}\). Second, the prediction is made by the basic MF model in SoCo. However, our proposed RPFM makes prediction by taking advantage of FM model. It is worth mentioning that some contextual information which can improve recommendation quality may be lost in SoCo when the depth of tree is less than the number of contextual variables. For instance, the node \(R_{22}\) in Fig. 1 has no child node because the number of tuples at the node \(R_{22}\) is less than the number of least support tuples. If matrix factorization is performed, the contextual information at node \(R_{22}\) can not be taken advantage. However, our proposed RPFM can do it. Third, both the users and items cannot be used to split training set in SoCo. In other words, the number of tuples at leaf nodes in SoCo may be still enormous.

5 Experiments

In this section, we empirically investigate whether our proposed RPFM can achieve better performance compared with other state-of-the-art methods on three benchmark datasets. First we describe the datasets and settings in our experiments, then report and analyze the experiment results.

5.1 Datasets

We conduct our experiments on three datasets: the Adom. dataset [1], the Food dataset [18] as well as the Yahoo! Webscope dataset.

The Adom. dataset [1] contains 1757 ratings by 117 users for 226 movies with many contextual information. The rating scale rang from 1 (hate) to 13 (absolutely love). However, there are missing values in some tuples. After removing the tuples containing missing values, there are 1464 ratings by 84 users for 192 movies in Adom. dataset. We keep 5 contextual information: withwhom, day of the week, if it was on the opening weekend, month and year seen (Table 4).

The Food dataset [18] contains 6360 ratings (1–5 stars) by 212 users for 20 menu items. We select 2 context variables. One context variable captures how hungry the user is: normal, hungry and full. The second one describes if the situation in which the user rates is virtual or real to be hungry.

The Yahoo! Webscope dataset contains 221,367 ratings (1–5 stars), for 11,915 movies by 7,642 users. There is no contextual information. However, the dataset contains user’s age and gender features. Just like [24], we also follow [10] and apply their method to generate modified dataset. In other words, we modify the original Yahoo! dataset by replacing the gender feature with a new artificial feature \(C\in \{0,1\}\) that was assigned randomly to the value 1 or 0 for each rating. This feature C represents a contextual condition that can affect the rating. We randomly choose 50% items from the dataset, and for these items we randomly pick 50% of the ratings to modify. We increase (or decrease) the rating value by one if \(C=1(C=0)\) if the rating value was not already 5 (1).

Table 4 Data set statistics

5.2 Setup and Metrics

We assess the performance of the models by conducting a fivefold cross-validation and use the most popular metrics: the mean absolute error (MAE) and root mean square error (RMSE), defined as follows:

$$\begin{aligned} \text {MAE}= & {} \frac{\sum \nolimits _{(x_{i},y_{i})\in \varOmega _\mathrm{test}}|y_{i}-{\hat{y}}(x_{i})|}{|\varOmega _\mathrm{test}|} \end{aligned}$$
(6)
$$\begin{aligned} \text {RMSE}= & {} \sqrt{\frac{\sum \nolimits _{(x_{i},y_{i})\in \varOmega _{test}}(y_{i}-{\hat{y}}(x_{i}))^{2}}{|\varOmega _\mathrm{test}|}} \end{aligned}$$
(7)

where \(\varOmega _\mathrm{test}\) denotes the test set, and \(|\varOmega _\mathrm{test}|\) denotes the number of tuples in test set. The smaller the value of MAE or RMSE, the better the performance.

5.3 Performance Comparison

We first conduct some experiments to assess the performance of proposed RPFM with different similarity measure function in k-means method. Then we compared the performance of proposed RPFM with state-of-the-art context-aware methods.

5.3.1 What’s the Better Method of Similarity Function?

The proposed RPFM algorithm takes advantage of k-means method to partition the training set. So the tuples in the each training subset are more impact each other. As we know, there are many metrics to measure the similarity among the tuples, for instance, Euclidean distance ( Euclid), Cosine-based similarity (Cosine), correlation-based similarity (Pearson), adjusted Cosine-based similarity (adjCosine), etc. As shown in Table 5, there are some different performance under the different similarity measure function. However, the difference of performance is not significance. In the following sections, we thus report performances using Euclidean distance.

Table 5 Performance comparison in terms of different similarity function

5.3.2 Comparison to Factorization-Based Context-Aware Methods

To begin with, we determine the structure of decision trees, i.e., input parameters S, by Algorithm 1. The parameters are ‘C2:2,C6:2,C5:2,C3:3,C0:2,C4:5,C1:5’, ‘C3:3,C2:2,C0:5,C1:4’ and ‘C3:2,C2:2,C0:2,C1:2’ for Adom., Food and Yahoo! dataset, respectively. Then, we select 0.01 as the values of learning rate and regularization.

  • FM [21] is easily applicable to a wide variety of context by specifying only the input data and achieves fast runtime both in training and prediction.

  • Multiverse Recommendation [10] is a contextual collaborative filtering model using N dimensional tensor factorization. In Multiverse Recommendation, different types of context are considered as addition dimensions in the representation of the data as tensor. The factorization of this tensor leads to a compact model of the data which can be used to provide context-aware recommendations.

  • COT [13] represents the common semantic effects of contexts as a contextual operating tensor and represents a context as a latent vector. Then, contextual operating matrix from the contextual operating tensor and latent vectors of contexts was generated so as to model the semantic operation of a context combination.

Dimensionality of latent factor vectors is one of the important parameters. Though latent factor vectors’ dimensionality of various contexts can be different in Multiverse and COT. In order to compare with FM and our proposed RPFM, we just take into account the equal dimensionality of latent factor vectors of various contexts. The scale of three datasets is different, so we run models with \(f\in \{2,3,4,5,6,7\}\) over Adom. dataset, \(f\in \{2,4,6,8,10,12\}\) over Food dataset and \(f\in \{5,10,15,20,25,30\}\) over Yahoo! dataset. Figures 2 and 3 show the result of FM, Multiverse, COT and RPFM over the three real-world datasets.

Fig. 2
figure 2

MAE over three datasets with different dimensionality of latent factor vectors. a Adom. dataset, b food dataset, c Yahoo! dataset

Fig. 3
figure 3

RMSE over three datasets with different dimensionality of latent factor vectors. a Adom. dataset, b food dataset, c Yahoo! dataset

We notice that in all experiment scenarios, dimensionality of latent factor vectors in RPFM is not sensitive and RPFM is more accurate than other recommendation models. These results show that in homogeneous environment which can be obtained by applying random decision trees to partition the original training set, users have similar rating behavior.

High computational complexity for both learning and prediction is one of the main disadvantages of Multiverse and COT. This make them hard to apply for larger dimensionality of latent factor vectors. In contrast to this, the computational complexity of FM and RPFM is linear. In order to compare the runtime of various models, we do experiment on Yahoo! dataset for one full iteration over whole training set. Figure 4 shows that the learning runtime of RPFM is faster than that of Multiverse and COT with increasing the dimensionality, however, slower than that of FM which is obvious because RPFM generates an ensemble which reduces the prediction error.

Fig. 4
figure 4

Learning runtime in seconds for one iteration over the whole training set (in log-y scale) over Yahoo! dataset with different latent dimensions

5.3.3 Comparison to Random Partition-Based Context-Aware Methods

  • RPMF [28] adopted a random partition approach to group similar users and items by taking advantage of decision trees. The tuples at each node of decision trees have more impact each other. Then matrix factorization is applied at each node to predict the missing ratings.

  • SoCo [14] explicitly handle contextual information which means SoCo partitions the training set based on the values of real contexts. SoCo incorporate social network information to make recommendation. There are not social network information in our selected datasets, so we just consider SoCo without social network information.

Both the number and depth of trees have important impact on the decision tree-based prediction methods. Because of space limitations, we just report the experimental result over Food dataset.

As shown in Fig. 5, we observe that RPFM achieves the best performance compared with RPMF and SoCo. And we notice that MAE/RMSE decreases with increasing number of trees, which means more trees produces higher accuracy. However, when the number of trees increases to around 3, improvements on prediction quality become negligible. We thus conclude that even a small number of trees are sufficient for decision tree-based models.

Fig. 5
figure 5

Impact of number of trees over Food dataset

The depth of trees which is one of the input parameters in RPMF can be very large because it can select a latent factor from UV and a splitting point randomly at each intermediate node during building the decision trees. Here, we define the maximal depth of trees as five in RPMF. In SoCo, the maximal depth of trees equals the number of contextual variables excluding user and item. Specially, the maximal depth of trees over Food dataset is two in SoCo. However, both user and item can be considered as contextual variables in RPFM. Then the maximal depth of trees over Food dataset is four in RPFM. Figure 6 shows that the deeper of trees, the better prediction quality, and RPFM outperforms RPMF and SoCo in terms of MAE and RMSE.

Fig. 6
figure 6

Impact of depth of trees over Food dataset

6 Conclusion and Future Work

In this paper, we propose Random Partition Factorization Machines (RPFM) for context-aware recommendations. RPFM adopts random decision trees to partition the original training set using k-means method. Factorization machines (FM) is then employed to learn the model parameters at each node of the trees and predict missing ratings of users for items under some specific contexts at leaf nodes of the trees. Experimental results demonstrate that RPFM outperforms state-of-the-art context-aware recommendation methods.

There are several directions for future work on RPFM. First, RPFM adopts the k-means method to partition the training set. There are many cluster methods [7] such as BIRCH, ROCK, Chameleon, DBSCAN. Some of them may be achieve better performance. Second, manipulation at each node in training phase, such as clustering, partition and learning parameters, can be parallelized. Third, there are many floating point arithmetic at leaf nodes in prediction which will spend much time. While GPU hold powerful capacity of floating point arithmetic, it can be taken advantage to accelerate the prediction.