This section presents an evaluation of the \(({ CF})^2\) architecture using two case studies. These case studies simulated the recommender system of a Web site. To train the collaborative filtering models, past interactions between clients and the Web site were used.
These past interactions were provided by a multi-media company specialized in weather- and traveller-related content and technology. Historical traffic was captured in clickstream form by two on-line marketing tools and Web analytics applications: Google analyticsFootnote 4 and Adobe Omniture.Footnote 5 All data that contained pseudo-identifiers were collected in accordance with privacy policies. No personally identifiable information about users was used.
The task being evaluated is defined in the literature as Find Good Items. In this task, the recommender system is interested in suggesting items to a user, but displaying only those that are a “best bet”.
Methodology
The evaluation scenario consists of a dataset D divided into two subsets, a training set T and a validation set V. The training set represents 80% of the dataset and is obtained by random selection from the original dataset without repetition. The remaining 20% represents the validation set.
Because an explicit rating is not provided by users, an implicit rating \(r_{{ ui}}\) of 1 is used to indicate that a user u is interested in the requested page i when the user accesses the page. Moreover, because CF models need enough data to generate good recommendations, \(r_{{ ui}}\) with contextual attributes representing less than 0.1% of the total dataset were removed.
Each contextual attribute c is represented by a training subset \(T_c\) and a validation subset \(V_c\) representing ratings of T and V containing the contextual attribute c. Moreover, because this research aims to prove that the use of contextual attributes as filtering criteria is better than random dataset reduction, the training subset \(T_r^c\) and the validation subset \(V_r^c\) represent randomly selected subsets of T and V with the same size as \(T_c\) and \(V_c\). This process is illustrated in Fig. 7. Each box expressing a Training Set \(T_c\) is visualized by a unique color indicating that the set comprises of the entities with the same contextual attributes. On the other side, each box representing a Training set \(T_r^c\) is depicted by mixed colors demonstrating the randomness of contextual attributes within the set.
Moreover, because the evaluation is performed by comparing the proposed architecture with the traditional approach, the subsets T and V represent the rating \(r_{{ ui}}\) given by a user u to a page i, regardless of context. Both sets will be used to train and validate the traditional CF model m.
To assess the predictive quality of the models, this research used predictive accuracy metrics. The metrics used are the mean squared error (MSE) and root mean squared error (RMSE), which are given by the equation
$$\begin{aligned} \hbox {MSE(V)}= & {} \frac{1}{|V|} \cdot \displaystyle \sum _{(u,i) \in V}\left( r_{{ ui}}-\hat{r}_{{ ui}}\right) ^{2} \end{aligned}$$
(1)
$$\begin{aligned} \hbox {RMSE(V)}= & {} \sqrt{\frac{1}{|V|} \cdot \displaystyle \sum _{(u,i) \in V}\left( r_{{ ui}}-\hat{r}_{{ ui}}\right) ^{2}} \end{aligned}$$
(2)
where V is the validation set, |V| is the size of V, \(r_{{ ui}}\) is the true user rating, and \(\hat{r}_{{ ui}}\) is the predicted rating. These and other notations used throughout the section are presented in Table 1.
Table 1 Notations used throughout this paper
Training dataset T was then used to feed the training process by means of the rating storage component. This step ensures the creation of a different model \(m_c\) for each contextual attribute c present in T, as illustrated in Fig. 3.
Similarly to the training dataset, validation is performed by splitting the validation set V into different contextual subsets \(V_c\) and using these subsets to compare the predicted rating \(\hat{r}_{{ ui}}\) with the rating \(r_{{ ui}}\) available in the validation set \(V_c\). This process is illustrated in Fig. 8.
To evaluate the models m, representing the traditional approach, and \(m_r^c\) , representing random dataset reduction, the \(({ CF})^2\) implementation was adapted to ignore context.
Case studies
Because \(({ CF})^2\) can use two types of contextual attributes, embedded and inferred, the evaluation process was divided into two case studies, each covering one of these types.
For the CF recommendation engine, both case studies used the matrix factorization technique with the alternating least squares (Koren et al. 2009) (ALS) algorithm. This has already been implemented by Spark’s spark.mllib machine learning library, providing an “out-of-the-box” solution that can process large volumes of data. Moreover, this implementation includes a training technique based on the work done by Hu et al. (2008), which specializes in training CF models using implicit ratings.
The model parameters were obtained after performing a cross-validation with partitions of the T dataset. Various configurations of the regularization parameter (\(\lambda\)), the number of hidden features, the number of iterations, and the confidence level (\(\alpha\)) were considered. The parameter values that resulted in a minimum stable MSE were chosen and are given in Table 2.
Table 2 Parameters used to train the CF models
Each case study was executed on a private server with a 24-core Intel Xeon E5-2630 2.3 GHz and 96 GB RAM DDR3 1600 MHz running Ubuntu 14.04.2 LTS.
Case study 1: embedded context
The first case study used clickstream data captured by the Google Analytics tool during the summer of 2016 (June 20 to September 22) to create recommender models based on the contextual attribute “operating system with platform”. To achieve this goal, the data were pre-processed to remove entries captured by the clickstream that did not represent a page view. These entries usually represent interactions with objects inside a Web page that do not trigger a page change, like interaction with map objects or social media snippets. After this step, the resulting dataset was filtered to contain only unique values with the following properties:
This process resulted in a dataset containing 130,684,845 unique samples. The Number of unique users was 33,624,517, and the number of items (unique URLs) was 9,823,125. The final dataset was then split into a training set T and a validation set V.
The training set was then used to create the contextual models \(m_c\), \(m_r^c\), and m, and to calculate the time spent to train them. Moreover, the validation set was used to calculate the MSE of each model.
The performance of the \(({ CF})^2\) architecture is graphically represented in Fig. 9. For presentation purposes, all contextual names were converted to the format “OS/Platform”. The MSE obtained for the traditional approach, represented by model m was compared with each model \(m_c\) and its equivalent model \(m_r^c\) obtained by random reduction. To facilitate interpretation, the MSE for model m is displayed as a dotted reference line.
From Fig. 9, it is clear that using the proposed architecture generally provided better results than the traditional approach and significantly better results than random dataset reduction. This is also corroborated by the MSE and RMSE values obtained for each approach given in Table 3.
Table 3 MSE and RMSE obtained in case study 1
Figure 10 relates the MSE of each model with the dataset size. Analysis of Fig. 10 shows that when using contextual dataset reduction, the MSE values tend to be lower than the value obtained using the traditional approach, regardless of dataset size. The same is not true for random dataset reduction. When using this technique, the error increases as the dataset becomes smaller and converges to the accuracy of the traditional approach as the dataset increases in size. This is to be expected because there is almost no dataset reduction when the size approaches the original. The size of each dataset classified by contextual attribute is displayed in Table 4.
Table 4 Size of each dataset of case study 1 classified by contextual attribute
Furthermore, the times (in s) taken to train each model m, \(m_c\), and \(m_r^c\) were compared and are illustrated in Fig. 11.
These values indicate that the time to train each model increased almost linearly with dataset size. This was especially true in the case of Spark’s implementation of ALS, but other matrix factorization implementations should also experience a significant reduction in training time when dataset reduction is used.
Case study 2: contextual inference
The second case study used contextual inference to create recommender models based on the contextual attribute “weather condition”. This case study used click-stream data captured by the Adobe Omniture tool between April 1, 2015 and June 30, 2015 and included only visits generated by users in London, ON, Canada. The captured data contained unique entries with the following properties:
After filtering out entries that did not represent a page view and removing duplicate entries, the resulting dataset contained 7,729,696 samples. Because the dataset still lacked the “weather condition” property, the time of access property along with the location property were used to obtain the weather condition for each visit. After removing duplicate entries, the dataset containing the tuples visitor identifier, URL, and weather condition was reduced to 3,181,808 entries. The Number of unique was users: 656,962 and the number of items ( unique URLs) was 3777. The final dataset was then split into a training set T and a validation set V.
The training set was then used to create the contextual models \(m_c\), \(m_r^c\), and m and to calculate the time spent to train them. Moreover, the validation set was used to calculate the MSE of each model, as shown in Fig. 12.
The MSE and RMSE values for each approach are given in Table 5.
Table 5 MSE and RMSE obtained in case study 2
Analysis conducted on these values indicates that \(({ CF})^2\) yielded similar accuracy to the traditional approach while outperforming the accuracy obtained using random dataset reduction. Moreover, results indicated that accuracy improved as weather conditions deteriorated. The most compelling evidence was the MSE obtained for “heavy thunderstorms and rain” and “thunderstorm” weather conditions. This occurred because under these circumstances, rating patterns among users are shared. In other words, users have the tendency to check the same Web pages during bad weather.
Despite the fact that MSE improved in some contexts, the results obtained indicate that weather condition is a less suitable contextual attribute than “operating system with platform” for this application. Nevertheless, the use of a contextual attribute as a filtering criterion for local learning is more appropriate than random selection.
Taking dataset size into consideration, as shown in Table 6, this case study showed similar behaviour as the previous case study. Like the first case study, the second one showed that the MSE values tended to fluctuate around the value obtained for m, and both case studies showed that \(m_c\) outperformed \(m_r^c\). This situation is shown in Fig. 13, which also shows the MSE value for model m as a dotted reference line.
Table 6 Size of each dataset of case study 2 classified by contextual attribute
Analysis of the time (in s) taken to train each model m, \(m_c\), and \(m_r^c\) were conducted, and the results are illustrated in Fig. 14. The results corroborate the findings of the previous case study and show that training time increases almost linearly with dataset size.
Discussion
The presented experiments show that the \(({ CF})^2\) architecture achieves similar (or better) accuracy using a small fraction of the data to collaborative filtering models trained with the complete dataset. Whereas the second case study uses a single contextual attribute (weather), the first case study combines two attributes, operating system, and platform (mobile or not), into a single attribute OS/Platform?. When several attributes are involved, the amount of data may not be sufficient for training when data are split, and further investigation is needed to determine which attributes should be included. In the case study 1, combining attributes was possible as the dataset consisted of over 130 million records and after splitting, segments still remained sufficiently large.
Both studies used the categorical attributes; nevertheless, handling continuous attributes can be done by converting them into categorical ones according to ranges. Depending on the attributes, selecting different ranges may result in different recommendation accuracies.
In the first case study, the proposed approach provided better recommendations then the traditional one (Table 3) whereas, in the second one, the traditional approach was slightly better (Table 5). Observing Fig. 12, it can be noticed that the \(({ CF})^2\) accuracy improved as condition deteriorated: accuracy was much better with \(({ CF})^2\) than with the traditional approach for ”heavy thunderstorms and rain”. The results indicate that the choice of the contextual attribute as well as the actual values of the attributes impact the quality of recommendations. Therefore, the important direction of future research in context-aware recommendation systems is a selection of context attributes to include in the recommendation system.