Abstract
We outline an inherent flaw of tensor factorization models when latent factors are expressed as a function of side information and propose a novel method to mitigate this. We coin our methodology kernel fried tensor (KFT) and present it as a large-scale prediction and forecasting tool for high dimensional data. Our results show superior performance against LightGBM and Field aware factorization machines (FFM), two algorithms with proven track records, widely used in large-scale prediction. We also develop a variational inference framework for KFT which enables associating the predictions and forecasts with calibrated uncertainty estimates on several datasets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and related work
In recent times, industrial prediction problems (Caro and Gallien 2010; Seeger et al. 2016) are not only large scale but also high dimensional (Zhai et al. 2014). Problems of this nature are ubiquitous and the most common setting is, but not limited to, the recommendation systems (Bobadilla et al. 2013). Here we are tasked with predicting user preference (i.e. ratings, buying propensity, etc.) for a product (movies, clothing, etc). More often than not, the number of users and products far exceeds the practicality of data matrix formalism, which characterizes the perils of high dimensionality in modern prediction problems. The choice of models is often limited to boosting models such as LightGBM (Ke et al. 2017), factorization machines (Juan et al. 2016; Rendle 2010) and matrix (Bobadilla et al. 2013) and tensor factorization models (Kolda and Bader 2009; Oseledets 2011). In particular, matrix/tensor factorization models are often used due to their memory-efficient representation of high dimensional data (Kuang et al. 2014). An additional benefit of factorization machines and matrix/tensor factorization models is their relative ease of extension to a Bayesian formulation, making uncertainty quantification straightforward.
A recurring problem in recommendation systems is the cold start problem (Bobadilla et al. 2012), where new users yield inaccurate predictions due to an absence of historical purchasing behavior. A common technique to overcome the cold start problem is to incorporate side information (Agarwal and Chen 2009; Xu et al. 2015; Zhang et al. 2014), which means adding descriptive covariates about each individual user (or product) to the model. While side information is not immediately applicable to factorization machines, there is extensive literature (Kim et al. 2016; Liu et al. 2019; Narita et al. 2012) on utilizing side information with matrix/tensor factorization models. There are limited choices of models when simultaneously considering scalability, uncertainty quantification, the cold start problem, and, ideally, interpretability (Rudin 2018). We contextualize our contribution by reviewing related works.
Tensor factorization is a generalization of matrix factorization to n-dimensions, where any technique that applies to tensors is directly applicable to matrices but not vice versa. We focus on tensor factorization as it offers the most flexibility and generality. In the frequentist setting, Canonical Polyadic (CP) and Tucker decomposition (Kolda and Bader 2009) are the most common factorization methods for tensors while Tensor-Train (TT) (Oseledets 2011) and Tensor-Ring (TR) (Zhao et al. 2016) decomposition are newer additions focusing on scalability. While Tucker decomposition has admirable analytical properties, the \({\mathcal {O}}(r^d)\) memory storage of the core tensor is infeasible in any large-scale application. CP decomposition is superseded by TT-decomposition, which in turn is extended by TR-decomposition. Due to certain pathologies (Batselier 2018) exhibited by TR, only TT remains plausible for a large-scale model. The existing methods LightGBM(Ke et al. 2017) and Field-Aware Factorization Machines(FFM) (Juan et al. 2016) are considered the gold standardFootnote 1\(^{,}\)Footnote 2 of large scale prediction, where an overall objective of this paper is to challenge this duopoly. To enhance the performance of tensor models, Du et al. (2016), Wu et al. (2019) and Zhang et al. (2019) applied neural methods for matrix and tensor factorization, where (Du et al. 2016; Zhang et al. 2019) is considered the state-of-the-art in performance. In the Bayesian domain, only FFM carries over with (Saha et al. 2017), which introduces a variational coordinate ascent method for factorization machines. There is rich literature in Bayesian matrix/tensor factorization ranging from Monte Carlo methods in the pioneering works of Salakhutdinov and Mnih (2007) to variational matrix factorization (Gönen et al. 2013; Kim and Choi 2014) and tensor factorization in Hawkins and Zhang (2018). Kim and Choi (2014) is of particular interest as they present a large-scale variational model incorporating side information. While coordinate ascent approaches proposed in Hawkins and Zhang (2018), Kim and Choi (2014) and Saha et al. (2017) are useful, we believe that the cyclical update scheme can be constraining for large scale scenarios that rely on parallelism. Further, the updating rules are hard to maintain with changes in model architecture. To mitigate maintenance of complicated gradient updates, we consider automatic differentiation (Paszke et al. 2017b) instead. In the context of maintainable variational inference, it would then suffice to find an analytical expression of the Evidence Lower BOund (ELBO) that is general for a family of models (Hoffman et al. 2013).
Side information is applied in two ways for matrix/tensor factorization models: implicitly through regularization schemes in He et al. (2017), Narita et al. (2012) and Pal and Jenamani (2018) Zhao et al. (2013) and directly as covariates in Agarwal and Chen (2009), Kim et al. (2016), Kim and Choi (2014) and Zhang et al. (2014). In terms of interpretability, the latter is more desirable as predictions are now an explicit expression of covariates, allowing for direct attribution analysis. A concerning observation of the results in Agarwal and Chen (2009), Kim et al. (2016) and Kim and Choi (2014) suggests that using side information in a covariate format barely improves performance and even worsens it in Agarwal and Chen (2009) using the “features only” model.
In this paper we develop a novel all-purpose large-scale prediction model that strives for a new level of versatility existing models lack. Our contribution Kernel Fried Tensor(KFT) aims to bridge the gap in the literature and answer the following questions:
-
1.
Is there an interpretable tensor model that avoids constraints and complex global dependencies arising from the addition of side information but still makes full use of side information?
-
2.
Can we formulate and characterize this model class in both primal and dual (Reproducing Kernel Hilbert Space) space?
-
3.
Do models in this class compare favourably, for large scale prediction, to state-of-the-art models such as LightGBM (Ke et al. 2017), FFM (Juan et al. 2016) and existing factorization models?
-
4.
Can we work with these new models in a scalable Bayesian context with calibrated uncertainty estimates?
-
5.
What are the potential pitfalls of using these models? When are they appropriate to use?
Following our introduction, we were tasked with the multifaceted problem of developing a tensor model that scales, is interpretable, is Bayesian, handles side information, and provides on-par performance with the existing gold-standard.
The rest of the paper is organized as follows: Sect. 2 illustrates an important limitation of side information applied as covariates to tensors, Sect. 3 introduces and characterises KFT in both the frequentist and Bayesian setting, Sect. 4 are experiments and Sect. 5 provides an ablation study related to Question 5 (Table 1).
2 Background and problem
Tensor models are used in large-scale prediction tasks ranging from bioinformatics to industrial prediction; we work in the latter setting. While existing tensor models are versatile and scalable, they have a flaw: When we build a model of the latent factors in tensor factorization as a function of covariates (Agarwal and Chen 2009; Kim et al. 2016; Kim and Choi 2014; Zhang et al. 2014), the model may be restricted by global parameter couplings that are generated. These couplings lead to a reduction of tractability at scale. We provide a new family of tensor models which admit side information without model restriction or loss of tractability.
Tensor Train decomposition Before we explain how side information restricts tensor model expressiveness, we set out the background. Consider the task of reconstructing the tensor \(\mathbf{Y}\in {\mathbb {R}}^{n_1 \times n_2 \cdots \times n_P}\). Many existing decomposition techniques (Kolda and Bader 2009) treat this problem. We focus on the Tensor Train (TT) decomposition (Oseledets 2011) as this generalises more readily than existing alternatives.
The n-mode (matrix) product of a tensor \({\mathbf {X}} \in {\mathbb {R}}^{I_1 \times I_2 \times \cdots \times I_N}\) with a matrix \({\mathbf {U}} \in {\mathbb {R}}^{J \times I_n}\) is \({\mathbf {X}} \times _n {\mathbf {U}}\). This product is of size \(I_1 \times \ldots I_{n-1} \times J \times I_{n+1} \times \ldots \times I_N\). Elementwise, the product is
We will consider the following notion of a mode product between tensor \(\mathbf{X}\in \mathbf{\mathbb {R}}^{I_1 \times \cdots I_{N-1}\times I_{N}}\), applying the N-th mode product \(\times _N\) with a tensor \(\mathbf{U}\in \mathbf{\mathbb {R}}^{I_N\times K_1 \times K_2}\) gives \(\mathbf{X}\times _N \mathbf{U}\in \mathbf{\mathbb {R}}^{I_1 \times \cdots I_{N-1}\times K_1 \times K_2}\).
In TT, \(\mathbf{Y}\) is decomposed into P latent tensors \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p} \times n_p \times R_{p-1} }, p=1,\ldots P\), with \(\mathbf{V}_1 \in \mathbf{\mathbb {R}}^{R_1\times n_1\times 1}\) and \(\mathbf{V}_P \in \mathbf{\mathbb {R}}^{1 \times n_P \times R_{P-1}}\). Here \(R_p\) is the latent dimensionality for each factor in the decomposition, with \(R_1=R_P=1\). Let \(\times _{-1} \mathbf{V}_p\) be the operation of applying the mode product to the last dimension of \(\mathbf{V}_p\). We seek \({\mathbf {V}}_1\ldots {\mathbf {V}}_P\) so that
Suppose that, associated with dimension p, we have \(c_p\)-dimensional side information denoted \(\mathbf{D}_p \in \mathbf{\mathbb {R}}^{n_p \times c_p}\). For example, if p is the dimension representing \(n_p=10{,}000\) different books, then the columns of \(\mathbf{D}_p \in \mathbf{\mathbb {R}}^{10{,}000\times c_p}\) might contain the author of the book, page count etc. Similar to Kim et al. (2016) and Kim and Choi (2014), side information is built into the second dimension of the latent tensor \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p} \times c_p \times R_{p-1}}\) using the mode product \(\mathbf{V}_p \times _{2} \mathbf{D}_p\). It should be noted that the middle dimension of \(\mathbf{V}_p\) changed from \(n_p\) to \(c_p\) to accommodate the dimensionality of the side information. For TT decomposition our approximation becomes
The above example illustrates the primal setting where side information is applied directly. Similarly to Kim et al. (2016), we also consider kernelized side information in the reproducing kernel hilbert space (RKHS) which we will refer to as the dual setting.
2.1 The problem with adding side information to tensor factorization
Consider a matrix factorization problem for \(\mathbf{Y}\in \mathbf{\mathbb {R}}^{n_1\times n_2}\) with unknown latent factors \(\mathbf{U}\in \mathbf{\mathbb {R}}^{n_1\times R},\mathbf{V}\in \mathbf{\mathbb {R}}^{n_2\times R}\). We are approximating
If we update \(u_{1r}\) in the approximation \(y_{11}\approx \sum _{r=1}^R u_{1r}v_{1r}\), we change the approximation \(y_{1,2}\approx \sum _{r=1}^R u_{1r}v_{2r}\) since they share the parameter \(u_{1r}\). However, \(y_{2,1}\) and \(y_{2,2}\) remain unchanged. Parameters are coupled across rows and columns but not globally. This is the standard setup in latent factorization.
Now consider the case where we have \(\mathbf{D}_1 = \mathbf{D}_2= {\mathbf {1}}_{n_1\times n_1}\). We take our latent factors to be a linear function of available side information which leads \(\mathbf{U},\mathbf{V}\) to form \(\mathbf{D}_1\mathbf{U}= \begin{bmatrix} \sum _{i=1}^{n_1} u_{i1} &{} \dots &{} \sum _{i=1}^{n_1} u_{iR} \\ \vdots &{} \ddots &{} \\ \sum _{i=1}^{n_1} u_{i1} &{} &{} \sum _{i=1}^{n_1} u_{iR} \end{bmatrix}\) and \(\mathbf{D}_2\mathbf{V}\) (similar form). It follows that
is a constant matrix! We have lost all model flexibility as we are approximating \(\mathbf{Y}\) with a constant. Now consider a more realistic example with \(\mathbf{D}_1 = \begin{bmatrix} d_{11} &{} \dots &{} d_{1n_1} \\ \vdots &{} \ddots &{} \\ d_{n_11} &{} &{} d_{n_1n_1} \end{bmatrix}\) and \(\mathbf{D}_2 =\begin{bmatrix} z_{11} &{} \dots &{} z_{1n_2} \\ \vdots &{} \ddots &{} \\ z_{n_21} &{} &{} z_{n_2n_2} \end{bmatrix}\). In this case
Again, \(u_{ir}\) appears in all entries in our matrix approximation for \(\mathbf{Y}\). However, this time changing \(u_{ir}\) will not change all entries by the same amount but rather differently across all entries depending on the entries of \(\mathbf{D}_1\) and \(\mathbf{D}_2\). This connects all entries in the approximating matrix, introduces complex global variable dependence, and makes fitting infeasible at a large scale as the optimization updates globally. The observation applies in both primal and dual representations. In the primal representation, there is a restriction in expressiveness as the rank of our approximation falls off with the rank of the side information. In this setting, near-colinearity is also a problem as it leads to unstable optimization and factorizations which are very sensitive to noise.
We see that when we add side information we may inadvertently restrict the expressiveness of our model. We formulate a new tensor model with a range and dependence structure unaffected by the addition of side information in either the primal or dual setting.
3 Proposed approach
3.1 Tensors and side information
We seek a tensor model that benefits from additional side information while not forfeiting model flexibility. We introduce two strategies, which we call weighted latent regression (WLR) and latent scaling (LS).
3.1.1 Weighted latent regression
We now return to the previous setting but with additional latent tensors \(\mathbf{U}'\in \mathbf{\mathbb {R}}^{n_1 \times R}\) and \(\mathbf{V}'\in \mathbf{\mathbb {R}}^{n_2 \times R}\). We approximate \(\mathbf{Y}\) as:
By taking the Hadamard product (\(\circ\)) with additional tensors \(\mathbf{U}'\), \(\mathbf{V}'\) we recover the model flexibility and dependence structure of vanilla matrix factorization, as \(\mathbf{U}'\) and \(\mathbf{V}'\) are independent of \(\mathbf{U}\) and \(\mathbf{V}\). Here, changing \(u_{ir}\) would still imply a change for all entries by magnitudes defined by the side information, however we can calibrate these changes on a latent entrywise level by scaling each entry with \(u_{ir}'v_{jr}'\). For any TT decomposition with an additional tensor \(\mathbf{V}_p'\), the factorization becomes
where \(\delta (\cdot )\) denotes Kronecker delta. We interpret this as weighting the regression terms \(\sum _{i,j}d_{pi}z_{qj}u_{ir}v_{jr}\) over indices p, q with \(u_{pr}v_{qr}\) and then summing over latent indices r.
Interpretability We can decompose the estimate into weights of side information illustrated in Fig. 1. A guiding example on analyzing the latent factors is provided in Appendix C
3.1.2 Latent scaling
An alternative computationally cheaper procedure would be to consider additional latent tensors \(\mathbf{U}_1\in \mathbf{\mathbb {R}}^{n_1\times K},\mathbf{U}_2\in \mathbf{\mathbb {R}}^{n_1\times L},\mathbf{V}_1\in \mathbf{\mathbb {R}}^{n_2\times K},\mathbf{V}_2\in \mathbf{\mathbb {R}}^{n_2\times L}\). We would approximate
We have similarly regained our original model flexibility where have introduced back independence for each term by scaling (\(\mathbf{V}^s\)) and adding a constant (\(\mathbf{V}^b\)) for each regression term with a ’latent scale and bias’ term. We generalize this to
One may conjecture that adding side information to tensors through a linear operation is counterproductive due to the restrictions it imposes on the approximation, and dispute that our proposal of introducing additional tensors to increase model flexibility is futile when side information is likely to be marginally informative or potentially uninformative. As an example, return to the case of completely non-informative constant side information, \(\mathbf{D}_1={\mathbf {1}}\), \(\mathbf{D}_2={\mathbf {1}}\). In this corner case, both our proposed models reduce to regular matrix factorization: the side information regression term collapses to a constant, which in conjunction with the added terms reduces to regular tensor factorization without side information.
Interpretability We can decompose the estimate into weights of side information illustrated in Fig. 2.
A comment on identifiability It should be noted that the proposed models need not be indentifiable.
To see this, return to the scenario where side information is constant. The term \(\mathbf{V}_p \times _2 \mathbf{D}_p\) has constant rows equal to row sums of \(\mathbf{V}_p\). Any transformation of \(\mathbf{V}_p\) which preserves these row sums leaves the fit unchanged. However, in large-scale industrial prediction, we draw utility from good generalization performance in prediction, and parameter identifiability is secondary.
3.2 Primal regularization terms
In weight space regression, the regularization terms are given by the squared Frobenius norm. The total regularization term would be written as
3.3 RKHS and the representer theorem
We extend our framework to the RKHS dual space formalism, where our extension can intuitively be viewed as a tensorized version of kernel ridge regression (Vovk 2013). The merit of this is to enhance the performance by providing an implicit non-linear feature map of side information using kernels.
Firstly consider side information \(\mathbf{D}_{p} = \{\mathbf{x}_i^p\}_{i=1}^{n_p}, \mathbf{x}_i^p \in \mathbf{\mathbb {R}}^{c_p}\) which are kernelized using a kernel function \(k:\mathbf{\mathbb {R}}^{c_p}\times \mathbf{\mathbb {R}}^{c_p}\rightarrow {\mathbb {R}}\). Denote \(k^{p}_{ij}=k(\mathbf{x}_i^p,\mathbf{x}_j^p) \in \mathbf{\mathbb {R}}\) and \(\mathbf{K}_p=k(\mathbf{D}_p,\mathbf{D}_p)\in \mathbf{\mathbb {R}}^{n_p \times n_p}\). Consider a \(\mathbf{V}\in \mathbf{\mathbb {R}}^{R_1\times n_1 \cdots \times n_Q\times R_2}\), where \(Q<P\). Using the Representer theorem (Schölkopf et al. 2001) we can express \(\prod _{q=1}^{Q}\mathbf{V}\times _{q+1} \mathbf{K}_q\) as a function in RKHS
where \(\mathsf {v}_{r_1,r_2}: \mathbf{\mathbb {R}}^{n_1}\times \cdots \times \mathbf{\mathbb {R}}^{n_Q} \rightarrow \mathbf{\mathbb {R}}\) and \(\mathsf {v}_{r_1,r_2} \in {\mathcal {H}}\), which denotes the RKHS with respect to the kernels \(\prod _{q}^Q \times k_q\). We use mode dot notation \(\times _{q+1}\) here to apply kernelized side information \(\mathbf{K}_q\) to each dimension of size \(n_1 \ldots n_Q\), where \(q+1\) is used to account for the first dimension consisting of \(R_1\).
3.4 Dual space regularization term for WLR
Consider applying another tensor with the same shape as \(\mathbf{V}'\) through element wise product to robustify \(\mathbf{V}\), i.e. \(\mathbf{V}' \circ \prod _{q=1}^{Q}\mathbf{V}\times _{q+1} \mathbf{K}_q\). Then we have
where the regularization term for \(\mathsf {v}_{r_1,r_2}\) is given by:
and \((\cdot )_{++}\) means summing all elements.
3.5 Dual space regularization term for LS
For LS models, the regularization term is calculated as
where \(\mathsf {v}_{r_1,r_2}= \sum _{n_{1}\ldots n_q}v_{r_{p}n_1...n_qr_{p-1}}\prod _{q=1}^{Q} k(\cdot ,\mathbf{x}_{n_q}^{Q})\) and \(\sum _{r}\langle \mathsf {v}_r,\mathsf {v}_r \rangle _{{\mathcal {H}}} = \left( \left( \prod _{q=1}^Q \mathbf{V}\times _{q+1}\mathbf{K}_q\right) \circ \mathbf{V}\right) _{++}\).
3.6 Scaling with random Fourier features
To make tensors with kernelized side information scalable, we rely on a random fourier feature (Rahimi and Recht 2007) (RFFs) approximation of the true kernels. RFFs approximate a translation-invariant kernel function k using Monte Carlo:
where \(\omega _i\) are frequencies drawn from a normalized non-negative spectral measure \(\varLambda\) of kernel k. Our primary goal in using RFFs is to create a memory efficient, yet expressive method. Thus, we write
with explicit feature map \(\phi :{\mathbb {R}}^{D_p}\rightarrow {\mathbb {R}}^{M}\), and \(M\ll N_p\). In the case of RFFs,
This feature map can be applied in the primal space setting as a computationally cheap alternative to the RKHS dual setting.
A drawback of tensors with kernelized side information is the \({\mathcal {O}}(N_p^2)\) memory growth of kernel matrices. If one of the dimensions has a large \(N_p\) in the dual space setting, we approximate large kernels \(\mathbf{K}_p\) with
where \(\varPhi =\phi (\mathbf{D}_p)\in \mathbf{\mathbb {R}}^{N_p\times M}\). To see that this is a valid approximation, an element \({\bar{v}}_{i_1...i_p}\) in \(\mathbf{V}\times _p \mathbf{K}_p\) is given by \({\bar{v}}_{i_1...i_p} = \sum _{i_p'=1}^{N_p}v_{i_1...i_p'}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p)\). Using RFFs we have
KFT with RFFs now becomes:
with the regularization term
For a derivation, please refer to the Appendix B.
3.7 Kernel fried tensor
Having established our new model, we coin it kernel fried tensor (KFT). Given some loss \({\mathcal {L}}(\mathbf{Y},\tilde{\mathbf{Y}})\) for predictions \(\tilde{\mathbf{Y}}\) the full objective is
where \(\mathbf{V}_p,\mathbf{V}_p',\varTheta _p\) are parameters of the model and \(\varTheta _p\) are kernel parameters if we use the RKHS dual formulation. As our proposed model involves mutually dependent components with non-zero mixed partial derivatives, optimizing them jointly with a first order solver is inappropriate as mixed partial derivatives will not be considered during each gradient step. Inspired by the EM-algorithm (Dempster et al. 1977), we summarize our training procedure in Algorithm 1. By updating each parameter group sequentially and independently, we eliminate the effects of mixed partials leading to accurate gradient updates. For further details, we refer to the Appendix E.1.
3.7.1 Joint features
From a statistical point of view, we are assuming that each of our latent tensors \(\mathbf{V}_p\) factorizes \(\mathbf{Y}\) into P independent components with prior distribution corresponding to \({\mathcal {N}}(\mathsf {v}_{r_{p-1}r_p}^{p} |0,\mathbf{K}_p)\), where \(\mathsf {v}_{r_{p-1}r_p}^{p}\in \mathbf{\mathbb {R}}^{n_p}\) is the \(r_{p-1},r_p\) cell selected from \(\mathbf{V}_p\). We can enrich our approximation by jointly modelling some dimensions p by choosing some \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p+1}\times n_{p+1}\times n_{p} \times R_{p-1 }}\). If we denote this dimension p by \(p'\) we have that
and the prior would instead be given as \({\mathcal {N}}(\text {vec}(\mathsf {v}_{r_{p'-1}r_{p'+1}}^{p'}) |0,\mathbf{K}_{p'}\otimes \mathbf{K}_{p'+1})\). Here \(\text {vec}(\mathsf {v}_{r_{p'-1}r_{p'+1}}^{p'}) \in \mathbf{\mathbb {R}}^{n_{p'}n_{p'+1}}\) and \(\text {vec}(\cdot )\) means flattening a tensor to a vector. The cell selected from \(\mathbf{V}_{p'}\) now has a dependency between dimensions \(p'\) and \(p'+1\). We refer to a one dimensional factorization component of TT as a TT-core and a multi dimensional factorization component as a joint TT-core.
3.8 Bayesian inference
We turn to Bayesian inference for uncertainty quantification with KFT. Assume a Gaussian conditional likelihood for an observation \(y_{i_1\ldots i_P}\) with inspiration from Gönen et al. (2013) Kim and Choi (2014). For KFT-WLR we have that
The corresponding objective for KFT-LS is
where \(\sigma _y^2\) is a scalar hyperparameter.
3.9 Variational approximation
Our goal is to maximize the posterior distribution \(p(\mathbf{V}_p,\mathbf{V}_p'|\mathbf{Y})\) which is intractable as the likelihood \(p(\mathbf{Y})= \int p(\mathbf{Y}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')p(\mathbf{V}_1)...p(\mathbf{V}_P) p(\mathbf{V}_1')...p(\mathbf{V}_P')d\mathbf{V}_{1\ldots P}d\mathbf{V}_{1\ldots P}'\) does not have a closed form solution due to the product of Gaussians. Instead we use variational approximations for \(\mathbf{V}_p,\mathbf{V}_p'\) by parametrizing distributions of the Gaussian family and optimize the evidence lower bound (ELBO)
In our framework, we consider the univariate Gaussian and multivariate Gaussian as variational approximations with corresponding priors where \(\sigma _y^2\) is interpreted to control the weight of the reconstruction term against the KL-term.
3.9.1 Univariate VI
Univariate KL For the case of univariate normal priors, we calculate the KL divergence as
where \(\mu _q,\mu _p\) and \(\sigma _q^2,\sigma _p^2\) are the mean and variance for the variational approximation and prior respectively, where \(\mu _p,\sigma _p^2\) are chosen a priori.
Model For a univariate Gaussian variational approximation we assume the following prior structure
with corresponding univariate meanfield approximation
We take \(\mu _p',\sigma _p'^2,\mu _p,\sigma _p^2\) to be hyperparameters.
Weighted latent regression reconstruction term For Weighted latent regression, we express the reconstruction term as
where \(\mathbf{M}_p',\mathbf{M}_p\) and \(\varSigma _p',\varSigma _p\) correspond to the tensors containing the variational parameters \(\mu _{r_{p}i_pr_{p-1}}'\), \(\mu _{r_{p}i_pr_{p-1}}\) and \(\sigma _{r_{p}i_pr_{p-1}}'^2\), \(\sigma _{r_{p}i_pr_{p-1}}^2\) respectively. For the case of RFF’s, we approximate \(\varSigma _p \times _2 (\mathbf{K}_p)^2 \approx \varSigma _p \times _2 (\varPhi _p \bullet \varPhi _p)^{\top } \times _2 (\varPhi _p \bullet \varPhi _p)\), where \(\bullet\) is the transposed Khatri–Rao product. It should further be noted that any square term means element wise squaring. We provide a derivation in the Appendix A.1.
Latent scaling reconstruction term For Latent Scaling, we express the reconstruction term as
For details, see the Appendix A.2.
3.9.2 Multivariate VI
Multivariate KL The KL divergence between a multivariate normal prior p and a variational approximation given by q:
Where \(\mu _p,\mu _q\) and \(\varSigma _p,\varSigma _q\) are the mean and covariance for the prior and variational respectively. Inspired by g-prior (Zellner 1986), we take \(\varSigma _p = \mathbf{K}^{-1}_p\), where \(\mathbf{K}^{-1}_p\) is the inverse kernel covariance matrix of side information for mode p. When side information is absent, we take \(\varSigma _p = \mathbf{I}\). Another benefit of using the inverse is that is simplifies calculations, since we now avoid inverting a dense square matrix in the KL-term. Similar to the univariate case we choose \(\mu _p\) a priori, although here it becomes a constant tensor rather than a constant scalar.
Model For the multivariate case, we consider the following priors
Where \(Q_p\) is the number of dimensions jointly modeled in each TT-core. For the variational approximations, we have
We take \(\mu _p',\sigma _p'^2,\mu _p\) to be hyperparameters and \(\varSigma _{q}=\mathbf{B}_{q}\mathbf{B}_{q}^{\top }\).
Sampling and parametrization Calculating \(\prod _{q=1}^{Q_i} \otimes \mathbf{B}_{q}\mathbf{B}_{q}^{\top }\) directly will yield a covariance matrix that is prohibitively large. To sample from \(q(\mathbf{V}_p)\) we exploit that positive definite matrices A and B with their Cholesky decompositions \(L_A\) and \(L_B\) have the following property
together with the fact that
where \(\text {vec}(\mathbf{X})\in \mathbf{\mathbb {R}}^{\prod _{i}^N I_i\times R}\). We would then draw a sample \({\mathbf {b}}\sim q(\mathbf{V})\) as
where \(\tilde{\mathbf{z}}\sim {\mathcal {N}}(0,{\mathbf {I}}_{\prod _{p=1}^{P}n_p})\) is reshaped into \(\tilde{\mathbf{z}}\in \mathbf{\mathbb {R}}^{\prod _{q=1}\times n_q}\). We take \(\mathbf{B}_q = \mathbf{ltri} (B_qB_q^{\top })+D_q\) (Ong et al. 2017), where \(B_q\in \mathbf{\mathbb {R}}^{n_q\times r}\), \(D_q\) to be a diagonal matrix and \(\mathbf{ltri}\) denotes taking the lower triangular component of a square matrix including the diagonal. We choose this parametrization for a linear time-complexity calculation of the determinant in the KL-term by exploiting that \(\det \left( \varSigma _{q}\right) =\det \left( \mathbf{B}_{q}\mathbf{B}_{q}^{\top }\right) =(\det \left( \mathbf{B}_{q}\right) )^2\). In the RFF case, we take \(\mathbf{B}_q=B_q\) and estimate the covariance as \(\mathbf{B}_q \mathbf{B}_q^{\top } + D_q^2\)
Weighted latent regression reconstruction term We similarly to the univariate case express the reconstruction term as
where \(\varSigma _p = \mathbf{B}_p \mathbf{B}_p^T\), \({\mathbf {1}}\) denotes a constant one tensor with the same dimensions as \(\varSigma _p'\), \({\bar{1}}\in \mathbf{\mathbb {R}}^{R\times 1}\) where R is the column dimension of \(\mathbf{B}_p\) and \(\varSigma _p'\) is the same as in the univariate case. For RFF’s we have that
Latent scaling reconstruction term The latent scaling version has the following expression
For details, see the Appendix A.2.
RFFs and KL divergence Using \((\varPhi _p\varPhi _p^{\top })^{-1}\approx (\mathbf{K}_p)^{-1}\) as our prior covariance, we observe that the KL-term presents computational difficulties as a naive approach would require storing \((\mathbf{K}_p)^{-1}\in \mathbf{\mathbb {R}}^{n_p\times n_p}\) in memory. Assuming we take \(\varSigma _p = BB^{\top }, B \in \mathbf{\mathbb {R}}^{n_p \times R}\), we can manage the first term by using the equivalence
Consequently, we have that
We can calculate the second term using (34) and (35). For the third term, we remember Weinstein–Aronszajn’s identity
where \(A\in \mathbf{\mathbb {R}}^{m\times n},B\in \mathbf{\mathbb {R}}^{n\times m}\) and AB is trace class. If we were to take our prior covariance matrix to be \(\varSigma _p = (\varPhi _p\varPhi _p^{\top } + I_{n_p} )^{-1}\approx (\mathbf{K}_p + I_{n_p})^{-1}\) and our posterior covariance matrix to be approximated as \(\varSigma _q = BB^{\top }+I_{n_p}\), we could use Weinstein–Aronszajn’s identity to calculate the third log term in a computationally efficient manner.
From a statistical perspective, adding a diagonal to the covariance matrix implies regularizing it by increasing the diagonal variance terms. Taking inspiration from Kim and Teh (2017), we can further choose the magnitude \(\sigma\) of the regularization
The KL expression then becomes
3.9.3 Calibration metric
We evaluate the overall calibration of our variational model using the sum
of the calibration rate \(\xi _{1-2\alpha }\), which we define as
where we consider \(\alpha\) to take values in \(\{0.05,0.15,0.25,0.35,0.45\}\). This calibration rate can be understood as the true (frequentist) coverage probability and \(1-2\alpha\) as the nominal coverage probability. The model is calibrated when the true coverage probability is close to the nominal coverage probability, that is when \(\varXi\) is small. To ensure that our model finds a meaningful variational approximation, we take our hyperparameter selection criteria to be:
where \(R^2:=1 - \frac{\sum _{i_1\ldots i_n} (y_{i_1\ldots i_n} - {\hat{y}}_{i_1\ldots i_n})^2}{\text {Var}(\mathbf{Y})}\) is the coefficient of determination (Draper and Smith 1966) calculated using the “mean terms” \(\Bigg ( \prod _{p=1}^P \times _{-1} ( \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p))\Bigg )\) or \(\Bigg (\prod _{p=1}^P (\mathbf{M}_p^s \circ (\mathbf{M}_p \times _2 \mathbf{K}_p) + \mathbf{M}_p^b) \Bigg )\) as predictions \({\hat{y}}_{i_1\ldots i_n}\). If we only use \(\eta _{\text {criteria}}=\varXi\), we argue that there is an inductive bias in choosing \(\alpha\)’s which may lead to an approximation that is calibrated per se, but not meaningful as the modes are incorrect (low \(R^2\) value). Similarly to the frequentist case, we use an EM-inspired optimization strategy in Algorithm 2. The main idea is to find the mode and variance parameters of our variational approximation in a mutually exclusive sequential order, starting with the modes. Similar to the frequentist case, the reconstruction term of the ELBO has terms that both contain \(\varSigma _p\) and \(\mathbf{M}_p'\) which motivates the EM-inspired approach. For further details, please refer to the Appendix E.2.
3.10 Extension to forecasting
KFT in its current form is fundamentally unable to accommodate forecasting problems. To see this, we first consider the forecasting problem of predicting observation \(y_T\) from previous observations \(\mathbf{y }_t=[y_0,\ldots ,y_{t}], \quad 0<t<T\) using model \(f(\mathbf{y }_t)\). The model is then optimized by minimizing
for all T. Forecasting problems assume that the model does not have access to future \(y_{T+1}\) outside the training set and assumes it to learn \(y_{T+1}\) through an autoregressive assumption. Imposing this assumption on KFT would imply that latent factorizations for time-indexed \(T+1\) would remain untrained with the current training procedure, as we do not access to these indices during training. With untrained latent factorizations for \(T+1\), any forecast would at best be random.
However, we can easily extend KFT to be autoregressive by directly applying (Yu et al. 2016).
3.10.1 Frequentist setting
We consider the Temporal Regularized Matrix Factorization (TRMF) framework presented in Yu et al. (2016)
where \(\mathbf{X } \in {\mathbb {R}}^{T\times k}\) is the temporal factorization component in the matrix factorization \(\mathbf{U}\cdot \mathbf{X }^\top\) and \(\mathbf{x }_t\) denotes \(\mathbf{X }\) sliced at time index t. Further we take \({\mathcal {W}}=\left\{ W^{(l)} \in \text {diag}({\mathbb {R}}^{k})\mid l \in {\mathbb {L}}\right\}\)(i.e. set of diagonal matrices) and \({\mathbb {L}}=\{l_i<T\mid i=1,\ldots ,I\}\) as the set of time indices to lag. Additionally, the regularization weight \(\eta\) is needed to ensure that \({\mathbf {X}}\) varies smoothly. However in KFT, such regularization already exists and it suffices to consider
Forecasting for WLR TRMF can be extended to WLR by simply taking \(\mathbf{X } = {\mathbf {V}}_t' \circ ({\mathbf {V}}_t \times _2 \mathbf{D}_t) \in {\mathbb {R}}^{r_{T+1} \times T\times r_{T-1}}\) and \({\mathcal {W}}=\left\{ W^{(l)} \in {\mathbb {R}}^{r_{T+1} \times r_{T-1}}\mid l \in {\mathbb {L}}\right\}\), which then yields
which we coin KFTRegularizer (KFTR).Footnote 3 We follow the same training strategy proposed in Yu et al. (2016) by sequentially updating \({\mathcal {F}}=\{{\mathbf {V}}_p,{\mathbf {V}}_p' \mid p=1\ldots ,P \}\backslash {\mathbf {X}}\), \(\mathbf{X }\) and \({\mathcal {W}}\).
Forecasting for LS KFTR can also be applied to the LS variant by applying the temporal regularization to all three components \(\mathbf{X}^s = \mathbf{V}^s_t\), \(\mathbf{X}^b = \mathbf{V}^b_t\) and \(\mathbf{X}= \mathbf{V}_t \times \mathbf{D}_t\). We then apply Eq. (49) to each term.
3.10.2 Bayesian setting
KFTR is extended to the Bayesian setting by optimizing the quantity
in addition to the ELBO. The probability distribution \(p(\cdot )\) is assumed to be a univariate normal with fixed variance \(\sigma ^2_{\text {KFTR}}\). We define
as functions of variational variables \(\mathbf{V}, \mathbf{V}'\). Here \(\left[ \cdot \right] _t\) means slicing the tensor at index t. As an autoregressive dependency on \(\mathbf{X}_t\) is required, we take
However \(\mathbf{X}_t\) is composed of variational variables \(\mathbf{V}, \mathbf{V}'\) and thus we can write
The log-expectation is intractable, so we use Jensen’s inequality and optimize a lower bound instead. We then arrive at the following expression
We calculate the expression by taking expectations of the x-terms with respect to \(v,v'\) and arrive at
It should be noted that the above expression considers the WLR case for a univariate meanfield model. For a complete derivation and an expression for the multivariate meanfield model and the LS version, we refer to Appendix A.2.2.
3.11 Complexity analysis
We give a complexity analysis for all variations of KFT in the frequentist setting and Bayesian setting.
Theorem 1
KFT has computational complexity
and memory footprint \({\mathcal {O}}\left( P\cdot \left( \max _{p}\left( n_p r_p r_{p-1}\right) + \max _{p}\left( n_p c_p\right) \right) \right)\) for a gradient update on a batch of data. In the dual case, we take \(c_p=n_p\).
We provide proof in the appendix. It should be noted that one can reduce complexity and memory footprint by permuting the modes of the tensor such that the larger modes are on the edges, i.e. when \(r_{p}=1\) or \(r_{p-1}=1\). Then \(n_p\) will only scale with \(r_{p-1}\) or \(r_p\).
KFTR complexity The additional complexity associated with adding an autoregressive regularization term is at worst \({\mathcal {O}}(r_p r_{p-1} K T )\), where K is the number of lags and T the size of the temporal mode. As this term scales linearly with K, it does not have an overall impact on the complexity of KFT.
4 Experiments
The experiments are divided into analyzing the frequentist version and bayesian version of KFT. Frequentist KFT is compared against competing methods on prediction and forecasting on various high dimensional datasets. Datasets we use are summarized in Table 2. For Bayesian KFT, we investigate the performance of the Bayesian version of KFT with a focus on the calibration of the obtained posterior distributions, by comparing nominal coverage probabilities to the true coverage probabilities.
4.1 Predictive performance of KFT
We compare KFT to the established FFM and LightGBM on the task of prediction on three different datasets, Retail Sales, Movielens-20M and Alcohol Sales (cf. Table 2). We compare KFT using squared loss. LightGBM is a challenging benchmark as it has continuously received development, engineering, and performance optimization since its inception in 2017. We execute our experiments by running 20 iterations of hyperopt (Bergstra et al. 2013) for all methods to find the optimal hyperparameter configuration constrained with a memory budget of 16GB (which is the memory limit of a high-end GPU) for 5 different seeds, where the seed controls how the data is split. We split our data into 60% training, 20% validation, and 20% testing. We report scores in \(R^2\), since it provides a normalized goodness-of-fit score and measure the performance in terms of \(R^2\)-value on test data. For further details on hyperparameter range, data, and preprocessing, see Appendix F. The results are reported in Table 3, where the best results are boldfaced. We observe that KFT has configurations that can outperform the benchmarks with a good margin. Furthermore, the dual space models are generally doing better than their primal counterparts. We hypothesize that the enhanced expressiveness of kernelized side information is the reason for this.
The next experiment makes a direct comparison of KFT to recent Matrix Factorization methods tailored for the Movielens-1M and Movielens-10M datasets. The purpose of this experiment is to further demonstrate the competitiveness of KFT on recommendation tasks. Table 4 gives comparison of KFT RMSE on Movielens-1M and Movielens-10M to RMSEs of existing methods which are reported in respective paper. KFT outperforms existing non-neural models and marginally underperforms compared to neural-based state-of-the-art models for matrix factorization. In comparison to tensor factorization, KFT outperforms NTF despite NTF being a neural model. The RMSE is generally higher for tensor factorization, as data becomes sparser with each additional dimension.
For forecasting problems, we replicate the experiments in Yu et al. (2016) and Salinas et al. (2019) for the traffic dataset and Yu et al. (2014) for the CCDS data and compare against KFT in Tables 5 and 6.
We plot forecasts for different plots and series in Fig. 3.
4.2 Calibration study of Bayesian KFT
We use the same setup as in the frequentist case, modulo the hyperparameter evaluation objective which instead is (46). For details on hyperparameter choices and data preparation, please refer to the Appendix F. We run a regular tensor factorization without side information as a benchmark for performance, which is intended to mimic (Hawkins and Zhang 2018) as a comparison. We summarize the results in Table 7. We obtain calibrated variational approximations and observe that models using side information yield better predictive performance but that their calibration becomes slightly worse. By using a Bayesian framework we seem to generally lose some predictive performance compared to the corresponding frequentist methods, except in the case of Movielens-20M. We provide a visualization of the calibration ratios for all datasets in Fig. 4.
We compare KFTR to existing Bayesian models with reported results in Table 8, where the model is optimized for \(\eta _{\text {criteria}}\), in contrast to RMSE used in Kim and Choi (2014). KFT performs well in a Bayesian setting compared to Kim and Choi (2014) and also yield calibrated estimates. For forecast problems, we apply KFTR to the Traffic and CCDS dataset and report the results in Table 9. We find that the WLR version of KFT does well and yields calibrated forecasts.
We plot forecasts for the Traffic dataset with uncertainty quantification in Fig. 5.
5 Analysis
We demonstrated the practical utility of KFT in both a frequentist and Bayesian context. We now scrutinize the robustness and effectiveness of KFT as a remedy for constraint-imposing side information.
5.1 Does KFT really amend the constraints of directly apply side information?
To validate this, we train KFT-WLR and a naive model on the Alcohol Sales dataset using kernelized side information for 5 hyperparameter searches. We plot the mean training, validation, and test error of the 5 searches (and the standard errors) against epochs in Fig. 6.
5.2 How does KFT perform when applying constant side information?
To answer this question, we replace all side information with a constant \({\mathbf {1}}\) and kernelize it. The results in the first row of Table 10 indicate that KFT indeed is robust towards constant side information, as the performance does not degrade dramatically.
5.3 How does KFT perform when applying noise as side information?
Similar to the previous question, we now replace the side information with standard Gaussian noise instead. The results in the last row of Table 10 indicate that KFT also is robust against noise and surprisingly performant as well. A possible explanation for this is that adding Gaussian noise serves as an implicit regularizer or that the original side information is similarly distributed as standard Gaussian noise. We conclude that KFT is stable against uninformative side information in the form of Gaussian noise.
6 Conclusion
We identified an inherent limitation of side information based tensor regression and gave a method that removes this limitation. Our proposed KFT method yields competitive performance against state-of-the-art large-scale prediction models on a fixed computational budget. Specifically, as the experiments in Table 3 demonstrate, for at least some cases of real practical interest, Weighted Latent Regression is the most performant configuration. Further, KFT offers extended versatility in terms of calibrated Bayesian variational estimates. Our analysis shows that KFT solves the problems we described in Sect. 2 and is robust for adversarial side information in the form of Gaussian noise. A direction for further development would be to characterize identifiability conditions for KFT and extend the Bayesian framework beyond mean-field variational inference.
Notes
The recursive abbreviation reflects the recursive nature of the autoregressive regularization!
References
Agarwal, D., & Chen, B. C. (2009). Regression-based latent factor models (Vol. ’09, pp. 19–28). Association for Computing Machinery. https://doi.org/10.1145/1557019.1557029
Batselier, K. (2018). The trouble with tensor ring decompositions. Preprint
Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. Proc. of the 30th International Conference on Machine Learning (ICML 2013).
Bobadilla, J., Ortega, F., Hernando, A., & Bernal, J. (2012). A collaborative filtering approach to mitigate the new user cold start problem. Knowledge-Based Systems, 26, 225–238. https://doi.org/10.1016/j.knosys.2011.07.021
Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-Based Systems, 46, 109–132. https://doi.org/10.1016/j.knosys.2013.03.012
Caro, F., & Gallien, J. (2010). Inventory management of a fast-fashion retail network. Operations Research. https://doi.org/10.1287/opre.1090.0698
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38. http://www.jstor.org/stable/2984875
Draper, N., & Smith, H. (1966). Applied regression analysis. Wiley series in probability and mathematical statistics. Wiley. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+022791892&sourceid=fbw_bibsonomy
Du, Chao, Chongxuan Li, Yin Zheng, Jun Zhu, and Bo Zhang. (2018). Collaborative Filtering With User-Item Co-Autoregressive Models. Proceedings of the AAAI Conference on Artificial Intelligence 32(1). https://ojs.aaai.org/index.php/AAAI/article/view/11884.
Gönen, M., Khan, S., & Kaski, S. (2013). In S. Dasgupta & D. McAllester (Eds.), Kernelized Bayesian matrix factorization (Vol. 28, pp. 864–872). PMLR. http://proceedings.mlr.press/v28/gonen13a.html
Hawkins, C., & Zhang, Z. (2018). Variational Bayesian Inference for Robust Streaming Tensor Factorization and Completion. 2018 IEEE International Conference on Data Mining (ICDM), 1446–1451.
He, L., Lu, C. T., Ma, G., Wang, S., Shen, L., Yu, P. S., & Ragin, A. B. (2017). In D. Precup & Y. W. Teh (Eds.) Kernelized support tensor machines (Vol. 70, pp. 1442–1451). PMLR, International Convention Centre. http://proceedings.mlr.press/v70/he17a.html
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1), 1303–1347. http://dl.acm.org/citation.cfm?id=2502581.2502622
Juan, Y., Zhuang, Y., Chin, W. S., & Lin, C. J. (2016). Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16 (pp. 43–50). ACM. https://doi.org/10.1145/2959100.2959134
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In NeurIPS.
Kim, H., Lu, X., Flaxman, S., & Teh, Y. W. (2016). Collaborative filtering with side information: A Gaussian process perspective. Preprint
Kim, H., & Teh, Y.W. (2018). Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes. AISTATS.
Kim, Y. D., & Choi, S. (2014). Scalable variational Bayesian matrix factorization with side information. In AISTATS.
Kingma, Diederik & Ba, Jimmy. (2014). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500.
Kuang, L., Hao, F., Yang, L. T., Lin, M., Luo, C., & Min, G. (2014). A tensor-based approach for big data representation and dimensionality reduction. IEEE Transactions on Emerging Topics in Computing, 2(03), 280–291. https://doi.org/10.1109/TETC.2014.2330516
Liu, T., Wang, Z., Tang, J., Yang, S., Huang, G. Y., & Liu, Z. (2019). Recommender systems with heterogeneous side information (Vol. ’19, pp. 3027–3033). Association for Computing Machinery. https://doi.org/10.1145/3308558.3313580
Narita, A., Hayashi, K., Tomioka, R., & Kashima, H. (2012). Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery, 25(2), 298–324. https://doi.org/10.1007/s10618-012-0280-z
Ong, Victor & Nott, David & Smith, Michael. (2017). Gaussian Variational Approximation With a Factor Covariance Structure. Journal of Computational and Graphical Statistics. 27. https://doi.org/10.1080/10618600.2017.1390472.
Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5), 2295–2317. https://doi.org/10.1137/090752286
Pal, B., & Jenamani, M. (2018). Kernelized probabilistic matrix factorization for collaborative filtering: Exploiting projected user and item graph (pp. 437–440). Association for Computing Machinery. https://doi.org/10.1145/3240323.3240402
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. & Lerer, A. (2017). Automatic Differentiation in PyTorch. NIPS 2017 Workshop on Autodiff.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines (Vol. NIPS’07, pp. 1177–1184). Curran Associates Inc.
Rendle, S. (2010). Factorization machines. In G.I. Webb, B. Liu, C. Zhang, D. Gunopulos, & X. Wu (Eds.) ICDM 2010, The 10th IEEE international conference on data mining, Sydney, Australia, 14–17 December 2010 (pp. 995–1000). IEEE Computer Society. https://doi.org/10.1109/ICDM.2010.127
Rudin, Cynthia. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 1, 206–215. https://doi.org/10.1038/s42256-019-0048-x.
Saha, A., Misra, R., Acharya, A., & Ravindran, B. (2017). Scalable variational Bayesian factorization machine. Preprint
Salakhutdinov, R., & Mnih, A. (2007). Probabilistic matrix factorization. In Proceedings of the 20th international conference on neural information processing systems, NeurIPS (pp. 1257–1264). Curran Associates Inc. http://dl.acm.org/citation.cfm?id=2981562.2981720
Flunkert, Valentin & Salinas, David & Gasthaus, Jan. (2017). DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. International Journal of Forecasting. 36. https://doi.org/10.1016/j.ijforecast.2019.07.001.
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. In COLT/EuroCOLT.
Seeger, M., Salinas, D., & Flunkert, V. (2016). Bayesian intermittent demand forecasting for large inventories. In NeurIPS.
Vovk, V. (2013). Kernel Ridge Regression. Empirical Inference.
Wu, X., Shi, B., Dong, Y., Huang, C., & Chawla, N. V. (2019). Neural tensor factorization for temporal interaction learning (Vol. WSDM ’19, pp. 537–545). Association for Computing Machinery. https://doi.org/10.1145/3289600.3290998
Xu, J., Yao, Y., Tong, H., Tao, X., & Lu, J. (2015). Ice-breaking: Mitigating cold-start recommendation problem by rating comparison. In IJCAI international joint conference on artificial intelligence. 24th international joint conference on artificial intelligence, IJCAI 2015; conference date: 25-07-2015 through 31-07-2015 (Vol. 2015, pp. 3981–3987).
Yu, H., Rao, N.S., & Dhillon, I.S. (2016). Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction. NIPS.
Yu, R., Bahadori, M. T., & Liu, Y. (2014). Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in neural information processing systems (pp. 3491–3499).
Zellner, A. (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. Basic Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti
Zhai, Y., Ong, Y., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26. https://doi.org/10.1109/MCI.2014.2326099
Zhang, J., Shi, X., Zhao, S., & King, I. (2019). STAR-GCN: Stacked and Reconstructed Graph Convolutional Networks for Recommender Systems. IJCAI.
Zhang, M., Tang, J., Zhang, X., & Xue, X. (2014). Addressing cold start in recommender systems: A semi-supervised co-training algorithm. Association for Computing Machinery. 14, 73–82 https://doi.org/10.1145/2600428.2609599
Zhao, Q., Zhou, G., Adali, T., Zhang, L., & Cichocki, A. (2013). Kernelization of tensor-based models for multiway data analysis: Processing of multidimensional structured data. IEEE Signal Processing Magazine, 30(4), 137–148. https://doi.org/10.1109/MSP.2013.2255334
Zhao, Q., Zhou, G., Xie, S., Zhang, L., & Cichocki, A. (2016). Tensor ring decomposition. Preprint
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Shai Shalev-Shwartz.
Appendices
Appendix A: \({\mathbb {E}}_{q}[\log {p(y_{i_1,...,i_P}|\mathbf{V}_1,...,\mathbf{V}_P,\mathbf{V}_1',...,\mathbf{V}_P')}]\) derivations
1.1 A.1 Weighted latent regression
Assuming that \(\sigma _y\) is a constant hyperparameter. We first have that
Our goal is to find the corresponding tensor operations of the sum terms.
1.1.1 A.1.1 Univariate meanfield
We first have that
Then the last term is can be written as
where \(\mathbf{M}_p',\mathbf{M}\) and \(\varSigma ',\varSigma\) correspond to the tensors containing the variational parameters \(\mu _{r_{p}i_pr_{p-1}}'\), \(\mu _{r_{p}i_pr_{p-1}}\) and \(\sigma _{r_{p}i_pr_{p-1}}'^2\), \(\sigma _{r_{p}i_pr_{p-1}}^2\) respectively. For the case of RFF’s in dual space, we approximate \(\varSigma _p \times _2 (\mathbf{K}_p)^2 \approx \varSigma _p \times _2 (\varPhi _p \bullet \varPhi _p)^{\top } \times _2 (\varPhi _p \bullet \varPhi _p)\), where \(\bullet\) is the transposed Khatri-Rao product. It should further be noted that any square term, means elementwise squaring. With the middle term given by
the full reconstruction term is expressed as:
1.1.2 A.1.2 Multivariate meanfield
Similarly first observe that
It then follows that the third term is calculated as
where \(\varSigma _p = \mathbf{B}_p \mathbf{B}_p^T\), \({\mathbf {1}}\) denotes a constant one tensor with the same dimensions as \(\varSigma _p'\), \({\bar{1}}\in \mathbf{\mathbb {R}}^{R\times 1}\) where R is the column dimension of \(\mathbf{B}_p\) and \(\varSigma _p'\) is the same as in the univariate case. For RFF’s we have that
As the middle term remains the same as in the univariate case our full reconstruction term is
1.2 A.2 Latent scaling
Assuming that \(\sigma _y\) is a constant hyperparameter. We first have that
1.2.1 A.2.1 Univariate case
For the univariate case, the squared component of the reconstruction term in the ELBO becomes
Too see this, we notice that all of the sums from the weighted latent regression case reappears here as well. The entire expression is given by
1.2.2 A.2.2 Multivariate case
For multivariate case the last term becomes
Again, we notice that all of the sums from the weighted latent regression case reappears here as well. The full expression is given by
1.3 A.3 Forecasting
We derive an expression for the forecasting term for the WLR case
We first establish that
Consequently,
where \([\cdot ]_\tau\) denotes slicing in the temporal dimension at index \(\tau\). We then calculate the first term
For univariate meanfield \(\mathbf{V},\mathbf{V}'\), taking the expectation we arrive at
If we instead consider a multivariate meanfield \(\mathbf{V},\mathbf{V}'\), the expression becomes
For the third expression, we take
The tensor expression for Expression 1 is given by
This follows directly from equation 76. Expression 2 can be written as
Here the idea is to reduce the memory footprint by calculating a slightly erroneous expression and removing the error rather than calculate the exact expression due to the difficulty of expressing the cross terms \(w_kw_f\). The above derivation extends directly to the LS case by simply removing the \(\mathbf{V}'\) variables and calculating the forecasting term for each component \(\mathbf{V}_s,\mathbf{V}, \mathbf{V}_b\).
Appendix B: Derivation for WLR regularization term
1.1 B.1 RFF regularization term
The regularization term can similarly be generalized to
Appendix C: Complexity proofs
We recall the theorem presented.
Theorem 2
KFT has computational complexity
and memory footprint \({\mathcal {O}}\left( P\cdot \left( \max _{p}\left( n_p r_p r_{p-1}\right) + \max _{p}\left( n_p c_p\right) \right) \right)\) for a gradient update on a batch of data. In the dual case, we take \(c_p=n_p\).
Proof
When training on batches of data, the reconstruction for each datapoint is calculated by sequentially multiplying slices \(\mathbf{V}_p(i_p),\mathbf{V}_p'(i_p)\in {\mathbb {R}}^{r_{p-1} \times r_p}\) from each TT-core. We start with the frequentist model for WLR and LS. We list the operations needed:
-
1.
Any \(\mathbf{V}\times _{p+1}\mathbf{D}_p\) operation. Complexity: \({\mathcal {O}}\left( \max _{p}\left( {n_p c_p r_p r_{p-1}}\right) \right)\). Memory: \({\mathcal {O}}\left( \max _{p}\left( {n_pr_p r_{p-1}}\right) \right)\) + \({\mathcal {O}}\left( \max _{p}\left( n_pc_p\right) \right)\).
-
2.
Any \(\mathbf{V}_p \circ \mathbf{V}_p'\) operation. Complexity: \({\mathcal {O}}\left( \max _{p}\left( {r_p r_{p-1}}\right) \right)\). Memory: \({\mathcal {O}}\left( \max _{p}\left( {r_p r_{p-1}}\right) \right)\).
-
3.
Reconstructing the tensor \(\prod ^P_{p=1}\times _{-1}\mathbf{V}_p\). Complexity: \({\mathcal {O}}\left( \left( \max _{p}r_p\right) ^P \right)\). Memory: \({\mathcal {O}}\left( P\cdot \max _{p}\left( {r_p r_{p-1}}\right) \right)\).
For the Bayesian case we have the additional operations for the multivariate case:
-
1.
\((\mathbf{K}_p \cdot \mathbf{B}_p)^2\cdot {\bar{1}}\) component. Complexity: \({\mathcal {O}}\left( \max _{p}\left( n_p^2 k_p \right) \right)\). Here we take \(\mathbf{B}_p\in {\mathbb {R}}^{n_p \times k_p}\) with \(k_p \ll n_p\). Memory: \({\mathcal {O}}\left( \max _{p}\left( n_p^2\right) \right)\).
-
2.
Prior determinant \(\det \left( \varSigma _P\right)\). Complexity: \({\mathcal {O}}(\max _p(M_p^2))\), assuming we use RFFs when \(n_p\) is large. When using RFFs, we expect \(M_p \ll N_p\). Memory: \({\mathcal {O}}(\max _p(n_p\cdot M_p))\).
-
3.
Variational determinant \(\det \left( \varSigma _Q\right)\). Complexity: \({\mathcal {O}}(\max _p(n_p))\), Memory: \({\mathcal {O}}(\max _p(n_p^2))\).
The dominating terms for complexity and memory are \({\mathcal {O}}\left( \max _{p}\left( {n_p c_p r_p r_{p-1}}\right) + \left( \max _{p}r_p\right) ^P \right)\) and \({\mathcal {O}}\left( P\cdot \left( \max _{p}\left( n_p r_p r_{p-1}\right) + \max _{p}\left( n_p c_p\right) \right) \right)\) respectively. \(\square\)
Appendix D: Interpretability example analysis
As a guiding example of utilizing the interpretability of KFT, we analyse the temporal component of WLR applied to the alcohol dataset in the primal setting. We visualize the weights for the time component and in particular the “year” covariate.
Applying an LS model instead, we can directly decompose the predictions into scaling contribution, regression contribution and bias contribution (Fig. 7; Table 11).
Appendix E: Training procedure
1.1 E.1 Frequentist
To motivate the sequential updating scheme, consider a data matrix \(\frac{\mathbf{X}}{\sigma }\in \mathbf{\mathbb {R}}^{N\times d}\) where \(\sigma\) is a hyperparameter that controls the scaling of \(\mathbf{X}\) and a target \(\mathbf{Y}\in \mathbf{\mathbb {R}}^{N}\). Assume N is too large and we have to resort to stochastic first order gradient methods to approximate \(\mu \in \mathbf{\mathbb {R}}^d\) in our regression \(\big(\frac{\mathbf{X}}{\sigma }\big)\mu \approx \mathbf{Y}\). Using autograd (Paszke et al. 2017a), together with ADAM (Kingma and Ba 2014) we can in practice optimize \(\{\sigma ,\mu \}\) simultaneously for each iteration. However doing this, we commit a fallacy as when updating \(\mu _t = \mu _{t-1}-\frac{\partial {\mathcal {L}}(\sigma _{t-1},\mu _{t-1})}{\partial \mu _{t-1}}\) and \(\sigma _t = \sigma _{t-1}-\frac{\partial {\mathcal {L}}(\sigma _{t-1},\mu _{t-1})}{\partial \sigma _{t-1}}\), we do not account for the mixed partial \(\frac{\partial }{\partial \sigma _{t-1}}\frac{\partial {\mathcal {L}}(\sigma _{t-1},\mu _{t-1})}{\partial \mu _{t-1}} \propto \frac{-1}{\sigma _{t-1}^3}\) assuming \({\mathcal {L}}\) is mean square error loss. Thus updating \(\{\sigma ,\mu \}\) simultaneously using first order derivatives would yield an update error for \(\sigma\) as the updating gradient does not adjust for the mixed partial \(\frac{\partial }{\partial \sigma _{t-1}}\frac{\partial {\mathcal {L}}(\sigma _{t-1},\mu _{t-1})}{\partial \mu _{t-1}}\) when \(\mu\) is being updated at the same time. This scenario extends one-to-one for the variables of KFT, as we would commit a similar fallacy by updating all parameters at once. Hence we take an EM inspired approach when updating \({l_p,\mathbf{V}_p,\mathbf{V}_p'}\).
1.2 E.2 Bayesian: variational inference
As the practical utility of calibrated uncertainty estimates rely on a good fit of the model, we motivate our sequential update scheme as a method to encourage good predictive performance by first finding the optimal modes determined by \(\mathbf{M},\mathbf{M}'\) and then the associated variance determined by \(\varSigma ,\varSigma '\). As we are considering the Gaussian meanfield family of models, this strategy is well motivated as Gaussian distribution is symmetric.
Appendix F: Data and data processing
1.1 F.1 Data processing
Our strategy in this paper is to limit data processing to simple operations that does not require excessive engineering for a fair comparison in both utility and input data. For all the models, we carry out the exact same preprocessing modulo the models requirements of data format. We do the following general processing steps:
-
1.
Extract relevant features and parse them to be continuous or categorical.
-
2.
Scale all features using a z-transformation.
For each model specifically we do the following:
-
1.
KFT: tensorize all data by expressing all main modes (i.e. person, movie, time etc) as a tensor with side information associated with each mode.
-
2.
LightGBM: here, we don’t scale the features as boosting trees generally performs better with unscaled data. In some cases we have applied PCA to some of the side information that was joined on the data matrix to decrease the memory footprint of the data matrix to contain it to a practical size.
-
3.
FFM: here we bin all continuous features, as FFM requires all data to be categorical.
-
4.
Linear regression: for large categorical features, we using feature hashing to avoid data matrices of infeasible sizes. All other categorical features get one-hot encoded.
1.2 F.2 Retail sales data
We detail the features of the Retail Sales data in Table 12. Here we choose our modes to be store, articles and time.
1.3 F.3 Movielens-20M
We detail the features of the Retail Sales data in Table 13. Here we choose our modes to be users, movies and time of rating given. For the Movielens-20M data, it should be noted that we filter the movies on existing entries in the side information. This is why we only have roughly 11 million observations rather than 20 million.
1.4 F.4 Alcohol sales
We detail the features of the Alcohol Sales data in Table 14. Here we choose our modes to be location, item and time.
Appendix G: Hyperparameter configuration
1.1 G.1 KFT hyperparameters
We run all KFT experiments for 10 epochs with 20 hyperparameter search iterations. We consider two decomposition types:
-
1.
P-way latent factorization, where each dimension has a latent component. In principle, this can be thought of as each dimension being independently factorized. Further we utilize all possible side information.
-
2.
2-way latent factorization, where time is grouped with another dimension as one latent component and with the other dimensions grouped in a second latent component. Here we only consider time as side information, for the purpose of only modelling temporal changes.
Our models are generally searched over the configurations described in Table 15. For exact details we refer to the code base.
1.1.1 G.1.1 LightGBM hyperparameters
We provide hyperparameters for LightGBM in Table 16
1.1.2 G.1.2 FFM hyperparameters
We provide hyperparameters for FFM in Table 17.
1.1.3 G.1.3 Linear regresison hyperparameters
We provide hyperparameters for linear regression in Table 18.
1.2 G.2 Bayesian hyperparameters
We run all experiments for 25 epochs with 5 hyperparameter search iterations. For exact details we refer to the code base.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hu, R., Nicholls, G.K. & Sejdinovic, D. Large scale tensor regression using kernels and variational inference. Mach Learn 111, 2663–2713 (2022). https://doi.org/10.1007/s10994-021-06067-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-06067-7