1 Introduction and related work

In recent times, industrial prediction problems (Caro and Gallien 2010; Seeger et al. 2016) are not only large scale but also high dimensional (Zhai et al. 2014). Problems of this nature are ubiquitous and the most common setting is, but not limited to, the recommendation systems (Bobadilla et al. 2013). Here we are tasked with predicting user preference (i.e. ratings, buying propensity, etc.) for a product (movies, clothing, etc). More often than not, the number of users and products far exceeds the practicality of data matrix formalism, which characterizes the perils of high dimensionality in modern prediction problems. The choice of models is often limited to boosting models such as LightGBM (Ke et al. 2017), factorization machines (Juan et al. 2016; Rendle 2010) and matrix (Bobadilla et al. 2013) and tensor factorization models (Kolda and Bader 2009; Oseledets 2011). In particular, matrix/tensor factorization models are often used due to their memory-efficient representation of high dimensional data (Kuang et al. 2014). An additional benefit of factorization machines and matrix/tensor factorization models is their relative ease of extension to a Bayesian formulation, making uncertainty quantification straightforward.

A recurring problem in recommendation systems is the cold start problem (Bobadilla et al. 2012), where new users yield inaccurate predictions due to an absence of historical purchasing behavior. A common technique to overcome the cold start problem is to incorporate side information (Agarwal and Chen 2009; Xu et al. 2015; Zhang et al. 2014), which means adding descriptive covariates about each individual user (or product) to the model. While side information is not immediately applicable to factorization machines, there is extensive literature (Kim et al. 2016; Liu et al. 2019; Narita et al. 2012) on utilizing side information with matrix/tensor factorization models. There are limited choices of models when simultaneously considering scalability, uncertainty quantification, the cold start problem, and, ideally, interpretability (Rudin 2018). We contextualize our contribution by reviewing related works.

Tensor factorization is a generalization of matrix factorization to n-dimensions, where any technique that applies to tensors is directly applicable to matrices but not vice versa. We focus on tensor factorization as it offers the most flexibility and generality. In the frequentist setting, Canonical Polyadic (CP) and Tucker decomposition (Kolda and Bader 2009) are the most common factorization methods for tensors while Tensor-Train (TT) (Oseledets 2011) and Tensor-Ring (TR) (Zhao et al. 2016) decomposition are newer additions focusing on scalability. While Tucker decomposition has admirable analytical properties, the \({\mathcal {O}}(r^d)\) memory storage of the core tensor is infeasible in any large-scale application. CP decomposition is superseded by TT-decomposition, which in turn is extended by TR-decomposition. Due to certain pathologies (Batselier 2018) exhibited by TR, only TT remains plausible for a large-scale model. The existing methods LightGBM(Ke et al. 2017) and Field-Aware Factorization Machines(FFM) (Juan et al. 2016) are considered the gold standardFootnote 1\(^{,}\)Footnote 2 of large scale prediction, where an overall objective of this paper is to challenge this duopoly. To enhance the performance of tensor models, Du et al. (2016), Wu et al. (2019) and Zhang et al. (2019) applied neural methods for matrix and tensor factorization, where (Du et al. 2016; Zhang et al. 2019) is considered the state-of-the-art in performance. In the Bayesian domain, only FFM carries over with (Saha et al. 2017), which introduces a variational coordinate ascent method for factorization machines. There is rich literature in Bayesian matrix/tensor factorization ranging from Monte Carlo methods in the pioneering works of Salakhutdinov and Mnih (2007) to variational matrix factorization (Gönen et al. 2013; Kim and Choi 2014) and tensor factorization in Hawkins and Zhang (2018). Kim and Choi (2014) is of particular interest as they present a large-scale variational model incorporating side information. While coordinate ascent approaches proposed in Hawkins and Zhang (2018), Kim and Choi (2014) and Saha et al. (2017) are useful, we believe that the cyclical update scheme can be constraining for large scale scenarios that rely on parallelism. Further, the updating rules are hard to maintain with changes in model architecture. To mitigate maintenance of complicated gradient updates, we consider automatic differentiation (Paszke et al. 2017b) instead. In the context of maintainable variational inference, it would then suffice to find an analytical expression of the Evidence Lower BOund (ELBO) that is general for a family of models (Hoffman et al. 2013).

Side information is applied in two ways for matrix/tensor factorization models: implicitly through regularization schemes in He et al. (2017), Narita et al. (2012) and Pal and Jenamani (2018) Zhao et al. (2013) and directly as covariates in Agarwal and Chen (2009), Kim et al. (2016), Kim and Choi (2014) and Zhang et al. (2014). In terms of interpretability, the latter is more desirable as predictions are now an explicit expression of covariates, allowing for direct attribution analysis. A concerning observation of the results in Agarwal and Chen (2009), Kim et al. (2016) and Kim and Choi (2014) suggests that using side information in a covariate format barely improves performance and even worsens it in Agarwal and Chen (2009) using the “features only” model.

In this paper we develop a novel all-purpose large-scale prediction model that strives for a new level of versatility existing models lack. Our contribution Kernel Fried Tensor(KFT) aims to bridge the gap in the literature and answer the following questions:

  1. 1.

    Is there an interpretable tensor model that avoids constraints and complex global dependencies arising from the addition of side information but still makes full use of side information?

  2. 2.

    Can we formulate and characterize this model class in both primal and dual (Reproducing Kernel Hilbert Space) space?

  3. 3.

    Do models in this class compare favourably, for large scale prediction, to state-of-the-art models such as LightGBM (Ke et al. 2017), FFM (Juan et al. 2016) and existing factorization models?

  4. 4.

    Can we work with these new models in a scalable Bayesian context with calibrated uncertainty estimates?

  5. 5.

    What are the potential pitfalls of using these models? When are they appropriate to use?

Following our introduction, we were tasked with the multifaceted problem of developing a tensor model that scales, is interpretable, is Bayesian, handles side information, and provides on-par performance with the existing gold-standard.

The rest of the paper is organized as follows: Sect. 2 illustrates an important limitation of side information applied as covariates to tensors, Sect. 3 introduces and characterises KFT in both the frequentist and Bayesian setting, Sect. 4 are experiments and Sect. 5 provides an ablation study related to Question 5 (Table 1).

Table 1 Where our contribution places

2 Background and problem

Tensor models are used in large-scale prediction tasks ranging from bioinformatics to industrial prediction; we work in the latter setting. While existing tensor models are versatile and scalable, they have a flaw: When we build a model of the latent factors in tensor factorization as a function of covariates (Agarwal and Chen 2009; Kim et al. 2016; Kim and Choi 2014; Zhang et al. 2014), the model may be restricted by global parameter couplings that are generated. These couplings lead to a reduction of tractability at scale. We provide a new family of tensor models which admit side information without model restriction or loss of tractability.

Tensor Train decomposition Before we explain how side information restricts tensor model expressiveness, we set out the background. Consider the task of reconstructing the tensor \(\mathbf{Y}\in {\mathbb {R}}^{n_1 \times n_2 \cdots \times n_P}\). Many existing decomposition techniques (Kolda and Bader 2009) treat this problem. We focus on the Tensor Train (TT) decomposition (Oseledets 2011) as this generalises more readily than existing alternatives.

The n-mode (matrix) product of a tensor \({\mathbf {X}} \in {\mathbb {R}}^{I_1 \times I_2 \times \cdots \times I_N}\) with a matrix \({\mathbf {U}} \in {\mathbb {R}}^{J \times I_n}\) is \({\mathbf {X}} \times _n {\mathbf {U}}\). This product is of size \(I_1 \times \ldots I_{n-1} \times J \times I_{n+1} \times \ldots \times I_N\). Elementwise, the product is

$$\begin{aligned} \left( {\mathbf {X}} \times _n {\mathbf {U}} \right) _{i_1\ldots i_{n-1}\ j\ i_{n+1} \ \ldots \ i_N} = \sum _{i_n=1}^{I_n}x_{i_1i_2\ldots i_N }\ u_{j i_n}. \end{aligned}$$

We will consider the following notion of a mode product between tensor \(\mathbf{X}\in \mathbf{\mathbb {R}}^{I_1 \times \cdots I_{N-1}\times I_{N}}\), applying the N-th mode product \(\times _N\) with a tensor \(\mathbf{U}\in \mathbf{\mathbb {R}}^{I_N\times K_1 \times K_2}\) gives \(\mathbf{X}\times _N \mathbf{U}\in \mathbf{\mathbb {R}}^{I_1 \times \cdots I_{N-1}\times K_1 \times K_2}\).

In TT, \(\mathbf{Y}\) is decomposed into P latent tensors \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p} \times n_p \times R_{p-1} }, p=1,\ldots P\), with \(\mathbf{V}_1 \in \mathbf{\mathbb {R}}^{R_1\times n_1\times 1}\) and \(\mathbf{V}_P \in \mathbf{\mathbb {R}}^{1 \times n_P \times R_{P-1}}\). Here \(R_p\) is the latent dimensionality for each factor in the decomposition, with \(R_1=R_P=1\). Let \(\times _{-1} \mathbf{V}_p\) be the operation of applying the mode product to the last dimension of \(\mathbf{V}_p\). We seek \({\mathbf {V}}_1\ldots {\mathbf {V}}_P\) so that

$$\begin{aligned}&\mathbf{Y}\approx \prod _{p=1}^P \times _{-1} \mathbf{V}_p\nonumber \\&y_{i_1...i_P} \approx \sum _{r_0\ldots r_P} \prod _{p=1}^P v_{r_{p}i_p r_{p-1}}. \end{aligned}$$
(1)

Suppose that, associated with dimension p, we have \(c_p\)-dimensional side information denoted \(\mathbf{D}_p \in \mathbf{\mathbb {R}}^{n_p \times c_p}\). For example, if p is the dimension representing \(n_p=10{,}000\) different books, then the columns of \(\mathbf{D}_p \in \mathbf{\mathbb {R}}^{10{,}000\times c_p}\) might contain the author of the book, page count etc. Similar to Kim et al. (2016) and Kim and Choi (2014), side information is built into the second dimension of the latent tensor \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p} \times c_p \times R_{p-1}}\) using the mode product \(\mathbf{V}_p \times _{2} \mathbf{D}_p\). It should be noted that the middle dimension of \(\mathbf{V}_p\) changed from \(n_p\) to \(c_p\) to accommodate the dimensionality of the side information. For TT decomposition our approximation becomes

$$\begin{aligned}&\mathbf{Y}\approx \prod _{p=1}^P \times _{-1} (\mathbf{V}_p\times _{2} \mathbf{D}_p)\nonumber \\&y_{i_1...i_P} \approx \sum _{\begin{array}{c} r_0\ldots r_P\\ i_1'\ldots i_P' \end{array}} \prod _{p=1}^P v_{r_{p}i_p'r_{p-1}}d_{i_p,i_p'} \end{aligned}$$
(2)

The above example illustrates the primal setting where side information is applied directly. Similarly to Kim et al. (2016), we also consider kernelized side information in the reproducing kernel hilbert space (RKHS) which we will refer to as the dual setting.

2.1 The problem with adding side information to tensor factorization

Consider a matrix factorization problem for \(\mathbf{Y}\in \mathbf{\mathbb {R}}^{n_1\times n_2}\) with unknown latent factors \(\mathbf{U}\in \mathbf{\mathbb {R}}^{n_1\times R},\mathbf{V}\in \mathbf{\mathbb {R}}^{n_2\times R}\). We are approximating

$$\begin{aligned} \mathbf{Y}= & {} \begin{bmatrix} y_{11} &{} \dots &{} y_{1n_2} \\ \vdots &{} \ddots &{} \\ y_{n_11} &{} &{} y_{n_1n_2} \end{bmatrix} \approx \mathbf{U}\cdot \mathbf{V}^{\top } = \begin{bmatrix} \sum _{r=1}^R u_{1r}v_{1r} &{} \dots &{}\sum _{r=1}^R u_{1r}v_{n_2r} \\ \vdots &{} \ddots &{} \\ \sum _{r=1}^R u_{n_1r}v_{1r} &{} &{} \sum _{r=1}^R u_{n_1r}v_{n_2r} \end{bmatrix}. \end{aligned}$$
(3)

If we update \(u_{1r}\) in the approximation \(y_{11}\approx \sum _{r=1}^R u_{1r}v_{1r}\), we change the approximation \(y_{1,2}\approx \sum _{r=1}^R u_{1r}v_{2r}\) since they share the parameter \(u_{1r}\). However, \(y_{2,1}\) and \(y_{2,2}\) remain unchanged. Parameters are coupled across rows and columns but not globally. This is the standard setup in latent factorization.

Now consider the case where we have \(\mathbf{D}_1 = \mathbf{D}_2= {\mathbf {1}}_{n_1\times n_1}\). We take our latent factors to be a linear function of available side information which leads \(\mathbf{U},\mathbf{V}\) to form \(\mathbf{D}_1\mathbf{U}= \begin{bmatrix} \sum _{i=1}^{n_1} u_{i1} &{} \dots &{} \sum _{i=1}^{n_1} u_{iR} \\ \vdots &{} \ddots &{} \\ \sum _{i=1}^{n_1} u_{i1} &{} &{} \sum _{i=1}^{n_1} u_{iR} \end{bmatrix}\) and \(\mathbf{D}_2\mathbf{V}\) (similar form). It follows that

$$\begin{aligned} \begin{aligned}&(\mathbf{D}_1\mathbf{U}) \cdot (\mathbf{D}_2\mathbf{V})^{\top } = \begin{bmatrix} \sum _{rij} u_{ir} v_{jr} &{} \dots &{} \sum _{rij} u_{ir} v_{jr} \\ \vdots &{} \ddots &{} \\ \sum _{rij} u_{ir} v_{jr} &{} &{} \sum _{rij} u_{ir} v_{jr} \end{bmatrix}, \end{aligned} \end{aligned}$$
(4)

is a constant matrix! We have lost all model flexibility as we are approximating \(\mathbf{Y}\) with a constant. Now consider a more realistic example with \(\mathbf{D}_1 = \begin{bmatrix} d_{11} &{} \dots &{} d_{1n_1} \\ \vdots &{} \ddots &{} \\ d_{n_11} &{} &{} d_{n_1n_1} \end{bmatrix}\) and \(\mathbf{D}_2 =\begin{bmatrix} z_{11} &{} \dots &{} z_{1n_2} \\ \vdots &{} \ddots &{} \\ z_{n_21} &{} &{} z_{n_2n_2} \end{bmatrix}\). In this case

$$\begin{aligned} \begin{aligned}&(\mathbf{D}_1\mathbf{U}) \cdot (\mathbf{D}_2\mathbf{V})^{\top } = \begin{bmatrix} \sum _{rij} d_{1i}z_{1j}u_{ir} v_{jr} &{} \dots &{} \sum _{rij} d_{1i}z_{n_2j}u_{ir} v_{jr} \\ \vdots &{} \ddots &{} \\ \sum _{rij} d_{n_1i}z_{1j}u_{ir} v_{jr} &{} &{} \sum _{rij} d_{n_1i}z_{n_2j}u_{ir} v_{jr} \end{bmatrix}. \end{aligned} \end{aligned}$$
(5)

Again, \(u_{ir}\) appears in all entries in our matrix approximation for \(\mathbf{Y}\). However, this time changing \(u_{ir}\) will not change all entries by the same amount but rather differently across all entries depending on the entries of \(\mathbf{D}_1\) and \(\mathbf{D}_2\). This connects all entries in the approximating matrix, introduces complex global variable dependence, and makes fitting infeasible at a large scale as the optimization updates globally. The observation applies in both primal and dual representations. In the primal representation, there is a restriction in expressiveness as the rank of our approximation falls off with the rank of the side information. In this setting, near-colinearity is also a problem as it leads to unstable optimization and factorizations which are very sensitive to noise.

We see that when we add side information we may inadvertently restrict the expressiveness of our model. We formulate a new tensor model with a range and dependence structure unaffected by the addition of side information in either the primal or dual setting.

3 Proposed approach

3.1 Tensors and side information

We seek a tensor model that benefits from additional side information while not forfeiting model flexibility. We introduce two strategies, which we call weighted latent regression (WLR) and latent scaling (LS).

3.1.1 Weighted latent regression

We now return to the previous setting but with additional latent tensors \(\mathbf{U}'\in \mathbf{\mathbb {R}}^{n_1 \times R}\) and \(\mathbf{V}'\in \mathbf{\mathbb {R}}^{n_2 \times R}\). We approximate \(\mathbf{Y}\) as:

$$\begin{aligned}&\mathbf{Y}\approx ((\mathbf{D}_1\mathbf{U}) \circ \mathbf{U}') \cdot ((\mathbf{D}_2\mathbf{V}) \circ \mathbf{V}')^{\top } \nonumber \\&\quad =\tiny { \begin{bmatrix} \sum _{rij}u_{1r}'v_{1r}' d_{1i}z_{1j}u_{ir} v_{jr} &{} \dots &{} \sum _{rij} u_{1r}'v_{n_2r}'d_{1i}z_{n_2j}u_{ir} v_{jr} \\ \vdots &{} \ddots &{} \\ \sum _{rij}u_{n_1r}'v_{1r}' d_{1i}z_{1j}u_{ir} v_{jr} &{} &{} \sum _{rij}u_{n_1r}'v_{n_2r}' d_{1i}z_{1j}u_{ir} v_{jr} \end{bmatrix}} \end{aligned}$$
(6)

By taking the Hadamard product (\(\circ\)) with additional tensors \(\mathbf{U}'\), \(\mathbf{V}'\) we recover the model flexibility and dependence structure of vanilla matrix factorization, as \(\mathbf{U}'\) and \(\mathbf{V}'\) are independent of \(\mathbf{U}\) and \(\mathbf{V}\). Here, changing \(u_{ir}\) would still imply a change for all entries by magnitudes defined by the side information, however we can calibrate these changes on a latent entrywise level by scaling each entry with \(u_{ir}'v_{jr}'\). For any TT decomposition with an additional tensor \(\mathbf{V}_p'\), the factorization becomes

$$\begin{aligned}&\mathbf{Y}\approx \prod _{p=1}^P \times _{-1} ( \mathbf{V}_p' \circ (\mathbf{V}_p\times _{2} \mathbf{D}_p))\nonumber \\&y_{i_1...i_P} \approx \sum _{\begin{array}{c} r_1\ldots r_P\\ i_1'\ldots i_P'\\ i_1''\ldots i_P'' \end{array}} \prod _{p=1}^P v_{r_{p}i_p''r_{p-1}}'v_{r_{p}i_p'r_{p-1}}d_{i_p,i_p'}\delta _{i_p}(i_p'') \end{aligned}$$
(7)

where \(\delta (\cdot )\) denotes Kronecker delta. We interpret this as weighting the regression terms \(\sum _{i,j}d_{pi}z_{qj}u_{ir}v_{jr}\) over indices pq with \(u_{pr}v_{qr}\) and then summing over latent indices r.

Interpretability We can decompose the estimate into weights of side information illustrated in Fig. 1. A guiding example on analyzing the latent factors is provided in Appendix C

Fig. 1
figure 1

Decomposing the estimate into interpretable weights

3.1.2 Latent scaling

An alternative computationally cheaper procedure would be to consider additional latent tensors \(\mathbf{U}_1\in \mathbf{\mathbb {R}}^{n_1\times K},\mathbf{U}_2\in \mathbf{\mathbb {R}}^{n_1\times L},\mathbf{V}_1\in \mathbf{\mathbb {R}}^{n_2\times K},\mathbf{V}_2\in \mathbf{\mathbb {R}}^{n_2\times L}\). We would approximate

$$\begin{aligned}&\mathbf{Y}\approx (\mathbf{U}_1\mathbf{V}_1^{\top })\circ ((\mathbf{D}_1\mathbf{U}) \cdot (\mathbf{D}_2\mathbf{V})^{\top }) + (\mathbf{U}_2\mathbf{V}_2^{\top }) \nonumber \\&\quad =\tiny {\begin{bmatrix} \sum _{q}u_{1k}^{1}v_{1k}^{1} \sum _{rij} d_{1i}z_{1j}u_{ir} v_{jr}+\sum _{l}u_{1l}^{2}v_{1l}^{2} &{} \dots &{} \\ \vdots &{} \ddots &{} \\ \sum _{q}u_{n_1k}^{1}v_{1k}^{1} \sum _{rij} d_{1i}z_{1j}u_{ir} v_{jr}+\sum _{l}u_{n_1l}^{2}v_{1l}^{2} &{} \end{bmatrix}} \end{aligned}$$
(8)

We have similarly regained our original model flexibility where have introduced back independence for each term by scaling (\(\mathbf{V}^s\)) and adding a constant (\(\mathbf{V}^b\)) for each regression term with a ’latent scale and bias’ term. We generalize this to

$$\begin{aligned}&\mathbf{Y}\approx \left( \prod _{p=1}^P \times _{-1} \mathbf{V}_p^s\right) \circ \left( \prod _{p=1}^P \times _{-1} (\mathbf{V}_p\times _{2} \mathbf{D}_p)\right) +\left( \prod _{p=1}^P \times _{-1} \mathbf{V}_p^b\right) \nonumber \\&y_{i_1...i_P} \approx \sum _{\begin{array}{c} r^s_1\ldots r^s_P \end{array}}\prod _{p=1}^Pv^s_{r_{p}i_pr_{p-1}} \sum _{\begin{array}{c} r_1\ldots r_P\\ i_1'\ldots i_P' \end{array}} \prod _{p=1}^P v_{r_{p}i_p'r_{p-1}}d_{i_p,i_p'}+\sum _{\begin{array}{c} r^b_1\ldots r^b_P \end{array}}\prod _{p=1}^Pv^b_{r_{p}i_pr_{p-1}} \end{aligned}$$
(9)

One may conjecture that adding side information to tensors through a linear operation is counterproductive due to the restrictions it imposes on the approximation, and dispute that our proposal of introducing additional tensors to increase model flexibility is futile when side information is likely to be marginally informative or potentially uninformative. As an example, return to the case of completely non-informative constant side information, \(\mathbf{D}_1={\mathbf {1}}\), \(\mathbf{D}_2={\mathbf {1}}\). In this corner case, both our proposed models reduce to regular matrix factorization: the side information regression term collapses to a constant, which in conjunction with the added terms reduces to regular tensor factorization without side information.

Interpretability We can decompose the estimate into weights of side information illustrated in Fig. 2.

Fig. 2
figure 2

Decomposing the estimate into interpretable weights

A comment on identifiability It should be noted that the proposed models need not be indentifiable.

To see this, return to the scenario where side information is constant. The term \(\mathbf{V}_p \times _2 \mathbf{D}_p\) has constant rows equal to row sums of \(\mathbf{V}_p\). Any transformation of \(\mathbf{V}_p\) which preserves these row sums leaves the fit unchanged. However, in large-scale industrial prediction, we draw utility from good generalization performance in prediction, and parameter identifiability is secondary.

3.2 Primal regularization terms

In weight space regression, the regularization terms are given by the squared Frobenius norm. The total regularization term would be written as

$$\begin{aligned} \varLambda =\sum _{p} \lambda _{p} \Vert \mathbf{V}_p \Vert ^2_F +\lambda _{p}' \Vert \mathbf{V}_p' \Vert ^2_F \end{aligned}$$
(10)

3.3 RKHS and the representer theorem

We extend our framework to the RKHS dual space formalism, where our extension can intuitively be viewed as a tensorized version of kernel ridge regression (Vovk 2013). The merit of this is to enhance the performance by providing an implicit non-linear feature map of side information using kernels.

Firstly consider side information \(\mathbf{D}_{p} = \{\mathbf{x}_i^p\}_{i=1}^{n_p}, \mathbf{x}_i^p \in \mathbf{\mathbb {R}}^{c_p}\) which are kernelized using a kernel function \(k:\mathbf{\mathbb {R}}^{c_p}\times \mathbf{\mathbb {R}}^{c_p}\rightarrow {\mathbb {R}}\). Denote \(k^{p}_{ij}=k(\mathbf{x}_i^p,\mathbf{x}_j^p) \in \mathbf{\mathbb {R}}\) and \(\mathbf{K}_p=k(\mathbf{D}_p,\mathbf{D}_p)\in \mathbf{\mathbb {R}}^{n_p \times n_p}\). Consider a \(\mathbf{V}\in \mathbf{\mathbb {R}}^{R_1\times n_1 \cdots \times n_Q\times R_2}\), where \(Q<P\). Using the Representer theorem (Schölkopf et al. 2001) we can express \(\prod _{q=1}^{Q}\mathbf{V}\times _{q+1} \mathbf{K}_q\) as a function in RKHS

$$\begin{aligned} \mathsf {v}_{r_1,r_2}=\sum _{n_{1}\ldots n_q}v_{r_1n_1...n_qr_2}\prod _{q=1}^{Q} k_q(\cdot ,\mathbf{x}_{n_q}^{q}). \end{aligned}$$
(11)

where \(\mathsf {v}_{r_1,r_2}: \mathbf{\mathbb {R}}^{n_1}\times \cdots \times \mathbf{\mathbb {R}}^{n_Q} \rightarrow \mathbf{\mathbb {R}}\) and \(\mathsf {v}_{r_1,r_2} \in {\mathcal {H}}\), which denotes the RKHS with respect to the kernels \(\prod _{q}^Q \times k_q\). We use mode dot notation \(\times _{q+1}\) here to apply kernelized side information \(\mathbf{K}_q\) to each dimension of size \(n_1 \ldots n_Q\), where \(q+1\) is used to account for the first dimension consisting of \(R_1\).

3.4 Dual space regularization term for WLR

Consider applying another tensor with the same shape as \(\mathbf{V}'\) through element wise product to robustify \(\mathbf{V}\), i.e. \(\mathbf{V}' \circ \prod _{q=1}^{Q}\mathbf{V}\times _{q+1} \mathbf{K}_q\). Then we have

$$\begin{aligned} \mathsf {v}_{r_1r_2}= & {} \left( \sum _{n_{1}'\ldots n_q'}v_{r_1n_1'...n_q'r_2}'\prod _{q=1}^{Q}\delta _{\mathbf{x}_{n_q'}^{q}}(\cdot )\right) \cdot \left( \sum _{n_{1}\ldots n_q}v_{r_1n_1...n_qr_2}\prod _{q=1}^{Q} k(\cdot ,\mathbf{x}_{n_q}^{q})\right) \nonumber \\= & {} \sum _{\begin{array}{c} n_1\ldots n_q\\ n_1'\ldots n_q \end{array}}v_{r_1n_1...n_qr_2}v_{r_1n_1'...n_q'r_2}'\prod _{q=1}^{Q}\delta _{\mathbf{x}_{n_q'}^{Q}}(\cdot )\prod _{q=1}^{Q}k(\cdot ,\mathbf{x}_{n_q}^{q}) \end{aligned}$$
(12)

where the regularization term for \(\mathsf {v}_{r_1,r_2}\) is given by:

$$\begin{aligned} \varLambda= & {} \sum _p \lambda _p \sum _{r_1,r_2} \langle \mathsf {v}_{r_1,r_2},\mathsf {v}_{r_1,r_2} \rangle _{{\mathcal {H}}}\nonumber \\= & {} \sum _p \lambda _p\left( \left( \prod _{q=1}^Q \mathbf{V}\times _{q+1}\mathbf{K}_q\right) \circ \left( (\prod _{q=1}^Q(\mathbf{V}'\circ \mathbf{V}')\times _{q+1} {\mathbf {1}}_{n_q\times n_q})\circ \mathbf{V}\right) \right) _{++} \end{aligned}$$
(13)

and \((\cdot )_{++}\) means summing all elements.

3.5 Dual space regularization term for LS

For LS models, the regularization term is calculated as

$$\begin{aligned} \varLambda = \sum _p \lambda _p \Bigg [\Vert \mathbf{V}_p^s\Vert ^2_F + \sum _{r} \langle \mathsf {v}_r,\mathsf {v}_r \rangle _{{\mathcal {H}}} + \Vert \mathbf{V}_p^b\Vert ^2_F\Bigg ] \end{aligned}$$
(14)

where \(\mathsf {v}_{r_1,r_2}= \sum _{n_{1}\ldots n_q}v_{r_{p}n_1...n_qr_{p-1}}\prod _{q=1}^{Q} k(\cdot ,\mathbf{x}_{n_q}^{Q})\) and \(\sum _{r}\langle \mathsf {v}_r,\mathsf {v}_r \rangle _{{\mathcal {H}}} = \left( \left( \prod _{q=1}^Q \mathbf{V}\times _{q+1}\mathbf{K}_q\right) \circ \mathbf{V}\right) _{++}\).

3.6 Scaling with random Fourier features

To make tensors with kernelized side information scalable, we rely on a random fourier feature (Rahimi and Recht 2007) (RFFs) approximation of the true kernels. RFFs approximate a translation-invariant kernel function k using Monte Carlo:

$$\begin{aligned} {\hat{k}}(\mathbf{x},\mathbf{y}) = \frac{2}{M}\sum _{i=1}^{M/2} \big [\cos (\omega _i^T \mathbf{x})\cos (\omega _i^T \mathbf{y}) + \sin (\omega _i^T \mathbf{x}) \sin (\omega _i^T \mathbf{y})\big ] \end{aligned}$$
(15)

where \(\omega _i\) are frequencies drawn from a normalized non-negative spectral measure \(\varLambda\) of kernel k. Our primary goal in using RFFs is to create a memory efficient, yet expressive method. Thus, we write

$$\begin{aligned} {\hat{k}}\left( \mathbf{x}_{i}^p,\mathbf{x}_j^p \right) \approx \phi \left( \mathbf{x}_{i}^p\right) ^{\top }\phi \left( \mathbf{x}_{j}^p\right) \end{aligned}$$
(16)

with explicit feature map \(\phi :{\mathbb {R}}^{D_p}\rightarrow {\mathbb {R}}^{M}\), and \(M\ll N_p\). In the case of RFFs,

$$\begin{aligned} \phi (\cdot )^{\top }=[\cos (\omega _1^T \cdot ),\ldots ,\cos (\omega _{M/2}^T \cdot ),\sin (\omega _1^T \cdot ),\ldots ,\sin (\omega _{M/2}^T \cdot )]. \end{aligned}$$

This feature map can be applied in the primal space setting as a computationally cheap alternative to the RKHS dual setting.

A drawback of tensors with kernelized side information is the \({\mathcal {O}}(N_p^2)\) memory growth of kernel matrices. If one of the dimensions has a large \(N_p\) in the dual space setting, we approximate large kernels \(\mathbf{K}_p\) with

$$\begin{aligned} \mathbf{V}\times _{p+1} \varPhi \times _{p+1} \varPhi ^{\top } \approx \mathbf{V}\times _{p+1} \mathbf{K}_p, \end{aligned}$$
(17)

where \(\varPhi =\phi (\mathbf{D}_p)\in \mathbf{\mathbb {R}}^{N_p\times M}\). To see that this is a valid approximation, an element \({\bar{v}}_{i_1...i_p}\) in \(\mathbf{V}\times _p \mathbf{K}_p\) is given by \({\bar{v}}_{i_1...i_p} = \sum _{i_p'=1}^{N_p}v_{i_1...i_p'}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p)\). Using RFFs we have

$$\begin{aligned} {\bar{v}}_{i_1...i_p} = \sum _{i_p'=1}^{N_p}\sum _{c=1}^M v_{i_1...i_p'}\varPhi _{i_p'c} \varPhi _{i_p c}=\sum _{i_p'=1}^{N_p}v_{i_1...i_p'}\phi (\mathbf{x}_{i_p'}^p)^{\top } \phi (\mathbf{x}_{i_p}^p). \end{aligned}$$
(18)

KFT with RFFs now becomes:

$$\begin{aligned} \mathsf {v}_{r_p} =\sum _{\begin{array}{c} n_1\ldots n_q \\ n_1'\ldots n_q \end{array}}v_{r_{p}n_1...n_qr_{p-1}}v_{r_{p}n_1'...n_q'r_{p-1}}'\prod _{q=1}^{Q}\delta _{\mathbf{x}_{n_q'}^{Q}}(\cdot )\prod _{q=1}^{Q}\phi (\mathbf{x}_{n_q}^{Q})\phi (\cdot ) \end{aligned}$$
(19)

with the regularization term

$$\begin{aligned} \sum _{r}\langle \mathsf {v}_r,\mathsf {v}_r \rangle _{{\mathcal {H}}}=\left( \left( \prod _{q=1}^Q \mathbf{V}\times _{q+1} \varPhi _{q}\times _q \varPhi _q^{\top } \right) \circ \left( \left( \prod _{q=1}^Q(\mathbf{V}'\circ \mathbf{V}')\times _q {\mathbf {1}}_{n_q\times n_q}\right) \circ \mathbf{V}\right) \right) _{++}. \end{aligned}$$
(20)

For a derivation, please refer to the Appendix B.

3.7 Kernel fried tensor

Having established our new model, we coin it kernel fried tensor (KFT). Given some loss \({\mathcal {L}}(\mathbf{Y},\tilde{\mathbf{Y}})\) for predictions \(\tilde{\mathbf{Y}}\) the full objective is

$$\begin{aligned} \min _{\text {wrt } \mathbf{V}_p,\mathbf{V}_p',\varTheta _p} {\mathcal {L}}(\mathbf{Y},\tilde{\mathbf{Y}}) + \varLambda (\mathbf{V}_p,\mathbf{V}_p',\varTheta _p) \end{aligned}$$
(21)

where \(\mathbf{V}_p,\mathbf{V}_p',\varTheta _p\) are parameters of the model and \(\varTheta _p\) are kernel parameters if we use the RKHS dual formulation. As our proposed model involves mutually dependent components with non-zero mixed partial derivatives, optimizing them jointly with a first order solver is inappropriate as mixed partial derivatives will not be considered during each gradient step. Inspired by the EM-algorithm (Dempster et al. 1977), we summarize our training procedure in Algorithm 1. By updating each parameter group sequentially and independently, we eliminate the effects of mixed partials leading to accurate gradient updates. For further details, we refer to the Appendix E.1.

figure a

3.7.1 Joint features

From a statistical point of view, we are assuming that each of our latent tensors \(\mathbf{V}_p\) factorizes \(\mathbf{Y}\) into P independent components with prior distribution corresponding to \({\mathcal {N}}(\mathsf {v}_{r_{p-1}r_p}^{p} |0,\mathbf{K}_p)\), where \(\mathsf {v}_{r_{p-1}r_p}^{p}\in \mathbf{\mathbb {R}}^{n_p}\) is the \(r_{p-1},r_p\) cell selected from \(\mathbf{V}_p\). We can enrich our approximation by jointly modelling some dimensions p by choosing some \(\mathbf{V}_p\in \mathbf{\mathbb {R}}^{R_{p+1}\times n_{p+1}\times n_{p} \times R_{p-1 }}\). If we denote this dimension p by \(p'\) we have that

$$\begin{aligned}&\mathbf{Y}\approx \prod _{p=1}^{p'-1} \times _{-1} (\mathbf{V}_p\times _{2} \mathbf{K}_p) \times _{-1}(\mathbf{V}_{p'}\times _3 \mathbf{K}_{p'} \times _{2} \mathbf{K}_{p'+1}) \prod _{p=p'+2}^P \times _{-1}(\mathbf{V}_p\times _{2} \mathbf{K}_p)\nonumber \\&y_{i_1...i_P}\approx \sum _{\begin{array}{c} r_1\ldots r_P\\ i_1'\ldots i_P' \end{array}} \prod _{p=1}^{p'-1} v_{r_{p}i_p'r_{p-1}}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p)\cdot v_{r_{p}i_{p'}'i_{p'+1}'r_{p-1}}\nonumber \\&k(\mathbf{x}_{i_{p'}}^{p'},\mathbf{x}_{i_{p'}'}^{p'})k\left( \mathbf{x}_{i_{p'+1}}^{p'+1},\mathbf{x}_{i_{p'+1}'}^{p'+1}\right) \prod _{p=1}^{p'+2} v_{r_{p}i_p'r_{p-1}}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p) \end{aligned}$$
(22)

and the prior would instead be given as \({\mathcal {N}}(\text {vec}(\mathsf {v}_{r_{p'-1}r_{p'+1}}^{p'}) |0,\mathbf{K}_{p'}\otimes \mathbf{K}_{p'+1})\). Here \(\text {vec}(\mathsf {v}_{r_{p'-1}r_{p'+1}}^{p'}) \in \mathbf{\mathbb {R}}^{n_{p'}n_{p'+1}}\) and \(\text {vec}(\cdot )\) means flattening a tensor to a vector. The cell selected from \(\mathbf{V}_{p'}\) now has a dependency between dimensions \(p'\) and \(p'+1\). We refer to a one dimensional factorization component of TT as a TT-core and a multi dimensional factorization component as a joint TT-core.

3.8 Bayesian inference

We turn to Bayesian inference for uncertainty quantification with KFT. Assume a Gaussian conditional likelihood for an observation \(y_{i_1\ldots i_P}\) with inspiration from Gönen et al. (2013) Kim and Choi (2014). For KFT-WLR we have that

$$\begin{aligned}&p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P') \nonumber \\&\quad = {\mathcal {N}}\left( y_{i_1\ldots i_P}\left| \sum _{\begin{array}{c} r_1\ldots r_P\\ i_1'\ldots i_P'\\ i_1''\ldots i_P'' \end{array}} \prod _{p=1}^P v_{r_{p}i_p''r_{p-1}}'v_{r_{p}i_p'r_{p-1}}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p)\delta _{\mathbf{x}_{i_p}^p}(\mathbf{x}_{i_p''}^p),\sigma _y^2\right. \right) . \end{aligned}$$
(23)

The corresponding objective for KFT-LS is

$$\begin{aligned}&p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P') \nonumber \\&\quad = {\mathcal {N}}\left( y_{i_1...i_P}\left| \sum _{\begin{array}{c} r^s_1\ldots r^s_P \end{array}}\prod _{p=1}^Pv^s_{r_{p}i_pr_{p-1}} \sum _{\begin{array}{c} r_1\ldots r_P\\ i_1'\ldots i_P' \end{array}} \prod _{p=1}^P v_{r_{p}i_p'r_{p-1}}k(\mathbf{x}_{i_p}^p,\mathbf{x}_{i_p'}^p)+\sum _{\begin{array}{c} r^b_1\ldots r^b_P \end{array}}\prod _{p=1}^Pv^b_{r_{p}i_pr_{p-1}},\sigma _y^2\right. \right) \end{aligned}$$
(24)

where \(\sigma _y^2\) is a scalar hyperparameter.

3.9 Variational approximation

Our goal is to maximize the posterior distribution \(p(\mathbf{V}_p,\mathbf{V}_p'|\mathbf{Y})\) which is intractable as the likelihood \(p(\mathbf{Y})= \int p(\mathbf{Y}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')p(\mathbf{V}_1)...p(\mathbf{V}_P) p(\mathbf{V}_1')...p(\mathbf{V}_P')d\mathbf{V}_{1\ldots P}d\mathbf{V}_{1\ldots P}'\) does not have a closed form solution due to the product of Gaussians. Instead we use variational approximations for \(\mathbf{V}_p,\mathbf{V}_p'\) by parametrizing distributions of the Gaussian family and optimize the evidence lower bound (ELBO)

$$\begin{aligned}&{\mathcal {L}}(y_{i_1\ldots i_P},\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')= {\mathbb {E}}_{q}[\log {p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')}]\nonumber \\&\quad -\,\left( \sum _{p=1}^P D_{\text {KL}}(q(\mathbf{V}_p)\Vert p(\mathbf{V}_p))+D_{\text {KL}}(q(\mathbf{V}_p')\Vert p(\mathbf{V}_p'))\right) \end{aligned}$$
(25)

In our framework, we consider the univariate Gaussian and multivariate Gaussian as variational approximations with corresponding priors where \(\sigma _y^2\) is interpreted to control the weight of the reconstruction term against the KL-term.

3.9.1 Univariate VI

Univariate KL For the case of univariate normal priors, we calculate the KL divergence as

$$\begin{aligned} {\displaystyle D_{\mathrm {KL} }({\mathcal {N}}_q\,\Vert \,{\mathcal {N}}_p)={\frac{(\mu _{q}-\mu _{p})^{2}}{2\sigma _{p}^{2}}}+{\frac{1}{2}}\left( {\frac{\sigma _{q}^{2}}{\sigma _{p}^{2}}}-1-\ln {\frac{\sigma _{q}^{2}}{\sigma _{p}^{2}}}\right) } \end{aligned}$$
(26)

where \(\mu _q,\mu _p\) and \(\sigma _q^2,\sigma _p^2\) are the mean and variance for the variational approximation and prior respectively, where \(\mu _p,\sigma _p^2\) are chosen a priori.

Model For a univariate Gaussian variational approximation we assume the following prior structure

$$\begin{aligned} p(\mathbf{V}_p')= \prod _{\begin{array}{c} r_p,i_p \end{array}}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}'|\mu _p',\sigma _p'^2),\quad p(\mathbf{V}_p) = \prod _{r_p,i_p}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}|\mu _p,\sigma _p^2) \end{aligned}$$
(27)

with corresponding univariate meanfield approximation

$$\begin{aligned}&q(\mathbf{V}_p')= \prod _{\begin{array}{c} r_p,i_p \end{array}}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}'|\mu _{r_pi_pr_{p-1}}',\sigma _{r_pi_pr_{p-1}'}'^2),\nonumber \\&q(\mathbf{V}_p) = \prod _{r_p,i_p}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}|\mu _{r_pi_pr_{p-1}},\sigma _{r_pi_pr_{p-1}}^2). \end{aligned}$$
(28)

We take \(\mu _p',\sigma _p'^2,\mu _p,\sigma _p^2\) to be hyperparameters.

Weighted latent regression reconstruction term For Weighted latent regression, we express the reconstruction term as

$$\begin{aligned}&{\mathbb {E}}_{q}[\log {p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')}] \nonumber \\&\quad \propto \frac{1}{\sigma _y^2} \Bigg [\mathbf{Y}^2-2\mathbf{Y}\circ \Bigg ( \prod _{p=1}^P \times _{-1} ( \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p))\Bigg ) + \Bigg (\prod _p^P \times _{-1} \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p)\Bigg )^2\nonumber \\&\qquad + \,\Bigg (\prod _p^P \times _{-1}\varSigma _p' \circ (\varSigma _p\times _{2} (\mathbf{K}_p)^2)\Bigg )+\Bigg (\prod _p^P \times _{-1}(\varSigma _p' \circ (\mathbf{M}_p\times _2 \mathbf{K}_p)^2) \Bigg )\nonumber \\&\qquad +\,\Bigg (\prod _p^P \times _{-1} \mathbf{M}_p'^2\circ (\varSigma _p\times _{2} (\mathbf{K}_p)^2)\Bigg ) \Bigg ] \end{aligned}$$
(29)

where \(\mathbf{M}_p',\mathbf{M}_p\) and \(\varSigma _p',\varSigma _p\) correspond to the tensors containing the variational parameters \(\mu _{r_{p}i_pr_{p-1}}'\), \(\mu _{r_{p}i_pr_{p-1}}\) and \(\sigma _{r_{p}i_pr_{p-1}}'^2\), \(\sigma _{r_{p}i_pr_{p-1}}^2\) respectively. For the case of RFF’s, we approximate \(\varSigma _p \times _2 (\mathbf{K}_p)^2 \approx \varSigma _p \times _2 (\varPhi _p \bullet \varPhi _p)^{\top } \times _2 (\varPhi _p \bullet \varPhi _p)\), where \(\bullet\) is the transposed Khatri–Rao product. It should further be noted that any square term means element wise squaring. We provide a derivation in the Appendix A.1.

Latent scaling reconstruction term For Latent Scaling, we express the reconstruction term as

$$\begin{aligned}&{\mathbb {E}}_{q}[\log {p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')}] \nonumber \\&\quad \propto \frac{1}{\sigma _y^2} \left[ \mathbf{Y}^2 -2\mathbf{Y}\circ \Bigg (\prod _{p=1}^P (\mathbf{M}_p^s \circ (\mathbf{M}_p \times _2 \mathbf{K}_p) + \mathbf{M}_p^b) \Bigg )\right. \nonumber \\&\qquad +\,\left( \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^s\right) ^2+\left( \prod _{p=1}^P \times _{-1} \varSigma _p^s\right) \right) \nonumber \\&\qquad \circ \,\left( \left( \prod _p^P \times _{-1} \mathbf{M}_p\times _{2} \mathbf{K}_p\right) ^2+ \prod _{p=1}^P \times _{-1} \varSigma _p\times _{2} (\mathbf{K}_p)^2\right) \nonumber \\&\qquad +\,2\left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^s\right) \circ \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^b\right) \circ \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p \times _2 \mathbf{K}_p\right) \nonumber \\&\left. \qquad +\,\left( \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^b\right) ^2+\left( \prod _{p=1}^P \times _{-1} \varSigma _p^b\right) \right) \right] . \end{aligned}$$
(30)

For details, see the Appendix A.2.

3.9.2 Multivariate VI

Multivariate KL The KL divergence between a multivariate normal prior p and a variational approximation given by q:

$$\begin{aligned} D_{\text {KL}}({{\mathcal {N}}}_{q}\parallel {{\mathcal {N}}}_{p})={\frac{1}{2}}\left[ {\text {tr}} \Bigg (\varSigma _{p}^{-1}\varSigma _{q}\right) +(\mu _{p}-\mu _{q})^{{\mathsf {T}}}\varSigma _{p}^{-1}(\mu _{p}-\mu _{q})-k+\ln \left( {\frac{\det \varSigma _{p}}{\det \varSigma _{q}}}\right) \Bigg ] \end{aligned}$$
(31)

Where \(\mu _p,\mu _q\) and \(\varSigma _p,\varSigma _q\) are the mean and covariance for the prior and variational respectively. Inspired by g-prior (Zellner 1986), we take \(\varSigma _p = \mathbf{K}^{-1}_p\), where \(\mathbf{K}^{-1}_p\) is the inverse kernel covariance matrix of side information for mode p. When side information is absent, we take \(\varSigma _p = \mathbf{I}\). Another benefit of using the inverse is that is simplifies calculations, since we now avoid inverting a dense square matrix in the KL-term. Similar to the univariate case we choose \(\mu _p\) a priori, although here it becomes a constant tensor rather than a constant scalar.

Model For the multivariate case, we consider the following priors

$$\begin{aligned}&p(\mathbf{V}_p')= \prod _{\begin{array}{c} r_p,i_p \end{array}}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}'|\mu _p',\sigma _p'^2)\nonumber \\&p(\mathbf{V}_p) =\prod _{r_p}{\mathcal {N}}(\text {vec}(\mathsf {v}_{r_p})|\mu _p,\prod _{q=1}^{Q_p}\otimes (\mathbf{K}_q)^{-1}) \end{aligned}$$
(32)

Where \(Q_p\) is the number of dimensions jointly modeled in each TT-core. For the variational approximations, we have

$$\begin{aligned}&q(\mathbf{V}_p')=\prod _{\begin{array}{c} r_p,i_p \end{array}}{\mathcal {N}}(v_{r_{p}i_pr_{p-1}}'|\mu _{r_pi_pr_{p-1}}',\sigma _{r_pi_pr_{p-1}}'^2)\nonumber \\&q(\mathbf{V}_p)=\prod _{r_p,i_p}{\mathcal {N}}(\text {vec}(\mathsf {v}_{r_p})|\mu _{r_pi_pr_{p-1}},\prod _{q=1}^{Q_p}\otimes (\mathbf{B}_{q}\mathbf{B}_{q}^{\top })) \end{aligned}$$
(33)

We take \(\mu _p',\sigma _p'^2,\mu _p\) to be hyperparameters and \(\varSigma _{q}=\mathbf{B}_{q}\mathbf{B}_{q}^{\top }\).

Sampling and parametrization Calculating \(\prod _{q=1}^{Q_i} \otimes \mathbf{B}_{q}\mathbf{B}_{q}^{\top }\) directly will yield a covariance matrix that is prohibitively large. To sample from \(q(\mathbf{V}_p)\) we exploit that positive definite matrices A and B with their Cholesky decompositions \(L_A\) and \(L_B\) have the following property

$$\begin{aligned} A\otimes B = (L_AL_A^{\top })\otimes (L_BL_B^{\top }) = (L_A\otimes L_B) (L_A\otimes L_B)^{\top }. \end{aligned}$$
(34)

together with the fact that

$$\begin{aligned} \prod _{i=1}^{N} \mathbf{X}\times _{i+1} A_{i} = \left( \prod _{i=1}^N \otimes A_i\right) \cdot \text {vec}(\mathbf{X}), \end{aligned}$$
(35)

where \(\text {vec}(\mathbf{X})\in \mathbf{\mathbb {R}}^{\prod _{i}^N I_i\times R}\). We would then draw a sample \({\mathbf {b}}\sim q(\mathbf{V})\) as

$$\begin{aligned} {\mathbf {b}} = \mu _{r_p} + \prod _{q=1}^{Q_i} \tilde{\mathbf{z}} \times _{q+1} \mathbf{B}_q \end{aligned}$$
(36)

where \(\tilde{\mathbf{z}}\sim {\mathcal {N}}(0,{\mathbf {I}}_{\prod _{p=1}^{P}n_p})\) is reshaped into \(\tilde{\mathbf{z}}\in \mathbf{\mathbb {R}}^{\prod _{q=1}\times n_q}\). We take \(\mathbf{B}_q = \mathbf{ltri} (B_qB_q^{\top })+D_q\) (Ong et al. 2017), where \(B_q\in \mathbf{\mathbb {R}}^{n_q\times r}\), \(D_q\) to be a diagonal matrix and \(\mathbf{ltri}\) denotes taking the lower triangular component of a square matrix including the diagonal. We choose this parametrization for a linear time-complexity calculation of the determinant in the KL-term by exploiting that \(\det \left( \varSigma _{q}\right) =\det \left( \mathbf{B}_{q}\mathbf{B}_{q}^{\top }\right) =(\det \left( \mathbf{B}_{q}\right) )^2\). In the RFF case, we take \(\mathbf{B}_q=B_q\) and estimate the covariance as \(\mathbf{B}_q \mathbf{B}_q^{\top } + D_q^2\)

Weighted latent regression reconstruction term We similarly to the univariate case express the reconstruction term as

$$\begin{aligned}&{\mathbb {E}}_{q}[\log {p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')}] \nonumber \\&\quad \propto \frac{1}{\sigma _y^2} \Bigg [\mathbf{Y}^2-2\mathbf{Y}\circ \Bigg ( \prod _{p=1}^P \times _{-1} ( \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p))\Bigg ) + \Bigg (\prod _p^P \times _{-1} \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p)\Bigg )^2 \nonumber \\&\qquad +\,\Bigg (\prod _{p}^P\times _{-1}\varSigma _p'\circ ( {\mathbf {1}}\times _2 \Big (\text {diag}((\mathbf{K}_p \cdot \mathbf{B}_p)^2\cdot {\bar{1}})\Big ) \Bigg )+\Bigg (\prod _p^P \times _{-1}(\varSigma _p' \circ (\mathbf{M}_p\times _2 \mathbf{K}_p)^2) \Bigg )\nonumber \\&\qquad + \,\Bigg (\prod _{p}^P\times _{-1} \mathbf{M}_p'^2\circ ( {\mathbf {1}}\times _2 \Big (\text {diag}((\mathbf{K}_p \cdot \mathbf{B}_p)^2\cdot {\bar{1}})\Big ) \Bigg ) \Bigg ] \end{aligned}$$
(37)

where \(\varSigma _p = \mathbf{B}_p \mathbf{B}_p^T\), \({\mathbf {1}}\) denotes a constant one tensor with the same dimensions as \(\varSigma _p'\), \({\bar{1}}\in \mathbf{\mathbb {R}}^{R\times 1}\) where R is the column dimension of \(\mathbf{B}_p\) and \(\varSigma _p'\) is the same as in the univariate case. For RFF’s we have that

$$\begin{aligned} (\mathbf{K}_p \cdot \mathbf{B}_p)^2\cdot {\bar{1}}\approx & {} ((\varPhi _p\cdot \varPhi _p^{\top }) \cdot \mathbf{B}_p)^2 \cdot {\bar{1}} + \text {vec}(D_p^2)\nonumber \\= & {} (\varPhi _p\cdot (\varPhi _p^{\top } \cdot \mathbf{B}_p))^2\cdot {\bar{1}}+ \text {vec}(D_p^2). \end{aligned}$$
(38)

Latent scaling reconstruction term The latent scaling version has the following expression

$$\begin{aligned}&{\mathbb {E}}_{q}[\log {p(y_{i_1\ldots i_P}|\mathbf{V}_1\ldots \mathbf{V}_P,\mathbf{V}_1'\ldots \mathbf{V}_P')}] \nonumber \\&\quad \propto \frac{1}{\sigma _y^2} \left[ \mathbf{Y}^2 -2\mathbf{Y}\circ \Bigg (\prod _{p=1}^P (\mathbf{M}_p^s \circ (\mathbf{M}_p \times _2 \mathbf{K}_p) + \mathbf{M}_p^b) \Bigg )\right. \nonumber \\&\qquad +\,\left( \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^s\right) ^2+\left( \prod _{p=1}^P \times _{-1} \varSigma _p^s\right) \right) \nonumber \\&\qquad \circ \left( \left( \prod _p^P \times _{-1} \mathbf{M}_p\times _{2} \mathbf{K}_p\right) ^2+ \prod _{p=1}^P \times _{-1} ({\mathbf {1}}\times _2(\text {diag}((\mathbf{K}_p \cdot \mathbf{B}_p)^2\cdot {\bar{1}})))\right) \nonumber \\&\qquad +\,2\left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^s\right) \circ \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^b\right) \circ \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p \times _2 \mathbf{K}_p\right) \nonumber \\&\left. \qquad +\,\left( \left( \prod _{p=1}^P \times _{-1} \mathbf{M}_p^b\right) ^2+\left( \prod _{p=1}^P \times _{-1} \varSigma _p^b\right) \right) \right] . \end{aligned}$$
(39)

For details, see the Appendix A.2.

RFFs and KL divergence Using \((\varPhi _p\varPhi _p^{\top })^{-1}\approx (\mathbf{K}_p)^{-1}\) as our prior covariance, we observe that the KL-term presents computational difficulties as a naive approach would require storing \((\mathbf{K}_p)^{-1}\in \mathbf{\mathbb {R}}^{n_p\times n_p}\) in memory. Assuming we take \(\varSigma _p = BB^{\top }, B \in \mathbf{\mathbb {R}}^{n_p \times R}\), we can manage the first term by using the equivalence

$$\begin{aligned} (A \bullet B)\cdot (A \bullet B)^{\top } = (AA^{\top })\circ (BB^{\top }). \end{aligned}$$
(40)

Consequently, we have that

$$\begin{aligned} \text {tr}(\varSigma _{p}^{-1}\varSigma _q)=\bigg (\underbrace{{(\varPhi \varPhi ^{\top })}}_{\text {Using }(\mathbf{K}_p)^{-1}} \circ BB^{\top }\bigg )_{++} = \bigg ( (\varPhi \bullet B)^{\top }\cdot (\varPhi \bullet B)\bigg )_{++}. \end{aligned}$$
(41)

We can calculate the second term using (34) and (35). For the third term, we remember Weinstein–Aronszajn’s identity

$$\begin{aligned} \det (I_{m}+AB)=\det (I_{n}+BA) \end{aligned}$$
(42)

where \(A\in \mathbf{\mathbb {R}}^{m\times n},B\in \mathbf{\mathbb {R}}^{n\times m}\) and AB is trace class. If we were to take our prior covariance matrix to be \(\varSigma _p = (\varPhi _p\varPhi _p^{\top } + I_{n_p} )^{-1}\approx (\mathbf{K}_p + I_{n_p})^{-1}\) and our posterior covariance matrix to be approximated as \(\varSigma _q = BB^{\top }+I_{n_p}\), we could use Weinstein–Aronszajn’s identity to calculate the third log term in a computationally efficient manner.

From a statistical perspective, adding a diagonal to the covariance matrix implies regularizing it by increasing the diagonal variance terms. Taking inspiration from Kim and Teh (2017), we can further choose the magnitude \(\sigma\) of the regularization

$$\begin{aligned} \det (\sigma ^2 I_{n} +\varPhi \varPhi ^\top )&=(\sigma ^2)^n \det (I_n+\sigma ^{-2}\varPhi \varPhi ^\top ) \\&= (\sigma ^2)^n \det (I_m+\sigma ^{-2}\varPhi ^\top \varPhi ) \\&= (\sigma ^2)^{n-m} \det (\sigma ^2 I_m+\varPhi ^\top \varPhi ) \\ \end{aligned}$$

The KL expression then becomes

$$\begin{aligned} D_{\text {KL}}({{\mathcal {N}}}_{q}\parallel {{\mathcal {N}}}_{p})& ={\frac{1}{2}}\Bigg [\left( (\varPhi \bullet B)^{\top }\cdot (\varPhi \bullet B)\right) _{++}+\sigma _p^2(B\circ B)_{++}\nonumber \\&\quad +\,\sigma _q^2 (\varPhi \circ \varPhi )_{++}+\sigma _q^2\sigma _p^2N +(\mu _{p}-\mu _{q})^{{\mathsf {T}}}(\varPhi \varPhi ^{\top }+\sigma _p^2I)(\mu _{p}-\mu _{q})\nonumber \\&\quad -\,k+\ln \left( {\frac{|(\sigma _p^2)^{N-I}\det (\varPhi ^{\top }\varPhi +\sigma _p^2I)|^{-1}}{|(\sigma _q^2)^{N-I}\det (B^{\top }B+\sigma _q^2I)|}}\right) \Bigg ]. \end{aligned}$$
(43)

3.9.3 Calibration metric

We evaluate the overall calibration of our variational model using the sum

$$\begin{aligned} \varXi = \sum _{\alpha \in \varOmega }|\xi _{1-2\alpha }-(1-2\alpha )| \end{aligned}$$
(44)

of the calibration rate \(\xi _{1-2\alpha }\), which we define as

$$\begin{aligned} \xi _{1-2\alpha } = \frac{\text {number of }y_{i_1\ldots i_P}\text { within }1-2\alpha \text { confidence level}}{\text {total number of }y_{i_1\ldots i_P}} \end{aligned}$$
(45)

where we consider \(\alpha\) to take values in \(\{0.05,0.15,0.25,0.35,0.45\}\). This calibration rate can be understood as the true (frequentist) coverage probability and \(1-2\alpha\) as the nominal coverage probability. The model is calibrated when the true coverage probability is close to the nominal coverage probability, that is when \(\varXi\) is small. To ensure that our model finds a meaningful variational approximation, we take our hyperparameter selection criteria to be:

$$\begin{aligned} \eta _{\text {criteria}} = \varXi - R^2 \end{aligned}$$
(46)

where \(R^2:=1 - \frac{\sum _{i_1\ldots i_n} (y_{i_1\ldots i_n} - {\hat{y}}_{i_1\ldots i_n})^2}{\text {Var}(\mathbf{Y})}\) is the coefficient of determination (Draper and Smith 1966) calculated using the “mean terms” \(\Bigg ( \prod _{p=1}^P \times _{-1} ( \mathbf{M}_p' \circ (\mathbf{M}_p\times _{2} \mathbf{K}_p))\Bigg )\) or \(\Bigg (\prod _{p=1}^P (\mathbf{M}_p^s \circ (\mathbf{M}_p \times _2 \mathbf{K}_p) + \mathbf{M}_p^b) \Bigg )\) as predictions \({\hat{y}}_{i_1\ldots i_n}\). If we only use \(\eta _{\text {criteria}}=\varXi\), we argue that there is an inductive bias in choosing \(\alpha\)’s which may lead to an approximation that is calibrated per se, but not meaningful as the modes are incorrect (low \(R^2\) value). Similarly to the frequentist case, we use an EM-inspired optimization strategy in Algorithm 2. The main idea is to find the mode and variance parameters of our variational approximation in a mutually exclusive sequential order, starting with the modes. Similar to the frequentist case, the reconstruction term of the ELBO has terms that both contain \(\varSigma _p\) and \(\mathbf{M}_p'\) which motivates the EM-inspired approach. For further details, please refer to the Appendix E.2.

figure b

3.10 Extension to forecasting

KFT in its current form is fundamentally unable to accommodate forecasting problems. To see this, we first consider the forecasting problem of predicting observation \(y_T\) from previous observations \(\mathbf{y }_t=[y_0,\ldots ,y_{t}], \quad 0<t<T\) using model \(f(\mathbf{y }_t)\). The model is then optimized by minimizing

$$\begin{aligned} {\mathcal {L}}(y_T,\mathbf{y }_t) = \Vert y_T - f(\mathbf{y }_t)\Vert ^2 \end{aligned}$$
(47)

for all T. Forecasting problems assume that the model does not have access to future \(y_{T+1}\) outside the training set and assumes it to learn \(y_{T+1}\) through an autoregressive assumption. Imposing this assumption on KFT would imply that latent factorizations for time-indexed \(T+1\) would remain untrained with the current training procedure, as we do not access to these indices during training. With untrained latent factorizations for \(T+1\), any forecast would at best be random.

However, we can easily extend KFT to be autoregressive by directly applying (Yu et al. 2016).

3.10.1 Frequentist setting

We consider the Temporal Regularized Matrix Factorization (TRMF) framework presented in Yu et al. (2016)

$$\begin{aligned} {\mathcal {T}}_{\mathrm {AR}}(\mathbf{X } \mid {\mathbb {L}}, {\mathcal {W}}, \eta ):=\frac{1}{2} \sum _{t=m}^{T}\left\| {\varvec{x}}_{t}-\sum _{l \in {\mathbb {L}}} W^{(l)} {\varvec{x}}_{t-l}\right\| ^{2}+\frac{\eta }{2} \sum _{t}\left\| {\varvec{x}}_{t}\right\| ^{2} \end{aligned}$$
(48)

where \(\mathbf{X } \in {\mathbb {R}}^{T\times k}\) is the temporal factorization component in the matrix factorization \(\mathbf{U}\cdot \mathbf{X }^\top\) and \(\mathbf{x }_t\) denotes \(\mathbf{X }\) sliced at time index t. Further we take \({\mathcal {W}}=\left\{ W^{(l)} \in \text {diag}({\mathbb {R}}^{k})\mid l \in {\mathbb {L}}\right\}\)(i.e. set of diagonal matrices) and \({\mathbb {L}}=\{l_i<T\mid i=1,\ldots ,I\}\) as the set of time indices to lag. Additionally, the regularization weight \(\eta\) is needed to ensure that \({\mathbf {X}}\) varies smoothly. However in KFT, such regularization already exists and it suffices to consider

$$\begin{aligned} {\mathcal {T}}_{\mathrm {AR}}(\mathbf{X } \mid {\mathbb {L}}, {\mathcal {W}})=\frac{1}{2} \sum _{t=m}^{T}\left\| {\varvec{x}}_{t}-\sum _{l \in {\mathbb {L}}} W^{(l)} {\varvec{x}}_{t-l}\right\| ^{2}. \end{aligned}$$

Forecasting for WLR TRMF can be extended to WLR by simply taking \(\mathbf{X } = {\mathbf {V}}_t' \circ ({\mathbf {V}}_t \times _2 \mathbf{D}_t) \in {\mathbb {R}}^{r_{T+1} \times T\times r_{T-1}}\) and \({\mathcal {W}}=\left\{ W^{(l)} \in {\mathbb {R}}^{r_{T+1} \times r_{T-1}}\mid l \in {\mathbb {L}}\right\}\), which then yields

$$\begin{aligned} {\mathcal {T}}_{\mathrm {AR}}(\mathbf{X } \mid {\mathbb {L}}, {\mathcal {W}})=\frac{1}{2} \sum _{t=m}^{T}\left\| {\varvec{x}}_{t}-\sum _{l \in {\mathbb {L}}} W^{(l)}\circ {\varvec{x}}_{t-l}\right\| ^{2}, \end{aligned}$$
(49)

which we coin KFTRegularizer (KFTR).Footnote 3 We follow the same training strategy proposed in Yu et al. (2016) by sequentially updating \({\mathcal {F}}=\{{\mathbf {V}}_p,{\mathbf {V}}_p' \mid p=1\ldots ,P \}\backslash {\mathbf {X}}\), \(\mathbf{X }\) and \({\mathcal {W}}\).

Forecasting for LS KFTR can also be applied to the LS variant by applying the temporal regularization to all three components \(\mathbf{X}^s = \mathbf{V}^s_t\), \(\mathbf{X}^b = \mathbf{V}^b_t\) and \(\mathbf{X}= \mathbf{V}_t \times \mathbf{D}_t\). We then apply Eq. (49) to each term.

3.10.2 Bayesian setting

KFTR is extended to the Bayesian setting by optimizing the quantity

$$\begin{aligned} \log {p(\mathbf{X }_T,\ldots , \mathbf{X }_{m})}=\sum _{t=m}^T \log p(\mathbf{X }_t) \end{aligned}$$
(50)

in addition to the ELBO. The probability distribution \(p(\cdot )\) is assumed to be a univariate normal with fixed variance \(\sigma ^2_{\text {KFTR}}\). We define

$$\begin{aligned} {\mathbf {X}}_t = \left[ {\mathbf {V}}_{\text {time}}' \circ ({\mathbf {V}}_{\text {time}} \times _2 \mathbf{D}_{\text {time}})\right] _t \end{aligned}$$
(51)

as functions of variational variables \(\mathbf{V}, \mathbf{V}'\). Here \(\left[ \cdot \right] _t\) means slicing the tensor at index t. As an autoregressive dependency on \(\mathbf{X}_t\) is required, we take

$$\begin{aligned} p(\mathbf{X}_t) = \int p(\mathbf{X}_t\mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})p(\mathbf{X}_{l_1})\ldots p(\mathbf{X}_{l_K}) d\mathbf{X}_{l_1}\ldots \mathbf{X}_{l_K}. \end{aligned}$$
(52)

However \(\mathbf{X}_t\) is composed of variational variables \(\mathbf{V}, \mathbf{V}'\) and thus we can write

$$\begin{aligned} \log p(\mathbf{X}_t)= & {} \log \int p(\mathbf{X}_t\mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})q(\mathbf{V})q(\mathbf{V}') d\mathbf{V}d\mathbf{V}'\nonumber \\= & {} \log {\mathbb {E}}_q[p(\mathbf{X }_T \mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})] \ge {\mathbb {E}}_q[\log p(\mathbf{X }_T \mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})]. \end{aligned}$$
(53)

The log-expectation is intractable, so we use Jensen’s inequality and optimize a lower bound instead. We then arrive at the following expression

$$\begin{aligned}&{\mathbb {E}}_q[\log {p(\mathbf{X }_T \mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})}]\propto \frac{1}{\sigma _{\text {TRMF}}^2}{\mathbb {E}}_q\left[ \left( x_{r_p,i_t=T,r_{p+1}}-\sum _{k=1}^K w_k x_{r_p,i_t=l_k,r_{p+1}} \right) ^2\right] \nonumber \\&\quad = \frac{1}{\sigma _{\text {TRMF}}^2}{\mathbb {E}}_q\left[ \left( x_{r_p,i_t=T,r_{p+1}}^2-2x_{r_p,i_t=T,r_{p+1}}\sum _{k=1}^K w_k x_{r_p,i_t=l_k,r_{p+1}} \right. \right. \nonumber \\&\qquad \left. \left. +(\sum _{k=1}^K w_k x_{r_p,i_t=l_k,r_{p+1}})^2\right) \right] . \end{aligned}$$
(54)

We calculate the expression by taking expectations of the x-terms with respect to \(v,v'\) and arrive at

$$\begin{aligned}&{\mathbb {E}}_q[\log {p(\mathbf{X }_T \mid \mathbf{X }_{l_1},\ldots ,\mathbf{X }_{l_K})}]\propto \left[ (\mathbf{M}'^2+\varSigma ')\circ \left[ (\mathbf{M}\times _2 \mathbf{K})^2+\varSigma \times _2 \mathbf{K}^2\right] \right] _T\nonumber \\&\quad -\,2\left[ \mathbf{M}'\circ (\mathbf{M}\times _2 \mathbf{K})\right] _T \circ \left( \sum _{k=1}^K \mathbf{W }_k \circ \left[ \mathbf{M}'\circ (\mathbf{M}\times _2 \mathbf{K})\right] _{l_k} \right) \nonumber \\&\quad + \,\sum _{k=1}^K \mathbf{W }_k^2\circ \left[ (\mathbf{M}'^2+\varSigma ')\circ \left[ (\mathbf{M}\times _2 \mathbf{K})^2+\varSigma \times _2 \mathbf{K}^2\right] \right] _{l_k}\nonumber \\&\quad +\,\left( \sum _{k=1}^K \mathbf{W }_k \circ \left[ \mathbf{M}'\circ (\mathbf{M}\times _2 \mathbf{K})\right] _{l_k}\right) ^2 - \sum _{k=1}^K \mathbf{W }_k^2 \circ \left[ \mathbf{M}'\circ (\mathbf{M}\times _2 \mathbf{K})\right] _{l_k}^2. \end{aligned}$$
(55)

It should be noted that the above expression considers the WLR case for a univariate meanfield model. For a complete derivation and an expression for the multivariate meanfield model and the LS version, we refer to Appendix A.2.2.

3.11 Complexity analysis

We give a complexity analysis for all variations of KFT in the frequentist setting and Bayesian setting.

Theorem 1

KFT has computational complexity

$$\begin{aligned} {\mathcal {O}}\left( \max _{p}\left( {n_p c_p r_p r_{p-1}}\right) + \left( \max _{p}r_p\right) ^P \right) \end{aligned}$$

and memory footprint \({\mathcal {O}}\left( P\cdot \left( \max _{p}\left( n_p r_p r_{p-1}\right) + \max _{p}\left( n_p c_p\right) \right) \right)\) for a gradient update on a batch of data. In the dual case, we take \(c_p=n_p\).

We provide proof in the appendix. It should be noted that one can reduce complexity and memory footprint by permuting the modes of the tensor such that the larger modes are on the edges, i.e. when \(r_{p}=1\) or \(r_{p-1}=1\). Then \(n_p\) will only scale with \(r_{p-1}\) or \(r_p\).

KFTR complexity The additional complexity associated with adding an autoregressive regularization term is at worst \({\mathcal {O}}(r_p r_{p-1} K T )\), where K is the number of lags and T the size of the temporal mode. As this term scales linearly with K, it does not have an overall impact on the complexity of KFT.

4 Experiments

The experiments are divided into analyzing the frequentist version and bayesian version of KFT. Frequentist KFT is compared against competing methods on prediction and forecasting on various high dimensional datasets. Datasets we use are summarized in Table 2. For Bayesian KFT, we investigate the performance of the Bayesian version of KFT with a focus on the calibration of the obtained posterior distributions, by comparing nominal coverage probabilities to the true coverage probabilities.

4.1 Predictive performance of KFT

We compare KFT to the established FFM and LightGBM on the task of prediction on three different datasets, Retail Sales, Movielens-20M and Alcohol Sales (cf. Table 2). We compare KFT using squared loss. LightGBM is a challenging benchmark as it has continuously received development, engineering, and performance optimization since its inception in 2017. We execute our experiments by running 20 iterations of hyperopt (Bergstra et al. 2013) for all methods to find the optimal hyperparameter configuration constrained with a memory budget of 16GB (which is the memory limit of a high-end GPU) for 5 different seeds, where the seed controls how the data is split. We split our data into 60% training, 20% validation, and 20% testing. We report scores in \(R^2\), since it provides a normalized goodness-of-fit score and measure the performance in terms of \(R^2\)-value on test data. For further details on hyperparameter range, data, and preprocessing, see Appendix F. The results are reported in Table 3, where the best results are boldfaced. We observe that KFT has configurations that can outperform the benchmarks with a good margin. Furthermore, the dual space models are generally doing better than their primal counterparts. We hypothesize that the enhanced expressiveness of kernelized side information is the reason for this.

Table 2 Summary of datasets
Table 3 KFT results
Table 4 Frequentist results in RMSE compared to reported results

The next experiment makes a direct comparison of KFT to recent Matrix Factorization methods tailored for the Movielens-1M and Movielens-10M datasets. The purpose of this experiment is to further demonstrate the competitiveness of KFT on recommendation tasks. Table 4 gives comparison of KFT RMSE on Movielens-1M and Movielens-10M to RMSEs of existing methods which are reported in respective paper. KFT outperforms existing non-neural models and marginally underperforms compared to neural-based state-of-the-art models for matrix factorization. In comparison to tensor factorization, KFT outperforms NTF despite NTF being a neural model. The RMSE is generally higher for tensor factorization, as data becomes sparser with each additional dimension.

For forecasting problems, we replicate the experiments in Yu et al. (2016) and Salinas et al. (2019) for the traffic dataset and Yu et al. (2014) for the CCDS data and compare against KFT in Tables 5 and 6.

Table 5 Traffic results
Table 6 CCDS results

We plot forecasts for different plots and series in Fig. 3.

Fig. 3
figure 3

Forecasts on the Traffic dataset using KFTR.

4.2 Calibration study of Bayesian KFT

We use the same setup as in the frequentist case, modulo the hyperparameter evaluation objective which instead is (46). For details on hyperparameter choices and data preparation, please refer to the Appendix F. We run a regular tensor factorization without side information as a benchmark for performance, which is intended to mimic (Hawkins and Zhang 2018) as a comparison. We summarize the results in Table 7. We obtain calibrated variational approximations and observe that models using side information yield better predictive performance but that their calibration becomes slightly worse. By using a Bayesian framework we seem to generally lose some predictive performance compared to the corresponding frequentist methods, except in the case of Movielens-20M. We provide a visualization of the calibration ratios for all datasets in Fig. 4.

Table 7 Table of results for Bayesian KFT. Here we denote \(\varXi _{\alpha } =|\xi _{1-2\alpha } - (1-2\alpha )|\), see Sect. 3.9.3 for more details
Fig. 4
figure 4

Heatmap of \(\xi _{1-2\alpha }\) for all datasets. Here the y-axis is location/userID and x-axis is time, where targets have been aggregated on items. We see that the calibration rate over all aggregates consistently adjusts with changes in \(1-2\alpha\)

We compare KFTR to existing Bayesian models with reported results in Table 8, where the model is optimized for \(\eta _{\text {criteria}}\), in contrast to RMSE used in Kim and Choi (2014). KFT performs well in a Bayesian setting compared to Kim and Choi (2014) and also yield calibrated estimates. For forecast problems, we apply KFTR to the Traffic and CCDS dataset and report the results in Table 9. We find that the WLR version of KFT does well and yields calibrated forecasts.

Table 8 Results with RMSE and (\(\varXi\))
Table 9 Table of results for Bayesian forecast experiments

We plot forecasts for the Traffic dataset with uncertainty quantification in Fig. 5.

Fig. 5
figure 5

Forecasts on the Traffic dataset using Bayesian KFTR.

5 Analysis

We demonstrated the practical utility of KFT in both a frequentist and Bayesian context. We now scrutinize the robustness and effectiveness of KFT as a remedy for constraint-imposing side information.

5.1 Does KFT really amend the constraints of directly apply side information?

To validate this, we train KFT-WLR and a naive model on the Alcohol Sales dataset using kernelized side information for 5 hyperparameter searches. We plot the mean training, validation, and test error of the 5 searches (and the standard errors) against epochs in Fig. 6.

Fig. 6
figure 6

Training, validation and test error against epoch for KFT-WLR/KFT-LS and a naive model (with “Dual” side information) on the Alcohol Sales dataset

5.2 How does KFT perform when applying constant side information?

To answer this question, we replace all side information with a constant \({\mathbf {1}}\) and kernelize it. The results in the first row of Table 10 indicate that KFT indeed is robust towards constant side information, as the performance does not degrade dramatically.

5.3 How does KFT perform when applying noise as side information?

Similar to the previous question, we now replace the side information with standard Gaussian noise instead. The results in the last row of Table 10 indicate that KFT also is robust against noise and surprisingly performant as well. A possible explanation for this is that adding Gaussian noise serves as an implicit regularizer or that the original side information is similarly distributed as standard Gaussian noise. We conclude that KFT is stable against uninformative side information in the form of Gaussian noise.

Table 10 Results from additional experiments for P-way dual models

6 Conclusion

We identified an inherent limitation of side information based tensor regression and gave a method that removes this limitation. Our proposed KFT method yields competitive performance against state-of-the-art large-scale prediction models on a fixed computational budget. Specifically, as the experiments in Table 3 demonstrate, for at least some cases of real practical interest, Weighted Latent Regression is the most performant configuration. Further, KFT offers extended versatility in terms of calibrated Bayesian variational estimates. Our analysis shows that KFT solves the problems we described in Sect. 2 and is robust for adversarial side information in the form of Gaussian noise. A direction for further development would be to characterize identifiability conditions for KFT and extend the Bayesian framework beyond mean-field variational inference.