Large Scale Tensor Regression using Kernels and Variational Inference

We outline an inherent weakness of tensor factorization models when latent factors are expressed as a function of side information and propose a novel method to mitigate this weakness. We coin our method \textit{Kernel Fried Tensor}(KFT) and present it as a large scale forecasting tool for high dimensional data. Our results show superior performance against \textit{LightGBM} and \textit{Field Aware Factorization Machines}(FFM), two algorithms with proven track records widely used in industrial forecasting. We also develop a variational inference framework for KFT and associate our forecasts with calibrated uncertainty estimates on three large scale datasets. Furthermore, KFT is empirically shown to be robust against uninformative side information in terms of constants and Gaussian noise.


Background and Problem
Large-scale prediction problems arising from the use of machine learning methods in industry are often formulated as an empirical risk minimization task coupled with a dataset and a model. In industrial applications, a high premium is often placed on calibrated uncertainty estimates around predictions. Nowadays, datasets often scale beyond the point where the standard data matrix formalism is practical, and this restricts the choice of models as they must give a computationally tractable analysis. In addition, obtaining calibrated uncertainty estimates becomes challenging. In this paper we focus on tensor factorization models, a F. Author first address Tel.: +123-45-678910 Fax: +123-45-678910 E-mail: fauthor@example.com rich model class which provide tractability at scale. Tensor models are used in large scale prediction tasks ranging from bioinformatics to industrial forecasting; we work in the latter setting. However, while existing tensor models are versatile and scalable, they have a weakness: as we explain below, when we build a model of the latent factors in tensor factorization as a function of covariates [9] [7], the model may be restricted and global parameter couplings are generated. These lead to reduction of tractability at scale. We give a new family of tensor models which admit side information without model restriction or loss of tractability.
Tensor Train decomposition Before we explain how side information restricts tensor model expressiveness, we set out the background. Consider the task of reconstructing the tensor Y ∈ R n1×n2...×n P . Many existing decomposition techniques [11] treat this problem. We focus on the Tensor Train (TT) decomposition [16] as this generalises more readily than existing alternatives.
The n-mode (matrix) product of a tensor X ∈ R I1×I2×...×I N with a matrix U ∈ R J×In is X × n U. This product is of size I 1 × ...I n−1 × J × I n+1 × ... × I n . Elementwise, the product is We will consider the following notion of a mode product between tensor X ∈ R I1×...I N −1 ×I N , applying the N-th mode product × N with a tensor U ∈ R I N ×K1×K2 gives X× N U ∈ R I1×...I N −1 ×K1×K2 .
In TT, Y is decomposed into P latent tensors V p ∈ R Rp×np×Rp−1 ,p = 1,...P , with V 1 ∈ R n1×R1 . Here R p is the latent dimensionality for each factor in the decomposition, with R 1 = R P = 1. Denote by × −1 V p the mode product applied to the last dimension of V p . We seek V 1 ,...,V P so that (1) Suppose that, associated with dimension p, we have c p -dimensional side information denoted D p ∈ R np×cp . For example, if p is the dimension representing n p = 10000 different books, then the columns of D p ∈ R 10000×cp might contain the author of the book, page count etc. Similar to [9] [7], side information is built into the second dimension of the latent tensor V p ∈ R Rp×cp×Rp−1 using the mode product V p × 2 D p . For TT decomposition our approximation becomes The above example illustrates the primal setting where side information is applied directly. Similarly to [7], we also consider kernelized side information in the Reproducing Kernel Hilbert Space (RKHS) which we will refer to as the dual setting.
1. 1 The Problem with adding side information to tensor factorization Consider a matrix factorization problem for Y ∈ R N1×N2 with unknown latent factors U ∈ R N1×R ,V ∈ R N2×R . We are approximating If we update u 1r in the approximation y 11 ≈ R r=1 u 1r v 1r , we change the approximation y 1,2 ≈ R r=1 u 1r v 2r since they share the parameter u 1r . However, y 2,1 and y 2,2 remain unchanged. Parameters are coupled across rows and columns but not globally. This is the standard setup in latent factorization.
Now consider an extreme case where we have D 1 = D 2 = 1 N1×N1 . We take our latent factors to be a linear function of available side information which leads U,V to form is a constant matrix! We have lost all model flexibility as we are approximating Y with a constant. Now consider a more realistic example with Again, u ir appears in all entries in our matrix approximation for Y. However this time changing u ir will not change all entries by the same amount but rather differently across all entries depending on the entries of D 1 and D 2 . This connects all entries in the approximating matrix, introduces complex global variable dependence and makes fitting infeasible at large scale as the optimisation updates globally. The observation applies in both primal and dual representations. In the primal representation there is a restriction in expressiveness as the rank of our approximation falls off with the rank of the side information. In this setting near-colinearity is also a problem as it leads to unstable optimisation and factorisations which are very sensitive to noise.
We see that when we add side information we may inadvertently restrict the expressiveness of our model. We formulate a new tensor model with a range and dependence structure unaffected by addition of side information in either the primal or dual setting.
With the problem illustrated, we go on to answer the following questions: 1. Is there a TT model class that avoids constraints and complex global dependencies arising from addition of side information but still makes full use of side information? 2. Can we formulate and characterise this model class in both primal and dual space? 3. Do models in this class compare favourably, for large scale prediction, to state-of-the-art models such as LightGBM [6] and Field-Aware Factorization Machines [5]? 4. Can we work with these new models in a scalable Bayesian context with calibrated uncertainty estimates? 5. What are the potential pitfalls of using these models? When are they appropriate to use?
2 Proposed Approach

Tensors and side information
We seek a tensor model that benefits from additional side information while not forfeiting model flexibility. We introduce two strategies, which we call Weighted latent regression (WLR) and Latent scaling (LS).

Weighted latent regression
We now return to the previous setting but with additional latent tensors U ∈ R N1×R and V ∈ R N2×R . We approximate Y as: By taking the Hadamard product (•) with additional tensors U , V we recover the model flexibility and dependence structure of vanilla matrix factorization, as U and V are independent of U and V. Here, changing u ir would still imply a change for all entries by magnitudes defined by the side information, however we can calibrate these changes on a latent entrywise level by scaling each entry with u ir v jr . For any TT decomposition with an additional tensor V p , the factorization becomes where δ(·) denotes Kronecker delta. We interpret this as weighting the regression terms i,j d pi z qj u ir v jr over indices p,q with u pr v qr and then summing over latent indices r.

Latent scaling
An alternative computationally cheaper procedure would be to consider additional latent tensors We have similarly regained our original model flexibility where have introduced back independence for each term by scaling (adding a constant) for each regression term with a 'latent scale and bias' term. We generalize this to The reader may conjecture that adding side information to tensors through a linear operation is counterproductive due to the restrictions it imposes on the approximation, and argue that our proposal of introducing additional tensors to increase model flexibility is futile when side information is likely to be marginally informative or potentially uninformative. As an example, return to the case of completely non-informative constant side information, D 1 = 1, D 2 = 1. In this corner case, both our proposed models reduce to regular matrix factorization: the side information regression term collapses to a constant, which in conjunction with the added terms reduces to regular tensor factorization without side information.
A comment on identifiability It should be noted that the proposed models need not be indentifiable. To see this, return to the scenario where side information is constant. The term V p × 2 D p has constant rows equal to row sums of V p .
Any transformation of V p which preserves these row sums leaves the fit unchanged. However, in large-scale industrial forecasting we draw utility from good generalization performance in prediction, and parameter identifiability and interpretation is secondary.

Primal regularization terms
In weight space regression, the regularization terms given by the the squared Frobenius norm. The total regularization term would be written as

RKHS and the Representer Theorem
We extend our framework to also include the RKHS dual space formalism. The extension can intuitively be viewed as a tensorized version of Kernel Ridge Regression. Firstly consider side information Using the Representer theorem [19] we can express Where v r1,r2 : Q q=1 ×R q → R and v r1,r2 ∈ H, which denotes the RKHS with respect to the kernels Q q ×k q .

Dual space regularization term for WLR
Consider applying another tensor with the same shape as V through element wise product to robustify V, i.e. V • Q q=1 V× q+1 K q . Then we have where the regularization term for v r1,r2 is given by: and (·) ++ means summing all elements.

Scaling with Random Fourier Features
To make tensors with kernelized side information scalable, we rely on a Random Fourier Feature [18] (RFFs) approximation of the true kernels. RFFs approximate a translation-invariant kernel function k using Monte Carlo: where ω i are frequencies drawn from a normalized non-negative spectral measure Λ of kernel k. Our primary goal in using RFFs is to create a memory efficient, yet expressive method. Thus, we writê with explicit feature map φ : R Dp → R I , and I N p . In the case of RFFs, φ(·) = [cos(ω T 1 ·),...,cos(ω T I/2 ·),sin(ω T 1 ·),...,sin(ω T I/2 ·)]. This feature map can be applied in the primal space setting as a computationally cheap alternative to the RKHS dual setting.
A drawback of tensors with kernelized side information is the O(N 2 p ) memory growth of kernel matrices. If one of the dimensions has a large N p in the dual space setting, we approximate large kernels K p with KFT with RFFs now becomes: With the regularization term (20) For a derivation, please refer to the appendix.

Kernel Fried Tensor
Having established our new model, we coin it Kernel Fried Tensor (KFT). Given an empirical risk minimization objective L(Y,Ỹ) for predictionsỸ the full objective is where V p ,V p ,Θ p are parameters of the model and Θ p are kernel parameters if we use the RKHS dual formulation.

Training Procedure
As our proposed model involves mutually dependent components with non-zero mixed partial derivatives, optimizing them jointly with a first order solver is inappropriate as mixed partial derivatives will not be considered during each gradient step. Inspired by the EM-algorithm [2], we summarize our training procedure in algorithm 1. By updating each parameter group sequentially and independently, we eliminate the effect of mixed partials. For a more detailed motivation, see the Appendix.

Joint features
From a statistical point of view, we are assuming that each of our latent tensors V p factorizes Y into P independent components with prior distribution corre- and the prior would instead be given as and vec(·) means flattening a tensor to a vector. The cell selected from V p now has a dependency between dimensions p and p +1. We refer to a one dimensional factorization component of TT as a TT-core and a multi dimensional factorization component as a joint TT-core.

Bayesian Inference
To put KFT in a probabilistic framework we turn to Bayesian inference. We assume a Gaussian conditional likelihood for an observation y i1,...,i P with inspiration from [3] [9]. For WLR we have that The corresponding objective for LS is where σ 2 y is a scalar hyperparameter.

Variational Approximation
Our goal is to maximize the posterior distribution ..,P dV 1,...,P does not have a closed form solution due to the product of Gaussians. Instead we use variational approximations for V p ,V p by parametrizing distributions of the Gaussian family and optimize the Evidence Lower BOund(ELBO) In our framework we consider the univariate Gaussian and multivariate Gaussian as variational approximations with corresponding priors where σ 2 y is interpreted to control the weight of the reconstruction term against the KL-term.

Univariate VI
Univariate KL For the case of univariate normal priors, we calculate the KL divergence as where µ q ,µ p and σ 2 q ,σ 2 p are the mean and variance for the prior and variational approximation respectively, where µ p ,σ 2 p are chosen a priori.
Model For a univariate Gaussian variational approximation we assume the following prior structure with corresponding univariate meanfield approximation We take µ p ,σ 2 p ,µ p ,σ 2 p to be hyperparameters.
Weighted latent regression reconstruction term For the case of Weighted latent regression, we express the reconstruction term as It should further be noted that any square term means element wise squaring. We provide a derivation in the appendix.
Latent scaling reconstruction term For the case of Weighted latent regression, we express the reconstruction term as (30) For details, see the appendix.

Multivariate VI
Multivariate KL The KL divergence between a multivariate normal prior p and a variational approximation given by q: Where µ p ,µ q and Σ p ,Σ q are the mean and covariance for the prior and variational respectively. We take Σ p = K −1 , where K −1 is the kernel covariance matrix. The inverse K −1 is inspired by g-prior [20]. Another benefit of using the inverse is that is simplifies calculations, since we now avoid inverting a dense square matrix in the KL-term. Similar to the univariate case we choose µ p a priori, although here it becomes a constant tensor rather than a constant scalar.
Model For the multivariate case we consider the following priors Where Q p is the number of dimensions jointly modeled in each TT-core. For the variational approximations we have We take µ p ,σ 2 p ,µ p to be hyperparameters.

Sampling and parametrization Calculating
Qi q=1 ⊗B q B q directly will yield a covariance matrix that is prohibitively large. To sample from q(V p ) we exploit that positive definite matrices A and B with their Cholesky decompositions L A and L B have the following property (34) together with the fact that where vec(X) ∈ R N i Ii×R . We would then draw a sample b ∼ q(V) as wherez ∼ N (0,I P p=1 np ) is reshaped intoz ∈ R q=1 ×nq . Taking inspiration from [15], we take B q = ltri(B q B q ) + D q , where B q ∈ R nq×r , D q to be a diagonal matrix and ltri denotes taking the lower triangular component of a square matrix including the diagonal. We choose this parametrization for a linear timecomplexity calculation of the determinant in the KL-term. In the RFF case, we take B q = B q and estimate the covariance as B q B q +D 2 q Weighted latent regression reconstruction term We similarly to the univariate case express the reconstruction term as where Σ p = B p B T p , 1 denotes a constant one tensor with the same dimensions as Σ p ,1 ∈ R 1×R where R is the column dimension of B p , I the identity matrix and Σ p is the same as in the univariate case. For RFF's we have that Latent Scaling The latent scaling version has the following expression (39) For details, see the appendix.
RFFs and KL divergence Using (Φ p Φ p ) −1 ≈ (K p ) −1 as our prior covariance, we observe that the KL-term presents computational difficulties as a naive approach would require storing (K p ) −1 ∈ R np×np in memory. Assuming we take Σ p = BB ,B ∈ R np×R , we can manage the first term by using the equivalence (40) Consequently we have that We can calculate the second term using (34) and (35). For the third term we remember Weinstein-Aronszajn's identity det(I m +AB) = det(I n +BA) (42) where A ∈ R m×n ,B ∈ R n×m and AB is trace class. If we were to take our prior covariance matrix to be Σ p = (Φ p Φ p + I np ) −1 ≈ (K p + I np ) −1 and our posterior covariance matrix to be approximated as Σ q = BB + I np , we could use Weinstein-Aronszajn's identity to calculate the third log term in a computationally efficient manner.
From a statistical perspective, adding a diagonal to the covariance matrix implies regularizing it by increasing the diagonal variance terms. Taking inspiration from [8], we can further choose the magnitude σ of the regularization det(σ 2 I +ΦΦ ) = (σ 2 ) n det(I +σ −2 ΦΦ ) The KL expression then becomes

Optimization Scheme
Similarly to the frequentist case, we use an EM inspired optimization strategy in 2. The main idea is to find the modes and variance parameters of our variational approximation in a mutually exclusive sequential order. Informally, we first find the modes and then the uncertainty. Similarly to the frequentist case, the reconstruction term of the ELBO has terms that both contain Σ p and M p which motivates the EM inspired approach.

Calibration metric
We evaluate the overal calibration of our variational model using the sum To ensure that our model finds a meaningful variational approximation, we take our hyperparameter selection criteria to be: where R 2 is calculated using . If we only use η criteria = Ξ, we argue that there is an inductive bias in choosing α's which may lead to an approximation that is calibrated per se, but not meaningful as the modes are incorrect.

Experimental setup and justification
We divide our experiments into two parts, the first comparing our proposed frequentist model to LightGBM and FFM in terms of predictive performance on three datasets with high dimensional covariates. The second part is analyzing our proposed Bayesian model and its performance in terms of calibration and predictive performance.

Frequentist experiments
Our goal here is to showcase the performance of our proposed model in comparison to the more established LightGBM and FFM. We define our proposed model to be successful if it can soundly beat FFM and achieve on-par performance with LightGBM. In particular, LightGBM is a challenging benchmark as it has continuously received development, engineering and performance optimization since its inception in 2017. We execute our experiments by running 20 hyperopt [1] iterations for all methods to find the optimal hyperparameter configuration constrained with a computational budget of 1000 iterations/gradient updates and a memory budget of approximately 16gb (which is the memory limit of a high-end GPU) for 5 different seeds, where the seed controls the randomization of how the data is split. We split our data into 60% training, 20% validation and 20 % testing. We measure the performance of all models in terms of test data R 2 . We compare the performance over the datasets Retail Fashion (∼ 42 million observations), Movielens-20M (∼ 12 million observations) [4] and Alcohol Sales (∼ 3 million observations). For details on hyperparameter choices, data and data preparation, please refer to the appendix. We see in table 1 that KFT has configurations that can outperform the benchmarks with a good margin and that the standard deviations are not overlapping.

Bayesian experiments
We use the same setup as in the frequentist case, modulo the hyperparameter evaluation objective which instead is (47). For details on hyperparameter choices and data preparation, please refer to the appendix. In table 2 we find that the we are able to obtain fairly calibrated variational approximations with the retail dataset having the best performance. By using a Bayesian framework we seem to generally lose some predictive performance except in the case of Movielens. We provide a visualization of the calibration ratio in figure 1,2 and 3.

Analysis
We demonstrated the practical utility of KFT in both a frequentist and Bayesian context. We now scrutinize the robustness and effectiveness of KFT as a remedy for constraint-imposing side information.

Does adding side information really improve performance?
To uncover this, we run a naive tensor model P -way factorization without any side information. The first row in table 3 shows the results for all three datasets. We find that a naive model with no side information is unable to match the performance of a model with full side information for all three datasets, indicating that adding side information is indeed a meaningful endeavor.
We ideally want Ξ α and Ξ to be as close to zero as possible coupled with a high R 2 . Fig. 1: Heatmat of calibration rates for Alcohol sales. Here the y-axis is location and x-axis is time, where sales have been aggregated on items. We see that the calibration rate over all aggregates consistently adjusts with changes in α

Does KFT actually alleviate side information imposed constraints?
We run a P -way model with kernelized side information over 5 seeds with results in the second row of table 3. We see that the experimental results indeed align with the proposed theoretical constraint of adding side information as the performance across all datasets are poor, which empirically demonstrates the need for KFT. Fig. 2: Heatmat of calibration rates for Movielens. Here the y-axis is users and x-axis is time, where ratings have been aggregated on movies. We see that the calibration rate over all aggregates consistently adjusts with changes in α Fig. 3: Heatmat of calibration rates for Retail sales. Here the y-axis is store and x-axis is time, where sales have been aggregated on articles. We see that the calibration rate over all aggregates consistently adjusts with changes in α

How does KFT perform when applying constant side information?
To answer this question, we replace all side information with a constant 1 and kernelize it. The results in the third row of  Similar to the previous question, we now replace the side information with standard Gaussian noise instead. The results in the last row of table 3 indicate that KFT also is robust against noise and surprisingly performant as well. A possible explanation for this is that adding Gaussian noise serves as an implicit regularizer or that the original side information is similarly distributed as standard Gaussian noise. We conclude that KFT is stable against uninformative side information in the form of Gaussian noise.

Conclusion
We identified an inherent limitation of side information based tensor regression and gave a method which removes this limitation. Our proposed KFT method yields competitive performance against state-of-the-art industrial forecasting models on a fixed computational budget. Specifically, as the experiments in table 1 demonstrate, for at least some cases of real practical interest, WLR is the most performant configuration. Further, KFT offers extended versatility in terms of calibrated Bayesian variational estimates. Our analysis shows that KFT solves the problems we described in section 1 and is robust for adversarial side information in the form of Gaussian noise. A direction for further development would be to characterise identifiability conditions for KFT and extend the Bayesian framework beyond mean-field variational inference.

A.1 Data Processing
Our philosophy in this paper is to limit data processing to simple operations that does not require excessive engineering for a fair comparison in both utility and input data. For all the models, we carry out the exact same preprocessing modulo the models requirements of data format. We do the following general processing steps: 1. Extract relevant features and parse them to be continuous or categorical.
2. Scale all features using a z-transformation For each model specifically we do the following: 1. KFT: Tensorize all data by expressing all main modes (i.e. person, movie, time etc) as a tensor with side information associated with each mode. 2. LightGBM: Here, we don't scale the features as boosting trees generally performs better with unscaled data. In some cases we have applied PCA to some of the side information that was joined on the data matrix to decrease the memory footprint of the data matrix to contain it to a practical size. 3. FFM: Here we bin all continuous features, as FFM requires all data to be categorical. 4. Linear regression: For large categorical features, we using feature hashing to avoid data matrices of infeasible sizes. All other categorical features get one-hot encoded.

A.2 Retail Sales Data
We detail the features of the Retail Sales data in table 4. Here we choose our modes to be store, articles and time.

A.3 Movielens
We detail the features of the Retail Sales data in 5. Here we choose our modes to be users, movies and time of rating given. For the Movielens data, it should be noted that we filter the movies on existing entries in the side information. This is why we only have roughly 11 million observations rather than 20 million.

A.4 Alcohol Sales
We detail the features of the Alcohol Sales data in 6. Here we choose our modes to be location, item and time.     Table 7 For exact details we refer to the code base.

B.1.1 LightGBM hyperparameters
We provide hyperparameters for LightGBM in     Table 11 For exact details we refer to the code base.
C Derivation for weighted latent regression regularization term k

C.1 RFF regularization term
The regularization term can similarly be generalized to

D Training procedure motivation
Consider a data matrix X σ ∈ R N ×d where σ is a hyperparameter that controls the scaling of X and a target Y ∈ R N . Assume N is too large and we have to resort to first order stochastic gradient methods to approximate µ ∈ R d in our regression X σ )µ ≈ Y. Exploiting autograd [17], together with ADAM [10] we can in practice optimize {σ,µ} simultaneously for each iteration. However doing this, we commit a fallacy as when updating , we do not account for the mixed assuming mean square error. Thus updating {σ,µ} simultaneously using first order derivatives would yield an update error for µ as the updating gradient does not adjust for the mixed partial ∂ when σ is being updated at the same time. This scenario extends almost one-to-one for KFT, as we would commit a similar fallacy by updating all parameters at once. Hence we take an EM-approach when updating lp,Vp,V p . E E q [logp(y i1,...,iP |V 1 ,...,V p )] derivations

E.1.1 Weighted Latent regression
Assuming that σy is a constant hyperparameter. We first have that Eq[logp(y i 1 ,...,i P |V 1 ,...,Vp,V 1 ,...,V p )] Our goal is to find the corresponding tensor operations of the sum terms.
As the middle term remains the same as in the univariate case our full reconstruction term is Eq[logp(y i 1 ,...,i P |V 1 ,...,Vp,V 1 ,...,

E.1.2 Latent Scaling
Assuming that σy is a constant hyperparameter. We first have that Eq[logp(y i 1 ,...,i P |V 1 ,...,Vp,V 1 ,...,V p )] Too see this, we notice that all of the sums from the weighted latent regression case reappears here as well. The entire expression is given by Eq[logp(y i 1 ,...,i P |V 1 ,...,Vp,V 1 ,..., Multivariate case For multivariate case Again, we notice that all of the sums from the weighted latent regression case reappears here as well. The full expression is given by Eq[logp(y i 1 ,...,i P |V 1 ,...,Vp,V 1 ,...,