Generalized vec trick for fast learning of pairwise kernel models

Pairwise learning corresponds to the supervised learning setting where the goal is to make predictions for pairs of objects. Prominent applications include predicting drug-target or protein-protein interactions, or customer-product preferences. In this work, we present a comprehensive review of pairwise kernels, that have been proposed for incorporating prior knowledge about the relationship between the objects. Specifically, we consider the standard, symmetric and anti-symmetric Kronecker product kernels, metric-learning, Cartesian, ranking, as well as linear, polynomial and Gaussian kernels. Recently, a O(nm + nq) time generalized vec trick algorithm, where n, m, and q denote the number of pairs, drugs and targets, was introduced for training kernel methods with the Kronecker product kernel. This was a significant improvement over previous O(n^2) training methods, since in most real-world applications m,q<<n. In this work we show how all the reviewed kernels can be expressed as sums of Kronecker products, allowing the use of generalized vec trick for speeding up their computation. In the experiments, we demonstrate how the introduced approach allows scaling pairwise kernels to much larger data sets than previously feasible, and provide an extensive comparison of the kernels on a number of biological interaction prediction tasks.


Introduction
The goal of supervised learning is to learn an unknown function f : X → R from a set of training examples Z = {(x i , y i )} n i=1 each consisting of an input x i ∈ X and an associated label y i ∈ R. The learning algorithm returns a function that approximates the true function on the training set with the aim of generalizing to data unseen during the training phase.
In pairwise learning, each input x is viewed as a pair of objects x = (d, t) that we call here drugs d ∈ D and targets t ∈ T . The task may, for example, then be to predict drug and target interaction y = f (d, t) values to test for novel interactions in drug discovery. This view is not unique and the inputs may be considered as paired in many different applications. For example, recommender system literature deals with ratings given to customer and product pairs (Basilico and Hofmann, 2004;Menon and Elkan, 2010;Rendle, 2010). Information retrieval can be formulated as predicting the relevance of query and document pairs (Liu, 2011). Bioinformatics has utilized machine learning for protein-protein (Ben-Hur and Noble, 2005;Ruan et al., 2018), protein-RNA (Bellucci et al., 2011) and drug-target (Gönen, 2012;Pahikkala et al., 2015a;Cichonska et al., 2017Cichonska et al., , 2018 interaction prediction. Other applications include image labeling (Romera-Paredes and Torr, 2015), and link prediction in social networks (Pieter and Koller, 2005) Various terminology and frameworks have been used to describe the general learning problem (see e.g. Waegeman et al. (2019) for overview). These include pairwise (kernel) learning (Ben-Hur and Noble, 2005;Park and Chu, 2009;Cichonska et al., 2017Cichonska et al., , 2018, dyadic prediction (Menon and Elkan, 2010;Pahikkala et al., 2014;Schäfer and Hüllermeier, 2015), pair-input prediction (Park and Marcotte, 2012), graph inference (Vert et al., 2007), link prediction (Pieter and Koller, 2005;Kashima et al., 2009a), relational learning (Pahikkala et al., 2010;Waegeman et al., 2012;Pahikkala et al., 2013), multitask (Bonilla et al., 2007;Bernard et al., 2017) and as a special case zero-shot (Romera-Paredes and Torr, 2015) learning. Different fields often consider different but related pairwise prediction tasks. These tasks can be divided into settings where different methods are applicable and which have varying degrees of difficulty. For example, in recommender systems one often assumes that all customers and products belong to the training set and that there are some example interactions for each customer and each product (Basilico and Hofmann, 2004). Predictions are needed for (customer, product)-pairs where the rating is missing. In this setting, methods based on factorizing the interaction matrix can be used, and no explicit features are required. However, in cold-start problems the task is to predict an interaction of a new customer and product pair, where we do not have any examples with the same customer or product in the training set. Basic factorization methods do not generalize to such settings, rather methods that make use of customer and product features are needed (sometimes called side information). In this work, we restrict our considerations to methods that can generalize to novel drugs and targets, rather than just imputing missing interactions between known ones.
Kernel methods are a standard method in supervised learning. They provide feature based generalization beyond training drugs and targets, and are used as a competitive method especially in the cold start setting. Kernel methods can be applied when the training data either have explicit feature vectors or implicit feature vectors are defined via positive semidefinite kernel functions. When drugs and targets have separate features or kernels, we can use pairwise kernels to define a kernel for the pair. One simple way to define a pairwise kernel, is to concatenate a feature vector for the drug and the target together, and apply a standard kernel such as polynomial or Gaussian on this feature vector. However, a large variety of different kernels specifically defined for pairwise data have been introduced in previous literature, starting with the introduction of the standard (Ben-Hur and Noble, 2005;Basilico and Hofmann, 2004;Oyama and Manning, 2004) and symmetric (Ben-Hur and Noble, 2005) Kronecker product kernels.
In this work we present a comprehensive review on pairwise kernels, and establish a joint framework under which the most commonly used of them can be expressed as linear combinations of Kronecker products. In particular, we cover the following kernels: -Linear kernel -2nd degreee polynomial kernel -Gaussian kernel -Kronecker product kernel (Ben-Hur and Noble, 2005;Basilico and Hofmann, 2004;Oyama and Manning, 2004) symmetric Kronecker product kernel (Ben-Hur and Noble, 2005) anti-symmetric Kronecker product kernel (Pahikkala et al., 2010) -Cartesian kernel (Kashima et al., 2009b) metric learning pairwise kernel (Vert et al., 2007) ranking kernel (Herbrich, 2000;Waegeman et al., 2012) In our framework, we represent the linear combinations corresponding to these kernels with specifically designed operator notation. The notation is not only mathematically elegant but, as we show below, enables the analysis of the kernel properties and assumptions, fast computation and easy implementation.
We start by introducing the two most fundamental assumptions. The first, what we call pairwise data assumption, is that both the drugs and targets tend to appear several times as parts of different inputs in an observed data set. For example, the same drug d i may belong to two different examples (d i , t j ) and (d i , t k ). In particular, if n is the number of observed data and m and q denote the numbers of unique drugs and targets in the data, then n >> m + q. This observation can be used to develop methods with computational shortcuts tailored specifically for the pairwise learning task, as we will elaborate below in more detail.
The second, what we refer to as non-linearity assumption, states another property inherent in pairwise learning problems, which is that the functions to be learned are usually not linear combinations of functions depending only of d or t. The opposite case, where the function can be ex- for all drugs and targets, that is, if drug d 1 is more effective than drug d 2 against target t 1 , then drug d 1 is also more effective than drug d 2 against target t 2 . In other words, there would always be a single drug that would be the best choice for all targets (and vice versa). This is illustrated in Figure 1 with a simple example, where the effect of drug on target is a function of the row and column number parities of both drugs and targets. The 'chessboard' is a true pairwise data set where an interaction exists between even drugs and odd targets, or vice versa, corresponding to a XOR function between their parities, that is unlearnable with linear methods according to the classical result by Minsky and Papert (1969). In contrast, the 'tablecloth' a linear function of interaction strengths of odd drugs and odd targets, which are therefore completely independent of each other. The runtime and in many cases the memory use of kernel solvers grow at least quadratically with respect to the number of pairs, and hence the use of pairwise kernels becomes infeasible in cases where the number of pairs is large. Faster training algorithms that avoid the costly step of building the pairwise kernel matrix have been previously proposed for certain specific cases. A fast solution to compute closed form solution is known for Kronecker product kernel, when minimizing ridge regression loss on so-called complete data that includes labels between all training drugs and targets (Romera-Paredes and Torr, 2015;Pahikkala et al., 2014Pahikkala et al., , 2013Stock et al., 2018Stock et al., , 2020. Kashima et al. (2009a) show how computations for Cartesian kernel can be made faster, and computational shortcuts for speeding up the use of the ranking kernel are known for the ridge regression (Pahikkala et al., 2009) and support vector machine algorithms (Kuo et al., 2014). Yet thus far there has been no unified approach that would allow to plug in any of the commonly used pairwise kernels to a kernel method training algorithm that guarantees better than O(n 2 ) scaling.
An exact computationally efficient algorithm has recently been proposed  for the special case of Kronecker product kernels when the data is not complete. The computational complexity of multiplying a vector with the kernel matrix is reduced from O(n 2 ) to O(nm+nq). This improvement has already shown to have major practical relevance, for example, the winning team of a recently held IDG-DREAM Drug-Kinase Binding Prediction Challenge on developing models for predicting unexplored drug-target potencies, used this algorithm (Cichonska et al., 2021).
In this work, we extend this result to present the first general O(nm + nq) approach that simultaneously covers all the widely used pairwise kernels. This is a major improvement if the pairwise assumption holds, and the approach also allows accelerating the computation of the more traditional standard kernels, such as Gaussian and polynomial, if the data can be decomposed as pairwise. This is made possible by the proposed operator framework.
Finally, we perform an experimental comparison of different pairwise kernels on four different biological data sets, in which we compare the prediction performance, training time, number of training iterations and memory usage. The kernels are compared with each other in the following four different prediction tasks: First, prediction of interaction strength between a drug and a target of which both have been observed in the training data as part of some other drug-target pair with a known interaction strength. Second, interaction strength prediction with novel targets that have not been observed in the training data as part of any known drug-target pair. Third, prediction with novel drugs, and fourth, prediction with both novel drugs and targets. As is shown in by the results, the prediction performances in these four tasks are tremendously different, hence underlying the importance that they should be separately considered in pairwise learning studies. Further, the results indicate that it is not at all self-evident that the expected prediction performance improvements of the nonlinear pairwise kernels over the linear ones implied by the nonlinearity assumption would always translate to practice.
To conclude, the major contributions of this work are as follows: -Review of the standard kernels for pairwise data that establishes a common operator based framework for analysing and implementing the kernels. -The framework allows accelerating the computation of matrix-vector products with the pairwise kernels to O(nm + nq) time, leading to considerably faster training methods. -Comprehensive experimental comparison of the pairwise kernels on biological interaction data sets with four different prediction problems.

Pairwise learning problem
Given the spaces of drugs D and targets T , the possible drug and target pairs are the Cartesian product X = D × T . The label space is denoted Y, where Y = R for regression and Y = {0, 1} for classification. We further denote the joint space of the pairwise inputs and labels as Z = X × Y. The observed data set consists of n labeled drug-target pairs Z obs = (X obs , y) ∈ (D × T × Y) n . We further define d ∈ D n and t ∈ T n to be drug and target sequences such that (d i , t i ) = x i . Finally, we let D obs and T obs denote the sets of drugs and targets observed in the sample and Z obs to denote the set of observed unique drug-target pairs, so that we have m = |D obs | unique drugs and q = |T obs | unique targets, Our goal is to learn a prediction function f : D × T → Y from the training set, such that f can correctly predict the labels for a new pair (d, t) ∈ D × T . The drug d and target t in the new pair may or may not belong to drugs D obs and targets T obs observed during training time. Here, four different settings emerge, as illustrated in Figure 2: 1. d ∈ D obs and t ∈ T obs : prediction for known drugs and targets 2. d ∈ D obs and t / ∈ T obs : prediction for novel targets 3. d / ∈ D obs and t ∈ T obs : prediction for novel drugs 4. d / ∈ D obs and t / ∈ T obs : prediction for novel drugs and targets In the literature specific settings in Figure 2 have been sometimes considered separately. For example, Setting 1 can be solved even without drug or target features using matrix factorization methods (Basilico and Hofmann, 2004). However, the latent representations learned by matrix factorization methods do not generalize to drugs and targets outside the training set (Settings 2-4). The pairwise kernel learning approach considered in this work is applicable in all of the four settings.

Targets Drugs
Recent studies have highlighted, that the prediction performance and optimal choice of kernel and hyperparameters for a pairwise learning method crucially depend on the assumption of how the test pairs overlap with training Setting 3 d / ∈ D obs and t ∈ T obs Split drugs D obs = D train ∪ Dtest Then Setting 4 d / ∈ D obs and t / ∈ T obs Split both targets T obs = T train ∪ Ttest and drugs D obs = D train ∪ Dtest Then if t i ∈ Ttest and d i ∈ Dtest Z ignored , otherwise Table 1 Training and test set split in different settings data (Park and Marcotte, 2012;Pahikkala et al., 2015a;Stock et al., 2020). An experimental observation made over a large variety of different studies was that Setting 1 is usually the easiest to predict accurately, followed by Settings 2 and 3, whereas making accurate predictions in Setting 4 tends to be very challenging. As recommended in previous studies (Park and Marcotte, 2012;Pahikkala et al., 2015a;Stock et al., 2020), we will always generate separate test sets for each of the four settings in the experiments to give a comprehensive view of how the learned prediction functions generalize to different types of test pairs. Depending on the amount of data, this can be implemented either with a single split to training and test sets, or by using cross-validation with repeated splits. The way the data splitting is implemented is defined in Table  1.

Learning algorithm
In this section we present a supervised machine learning approach for learning with pairwise kernels. The computational shortcuts presented in this paper can be used to speed up any optimization approach whose computational complexity is dominated by multiplications of a pairwise kernel matrix with a vector, such as the truncated Newton method . In this paper we focus on kernel ridge regression, as it is a widely used method that admits a closed form solution and simplifies the following considerations.
To learn a prediction function, we consider the regularized empirical risk minimization problem where p ∈ R n are the predicted outputs and y ∈ R n the correct outputs, L a convex nonnegative loss function and λ > 0 a regularization parameter.
To define a kernel learning problem, let k D,T : (D × T ) × (D × T ) → R be a positive semidefinite pairwise kernel function. Denote the kernel matrix containing the kernel evaluations between the drug-target pairs used to train the model as K ∈ R n×n such that K i,j = k D,T ((d i , t i ), (d j , t j )). Choosing the reproducing kernel Hilbert space (RKHS) associated with k D,T as the hypothesis space H for risk minimization, the representer theorem (Schölkopf et al., 2001) implies that the minimizing function can be written as: where a ∈ R n is the vector of dual coefficients. Accordingly, the predictions for the training data can be written with the kernel matrix as p = Ka.
Kernel ridge regression (see e.g. (Poggio and Smale, 2003)) is a special case of the regularized empirical risk minimization, where the loss function is the squared loss L(p, y) = y − p 2 . The optimization problem then has a direct solution in terms of matrix algebra. The ridge regression problem can be formulated as solving the dual parameter vector a ∈ R n : a = argmin a∈R n y − Ka 2 + λa T Ka It can be shown that this corresponds to solving the linear equation: (1) Solving this system with a method that computes K requires at least O(n 2 ) time and memory, which is not practical in many pairwise learning problems, where n can be in the range of 10 5 or more. A much more efficient solution can be found, when the kernel matrix can be expressed as a Kronecker product matrix. Assume we have a drug kernel function k D : D × D → R and a target kernel function k T : T × T → R. The Kronecker product kernel is then defined as the product of the drug and target kernels k D,T ((d, t), (d, t)) = k D (d, d)k T (t, t).
In the following considerations, we also use the following linear operator notation for the kernels. For the drug kernel, D ∈ R D×D such that D d,d = k D (d, d), and for the target kernel T ∈ R T ×T such that T t,t = k T (t, t). For finite domains of drugs and targets, the operators can be considered as matrices, whose rows and columns are indexed with drugs and targets instead of positive integers. Their addition, scalar multiplication, transpose and Kronecker product of these operators also naturally extend to infinite and continuous domains. For example, the operator corresponding to Kronecker product kernel over drug and target kernels is , and with the parenthesis notation we stress that both the rows and columns of the Kronecker product operator are indexed by drug-target pairs. Extending the matrix product is more involved in general but the products considered in this paper are always well-defined. This is enough for our purposes, and hence we avoid going into further technical details.
Let R(d, t) ∈ R n×(D×T ) denote the Kronecker product indexing operator, whose rows are indexed by a sample of n drug-target pairs and columns by all drug-target pairs in the space D × T . Its values, as a function of the sequences d ∈ D n and t ∈ T n , are defined as follows: Below, we omit explicitly writing d and t for clarity when they are clear from the context. In the literature, this type of constructs are sometimes called sampling operators, as they select a finite sample from a space of possibilities. For two samples of data, say X = (d, t) and X = (d, t), the kernel matrix containing all Kronecker product kernel evaluations between data in the first and second sample can then be expressed as The second sample can be, for example, a validation sets used for selecting an appropriate value of the regularization parameter, the number of training iterations, or kernel parameter values. It can also be used for prediction performance evaluation of the final model with a separate test set or in general for performing predictions for data with unknown labels.
Substituting the kernel matrix of evaluations between the training data with itself to (1), we end up to the following linear system: This linear system can be solved iteratively, for example, with the minimal residual method (Saad and Schultz, 1986), combined with early stopping. A single training iteration in Equation 2 requires matrix vector products of the form u ← (R(D ⊗ T)R T + λI)u. Given a vector of parameters u, predictions for another sample of data not used in training can be computed as a single matrix vector product Table 2 presents the relevant dimensions associated to the matrix vector products. We next recollect the following result  concerning matrix-vector products in which the matrix consists of a Kronecker product that is indexed from both left and right sides. This theorem is a generalization of Roth's column lemma (Roth, 1934), often known as the "vectrick".  Theorem 1 (Airola and Pahikkala (2018)) Let Then, the operation can be carried out in O(min(qn+mn, mn+qn)) time using a sparse Kronecker product multiplication algorithm known as the generalized vec-trick (GVT).
The theorem implies that in training, the Kronecker product kernel matrix can be multiplied with a dual parameter vector in O(qn + mn) time. The cost of computing predictions simultaneously for a set of data not used for training is O(min(qn + mn, mn + qn)), where the overlined symbols denote the dimensions of the set for which the predictions are computed. This is much more efficient than the O(n 2 ) or O(nn) costs of explicitly forming the kernel matrices, since typically m, q << n and m, q << n. Table 3 Kernel functions of different pairwise kernels. We show that all of these kernels can be expressed as a combination Kronecker products of a separate drug and target kernel.

Sum of Kronecker products framework for pairwise kernels
In this section we discuss different pairwise kernels presented in the literature and show how they can be expressed as sums of Kronecker products. Each matrix vector product can then be calculated as a sum of individual Knocker product terms. This allows the application of GVT shortcut to all of these kernels, which results in efficient algorithm for both training and making predictions. Table 4 Properties of different pairwise kernels. The middle column denotes whether the kernel allows heterogenous domains and the last column shows the feature map of the kernel. Table 4 highlights an important limitation that applies to some of the kernels. These require homogeneous domains, i.e. they assume both objects in the pair belong to the same domain D = T , so that x = (d, t) ∈ D × D. For the other kernels, we can have heterogeneous domains. Further, the Cartesian kernel is designed to be used in Setting 1 only, as it does not allow generalization to such drugs and targets that are not included in the training data.
The pairwise kernels can be motivated through feature mappings, since different pairwise kernel functions in Table 3 imply different implicit feature mappings for the pair as listed in Table 4. The implicit drug and target feature mappings φ D : D → R r and φ T : T → R s are defined by the drug and target kernels k Then the feature vector of the pair is defined by the feature mapping φ D,T : D × T → R p corresponding to the pairwise kernel k D,T ((d, t), (d, t)) = φ D,T (d, t), φ D,T (d, t) . The claimed feature maps can be proven simply by computing the inner product and checking that it matches the definition of the kernel function. In the following, we discuss the implied pairwise feature vector φ D,T ((d, t)) := (x d,t 1 , ..., x d,t p ) ∈ R p of each pairwise kernel in terms of the drug φ d (d) := (x d 1 , ..., x d r ) ∈ R r and the target φ t (t) := (x t 1 , ..., x t s ) ∈ R s feature vectors. This motivates the kernels and demonstrates the intuition behind using a specific kernel for a specific task.

Linear
The pairwise linear kernel is computed as the linear kernel on the concatenated feature vector. The feature vector is the concatenation of the drug and target feature vectors x d,t = (x d , x t ). The resulting features consists of the union of original drug features (x d i ) i=1...r and target features (x t i ) i=1...s . In this feature mapping, each feature contributes equally to interaction strength in every drug and target pair. Interaction is predicted simply by the presence or absence of certain features in the drug or the target, regardless of which drug and target pair is being tested. Given drug d and target t, the predicted interaction of the drug on the target is given by This implies a global ordering of drugs, where drugs and targets are completely decoupled. If drug d 1 is more effective than drug d 2 against target t 1 , then drug d 1 is also more effective than drug d 2 against target . In the resulting model, some drugs and targets simply have more interactions than others, but there are no interactions between drug and target features. The artificial chessboard problem illustrated in Figure 1 is an example of a data set, that is impossible to model using the pairwise linear kernel.

Polynomial
The pairwise polynomial kernel is computed as the polynomial kernel on the concatenated feature vector. On a second degree polynomial kernel without bias, the feature vector is the tensor product of the concatenated feature vec- The resulting features include three types of terms: self interactions between drug features ( ...s . The self interactions contribute to a global ordering of drugs and targets, similar to the linear kernel. However, the pairwise interactions model actual interactions of drug and target features: a drug and target pair may be interacting if for example the features indicate that a certain chemical structure in a drug binds to a certain receptor on a target.

Gaussian
The pairwise Gaussian kernel is defined as the Gaussian kernel on the concatenated feature vector. This kernel exp( can be expressed as product of Gaussian drug and target kernels. This is a special case of the Kronecker product kernel, and will thus not be considered separately in the following.

Kronecker product
The Kronecker product kernel (Ben-Hur and Noble, 2005;Basilico and Hofmann, 2004;Oyama and Manning, 2004) is computed as the product of drug and target kernels. The feature vector is given as a tensor product of the drug and target feature vectors The resulting feature vector consists of simply all the pairwise interactions (x d i x t j ) i=1...r,j=1...s . These are same as the pairwise interactions in the polynomial kernel with self-interations excluded. The Kronecker product kernel can be motivated as the simplest kernel that models actual pairwise interactions in drug and target features. The Kronecker kernel is an universal kernel, if the drug and target kernels are universal (e.g. Gaussian) (Waegeman et al., 2012).

Symmetric and Anti-symmetric kernels
If we assume homogeneous domains, feature vectors can be written as a sum of symmetric and anti-symmetric parts φ D, ). The symmetric Kronecker kernel (Ben-Hur and Noble, 2005) is motivated by applying the symmetrization to the Kronecker kernel feature vector. This results in a tensor product of the drug and target feature vectors with only symmetric parts When all interactions are known to be symmetric by definition, the symmetric Kronecker kernel is sometimes referred to as the Kronecker kernel in the literature. Several works have analysed the theoretical properties of the symmetric and antisymmetric Kronecker kernels (Pahikkala et al., 2010;Waegeman et al., 2012;Brunner et al., 2012;Pahikkala et al., 2015b;Gnecco, 2017Gnecco, , 2018.

Ranking
The feature vector of the ranking kernel is the difference of drug and target fea- which are assumed to belong to the same domain (Herbrich, 2000;Waegeman et al., 2012). The resulting features consist of pair- The ranking kernel can model ranking representable relations, i.e. relations constructed from some utility function h such which provides a global ranking of drugs based on their feature representation. The ranking kernel can be considered as an anti-symmetric linear kernel as can be observed from the operator notation below. Pahikkala et al. (2009) show that the ranking kernel matrix can be computed using the oriented incidence operator M ∈ R D×n where as M T DM. Since M can be implemented with a sparse matrix, this allows efficient kernel matrix vector multiplication in O(m 2 + n) time without need to use GVT.

MLPK
The MLPK kernel (Vert et al., 2007) is computed as the ranking kernel squared. The feature vector is given by the tensor product of the pairwise differ- ..r,j=1...r . This models interaction of a pair in the terms of how similar the drug and the target in the pair are. The formulation compares both elementwise differences, and possible interactions between the differences. The MLPK kernel can also be motivated as a distance learning problem by adding an extra parameter constraint to the standard SVM optimization problem (Vert et al., 2007). There, the goal is to learn a linear map such that the function is modelled by the Euclidean distance metric between feature vectors: learn a positive

Cartesian
The Cartesian kernel (Kashima et al., 2009b) is computed as the drug kernel when the targets match, and the target kernel when the drugs match. The feature vector is given as a concatenation of the drug feature vector (target specific) and the target feature vector (drug specific) ..m,j=1...s corresponding to drug and target specific features. The full parameter vector w can be partitioned into drug specific (w d ) d∈D and target specific (w t ) t∈T parameters, with separate parameters learned for each drug and target. This means that target features may have different effects, depending on the drug, and vice versa. In this sense the learned model includes pairwise interactions, but it does not utilize information between similar interactions in different pairs and cannot generalize to drugs or targets that have not been seen in the training set. Kashima et al. (2009b) show that the Cartesian kernel can be represented as a Kronecker sum, and thus using the standard vec trick (Roth, 1934) kernel matrix multiplication can be done in O(m 2 q + q 2 m) time. In this work, we improve on this result.

Efficient computation of pairwise kernels
In this section we show how the pairwise kernel matrices of the above described kernels can be conveniently written as sums of Kronecker product matrices. For this purpose, we make the following definitions.

Definition 1 (Commutation and unification operators)
The commutation operator P ∈ R (T ×D)×(D×T ) has its values defined as Note that if the domains D and T are different, the row indexing of any operator will be changed from D × T to T × D if multiplied from left with P. Its inverse operator P T ∈ R (D×T )×(T ×D) is defined analogously, by switching the drug and target domains. The values are also defined similarly when D = T but in this case we use the notation The unification operator Q ∈ R (D×T )×(D×D) has its values defined as: The corresponding unification operator Q ∈ R (T ×D)×(T ×T ) is defined analogously, by switching the drug and target domains. The values are also defined similarly when D = T but in this case we use the notation For convenience, we also give the values of the product of operators PQ ∈ R (D×T )×(T ×T ) : as this product is also heavily used in the forthcoming considerations.
In the following example, we illustrate finite dimensional examples of both commutation and unification operators that are, due to their finiteness, representable as matrices.
Example 1 Consider a finite space of drugs of size |D| = 3 and a finite space of targets |T | = 2. Then, the commutation operator P ∈ R (T ×D)×(D×T ) can be represented as the following matrix: where the rows and columns are arranged in the natural order of the drugtarget pairs and drug-drug pairs, respectively.
From the above definition of the commutation and unification operators, we obtain a cheat sheet of rules indicated by the following lemma: where D ⊙2 denotes the elementwise square of D, and 1 ∈ R T ×T is an operator with all values equal to one.
For the values, we have where Q ∈ R (D×T )×(D×D) , and Finally, if D = T , we further have: Proof The listed results are straightforward operator algebraic manipulations based on Definition 1.

⊓ ⊔
From the above results, we can conclude the following results concerning certain specific pairwise kernels in particular:

Corollary 1 The kernel matrices of the linear, second order polynomial, Kronecker product, Cartesian, symmetric, anti-symmetric, ranking and metric learning pairwise kernels for two samples of data, say X = (d, t) and X = (d, t), can be expressed as R(d, t)K D,T R(d, t), where K D,T is the corresponding operator of all kernel values as follows:
Kernel Proof We first show that the kernel matrices over the whole domain of D and T can be compactly expressed with the operator notation and show the indexed case afterwards.
Now, recall that if we have two samples of data, say X = (d, t) and X = (d, t), and we intend to calculate all kernel evaluations between data in the first sample with the second sample, the matrix consisting of these kernel evaluations is defined as follows:

t)K kernel (D, T)R(d, t) T
By setting d = d and t = t we may as a special case define the kernel matrix for the training data.
We also have the following rules on permuting either of the indexing matrices with the commutation or the unification operator: To obtain the incomplete data pairwise kernel matrix, we multiply the complete data pairwise kernel matrix K kernel (D, T) with the indexing matrix R(d, t) and R(d, t) T from left and right sides, respectively. The complete data pairwise kernel matrix is a sum of permuted Kronecker product matrices, so these results imply different indexing matrices for each term in the sum. We can then apply GVT to each term separately.
We have two ways of calculating an identical matrix-vector product given kernel matrices and sample indices d, t, d, t, with vectors a ∈ R n , u ∈ R t : 1. Use the standard matrix vector product with the kernel matrix: u ← Ka, 2. Use GVT in Theorem (1): u ← vectrick(D, T, d, t, d, t, a).
In computing the pairwise kernel matrix K = R(d, t)K kernel (D, T)R(d, t) T , only elements in the indexing matrices need to be computed. The computational complexity of implementing approach 1 directly is O(nn). Based on Theorem 1 and Corollary 1, the complexity of approach 2 for any of the kernels listed in Table 4 is O(min(qn + mn, mn + qn)). For the training kernel matrix, these complexities can be simplified as O(n 2 ) and O(qn + mn).

Data sets
We apply the pairwise kernel learning framework to four biological data sets. As shown in Table 5, the data sets have quite different characteristics. They vary in the number of samples, ratio of drugs to targets, homogeneity, density, and features. While our data sets belong to the same domain, the different prediction tasks provide an useful benchmark on how pairwise kernels perform over different applications.  Table 5 Data sets used in the experiments. We report for each data set the number of pairs and unique drugs and targets, and whether the data is homogenous. Density is the fraction of drug target pairs that have known labels. We denote the number of drug kernels |D|, target kernels |T| and pairwise kernels |K|.

Heterodimer
Many proteins bind together and form multiprotein structures called protein complexes, which have essential roles in a variety of biological functions. To understand how proteins function, one needs to identify those sets of proteins that form complexes. A significant fraction of known protein complexes are heterodimers, that is, formed by the assembly of only two proteins. Recent research has taken into account information from measured protein-protein interactions and other possible protein information sources in order to develop new methods for predicting complexes, especially for smaller sizes (Ruan et al., 2018;Maruyama, 2011;Ruan et al., 2013). Labels for a heterodimer data set can be generated from databases of curated protein complexes. We created positive and negative examples following a paper which applied Naive Bayes for supervised learning of heterodimers (Maruyama, 2011). The labels are based on CYC2008 (Pu et al., 2008), a comprehensive catalogue of 408 manually curated yeast protein complexes, and WI-PHI (Kiemer et al., 2007), a dataset 49607 (protein,protein)-interactions. A positive (negative) example of a heterodimer is a pair of proteins satisfying the following conditions: 1. is (is not) a heterodimeric protein complex in CYC2008 2. is not (is) a proper subset of any other complex in CYC2008 3. WI-PHI includes the PPI corresponding to it Following research that sought to improve heterodimer predictions (Ruan et al., 2018), we added protein features by considering domain, phylogenetic profile and subcellular localization properties. The idea is that proteins having a similar specification are more likely to form a complex because they are functionally linked. We obtained the domain and subcellular location information from UniProtKB and the phylogenetic profile from KEGG OC (Nakaya et al., 2012). The feature map φ for each of the 1526 proteins is one of three binary vectors (length in parenthesis): 1. φ dom (P i ) j : the j-th domain occurs in the protein P i (2554), 2. φ phylo (P i ) j : the j-th genome contains the homolog P i (768), 3. φ local (P i ) j : the j-th subcellular localization contains the protein P i (83).
We computed the protein kernels D using the Tanimoto kernel on these binary feature vectors. Given binary vectors v and v of length l, it is defined as the ratio of bits set to 1 in both vs. bits set to 1 in either:

Metz
Understanding interactions beween chemical compounds and cellular targets is an important research topic in biology. For example, protein kineases control many aspects of the cell life cycle, and drugs that inhibit specific kineases have been developed to treat several diseases. Large-scale bioactivity assays enable the prediction of interactions across wide panels kinease inhibitors and their potential cellular targets. In particular, supervised machine learning is a promising approach of predicting interactions since it can use structural similarities among the drug compounds and genomic similarities among target proteins.
Labels for an interaction data set were based on biochemical selectivity assays for clinically relevant kinease inhibitors by Metz et al. (2011). The interaction affinity between a ligand molecule (e.g. a drug compound) and a target molecule (e.g. a protein kinease) reflects how tightly the ligand binds to a particular target, quantified using the inhibition constant K i . The smaller the K i bioactivity, the higher the interaction affinity between the chemical compond and the protein kinase. We binarized the real valued interactions using a relatively stringent threshold of K i < 28.18nm into 2798 interacting and 90 558 non-interacting pairs. Following a study that investigated how well machine learning based methods work in different prediction tasks (Pahikkala et al., 2015a), we extracted features for both drugs and targets. Drug features were based on chemical properties, where structural fingerprint similarity was computed as the two dimensional (2D) Tanimoto coefficient based on the structure clustering server at PubChem. Target features were based on genomic data, where sequence similarities were computed using a normalized version of the Smith-Waterman (SW) score. In total, we have 156 drugs and 1421 targets, with a symmetric 156 x 156 (drug, drug)-similarity matrix X d and a symmetric 1421 x 1421 (target,target)-similarity matrix X t . Following the previous study, we used the drug and target similarity matrix rows as feature vectors, computing either a linear kernel k Linear (x i , x j ) = x i , x j or a Gaussian kernel k Gaussian (x i , x j ) = e −γ xi−xj 2 with bandwidth γ = 10 −5 [4]. Assuming that target and drug kernels have the same specification, we then have either linear or Gaussian drug kernels D and target kernels T.

Merget
A study of similar drug bioactivity prediction appears in Cichonska et al. (2018), where the task was also to predict the interaction affinity between drug compounds and protein kinease targets. The authors evaluated the pairwise kronecker kernel resulting from 3210 different combinations of 10 drug and 320 target kernels. Many of the pairwise kernels were created by using varying choices of target kernel hyperparameters. This study is interesting for our purposes, because we can use these kernels to evaluate how different pairwise kernels compare on different features.
The labels were created by processing the drug-target interactions in Merget (Merget et al., 2016) updated with ChEMBL bioactivities (Sorgenfrei et al., 2018). The authors used only drugs that had more than 1% of bioactivities across kineases measured and only kineases with both domain and ATP binding pocket amino acid subsequences at PROSITE (Sigrist et al., 2012). This resulted in 2967 drugs and 226 protein kinases, with a total of 167 995 binding values.
The features were defined directly through multiple kernel functions for both drugs and targets. Drug kernels D were based on Tanimoto kernels using 10 different binary molecular fingerprints obtained with rcdk R package (Guha et al., 2007). Given a fixed choice of hyperparameters they had 9 different protein kernels T: three Gaussian kernels based on gene ontology (GO) annotations, three kernels based Smith-Waterman (SW) sequence similarities, and three generic string (GS) kernels. Gaussian kernels were based on three GO profiles: molecular function, biological process and cellular components. The SW kernels and GS kernels are both based on three possible amino acid sequences: full kinase sequences, kinase domain subsequences and ATP binding pocket subsequences. These kernels used BLOSUM 50 as amino acid descriptors. These 9 protein kernels were originally expanded into 320 different kernels by varying the choice of hyperparameters.

Kernel filling
In the final experiment, we use the data set in Cichonska et al. (2018) to define a novel prediction task that has an even larger data set, in order to use it for scalability experiments. The authors calculated 10 different drug kernels (D i ) i=1...10 , which can be used both as labels and as features in a kernel filling prediction task. Given n = 2967 drugs, each drug kernel is a D i ∈ R 2967×2967 matrix. If some of the 2967 × 2967 = 8803089 possible entries are missing, they can be predicted using another kernel that has these entries. For a choice of kernels i = j, denote Y = vec(D i ) as the label vector and D j as the drug kernel. The drug kernel is plugged into a pairwise kernel to predict the label vector.
To create a smaller data set, we can sample an n × n submatrix from both kernel matrices, and split these entries into n train training samples and use remaining samples as setting 1 test samples. The entries outside the submatrix are test samples in settings 2, 3, and 4. The original data set is dense and real-valued; each (drug, drug)-pair has a latent feature vector encoded by the second kernel. Because we are predicting kernel encoded similarities of n drugs that belong to the same domain, the data set is also homogeneous. We implemented ridge regression with the minimum residual optimization method, which is an iterative method for the numerical solution of a system of linear equations. The matrix vector products required within the minimum residual method were computed with either of two algorithms. Given a vector to be multiplied with a pairwise Kronecker kernel matrix, the baseline algorithm uses the explicit kernel matrix and the standard matrix vector product, whereas the fast method uses the GVT algorithm. We used the scipy.sparse.linalg.minres method in the SciPy library. The method CGKron-RLS in the RLScore software package includes an user friendly implementation of GVT (Pahikkala and Airola, 2016), for example. These two methods are identical except for the calculation of the matrix vector products.
Instead of solving the system completely, the minimum residual method can be run up to a given number of iterations. To speed up training, iterations may be stopped before the least squares solution is reached. In practice, a limited number of iterations is often sufficient to reach optimal model performance, where a separate validation set can be used to check whether model performance increases with more iterations. Limiting the number of iterations is also an effective regularization method, known as early-stopping in the literature. A method that includes early stopping therefore has the number of iterations k as a hyperparameter. Regularization can then be performed either by setting the Tikhonov regularization parameter λ to a small constant and limiting the number of iterations k, or finding an optimal λ and stopping iterations when the model has converged. Figure 3 illustrates the effect of early stopping in the Ki data set. The best validation set AUC was reached either by stopping the training early, or by finding the optimal regularization parameter and running the iterations until convergence.
We implemented early stopping ridge regression as follows. The algorithm fits ridge regression ridge(Z obs , k D , k T , k D,T , setting) given a data set Z obs , drug kernel k D , target kernel k T , pairwise kernel k D,T , and setting. We use 9fold cross-validation, according to the setting (see Table 1), to split the data set into Z train and Z test pair during each round. On each round of cross-validation, the training set Z train is further split into an inner training set Z inner and a validation set Z validation according to the setting. The optimal hyperparameter for the number of iterations k is then found by running the minimum residual algorithm on Z inner until the AUC stops increasing in Z validation for a given number of iterations. The number of iterations required and the observed AUC on the validation set is stored. Finally, the model is fit to the full training set Z train using this many iterations. The resulting model is used to make predictions for the test set Z test , for which the AUC is measured. We tested different pairwise kernels, features and settings in the heterodimers data set. The experiment included every combination of following choices:

D,T
, k MLPK D,T } 3. Setting ∈ {1, 2, 3, 4} splits Z train into 75% Z inner and 25% Z validation . We fit ridge regression in Z inner while the AUC in Z validation is improving.

4.
We then train a model ← ridge(Z train , k D , k T , k D,T , setting) with the optimal number of iterations and calculate the AUC corresponding to Setting ∈ {1, 2, 3, 4} in Z test .
The results in Figure 4 show that the best pairwise kernel strongly depends on features. For domain features, the MLPK is by far the best pairwise kernel with almost perfect predictions. However, for genome and location features the best kernels are the second degree polynomial and symmetric Kronecker kernels by a notable margin. While the best pairwise kernel depends on the underlying features, using different drug kernel (Min/MinMax/Norm) for the binary feature vectors did not have significant effects, so we report only the Tanimoto, or MinMax, kernel. The best kernel does not seem to vary by the setting, but the later settings are slightly more challenging. The linear kernel that excludes pairwise interactions, and simply models some proteins having more interactions than others, offers suprisingly good results. However, it seems that in this data set there are also significant pairwise interactions that the other kernels are able to capture.

Cartesian
Kronecker Linear Poly2D Kernel ( 3. Setting ∈ {1, 2, 3, 4} splits Z train into 75% Z inner and 25% Z validation . We fit ridge regression in Z inner while the AUC in Z validation is improving. 4. We then train a model ← ridge(Z train , k D , k T , k D,T , setting) with the optimal number of iterations and calculate the AUC corresponding to Setting ∈ {1, 2, 3, 4} in Z test .
The results in Figure 5 show that for both Linear and Gaussian drug kernels, the second degree polynomial and Kronecker pairwise kernels have the best and comparable performance, because they also include pairwise interactions. The linear kernel offers suprisingly good results, not very far from optimal, but there are also some pairwise interactions that contribute to the prediction task. The cartesian kernel is not much better than random on the task. There seem to be some benefits from using the Gaussian instead of the linear drug kernel, which are comparable in magnitude to the benefits from modeling pairwise interactions. Regardless of the drug kernels used as features, over different experiments the relative pairwise kernel performance is the same.

Cartesian
Kronecker Linear Poly2D Kernel ( , k Cartesian D,T } 3. Setting ∈ {1, 2, 3, 4} splits Z train into 75% Z inner and 25% Z validation . We fit ridge regression in Z inner while the AUC in Z validation is improving. 4. We then train a model ← ridge(Z train , k D , k T , k D,T , setting) with the optimal number of iterations and calculate the AUC corresponding to Setting ∈ {1, 2, 3, 4} in Z test .
We obtain close to identical results for different (drug kernel, target kernel)pairs, so we only present the first two pairs. The results in Figure 6 closely mirror the Metz data set. Polynomial and Kronecker kernel are the best with comparable performance in all pairs. Linear kernel has almost as good results, even though some pairwise interactions can be found between the drugs and the targets. Cartesian kernel is not much better than random, with an exception in setting 3. Over all possible drug and target kernel pairs, different features do not seem to have much of an effect on prediction performance or relative order of kernels. This is suprising given that the original study was motivated as a method that enables one to use a large mixture of different kernels to improve prediction performance.

Kernel filling
We predict the missing labels in a drug kernel matrix y = vec(D circular ) using another drug kernel matrix D = D estate as features. Different choices of drug kernels for labels and features result in a drastically different absolute prediction performance. However, not much difference is observed in the relative order of the pairwise kernels. For brevity, we therefore report the experiment on these two kernels, as they offered reasonable but not exceptionally high or low prediction performance.
Because there is so much data available in this task, we used separate test sets. For N samples, the data set Z obs is split into a (Z train , Z  } 2. The setting ∈ {1, 2, 3, 4} splits the data set Z train into 75% training set Z inner and 25% validation set Z validation . We iteratively fit early stopping ridge regression in Z inner while the AUC in Z validation is improving, and then save the optimal number of iterations. We then train the model on Z train for that many iterations and evaluate the AUC on Z (setting) test . The number of iterations required to reach an optimal model is shown in Figure 7. The number of iterations depends on the performance it is possible to achieve in a given setting. More iterations are needed to find a more elaborate model when it is possible to achieve a better prediction performance. Setting 1 requires most iterations, setting 2/3 somewhat less, and setting 4 fewest iterations to reach an optimal solution. Fitting the MLPK and symmetric Kronecker kernel seem to require significantly more iterations than other kernels. Note how the number of iterations is very modest, relative to the total number of samples that is theoretically needed to fully solve the linear system.
The CPU time in seconds and memory usage in GiB are shown in Figure  7, respectively. The standard method requires significantly more time than the GVT method. At a time when the standard method ran out of memory, the training was taking over an hour whereas GVT completed in a second. The performance of GVT has a small constant term depending on how many summands of Kronecker kernels are required in the pairwise kernel expression. The Kronecker kernel is fastest of these because it has only one term and the MLPK slowest because it has 10 such terms. The naive method requires significantly more memory because it stores the full O(n 2 ) pairwise kernel matrix whereas GVT only stores the O(m 2 ) drug and O(q 2 ) kernel matrices. Here we have n ≈ 0.5q 2 , which implies complexities O naive (q 4 ) vs. O GVT (q 2 ). The naive method experiments were stopped when N required > 16GiB memory, which did not become an issue with GVT for the size of this data set.
Prediction performance comparisons, quantified with the AUC in Figure  7, are quite complicated because they depend on the setting and the size of the data set. We make the following observations in different settings: 1. Setting 1: The MLPK kernel has slightly higher performance for larger data sets N > 10000. The Kronecker, second degree polynomial, and symmetric Kronecker kernels are comparable to each other and quite close to the MLPK. For medium data sets N ≤ 10000, they perform better and incorporating prior knowledge via symmetrization may provide a small benefit. The linear kernel is significantly worse except for very small data sets N ≤ 1000. 2. Setting 2/3: The settings are equivalent because the domain is homogeneous. The MLPK kernel has worst performance for all data set sizes, and the linear kernel is significantly worse except for the smallest N ≤ 1000 data sets. The Kronecker, second degree polynomial, and symmetric Kronecker kernels have best and almost identical performance. The general prediction accuracy is somewhat lower because the prediction task has become more difficult 3. Setting 4: The results are similar to setting 2/3, but the overall prediction accuracy is even slightly lower.

Comparison to Nyström approximation with Falkon
The method proposed in this article allows computing efficiently the exact solution to the regularized risk minimization problem for a family of commonly used pairwise kernels. In the following experiments, we compare the approach to a standard approximation method that speeds up training by using only a random subset of the training data to represent the learned function. Specifically, we compare the proposed method, implemented in RLScore package (Pahikkala and Airola, 2016), to the Nyström-method based training algorithm implemented in the Falkon package (Rudi et al., 2017;Meanti et al., 2020). The Nyström approximation allows speeding up kernel methods on large data sets, though there is a trade-off in accuracy if the approximation is not sufficient. This introduces an additional hyperparameter: the number of basis vectors N used as an approximation. The method then computes the kernel K ∈ R n×N between data points and basis vectors, and Falkon solves the resulting linear system using a preconditioned conjugate gradient optimizer. Our pairwise kernel implementation inherits the falkon.kernels.kernel.Kernel class from Falkon, and is implemented as c-language extension using Cython to guarantee efficiency. The comparison was implemented on the kernel filling task described in the previous section.
Our experiments in Figure 8 collaborate theoretic results, which state that increasing the number of basis vectors will result in a higher accuracy when properly regularized. The solution converges to the full solution as the number of basis vectors approaches the number of data points. We also saw that limiting the number of basis vectors effectively regularizes the problem, as the model converges quickly and early stopping results in an identical solution. However, a kernel matrix with 1 024 000 samples and 2048 basis vectors already consumes 16GiB memory so we use this as the best approximation. To align the results with RLScore, we use a regularization parameter λ = 1e − 5 and early stopping based on a validation set.
In Figure 9, we perform experiments to compare the Falkon method with N = 32, 128, 512, 2048 basis vectors against RLScore that computes the full solution using GVT, both with the Kronecker product kernel. The experiments are otherwise identical to the previous experiment with the standard kernel method versus the GVT based method. We see that the quality of the approximation increases with the number of basis vectors, almost reaching the AUC of the RLScore implementation. RLScore has lower runtime and lower memory requirement. We conclude that both methods provide a drastically smaller runtime and memory use compared to the standard method, but are quite comparable to each other in computational requirements. RLScore provides slightly better AUC with less computational resources, especially in Setting 1.

Discussion and Conclusion
In this work we reviewed the most commonly used pairwise kernels and introduced an operator based framework for analysing and implementing the kernels. The framework allows applying the generalized vec-trick algorithm  for speeding up matrix-vector products for the kernels, allowing much faster training and prediction than with explicit computation of the kernel matrix. As a specific use case we considered the ridge regression method, but the approach can be also used for speeding up the (sub)gradient computations for other types of regularized kernel methods, such as kernel logistic regression or support vector machines. Our experiments on drug-target data show that the approach allows scaling to much larger problem sizes than without the computational short cuts, and provides better predictive performance with the same computational resources than Falkon (Rudi et al., 2017), a the state-of-the art method for training large-scale kernel machines. Further, the choice of optimal kernel is seen to be highly dependant on both the problem domain and the type of prediction task considered. An interesting observation that can be made from the experimental results is that in many cases the linear pairwise kernel produces results that are competitive with those obtained using the non-linear kernels. This is surpris-ing in the sense that the kernel allows only expressing functions of the form f (d, t) = f d (d) + f t (t) that score the drugs and targets separately without truly modeling interactions between them. It seems implausible that the nonlinearity assumption would not hold in the domain of drug-target interaction prediction, or other similar interaction prediction tasks, since this would imply the existence of "universal drug" that would be the optimal choice for all targets. We observed from our experimental results that, with larger sample sizes (see Figure 7), the nonlinear kernels were better able to capture the nonlinear components of the underlying signal, and the relative strength of the nonlinear part likely determines how large training sample is needed to capture it. An example showing the extreme cases containing either nonlinear or linear signal components only is given in Figure 1. The nonlinear component also may easily "drown" with high dimensional data, as the number of interaction terms increases quickly as its function.
We make the GVT code publicly available as part of the open source RLScore machine learning library (Pahikkala and Airola, 2016), allowing other researchers and developers to make use of the described kernel matrix multiplication short cuts. Our work considers the specific case of pairwise data, an open question remains under what conditions similar efficient methods can be derived in general to nth order tensorial data, which could be a Kronecker product of more than two kernel matrices. For example, the data may consist of triplets (drug, target, cell line) where each object in the triplet has its own kernel. Author's contributions MV wrote the initial draft of the manuscript, with contributions from AA and TP. All the authors edited together and approved the final version. MV was responsible for data processing and experiments, with experimental design done jointly by all the authors. All the authors contributed to writing code. AA and TP jointly planned the study deriving the joint framework for pairwise kernels, with contributions from MV.