The modeling choices made in the basic GaSPK framework described in Sect. 2 are designed to give flexible and powerful modeling capacity, allowing us to obtain high-quality predictive performances. However, the goal of this work is to combine state-of-the-art performances with computational efficiency. As described in Sect. 2, a naïve implementation of the GaSPK will not scale well as we see more data, since GP inference typically scales cubically with the number of datapoints. Further, the non-Gaussian likelihood means we are unable to evaluate the posterior analytically, and must make judicious approximate inference choices to ensure scalability.
In this section we address these issues of scalability. We first introduce modeling choices that facilitate scalable inference (Sect. 3.1), then develop a scalable approximate inference scheme in Sect. 3.2. Our inference algorithm alternates between using Laplace’s method to efficiently obtain the approximate posterior distribution \( p(f^c | C) \approx q(f^c | C)\) of characteristic trade-off evaluations, and estimating the user characteristics \({\varGamma }\) and the hyperparameters \(\theta \).
Structured Gaussian processes
When we condition on our finite set of trade-offs T, inferences about the \(f^c\) correspond to posterior inference in a multivariate Gaussian. Evaluating the covariance function k at all pairs of observed trade-offs \((t_1, t_2)\) yields the covariance (kernel) matrix K necessary for this posterior inference. Importantly, the cost of many key operations on K grows cubically in the number of unique trade-offs, which presents naïve inference methods with significant scalability challenges.
However, since our covariance structure factorizes across dimensions (Eq. 2), if we are able to arrange our inputs on a grid, we can formulate our model using Kronecker covariance matrices. Kronecker covariance matrices have favorable factorization and decomposition properties that, as we describe in this section, facilitate scalable inference. While Kronecker-structured covariances have appeared in other preference learning models (Bonilla et al. 2010; Birlutiu et al. 2013), we believe we are the first to exploit their computational advantages in this context.
In particular, as we will see later in this section, Kronecker covariance matrices are particularly appealing when our input space can be expressed as a low-dimensional grid with a fairly small number of possible values along each dimension. Our problem setting is well suited to the use of such a structure. Consumer and econometric research has established that consumers focus on relatively small subsets of attributes as well as few possible values thereof when choosing amongst alternatives, e.g., Caussade et al. (2005) and Hensher (2006). Motivated by this, we consider settings in which (1) the number of users, instances, and observed choices is large and naïve methods are therefore computationally infeasible; (2) trade-offs can be represented by a small number of attributes; and (3) each attribute has a small number of values, or can be discretized. We show that when alternatives can be represented by a small number of attributes and values, it is possible to obtain matrices K which are large, but on which important operations can be performed efficiently. In the empirical evaluations that follow, we demonstrate that this approach yields computational advances but also, despite introducing approximations, produces predictive performance that is often superior to what can be achieved with current scalable approaches.
Concretely, we assume that trade-offs can be arranged on a \(d_T\)-dimensional grid, and let \(T_d\) denote the set of unique values that occur on the dth attribute in T. In our electricity tariffs example, trade-offs can be characterized by (1) price differences per kWh, and (2) differences in renewable sources, so that we may have the following unique trade-off values: \({T_1 = \left\{ -0.10, -0.09, \ldots , 0.09, 0.10\right\} }\) and \({T_2 = \left\{ -1, 0, 1 \right\} }\). Not all possible combinations of trade-offs are always observed \((|T| < |T_1| \cdot |T_2| = 63)\), and the covariance matrix \({\widetilde{K} = \left[ k(t, t')\right] _{t, t' \in T}}\) is therefore significantly smaller than \(63 \times 63\). A Gaussian process applied to such a structured input space is known as a structured GP (Saatci 2011).
The key notion of structured GPs is that, rather than working directly with \(\widetilde{K}\), we can instead work with a larger matrix of the form (Saatci 2011):
$$\begin{aligned} K = K_1 \otimes \cdots \otimes K_{d_T} \end{aligned}$$
where \(\otimes \) denotes the Kronecker product.Footnote 3 The entries \(K_d\) hold the covariance contributions of the d-th dimension and they are generally much smaller than \(\widetilde{K}\) (in our example, \(K_1 \in {\mathbb {R}}^{21 \times 21}\) and \(K_2 \in {\mathbb {R}}^{3 \times 3}\)). The Kronecker matrix K, on the other hand, holds the covariances between all trade-offs in the Cartesian product
, and it is thus much larger (in our example, \(K \in {\mathbb {R}}^{63 \times 63}\)).
The significant computational savings that the Kronecker structure of K outlined above enables follow from the fact that, instead of explicitly generating and manipulating \(\widetilde{K}\), it is now possible to operate on the smaller, \(K_d\). In this setting, several key matrix operations involving K can be performed efficiently. Most importantly:
-
Matrix-vector products of the form Kb can be computed at a cost that is linear in the size of b, in contrast to the quadratic cost entailed by standard matrix-vector products. This follows from the fact that \(\left( K_i\otimes K_j\right) \text{ vec }(B) = \text{ vec }\left( K_iBK_j^T\right) \), where \(b=\mathbf {(}B)\); since the number of nonzero elements of B is the same as the length of b, this operation is linear in the length of b.
As we will see in Algorithm 2, such products are required to find the posterior mode of our GPs and in general dominate the overall computational budget; this speed-up means that they are no longer the dominant computational cost.
-
Eigendecompositions of the form \(K = Q^T {\varLambda } Q\) can be computed from the Eigendecompositions of the \(K_d\):
$$\begin{aligned} Q = \bigotimes _{d=1}^{D}Q_d\quad {\varLambda } = \bigotimes _{d=1}^{D}{\varLambda }_d \end{aligned}$$
at cubic cost in the size of the largest \(K_d\). This is a consequence of the mixed product property of Kronecker products, that states that \((A\otimes B)(C\otimes D) = (AC)\otimes (BD)\) and therefore
$$\begin{aligned} (Q_i\Lambda _iQ_i^T)\otimes (Q_j\Lambda _jQ_j^T) = \left( (Q_i\Lambda _i)\otimes (Q_j\Lambda _j)\right) (Q_i^T\otimes Q_j^T) = (Q_i\otimes Q_j)(\Lambda _i\otimes \Lambda _j)(Q_i\otimes Q_j)^T \end{aligned}$$
In particular, this allows us to efficiently determine the Eigenvectors to the \(n_e\) largest Eigenvalues of K, allowing us to obtain computational speed-ups by replacing K with a low-rank approximation.
Furthermore, note that all operations can be implemented by considering only the set of unique, observed or predicted trade-offs. This reduces the region under consideration, from the large space covered by K to a manageable superset of T. Unobserved trade-offs can be modeled through infinite noise variances in Eq. (3). The corresponding likelihood terms then evaluate to indifference (\(p = 0.5\)), and their derivatives to zero. The latter yield even sparser matrices W and L in Algorithms 2 and 4 below, which can be directly exploited via standard sparse matrix operations.
Approximate inference in GaSPK
The Kronecker structure described above has proved useful in a regression context, but requires careful algorithmic design to ensure its benefits are exploited in the current context. In Sect. 3.2.1, we develop a scalable inference algorithm using Laplace’s method to estimate the posterior distributions \(p(f^c|C,{\varGamma })\).
In a full Bayesian treatment of GaSPK, we would consider \({\varGamma }\) another latent quantity of interest, and infer its posterior distribution. Previous work has addressed similar challenges by either imposing a Gaussian or a Dirichlet process prior on \({\varGamma }\) (Houlsby et al. 2012; Abbasnejad et al. 2013). However, these approaches are computationally expensive, and it can be hard to interpret the resulting joint distribution over weights and characteristics. Instead, we treat \({\varGamma }\) as a parameter to be estimated; in Sect. 3.2.2 we show that we can either find the maximum likelihood value by optimization, or find a heuristic estimator that we show in Sect. 5 performs well in practice at a much lower computational cost.
We combine these two steps in an EM-type algorithm (Dempster et al. 1977) that jointly learns \({\varGamma }\) and the posterior distribution over the \(f^c\). The algorithm is outlined in Algorithm 1.
In the E-step, we use Laplace’s method to approximate the conditional expectation \(E[f^c|C,{\varGamma }]\) with the posterior mode \(E[\hat{f^c}|C,{\varGamma }]\), as described in Sect. 3.2.1. We then obtain one of the two estimators for \({\varGamma }\) described in Sect. 3.2.2—an optimization-based estimator that corresponds to the exact M-step but is slow to compute, or a heuristic-based estimator that is significantly faster to compute. In practice, we suggest using the heuristic-based estimator; as we show in Sect. 5 this approach strikes a good balance between predictive performance and computational efficiency.
Learning the latent functions \(f^c\) conditioned on \({\varGamma }\)
Inferring the \(f^c\) is complicated by the fact that the posterior \(p(f^c | C,{\varGamma })\) is analytically intractable under the Probit likelihood. Discrete choice models often use sampling-based methods to approximate the posterior (Allenby and Rossi 1998; Train 2003). However, sampling is slow, particularly for high-dimensional models based on GPs. Alternatives include Laplace’s method, Expectation Propagation, and Variational Bayesian methods, all of which seek to approximate \(p(f^c | C)\) with a similar distribution \(q(f^c | C)\) that can be computed and represented efficiently (Bishop 2006).
In this paper we use Laplace’s method, because it is computationally fast and conceptually simple. Laplace’s method is a well known approximation for posterior inference in regular GPs (Rasmussen and Williams 2006) and simpler preference learning scenarios (Chu and Ghahramani 2005). Laplace’s method aims to approximate the true posterior p with a single Gaussian q, centered on the true posterior mode \(\hat{f^c}\), and with a variance matching a second-order Taylor expansion of p at that point (see Fig. 3). Approximating the posterior with a single multivariate Gaussian allows us to conveniently re-use it as the prior in subsequent Bayesian updates which is important for online and active learning from user interactions (Saar-Tsechansky and Provost 2004). While the approximation can become poor if the true posterior is strongly multi-modal or skewed, prior work has shown this limitation has no significant impact in the preference learning context, e.g., Chu and Ghahramani (2005).
In principle, we could directly apply the Laplace mode and variance calculations used by Chu and Ghahramani (2005), which assume a full covariance matrix. However, doing so would negate the benefit of using a structured covariance function. Instead, we formulate our calculations to exploit properties of our covariance matrix, yielding an algorithm which, as we show later in this section, has better scaling properties directly applying the algorithms in (Chu and Ghahramani 2005).
Our development of Laplace inference in GaSPK proceeds in two steps. First, we describe an efficient procedure for finding the posterior mode \(\hat{f}\) (Algorithm 2). We then describe how the posterior variance and predictions for new trade-offs \(t_*\) can be computed (Algorithm 3). Additional mathematical details are provided in “Appendix A”.
The mode \(\hat{f}\) of the posterior is the maximizer of the log posterior \(\log p(f^c|C,{\varGamma }) \propto \log p(C|f^c,{\varGamma }) + \log p(f^c)\) which can be found by setting the first derivative of \(\log p(f^c|C,{\varGamma })\) to zero and solving for \(f^c\). Because the Probit likelihood is log concave, there exists a unique maximum \(\hat{f}\), which we obtain iteratively by using the Newton–Raphson method (Press et al. 2007) with the update step
$$\begin{aligned} \begin{aligned} f^{new}&= (K^{-1}+ W)^{-1}\underbrace{(W f + \nabla \log p(C | f))}_{b} \\&= K (b - L(I + L^T K L)^{-1} L^T K b). \end{aligned} \end{aligned}$$
(6)
We repeatedly assign \(f\leftarrow f^{new}\) and recompute Eq. (6) until f converges. The matrix W in the first line of Eq. (6) denotes the negative Hessian of the log likelihood, \(W = -\nabla \nabla \log p(C|f^c,{\varGamma })\), a sparse matrix consisting of \(n_c \times n_c\) diagonal sub-matrices of size \(n_T \times n_T\). W is computed using Eq. (10) described in “Appendix A.1”, along with additional computational details regarding our Probit likelihood. The sparsity of W allows us to compute its Cholesky decomposition \(W = L L^T\) in \(O(n_Tn_c^3)\) time, rather than the \(O(n_c^3n_T^3)\) time that would be typical of a dense matrix. We use this decomposition instead of W in the second line of Eq. (6), eliminating the numerically unstable \(K^{-1}\) and the unwieldy inverse of the first factor in the previous line. All matrices in the second line of Eq. (6) are of size \((n_c n_T) \times (n_c n_T)\) and therefore usually large. However, as we discuss in Sect. 3.1, L has at most \(\frac{n_T n_c (n_c - 1)}{2}\) non-zero elements (less if not all possible trade-offs from T are observed), and thus it is never necessary to generate K explicitly.
Using Eq. (6), we can efficiently compute the posterior mode by following the steps outlined in Algorithm 2. Note, that all operations in the algorithm are simple matrix operations available in most programming environments. Furthermore, the operations in lines 6 through 8 are all matrix-vector operations which generate vectors as intermediate results. Rather than calculating the inverse in line 7 explicitly, we use conjugate gradients (Press et al. 2007) to solve the system \((I + L^TKL) x = L^T K b\) by repeatedly multiplying the parenthesized term with candidates for x, as in Cunningham et al. (2008).
Because K has Kronecker structure and L consists only of diagonal sub-matrices, multiplications with K and L have linear time and space complexity, hence the overall computational cost is dominated by the \(O(n_Tn_c^3)\) cost of the Cholesky decomposition. Without the Kronecker structure, these multiplications would be \(O(n_T^2n_c^2)\), and their cost would therefore dominate when \(n_T>n_c\).
We next compute the variance \(V_q(f)\) of the approximate posterior q, which can be written as (Rasmussen and Williams 2006):
$$\begin{aligned} V_q(f) = \text {diag}(K) - \text {diag}(K L (I + L^T K L)^{-1} L^T K) \end{aligned}$$
(7)
The computations in Eq. (7) involve full matrix operations, and are therefore more expensive than the matrix-vector operations used for mode-finding. However, we can limit the computations to points of interest \(t_*\) only, which reduces the number of rows in K being considered. To further reduce the size of the involved matrices, we approximate K via a low-rank decomposition with exact diagonal given by:
$$\begin{aligned} K&\approx QSQ^T+ {\varLambda }, \text {~where~} {\varLambda } = \text {diag}(K) - \text {diag}({QSQ^T}) \end{aligned}$$
(8)
Importantly, the decomposition can be efficiently computed when K has Kronecker structure, as discussed in Sect. 3.1. Specifically, the matrix S in Eq. (8) is a diagonal matrix with the \(n_e\) largest Eigenvalues of K on its main diagonal. Q contains the corresponding Eigenvectors, and it has the same number of rows as K but only \(n_e\) columns. \({\varLambda }\) is a diagonal matrix of the same size as K, making the low-rank approximation of K exact on the diagonal (Quiñonero-Candela and Rasmussen 2005; Vanhatalo et al. 2010). The number of Eigenvalues \(n_e\) in the approximation is a user-defined input and it can be used to balance computing time against accuracy of the approximated posterior variance. As we will show below, even choices of small numbers of Eigenvalues \(n_e\) often yield posterior variances close to those obtained with the full matrix K. Under this low-rank approximation, Eq. (7) can be re-written as:
$$\begin{aligned} V_q(f)&\approx \text {diag}(K) - \text {diag}(K L (I + L^T (QSQ^T+ {\varLambda }) L)^{-1} L^T K) \nonumber \\&=\text {diag}(K) - \text {diag}(K {\Pi }K) + \text {diag}(K {\Pi }Q (\underbrace{S^{-1} + Q^T {\Pi }Q}_{P})^{-1} Q^T {\Pi }K) \end{aligned}$$
(9)
where P is a small matrix of size \(n_e \times n_e\), and where \({\Pi }= L (I + L^T {\varLambda } L)^{-1} L^T\) can be computed efficiently, because L is sparse and \({\varLambda }\) is diagonal. \({\Pi }\) itself is also sparse, consisting of \(n_c \times n_c\) diagonal blocks like W. Because K has Kronecker structure, the first two terms in Eq. (9) can be computed efficiently and without resorting to approximations. We address the computation of the third term next.
In Algorithm 3, we first calculate the Cholesky factor C of P (line 5), which is subsequently used in solvingFootnote 4 the system \({\Pi }QC^{-1}\). The product V in line 6 is equivalent to \(n_e\) matrix-vector products with a Kronecker matrix and is computationally inexpensive when \(n_e\) is sufficiently small. In line 7, we exploit the symmetry of the third term in Eq. (9), and the fact that only its diagonal is needed, to reduce calculations to an efficient element-wise product of the smaller V. Finally, in line 9, we use the posterior variances to calculate the predictive probabilities \(p_*\) at the trade-off points \(T_*\) using Eq. (4).
Figure 4 illustrates the output of Algorithms 2 and 3 for the choices of a single user, using data from a popular preference benchmark dataset (Kamishima and Akaho 2009). Panel (a) shows the posterior mode \(\hat{f_u} = E[f_u]\), which is expectedly high in regions of the trade-off space perceived as favorable, and low otherwise. The bold line indicates the zero boundary \(\hat{f_u} = 0\), and it is sufficient as a predictor of future choices when predictive certainty estimates are not required. Importantly, it can be computed using only Algorithm 2 and is therefore very fast.
The key distinguishing feature of our probabilistic approach are the variance estimates shown in Panel (b). As shown, the algorithm correctly identifies the region at the center of the panel where the decision boundary already follows a closely determined course to match earlier observations (pale yellow coloring, low variance). If additional observations were to be acquired for the purpose of improving predictions, they should be located in the upper or lower regions of the decision boundary instead, where fewer evidence is currently available (dark red coloring, high variance). Panel (c) shows the combination of both outputs to compute the predictive probabilities \(p(y=+1 | f)\). While the decision boundary at \(p(y = +1|f) = 0.5\) is the same as the one in Panel (a), this panel also incorporates predictive variances by shrinking the predictive probabilities towards indifference \((p = 0.5)\) in high-variance regions [see Eq. (4)]. Consequently, the corridor in which GaSPK is indifferent (intermediate intensity orange coloring, intermediate probabilities) is narrower in areas with extensive evidence from the data, and wider towards the edges of the panel. This information is an important input to subsequent decision-making tasks which require information on whether existing evidence is conclusive enough to make an autonomous decision.
Learning user characteristics
To complete our EM-type algorithm, we must estimate the user characteristics \({\varGamma } = \left[ \gamma ^c_u \right] _{u,c}\) from the data. Recall from Sect. 2 that \(\gamma ^c_u\) denotes the fraction of user u’s behavior explained by characteristic c, that is, \(f_{u}(t) = \sum _{c} \gamma _u^c \cdot f^c(t)\) with \(\sum _c \gamma _u^c = 1\). An exact M-step estimator for \({\varGamma }\), that returns \(\arg \max \prod _i \Phi \left( y_i\sum _c \gamma _{u_i}^c\cdot f^c(t_i)\right) \) s.t. \( \sum _c \gamma _u^c = 1, u=1,\dots , U\), can be obtained using an interior-point optimizer. This yields a (local) optimum for \({\varGamma }\), but is more computationally expensive.
As an alternative, we propose a heuristic approximation to this M-step, described in Algorithm 4. We note that if \(\gamma _u^c>\gamma _u^{c'}\), then \(f_u\) is likely to be closer to \(f^c\) than to \(f^{c'}\). Therefore, approximating \(f_u\) with \(f^c\) is likely to give a higher likelihood than approximating \(f_u\) with \(f^{c'}\). The heuristic “M-step” in line 3 computes an approximation to the likelihood that characteristic c alone generated the observed choices. Each iteration of the surrounding loop calculates one column of the \({\varGamma }\) matrix, corresponding to one characteristic. The resulting user characteristics are then re-scaled so that they add to one in line 5.
As we will see in Sect. 5, while it lacks the theoretical justification of the exact M-step, empirically the heuristic Algorithm 4 obtains good results with much lower computational cost. Finally, while the number of user characteristics \(n_c\) has to be set manually, we find that consistent with prior work, our method is insensitive to the choice of this parameter when it is not excessively small, e.g., Houlsby et al. (2012).
Learning hyperparameters
As in the case of \({\varGamma }\), a full Bayesian treatment the hyperparameters, \(\theta = \{ l_d \}\), is prohibitively expensive. Prior work has often resorted to either gradient-based optimization of the marginal likelihood Z, e.g., Chu and Ghahramani (2005), or to heuristics, e.g., Stachniss et al. (2009) to learn the hyperparameters from the data. In the experiments that follow, we employ a heuristic and set the length-scales to the median distance between trade-offs t. This has been found in prior work to be a computationally fast heuristic yielding consistently good empirical results (Houlsby et al. 2012).