Abstract
This paper presents a framework for optimizing the preference learning process. In many realworld applications in which preference learning is involved the available training data is scarce and obtaining labeled training data is expensive. Fortunately in many of the preference learning situations data is available from multiple subjects. We use the multitask formalism to enhance the individual training data by making use of the preference information learned from other subjects. Furthermore, since obtaining labels is expensive, we optimally choose which data to ask a subject for labelling to obtain the most of information about her/his preferences. This paradigm—called active learning—has hardly been studied in a multitask formalism. We propose an alternative for the standard criteria in active learning which actively chooses queries by making use of the available preference data from other subjects. The advantage of this alternative is the reduced computation costs and reduced time subjects are involved. We validate empirically our approach on three realworld data sets involving the preferences of people.
Introduction
There has been an increasing interest recently in learning the preferences of people within artificial intelligence research (Doyle 2004). Preference learning provides the means for modeling and predicting people’s desires and this makes it a crucial aspect in modern applications such as decision support systems (Chajewska et al. 2000), recommender systems (Blythe 2002; Blei et al. 2003), and personalized devices (Clyde et al. 1993; Heskes and de Vries 2005).
A prototypical example for an application of preference learning that we will use in this paper is fitting hearingaids, i.e., tuning of hearingaid parameters so as to maximize user satisfaction. This is a complex task, due to three reasons: (1) high dimensionality of the parameter space, (2) the determinants of hearingimpaired user satisfaction are unknown, and (3) the evaluation of this satisfaction through listening tests is costly (in terms of patient burden and clinical time investment) and unreliable (due to inconsistent responses). The last point illustrates an important issue that preference learning has to address, which is the limited availability of labeled data used for model training. Obtaining appropriate training data in preference learning applications requires time and effort from the modeled user. This shortcoming can be addressed by taking advantage of two characteristics of the settings in which preference learning is usually applied. First, the training data is mostly acquired through interactions with the modeled user; and, second, preferences are modeled for multiple users, as a result multiple training data sets are available. In order for the preference learning methods to be implemented in realworld systems, they must be capable of exploiting all possible sources of information and in the most efficient way.
In most of the situations in which preference learning is involved data is available from multiple subjects. Thus, even though individual data is scarce and difficult to obtain, we can optimize the learning of preferences of a new subject by making use of the available data from other subjects. Learning in this setting is wellknown as multitask or hierarchical learning and has been studied extensively in recent years in machine learning. By using the multitask formalism, the preference data collected for other subjects can be gathered and used as prior information when learning the preferences of a new subject. Furthermore, to deal with the fact that obtaining labeled data is expensive we can speed up learning by optimally choosing the examples to be queried. At each learning step we can decide which example gives the most information about the subject’s preferences. This paradigm, called active learning in the machine learning literature and related to sequential experimental design in statistics, has been studied extensively, but hardly in the multitask setting.
The aim of this work is to present an efficient framework for optimizing the preference learning process. This framework considers the combination between active learning and multitask learning in the preference learning context. The contribution of this work is a criterion for active learning designed for the multitask setting. The advantages of this criterion are in its interpretation and the ease in computability.
The structure of this paper is as follows. First, this section ends by presenting related work on preference learning, multitask learning and active learning. In Sect. 2 we describe the learning framework. We consider learning from qualitative preference observations in which the subject makes a choice for one of the presented alternatives. This can be modeled using the probabilistic choice models introduced in Sect. 2.1. Learning a utility function representing the preferences of a subject from this type of preference observations is described in Sect. 2.2. Learning the utility function in a multitask setting by making use of the data available from other subjects is considered in Sect. 2.3. In Sect. 3 we present several criteria for selecting the most informative experiments with respect to a subject’s preferences. After reviewing some of the standard criteria from experimental design, we propose an alternative criterion which makes use of the preference observations collected already from a community of subjects. We show that this alternative criterion is connected to the standard criteria from experimental design. In Sect. 4 we demonstrate experimentally the usefulness of our framework on three data sets, a subset of the Letor data set, an audiological data set and a data set about people’s preferences for art images. In Sect. 5 we present several conclusions and discuss directions for future research.
Background and related work
In this section we review some studies from preference learning, multitask learning, and active learning related to the work presented in this paper.
Preference learning
Preference learning is the creation of a model from collected data which can be used to model and predict people’s desires. A very recent and complete overview about preference learning is given in (Fürnkranz and Hüllermeier 2010). There are different approaches to preference learning which can be categorized according to the learning task, the learning technique and the application area. We will briefly enumerate them and state in which category our current work can be included (for more details we refer the interested reader to Fürnkranz and Hüllermeier 2010).

1.
Based on the application area, preference learning approaches can be divided into the following main groups: (i) applied to the field of information retrieval, e.g., learning to rank search results of a query or a search engine, (ii) applied to recommender systems, e.g., used by online stores to recommend products to their customers, or for personalized devices, and (iii) bipartite ranking and label ranking, which find applications in disciplines such as medicine and biology. The application scenarios that we consider in the experimental evaluation in Sect. 4 belong to information retrieval (the Letor data set) and recommender systems (the audiological and art data sets).

2.
The learning technique divides the preference learning approaches into four categories: (i) learn a binary preference relation that compares pairs of alternatives, (ii) modelbased approach that aims at identifying the preference relation by making sufficiently restrictive model assumptions, (iii) local estimation techniques which lead to aggregating preferences, and (iv) learning utility functions by using regression to map instances to target valuations for direct ranking. We focus on the latter approach and use a utility function in order to model a subject’s preferences. The utility function is learned in a Bayesian framework.

3.
The learning task includes label, instance, and object ranking. Label ranking can be seen as a generalization of classification where a complete ranking of labels is associated with an instance instead of only a class label. Instance ranking can be seen as a generalization of ordinal classification where an instance belongs to one among a finite set of classes and the classes have an order. The setting of object ranking has the peculiarity of having no supervision in the sense that no class label is associated with an object. Instead, a finite set of pairwise preferences or other ordering between objects is given. The setting that we consider in this work belongs to the last category.
In many preference learning settings it is important to take into account the context, i.e., contextaware preference learning (Adomavicius et al. 2005). The motivation for context aware preference learning is that the same subject/user/consumer may use different decisionmaking strategies and prefer different products under different contexts. For hearingaid fitting, which is one of the application scenarios that we use in this work, this means that a user would prefer a certain setting of the hearingaid parameters if he is listening to a concert and another setting if he is in a discussion. In general for context aware preferences bigger data sets are needed, as preferences would have to be learned for all contextual situations. The approach that we present in this paper can be applicable in this setting as well. While this is an interesting, related topic, it is beyond the scope of the current work.
Multitask learning
The idea behind multitask learning is to utilize labeled data from other “similar” learning tasks in order to improve the performance on a target task. It is inspired by the research on transfer of learning in psychology, more specifically on the dependency of human learning on prior experience. For example, the abilities acquired while learning to walk presumably apply when one learns to run, and knowledge gained while learning to recognize cars could apply when recognizing trucks. The initial foundations for multitask learning were laid by (Thrun 1995; Caruana 1997). The psychological theory of transfer of learning implies the similarity between tasks. In a related way, the multitask learning assumes similarity between models of different tasks. For example, (Evgeniou et al. 2005; Argyriou et al. 2008) exploit similarity between the deterministic parts of the models by means of regularization, with the effect of improvement in performance. In this work we implement multitask learning using a Bayesian approach. The Bayesian approach to multitask learning assumes the parameters of individual models to be drawn from the same prior distribution. Examples of the Bayesian approach to multitask learning are (Bakker and Heskes 2003) where a mixture of Gaussians is used for the top of the hierarchy. This leads to clustering the tasks, one cluster for each Gaussian in the mixture. In (Yu et al. 2005; Birlutiu et al. 2009) a hierarchical Gaussian Process is derived with a normalinverse Wishart distribution used at the top of the hierarchy.
Active learning
Active learning, also known in the statistics literature as sequential experimental design, is suitable for situations in which labeling points is difficult, timeconsuming, and expensive. The idea behind active learning is that by optimal selection of the training points a better performance can be achieved instead of random selection. The scenarios in which active learning can be applied belong to one of the following three categories: (i) generating de novo points for labeling; (ii) streambased active learning where the learner decides whether to request the label of a given instance or not; (iii) poolbased active learning where queries are selected from a large pool of unlabeled data. In this work we consider the poolbased active learning setting.
Methods for active learning can be roughly divided into two categories: those with and without an explicitly defined objective function. Uncertainty sampling (Lewis and Gale 1994), QuerybyCommittee (Seung et al. 1992; Freund et al. 1997) and variants thereof belong to the latter category. They are based on the idea of selecting the most uncertain data given the previously trained models. The methods with an explicit objective function are often motivated by the theory of experimental design (Fedorov 1972; Chaloner and Verdinelli 1995; Schein and Ungar 2007; Lewi et al. 2009; Dror and Steinberg 2008). The objective function quantifies the expected gain of labeling a particular input, for example in terms of the expected reduction in the entropy of the model parameters (MacKay 1992; Cohn et al. 1996). With respect to the performance of the two categories of methods, Schein and Ungar (2007) show that the methods from the second approach perform better but are computationally more expensive due to retraining the models for each candidate point. A trend is to improve the performance of the active learning methods by combining them with heuristics designed either for the context in which they are applied or by the models they use, e.g., making use of the unlabeled data available (McCallum and Nigam 1998; Yu et al. 2006), exploiting the clusters in the data (Dasgupta and Hsu 2008), diversifying the set of hypotheses (Melville and Mooney 2004), or adapting the active learning to Gaussian processes (Chu and Ghahramani 2005a; Brochu et al. 2008; Groot et al. 2010).
Preference learning can benefit from the active learning paradigm. In most of the preference learning settings labels are given by people in an explicit way. This means that for acquiring training preference data, the subjects have to interact with the system, and they need to express their preferences explicitly. These situations appear when it is impossible or insufficient to implicitly collect training preference data. For example, when learning preferences in a live system where subjects choose electronically their favorite movie, labelling is done automatically by the selection, but, for other scenarios, like for example, fitting hearingaids, this implicit way of collecting training data cannot be applied. In these situations, it makes sense to use active learning in order to collect the most informative data. There are severals studies in the literature that use active learning in a preference learning setting. Brinker (2004) presents some extensions of poolbased active learning to label ranking problems; Xu et al. (2010) address the problem of preference learning using relational models between items; Guo and Sanner (2010) investigate active preference learning for realtime systems; Brochu et al. (2008) propose a criterion for active learning that maximizes the expected improvement at each query without accurately modeling the entire valuation surface. Furthermore, there are several studies which investigate active preference learning for practical applications such as, collaborative filtering (Jin and Si 2004; Harpale and Yang 2008; Boutilier et al. 2003), personalized calendar scheduling (Gervasio et al. 2005), or for optimizing search results for biomedical documents (Arens 2008). The difference between our work and the other studies for active preference learning mentioned above is that we consider active preference learning in a multitask setting, i.e., we are interested about settings with multiple learning tasks and how active learning can be implemented in an efficient way in this case. We propose a criterion for active learning designed for this multitask learning setting. This criterion, which we call the Committee criterion, will make use of the preference observations collected already from a community of subjects. The idea behind the Committee criterion is related to the QuerybyCommittee method from active learning which selects those queries that have maximum disagreement amongst an ensemble of hypotheses. The difference in our case is that the group of subjects, for which the preferences were already learned, plays the role of the ensemble of hypotheses instead of an ensemble of models learned on the same task.
Notation
Boldface notation is used for vectors and matrices and normal fonts for their components. Upperscripts are used to distinguish between different vectors or matrices and lowerscripts to address their components. The notation \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\varSigma})\) is used for a multivariate Gaussian with mean μ and covariance matrix Σ. The transpose of a matrix M is denoted by M ^{T}. Capital letters are used for constants and small letters for indices, e.g., i=1,…,I.
Learning framework
The idea of using the preference observations from other subjects in order to optimize the process of learning the preferences of a new subject can be basically applied in any preference learning context. In this work, we consider the case of qualitative preference observations which can be modeled using the probabilistic choice models described in this section.
Probabilistic choice models
In many realworld applications preferences are learned from experiments in which the subject makes a choice for one of the presented alternatives. The motivation for this is that people are very good at making comparisons between alternatives and expressing a preference for one of them, which results in qualitative preference observations. This is in contrast to quantitative preference observations where the people have to assign an absolute rating to each alternative independently. Let X={x _{1},…,x _{ I }} be a set of inputs. Let \({\mathcal {D}}\) be a set of J observed preference comparisons over instances in X corresponding to a subject,
with \(a_{j} = (\boldsymbol{x}_{i_{1}(j)}, \ldots, \boldsymbol{x}_{i_{A}(j)})\) representing the alternatives presented to the subject and c _{ j } the choice made, i _{1},…,i _{ A }:{1,…,J}→{1,…,I} index functions such that i _{1}(j) represents the input presented first in the jth preference comparison and c _{ j }=c means that \(\boldsymbol{x}_{i_{c}(j)}\) is chosen from the A alternatives presented in the jth comparison. For A=2 this setup reduces to pairwise comparisons between two alternatives.
The main idea behind probabilistic choice models is to assume a latent utility function value U(x _{ i }) associated with each input x _{ i } which captures the individual preference of a subject for x _{ i } (the utility function will be formally defined in the next section). In the ideal case the latent function values are consistent with the preference observations. This means that alternative c is preferred over the other alternatives c′ in the jth comparison whenever the utility for c exceeds the utilities for the other alternatives c′, i.e., \(U(\boldsymbol{x}_{i_{c}(j)}) > U(\boldsymbol{x}_{i_{c'}(j)})\). In practice, however, subjects are often inconsistent in their responses. A very inconsistent subject will have a high uncertainty associated with the utility function; this uncertainty is directly taken into account in the probabilistic framework. We define this probabilistic framework using the BradleyTerry model (Bradley and Terry 1952; Kanninen 2002; Glickman and Jensen 2005) by making a standard modeling assumption that the probability that the cth alternative is chosen by the subject in the jth comparison follows a multinomial logistic model, which is defined as
where “exp” is the exponential function and the other terms are as defined before. Efficiently learning preferences reduces to learning the unknown utility function U as accurately and with as few comparisons as possible.
One important drawback of the BradleyTerry model is that it assumes very strong transitivity conditions of preference relations, while some psychological experiments have shown that human preference judgments can violate transitivity (Anand 1993; Tversky 1998). In most situations transitivity violations can be considered as noise. When this is not applicable, specific probabilistic models for human preference judgements which preserve intransitive reciprocal relation have to be designed. This was recently investigated in (Pahikkala et al. 2009) which introduced a new kernel function in the framework of regularized least squares which is capable of inferring intransitive reciprocal relations.
The utility function
The utility function U is a realvalued function, U:X→ℝ, which associates with every input x∈X a real number U(x). Each input x∈X is characterized by a set of features, ϕ(x)∈ℝ^{D}. One possible choice for the utility function is to express it as a linear combination of the features,
where α=(α _{1},…,α _{ D }) is a vector of weights which captures the importance of each feature of x when evaluating the utility U for a specific subject, ϕ _{ i }(x) are the components of the vector ϕ(x). The preferences of a subject are thus encoded in the vector α and learning the utility function reduces to learning α.
In order to make the definition of the utility function more flexible, we can use a semiparametric model in which the utility function is defined as a linear combination of basis functions. The basis functions are defined by a kernel function κ centered on the data points,
where the vector α with dimension I—the number of data points (the size of the set of inputs X)—captures the preferences of the subject. A nonlinear utility function can be obtained by using, for example, a Gaussian kernel,
where ℓ is a lengthscale parameter. The two definitions of the utility function from Eqs. (3) and (4) are similar in the sense that they are both linear in the parameter. Equation (4) is suited when the number of features is larger than the number of data points, i.e., D>I and for introducing nonlinearity in the utility model.
In order to learn the utility function, we use a Bayesian framework in which we treat the vector of parameters α as a random variable. We consider a Gaussian prior distribution over \(\boldsymbol{\alpha} \sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\varSigma})\), which is updated based on the observations from the preference comparisons using Bayes’ rule (where we omitted the normalization constant),
with the likelihood terms of the form given in Eq. (2). The choice of the prior will be discussed in the next section. The posterior distribution obtained is approximated to a Gaussian. The Gaussian approximation of the posterior is a good approximation because with few data points the posterior is close to the prior which is a Gaussian, and with many data points the posterior approaches again a Gaussian as a consequence of the central limit theorem (Bishop 2006). To perform the approximation of the posterior a good choice is to use deterministic methods (e.g., Laplace’s method (MacKay 2002), Expectation Propagation (Minka 2001)) since they are computationally cheaper than the nondeterministic ones (sampling) and because they are known to be accurate for these types of models (Glickman and Jensen 2005).
Multitask preference learning
One property which distinguishes preference learning from other learning settings is that in most of the cases in which preference learning is involved, observations are available from multiple subjects. For example, in the case of product recommendation, preferences are learned from multiple consumers. Furthermore, in situations where the training preference data is collected in an explicit way, by interactions with the subject, the individual training data is usually small. As a consequence, it is reasonable to make use of all the data available. In this direction, we make the assumption that each subject is not an isolated case, but belongs to a group of people sharing a common underlying rationale in their way of making preference decisions. This property combined with the Bayesian framework allows the transfer of information about preferences between different subjects. Basically, we use the preference data previously seen from some other subjects to learn an informed prior which will be used as the starting prior when learning the preferences of a new subject. For learning this informed prior, we use Bayesian hierarchical modeling which assumes that the parameters for individual models are drawn from the same hierarchical prior distribution. Let us assume that we already have preference data available from a group of M subjects. We make the common assumption of a Gaussian prior distribution, \(p(\boldsymbol{\alpha}^{m}) = \mathcal{N}(\boldsymbol{\alpha}^{m}; \bar{\boldsymbol{\mu}},\bar{\boldsymbol{\varSigma}})\), m=1,…,M with the same \(\bar{\boldsymbol{\mu} }\) and \(\bar{\boldsymbol{\varSigma}}\) for the preference models of all subjects. This prior is updated using Bayes’ rule based on the observations from each subject, resulting in a posterior distribution for each individual subject. The common prior over all task parameters controls the general part of the model. This common prior is learned from the data belonging to a group of tasks, other than the current (new) task for which the learning is performed. Starting from this generalmodel given by the common prior, the model is updated using the observation (data) seen in the current task. These taskspecific observations control the taskspecific part of the model. The hierarchical prior is obtained by maximizing the penalized loglikelihood of all data in a socalled typeII maximum likelihood approach. This optimization is performed by applying the EM algorithm (Gelman et al. 2003), which reduces to the iteration (until convergence) of the following steps:
 Estep::

Estimate the sufficient statistics (mean μ ^{m} and covariance matrix Σ ^{m}) of the posterior distribution corresponding to each subject m, given the current estimates at step t (\(\bar{\boldsymbol{\mu}}^{(t)}\) and \(\bar{\boldsymbol{\varSigma}}^{(t)}\)) of the hierarchical prior.
 Mstep::

Reestimate the parameters of the hierarchical prior:
(7)(8)
Active preference learning
In this section we discuss methods for active preference learning. We start from QuerybyCommittee (QBC) (Seung et al. 1992) method for active learning and based on it we propose some variants of QBC adapted to the setting of preference learning for multiple subjects (Sect. 3.1). Furthermore, we show how these variants of QBC can be naturally linked to the hierarchical Bayesian modeling for reducing the computations (Sect. 3.2). Finally, we show connections between the variants of QBC proposed and other active learning criteria (Sect. 3.4).
QBC for preference learning
In this section we will discuss how to adapt QBC to our preference learning setting.
The committee members
For the QBC approach to be effective it is important that the committee is made of consistent and representative models. The main idea in this work is to exploit the preference learning setting with multiple subjects and use the learned models of other subjects \(\mathcal{M}_{1},\ldots, \mathcal{M}_{M}\) as committee members when learning the preferences of a new subject.
After choosing the committee we still have to decide upon a suitable criterion for selecting the next examples. Some measures of disagreement among the committee members appear to be most obvious, and in the following we will consider two alternatives.
Vote criterion
A simple and straightforward way is to consider the labels assigned by the other subjects, e.g., through the Vote criterion defined as
where δ(a,c;m)=1 if \((a,c) \in {\mathcal {D}}_{m}\), and δ(a,c;m)=0 otherwise. The score Vote(a) is minimal when the labels assigned by the committee members are equally distributed (total disagreement) and maximal when all members fully agree. There are two problems with this criterion. First, a comparison a may not be labeled by a subject m. This can be overcome if we consider the predictions computed based on the learned model of subject m and allow each committee member to ‘vote’ for its winning class. This same idea is implemented in the socalled vote entropy method (Dagan and Engelson 1995). The entropy is measured over the final classes assigned to an example by possible models, and not over class probabilities given by possible models. Second, in practical applications just scoring votes turns out to be suboptimal. The reason, as also suggested in (McCallum and Nigam 1998), is that the Vote criterion does not take into account the confidences of the committee members’ predictions.
Committee criterion
We will use the following notation for the predictive probability corresponding to a subject m=1,…,M,
The predictive probability can be computed either by taking into account the entire distribution \(\mathcal {M}_{m} = \mathcal {N}(\boldsymbol{\mu}^{m}, \boldsymbol{\varSigma}^{m})\)
or, for computational reasons, we can consider only a point estimate for \(\mathcal {M}_{m}\), for example, the mean of the Gaussian distribution, and use it to compute the predictive probabilities using Eq. (2)
Inspired by (McCallum and Nigam 1998), we propose to measure disagreement by taking the average prediction of the entire committee and computing the average KullbackLeibler (KL) divergence of the individual predictions from the average:
with \(\bar{p}(\cdota)\) the average predictive probability of the entire committee, which will be more precisely defined in Sect. 3.2.
The KL divergence for discrete probabilities is defined as
The KL divergence can be seen as a distance between probabilities, where we abused the notion of distance, since the KLdivergence is not symmetric, i.e., KL[p _{1}p _{2}]≠KL[p _{2}p _{1}]. This drawback of the KLdivergence can be overcome by considering a symmetric measure, for example, KL[p _{1}p _{2}]+KL[p _{2}p _{1}]. In (McCallum and Nigam 1998), the disagreement is computed between committee members constructed based on the current model, i.e., the committee changes with every update and the criterion has to be recomputed with every update. A committee of models learned on different tasks is fixed and thus selecting examples solely based on it leads to a fixed instead of an active design: all examples can be ranked beforehand (the same applies to the Vote criterion defined above).
To arrive at an active design and take into account the current model, we propose a small modification based on the following intuition. Querying examples on which the committee members disagree makes sense, because it will force the current model to make a choice between options that, according to the committee members, are reasonably plausible. However, when the current model on a particular example already “made up its mind”, i.e., deviates substantially from the average prediction of the committee based on what it learned from other input/output pairs, it makes no sense to still query that example, even though the committee members might disagree. Taking into account this consideration, we propose the Committee criterion which assigns a score to a candidate query comparison a through
with p(⋅a) the current model’s predictive probability based on the data seen so far and γ a parameter that accounts for the degree of similarity between subjects. According to the Committee criterion, the most interesting experiments are those on which the other models disagree (the first term on the righthand side of Eq. (12)), with the current model (still) undecided (the second term on the righthand side of Eq. (12)).
An advantage of the Committee criterion is its computational efficiency: the first term on the righthand side of Eq. (12) as well as the average predictive probability can be computed beforehand. The Committee criterion does require computation of the predictive probabilities corresponding to the current model, but this is the least one could expect from an active design. This is to be compared with the QBC criterion (any of the two variants considered), which requires constructing new committee members with each update, and Doptimal experimental design, which calls for keeping track of variances.
Note that we have not made any restriction so far with respect to the probabilistic models used in the active learning design. In the following we will consider only the loglinear models introduced in Sect. 2. They have some nice properties, which simplify the computation of the Committee criterion (Sect. 3.2), and provide a natural link to hierarchical Bayesian modeling (Sect. 2.3). The general idea, of using the already learned models from the other tasks as the committee members in a QBClike approach, is of course also applicable to other models.
Average probability
In this section we discuss how to efficiently compute the average probability used for computing the committee criterion in Eq. (12) in the case of loglinear models (Christensen 1997). For linear utility functions the likelihood function defined in Eq. (2) is a loglinear model. The logodds of the model are linear in the parameter.
Let p _{ m }(ca) be the predictive probability defined in Eq. (10). We define the average predictive probability of the committee, \(\bar{p}(ca)\), as the prediction probability that is closest to the prediction probabilities of the members:
The solution of the optimization from above is the socalled logarithmic opinion pool (Bordley 1982)
with Z(a) a normalization constant
For loglinear models, the logarithmic opinion pool boils down to a simple averaging of model parameters:
This natural combination between loglinear models and logarithmic opinion pools is the advantage of using the logarithmic opinion pool instead of the linear opinion pool used in (McCallum and Nigam 1998).
As can be seen from the EM updates in Eq. (7), the average \(\bar{\boldsymbol{\mu}}\) in the logarithmic opinion pool is then precisely the mean of the learned hierarchical prior. Summarizing, once we have learned a hierarchical prior from the data available for subjects 1 through M using the EM algorithm, we can start off the new model M+1 from this prior (as is normally done in hierarchical Bayesian learning). On top of this, the same EM algorithm gives us the information we need to compute the Committee criterion that can be used subsequently to select new inputs to label.
Other criteria for active learning
In this section we discuss how several strategies for active learning can be implemented in the learning framework considered here. All the strategies are being concerned with evaluating the informativeness of the unlabeled points. Let the new model obtained after incorporating an observation (a,c) be \(\mathcal {M}_{(a,c)} \approx\mathcal{N}(\boldsymbol{\mu}_{(a,c)}, \boldsymbol{\varSigma}_{(a,c)})\).

1.
Uncertainty sampling (Lewis and Gale 1994). In this strategy an active learner chooses for labeling the example for which the model’s predictions are most uncertain. The uncertainty of the predictions can be measured, for example, using Shannon entropy
$$ \mathrm{Uncertainty} (a) = \sum_c p(ca, \mathcal{M}) \log p(ca, \mathcal{M}) . $$(16)For a binary classifier this strategy reduces to querying points whose prediction probabilities are close to 0.5. Intuitively this strategy aims at finding as fast as possible the decision boundary since this is indicated by the regions where the model is most uncertain.

2.
Variance reduction (MacKay 1992). This strategy, also known in experimental design as Doptimality (Fedorov 1972; Chaloner and Verdinelli 1995; Berger 1994; Ford and Silvey 1980), chooses as the most informative experiments the ones that give the most reduction in the model’s uncertainty. The motivation behind this strategy is a result of (Geman et al. 1992) which shows that the generalization error can be decomposed into three components: (i) noise (which is independent of the model or training data); (ii) bias (due to the model); (iii) model’s uncertainty. Since the model cannot influence the noise and the bias components, the future generalization error can only be influenced via the model’s variance. Formally, this criterion can be written as
$$ \mathrm{Variance} (a) = \sum_c p(ca,\mathcal{M}) \mathrm{variance} [\mathcal{M}_{(a,c)}]  \mathrm{variance} [\mathcal{M}]. $$(17)In the setting considered in this work the variance of the model is expressed in the covariance of the Gaussian distribution. In order to use Eq. (17) we need to choose a measure for the variance. We can consider, for example, the logdeterminant of the covariance matrix
$$ \mbox{Variancelogdet} (a) = \sum_c p(ca, \mathcal {M}) \log\det (\boldsymbol{\varSigma}_{(a,c)})  \log\det(\boldsymbol{\varSigma}) , $$(18)which is actually minimizing the entropy of the Gaussian random variable representing the current model, or the trace of the covariance matrix
$$ \mbox{Variancetrace} (a) = \sum_c p(ca, \mathcal {M}) \operatorname {Tr}(\boldsymbol{\varSigma}_{(a,c)})  \operatorname {Tr}(\boldsymbol{\varSigma}) . $$(19) 
3.
Expected model change (Cohn et al. 1996). This strategy chooses as the most informative query the one which when added to the training set would yield the greatest model change. Quantifying the model change depends on the learning framework. For gradientbased optimization the change can be measured via the training gradient, i.e., the vector used to reestimate parameter values (Settles and Craven 2008). In the Bayesian framework, the model change can be quantified via a distance measure between the current distribution and the posterior distribution obtained after incorporating the candidate point
$$\mathrm{Change} (a) = \sum_c p(ca, \mathcal{M}) \mathrm{distance} [\mathcal{M}, \mathcal{M}_{(a,c)} ] . $$A suitable distance for our setting is the KullbackLeibler divergence between distributions, which for two Gaussians has a closed form solution and can be written as follows
(20)The KL divergence between Gaussians is used by Seeger (2008) to design an efficient sequential experimental design in a setting similar to the one used in this work.
Uncertainty sampling, and QBC and its variants are attractive due to their applicability in various machine learning settings. Variance reduction and expected model change are robust and in many situations they have proved to be the best one can do (Schein and Ungar 2007). Although more robust, the variance reduction and expected model change strategies are computationally more demanding since for each candidate comparison query and each possible label the posterior distribution induced has to be computed. For the learning setting considered in this study, the posterior distribution computed using Bayes’ theorem (Eq. (6)) does not have an analytical expression and approximations are needed for it; these approximations are usually costly. In contrast, the variants of QBC proposed in this paper are computationally efficient
Similarities between criteria
In this section we consider the following active learning criteria: Variancelogdet, Committee, Variancetrace and ChangeKL. We investigate how similar the active learning criteria are and how they can be related. We analyze the modifications induced to the model by the criteria after updating the probability model to incorporate the information from new training points. A single update induces a small change in the posterior distribution, and this allows for Taylor expansions, keeping only the lowest nonzero contribution. In the following we present the main results of the approximations while some of the details can be found in the Appendix.
As we will show below and in the Appendix, under the assumption that the updates of the posterior distribution for each alternative a and choice c lead to small changes in the model \(\mathcal {M}\), we can approximate the active criteria to the form
for some vector α and matrix Q and with g the gradient of the logprobabilities
The following lemma approximates the Variancelogdet criterion to the form from Eq. (21).
Lemma 1
In a first order approximation, assuming that Σ _{(a,c)} is close to Σ, we can simplify
where μ and Σ represent the mean and covariance of the Gaussian posterior distribution.
Proof
In a first order approximation we have
where we ignored the change from the old α to a new MAP solution depending on c and a. We will use the notation
For a matrix A and ϵ small compared to A, the following holds (see, for example, Boyd and Vandenberghe 2004, p. 642)
Assuming \(\boldsymbol{\varSigma}_{(a,c)}^{1}\) is close to Σ ^{−1} which makes H(ca,μ) small, we can use Eq. (23) in Eq. (24) with the following substitutions A=Σ ^{−1} and H(ca,μ)=ϵ I to obtain
The probability that the subject gives the response c when presented the alternatives a follows by integrating p(ca,α) over the current posterior. We make a second order Taylor expansion of p(ca,α) around the point μ:
The first order term cancels since the gradient is zero at the maximum solution α=μ. In a lowest order approximation we can ignore the correction upon p(ca,α) to obtain
where for the last approximation we used the approximation from Eq. (25). To obtain the proof of this lemma we use Lemma 3 in the Appendix at the end of the paper which states a relationship between Hessian and Fisher matrices. □
Using the same type of approximation, the Committee criterion can be approximated to the same form given in Eq. (21).
Lemma 2
In a lowest order approximation the Committee criterion can be written as
where \(\bar{\boldsymbol{\mu}}\) is the mean of the hierarchical prior learned from the other subjects and
We make a second order Taylor expansion of the KL divergences from the definition of the Committee criterion in Eq. (12)
around the point \(\bar{\boldsymbol{\mu}}\).
The first order term of the Taylor expansion is:
which cancels since based on Eq. (28) from the Appendix \(\sum_{c} p(ca, \bar{\boldsymbol{\mu}}) \boldsymbol{g}(ca,\bar{\boldsymbol{\mu}})\) is the vector with every component 0.
The second order term can be rewritten using Lemma 3 as:
Since the other terms cancel, we obtain that the KLdivergence between the predictive probabilities can be approximated as
Making this approximation for all the KLdivergences in the definition of the Committee criterion from Eq. (12) and computing the sum we obtain the result stated in this lemma.
Furthermore, it can also be showed that Variancetrace and ChangeKL can be approximated to the same form given in Eq. (21), namely
and
For the derivations of these approximations see Lemmas 4 and 5 in the Appendix.
We will focus on the differences between the Variancelogdet criterion (considered as the reference) and the Committee criterion. The differences between their approximations are as follows.

1.
The gradients g(ca,⋅) are evaluated at different points: the prior hierarchical mean \(\bar{\boldsymbol{\mu}}\) and the current posterior mean μ. This effect is small since μ is still close enough to \(\bar{\boldsymbol{\mu}}\) for a sufficiently accurate approximation of the gradients, in particular at the start of the learning when selecting the right points to label is more important.

2.
The current posterior variance Σ is replaced by \(\tilde{\boldsymbol{\varSigma}}\). The effect of the precise weighting of the gradients is not so important, and again, at the beginning of learning \(\tilde{\boldsymbol{\varSigma}}\) is close to Σ.
The way in which experiments are selected is more important at the beginning of the learning process, when μ is still close to the prior mean \(\bar{\boldsymbol{\mu}}\), and \(\tilde{\boldsymbol{\varSigma}}\) to Σ.
Experimental evaluation
This section presents the experimental evaluation of the framework proposed in this paper. We will use pairwise comparisons data. The main goal of the experimental evaluation section is to show that optimally selecting data for labeling using the Committee criterion achieves higher accuracy than random selection. Furthermore, we also show that the Committee criterion performs in practice similarly to other standard active learning criteria, but in addition has a computational advantage.
Data sets
The following data sets related to the preferences of people were used in the experimental evaluation.
Letor
This data set consists of relevance levels assigned to documents with respect to a given textual query (the OHSUMED data set from Letor 3.0, Qin et al. 2010). The relevances were assessed by human experts, using three rank scales: definitely relevant, partially relevant, and not relevant. We used a subset of this data related to Query 1. It contains 138 references with the following labels: 24 definitely relevant, 26 partially relevant, and 88 not relevant. Each of the samples is characterized by a 45dimensional vector consisting of text features extracted from the titles and abstracts of the documents. The features were normalized. Based on this data set we constructed pairwise preferences belonging to 50 subjects in a way that we describe below. We followed a procedure similar to (Xu et al. 2010) to turn the relevance levels into pairwise preference comparisons. Since such coarse relevance judgements are considered unrealistic in many realworld applications, Xu et al. (2010) proposed to add uniform noise in the range [−0.5,0.5] to the true relevance levels. This addition preserves the relative order between definitely relevant (respectively partially relevant) documents and partially relevant (respectively not relevant) ones, but randomly breaks ties within each relevance level. To introduce a hierarchical component, we replaced the random tiebreaking of Xu et al. (2010) by a subjectspecific one. We do this by changing the uniform noise by a subject (and feature) dependent term as follows. For subject m, a weight vector α _{ m } is drawn from a zero mean fully factorized Gaussian with unit variance, \(\boldsymbol{\alpha}_{m} \sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\). Given features x _{ i }, noise terms are then the inner products \(\boldsymbol{\alpha}_{m}^{T} \boldsymbol{x}_{i}\), linearly scaled back to the interval [−0.5,0.5] (not to destroy the relative order of the true relevance levels), and the relevance levels are taken to be the true relevance levels plus these noise terms.
Audio
This data set consists of evaluations of sound quality from 32 subjects (this data set was borrowed from Arehart et al. 2007). Each subject performed 576 pairwise comparison listening experiments. Each listening experiment represents one sound sample processed with two different settings of the hearingaid parameters, and the choice for one of the two. The processed sound sample is represented by a 3dimensional feature vector.
Art
This data set consists of evaluations of art images from 190 subjects (this data set was borrowed from Yu et al. 2003). Each subject was presented a number of images from a total of 642 images and asked to rate each of them: like/dislike (on average, each subject rated 90 images). We considered the 32 subjects who rated more than 120 images. Each image is described by a 275dimensional feature vector, with features which characterize the image such as, color, shape, texture, etc. For computational reasons, we reduced the highdimensional feature vector (275 dimensions) to a lower dimension. We noticed that most of the features where not very informative for predicting the outcomes, this is why we used only the 10 most informative features (the informativeness of the features was measured by averaging the correlations between features and observations). Note that this data set does not contain pairwise comparisons like the other two data sets. With each instance, a binary label is associated: like or dislike, which makes the learning task on this data set to be a binary classification task. The combination of multitask and active learning that we propose in this work can be still applied in this case in the same framework which was introduced in Sect. 2 by using the logistic regression model instead of the BradleyTerry model as likelihood terms in the model from Eq. (6).
Protocol
Our experiments use a leaveoneout scheme in which each subject was considered once as the current/test subject for which the preferences need to be learned. For each test subject the learning started with the hierarchical prior learned from the data of the other remaining subjects. The data for the test subject was split into 5 folds, 1 fold was used for training and the rest was used for testing. The training data was used as a pool out of which points were selected for labeling either randomly or actively using one of the active learning criteria. The hierarchical prior was updated based on these data points. After every update predictions were made on the test set using the current model. We used accuracy (percentage of correct predictions among all the predictions) as a measure of performance. The accuracy of the predictions on the test data measures how much we learned about the subject preferences. The results were averaged over the 5 splits and over the subjects.
Performance
The framework that we propose in this work for optimizing preference learning consists of combining the multitask formalism together with active learning. The multitask ideas in preference learing are especially useful when the training preference data from a subject is very small. In this situation it makes sense to use the preference data from other subjects as additional information.
Letor
The pairwise comparisons from Letor data set were generated by adding noise in the interval [−0.5,0.5] such that the relative order between the three relevance levels is preserved, but ties within each relevance level are broken. As a result, different subjects do agree on comparisons between different relevance levels. Thus, the data was constructed to have an underlying common structure in the preference of different subjects. Because of this reason we expect that the multitask learning would improve the performance. In order to validate this hypothesis, we checked whether the preferences of a new subject can be learned more accurately by using the available preference data from other subjects. We compared the hierarchical model with the method of Chu and Ghahramani (2005b) Gaussian processes for preference learning which assumes no prior information. The hierarchical/community prior was obtained by applying the EM algorithm described in Sect. 2.3 in combination with the semiparametric utility function from Eq. (4); the hierarchical prior was learned from 20 samples from each of the other subjects. The method of Chu and Ghahramani (2005b) was applied with a Gaussian kernel in which the kernel parameters were tuned using crossvalidation.
Figure 1, left panel, compares the accuracy obtained using the hierarchical model to the Gaussian process method for preference learning of (Chu and Ghahramani 2005b) in a nonactive setting, i.e., for both models the updates are done with training points randomly selected. The prediction accuracy is shown as a function of the number of data points included in the training set for the test subject. The plots show that the improvement obtained with the hierarchical model depends on the size of the training data. This is in accordance with the expectation that the multitask formalism is suited for situations in which the available training data is small. The results obtained using a hierarchical prior on the audio and art data sets are similar to the ones obtained in the case of the Letor data set. Figure 1, right panel, compares the accuracy obtained with random and active selection and starting from a hierarchical prior. Please note the change in scaling of the yaxis. The active selection was implemented using the Committee criterion. These plots show that the combination between multitask and active learning indeed improves the performance.
Audio
Figure 2 shows the performance of the Committee criterion on the left and Variancelogdet criterion on the right versus random selection on the audio data set. The plots show the prediction accuracy (on the yaxis) as a function of the number of updates from the hierarchical prior (on the xaxis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the 32 subjects. We used the Committee criterion with γ=0 since the subjects in the committee are quite similar between each other, which is also suggested by the small error bars. The informative prior improves the predictions at the beginning when no preference observations have been observed for the new subject. The hierarchical prior already gives an accuracy of almost 0.7 for the audio data at the beginning of learning. The hierarchical prior was learned from 20 randomly selected data points per subject. Committee and Variancelogdet strongly overlap and are considerably better than a random strategy. The audio data set contains a few very informative data points and some which are not informative. In some cases the difference between the two sound samples presented in an experiment is so small that the subject cannot hear any difference. Such experiments are not informative because a subject’s answer is close to random and does not provide any information with respect to the subject’s preferences. The active learning criteria avoid selecting these type of experiments and obtain better performance than random selection. The performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 2, except for the Vote criterion which does not seem to perform better than random. We refer to Sect. 4.5 for an empirical evaluation of the similarities between the active learning criteria considered in Sect. 3.
Art
Figure 3 shows the performance of the Committee criterion on the left and Variancelogdet criterion on the right versus random selection on the art data set. The plots show the prediction accuracy (on the yaxis) as a function of the number of updates from the hierarchical prior (on the xaxis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the subjects. For the art data which has a higher variability between subjects the Committee criterion with γ=1 performs slightly better than the Committee criterion with γ=0. The preferences of people for art images are more difficult to predict, since preferences do not depend on some lowlevel characteristics of the image, like texture, color, etc. This is why the accuracy obtained on the art data is less than the accuracy obtained, for example, on the audio data. The Variancelogdet criterion appears to perform slightly better than the Committee criterion. Furthermore, the benefit of active learning over random selection is much smaller. Like in the case of audio data set, the performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 3.
Computational complexity
One of the main advantages of the Committee criterion is its computational simplicity in comparison with the standard criteria used in experimental design and exemplified here by the Variancelogdet criterion. For every candidate data point to be included in the training set, the Variancelogdet criterion needs to infer the posterior distribution induced. A standard method for performing this approximation is Laplace’s method which has a cubic complexity in the dimension of the data features because it involves inversions of the covariance matrices. For more details about other inference methods suited to pairwise comparison data we refer to (Birlutiu and Heskes 2007). All the algorithms presented here are linear in the number of data points which makes them scalable to large data sets. Table 1 shows a comparison between the execution times for Variancelogdet and Committee criterion as a function of feature dimension. Since these execution times for a fixed number of updates (one in our case) do not depend on the actual nature of the data, we randomly generated data in order to be able to change the number of input dimensions. The data was randomly generated to have dimension 10, 50, 100 and 200. The time was evaluated for 100 candidate data points and for 1 update step. The Committee criterion was computed from data belonging to 20 users. The simulations were performed using Matlab on an Intel Xeon processor with 16 Gb of memory which runs Fedora release 9 with Linux kernel 2.6.27. In the case of the Committee criterion, only KL divergences between predictive probabilities need to be computed. The Committee criterion is clearly much faster than the Variancelogdet, furthermore, contrary to the Variancelogdet criterion, the computational complexity of the Committee criterion is independent of the dimension of the features.
Similarities between criteria
In order to test empirically the approximations and similarities from Sect. 3.4, we computed the Spearman rank correlations (Hollander and Wolfe 1973) between the scores assigned by the criteria when evaluating the informativeness of the data points. The correlations were computed for all the data sets considered in the experimental evaluation, Letor, audio and art data set. For each subject the learning started with the hierarchical prior learned from the data of the other subjects. This prior was updated by taking into account the information from 20 randomly selected data points for both data sets. After these updates, we computed the scores assigned by each of the active learning criteria to 50 randomly chosen data points. Figure 4 shows these Spearman rank correlation coefficients for each pair of criteria; the darker the color the closer to 1 the correlations are and the stronger the similarity between the two criteria. There are several observations to be made from this figure: (i) One can notice a darker square on the leftdown part of the figures, for all three data sets. This square involves the Variancelogdet, ChangeKL, Variancetrace, and Committee criteria. The correlations between each pair of them are very close to 1 which suggests that these criteria perform in practice very similarly. This is also what the theory from Sect. 3.4 suggests by approximating these criteria to a similar form. (ii) The Variancelogdet and ChangeKL criterion have the Spearman rank correlation extremely close to 1. Their approximations are proven to be equivalent in Lemma 5 in the Appendix. These two observations also suggest that the approximations of the Variancelogdet and ChangeKL are very accurate. (iii) The Vote criterion performs in some situations randomly since when the number of subjects is much smaller than the number of data points considered, the scores assigned by the Vote criterion are the same for most of the experiments. (iv) The Uncertainty criterion is most different from the others.
Conclusions and discussions
This work studied how to exploit models learned on other scenarios to actively learn a model for a new scenario in an efficient way. Our approach to active learning in a multiscenario setting combines a hierarchical Bayesian prior (to learn from related scenarios) with active learning (to learn efficiently by selecting informative examples). Our new Committee criterion inspired by the QuerybyCommittee method is very similar to the standard criteria from experimental design, in particular in the early stages of active learning, but computationally more efficient. Aside from the computational advantage, the Committee criterion introduces the idea to have the data, available from other users, collaborate in order to select the most informative experiments to perform with a new user. The same idea is already implicit in the QuerybyCommittee algorithm. We show, theoretically and through experiments, that this conceptual idea also works with a committee of people. This can be interpreted as another way of using people as the elements of a machine learning algorithm, which is a very promising research area, as suggested also by (Sanborn and Griffiths 2008).
There are several aspects related to the approach proposed here that require further attention: (i) The design is myopic in the sense that the active learning criteria look one step ahead at a time when evaluating the informativeness of a data point. A nonmyopic design “looks” more than just one step and it is theoretically closer to the best possible design but computationally much more expensive. Due to the computational complexity involving a nonmyopic design, we discussed all the active learning criteria from a myopic perspective, however, a nonmyopic perspective can be applied to all of them, similar to the one proposed by Boutilier (2002). (ii) In this work we used loglinear models and Gaussian distributions to model the preference data. The same idea, of using models learned on data from different subjects (or scenarios) to actively select examples for a new subject, can be applied to other models and starting from different priors as well, although the mathematics will be a bit more involved and less intuitive. In particular, considering a mixture of Gaussians as the prior may still be feasible and may lead to an active learning strategy that tries to find those examples that can best discriminate to which mixture component the current model belongs.
References
Adomavicius, A., Sankaranarayanan, R., Sen, S., & Tuzhilin, A. (2005). Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems, 23(1), 103–145.
Anand, P. (1993). The philosophy of intransitive preferences. The Economic Journal, 103(417), 337–346.
Arehart, K. H., Kates, J. M., Anderson, C. A., & Harvey, L. O. Jr. (2007). Effects of noise and distortion on speech quality judgments in normalhearing and hearingimpaired listeners. Journal of the Acoustical Society of America, 122(2), 1150–1164.
Arens, R. (2008). Learning SVM ranking function from user feedback using document metadata and active learning in the biomedical domain. In Proceedings of the ECML/PKDD workshop on preference learning (pp. 363–383).
Argyriou, A., Micchelli, C., & Pontil, M. (2008). When is there a representer theorem? Vector versus matrix regularizers. Journal of Machine Learning Research.
Bakker, B., & Heskes, T. (2003). Task clustering and gating for Bayesian multitask learning. Journal of Machine Learning Research, 4, 83–99.
Berger, M. P. F. (1994). Doptimal sequential sampling designs for item response theory models. Journal of Educational and Behavioral Statistics, 19, 43–56.
Birlutiu, A., & Heskes, T. (2007). Expectation propagation for rating players in sports competitions. In Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases (pp. 374–381).
Birlutiu, A., Groot, P., & Heskes, T. (2009). Multitask preference learning with an application to hearing aid personalization. Neurocomputing, 73(7–9), 1177–1185.
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Blythe, J. (2002). Visual exploration and incremental utility elicitation. In Proceedings of the 18th national conference on artificial intelligence (pp. 526–532).
Bordley, R. F. (1982). A multiplicative formula for aggregating probability assessments. Management Science, 28(10), 1137–1148.
Boutilier, C. (2002). A POMDP formulation of preference elicitation problems. In Proceedings of the 18th national conference on artificial intelligence (pp. 239–246).
Boutilier, C., Zemel, R. S., & Marlin, B. (2003). Active collaborative filtering. In Proceedings of the 19th annual conference on uncertainty in artificial intelligence (pp. 98–106).
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs, I: the method of paired comparisons. Biometrika, 39, 324–345.
Brinker, K. (2004). Active learning of label ranking functions. In Proceedings of the 27th international conference on machine learning.
Brochu, E., de Freitas, N., & Ghosh, A. (2008). Active preference learning with discrete choice data. In J. C. Platt, Y. Koller, D. Singer & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 409–416). Cambridge: MIT Press.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Chajewska, U., Koller, D., & Parr, R. (2000). Making rational decisions using adaptive utility elicitation. In Proceedings of the 17th national conference on artificial intelligence (pp. 363–369).
Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: a review. Statistical Science, 10, 273–304.
Christensen, R. (1997). Loglinear models and logistic regression. Berlin: Springer.
Chu, W., & Ghahramani, Z. (2005a). Extensions of Gaussian processes for ranking: semisupervised and active learning. In NIPS workshop on learning to rank.
Chu, W., & Ghahramani, Z. (2005b). Preference learning with Gaussian processes. In Proceedings of the 22nd international conference on machine learning.
Clyde, M., Muller, P., & Parmigiani, G. (1993). Optimal design for heart defibrillators. In Case studies in Bayesian statistics (Vol. II, pp. 278–292).
Cohn, D. A., Ghahramani, A., & Jordan, M. I. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4(1), 129–145.
Dagan, I., & Engelson, S. P. (1995). Committeebased sampling for training probabilistic classifiers. In Proceedings of the 12th international conference on machine learning (pp. 150–157).
Dasgupta, S., & Hsu, D. (2008). Hierarchical sampling for active learning. In Proceedings of the 25th international conference on machine learning (pp. 208–215).
Doyle, J. (2004). Prospects for preferences. Computational Intelligence, 20(2), 111–136.
Dror, H. A., & Steinberg, D. M. (2008). Sequential experimental designs for generalized linear models. Journal of the American Statistical Association, 103(481), 288–298.
Evgeniou, T., Micchelli, A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615–637.
Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press.
Ford, I., & Silvey, S. D. (1980). A sequentially constructed design for estimating a nonlinear parametric function. Biometrika, 67, 381–388.
Freund, Y., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28(2–3), 133–168.
Fürnkranz, J., & Hüllermeier, E. (2010). Preference learning. Berlin: Springer.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman & Hall/CRC.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.
Gervasio, M. T., Moffitt, M. D., & Pollack, M. E. (2005). Active preference learning for personalized calender scheduling assistance. In Proceedings of the 10th international conference on intelligent user interfaces (pp. 90–97).
Glickman, M., & Jensen, S. (2005). Adaptive paired comparison design. Journal of Statistical Planning and Inference, 127, 279–293.
Groot, P. C., Birlutiu, A., & Heskes, T. (2010). Bayesian Monte Carlo for the global optimization of expensive functions. In Proceedings of the 19th European conference on artificial intelligence (pp. 249–254).
Guo, S., & Sanner, S. (2010). Realtime multiattribute Bayesian preference elicitation with pairwise comparison queries. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp. 289–296).
Harpale, S., & Yang, Y. (2008). Personalized active learning for collaborative filtering. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 91–98).
Heskes, T., & de Vries, B. (2005). Incremental utility elicitation for adaptive personalization. In Proceedings of the 17th BelgiumNetherlands conference on artificial intelligence (pp. 127–134).
Hollander, M., & Wolfe, D. A. (1973). Nonparametric statistical methods. New York: Wiley.
Jin, R., & Si, L. (2004). A Bayesian approach toward active learning for collaborative filtering. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 278–285).
Kanninen, B. (2002). Optimal design for multinomial choice experiments. Journal of Marketing Research, 39, 307–317.
Lewi, J., Butera, R., & Paninski, L. (2009). Sequential optimal design of neurophysiology experiments. Neural Computation, 21(3), 619–687.
Lewis, D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 3–12).
MacKay, D. J. C. (1992). Informationbased objective functions for active data selection. Neural Computation, 4, 590–604.
MacKay, D. J. C. (2002). Information theory, inference & learning algorithms. Cambridge: Cambridge University Press.
McCallum, A., & Nigam, K. (1998). Employing EM and poolbased active learning for text classification. In Proceedings of the 15th international conference on machine learning (pp. 350–358).
Melville, P., & Mooney, R. (2004). Diverse ensembles for active learning. In Proceedings of the 21st international conference on machine learning (pp. 584–591).
Minka, T. (2001). A family of approximation methods for approximate Bayesian inference. PhD thesis, MIT.
Pahikkala, T., Waegeman, W., Tsivtsivadze, E., De Baets, B., & Salakoski, T. (2009). From ranking to intransitive preference learning: rockpaperscissors and beyond. In Proceedings of the ECML/PKDD workshop on preference learning (pp. 84–100).
Qin, T., Liu, T. Y., Xu, J., & Li, H. (2010). LETOR: a benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4), 346–374.
Sanborn, A. N., & Griffiths, T. L. (2008). Markov chain Monte Carlo with people. In: Advances in neural information processing systems.
Schein, A., & Ungar, L. (2007). Active learning for logistic regression: an evaluation. Machine Learning, 68(3), 235–265.
Seeger, M. W. (2008). Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research, 9, 759–813.
Settles, B., & Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070–1079).
Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by Committee. In Proceedings of the 5th annual workshop on computational learning theory (pp. 287–294).
Thrun, S. (1995). Is learning the nth thing any easier than learning the first? In Advances in neural information processing systems (pp. 640–646).
Tversky, A. (1998). Preference, belief, and similarity. Cambridge: MIT Press.
Xu, Z., Kersting, K., & Joachims, T. (2010). Fast active exploration for linkbased preference learning using Gaussian processes. In Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: part III (pp. 499–514).
Yu, K., Schwaighofer, A., Tresp, V., Ma, W. Y., & Zhang, H. J. (2003). Collaborative ensemble learning: combining collaborative and contentbased information filtering. In Proceedings of the 19th conference on uncertainty in artificial intelligence (pp. 616–623).
Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on machine learning.
Yu, K., Bi, J., & Tresp, V. (2006). Active learning via transductive experimental design. In Proceedings of the 23rd international conference on machine learning (pp. 1081–1088).
Acknowledgements
We would like to thank Kai Yu for providing the art data set and Wei Chu for making available his code on preference learning with Gaussian Processes.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Author information
Affiliations
Corresponding author
Additional information
Editor: James Cussens.
Appendix
Appendix
In this Appendix we prove the equivalences between the active learning criteria stated in Sect. 3.4. We show that these criteria can be approximated to the same form, namely
for some vector α and matrix Q. The difference between the approximations for different criteria is the point α in which the gradients and the probabilities are evaluated and the weighting matrix of the gradients Q.
We consider probabilistic choice models of the form given in Eq. (2), which by using the definition of the utility function from Eqs. (3) and (4) can be rewritten as
We define the derivatives of the log probabilities
We first prove a lemma which states a relationship between the Hessian and the Fisher matrices which will be used in further proofs.
Lemma 3
For any input a and vector α we have the following relationship between the Hessian and the Fisher matrices:
Proof
We use shorthand notation p _{ c }=p(ca,α), g _{ cj }=g _{ j }(ca,α), ϕ _{ ci }=ϕ _{ i }(x _{ c }), omitting the dependencies on a and α.
From logp _{ c }=∑_{ j } ϕ _{ cj } α _{ j }−logZ, it is easy to see that
Furthermore,
and thus
Note that the second derivative is in fact independent of c. We then have
□
The following lemma proves the approximation of the Variancetrace criterion from Eq. (26).
Lemma 4
In a first order approximation, the Variancetrace criterion boils down to
Proof
We have
Use of Eq. (23) and Lemma 3 gives the result. □
The following lemma proves the approximation of the ChangeKL criterion from Eq. (27).
Lemma 5
In a first order approximation, assuming that Σ _{(a,c)} is close to Σ, we have
i.e., the two criteria are indistinguishable.
Proof
We evaluate the terms of the ChangeKL criterion one by one,
The first term gives
The second term
In the same lowest order, we obtain for the third term
Collection of all the terms then gives the result. □
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Birlutiu, A., Groot, P. & Heskes, T. Efficiently learning the preferences of people. Mach Learn 90, 1–28 (2013). https://doi.org/10.1007/s1099401252974
Received:
Accepted:
Published:
Issue Date:
Keywords
 Learning preferences
 Active learning
 Experimental design
 Multitask learning
 Hierarchical modeling