Efficiently learning the preferences of people
Authors
Open AccessArticle
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s1099401252974
Abstract
This paper presents a framework for optimizing the preference learning process. In many realworld applications in which preference learning is involved the available training data is scarce and obtaining labeled training data is expensive. Fortunately in many of the preference learning situations data is available from multiple subjects. We use the multitask formalism to enhance the individual training data by making use of the preference information learned from other subjects. Furthermore, since obtaining labels is expensive, we optimally choose which data to ask a subject for labelling to obtain the most of information about her/his preferences. This paradigm—called active learning—has hardly been studied in a multitask formalism. We propose an alternative for the standard criteria in active learning which actively chooses queries by making use of the available preference data from other subjects. The advantage of this alternative is the reduced computation costs and reduced time subjects are involved. We validate empirically our approach on three realworld data sets involving the preferences of people.
Keywords
Learning preferences Active learning Experimental design Multitask learning Hierarchical modeling1 Introduction
There has been an increasing interest recently in learning the preferences of people within artificial intelligence research (Doyle 2004). Preference learning provides the means for modeling and predicting people’s desires and this makes it a crucial aspect in modern applications such as decision support systems (Chajewska et al. 2000), recommender systems (Blythe 2002; Blei et al. 2003), and personalized devices (Clyde et al. 1993; Heskes and de Vries 2005).
A prototypical example for an application of preference learning that we will use in this paper is fitting hearingaids, i.e., tuning of hearingaid parameters so as to maximize user satisfaction. This is a complex task, due to three reasons: (1) high dimensionality of the parameter space, (2) the determinants of hearingimpaired user satisfaction are unknown, and (3) the evaluation of this satisfaction through listening tests is costly (in terms of patient burden and clinical time investment) and unreliable (due to inconsistent responses). The last point illustrates an important issue that preference learning has to address, which is the limited availability of labeled data used for model training. Obtaining appropriate training data in preference learning applications requires time and effort from the modeled user. This shortcoming can be addressed by taking advantage of two characteristics of the settings in which preference learning is usually applied. First, the training data is mostly acquired through interactions with the modeled user; and, second, preferences are modeled for multiple users, as a result multiple training data sets are available. In order for the preference learning methods to be implemented in realworld systems, they must be capable of exploiting all possible sources of information and in the most efficient way.
In most of the situations in which preference learning is involved data is available from multiple subjects. Thus, even though individual data is scarce and difficult to obtain, we can optimize the learning of preferences of a new subject by making use of the available data from other subjects. Learning in this setting is wellknown as multitask or hierarchical learning and has been studied extensively in recent years in machine learning. By using the multitask formalism, the preference data collected for other subjects can be gathered and used as prior information when learning the preferences of a new subject. Furthermore, to deal with the fact that obtaining labeled data is expensive we can speed up learning by optimally choosing the examples to be queried. At each learning step we can decide which example gives the most information about the subject’s preferences. This paradigm, called active learning in the machine learning literature and related to sequential experimental design in statistics, has been studied extensively, but hardly in the multitask setting.
The aim of this work is to present an efficient framework for optimizing the preference learning process. This framework considers the combination between active learning and multitask learning in the preference learning context. The contribution of this work is a criterion for active learning designed for the multitask setting. The advantages of this criterion are in its interpretation and the ease in computability.
The structure of this paper is as follows. First, this section ends by presenting related work on preference learning, multitask learning and active learning. In Sect. 2 we describe the learning framework. We consider learning from qualitative preference observations in which the subject makes a choice for one of the presented alternatives. This can be modeled using the probabilistic choice models introduced in Sect. 2.1. Learning a utility function representing the preferences of a subject from this type of preference observations is described in Sect. 2.2. Learning the utility function in a multitask setting by making use of the data available from other subjects is considered in Sect. 2.3. In Sect. 3 we present several criteria for selecting the most informative experiments with respect to a subject’s preferences. After reviewing some of the standard criteria from experimental design, we propose an alternative criterion which makes use of the preference observations collected already from a community of subjects. We show that this alternative criterion is connected to the standard criteria from experimental design. In Sect. 4 we demonstrate experimentally the usefulness of our framework on three data sets, a subset of the Letor data set, an audiological data set and a data set about people’s preferences for art images. In Sect. 5 we present several conclusions and discuss directions for future research.
1.1 Background and related work
In this section we review some studies from preference learning, multitask learning, and active learning related to the work presented in this paper.
1.1.1 Preference learning
 1.
Based on the application area, preference learning approaches can be divided into the following main groups: (i) applied to the field of information retrieval, e.g., learning to rank search results of a query or a search engine, (ii) applied to recommender systems, e.g., used by online stores to recommend products to their customers, or for personalized devices, and (iii) bipartite ranking and label ranking, which find applications in disciplines such as medicine and biology. The application scenarios that we consider in the experimental evaluation in Sect. 4 belong to information retrieval (the Letor data set) and recommender systems (the audiological and art data sets).
 2.
The learning technique divides the preference learning approaches into four categories: (i) learn a binary preference relation that compares pairs of alternatives, (ii) modelbased approach that aims at identifying the preference relation by making sufficiently restrictive model assumptions, (iii) local estimation techniques which lead to aggregating preferences, and (iv) learning utility functions by using regression to map instances to target valuations for direct ranking. We focus on the latter approach and use a utility function in order to model a subject’s preferences. The utility function is learned in a Bayesian framework.
 3.
The learning task includes label, instance, and object ranking. Label ranking can be seen as a generalization of classification where a complete ranking of labels is associated with an instance instead of only a class label. Instance ranking can be seen as a generalization of ordinal classification where an instance belongs to one among a finite set of classes and the classes have an order. The setting of object ranking has the peculiarity of having no supervision in the sense that no class label is associated with an object. Instead, a finite set of pairwise preferences or other ordering between objects is given. The setting that we consider in this work belongs to the last category.
In many preference learning settings it is important to take into account the context, i.e., contextaware preference learning (Adomavicius et al. 2005). The motivation for context aware preference learning is that the same subject/user/consumer may use different decisionmaking strategies and prefer different products under different contexts. For hearingaid fitting, which is one of the application scenarios that we use in this work, this means that a user would prefer a certain setting of the hearingaid parameters if he is listening to a concert and another setting if he is in a discussion. In general for context aware preferences bigger data sets are needed, as preferences would have to be learned for all contextual situations. The approach that we present in this paper can be applicable in this setting as well. While this is an interesting, related topic, it is beyond the scope of the current work.
1.1.2 Multitask learning
The idea behind multitask learning is to utilize labeled data from other “similar” learning tasks in order to improve the performance on a target task. It is inspired by the research on transfer of learning in psychology, more specifically on the dependency of human learning on prior experience. For example, the abilities acquired while learning to walk presumably apply when one learns to run, and knowledge gained while learning to recognize cars could apply when recognizing trucks. The initial foundations for multitask learning were laid by (Thrun 1995; Caruana 1997). The psychological theory of transfer of learning implies the similarity between tasks. In a related way, the multitask learning assumes similarity between models of different tasks. For example, (Evgeniou et al. 2005; Argyriou et al. 2008) exploit similarity between the deterministic parts of the models by means of regularization, with the effect of improvement in performance. In this work we implement multitask learning using a Bayesian approach. The Bayesian approach to multitask learning assumes the parameters of individual models to be drawn from the same prior distribution. Examples of the Bayesian approach to multitask learning are (Bakker and Heskes 2003) where a mixture of Gaussians is used for the top of the hierarchy. This leads to clustering the tasks, one cluster for each Gaussian in the mixture. In (Yu et al. 2005; Birlutiu et al. 2009) a hierarchical Gaussian Process is derived with a normalinverse Wishart distribution used at the top of the hierarchy.
1.1.3 Active learning
Active learning, also known in the statistics literature as sequential experimental design, is suitable for situations in which labeling points is difficult, timeconsuming, and expensive. The idea behind active learning is that by optimal selection of the training points a better performance can be achieved instead of random selection. The scenarios in which active learning can be applied belong to one of the following three categories: (i) generating de novo points for labeling; (ii) streambased active learning where the learner decides whether to request the label of a given instance or not; (iii) poolbased active learning where queries are selected from a large pool of unlabeled data. In this work we consider the poolbased active learning setting.
Methods for active learning can be roughly divided into two categories: those with and without an explicitly defined objective function. Uncertainty sampling (Lewis and Gale 1994), QuerybyCommittee (Seung et al. 1992; Freund et al. 1997) and variants thereof belong to the latter category. They are based on the idea of selecting the most uncertain data given the previously trained models. The methods with an explicit objective function are often motivated by the theory of experimental design (Fedorov 1972; Chaloner and Verdinelli 1995; Schein and Ungar 2007; Lewi et al. 2009; Dror and Steinberg 2008). The objective function quantifies the expected gain of labeling a particular input, for example in terms of the expected reduction in the entropy of the model parameters (MacKay 1992; Cohn et al. 1996). With respect to the performance of the two categories of methods, Schein and Ungar (2007) show that the methods from the second approach perform better but are computationally more expensive due to retraining the models for each candidate point. A trend is to improve the performance of the active learning methods by combining them with heuristics designed either for the context in which they are applied or by the models they use, e.g., making use of the unlabeled data available (McCallum and Nigam 1998; Yu et al. 2006), exploiting the clusters in the data (Dasgupta and Hsu 2008), diversifying the set of hypotheses (Melville and Mooney 2004), or adapting the active learning to Gaussian processes (Chu and Ghahramani 2005a; Brochu et al. 2008; Groot et al. 2010).
Preference learning can benefit from the active learning paradigm. In most of the preference learning settings labels are given by people in an explicit way. This means that for acquiring training preference data, the subjects have to interact with the system, and they need to express their preferences explicitly. These situations appear when it is impossible or insufficient to implicitly collect training preference data. For example, when learning preferences in a live system where subjects choose electronically their favorite movie, labelling is done automatically by the selection, but, for other scenarios, like for example, fitting hearingaids, this implicit way of collecting training data cannot be applied. In these situations, it makes sense to use active learning in order to collect the most informative data. There are severals studies in the literature that use active learning in a preference learning setting. Brinker (2004) presents some extensions of poolbased active learning to label ranking problems; Xu et al. (2010) address the problem of preference learning using relational models between items; Guo and Sanner (2010) investigate active preference learning for realtime systems; Brochu et al. (2008) propose a criterion for active learning that maximizes the expected improvement at each query without accurately modeling the entire valuation surface. Furthermore, there are several studies which investigate active preference learning for practical applications such as, collaborative filtering (Jin and Si 2004; Harpale and Yang 2008; Boutilier et al. 2003), personalized calendar scheduling (Gervasio et al. 2005), or for optimizing search results for biomedical documents (Arens 2008). The difference between our work and the other studies for active preference learning mentioned above is that we consider active preference learning in a multitask setting, i.e., we are interested about settings with multiple learning tasks and how active learning can be implemented in an efficient way in this case. We propose a criterion for active learning designed for this multitask learning setting. This criterion, which we call the Committee criterion, will make use of the preference observations collected already from a community of subjects. The idea behind the Committee criterion is related to the QuerybyCommittee method from active learning which selects those queries that have maximum disagreement amongst an ensemble of hypotheses. The difference in our case is that the group of subjects, for which the preferences were already learned, plays the role of the ensemble of hypotheses instead of an ensemble of models learned on the same task.
1.2 Notation
Boldface notation is used for vectors and matrices and normal fonts for their components. Upperscripts are used to distinguish between different vectors or matrices and lowerscripts to address their components. The notation \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\varSigma})\) is used for a multivariate Gaussian with mean μ and covariance matrix Σ. The transpose of a matrix M is denoted by M ^{ T }. Capital letters are used for constants and small letters for indices, e.g., i=1,…,I.
2 Learning framework
The idea of using the preference observations from other subjects in order to optimize the process of learning the preferences of a new subject can be basically applied in any preference learning context. In this work, we consider the case of qualitative preference observations which can be modeled using the probabilistic choice models described in this section.
2.1 Probabilistic choice models
One important drawback of the BradleyTerry model is that it assumes very strong transitivity conditions of preference relations, while some psychological experiments have shown that human preference judgments can violate transitivity (Anand 1993; Tversky 1998). In most situations transitivity violations can be considered as noise. When this is not applicable, specific probabilistic models for human preference judgements which preserve intransitive reciprocal relation have to be designed. This was recently investigated in (Pahikkala et al. 2009) which introduced a new kernel function in the framework of regularized least squares which is capable of inferring intransitive reciprocal relations.
2.2 The utility function
2.3 Multitask preference learning
 Estep:

Estimate the sufficient statistics (mean μ ^{ m } and covariance matrix Σ ^{ m }) of the posterior distribution corresponding to each subject m, given the current estimates at step t (\(\bar{\boldsymbol{\mu}}^{(t)}\) and \(\bar{\boldsymbol{\varSigma}}^{(t)}\)) of the hierarchical prior.
 Mstep:

Reestimate the parameters of the hierarchical prior:(7)(8)
3 Active preference learning
In this section we discuss methods for active preference learning. We start from QuerybyCommittee (QBC) (Seung et al. 1992) method for active learning and based on it we propose some variants of QBC adapted to the setting of preference learning for multiple subjects (Sect. 3.1). Furthermore, we show how these variants of QBC can be naturally linked to the hierarchical Bayesian modeling for reducing the computations (Sect. 3.2). Finally, we show connections between the variants of QBC proposed and other active learning criteria (Sect. 3.4).
3.1 QBC for preference learning
In this section we will discuss how to adapt QBC to our preference learning setting.
3.1.1 The committee members
For the QBC approach to be effective it is important that the committee is made of consistent and representative models. The main idea in this work is to exploit the preference learning setting with multiple subjects and use the learned models of other subjects \(\mathcal{M}_{1},\ldots, \mathcal{M}_{M}\) as committee members when learning the preferences of a new subject.
After choosing the committee we still have to decide upon a suitable criterion for selecting the next examples. Some measures of disagreement among the committee members appear to be most obvious, and in the following we will consider two alternatives.
3.1.2 Vote criterion
3.1.3 Committee criterion
An advantage of the Committee criterion is its computational efficiency: the first term on the righthand side of Eq. (12) as well as the average predictive probability can be computed beforehand. The Committee criterion does require computation of the predictive probabilities corresponding to the current model, but this is the least one could expect from an active design. This is to be compared with the QBC criterion (any of the two variants considered), which requires constructing new committee members with each update, and Doptimal experimental design, which calls for keeping track of variances.
Note that we have not made any restriction so far with respect to the probabilistic models used in the active learning design. In the following we will consider only the loglinear models introduced in Sect. 2. They have some nice properties, which simplify the computation of the Committee criterion (Sect. 3.2), and provide a natural link to hierarchical Bayesian modeling (Sect. 2.3). The general idea, of using the already learned models from the other tasks as the committee members in a QBClike approach, is of course also applicable to other models.
3.2 Average probability
In this section we discuss how to efficiently compute the average probability used for computing the committee criterion in Eq. (12) in the case of loglinear models (Christensen 1997). For linear utility functions the likelihood function defined in Eq. (2) is a loglinear model. The logodds of the model are linear in the parameter.
As can be seen from the EM updates in Eq. (7), the average \(\bar{\boldsymbol{\mu}}\) in the logarithmic opinion pool is then precisely the mean of the learned hierarchical prior. Summarizing, once we have learned a hierarchical prior from the data available for subjects 1 through M using the EM algorithm, we can start off the new model M+1 from this prior (as is normally done in hierarchical Bayesian learning). On top of this, the same EM algorithm gives us the information we need to compute the Committee criterion that can be used subsequently to select new inputs to label.
3.3 Other criteria for active learning
 1.Uncertainty sampling (Lewis and Gale 1994). In this strategy an active learner chooses for labeling the example for which the model’s predictions are most uncertain. The uncertainty of the predictions can be measured, for example, using Shannon entropyFor a binary classifier this strategy reduces to querying points whose prediction probabilities are close to 0.5. Intuitively this strategy aims at finding as fast as possible the decision boundary since this is indicated by the regions where the model is most uncertain.$$ \mathrm{Uncertainty} (a) = \sum_c p(ca, \mathcal{M}) \log p(ca, \mathcal{M}) . $$(16)
 2.Variance reduction (MacKay 1992). This strategy, also known in experimental design as Doptimality (Fedorov 1972; Chaloner and Verdinelli 1995; Berger 1994; Ford and Silvey 1980), chooses as the most informative experiments the ones that give the most reduction in the model’s uncertainty. The motivation behind this strategy is a result of (Geman et al. 1992) which shows that the generalization error can be decomposed into three components: (i) noise (which is independent of the model or training data); (ii) bias (due to the model); (iii) model’s uncertainty. Since the model cannot influence the noise and the bias components, the future generalization error can only be influenced via the model’s variance. Formally, this criterion can be written asIn the setting considered in this work the variance of the model is expressed in the covariance of the Gaussian distribution. In order to use Eq. (17) we need to choose a measure for the variance. We can consider, for example, the logdeterminant of the covariance matrix$$ \mathrm{Variance} (a) = \sum_c p(ca,\mathcal{M}) \mathrm{variance} [\mathcal{M}_{(a,c)}]  \mathrm{variance} [\mathcal{M}]. $$(17)which is actually minimizing the entropy of the Gaussian random variable representing the current model, or the trace of the covariance matrix$$ \mbox{Variancelogdet} (a) = \sum_c p(ca, \mathcal {M}) \log\det (\boldsymbol{\varSigma}_{(a,c)})  \log\det(\boldsymbol{\varSigma}) , $$(18)$$ \mbox{Variancetrace} (a) = \sum_c p(ca, \mathcal {M}) \operatorname {Tr}(\boldsymbol{\varSigma}_{(a,c)})  \operatorname {Tr}(\boldsymbol{\varSigma}) . $$(19)
 3.Expected model change (Cohn et al. 1996). This strategy chooses as the most informative query the one which when added to the training set would yield the greatest model change. Quantifying the model change depends on the learning framework. For gradientbased optimization the change can be measured via the training gradient, i.e., the vector used to reestimate parameter values (Settles and Craven 2008). In the Bayesian framework, the model change can be quantified via a distance measure between the current distribution and the posterior distribution obtained after incorporating the candidate pointA suitable distance for our setting is the KullbackLeibler divergence between distributions, which for two Gaussians has a closed form solution and can be written as follows$$\mathrm{Change} (a) = \sum_c p(ca, \mathcal{M}) \mathrm{distance} [\mathcal{M}, \mathcal{M}_{(a,c)} ] . $$The KL divergence between Gaussians is used by Seeger (2008) to design an efficient sequential experimental design in a setting similar to the one used in this work.(20)
3.4 Similarities between criteria
In this section we consider the following active learning criteria: Variancelogdet, Committee, Variancetrace and ChangeKL. We investigate how similar the active learning criteria are and how they can be related. We analyze the modifications induced to the model by the criteria after updating the probability model to incorporate the information from new training points. A single update induces a small change in the posterior distribution, and this allows for Taylor expansions, keeping only the lowest nonzero contribution. In the following we present the main results of the approximations while some of the details can be found in the Appendix.
The following lemma approximates the Variancelogdet criterion to the form from Eq. (21).
Lemma 1
Proof
Using the same type of approximation, the Committee criterion can be approximated to the same form given in Eq. (21).
Lemma 2
 1.
The gradients g(ca,⋅) are evaluated at different points: the prior hierarchical mean \(\bar{\boldsymbol{\mu}}\) and the current posterior mean μ. This effect is small since μ is still close enough to \(\bar{\boldsymbol{\mu}}\) for a sufficiently accurate approximation of the gradients, in particular at the start of the learning when selecting the right points to label is more important.
 2.
The current posterior variance Σ is replaced by \(\tilde{\boldsymbol{\varSigma}}\). The effect of the precise weighting of the gradients is not so important, and again, at the beginning of learning \(\tilde{\boldsymbol{\varSigma}}\) is close to Σ.
4 Experimental evaluation
This section presents the experimental evaluation of the framework proposed in this paper. We will use pairwise comparisons data. The main goal of the experimental evaluation section is to show that optimally selecting data for labeling using the Committee criterion achieves higher accuracy than random selection. Furthermore, we also show that the Committee criterion performs in practice similarly to other standard active learning criteria, but in addition has a computational advantage.
4.1 Data sets
The following data sets related to the preferences of people were used in the experimental evaluation.
4.1.1 Letor
This data set consists of relevance levels assigned to documents with respect to a given textual query (the OHSUMED data set from Letor 3.0, Qin et al. 2010). The relevances were assessed by human experts, using three rank scales: definitely relevant, partially relevant, and not relevant. We used a subset of this data related to Query 1. It contains 138 references with the following labels: 24 definitely relevant, 26 partially relevant, and 88 not relevant. Each of the samples is characterized by a 45dimensional vector consisting of text features extracted from the titles and abstracts of the documents. The features were normalized. Based on this data set we constructed pairwise preferences belonging to 50 subjects in a way that we describe below. We followed a procedure similar to (Xu et al. 2010) to turn the relevance levels into pairwise preference comparisons. Since such coarse relevance judgements are considered unrealistic in many realworld applications, Xu et al. (2010) proposed to add uniform noise in the range [−0.5,0.5] to the true relevance levels. This addition preserves the relative order between definitely relevant (respectively partially relevant) documents and partially relevant (respectively not relevant) ones, but randomly breaks ties within each relevance level. To introduce a hierarchical component, we replaced the random tiebreaking of Xu et al. (2010) by a subjectspecific one. We do this by changing the uniform noise by a subject (and feature) dependent term as follows. For subject m, a weight vector α _{ m } is drawn from a zero mean fully factorized Gaussian with unit variance, \(\boldsymbol{\alpha}_{m} \sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\). Given features x _{ i }, noise terms are then the inner products \(\boldsymbol{\alpha}_{m}^{T} \boldsymbol{x}_{i}\), linearly scaled back to the interval [−0.5,0.5] (not to destroy the relative order of the true relevance levels), and the relevance levels are taken to be the true relevance levels plus these noise terms.
4.1.2 Audio
This data set consists of evaluations of sound quality from 32 subjects (this data set was borrowed from Arehart et al. 2007). Each subject performed 576 pairwise comparison listening experiments. Each listening experiment represents one sound sample processed with two different settings of the hearingaid parameters, and the choice for one of the two. The processed sound sample is represented by a 3dimensional feature vector.
4.1.3 Art
This data set consists of evaluations of art images from 190 subjects (this data set was borrowed from Yu et al. 2003). Each subject was presented a number of images from a total of 642 images and asked to rate each of them: like/dislike (on average, each subject rated 90 images). We considered the 32 subjects who rated more than 120 images. Each image is described by a 275dimensional feature vector, with features which characterize the image such as, color, shape, texture, etc. For computational reasons, we reduced the highdimensional feature vector (275 dimensions) to a lower dimension. We noticed that most of the features where not very informative for predicting the outcomes, this is why we used only the 10 most informative features (the informativeness of the features was measured by averaging the correlations between features and observations). Note that this data set does not contain pairwise comparisons like the other two data sets. With each instance, a binary label is associated: like or dislike, which makes the learning task on this data set to be a binary classification task. The combination of multitask and active learning that we propose in this work can be still applied in this case in the same framework which was introduced in Sect. 2 by using the logistic regression model instead of the BradleyTerry model as likelihood terms in the model from Eq. (6).
4.2 Protocol
Our experiments use a leaveoneout scheme in which each subject was considered once as the current/test subject for which the preferences need to be learned. For each test subject the learning started with the hierarchical prior learned from the data of the other remaining subjects. The data for the test subject was split into 5 folds, 1 fold was used for training and the rest was used for testing. The training data was used as a pool out of which points were selected for labeling either randomly or actively using one of the active learning criteria. The hierarchical prior was updated based on these data points. After every update predictions were made on the test set using the current model. We used accuracy (percentage of correct predictions among all the predictions) as a measure of performance. The accuracy of the predictions on the test data measures how much we learned about the subject preferences. The results were averaged over the 5 splits and over the subjects.
4.3 Performance
The framework that we propose in this work for optimizing preference learning consists of combining the multitask formalism together with active learning. The multitask ideas in preference learing are especially useful when the training preference data from a subject is very small. In this situation it makes sense to use the preference data from other subjects as additional information.
4.3.1 Letor
The pairwise comparisons from Letor data set were generated by adding noise in the interval [−0.5,0.5] such that the relative order between the three relevance levels is preserved, but ties within each relevance level are broken. As a result, different subjects do agree on comparisons between different relevance levels. Thus, the data was constructed to have an underlying common structure in the preference of different subjects. Because of this reason we expect that the multitask learning would improve the performance. In order to validate this hypothesis, we checked whether the preferences of a new subject can be learned more accurately by using the available preference data from other subjects. We compared the hierarchical model with the method of Chu and Ghahramani (2005b) Gaussian processes for preference learning which assumes no prior information. The hierarchical/community prior was obtained by applying the EM algorithm described in Sect. 2.3 in combination with the semiparametric utility function from Eq. (4); the hierarchical prior was learned from 20 samples from each of the other subjects. The method of Chu and Ghahramani (2005b) was applied with a Gaussian kernel in which the kernel parameters were tuned using crossvalidation.
4.3.2 Audio
4.3.3 Art
4.4 Computational complexity
Execution time (in seconds) for Variancelogdet and Committee criterion a function of feature dimension
Feature dimension 
Variancelogdet (s) 
Committee (s) 

10 
2.894 
0.014 
50 
13.543 
0.010 
100 
37.926 
0.009 
200 
172.661 
0.010 
4.5 Similarities between criteria
5 Conclusions and discussions
This work studied how to exploit models learned on other scenarios to actively learn a model for a new scenario in an efficient way. Our approach to active learning in a multiscenario setting combines a hierarchical Bayesian prior (to learn from related scenarios) with active learning (to learn efficiently by selecting informative examples). Our new Committee criterion inspired by the QuerybyCommittee method is very similar to the standard criteria from experimental design, in particular in the early stages of active learning, but computationally more efficient. Aside from the computational advantage, the Committee criterion introduces the idea to have the data, available from other users, collaborate in order to select the most informative experiments to perform with a new user. The same idea is already implicit in the QuerybyCommittee algorithm. We show, theoretically and through experiments, that this conceptual idea also works with a committee of people. This can be interpreted as another way of using people as the elements of a machine learning algorithm, which is a very promising research area, as suggested also by (Sanborn and Griffiths 2008).
There are several aspects related to the approach proposed here that require further attention: (i) The design is myopic in the sense that the active learning criteria look one step ahead at a time when evaluating the informativeness of a data point. A nonmyopic design “looks” more than just one step and it is theoretically closer to the best possible design but computationally much more expensive. Due to the computational complexity involving a nonmyopic design, we discussed all the active learning criteria from a myopic perspective, however, a nonmyopic perspective can be applied to all of them, similar to the one proposed by Boutilier (2002). (ii) In this work we used loglinear models and Gaussian distributions to model the preference data. The same idea, of using models learned on data from different subjects (or scenarios) to actively select examples for a new subject, can be applied to other models and starting from different priors as well, although the mathematics will be a bit more involved and less intuitive. In particular, considering a mixture of Gaussians as the prior may still be feasible and may lead to an active learning strategy that tries to find those examples that can best discriminate to which mixture component the current model belongs.
Acknowledgements
We would like to thank Kai Yu for providing the art data set and Wei Chu for making available his code on preference learning with Gaussian Processes.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Appendix
We first prove a lemma which states a relationship between the Hessian and the Fisher matrices which will be used in further proofs.
Lemma 3
Proof
We use shorthand notation p _{ c }=p(ca,α), g _{ cj }=g _{ j }(ca,α), ϕ _{ ci }=ϕ _{ i }(x _{ c }), omitting the dependencies on a and α.
The following lemma proves the approximation of the Variancetrace criterion from Eq. (26).
Lemma 4
Proof
The following lemma proves the approximation of the ChangeKL criterion from Eq. (27).