# Efficiently learning the preferences of people

## Authors

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10994-012-5297-4

- Cite this article as:
- Birlutiu, A., Groot, P. & Heskes, T. Mach Learn (2013) 90: 1. doi:10.1007/s10994-012-5297-4

- 1 Citations
- 1.9k Views

## Abstract

This paper presents a framework for optimizing the preference learning process. In many real-world applications in which preference learning is involved the available training data is scarce and obtaining labeled training data is expensive. Fortunately in many of the preference learning situations data is available from multiple subjects. We use the multi-task formalism to enhance the individual training data by making use of the preference information learned from other subjects. Furthermore, since obtaining labels is expensive, we optimally choose which data to ask a subject for labelling to obtain the most of information about her/his preferences. This paradigm—called active learning—has hardly been studied in a multi-task formalism. We propose an alternative for the standard criteria in active learning which actively chooses queries by making use of the available preference data from other subjects. The advantage of this alternative is the reduced computation costs and reduced time subjects are involved. We validate empirically our approach on three real-world data sets involving the preferences of people.

### Keywords

Learning preferencesActive learningExperimental designMulti-task learningHierarchical modeling## 1 Introduction

There has been an increasing interest recently in learning the preferences of people within artificial intelligence research (Doyle 2004). Preference learning provides the means for modeling and predicting people’s desires and this makes it a crucial aspect in modern applications such as decision support systems (Chajewska et al. 2000), recommender systems (Blythe 2002; Blei et al. 2003), and personalized devices (Clyde et al. 1993; Heskes and de Vries 2005).

A prototypical example for an application of preference learning that we will use in this paper is fitting hearing-aids, i.e., tuning of hearing-aid parameters so as to maximize user satisfaction. This is a complex task, due to three reasons: (1) high dimensionality of the parameter space, (2) the determinants of hearing-impaired user satisfaction are unknown, and (3) the evaluation of this satisfaction through listening tests is costly (in terms of patient burden and clinical time investment) and unreliable (due to inconsistent responses). The last point illustrates an important issue that preference learning has to address, which is the limited availability of labeled data used for model training. Obtaining appropriate training data in preference learning applications requires time and effort from the modeled user. This shortcoming can be addressed by taking advantage of two characteristics of the settings in which preference learning is usually applied. First, the training data is mostly acquired through interactions with the modeled user; and, second, preferences are modeled for multiple users, as a result multiple training data sets are available. In order for the preference learning methods to be implemented in real-world systems, they must be capable of exploiting all possible sources of information and in the most efficient way.

In most of the situations in which preference learning is involved data is available from multiple subjects. Thus, even though individual data is scarce and difficult to obtain, we can optimize the learning of preferences of a new subject by making use of the available data from other subjects. Learning in this setting is well-known as multi-task or hierarchical learning and has been studied extensively in recent years in machine learning. By using the multi-task formalism, the preference data collected for other subjects can be gathered and used as prior information when learning the preferences of a new subject. Furthermore, to deal with the fact that obtaining labeled data is expensive we can speed up learning by optimally choosing the examples to be queried. At each learning step we can decide which example gives the most information about the subject’s preferences. This paradigm, called active learning in the machine learning literature and related to sequential experimental design in statistics, has been studied extensively, but hardly in the multi-task setting.

The aim of this work is to present an efficient framework for optimizing the preference learning process. This framework considers the combination between active learning and multi-task learning in the preference learning context. The contribution of this work is a criterion for active learning designed for the multi-task setting. The advantages of this criterion are in its interpretation and the ease in computability.

The structure of this paper is as follows. First, this section ends by presenting related work on preference learning, multi-task learning and active learning. In Sect. 2 we describe the learning framework. We consider learning from qualitative preference observations in which the subject makes a choice for one of the presented alternatives. This can be modeled using the probabilistic choice models introduced in Sect. 2.1. Learning a utility function representing the preferences of a subject from this type of preference observations is described in Sect. 2.2. Learning the utility function in a multi-task setting by making use of the data available from other subjects is considered in Sect. 2.3. In Sect. 3 we present several criteria for selecting the most informative experiments with respect to a subject’s preferences. After reviewing some of the standard criteria from experimental design, we propose an alternative criterion which makes use of the preference observations collected already from a community of subjects. We show that this alternative criterion is connected to the standard criteria from experimental design. In Sect. 4 we demonstrate experimentally the usefulness of our framework on three data sets, a subset of the Letor data set, an audiological data set and a data set about people’s preferences for art images. In Sect. 5 we present several conclusions and discuss directions for future research.

### 1.1 Background and related work

In this section we review some studies from preference learning, multi-task learning, and active learning related to the work presented in this paper.

#### 1.1.1 Preference learning

- 1.
Based on the application area, preference learning approaches can be divided into the following main groups: (i) applied to the field of information retrieval, e.g., learning to rank search results of a query or a search engine, (ii) applied to recommender systems, e.g., used by online stores to recommend products to their customers, or for personalized devices, and (iii) bipartite ranking and label ranking, which find applications in disciplines such as medicine and biology. The application scenarios that we consider in the experimental evaluation in Sect. 4 belong to information retrieval (the Letor data set) and recommender systems (the audiological and art data sets).

- 2.
The learning technique divides the preference learning approaches into four categories: (i) learn a binary preference relation that compares pairs of alternatives, (ii) model-based approach that aims at identifying the preference relation by making sufficiently restrictive model assumptions, (iii) local estimation techniques which lead to aggregating preferences, and (iv) learning utility functions by using regression to map instances to target valuations for direct ranking. We focus on the latter approach and use a utility function in order to model a subject’s preferences. The utility function is learned in a Bayesian framework.

- 3.
The learning task includes label, instance, and object ranking. Label ranking can be seen as a generalization of classification where a complete ranking of labels is associated with an instance instead of only a class label. Instance ranking can be seen as a generalization of ordinal classification where an instance belongs to one among a finite set of classes and the classes have an order. The setting of object ranking has the peculiarity of having no supervision in the sense that no class label is associated with an object. Instead, a finite set of pairwise preferences or other ordering between objects is given. The setting that we consider in this work belongs to the last category.

In many preference learning settings it is important to take into account the context, i.e., context-aware preference learning (Adomavicius et al. 2005). The motivation for context aware preference learning is that the same subject/user/consumer may use different decision-making strategies and prefer different products under different contexts. For hearing-aid fitting, which is one of the application scenarios that we use in this work, this means that a user would prefer a certain setting of the hearing-aid parameters if he is listening to a concert and another setting if he is in a discussion. In general for context aware preferences bigger data sets are needed, as preferences would have to be learned for all contextual situations. The approach that we present in this paper can be applicable in this setting as well. While this is an interesting, related topic, it is beyond the scope of the current work.

#### 1.1.2 Multi-task learning

The idea behind multi-task learning is to utilize labeled data from other “similar” learning tasks in order to improve the performance on a target task. It is inspired by the research on transfer of learning in psychology, more specifically on the dependency of human learning on prior experience. For example, the abilities acquired while learning to walk presumably apply when one learns to run, and knowledge gained while learning to recognize cars could apply when recognizing trucks. The initial foundations for multi-task learning were laid by (Thrun 1995; Caruana 1997). The psychological theory of transfer of learning implies the similarity between tasks. In a related way, the multi-task learning assumes similarity between models of different tasks. For example, (Evgeniou et al. 2005; Argyriou et al. 2008) exploit similarity between the deterministic parts of the models by means of regularization, with the effect of improvement in performance. In this work we implement multi-task learning using a Bayesian approach. The Bayesian approach to multi-task learning assumes the parameters of individual models to be drawn from the same prior distribution. Examples of the Bayesian approach to multi-task learning are (Bakker and Heskes 2003) where a mixture of Gaussians is used for the top of the hierarchy. This leads to clustering the tasks, one cluster for each Gaussian in the mixture. In (Yu et al. 2005; Birlutiu et al. 2009) a hierarchical Gaussian Process is derived with a normal-inverse Wishart distribution used at the top of the hierarchy.

#### 1.1.3 Active learning

Active learning, also known in the statistics literature as sequential experimental design, is suitable for situations in which labeling points is difficult, time-consuming, and expensive. The idea behind active learning is that by optimal selection of the training points a better performance can be achieved instead of random selection. The scenarios in which active learning can be applied belong to one of the following three categories: (i) generating *de novo* points for labeling; (ii) stream-based active learning where the learner decides whether to request the label of a given instance or not; (iii) pool-based active learning where queries are selected from a large pool of unlabeled data. In this work we consider the pool-based active learning setting.

Methods for active learning can be roughly divided into two categories: those with and without an explicitly defined objective function. Uncertainty sampling (Lewis and Gale 1994), Query-by-Committee (Seung et al. 1992; Freund et al. 1997) and variants thereof belong to the latter category. They are based on the idea of selecting the most uncertain data given the previously trained models. The methods with an explicit objective function are often motivated by the theory of experimental design (Fedorov 1972; Chaloner and Verdinelli 1995; Schein and Ungar 2007; Lewi et al. 2009; Dror and Steinberg 2008). The objective function quantifies the expected gain of labeling a particular input, for example in terms of the expected reduction in the entropy of the model parameters (MacKay 1992; Cohn et al. 1996). With respect to the performance of the two categories of methods, Schein and Ungar (2007) show that the methods from the second approach perform better but are computationally more expensive due to re-training the models for each candidate point. A trend is to improve the performance of the active learning methods by combining them with heuristics designed either for the context in which they are applied or by the models they use, e.g., making use of the unlabeled data available (McCallum and Nigam 1998; Yu et al. 2006), exploiting the clusters in the data (Dasgupta and Hsu 2008), diversifying the set of hypotheses (Melville and Mooney 2004), or adapting the active learning to Gaussian processes (Chu and Ghahramani 2005a; Brochu et al. 2008; Groot et al. 2010).

Preference learning can benefit from the active learning paradigm. In most of the preference learning settings labels are given by people in an explicit way. This means that for acquiring training preference data, the subjects have to interact with the system, and they need to express their preferences explicitly. These situations appear when it is impossible or insufficient to implicitly collect training preference data. For example, when learning preferences in a live system where subjects choose electronically their favorite movie, labelling is done automatically by the selection, but, for other scenarios, like for example, fitting hearing-aids, this implicit way of collecting training data cannot be applied. In these situations, it makes sense to use active learning in order to collect the most informative data. There are severals studies in the literature that use active learning in a preference learning setting. Brinker (2004) presents some extensions of pool-based active learning to label ranking problems; Xu et al. (2010) address the problem of preference learning using relational models between items; Guo and Sanner (2010) investigate active preference learning for real-time systems; Brochu et al. (2008) propose a criterion for active learning that maximizes the expected improvement at each query without accurately modeling the entire valuation surface. Furthermore, there are several studies which investigate active preference learning for practical applications such as, collaborative filtering (Jin and Si 2004; Harpale and Yang 2008; Boutilier et al. 2003), personalized calendar scheduling (Gervasio et al. 2005), or for optimizing search results for biomedical documents (Arens 2008). The difference between our work and the other studies for active preference learning mentioned above is that we consider active preference learning in a multi-task setting, i.e., we are interested about settings with multiple learning tasks and how active learning can be implemented in an efficient way in this case. We propose a criterion for active learning designed for this multi-task learning setting. This criterion, which we call the Committee criterion, will make use of the preference observations collected already from a community of subjects. The idea behind the Committee criterion is related to the Query-by-Committee method from active learning which selects those queries that have maximum disagreement amongst an ensemble of hypotheses. The difference in our case is that the group of subjects, for which the preferences were already learned, plays the role of the ensemble of hypotheses instead of an ensemble of models learned on the same task.

### 1.2 Notation

Boldface notation is used for vectors and matrices and normal fonts for their components. Upperscripts are used to distinguish between different vectors or matrices and lowerscripts to address their components. The notation \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\varSigma})\) is used for a multivariate Gaussian with mean * μ* and covariance matrix

*. The transpose of a matrix*

**Σ***is denoted by*

**M**

**M**^{T}. Capital letters are used for constants and small letters for indices, e.g.,

*i*=1,…,

*I*.

## 2 Learning framework

The idea of using the preference observations from other subjects in order to optimize the process of learning the preferences of a new subject can be basically applied in any preference learning context. In this work, we consider the case of qualitative preference observations which can be modeled using the probabilistic choice models described in this section.

### 2.1 Probabilistic choice models

*X*={

**x**_{1},…,

**x**_{I}} be a set of inputs. Let \({\mathcal {D}}\) be a set of

*J*observed preference comparisons over instances in

*X*corresponding to a subject,

*c*

_{j}the choice made,

*i*

_{1},…,

*i*

_{A}:{1,…,

*J*}→{1,…,

*I*} index functions such that

*i*

_{1}(

*j*) represents the input presented first in the

*j*th preference comparison and

*c*

_{j}=

*c*means that \(\boldsymbol{x}_{i_{c}(j)}\) is chosen from the

*A*alternatives presented in the

*j*th comparison. For

*A*=2 this setup reduces to pairwise comparisons between two alternatives.

*U*(

**x**_{i}) associated with each input

**x**_{i}which captures the individual preference of a subject for

**x**_{i}(the utility function will be formally defined in the next section). In the ideal case the latent function values are consistent with the preference observations. This means that alternative

*c*is preferred over the other alternatives

*c*′ in the

*j*th comparison whenever the utility for

*c*exceeds the utilities for the other alternatives

*c*′, i.e., \(U(\boldsymbol{x}_{i_{c}(j)}) > U(\boldsymbol{x}_{i_{c'}(j)})\). In practice, however, subjects are often inconsistent in their responses. A very inconsistent subject will have a high uncertainty associated with the utility function; this uncertainty is directly taken into account in the probabilistic framework. We define this probabilistic framework using the Bradley-Terry model (Bradley and Terry 1952; Kanninen 2002; Glickman and Jensen 2005) by making a standard modeling assumption that the probability that the

*c*th alternative is chosen by the subject in the

*j*th comparison follows a multinomial logistic model, which is defined as

*U*as accurately and with as few comparisons as possible.

One important drawback of the Bradley-Terry model is that it assumes very strong transitivity conditions of preference relations, while some psychological experiments have shown that human preference judgments can violate transitivity (Anand 1993; Tversky 1998). In most situations transitivity violations can be considered as noise. When this is not applicable, specific probabilistic models for human preference judgements which preserve intransitive reciprocal relation have to be designed. This was recently investigated in (Pahikkala et al. 2009) which introduced a new kernel function in the framework of regularized least squares which is capable of inferring intransitive reciprocal relations.

### 2.2 The utility function

*U*is a real-valued function,

*U*:

*X*→ℝ, which associates with every input

*∈*

**x***X*a real number

*U*(

*). Each input*

**x***∈*

**x***X*is characterized by a set of features,

*(*

**ϕ***)∈ℝ*

**x**^{D}. One possible choice for the utility function is to express it as a linear combination of the features,

*=(*

**α***α*

_{1},…,

*α*

_{D}) is a vector of weights which captures the importance of each feature of

*when evaluating the utility*

**x***U*for a specific subject,

*ϕ*

_{i}(

*) are the components of the vector*

**x***(*

**ϕ***). The preferences of a subject are thus encoded in the vector*

**x***and learning the utility function reduces to learning*

**α***.*

**α***κ*centered on the data points,

*with dimension*

**α***I*—the number of data points (the size of the set of inputs

*X*)—captures the preferences of the subject. A non-linear utility function can be obtained by using, for example, a Gaussian kernel,

*ℓ*is a length-scale parameter. The two definitions of the utility function from Eqs. (3) and (4) are similar in the sense that they are both linear in the parameter. Equation (4) is suited when the number of features is larger than the number of data points, i.e.,

*D*>

*I*and for introducing non-linearity in the utility model.

*as a random variable. We consider a Gaussian prior distribution over \(\boldsymbol{\alpha} \sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\varSigma})\), which is updated based on the observations from the preference comparisons using Bayes’ rule (where we omitted the normalization constant),*

**α**### 2.3 Multi-task preference learning

*M*subjects. We make the common assumption of a Gaussian prior distribution, \(p(\boldsymbol{\alpha}^{m}) = \mathcal{N}(\boldsymbol{\alpha}^{m}; \bar{\boldsymbol{\mu}},\bar{\boldsymbol{\varSigma}})\),

*m*=1,…,

*M*with the same \(\bar{\boldsymbol{\mu} }\) and \(\bar{\boldsymbol{\varSigma}}\) for the preference models of all subjects. This prior is updated using Bayes’ rule based on the observations from each subject, resulting in a posterior distribution for each individual subject. The common prior over all task parameters controls the general part of the model. This common prior is learned from the data belonging to a group of tasks, other than the current (new) task for which the learning is performed. Starting from this general-model given by the common prior, the model is updated using the observation (data) seen in the current task. These task-specific observations control the task-specific part of the model. The hierarchical prior is obtained by maximizing the penalized log-likelihood of all data in a so-called type-II maximum likelihood approach. This optimization is performed by applying the EM algorithm (Gelman et al. 2003), which reduces to the iteration (until convergence) of the following steps:

- E-step:
Estimate the sufficient statistics (mean

**μ**^{m}and covariance matrix**Σ**^{m}) of the posterior distribution corresponding to each subject*m*, given the current estimates at step*t*(\(\bar{\boldsymbol{\mu}}^{(t)}\) and \(\bar{\boldsymbol{\varSigma}}^{(t)}\)) of the hierarchical prior.- M-step:

*T*when iterations were stopped (at convergence). Once we have learned the hierarchical prior we can use it as an informative prior for the preference model of a new subject in Eq. (6).

## 3 Active preference learning

In this section we discuss methods for active preference learning. We start from Query-by-Committee (QBC) (Seung et al. 1992) method for active learning and based on it we propose some variants of QBC adapted to the setting of preference learning for multiple subjects (Sect. 3.1). Furthermore, we show how these variants of QBC can be naturally linked to the hierarchical Bayesian modeling for reducing the computations (Sect. 3.2). Finally, we show connections between the variants of QBC proposed and other active learning criteria (Sect. 3.4).

### 3.1 QBC for preference learning

In this section we will discuss how to adapt QBC to our preference learning setting.

#### 3.1.1 The committee members

For the QBC approach to be effective it is important that the committee is made of consistent and representative models. The main idea in this work is to exploit the preference learning setting with multiple subjects and use the learned models of other subjects \(\mathcal{M}_{1},\ldots, \mathcal{M}_{M}\) as committee members when learning the preferences of a new subject.

After choosing the committee we still have to decide upon a suitable criterion for selecting the next examples. Some measures of disagreement among the committee members appear to be most obvious, and in the following we will consider two alternatives.

#### 3.1.2 Vote criterion

*δ*(

*a*,

*c*;

*m*)=1 if \((a,c) \in {\mathcal {D}}_{m}\), and

*δ*(

*a*,

*c*;

*m*)=0 otherwise. The score Vote(

*a*) is minimal when the labels assigned by the committee members are equally distributed (total disagreement) and maximal when all members fully agree. There are two problems with this criterion. First, a comparison

*a*may not be labeled by a subject

*m*. This can be overcome if we consider the predictions computed based on the learned model of subject

*m*and allow each committee member to ‘vote’ for its winning class. This same idea is implemented in the so-called vote entropy method (Dagan and Engelson 1995). The entropy is measured over the final classes assigned to an example by possible models, and not over class probabilities given by possible models. Second, in practical applications just scoring votes turns out to be suboptimal. The reason, as also suggested in (McCallum and Nigam 1998), is that the Vote criterion does not take into account the confidences of the committee members’ predictions.

#### 3.1.3 Committee criterion

*m*=1,…,

*M*,

*p*

_{1}||

*p*

_{2}]≠KL[

*p*

_{2}||

*p*

_{1}]. This drawback of the KL-divergence can be overcome by considering a symmetric measure, for example, KL[

*p*

_{1}||

*p*

_{2}]+KL[

*p*

_{2}||

*p*

_{1}]. In (McCallum and Nigam 1998), the disagreement is computed between committee members constructed based on the current model, i.e., the committee changes with every update and the criterion has to be recomputed with every update. A committee of models learned on different tasks is fixed and thus selecting examples solely based on it leads to a fixed instead of an active design: all examples can be ranked beforehand (the same applies to the Vote criterion defined above).

*a*through

*p*(⋅|

*a*) the current model’s predictive probability based on the data seen so far and

*γ*a parameter that accounts for the degree of similarity between subjects. According to the Committee criterion, the most interesting experiments are those on which the other models disagree (the first term on the right-hand side of Eq. (12)), with the current model (still) undecided (the second term on the right-hand side of Eq. (12)).

An advantage of the Committee criterion is its computational efficiency: the first term on the right-hand side of Eq. (12) as well as the average predictive probability can be computed beforehand. The Committee criterion does require computation of the predictive probabilities corresponding to the current model, but this is the least one could expect from an active design. This is to be compared with the QBC criterion (any of the two variants considered), which requires constructing new committee members with each update, and D-optimal experimental design, which calls for keeping track of variances.

Note that we have not made any restriction so far with respect to the probabilistic models used in the active learning design. In the following we will consider only the log-linear models introduced in Sect. 2. They have some nice properties, which simplify the computation of the Committee criterion (Sect. 3.2), and provide a natural link to hierarchical Bayesian modeling (Sect. 2.3). The general idea, of using the already learned models from the other tasks as the committee members in a QBC-like approach, is of course also applicable to other models.

### 3.2 Average probability

In this section we discuss how to efficiently compute the average probability used for computing the committee criterion in Eq. (12) in the case of log-linear models (Christensen 1997). For linear utility functions the likelihood function defined in Eq. (2) is a log-linear model. The log-odds of the model are linear in the parameter.

*p*

_{m}(

*c*|

*a*) be the predictive probability defined in Eq. (10). We define the average predictive probability of the committee, \(\bar{p}(c|a)\), as the prediction probability that is closest to the prediction probabilities of the members:

*Z*(

*a*) a normalization constant

As can be seen from the EM updates in Eq. (7), the average \(\bar{\boldsymbol{\mu}}\) in the logarithmic opinion pool is then precisely the mean of the learned hierarchical prior. Summarizing, once we have learned a hierarchical prior from the data available for subjects 1 through *M* using the EM algorithm, we can start off the new model *M*+1 from this prior (as is normally done in hierarchical Bayesian learning). On top of this, the same EM algorithm gives us the information we need to compute the Committee criterion that can be used subsequently to select new inputs to label.

### 3.3 Other criteria for active learning

*a*,

*c*) be \(\mathcal {M}_{(a,c)} \approx\mathcal{N}(\boldsymbol{\mu}_{(a,c)}, \boldsymbol{\varSigma}_{(a,c)})\).

- 1.
*Uncertainty sampling*(Lewis and Gale 1994). In this strategy an active learner chooses for labeling the example for which the model’s predictions are most uncertain. The uncertainty of the predictions can be measured, for example, using Shannon entropyFor a binary classifier this strategy reduces to querying points whose prediction probabilities are close to 0.5. Intuitively this strategy aims at finding as fast as possible the decision boundary since this is indicated by the regions where the model is most uncertain.$$ \mathrm{Uncertainty} (a) = -\sum_c p(c|a, \mathcal{M}) \log p(c|a, \mathcal{M}) . $$(16) - 2.
*Variance reduction*(MacKay 1992). This strategy, also known in experimental design as D-optimality (Fedorov 1972; Chaloner and Verdinelli 1995; Berger 1994; Ford and Silvey 1980), chooses as the most informative experiments the ones that give the most reduction in the model’s uncertainty. The motivation behind this strategy is a result of (Geman et al. 1992) which shows that the generalization error can be decomposed into three components: (i) noise (which is independent of the model or training data); (ii) bias (due to the model); (iii) model’s uncertainty. Since the model cannot influence the noise and the bias components, the future generalization error can only be influenced via the model’s variance. Formally, this criterion can be written asIn the setting considered in this work the variance of the model is expressed in the covariance of the Gaussian distribution. In order to use Eq. (17) we need to choose a measure for the variance. We can consider, for example, the log-determinant of the covariance matrix$$ \mathrm{Variance} (a) = \sum_c p(c|a,\mathcal{M}) \mathrm{variance} [\mathcal{M}_{(a,c)}] - \mathrm{variance} [\mathcal{M}]. $$(17)which is actually minimizing the entropy of the Gaussian random variable representing the current model, or the trace of the covariance matrix$$ \mbox{Variance-logdet} (a) = \sum_c p(c|a, \mathcal {M}) \log\det (\boldsymbol{\varSigma}_{(a,c)}) - \log\det(\boldsymbol{\varSigma}) , $$(18)$$ \mbox{Variance-trace} (a) = \sum_c p(c|a, \mathcal {M}) \operatorname {Tr}(\boldsymbol{\varSigma}_{(a,c)}) - \operatorname {Tr}(\boldsymbol{\varSigma}) . $$(19) - 3.
*Expected model change*(Cohn et al. 1996). This strategy chooses as the most informative query the one which when added to the training set would yield the greatest model change. Quantifying the model change depends on the learning framework. For gradient-based optimization the change can be measured via the training gradient, i.e., the vector used to re-estimate parameter values (Settles and Craven 2008). In the Bayesian framework, the model change can be quantified via a distance measure between the current distribution and the posterior distribution obtained after incorporating the candidate pointA suitable distance for our setting is the Kullback-Leibler divergence between distributions, which for two Gaussians has a closed form solution and can be written as follows The KL divergence between Gaussians is used by Seeger (2008) to design an efficient sequential experimental design in a setting similar to the one used in this work.$$\mathrm{Change} (a) = \sum_c p(c|a, \mathcal{M}) \mathrm{distance} [\mathcal{M}, \mathcal{M}_{(a,c)} ] . $$

### 3.4 Similarities between criteria

In this section we consider the following active learning criteria: Variance-logdet, Committee, Variance-trace and Change-KL. We investigate how similar the active learning criteria are and how they can be related. We analyze the modifications induced to the model by the criteria after updating the probability model to incorporate the information from new training points. A single update induces a small change in the posterior distribution, and this allows for Taylor expansions, keeping only the lowest non-zero contribution. In the following we present the main results of the approximations while some of the details can be found in the Appendix.

*a*and choice

*c*lead to small changes in the model \(\mathcal {M}\), we can approximate the active criteria to the form

*and matrix*

**α***and with*

**Q***the gradient of the log-probabilities*

**g**The following lemma approximates the Variance-logdet criterion to the form from Eq. (21).

### Lemma 1

*In a first order approximation*,

*assuming that*

**Σ**_{(a,c)}

*is close to*

*,*

**Σ***we can simplify*

*where*

**μ***and*

**Σ***represent the mean and covariance of the Gaussian posterior distribution*.

### Proof

*to a new MAP solution depending on*

**α***c*and

*a*. We will use the notation

*and*

**A***ϵ*small compared to

*, the following holds (see, for example, Boyd and Vandenberghe 2004, p. 642)*

**A**

**Σ**^{−1}which makes

*(*

**H***c*|

*a*,

*) small, we can use Eq. (23) in Eq. (24) with the following substitutions*

**μ***=*

**A**

**Σ**^{−1}and

*(*

**H***c*|

*a*,

*)=*

**μ***ϵ*

*to obtain*

**I***c*when presented the alternatives

*a*follows by integrating

*p*(

*c*|

*a*,

*) over the current posterior. We make a second order Taylor expansion of*

**α***p*(

*c*|

*a*,

*) around the point*

**α***: The first order term cancels since the gradient is zero at the maximum solution*

**μ***=*

**α***. In a lowest order approximation we can ignore the correction upon*

**μ***p*(

*c*|

*a*,

*) to obtain where for the last approximation we used the approximation from Eq. (25). To obtain the proof of this lemma we use Lemma 3 in the Appendix at the end of the paper which states a relationship between Hessian and Fisher matrices. □*

**α**Using the same type of approximation, the Committee criterion can be approximated to the same form given in Eq. (21).

### Lemma 2

*In a lowest order approximation the Committee criterion can be written as*

*where*\(\bar{\boldsymbol{\mu}}\)

*is the mean of the hierarchical prior learned from the other subjects and*

- 1.
The gradients

(**g***c*|*a*,⋅) are evaluated at different points: the prior hierarchical mean \(\bar{\boldsymbol{\mu}}\) and the current posterior mean. This effect is small since**μ**is still close enough to \(\bar{\boldsymbol{\mu}}\) for a sufficiently accurate approximation of the gradients, in particular at the start of the learning when selecting the right points to label is more important.**μ** - 2.
The current posterior variance

is replaced by \(\tilde{\boldsymbol{\varSigma}}\). The effect of the precise weighting of the gradients is not so important, and again, at the beginning of learning \(\tilde{\boldsymbol{\varSigma}}\) is close to**Σ**.**Σ**

*is still close to the prior mean \(\bar{\boldsymbol{\mu}}\), and \(\tilde{\boldsymbol{\varSigma}}\) to*

**μ***.*

**Σ**## 4 Experimental evaluation

This section presents the experimental evaluation of the framework proposed in this paper. We will use pairwise comparisons data. The main goal of the experimental evaluation section is to show that optimally selecting data for labeling using the Committee criterion achieves higher accuracy than random selection. Furthermore, we also show that the Committee criterion performs in practice similarly to other standard active learning criteria, but in addition has a computational advantage.

### 4.1 Data sets

The following data sets related to the preferences of people were used in the experimental evaluation.

#### 4.1.1 Letor

This data set consists of relevance levels assigned to documents with respect to a given textual query (the OHSUMED data set from Letor 3.0, Qin et al. 2010). The relevances were assessed by human experts, using three rank scales: definitely relevant, partially relevant, and not relevant. We used a subset of this data related to Query 1. It contains 138 references with the following labels: 24 definitely relevant, 26 partially relevant, and 88 not relevant. Each of the samples is characterized by a 45-dimensional vector consisting of text features extracted from the titles and abstracts of the documents. The features were normalized. Based on this data set we constructed pairwise preferences belonging to 50 subjects in a way that we describe below. We followed a procedure similar to (Xu et al. 2010) to turn the relevance levels into pairwise preference comparisons. Since such coarse relevance judgements are considered unrealistic in many real-world applications, Xu et al. (2010) proposed to add uniform noise in the range [−0.5,0.5] to the true relevance levels. This addition preserves the relative order between definitely relevant (respectively partially relevant) documents and partially relevant (respectively not relevant) ones, but randomly breaks ties within each relevance level. To introduce a hierarchical component, we replaced the random tie-breaking of Xu et al. (2010) by a subject-specific one. We do this by changing the uniform noise by a subject (and feature) dependent term as follows. For subject *m*, a weight vector **α**_{m} is drawn from a zero mean fully factorized Gaussian with unit variance, \(\boldsymbol{\alpha}_{m} \sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\). Given features **x**_{i}, noise terms are then the inner products \(\boldsymbol{\alpha}_{m}^{T} \boldsymbol{x}_{i}\), linearly scaled back to the interval [−0.5,0.5] (not to destroy the relative order of the true relevance levels), and the relevance levels are taken to be the true relevance levels plus these noise terms.

#### 4.1.2 Audio

This data set consists of evaluations of sound quality from 32 subjects (this data set was borrowed from Arehart et al. 2007). Each subject performed 576 pairwise comparison listening experiments. Each listening experiment represents one sound sample processed with two different settings of the hearing-aid parameters, and the choice for one of the two. The processed sound sample is represented by a 3-dimensional feature vector.

#### 4.1.3 Art

This data set consists of evaluations of art images from 190 subjects (this data set was borrowed from Yu et al. 2003). Each subject was presented a number of images from a total of 642 images and asked to rate each of them: like/dislike (on average, each subject rated 90 images). We considered the 32 subjects who rated more than 120 images. Each image is described by a 275-dimensional feature vector, with features which characterize the image such as, color, shape, texture, etc. For computational reasons, we reduced the high-dimensional feature vector (275 dimensions) to a lower dimension. We noticed that most of the features where not very informative for predicting the outcomes, this is why we used only the 10 most informative features (the informativeness of the features was measured by averaging the correlations between features and observations). Note that this data set does not contain pairwise comparisons like the other two data sets. With each instance, a binary label is associated: like or dislike, which makes the learning task on this data set to be a binary classification task. The combination of multi-task and active learning that we propose in this work can be still applied in this case in the same framework which was introduced in Sect. 2 by using the logistic regression model instead of the Bradley-Terry model as likelihood terms in the model from Eq. (6).

### 4.2 Protocol

Our experiments use a leave-one-out scheme in which each subject was considered once as the current/test subject for which the preferences need to be learned. For each test subject the learning started with the hierarchical prior learned from the data of the other remaining subjects. The data for the test subject was split into 5 folds, 1 fold was used for training and the rest was used for testing. The training data was used as a pool out of which points were selected for labeling either randomly or actively using one of the active learning criteria. The hierarchical prior was updated based on these data points. After every update predictions were made on the test set using the current model. We used accuracy (percentage of correct predictions among all the predictions) as a measure of performance. The accuracy of the predictions on the test data measures how much we learned about the subject preferences. The results were averaged over the 5 splits and over the subjects.

### 4.3 Performance

The framework that we propose in this work for optimizing preference learning consists of combining the multi-task formalism together with active learning. The multi-task ideas in preference learing are especially useful when the training preference data from a subject is very small. In this situation it makes sense to use the preference data from other subjects as additional information.

#### 4.3.1 Letor

The pairwise comparisons from Letor data set were generated by adding noise in the interval [−0.5,0.5] such that the relative order between the three relevance levels is preserved, but ties within each relevance level are broken. As a result, different subjects do agree on comparisons between different relevance levels. Thus, the data was constructed to have an underlying common structure in the preference of different subjects. Because of this reason we expect that the multi-task learning would improve the performance. In order to validate this hypothesis, we checked whether the preferences of a new subject can be learned more accurately by using the available preference data from other subjects. We compared the hierarchical model with the method of Chu and Ghahramani (2005b) Gaussian processes for preference learning which assumes no prior information. The hierarchical/community prior was obtained by applying the EM algorithm described in Sect. 2.3 in combination with the semiparametric utility function from Eq. (4); the hierarchical prior was learned from 20 samples from each of the other subjects. The method of Chu and Ghahramani (2005b) was applied with a Gaussian kernel in which the kernel parameters were tuned using cross-validation.

*y*-axis. The active selection was implemented using the Committee criterion. These plots show that the combination between multi-task and active learning indeed improves the performance.

#### 4.3.2 Audio

*y*-axis) as a function of the number of updates from the hierarchical prior (on the

*x*-axis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the 32 subjects. We used the Committee criterion with

*γ*=0 since the subjects in the committee are quite similar between each other, which is also suggested by the small error bars. The informative prior improves the predictions at the beginning when no preference observations have been observed for the new subject. The hierarchical prior already gives an accuracy of almost 0.7 for the audio data at the beginning of learning. The hierarchical prior was learned from 20 randomly selected data points per subject. Committee and Variance-logdet strongly overlap and are considerably better than a random strategy. The audio data set contains a few very informative data points and some which are not informative. In some cases the difference between the two sound samples presented in an experiment is so small that the subject cannot hear any difference. Such experiments are not informative because a subject’s answer is close to random and does not provide any information with respect to the subject’s preferences. The active learning criteria avoid selecting these type of experiments and obtain better performance than random selection. The performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 2, except for the Vote criterion which does not seem to perform better than random. We refer to Sect. 4.5 for an empirical evaluation of the similarities between the active learning criteria considered in Sect. 3.

#### 4.3.3 Art

*y*-axis) as a function of the number of updates from the hierarchical prior (on the

*x*-axis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the subjects. For the art data which has a higher variability between subjects the Committee criterion with

*γ*=1 performs slightly better than the Committee criterion with

*γ*=0. The preferences of people for art images are more difficult to predict, since preferences do not depend on some low-level characteristics of the image, like texture, color, etc. This is why the accuracy obtained on the art data is less than the accuracy obtained, for example, on the audio data. The Variance-logdet criterion appears to perform slightly better than the Committee criterion. Furthermore, the benefit of active learning over random selection is much smaller. Like in the case of audio data set, the performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 3.

### 4.4 Computational complexity

Execution time (in seconds) for Variance-logdet and Committee criterion a function of feature dimension

Feature dimension | Variance-logdet (s) | Committee (s) |
---|---|---|

10 | 2.894 | 0.014 |

50 | 13.543 | 0.010 |

100 | 37.926 | 0.009 |

200 | 172.661 | 0.010 |

### 4.5 Similarities between criteria

## 5 Conclusions and discussions

This work studied how to exploit models learned on other scenarios to actively learn a model for a new scenario in an efficient way. Our approach to active learning in a multi-scenario setting combines a hierarchical Bayesian prior (to learn from related scenarios) with active learning (to learn efficiently by selecting informative examples). Our new Committee criterion inspired by the Query-by-Committee method is very similar to the standard criteria from experimental design, in particular in the early stages of active learning, but computationally more efficient. Aside from the computational advantage, the Committee criterion introduces the idea to have the data, available from other users, collaborate in order to select the most informative experiments to perform with a new user. The same idea is already implicit in the Query-by-Committee algorithm. We show, theoretically and through experiments, that this conceptual idea also works with a committee of people. This can be interpreted as another way of using people as the elements of a machine learning algorithm, which is a very promising research area, as suggested also by (Sanborn and Griffiths 2008).

There are several aspects related to the approach proposed here that require further attention: (i) The design is myopic in the sense that the active learning criteria look one step ahead at a time when evaluating the informativeness of a data point. A non-myopic design “looks” more than just one step and it is theoretically closer to the best possible design but computationally much more expensive. Due to the computational complexity involving a non-myopic design, we discussed all the active learning criteria from a myopic perspective, however, a non-myopic perspective can be applied to all of them, similar to the one proposed by Boutilier (2002). (ii) In this work we used log-linear models and Gaussian distributions to model the preference data. The same idea, of using models learned on data from different subjects (or scenarios) to actively select examples for a new subject, can be applied to other models and starting from different priors as well, although the mathematics will be a bit more involved and less intuitive. In particular, considering a mixture of Gaussians as the prior may still be feasible and may lead to an active learning strategy that tries to find those examples that can best discriminate to which mixture component the current model belongs.

## Acknowledgements

We would like to thank Kai Yu for providing the art data set and Wei Chu for making available his code on preference learning with Gaussian Processes.

### Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.