1 Introduction

There has been an increasing interest recently in learning the preferences of people within artificial intelligence research (Doyle 2004). Preference learning provides the means for modeling and predicting people’s desires and this makes it a crucial aspect in modern applications such as decision support systems (Chajewska et al. 2000), recommender systems (Blythe 2002; Blei et al. 2003), and personalized devices (Clyde et al. 1993; Heskes and de Vries 2005).

A prototypical example for an application of preference learning that we will use in this paper is fitting hearing-aids, i.e., tuning of hearing-aid parameters so as to maximize user satisfaction. This is a complex task, due to three reasons: (1) high dimensionality of the parameter space, (2) the determinants of hearing-impaired user satisfaction are unknown, and (3) the evaluation of this satisfaction through listening tests is costly (in terms of patient burden and clinical time investment) and unreliable (due to inconsistent responses). The last point illustrates an important issue that preference learning has to address, which is the limited availability of labeled data used for model training. Obtaining appropriate training data in preference learning applications requires time and effort from the modeled user. This shortcoming can be addressed by taking advantage of two characteristics of the settings in which preference learning is usually applied. First, the training data is mostly acquired through interactions with the modeled user; and, second, preferences are modeled for multiple users, as a result multiple training data sets are available. In order for the preference learning methods to be implemented in real-world systems, they must be capable of exploiting all possible sources of information and in the most efficient way.

In most of the situations in which preference learning is involved data is available from multiple subjects. Thus, even though individual data is scarce and difficult to obtain, we can optimize the learning of preferences of a new subject by making use of the available data from other subjects. Learning in this setting is well-known as multi-task or hierarchical learning and has been studied extensively in recent years in machine learning. By using the multi-task formalism, the preference data collected for other subjects can be gathered and used as prior information when learning the preferences of a new subject. Furthermore, to deal with the fact that obtaining labeled data is expensive we can speed up learning by optimally choosing the examples to be queried. At each learning step we can decide which example gives the most information about the subject’s preferences. This paradigm, called active learning in the machine learning literature and related to sequential experimental design in statistics, has been studied extensively, but hardly in the multi-task setting.

The aim of this work is to present an efficient framework for optimizing the preference learning process. This framework considers the combination between active learning and multi-task learning in the preference learning context. The contribution of this work is a criterion for active learning designed for the multi-task setting. The advantages of this criterion are in its interpretation and the ease in computability.

The structure of this paper is as follows. First, this section ends by presenting related work on preference learning, multi-task learning and active learning. In Sect. 2 we describe the learning framework. We consider learning from qualitative preference observations in which the subject makes a choice for one of the presented alternatives. This can be modeled using the probabilistic choice models introduced in Sect. 2.1. Learning a utility function representing the preferences of a subject from this type of preference observations is described in Sect. 2.2. Learning the utility function in a multi-task setting by making use of the data available from other subjects is considered in Sect. 2.3. In Sect. 3 we present several criteria for selecting the most informative experiments with respect to a subject’s preferences. After reviewing some of the standard criteria from experimental design, we propose an alternative criterion which makes use of the preference observations collected already from a community of subjects. We show that this alternative criterion is connected to the standard criteria from experimental design. In Sect. 4 we demonstrate experimentally the usefulness of our framework on three data sets, a subset of the Letor data set, an audiological data set and a data set about people’s preferences for art images. In Sect. 5 we present several conclusions and discuss directions for future research.

1.1 Background and related work

In this section we review some studies from preference learning, multi-task learning, and active learning related to the work presented in this paper.

1.1.1 Preference learning

Preference learning is the creation of a model from collected data which can be used to model and predict people’s desires. A very recent and complete overview about preference learning is given in (Fürnkranz and Hüllermeier 2010). There are different approaches to preference learning which can be categorized according to the learning task, the learning technique and the application area. We will briefly enumerate them and state in which category our current work can be included (for more details we refer the interested reader to Fürnkranz and Hüllermeier 2010).

  1. 1.

    Based on the application area, preference learning approaches can be divided into the following main groups: (i) applied to the field of information retrieval, e.g., learning to rank search results of a query or a search engine, (ii) applied to recommender systems, e.g., used by online stores to recommend products to their customers, or for personalized devices, and (iii) bipartite ranking and label ranking, which find applications in disciplines such as medicine and biology. The application scenarios that we consider in the experimental evaluation in Sect. 4 belong to information retrieval (the Letor data set) and recommender systems (the audiological and art data sets).

  2. 2.

    The learning technique divides the preference learning approaches into four categories: (i) learn a binary preference relation that compares pairs of alternatives, (ii) model-based approach that aims at identifying the preference relation by making sufficiently restrictive model assumptions, (iii) local estimation techniques which lead to aggregating preferences, and (iv) learning utility functions by using regression to map instances to target valuations for direct ranking. We focus on the latter approach and use a utility function in order to model a subject’s preferences. The utility function is learned in a Bayesian framework.

  3. 3.

    The learning task includes label, instance, and object ranking. Label ranking can be seen as a generalization of classification where a complete ranking of labels is associated with an instance instead of only a class label. Instance ranking can be seen as a generalization of ordinal classification where an instance belongs to one among a finite set of classes and the classes have an order. The setting of object ranking has the peculiarity of having no supervision in the sense that no class label is associated with an object. Instead, a finite set of pairwise preferences or other ordering between objects is given. The setting that we consider in this work belongs to the last category.

In many preference learning settings it is important to take into account the context, i.e., context-aware preference learning (Adomavicius et al. 2005). The motivation for context aware preference learning is that the same subject/user/consumer may use different decision-making strategies and prefer different products under different contexts. For hearing-aid fitting, which is one of the application scenarios that we use in this work, this means that a user would prefer a certain setting of the hearing-aid parameters if he is listening to a concert and another setting if he is in a discussion. In general for context aware preferences bigger data sets are needed, as preferences would have to be learned for all contextual situations. The approach that we present in this paper can be applicable in this setting as well. While this is an interesting, related topic, it is beyond the scope of the current work.

1.1.2 Multi-task learning

The idea behind multi-task learning is to utilize labeled data from other “similar” learning tasks in order to improve the performance on a target task. It is inspired by the research on transfer of learning in psychology, more specifically on the dependency of human learning on prior experience. For example, the abilities acquired while learning to walk presumably apply when one learns to run, and knowledge gained while learning to recognize cars could apply when recognizing trucks. The initial foundations for multi-task learning were laid by (Thrun 1995; Caruana 1997). The psychological theory of transfer of learning implies the similarity between tasks. In a related way, the multi-task learning assumes similarity between models of different tasks. For example, (Evgeniou et al. 2005; Argyriou et al. 2008) exploit similarity between the deterministic parts of the models by means of regularization, with the effect of improvement in performance. In this work we implement multi-task learning using a Bayesian approach. The Bayesian approach to multi-task learning assumes the parameters of individual models to be drawn from the same prior distribution. Examples of the Bayesian approach to multi-task learning are (Bakker and Heskes 2003) where a mixture of Gaussians is used for the top of the hierarchy. This leads to clustering the tasks, one cluster for each Gaussian in the mixture. In (Yu et al. 2005; Birlutiu et al. 2009) a hierarchical Gaussian Process is derived with a normal-inverse Wishart distribution used at the top of the hierarchy.

1.1.3 Active learning

Active learning, also known in the statistics literature as sequential experimental design, is suitable for situations in which labeling points is difficult, time-consuming, and expensive. The idea behind active learning is that by optimal selection of the training points a better performance can be achieved instead of random selection. The scenarios in which active learning can be applied belong to one of the following three categories: (i) generating de novo points for labeling; (ii) stream-based active learning where the learner decides whether to request the label of a given instance or not; (iii) pool-based active learning where queries are selected from a large pool of unlabeled data. In this work we consider the pool-based active learning setting.

Methods for active learning can be roughly divided into two categories: those with and without an explicitly defined objective function. Uncertainty sampling (Lewis and Gale 1994), Query-by-Committee (Seung et al. 1992; Freund et al. 1997) and variants thereof belong to the latter category. They are based on the idea of selecting the most uncertain data given the previously trained models. The methods with an explicit objective function are often motivated by the theory of experimental design (Fedorov 1972; Chaloner and Verdinelli 1995; Schein and Ungar 2007; Lewi et al. 2009; Dror and Steinberg 2008). The objective function quantifies the expected gain of labeling a particular input, for example in terms of the expected reduction in the entropy of the model parameters (MacKay 1992; Cohn et al. 1996). With respect to the performance of the two categories of methods, Schein and Ungar (2007) show that the methods from the second approach perform better but are computationally more expensive due to re-training the models for each candidate point. A trend is to improve the performance of the active learning methods by combining them with heuristics designed either for the context in which they are applied or by the models they use, e.g., making use of the unlabeled data available (McCallum and Nigam 1998; Yu et al. 2006), exploiting the clusters in the data (Dasgupta and Hsu 2008), diversifying the set of hypotheses (Melville and Mooney 2004), or adapting the active learning to Gaussian processes (Chu and Ghahramani 2005a; Brochu et al. 2008; Groot et al. 2010).

Preference learning can benefit from the active learning paradigm. In most of the preference learning settings labels are given by people in an explicit way. This means that for acquiring training preference data, the subjects have to interact with the system, and they need to express their preferences explicitly. These situations appear when it is impossible or insufficient to implicitly collect training preference data. For example, when learning preferences in a live system where subjects choose electronically their favorite movie, labelling is done automatically by the selection, but, for other scenarios, like for example, fitting hearing-aids, this implicit way of collecting training data cannot be applied. In these situations, it makes sense to use active learning in order to collect the most informative data. There are severals studies in the literature that use active learning in a preference learning setting. Brinker (2004) presents some extensions of pool-based active learning to label ranking problems; Xu et al. (2010) address the problem of preference learning using relational models between items; Guo and Sanner (2010) investigate active preference learning for real-time systems; Brochu et al. (2008) propose a criterion for active learning that maximizes the expected improvement at each query without accurately modeling the entire valuation surface. Furthermore, there are several studies which investigate active preference learning for practical applications such as, collaborative filtering (Jin and Si 2004; Harpale and Yang 2008; Boutilier et al. 2003), personalized calendar scheduling (Gervasio et al. 2005), or for optimizing search results for biomedical documents (Arens 2008). The difference between our work and the other studies for active preference learning mentioned above is that we consider active preference learning in a multi-task setting, i.e., we are interested about settings with multiple learning tasks and how active learning can be implemented in an efficient way in this case. We propose a criterion for active learning designed for this multi-task learning setting. This criterion, which we call the Committee criterion, will make use of the preference observations collected already from a community of subjects. The idea behind the Committee criterion is related to the Query-by-Committee method from active learning which selects those queries that have maximum disagreement amongst an ensemble of hypotheses. The difference in our case is that the group of subjects, for which the preferences were already learned, plays the role of the ensemble of hypotheses instead of an ensemble of models learned on the same task.

1.2 Notation

Boldface notation is used for vectors and matrices and normal fonts for their components. Upperscripts are used to distinguish between different vectors or matrices and lowerscripts to address their components. The notation \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\varSigma})\) is used for a multivariate Gaussian with mean μ and covariance matrix Σ. The transpose of a matrix M is denoted by M T. Capital letters are used for constants and small letters for indices, e.g., i=1,…,I.

2 Learning framework

The idea of using the preference observations from other subjects in order to optimize the process of learning the preferences of a new subject can be basically applied in any preference learning context. In this work, we consider the case of qualitative preference observations which can be modeled using the probabilistic choice models described in this section.

2.1 Probabilistic choice models

In many real-world applications preferences are learned from experiments in which the subject makes a choice for one of the presented alternatives. The motivation for this is that people are very good at making comparisons between alternatives and expressing a preference for one of them, which results in qualitative preference observations. This is in contrast to quantitative preference observations where the people have to assign an absolute rating to each alternative independently. Let X={x 1,…,x I } be a set of inputs. Let \({\mathcal {D}}\) be a set of J observed preference comparisons over instances in X corresponding to a subject,

$$ {\mathcal {D}}= \bigl\{(a_j, c_j)\mid 1 \leq j \leq J,\ c_j \in\{1,\ldots,A\}\bigr\} $$
(1)

with \(a_{j} = (\boldsymbol{x}_{i_{1}(j)}, \ldots, \boldsymbol{x}_{i_{A}(j)})\) representing the alternatives presented to the subject and c j the choice made, i 1,…,i A :{1,…,J}→{1,…,I} index functions such that i 1(j) represents the input presented first in the jth preference comparison and c j =c means that \(\boldsymbol{x}_{i_{c}(j)}\) is chosen from the A alternatives presented in the jth comparison. For A=2 this setup reduces to pairwise comparisons between two alternatives.

The main idea behind probabilistic choice models is to assume a latent utility function value U(x i ) associated with each input x i which captures the individual preference of a subject for x i (the utility function will be formally defined in the next section). In the ideal case the latent function values are consistent with the preference observations. This means that alternative c is preferred over the other alternatives c′ in the jth comparison whenever the utility for c exceeds the utilities for the other alternatives c′, i.e., \(U(\boldsymbol{x}_{i_{c}(j)}) > U(\boldsymbol{x}_{i_{c'}(j)})\). In practice, however, subjects are often inconsistent in their responses. A very inconsistent subject will have a high uncertainty associated with the utility function; this uncertainty is directly taken into account in the probabilistic framework. We define this probabilistic framework using the Bradley-Terry model (Bradley and Terry 1952; Kanninen 2002; Glickman and Jensen 2005) by making a standard modeling assumption that the probability that the cth alternative is chosen by the subject in the jth comparison follows a multinomial logistic model, which is defined as

$$ p(c_j=c| a_j, U) = \frac{ \exp [ U( \boldsymbol{x}_{i_c(j)} ) ]}{ \sum_{c'=1}^A \exp [ U(\boldsymbol{x}_{i_{c'}(j)}) ] }, $$
(2)

where “exp” is the exponential function and the other terms are as defined before. Efficiently learning preferences reduces to learning the unknown utility function U as accurately and with as few comparisons as possible.

One important drawback of the Bradley-Terry model is that it assumes very strong transitivity conditions of preference relations, while some psychological experiments have shown that human preference judgments can violate transitivity (Anand 1993; Tversky 1998). In most situations transitivity violations can be considered as noise. When this is not applicable, specific probabilistic models for human preference judgements which preserve intransitive reciprocal relation have to be designed. This was recently investigated in (Pahikkala et al. 2009) which introduced a new kernel function in the framework of regularized least squares which is capable of inferring intransitive reciprocal relations.

2.2 The utility function

The utility function U is a real-valued function, U:X→ℝ, which associates with every input xX a real number U(x). Each input xX is characterized by a set of features, ϕ(x)∈ℝD. One possible choice for the utility function is to express it as a linear combination of the features,

$$ U(\boldsymbol{x}) = \sum_{i=1}^D \alpha_i \phi_i(\boldsymbol{x}) , $$
(3)

where α=(α 1,…,α D ) is a vector of weights which captures the importance of each feature of x when evaluating the utility U for a specific subject, ϕ i (x) are the components of the vector ϕ(x). The preferences of a subject are thus encoded in the vector α and learning the utility function reduces to learning α.

In order to make the definition of the utility function more flexible, we can use a semiparametric model in which the utility function is defined as a linear combination of basis functions. The basis functions are defined by a kernel function κ centered on the data points,

$$ U(\boldsymbol{x}) = \sum_{i=1}^I \alpha_i \kappa(\boldsymbol{x},\boldsymbol{x}_i) , $$
(4)

where the vector α with dimension I—the number of data points (the size of the set of inputs X)—captures the preferences of the subject. A non-linear utility function can be obtained by using, for example, a Gaussian kernel,

$$ \kappa_{\mathrm{Gauss}}\bigl(\boldsymbol{x}, \boldsymbol{x}'\bigr) = \exp \Biggl(-\frac{\ell}{2} \sum_{j=1}^{D} \bigl(\phi_j(\boldsymbol{x})-\phi_j\bigl(\boldsymbol{x}'\bigr)\bigr)^2 \Biggr) , $$
(5)

where is a length-scale parameter. The two definitions of the utility function from Eqs. (3) and (4) are similar in the sense that they are both linear in the parameter. Equation (4) is suited when the number of features is larger than the number of data points, i.e., D>I and for introducing non-linearity in the utility model.

In order to learn the utility function, we use a Bayesian framework in which we treat the vector of parameters α as a random variable. We consider a Gaussian prior distribution over \(\boldsymbol{\alpha} \sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\varSigma})\), which is updated based on the observations from the preference comparisons using Bayes’ rule (where we omitted the normalization constant),

$$ p(\boldsymbol{\alpha}|{\mathcal {D}},\boldsymbol{\mu},\boldsymbol{\varSigma}) \propto p( \boldsymbol{\alpha}) \prod_{j=1}^{J} p(c_j| a_j, \boldsymbol{\alpha}) , $$
(6)

with the likelihood terms of the form given in Eq. (2). The choice of the prior will be discussed in the next section. The posterior distribution obtained is approximated to a Gaussian. The Gaussian approximation of the posterior is a good approximation because with few data points the posterior is close to the prior which is a Gaussian, and with many data points the posterior approaches again a Gaussian as a consequence of the central limit theorem (Bishop 2006). To perform the approximation of the posterior a good choice is to use deterministic methods (e.g., Laplace’s method (MacKay 2002), Expectation Propagation (Minka 2001)) since they are computationally cheaper than the non-deterministic ones (sampling) and because they are known to be accurate for these types of models (Glickman and Jensen 2005).

2.3 Multi-task preference learning

One property which distinguishes preference learning from other learning settings is that in most of the cases in which preference learning is involved, observations are available from multiple subjects. For example, in the case of product recommendation, preferences are learned from multiple consumers. Furthermore, in situations where the training preference data is collected in an explicit way, by interactions with the subject, the individual training data is usually small. As a consequence, it is reasonable to make use of all the data available. In this direction, we make the assumption that each subject is not an isolated case, but belongs to a group of people sharing a common underlying rationale in their way of making preference decisions. This property combined with the Bayesian framework allows the transfer of information about preferences between different subjects. Basically, we use the preference data previously seen from some other subjects to learn an informed prior which will be used as the starting prior when learning the preferences of a new subject. For learning this informed prior, we use Bayesian hierarchical modeling which assumes that the parameters for individual models are drawn from the same hierarchical prior distribution. Let us assume that we already have preference data available from a group of M subjects. We make the common assumption of a Gaussian prior distribution, \(p(\boldsymbol{\alpha}^{m}) = \mathcal{N}(\boldsymbol{\alpha}^{m}; \bar{\boldsymbol{\mu}},\bar{\boldsymbol{\varSigma}})\), m=1,…,M with the same \(\bar{\boldsymbol{\mu} }\) and \(\bar{\boldsymbol{\varSigma}}\) for the preference models of all subjects. This prior is updated using Bayes’ rule based on the observations from each subject, resulting in a posterior distribution for each individual subject. The common prior over all task parameters controls the general part of the model. This common prior is learned from the data belonging to a group of tasks, other than the current (new) task for which the learning is performed. Starting from this general-model given by the common prior, the model is updated using the observation (data) seen in the current task. These task-specific observations control the task-specific part of the model. The hierarchical prior is obtained by maximizing the penalized log-likelihood of all data in a so-called type-II maximum likelihood approach. This optimization is performed by applying the EM algorithm (Gelman et al. 2003), which reduces to the iteration (until convergence) of the following steps:

E-step::

Estimate the sufficient statistics (mean μ m and covariance matrix Σ m) of the posterior distribution corresponding to each subject m, given the current estimates at step t (\(\bar{\boldsymbol{\mu}}^{(t)}\) and \(\bar{\boldsymbol{\varSigma}}^{(t)}\)) of the hierarchical prior.

M-step::

Re-estimate the parameters of the hierarchical prior:

(7)
(8)

The details of the derivations of Eqs. (7) and (8) can be found in Appendix C of our paper (Birlutiu et al. 2009). The hierarchical prior is set to \(p(\boldsymbol{\alpha}) =\mathcal{N}(\boldsymbol{\alpha}; \bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\varSigma}})\) where \(\bar{\boldsymbol{\mu}} = \bar{\boldsymbol{\mu}}^{T}\) and \(\bar{\boldsymbol{\varSigma}} = \bar{\boldsymbol{\varSigma}}^{T}\), at step T when iterations were stopped (at convergence). Once we have learned the hierarchical prior we can use it as an informative prior for the preference model of a new subject in Eq. (6).

3 Active preference learning

In this section we discuss methods for active preference learning. We start from Query-by-Committee (QBC) (Seung et al. 1992) method for active learning and based on it we propose some variants of QBC adapted to the setting of preference learning for multiple subjects (Sect. 3.1). Furthermore, we show how these variants of QBC can be naturally linked to the hierarchical Bayesian modeling for reducing the computations (Sect. 3.2). Finally, we show connections between the variants of QBC proposed and other active learning criteria (Sect. 3.4).

3.1 QBC for preference learning

In this section we will discuss how to adapt QBC to our preference learning setting.

3.1.1 The committee members

For the QBC approach to be effective it is important that the committee is made of consistent and representative models. The main idea in this work is to exploit the preference learning setting with multiple subjects and use the learned models of other subjects \(\mathcal{M}_{1},\ldots, \mathcal{M}_{M}\) as committee members when learning the preferences of a new subject.

After choosing the committee we still have to decide upon a suitable criterion for selecting the next examples. Some measures of disagreement among the committee members appear to be most obvious, and in the following we will consider two alternatives.

3.1.2 Vote criterion

A simple and straightforward way is to consider the labels assigned by the other subjects, e.g., through the Vote criterion defined as

$$ \mbox{Vote}(a) = \max_c \sum_{m=1}^{M} \delta(a,c;m) , $$
(9)

where δ(a,c;m)=1 if \((a,c) \in {\mathcal {D}}_{m}\), and δ(a,c;m)=0 otherwise. The score Vote(a) is minimal when the labels assigned by the committee members are equally distributed (total disagreement) and maximal when all members fully agree. There are two problems with this criterion. First, a comparison a may not be labeled by a subject m. This can be overcome if we consider the predictions computed based on the learned model of subject m and allow each committee member to ‘vote’ for its winning class. This same idea is implemented in the so-called vote entropy method (Dagan and Engelson 1995). The entropy is measured over the final classes assigned to an example by possible models, and not over class probabilities given by possible models. Second, in practical applications just scoring votes turns out to be suboptimal. The reason, as also suggested in (McCallum and Nigam 1998), is that the Vote criterion does not take into account the confidences of the committee members’ predictions.

3.1.3 Committee criterion

We will use the following notation for the predictive probability corresponding to a subject m=1,…,M,

$$p_m(c|a) \equiv p(c|a, \mathcal{M}_m) . $$

The predictive probability can be computed either by taking into account the entire distribution \(\mathcal {M}_{m} = \mathcal {N}(\boldsymbol{\mu}^{m}, \boldsymbol{\varSigma}^{m})\)

$$p_m(c|a) = \int p(c|a,\boldsymbol{\alpha}) \mathcal {N}\bigl(\boldsymbol{ \alpha}| \boldsymbol{\mu}^m, \boldsymbol{\varSigma}^m\bigr) d \boldsymbol{\alpha}, $$

or, for computational reasons, we can consider only a point estimate for \(\mathcal {M}_{m}\), for example, the mean of the Gaussian distribution, and use it to compute the predictive probabilities using Eq. (2)

$$ p_m(c|a) \approx p\bigl(c|a,\boldsymbol{\mu}^m\bigr) . $$
(10)

Inspired by (McCallum and Nigam 1998), we propose to measure disagreement by taking the average prediction of the entire committee and computing the average Kullback-Leibler (KL) divergence of the individual predictions from the average:

$$ \mbox{Committee}(a) = \sum_{m=1}^M \frac{1}{M} \mathrm {KL}\bigl[\bar{p}(\cdot |a) || p_m(\cdot|a) \bigr], $$
(11)

with \(\bar{p}(\cdot|a)\) the average predictive probability of the entire committee, which will be more precisely defined in Sect. 3.2.

The KL divergence for discrete probabilities is defined as

$$\mathrm {KL}\bigl[p_1(\cdot|a) || p_2(\cdot|a)\bigr] = \sum _c p_1(c|a) \log \biggl( \frac{p_1(c|a)}{p_2(c|a)} \biggr) . $$

The KL divergence can be seen as a distance between probabilities, where we abused the notion of distance, since the KL-divergence is not symmetric, i.e., KL[p 1||p 2]≠KL[p 2||p 1]. This drawback of the KL-divergence can be overcome by considering a symmetric measure, for example, KL[p 1||p 2]+KL[p 2||p 1]. In (McCallum and Nigam 1998), the disagreement is computed between committee members constructed based on the current model, i.e., the committee changes with every update and the criterion has to be recomputed with every update. A committee of models learned on different tasks is fixed and thus selecting examples solely based on it leads to a fixed instead of an active design: all examples can be ranked beforehand (the same applies to the Vote criterion defined above).

To arrive at an active design and take into account the current model, we propose a small modification based on the following intuition. Querying examples on which the committee members disagree makes sense, because it will force the current model to make a choice between options that, according to the committee members, are reasonably plausible. However, when the current model on a particular example already “made up its mind”, i.e., deviates substantially from the average prediction of the committee based on what it learned from other input/output pairs, it makes no sense to still query that example, even though the committee members might disagree. Taking into account this consideration, we propose the Committee criterion which assigns a score to a candidate query comparison a through

$$ \mbox{Committee}(a) = \frac{1}{M} \sum_{m=1}^M \mathrm {KL}\bigl[ \bar{p}(\cdot |a)|| p_m(\cdot|a) \bigr] - \gamma \mathrm {KL}\bigl[ \bar{p}(\cdot|a)|| p(\cdot|a) \bigr] , $$
(12)

with p(⋅|a) the current model’s predictive probability based on the data seen so far and γ a parameter that accounts for the degree of similarity between subjects. According to the Committee criterion, the most interesting experiments are those on which the other models disagree (the first term on the right-hand side of Eq. (12)), with the current model (still) undecided (the second term on the right-hand side of Eq. (12)).

An advantage of the Committee criterion is its computational efficiency: the first term on the right-hand side of Eq. (12) as well as the average predictive probability can be computed beforehand. The Committee criterion does require computation of the predictive probabilities corresponding to the current model, but this is the least one could expect from an active design. This is to be compared with the QBC criterion (any of the two variants considered), which requires constructing new committee members with each update, and D-optimal experimental design, which calls for keeping track of variances.

Note that we have not made any restriction so far with respect to the probabilistic models used in the active learning design. In the following we will consider only the log-linear models introduced in Sect. 2. They have some nice properties, which simplify the computation of the Committee criterion (Sect. 3.2), and provide a natural link to hierarchical Bayesian modeling (Sect. 2.3). The general idea, of using the already learned models from the other tasks as the committee members in a QBC-like approach, is of course also applicable to other models.

3.2 Average probability

In this section we discuss how to efficiently compute the average probability used for computing the committee criterion in Eq. (12) in the case of log-linear models (Christensen 1997). For linear utility functions the likelihood function defined in Eq. (2) is a log-linear model. The log-odds of the model are linear in the parameter.

Let p m (c|a) be the predictive probability defined in Eq. (10). We define the average predictive probability of the committee, \(\bar{p}(c|a)\), as the prediction probability that is closest to the prediction probabilities of the members:

$$ \bar{p}(c|a) \equiv \operatorname {argmin}_{p(c|a)} \sum_{m=1}^M \frac{1}{M} \mathrm {KL}\bigl[p(c|a)||p_m(c|a)\bigr] . $$
(13)

The solution of the optimization from above is the so-called logarithmic opinion pool (Bordley 1982)

$$ \bar{p}(c|a) = \frac{1}{Z(a)} \prod_{m=1}^M \bigl[p_m(c|a) \bigr]^{ \frac{1}{M}} = \frac{1}{Z(a)} \exp \Biggl( \frac{1}{M} \sum_{m=1}^M \log p_m(c|a) \Biggr), $$
(14)

with Z(a) a normalization constant

$$Z(a) = \sum_c \prod_{m=1}^M \bigl[p_m(c|a) \bigr]^{\frac{1}{M}} . $$

For log-linear models, the logarithmic opinion pool boils down to a simple averaging of model parameters:

$$ \bar{p}(c|a) = p(c|a, \bar{\boldsymbol{\mu}}) \quad \mbox{with } \bar{\boldsymbol{\mu}} = \frac{1}{M} \sum_{m=1}^M \boldsymbol{\mu}^m . $$
(15)

This natural combination between log-linear models and logarithmic opinion pools is the advantage of using the logarithmic opinion pool instead of the linear opinion pool used in (McCallum and Nigam 1998).

As can be seen from the EM updates in Eq. (7), the average \(\bar{\boldsymbol{\mu}}\) in the logarithmic opinion pool is then precisely the mean of the learned hierarchical prior. Summarizing, once we have learned a hierarchical prior from the data available for subjects 1 through M using the EM algorithm, we can start off the new model M+1 from this prior (as is normally done in hierarchical Bayesian learning). On top of this, the same EM algorithm gives us the information we need to compute the Committee criterion that can be used subsequently to select new inputs to label.

3.3 Other criteria for active learning

In this section we discuss how several strategies for active learning can be implemented in the learning framework considered here. All the strategies are being concerned with evaluating the informativeness of the unlabeled points. Let the new model obtained after incorporating an observation (a,c) be \(\mathcal {M}_{(a,c)} \approx\mathcal{N}(\boldsymbol{\mu}_{(a,c)}, \boldsymbol{\varSigma}_{(a,c)})\).

  1. 1.

    Uncertainty sampling (Lewis and Gale 1994). In this strategy an active learner chooses for labeling the example for which the model’s predictions are most uncertain. The uncertainty of the predictions can be measured, for example, using Shannon entropy

    $$ \mathrm{Uncertainty} (a) = -\sum_c p(c|a, \mathcal{M}) \log p(c|a, \mathcal{M}) . $$
    (16)

    For a binary classifier this strategy reduces to querying points whose prediction probabilities are close to 0.5. Intuitively this strategy aims at finding as fast as possible the decision boundary since this is indicated by the regions where the model is most uncertain.

  2. 2.

    Variance reduction (MacKay 1992). This strategy, also known in experimental design as D-optimality (Fedorov 1972; Chaloner and Verdinelli 1995; Berger 1994; Ford and Silvey 1980), chooses as the most informative experiments the ones that give the most reduction in the model’s uncertainty. The motivation behind this strategy is a result of (Geman et al. 1992) which shows that the generalization error can be decomposed into three components: (i) noise (which is independent of the model or training data); (ii) bias (due to the model); (iii) model’s uncertainty. Since the model cannot influence the noise and the bias components, the future generalization error can only be influenced via the model’s variance. Formally, this criterion can be written as

    $$ \mathrm{Variance} (a) = \sum_c p(c|a,\mathcal{M}) \mathrm{variance} [\mathcal{M}_{(a,c)}] - \mathrm{variance} [\mathcal{M}]. $$
    (17)

    In the setting considered in this work the variance of the model is expressed in the covariance of the Gaussian distribution. In order to use Eq. (17) we need to choose a measure for the variance. We can consider, for example, the log-determinant of the covariance matrix

    $$ \mbox{Variance-logdet} (a) = \sum_c p(c|a, \mathcal {M}) \log\det (\boldsymbol{\varSigma}_{(a,c)}) - \log\det(\boldsymbol{\varSigma}) , $$
    (18)

    which is actually minimizing the entropy of the Gaussian random variable representing the current model, or the trace of the covariance matrix

    $$ \mbox{Variance-trace} (a) = \sum_c p(c|a, \mathcal {M}) \operatorname {Tr}(\boldsymbol{\varSigma}_{(a,c)}) - \operatorname {Tr}(\boldsymbol{\varSigma}) . $$
    (19)
  3. 3.

    Expected model change (Cohn et al. 1996). This strategy chooses as the most informative query the one which when added to the training set would yield the greatest model change. Quantifying the model change depends on the learning framework. For gradient-based optimization the change can be measured via the training gradient, i.e., the vector used to re-estimate parameter values (Settles and Craven 2008). In the Bayesian framework, the model change can be quantified via a distance measure between the current distribution and the posterior distribution obtained after incorporating the candidate point

    $$\mathrm{Change} (a) = \sum_c p(c|a, \mathcal{M}) \mathrm{distance} [\mathcal{M}, \mathcal{M}_{(a,c)} ] . $$

    A suitable distance for our setting is the Kullback-Leibler divergence between distributions, which for two Gaussians has a closed form solution and can be written as follows

    (20)

    The KL divergence between Gaussians is used by Seeger (2008) to design an efficient sequential experimental design in a setting similar to the one used in this work.

Uncertainty sampling, and QBC and its variants are attractive due to their applicability in various machine learning settings. Variance reduction and expected model change are robust and in many situations they have proved to be the best one can do (Schein and Ungar 2007). Although more robust, the variance reduction and expected model change strategies are computationally more demanding since for each candidate comparison query and each possible label the posterior distribution induced has to be computed. For the learning setting considered in this study, the posterior distribution computed using Bayes’ theorem (Eq. (6)) does not have an analytical expression and approximations are needed for it; these approximations are usually costly. In contrast, the variants of QBC proposed in this paper are computationally efficient

3.4 Similarities between criteria

In this section we consider the following active learning criteria: Variance-logdet, Committee, Variance-trace and Change-KL. We investigate how similar the active learning criteria are and how they can be related. We analyze the modifications induced to the model by the criteria after updating the probability model to incorporate the information from new training points. A single update induces a small change in the posterior distribution, and this allows for Taylor expansions, keeping only the lowest non-zero contribution. In the following we present the main results of the approximations while some of the details can be found in the Appendix.

As we will show below and in the Appendix, under the assumption that the updates of the posterior distribution for each alternative a and choice c lead to small changes in the model \(\mathcal {M}\), we can approximate the active criteria to the form

$$ \sum_c p(c|a, \boldsymbol{\alpha}) \boldsymbol{g}(c|a, \boldsymbol{\alpha})^T \boldsymbol{Q} \boldsymbol{g}(c|a,\boldsymbol{\alpha}) , $$
(21)

for some vector α and matrix Q and with g the gradient of the log-probabilities

$$\boldsymbol{g}(c|a, \boldsymbol{\alpha}) \equiv\frac{\partial\log p(c|a,\boldsymbol{\alpha})}{\partial\boldsymbol{\alpha}} . $$

The following lemma approximates the Variance-logdet criterion to the form from Eq. (21).

Lemma 1

In a first order approximation, assuming that Σ (a,c) is close to Σ, we can simplify

$$ \mathrm{Variance}\mbox{-}\mathrm{logdet}(a) \approx\sum_c p(c|a, \boldsymbol{\mu}) \boldsymbol{g}(c|a, \boldsymbol{\mu})^T \boldsymbol{\varSigma} \boldsymbol{g}(c|a,\boldsymbol{\mu}) , $$
(22)

where μ and Σ represent the mean and covariance of the Gaussian posterior distribution.

Proof

In a first order approximation we have

$$ \boldsymbol{\varSigma}_{(a,c)}^{-1} \approx\boldsymbol{\varSigma}^{-1} -{ \partial^2 \log p(c|a,\boldsymbol{\alpha}) \over\partial\boldsymbol{\alpha} \partial\boldsymbol{\alpha}^{T} } \bigg|_{\boldsymbol{\alpha}=\boldsymbol{\mu}} $$
(23)

where we ignored the change from the old α to a new MAP solution depending on c and a. We will use the notation

$$\boldsymbol{H}(c|a, \boldsymbol{\alpha}) \equiv{\partial^2 \log p(c|a, \boldsymbol{\alpha}) \over\partial\boldsymbol{\alpha}\partial \boldsymbol{\alpha}^T} . $$

For a matrix A and ϵ small compared to A, the following holds (see, for example, Boyd and Vandenberghe 2004, p. 642)

$$ \log\det(\boldsymbol{A} + \epsilon\boldsymbol{I}) \approx\log\det(\boldsymbol{A}) + \operatorname {Tr}\bigl[\boldsymbol{A}^{-1}\epsilon\bigr] . $$
(24)

Assuming \(\boldsymbol{\varSigma}_{(a,c)}^{-1}\) is close to Σ −1 which makes H(c|a,μ) small, we can use Eq. (23) in Eq. (24) with the following substitutions A=Σ −1 and H(c|a,μ)=ϵ I to obtain

$$ \log\det\boldsymbol{\varSigma}_{(a,c)}^{-1} \approx\log\det \boldsymbol{\varSigma}^{-1} - \operatorname {Tr}\bigl[\boldsymbol{\varSigma} \boldsymbol{H}(c|a,\boldsymbol{\mu})\bigr]. $$
(25)

The probability that the subject gives the response c when presented the alternatives a follows by integrating p(c|a,α) over the current posterior. We make a second order Taylor expansion of p(c|a,α) around the point μ:

The first order term cancels since the gradient is zero at the maximum solution α=μ. In a lowest order approximation we can ignore the correction upon p(c|a,α) to obtain

where for the last approximation we used the approximation from Eq. (25). To obtain the proof of this lemma we use Lemma 3 in the Appendix at the end of the paper which states a relationship between Hessian and Fisher matrices. □

Using the same type of approximation, the Committee criterion can be approximated to the same form given in Eq. (21).

Lemma 2

In a lowest order approximation the Committee criterion can be written as

$$ \mathrm{Committee}(a) \approx{1 \over2} \sum_c p(c|a,\bar{\boldsymbol{\mu}}) \boldsymbol{g}(c|a, \bar{\boldsymbol{\mu}})^T \tilde{\boldsymbol{\varSigma}} \boldsymbol{g}(c|a, \bar {\boldsymbol{\mu}}) , $$

where \(\bar{\boldsymbol{\mu}}\) is the mean of the hierarchical prior learned from the other subjects and

$$ \tilde{\boldsymbol{\varSigma}} \equiv{1 \over M} \sum_{m=1}^{M} \bigl( \boldsymbol{\mu}^m- \bar{\boldsymbol{\mu}}\bigr) \bigl( \boldsymbol{\mu}^m - \bar{\boldsymbol{\mu}}\bigr)^T - ( \boldsymbol{\mu}- \bar{\boldsymbol{\mu}} ) (\boldsymbol{\mu}- \bar{\boldsymbol{\mu}})^T . $$

We make a second order Taylor expansion of the KL divergences from the definition of the Committee criterion in Eq. (12)

$$\mathrm {KL}\bigl[ \bar{p}(\cdot|a)|| p(\cdot|a) \bigr] = \sum _c p(c|a, \bar{\boldsymbol{\mu}}) \log \biggl[ {{p(c|a, \bar{\boldsymbol{\mu}}) }\over{ p(c|a, \boldsymbol{\mu}) }}\biggr] , $$

around the point \(\bar{\boldsymbol{\mu}}\).

The first order term of the Taylor expansion is:

$$ -\sum_c p(c|a, \bar{\boldsymbol{\mu}}) { \partial\log p(c|a, \bar{\boldsymbol{\mu}}) \over\partial\boldsymbol{\mu}} \bigg|_{\boldsymbol{\mu}=\bar{\boldsymbol{\mu}}} (\boldsymbol{\mu}-\bar{\boldsymbol{\mu}})^T =\sum _c p(c|a,\bar{\boldsymbol{\mu}}) \boldsymbol{g}(c|a,\bar{\boldsymbol{\mu}}) (\boldsymbol{\mu}-\bar{\boldsymbol{ \mu}})^T $$

which cancels since based on Eq. (28) from the Appendix \(\sum_{c} p(c|a, \bar{\boldsymbol{\mu}}) \boldsymbol{g}(c|a,\bar{\boldsymbol{\mu}})\) is the vector with every component 0.

The second order term can be rewritten using Lemma 3 as:

Since the other terms cancel, we obtain that the KL-divergence between the predictive probabilities can be approximated as

$$\mathrm {KL}\bigl[ \bar{p}(\cdot|a)|| p(\cdot|a) \bigr] = {1 \over2} \sum _c p(c|a, \bar{\boldsymbol{\mu}})\boldsymbol{g}(c|a,\bar{ \boldsymbol{\mu}})^T (\boldsymbol{\mu}- \bar {\boldsymbol{\mu}}) ( \boldsymbol{\mu}- \bar{\boldsymbol{\mu}})^T \boldsymbol{g}(c|a,\bar{ \boldsymbol{\mu} }) . $$

Making this approximation for all the KL-divergences in the definition of the Committee criterion from Eq. (12) and computing the sum we obtain the result stated in this lemma.

Furthermore, it can also be showed that Variance-trace and Change-KL can be approximated to the same form given in Eq. (21), namely

$$ \mbox{Variance-trace}(a) \approx\sum_c p(c|a, \boldsymbol{\mu}) \boldsymbol{g}(c|a,\boldsymbol{\mu})^T \boldsymbol{ \varSigma}^2 \boldsymbol{g}(c|a,\boldsymbol{\mu}) , $$
(26)

and

$$ \mbox{Change-KL}(a) \approx \mbox{Variance-logdet}(a) . $$
(27)

For the derivations of these approximations see Lemmas 4 and 5 in the Appendix.

We will focus on the differences between the Variance-logdet criterion (considered as the reference) and the Committee criterion. The differences between their approximations are as follows.

  1. 1.

    The gradients g(c|a,⋅) are evaluated at different points: the prior hierarchical mean \(\bar{\boldsymbol{\mu}}\) and the current posterior mean μ. This effect is small since μ is still close enough to \(\bar{\boldsymbol{\mu}}\) for a sufficiently accurate approximation of the gradients, in particular at the start of the learning when selecting the right points to label is more important.

  2. 2.

    The current posterior variance Σ is replaced by \(\tilde{\boldsymbol{\varSigma}}\). The effect of the precise weighting of the gradients is not so important, and again, at the beginning of learning \(\tilde{\boldsymbol{\varSigma}}\) is close to Σ.

The way in which experiments are selected is more important at the beginning of the learning process, when μ is still close to the prior mean \(\bar{\boldsymbol{\mu}}\), and \(\tilde{\boldsymbol{\varSigma}}\) to Σ.

4 Experimental evaluation

This section presents the experimental evaluation of the framework proposed in this paper. We will use pairwise comparisons data. The main goal of the experimental evaluation section is to show that optimally selecting data for labeling using the Committee criterion achieves higher accuracy than random selection. Furthermore, we also show that the Committee criterion performs in practice similarly to other standard active learning criteria, but in addition has a computational advantage.

4.1 Data sets

The following data sets related to the preferences of people were used in the experimental evaluation.

4.1.1 Letor

This data set consists of relevance levels assigned to documents with respect to a given textual query (the OHSUMED data set from Letor 3.0, Qin et al. 2010). The relevances were assessed by human experts, using three rank scales: definitely relevant, partially relevant, and not relevant. We used a subset of this data related to Query 1. It contains 138 references with the following labels: 24 definitely relevant, 26 partially relevant, and 88 not relevant. Each of the samples is characterized by a 45-dimensional vector consisting of text features extracted from the titles and abstracts of the documents. The features were normalized. Based on this data set we constructed pairwise preferences belonging to 50 subjects in a way that we describe below. We followed a procedure similar to (Xu et al. 2010) to turn the relevance levels into pairwise preference comparisons. Since such coarse relevance judgements are considered unrealistic in many real-world applications, Xu et al. (2010) proposed to add uniform noise in the range [−0.5,0.5] to the true relevance levels. This addition preserves the relative order between definitely relevant (respectively partially relevant) documents and partially relevant (respectively not relevant) ones, but randomly breaks ties within each relevance level. To introduce a hierarchical component, we replaced the random tie-breaking of Xu et al. (2010) by a subject-specific one. We do this by changing the uniform noise by a subject (and feature) dependent term as follows. For subject m, a weight vector α m is drawn from a zero mean fully factorized Gaussian with unit variance, \(\boldsymbol{\alpha}_{m} \sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\). Given features x i , noise terms are then the inner products \(\boldsymbol{\alpha}_{m}^{T} \boldsymbol{x}_{i}\), linearly scaled back to the interval [−0.5,0.5] (not to destroy the relative order of the true relevance levels), and the relevance levels are taken to be the true relevance levels plus these noise terms.

4.1.2 Audio

This data set consists of evaluations of sound quality from 32 subjects (this data set was borrowed from Arehart et al. 2007). Each subject performed 576 pairwise comparison listening experiments. Each listening experiment represents one sound sample processed with two different settings of the hearing-aid parameters, and the choice for one of the two. The processed sound sample is represented by a 3-dimensional feature vector.

4.1.3 Art

This data set consists of evaluations of art images from 190 subjects (this data set was borrowed from Yu et al. 2003). Each subject was presented a number of images from a total of 642 images and asked to rate each of them: like/dislike (on average, each subject rated 90 images). We considered the 32 subjects who rated more than 120 images. Each image is described by a 275-dimensional feature vector, with features which characterize the image such as, color, shape, texture, etc. For computational reasons, we reduced the high-dimensional feature vector (275 dimensions) to a lower dimension. We noticed that most of the features where not very informative for predicting the outcomes, this is why we used only the 10 most informative features (the informativeness of the features was measured by averaging the correlations between features and observations). Note that this data set does not contain pairwise comparisons like the other two data sets. With each instance, a binary label is associated: like or dislike, which makes the learning task on this data set to be a binary classification task. The combination of multi-task and active learning that we propose in this work can be still applied in this case in the same framework which was introduced in Sect. 2 by using the logistic regression model instead of the Bradley-Terry model as likelihood terms in the model from Eq. (6).

4.2 Protocol

Our experiments use a leave-one-out scheme in which each subject was considered once as the current/test subject for which the preferences need to be learned. For each test subject the learning started with the hierarchical prior learned from the data of the other remaining subjects. The data for the test subject was split into 5 folds, 1 fold was used for training and the rest was used for testing. The training data was used as a pool out of which points were selected for labeling either randomly or actively using one of the active learning criteria. The hierarchical prior was updated based on these data points. After every update predictions were made on the test set using the current model. We used accuracy (percentage of correct predictions among all the predictions) as a measure of performance. The accuracy of the predictions on the test data measures how much we learned about the subject preferences. The results were averaged over the 5 splits and over the subjects.

4.3 Performance

The framework that we propose in this work for optimizing preference learning consists of combining the multi-task formalism together with active learning. The multi-task ideas in preference learing are especially useful when the training preference data from a subject is very small. In this situation it makes sense to use the preference data from other subjects as additional information.

4.3.1 Letor

The pairwise comparisons from Letor data set were generated by adding noise in the interval [−0.5,0.5] such that the relative order between the three relevance levels is preserved, but ties within each relevance level are broken. As a result, different subjects do agree on comparisons between different relevance levels. Thus, the data was constructed to have an underlying common structure in the preference of different subjects. Because of this reason we expect that the multi-task learning would improve the performance. In order to validate this hypothesis, we checked whether the preferences of a new subject can be learned more accurately by using the available preference data from other subjects. We compared the hierarchical model with the method of Chu and Ghahramani (2005b) Gaussian processes for preference learning which assumes no prior information. The hierarchical/community prior was obtained by applying the EM algorithm described in Sect. 2.3 in combination with the semiparametric utility function from Eq. (4); the hierarchical prior was learned from 20 samples from each of the other subjects. The method of Chu and Ghahramani (2005b) was applied with a Gaussian kernel in which the kernel parameters were tuned using cross-validation.

Figure 1, left panel, compares the accuracy obtained using the hierarchical model to the Gaussian process method for preference learning of (Chu and Ghahramani 2005b) in a non-active setting, i.e., for both models the updates are done with training points randomly selected. The prediction accuracy is shown as a function of the number of data points included in the training set for the test subject. The plots show that the improvement obtained with the hierarchical model depends on the size of the training data. This is in accordance with the expectation that the multi-task formalism is suited for situations in which the available training data is small. The results obtained using a hierarchical prior on the audio and art data sets are similar to the ones obtained in the case of the Letor data set. Figure 1, right panel, compares the accuracy obtained with random and active selection and starting from a hierarchical prior. Please note the change in scaling of the y-axis. The active selection was implemented using the Committee criterion. These plots show that the combination between multi-task and active learning indeed improves the performance.

Fig. 1
figure 1

Left: Comparison between the hierarchical model that we discussed in Sect. 2.3 and the Gaussian processes for preference learning model (Chu and Ghahramani 2005b). The setting is non-active, the updates are done using training points randomly selected. Right: Random vs active selection of training points. The performance is evaluated as a function of the number of data points included in the training set. The active selection was implemented using the Committee criterion. The shaded region shows the range of 10 random strategies. The error bars give the standard deviation of the mean accuracy, averaged over the subjects. Please note the change in scaling of the y-axis between the left and right plots

4.3.2 Audio

Figure 2 shows the performance of the Committee criterion on the left and Variance-logdet criterion on the right versus random selection on the audio data set. The plots show the prediction accuracy (on the y-axis) as a function of the number of updates from the hierarchical prior (on the x-axis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the 32 subjects. We used the Committee criterion with γ=0 since the subjects in the committee are quite similar between each other, which is also suggested by the small error bars. The informative prior improves the predictions at the beginning when no preference observations have been observed for the new subject. The hierarchical prior already gives an accuracy of almost 0.7 for the audio data at the beginning of learning. The hierarchical prior was learned from 20 randomly selected data points per subject. Committee and Variance-logdet strongly overlap and are considerably better than a random strategy. The audio data set contains a few very informative data points and some which are not informative. In some cases the difference between the two sound samples presented in an experiment is so small that the subject cannot hear any difference. Such experiments are not informative because a subject’s answer is close to random and does not provide any information with respect to the subject’s preferences. The active learning criteria avoid selecting these type of experiments and obtain better performance than random selection. The performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 2, except for the Vote criterion which does not seem to perform better than random. We refer to Sect. 4.5 for an empirical evaluation of the similarities between the active learning criteria considered in Sect. 3.

Fig. 2
figure 2

Performance of the Committee criterion on the left and Variance-logdet criterion on the right versus random selection for the audio data set. The plots show the prediction accuracy (on the y-axis) as a function of the number of updates from the hierarchical prior (on the x-axis). The error bars give the standard deviation of the mean accuracy, averaged over the 32 subjects. The shaded region shows the range of 10 random strategies

4.3.3 Art

Figure 3 shows the performance of the Committee criterion on the left and Variance-logdet criterion on the right versus random selection on the art data set. The plots show the prediction accuracy (on the y-axis) as a function of the number of updates from the hierarchical prior (on the x-axis). The shaded region indicates the accuracy of 10 random selection runs. The error bars give the standard deviation of the mean accuracy, averaged over the subjects. For the art data which has a higher variability between subjects the Committee criterion with γ=1 performs slightly better than the Committee criterion with γ=0. The preferences of people for art images are more difficult to predict, since preferences do not depend on some low-level characteristics of the image, like texture, color, etc. This is why the accuracy obtained on the art data is less than the accuracy obtained, for example, on the audio data. The Variance-logdet criterion appears to perform slightly better than the Committee criterion. Furthermore, the benefit of active learning over random selection is much smaller. Like in the case of audio data set, the performance of the other active learning strategies (not shown) is comparable to the active learning strategies shown in Fig. 3.

Fig. 3
figure 3

Performance of the Committee criterion on the left and Variance-logdet criterion on the right versus random selection for the art data set. The plots show the prediction accuracy (on the y-axis) as a function of the number of updates from the hierarchical prior (on the x-axis). The error bars give the standard deviation of the mean accuracy, averaged over the subjects. The shaded region shows the range of 10 random strategies

4.4 Computational complexity

One of the main advantages of the Committee criterion is its computational simplicity in comparison with the standard criteria used in experimental design and exemplified here by the Variance-logdet criterion. For every candidate data point to be included in the training set, the Variance-logdet criterion needs to infer the posterior distribution induced. A standard method for performing this approximation is Laplace’s method which has a cubic complexity in the dimension of the data features because it involves inversions of the covariance matrices. For more details about other inference methods suited to pairwise comparison data we refer to (Birlutiu and Heskes 2007). All the algorithms presented here are linear in the number of data points which makes them scalable to large data sets. Table 1 shows a comparison between the execution times for Variance-logdet and Committee criterion as a function of feature dimension. Since these execution times for a fixed number of updates (one in our case) do not depend on the actual nature of the data, we randomly generated data in order to be able to change the number of input dimensions. The data was randomly generated to have dimension 10, 50, 100 and 200. The time was evaluated for 100 candidate data points and for 1 update step. The Committee criterion was computed from data belonging to 20 users. The simulations were performed using Matlab on an Intel Xeon processor with 16 Gb of memory which runs Fedora release 9 with Linux kernel 2.6.27. In the case of the Committee criterion, only KL divergences between predictive probabilities need to be computed. The Committee criterion is clearly much faster than the Variance-logdet, furthermore, contrary to the Variance-logdet criterion, the computational complexity of the Committee criterion is independent of the dimension of the features.

Table 1 Execution time (in seconds) for Variance-logdet and Committee criterion a function of feature dimension

4.5 Similarities between criteria

In order to test empirically the approximations and similarities from Sect. 3.4, we computed the Spearman rank correlations (Hollander and Wolfe 1973) between the scores assigned by the criteria when evaluating the informativeness of the data points. The correlations were computed for all the data sets considered in the experimental evaluation, Letor, audio and art data set. For each subject the learning started with the hierarchical prior learned from the data of the other subjects. This prior was updated by taking into account the information from 20 randomly selected data points for both data sets. After these updates, we computed the scores assigned by each of the active learning criteria to 50 randomly chosen data points. Figure 4 shows these Spearman rank correlation coefficients for each pair of criteria; the darker the color the closer to 1 the correlations are and the stronger the similarity between the two criteria. There are several observations to be made from this figure: (i) One can notice a darker square on the left-down part of the figures, for all three data sets. This square involves the Variance-logdet, Change-KL, Variance-trace, and Committee criteria. The correlations between each pair of them are very close to 1 which suggests that these criteria perform in practice very similarly. This is also what the theory from Sect. 3.4 suggests by approximating these criteria to a similar form. (ii) The Variance-logdet and Change-KL criterion have the Spearman rank correlation extremely close to 1. Their approximations are proven to be equivalent in Lemma 5 in the Appendix. These two observations also suggest that the approximations of the Variance-logdet and Change-KL are very accurate. (iii) The Vote criterion performs in some situations randomly since when the number of subjects is much smaller than the number of data points considered, the scores assigned by the Vote criterion are the same for most of the experiments. (iv) The Uncertainty criterion is most different from the others.

Fig. 4
figure 4

Spearman rank correlation coefficients for the scores assigned by the active learning criteria on the three data sets: letor data (left) audio data (center) and on the art data (right). The darker the color, the higher the correlations, suggesting that the criteria are very similar. The lighter the color, the lower the correlations, suggesting that the criteria are not similar. The Variance-logdet and Variance-trace criteria are referred in the plots as V-logdet and V-trace. The correlations between the Variance-logdet and Committee criterion are about 0.7 for the Letor data set, 0.95 for the audio data set and around 0.85 for the art data set

5 Conclusions and discussions

This work studied how to exploit models learned on other scenarios to actively learn a model for a new scenario in an efficient way. Our approach to active learning in a multi-scenario setting combines a hierarchical Bayesian prior (to learn from related scenarios) with active learning (to learn efficiently by selecting informative examples). Our new Committee criterion inspired by the Query-by-Committee method is very similar to the standard criteria from experimental design, in particular in the early stages of active learning, but computationally more efficient. Aside from the computational advantage, the Committee criterion introduces the idea to have the data, available from other users, collaborate in order to select the most informative experiments to perform with a new user. The same idea is already implicit in the Query-by-Committee algorithm. We show, theoretically and through experiments, that this conceptual idea also works with a committee of people. This can be interpreted as another way of using people as the elements of a machine learning algorithm, which is a very promising research area, as suggested also by (Sanborn and Griffiths 2008).

There are several aspects related to the approach proposed here that require further attention: (i) The design is myopic in the sense that the active learning criteria look one step ahead at a time when evaluating the informativeness of a data point. A non-myopic design “looks” more than just one step and it is theoretically closer to the best possible design but computationally much more expensive. Due to the computational complexity involving a non-myopic design, we discussed all the active learning criteria from a myopic perspective, however, a non-myopic perspective can be applied to all of them, similar to the one proposed by Boutilier (2002). (ii) In this work we used log-linear models and Gaussian distributions to model the preference data. The same idea, of using models learned on data from different subjects (or scenarios) to actively select examples for a new subject, can be applied to other models and starting from different priors as well, although the mathematics will be a bit more involved and less intuitive. In particular, considering a mixture of Gaussians as the prior may still be feasible and may lead to an active learning strategy that tries to find those examples that can best discriminate to which mixture component the current model belongs.