Attributes for Image Retrieval

Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Image retrieval is a computer vision application that people encounter in their everyday lives. To enable accurate retrieval results, a human user needs to be able to communicate in a rich and noiseless way with the retrieval system. We propose semantic visual attributes as a communication channel for search because they are commonly used by humans to describe the world around them. We first propose a new feedback interaction where users can directly comment on how individual properties of retrieved content should be adjusted to more closely match the desired visual content. We then show how to ensure this interaction is as informative as possible, by having the vision system ask those questions that will most increase its certainty over what content is relevant. To ensure that attribute-based statements from the user are not misinterpreted by the system, we model the unique ways in which users employ attribute terms, and develop personalized attribute models. We discover clusters among users in terms of how they use a given attribute term, and consequently discover the distinct “shades of meaning” of these attributes. Our work is a significant step in the direction of bridging the semantic gap between high-level user intent and low-level visual features. We discuss extensions to further increase the utility of attributes for practical search applications.


Image Retrieval Database Image Relevance Feedback Image Search Binary Search Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction

Semantic visual attributes are properties of the world akin to adjectives (e.g. “furry,” “metallic,” “smiling,” “natural,” etc.) Humans naturally explain the world to each other with attribute-driven descriptions. For example, a person might say “Give me the red cup” or “I wanted shoes that were more formal” or “The actor I am thinking of is older.” Thus, attributes are meaningful to humans. Importantly, they can also be captured with computational models. As such, they are an excellent channel for communication between the user and the system. This property is exploited by a number of works that apply attributes for semantic image retrieval [7, 26, 30, 45, 50, 55, 60].

Image retrieval is a task in which a human user poses a query using either text or an image to a search engine, and the engine returns image results. Search is necessary because there is simply too much visual data on the web today, and browsing it to find relevant content is infeasible. When people perform a search, they usually have a very specific idea of what they want to retrieve, and this idea cannot be captured by simple tags or keywords, which are usually category labels. The traditional categories we use in computer vision are insufficiently descriptive of the user’s information need because they are too coarse-grained. For example, a user might want to buy shoes that satisfy certain properties like color, heel height, texture, etc., and these properties cannot be captured by even the most fine-grained categories that might reasonably exist in the world. Similarly, the user might search for stock photography to include in a presentation, and she likely has a very detailed idea of what the photograph she wants to include should look like. Alternatives to keyword search include asking the user to point to examples, which is infeasible because when the user’s target is specific or complex, examples may not be available. Another alternative is to trust users to draw readily so they can illustrate what they want to find, but unfortunately this is an unrealistic expectation. Thus, search via some form of language-based interaction remains a very appealing option.

It is infeasible to pre-assign tags to images that are sufficient to satisfy any future query. Further, due to the “semantic gap” between the system’s low-level image representation and the user’s high-level concept, one-shot retrieval performed by matching images to keywords is unlikely to get the right results. Typically retrieval systems allow the user to iteratively provide feedback on the results retrieved in each round. In this interactive form of search, users mark some images as “relevant” and others as “irrelevant”, and the system adapts its relevance ranking function accordingly [5, 14, 31, 33, 52, 63, 70]. Instead of requesting feedback on some user-chosen subset of the current results, some methods perform active selection of the images to display for feedback, by exploiting the uncertainty in the system’s current model of relevance to find useful exemplars [5, 14, 33, 63, 70].

However, this form of feedback is limited as it forces the retrieval system to guess what about the images was relevant or irrelevant. For example, when a user searches for “black shoes”, retrieves a pair of pointy high-heeled black shoes, and marks them as irrelevant, this might be because she did not want these shoes to be “pointy”, or because she wanted them to be “flat”. However, the system does not know which, and this uncertainty will negatively impact the next set of image results. Furthermore, existing methods which actively select the images for feedback use an approximation for finding the optimal uncertainty reduction, whether in the form of uncertainty sampling [63] or by employing sampling or clustering heuristics [5, 14]. Finally, such methods only consider binary feedback (“this is relevant”/“this is irrelevant”), which is imprecise.

Below, we introduce a method for refining image search results via attributes. A user initiates the search, for instance by providing a set of keywords involving objects or attributes, and the system retrieves images that satisfy those keywords. After the initialization, the user performs relevance feedback; the form of this feedback is where our method’s novelty lies. We propose a new approach which allows the user to give rich feedback based on relative attributes. For example, she can say “Show me images like this, but brighter in color.” This descriptive statement allows the system to adjust the properties of the search results in exactly the way which the user envisions. Notice this new form of feedback is much more informative than the “relevant/irrelevant” binary relevance feedback that previous methods allowed.

Attribute-based search has been explored in [30, 50, 55, 60], but while one-shot attribute-based queries allow a user to more precisely state their goal compared to category-based queries, the full descriptive power of attributes cannot be utilized without a way to quantify to what extent they are present and to refine a search after the query is issued. Furthermore, existing work in attribute-based search [28, 50, 55, 60] assumes one classifier is sufficient to capture all the variability within a given attribute term, but researchers find there is substantial disagreement between users regarding attribute labels [6, 13, 23, 46]. We show how to prevent this disagreement from introducing noise on the user-system communication channel.

Towards the broad goal of interactive search with attributes, we address a number of technical challenges. First, we use attributes to provide a channel on which the user can communicate her information need precisely and with as little effort as possible. We find that, compared to traditional binary relevance feedback, attributes enable more powerful relevance feedback for image search (Sect. 5.2), and show how to further select this feedback so it is as informative as possible (Sect. 5.3). Unlike existing relevance feedback for image retrieval [5, 14, 16, 31, 52, 62, 70], the attribute-based feedback we propose allows the user to communicate with the retrieval system precisely how a set of results lack what the user is looking for. We also investigate how users use the attribute vocabulary during search, and ensure that the models learned for each attribute align with how a user employs the attribute name, which is determined by the user’s individual perception of this attribute (Sect. 5.4). We automatically discover and exploit the commonalities that exist in user perceptions of the same attribute, to reveal the “shades of meaning” of an attribute and learn more robust models (Sect. 5.5). Due to their computational efficiency, the methods we develop are highly relevant to practical applications.

5.2 Comparative Relevance Feedback Using Attributes

In [26], we propose a novel mode of feedback where a user directly describes how high-level properties of image results should be adjusted in order to more closely match her envisioned target images. Using the relevance feedback paradigm, the user first initializes the search with some keywords: either the name of the general class of interest (“shoes”) or some multi-attribute query (“black high-heeled shoes”). Alternatively, the user can provide an image or a sketch [9], and we can use existing query-by-example approaches [37, 48] to retrieve an initial set of results. The system ranks the database images with respect to how well they match the text-based or image-based query. Our system’s job is to refine this initial set of results, through user-given feedback. If no text-based or image-based initialization is possible, the search simply begins with a random set of images for feedback.

The top-ranked images are then displayed to the user, and the feedback-refinement loop begins. For example, when conducting a query on a shopping website, the user might state: “I want shoes like these, but more formal.” When browsing images of potential dates on a dating website, she can say: “I am interested in someone who looks like this, but with longer hair and more smiling.” When searching for stock photos to fit an ad, she might say: “I need a scene similarly bright as this one and more urban than that one.” See Fig. 5.1. Using the resulting constraints in the multi-dimensional attribute space, the system updates its relevance function, re-ranks the pool of images, and displays to the user the images which are most relevant. In this way, rather than simply state which images are (ir)relevant, the user employs semantic terms to say how they are so. We call the approach WhittleSearch, since it allows users to “whittle away” irrelevant portions of the visual feature space via precise, intuitive statements of their attribute preferences.
Fig. 5.1

WhittleSearch allows users to refine image search using relative attribute feedback. In this example, the user initiated the search with the query “black shoes,” retrieved some results, and then asked the system to show images that are “more formal” than the second result and “shinier” than the fourth result. The system then refined the set of search results in accordance with the user’s descriptive feedback. Image reprinted with permission

Throughout, let \(\mathcal{D} = \{I_{1},\dots ,I_{N}\}\) refer to the pool of N database images that are ranked by the system using its current scoring function \(S_t : I_i \rightarrow \mathbb {R}\), where t denotes the iteration of refinement. \(S_t(I_i)\) captures the likelihood that image \(I_i\) is relevant to the user’s information need, given all accumulated feedback received in iterations \(1,\dots ,t-1\). Note that \(S_t\) supplies a (possibly partial) ordering on the images in \(\mathcal {D}\).

At each iteration t, the top \(L < N\) ranked images \(\mathcal {T}_{t} = \{I_{t1},\dots ,I_{tL}\} \subseteq \mathcal {D}\) are displayed to the user for further feedback, where \(S_t(I_{t1}) \ge S_t(I_{t2}) \ge \dots \ge S_t(I_{tL})\). A user then gives feedback of her choosing on any or all of the L results in \(\mathcal {T}_t\). We refer to \(\mathcal {T}_t\) interchangeably as the reference set or top-ranked set.

Offline, our system learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image (e.g. the degree of “shininess,” “furriness,” etc.). First, we describe how relative attribute models are learned, and then how we use these models to enable a new mode of relevance feedback.

5.2.1 Learning to Predict Relative Attributes

We assume we are given a vocabulary of M attributes \(A_1, \dots , A_M\), which may be generic or domain-specific for the image search problem of interest.1 For example, a domain-specific vocabulary for shoe shopping could contain attributes such as “shininess,” “heel height,” “colorfulness,” etc., whereas for scene descriptions it could contain attributes like “openness,” “naturalness,” and “depth”. It would be too expensive to manually annotate all images with their attribute strength, so we learn to extrapolate from a small set of annotations to a prediction function over all database images as follows.

For each attribute \(A_m\), we obtain supervision on a set of image pairs (ij) in the training set \(\mathcal {I}\). We ask human annotators to judge whether that attribute has a stronger presence in image i or j, or if it is equally strong in both.2 On each pair we collect five redundant responses from multiple annotators on Amazon Mechanical Turk (MTurk), in order to elicit the most common perception of the attribute and reduce the impact of noisy responses; we use only those responses for which most labelers agree. This yields a set of ordered image pairs \(O_m = \{(i,j)\}\) such that \((i,j) \in O_m \implies i \succ j\), i.e. image i has stronger presence of attribute \(A_m\) than j. Note that making comparative judgments is often more natural for annotators than assigning absolute scores reflecting how much the attribute \(A_m\) is present [44]. Our approach extends the learning process proposed in [44] to incorporate image-level (rather than category-level) relative comparisons, which we show in [27] to more reliably capture those attributes that do not closely follow category boundaries.

Next, we employ the large-margin formulation of [20] to learn a ranking function for each attribute that orders images in increasing order of attribute strength. The function is of the form \(a_m(\varvec{x}_i) = \varvec{w}_m^T\varvec{x}_i,\) where each image \(I_i\) is represented in \(\mathbb {R}^d\) by a feature vector \(\varvec{x}_i\). We seek a vector \(\varvec{w}_m\) for each \(m=1,\dots ,M\) that enforces a large margin between images at nearby ranks, while also allowing the maximum number of the following constraints to be satisfied: \(\forall (i,j) \in O_m: \varvec{w}_m^T\varvec{x}_i > \varvec{w}_m^T\varvec{x}_j.\) The ranking objective in [20] is reminiscent of standard SVM training and is solved with similar methods; see [20, 27] for details. We apply the learned functions \(a_1,\dots ,a_M\) to an image’s feature descriptor \(\varvec{x}\), in order to predict the extent to which each attribute is present in any novel image. Note that this training is a one-time offline process.

The predicted attribute values \(a_m(\varvec{x}_i)\) are what we can observe for image \(I_i\). They are a function of (but distinct from) the “true” latent attribute strengths \(A_m(I_i)\). Using standard features and kernels, we find that 75 % of held-out ground truth comparisons are preserved by attribute predictors trained with \(\sim \)200 pairs.

More sophisticated techniques for learning attribute models can be applied. For example, multiple attributes can be modeled jointly [3, 66]. Chapter  4 describes an approach for decorrelating attribute models, Chap.  6 proposes a method to learn fine-grained attribute differences, and [34] proposes to use random forests to improve relative attributes. In [30], the authors describe how to discover localized attributes using a pre-defined set of candidate face regions (e.g. mouth, eyes), and the authors of [54] mine for discriminative object parts. One can also develop a method to directly learn the spatial support of attributes by capturing human intuition about this support, or by discovering what image features change smoothly to make an attribute “appear” in images [67]. Recent work uses deep networks to predict attributes [11, 42, 57, 58], and to adapt attributes across domains [4, 35].

5.2.2 Relative Attribute Feedback

With the ranking functions learned above, we can now map any image from \(\mathcal {D}\) into an M-dimensional space, where each dimension corresponds to the relative rank prediction for one attribute. It is in this feature space we propose to handle query refinement from a user’s feedback.

To refine the current search results, the user surveys the L top-ranked images in the displayed set \(\mathcal {T}_t\), and uses some of them as reference images to express her desired visual result. The feedback is of the form “What I want is more/less m than image \(I_{t_f}\)”, where m is an attribute name, and \(I_{t_f}\) is an image in \(\mathcal {T}_t\) (the subscript \(t_f\) denotes it is a reference image at iteration t). Let \(\mathcal {F}=\{(I_{t_f}, m, r)\}_1^K\) denote the set of all accumulated comparative constraints at each iteration, where r is the user response r \(\in \) {“more”, “less”}.3 The conjunction of all such user feedback statements is used to update the relevance scoring function.

Let \(G_{k,i} \in \{0,1\}\) be a binary random variable representing whether image \(I_i\) satisfies the k-th feedback constraint. For example, if the user’s k-th comparison on attribute m yields response r = “more”, then \(G_{k,i} = 1\) if the database image \(I_i\) has attribute m more than the corresponding reference image \(I_{t_f}\). The estimate of relevance is thus proportional to the probability that any of the \(|\mathcal {F}|\) feedback comparisons are satisfied:
$$\begin{aligned} S_{T}(I_i) = \sum _{k=1}^{|\mathcal {F}|} P(G_{k,i} = 1 | I_i, \mathcal {F}_k). \end{aligned}$$
Using Iverson bracket notation, we compute the probability that an individual constraint is satisfied as:This simply reflects that images having the appropriate amount of property m are more relevant than those that do not. In the next iteration, we show at the top of the results page those images that satisfy all constraints, followed by images satisfying all but one constraint, etc. The feedback loop is repeated, accepting any additional feedback on the newly top-ranked images, until the user’s target image is found or the budget of interaction effort is expended. The final output is a sorting of the database images in \(\mathcal {D}\) according to their likelihood of being relevant.

Note that these similarity constraints differ from traditional binary relevance feedback, in that they single out an individual attribute. Each attribute feedback statement carves out a relevant region of the M-dimensional attribute feature space, whittling away images not meeting the user’s requirements. Further, the proposed form of relative attribute feedback refines the search in ways that a straightforward multi-attribute [30, 55, 60] query cannot. If a user simply stated the attribute labels of interest (“show me black shoes that are shiny and high-heeled”), one can retrieve the images whose attribute predictions meet those criteria, but since the user’s description is in absolute terms, it cannot be refined based on the retrieved images. In contrast, with access to relative attributes as a mode of communication, for every new set of reference images returned by the system, the user can further refine his description. Similarly to multi-attribute queries, faceted browsing—where the retrieval system organizes documents or products according to several properties (facets) and allows the user to query with different combinations of the facets [64]—is also a form of keyword search with fixed values for the attribute properties. However, this form of search does not suffice when a user’s preferences are very specific and possibly subjective, i.e. it may be difficult to quantize attributes as multiple-valued facets and determine what lies within a range of 0.2–0.4 of “pointiness.”

5.2.3 Experimental Validation

We analyze how the proposed relative attribute feedback can enhance image search compared to classic binary feedback. We use three datasets: the Shoes dataset from the Attribute Discovery Dataset [1], the Public Figures dataset of human faces [29] (PubFig), and the Outdoor Scene Recognition dataset of natural scenes [40] (OSR). The Shoes data contains 14,658 shoe images from, and we use Amazon’s Mechanical Turk to annotate the data with ten relative attributes (“pointy at the front,” “open,” “bright in color,” “ornamented,” “shiny,” “high at the heel,” “long on the leg,” “formal,” “sporty,” “feminine”). For PubFig we use the subset from [44], which contains 772 images from 8 people and 11 attributes (“masculine-looking,” “young,” “smiling,” “chubby,” “pointy nose,” etc.). OSR consists of 2,688 images from 8 categories and 6 attributes (“natural,” “open,” “close-depth,” etc.); these attributes are used in [44]. For the image features \(\varvec{x}\), we use GIST [40] and LAB color histograms for Shoes and PubFig, and GIST alone for OSR, since the scenes do not seem well characterized by color.

For each query we select a random target image and score how well the search results match that target after feedback. This target stands in for a user’s mental model; it allows us to prompt multiple subjects for feedback on a well-defined visual concept, and to precisely judge how accurate results are. We measure the NDCG@K [21] correlation between the full ranking computed by \(S_t\) and a ground truth ranking that reflects the perceived relevance of all images in \(\mathcal {D}\).

As a baseline, we use a “binary relevance feedback” approach that is intended to represent traditional approaches such as [5, 14, 52, 62, 63]. In a binary relevance feedback model, the user identifies a set of relevant images \(\mathcal {R}\) and a set of irrelevant images \(\mathcal {\bar{R}}\) among the current reference set \(\mathcal {T}_t\). In this case, the scoring function \(S^b_t\) is a classifier (or some other statistical model), and the binary feedback supplies positive (the images in \(\mathcal {R}\)) and negative (the images in \(\mathcal {\bar{R}}\)) training examples for that classifier. We employ a support vector machine (SVM) classifier for the binary feedback model due to its strong performance in practice.

We use two methods to generate feedback statements in order to evaluate our method and the baseline. First, we gather attribute comparisons from users on MTurk. Second, to allow testing on a larger scale without incurring a large monetary cost, we also generate feedback automatically, by simulating user responses. For relative constraints, we randomly sample constraints based on the predicted relative attribute values, checking how the target image relates to the reference images. For binary feedback, we analogously sample positive/negative reference examples based on their image feature distance to the true target. When scoring rank, we add Gaussian noise to the predicted attributes (for our method) and the SVM outputs (for the baseline), to coarsely mimic people’s uncertainty in constraint generation.

In Fig. 5.2, we show the rank correlation for our method and the baseline as a function of the number of feedback statements, using 100 queries and automatically generated feedback. A round of feedback consists of a relative attribute constraint (for our method) or a binary relevance label on one image (for the baseline). For all datasets, both methods clearly improve with more feedback, but the precision enabled by attribute feedback yields larger gains in accuracy. The result is intuitive, since with our method users can better express what about the reference image is (ir)relevant to them, whereas with binary feedback they cannot.4
Fig. 5.2

Impact of the amount of feedback: while more feedback enhances both methods, the proposed attribute feedback yields faster gains per unit of feedback. Image reprinted with permission

We see similar results when using the feedback generated by real users on MTurk. Attribute feedback largely outperforms binary feedback, and does similarly well on OSR. One possible reason is that people seem to have more confusion interpreting the attribute meanings (e.g. “amount of perspective” on a scene is less intuitive than “shininess” on shoes). In Sects. 5.4 and 5.5, we propose methods that help account for these ambiguities and differences in user perception.

In [27], we analyze the performance of our system when rather than a batch of feedback statements in a single iteration, one statement is given at a time, and the system iterates. Our method outperforms the binary feedback baseline for all datasets, but on PubFig our advantage is slight, likely due to the strong category-based nature of the PubFig data, which makes it more amenable to binary feedback, i.e. adding positive labels on exemplars of the same person as the target image is quite effective.

Note that while feedback using language (in the form of relative attributes) is clearly richer and more informative than binary relevance feedback, some aspects of desired visual content may be hard to capture in words. In such cases, binary feedback, while imprecise, might offer a more natural alternative. In [26], we propose a hybrid feedback approach that combines relative attribute and binary feedback. Further, one could utilize work in modeling perceptual similarity [18, 61] to more accurately estimate the user’s visual need based on examples that the user identifies.

5.3 Actively Guiding the User’s Relevance Feedback

Having presented the basic system using relative attribute feedback for image search, we now consider the question of which images ought to receive the user’s feedback. Notably, the images believed to be most relevant need not be most informative for reducing the system’s uncertainty. As a result, it might be more beneficial to leave the choice of reference images on which to seek feedback to the system. Thus, we next explore how the system can best select the feedback it requests. The method and results in this section first appeared in [23].

The goal of actively selecting images for feedback is to solicit feedback on those exemplars that would most improve the system’s notion of relevance. Many existing methods exploit classifier uncertainty to find useful exemplars (e.g. [33, 63, 70]), but they have two limitations. First, they elicit traditional binary feedback which is imprecise, as discussed above. This makes it ambiguous how to extrapolate relevance predictions to other images, which in turn clouds the active selection criterion. Second, since ideally they must scan all database images to find the most informative exemplars, they are computationally expensive and often resort to sampling or clustering heuristics [5, 14, 51] or to the over-simplified uncertainty sampling [63] which does not guarantee global uncertainty reduction over the full dataset.

Building on the WhittleSearch concept we introduced above, we next introduce a novel approach that addresses these shortcomings. As before, we assume the user initiates a search and the goal of our method is to then refine the results. We propose to actively guide the user through a coarse-to-fine search using a relative attribute image representation. At each iteration of feedback, the user provides a visual comparison between the attribute in her envisioned target and a “pivot” exemplar, where a pivot separates all database images into two balanced sets. Instead of asking the user to choose both the image and attribute for feedback, in this approach we ask the system to make this choice, so the user is presented with a single image and a single attribute and simply has to provide the value of the comparison (“more”, “less”, or “equally”). In other words, the system interacts with the user through multiple-choice questions of the form: “Is the image you are looking for more, less, (or equally) A than image I?”, where A is a semantic attribute and I is an exemplar from the database being searched. The system actively determines along which of multiple attributes the user’s comparison should next be requested, based on the expected information gain that would result. We show how to limit the scan for candidate questions to just one image (the pivot) per attribute. Thus, the active selection method is efficient both for the system (which analyzes a small number of candidates per iteration) and the user (who locates his content via a small number of well-chosen interactions). See Fig. 5.3.
Fig. 5.3

The active version of WhittleSearch requests feedback in the form of visual attribute comparisons between the user’s target and images selected by the system. To formulate the optimal questions, it unifies an entropy reduction criterion with binary search trees in attribute space. Image reprinted with permission

5.3.1 Attribute Binary Search Trees

We use the same notation as in Sect. 5.2. \(A_m(I_i)\) denotes the true strength and \(a_m(I_i)\) the predicted strength of an attribute m in image \(I_i\). We construct one binary search tree for each attribute \(m=1,\dots ,M\). The tree recursively partitions all database images into two balanced sets, where the key at a given node is the median relative attribute value within the set of images passed to that node. To build the m-th attribute tree, we start at the root with all database images, sort them by their attribute values \(a_m(I_1),\dots ,a_m(I_N)\), and identify the median value. Let \(I_p\) denote the “pivot” image (the one that has the median attribute strength). The images \(I_i\) for which \(a_m(I_i) \le a_m(I_p)\) are passed to the left child, and those for which \(a_m(I_i) > a_m(I_p)\) are passed to the right child. The splitting repeats recursively, each time storing the next pivot image and its relative attribute value at the appropriate node. Note that the relative attribute ranker training and search tree construction are offline procedures.

One could devise a search procedure that requests a comparison to the pivot at each level of a single attribute tree and eliminates the appropriate portion of the database depending on the user’s response. However, such pruning is error-prone because (1) the attribute predictions may not be identical to the attribute strengths a user will perceive, and (2) such pruning ignores the information gain that could result by intelligently choosing the attribute along which a comparison is requested. Instead, we will show how to use comparisons to the pivots in our binary search trees, in order to probabilistically refine the system’s prediction of the relevance/irrelevance of database images to the user’s goal.

5.3.2 Predicting the Relevance of an Image

The output of our search system will be a sorting of the database images \(I_i \in \mathcal {D}\) according to their probability of relevance, given the image content and all user feedback. As before, \(\mathcal {F}=\{(I_{p_m},r)\}_{k=1}^T\) denotes the set of comparative constraints accumulated in the T rounds of feedback so far. The k-th item in \(\mathcal {F}\) consists of a pivot image \(I_{p_m}\) for attribute m, and a user response r \(\in \) {“more”, “less”, “equally”}. \(G_{k,i} \in \{0,1\}\) is a binary random variable representing whether image \(I_i\) satisfies the k-th feedback constraint. Let \(y_i \in \{1, 0\}\) denote the binary label for image \(I_i\), which reflects whether it is relevant to the user (matches her target), or not. The probability of relevance is the probability that all T feedback comparisons in \(\mathcal {F}\) are satisfied, and for numerical stability, we use a sum of log probabilities: \(\log P(y_i = 1 | I_i, \mathcal {F}) = \sum _{k=1}^T \log P(G_{k,i} = 1 | I_i, \mathcal {F}_k).\) This equation is similar to the definition of \(S_{T}(I_i)\) in Sect. 5.2, but we now use a soft score denoting whether an image satisfies a constraint, in order to account for the fact that predicted attributes can deviate from true perceived attribute strengths. The probability that the k-th individual constraint is satisfied given that the user’s response was r for pivot \(I_{p_m}\) is:To estimate these probabilities, we map the differences of attribute predictions, i.e. \(a_m(I_i) - a_m(I_p)\) (or \(|a_m(I_i) - a_m(I_p)|\) for “equally”) to probabilistic outputs, using Platt’s method [47].

5.3.3 Actively Selecting an Informative Comparison

Our system maintains a set of M current pivot images (one per attribute tree) at each iteration, denoted \(\mathcal {P}=\{I_{p_1},\dots ,I_{p_M}\}\). Given the feedback history \(\mathcal {F}\), we want to predict the information gain across all N database images that would result from asking the user how her target image compares to each of the current pivots in \(\mathcal {P}\). We will request a comparison for the pivot that minimizes the expected entropy when used to augment the current set of feedback constraints. Note that selecting a pivot corresponds to selecting both an image as well as an attribute along which we want it to be compared; \(I_{p_m}\) refers to the pivot for attribute m.

The entropy given feedback \(\mathcal {F}\) is:
$$\begin{aligned} H(\mathcal {F}) = - \sum _{i=1}^N \sum _{\ell } P(y_i=\ell | I_i, \mathcal {F}) \log P(y_i = \ell | I_i, \mathcal {F}), \end{aligned}$$
where \(\ell \in \{0,1\}\). Let R be a random variable denoting the user’s response, R \(\in \) {“more”, “less”, “equally”}. We select the next pivot for comparison as:
$$\begin{aligned} I_p^*= \displaystyle \mathop {\text {arg min}}_{I_{p_m} \in \mathcal {P}} \sum _{r} P(R = r | I_{p_m}, \mathcal {F})~~H(\mathcal {F} \cup (I_{p_m}, r)). \end{aligned}$$
Optimizing Eq. 5.5 requires estimating the likelihood of each of the three possible user responses to a question we have not issued yet. In [23], we describe and evaluate three strategies to estimate it; here we describe one. We use cues from the available feedback history to form a “proxy” for the user, essentially borrowing the probability that a new constraint is satisfied from previously seen feedback. Let \(I_b\) be the database image which the system currently ranks highest, i.e. the image that maximizes \(P(y_i=1 | I_i, \mathcal {F})\). We can use this image as a proxy for the target, and compute:
$$\begin{aligned} P(R=r | I_{p_m},\mathcal {F}) = P(G_{c,b} = 1 | I_b, \mathcal {F}_c), \end{aligned}$$
where c indexes the candidate new feedback for a (yet unknown) user response R.

At each iteration, we present the user with the pivot selected with Eq. 5.5 and request the specified attribute comparison. Using the resulting feedback, we first update \(\mathcal {F}\) with the user’s new image-attribute-response constraint. Then we either replace the pivot in \(\mathcal {P}\) for that attribute with its appropriate child pivot (i.e. the left or right child in the binary search tree if the response is “less” or “more”, respectively) or terminate the exploration of this tree (if the response is “equally”). The approach iterates until the user is satisfied with the top-ranked results, or until all of the attribute trees have bottomed out to an “equally” response from the user.

The cost of our selection method per round of feedback is O(MN), where M is the size of the attribute vocabulary, N is the database size, and \(M \ll N\). For each of O(M) pivots which can be used to complement the feedback set, we need to evaluate expected entropy for all N images. In contrast, a traditional information gain approach would scan all database items paired with all attributes, requiring \(O(MN^2)\) time. In comparison to other error reduction methods [2, 5, 14, 25, 39, 51], our method can exploit the structure of rankable visual properties for substantial computational savings.

5.3.4 Experimental Validation

We use the same data and experimental setup as in Sect. 5.2, but now we measure the percentile rank each method assigns to the target at each iteration. We compare our method Active attribute pivots against:
  • Attribute pivots, a version of our method that cycles through pivots in a round-robin fashion;

  • Active attribute exhaustive, which uses entropy to select questions like our method, but evaluates all possible MxN candidate questions;

  • Top, which selects the image that has the current highest probability of relevance and pairs it with a random attribute;

  • Passive, which selects a random (image, attribute) pair;

  • Active binary feedback, which asks the user whether the exemplar is similar to the target, and chooses the image with decision value closest to 0, as in [63]; and

  • Passive binary feedback, which works as above, but randomly selects the images for feedback.

To thoroughly test the methods, we conduct experiments where we simulate the user’s responses, similar to Sect. 5.2. Figure 5.4 shows that our method finds the target image more efficiently than any of the baselines. Consistent with results in the previous section, our method significantly outperforms binary relevance feedback. Interestingly, we find that Passive binary feedback is stronger than its active counterpart, likely because images near the decision boundary were often negative, whereas the passive approach samples more diverse instances. Our method substantially improves over the Top approach, which shows that relative attribute feedback alone does not offer the most efficient search if uninformative feedback is given; and over Attribute pivots, which indicates that actively interleaving the trees allows us to focus on attributes that better distinguish the relevant images. It also outperforms Active attribute exhaustive 5 likely because the attribute trees serve as a form of regularization, helping our method focus on those comparisons that a priori may be most informative. The active exhaustive approach considers entropy reduction resulting from feedback on each possible database image in turn, and can be misled by outliers that seem to have high expected information gain. Furthermore, our method is orders of magnitude faster. On the Shoes, OSR and PubFig datasets, our method only requires 0.05, 0.01 and 0.01 s respectively to make its choice in a single iteration. In contrast, the exhaustive methods requires 656.27, 28.20 and 3.42 s.

We present live experiments with real MTurk users in [23]. In those experiments, we achieve a 100–200 raw rank improvement on two datasets, and a negligible 0–10 raw rank loss on PubFig, compared to the strongest baseline, Top. This is very encouraging given the noise in MTurk responses and the difficulty of predicting all attributes reliably. Our information gain predictions on PubFig are imprecise since the facial attributes are difficult for both the system and people to compare reliably (e.g. it is hard to say who among two white people is whiter).

In [27], we show a comparison of the active pivots method presented in this section, and the passive WhittleSearch method presented in the previous section. Overall we find that the pivots method saves users more time, but also asks harder questions, which results in less confident responses from users, and in turn this could lead to erroneous search results. However, our pivots approach reduces the entropy of the system over the relevance of database images faster than the passive method from Sect. 5.2. The choice of which method to use for a given application can be made depending on how long it takes to browse a page of image results, as shown in [27].
Fig. 5.4

Comparison to existing interactive search methods (higher and steeper curves early on are better). Image reprinted with permission

Our system actively guides the search based on visual comparisons, helping a user navigate the image database via relative semantic properties. We experimentally demonstrate the utility of this approach. However, there are several possible improvements that can further increase utility as well as the search experience of a user. First, two measures of confidence can be incorporated into the active selection formulation: the confidence of attribute models, and the confidence of user responses. The first would ensure that our selection is not mislead by noisy attribute predictions, while the second would allow the down-weighing of user responses which may be erroneous. Further, we could allow the user to give different weight to responses about different attributes, if these attributes are more important to the search task than others. In this way, information gain would be higher for attributes that have accurate models and are key to the user’s search goal.

Further, we could define a mixed-initiative framework for search where we are not forced to choose between the user having control over the feedback (as in Sect. 5.2) or the system having this control (as in this section), but can rather alternate between these two options, depending on whether the user or system can provide a more meaningful next feedback statement. For example, if the system’s estimate of what the user’s response should be is incorrect for three consecutive iterations, or if the best potential information gain is lower than some threshold, perhaps the system should relinquish control. On the other hand, if the user does not see any reference images that seem particularly useful for feedback, she should give up control.

5.4 Accounting for Differing User Perceptions of Attributes

In the previous sections, we described the power of relative attribute statements as a form of relevance feedback for search. However, no matter what potential power of feedback we offer a user, search efficiency will suffer if there is noise on the communication channel between the user and the system, i.e. if the user says “A” and the system understands “B”.

Researchers collecting attribute-labeled datasets report significant disagreement among human annotators over the “true” attribute labels [10, 13, 46]. The differences may stem from several factors: the words for attributes are imprecise (when is the cat “overweight” vs. “chubby”?), and their meanings often depend on context (the shoe appears “comfortable” for a wedding, but not for running) and even cultures (languages have differing numbers of color words, ranging from two to eleven). Further, they often stretch to refer to quite distinct object categories (e.g. “pointy” pencil vs. “pointy” shoes). For all such reasons, people inevitably craft their own definitions for visual attributes. Failing to account for user-specific notions of attributes will lead to discrepancies between the user’s precise intent and the message received by the system.

Existing methods learn only a single “mainstream” view of each attribute, forcing a consensus through majority voting. This is the case whether using binary [13, 15, 32] or relative [44] attributes. For binary properties, one takes the majority vote on the attribute present/absent label. For relative properties, one takes a majority vote on the attribute more/less label. Note that using relative attributes does not resolve the ambiguity problem. The point in relative attributes is that people may agree best on comparisons or strengths, not binary labels, but relative attributes too assume that there is some single, common interpretation of the property and hence a single ordering of images from least to most [attribute] is possible.

In this section, we propose to model attributes in a user-specific way, in order to capture the inherent differences in perception. The most straightforward approach for doing so is to learn one function per attribute and per user, from scratch, but this is not scalable. Instead, we pose user-specific attribute learning as an adaptation problem. We leverage any commonalities in perception to learn a generic prediction function, then use a small number of user-labeled examples to adapt that model into a user-specific prediction function. In technical terms, this amounts to imposing regularizers on the learning objective favoring user-specific model parameters that are similar to the generic ones, while still satisfying the user-specific label constraints. In this fashion, the system can learn the user’s perception with fewer labels than if it used a given user’s data alone.

Adaptation [17, 68] requires that the source and target tasks be related, such that it is meaningful to constrain the target parameters to be close to the source’s. In our setting the assumption naturally holds: an attribute is semantically meaningful to all annotators, just with (usually slight) perceptual variations among them.

5.4.1 Adapting Attributes

As before, we learn each attribute of interest separately (i.e. one classifier for “white”, another for “pointy”). An adapted function is user-specific, with one distinct function for each user. Let \(D^\prime \) denote the set of images labeled by majority vote that are used to learn the generic model. We assume the labeled examples originate from a pool of many annotators who collectively represent the “common denominator” in attribute perception. We train a generic attribute model \(f^\prime (\varvec{x}_i)\) from \(D^\prime \). Let D denote the set of user-labeled images, which is typically disjoint from \(D^\prime \). Our adaptive learning objective will take a D and \(f^\prime \) as input, and produce an adapted attribute f as output. In this section, we describe how to adapt binary attributes; see [22] for an analogous formulation for adapting relative attributes.

The generic data \(D^\prime = \{\varvec{x}_i^\prime ,y_i^\prime \}_{i=1}^{N^\prime }\) consists of \(N^\prime \) labeled images, with \(y_i^\prime \in \{-1,+1\}\). Let \(f^\prime \) denote the generic binary attribute classifier trained with \(D^\prime \). For a linear support vector machine (SVM), we have \(f^\prime (\varvec{x}) = \varvec{x}^T \varvec{w}^\prime \). To adapt the parameters \(\varvec{w}^\prime \) to account for user-specific data \(D=\{\varvec{x}_i,y_i\}_{i=1}^{N}\), we use the Adaptive SVM [68] objective function:
$$\begin{aligned} \min _{\varvec{w}} \frac{1}{2} \Vert \varvec{w} - \varvec{w}^\prime \Vert ^2 + C \sum _{i=1}^N {\xi }_i, \\ \nonumber \text {subject to}\quad y_i \varvec{x}^T_i \varvec{w} \ge 1 - \xi _i, \forall i, \xi _i \ge 0, \end{aligned}$$
where \(\varvec{w}\) denotes the desired user-specific hyperplane, and C is a constant controlling the tradeoff between misclassification on the user-specific training examples and the regularizer. Note that the objective expands the usual large-margin regularizer \(\Vert \varvec{w}\Vert ^2\) to additionally prefer that \(\varvec{w}\) be similar to \(\varvec{w}^\prime \). In this way, the generic attribute serves as a prior for the user-specific attribute, such that even with small amounts of user-labeled data we can learn an accurate predictor.

The optimal \(\varvec{w}\) is found by solving a quadratic program to maximize the Lagrange dual objective function. This yields the Adaptive SVM decision function: \(f(\varvec{x}) = f^\prime (\varvec{x}) + \sum _{i=1}^N \alpha _i y_i \varvec{x}^T \varvec{x}_i,\) where \(\varvec{\alpha }\) denotes the Lagrange multipliers that define \(\varvec{w}\). Hence, the adapted attribute prediction is a combination of the generic model’s prediction and similarities between the novel input \(\varvec{x}\) and (selected) user-specific instances \(\varvec{x}_i\). Intuitively, a larger weight on a user-specific support vector \(\varvec{x}_i\) is more likely when the generic model \(f^\prime \) mispredicts \(\varvec{x}_i\). Thus, user-specific instances that deviate from the generic model will have more impact on f. For example, suppose a user mostly agrees with the generic notion of “formal” shoes, but, unlike the average annotator, is also inclined to call loafers “formal”. Then the adapted classifier will likely exploit some user-labeled loafer image(s) with nonzero \(\alpha _i\) when predicting whether a shoe would be perceived as formal by that user.

The adaptation strategy promotes efficiency in two ways. First, the human labeling cost is low, since the effort of the extensive label collection required to train the generic models is distributed among many users. Meanwhile, each user only needs to provide a small amount of labeled data. In experiments, we see substantial gains with as few as 12 user-labeled examples. Second, training time is substantially lower than training each user model from scratch by pooling the generic and user-specific data. The cost of training the “big” generic SVM is amortized across all user-specific functions. The efficiency is especially valuable for personalized search.

We obtain the user-specific labeled data D in two ways: by explicitly asking annotators to label informative images (either an uncertain or diverse pool), and by implicitly mining for such data in a user’s history. See [22] for details.

We use the adapted attributes to personalize image search results. Compared to using generic attributes, the personalized results should more closely align with the user’s perception, leading to more precise retrieval of relevant images. For binary attributes, we use the user-specific classifiers to retrieve images that match a multi-attribute query, e.g. “I want images with attributes X, Y, and not Z”. For relative attributes, we use the adapted rankers to retrieve images that agree with comparative relevance feedback, similar to Sects. 5.2 and 5.3. In both cases, the system sorts the database images according to how confidently the adapted attribute predictions agree with the attribute constraints mentioned in the query or feedback. Note that one can directly incorporate our adapted attributes into any existing attribute-search method [26, 30, 55, 60].

5.4.2 Experimental Validation

We conduct experiments with 75 unique users on two large datasets: the Shoes dataset and 12 attributes from the SUN Attributes dataset [46], which contains 14,340 scenes. To form descriptors \(\varvec{x}\) for Shoes, we use the GIST and color histograms as before. For SUN, we concatenate features provided by [46]: GIST, color, and base HOG and self-similarity. We cross-validate C for all models, per attribute and user. We compare our User-adaptive approach to three methods:
  • Generic, which learns a model from the generic majority vote data \(D^\prime \) only;

  • Generic+, which adds more generic data to \(D^\prime \) (one generic label for each user-specific label our method uses); and

  • User-exclusive, which uses the same user-specific data as our method, but learns a user-specific model from scratch, without the generic model.

We evaluate generalization accuracy: will adapted attributes better agree with a user’s perception in novel images? To form a generic model for each dataset, we use 100–200 images (or pairs, in the case of relative Shoes attributes) labeled by majority vote. We collect user-specific labels on 60 images/pairs, from each of 10 (Shoes) or 5 (SUN) workers on MTurk. We reserve 10 random user-labeled images per user as a test set in each run. We measure accuracy across 300 random splits.
Fig. 5.5

Attribute prediction accuracy per attribute and per user, as more training data is added. Image reprinted with permission

In Fig. 5.5, we show representative results for individual attributes and individual users. We plot test accuracy as a function of the amount of additional training data beyond the generic pool \(D^\prime \). Generic remains flat, as it gets no additional data. For binary attributes, chance is 50 %; for relative it is 33 %, since there are three possible responses (“more”, “less”, “equally”). Overall, our method more accurately predicts the labels on the held-out user-specific images than any of the baselines. The advantage of adapted attributes over the generic model supports our main claim: we need to account for users’ individual perception when learning attributes. Further, the advantage over the user-exclusive model shows that our approach successfully leverages “universal” perception as a prior; learning from scratch is inferior, particularly if very few user-specific labels are available (see the leftmost point of all plots). With more user-specific labels, the non-adaptive approach can sometimes catch up (see “sporty” in column (a)), but at the expense of a much higher burden on each user. Finally, the Generic+ baseline confirms that our method’s advantage is not simply a matter of having more data available. Generic+ usually gives Generic a bump, but much less than User-adaptive. For example, on “bright in color”, our method improves accuracy by up to 26 %, whereas Generic+ only gains 14 %.

We do see some failure cases though, as shown in columns (e) and (f). The failures are by definition rather hard to analyze. That’s because by focusing on user-specific perception, we lose any ability to filter noisy label responses (e.g. with voting). So, when a user-adapted model misclassifies, we cannot rule out the possibility that the worker herself was inconsistent with her personal perception of the attribute in that test case. Nonetheless, we do see a trend in the failure cases—weaker User-exclusive classifiers. As a result, our model can start to underperform Generic, pulled down by (what are possibly inconsistent) user responses, as seen by a number of cases where User-exclusive remains close to chance. Another reason for failure (with respect to the user-exclusive model) are user responses which were the opposite of generic responses, where the generic prior can cause negative transfer for our method (see “high at the heel” in column (e)). Note that the success of adaptation depends not just on the attribute being learned, but also on individual users, e.g. “high at the heel” in columns (d, e) and “open area” in columns (a, e). One could devise a method that automatically determines when the generic model should be used as a prior.

We find that user-adapted attributes are often strongest when test cases are hardest. See [22] for details. We also show that correctly capturing attribute perception is important for accurate search. Search is a key application where adapted attributes can alleviate inconsistencies between what the user says, and what the (traditionally majority-vote-trained) machine understands. The generalization power of the adapted attributes translates into the search setting: our method is substantially better at finding the images relevant to the user. This result demonstrates how our idea can benefit a number of prior binary attribute search systems [30, 55, 60] and our relative attribute relevance feedback search.

5.5 Discovering Attribute Shades of Meaning

So far, we have discussed generic attribute models, which assume that all users perceive the attribute in the same way; and user-specific models, which assume that each user’s perception is unique. However, while users differ in how they perceive and use attributes, it is likely that there are some commonalities or groupings between them in terms of how they interpret and utilize the attribute vocabulary. We find evidence for this in work on linguistic relativity [12], which examines how culture influences how we describe objects, shape properties of animals, colors, etc. For example, Russian has two words for what would be shades of “blue” in English, while other languages do not strongly distinguish “blue” and “green”. In other words, if asked whether an object in some image is “blue” or not, people of different countries might be grouped around different answers. We refer to such groupings of users as “schools of thought”.

We can use the groupings of users to discover the “shades of meaning” of an attribute, since users in the same “school” likely subscribe to the same interpretation of the attribute.6 An attribute “shade” is a visual interpretation of an attribute name that one or more people apply when judging whether that attribute is present in an image. For example, for the attribute “open” in Fig. 5.6, we might discover that some users have peep-toed shoes in mind when they say “open”, while others have flip-flops in mind when they use the same word. Note that for many attributes, such ambiguities in language use cannot be resolved by adjusting the attribute definitions, since people use the same definition differently.

In order to discover schools, we first collect a set of sparse annotations from a large pool of users. We then perform matrix factorization over these labels, and obtain a description of each user that captures the underlying latent factors contributing to the user’s annotations. We cluster users in this latent factor space, and each cluster becomes a “school.”

After we discover the schools of users, we personalize each attribute model towards these schools, rather than towards individual users. Focusing on the commonalities between users allows the system to learn the important biases that users have in interpreting the attribute, as opposed to minor differences in labeling which may stem from factors other than a truly different interpretation.
Fig. 5.6

Our attribute shade discovery method uses the crowd to discover factors responsible for an attribute’s presence, then learns predictive models based on these visual cues. For example, for the attribute open, the method will discover shades of meaning, e.g. peep-toed (open at toe) versus slip-on (open at heel) versus sandal-like (open at toe and heel), which are three visual definitions of openness. Since these shades are not coherent in terms of their global descriptors, they would be difficult to discover using traditional image clustering

5.5.1 Collecting Personal Labels and Label Explanations

We build a Mechanical Turk interface to gather the labels. We use 12 attributes from the Shoes and SUN Attributes datasets that can be defined concisely in language, yet may vary in their visual instantiations. We sample 250–1000 images per attribute. Workers are shown definitions of the attributes from a web dictionary, but no example images. Then, given an image, the worker must provide a binary label, i.e. she must state whether the image does or does not possess a specified attribute. Additionally, for a random set of 5 images, the worker must explain her label in free-form text, and state which image most has the attribute and why. These questions both slow the worker down, helping quality control, and also provide valuable ground truth data for evaluation. To help ensure self-consistency in the labels, we exclude workers who fail to consistently answer 3 repeated questions sprinkled among the 50. This yields annotations from 195 workers per attribute on average.

5.5.2 Discovering Schools and Training Per-School Adapted Models

We use the label data to discover latent factors, which are needed to recover the shades of meaning, separately for each attribute. We retain each worker’s ID, the indices of images she labeled, and how she labeled them. Let M denote the number of unique annotators and N the number of images seen by at least one annotator. Let \(\mathbf {L}\) be the \(M \times N\) label matrix, where \(L_{ij} \in \{0,1,?\}\) is a binary attribute label for image j by annotator i. A ? denotes an unlabeled example (on average only 20 % of the possible image-worker pairs are labeled).

We suppose there is a small number D of unobserved factors that influence the annotators’ labels. This reflects that their decisions are driven by some mid-level visual cues. For example, when deciding whether a shoe looks “ornate”, the latent factors might include presence of buckles, amount of patterned textures, material type, color, and heel height. Assuming a linear factor model, the label matrix \(\mathbf {L}\) can be factored as the product of an \(M \times D\) annotator latent factor matrix \(\mathbf {A}^T\) and a \(D \times N\) image latent factor matrix \(\mathbf {I}\): \(\mathbf {L} = \mathbf {A}^T \mathbf {I}\). We use the probabilistic matrix factorization algorithm (PMF) [53] to factor this partially observed matrix, by finding the best rank-D approximation. We fix \(D=50\), then use the default parameter settings.

We represent each annotator i in terms of her association with each discovered factor, i.e. the “latent feature vector” for annotator i is \(A_i \in \mathfrak {R}^D\), the i-th column of \(\mathbf {A}\). It represents how much each of the D factors influences that annotator when she decides if the named attribute is present. We pose shade discovery as a grouping problem in the space of these latent features. We apply K-means to the columns of \(\mathbf {A}\) to obtain clusters \(\{\mathcal {S}_1,\dots ,\mathcal {S}_K\}\). We set K automatically per attribute based on the optimal silhouette coefficient within \(K=\{2,\dots ,15\}\). By clustering in the low-dimensional latent space, the method identifies the “schools of thought” underlying the discrete set of labels the annotators provided.

Finally, we use the positive exemplars in each school to train a predictive model, which can then detect when the particular attribute shade is present in novel images. We train school-specific classifiers that adapt the consensus model. Each school \(\mathcal {S}_k\) is represented by the total pool of images that its annotators labeled as positive. Several annotators in the cluster may have labeled the same image, and their labels need not agree. Thus, we perform majority vote (over just the annotators in \(\mathcal {S}_k\)) to decide whether an image is positive or negative for the shade. We use the images to train a discriminative classifier, using the adaptive SVM objective of Yang et al. [68] to regularize its parameters to be similar to those of the consensus model, as in Sect. 5.4. In other words, we are now personalizing to schools of users, as opposed to individual users. When we need to predict how a user will judge the presence/absence of an attribute, e.g. during image search, we apply the adapted shade model for the school to which the user belongs. Compared to user-adaptive models, each shade model typically leverages more training data than a single user provides. This lets us effectively “borrow” labeled instances from the user’s neighbors in the crowd. Further, the within-school majority vote can be seen as a form of quality control, where we assume consistency within the group. This helps reduce noise in an individual user’s labeling.

The images within a shade can be visually diverse from the point of view of typical global image descriptors, since annotators attuned to that shade’s latent factors could have focused on arbitrarily small parts of the images, or arbitrary subsets of feature modalities (e.g. color, shape, texture). For example, one shade for “open” might focus on shoe toes, while another focuses on shoe heels. Similarly, one shade for “formal” capturing the notion that dark-colored shoes are formal would rely on color alone, while another capturing the notion that shoes with excessively high heels are not formal would rely on shape alone. An approach that attempts to discover shades based on image clustering, as well as non-semantic attribute discovery approaches [8, 38, 43, 49, 59, 69], would be susceptible to the more obvious splits in the feature space which need not directly support the semantic attribute of interest, and would not be able to group images according to these perceived, possibly subtle, cues. Furthermore, discovery methods would be biased by the choice of features, e.g. the set of salient splits in color histogram space would be quite different than those discovered in a dense SIFT feature space. In contrast, our method partitions the images semantically, so even though the training images may be visually diverse, standard discriminative learning methods let us isolate the informative features.

Note that it would be challenging to manually enumerate the attribute shades with words. For example, when asked to explain why an image is “ornamented”, an annotator might comment on the “buckle” or “bow”; yet the latent shade of “ornamented” underlying many users’ labels is more abstract and encompasses combinations of such concrete mid-level cues. Our method uses the structure in the labels to automatically discover these shades.

5.5.3 Experimental Validation

We demonstrate shades’ utility for improving attribute prediction. We compare to the methods from Sect. 5.4, as well as two alternative shade formation baselines—Attribute discovery, where we cluster images in the attribute space discovered by a state-of-the-art non-semantic attribute discovery method [49], and Image clusters, an image clustering approach inspired by [36]. We run 30 trials, sampling 20 % of the available labels to obtain on average 10 labels per user.
Table 5.1

Accuracy of predicting perceived attributes, with standard error in parentheses






Attr disc

Img clust


76.3 (0.3)

74.0 (0.4)

67.8 (0.2)

74.8 (0.3)

74.5 (0.4)

74.3 (0.4)


74.6 (0.4)

66.5 (0.5)

65.8 (0.2)

71.6 (0.3)

68.5 (0.4)

68.3 (0.4)


62.8 (0.7)

56.4 (1.1)

59.6 (0.5)

61.1 (0.6)

58.3 (0.8)

58.6 (0.7)


77.3 (0.6)

75.0 (0.7)

68.7 (0.5)

75.5 (0.6)

76.0 (0.7)

75.4 (0.6)


78.8 (0.5)

76.2 (0.7)

69.6 (0.4)

77.1 (0.4)

77.4 (0.6)

77.0 (0.6)


70.9 (1.0)

69.5 (1.2)

61.9 (0.5)

68.5 (0.9)

69.3 (1.2)

69.8 (1.2)


62.2 (0.9)

58.5 (1.4)

60.5 (1.3)

62.0 (1.4)

61.2 (1.4)

61.5 (1.1)


64.5 (0.3)

60.5 (0.5)

58.8 (0.2)

63.1 (0.4)

60.4 (0.7)

60.8 (0.7)


62.5 (0.4)

61.0 (0.5)

55.2 (0.2)

61.5 (0.4)

61.1 (0.4)

61.0 (0.5)

Open area

64.6 (0.6)

62.9 (1.0)

57.9 (0.4)

63.5 (0.5)

63.5 (0.8)

62.8 (0.9)


57.3 (0.8)

51.2 (0.9)

56.2 (0.7)

56.2 (1.1)

52.5 (0.9)

52.0 (1.1)


67.4 (0.6)

66.7 (0.5)

63.4 (0.5)

67.0 (0.5)

67.2 (0.5)

67.2 (0.5)

Table 5.1 shows the results. Our shade discovery is more reliable than Generic, which is the status quo attribute learning approach. For “open”, we achieve an 8-point gain over Generic and User-exclusive, which indicates both how different user perceptions of this attribute are, as well as how useful it is to rely on schools rather than individual users. Shades also outperform our User-adaptive approach. While that method learns personalized models, shades leverage common perceptions and thereby avoid overfitting to a user’s few labeled instances. Finally, neither alternative shade formation method is competitive with our approach. These results demonstrate that for all attributes evaluated, mapping a person’s use of an attribute to a shade allows us to predict attribute presence more accurately. This is achieved at no additional expense for the user.

Figure 5.7 visualizes two shades each, for four of the attributes (see [24] for more). The images are those most frequently labeled as positive by annotators in a shade \(\mathcal {S}_k\). The (stemmed) words are those that appear most frequently in the annotator explanations for that shade, after we remove words that overlap between the two shades. We see the shades capture nuanced visual sub-definitions of the attribute words. For example, for the attribute “ornate,” one shade focuses on straps/buckles (top shade), while another focuses on texture/print/patterns (bottom shade). For “open,” one shade includes open-heeled shoes, while another includes sandals which are open at the front and back. In SUN, the “open area” attribute can be either outside (top) or inside (bottom). For “soothing,” one shade emphasizes scenes conducive to relaxing activities, while another focuses on the aesthetics of the scene.
Fig. 5.7

Top words and images for two shades per attribute (top and bottom for each attribute)

See [24] for results that demonstrate the advantage of using shades for attribute-based search and for an analysis of the purity of the discovered shades. These results show the importance of our shade discovery approach for interactive search: for a user to reliably find “formal” shoes, the system must correctly estimate “formal” in the database images. If the wrong attribute shade is predicted, the wrong image is retrieved. In general, detecting shades is key whenever linguistic attributes are required, which includes applications beyond image search as well (e.g. zero-shot recognition).

In our experiments, we assume that the pool of annotators is fixed, so we can map annotators to schools or shades during the matrix factorization procedure. However, new users could join after that procedure has taken place, so how can we map such new users to a shade? Of course, a user must provide at least some attribute labels to benefit from the shade models, since we need to know which shade to apply. One approach is to add the user to the user-image label matrix \(\mathbf {L}\) and re-factor. Alternatively, we can use the more efficient folding-in heuristic [19]. We can appropriately copy the user’s image labels into a \(1 \times N\) vector \(\mathbf {u}\), where we fill in missing label values by the most common response (0 or 1) for that image from already known users, similarly to an idea used by [44]. We can then compute the product of \(\mathbf {u}\) and the image latent factor matrix \(\mathbf {I}\), resulting in a representation of this new user in the latent factor space. After finding this representation, we use the existing set of cluster centers, and find the closest cluster center for the new user. We can then perform the personalization approach for this user as before, and thus any new user can also receive the benefit from our school discovery. We leave as future work the task of testing our system with late-comer new users.

5.6 Discussion and Conclusion

In this chapter, we proposed an effective new form of feedback for image search using relative attributes. In contrast to traditional binary relevance feedback which restricts the user’s input to labeling images as “relevant” or “not relevant”, our approach allows the user to precisely indicate how the results compare with her mental model. Next, we studied how to select the reference images used for feedback so the provided feedback is as informative to the retrieval system as possible. Today’s visual search systems place the burden on the user to initiate useful feedback by labeling images as relevant, and often prioritize showing the user pleasing results over striving to obtain useful feedback. In contrast, we guide the user through a coarse-to-fine search via visual comparisons, and demonstrate this enables accurate results to be retrieved faster. Further, we showed how to bridge the human and machine perception of attributes by accounting for the variability in user attribute perceptions. While existing work assumes that users agree on the attribute values of images and thus build a single monolithic model per attribute, we develop personalized attribute models. Our results on two compelling datasets indicate that (1) people do indeed have varying shades of attribute meaning, (2) transferring generic models makes learning those shades more cost-effective than learning from scratch, and (3) accounting for the differences in user perception is essential in image search applications. Finally, we show how to discover people’s shared biases in perception, then exploit them with visual classifiers that can generalize to new images. The discovered shades of attribute meaning allow us to tailor attribute predictions to the user’s “school of thought,” boosting the accuracy of detecting attributes.

While attributes are an excellent channel for interactive image retrieval, several issues remain to be solved in order to unleash attributes’ full power for practical applications. First, the accuracy of attribute-based search is still far from satisfactory, and it is not acceptable for real users. For example, in [23], after 5 iterations of feedback, the viewer still has to browse between 9 and 14 % of the full dataset in order to find the exact image she is looking for. (In contrast, in our simulated experiment using perfect attribute models with added noise, only between 2 and 5 % of the dataset needs to be browsed.) To address this problem, we need to develop more accurate attribute models. Deep learning methods might enable us to make better use of existing annotations, but an orthogonal solution is to learn richer annotations, by involving humans more directly in training models that truly understand what these attributes mean.7 We also need ways to visualize attributes, similar to visualizing object detection models [65], to ensure that the model aligns with the meaning that a human ascribes to the attribute, rather than a property correlated with the attribute.

A second problem with existing attribute-based work is that users are confined to using a small vocabulary of attributes to describe the world. We need to enable users to define new attributes on the fly during search, and propose techniques for efficiently learning models for these newly defined attributes. One approach for the latter is to utilize existing models for related attributes as a prior for learning new attribute models.


  1. 1.

    To derive an attribute vocabulary, one could use [43] which automatically generates splits in visual space and learns from human annotations whether these splits can be described with an attribute; [46] which shows pairs of images to users on Amazon’s Mechanical Turk platform and aggregates terms which describe what one image has and the other does not have; or [1, 41] which mine text to discover attributes for which reliable computer models can be learned.

  2. 2.

    The annotations are available at

  3. 3.

    In Sect. 5.3, we extend this approach to also allow “equally” responses.

  4. 4.

    As another point of comparison against existing methods, a multi-attribute query baseline that ranks images by how many binary attributes they share with the target image achieves NDCG scores that are 40 % weaker on average than our method when using 40 feedback constraints.

  5. 5.

    The exhaustive baseline was too expensive to run on all 14K Shoes. On a 1000-image subset, it does similarly as on the other datasets.

  6. 6.

    Below we use the terms “school” and “shade” interchangeably.

  7. 7.

    Note that non-semantic attributes [49, 56, 69] are not readily applicable for applications that require human-machine communication as they do not have human-interpretable names.



This research was supported by ONR YIP grant N00014-12-1-0754 and ONR ATL grant N00014-11-1-0105. We would like to thank Devi Parikh for her collaboration on WhittleSearch and feedback on our other work, as well as Ray Mooney for his suggestions for future work.


  1. 1.
    Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: European Conference on Computer Vision (ECCV) (2010)Google Scholar
  2. 2.
    Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., Belongie, S.: Visual recognition with humans in the loop. In: European Conference on Computer Vision (ECCV) (2010)Google Scholar
  3. 3.
    Chen, L., Zhang, Q., Li, B.: Predicting multiple attributes via relative multi-task learning. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  4. 4.
    Chen, Q., Huang, J., Feris, R., Brown, L.M., Dong, J., Yan, S.: Deep domain adaptation for describing people based on fine-grained clothing attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  5. 5.
    Cox, I., Miller, M., Minka, T., Papathomas, T., Yianilos, P.: The bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments. IEEE Trans. Image Process. 9(1), 20–37 (2000)CrossRefGoogle Scholar
  6. 6.
    Curran, W., Moore, T., Kulesza, T., Wong, W.K., Todorovic, S., Stumpf, S., White, R., Burnett, M.: Towards recognizing “cool”: can end users help computer vision recognize subjective attributes or objects in images? In: Intelligent User Interfaces (IUI) (2012)Google Scholar
  7. 7.
    Douze, M., Ramisa, A., Schmid, C.: Combining attributes and fisher vectors for efficient image retrieval. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  8. 8.
    Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  9. 9.
    Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans. Vis. Comput. Graph. 17(11), 1624–1636 (2011)CrossRefGoogle Scholar
  10. 10.
    Endres, I., Farhadi, A., Hoiem, D., Forsyth, D.A.: The benefits and challenges of collecting richer object annotations. In: Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2010)Google Scholar
  11. 11.
    Escorcia, V., Niebles, J.C., Ghanem, B.: On the relationship between visual attributes and convolutional networks. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  12. 12.
    Everett, C.: Linguistic relativity: evidence across languages and cognitive domains. In: Mouton De Gruyter (2013)Google Scholar
  13. 13.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  14. 14.
    Ferecatu, M., Geman, D.: Interactive search for image categories by mental matching. In: International Conference on Computer Vision (ICCV) (2007)Google Scholar
  15. 15.
    Ferrari, V., Zisserman, A.: Learning visual attributes. In: Conference on Neural Information Processing Systems (NIPS) (2007)Google Scholar
  16. 16.
    Fogarty, J., Tan, D.S., Kapoor, A., Winder, S.: Cueflik: interactive concept learning in image search. In: Conference on Human Factors in Computing Systems (CHI) (2008)Google Scholar
  17. 17.
    Geng, B., Yang, L., Xu, C., Hua, X.S.: Ranking model adaptation for domain-specific search. IEEE Trans. Knowle. Data Eng. 24(4), 745–758 (2012)CrossRefGoogle Scholar
  18. 18.
    Heim, E., Berger, M., Seversky, L., Hauskrecht, M.: Active perceptual similarity modeling with auxiliary information. In: arXiv preprint arXiv:1511.02254 (2015)
  19. 19.
    Hofmann, T.: Probabilistic latent semantic analysis. In: Uncertainty in Artificial Intelligence (UAI) (1999)Google Scholar
  20. 20.
    Joachims, T.: Optimizing search engines using click through data. In: International Conference on Knowledge Discovery and Data Mining (KDD) (2002)Google Scholar
  21. 21.
    Kekalainen, J., Jarvelin, K.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)CrossRefGoogle Scholar
  22. 22.
    Kovashka, A., Grauman, K.: Attribute adaptation for personalized image search. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  23. 23.
    Kovashka, A., Grauman, K.: Attribute pivots for guiding relevance feedback in image search. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  24. 24.
    Kovashka, A., Grauman, K.: Discovering attribute shades of meaning with the crowd. Int. J. Comput. Vis. 114, 56–73 (2015)CrossRefGoogle Scholar
  25. 25.
    Kovashka, A., Vijayanarasimhan, S., Grauman, K.: Actively selecting annotations among objects and attributes. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  26. 26.
    Kovashka, A., Parikh, D., Grauman, K.: WhittleSearch: image search with relative attribute feedback. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  27. 27.
    Kovashka, A., Parikh, D., Grauman, K.: WhittleSearch: interactive image search with relative attribute feedback. Int. J. Comput. Vis. 115, 185–210 (2015)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Kumar, N., Belhumeur, P.N., Nayar, S.K.: FaceTracer: a search engine for large collections of images with faces. In: European Conference on Computer Vision (ECCV) (2008)Google Scholar
  29. 29.
    Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. In: International Conference on Computer Vision (ICCV) (2009)Google Scholar
  30. 30.
    Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Describable visual attributes for face verification and image search. IEEE Trans. Pattern Anal. Mach. Intell. 33(10), 1962–1977 (2011)CrossRefGoogle Scholar
  31. 31.
    Kurita, T., Kato, T.: Learning of personal visual impression for image database systems. In: International Conference on Document Analysis and Recognition (ICDAR) (1993)Google Scholar
  32. 32.
    Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  33. 33.
    Li, B., Chang, E., Li, C.S.: Learning image query concepts via intelligent sampling. In: International Conference on Multimedia and Expo (ICME) (2001)Google Scholar
  34. 34.
    Li, S., Shan, S., Chen, X.: Relative forest for attribute prediction. In: Asian Conference on Computer Vision (ACCV) (2013)Google Scholar
  35. 35.
    Liu, S., Kovashka, A.: Adapting attributes using features similar across domains. In: Winter Conference on Applications of Computer Vision (WACV) (2016)Google Scholar
  36. 36.
    Loeff, N., Alm, C.O., Forsyth, D.A.: Discriminating image senses by clustering with multimodal features. In: Association for Computational Linguistics (ACL) (2006)Google Scholar
  37. 37.
    Ma, W.Y., Manjunath, B.S.: Netra: a toolbox for navigating large image databases. Multimedia Syst. 7(3), 184–198 (1999)CrossRefGoogle Scholar
  38. 38.
    Mahajan, D., Sellamanickam, S., Nair, V.: A joint learning framework for attribute models and object descriptions. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  39. 39.
    Mensink, T., Verbeek, J., Csurka, G.: Learning structured prediction models for interactive image labeling. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  40. 40.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001)CrossRefMATHGoogle Scholar
  41. 41.
    Ordonez, V., Jagadeesh, V., Di, W., Bhardwaj, A., Piramuthu, R.: Furniture-geek: understanding fine-grained furniture attributes from freely associated text and tags. In: Winter Conference on Applications of Computer Vision (WACV) (2014)Google Scholar
  42. 42.
    Ozeki, M., Okatani, T.: Understanding convolutional neural networks in terms of category-level attributes. In: Asian Conference on Computer Vision (ACCV) (2014)Google Scholar
  43. 43.
    Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  44. 44.
    Parikh, D., Grauman, K.: Relative attributes. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  45. 45.
    Parikh, D., Grauman, K.: Implied feedback: learning nuances of user behavior in image search. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  46. 46.
    Patterson, G., Hays, J.: Sun attribute database: discovering, annotating, and recognizing scene attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  47. 47.
    Platt, J.C.: Probabilistic output for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers (1999)Google Scholar
  48. 48.
    Rasiwasia, N., Moreno, P.J., Vasconcelos, N.: Bridging the gap: query by semantic example. IEEE Trans. Multimedia 9(5), 923–938 (2007)CrossRefGoogle Scholar
  49. 49.
    Rastegari, M., Farhadi, A., Forsyth, D.A.: Attribute discovery via predictable discriminative binary codes. In: European Conference on Computer Vision (ECCV) (2012)Google Scholar
  50. 50.
    Rastegari, M., Parikh, D., Diba, A., Farhadi, A.: Multi-attribute queries: to merge or not to merge? In: Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  51. 51.
    Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: International Conference on Machine Learning (ICML) (2011)Google Scholar
  52. 52.
    Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. Circ. Syst. Video Technol. (1998)Google Scholar
  53. 53.
    Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: International Conference on Machine Learning (ICML) (2008)Google Scholar
  54. 54.
    Sandeep, R.N., Verma, Y., Jawahar, C.: Relative parts: distinctive parts for learning relative attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  55. 55.
    Scheirer, W., Kumar, N., Belhumeur, P.N., Boult, T.E.: Multi-attribute spaces: calibration for attribute fusion and similarity search. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  56. 56.
    Schwartz, G., Nishino, K.: Automatically discovering local visual material attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  57. 57.
    Shankar, S., Garg, V.K., Cipolla, R.: Deep-carving: discovering visual attributes by carving deep neural nets. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  58. 58.
    Shao, J., Kang, K., Loy, C.C., Wang, X.: Deeply learned attributes for crowded scene understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  59. 59.
    Sharmanska, V., Quadrianto, N., Lampert, C.: Augmented attribute representations. In: European Conference on Computer Vision (ECCV) (2012)Google Scholar
  60. 60.
    Siddiquie, B., Feris, R., Davis, L.: Image ranking and retrieval based on multi-attribute queries. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  61. 61.
    Tamuz, O., Liu, C., Belongie, S., Shamir, O., Kalai, A.T.: Adaptively learning the crowd kernel. In: International Conference on Machine Learning (ICML) (2011)Google Scholar
  62. 62.
    Tieu, K., Viola, P.: Boosting image retrieval. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2000)Google Scholar
  63. 63.
    Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: ACM Multimedia (2001)Google Scholar
  64. 64.
    Tunkelang, D.: Faceted search. In: Synthesis Lectures on Information Concepts, Retrieval, and Services (2009)Google Scholar
  65. 65.
    Vondrick, C., Khosla, A., Malisiewicz, T., Torralba, A.: Hoggles: visualizing object detection features. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  66. 66.
    Wang, X., Ji, Q.: A unified probabilistic approach modeling relationships between attributes and objects. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  67. 67.
    Xiao, F., Lee, Y.J.: Discovering the spatial extent of relative attributes. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  68. 68.
    Yang, J., Yan, R., Hauptmann, A.G.: Adapting SVM classifiers to data with shifted distributions. In: IEEE International Conference on Data Mining (ICDM) Workshops (2007)Google Scholar
  69. 69.
    Yu, F., Cao, L., Feris, R., Smith, J., Chang, S.F.: Designing category-level attributes for discriminative visual recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  70. 70.
    Zhou, X.S., Huang, T.S.: Relevance feedback in image retrieval: a comprehensive review. In: Multimedia Systems (2003)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.University of PittsburghPittsburghUSA
  2. 2.The University of Texas at AustinAustinUSA

Personalised recommendations