Deep Filter Banks for Texture Recognition, Description, and Segmentation

Visual textures have played a key role in image understanding because they convey important semantics of images, and because texture representations that pool local image descriptors in an orderless manner have had a tremendous impact in diverse applications. In this paper we make several contributions to texture understanding. First, instead of focusing on texture instance and material category recognition, we propose a human-interpretable vocabulary of texture attributes to describe common texture patterns, complemented by a new describable texture dataset for benchmarking. Second, we look at the problem of recognizing materials and texture attributes in realistic imaging conditions, including when textures appear in clutter, developing corresponding benchmarks on top of the recently proposed OpenSurfaces dataset. Third, we revisit classic texture represenations, including bag-of-visual-words and the Fisher vectors, in the context of deep learning and show that these have excellent efficiency and generalization properties if the convolutional layers of a deep model are used as filter banks. We obtain in this manner state-of-the-art performance in numerous datasets well beyond textures, an efficient method to apply deep features to image regions, as well as benefit in transferring features from one domain to another.


Introduction
Visual representations based on orderless aggregations of local features, which were originally developed as texture descriptors, have had a widespread influence in image understanding.These models include cornerstones such as the histograms of vector quantized filter responses of [59] and later generalizations such as the bag-of-visual-words model of [27] and the Fisher vector of [78].These and other texture models have been successfully applied to a huge variety of visual domains, including problems closer to "texture understanding" such as material recognition, as well as domains such as object categorization and face identification that share little of the appearance of textures.
This paper makes three contributions to texture understanding.The first one is to add a new semantic dimension to the problem.We depart from most of the previous works on visual textures, which focused on texture identification and material recognition, and look instead at the problem of describing generic texture patterns.We do so by developing a vocabulary of forty-seven texture attributes that describe a wide range of texture patterns; we also introduce a large dataset annotated with these attributes which we call the describable texture dataset (Sect.2).We then study whether texture attributes can be reliably estimated from images, and for what tasks are they useful.We demonstrate in particular two applications (Sect.7.1): the first one is to use texture attributes as dimensions to organise large collections of texture patterns, such as textile, wallpapers, and construction materials for search and retrieval.The second one is to use texture attributes as a compact basis of visual descriptors applicable to other tasks such as material recognition.
The second contribution of the paper is to introduce new data and benchmarks to study texture recognition in realistic settings.While most of the earlier work on texture recognition was carried out in carefully controlled condi-tions, more recent benchmarks such as the Flickr material dataset (FMD) [89] have emphasized the importance of testing algorithms "in the wild", for example on Internet images.However, even these datasets are somewhat removed from practical applications as they assume that textures fill the field of view, whereas in applications they are often observed in clutter.Here we leverage the excellent OpenSurfaces dataset [8] to create novel benchmarks for materials and texture attributes where textures appear both in the wild and in clutter (Sect.3), and demonstrate promising recognition results in these challenging conditions.
The third contribution is technical and revisits classical ideas in texture modeling in the light of modern local feature descriptors and pooling encoders.While texture representations were extensively used in most areas of image understanding, since the breakthrough work of [54] they have been replaced by deep Convolutional Neural Networks (CNNs).Often CNNs are applied to a problem by using transfer learning, in the sense that the network is first trained on a large-scale image classification task such as the Ima-geNet ILSVRC challenge [30], and then applied to another domain by exposing the output of a so-called "fully connected layer" as a general-purpose image representation.In this work we illustrate the many benefits of truncating these CNNs earlier, at the level of the convolutional layers (Sect.4).In this manner, one obtains powerful local image descriptors that, combined with traditional pooling encoders developed for texture representations, result in state-of-theart recognition accuracy in a diverse set of visual domains, from material and texture attribute recognition, to coarse and fine grained object categorization and scene classification.We show that a benefit of this approach is that features transfer easily across domains even without fine-tuning the CNN on the target problem.Furthermore, pooling allows us to efficiently evaluate descriptors in image subregions, a fact that we exploit to recognize local image regions without recomputing CNN features from scratch.
This paper is the archival version of two previous publications [24] and [25].Compared to these two papers, this new version adds a significant number of new experiments and a substantial amount of new discussion.

Describing textures with attributes
This section looks at the problem of automatically describing texture patterns using a general-purpose vocabulary of human-interpretable texture attributes, in a manner similar to how we can vividly characterize the textures shown in Fig. 1.The goal is to design algorithms capable of generating and understanding texture descriptions involving a combination of describable attributes for each texture.Visual attributes have been extensively used in search, to understand Fig. 1: We address the problem of describing textures by associating to them a collection of attributes.Our goal is to understand and generate automatically human-interpretable descriptions such as the examples above.complex user queries, in learning, to port textual information back to the visual domain, and in image description, to produce richer accounts of the content of images.Textural properties are an important component of the semantics of images, particularly for objects that are best characterized by a pattern, such as a scarf or the wings of a butterfly [103].Nevertheless, the attributes of visual textures have been investigated only tangentially so far.Our aim is to fill this gap.
Our first contribution is to introduce the Describable Textures Dataset (DTD) [24], a collection of real-world texture images annotated with one or more adjectives selected in a vocabulary of forty-seven English words.These adjectives, or describable texture attributes, are illustrated in Fig. 2 and include words such as banded, cobwebbed, freckled, knitted, and zigzagged.Sect.2.1 describes this data in more detail.Sect.2.2 discusses the technical challenges we addressed while designing and collecting DTD, including how the forty-seven texture attributes were selected and how the problem of collecting numerous attributes for a vast number of images was addressed.Sect.2.3 defines a number of benchmark tasks in DTD.Finally, Sect.2.4 relates DTD to existing texture datasets.

The Describable Texture Dataset
DTD investigates the problem of texture description, understood as the recognition of describable texture attributes.This problem is complementary to standard texture analysis tasks such as texture identification and material recognition for the following reasons.While describable attributes are correlated with materials, attributes do not imply materials (e.g.veined may equally apply to leaves or marble)  and materials do not imply attributes (not all marbles are veined).Describable attributes can be combined to create rich descriptions (Fig. 3; marble can be veined, stratified and cracked at the same time), whereas a typical assumption is that textures are made of a single material.Describable attributes are subjective properties that depend on the imaged object as well as on human judgements, whereas materials are objective.In short, attributes capture properties of textures complementary to materials, supporting human-centric tasks where describing textures is important.At the same time, we will show that texture attributes are also helpful in material recognition (Sect.7.1).
DTD contains textures in the wild, i.e. texture images extracted from the web rather than captured or generated in a controlled setting.Textures fill the entire image in order to allow studying the problem of texture description independently of texture segmentation, which is instead addressed in Sect.3.With 5,640 annotated texture images, this dataset aims at supporting real-world applications were the recognition of texture properties is a key component.Collecting images from the Internet is a common approach in categorization and object recognition, and was adopted in material recognition in FMD.This choice trades-off the systematic sampling of illumination and viewpoint variations existing in datasets such as CUReT, KTH-TIPS, Outex, and Drexel to capture real-world variations, reducing the gap with applications.Furthermore, DTD captures empirically human judgements regarding the invariance of describable texture attributes; this invariance is not necessarily reflected in material properties.

Dataset design and collection
This section discusses how DTD was designed and collected, including: selecting the 47 attributes, finding at least 120 representative images for each attribute, and collecting all the attribute labels for each image in the dataset.The matrix shows the joint probability p(q, q ) of two attributes occurring together (rows and columns are sorted in the same way as the left image).

Selecting the describable attributes
Psychological experiments suggest that, while there are a few hundred words that people commonly use to describe textures, this vocabulary is redundant and can be reduced to a much smaller number of representative words.Our starting point is the list of 98 words identified by Bhusan, Rao and Lohse [13].Their seminal work aimed to achieve for texture recognition the same that color words have achieved for describing color spaces [12].However, their work mainly focuses on the cognitive aspects of texture perception, including perceptual similarity and the identification of directions of perceptual texture variability.Since our interest is in the visual aspects of texture, words such as "corrugated" that are more related to surface shape or haptic properties were ignored.Other words such as "messy" that are highly subjective and do not necessarily correspond to well defined visual features were also ignored.After this screening phase we analyzed the remaining words and merged similar ones such as "coiled", "spiraled" and "corkscrewed" into a single term.This resulted in a set of 47 words, illustrated in Fig. 2.

Bootstrapping the key images
Given the 47 attributes, the next step consisted in collecting a sufficient number (120) of example images representative of each attribute.Initially, a large initial pool of about a hundred-thousand images in total was downloaded from Google and Flickr by entering the attributes and related terms as search queries.Then Amazon Mechanical Turk (AMT) was used to remove low resolution, poor quality, watermarked images, or images that were not almost entirely filled with a texture.Next, detailed annotation instructions were created for each of the 47 attributes, including a dictio-nary definition of each concept and examples of textures that did and did not match the concept.Votes from three AMT annotators were collected for the candidate images of each attribute and a shortlist of about 200 highly-voted images was further manually checked by the authors to eliminate remaining errors.The result was a selection of 120 key representative images for each attribute.

Sequential joint annotation
So far only the key attribute of each image is known while any of the remaining 46 attributes may apply as well.Exhaustively collecting annotations for 46 attributes and 5,640 texture images is fairly expensive.To reduce this cost we propose to exploit the correlation and sparsity of the attribute occurrences (Fig. 3).For each attribute q, twelve key images are annotated exhaustively and used to estimate the probability p(q |q) that another attribute q could co-exist with q.Then for the remaining key images of attribute q, only annotations for attributes q with non negligible probability are collected, assuming that the remaining attributes would not apply.In practice, this requires annotating around 10 attributes per texture instance, instead of 47.This procedure occasionally misses attribute annotations; Fig. 3 evaluates attribute recall by 12-fold cross-validation on the 12 exhaustive annotations for a fixed budget of collecting 10 annotations per image.
A further refinement is to suggest which attributes q to annotate not just based on the prior p(q |q), but also based on the appearance of an image x i .This was done by using the attribute classifier learned in Sect.5; after Platt's calibration [81] on a held-out test set, the classifier score c q (x i ) ∈ R is transformed in a probability p(q |x i ) = σ(c q (x)) where σ(z) = 1/(1 + e −z ) is the sigmoid func-tion.By construction, Platt's calibration reflects the prior probability p(q ) ≈ p 0 = 1/47 of q on the validation set.To reflect the probability p(q |q) instead, the score is adjusted as and used to find which attributes should be annotated for each image.As shown in Fig. 3, for a fixed annotation budged this method increases attribute recall.Overall, with roughly 10 annotations per image it was possible to recover all of the attributes for at least 75% of the images, and miss one out of four (on average) for another 20%, while keeping the annotation cost to a reasonable level.

Benchmark tasks
DTD is designed as a public benchmark.The data, including images, annotations, and splits, is available on the web at http://www.robots.ox.ac.uk/ ˜vgg/data/dtd, along with code for evaluation and reproducing the results in Sect. 5. DTD defines two challenges.The first one, denoted DTD, is the prediction of key attributes, where each image is assigned a single label corresponding to the key attribute defined above.The second one, denoted DTD-J, is the joint prediction of multiple attributes.In this case each image is assigned one or more labels, corresponding to all the attributes that apply to that image.
The first task is evaluated both in term of classification accuracy (acc) and in term of mean average precision (mAP), while the second task only in term of mAP due to the possibility of multiple labels.The classification accuracy is normalized per class: if ĉ(x), c(x) ∈ {1, . . ., C} are respectively the predicted and ground-truth label of image x, accuracy is defined as We define mAP as per the PASCAL VOC 2008 benchmark onward [33]. 1  DTD contains 10 preset splits into equally-sized training, validation and test subsets for easier algorithm comparison.Results on any of the tasks are repeated for each split and average accuracies are reported.

Recognition of perceptual properties
The study of perceptual properties of textures originated in computer vision as well as in cognitive sciences.Some of the earliest work on texture perception conducted by Julesz [51] focussed on pre-attentive aspects of perception.It led to the concept of "textons," primitives such as lineterminators, crossings, intersections, etc., that are responsible for pre-attentive discrimination of textures.In computer vision, Tamura et al. [95] identified six common directions of variability of images in the Broadatz dataset; coarse vs. fine; high-contrast vs. low-contrast; directional vs. non-directional; linelike vs. bloblike; regular vs. irregular; and rough vs. smooth.Similar perceptual attributes of texture [3,7] have been found by other researchers.
Our work is motivated by that of Bhusan, Rao and Lohse [13,85].Their experiments suggest that there is a strong correlation between the structure of the lexical space and perceptual properties of texture.While they studied the psychological aspects of texture perception, the focus of this paper is the challenge of estimating such properties from images automatically.Their work [13], in particular, identified a set of words sufficient to describe a wide variety of texture patterns; the same set of words was used to bootstrap DTD.
While recent work in computer vision has been focussed on texture identification and material recognition, notable contributions to the recognition of perceptual properties exist.Most of this work is part of the general research on visual attributes [14,35,55,75,77].Texture attributes have an important role in describing objects, particularly for those that are best characterized by a pattern, such as items of clothing and parts of animals such as birds.Notably, the first work on modern visual attributes by Ferrari et al. [37] focused on the recognition of a few perceptual properties of textures.Later work, such as [11] that mined visual attributes from images on the Internet, also contain some attributes that describe textures.Nevertheless, so far the attributes of textures have been investigated only tangentially.DTD address the question of whether there exists a "universal" set of attributes that can describe a wide range of texture patterns, whether these can be reliably estimated from images, and for what tasks they are useful.
Datasets that focus on the recognition of subjective properties of textures are less common.One example is Pertex [26], containing 300 texture images taken in a controlled Fig. 4: Datasets such as Brodatz [16] and CUReT [28] (left) addressed the problem of material instance identification and others such as.KTH-T2b [45] and FMD [89] (right) addressed the problem of material category recognition.Our DTD dataset addresses a very different problem: the one of describing a pattern using intuitive attributes (Fig. 1).setting (Lambertian renderings of 3D reconstructions of real materials) as well as a semantic similarity matrix obtained form human similarity judgments.The work most related to ours is probably the one of [70] that analyzed images in the Outex dataset [72] using a subset of the texture attributes that we consider.DTD differs in scope (containing more attributes) and, especially, in the nature of the data (controlled vs uncontrolled conditions).In particular, working in uncontrolled conditions allows us to transfer the texture attributes to real-world applications, including material recognition in the wild and in clutter, as shown in the experiments.

Recognition of texture instances and material categories
Most of the recent work in texture recognition focuses on the recognition of texture instances and material categories, as reflected by the development of corresponding benchmarks (Fig. 4).The Brodatz [16] catalogue was used in early works on textures to study the problem of identifying texture instances (e.g.matching half of the texture image given the other half).Others including CUReT [28], UIUC [57], KTH-TIPS [18,45], Outex [72], Drexel Texture Database [74], and ALOT [17] address the recognition of specific instances of one or more materials.UMD [108] is similar, but the imaged objects are not necessarily composed of a single material.As textures are imaged under variable truncation, viewpoint, and illumination, these datasets have stimulated the creation of texture representations that are invariant to viewpoint and illumination changes [60,72,97,98].Frequently, texture understanding is formulated as the problem of recognizing the material of an object rather than a particular texture instance (in this case any two slabs of marble would be considered equal).KTH-T2b [68] is one of the first datasets to address this problem by grouping textures not only by the instance, but also by the type of materials (e.g."wood").
However, these datasets make the simplifying assumption that textures fill images, and often, there is limited intraclass variability, due to a single or limited number of instances, captured under controlled scale, view-angle and illumination.Thus, they are not representative of the problem of recognizing materials in natural images, where textures appear under poor viewing conditions, low resolution, and in clutter.Addressing this limitation is the main goal of the Flickr Material Database (FMD) [89].FMD samples just one viewpoint and illumination per object, but contains many different object instances grouped in several different material classes.Sect. 3 will introduce datasets addressing the problem of clutter as well.
The performance of recognition algorithms on most of this data is close to perfect, with classification accuracies well above 95%; KTH-T2b and FMD are an exception due to their increased complexity.A review of these datasets and classification methodologies is presented in [96], who also propose a training-free framework to classify textures, significantly improving on other methods.Table 1 and Fig. 4 provides a summary of the nature and size of various texture datasets that are used in our experiments.

Recognizing textures in clutter
This section looks at the second contribution of the paper, namely studying the recognition of materials and describable textures attributes not only "in the wild," but also "in clutter".Even in datasets such as FMD and DTD, in fact, each texture instance fills the entire image, which doest not match most applications.This section removes this limitation and looks at the problem of recognizing textures imaged in the larger context of a complex natural scene, including the challenging task of automatically segmenting textured image regions.
Rather than collecting a new image dataset from scratch, our starting point is the excellent OpenSurfaces (OS) dataset that was recently introduced by Bell et al. [8].OS comprises 25,357 images, each containing a number of high-quality texture/material segments.Many of these segments are annotated with additional attributes such as the material, viewpoint, BRDF estimates, and object class.Experiments focus on the 58,928 segments that contain material annotations.Since material classes are highly unbalanced, we consider only the materials that contain at least 400 examples.This results in 53,915 annotated material segments in 10,422 images spanning 23 different classes. 2 Images are split evenly into training, validation, and test subsets with 3,474 images each.Segment sizes are highly variable, with half of them being relatively small, with an area smaller than 64 × 64 pixels.One issue with crowdsourced collection of segmentations is that not all the pixels in an image are labelled.This makes it difficult to define a complete background class.For our benchmark several less common materials (including for example segments that annotators could not assign to a material) were merged in an "other" class that acts as the background.
This benchmark is similar to the one concurrently proposed by Bell et al. [10].However, in order to study perceptual properties as well as materials, we also augment the OS dataset with some of the describable attributes of Sect. 2. Since the OS segments do not trigger with sufficient frequency all the 47 attributes, the evaluation is restricted to eleven of them for which it was possible to identify at least 100 matching segments. 3The attributes were manually labelled in the 53,915 segments retained for materials.We refer to this data as OSA.

Benchmark tasks
As for DTD, the aim is to define standardized image understanding tasks to be used as public benchmarks.The complete list of images, segments, labels, and splits are publicly available at http://www.robots.ox.ac.uk/ ˜vgg/ data/dtd.
The benchmarks include two tasks on two complementary semantic domains.The first task is the recognition of texture regions, given the region extent as ground truth information.This task is instantiated for both material, denoted OS+R, and describable texture attributes, denoted OSA+R.Performance in OS+R is measured in term of classification accuracy and mAP, using the same definition (1) where images are replaced by image regions.Performance in OSA+R uses instead mAP due to the possibility of multiple labels.
The second task is the segmentation and recognition of texture regions, which we also instantiate for materials (OS) and describable texture attributes (OSA).Since not all image pixels are labelled in the ground truth, the performance of a predictor ĉ is measured in term of per-pixel classification accuracy, pp-acc(ĉ).This is computed using the same formula as (1) with two modification: first, the images x are replaced by pixels p (extracted from all images in the dataset); second, the ground truth label c(p) of a pixel may take an additional value 0 to denote pixels that are not labelled in the ground truth (the effect is to ignore them in the computation of accuracy).
In the case of OSA, the per-pixel accuracy is modified such that a class prediction is considered correct if it belongs to any of the ground-truth pixel labels.Furthermore, accuracy is not normalized per class as this is ill-defined, but by the total number of pixels: where c(p) is the set of possible labels of pixel p and φ denotes the empty set.

Texture representations
Having presented our contributions to framing the problem of texture description, we now turn to our technical advances towards addressing the resulting problems.We start by revisiting the concept of texture representation and studies how it relates to modern image descriptors based on CNNs.
In general, a visual representation is a map that takes an image x to a vector φ(x) ∈ R d that facilitates understanding the image content.Understanding is often achieved by learning a linear predictor φ(x), w scoring the strength of association between the image and a particular concept, such as an object category.
Among image representations, this paper is particularly interested in the class of texture representations pioneered by the works of [15,60,65,67].Textures encompass a large diversity of visual patterns, from regular repetitions such as wallpapers, to stochastic processes such as fur, to intermediate cases such as pebbles.Distortions due to viewpoint and other imaging factors further complicate modeling textures.However, one can usually assume that, given a particular texture, appearance variations are statistically independent in the long range and can therefore be eliminated by averaging local image statistics over a sufficiently large texture sample.Hence, the defining characteristic of texture representations is to pool information extracted locally and uniformly from the image, by means of local descriptors, in an orderless manner.
The importance of texture representations is in the fact that they were found to be applicable well beyond textures.For example, until recently many of the best object categorization methods in challenges such as PASCAL VOC [34] and ImageNet ILSVRC [30] were based on variants of texture representations, developed specifically for objects.One of the contributions of this work is to show that these objectoptimized texture representations are in fact optimal for a large number of texture-specific problems too (Sect.5.1.3).
More recently, texture representations have been significantly outperformed by Convolutional Neural Networks (CNNs) in object categorization [54], detection [42], segmentation [44], and in fact in almost all domains of image understanding.Key to the success of CNNs is their ability to leverage large labelled datasets to learn high-quality features.Importantly, CNN features pre-trained on very large datasets were found to transfer to many other domains with a relatively modest adaptation effort [21,42,50,73,86].Hence, CNNs provide general-purpose image descriptors.
While CNNs generally outperform classical texture representations, it is interesting to ask what is the relation between these two methods and whether they can be fruitfully hybridized.Standard CNN-based methods such as [21,42,50,73,86] can be interpreted as extracting local image descriptors (performed by the the so called "convolutional layers") followed by pooling such features in a global image representation (performed by the "Fully-Connected (FC) layers").Here we will show that replacing FC pooling with one of the many pooling mechanisms developed in texture representations has several advantages: (i) a much faster computation of the representation for image subregions accelerating applications such as detection and segmentation [42,43,46], (ii) a significantly superior recognition accuracy in several application domains and (iii) the ability of achieving this superior performance without finetuning CNNs by implicitly reducing the domain shift problem.
In order to systematically study variants of texture representations φ = φ e • φ f , we break them into local descriptor extraction φ f followed by descriptor pooling φ e .In this manner, different combinations of each component can be evaluated.Common local descriptors include linear filters, local image patches, local binary patterns, denselyextracted SIFT features, and many others.Since local descriptors are extracted uniformly from the image, they can be seen as banks of (non-linear) filters; we therefore refer to them as filter banks in honor of the pioneering works of [15,39,60,67] and others where descriptors were the output of actual linear filters.Pooling methods include bagof-visual-words, variants using soft-assignment, or extracting higher-order statistics as in the Fisher vector.Since these methods encode the information contained in the local descriptors in a single vector, we refer to them as pooling encoders.
Sect. 4.1 and Sect.4.2 discuss filter banks and pooling encoders in detail.

Local image descriptors
There is a vast choice of local image descriptors in texture representations.Traditionally, these features were handcrafted , but with the latest generation of deep learning methods it is now customary to learn them from data (although often in an implicit form).Representative examples of these two families of local features are discussed in Sect.4.1.1 and Sect.4.1.2,respectively.

Hand-crafted local descriptors
Some of the earliest local image descriptors were developed as linear filter banks in texture recognition.As an evolution of earlier texture filters [15,65], the filter bank of Leung Malik (LM) [61] includes 48 filters matching bars, edges and spots, at various scales and orientations.These filters are first and second derivatives of Gaussians at 6 orientations and 3 scales (36), 8 Laplacian of Gaussian (LOG) filters, and 4 Gaussians.Combinations of the filter responses, identified by vector quantisation (Sect.4.2.1), were used as the computational basis of the "textons" proposed by Julesz [52].The filter bank MR8 of [41,97] consists instead of 38 filters, similar to LM.For two of the oriented filters, only the maximum response across the scales is recorded, reducing the number of responses to 8 (3 scales for two oriented filters, and two isotropic -Gaussian and Laplacian of Gaussian).
The importance of using linear filters as local features was later questioned by Varma and Zisserman [97].The VZ descriptors are in fact small image patches which, remarkably, were shown to outperform LM and MR8 on earlier texture benchmarks such as CuRET.However, as will be demonstrated in the experiments, trivial local descriptors are not competitive in harder tasks.
Another early local image descriptor are the Local Binary Patterns (LBP) of [71,72], a special case of the texture units of [105].A LBP d i = (b 1 , . . ., b m ) computed a pixel p 0 is the sequence of bits b j = [x(p i ) > x(p j )] comparing the intensity x(p i ) of the central pixel to the one of m neighbors p j (usually 8 in a circle).LBPs have specialized quantization schemes; the most common one maps the bit string d i to one of a number of uniform patterns [72].The quantized LBPs can be averaged over the image to build a histogram; alternatively, such histograms can be computed for small image patches and used in turn as local image descriptors.
In the context of object recognition, the best known local descriptor is undoubtedly D. Lowe's SIFT [63].SIFT is the histogram of the occurrences of image gradients quantized with respect to their location within a patch as well to their orientation.While SIFT was originally introduced to match object instances, it was later applied to an impressive diversity of tasks, from object categorization to semantic segmentation and face recognition.

Learned local descriptors
Handcrafted image descriptors are nowadays outperformed by features learned using the latest generation of deep CNNs [54].A CNN can be seen as a composition where W k and H k are the width and height of the field and D k is the number of feature channels.By collecting the D k responses at a certain spatial location, one obtains a D k dimensional descriptor vector.The network is called convolutional if all the layers are implemented as (non-linear) filters, in the sense that they act locally and uniformly on their input.If this is the case, since compositions of filters are filters, the feature field x k is the result of applying a non-linear filter bank to the image x.
As computation progresses, the resolution of the descriptor fields decreases whereas the number of feature channels increases.Often, the last several layers φ k of a CNN are called "fully connected" because, if seen as filters, their support is the same as the size of the input field x k−1 and therefore lack locality.By contrast, earlier layers that act locally will be referred to as "convolutional".If there are C convolutional layers, the CNN φ = φ e • φ f can be decomposed into a filter bank (local descriptors)

Pooling encoders
A pooling encoder takes as input the local descriptors extracted from an image x and produces as output a single feature vector φ(x), suitable for tasks such as classification with an SVM.A first important differentiating factor between encoders is whether they discard the spatial configuration of input features (orderless pooling; Sect.4.2.1) or whether they retain it (order-sensitive pooling; Sect.4.2.2).A detail of practical importance, furthermore, is the type of post-processing applied to the pooled vectors (post-processing; Sect.4.2.3).

Orderless pooling encoders
An orderless pooling encoder φ e maps a collection F = (f 1 , . . ., f n ), f i ∈ R D of local image descriptors to a feature vector φ e (F) ∈ R d .The encoder is orderless in the sense that the function φ e is invariant to permutation of the input F. 4 Furthermore, the encoder can be applied to any number of features; for example, the encoder can be applied to the subset F ⊂ F of local descriptors contained in a target image region without recomputing the local descriptors themselves.
All common orderless encoders are obtained by applying a non-linear descriptor encoder η(f i ) ∈ R d to individual local descriptors and then aggregating the result by using a commutative operator such as average of max.For example, average-pooling yields φe (F) = 1 n n i=1 η(f i ).The pooled vector φe (F) is post-processed to obtain the final representation φ e (F) as discussed later.
The best-known orderless encoder is the Bag of Visual Words (BoVW).This encoder starts by vector-quantizing (VQ) the local features f i ∈ R D by assigning them to their closest visual word in a dictionary C = c 1 . . .c d ∈ R D×d of d elements.Visual words can be thought of as "prototype features" and are obtained during training by clustering example local features.The descriptor encoder η 1 (f i ) is the one-hot vector indicating the visual word corresponding to f i and average-pooling these one-hot vectors yields the histogram of visual words occurrences.BoVW was introduced in the work of [61] to characterize the distribution of textons, defined as configuration of local filter responses, and then ported to object instance and category understanding by [93] and [27] respectively.It was then extended in several ways as described below.
The kernel codebook encoder [80] assigns each local feature to several visual words, weighted by a degree of membership: , where λ is a parameter controlling the locality of the assignment.
The descriptor code η KC (f i ) is L 1 normalized before aggregation, such that η KC (f i ) 1 = 1.Several related methods used concepts from sparse coding to define the local descriptor encoder [62,112].Locality constrained Linear Coding (LLC) [104], in particular, extends soft assignment by making the assignments reconstructive, local, and sparse: the descriptor encoder while allowing only the r d visual words closer to f i to have a non-zero coeffcient.
In the Vector of Locally-Aggregated Descriptors (VLAD) [49] the descriptor encoder is richer.Local image descriptors are first assigned to their nearest neighbor visual word in a dictionary of K elements like in BoVW; then the descriptor encoder is given by η where ⊗ is the Kronecker product.Intuitively, this subtracts from f i the corresponding visual word Cη 1 (f i ) and then copies the difference into one of K possible subvectors, one for each visual word.Hence average-pooling η VLAD (f i ) accumulates first-order descriptor statistics instead of simple occurrences as in BoVW.
VLAD can be seen as a variant of the Fisher Vector (FV) [78].The FV differs from VLAD as follows.First, the quantizer is not K-means but a Gaussian Mixture Model (GMM) with components (π k , µ k , Σ k ), k = 1, . . ., K, where π k ∈ R is the prior probability of the component, µ k ∈ R D the Gaussian mean and Σ k ∈ R D×D the Gaussian covariance (assumed diagonal).Second, hard-assignments η 1 (f i ) are replaced by soft-assignments η GMM (f i ) given by the posterior probability of each GMM component.Third, the FV descriptor encoder η FV (f i ) includes both first Σ [20,78,79] for details).Hence, average pooling η FV (f i ) accumulates both first and second order statistics of the local image descriptors.
All the encoders discussed above use average pooling, except LLC that uses max pooling.

Order-sensitive pooling encoders
An order-sensitive encoder differs from an orderless encoder in that the map φ e (F) is not invariant to permutation of the input F. Such an encoder can therefore reflect the layout of the local image desctiptors, which may be ineffective or even counter-productive in texture recognition, but is usually helpful in the recognition of objects, scenes, and others.
The most common order-sensitive encoder method is the Spatial Pyramid Pooling (SPP) of [58].SSP transforms any orderless encoder into one with (weak) spatial sensitivity by dividing the image in subregions, computing any encoder for each subregion, and stacking the results.This encoder is only be sensitive to reassignments of the local descriptors to different subregions.
The Fully-Connected layers (FC) in a CNN also form an order-sensitive encoder.Compared to the encoders seen above, FC are pre-trained discriminatively, which can be either an advantage or disadvantage, depending on whether the information that they captured can be transferred to the domain of interest.FC poolers are much less flexible than the encoders seen above as they work only with a particular type of local descriptors, namely the corresponding CNN convolutional layers.Furthermore, a standard FC pooler can only operate on a well defined layout of local descriptors (e.g. a 6×6), which in turn means that the image needs to be resized to a standard size before the FC encoder can be evaluated.This is particularly expensive when, as in object detection or image segmentation, many image subregions must be considered.

Post-processing
The vector y = φe (F) obtained by pooling local image descriptors is usually post-processed before being used in a classifier.In the simplest case, this amounts to performing L 2 normalization φ e (F) = y/ y 2 .However, this is usually preceded by a non-linear transformation φ K (y) which is best understood in term of kernels.A kernel K(y , y ) specifies a notion of similarity between data points y and y .If K is a positive semidefinite function, then it can always be rewritten as the inner product φ K (y ), φ K (y ) where φ K is a suitable pre-processing function called a kernel embedding [64,101].Typical kernels include the linear, Hellinger's, additive-χ 2 , and exponential-χ 2 ones, given respectively by: y , y , In practice, the kernel embedding φ K can be computed easily only in a few cases, including the linear kernel (φ K is the identity) and Hellinger's kernel (for each scalar component, φ Hell.(y) = √ y).In the latter case, if y can take negative values, then the embedding is extended to the so called signed square rooting5 φ Hell.(y) = sign(y) |y|.
Even if φ K is not explicitly computed, any kernel can be used to learn a classifier such as an SVM (kernel trick).In this case, L 2 normalizing the kernel embedding φ K (y) amounts to normalizing the kernel as K (y, y ) = K(y , y ) K(y , y )K(y , y ) .
All the pooling encoders discussed above are usually followed by post-processing.In particular, the Improved Fisher Vector (IFV) [79] prescribes the use of the signed-square root embedding followed by L 2 normalization.VLAD has several standard variants that differ in the post-processing; here we use the one that L 2 normalizes the individual VLAD subvectors (one for each visual word) before L 2 normalizing the whole vector [4].

Experiments on semantic recognition
So far the paper has introduced novel problems in texture understanding as well as a number of old and new texture representations.The goal of this section is to determine, through extensive experiments, what representations work best for which problem.
Representations are labelled as pairs X-Y, where X is a pooling encoder and Y a local descriptor.For example, FV-SIFT denotes the Fisher vector encoder applied to densely extracted SIFT descriptors, whereas BoVW-CNN denotes the bag-of-visual-words encoder applied on top of CNN convolutional descriptors.Note in particular that the CNNbased image representations as commonly extracted in the literature [21,50,86] implicitly use CNN-based descriptors and the FC pooler, and therefore are denoted here as FC-CNN.
The first set of experiments (Sect.5.1) evaluates several local image descriptor and pooling encoder combinations on a small number of datasets in order to determine noteworthy representations.The second set of experiments (Sect.5.2) evaluates the latter on a wide range of image understanding problems in order to establish their breath of applicability.
The main findings of these experiments can be summarized as follows.First, orderless pooling of SIFT features is superior to specialized texture descriptors in many texture recognition problems.Second, orderless pooling of CNN local descriptors is significantly better than pooling SIFT descriptors, as well as FC pooling of CNN descriptors.Third, orderless pooling of very deep CNN local descriptors approaches or surpasses the state of the art on many standard problems in texture recognition, coarse and fine-grained object categorization, semantic region recognition, and scene categorization.

General experimental setup
The experiments are centered around two types of local descriptors.The first type are SIFT descriptors extracted densely from the image (denoted DSIFT).SIFT descriptors are sampled with a step of two pixels and the support of the descriptor is scaled such that a SIFT spatial bin has size 8 × 8 pixels.Since there are 4 × 4 spatial bins, the support or "receptive field" of each DSIFT descriptor is 40 × 40 pixels, (including a border of half a bin due to bilinear interpolation).Descriptors are 128-dimensional [63], but their dimensionality is further reduced to 80 using PCA, in all experiments.Besides improving the classification accuracy, this significantly reduces the size of the Fisher Vector and VLAD encodings.
The second type of local image descriptors are deep convolutional features (denoted CNN) extracted from the convolutional layers of CNNs pre-trained on ImageNet ILSVRC data.Most experiments build on the VGG-M model of [21] as this network performs better than standard networks such as Caffe [50] and AlexNet [54] while having a similar computational cost.The VGG-M convolutional features are extracted as the output of the last convolutional layer, directly from the linear filters excluding ReLU and max pooling, which yields a field of 512-dimensional descriptor vectors.In addition to VGG-M, experiments consider the recent VGG-VD (very deep with 19 layers) model of Simonyan and Zisserman [92].The receptive field of CNN descriptors is much larger compared to SIFT: 139 × 139 pixels for VGG-M and 252 × 252 for VGG-VD.
When combined with a pooling encoder, local descriptors are extracted at multiple scales, obtained by rescaling the image by factors 2 s , s = −3, −2.5, . . ., 1.5 (but, for efficiency, discarding scales that would make the image larger than 1024 2 pixels).
The dimensionality of the final representation strongly depends on the encoder type and parameters.For K visual words, BoVW and LLC have K dimensions, VLAD has KD and FV 2KD, where D is the dimension of the local descriptors.For the FC encoder, the dimensionality is fixed by the CNN architecture; here the representation is extracted from the penultimate FC layer (before the final classification layer) of the CNNs and happens to have 4096 dimensions for all the CNNs considered.In practice, dimensions vary widely, with BoVW, LLC, and FC having a comparable dimensionality, and VLAD and FV a much higher one.For example, FV-CNN has 64 • 10 3 dimensions with K = 64 Gaussian mixture components, versus the 4096 of FC, BoVW, and LLC (when used with K = 4096 visual words).In practice, however, dimensions are hardly compa-rable as VLAD and FV vectors are usually highly compressible [76].We verified that by using PCA to reduce FV to 4096 dimensions and observing only a marginal reduction in classification performance in the PASCAL VOC object recognition task, as described below.
Unless otherwise specified, learning uses a standard nonlinear SVM solver.Initially, cross-validation was used to select the parameter C of the SVM in the range {0.1, 1, 10, 100}; however, after noting that performance was nearly identical in this range (probably due to the data normalization), C was simply set to the constant 1.Instead, it was found that recalibrating the SVM scores for each class improves classification accuracy (but of course not mAP).Recalibration is obtained by changing the SVM bias and rescaling the SVM weight vector in such a way that the median scores of the negative and positive training samples for each class are mapped respectively to the values −1 and 1.
All the experiments in the paper use the VLFeat library [99] for the computation of SIFT features and the pooling embedding (BoVW, VLAD, FV).The MatCon-vNet [100] library is used instead for all the experiments involving CNNs.Further details specific to the setup of each experiment are given below as needed.

Datasets and evaluation measures
The evaluation is performed on a diversity of tasks: the new describable attribute and material recognition benchmarks in DTD and OpenSurfaces, existing ones in FMD and KTH-T2b, object recognition in PASCAL VOC 2007, and scene recognition in MIT Indoor.All experiments follow standard evaluation protocols for each dataset, as detailed below.
DTD (Sect.2) contains 47 texture classes, one per visual attribute, containing 120 images each.Images are equally spilt into train, test and validation, and include experiments on the prediction of "key attributes" as well as "joint attributes", as as defined in Sect.2.1, and reports accuracy averaged over the 10 default splits provided with the datasets.OpenSurfaces [8] is used in the setup described in Sect. 3 and contains 25,357 images, out of which we selected 10,422 images, spanning across 21 categories.When segments are provided, the dataset is referred to as OS+R, and recognition accuracy is reported on a per-segment basis.We also annotated the segments with the attributes from DTD, and called this subset OSA (and OSA+R for the setup when segments are provided).For the recognition task on OSA+R we report mean average precision, as this is a multi-label dataset.
FMD [89] consists of 1,000 images with 100 for each of ten material categories.The standard evaluation protocol of [89]  accuracy, averaged over the predefined ten splits, provided with the dataset.We marked in bold the best performing descriptors, SIFT and convolutional features, which we will cover in the following experiments and discussions.category, images of four samples were captured under various conditions, resulting in 108 images per sample.Following the standard procedure [19,96], images of one material sample are used to train the model, and the other three samples for evaluating it, resulting in four possible splits of the data, for which average per-class classification accuracy is reported.MIT Indoor Scenes [84] contains 6,700 images divided in 67 scene categories.There is one split of the data into train (80%) and test (20%), provided with the dataset, and the evaluation metric is average per-class classification accuracy.PASCAL VOC 2007 [34] contains 9,963 images split across 20 object categories.The dataset provides a standard split in training, validation and test data.Performance is reported in term of mean average precision (mAP) computed using the TRECVID 11-point interpolation scheme [34]. 6

Local image descriptors and kernels comparison
The goal of this section is to establish which local image descriptors work best in a texture representation.The question is relevant because: (i) while SIFT is the de-facto standard handcrafted-feature in object and scene recognition, most authors use specialized descriptors for texture recognition and (ii) learned convolutional features in CNNs have not yet been compared when used as local descriptors (instead, they have been compared to classical image representations when used in combination with their FC layers).
The experiments are carried on the the task of recognizing describable texture attributes in DTD (Sect.2) using the BoVW encoder.As a byproduct, the experiments determine the relative difficulty of recognizing the different 47 perceptual attributes in DTD.After the BoVW representation is extracted, it is used to train a 1-vs-all SVM using the different kernels discussed in Sect.4.2.3:linear, Hellinger, additive-χ 2 , and exponential-χ 2 .Kernels are normalized as described before.The exponential-χ 2 kernel requires choosing the parameter λ; this is set as the reciprocal of the mean of the χ 2 distance matrix of the training BoVW vectors.Before computing the exponential-χ 2 kernel, furthermore, BoVW vectors are L 1 normalized.An important parameter in BoVW is the number of visual words selected.K was varied in the range of 256, 512, 1024, 2048, 4096 and performance evaluated on a validation set.Regardless of the local feature and embedding, performance was found to increase with K and to saturate around K = 4096 (although the relative benefit of increasing K was larger for features such as SIFT and CNNs).Therefore K was set to this value in all experiments.

Analysis. Table 2 reports the classification accuracy for 47 1-vs-all SVM attribute classifiers, computed as (1).
As often found in the literature, the best kernel was found to be exponential-χ 2 , followed by additive-χ 2 , Hellinger's, and linear kernels.Among the hand-crafted descriptors, dense SIFT significantly outperforms the best specialized texture descriptor on the DTD data (52.3% for BoVWexp-χ 2 -SIFT vs 44% for BoVW-exp-χ 2 -LM).CNN local descriptors handily outperform handcrafted features by a 10-15% recognition accuracy margin.It is also interesting to note that the choice of kernel function has a much stronger effect for image patches and linear filters (e.g.accuracy nearly doubles moving from BoVW-linear-patches to BoVW-exp-χ 2 -patches) and an almost negligible effect for the much stronger CNN features.Fig. 5 reports the classification accuracy for each attribute in DTD for the BoVW-SIFT, BoVW-VGG-M, and BoVW-VGG-VD descriptors and the additive-χ 2 kernel.As it may be expected, concepts such as chequered, waffled, knitted, paisley achieve nearly perfect classification, while others such as blotchy, smeared or stained are far harder.

Conclusions.
The conclusions are that (i) SIFT descriptors outperform significantly texture-specific descriptors such as linear filter banks, patches, and LBP on this texture recognition task, and that (ii) learned convolutional local descriptors significantly surpass SIFT.

Pooling encoders
The previous section established the primacy  Before pooling local descriptors with a FV, these are usually de-correlated by using PCA whitening.Here PCA is applied to SIFT, additionally reducing its dimension to 80, as this was empirically shown to improve recognition performance.The effect of PCA-reduction to the convolutional features is studied in Section 5.1.6.The improved version of the FV is used in all the experiments (Sect.3), and, similarly, for VLAD, we applied signed square root to the resulting encoding, which is then normalized component-wise (Sect.4.2.3).

Analysis.
Results are reported in Table 3.In term of orderless encoders, BoVW and LLC result in similar performance for SIFT, while the difference is slightly larger and in favor of LLC for CNN features.Note that BoVW is used with the Hellinger kernel, which contributes to reducing the gap between BoVW and LLC.IFV and VLAD significantly outperform BoVW and LLC in almost all tasks; FV is definitely better than VALD with SIFT features and about the same with CNN features.CNN features maintain a healthy lead on SIFT features regardless of the encoder used.Importantly, VLAD and FV (and to some extent BoVW and LLC) perform either substantially better or as well as the original FC encoders.Some of these observations can are confirmed by other experiments such as Table 4.
Next, we compare using CNN features with an orderless encoder (FV-CNN) as opposed to the standard FC layer (FC-CNN).As seen in Table 3 and Table 4, in PASCAL VOC and MIT Indoor the FC-CNN descriptor performs very well but in line with previous results for this class of methods [21].FV-CNN performs similarly to FC-CNN in PAS-CAL VOC, KTH-T2b and FMD, but substantially better for DTD, OS+R, and MIT Indoor (e.g. for the latter +5% for VGG-M and +13% for VGG-VD).
As a sanity check, results are within 1% of the ones reported in [20] and [21] for matching experiments on FV-SIFT and FC-VGG-M.The differences in case of SIFT LLC and BoVW are easily explained by the fact that, differently from [20], our present experiments do not use SPP and image augmentation.

Conclusions.
The conclusions of these experiments are that: (i) IFV and VLAD are preferable to other orderless pooling encoders, that (ii) orderless pooling encoders such as the FV are at least as good and often significantly better than FC pooling with CNN features.

CNN descriptor variants comparison
This section conducts additional experiments on CNN local descriptors to find the best variants.

Experimental setup.
The same setup of the previous section is used.We compare the performance of FC-CNN and FV-CNN local descriptors obtained from VGG-M, VGG-VD as well as the simpler AlexNet [54] CNN which is widely adopted in the literature.).For this experiment the region support is assumed to be known (and equal to the entire image for all the datasets except OS+R and MSRC+R and for CUB+R, where it is set to the bounding box of a bird).* using a model without parts like ours the performance is 62.8%.4. Within that table, the analysis here focuses mainly on texture and material datasets, but conclusions are similar for the other datasets.In general, VGG-M is better than AlexNet and VGG-VD is substantially better than VGG-M (e.g. on FMD, FC-AlexNet obtains 64.8%, FC-VGG-M obtains 70.3% (+5.5%),FC-VGG-VD obtains 77.4% (+7.1%)).However, switching from FC to FV pooling improves the performance more than switching to a better CNN (e.g. on DTD going from FC-VGG-M to FC-VGG-VD yields a 7.1% improvement, while going from FC-VGG-M to FV-VGG-M yields a 11.3% improvement).Combining FV-CNN and FC-CNN (by stacking the corresponding image representations) improves the accuracy by 1-2% for VGG-VD, and up to 3-5% for VGG-M.There is no significant benefit from adding FV-SIFT as well, as the improvement is at most 1%, and in some cases (MIT, FMD) it degrades the performance.
Next, we analyze in detail the effect of depth on the convolutional features.Fig. 6 reports the accuracy of VGG-M and VGG-VD on several datasets for features extracted at increasing depths.The pooling method is fixed to FV and the number of Gaussian centers K is set such that the overall dimensionality of the descriptor 2KD k is constant.For both VGG-M and VGG-VD, the improvement with increasing depth is substantial and the best performance is obtained by the deepest features (up to 32% absolute accuracy improvement in VGG-M and up to 48% in VGG-VD).Performance increases at a faster rate up to the third convolutional layer (conv3) and then the rate tapers off somewhat.The performance of the earlier levels in VGG-VD is much worse than the corresponding layers in VGG-M.In fact, the perfor-Fig.6: Effect of the depth on CNN features.The figure reports the performance of VGG-M (left) and VGG-VD (right) local image descriptors pooled with the FV encoder.For each layer the figures shows the size of the receptive field of the local descriptors (denoted [N × N ]]), as well as, for some of the layers, the dimension D of the local descriptors and the number K of visual words in the FV representation (denoted as D × K).Curves for PASCAL VOC, MIT Indoor, FMD, and DTD are reported; the performance of using SIFT as local descriptors is reported as a plus (+) mark.mance of VGG-VD matches the performance of the deepest (fifth) layer in VGG-M in correspondence of conv5 1, which has depth 13.
Finally, we look at the effect of the number of Gaussian components (visual words) in the FV-CNN representation, testing possible values in the range 1 to 128 in small (1-16) increments.Results are presented in Fig. 7.While there is a substantial improvement in moving from one Gaussian component to about 64 (up to +15% on DTD and up to 6% on OS), there is little if any advantage at increasing the number of components further.5.1.5.3 Conclusions.The conclusions of these experiments are as follows: (i) deeper models substantially improve performance; (ii) switching from FC to FV pooling has an ever more substantial impact, particularly for deeper models; (iii) combining FC and FV pooling has a modest benefit and there is no benefit in integrating SIFT features; (iv) in very deep models, most of the performance gain is realized in the very last few layers.

Dimensionality reduction of the CNN descriptors
This section explores the effect of applying dimensionality reduction to the CNN local descriptors before FV pooling.
This experiment investigates the effect of two parameters, the number of Gaussians in the mixture model used by the FV encoder, and the dimensionality of the convolutional features, which we reduce using PCA.Various local descriptor dimensions are evaluated, from 512 (no PCA) to 32, reporting mAP on PASCAL VOC 2007, as a function of the pooled descriptor dimension.The latter is equal to 2KD, where K is the number of Gaussian centers, and D the dimensionality of the local descriptor after PCA reduction.
Results are presented in Figure 8 for VGG-M and VGG-VD.It can be noted that, for similar values of the total representation dimensionality 2KD, the performance of PCAreduced descriptors is a little better than not using PCA, provided that this is compensated by a large number of GMM components.In particular, similar to what was observed for SIFT in [79], using PCA does improve the performance by 1-2% mAP point; furthermore, reducing descriptors to 64 or 80 dimensions appears to result in the best performance.

Visualization of descriptors
In this experiment we are interested in understanding which GMM components in the FV-CNN representation code for a particular concept, as well as in determining which areas of the input image contribute the most to the classification score.
In order to do so, let w be the weight vector learned by a SVM classifier for a target class using the FB-CNN representation as input.We partition w in subvectors w k , one for each GMM component k, and rank components by decreasing value w k , matching the intuition that the GMM component that is predictive of the target class will result in larger weights.Having identified the top components for a target concept, the CNN local descriptors are then extracted from a test image, the descriptors that are assigned to a top component are selected, and their location is marked on the   Each image shows the location of the CNN local descriptors that map to the FV-CNN components most strongly associated with the "wrinkled", "studded", "swirly", "bubbly", and "sprinkled" classes for a number of example images in DTD.Red, green and black marks correspond to the top three components selected as described in the text.image.To simplify the visualization, features are extracted at a single scale.
As can be noted in Fig. 9 for some indicative texture types in DTD, the strongest GMM components do tend to fire in correspondence to the characteristic features of each texture.Hence, we conclude that GMM components, while trained in an unsupervised manner, contain clusters that can consistently localize features that capture distinctive characteristics of different texture types.

Evaluating texture representations on different domains
The previous section established optimal combinations of local image descriptors and pooling encoders in texture representations.This section investigates the applicability of these representations to a variety of domains, from texture (Sect.5.2.1) to object and scene recognition (Sect.5.2.3).It also emphasizes several practical advantages of orderless pooling compared to fully-connected pooling, including helping with the problem of domain shift in learned descrip-tors.This section focuses on problems where the goal is to either classify an image as a whole or a known region of an image, while texture segmentation is looked at later in Sect.6.3.

Texture recognition
Experiments on textures are divided in recognition in controlled conditions (Sect.5.2.1.3),where the main sources of variability are viewpoint and illumination, recognition in the wild (Sect.5.2.1.4),characterized by larger intra-class variations, and recognition in the wild and clutter (Sect.5.2.1.5),where textures are a small portion of a larger scene.

Datasets and evaluation measures.
In addition to the datasets evaluated in Sect.5.1, DTD, OS+R, FMD and KTH-T2b, we consider here also the standard benchmarks for texture recognition.CUReT [29] (5612 images, 61 classes), UIUC [57] (1000 images, 25 classes), KTH-TIPS [17] (810 images, 10 classes) are collected in controlled conditions, by photographing the same instance of a material, under varying scale, viewing angle and illumination.UMD [108] consists of 1000 images, spread across 25 classes, but was collected in uncontrolled conditions.For these datasets, we follow the standard evaluation procedures, that is, we are using half of the images for training, and the remaining half for testing, and we are reporting accuracy, averaged over 10 random splits.The ALOT dataset [17] is similar to the existing texture datasets, but significantly larger, having 250 categories.For our experiments we used the protocol of [94], using 20 images per class for training and the rest for testing.

Experimental setup.
For the recognition tasks described in the following subsections, we compare SIFT, VGG-M, and VGG-VD local descriptors and the FC and FV pooling encoders as these were determined before to be some of the best representative texture descriptors.Combinations of such descriptors are evaluated as well.

5.2.1.3
Texture recognition in controlled conditions.This paragraph evaluates texture representations on datasets which are collected under controlled condition (Table 4, section a).
In material recognition, KTH-T2b and ALOT offer a somewhat more interesting challenge.First, there is a significant difference between FC-CNN and FV-CNN (3-6% absolute difference in KTH-T2b and 8-10% in ALOT), consistent across all CNN evaluated.Second, CNN descriptors are significantly better than SIFT on KTH-T2b and ALOT with absolute accuracy gains of up to 11%.
Compared to the state of the art, FV-SIFT is generally very competitive.In KTH-T2b, FV-SIFT outperforms all recent methods [24] with the exception of [94] which is based on a variant of LBP.The latter is very strong in ALOT too, but in this case FV-SIFT is virtually as good.In the case of KTH-T2b, [94] is better than most of the deep descriptors as well, but it is still significantly bested by FV-VGG-VD (+5.5%).Nevertheless, this is an example in which a specialized texture descriptor can be competitive with deep features, although of course deep features apply unchanged to several other problems.
On ALOT, FV-CNN with VGG-VD is on par with the result obtained by [6] -98.45% -but their model was trained with 30 images per class instead of 20.The same paper reports even better results, but when training with 50 images per class or by integrating additional synthetic training data.

5.2.1.4
Texture recognition in the wild.This paragraph evaluates the texture representations on two texture datasets collected "in the wild": FMD (materials) and DTD (describable attributes).
Texture recognition in the wild is more comparable, in term of the type of intra-class variations, to object recognition than to texture recognition in controlled conditions.Hence, one can expect larger gains in moving from texturespecific descriptors to general-purpose descriptors.This is confirmed by the results.SIFT is competitive with AlexNet and VGG-M features in FMD (within 3% accuracy), but it is significantly worse in DTD (+4.3% for FV-AlexNet and +8.2% for FV-VGG-M).FV-CNN is a little better than FC-CNN (∼3%) on FMD and substantially better in DTD (∼8%).Different CNN architectures exhibit very different performance; moving from AlexNet to VGG-VD, the accuracy absolute improvement is more than 11% across the board.
Compared to the state of the art, FV-SIFT is generally very competitive, outperforming the specialized texture descriptors developed by [83,88] in FMD (and this without using ground-truth texture segmentations as used by [88]).Yet FV-VGG-VD is significantly better than all these descriptors (+24.7%).
In term of complementarity of the features, the combination of FC-CNN and FV-CNN improves performance by about 3% across the board, but including FV-SIFT (labelled FV-SIFT/FC+FV-VD in the table) as well does not seem to improve performance further.This is in contrast with the fact that SIFT was found to be fairly complementary to FC-CNN on a variant of AlexNet in [24].4 in sections b and c.As before, performance improves with the depth of CNNs.For example, in material recognition (OS+R) accuracy starts at about 39.1% for FV-SIFT, is about the same for FC-VGG-M (41.3%) and a little better for FC-VGG-VD (43.4%).However, the benefit of switching from FC encoding to FV encoding is now even more dramatic.For example, on OS+R FV-VGG-M has accuracy 52.5% (+11.2%) while FV-VGG-VD 59.5% (+16.1%).This clearly demonstrates the advantage of orderless pooling of CNN local descriptors on FC pooling when regions of different sizes and shapes must be evaluated.There is also a significant computational advantage (evaluated further in Sect.5.2.3) if, as it is typical, several regions must be classified: in that case, CNN features need not to be recomputed for each region.Results on OSA+R are entirely analogous.

Object and scene recognition
This section evaluates texture descriptors on tasks other than texture recognition, namely coarse and fine-grained object categorization, scene recognition, and semantic region recognition.

Datasets and evaluation measures.
In addition to the datasets seen before, here we experiment with fine grained recognition in the CUB [102] dataset.This dataset contains 11788 images, representing 200 species of birds.The images are split approximately into half for training and half for testing, according to the list that accompanies the dataset.Image representations are either applied to the whole image (denoted CUB) or on the region counting the target bird using ground-truth bounding boxes (CUB+R).Performance in CUB and CUB+R is reported as per-image classification accuracy.For this dataset, the local descriptors are again extracted at multiple scales, but now only for the smaller range {0.5, 0.75, 1} which was found to work better for this task.
Performance is also evaluated on the MSRC dataset, designed to benchmark semantic segmentation algorithms.The dataset contains 591 images, for which some pixels are labelled with one of the 23 classes.In order to be consistent with the results reported in the literature, performance is reported in term of per-pixel classification accuracy, similar to the measure used for the OS task as defined in Sect.3.1.However, this measure is further modified such that it is not normalized per class: (85.2% vs 84.9% mAP), but using a much more straightforward pipeline.In MIT Places the best performance is also substantially superior (+10%) to the current state-of-the-art using deep convolutional networks learned on the MIT Place dataset [111] (this is discussed further below).In the CUB dataset, the best performance is short (∼ 6%) of the stateof-the-art results of [109].However, [109] uses a categoryspecific part detector and corresponding part descriptor as well as a CNN fine-tuned on the CUB data; by contrast, FV-CNN and FC-CNN are used here as global image descriptors which, furthermore, are the same for all the datasets considered.Compared to the results of [109] without partbased descriptors (but still using a part-based object detector), the best of our global image descriptors perform substantially better (62.1% vs 67.3%).
Results on MSRC+R for semantic segmentation are entirely analogous; it is worth noting that, although groundtruth segments are used in this experiment and hence this number is not comparable with other reported in the literature, the best model achieves an outstanding 99.1% per-pixel classification rate in this dataset.

Conclusions.
The conclusion of this section is that FV-CNN, although inspired by texture representations, are superior to many alternative descriptors in object and scene recognition, including more elaborate constructions.Furthermore, FV-CNN is significantly superior to FC-CNN in this case as well.

Domain transfer
This section investigates in more detail the problem of domain transfer in CNN-based features.So far, the same underlying CNN features, trained on the ImageNet's ILSVCR data, were used in all cases.To investigate the effect of the source domain on performance, this section consider, in addition to these networks, new ones trained on the PLACES dataset [111] to recognize scenes on a dataset of about 2.5 million labeled images.[111] showed that, applied to the task of scene recognition in MIT Indoor, these features outperform similar ones trained on ILSVCR (denoted CAFFE [50] below) -a fact explained by the similarity of domains.We repeat this experiment using FC-and FV-CNN descriptors on top of VGG-M, VGG-VD, PLACES, and CAFFE.
Results are shown in Table 5.The FC-CNN performance is in line with those reported in [111] -in scene recognition with FC-CNN the same CNN architecture performs better if trained on the Places dataset instead of the ImageNet data (58.6%vs 65.0% accuracy 7 ).Nevertheless, stronger CNN architectures such as VGG-M and VGG-VD can approach and outperform PLACES even if trained on ImageNet data (65.0%vs 62.5%/67.6%).
However, when it comes to using the filter banks with FV-CNN, conclusions are very different.First, FV-CNN outperforms FC-CNN in all cases, with substantial gains up to ∼ 11 − 12% in correspondence of a domain transfer from ImageNet to MIT Indoor.The gap between FC-CNN and FV-CNN is the highest for VGG-VD models (67.6% vs 81.0%, nearly 14% difference), a trend also exhibited by other datasets as seen in Table 4. Second, the advantage of using domain-specific CNNs disappears.In fact, the same CAFFE model that is 6.4% worse than PLACES with FC-CNN, is actually 2.1% better when used in FV-CNN.The conclusion is that FV-CNN appears to be immune, or at least substantially less sensitive, to domain shifts.
Our explanation of this phenomenon is that the convolutional features are substantially less committed to a specific dataset than the fully connected layers.Hence, by using those, FV-CNN tends to be a lot more general than FC-CNN.A second explanation is that PLACES CNN may learn filters that tend to capture the overall spatial structure of the image, whereas CNNs trained on ImageNet tend to focus on localized attributes which may work well with orderless pooling.
Finally, we compare FV-CNN to alternative CNN pooling techniques in the literature.The closest method is the one of [43], which uses a similar underlying CNN to extract local image descriptors and VLAD instead of FV for pooling.Notably, however, FV-CNN results on MIT Indoor are markedly better than theirs for both VGG-M and VGG-VD (68.8% vs 74.2% / 81.0% resp.) and marginally better (69.7% -Table 4 and 5) when the same CAFFE CNN is used.Also, when using VLAD instead of FV for pooling the convolutional layer descriptors, the performance of our method is still better (68.8% vs 71.2%), as seen in Table 3.
The key difference is that FV-CNN pools convolutional features, whereas [43] pools fully connected descriptors extracted from square image patches.Thus, even without spatial information as used by [43], FV-CNN is not only substantially faster -8.5× speedup when using the same network and three scales, but at least as accurate.

Semantic segmentation
The previous sections considered the problem of recognizing given image regions.This section explores instead the problem of automatically recognizing as well as segmenting such regions in the image.

Experimental setup
Inspired by Cimpoi et al. [24] that successfully ported object description methods to texture descriptors, here we propose a segmentation technique building on ideas from object detection.An increasingly popular method for object detection, followed for example by FC-CNN [42], is to first propose a number of candidate object regions using lowlevel image cues, and then verifying a shortlist of such regions using a powerful classifier.Applied to textures, this requires a low-level mechanism to generate textured region proposals, followed by a region classifier.A key advantage of this approach is that it allows applying object-(FC-CNN) and texture-like (FV-CNN) descriptors alike.After proposal classification, each pixel can be assigned more than one label; this is solved with a simple voting schemes, also inspired by object detections methods.
The paper explores two such region generation methods: the crisp regions of [47] and the Multi-scale Combinatorial Grouping (MCG) of [5].In both cases, region proposals are generated using low-level image cues, such as color or texture consistency, as specified by the original methods.It would of course be possible to incorporate FC-CNN and FV-CNN among these energy terms to potentially strengthen the region generation mechanism itself.However, this contradicts partially the logic of the scheme, which breaks down the problem into cheaply generating tentative segmentations and then verifying them using a more powerful (and likely expensive) model.Furthermore, and more importantly, these cues focus on separating texture instances, as presented in each particular image, whereas FC-CNN and FV-CNN are meant to identify a texture class.It is reasonable to expect instance-specific cues (say the color of a painted wall) to be better for segmentation.
The crisp region method generates a single partition of the image; hence, individual pixels are labelled by transferring the label of the corresponding region, as determined by the learned predictor.By contrast, MCG generates many thousands overlapping region proposals in an image and requires a mechanism to resolve potentially ambiguous pixel labelings.This is done using the following simple scheme.For each proposed region, its label is set to the the highest scoring class based on the multi-class SVM, and its score to the corresponding class score divided by the region area.Proposals are then sorted by increasing score and "pasted" to the image sequentially.This has the effect of considering larger regions before smaller ones and more confident regions after less confident ones for regions of the same area.

Dense-CRF post-processing
The segmentation results delivered by the previous methods can potentially be hampered by the occasional failures of the respective front-end superpixel segmentation modules.But we can see the front-end segmentation as providing as a convenient way of pooling discriminative information, which can then be refined post-hoc through a pixel-level segmentation algorithm.
In particular, a series of recent works [9,23,110] have reported that substantial gains can be obtained by combining CNN classification scores with the densely-connected Conditional Random Field (Dense-CRF) of [53].Apart from its ability to incorporate information pertaining to image boundaries and color similarity, the Dense-CRF is particularily effiecient when used in conjunction with approximate probabilistic inference: the message passing updates under a fully decomposable mean field approximation can be expressed as convolutions with a Gaussian kernel in feature space, implemented efficiently using high-dimensional filtering [1].
Inspired by these advances, we have employed the Dense-CRF segmentation algorithm post-hoc, with the aim of enhancing our algorithm's ability to localize region boundaries by taking context and low-level image information into account.For this we turn the superpixel classification scores into pixel-level unary terms, interpeting the SVM classifier's scores as indicating the negative energy associated to labelling each pixel with the respective labels.Even though Platt scaling could be used to turn the SVM scores into log-probability estimates, we prefer to estimate the transformation by jointly cross-validating the SVM-Dense-CRF cascade's parameters.In particular, similarly to [23,53], we set the dense CRF hyperparameters by cross-validation, performing grid search to find the values that perform best on a validation set.

Analysis
Results are reported in Table 6.Two datasets are evaluated: OS for material recognition and MSRC for things & stuff.Compared to OS+R, classifying crisp regions results in a drop of about 10% per-pixel classification accuracy for all descriptors.At the same time, it shows that there is ample space for future improvements.In MSRC, the best accuracy is 87.0%, just a hair above the best published result 86.5% [56].Remarkably, these algorithms do not use any dataset-specific training, nor CRF-regularised semantic inference: they simply greedily classify regions as obtained from a general-purpose segmentation algorithms.CRF postprocessing improves the results even further, up to 90.2% in MSRC.Qualitative segmentation results (sampled at random) are given in Fig. 10 and 11.
Results using FV-CNN shown in Table 6 in brackets (due to the requirement of computing CNN features from scratch for every region, it was impractical to use FC-CNN with MCG proposals).The results are comparable to those using crisp regions, resulting in 55.7% accuracy on the OS dataset.Other schemes such as non-maximum suppression of overlapping regions that are quite successful for object segmentation [44] performed rather poorly in this case.This is probably because, unlike objects, texture information is fairly localized and highly irregularly shaped in an image.
While for recognizing textures, materials or objects covering the entire image, the difference in performance between FC-CNN and FV-CNN was not significant, the latter consisting in evaluating few layers less, the advantage of FV-CNN becomes clear for segmentation tasks, as FC-CNN requires recomputing the features for every region proposal.

Applications of describable texture attributes
This section explores two applications of the DTD attributes: using them as general-purpose texture descriptors (Sect.7.1) and as a tool for search and visualization (Sect.7.2).

Describable attributes as generic texture descriptors
This section explores using the 47 describable attributes of Sect. 2 as a general-purpose texture descriptor.The first step in this construction is to learn a multi-class predictor for the   57.7±1.7 [83,88] Table 7: DTD for material recognition.Accuracy on material recognition on the KTH-T2b and FMD benchmarks obtained by using as image representation the predictions of the 47 DTD attributes by different methods: FV-SIFT, FV-CNN (using either VGG-M or VGG-VD) or combinations.Accuracies are compared to published state of the art results.
47 attributes; this predictor is trained on DTD using a texture representation of choice and a multi-class linear SVM as before.The second step is to evaluate the multi-class predictor to obtain a 47-dimensional descriptor (of class scores) for each image in a target dataset.In this manner, one obtains a novel and very compact representation which is then used to learn a multi-class non-linear SVM classifier, for example for material recognition.
Results are reported in Table 7 for material recognition in FMD and KTH-T2b.There are two important factors in this experiment.The first one is the choice of the DTD attributes predictor.Here the best texture representations found before are evaluated: FV-SIFT, FC-CNN, and FV-CNN (using either VGG-M or VGG-VD local descriptors), as well as their combinations.The second one is the choice of classifier used to predict a texture material based on the 47-dimensional vector of describable attributes.This is done using either a linear or RBF SVM.
Using a linear SVM and FV-SIFT to predict the DTD attributes yields promising results: 64.7% classification accuracy on KTH-T2b and 49.2% on FMD.The latter outperforms the specialized aLDA model of [88] combining color, SIFT and edge-slice features, whose accuracy is 44.6%.Replacing SIFT with CNN image descriptors (FV-CNN) improves results significantly for FMD (49.2% vs 62.8% for VGG-M and 70.8% for VGG-VD) as well as KTH-T2b (64.7% vs 67.4% and 74.6% respectively).While these results are not as good as using the best texture representations directly on these datasets, remarkably the dimensionality of the DTD descriptors is two orders of magnitude smaller than all the other alternatives.
An advantage of the small dimensionality of the DTD descriptors is that using an RBF classifier instead of the linear one is relatively cheap.Doing so improves the performance by 1-3% on both FMD and KTH-T2b across experiments.Overall, the best result of the DTD features on KTH-T2b is 77.1% accuracy, slightly better than the state-of-theart accuracy rate of 76.0% of [94].On FMD the DTD features outperform significantly the state of the art []: 72.17% accuracy vs. 57.7%,an improvement of about 15%.
The final experiment compares the semantic attributes of [70] on the Outex data.Using FV-SIFT and a linear classifier to predict the DTD attributes, performance on the retrieval experiment of [70] is 49.82% mAP which is not competitive with their result of 63.3% obtained using LBP u (Sect.4.1).To verify whether this was due to LBP u being particularly optimized for the Outex data, the DTD attributes where trained again using FV on top of the LBP u local image descriptors; by doing so, using the 47 attributes on Outex results in an accuracy of 64.5% mAP; at the same time, Table 2 shows that LBP u is not a competitive predictor on DTD itself.This confirms the advantage of the LBP u on the Outex dataset.

Search and visualization
This section includes a short qualitative evaluation of the DTD attributes.Perhaps their most appealing property is interpretability; to verify that semantics transfer in a reasonable way across domains, Fig. 12 shows an excellent semantic correlation between the ten categories in KTH-T2b and the attributes in DTD.For example, aluminum foil is found to be wrinkled, while bread is found be bumpy, pitted, porous and flecked.
As an additional application of our describable texture attributes we compute them on a large dataset of 10,000 wallpapers and bedding sets from houzz.com.The 47 attribute classifiers are learned as in Sect. 5 using the FV-SIFT representation and then applied to the 10,000 images to predict the strength of association of each attribute and image.Classifier scores are re-calibrated on the target data and converted to probabilities by rescaling the scores to have a maximum value of one on the whole dataset.Fig. 13 shows some example attribute predictions, selecting for each of a number of attributes an image that has a score close to 1 (excluding images used for calibrating the scores), and then including additional top two attribute matches.The top two matches tend to be a very good description of each texture or pattern, while the third is a good match in about half of the cases.

Conclusions
In this paper we have introduced a dataset of 5,640 images collected "in the wild" that have been jointly labelled with 47 describable texture attributes and have used this dataset to study the problem of extracting semantic properties of textures and patterns, addressing real-world human-centric applications.We have also introduced a novel analysis of material and texture attribute recognition in a large dataset of textures in clutter derived from the excellent OpenSurfaces dataset.Finally, we have analyzed texture representation in relation to modern deep neural networks.The main finding is that orderless pooling of convolutional neural network features is a remarkably good texture descriptor, sufficiently versatile to dub as a scene and object descriptor too and resulting in the new state-of-the-art performance in several benchmarks.

Fig. 2 :
Fig. 2: The 47 texture words in the describable texture dataset introduced in this paper.Two examples of each attribute are shown to illustrate the significant amount of variability in the data.

Fig. 3 :
Fig.3: Quality of sequential joint annotations.Each bar shows the average number of occurrences of a given attribute in a DTD image.The horizontal dashed line corresponds to a frequency of 1/47, the minimum given the design of DTD (Sect.2.2).The black portion of each bar is the amount of attributes discovered by the sequential procedure, using only 10 annotations per image (about one fifth of the effort required for exhaustive annotation).The orange portion shows the additional recall obtained by integrating cross-validation in the process.Right: co-occurrence of attributes.The matrix shows the joint probability p(q, q ) of two attributes occurring together (rows and columns are sorted in the same way as the left image).

Fig. 5 :
Fig. 5: Per class classification accuracy in the DTD data comparing three local image descriptors: SIFT, VGG-M, and VGG-VD.For all three local descriptors, BoVW with 4096 visual words was used.Classes are sorted by increasing BoVW-CNN-VD accuracy (this number is reported along each bar).

Fig. 7 :
Fig. 7: Effect of the number of Gaussian components in the FV encoder.The figure shows the performance of the FV-VGG-M and FV-VGG-VD representations on the OS and DTD datasets when the number of Gaussians components in the GMM is varied from 1 to 128.Note that the abscissa is scaled logarithmically.
reduction of FV−CNN descriptor on Pascal VOC07

Fig. 8 :Fig. 9 :
Fig. 8: PCA reduced FV-CNN.The figure reports the performance of VGG-M (left) and VGG-VD (right) local descriptors, on PASCAL VOC 2007, when reducing their dimensionality from 512 to up to 32 using PCA in combination with a variable number of GMM components.The horizontal axis report the total descriptor dimensionality, proportional to the dimensionality of the local descriptors by the number of GMM components.

Fig. 10 :
Fig. 10: OS material recognition results.Example test image with material recognition and segmentation on the OS dataset.(a) original image.(b) ground truth segmentations from the OpenSurfaces repository (note that not all pixels are annotated).(c) FC-CNN and crisp-region proposals segmentation results.(d) correctly (green) and incorrectly (red) predicted pixels (restricted to the ones annotated).(e-f) the same, but for FV-CNN.

Fig. 12 :Fig. 13 :
Fig. 12: Descriptions of materials from KTH-T2b dataset.These words are the most frequent top scoring texture attributes (from the list of 47 we proposed), when classifying the images from the KTH-T2b dataset.The descriptions are obtained by considering the whole material category, while a single image per material is shown for visualization.cobwebbed (1.00) perforated (0.39) cracked (0.23)

Table 1 :
Comparison of existing texture datasets, in terms of size, collection condition, nature of the classes to be recognized, and whether each class includes a single object/material instance or several instances of the same category.Note that Outex is a meta-collection of textures spanning different datasets and problems.

Table 2 :
uses 50 images per class for training and the remaining 50 for testing, and reports classification accuracy averaged over 14 splits.KTH-T2b [68] contains 4,752 images, grouped into 11 material categories.For each material Comparison of local features and kernels on the DTD data.The table reports classification

Table 3 :
[34]IFT and CNN local image descriptors on alternatives.The goal of Pooling encoder comparisons.The table compares the orderless pooling encoders BoVW, LLC, VLAD, and IFV with either SIFT local descriptors and VGG-M CNN local descriptors (FV-CNN).It also compares pooling convolutional features with the CNN fully connected layers (FC-CNN).The table reports classification accuracies for all datasets except VOC07 and OS+R for which mAP-11[34]and mAP are reported, respectively.
5.1.4.1 Experimental setup.The experimental setup is similar to the previous experiment: the same SIFT and CNN VGG-M descriptors are used; BoVW is used in combination with the Hellinger kernel (the exponential variant is slightly better, but much more expensive); the same K = 4096 codebook size is used with LLC.VLAD and FV use a much smaller codebook as these representations multiply the dimensionality of the descriptors (Sect.5.1.1).Since SIFT and CNN features are respectively 128 and 512-dimensional, K is set to 256 and 64 respectively.The impact of varying the number of visual words in the FV representation is further analyzed in Sect.5.1.5.

Table 5 :
Accuracy of various CNNs on the MIT indoor dataset.PLACES and CAFFE are the same CNN architecture ("AlexNet") but trained on different datasets (PLACES and ImageNet resp.).The domain specific advantage of training on PLACES dissapears when the convolutional features are used with FV pooling.For all architectures FV CNN outperformns FC and better architectures lead to better overall performance.describableattributes in clutter.Since there is no standard benchmark for this setting, we introduce here the first analysis of this kind using the the OS+R and OSA+R datasets of Sect.3.1.Recall that the +R suffix indicates that, while textures are imaged in clutter, the classifier is given the groundtruth region segmentation; therefore, the goal of this experiment is to evaluate the effect of realistic viewing conditions on texture recognition, but the problem of segmenting the textures is evaluated later, in Sect.6.3.Results are reported in Table [107].2.2 Analysis.Results are reported in Table 4 section d.On PASCAL VOC, MIT Indoor, CUB, and CUB+R the relative performance of the different descriptors is similar to what has been observed above for textures.Compared to the state-of-the-art results in each dataset, FC-CNN and particularly the FV-CNN descriptors are very competitive.The best result obtained in PASCAL VOC is comparable to the current state-of-the-art set by the deep learning method of[107]