From BoW to CNN: Two Decades of Texture Representation for Texture Classification

Texture is a fundamental characteristic of many types of images, and texture representation is one of the essential and challenging problems in computer vision and pattern recognition which has attracted extensive research attention. Since 2000, texture representations based on Bag of Words (BoW) and on Convolutional Neural Networks (CNNs) have been extensively studied with impressive performance. Given this period of remarkable evolution, this paper aims to present a comprehensive survey of advances in texture representation over the last two decades. More than 200 major publications are cited in this survey covering different aspects of the research, which includes (i) problem description; (ii) recent advances in the broad categories of BoW-based, CNN-based and attribute-based methods; and (iii) evaluation issues, specifically benchmark datasets and state of the art results. In retrospect of what has been achieved so far, the survey discusses open challenges and directions for future research.


Introduction
Our visual world is richly filled with a great variety of textures, present in images ranging from multispectral satellite data to microscopic images of tissue samples (see Fig. 1). As a powerful visual cue, like color, texture provides useful information in identifying objects or regions of interest in images. Texture is different from color in that it refers to the spatial organization of a set of basic elements or primitives (i.e., textons), the fundamental microstructures in natural images and the atoms of preattentive human visual perception [93]. A textured region will obey some statistical properties, exhibiting periodically repeated textons with As a longstanding, fundamental and challenging problem in the fields of computer vision and pattern recognition, texture analysis has been a topic of intensive research since the 1960's [92] due to its significance both in understanding how the texture perception process works in human vision as well as in the important role it plays in a wide variety of applications. The analysis of texture traditionally embraces several problems including classification, segmentation, synthesis and shape from texture [224]. Significant progress has been made since the 1990's in the first three areas, with shape from texture receiving comparatively less attention. Typical applications of texture analysis include medical image analysis [48,158,177], quality inspection [250], content based image retrieval [148,216,266], analysis of satellite or aerial imagery [96,83], face analysis [4,49,215,265], biometrics [137,185], object recognition [210,173,263], texture synthesis for computer graphics and image compression [65,66], and robot vision and autonomous navigation for unmanned aerial vehicles. The ever-increasing amount of image and video data due to surveillance, handheld devices, medical imaging, robotics etc. offers an endless potential for further applications of texture analysis.
Texture representation, i.e., the extraction of features that describe texture information, is at the core of texture analysis. After over five decades of continuous research, many kinds of theories and algorithms have emerged, with major surveys and some representative work as follows. The majority of texture features before 1990 can be found in surveys and comparative studies [39,79,160,195,224,234,245]. Tuceryan and Jain [224] identified five major categories of features for texture discrimination: statistical, geometrical, structural, model based, and filtering based features. In 1996, Ojala et al. [161] carried out a comparative study to evaluate the classification performance of several texture features. In 1999, Randen and Husøy [190] reviewed most major filtering based texture features and performed a comparative performance evaluation for texture segmentation. In 2002, Zhang and Tan [262]   book of Texture Analysis" edited by Mirmehdi et al. [157] contains representative work on texture analysis -from 2D to 3D, from feature extraction to synthesis, and from texture image acquisition to classification. The book "Computer Vision Using Local Binary Patterns" by Pietikäinen et al. [185] in 2011 provides an excellent overview of the theory of Local Binary Patterns (LBP) and the use in solving various kinds of problems in computer vision, especially in biomedical applications and biometric recognition systems. Huang et al. [86] presented a review of the LBP variants in the application area of facial image analysis. The book "Local Binary Patterns: New Variants and Applications" by Brahnam et al. [21] in 2014 is a collection of several new LBP variants and their applications to face recognition. More recently, Liu et al. [132] conducted a taxonomy of recent LBP variants and performed a large scale performance evaluation of forty texture features. Researchers [189,5] presented a review of exemplar based texture synthesis approaches.
The published surveys [39,79,160,194,195,161,183,224,234] mainly reviewed or compared methods prior to 1995. Similarly, the articles [190,262] only covered approaches before 2000. There are more recent surveys [21,86,132,185], however they focused exclusively on texture features based on LBP. The emergence of many powerful texture analysis techniques has given rise to a further increase in research activity in texture research since 2000, however none of these published surveys provides an extensive survey over that time. Given recent developments, we believe that there is a need for an updated survey, motivating this present work. A thorough review and survey of existing work, the focus of this paper, will contribute to more progress in texture analysis. Our goal is to overview the core tasks and key challenges in texture representation approaches, to define taxonomies of representative approaches, to provide a review of texture datasets, and to summarize the performance of the state of the art on publicly available datasets. According to the different visual representations, this survey categorizes the texture representation literature into three broad types: BoW-based, CNN-based, and attributebased. The BoW-based methods are organized according to their key components. The CNN-based methods are categorized into one of pretrained CNN models, finetuned CNN models, or handcrafted deep convolutional networks.
The remainder of this paper is organized as follows. Related background, including the problem and its applications, the progress made during the past decades, and the challenges of the problem, are summarized in Section 2. From Sections 3 to 5 we give a detailed review of texture representation techniques for texture classification by providing a taxonomy to more clearly group the prominent alternatives. A summarization of benchmark texture databases and state of the art performance is given in Section 6. Section 7 concludes the paper with a discussion of promising directions for texture representation.

The Problem
Texture analysis can be divided into four areas: classification, segmentation, synthesis, and shape from texture [224]. Texture classification [110,125,224,236,237] deals with designing algorithms for declaring a given texture region or image as belonging to one of a set of known texture categories of which training samples have been provided. Texture classification may also be a binary hypothesis testing problem, such as differentiating a texture as being within or outside of a given class, such as distinguishing between healthy and pathological tissues in medial image analysis. The goal of texture segmentation is to partition a given image into disjoint regions of homogeneous texture [89,147,194,210]. Texture synthesis is the process of generating new texture images which are perceptually equivalent to a given texture sample [56,65,186,189,243,270]. As textures provide powerful shape cues, approaches for shape from texture attempt to recover the three dimensional shape of a textured object from its image. It should be noted that the concept of "texture" may have different connotations or definitions depending on the given objective. Classification, segmentation, and synthesis are closely related and widely studied, with shape from texture receiving comparatively less attention. Nevertheless, texture representation is at the core of these four problems. Texture representation, together with texture classification, will form the primary focus of this survey.
As a classical pattern recognition problem, texture classification primarily consists of two critical subproblems: texture representation and classification [90]. It is generally agreed that the extraction of powerful texture features plays a relatively more important role, since if poor features are used even the best classifier will fail to achieve good results. While this survey is not explicitly concerned with texture synthesis, studying synthesis can be instructive, for example, classification of textures via analysis by Video Google (Sivic and Zisserman) Fractal Model (Keller et al.) Improved FV (Perronnin et al.) Scattering Convolutional Network (Bruna et al.) CNN for Texture Synthesis (Gatys et al.) Unified Theory for Texture Modeling (Zhu et al.) texture representations in the early years texture representations in the new century (the focus of this survey) 2 0 0 8 2 0 1 2 MRF Texture Model (Cross and Jain) Bag of Textons (Leung and Malik) 2 0 0 0 Gabor Wavelets (Manjunath and Ma) Harris Corners and Laplacian Blobs (Lazebnik et al.) LBP-TOP for Dynamic Texture (Zhao and Pietikäinen) LBP for Facial Texture (Ahonen et al.) Textonboost (Shotton et al.) Laws Filter Masks (Kenneth Laws) Gaussian MRF (Chellappa and Chatterjee) DeCAF and IFV (Cimpoi et al.) The Texton Theory (Bela Julesz, Nature) Gabor Filters (Mark Turner) Fig. 2 The evolution of texture representation over the past decades (see discussion in Section 2.2).
synthesis [65] in which a model is first constructed for synthesizing textures and then inverted for the purposes of classification. As a result, we will include representative texture modeling methods in our discussion.

Summary of Progress in the Past Decades
Milestones in texture representation over the past decades are listed in Fig. 2. The study of texture analysis can be traced back to the earliest work of Julesz [92] in 1962, who studied the theory of human visual perception of texture and suggested that texture might be modelled using kth order statistics -the cooccurrence statistics for intensities at k-tuples of pixels. Indeed, early work on texture features in the 1970s, such as the well known Gray Level Cooccurrence Matrix (GLCM) method [80,79], were mainly driven by this perspective. Aiming at seeking essential ingredients in terms of features and statistics in human texture perception, in the early 1980s Julesz [93,94] proposed the texton theory to explain texture preattentive discrimination, which states that textons (composed of local conspicuous features such as corners, blobs, terminators and crossings) are the elementary units of preattentive human texture perception and only the first order statistics of textons have perceptual significance: textures having the same texton densities could not be discriminated. Julesz's texton theory has been widely studied and has largely influenced the development of texture analysis methods.
Research on texture features in the late 1980s and the early 1990s mainly focused on two well-established areas: 1. Filtering approaches, which convolve an image with a bank of filters followed by some nonlinearity. One pioneering approach was that of Laws [108], where a bank of separable filters was applied, with subsequent filtering methods including Gabor filters [20,89,225], Gabor wavelets [148], wavelet pyramids [61,144], and simple linear filters like Differences of Gaussians [142].
2. Statistical modelling, which characterizes texture images as arising from probability distributions on random fields, such as a Markov Random Field (MRF) [41,149,32,119] or fractal models [98,146].
At the end of the last century there was a renaissance of textonbased approaches, including Zhu et al. [248,249,270,271,269,272] on the mathematical modelling of textures and textons. A notable stride was the Bag of Textons (BoT) [114] and later Bag of Words (BoW) [42,216,238] approaches, where a dictionary of textons is generated, and images are represented statistically as orderless histograms over the texton dictionary. In the 1990s, the need for invariant feature representations was recognized, to reduce or eliminate sensitivity to variations such as illumination, scale, rotation, view point etc. This gave rise to the development of local invariant descriptors, particularly milestone texture features such as Scale Invariant Feature Transform (SIFT) [135], Speeded Up Robust Features (SURF) [12] and LBP [163]. Such local handcrafted texture descriptors dominated many domains of computer vision until the turning point in 2012 when deep Convolutional Neural Networks (CNN) [103] achieved record-breaking image classification accuracy. Since that time the research focus has been on deep learning methods for many problems in computer vision, including texture analysis [34,35,36].
The importance of texture representations (such as Gabor filters [148], LBP [163], BoT [114], Fisher Vector (FV) [203], and wavelet Scattering Convolution Networks (ScatNet) [24]) is that they were found to be well applicable to other problems of image understanding and computer vision, such as object recognition [57,201], scene classification [18,36,106,197] and facial image analysis [3,215,265]. For instance, recently many of the best object recognition approaches in challenges such as PASCAL VOC [57] and ImageNet ILSVRC [201] were based on variants of texture representations. Beyond BoT [114] and FV [203], researchers developed Bag of Semantics (BoS) [50,51,106,118,191] which requires classifying image patches using BoT or CNN and considers the class posterior probability vectors as locally extracted semantic descriptors. On the other hand, texture representations  optimized for objects were also found to perform well for texturespecific problems [34,35,36]. As a result, the division between texture descriptors and more generic image or video descriptors has been narrowing. The study of texture representation continues to play an important role in computer vision and pattern recognition.

Key Challenges
In spite of several decades of development, most texture features have not been capable of performing at a level sufficient for realworld textures and are computationally too complex to meet the real-time requirements of many computer vision applications. The inherent difficulty in obtaining powerful texture representations lies in balancing two competing goals: high quality representation and high efficiency. High Quality related challenges mainly arise due to the large intraclass appearance variations caused by changes in illumination, rotation, scale, blur, noise, occlusion, etc. and potentially small interclass appearance differences, requiring texture representations to be of high robustness and distinctiveness. Illustrative examples are shown in Fig. 3. A further difficulty is in obtaining sufficient training data in the form of labeled examples, which are frequently available only in limited amounts due to collection time or cost.
High Efficiency related challenges include the potentially large number of different texture categories and their high dimensional representations. Here we have polar opposite motivations: that of big data, with associated grand challenges and the scalability/complexity of huge problems, and that of tiny devices, the growing need for deploying highly compact and efficient texture representations on resource-limited platforms such as embedded and handheld devices.  Fig. 4 The goal of texture representation is to transform the input texture image into a feature vector that describes the properties of the texture, facilitating subsequent tasks such as texture recognition. Usually a texture image is first transformed into a pool of local features, which are then aggregated into a global representation for an entire image or region.

Bag of Words based Texture Representation
The goal of texture representation or texture feature extraction is to transform the input texture image into a feature vector that describes the properties of a texture, facilitating subsequent tasks such as texture classification, as illustrated in Fig. 4. Since texture is a spatial phenomenon, texture representation cannot be based on a single pixel, and generally requires the analysis of patterns over local pixel neighborhoods. Therefore, a texture image is first transformed to a pool of local features, which are then aggregated into a global representation for an entire image or region. Since the properties of texture are usually translationally invariant, most texture representations are based on an orderless aggregation of local texture features, such as a sum or max operation.
Early in 1981, Julesz [93] introduced "textons", which refer to basic image features such as elongated blobs, bars, crosses, and terminators, as the elementary units of preattentive human texture perception. However Julesz's texton studies were limited by their exclusive focus on artificial texture patterns rather than natural textures. In addition, Julesz did not provide a rigorous definition for textons. Subsequently, texton theory fell into disfavor as a model of texture discrimination until the influential work by Leung and Malik [114] who revisited textons and gave an operational definition of a texton as a cluster center in filter response space. This not only enabled textons to be generated automatically from an image, but also opened up the possibility of learning a universal texton dictionary for all images. Texture images can be statistically represented as histograms over a texton dictionary, referred to as the Bag of Textons (BoT) approach. Although BoT was initially developed in the context of texture recognition [114,143], it was introduced / generalized to image retrieval [216] and classification [42], where it was referred to as Bag of Features (BoF) or, more commonly, Bag of Words (BoW). The research community has since witnessed the prominence of the BoW model for over a decade during which many improvements were proposed.

The BoW Pipeline
The BoW pipeline is sketched in Fig. 5, consisting of the following basic steps: 1. Local Patch Extraction. For a given image, a pool of N image patches is extracted over a sparse set of points of interest [110,263], over a fixed grid [101,150,207], or densely at each pixel position [163,236,237].
2. Local Patch Representation. Given the extracted N patches, local texture descriptors are applied to obtain a set or pool of texture features of D dimension. We denote the local features of N patches in an image as Ideally, local descriptors should be distinctive and at the same time robust to a variety of possible image transformations, such as scale, rotation, blur, illumination, and viewpoint changes. High quality local texture descriptors play a critical role in the BoW pipeline.
3. Codebook Generation. The objective of this step is to generate a codebook (i.e., a texton dictionary) with K codewords {w i } K i=1 , w i ∈ R D based on training data. The codewords may be learned (e.g., by kmeans [109,236]) or in a predefined way (such as LBP [163]). The size and nature of the codebook affects the representation followed and thus the discrimination power. The key here is how to generate a compact and discriminative codebook so as to enable accurate and efficient classification.
4. Feature Encoding. Given the generated codebook and the extracted local texture features {x i } from an image, feature encoding represents each local feature x i with the codebook, usually by mapping each x i to one or a number of codewords, resulting a feature coding vector v i (e.g. v i ∈ R K ). Of all the steps in the BoW pipeline, feature encoding is a core component which links local representation and feature pooling, greatly influencing texture classification in terms of both accuracy and speed. Thus, many studies have focused on developing powerful feature encoding, such as vector quantization / kmeans, sparse coding [139,140,181], Locality constrained Linear Coding (LLC) [240], Vector of Locally Aggregated Descriptors (VLAD) [91], and Fisher Vector (FV) [36,179,203].
5. Feature Pooling. A global feature representation y is produced by using a feature pooling strategy to aggregate the coded feature vectors {v i }. Classical pooling methods include average pooling, max pooling, and Spatial Pyramid Pooling (SPM) [111,223].
6. Feature Classification. The global feature is used as the basis for classification, for which many approaches are possible [90,242]: Nearest Neighbor Classifier (NNC), Support Vector Machines (SVM), neural networks, and random forests. SVM is one of the most widely used classifiers for the BoW based representation.
The remainder of this section will introduce the methods in each component, as summarized in Table 1.

Local Texture Descriptors
All local texture descriptors aim to provide local representations invariant to contrast, rotation, scale, and possibly other criteria. The primary categorization is whether the descriptor is applied densely, at every pixel, as opposed to sparsely, only at certain locations of interest.

Sparse Texture Descriptors
To develop a sparse texture descriptor, a region of interest detector must be designed which is able to reliably detect a sparse set of regions, reliably and stably, under various imaging conditions. Typically, the detected regions undergo a geometric normalization, after which local descriptors are applied to encode the image content. A series of region detectors and local descriptors has been proposed, with excellent surveys [154,155,226]. The sparse approach was introduced to texture recognition by Lazebnik et al. [109,110] and followed by Zhang et al. [263].
In [110] two types of complementary region detectors, the Harris affine detector of Mikolajczyk and Schmid [153] and the Laplacian blob detector of Gårding and Lindeberg [64], were used to detect affine covariant regions, meaning that the region content is affine invariant. Each detected region can be thought of as a texture element having a characteristic elliptic shape and a distinctive appearance pattern. In order to achieve affine invariance, each elliptical region was normalized and then two rotation invariant descriptors, the spin image (SPIN) and the Rotation Invariant Feature Transform (RIFT) descriptor, were applied. As a result, for each texture image four feature channels were extracted (two detectors × two descriptors), and for each feature channel kmeans clustering is performed to form its signature. The Earth Mover's Distance (EMD) [200] was used for measuring the similarity between image signatures and NNC was used for classification. The Harris affine regions and Laplacian blobs in combination with SPIN and RIFT descriptors (i.e. the (H+L)(S+R) method) have demonstrated good performance (listed in Table 4) in classifying textures with significant affine variations, evidenced by the classification rate 96.0% on UIUC with a NNC classifier. Although this approach achieve affine invariance, they lack distinctiveness since some spatial information is lost due to their feature pooling schemes.
Following Lazebnik et al. [110], Zhang et al. [263] presented an evaluation of multiple region detector types, levels of geometric   Table 1, and also refer to Section 3 for detail discussion. Features are computed from handcrafted detectors for descriptors like SIFT and RIFT, and densely applied local texture descriptors like handcrafted filters or CNNs. The CNN features can also be computed in an end-to-end manner using finetuned CNN models. These local features are quantized to visual words in a codebook.
invariance, multiple local texture descriptors, and SVM classifier with kernels based on two effective measures for comparing distributions (signatures and EMD distance vs. standard BoW and the Chi Square distance) for texture and object recognition. Regarding local description, Zhang et al. [263] also used the SIFT descriptor 1 in addition to SPIN and RIFT. With SVM classification, Zhang et al. [263] showed significant performance improvement over that of Lazebnik et al. [110], and reported classification rates of 95.3% and 98.7% on CUReT and UIUC respectively. They recommended that practical texture recognition should seek to incorporate multiple types of complementary features, but with local invariance properties not exceeding those absolutely required for a given application. Other local region detectors have also been used for texture description, such as the Scale Descriptors which measure the scales of salient textons [95].

Dense Texture Descriptors
The number of features derived from a sparse set of interesting points is much smaller than the total number of image pixels, resulting a compact feature space. However, the sparse approach can be inappropriate for many texture classification tasks: • Interest point detectors typically produce a sparse output and could miss important texture elements. • A sparse output in a small image might not produce sufficient regions for robust statistical characterization. • There are issues regarding the repeatability of the detectors, the stability of the selected regions and the instability of orientation estimation [155].
As a result, extracting local texture features densely at each pixel is the more popular representation, the subject of the following discussion.
(1) Gabor Filters are one of the most popular texture descriptors, motivated by their relation to models of early visual systems of mammals as well as their joint optimum resolution in time and frequency [89,113,148]. As illustrated in Fig. 6, Gabor filters can be considered as orientation and scale tunable edge and bar detectors. The Gabor wavelets are generated by appropriate rotations and dilations from the following product of an elliptical Gaussian and a complex plane wave: where ω is the radial center frequency of the filter in the frequency domain, σ x and σ y are the standard deviations of the elliptical Gaussian along x and y. Thus, a Gabor filter bank is defined by its parameters including frequencies, orientations and the parameters of the Gaussian envelope. In the literature, different parameter settings have been suggested, and filter banks created by these parameter settings work well in general. Details for the derivation of Gabor wavelets and parameter selection can be found in [113,148,180]. Invariant Gabor representations can be accessed in [78]. According to the experimental study in [97,263], Gabor features [148] fail to meet the expected level of performance in the presence of rotation, affine and scale variations. However, Gabor filters encode structural features from multiple orientations and over a broader range of scales. It has been shown [97] that for large datasets, under varying illumination conditions, Gabor filters can serve as a preprocessing method and combine with LBP [163] to obtain texture features with reasonable robustness [185,264].
(2) Filters by Leung and Malik (LM Filters) [114,143] pioneered the problem of classifying textures under varying viewpoint and illumination. The LM filters used for local texture feature extraction are illustrated in Fig. 8. In particular, they marked a milestone by giving an operational definition of textons: the cluster centers of the filter response vectors. Their work has been widely followed by other researchers [42,110,210,216,236,237]. To Table 1 A summary of components in the BoW representation pipeline, as was sketched in Fig. 5.
Step Approach Highlights Local Texture Descriptors (Section 3.2)

Dense Descriptors
• Gabor Wavelets Joint optimum resolution in time and frequency; Multiscale and multiorientation analysis. • LMfilters [114] First to propose Bag of Texton (BoT) model (i.e. the BoW model) Gabor like filters; Rotation invariant. • MR8 [236] Rotationally invariant filters and low-dimensional filter response space. • Patch Intensity [237] Challenge the dominant role of filter descriptors and propose image raw intensity feature. • LBP [163] Fast binary features with gray scale invariance; Predefined codebook. • Random Projection [125] First to introduce compressive sensing and random projection into texture classification. • Sorted Random Projection [126] Efficient and effective approach for random projection to achieve rotation invariance. • Basic Image Features (BIFs) [40] Introduce BIFs of Griffin and Lillholm into texture classification; Predefined codebook. • Weber Local Descriptor (WLD) [40] A descriptor based on Weber's Law.

(Section 3.3)
Predefined [40,163] No codebook learning step; Computationally efficient. kmeans clustering [42,114] Most commonly used method; Cannot capture overlapping distributions in the feature space. GMM modeling [36,179,209] Considers both cluster centers and covariances which describe the spreads of clusters. Sparse Coding based learning [181,217] Sparse representation based; Minimize reconstruction error of data; Computationally expensive.

Feature Encoding (Section 3.4)
Voting Based Methods Require a large codebook (usually learned by kmeans); Usually combine with nonlinear SVM. • Hard Voting [114,236] Quantize each feature to nearest codeword; Fast to compute; Codes are sparse and high dimensional. • Soft Voting [2,196,232] Assigns each feature to multiple codewords; Does not minimize reconstruction error. Fisher Vector (FV) Based Methods Require a small codebook; Very high dimension; Combines with efficient linear SVM. • FV [178] GMM-based; Encodes higher order statistics; Efficient to compute. • Improved FV (IFV) [34,179,209] Uses signed square rooting and L 2 normalization; State of the art performance in texture classification. • VLAD [91,34] A simplified version of FV.

Reconstruction Based Methods
Enforce sparse representation; Explores the manifold structure of data; Minimize reconstruction error. • Sparse Coding [181,217,256] Leverage that fact that natural images are sparse; Optimization is computationally expensive. • Local constraint Linear Coding (LLC) [34,240] Local smooth sparsity; Fast computation through approximated LLC. Nearest Neighbor Classifier (NNC) [125,236] Simple and elegant nonparametric classifier; Popular in texture classification. Kernel SVM [263] Usually in combination with Chi Square for BoW based representation. Linear SVM [36] Suitable for high-dimensional feature representation like FV and VLAD. handle 3D effects caused by imaging, they proposed 3D textons which were cluster centers of filter responses over a stack of images with representative viewpoints and lighting, as illustrated in Fig. 9. In their texture classification algorithm, 20 images of each texture were geometrically registered and transformed into 48D local features with the LM Filters. Then the 48D filter response vectors of 20 selected images of the same pixel were concatenated to obtain a 960D feature vector as the local texture representation, subsequently input into a BoW pipeline for texture classification. A downside of the method is that it is not suitable for classifying a single texture image under unknown imaging conditions, which usually arises in practical applications.
(3) The Schmid Filters (S Filters) [204] consist of 13 rotationally invariant Gabor-like filters of the form where β is the number of cycles of the harmonic function within the Gaussian envelope of the filter. The filters are shown in Fig. 7; as can be seen, all of the filters have rotational symmetry. The rotation-invariant S Filters were shown to outperform the rotationvariant LM Filters in classifying the CUReT textures [236], indi-   cating that rotational invariance is necessary in practical applications.  Fig. 9 Illustration of the process of 3D texton dictionary learning proposed by Leung and Malik [114]. Each image at different lighting and viewing directions is filtered using the filter bank illustrated in Fig. 8. The response vectors are concatenated together to form data vectors of length N f il N im . These data vectors are clustered using the kmeans algorithm to obtain the 3D textons. (4) Maximum Response (MR8) Filters of Varma and Zisserman [236] consist of 38 root filters but only 8 filter responses. The filter bank contains filters at multiple orientations but their outputs are pooled by recording only the maximum filter response across all orientations, in order to achieve rotation invariance. The root filters are a subset of the LM Filters [114] of Fig. 8, retaining the two rotational symmetry filters, the edge filter, and the bar filter at 3 scales and 6 orientations. Recording only the maximum response across orientations reduces the number of responses from 38 to 8 (3 scales for 2 anisotropic filters, plus 2 isotropic), resulting the so called MR8 filter bank.
Realizing the shortcomings of Leung and Malik's method [114], Varma and Zisserman [236] attempted to improve the classification of a single texture sample image under unknown imaging conditions, bypassing the registration step, instead learning 2D textons by aggregating filter responses over different images. Experimental results [236] showed that MR8 outperformed the LM Filters and S Filters, indicating that detecting better features and clustering in a lower dimensional feature space can be advantageous. The best results for MR8 are 97.4% obtained with a dictionary of 2440 textons and a Nearest Neighbor Classifier (NNC) [236]. Later, Hayman et al. [82] showed that SVM could further enhance the texture classification performance of MR8 features, giving a 98.5% classification rate for the same setup used for texton representation.
(5) Patch Descriptors of Varma and Zisserman [237] challenged the dominant role of the filter banks [152,190] in texture analysis, and instead developed a simple Patch Descriptor, keeping the raw pixel intensities of a square neighborhood to form a feature vector, as illustrated in Fig. 10. By replacing the filter responses such as LM Filters [190], S Filters [204] and MR8 [236] with the Patch Descriptor in texture classification, Varma and Zisserman [237] observed very good classification performance using extremely compact neighborhoods (3 × 3), and that for any fixed size of neighborhood the Patch Descriptor leads to superior classification compared to filter banks with the same support.
Two variants of the Patch Descriptor, the Neighborhood Descriptor and the MRF Descriptor, were developed. For the Neighborhood Descriptor, the central pixel is discarded and only the neighborhood vector is used for texton representation. Instead of ignoring the central pixel, the MRF Descriptor explicitly models the joint distribution of the central pixels and its neighbors. The best result 98.0% is given by the MRF Descriptor using a 7 × 7 neighborhood with 2440 textons and 90 bins and a NNC classifier. Note that the dimensionality of this MRF representation is very high: 2440 × 90. A clear limitation of the Patch, Neighborhood and MRF Descriptors is sensitivity to nearly any change (brightness, rotation, affine etc.). Varma and Zisserman [237] adopted the method of finding the dominant orientation of a patch and measuring the neighborhood relative to this orientation to achieve rotation invariance, and reported a 97.8% classification rate on the UIUC dataset. It is worth mentioning that finding the dominant orientation for each patch is computationally expensive.
(6) Random Projection (RP) and Sorted Random Projection (SRP) features of Liu and Fieguth [125] were inspired by theories of sparse representation and compressed sensing [27,52]. Taking advantage of the sparse nature of textured images, a small set of random features is extracted from local image patches by projecting the local patch feature vectors to a lower dimensional feature subspace. The random projection is a fixed, distance-preserving embedding capable of alleviating the curse of dimensionality [11,69]. The random features are embedded into BoW to perform texture classification. It has been shown that the performance of RP features is superior to that of the Patch Descriptor with equivalent neighborhoods [125]; a clear indication that the RP matrix preserves the salient information contained in the local patch and that performing classification in a lower feature space is advantageous. The best result 98.5% is achieved using a 17 × 17 neighborhood with 2440 textons and a NNC classifier.
Like the Patch Descriptors, the RP features remain sensitive to image rotation. To further improve robustness, Liu et al. [128,126] proposed sorting the RP features, as illustrated in Fig. 11, whereby rings of pixel values are sorted, without any reference orientation, ensuring rotation invariance. Two kinds of local features are used, one based on raw intensities and the other on gradients (radial dif- Example: Fig. 12 A circular neighborhood used to derive an LBP code: a central pixel xc and its p circularly and evenly spaced neighbors on a circle of radius r. ferences and angular differences). Random functions of the sorted local features are taken to obtain SRP features. It was shown that SRP outperformed RP significantly for robust texture classification [126,128], producing state of the art classification results on CUReT (99.4%) KTHTIPS (99.3%), and UMD (99.3%) with a SVM classifier [126,129].
(7) Local Binary Patterns of Ojala et al. [161] marked the beginning of the LBP methodology, followed by the simpler rotation invariant version of Pietikäinen et al. [184], and later "uniform" patterns to reduce feature dimensionality [163].
Texture representation generally requires the analysis of patterns in local pixel neighborhoods, which are comprehensively described by their joint distribution. However, stable estimation of joint distributions is often infeasible, even for small neighborhoods, because of the combinatorics of joint distributions. Considering the joint distribution: of center pixel x c and {x n } p−1 n=0 , p equally spaced pixels on a circle of radius r, Ojala et al. [163] argued that much of the information in this joint distribution is conveyed by the joint distribution of differences: The size of the joint histogram was greatly minimized by keeping only the sign of each difference, as illustrated in Fig. 12.
A certain degree of rotation invariance is achieved by cyclic shifts of the LBPs, i.e., grouping together those LBPs that are actually rotated versions of the same underlying pattern. Since the dimensionality of the representation (which grows exponentially with p) is still high, Ojala et al. [163] introduced a uniformity measure to identify p(p−1)+2 uniform LBPs and classified all remaining nonuniform LBPs under a single group. By changing parameters p and r, we can derive LBP for any quantization of the angular space and for any spatial resolution, such that multiscale analysis can be accomplished by combining multiple operators of varying r. The most prominent advantages of LBP are its invariance to monotonic gray scale change, very low computational complexity, and ease of implementation.
Since [163], LBP started to receive increasing attention in computer vision and pattern recognition, especially texture and facial analysis, with the LBP milestones presented in Fig. 13. As Gabor filters and LBP provide complementary information (LBP captures small and fine details, Gabor filters encode appearance information over a broader range of scales), Zhang et al. [264] proposed Local Gabor Binary Pattern (LGBP) by extracting LBP features from images filtered by Gabor filters of different scales and orientations, to enhance the representation power, followed by subsequent Gabor-LBP approaches [86,132,185]. Additional important LBP variants include LBP-TOP, proposed by Zhao and Pietikäinen [265], a milestone in using LBP for dynamic texture analysis; the Local Ternary Patterns (LTP) of Tan and Triggs [222], introducing a pair of thresholds and a split coding scheme which allows for encoding pixel similarity; the Local Phase Quantization (LPQ) by Ojansivu et al. [164,165] quantizing the Fourier transform phase in local neighborhoods which is, by design, tolerant to most common types of image blurs; the Completed LBP (CLBP) of Guo et al. [76], encoding not only the signs but also the magnitudes of local differences; and the Median Robust Extended LBP (MRELBP) of Liu et al. [131] which enjoys high distinctiveness, low computational complexity, and strong robustness to image rotation and noise.
LBP has also led to compact and efficient binary feature descriptors designed for image matching, with noticeable ones in-   cluding Binary Robust Independent Elementary Features (BRIEF) [26], Oriented FAST and Rotated BRIEF (ORB) [199], Binary Robust Invariant Scalable Keypoints (BRISK) [115] and Fast Retina Keypoint (FREAK) [6]. These binary descriptors provide a comparable matching performance with the widely used region descriptors such as SIFT [135] and SURF [12], but are fast to compute and have significantly lower memory requirements, especially suitable for applications on resource constrained devices. In summary, for large datasets with rotation variations and no significant illumination related variations, LBP [163] could serve as an effective and efficient approach for texture classification. However, in the presence of significant illumination variations, significant affine transformations, or noise corruption, LBP fails to meet the expected level of performance. MRELBP [131], a recent LBP variant, has been demonstrated to outperform LBP significantly, with near perfect classification performance on two small benchmark datasets (Outex TC10 100% and Outex TC12 99.8%) [131], and which obtained the best overall performance in a recent experimental survey [132] evaluating robustness in multiple classification challenges. In general, LBP-based features work well in situations when limited training data are available; learning based approaches like MR8, Patch Descriptors and DCNN based representations, which require large amount of training samples, are significantly outperformed by LBP based ones.
After over 20 years of developments, LBP is no longer just a simple texture operator, but has laid the foundation for a direction of research dealing with local image and video descriptors. A large number of LBP variants have been proposed to improve its robustness and to increase its discriminative power and applicability to different types of problems, and interested readers are referred to excellent surveys [86,132,185]. Recently, although CNN based methods are beginning to dominate, LBP research remains active, as evidenced by significant recent work [77,219,202,116,136,251,259,49].
(8) Basic Image Features (BIF) approach [40] is similar to LBP [163], in that it is based upon a predefined codebook rather than one learned from training. It therefore shares the advantages of LBP over methods based on codebook learning with clustering. In contrast with LBP, BIF probes an image locally using Gaussian derivative filters [73,72] whereas LBP computes the differences between a pixel and its neighbors. Derivative of Gaussians (DtG), consisting of first and second order derivatives of the Gaussian filter, can effectively detect the local basic and symmetry structure of an image, and allows achieving exact rotation invariance [61]. BIF feature extraction is summarized in Fig. 14: each pixel in the image is filtered by the DtG filters, and then labeled as the maximizing class. A simple six dimensional BIF histogram can be used as a global texture representation, however the histogram over these six categories produces too coarse a representation, therefore others (e.g., Crosier and Griffin [40]) have performed multiscale analysis and calculated joint histograms over multiple scales. Multiscale BIF features achieved very good classification performance on CUReT (98.6%), UIUC (98.8%) and KTHTIPS (98.5%) [40], with a NNC classifier.
(9) Weber Law Descriptor (WLD) [33] is based on the fact that human perception of a pattern depends not only on the change of a stimulus but also on the original intensity of the stimulus. The WLD consists of two components: differential excitation and orientation. For a small patch of size 3 × 3, shown in Fig. 15, the differential excitation is the relative intensity ratio x c and the orientation component is derived from the local gradient orientation Both ξ and θ are quantified into a 2D histogram, offering a global representation. Clearly the use of multiple neighborhood sizes supports a multiscale generalization. Though computationally efficient, WLD features fail to meet the expected level of performance for texture recognition.

Fractal Based Descriptors
Fractal Based Descriptors present a mathematically well founded alternative to dealing with scale [146], however they have not become popular as texture features due to their lack of discriminative power [235]. Recently, inspired by the BoW approach, researchers revisited the fractal method and proposed the MultiFractal Spectrum (MFS) method [253,252,254], invariant to viewpoint changes, nonrigid deformations and local affine illumination changes. The basic MFS method was proposed in [253], where MFS was first defined for simple image features, such as intensity, gradient and Laplacian of Gaussian (LoG). A texture image is first transformed into n feature maps such as intensity, gradient or LoG filter features. Each map is clustered into k clusters (i.e. k codewords) via kmeans. Then, a codeword label map is obtained and is decomposed into k binary feature maps: those pixels assigned to codeword i are labeled with 1 and the remainder as 0. For each binary feature map, the box counting algorithm [254] is used to estimate a fractal dimension feature. Thus, a total of k fractal dimension features are computed for each feature map, forming a kD feature vector (referred to as a fractal spectrum) as the global representation of the image. Finally, for the n different feature maps, n fractal spectrum feature vectors are concatenated as the MFS feature. The MFS representation demonstrated invariance to a number of geometrical changes such as viewpoint changes, nonrigid surface changes and reasonable robustness to illumination changes. However, since it is based on simple features (intensities and gradients) and has very low dimension, it has limited discriminability, and gives classification rates 92.3% and 93.9% on datasets UIUC and UMD respectively.
Later MFS was improved by generalizing the simple image intensity and gradient features with SIFT [252], wavelets [254], and LBP [188]. For instance, the Wavelet based MFS (WMFS) features archived significantly improved classification performance on UIUC (98.6%) and UMD (98.7%). The downside of the MFS approach is that it requires high resolution images to obtain sufficiently stable features.

Codebook Generation
Texture characterization requires the analysis of spatially repeating patterns, which suffice to characterize textures and the pursuit of which has had important implications in a series of practical problems, such as dimensionality reduction, variable decoupling, and biological modelling [168,272]. The extracted set of local texture features is versatile, and yet overly redundant [114]. It can therefore be expected that a set of prototype features (i.e. codewords or textons) must exist which can be used to create global representations of textures in natural images [114,166,272], in a similar way as in speech and language (such as words, phrases and sentences).
There exist a variety of methods for codebook generation. Certain approaches, such as LBP [163] and BIF [40], which we have already discussed, use predefined codebooks, therefore entirely bypassing the codebook learning step.
For approaches requiring a learned codebook, kmeans clustering [110,114,125,237,263] and Gaussian Mixture Models (GMM) [34,36,107,91,179,209] are the most popular and successful methods. GMM modeling considers both cluster centers and covariances, which describe the location and spread/shape of clusters, whereas kmeans clustering cannot capture overlapping distributions in the feature space as it considers only distances to cluster centers, although generalizations to kmeans with multiple prototypes per cluster can allow this limitation to be relaxed. The GMM and kmeans methods learn a codebook in an unsupervised manner, but some recent approaches focus on building more discriminative ones [257,246].
In addition, another significant research thread is reconstruction based codebook learning [1,181,217,240], under the assumption that natural images admit a sparse decomposition in some redundant basis (i.e., dictionary or codebook). These methods focus on learning nonparametric redundant dictionaries that facilitate a sparse representation of the data and minimize the reconstruction error of the data. Because discrimination is the primary goal of texture classification, researchers have proposed to construct discriminative dictionaries that explicitly incorporate category specific information [139,140].  Since the codebook is used as the basis for encoding feature vectors, codebook generation is often interleaved with feature encoding, described next.

Feature Encoding
As illustrated in Fig. 4, a given image is transformed into a pool of local texture features, from which a global image representation is derived by feature encoding with the generated codebook. In the field of texture classification, we group commonly-used encoding strategies into three major categories: • Voting based [114,236,232,233], • Fisher Vector based [91,36,179,203], and • Reconstruction based [139,140,167,181,240].
Comprehensive comparisons of encoding methods in image classification can be found in [30,34,88].
Voting based methods. The most intuitive way to quantize a local feature is to assign it to its nearest codeword in the codebook, also referred to as hard voting [114,236]. A histogram of the quantized local descriptors can be computed by counting the number of local features assigned to each codeword; this histogram constitutes the baseline BoW representation (as illustrated in Fig. 16 (a)) upon which other methods can improve. Formally, it starts by learning a codebook {w i } K i=1 , usually by kmeans clustering. Given a set of local texture descriptors {x i } N i=1 extracted from an image, the encoding representation of some descriptor x via hard voting is The histogram of the set of local descriptors is to aggregate all encoding vectors {v i } N i=1 via sum pooling. Hard voting overlooks codeword uncertainty, and may label image features by nonrepresentative codewords. In an improvement to this hard voting scheme, soft voting [2,196,258,232,233] employs several nearest codewords to encode each local feature in a soft manner, such that the weight of each assigned codeword is an inverse function of the distance from the feature, for some kernel definition of distance. Voting based methods yield a histogram representation of dimensionality K, the number of bins in the histogram. Fisher Vector based methods. By counting the number of occurrences of codewords, the standard BoW histogram representation encodes the zeroth-order statistics of the distribution of descriptors, which is only a rough approximation of the probability density distribution of the local features. The Fisher vector extends the histogram approach by encoding additional information about the distribution of the local descriptors. Based on the original FV encoding [178], improved versions were proposed [37,179] such as the Improved FV (IFV) [179] and VLAD [91]. We briefly describe IFV [179] here, since to the best of our knowledge it achieves the best performance in texture classification [34,35,36,209]. Theory and practical issues regarding FV encoding can be found in [203].
IFV encoding learns a soft codebook with GMM, as shown in Fig. 16 (c). An IFV encoding of a local feature is computed by assigning it to each codeword, in turn, and computing the gradient of the soft assignment with respect to the GMM parameters 2 . The IFV encoding dimensionality is 2DK, where D is the dimensionality of the feature space and K is the number of Gaussian mixtures. BoW can be considered a special case of FV in the case where the gradient computation is restricted to the mixture weight parameters of the GMM. Unlike BoW, which requires a large codebook size, FV can be computed from a much smaller codebook (typically 64 or 256) and therefore at a lower computational cost at the codebook learning step. On the other hand, the resulting dimension of the FV encoding vector (e.g. tens of thousands) is usually significantly higher than BoW (thousands), which makes it unsuitable for nonlinear classifiers, however it offers good performance even with simple linear classifiers.
The VLAD encoding scheme proposed by Jégou et al. [91] can be thought of as a simplified version of FV, in that it typically uses kmeans, rather than GMM, and records only first-order statistics rather than second order. In particular, it records the residuals (the difference between the local features and the codewords), as shown in Fig. 16 (b).
Reconstruction based methods. Reconstruction based methods aim to obtain an information-preserving encoding vector that allows for the reconstruction of a local feature with a small number of codewords. Typical methods include sparse coding and Local constraint Linear Coding (LLC), which are contrasted in Fig. 17. Sparse coding was initially proposed [167] to model natural image statistics, then to texture classification [45,139,140,181,217] and later to other problems such as image classification [256] and face recognition [247].
In sparse coding, a local feature x can be well approximated by a sparse decomposition x ≈ Wv over the learned codebook W = [w 1 , w 2 , . . . w K ], by leveraging the sparse nature of the underlying 2 The derivative to weights, which is considered to make little contribution to the performance, is removed in IFK [179].
image [167]. A sparse encoding can be solved as where s is a small integer denoting the sparsity level, limiting the number of nonzero entries in v, measured as v 0 . Learning a redundant codebook that facilitate a sparse representation of the local features is important in sparse coding [1]. Methods in [139,140,181,217] are based on learning C class-specific codebooks, one for each texture class and approximating each local feature using a constant sparsity s. The C different codebooks provides C different reconstruction errors, which can then be used as classification features. In [181,217], the class specific codebooks were optimized for reconstruction, but significant improvements have been shown by optimizing for discriminative power instead [45,139,140], an approach which is, however, associated with high computational cost, especially when the number of texture classes C is large. Locality constrained linear coding (LLC) [240] projects each local descriptor x down to the local linear subspace spanned by q codewords in the codebook of size K closest to it (in Euclidean distance), resulting in a K dimensional encoding vector whose entries are all zero except for the indices of the q codewords closest to x. The projection of x down to the span of its q closest codewords is solved via where λ is a small regularization constant and σ adjusts the weight decay speed. In summary, reconstruction based coding has received significant attention since sparse coding was applied for visual classification [139,140,181,217,240]. A theoretical study for the success of sparse coding over vector quantization can be found in [38].

Feature Pooling and Classification
The goal of feature pooling [19] is to integrate or combine the coded feature vectors {v i } i , v i ∈ R d of a given image into a final compact global representation y i which is more robust to image transformations and noise. Commonly used pooling methods include sum pooling, average pooling and max pooling [114,237,240]. Boureau et al. [19] presented a theoretical analysis of average pooling and max pooling, and showed that max pooling may be well suited to sparse features. The authors also proposed softer max pooling methods by using a smoother estimate of the expected max-pooled feature and demonstrated improved performance. Another noticeable pooling method is the mix-order max pooling method which considers the information of visual word occurrence frequency [127].
Specifically, let V = [v 1 , ..., v N ] ∈ R d×N denote the coded features from N locations. For u denoting a row of V, u is reduced to a single scalar by some operation (sum, average, max), reducing V to a d-dimensional feature vector. Realizing that pooling over the entire image disregards all information regarding spatial dependencies, Lazebnik et al. [111] proposed a simple Spatial Pyramid Pooling (SPM) scheme by partitioning the image into increasingly fine subregions and computing histograms of local features found inside each subregion via average or max pooling. The final global representation is a concatenation of all histograms extracted from subregions, resulting in a higher dimensional representation that preserves more spatial information [223].
Given a pooled feature, a given texture sample can be classified. Many classification approaches are possible [90,242], although Nearest Neighbor Classifier (NNC) and Support Vector Machine (SVM) are the most widely-used classifiers for the BoW representation. Different distance measures may be used, such as the EMD distance [110,263], KL divergence and the widely-used Chi Square distance [125,237]. For high dimensional BoW features, as with SPM features and multilevel histograms, histogram intersection kernel SVM [71,111,141] is a good and efficient choice. For very high-dimensional features, as with IFV and VLAD, linear SVM may represent a better choice [91,179].

CNN based Texture Representation
A large number of CNN-based texture representation methods have been proposed in recent years since the record-breaking image classification result [103] achieved in 2012. A key to the success of CNNs is their ability to leverage large labeled datasets to learn high quality features. Learning CNNs, however, amounts to estimating millions of parameters and requires a very large number of annotated images, an issue which rather constrains the applicability of CNNs in problems with limited training data. A key discovery, in this regard, was that CNN features pretrained on very large datasets were found to transfer well to many other problems, including texture analysis, with a relatively modest adaptation effort [31,36,68,170,208]. In general, the current literature on texture classification includes examples of both employing pretrained generic CNN models or performing finetuning for specific texture classification tasks.
In this survey we will classify CNN based texture representation methods into three categories, and which form the basis of the following three sections: • using pretrained generic CNN models, • using finetuned CNN models, and • using handcrafted deep convolutional networks.
These representations have had a widespread influence in image understanding; representative examples of each of these are given in Table 2.

Using Pretrained Generic CNN Models
Given the behavior of CNN transfer, the success of pretrained CNN models lies in the feature extraction and encoding steps. Similar to Section 3, we will describe first some commonly used networks for pretraining and then the feature extraction process.
(1) Popular Generic CNN Models can serve as good choices for extracting features, including AlexNet [103], VGGNet [214], GoogleNet [220], ResNet [84] and DenseNet [87]. Among these networks, AlexNet was proposed the earliest, and in general the others are deeper and more complex. A full review of these networks is beyond the scope of this paper, and we refer readers to the original papers [84,87,103,214,220] and to excellent surveys [15,31,75,112,133] for additional details. Briefly, as shown in Fig. 18 (b), a typical CNN repeatedly applies the following three operations: 1. Convolution with a number of linear filters, 2. Nonlinearities, such as sigmoid or rectification, 3. Local pooling or subsampling.
These three operations are highly related to traditional filter bank methods widely used in texture analysis [190], as shown in Fig. 18 [120,123] and LFV [218] • AlexNet [103] Achieved breakthrough image classification result on ImageNet; The historical turning point of feature representation from handcrafted to CNN. • VGGM [31,36] Similar complexity as AlexNet, but better texture classification performance. • VGGVD [214] Much deeper than AlexNet; Much Larger model size than AlexNet and VGGM; Much better texture recognition performance than AlexNet and VGGM. • GoogleNet [220] Much deeper than AlexNet; Small pretrained model size; Not often used in texture classification. • ResNet [84] Significantly deeper than VGGVD; Smaller model size (ResNet 101) than AlexNet.
Using Finetuned CNN Models (Section 4.2) End-to-end learning • TCNN [9] Using global average pooling; Combining outputs from multiple CONV layers. • BCNN [122,120] Introducing a novel and orderless bilinear feature pooling method; Generalizing Fisher Vector and VLAD; Good representation ability; Very high feature dimensionality. • Compact BCNN [63] Adopting Random Maclaurin Projection or Tensor Sketch Projection to reduce the dimensionality of bilinear features (e.g. from 262144 (512 2 ) to 8192); Maintain similar performance to BCNN; • FASON [46] Combining the ideas of TCNN [9] and Compact BCNN [63]. • NetVLAD [10] Plugging a VLAD like layer in a CNN network at the last CONV layer. • DeepTEN [261] Similar to NetVLAD [10], integrating an encoding layer on top of CONV layers; Generalizing orderless pooling methods such as VLAD and FV in a CNN trained end to end.

Texture Specific Deep Convolutional Models (Section 4.3)
• ScatNet [24] Use Gabor wavelets for comvolution; Mathematical interpretation of CNNs; Features being stable to deformations and preserving high frequency information; • PCANet [29] Inspired by ScatNet [24], using PCA filters to replace Gabor wavelets;Using LBP and histogramming as feature pooling; No local invariance.
(a), with the key differences that the CNN filters are learned directly from data rather than handcrafted, and that CNNs have a hierarchical architecture learning increasingly abstract levels of representation. These three operations are also closely related to the RP approach (Fig. 18 (c)) and the LBP (Fig. 18 (d)). Several large-scale image datasets are usually used for CNN pretraining. Among them the commonly used ImageNet dataset, with 1000 classes and 1.2 million images [201], and the scenecentric MITPlaces dataset [267,268].
Comprehensive evaluations of the feature transfer effect of CNNs for the purpose of texture classification have been conducted in [34,35,36,159], with the following critical insights. During model transfer, features extracted from different layers exhibit different classification performance. Experiments confirm that the fully-connected layers of the CNN, whose role is primarily that of classification, tend to exhibit relatively worse generalization ability and transferability, and therefore would need retraining or finetuning on the transfer target. In contrast the convolutional layers, which act more as feature extractors, with coarser convolutional layers acting as progressively more abstract features, generally transfer well. That is, the convolutional descriptors are substantially less committed to a specific dataset than the fully connected descriptors. As a result, the source training set is relevant to classification accuracy on different datasets, and the similarity of the source and target plays a critical role when using a pretrained CNN model [14]. Finally, from [35,36,159] it was found that deeper models transfer better, and that the deepest convolutional descriptors give the best performance, superior to the fully-connected descriptors, when proper encoding techniques are employed (such as FVCNN←CNN features with Fisher Vector encoder).
(2) Feature Extraction: A CNN can be viewed as a composition f L • · · · • f 2 • f 1 of L layers, where the output of each layer X l = (f l • · · · • f 2 • f 1 )(I) consists of D l feature maps of size W l × H l . The D l responses at each spatial location form a D l dimensional feature vector. The network is called convolutional if all the layers are implemented as filters, in the sense that they act locally and uniformly on their input. From bottom to top layers, the image undergoes convolution, and the receptive field of these convolutional filters and the number of feature channels increases, whereas the size of the feature maps decreases. Usually, the last several layers of a typical CNN are fully connected (FC) because, if seen as filters, their support is the same as the size of the input X l−1 , and therefore lack locality.
The most straightforward approach to CNN based texture classification is to extract the descriptor from the fully connected layers of the network [35,36], e.g., the FC6 or FC7 descriptors in AlexNet [103]. The fully connected layers are pretrained discriminatively, which can be either an advantage or a disadvantage, depending on whether the information that they captured can be transferred to the domain of interest [31,36,68]. The fully connected descriptors have a global receptive field and are usually viewed as global features suitable for classification with an SVM classifier. In contrast, the convolutional layers of a CNN can be used as filter banks to extract local features [35,36,70]. Compared with the global fully-connected descriptors, lower level convolutional descriptors are more robust to image transformations such as translation and occlusion. In [35,36], the features are extracted as the output of a convolutional layer, directly from the linear filters (excluding ReLU and max pooling, if any), and are combined with traditional encoders for global representation. For instance, the last convolutional layer of VGGVD (very deep with 19 layers) [214] yields a set of 512 descriptor vectors; in [34,35,36] four types of CNNs were considered for feature extraction.
(3) Feature Encoding and Pooling: A set of features extracted from convolutional or fully connected layers resembles a set of texture features as described in Section 3.2, so the traditional feature encoding methods discussed in Section 3.4 can be directly employed. In [36], Cimpoi et al. evaluated several encoders, i.e. standard BoW [114], LLC [240], VLAD [91] and IFV [179] (reviewed in Section 3.4), for CNN features, and showed that the best performance is achieved by IFV. It has been reported that VGGVD+IFV with a linear SVM classifier produced consistently near perfect classification performance on several texture datasets: KTHTIPS (99.8%), UIUC (99.9%, UMD (99.9%) and ALOT (99.5%)), as summarized in Table 4. In addition, it obtained significant improvement on very challenging datasets like KTHTIPS2b (81.8%), FMD (79.8%) and DTD (72.3%). However, it only achieved 80.0% and 82.3% on Outex TC10 and Outex TC12 respectively, which are significantly worse than the near perfect performance of MRELBP on these two datasets [132]; a clear indicator that DCNN based features require large amount of training samples and that they lack local invariance. Song et al. [218] proposed a neural network to transform the FVCNN descriptors into a lower dimensional representation. As shown in Fig. 20, locally transferred FVCNN (LFVCNN) descriptors are obtained by passing the 2KD dimensional FVCNN descriptors of images through a multilayer neural network consisting of fully connected, l 2 normalization layers, and ReLU layers. LFVCNN achieved state of the art results on KTHTIPS2b (82.6%), FMD (82.1%) and DTD (73.8%), as shown in Table 4.
Recently, Gatys et al. [65] showed that the Gram matrix representations extracted from various layers of VGGNet [214] can be inverted for texture synthesis. The work of Gatys et al. [65] ignited a renewed interest in texture synthesis [229]. Notably, the Gram matrix representation used in their approach is identical to the bilinear pooling of CNN features of Lin et al. [122], which were proved to be good for texture recognition in [120]. Like the traditional encoders introduced in Section 3.4, the bilinear feature pooling is an orderless representation of the input image and hence is suitable for modeling textures. The Bilinear CNN (BCNN) descriptors are obtained by computing the outer product of each feature x l i with itself, reordered into feature vectors, and subsequently pooled via sum to obtain the final global representation. The dimension of the bilinear descriptor is (D l ) 2 , which is very high (e.g. 512 2 ). It was shown in [120,123] that the texture classification performance of BCNN and FVCNN was virtually identical, indicating that bilinear pooling is as good as the Fisher vector pooling for texture recognition. It was also found that the BCNN descriptor of the last convolutional layer performed the best, in agreement with [36].

Using Finetuned CNN Models:
Pretrained CNN models, discussed in Section 4.1, have achieved impressive performance in texture recognition, however training in these methods is a multistage pipeline that involves feature extraction, codebook generation, feature encoding and classifier training. Consequently, these methods cannot take advantage of utilizing the full capability of neural networks in representation learning. Generally finetuning CNN models on task-specific training datasets (or learning from scratch if large-scale task-specific datasets are available) is expected to improve on already strong performance achieved by pretrained CNN models [31,68]. When using a finetuned CNN model, the global image representation is usually generated in an end-to-end manner; that is, the network will produce a final visual representation without additional explicit encoding or pooling steps, as illustrated in Fig. 5. When finetuning a CNN, the last fully connected layer is modified to have B nodes corresponding to the number of classes in the target dataset. The nature of the datasets used in finetuning is important to learning discriminative CNN features. The pretrained CNN model is capable of discriminating images of different objects or scene classes, but may be less effective in discerning the difference between different textures (material types) since an image in ImageNet may contain different types of textures (materials). The size of the dataset used in finetuning matters as well, since too small a dataset may be inadequate for complete learning.
To the best of our knowledge, the behaviour of a finetuned large-scale CNN like VGGNet [214] or training it from scratch using a texture dataset have not been fully explored, almost certainly due to the fact that a large texture dataset on the scale of ImageNet [201] or MITPlaces [267] does not exist. Most existing texture datasets are small, as discussed later in Section 6, and according to [9,120] finetuning a VGGNet [214] or AlexNet [103] on existing texture datasets leads to negligible performance improvement. As shown in Fig. 19 (a), for a typical CNN like VGGNet [214], the output of the last convolutional layer is reshaped into a single feature vector (spatially sensitive) and fed into fully connected layers (i.e., order sensitive pooling). The global spatial information is necessary for analyzing the global shapes of objects, however it has been realized [9,36,65,120,261] that it is not of great importance for analyzing textures due to the need for orderless representation. The FVCNN descriptor shows higher recognition performance than FCCNN, even if the pretrained VGGVD model is fine-tuned on the texture dataset (i.e., the finetuned FCCNN descriptor) [36,120]. Therefore, an orderless feature pooling from the output of a convolution layer is desirable for end-to-end learning. In addition, orderless pooling does not require an input image to be of a fixed size, motivating a series of innovations in designing novel CNN architectures for texture recognition [9,10,46,123,261].
A Texture CNN (TCNN) based on AlexNet, as illustrated in Fig. 19 (b), was developed in [9]. It simply utilizes global average pooling to transform a field of descriptor X l ∈ R W l ×H l ×D l at a given convolutional layer l of a CNN into a D l dimension vector which is connected to a fully connected layer. TCNN has fewer parameters and lower complexity than AlexNet. In addition, Andrearczyk and Whelan [9] proposed to fuse the global average pooled vector of an intermediate convolutional layer and that of the last convolutional layer via concatenation and introduced to later fully connected layers, a combination which resembles the hypercolumn feature developed in [81]. Andrearczyk and Whelan [9] observed that finetuning a network that was pretrained on a texture-centric dataset achieves better results on other texture datasets compared to a network pretrained on an object-centric dataset of the same size, and that the size of the dataset on which the network is pretrained or finetuned predominantly influences the performance of the finetuning. These two observations suggest that a very large texture dataset could bring a significant contribution to CNNs applied to texture analysis.
In BCNN [123], as shown in Fig. 19 (c), Lin et al. proposed to replace the fully connected layers with an orderless bilinear pooling layer, which was discussed in Section 4.1. This method was successfully applied to texture classification in [120] and obtained slightly better results than FVCNN, however the representational power of bilinear features comes at the cost of very high dimensional feature representations, which induce substantial computational burdens and require large amounts of training data, motivating several improvements on BCNN. Gao et al. [63] proposed compact bilinear pooling, as shown in Fig. 19 (d), which utilizes Random Maclaurin Projection or Tensor Sketch Projection to reduce the dimensionality of bilinear representations while still maintaining similar performance to the full BCNN feature [123] with a 90% reduction in the number of learned parameters. To combine the ideas in [9] and [63], Dai et al. [46] proposed an effective fusion network called FASON (First And Second Order information fusion Network) that combines first and second order information flow, as illustrated in Fig. 19 (e). These two types of features were generated from different convolutional layers and concatenated to form a single feature vector which was connected to a fully connected softmax layer for end to end training. In [100], Kong and Fowlkes proposed to represent the bilinear features as a matrix and applied a low rank bilinear classifier. The resulting classifier can be evaluated without explicitly computing the bilinear feature map which allows for a large reduction in the computational time as well as decreasing the effective number of parameters to be learned.
There are some works attempting to integrate CNN and VLAD or FV pooling in an end to end manner. In [10], a NetVLAD network was proposed by plugging a VLAD-like layer into a CNN network at the last convolutional layer and allows training end to end. The model was initially designed for place recognition, however when applied to texture classification by Song et al. [218] it was found that the classification performance was inferior to FVCNN. Similar to NetVLAD [10], a Deep Texture Encoding Network (DeepTEN) was introduced in [261] by integrating an encoding layer on top of convolutional layers, also generalizing orderless pooling methods such as VLAD and FV in a CNN trained end to end.

Using Handcrafted Deep Convolutional Networks
In addition to the CNN based methods reviewed in Sections 4.1 and 4.2, some "handcrafted" 3 deep convolutional networks [24,29] deserve attention. Recall that a standard CNN architecture (as shown in Fig. 18 (b)) consists of multiple trainable building blocks stacked on top of one another followed by a supervised classifier. Each block generally consists of three layers: a convolutional filter bank layer, a nonlinear layer, and a feature pooling layer. Similar to the CNN architecture, Bruna and Mallat [24] proposed a highly influential Scattering convolution Network (ScatNet), as illustrated in Fig. 21.
The key difference from CNN, where the convolutional filters are learned from data, is that the convolutional filters in Scat-Net are predetermined -they are simply wavelet filters, such as Gabor or Haar wavelets, and no learning is required. Moreover, the ScatNet usually cannot go as deep as a CNN; Bruna and Mallat [24] suggested two convolutional layers, since the energy of the third layer scattering coefficients is negligible. Specifically, as can be seen in Fig. 21, ScatNet cascades wavelet transform convolutions with modulus nonlinearity and averaging poolers. It is shown in [24] that ScatNet computes translation-invariant image representations which are stable to deformations and preserve high frequency information for recognition. As shown in Fig. 21, the average pooled feature vector from each stage is concatenated to form the global feature representation of an image, which is input into a simple PCA classifier for recognition, and which has demonstrated very high performance in texture recognition [24,212,213,211,132]. It achieved very high classification performance on Outex TC10 (99.7%), Outex TC12 (99.1%), KTHTIPS (99.4%), CUReT (99.8%), UIUC (99.4%) and UMD (99.7%) [24,213,132], but performed poorly on even challenging datasets like DTD (35.7%). A downside of ScatNet is that the feature extraction stage is very time consuming, although the dimensionality of the global representation feature is relatively low (several hundreds). ScatNet has been extended to achieve rotation and scale invariance [212,213,211] and applied to other problems besides texture such as object recognition [173]. Importantly, the mathematical analysis of ScatNet explains important properties of CNN architectures, and it is one of the few works that provides detailed theoretical understanding of CNNs. Fig. 21 contrasts ScatNet and PCANet, proposed by Chan et al. [29], a very simple convolutional network based on trained PCA filters, instead of predefined Gabor wavelets, and LBP encoding [163] and histogramming for feature pooling. Two simple variations of PCANet, RandNet and LDANet, were also introduced in [29], sharing the same topology as PCANet, but their convolutional  Fig. 21 Illustration of two similar handcrafted deep convolutional networks: ScatNet [24] and PCANet [29].
filters are either random filters as in [125] or learned from Linear Discriminant Analysis (LDA). Compared with ScatNet, feature extraction in PCANet is much faster, but with weaker invariance and texture classification performance [132].

Attribute-Based Texture Representation
In recent years, the recognition of texture categories has been extensively studied and has shown substantial progress, partly thanks to the texture representations reviewed in Sections 3 and 4. Despite the rapid progress, particularly with the development of deep learning techniques, we remain far from reaching the goal of comprehensive scene understanding [102]. Although the traditional goal was to recognize texture categories based on their perceptual differences or their material types, textures have other properties, as shown in Fig. 22, where we may speak of a banded shirt, a striped zebra, and a striped tiger. Here, banded and striped are referred to as visual texture attributes [34], which describe texture patterns using human-interpretable semantic words. With texture attributes, the textures shown back in Fig. 3 (d) might all be described as braided, falling into a single category in the Describable Textures Dataset (DTD) database [34].
The study of visual texture attributes [17,34,151] was motivated by the significant interest raised by visual attributes [58,176,175,105]. Visual attributes allow the describing of objects in significantly greater detail than a category label and are therefore important towards reaching the goal of comprehensive scene understanding [102], which would support important applications such as detailed image search, question answering, and robotic interactions. Texture attributes are an important component of visual attributes, particularly for objects that are best characterized by a pattern. It can support advanced image search applications, such as more specific queries in image search engines (e.g. a striped skirt, rather than just any skirt). The investigation of texture attributes and detailed semantic texture description offers a significant opportunity to close the semantic gap in texture modeling and to support applications that require fine grained texture description. Nevertheless, there are only several papers [17,34,151] investigating the texture attributes thus far, and there is no systematic study yet attempted.
There are three essential issues in studying texture attribute based representation: 1. The identification of a universal texture attribute vocabulary that can describe a wide range of textures; 2. The establishment of a benchmark texture dataset, annotated by semantic attributes; 3. The reliable estimation of texture attributes from images, based on low level texture representations, such as the methods reviewed in Sections 3 and 4.
Tamura et al. [221] proposed a set of six attributes for describing textures: coarseness, contrast, directionality, line-likeness, regularity and roughness. Amadasun and King [8] refined this idea with the five attributes of coarseness, contrast, business, complexity, and strength. Later, Bhushan et al. [16] studied texture attributes from the perspective of psychology, asking subjects to cluster a collection of 98 texture adjectives according to similarity and identified eleven major clusters. Recently, inspired by the work in [16,58,175,105], Matthews et al. [151] attempted to enrich texture analysis with semantic attributes. They identified eleven commonly-used texture attributes 4 by selecting a single adjective from each of the eleven clusters identified by Bhushan et al. [16]. Then, with the eleven texture attributes, they released a publicly available human-provided labeling of over 300 classes of texture from the Outex database [162]. For each texture image, instead of asking a subject to simply identifying the presence or absence of each texture attribute, Matthews et al. [151] proposed a framework of pairwise comparison, in which a subject was shown two texture images simultaneously and prompted to choose the image exhibiting more of some attribute, motivated by the use of relative attributes [175].
After performing a screening process on the 98 adjectives identified by Bhushan et al. [16], Cimpoi et al. [34] obtained a texture attribute vocabulary of 47 English adjectives and collected a dataset providing 120 example images for each attribute. They furthermore provide a comparison of BoW-and CNN-based texture representation methods for attribute estimation, demonstrating that texture attributes are excellent texture descriptors, transferring between datasets. Bormann et al. [17] introduced a set of seventeen human comprehensible attributes (seven color and ten structural) for color texture characterization. They also collected a new database named Robotics Domain Attributes Database (RDAD) for the indoor service robotics context. They compared five low level texture representation approaches for attribute prediction, and found that not all objects can be described very well with the seventeen attributes. Clearly, which attributes are best suited for a precise description of different object and texture classes deserves further attention.
6 Texture Datasets and Performance

Texture Datasets
Datasets have played an important role throughout the history of visual recognition research. They have been one of the most important factors for the considerable progress in the field, not only as a common ground for measuring and comparing performance of competing algorithms but also pushing the field towards increasingly complicated and challenging problems. With the rapid development of visual recognition approaches, datasets have become progressively more challenging, evidenced by the fact that the recent large scale ImageNet dataset [201] has enabled breakthroughs in visual recognition research. In the big data era, it becomes urgent to further enrich texture datasets to promote future research. In this section, we discuss existing texture image datasets that have been released and commonly used by the research community for texture classification, as summarized in Table 3.
The Brodatz texture database [22], derived from the Brodatz Album [23], is the earliest, the most widely used and the most famous texture database. It has a relatively large number of classes (111), with each class having only one image. Many texture representation approaches exploit the Brodatz database for evaluations [99,125,163,187,190,231], however in most cases the entire database is not utilized, except in some recent studies [67,110,132,182,263]. The database has been criticized because of the lack of intraclass variations such as scale, rotation, perspective and illumination.
The Vision Texture Database (VisTex) [134,239] is an early and well-known database. Built by the MIT Multimedia Lab, it has 167 classes of textures, each with only one image. The VisTex textures are imaged under natural lighting conditions, and have extra visual cues such as shadows, lighting, depth, perspective, thus closer in appearance to real-world images. VisTex is often used for texture synthesis or segmentation, but rarely for image-level texture classification.
Since 2000, texture recognition has evolved to classifying real world textures with large intraclass variations due to changes in camera pose and illumination, leading to the development of a number of benchmark texture datasets based on various real-world material instances. Among these, the most famous and widely used is the Columbia-Utrecht Reflectance and Texture (CUReT) dataset [47], with 61 different material textures taken under varying image conditions in a controlled lab environment. The effects of specularities, interreflections, shadowing, and other surface normal variations are evident, as shown in Fig. 3 (a). CUReT is a considerable improvement over Brodatz, where all such effects are absent. Based on the original CUReT, Varma and Zisserman [236] built a subset for texture classification, which became the widely used benchmark to assess classification performance. CUReT has limitations of no significant scale change for most of the textures and limited in-plane rotation. Thus, a discriminative texture feature without rotation invariance can achieve high recognition rates [24].
Noticing the limited scale invariance in CUReT, researchers from the Royal Institute of Technology (KTH) introduced a dataset called "KTH Textures under varying Illumination, Pose, and Scale" (KTHTIPS) [82,174] by imaging ten CUReT materials at three different illuminations, three different poses, and nine different distances, but with significantly fewer settings for lighting and viewing angle than CUReT. KTHTIPS was created to extend CUReT in two directions: (i) by providing variations in scale (as shown in Experiments with Brodatz or VisTex used different nonoverlapping subregions from the same image for training and testing; experiments with CUReT or KTHTIPS used different subsets of the images imaged from the identical sample for training and testing. KTHTIPS2 was one of the first datasets to offer considerable variations within each class. It groups textures not only by instance, but also by the type of material (e.g., wool). It is built on KTHTIPS and provides a considerable extension by imaging four physical, planar samples of each of eleven materials [174].
The Oulu Texture (Outex) database was collected by the Machine Vision Group at the University of Oulu [162]. It has the largest number of different texture classes (320), with each class  [198] having images photographed under three illuminations and nine rotation angles, but with limited scale variations. Based on Outex, a series of benchmark test suites were derived for evaluations of texture classification or segmentation algorithms [162]. Among them, two benchmark datasets Outex TC00010 and Outex TC00012 [163] designated for testing rotation and illumination invariance, appear commonly in papers. The UIUC (University of Illinois Urbana-Champaign) dataset collected by Lazebnik et al. [110] contains 25 texture classes, with each class having 40 uncalibrated, unregistered images. It has significant variations in scale and viewpoint as well as nonrigid deformations (see Fig. 3 (b)), but has less severe illumination variations than CUReT. The challenges of this database are that there are few sample images per class, but with significant variations within classes. Though UIUC improves over CUReT in terms of large intraclass variations, it is much smaller than CUReT both in the number of classes and the number of images per class. The UMD (University of Maryland) dataset [253] also contains 25 texture classes; similar to UIUC, it has significant viewpoint and scale variations and uncontrolled illumination conditions. As textures are imaged under variable truncation, viewpoint, and illumination, the UIUC and the UMD have stimulated the creation of texture representations that are invariant to significant viewpoint changes.
The Amsterdam Library of Textures (ALOT) database [25] consists of 250 texture classes. It was collected under controlled lab environment at eight different lighting conditions. Although it has a much larger number of texture classes than UIUC or UMD, it has little scale, rotation and viewpoint variations and is therefore not a very challenging dataset. The Drexel Texture (DreTex) dataset [172] contains 20 different textures, each of which was imaged approximately 2000 times under different (known) illumination di-rections, at multiple distances, and with different in-plane and out of plane rotations. It contains stochastic and regular textures.
The Raw Food Texture database (RawFooT), has been specially designed to investigate the robustness of texture representation methods with respect to variations in the lighting conditions [44]. It consists of 68 texture classes of raw food, with each class having 46 images acquired under 46 lighting conditions which may differ in the light direction, in the illuminant color, in its intensity, or in a combination of these factors. It has no variations in rotation, viewpoint and scale.
Due to the rapid progress of texture representation approaches, the performance of many methods on the datasets described above are close to saturation, with KTHTIPS2b being an exception due to its increased complexity. However, most datasets introduced above make the simplifying assumption that textures fill images, and often there is limited intraclass variability, due to a single or limited number of instances, captured under controlled scale, viewpoint and illumination. In recent years, researchers have set their sights on more complex recognition problems where textures appear under poor viewing conditions, low resolution, and in realistic cluttered backgrounds. The Flickr Material Database (FMD) [206,207] was built to address some of these limitations, by collecting many different object instances from the Internet grouped in 10 different material categories, with examples shown in Fig. 3 (e). The FMD [206] focuses on identifying materials such as plastic, wood, fiber and glass. The limitations of the FMD dataset is that its size is quite small, containing only 10 material classes with 100 images in each class.  [16] who studied the relationship between commonly used English words and the perceptual properties of textures, identifying a set of words sufficient to describing a wide variety of texture patterns. These human interpretable texture attributes can vividly characterize textures, as shown in Fig. 24. Based on the 47 texture attributes, they introduced a corresponding DTD dataset consisting of 120 texture images per attribute, by downloading images from the Internet in an effort to support directly real world applications. The large intraclass variations in the DTD are different from traditional texture datasets like CUReT, UIUC and UMD, in the sense that the images shown in Fig. 3 (d) all belong to the braided class, whereas in a traditional sense these textures should belong to rather different texture categories. Subsequent to FMD, Bell et al. [13] released OpenSurfaces (OS) which has over 20,000 images from consumer photographs, each containing a number of high-quality texture or material segments. Images in OS have real world context, in contrast to prior databases where each image belong to one texture category and the texture fills the whole image. OS has over 100,000 segments (as shown shown in Fig. 25) that can support a variety of applications. Many, but not all, of these segments are annotated with material names, the viewpoint, reflectance, the object names and scene class. The number of segments in each material category can also be highly unbalanced in the OS.
Using the OS dataset as the seed, Bell et al. [14] introduced a large material dataset named the Materials in Context Database (MINC) for material recognition and segmentation in the wild, with samples shown in Fig. 26. MINC has a total of 3 million material samples from 23 different material categories. MINC is more diverse, has more samples in each category, and is much larger than previous datasets. Bell et al. concluded that a large and well-sampled dataset such as MINC is key for real-world material recognition and segmentation.
Concurrent to the work by Bell et al. [14], Cimpoi et al. [36] derived a new dataset from OS to conduct a study of material and describable texture attribute recognition in clutter. Since not all segments in OS have a complete set of annotations, Cimpoi et al. [36] selected a subset of segments annotated with material names, annotated the dataset with eleven texture attributes, and removed those material classes containing fewer than 400 segments. Similarly, the Robotics Domain Attributes Database (RDAD) [17] contains 57 categories of everyday indoor object and surface textures labeled with a set of seventeen texture attributes, collected to addresses the target domain of everyday objects and surfaces that a service robot might encounter.
Wang et al. [241] introduced a new light-field dataset of materials, called the Light-Field Material Database (LFMD). Since light-fields can capture multiple viewpoints in a single shot, they implicitly contain reflectance information, which should be helpful in material recognition. The goal of LFMD is to investigate whether 4D light-field information improves the performance of material recognition.
Finally, Xue et al. [255] built a material database named the Ground Terrain in Outdoor Scenes (GTOS) to study the use of spatial and angular reflectance information of outdoor ground terrain for material recognition. It consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. Table 4 presents a performance summary of representative methods applied to popular benchmark texture datasets. It is clear that major improvements have come from more powerful local texture descriptors such as MRELBP [132,131], ScatNet [24] and CNNbased descriptors [36] and from advanced feature encoding methods like IFV [179].  revealing one of the limitations of CNN based descriptors in being sensitive to image degradations. Despite the usual advantages of CNN based methods, it is at a cost of very high computational complexity and memory requirements. We believe that traditional texture descriptors, like the efficient LBP and robust variants such as MRELBP, still have merits in cases when real-time computation is a priority or when robustness to image degradation is needed [132].

Performance
As can be seen from Table 4, currently the highest classification scores on Outex TC10, Outex TC12, CUReT, KTHTIPS, UIUC, UMD and ALOT are nearly perfect, in excess of 99.5%, and quite a few texture representation approaches can achieve more than 99.0% accuracy on these datasets. Since the influential work by Cimpoi et al. [34,35,36], who reported near perfect classification accuracies with pretrained CNN features for texture classification, subsequent representative CNN based approaches have not reported results on these datasets because performance is saturated and because the datasets are not large enough to allow finetuning to obtain improved results. The FMD, DTD and KTHTIPS2b are undoubtedly more challenging than other texture datasets, for example the UIUC and FMD texture category separation shown in Fig. 27, and these more challenging datasets appear more frequently in recent works. However, since the IFV encoding of VG-GVD descriptors [36], the progress on these three datasets has been slow, with incremental improvements in accuracy and efficiency obtained by building more complex or deeper CNN architectures.
As can be observed from Table 4, LBP type methods (LBP [163], MRELBP [131] and BIF [40]) which adopt a predefined codebook have a much more efficient feature extraction step than the remaining methods listed. For those BoW based methods which require codebook learning, since the codebook learning, feature encoding, and pooling process are similar, the distinguishing factors are the computation and feature dimensionality of the local texture descriptor. Among commonly-used local texture descriptors, those approaches first detecting local regions of interest followed by local descriptors, such as SIFT, RIFT and SPIN [110,263], are among the slowest and have relatively high dimensionality. For the CNN based methods developed in [34,35,36], CNN feature extraction is performed on multiple scaled versions of the original texture image, which requires more computational time. In general, CNN pretraining and finetuning is efficient, whereas CNN model training is time consuming. From [132], ScatNet is computationally expensive at the feature extraction stage, though it has medium feature dimensionality. Finally, at the feature classification stage linear SVM is significantly faster than kernel SVM.

Discussion and Conclusion
The importance of texture representations lies in the fact that they have extended to many different problems beyond that of textures themselves. As a comprehensive survey on texture representations, this paper has highlighted the recent achievements, provided some structural categories for the methods according to their roles in feature representation, analyzed their merits and demerits, summarized existing popular texture datasets, and discussed performance for the most representative approaches. Almost any practical application is a compromise among conflicting requirements such as classification accuracy, robustness to image degradations, compactness and efficiency, number of training data available, and cost and power consumption of implementations. Although significant progress has been made, the following discussion identifies a number of promising directions for exploratory research.
Large Scale Texture Dataset Collection. The constantly increasing volume of image and video data creates new opportunities and challenges. The complex variability of big image data reveals the inadequacies of conventional handcrafted texture descriptors and brings opportunities for representation learning techniques, such as deep learning, which aim at learning good representations automatically from data. The recent success of deep learning in image classification and object recognition is inseparable from the availability of large-scale annotated image datasets such as ImageNet [201] and MS COCO [121]. However, deep learning based texture analysis has not kept pace with the rapid progress witnessed in other fields, partially due to the unavailability of a large-scale texture database. As a result there is significant motivation for a good, large-scale texture dataset, which will significantly advance texture analysis.
More Effective and Robust Texture Representations. Despite significant progress in recent years most texture descriptors, irrespective of whether handcrafted or learned, have not been capable of performing at a level sufficient for real world textures. The ultimate goal of the community is to develop texture representations that can accurately and robustly discriminate massive image texture categories in all possible scenes, at a level comparable to the human visual system. In practical applications, factors such as significant changes in illumination, rotation, viewpoint and scale, and image degradations such as occlusions, image blur and random noise call for more discriminative and robust texture representations. Further input from psychological research of visual perception and the biology of the human visual system would be welcome.
Compact and Efficient Texture Representations. There is a tension between the demands of big data and desire for highly compact and efficient feature representations. Thus, on the one hand, many existing texture representations are failing to keep pace with the emerging "big dimensionality" [260], leading to a pressing need for new strategies in dealing with scalability, high computational complexity, and storage. On the other hand, there is a growing need for deploying highly compact and resource-efficient feature representations on platforms like low energy embedded vision sensors and handheld devices. Many of the existing descriptors would similarly fail in these contexts, and the current general trend of deep CNN architectures has been to develop deeper and more complicated networks, advances requiring massive data and power hungry GPUs, not suitable to be deployed on mobile platforms that have limited resources. As a result, there is a growing interest in building compact and efficient CNN-based features [85,192]. While CNNs generally outperform classical texture descriptors, it remains to be seen which approaches will be most effective in resource-limited contexts, and whether some degree of LBP / CNN hybridization might be considered, such as recent lightweight CNN architectures [124,251].
Reduced Dependence on Large Amounts of Data. There are many applications where texture representations are very useful and only limited amounts of annotated training data can be available, or where collecting labeled training data is too expensive (such as visual inspection, facial micro-expression recognition, age estimation and medical texture analysis). Possible research could be the development of learnable local descriptors requiring modest training data, as in [55,136], or to explore effective transfer learning.
Semantic Texture Attributes. Progress in image texture representation and understanding, while substantial, has so far been mostly focused on low-level feature representation. However, in order to address advanced human-centric applications, such as detailed image search and human-robotic interaction, low-level understanding will not be sufficient. Future efforts should be devoted to go beyond texture identification and categorization, to develop semantic and easily describable texture attributes that can be well predicted with low-level texture representations, and to explore even finegrained and compositional structure analysis of texture patterns.
Effect of Smaller Image Size. Performance evaluation of texture descriptors is usually done with texture datasets consisting of rel-atively large images. For a large number of applications an ability to analyze small image sizes at high speed is vital, including facial image analysis, interest region description, segmentation, defect detection, and tracking. Many existing texture descriptors would fail in this respect, and it would be important to evaluate the performance of new descriptors [205].

Acknowledgments
The authors would like to thank the pioneer researchers in texture analysis and other related fields. The authors would also like to express their sincere appreciation to the associate editor and the reviewers for their comments and suggestions. This work has been supported by the Center for Machine Vision and Signal Analysis at the University of Oulu (Finland) and the National Natural Science Foundation of China under Grant 61872379. Table 4 Performance (%) summarization of some representative methods on popular benchmark texture datasets. All methods used the same splitting strategy for training and testing on each dataset. Specifically, for KTHTIPS2, one image per class is used for training and the remaining three for testing. For Brodatz, please see [110]; for Outex TC10 and Outex TC12 please see [130]. For DTD, 80 images per class are randomly selected for training and the remaining 40 for testing. For all other datasets, half of the samples per class are chosen for training and the remaining half for testing. Results are averaged over a number of random partitionings of training and testing data. All listed results are quoted from the original papers, except those marked with ( ) from [263], and those marked ( ) from [130]. For interested readers, more results on LBP variants can be found in the recent survey [130,132]. Those dimensions with ( †) denote feature dimension before the SoftMax layer. For Brodatz, KTHTIPS2, FMD and DTD, the highest classification score is highlighted; for all other datasets classification scores higher than 99% are highlighted.