Compact Deep Color Features for Remote Sensing Scene Classification

Aerial scene classification is a challenging problem in understanding high-resolution remote sensing images. Most recent aerial scene classification approaches are based on Convolutional Neural Networks (CNNs). These CNN models are trained on a large amount of labeled data and the de facto practice is to use RGB patches as input to the networks. However, the importance of color within the deep learning framework is yet to be investigated for aerial scene classification. In this work, we investigate the fusion of several deep color models, trained using color representations, for aerial scene classification. We show that combining several deep color models significantly improves the recognition performance compared to using the RGB network alone. This improvement in classification performance is, however, achieved at the cost of a high-dimensional final image representation. We propose to use an information theoretic compression approach to counter this issue, leading to a compact deep color feature set without any significant loss in accuracy. Comprehensive experiments are performed on five remote sensing scene classification benchmarks: UC-Merced with 21 scene classes, WHU-RS19 with 19 scene types, RSSCN7 with 7 categories, AID with 30 aerial scene classes, and NWPU-RESISC45 with 45 categories. Our results clearly demonstrate that the fusion of deep color features always improves the overall classification performance compared to the standard RGB deep features. On the large-scale NWPU-RESISC45 dataset, our deep color features provide a significant absolute gain of 4.3% over the standard RGB deep features.

Generally, deep convolutional neural networks or CNNs take a fixed-sized image as input to a series of convolution, local normalization and pooling operations (termed as layers). The final layers of the convolutional neural network are fully connected (FC), and are typically used to extract deep features that are generic and used for a variety of vision applications [5], including remote sensing scene classification [54,73]. The standard input to a deep convolutional neural network is RGB pixel values, with training performed on the large-scale ImageNet dataset. Most existing remote sensing scene categorization approaches employ these CNNs, pre-trained on the ImageNet dataset, as feature extractors. The exploration of different color spaces and their combination for remote sensing scene classification is still an open research problem. The work of [56] investigated different color spaces for vehicle color identification. The work of [23] explored YCbCr and RGB color channels for image super resolution. A collaborative facial color feature learning approach, combining multiple color spaces, was proposed by [41] for face recognition. Here, we investigate a variety of color features within a deep learning framework for remote sensing scene classification.
Prior to deep learning, the impact of multiple color features was well studied for object recognition and detection [3,4,39,67]. The work of [67] studied the invariance properties of different color descriptors and showed that different color features are complementary and their combination provides a consistent improvement in overall classification performance. Khan et al. [39] proposed an attention based framework to combine multiple color features with shape features. Within the deep learning framework, the complementary characteristics of these color features are yet to be investigated for remote sensing scene classification. In this work, we investigate the effectiveness of combining multiple color features, within a deep learning framework, for remote sensing scene classification.
As discussed above, the common strategy employed by most remote sensing scene classification methods is to extract deep features from the activations of the FC layers of a pre-trained deep convolutional neural network. However, such a strategy will encounter the inherent problem of a high-dimensional final image representation when combining activations from multiple deep color convolutional neural networks. Therefore, it is desired to obtain a compact final image representation without sacrificing the improvements obtained from the complementary characteristics of multiple deep color features. Recently, Khan et al. [37] proposed to use a divisive information theoretic clustering (DITC) technique [22] to combine heterogenous texture descriptors for texture classification. Their work showed that a notable reduction in the dimensionality of the final image representation can be obtained using the DITC technique, without significant loss in classification performance. Motivated by this, we propose to use the DITC technique to compress the dimensionality of a multicolor deep representation for remote sensing scene classification. The DITC approach has previously been employed to compress the high-dimensionality of bag-of-words based spatial pyramids and hand-crafted heterogenous texture representations [25,37]. To the best of our knowledge, we are the first to investigate the DITC technique to compress deep multi-color image representations for scene classification in remote sensing images. Contributions In this work, we study the problem of remote sensing scene classification with the following contributions.
-We investigate the contribution of color in a deep learning framework for scene classification in remote sensing images. We further demonstrate the effectiveness of combining multiple color features within the deep learning framework. Furthermore, we propose the usage of an information theoretic compression approach to compress high-dimensional multi-color deep features into a compact image representation.

Related Work
The impact of color features for remote sensing scene analysis has been extensively studied [11,24,59,65,76,78]. The work of [76] investigated integrating color information within Gabor features for remote sensing scene classification. dos Santos et al. [24] evaluated a variety of hand-crafted color and texture feature description approaches for remote sensing image classification and retrieval. Chen et al. [11] performed an evaluation of local features, such as structure, texture and color for remote sensing scene classification. The work of [65] studied a variety of feature description baselines in different color spaces for remote sensing images.
Combining multiple hand-crafted color features has also been investigated in the literature [39,60,67]. The work of [67] investigated integrating color and shape features, within the bag-of-words framework, for object recognition. Their evaluation recommends employing opponent color features with SIFT descriptor for object recognition and also showed the importance of fusing multiple color representations to achieve further improvement in classification performance. Khan et al. [39] proposed an approach where multiple hand-crafted color features are employed to modulate shape features for object recognition. The work of [1] proposed an approach to combine color models that are learned to achieve color invariance for object detection. The work of [21] investigate the impact of color information for oint set registration. Different to these previous works using hand-crafted features, recent works have also investigated combining multiple color features within a deep learning framework for face recognition [41], image super resolution [23] and vehicle classification [56]. However, to our knowledge, the effectiveness of combining multiple color features within a deep learning framework is yet to be investigated for remote sensing scene analysis.
In recent years, several deep learning-based approaches have been introduced for remote sensing scene classification. The work of [54] evaluated off-the-shelf CNNs features and compared their performance with low-level descriptors for remote sensing scene classification. Marmanis et al. [49] proposed a two-stream approach where pre-trained CNNs features are first used to represent the images. Then, the extracted representations were input to shallow CNN classifiers. The work of [72] introduced a hybrid architecture where multi-column stacked denoising sparse auto-encoder is combined with Fisher vector to learn features in a hierarchical manner for land use scene classification. Yan et al. [74] proposed an approach based on improved category-specific codebook using kernel collaborative representation based classification which is integrated with Spatial pyramid matching. Their approach then employed SVM classifier to classify remote sensing images. The work of [73] introduced a large-scale remote sensing scene classification dataset and also evaluated several pre-trained CNNs on their large dataset. Cheng et al. [17] introduced a method based on bag of convolutional features where CNNs features are employed in place of hand-crafted local features to construct bag-of-words based image representation. The work of [2] investigated a fusion approach where standard RGB deep features are combined with LBP-based deep texture features to classify remote sensing images. Chen et al. [10] proposed a CNN based approach under the guidance of a human visual attention mechanism. In their approach, a computational visual attention model is utilised to extract salient regions. Afterwards, sparse filters are employed for learning features from extracted regions. A superpixel-guided layer-wise embedding CNN based approach is introduced by [46] to exploit information from both labeled and unlabeled examples. The work of [29] introduced a center-based structured metric learning approach where both deep metrics and center points are taken into account to penalize pairwise correlation and class-wise information between categories. Most of these approaches employ CNN models trained using RGB patches as input. Different to these approaches based on the de facto practice of using RGB patches for CNN training, we investigate the contribution color in deep learning framework and demonstrate the effectiveness of integrating multiple deep color features. Our extensive experiments on five benchmarks demonstrate the effectiveness of combining multiple deep color features for remote sensing scene classification.

Our Approach
Here, we first discus the motivation behind the proposed approach. Then, we describe how the deep color models are constructed.Afterwards, we investigate the fusion of deep color features and counter the problem of the high-dimensionality of fused deep color features for classification.
Motivation As discussed earlier, most existing state-of-the-art remote sensing image classification methods are based on CNNs. Here, CNNs are generally pre-trained on a large-scale generic object recognition dataset (ImageNet) using raw RGB pixel values as an input. Previously, combining multiple hand-crafted local color features have been investigated, within the bag-of-words framework, for object recognition. Motivated by these previous works, we investigate the contribution of color within a deep learning framework (CNNs) and demonstrate the impact of integrating multiple deep color features for remote sensing scene analysis. To the best of our knowledge, we are the first to investigate the effectiveness of integrating multiple color features, within a deep learning framework (CNNs), for remote sensing scene classification.

Deep Color Convolutional Neural Networks
Most existing state-of-the-art remote sensing image classification methods are based on deep models. These deep models are generally pre-trained on a large-scale generic object recognition dataset (ImageNet) using raw RGB pixel values as an input. Here, we analyze a variety of color features within deep learning framework to evaluate their impact on remote sensing scene classification. We investigate the contribution of color in a standard deep convolutional neural network (CNN) architecture [8,43,64].
To analyze the impact of color for remote sensing image classification, we employ a variety of color features popular in object recognition. The selected color features are based on different color space transformations: H SV , Y CbCr, Opponent, C, Lab, and the colornames. Furthermore, the motivation of these color representations differ from photometric invariance to discriminative power.
RGB In this work, we use the standard three-channel RGB color space as the baseline. HSV In the HSV color space, the H model is scale-invariant. Further, it is shift-invariant with respect to light intensity [67]. The HSV color space has been previously investigated with the hand-crafted SIFT descriptor for scene recognition [7].
YCbCr In the YCbCr color space, Y is the luminance component and Cb and Cr are the blue-difference and red-difference chroma components. This color space is approximately perceptually uniform and has been used previously for remote sensing images [34,65].
Lab The three dimensions of the Lab color space correspond to L for lightness and a and b for the color components green-red and blue-yellow. This color space is perceptually uniform implying that colors at an equal distance are also perceptually equally far apart.
Opponent In this color space, the O 1 and O 2 channels encode the color information in the image. The O 3 channel describes the intensity information. The image is transformed as in [45]: The opponent color representation possesses invariance with respect to specularities. In the evaluation performed by van de Sande et al. [67], the opponent color space in conjunction with the hand-crafted SIFT feature descriptor, was shown to provide improved results for visual object recognition.
C The C representation, defined as C = O1 O3 O2 O3 O3 T , aims at adding photometric invariance with respect to shadow-shading to the opponent representation. The invariance is achieved by normalizing the first two dimensions with the luminance channel O 3 . The C representation was initially proposed by [27] and later employed with the SIFT descriptor by [67].
Color names Most of the aforementioned color representations aim at employing specific photometric invariance properties. Different to these representations, the color names are linguistic color labels assigned by humans to represent world colors. It involves the assignment of linguistic color labels to image pixels. A linguistic study by [6] identified that the 11 basic color terms of the English language are white, blue, grey, brown, orange, green, red, black, purple, yellow, and pink. The work of [70] proposed an approach to automatically learn from images retrieved with Google-image search. The descriptor is based on the 11 basic color terms. Color names representation [70] CN is defined as a feature vector comprising the probability of a color name for an image I mg: with p cns j |I mg = 1 P x,y∈I mg p cns j |f (x, y) here cns j is the j-th color name, f = {L * a * , b * } and x, y are the spatial coordinates of the P pixels in the image I mg, Further, p cns j |f is the probability of a color name given a pixel value in the image, computed from an image dataset collected from Google. Since the images are acquired from the web, the issue of retrieving noisy images is addressed by employing PLSA approach [70]. Figure 2 shows the proposed color fusion in a deep convolutional neural network architecture. We use same architecture to train all deep color convolutional neural networks. Each deep color network is trained separately. The details of the underlying network architecture is provided in Sect. 4.1.

Compact Deep Color Features
As discussed earlier, the final layers of the deep convolutional neural network (FC layers) are typically employed to extract deep features since they are generic and previously used for a variety of vision applications [5], including remote sensing scene classification [54,73].
Here, we extract 4096 dimensional activations from the FC7 (second last) and FC6 (third last) layers of each deep color network respectively. These activations are then concatenated and used as image features, However, the combination of these activations from multiple deep color convolutional neural networks has the disadvantage of being high-dimensional (more than 57k in size) for each image. Here, we propose to use an information theoretic compression approach (DITC) [22] to compress the high-dimensional multi-color deep representation. The DITC algorithm works by discriminatively learning a pre-determined compact representation by minimizing the loss in mutual information between clusters and the class labels of training samples. The approach operates on the class-conditional distributions over deep multi-color image representations.
where KL is the Kullback-Leibler (KL) divergence. The Kullback-Leibler divergence between the two distributions is defined by It is worth mentioning that the category information is exploited using only the training samples. The high-dimensional deep multi-color image representation is compressed by merging the bins, over the classes, with similar discriminative powers. We refer to [22] for additional details of the DITC algorithm.

Experimental Setup
We first describe the underlying deep convolutional neural network architecture employed to obtain our deep color models. The deep convolutional neural network is based on the VGG architecture and is similar to [43]. The deep convolutional neural network consists of 5 convolutional layers (C1, C2, C3, C4, and C5) and 3 fully-connected (FC) layers (FC6, FC7 and FC8). The deep convolutional neural network takes as input an image of size 224 × 224. Throughout our experiments, images are resized to 224 × 224 pixels and then input to the network. The first convolutional layer C1 contains 64 convolutional filters with a filter size of 11 × 11. The convolution stride is set to 4 and a max-pooling downsampling factor of 2. The second convolutional layer C2 comprises of 256 convolutional filters with a filter size of 5 × 5. The convolution stride is set to 1, spatial padding is set to 2, and a max-pooling downsampling factor of 2. For the third, fourth and fifth convolutional layer C3, C4 and C5, the number of convolutional filters is 256, filter size is 3 × 3, and the convolution stride and spatial padding are 1. For the fifth convolutional layer, a max-pooling downsampling factor of 2 is employed. Furthermore, the first two FC layers (FC1 and FC2) are regularised using dropout [43] with dropout ratio set to 0.5. Consequently, the last FC layer FC3 is a multi-class soft-max classifier. Other than the FC3 layer, the activation function for the rest of the weight layers is the Rectified Linear Unit (ReLU) [35,43,50].
We train all deep color convolutional neural networks, described in Sect. 3.1, from scratch on the ImageNet 2012 training set. We use the same set of hyper-parameters as in [43] during network training in our experiments. For all CNNs training, the learning rate is set to 0.001 and momentum is set to 0.9. The initial learning rate is decreased by a factor of 10, in case the validation error stops to decrease further. We initialize the layers from a Gaussian distribution with a zero mean and variance equal to 10 −2 . A similar data augmentation, as in [8], in the form of random crops, horizontal flips, and RGB color jittering is employed during training. For a fair comparison, we train the baseline standard RGB by increasing the depth of the network architecture with a factor of seven resulting in same number of network parameters as our color fusion. Furthermore, pre-trained deep color convolutional neural networks are employed as feature extractors by extracting 4096 dimensional activations from the FC7 and FC6 layers as image features. All the image features are L 2 -normalized and input to a one-versus-all linear SVM classifier.
Throughout all experiments, the classification results are reported in terms of average recognition accuracy over all scene categories in a remote sensing scene classification dataset. From the classifier, the scene class label providing the highest confidence is assigned to the test image. The overall recognition results are obtained by computing the average classification score over all scene categories in a remote sensing scene classification dataset. As in [16,73], each dataset is randomly split into training and test sets for performance evaluation. For all datasets, the ratio of training to test images is set to 50:50, where images are randomly selected from each aerial scene category. To obtain a reliable performance comparison, we repeat the evaluation procedure ten times. The final classification results are then reported as the mean over these ten runs together with the standard deviation.

Datasets
We conduct experiments on multiple datasets (see Fig. 3).
UC-Merced is a commonly used remote sensing dataset [76] that is publicly available and comprises of 2100 images. There are 21 classes in this dataset. Some of the scene categories in this dataset are: agriculture, golf course, baseball diamond, dense residential, medium density residential, forest, river, sparse residential, overpass, parking lot, storage tanks, and tennis courts. The images in the dataset are cropped to 256 × 256 pixels and are collected from 20 regions across the USA.
WHU-RS19 is a public dataset [62] with 950 aerial images acquired from Google Earth imagery. There are 50 samples per scene class in the dataset whereas the images are of size 600 × 600 pixels. There are 19 aerial scene categories in this dataset. Some of the scene categories in the dataset are: airport, meadow, pond, parking, port, beach, bridge, river, railway station, viaduct, commercial area, desert, farmland, industrial area, and park. The dataset poses several challenges due to scale and illumination variations.
RSSCN7 is a dataset [84] with seven aerial scene categories: farmland, grassland forest, industrial region, lake, parking lot, residential region, and river. The dataset was released in 2015 and is publicly available. Each aerial scene category contains 400 images. The images in the dataset are of size 400 × 400 pixels with sampling performed at varying scales (four).
AID is a large scale public dataset [73] with 30 classes and 10,000 images. The dataset consists of 30 aerial scene categories. Some of the scene categories in the dataset are: playground, sparse residential, medium residential, bare land, center, desert, farmland, mountain, park, parking, forest, resort, school, church, square, river, storage tanks, and viaduct. The images in the AID dataset are collected from different countries, including China, USA, Germany, France, UK, and Italy.
NWPU-RESISC45 is another large scale public dataset [16] with 31500 images having 700 images per category. The images in the dataset are of size 256 × 256 pixels. The dataset comprises 45 aerial scene categories where some of the classes in the dataset are: airplane, railway, railway station, bridge, church, stadium, sparse residential, forest, ship, terrace, freeway, storage tank, golf course, lake, ground track field, baseball diamond, mountain, parking lot, wetland, river, and roundabout.
Here, we evaluate different deep color features, described in Sect. 3.1, on a variety of five datasets. In all cases, we employ activations from the FC7 layer of the CNNs as deep color features. For the color fusion, we concatenate all the deep color features resulting in a 28672 dimensional feature vector. Table 1 shows the comparison of deep color features on the UC-Merced, WHU-RS19, RSSCN7, AID, and NWPU-RESISC45 datasets. On the UC-Merced dataset, the baseline approach provides a mean recognition rate of 94.7%. Image features from the color names and HSV based CNNs achieve mean classification scores of 93.7% and 93.8%, respectively. Deep features from the C and Lab color space based CNNs provide mean recognition scores of 93.6% and 93.9%, respectively. Deep features from the opponent color space-based deep convolutional neural network provides an average classification accuracy of 94.5%. Furthermore, the proposed deep color feature fusion significantly improves the classification performance, achieving a mean recognition score of 96.3%. The proposed deep color feature fusion provides an absolute gain of +1.6% in terms of classification performance compared to the baseline standard RGB deep features.
On the WHU-RS19 dataset, the baseline (RGB) network provides an average recognition rate of 96.0%. Deep features from the color names, C and HSV color space based CNNs achieve mean recognition scores of 95.1%, 94.7% and 94.4%, respectively. Furthermore, image features from the YCbCr and Lab color space based CNNs provide mean recognition rates of 95.4% and 95.0%, respectively. On this dataset, deep features from the opponent color space based deep convolutional neural network achieves similar performance with a mean classification score of 96.0%, compared to the baseline RGB features (96.0%). Moreover, the combined set of deep color features improves the classification performance with an absolute gain of +1.4%, compared the baseline standard RGB deep features. Similarly on the RSSCN7 dataset, deep features from the opponent color space-based deep convolutional neural network provide similar classification results with a mean recognition rate of 89.4%, compared to the baseline RGB features (89.5%). Furthermore, the classification performance is improved by employing the combined set of deep color features, which obtains an average recognition accuracy of 92.3%.
We also evaluate different deep color features on two recently introduced large scale AID and NWPU-RESISC45 datasets. On the AID dataset, the baseline standard RGB color space-based deep convolutional neural network achieves an average classification score of 90.3%. The deep features from most other color spaces provide slightly inferior results compared to the standard RGB. However, deep features from the opponent color spacebased deep convolutional neural network again provide similar performance, with an average classification accuracy of 89.9%, compared to the baseline RGB features. Furthermore, the proposed deep color feature fusion significantly improves the classification performance, with an absolute gain of +3.1% in terms of classification performance, compared to the baseline standard RGB deep features. Finally on the NWPU-RESISC45 dataset, the baseline standard RGB deep network provides a mean recognition rate of 85.7%. Image features from the color names, HSV and C based deep convolutional neural networks (CNNs) obtain mean classification scores of 83.2%, 83.1% and 82.7%, respectively. Deep features from the YCbCr

±0.20
For each dataset, 50% of the images are used for training and the rest for testing. The best result for each dataset is shown in bold value respectively. For all deep models, we use the same network architecture with same set of parameters. Overall, our late fusion based color fusion always outperforms the standard baseline RGB color space-based deep convolutional neural network. On the large-scale NWPU-RESISC45 dataset, the combined set of deep color features provides significant improvement in classification performance, with an absolute gain of 4.3% over baseline RGB deep features and Lab color space based deep convolutional neural networks (CNNs) achieve average classification scores of 84.3% and 83.0%, respectively. The proposed deep color feature fusion provides significant improvement in classification performance with an absolute gain of +4.3%, compared to the baseline standard RGB deep features. Figure 4 shows a per-category recognition performance comparison between the deep color feature fusion and the baseline RGB deep convolutional neural network on the NWPU-RESISC45 dataset. The combined set of deep color features provides consistent improvements on 43 out of 45 aerial scene categories compared to the baseline RGB features. Particularly significant gains in classification performance are achieved for the tennis court (+18%), palace (+15%), commercial area (+12%), medium residential (+11%), and basketball court (+11%) aerial scene categories.

Deep Color Features Evaluation
We also perform a comparison between convolutional features (Conv1, Conv2, Conv3, Conv4 and Conv5) and FC features (FC6 and FC7). Table 2 shows the comparison for both the baseline RGB and our color fusion. In all cases, superior classification results are obtained using features from FC layers. Note that no significant improvement in performance is observed when combining convolutional and FC layer features.
To summarize, the deep color feature fusion provides consistent improvements on all five datasets, compared to the baseline RGB features. It is worth mentioning that the most considerable gains in performance are obtained on large-scale AID and NWPU-RESISC45 datasets. These results suggest that different deep color features possess complementary information as their combination leads to a significant performance boost for remote sensing

±0.20
In case of 'Conv', we use features extracted from the five convolutional layers. In case of 'FC', we use features from the two FC layers (Fc6 and FC7). For both the baseline and our color fusion, features from the FC layers provide superior performance compared to the convolutional layer features Bold values highlight the results scene classification. Furthermore, superior results are obtained using features from the FC layers, compared to convolutional layer features.

Compact Deep Color Features
As demonstrated above, the combined set of deep color features always improves the classification performance compared to the baseline RGB. However, this gain in classification performance comes at the cost of high-dimensionality. When fusing deep color features from the FC6 and FC7 layers of the networks, the resulting dimensionality becomes significantly higher (57K). To tackle this issue, we evaluate the compression of deep color feature fusion using the approach described in Sect. 3.2. Table 3 shows the results obtained when compressing the combined set of deep color features using the DITC approach. The final dimensions of the compact deep color fusion image representation are fixed to 8k so that it is similar to the dimensionality of the standard RGB deep features commonly employed for classification. The DITC compression approach compresses the combined set of deep color features from 57k to 8k without any substantial deterioration in classification accuracy for all datasets. In the case of UC-Merced and WHU-RS19, there is even a slight improvement in performance when compressing the combined set of deep color features indicating an increase in discriminative power by removing the redundancy. In the case of the RSSCN7, AID and NWPU-RESISC45 datasets, there is a marginal reduction in accuracy compared to the original combined set of deep color features. In all cases, the compact deep color feature fusion significantly reduces the dimensionality without sacrificing the classification accuracy.
We additionally analyze the extreme compression of the deep color feature fusion and compare it with several commonly used dimensionality reduction techniques: principle component analysis (PCA), partial least squares (PLS) and diffusion maps (DM). Among these existing approaches, PLS is a category-aware dimensionality reduction statistical technique that models relations between sets of observations by means of latent variables. We perform the comparison to obtain very low-dimensional (100 to 500 dimensional) deep color feature fusion based image representations. Figure 5 shows the results of extreme compression (even to 100 dimensions) on the UC-Merced dataset. The DITC compression technique provides superior classification results even in the case of extreme compression of the deep color feature fusion based image representation. Figure 6 shows the results on the NWPU-RESISC45 dataset.

State-of-the-art Performance Comparison
Finally, we compare our compact deep color feature fusion representation with state-ofthe-art methods in the literature. Table 4 shows the results on the five remote sensing scene classification datasets. For fair comparison, we adopt the same sampling setting as [32,62,76], taking 80 images per class for training for the UC-Merced dataset. In the case of the WHU-RS19 dataset, 30 images per aerial scene category are used for training and the rest for testing. For the RSSCN7 and AID datasets, 50 images per aerial scene category are employed for training. Furthermore, 20 images per class are employed for training in the case of the NWPU-RESISC45 dataset.
On the UC-Merced dataset, the work of [76] proposed an approach that extends the bag-ofvisual-words (BOVW) framework with the spatial co-occurrence kernel, achieving an average classification accuracy of 77.7%. In their work, the integration of color features within a

±0.19
The best result for each dataset is shown in bold value. In all cases, the compact deep color feature fusion achieves favorable performance while being significantly low-dimensional Gabor representation was also investigated, leading to a mean recognition rate of 80.5%. The impact of texture information on remote sensing scene classification has been investigated by previous works [9,58,82]. One such texture description based on multi-scale completed LBP features achieved an average classification accuracy of 90.6%. A pyramidal co-occurrence feature representation, accounting for both photometric and geometric aspects of an image, was proposed by [75] achieving a classification accuracy of 77.4%. With the recent advent of deep features, a considerable jump in classification performance has been observed. The work of [72] proposed deep filter banks based on CNNs and obtained a classification accuracy of 92.7%. Previous works have also investigated transferring pre-trained deep features from both the FC and the convolutional (conv) layers of the CNNs. The work of [72] investigated transferability of deep CNNs with respect to both FC and convolutional layers. In case of FC layers (Case I: FC features), their approach achieved an accuracy of 96.8% whereas a mean   recognition rate of 96.9% is obtained when using features from convolutional layers (Case II: Conv features) in conjunction with VLAD encoding strategy. The work of [55] proposed a multi-scale deeply described correlations-based model and achieved an accuracy of 96.9%. Our proposed approach, while being compact, achieves an average classification accuracy of 98.1%. On the WHU-RS19 dataset, the work of [74] based on class-specific vocabulary employing kernel collaborative representation obtained an average classification accuracy of 93.7%. Among the deep learning approaches, CaffeNet model achieved an average classification accuracy of 94.8%. Our compact deep color feature fusion approach also employing FC layer features obtains an average classification accuracy of 96.6%. The best result (98.6%) on this dataset is obtained by transferring deep features from the conv layers together with VLAD encoding technique. Such an encoding of conv features is complementary to our approach using FC features. In the case of RSSCN7 dataset, the work of [71] based on hierarchical coding vectors based classification obtained a mean recognition accuracy of 86.4%. The work of [72] based on deep filter banks achieved a mean recognition rate of 90.4%. Our approach obtains outperforms state-of-the-art methods on this dataset with a mean recognition rate of 92.9%.
For the AID dataset, the work of [10] proposed a deep convolutional neural network pretraining approach (SSF-AlexNet) and achieved a mean recognition accuracy of 88.7%. The work of [36] proposed a fusion approach (BAFF) to integrate SIFT and deep features. Their approach achieved a mean recognition rate of 93.6%. Our approach provides superior results with an average classification accuracy of 94.0%. Finally, on the NWPU-RESISC45 dataset, the work of [79] based on single-scale deep features achieved a mean recognition rate of 83.6%. The multi-scale variant of their approach obtained an average classification score of 84.3%. The bag-of-convolutional feature approach of [17] achieved a mean recognition rate of 84.3%. Our approach again provides superior classification performance by achieving an average classification accuracy of 87.5%.

Conclusions
In this paper, we investigated the contribution of color within a deep learning framework (CNNs) for the problem of remote sensing scene classification. We demonstrated that different deep color features possess complementary information and combining them leads to a significant performance boost for the remote sensing scene classification task. Additionally, we addressed the high-dimensionality of deep color feature fusion and compressed them to obtain a compact final image representation without a significant deterioration in classification performance. To validate our approach, we perform comprehensive experiments on five challenging remote sensing scene classification datasets. The results from our experiments clearly demonstrated the effectiveness of the proposed approach. Table 5 shows the gain in classification performance obtained using the proposed compact deep color feature fusion, compared to the standard RGB deep features, on the five remote sensng scene classification datasets.
A potential future research direction is to investigate the fusion of other available spectral bands (, near infrared) besides RGB in the form of different color transformations. Additionally, integrating other visual cues, such as texture features with color features in a deep learning framework may improve remote sensing scene classification performance and is therefore a promising research direction. Another future research direction is to investigate the impact of integrating multiple deep color features for other remote sensing image analysis tasks, such as object detection (simultaneous classification and localization).