Keywords

1 Introduction

Color images surround us everywhere. The human visual systems has an innate ability to discern thousands of colors, but only a fraction shades of gray [19]. The reason is that we have three kinds of cone cells in the retina, which are sensitive to light of different wavelengths, recognizing the colors red, green, and blue, which are represented by these wavelengths. The visible colors arise through multiple combinations of red, green, and blue. To perceive light intensity we have only one kind of cells – the rod cells. Under bad light conditions, we cannot distinguish between colors but can still recognize light intensity, we see grayscale images then.

Commonly, regarding natural images we can easily recognize simple objects in images, independently if the images are colored or not. A simple example is a person standing in front of a building. Yet, color images contain additional information than grayscale ones, for instance to distinguish a hill in the desert from a snowy hill. Moreover, there are several color spaces, in which this information is based on, their motivation is to be sought on the one hand technically and on the other hand methodically [19]. Understanding the different color models and their color information is important in Computer Vision, e. g. for inpainting, colorization, segmentation tasks, and Convolutional Neural Networks [2]. We will focus on CNNs, which yield state of the art tools for object detection, face recognition or video analysis. To solve the basic task that is image classification, a CNN usually takes an input color image of a predetermined size \(length \times height \times 3\) and returns a prediction vector of possible contents of the image. In this paper, we want to find out more about the information that is hidden in these three channels and provides a contribution to a deep neural network.

1.1 Related Work

The role of colors in face recognition was investigated by Yip and Sinha [23] in 2002. They made experiments with thirty-seven subjects and found that color images perform better than grayscale images and significantly better than grayscale images, when available shape information of both images is degraded by Gaussian blur. In 2009, Oliveira and Conci [17] showed that HSV color images are advantageous for skin detection, whereby a value H between 6 and 38 indicates skin. Li et al. [16] used different color models in vehicle color recognition using vector matching of templates in 2010. They found out that HSV is the best model for color recognition compared to RGB or YUV. Agrawal et al. [1] proposed 2011 a new approach for image classification using Support Vector Machines and compared accuracies in different color models, which depend on the classifier. One year later Kanan et al. [12] investigated different grayscale models in image recognition using Naïve Bayes Nearest Neighbor framework. They came to the result that Intensity (with or without gamma correction) performs best, while Value performs worse. Ruela [20] investigated different color spaces in Dermoscopy Analysis and shows that color features play a major role in this area. In 2014 Zeiler and Fergus [24] presented an impressive possibility to visualize deep neural networks and what and how they learn from color images. They used a multi-layered Deconvolutional Network, as proposed by Zeiler et al. [25] to project feature activations back to the input pixel space. The role of color distribution and illuminant estimation for color constancy was studied in 2014 by Cheng et al. [5]. They showed that spatial information does not provide any additional information that cannot be obtained directly from the color distribution. Humans are able to perceive colors constantly and independently of the illumination. In Computer Vision problems need color constancy in image processing to make sure that the objects in a scene can be recognized reliable under different illumination conditions. One year later Barron [3], Bianco et al. [4] investigated color constancy in Convolutional Neural Networks. Recent work investigated color augmentation techniques in deep skin image analysis [7], 2017. Galiyawala et al. [8] found out that the impact of color is helpful in deep-learning-based persons retrieval tasks in surveillance videos, 2018. Also with regard to CNN-based colorization [21], it could be useful to get a deeper understanding of color contribution in deep architectures.

Fig. 1.
figure 1

Images of the different color models Luminance, RGB, HSV, SV, YIQ, YUV: horse (CIFAR-10), crockroach (CIFAR-100), desert (FlickrScene) shown in RGB representation. (Color figure online)

1.2 Contribution

Previous work examines the impact of colors only by regarding color augmentation techniques for train and test images. Because we transform the grayscale images in 3D tensors with three identical channels, we can also put them in our architecture and this makes it possible to train and test with different color models in one experimental setup. That is why we can deduce whether images are recognized or misclassified only because of their removed or added color and create statistics about these cases in the different categories. We call this property of a special class of a dataset to be high or low (in)dependent of color information. On the other hand we investigate degraded test images inspired by Yip and Sinha [23] to regard the coherence between color and texture. While they only used Gaussian blur to degrade edges, we add Gaussian noise on our images, too. We investigate the performance difference depending what modification is applied to the images represented to the architecture and compare the results to the dataset or classes properties. Finally, we observe the activations in higher layers and the change of confidence values under the mentioned conditions. Answers to this questions are of general interest to improve networks, but may even play an important role when classifying images of a special category or recognize something in thermal infrared images, which contain no color information.

2 Methodology

In this section, we give a short overview about CNNs. After that, we show the applied augmentation techniques and some examples of color spaces.

2.1 About CNNs

The upcoming description of those aspects of CNNs needed for our methodology will not replace an encompassing textbook [9] or article [15], therefore we assume that the reader is familiar with the basic terminology such as loss function, max pooling or activation. However in a nutshell, CNN-based image classification presupposes determination of a feature set allowing the best separation of the training data, or, equivalently, minimization of a loss function. In the simplest case, the unknowns in this function could be the weights connecting the responses of convolutions of the input image with a bunch of filters. Examples of such filters, also denoted as low-level features, are those based on color information (for example, a constant green filter could give a hint about the amount of vegetation) while others rely on textures (Gabor-like-filters). For classification problems involving many classes, these filters are propagated to deeper scales, using iterative convolutional, max-pooling, and dropout layers. For example, the widely-used VGG 16 architecture [22] diminishes the image to the scale of 32 before the fully connected layer, that is, weighted sum of filter responses to make them vote for each of the classes, is employed. On the contrary, flatter architectures are more suitable to perform classification if the number of classes is low for reasons of computational efficiency. Since for such flat architectures, the color and texture filters play a crucial, decisive part, the two goals of this work are (1) to investigate to what extent change of color space may influence the classification accuracy and (2) to find out which color space is the most robust against color and texture changes between training and test data.

2.2 Augmentation Techniques

The human brain interprets color as a physiopsychological phenomenon that is not yet fully understood. It is represented by the light that is reflected by an object and each color blends smoothly into the next, it does not end abruptly. A color model or system is a specification of a coordinate system and every color is determined by a point in that system, see [19]. Input images can be illustrated in different representations, the conversion between the descriptions is a color augmentation technique. The impact of color can be also be regarded under applied texture modifications. Therefor we add noise with several distributions, like random uniform, random gamma, or random Gaussian \({\mathcal {N}}(0,\sigma ^2)\) with different variances \(\sigma ^2\in \{3,8.5,12,25,50\}\). According to our results and regarding the work of da Costa et al. [6], who investigated several types of noise in image classification tasks, the Gaussian distribution \(\mathcal N(0,8.5)\) provides a medium deterioration. Another option to degrade shape is to make the images blurring employing Gaussian blur. We choose \(\sigma =0.5\) and filter size 3 inspired by Yip and Sinha [23]. Several color and texture transformations T to the input images i can be described as:

$$\begin{aligned} T:i\rightarrow T(i). \end{aligned}$$
(1)

Our CNN depicts the images to a prediction:

$$\begin{aligned} \mathrm {CNN}:i\rightarrow P,\,\,\,\, \mathrm {CNN}:T(i)\rightarrow P'. \end{aligned}$$
(2)

and we are interested in the relation of \(P'\) and P.

Fig. 2.
figure 2

Left: Bus image in Luminance and Value, difference of both. Right: Original Car image, with noise, and Gaussian blur.

2.3 Examples of Color Spaces

This section describes the regarded color spaces of the conducted study. Let M be a tensor with entries in \({\mathbb {R}}\) representing a 3-channel color image. The pixels of the RGB color input image with length l and width h

$$\begin{aligned} im_{RGB}=[R,G,B]\in M^{l\times h\times 3} \end{aligned}$$
(3)

are the entries of the matrices \(R,G,B\in M^{l\times h}\). According to [19] an image in RGB representation can be converted to an other color model.

For a conversion into grayscale, the following coefficients, which add up to one, are tried-and-true. The model is also known as Luminance, and is inspired by the current spectra, which human eyes are differently sensitive to perceiving red, green, and blue:

$$\begin{aligned} L=[0.2989R+0.5870G+0.1140B]\in M^{l\times h} \end{aligned}$$
(4)

or \(G_L=[L,L,L]\in M^{l\times h\times 3}\) to get a grayscale 3D tensor. Another more simple grayscale space is the Intensity, mean of the RGB channels:

$$\begin{aligned} I=\frac{1}{3}[R+G+B]\in M^{l\times h} \end{aligned}$$
(5)

or \(G_{I}=[I,I,I]\in M^{l\times h\times 3}\).

To transform an image into HSV (Hue, Saturation, Value) space, we have to normalize the image and receive \(R,G,B\in [0,1]\). Sort RGB by size and set them \(\lambda _1,\lambda _2,\lambda _3\), if two are equal set R before G before B. Now calculate

$$\begin{aligned} \begin{aligned} V=&\lambda _1,\,\, S= {\left\{ \begin{array}{ll} 0 &{} ; \lambda _1=0\\ \frac{\lambda _1-\lambda _3}{\lambda _1} &{}; \lambda _1 \ne 0, \end{array}\right. }\\ H=&{\left\{ \begin{array}{ll} 0 &{} ; \lambda _1=\lambda _3\\ c+\frac{ 6 \cdot (\lambda _2-\lambda _3)}{\lambda _1-\lambda _3}&{} ; \lambda _1\ne \lambda _3, \end{array}\right. } \; \text{ where } \; c&= {\left\{ \begin{array}{ll} 0 &{} ; \lambda _1=R \\ \frac{1}{3} &{} ; \lambda _1=G \\ \frac{2}{3} &{} ; \lambda _1=B. \\ \end{array}\right. } \end{aligned} \end{aligned}$$
(6)

Carrying out equivalent transformations we can show that \(H\in [0,1),V,S\in [0,1]\). A similar color space is HSI, using I instead of V. The described transformations from RGB to HSV or HSI are no continuous functions. To define a grayscale image in HSV mode, it is not appropriate to consider just the Value channel \(V'=[0,0,V]\). Kanan et al. [12] showed that Value performs worst of all grayscale models. Instead of this, we take \(G_L\) and considering (6) and

$$\begin{aligned} R=G=B\Longleftrightarrow \lambda _1=\lambda _3 \end{aligned}$$
(7)

it follows that \(H=S=0,V=L\), \(L'=[0,0,L]\). In Fig. 2  both models are compared. Another option is to regard SV mode, where Hue is set to zero. Grayscale images in HSI mode are defined as \(I'=[0,0,I]\).

To convert an RGB image with \(R,G,B\in [0,1]\) into a YUV image (Y is the Luminance and UV two Chrominance components), we do the following transformation, see [18]. Using (4), set

$$\begin{aligned} \begin{pmatrix} Y \\ U \\ V\\ \end{pmatrix}=\begin{pmatrix} L \\ (B-Y) \cdot 0.493 \\ (R-Y)\cdot 0.877\\ \end{pmatrix}. \end{aligned}$$
(8)

Grayscale images in YUV mode are characterized by \(L'=[0,0,L].\)

To convert an RGB image with \(R,G,B\in [0,1]\) into a YIQ image (Y is the Luminance and IQ two Chrominance components, notice that I is redefined now) we do the following, see [10]. With (4) set

$$\begin{aligned} \begin{pmatrix} Y \\ I \\ Q\\ \end{pmatrix}=\begin{pmatrix} L \\ 0.596R-0.274G-0.322B \\ 0.211R-0.523G+0.312B\\ \end{pmatrix}. \end{aligned}$$
(9)

The coefficients in each entry add up to zero. Grayscale images in YIQ mode are characterized by \(L'=[0,0,L].\)

The different value ranges from UVIQ do not matter as we use the augmentation technique standardization before the images are fed into our model.

3 Experiments

In the following sections we describe our experiments and explain the results.

Fig. 3.
figure 3

PersonFinder dataset showing example images with persons (defaced) and background.

Fig. 4.
figure 4

Exapmples of the four categories [desert, forest, snow, urban] in FlickrScene.

3.1 Data

For our experiments, we consider four image datasets employed in classification tasks. The first is our own created PersonFinder dataset. It contains 15,083 color train and 793 test images with a size of \(128\times 64\times 3\) pixels, labeled with [person, background], see Fig. 3. Second we use the custom-made FlickrScene dataset, that contains 10,000 color landscape images of size \(128\times 128 \times 3\) and four labeled categories [desert, forest, snow, urban], from which 30% were taken as test images, examples see in Fig. 4. Other datasets we worked with are the existing public datasets CIFAR-10 and CIFAR-100 [14]. Each contains 60,000 color images of \(32\times 32\) pixels, split into 50,000 train examples and 10,000 test examples, which are labeled with one of ten or hundred classes, respectively. To fit all the input images in our architecture, we down-sample them all to the same size as in the CIFAR data.

3.2 Implementation Details

As model, we use an established Convolutional Neural Network, shown in Fig. 5. We choose this rather flat architecture because the PersonFinder and CIFAR data perform well with this model and it allows us to focus on our main interest: color and texture information. The architecture has two convolutional layers (conv 1, conv 2) with maxpooling (pool 1, pool 2) and three fully connected layers (fc 1, fc 2, and softmax) and further state-of-the-art tools like standardization, data augmentation, normalization, and dropout, see [11]. We have accordingly modified it to enter images of any color modes and to visualize the kernels, activations, confidence values and misclassified categories.

Fig. 5.
figure 5

Graph of our Convolutional Neural Network.

3.3 Experimental Set-Up

We train our neural network with the mentioned datasets PersonFinder, CIFAR-10, CIFAR-100, and FlickrScene (all include RGB images), respectively. To compare the effects of partially lost shape information, we carry out the following experiments with three modes: normal, with noise, and blur on the images. Let \(\, {\mathcal {C}}=\{L,I, G_L,G_I,L',I',RGB,HSV, SV, HSI,YUV, YIQ\}\) be the set of representations. The value \(\mathrm {acc}(x,y), x,y\in {\mathcal {C}}\) describes the test accuracy when training in x mode and testing in y mode. First, the algorithm calculates \(\mathrm {acc}(RGB,RGB)\). After that we modify the test images before evaluation and convert them into grayscale images. The results are \(acc(RGB,G_L)\) and \(acc(RGB,G_I)\). Now we do the training and testing with several suitable representations and evaluate \(\mathrm {acc}(x,y), x,y\in {\mathcal {C}}\). Suitable means that x is the same as y or a similar color system with added or removed color information, see Fig. 6.

After the evaluations we calculate the error rates

$$\begin{aligned} \mathrm {err}(x,y)=1-\frac{\mathrm {acc}(x,y)}{\mathrm {acc}(x,x)}, \end{aligned}$$
(10)

\(x,y\in {\mathcal {C}}, x\ne y\). This relative value indicates the relative accuracy change only because of removed or added color information. A negative value indicates an improvement through the augmentation and in general is \(\mathrm {acc}(x,x)>\mathrm {acc}(x,y)\Longleftrightarrow \mathrm {err}>0\). The results of interest are given in Fig. 1.

Beside the error rates, we also accomplish a hard negative analysis to plot the classes that are mostly true negative or false positive only because of the added or removed color information in the test images, see Table 2. We call them to be color-dependent. Contrarily, the classes which are not recognized significantly different under color modification are called color independent. To investigate how color (in)dependence evolves through the architecture, we record the number of activation differences in the first fully connected layer in case of color or colorless images. We calculate the ratio of how many activations are added or lost proportionally on all output pixels and get \(\mathrm {act}(x,y), x,y\in {\mathcal {C}}, x\ne y\), see Table 3. Finally, we compare the different confidence values, obtained after we modified color or Hue information, respectively. The changing of confident values with/without color information are given as \(\mathrm {con}(x,y), x,y\in {\mathcal {C}}, x\ne y\), see Table 3.

Fig. 6.
figure 6

Accuracies with training/test images of different color models, original, noise, and blur.

3.4 Results

Impact of Color Models on Accuracy. We first regard the experiments without noise or blurring, see Fig. 6. The reason for \(\mathrm {acc}(L',L')<\mathrm {acc}(SV,SV)<\mathrm {acc}(HSV,HSV)\) in the last three datasets is obviously that SV color images contain more information than L’ images and less than HSV. In PersonFinder, the values are similar. The fact that \(\mathrm {acc}(c,c)>\mathrm {acc}(g,g)\) for \(g\in \mathcal {G'}=\{G_L,G_I,I',L'\}, c\in \mathcal {C'}=\{RGB,HSV,HSI,SV,YUV,YIQ\},\) in each column except PersonFinder indicates that the additional information color is learned by the model to perform better than without. Depending on the used dataset, different color models perform best. In FlickrScene, it is YUV, in CIFAR-10 and CIFAR-100 it is RGB, but the results are similar in all color spaces, especially in CIFAR-10. In PersonFinder, it is Luminance. We have \(\mathrm {acc}(c,c)<\mathrm {acc}(g,g)\) and the values are similar high. The model becomes color independent when applied to PersonFinder data. During training, it focuses on the special properties of the input: One the one hand, if \(x\in \mathcal {G'}, y\in \mathcal {C'}\) the performance differences are significantly lower because in this case the main learned properties are shape cues, edges, forms, brightness etc. and they are also useful to classify color images. On the other hand, if \(x\in \mathcal {C'}, y\in \mathcal {G'}\) the difference is higher, except PersonFinder, because the model now focuses on colors, which is not helpful to classify grayscale test images.

Color Dependency in Different Data Sets. Furthermore we notice that the improvement of \(\mathrm {acc}(c,c)>\mathrm {acc}(g,g)\) is more significant in FlickrScene and CIFAR-100 than in CIFAR-10 while \(\mathrm {acc}(g,g)>\mathrm {acc}(g,c)\) is similarly low in CIFAR-10 and CIFAR-100 and higher in FlickrScene. The focus on colors during training seems to be more distinctive in FlickrScene and CIFAR-100 than CIFAR-10 and even not available in PersonFinder. To give an explanation, we compare the error rates, see Table 1. While training the neural network with color, CIFAR-10 pictures one gets the proportion \(\mathrm {err}^{(c_{10})}_{(RGB,G_L)}=0.081.\) That means 8% of the before correctly classified images are now not recognized only because of the missing color information. One could say that in 8% of the images, their color provides the main information. In 92% the neural network seems to look mainly at shape cues like forms or edges and brightness. CIFAR-10 dataset differs between ten categories like vehicles, animals etc. This is what one would expect, considering humans are not reliant on colors statistically in these cases, in contrast on FlickrScene images, which seem to have much more color information. Here about 32% of the recognizable images contain necessary color information which the neural network has learned. The categories could be classified in the main colors sand for desert, white for snow, green and brown for forest and gray for urban. CIFAR-100 images rely on even more color information as this dataset has detailed subcategories, e.g. sunflowers and roses or apples and oranges. Related species of blossoms or fruits may be difficult to distinguish even for humans if only grayscale pictures are present. The error rates of PersonFinder images are very small because of the insensitivity of color modification.

Table 1. Error rates heat map in %.

Training in Grayscale. If one trains the neural network with only grayscale images but feeds in the colored ones for testing, the error rates of the first four lines are smaller, see the lower part of Table 1. Here, the algorithm learns forms, edges, and brightness and is not irritated by color test images predominantly. Maybe even test objects with extraordinary colors e.g. exotic animals like red frogs, special birds etc. in the application are recognized better. This could be explored further. In the undermost four lines when dealing with HSV, SV, and HSI model the values are out-of-band. A reason could be that some shapes change because of the discontinuity of the transformation from RGB to HSV or HSI, which could be also noticed in Fig. 1. In CIFAR-10 and CIFAR-100 datasets, RGB mode operates best while in FlickrScene YUV, YIQ is similar good. HSV, SV, HSI mode perform worse because of the above mentioned reason of lost texture. This is more crucial in CIFAR-100 images than in CIFAR-10 images or in FlickrScene images. Only in FlickrScene, \(\mathrm {acc}(L,L)\approx \mathrm {acc}(G_L,G_L)\approx \mathrm {acc}(L',L')\approx \,\mathrm {acc}(I,I)\approx \mathrm {acc}(G_I,G_I)\approx \mathrm {acc}(I',I')\) holds. In the other datasets, the differences are higher and Intensity performs worse than Luminance. This does not confirm with the research of [12].

Degraded Images with Noise and Gaussian Blur. To understand the impact of color, it is also interesting to keep color information, but modifying the rest of the image. To do this, we add sensor noise or Gaussian blurring on our test images. As a secondary effect we receive a low-quality camera model. The accuracy is recorded in Fig. 6, the error rates are provided in Table 1. In all regarded datasets, the error rates of color model changes are similar, irrespective whether shapes are degraded or not. The colors of the heat map allow to see this. Most robust against color or texture modification is the color system Luminance during training. Wether one has test images in grayscale, RGB, YUV, or YIQ, they perform well.

Table 2. Proportion of wrongly classified test images when changing representation during testing in the different categories of CIFAR-10 [airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck] and FlickrScene [desert, forest, snow, urban]. Highest and values highlighted. Results for CIFAR-100 and PersonFinder see text.

Hard Negative Analysis. To investigate the impact of color, we regard the change of predictions when adding or removing color information of the test images in several color spaces. We count the different predicted classes caused by the modification. In Table 2, we print the normalized numbers and realize that there are several accumulations in some classes. Negative numbers indicate that the model has classified more images correctly in one category than before. Positive numbers mean a worsening.

First, we regard CIFAR-10. The category cat performs better in six of eight regarded modes of test images with less color information, the various colors of cats lead to confusions and mistakes. Contrariwise, the bad result in category deer in three of eight modes suggests that the color brown of the animal and the color green of the background was learned or needed respectively. Training with HSV or HSI system images reaches high misclassification rates in the category ship. Lost shapes and the missing background color blue is a possible explanation. Using FlickrScene, we get high values in category desert, thirteen of sixteen. This category seems to be sensitive on colors, presence of yellow, orange or sand might be recognized as desert. The category urban has low values because the colors of objects vary strongly. Hence the shapes and objects are more important. The impact of color is low here. Snow is classified robust against color augmentation because mainly the brightness seems to be important. We refrain from printing the CIFAR-100 results, however we evaluated them and counted the three highest and lowest numbers: Training with color and testing with grayscale images shows noticeable deterioration in the result of the classes beaver, plain, rabbit, and fox. Improvement was reached in maple, lobster, cockroach, streetcar, forest, and roses. When training with grayscale images and testing with colored ones, we obtain worse results in the categories beaver, rabbit and fox and better ones in forest, pine, sunflowers, maple, and roses. In comparison with the above results, these categories do not surprise us for they are wild animals (beaver, rabbit, fox) with similar properties as deer in CIFAR-10 and a plain landscape similar to desert in FlickrScene. Independent on color with high texture information are mainly trees and flowers. Lobster and cockroach have typical important shapes that the color is second-rate. Different streetcars may have all different colors, hence the algorithm was not learning colors and performs better with gray test images, similar to the urban category in FlickrScene. In PersonFinder, we get continuously mainly false negative results: about [ ], [ ] while changing the color mode from grayscale to YIQ, SV or HSV. Some persons are not detected only because of the missing or added color information, but background was mostly recognized reliable.

Table 3. Prorated differences of activations of the first fully connected layer (current left column) and confidence values (current right column) regarding without/with color information, PersonFinder, CIFAR-10, CIFAR-100, and FlickrScene. Lowest and values are highlighted.

Color Dependency and Stability. In Table 3, we print first the differences of number of activations \(\mathrm {act}\) after the first fully connected layer. The second values are the differences of the confident values \(\mathrm {con}\) when removing or adding color information. Only the absolute differences are counted (activation or not) while the differently powerful activations remain unnoticed. The best results are small values and they are found if color information was added, more precisely, if training was executed in Luminance space and testing in RGB, YUV, or YIQ space. The worst results are found in the middle of the table, when dealing with HSV color images or removing color information.

In CIFAR, one notices that always \(\mathrm {act}<\mathrm {con}\). Here, changing color information leads to a constant deterioration in the deep layers. In FlickrScene, this only applies to a few values and in PersonFinder, there is no regular connection to see. We have already seen that the person classifier is color independent, hence an activation difference after a rear layer because of missing or added color should not lead to a difference in the confidence value. Training in \(L'\) and testing in HSV or SV system constitutes an exception. The reason is the same as described above: deterioration of shapes in this modes.

To regard stability, we decrease Hue and/or Saturation in our input HSV color image in small steps. We notice only corresponding small worsening in the \(\mathrm {act}\) and \(\mathrm {con}\) values. All the regarded datasets seem to be stable in terms of moderate color changes.

4 Conclusions and Future Work

In this paper we investigated the impact of color depending on image quality, datasets and special classes. Our used method that makes it possible to apply test images of different color systems in the architecture is insightful. We found that, in general, color information plays an important role in image classification and its contribution becomes evident in more specialized datasets with a high number of classes or special subclasses (e. g. CIFAR-100 and FlickrScene). Furthermore, our results lead to the thesis that some categories like wild animals (deer, rabbit, fox, beaver), ship (CIFAR) or plain landscape or desert (CIFAR-100, FlickrScene) are high dependent on color information.

Regarding the application, it is more successful to train with RGB (CIFAR) or YIQ (FlickrScene) color images. Training with grayscale, especially with Luminance images is promising to be advantageous if the expected contribution of texture is significantly higher than that of color. This is the case in categories like trees, flowers, certain animals (cat, lobster, cockroach), persons, streetcars or urban images. An intelligent algorithm gains color independence then. This is also the case while detecting persons, because – as expected – different colors of clothes or background should not be relevant. On the opposite, HSV and HSI color spaces are interesting only if one suspects a high impact of color but a lower one of shapes. In our results, HSV is not suitable if texture is degraded, especially if noise is responsible for this. Regarding test images of any color space Luminance is the most robust color mode for training and moreover the most suitable grayscale mode, not Intensity, as [12] found. Anyway, the fact that one deals with smaller images and saves run-time is only a small advantage of grayscale representation. It would also be possible to perform a similar study considering state-of-the-art families of networks. Our work could be helpful if one handles with thermal infrared images [13], which include no color information, e. g. in video surveillance. It is obvious to train with grayscale images – if no suitable thermal infrared train sets are available. Our results can be applied to it.