2.1 Introduction

Extracting useful features from a scene is an essential step of any computer vision and multimedia analysis tasks. Though progress has been made in past decades, it is still quite difficult for computers to accurately recognize an object or analyze the semantics of an image. In this chapter, we study two extreme approaches of feature extraction, model-based and data-driven, and then evaluate a hybrid scheme.

One may consider model-based and data-driven to be two mutually exclusive approaches. In practice; however, they are not. Virtually all model construction relies on some information from data; all data-driven schemes are built upon some models, simple or complex. The key research questions for the feature-extraction task of an object recognition or image annotation application are:

  1. 1.

    Can more data help a model? Given a model, can the availability of more training data improve feature quality, and hence improve annotation accuracy?

  2. 2.

    Can an improve model help data-driven? Give some fixed amount of data, can a model be enhanced to improve feature quality, and hence improve annotation accuracy?

We first closely examine a model-based deep-learning scheme, which is neuro-science-motivated. Strongly motivated by the fact that the human visual system can effortlessly conduct these tasks, neuroscientists have been developing vision models based on physiological evidences. Though such research may still be in its infancy and several hypotheses remain to be validated, some widely accepted theories have been established. This chapter first presents such a model-based approach. Built upon the pioneer neuroscience work of Hubel [2], all recent models are founded on the theory that visual information is transmitted from the primary visual cortex (V1) over extrastriate visual areas (V2 and V4) to the inferotemporal cortex (IT). IT in turn is a major source of input to the prefrontal cortex (PFC), which involves in linking perception to memory and action [3]. The pathway from V1 to IT, which is called the visual frontend [4], consists of a number of simple and complex layers. The lower layers attain simple features that are invariant to scale, position and orientation at the pixel level. Higher layers detect complex features at the object-part level. Pattern reading at the lower layers are unsupervised; whereas recognition at the higher layers involves supervised learning. Computational models proposed by Lee [5] and Serre [6] show such a multi-layer generative approach to be effective in object recognition.

Our empirical study compared features extracted by a neuro-science-motivated deep-learning model and those extracted by a data-driven scheme through an application of image annotation. For the data-driven scheme, we employed features of some widely used pixel aggregates such as shapes and color/texture patches. These features construct a feature space. Given a previously unseen data instance, its annotation is determined through some nearest-neighbor scheme such as \(k\)-NN or kernel methods. The assumption of the data-driven approach is that if the features of two data instances are close (similar) in the feature space, then their target semantics would be same. For a data-driven scheme to work well, its feature space must be densely populated with training instances so that unseen instances can find sufficient number of nearest neighbors as their references.

We made two observations from the results of our experimental study. First, when the number of training instances is small, the model-based deep-learning scheme outperforms the data-driven. Second, while both feature sets commit prediction errors, each does better on certain objects. Model-based tends to do well on objects of a regular, rigid shape with similar interior patterns; whereas the data-driven model performs better in recognizing objects of variant perceptual characteristics. These observations establish three guidelines for feature design.

  1. 1.

    Recognizing objects with similar details. For objects that have regular features, invariance should be the top priority of feature extraction. A feature-extraction pipeline that can be invariant to scale, position and orientation requires only a handful of training instances to obtain a good set of features for recognizing this class of objects. For this class of objects, model-based works very well.

  2. 2.

    Recognizing objects with different details. Objects with variant features, such as strawberries in different orientations and environmental settings, or dalmatians with their varying patterns, do not have invariant features to extract. For recognizing such an object class, diversity is the design priority of feature extraction. To achieve diversity, a learning algorithm requires a large number of training instances to collect abundant samples. Therefore, data-driven works better for this class of objects.

  3. 3.

    Recognizing objects within abstracts. Classifying objects of different semantics such as whales and lions being mammals, or tea cups and beer mugs being cups, is remote to percepts. Abstract concept classification requires a WordNet-like semantic model.

The first two design principles confirm that feature extraction must consider both feature invariance and feature diversity; but how? A feedforward pathway model designed by Poggio’s group [7] holds promises in obtaining invariant features. However, additional signals must be collected to enhance the diversity aspect. As Serre [5] indicates, feedback signals are transmitted back to V1 to pay attention to details. Biological evidences suggest that a feedback loop in visual system instructs cells to “see” local details such as color-based shapes and shape-based textures. These insights lead to the design of our hybrid model data-driven hybrid architecture (DMD) , which combines a deep model-based pipeline with a data-driven pipeline to form a six-layer hierarchy. While the model-based pipeline faithfully models a deep learning architecture based on visual cortex’s feedforward path [8], the data-driven pipeline extracts augmented features in a heuristic-based fashion. The two pipelines join at an unsupervised middle layer, which clusters low-level features into feature patches. This unsupervised layer is a critical step to effectively regularize the feature space [9, 10] for improving subsequent supervised learning, making object prediction both effective and scalable. Finally, at the supervised layer, DMD employs a traditional learning algorithm to map patches to semantics. Empirical studies show that DMD works markedly better than traditional models in image annotation. DMD’s success is due to (1) its simple to complex deep pipeline for balancing invariance and selectivity, and (2) its model-based and data-driven hybrid approach for fusing feature specificity and diversity.

In this chapter, we show that a model-based pipeline encounters limitations. As we have explained, a data-driven pipeline is necessary for recognizing objects of different shapes and details. DMD employs both approaches of deep and hybrid to achieve improved performance for the following reasons:

  1. 1.

    Balancing feature invariance and selectivity. DMD implements Serre’s method [8] to achieve a good balance between feature invariance and selectivity. To achieve feature selectivity, DMD conducts multi-band, multi-scale, and multi-orientation convolutions. To achieve invariance, DMD only keeps signals of sufficient strengths via pooling operations.

  2. 2.

    Properly using unsupervised learning to regularize supervised learning. At the second, the third, and the fifth layers, DMD introduces unsupervised learning to reduce features so as to prevent the subsequent supervised layer from learning trivial solutions.

  3. 3.

    Augmenting feature specificity with diversity. Through empirical study, we identified that a model-based only approach cannot effectively recognize irregular objects or objects with diversified patterns; and therefore, fuse into DMD a data-driven pipeline. We point out subtle pitfalls in combining model-based and data-driven and propose a remedy for noise reduction.

2.2 DMD Algorithm

DMD consists two pipelines of six steps. Given a set of training images, the model-based pipeline feeds training images to the edge detection step. At the same time, the data-driven pipeline feeds training images directly to the step of sparsity regularization. We first discuss the model-based pipeline of DMD in Sect. 2.2.1, and then its data-driven pipeline in Sect. 2.2.2.

2.2.1 Model-Based Pipeline

Figure 2.1 depicts that visual information is transmitted from the primary visual cortex (V1) over extrastriate visual areas (V2 and V4) to the IT. Physiological evidences indicate that the cells in V1 largely conduct selection operations, and cells in V2 and V4 pooling operations. Based on such, M. Riesenhuber and T. Poggio’s theory of feedforward path of object recognition in the cortex [7] establishes a qualitative way to model the ventral stream in the visual cortex. Their model suggests that the visual system consists of multiple layers of computational units where simple S units alternate with complex C units. The S units deal with signal selectivity, whereas the C units deal with invariance. Lower layers attain features that are invariant to scale, position, and orientation at the pixel level. Higher layers detect features at the object-part level. Pattern reading at the lower layers are largely unsupervised, whereas recognition at the higher layers involves supervised models. Recent advancements in deep learning [10] have led to mutual justifications that a model-based, hierarchical model enjoys these computational advantages:

  • Deep architectures enjoy advantages over shallow architectures (please consult [11] for details), and

  • Unsupervised initiated deep architectures can enjoy better generalization performance [12].

Fig. 2.1
figure 1

Information flow in the visual cortex. (See the brain structure in [13])

Motivated by both physiological evidences [8, 14] and computational learning theories [5], we designed DMD’s model-based pipeline with six steps:

  1. 1.

    Edge selection (Sect. 2.2.1.1). This step corresponds to the operation conducted by cells in V1 and V2 [15], which detect edge signals at the pixel level.

  2. 2.

    Edge pooling (Sect. 2.2.1.2). This step also corresponds to cells in V1 and V2. The primary operation is to pool strong, representative edge signals.

  3. 3.

    Sparsity regularization (Sect. 2.2.1.3). To prevent too large a number of features, which can lead to dimensionality curse, or too low a level, which may lead to trivial solutions, DMD uses this unsupervised step to group edges into patches.

  4. 4.

    Part selection (Sect. 2.2.1.4). There is not yet strong physiological evidence, but it is widely believed that V2 performs part selection and then feeds signals directly to V4. DMD models this step to look for image patches matching those prototypes (patches) produced in the previous step.

  5. 5.

    Part pooling (Sect. 2.2.1.5). Cells in V4 [16], which have larger receptive fields than V1, deal with parts. Because of their larger receptive fields, V4’s selectivity is preserved over translation.

  6. 6.

    Supervised learning (Sect. 2.2.1.6). Learning occurs at all steps and certainly at the level of IT cortex and PFC. This top-most layer employs a supervised learning algorithm to map a patch-activation vector to some objects.

2.2.1.1 Edge Selection

In this step, computational units model the classical simple cells described by Hubel and Wiesel in the primary visual cortex (V1) [17]. A simple selective operation is performed by V1 cells. To model this operation, Serre [8] uses Gabor filters to perform a 2D convolution, Lee [5] suggests using a convolutional restricted Boltzmann machine (RBM), and Ranzato [18] constructs an encoder convolution. We initially employed T. Serre’s strategy since T. Serre [6, 8] justifies model selection and parameter tuning based on strong physiological evidences, whereas computer scientists often justify their models through contrived experiments on a small set of samples. The input image is transmitted into a gray-value image, where only the edge information is of interest. The 2D convolution is a summation-like operation, whose convolution kernel is to model the receptive fields of cortical simple cells [19]. Different sizes of Gabor filters are applied as the convolution kernel to process the gray-value image \({\mathbf{I}},\) using this format:

$$ {\mathbf{F}}_{s}(x,y)=\exp\left(-{{x_0^2+\gamma^2 y_0^2}\over {2 \sigma^2}}\right)\times \cos\left({{2\pi}\over {\lambda}}x_0\right), $$
(2.1)

where

$$ x_0=x \cos\theta+y \sin\theta\;{\hbox{and}}\;y_0=-x \sin\theta+y \cos\theta. $$

In (2.1), \(\gamma\) is the aspect ratio and \(\theta\) is the orientation, which takes values \(0^{\circ},\;45^{\circ},\;90^{\circ}, {\hbox{and}}\;135^{\circ}.\) Parameters \(\sigma\;{\hbox{and}}\;\lambda\) are the effective width and wave length, respectively. The Gabor filter forms a 2D matrix with the value at position \((x,y)\) to be \({\mathbf{F}}_{s}(x,y.)\) The matrix size \((s\times s)\) or the Gabor filter size ranges from \(7\times7\;{\hbox {to}}\;37\times37\) pixels in intervals of two pixels. Thus there are 64 (16 scales \(\times\) 4 orientations) different receptive field types in total. With different parameters, Gabor filters can cover different orientations and scales and hence increase selectivity. The output of the edge-selection step is produced by 2D convolutions (conv2) of the input image and \(n_b \times n_s \times n_f = 64\) Gabor filters of

$$ {\mathbf{I}}_{S\_{\rm edge}(i_b,i_s,i_f)} = {\rm conv2}({\mathbf{I}}, {\mathbf{F}}_{i_F}), $$
(2.2)

where

$$ i_F = (i_b\times n_s + i_s)\times n_f + i_f. $$

2.2.1.2 Edge Pooling

In the previous step, several edge-detection output matrices are produced, which sufficiently support selectivity. At the same time, there is clearly some redundant or noisy information produced from these matrices. Physiological evidences on cats show that a MAX-like operation is taken in complex cells [15] to deal with redundancy and noise. To model this MAX-like operation, Serre’s, Lee’s, and Ranzato’s work all agree on applying a MAX operation on outputs from the simple cells. The response \({\mathbf{I}}_{{\rm edge}(i_b,i_f)}\) of a complex unit corresponds to the response of the strongest of all the neighboring units from the previous edge-selection layer. The output of this edge-pooling layer is as follows:

$$ {\mathbf{I}}_{{\rm edge}(i_b,i_f)}(x,y)=\max_{i_s \in {\mathbf{v}}_s, m \in {\mathcal{N}}(x,y)}{\mathbf{I}}_{S\_{\rm edge}(i_b,i_s,i_f)}(x_m,y_m), $$

where \((x_m,y_m)\) stands for edge-selection results at position \((x,y).\) The max is taken over the two scales within the same spatial neighborhood of the same orientation, justified by the experiments conducted by the work of Serre [20].

Fig. 2.2
figure 2

DMD Model-based pipeline, steps 1 and 2

2.2.1.3 Sparsity Regularization

A subtle and important step of a deep architecture is to perform proper initialization between layers. The edge-pooling step may produce a huge number of edges. With such a large-sized output, the next layer may risk learning trivial solutions at the pixel level. Both Serre [8] and Ekanadham [21] suggest to sparsify the output of V2 (or input to V4).

To perform the sparsification, we form pixel patches via sampling. In this way, not only the size of the input to the part-selection step is reduced, but patches larger than pixels can regularize the learning at the upper layers. The regularization effect is achieved by the fact that parts are formed by neighboring edges, not edges at random positions. Thus, there is no reason to conduct learning directly on the edges. A patch is a region of pixels sampled at a random position of a training image at four orientations. An object can be fully expressed if enough representative patches have been sampled. It is important to note that this sampling step can be performed incrementally when new training images are available. The result of this unsupervised learning step is \(n_p\) prototype patches, where \(n_p\) can be set initially to be a large value, and then trimmed back by the part-selection step.

In Sect. 2.2.2 we show that the data-driven pipeline also produces patches by sampling a large number of training instances. Two pipelines join at this unsupervised regularization step.

2.2.1.4 Part Selection

So far, DMD has generated patches via clustering and sampling. This part-selection step finds out which patches may be useful and of what patches an object part is composed. Part selection units describe a larger region of objects than the edge detection, by focusing on parts of the objects. Similar to our approach, Serre’s \(S_2\) units behave as radial basis function (RBF) units, Lee uses a convolutional deep belief network (CDBN), and Ranzato’s algorithm implements a convolutional operation for the decoder. All are consistent with well-known response properties of neurons in the primate inferotemporal cortex (IT).

Serre proposes using Gaussian-like Euclidean distance to measure similarity between an image and the pre-calculated prototypes (patches). Basically, we would like to find out what patches an object consists of. Analogically, we are constructing a map from object-parts to an object using the training images. Once the mapping has been learned, we can then classify an unseen image.

To perform part selection, we have to examine if patches obtained in the regularization step appear frequently enough in the training images. If a patch appears frequently, that patch can be selected as a part; otherwise, that patch is discarded for efficiency. For each training image, we match its edge patches with the \(n_p\) prototyped patches generated in the previous step. For the \(i_b^{th}\) band of an image’s edge detection output, we obtain for the \(i_p^{th}\) patch a measure as follows:

$$ {\mathbf{I}}_{S\_{\rm part}(i_b,i_p)}=\exp(-\beta||{\mathbf{X}}_{i_b}-{\mathbf{P}}_{i_p}||^2), $$
(2.3)

where \(\beta\) is the sharpness of the tuning and \({\mathbf{P}}_{i_p}\) is one of the \(n_p\) patches learned during sparsity regularization. \({\mathbf{X}}_{i_b}\) is a transformation of the \({\mathbf{I}}_{{\rm edge}(i_b,i_f)}\) with all \(n_f\) orientations merged to fit the size of \({\mathbf{P}}_{i_P}.\) We obtain \(n_b\) measurements of the image for each prototype patch. Hence the total number of measurements that this part-selection step makes is the number of patches times the number of bands, or \(n_p \times n_b.\)

2.2.1.5 Part Pooling

Each image is measured against \(n_p\) patches, and for each patch, \(n_b\) measurements are performed. To aggregate \(n_b\) measurements into one, we resort to the part-pooling units. The part-pooling units 1 correspond to visual cortical V4 neurons. It has been discovered that a substantial fraction of the neurons takes the maximum input as output in visual cortical V4 neurons of rhesus monkeys (macaca mulatta) [16], or

$$ {\mathbf{v}}_{{\rm part}(i_P)}=\min_{i_b}{\mathbf{I}}_{S\_{\rm part}(i_b,i_P)}. $$
(2.4)

The MAX operation (maximizing similarity is equivalent to minimizing distance) can not only maintain feature invariance, but also scale down feature-vector size. The output of this stage for each training image is a vector of \(n_p\) values as depicted by the pseudo code in Fig. 2.3.

Fig. 2.3
figure 3

DMD steps 4 and 5

2.2.1.6 Supervised Learning

At the top layer, DMD performs part-to-object mapping. At this layer, any traditional shallow learning algorithm can work reasonably well. We employ SVMs to perform the task. The input to SVMs is a set of vector representations of image patches produced by this model-based pipeline and by the data-driven pipeline, which we present next. Each image is represented by a vector of real values each depicting the image’s perceptual strength matched by a prototype patch.

2.2.2 Data-Driven Pipeline

The key advantage of the model-based pipeline is feature invariance. For objects that have a rigid body of predictable patterns, such as a watch or a phone, the model-based pipeline can obtain invariant features from a small number of training instances. Indeed, our experimental results presented in Sect. 2.3 show that it takes just five training images to effectively learn the features of a watch and to recognize it. Unfortunately, for objects that can have various appearances such as pizzas with different toppings, the model-based pipeline runs into limitations. The features it learned from the toppings of one pizza cannot help recognize a pizza with different toppings. The key reason for this is that invariance may cause overfitting, and that hurts selectivity.

To remedy the problem, DMD adds a data-driven pipeline. The principal idea is to collect enough examples of an object so that feature selectivity can be improved. By collecting signals from a large number of training data, it is also likely to collect signals of different scales and orientations. In other words, instead of relying solely on a model-based pipeline to deal with invariance, we can collect enough examples to ensure with high probability that the collected examples can cover most transformations of features.

Another duty that the data-driven pipeline can fulfill is to augment a key shortcoming of the model-based pipeline, i.e., it considers only the feedforward pathway of the visual system. It is well understood that some complex recognition tasks may require recursive predictions and verifications. Backprojection models [22, 23] and attention models [24] are still in early stage of development, and hence there is no solid basis of incorporating feedback. DMD uses heuristic-based signal processing subroutines to extract patches for the data-driven pipeline. The extracted patches are merged with those learned in the sparse-regularization step of the model-based pipeline.

We extracted patches in multiple resolutions to improve invariance [25, 26]. We characterized images by two main features: color and texture. We consider shapes as attributes of these main features. Footnote 1

2.2.2.1 Color Patches

Although the wavelength of visible light ranges from 400 to 700 nm, research effort [28] shows that the colors that can be named by all cultures are generally limited to eleven. In addition to black and white, the discernible colors are red, yellow, green, blue, brown, purple, pink, orange and gray.

We first divided color into 12 color bins including 11 bins for culture colors and one bin for outliers [26]. At the coarsest resolution, we characterized color using a color mask of 12 bits. To recorded color information at finer resolutions, we record eight additional features for each color. These eight features are color histograms, color means (in H, S and V channels), color variances (in H, S and V channel), and two shape characteristics: elongation and spread. Color elongation characterizes the shape of a color and spreadness characterizes how that color scatters within the image [29]. We categorize color features by coarse, medium and fine resolutions.

2.2.2.2 Texture Patches

Texture is an important cue for image analysis. Studies [3033] have shown that characterizing texture features in terms of structuredness, orientation, and scale (coarseness) fits well with models of human perception. A wide variety of texture analysis methods have been proposed in the past. We chose a discrete wavelet transformation (DWT) using quadrature mirror filters [31] because of its computational efficiency.

Each wavelet decomposition on a 2D image yields four subimages: a \({\frac{1}{2}}\,{\times}\,{\frac{1}{2}}\) scaled-down image of the input image and its wavelets in three orientations: horizontal, vertical and diagonal. Decomposing the scaled-down image further, we obtain a tree-structured or wavelet packet decomposition. The wavelet image decomposition provides a representation that is easy to interpret. Every subimage contains in formation of a specific scale and orientation and also retains spatial information. We obtain nine texture combinations from subimages of three scales and three orientations. Since each subimage retains the spatial information of texture, we also compute elongation and spreadness for each texture channel.

2.2.2.3 Feature Fusion

Now, given an image, we can extract the above color and texture information to produce some clusters of features. These clusters are similar to those patches generated by the model-based pipeline. All features, generated by the model-based or data-driven pipeline are inputs to the sparsity regularization step, depicted in Sect. 2.2.1.3, to conduct subsequent processing. In Sect. 2.3.3 we discuss where fusion can be counter-productive and propose a remedy to reduce noise.

2.3 Experiments

Our experiments were designed to answer three questions:

  1. 1.

    How does a model-based approach compare to a data-driven approach? Which model performs better? Where and why?

  2. 2.

    How does DMD perform compared to an individual approach, model-based or data-driven?

  3. 3.

    How much does the unsupervised regularization step help?

To answer these questions we conducted three experiments:

  1. 1.

    Model-based versus data-driven model.

  2. 2.

    DMD versus individual approaches.

  3. 3.

    Parameter tuning at the regularization step.

2.3.1 Dataset and Setup

To ensure that our experiments can cover object appearances of different characteristics (objects of similar details, different details, and within abstracts, as depicted in Sect. 2.1), we collected training and testing data from ImageNet [34]. We selected 10,885 images of 100 categories to cover the above characteristics. We followed the two pipelines of DMD to extract model-based features and data-driven features. For each image category, we cross-validated by using 15 or 30 images for training, and the remainder for testing to compute annotation accuracy. Because of the small training-data size, using linear SVMs turned out to be competitive to using advanced kernels. We thus employed linear SVMs to conduct all experiments.

2.3.2 Model-Based versus Data-Driven

This experiment was designed to evaluate individual models and explain where and why model-based or data-driven is more effective.

2.3.2.1 Overall Accuracy

Table 2.1 summarizes the average annotation accuracy of the model-based-only versus data-driven-only method with 15 and 30 training images, respectively. The table shows that the model-based method to be more effective. (We will shortly explain this result to be not-so-meaningful.) We next looked into individual categories to examine the reasons why.

Table 2.1 Model-based versus data-driven

2.3.2.2 Where Model-Based Works Better

Figure 2.4 shows a set of six categories (example images shown in Table 2.2) where the model-based method performs much better than the data-driven in class-prediction accuracy (on the \(y\)-axis).

Fig. 2.4
figure 4

Model-based outperforms data-driven (see color insert)

Table 2.2 Images with a rigid body (see color insert)

First, on some categories such as ‘dollar bills’, ‘Windsor-chair’ and ‘yin-yang’, increasing training data from 15 to 30 does not improve annotation accuracy. This is because these objects exhibit precise features (e.g., all dollar bills are the same), and as long as a model-based pipeline can deal with scale/position/orientation invariance, the feature-learning process requires only a small number of examples to capture their idiosyncrasies. The data-driven approach on these six categories performs quite poorly because its lacking the ability to deal with feature invariance. For instance, the data-driven pipeline cannot recognize watches of different sizes and colors, or Windsor-chairs of different orientations.

2.3.2.3 Where Data-Driven Works Better

Data-driven works better than model-based when objects exhibit patterns that are similar but not necessarily identical nor scale/position/orientation invariant. Data-driven can work effectively when an object exhibit some consistent characteristics such as apples are red or green, and dalmatians have black patches of irregular shapes.

Figure 2.5 shows a set of six categories where data-driven works substantially better than model-based in class-prediction accuracy.

Fig. 2.5
figure 5

Data-driven outperforms model-based (see color insert)

Table 2.3 display example images from four categories, ‘strawberry’, ‘sunflower’, ‘dalmatian’, and ‘pizza’. Both ‘strawberry’ and ‘sunflower’ exhibit strong color characteristics, whereas ‘dalmatian’ and ‘pizza’ exhibit predictable texture features. For these categories, once after sufficient amount of samples can be collected, semantic-prediction can be performed with good accuracy. The model-based pipeline performs poorly in this case because it does not consider color nor can it find invariant patterns (e.g., pizzas have different shapes and toppings).

Table 2.3 Images with a rigid body (see color insert)

2.3.2.4 Accuracy versus Training Size

We varied the number of training instances for each category from one to ten to further investigate the strengths of model-based and data-driven. The \(x\)-axis of Fig. 2.6 indicates the number of training instances, and the \(y\)-axis the accuracy of model-base subtracted by the accuracy of data-driven. A positive margin means that the model-based outperforms the data-driven in class prediction, whereas a negative means that the data-driven outperforms. When the size of training data is below five, the model-based outperforms data-driven. As we have observed from the previous results, model-based can do well with a small number of training instances on objects of invariant features. On the contrary, a data-driven approach cannot be productive when the number of training instances is scarce, as its name suggests. Also as we have explained, even the features of an object class can be quite predictable, varying camera and environmental parameters can produce images of different scales, orientations, and colors. A data-driven pipeline is not expected to do well unless it can get ample samples for capturing these variations.

Fig. 2.6
figure 6

Accuracy comparison (x-axis for the training numbers and y-axis for the accuracy model-based minus data-driven)

2.3.2.5 Discussion

Table 2.4 displays patches of selected categories to illustrate the strengths of the model-based and data-driven, respectively. The patches of ‘watch’ and ‘Windsor-chair’ show patterns of almost identical edges. Model-based thus works well with these objects.

Table 2.4 Images with a rigid body (see color insert)

The ‘dalmatian’ and ‘pizza’ images tell a different story. The patches of these objects are similar but not identical. Besides, these patterns do not have strong edges. The color and texture patterns extracted by the data-driven pipeline can be more productive for recognizing these objects when sufficient samples are available.

Note

Let us revisit the result presented in Table 2.1. It is easy to make the data-driven look better by adding more image categories favoring data-driven to the testbed. Or the bias of a dataset can favor one approach over the other. Thus, evaluating which model works better cannot be done by looking only at the average accuracy. On a dataset of a few hundred or thousand categories, evaluation should be performed on individual categories.

2.3.3 DMD versus Individual Models

Fusing model-based and data-driven seems like a logical approach. Figure 2.7 shows that DMD outperforms individual models in classification accuracy on training sizes of both 15 and 30. Table 2.5 reports four selected categories on which we also performed training on up to 700 training instances.

Fig. 2.7
figure 7

DMD versus individual models

Table 2.5 Fusion accuracy with large training pool

2.3.3.1 Where Fusion Helps

For objects that the data-driven model works better, adding more features from the model-based pipeline can be helpful. The data-driven model needs a large number of examples to characterize an object (typically deformable as we have discussed), and those features produced by the model-based pipeline can help. On ‘skirts’ and ‘sunflower’, DMD outperforms data-driven when 100 images were used for training ‘skirts’, and 65 for ‘sunflower’. (These are the maximum amount of training data that can be obtained from ImageNet for these categories.) On these deformable objects, we have explained why the data-driven approach is more productive. Table 2.5 shows that the model-based features employed by DMD can help further improve class-prediction accuracy.

2.3.3.2 Where Fusion May Hurt

For objects that the model-based can achieve better prediction accuracy, adding more features form the data-driven pipeline can be counter-productive. For all object categories in Fig. 2.4 with 30 training instances, adding data-driven features reduces classification accuracy. This is because for those objects where the model-based performs well, additional features may contribute noise. For instance, the different colors of watches can make watch recognition harder when color is considered as a feature. Table 2.5 shows that when 30 training instances were used, DMD’s accuracy on ‘watch’ was degraded by 25 percentile. However, the good news is that when ample samples were provided to capture almost all possible colors of watches, the prediction accuracy improved. When we used 170 watch images to train the ‘watch’ predictor, the accuracy of DMD trails data-driven by just 5 category where we were able to get a hold of 700 training instances, both DMD and model-base enjoy the same degree of accuracy.

2.3.3.3 Strength-Dominant Fusion

One key lesson learned from this experiment is that the MAX operator in the pooling steps of DMD is indeed effective in telling features from noise. Therefore, DMD can consider amplifying the stronger signals from either the model-based or the data-driven pipeline. When the signals extracted from an image well match some patches generated by the model-based pipeline, the model-based classifier should dominate the final class-prediction decision. This simple adjustment to DMD can improve its prediction accuracy on objects of rigid bodies, and hence the average prediction accuracy. Another key lesson learned (though obvious) is that the amount of training instances must be large to make the data-driven pipeline effective. When the training size is small and model-based can be effective, the data-driven pipeline should be turned off.

2.3.4 Regularization Tuning

This experiment examined the effect of the realization step. Recall that regularization employs unsupervised schemes to achieve both feature selection and feature reduction. Table 2.6 shows the effect of the number of prototype patches \(n_p\) on the final prediction accuracy of DMD. When \(n_p\) is set between \(\hbox{2,000}\,\times\,4\;{\hbox{and}}\;\hbox{2,500}\,\times\,4,\) the prediction accuracy is the best. A small \(n_p\) may cause underfitting and too large an \(n_p\) may cause overfitting. Parameter \(n_p\) decides selectivity, and its best setting must be tuned through an empirical process like this.

Table 2.6 Accuracy of different number of patches

2.3.5 Tough Categories

Table 2.7 presents six categories where DMD cannot perform effectively with a small amount of training data. For instance, the best prediction accuracy with 30 training images on ‘barrel’ and ‘cup’ is 12 and 17%, respectively. These objects exhibit such diversified perceptual features, and neither model-based nor data-driven with a small set of training data can capture their characteristics. Furthermore, the ‘cup’ category is more than percepts, as liquid containers of different shapes and colors can be classified as a cup. To improve class-prediction accuracy on these challenging categories, more training data ought to be collected to make DMD effective.

Table 2.7 Images with a rigid body (see color insert)

2.4 Related Reading

The CBIR community has been striving to bridge the semantic gap [35] between low-level features and high-level semantics for over a decade. (For a comprehensive survey, please consult [27].) Despite the progress has been made on both feature extraction and computational learning, these algorithms are far from being completely successful. On feature extraction, scale-invariant feature transform (SIFT) was considered a big step forward in the last decade for extracting scale-invariant features. SIFT may have improved feature invariance, but it does not effectively deal with feature selectivity. Indeed, SIFT has shown to be effective in detecting near-replicas of images [36], but it alone has not been widely successful in more general recognition tasks. As for computational learning, discriminative models such as linear SVMs [37] and generative models such as latent dirichlet allocation (LDA) [38] have been employed to map features to semantics. However, applying these and similar models directly to low-level features are considered to be shallow learning, which may be too limited for modeling complex vision problems. Yoshua Bengio [9] argues in theory that some functions simply cannot be represented by architectures that are too shallow.

If shallow learning suffers from limitations, then why have’t deep learning been widely adopted? Indeed, neuroscientists have studied how the human vision system works for over 40 years [2]. Ample evidences [3941] indicate that the human visual system is a pipeline of multiple layers: from sensing pixels and detecting edges to forming patches, recognizing parts, and then composing parts into objects. Physiological evidences strongly suggest that deep learning, rather than shallow, is appropriate. Unfortunately, before the work of Hinton in 2006 [10] deep models were not fully embraced partly because of their high intensity of computation and partly because of the well known problem of local optima. Recent advancements in neuroscience [3] motivated computer scientists to revisit deep learning in two respects:

  1. 1.

    Unsupervised learning in lower layers. The first couple of layers of the visual system are unsupervised, whereas supervised learning is conducted at the latter layers.

  2. 2.

    Invariance and selectivity tradeoff. Layers in the visual system deal with invariance and selectivity alternately.

These insights have led to recent progress in deep learning. First, Salakhutdinow et al. [42] show that using an unsupervised learning algorithm to conduct pre-training at each layer, a deep architecture can achieve much better results. Second, Serre [8] shows that by alternating pooling and sampling operations between layers, a balance between feature invariance and selectivity can be achieved. The subsequent work of Lee [5] confirms these insights.

2.5 Concluding Remarks

In this chapter, we first conducted empirical study to compare the model-based and the data-driven approach to image annotation. From our experimental results, we learned insights to design DMD, a hybrid architecture of deep model-based and data-drive learning. We showed the usefulness of unsupervised learning at three steps: edge-pooling, sparse regularization, and part-pooling. Unsupervised learning plays a pivotal role in making good tradeoffs between invariance and selectivity, and between specificity and diversity. We also showed that the data-driven pipeline can always be helped by the model-based. However, the other way may introduce noise when the amount of training data is scarce. DMD makes proper adjustments in making model-based and data-driven complement each other to achieve good performance.

Besides perceptual features that can be directly extracted from images, researchers have also considered camera parameters, textual information surrounding images, and social metadata, which can be characterized as contextual features. Nevertheless, perceptual feature extraction remains to be a core research topic of computer vision and image processing. Contextual features can complement image content but cannot substitute perceptual features. How to combine context and content belongs to the topic of multimodal fusion, which we address in Chaps. 7, 8 and 9.