Keywords

1 Introduction

One of the most extensively studied problems in computer vision area is to find correspondences between images according to local feature descriptors that embed local features into vectors. Compared with global features, local feature represents only part of the image so that it is more robust to illuminations. Recently, local descriptors based on CNN architectures have been proved to significantly outperform handcrafted local descriptors [4, 21, 23], meanwhile large datasets are available for training [20, 22]. However, due to the distribution discrepancies of different datasets, trained models based on patches from training sets may not generalize the most optimal results in the testing sets, which is mainly caused by potential variations between domains. For example, patches of the training sets are extracted from images of buildings while patches from testing sets are mainly from decorations indoors or natural scenery. Therefore, it is a natural adoption of domain adaptation methods to explore the domain-invariant structure between the source domain (labeled training set) and the target domain (unlabeled testing set).

Recent studies have demonstrated that the deep neural networks can learn transferable features to establish knowledge transfer by exploring the invariant factors between different datasets so as to make features robust to noise [13]. However, most of the studies focus on object recognition and a systematic study of the application of domain adaptation in local descriptors is yet to be done. Therefore, in this work, we will investigate the application of domain adaptation in local descriptors. Our contributions include: (1) we investigate the performance of different CNN based local descriptors, combining maximum mean discrepancy (MMD) criterion. Extensive experiments on Photo Tour and HPatches dataset show that domain adaption is effective to local feature descriptors; (2) Different from previous domain adaptation methods that focus only on the fully connected layer, we jointly calculate MMD from both the fully connected layer and the Convolutional layer of the network considering local descriptors’ own traits, which can further improve the performance of traditional domain adaptation.

2 Related Work

2.1 Local Descriptors

End-to-end learning local descriptors based on CNN architectures have been investigated in many studies, and the improvement has been shown over state of art descriptors [4, 21, 23]. In  [21], feature layers and metric layers are learnt in the same network. Therefore, the final hinge-based loss can be optimized using the last abstract metric layer of the network. MatchNet [24] also includes both feature extracting layers and metric layers while using entropy loss to update the network.

On the contrary, [4, 23] directly use the last feature layer as the feature descriptor of the input patch without training of the metric layer so that it can be judged by traditional evaluation criterion. Based on Siamese network, Deepdesc [4] trains the network using L2 distance meanwhile adopting a mining strategy to select training samples. However, it requires large quantity of samples to guarantee its performance. TFeat [23] uses Triplet network to decrease the distance between matching pairs and increase the distance between non-matching pairs. Based on triplet loss, L2-Net [26] also proposes a progressive sampling method with consideration of the intermediate layers.

Another important observation is that multi-scale network architectures can achieve better results compared with single-scale network architectures.

2.2 Domain Adaptation

Transfer learning [19] aims to build a learning model that can follow different probability distributions according to different domains [3, 8, 10, 16, 19]. Recent studies of deep domain adaptation embed an adaptive layer into the deep network to enhance the transfer ability [5,6,7, 13, 14, 25]. The deep domain confusion network (DDC) by Tzeng et al. [5] uses two CNNs with shared weights, according to the source and target domain respectively. The network of the source domain is updated by the originally defined loss function when the difference between the two domains is calculated by the MMD metric of the adaptive layer. DDC only adjusts a single layer of the network, which may limit the transfer ability of the multi-layer network. Therefore, Long et al. [13] proposed the deep adaptation network (DAN) combining multi-layer adaptation using multi-kernel MMD metric to match the shift of different domains. In order to avoid mutual influence of layers, a joint adaptation network (JAN) [12] based on a joint maximum mean discrepancy (JMMD) criterion was proposed to align the shift of the joint distribution of multiple layers in the network. Besides, there are several extensions of DAN aimed at aligning the distributions of both the classifier and the feature layer. In this work, we only investigate domain adaptation methods based on feature layers.

3 Model

3.1 Maximum Mean Discrepancy (MMD)

In standard CNN architecture, the features of the last layer tend to transfer from general to specific because it is tailored for the source data at the expense of degraded performance on the target task [13]. Hence, in order to get the most optimal performance, after pre-training on the training set, we require the distributions of the features of the fully connected layer from the source and the target domain to be similar. This can be achieved by adding an MMD metric to the original loss function, which can limit the target error by the source error plus a discrepancy metric between the source and the target [18].

MMD is an efficient metric that can compare the distributions of two datasets using a kernel two-sample test [11]. Given two distributions S and T, MMD is defined as:

$$\begin{aligned} MMD(X_S,X_T)=\left\| \frac{1}{\left| X_S\right| } \sum _{x_s\in X_S}\varPhi (x_s)-\frac{1}{\left| X_T\right| } \sum _{x_t\in X_T}\varPhi (x_t)\right\| \end{aligned}$$
(1)

where \(\varPhi \) is a kernel function that maps the original data to a reproducing kernel Hilbert space (RKHS) and \(\left\| \varPhi \right\| _H\le 1\) defines a set of functions in the unit ball of RKHS. This MMD metric considers the distribution of each domain to reduce the mismatch in a latent space. Subsequently, Tzeng et al. [5] and Long et al. [13] extended the MMD metric to a multi-kernel MMD metric. Multi-kernel MMD enhances the two-sample test power meanwhile minimizes the Type II error, i.e., the failure of rejecting a false null hypothesis [13]. Its final result is calculated by a weighted summation of several single kernel tests:

$$\begin{aligned} K\triangleq \left\{ k=\sum _{u=1}^{m} \beta _u k_u : \sum _{u=1}^{m}\beta _u=1, \beta _u \ge 0, \forall u \right\} \end{aligned}$$
(2)

where \(k_u\) stands for one single MMD test and the \(\left\{ \beta _u\right\} \) is limited in the way to make each kernel more representative. The multi-kernel MMD improves the testing power of MMD and leads to a more optimal result.

3.2 Adaptative Networks

Based on the idea of domain adaptation, we first combine MMD metric with TFeat [23] to exploit data from both the source and the target domain. Figure 1 gives an illustration of the proposed combined model. TFeat (Fig. 1-left) is a typical CNN based local descriptor. It is comprised of 2 Convolutional layers and one fully connected layer. For each layer, it is followed by an activation \(f^l=tanh(x)\). The objective function of TFeat is:

$$\begin{aligned} \lambda (\delta ^+,\delta ^-)=max(0,\mu +\delta ^+ - \delta ^-) \end{aligned}$$
(3)

where \(\delta ^+=\left\| Net(x^+)-Net(x)\right\| _2\) is the L2 distance between the matching pairs \((x^+,x)\), and \(\delta ^-=\left\| Net(x^-)-Net(x)\right\| _2\) is the L2 distance between the non-matching pairs \((x^-,x)\), and \(\mu \) is a constant. The objective function aims to make \(\delta ^->\mu +\delta ^+\), so the distance between non-matching pairs will be longer and between matching pairs will be shorter.

Fig. 1.
figure 1

Left is the original TFeat Network. It is comprised of 2 Convolutional layers(blue) and one fully connected layer(green). Right shows the modified adaptive model. (Color figure online)

In previous studies, deep networks are pre-trained on ImageNet [17], but our network is rather shallow so we only pre-train TFeat on the original training sets. Then we fix the Convolutional layers and update the fully connected layer using the new loss function,

$$\begin{aligned} L = L_C + \lambda MMD(X_S,X_T) \end{aligned}$$
(4)

where \(L_C\) is the original loss function \(\lambda (\delta ^+,\delta ^- )\). MMD is used for calculating the discrepancy between the training set and the testing set. \(\lambda > 0\) is a penalty parameter that can control the balance between the task specification and the discrepancy between two domains. As pointed out by Gretton et al. [1], kernel choice is important for the testing power of MMD because different kernel will map the probability distribution into different RKHS. Therefore, we choose the performance of multi-kernel MMD on the local descriptor learning.

3.3 Joint Adaptation of the Fully Connected Layer and the Convolutional Layer

In [21], it has been pointed out that it is important to jointly use information from the first layer of the network. Therefore, we consider the modification of the MMD loss calculation to fit features from the first layer into the MMD metrics,

$$\begin{aligned} L = L_C + \lambda \varphi (MMD_{fc}(X_S,X_T),MMD_{cov}(X_S,X_T)) \end{aligned}$$
(5)

where \(\varphi (a,b)\) is a way to combine the MMD loss from both fully connected layer and the first Convolutional layer.

To train the network with multi-layer MMD, there are two ways. On the one hand, we could define \(\varphi (a,b)=a+b\), which means we directly add up two MMD results from two separate layers, as Fig. 2(a) illustrates. On the other hand, As [12] points out, separate adaptation of different layers will exert a mutual influence on the conditional distribution of each layer, therefore \(\varphi (a,b)\) could be defined as \(\varphi (a,b)=a*b\), where \(*\) means the joint distribution of the features from the two layers. The modified version is illustrated in Fig. 2(b).

Fig. 2.
figure 2

Two architectures to apply MMD loss

3.4 Dimension Reduction

In [1], it is proved that high dimension will decrease the power of MMD to calculate the discrepancies between different distributions. Allowing for the high dimensions of the Convolutional feature maps, we need to reduce the dimension before calculating the MMD metrics. For convenience, we consider simple ways of average pooling. As the dataset (Fig. 3) shows, location information in patches are less important for domain adaptation since different subsets contains completely different scenes. Therefore, when considering dimension reduction, we could adopt average pooling to get a smoother distribution of pixel tensity in the patch.

4 Experiment

We combine the CNN based local descriptors with MMD metric, focusing on the improvement of the performance that domain adaptation can offer.

4.1 Photo Tour Dataset

Photo Tour dataset [20] is a standard benchmark for patch training and testing. It consists of around 1M patches from each distinct scene: Notredame(N, grand building), Liberty(L, statue), Yosemite(Y, natural park), which we could think as three subsets. Each subset consists of three components: two patches and their label that shows whether they are matching pair(label = 1) or non-matching pair(label = 0). Figure 3 gives an illustration of the structure of the dataset, which mainly shows pairs of patches and their labels from three different subsets. For each learning task, we take one subset as training set and another as testing set so that there are 6 ways of subset combination. We evaluate the domain adaptation performance on the 6 learning tasks, N \(\rightarrow \) L, N \(\rightarrow \) Y, L \(\rightarrow \) N, L \(\rightarrow \) Y, Y \(\rightarrow \) N, Y \(\rightarrow \) L(training set\(\rightarrow \)testing set).

We use FPR95 to calculate the error rate when the matching accuracy achieves 95%.

Fig. 3.
figure 3

Photo tour dataset examples

4.2 HPatches Dataset

HPatches dataset [22] is a standard benchmark for patch testing. It consists of around 2M patches from 116 scenes. This dataset evaluates the local descriptors on three tasks: patch verification, image matching and patch retrieval. We evaluate the domain adaptation by training the networks on Photo Tour dataset and testing on Hpatches.

4.3 Evaluation Protocol

For evaluation on Photo Tour dataset, we mainly evaluate the performance following the protocol below.

TFeat Network. We first extract 5M triplets from the training set and find the best results in certain epochs as the pre-trained model following the original procedure in [23]. Then we use 5M labeled triplets from the training set and 5M random selected unlabeled triplets from the testing set to update the fully connected layer of the pre-trained network using new loss function and fix the Convolutional layer, and evaluate the descriptors’ performance using FPR95. As for joint adaptation of both fully connected layer and Convolutional layer, we update the whole network after pre-training.

Fig. 4.
figure 4

Siamese network (Color figure online)

Siamese Network. Siamese Network [21] (as Fig. 4 shows) is another typical CNN based local descriptors. It consists of 3 Convolutional layers(blue), two maxpooling layers(red) and two fully connected layers(green) while the output of the last layer is a number representing whether these two patches are matching or not. Compared with TFeat, Siamese Network is trained with matching and non-matching pairs instead of triplets. Its objective function adopts a hinge-based loss. For adaptation, it follows the above protocol.

4.4 Parameters

When using multi-kernel MMD and considering a family of m Gaussian kernels \(\{k_u\}_{u=1}^{m}\), we mainly follow the procedure in [13] to set the varying bandwidth \(\gamma _u\). We use stochastic gradient descent (SGD) with 0.9 momentum and the learning rate is set to 0.1 at the beginning and is gradually decreased.

5 Results

5.1 Performance Changes on \(\lambda \) Variation

On TFeat Network, we first investigate the effect of the parameter \(\lambda \). Table 1 illustrates the variation of the error rate with \(\lambda \in \left\{ 0.005, 0.008, 0.01, 0.02\right\} \) on tasks N \(\rightarrow \) and the number of MMD kernel is set to 3. We can see from the variation that when \(\lambda \) varies, the error rate first decreases and then increases forming a notching curve. It shows that it is important to find the balance between learning more specific deep features and adapting to target domain.

Table 1. Performance changes on \(\lambda \) variation.

5.2 Domain Adaptation on Photo Tour Dataset with Fully Connected Layer

For the convenience of implementation, we set \(\lambda \) to 0.01 for all tasks, which means results that Table 2 below shows can be decreased in an effective way even though the performance is not the most optimal. It demonstrates that MMD can effectively transfer features across domains and further boost the performance of our networks.

Table 2. Results of six learning tasks combining local descriptors with domain adaptation. The first row shows the original results and the second row shows results after domain adaptation.

5.3 Domain Adaptation on HPatches Dataset with Fully Connected Layer

In [22], experiments show that TFeat Network has achieved higher results. Therefore, we tested TFeat Network on HPatches after domain adaptation. We could see from Fig. 5 that all of the three tasks have gained 2% increase, which proves the effectiveness of domain adaptation.Also, verification tasks between same sequence and matching tasks between illumination changes could gain bigger increase. Also, domain adaptation influences more on tough tasks.

Fig. 5.
figure 5

HPatches evaluation:for each task, ‘+’ represents results after domain adaptation. The mean values of the results before and after domain adaptation are put on the right, which shows that all the mean values get 2% increase. ‘differ’ and ‘same’ stand for different and same sequences verification. ‘view’ and ‘illum’ show matching under changes of view or illumination.

5.4 Multi-layer Adaptation

In previous work, domain adaptation only considers fully connected layers. Given local descriptors’ own traits, which implies that the first Convolutional layer contains important information, we add Convolutional layer to layer adaptation. Table 3 shows the comparison of different ways of layer combination. First of all, we adopt the traditional way to only update the fully connected layer. Then, we simply sum the MMD losses from the fully connected layer and the first Convolutional layer respectively. We can see from the results that error rate can be reduced because of the extra information the first layer offers. However, as [12] points out that the update of the former layers will change the distribution of the following layers so the joint MMD losses are also calculated following [12]. We can see from the results that improvement can be further achieved.

Table 3. Error rates with different ways of combining MMD losses from fully connected and Convolutional layers

5.5 Performance with Respect to Dimension Reduction

First we run several experiments of average pooling to find the variation tendency of the performance with different scale of pooling size. We can see from the figure that the error rate first decreases and then increases which means there is a balance between reducing the dimension and keeping enough feature information (Table 4).

Table 4. Error rates with different average pooling size. The first row shows the final dimension of the first layer after pooling.
Fig. 6.
figure 6

The changes of feature maps with three ways of adaption

figure a

6 Discussion

6.1 First Layer Filter Visualization

For one original patch on the right, Fig. 6 shows the changes of feature maps with three ways of adaptation(TFeat-fc, TFeat-(fc+cov), TFeat-(fc*cov)). From the first layer filter visualization, we could see the color of features from fc adaptation to fc+cov adaptation change more strongly while there are still little changes from fc+cov adaptation to fc*cov adaptation, which shows different ways of domain adaptation indeed influence the features of the first Convolutional layer.

Fig. 7.
figure 7

t-SNE visualization of deep features before and after domain adaption

6.2 t-SNE Visualization

Seeing from t-SNE visualization [15] (Fig. 7), features from the source(blue) and the target(red) become more collective and mixed with each other after adaptation while most of the original features of the source lie outside the target features.

7 Conclusion

In this work, we investigate the application of domain adaptation in local descriptors. Experiments results have proven that domain adaptation methods can further enhance the performance of CNN architecture based local descriptors. Besides, the results also demonstrate that it is important to jointly use information to calculate MMD loss from both the first layer and the last layer. It is also interesting to consider how to reduce the high dimension of features from the Convolutional layers so that the joint distribution can be better learnt from domain adaptation. Meanwhile, deeper architecture will further boost the performance.