Unsupervised natural image patch learning

A metric for natural image patches is an important tool for analyzing images. An efficient means of learning one is to train a deep network to map an image patch to a vector space, in which the Euclidean distance reflects patch similarity. Previous attempts learned such an embedding in a supervised manner, requiring the availability of many annotated images. In this paper, we present an unsupervised embedding of natural image patches, avoiding the need for annotated images. The key idea is that the similarity of two patches can be learned from the prevalence of their spatial proximity in natural images. Clearly, relying on this simple principle, many spatially nearby pairs are outliers. However, as we show, these outliers do not harm the convergence of the metric learning. We show that our unsupervised embedding approach is more effective than a supervised one or one that uses deep patch representations. Moreover, we show that it naturally lends itself to an efficient self-supervised domain adaptation technique onto a target domain that contains a common foreground object.


INTRODUCTION
Humans can easily understand what they see at different regions in an image, or tell whether two regions are similar or not.However, despite recent progress, such forms of image understanding remain extremely challenging.One way to address image understanding takes inspiration from the ability of human observers to understand image contents, even when viewing through a small observation window.Image understanding can be formalized as the ability to encode contents of small image patches into representation vectors.To keep such encodings generic, they are not predetermined by certain classes, but instead aim to project image patches into an embedding space, where Euclidean distances correlate with general similarity among image patches.As natural patches form a low dimensional manifold in the space of patches [1], [2], such an embedding of image patches allows various image understanding and segmentation tasks.For example, semantic segmentation is reduced into a simple clustering technique based on l 2 distances.
The key insight of our work is that such an embedding of image patches can be trained by a neural network in an unsupervised manner.Using semantic annotations allows a direct sampling of positive and negative pairs of patches that can be embedded using a triplet loss [3].However, data labeling is laborious and expensive.Therefore, only a tiny fraction of the images available online can be utilized by supervised techniques, necessarily limiting the learning to a bounded extent.An unsupervised embedding can also be based on deep patch representations that are learned indirectly by the network, e.g., [4], however, as we show, explicitly training the network for an embedding can achieve significantly higher performance.
In this work, we introduce an unsupervised patch embedding method, which analyses natural image patches to define a mapping from a patch to a vector, such that the Euclidean distance between two vectors reflects their perceptual similarity.We observe that the similarity of two patches in natural images correlates with their spatial distances.In other words, patches of coherent or semantic segments tend to be spatially close, hence forming a surprisingly simple but strong correlation between patch similarity and spatial distance.Clearly, not all neighboring patches are similar (see Figure 2).However, as we shall show, these dissimilar close patches are rare enough and uncorrelated, resulting in insignificant noise to the learning system which does not prohibit the learning.
Our embedding yields deep images, as each patch is mapped to a vector of 128D by a deep network.See the visualization of the deep images in the second and fourth Fig. 2: Learning patch similarity from spatial distances.The premise is that two patches sampled from the same swatch (colored in red) are more likely to be similar to each other than to a patch sampled in a distant one (colored in blue).rows of Figure 1, obtained by projecting the 128D vectors onto their three principle directions, producing pseudo-RGB images where similar colors correspond to similar embedded points.Using our embedding technique, we further present a domain specialization method.Given a new domain that contains a common foreground object, using self-supervision, we refine the initial embedding results for the specific domain to yield a more accurate embedding.
We use a Convolutional Neural Network (CNN) to learn a 128 dimension embedding space.We train the network on 2.5M natural patches with a triplet-loss objective function.Section 3 explains our embedding framework in detail.Section 4 describes our domain adaptation technique to a target domain that contains a common foreground object.In Section 5, we show that the patch embedding space learned using our method is more effective than embedding spaces that were learned with supervision or those based on handcrafted features or deep patch representations.We further show that by fine-tuning the network to a specific domain using self-supervision, we can further increase performance.

RELATED WORK
Our work is closely related to dimensionality reduction and embedding techniques, image patch representation, transfer learning and neural network based optimization.In the following we highlight directly relevant research.
Image patches can be treated as a collection of partial objects with different textures.Julesz introduced textons [5] as a way to represent texture via second order statistics of small patches.Various filter banks can be used for texture representation [6], e.g.Gabor filters [7].Also, hierarchical filter responses were used with great success for texture synthesis [8], [9].All these filters are fixed and not learned from data.In contrast, we learn the embedding by analyzing the distribution of all natural patches, thus avoiding the bias of hand crafted features.
The idea of representing a patch by its pixel values (without attempting dimensionality reduction) had success in various applications [10], see Barnes and Zhang [11] for a survey.In Section 5, we compare our method against a raw pixel descriptor.
Žbontar and LeCun [12] train a CNN to do stereo matching on image patches.Simo-Serra et al. [13] learn SIFT-like descriptors using a Siamese network.Both of these methods focus on invariance to viewpoint changes, whereas we aim to learn invariance to fluctuations in patch appearance of similar objects.
PatchNet [14] introduces a compact and hierarchical representation of image regions.It uses raw L*a*b* pixel values to represent patches, which we compare against in Section 5. PatchTable [15] proposes an efficient approximate nearest neighbor (ANN) implementation.ANN is an orthogonal and complementary task to patch representation.
Recently, deep networks were used for image region representation and segmentation.Cimpoi et al. [16] use the last convolution layer of a convolutional neural network (CNN) as an image region descriptor.It is not suitable for patch representation, as it will produce a 65K dimensional vector per patch.Fully Convolutional Networks (FCNs) [17] prove potent for, e.g., image segmentation.We compare to FCNs in Section 5.
Our work is based on Patch2Vec [3], which also uses deep networks to train a meaningful patch representation.However, in contrary to our method, Patch2Vec is a supervised method that requires an annotated segmentation dataset for training.
The ideas of using spatial proximity in image space and temporal proximity for videos has been utilized in the past for self-supervised learning.Isola et al. [18] utilize space and time co-occurrences to learn patch, frame and photo affinities.Wang and Gupta [19] track objects in videos to generate data for a self-supervised learning scheme.Closer to our method, Doersch et al. (UVRL) [4] train a network to predict the spatial relationship between pairs of patches, and use the patch representation to group similar visual concepts.Pathak et al. [20] train a network to predict missing content based on its spatial surrounding.These methods learn the patch representation while training the network for a different task, and the embedding is provided implicitly.In our work, the network is directly trained for patch embedding.We compare our method against UVRL in Section 5.
Given a labeled set in a source domain and an unlabeled set of samples in a target domain, domain adaptation aims to generalize the classifier learned on the source domain to the target domain [21], [22].It has become common practice to pre-train a classifier on a large labeled image database, such as ImageNet [23], and transfer the parameters to a target domain [24], [25].See Patel et al. [26] for a survey of recent visual techniques.In our work, we refine our embeddings from the natural image source domain to a target domain that contains a common object.Unlike recent unsupervised domain adaptation techniques [27], [28], in our case neither domain contains labeled data.

PATCH SPACE EMBEDDING
In this work, we take advantage of the fact that there is a strong coherence in the appearance of semantic segments in natural images.It is expected then that nearby patches have similar appearance.The correlation between spatial proximity and appearance similarity is learned and encoded in a patch space, where the Euclidean distance between two patches reflects their appearance similarity.
The embedding patch space is learned by training a neural network using a triplet loss: where p c , p n , p f are three patches of size 16 × 16 selected from a collection of natural images, such that p c is the current patch, p n is a nearby patch, p f is a distant patch, and m is a margin value (set empirically to 0.2).Fig. 3: Network loss convergence.The graph demonstrates the losses on the training and test data (in yellow and blue, respectively).As can be observed, the loss function is not completely stable due to the presence of outliers swatches.Nonetheless, the learning converges for both sets (starting from a loss of around 0.22, down to 0.07), which demonstrates the network resiliency to outliers.
To train our network, we utilize a large number of natural images (5000 images from the MIT-Adobe FiveK Dataset, in our implementation) and for each image we sample six disjoint regions, referred to as swatches.Each swatch consists of a grid of nine patches.A triplet is formed by randomly picking two patches from one swatch, and one from another swatch.The assumption is that the two patches taken from the same swatch are close enough, while the third is distant.In our implementation, the distant patch is always taken from the same image.The above scheme for sampling triplets is illustrated in Figure 2, where only two swatches are illustrated, one in red and one in blue.A triplet is formed by sampling two positive patches from the red swatch, and one negative patch from the blue one.Furthermore, we adopt the principal described in [3] that selects the "hard" examples, i.e., in each epoch, we use triplets that so far did not perform well by the network.This is expressed by the following equation: (2) Thus, the set N contains distant patches that the network embedded them within the margin m.The network f (p) is trained to create an embedding space that admits to the training triplets.Once trained, f (p) can embed any give patch by feed-forwarding it through the network, yielding its 128D feature vector.
To cope with outliers, we incorporate a strong regularization into the network.The embedding lies only on the unit hypersphere, which prevents overfitting.The unit hypersphere provides a structure to the embedding space that is otherwise unbounded.
The architecture of our network is similar to the one used in [3], but with the required changes for supporting 16X16 patches size (see Figure 5 for the network illustration).Note that inception layers are implemented as detailed in Szegedy et al. [29].
We train the network for 1600 epochs on NVIDIA GTX 1080.Training takes approximately 24 hours.The network convergence is demonstrated in Figure 3.As the figure illustrates, the losses on the train and test (colored in yellow and blue, respectively) are similar.This implies that our basic assumption holds and generalizes well.Furthermore, although the learning converges, the convergence is not completely stable.This may be attributed to the presence of outliers in the swatches, i.e., two patches from the same swatch but not from the same segment or two patches from different swatches but from the same segment.

DOMAIN SPECIALIZATION
In Section 3, we described an unsupervised technique to embed any natural image patch onto a 128D vector.Given a new domain that contains a common foreground object, we can improve the embedding by fine-tuning the network, or simply training it on patches taken from the new domain.However, we can do better using the initial embedding obtained by the previously described method to generate a preliminary segmentation.We can then use these rough segments to "supervise" the refined embedding.
To generate the rough segments, the images are first transformed using the patch embedding such that each pixel is mapped to a vector of 128D.Next, we apply k-means clustering on each of the deep images using k = 4 followed by a graph-cuts segmentation [30] (see the third row in Figure 4).
These segments are then used as supervision for finetuning the network, where the triplets are defined based on these segments, i.e., p c and p n are taken from the same foreground segment, and p f is a patch taken from any other segment in the image.
In our experiments, we execute the fine-tuning process only for 400 epochs.This process improves our embedding space and makes it much more coherent (see Table 2 and Figure 8).

RESULTS
We performed quantitative and qualitative evaluations to analyze the performance of our embedding technique.The quantitative evaluation was conducted on ground truth images from the Berkeley Segmentation Dataset (BSDS500) [31], which contains natural images that span a wide range of objects, as well as images from object-specific internet datasets of Rubinstein et al. [32].These object-specific datasets further enabled a quantitative evaluation of our domain specialization technique.
To quantitatively demonstrate our improved performance over previous works, we follow the measure reported by Fried et al. [3].We start by sampling "same segment" and "different segment" pairs of patches and calculate their distance in the embedding space.Next, for a given distance threshold, we predict that all pairs below

Method Accuracy Unsupervised
Raw pixels (RGB) 0.69 UVRL [4] 0.70 Patch2Vec [3] 0.76 -Ours 0.78 Human 0.86 -TABLE 1: Patch Embedding Evaluation.We compare our method to alternative patch representations.We report the AUC scores using l2 distance between patch representation as means to predict if a pair of patches comes from the same segment or not.
the threshold are from the same segment, and evaluate the prediction (for all threshold values) by calculating the area under the receiver operating characteristic (ROC) curve.Table 1 contains the full comparison.Notice that [3] is supervised, requiring an annotated segmentation dataset.The comparison to raw RGB pixels provides a more intuitive baseline.On the other hand, the accuracy of a human annotator (bottom row in Table 1) demonstrates the problem ambiguity and a level of accuracy which can be considered ideal.
To qualitatively visualize the quality of our embeddings, as previously detailed, we project the 128D vectors onto their three principle directions, which enables producing pseudo-RGB images where similar colors correspond to similar embedded points.In Figure 6, we visualize our embeddings and compare to the supervised technique of Fried et al. [3] on their training data.As the figure illustrates, our results are more coherent than the ones obtained with supervision, even though our method does not train on these images.In the supplementary material, we provide   [32] before fine-tuning the network (baseline), after fine-tuning the network on patches from the dataset (fine-tune) and after fine-tuning the network using our self-supervision technique (fine-tuned+self-supervision).
a comparison for the full BSDS500 dataset.Please refer to these results for assessing the high quality of our results.In Figure 7, we compare to Doersch et al. [4], where the patch representation can also be obtained without supervision.For comparison purposes, we use both their pre-trained weights and the weights retrained on BSDS500.We use their fc6 layer which performed the best in our tests.Unlike our embeddings, their method does not produce similar embeddings, which are visualized by similar colors in the figure, for pixels of the same region.
To evaluate our domain specialization technique, we fine-tune our network in two ways: First, we retrain the weights by simply training it on patches taken from the object-specific datasets of Rubinstein et al. [32].The second option is the one we describe in Section 4, where we use self-supervision to refine the results in the new domain.
In Table 2, we report the AUC scores in both settings.Since the ground-truth for these datasets contains only foreground-background segmentation (and not a segment for each semantic object), the AUC measure required a slight adjustment.As the background may contain many unrelated segments, we sample "same segment" pairs only from the foreground.As validated in Table 2, our method successfully learns and adjusts to the new domain.Moreover, our self-supervision scheme further boosts the performance.
In Figure 8, we qualitatively demonstrate the improvement over samples belonging to the HORSE, CAR and AIRPLANE datasets, half of them belong to the training set and the other half belong to the test set.Since we could not tell the samples apart, in the figure they are mixed together.As the figure illustrates, the colors, and thereby the embeddings, of the horses parts are more compatible and in general more homogeneous.For more results, see the supplementary material.The second and third rows show results of Patch2Vec [3] and our unsupervised technique, respectively.The input images, which belong to the training data of [3], are provided on the first row.Note that although our method did not train on these images, the textures are significantly less apparent in our embeddings.This suggests that segments with similar texture are embedded to the closer locations in the embedding space.

SUMMARY, LIMITATION, AND FUTURE WORK
We presented an unsupervised patch embedding technique, where the network learns to map natural image patches into 128D codes such that the L2 metric reflects their similarity.We showed that the triplet loss that we use to train the network explicitly for embedding outperforms other embeddings that are inferred by deep representations learned for other tasks or designed specifically to learn similarities between patches.Generally speaking, learning to embed by a network has its limitations as it is applied on the patch level.Feed-forwarding patches in network is a computationallyintensive task, and analyzing an image as a series of patches is time consuming.Parallel analysis of multitude of patch, possibly overlapping ones, can significantly accelerate the process.
To refine the performance and transfer the learning into a new domain, we utilize the embedding obtained by trained network as self-supervision.The embedded image is segmented by some naive method, to yield a rough segmentation.As demonstrated, these segments although imperfect, can successfully supervise the refinement of the network for the given new domain.However, we believe this can be further improved by using more advanced segmentation methods.In the future, we also want to consider conservative segmentation, where the segments may not necessarily cover the entire image, excluding regions with low confidence.
Furthermore, in the future, we would like to utilize our embedding technique to advance segmentation and foreground extraction methods.In particular, we would like to analyze large sets of embedded images, aiming to cosegment the common foreground of a weakly supervised set.We believe that the common foreground object can serve as a self-supervision to further improve the embedding performance.2, the embeddings of the objects (e.g., the horses) are more coherent after the refinement.

Fig. 4 :
Fig. 4: Refining the embedding using self-supervision.Given a new domain that contains a common foreground object (i.e., the input images on the top), we refine our initial embedding (visualized in the second row) by automatically generating semantic guiding segments (colored in unique colors in the third row) for the training images.Our technique yields a more coherent embedding of the common object (visualized in the bottom row).

Fig. 7 :Fig. 8 :
Fig.7: Comparison between our embedding to one inferred by deep representations.We compare to UVRL[4] using their pre-trained weights (second row) and also by retraining them on BSDS500 (third row).As demonstrated above, our technique maps pixels from similar regions to closer values.

TABLE 2 :
Domain Specialization Evaluation.Above, we report the AUC scores on the object-specific datasets provided by