Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Domain adaptation and transfer learning are widely studied in computer vision and machine learning [1, 2]. They are inspired by the human cognitive capacity to learn new concepts from very few data samples (cf. training classifier on millions of labeled images from the ImageNet dataset [3]). Generally, given a new (target) task to learn, the arising question is how to identify the so-called commonality [4, 5] between this task and previous (source) tasks, and transfer knowledge from the source tasks to the target one. Therefore, one has to address three questions: what to transfer, how, and when [4].

Domain adaptation and transfer learning utilize annotated and/or unlabeled data and perform tasks-in-hand on the target data e.g., learning new categories from few annotated samples (supervised domain adaptation [6, 7]), utilizing available unlabeled data (unsupervised [8, 9] or semi-supervised domain adaptation [7, 10]). Similar is one- and few-shoot learning that trains robust class predictors from one/few samples [11].

Recently, algorithms for supervised, semi-supervised and unsupervised domain adaptation such as Simultaneous Deep Transfer Across Domains and Tasks [7], Second- or Higher-order Transfer (So-HoT) of knowledge [5] and Learning an Invariant Hilbert Space [12], all combined with Convolutional Neural Networks (CNN) [13, 14], have reached state-of-the-art results \(\sim \)90% accuracy on classic benchmarks such as the Office dataset [15]. Such good results are due to fine-tuning of CNNs on the large-scale datasets such as ImageNet [3]. Indeed, fine-tuning of CNN is a powerful domain adaptation and transfer learning tool by itself [16, 17]. Thus, these works show saturation for CNN features on the Office [15] dataset and its newer Office+Caltech 10 variant [18].

Thus, we propose a new dataset for the task of exhibit identification in museum spaces that challenges domain adaptation/fine-tuning due to its significant domain shifts.

For the source domain, we captured the photos in a controlled fashion by Android phones e.g., we ensured that each exhibit is centered and non-occluded in photos. We prevented adverse capturing conditions and did not mix multiple objects per photo unless they were all part of one exhibit. We captured 2–30 photos of each art piece from different viewpoints and distances in their natural settings.

For the target domain, we employed an egocentric setup to ensure in-the-wild capturing process. We equipped 2 volunteers per exhibition with cheap wearable cameras and let them stroll and interact with artworks at their discretion. Such a capturing setup is applicable to preference and recommendation systems e.g., a curator takes training photos of exhibits with an Android phone while visitors stroll with wearable cameras to capture data from the egocentric perspective for a system to reason about the most popular exhibits. Open MIC contains 10 distinct source-target subsets of images from 10 different kinds of museum exhibition spaces, each exhibiting various photometric and geometric challenges, as detailed in Sect. 5.

To demonstrate the intrinsic difficulty of Open MIC, we chose useful baselines in supervised domain adaptation detailed in Sect. 5. They include fine-tuning CNNs on the source and/or target data and training a state-of-the-art So-HoT model [5] which we equip with non-Euclidean distances [19, 20] for robust end-to-end learning.

We provide various evaluation protocols which include: (i) training/evaluation per exhibition subset, (ii) training/testing on the combined set that covers all 866 identity labels, (iii) testing w.r.t. various scene factors annotated by us such as quality of lighting, motion blur, occlusions, clutter, viewpoint and scale variations, rotations, glares, transparency, non-planarity, clipping, etc..

Moreover, we introduce a new evaluation metric inspired by the following saliency problem: As numerous exhibits can be captured in a target image, we asked our volunteers to enumerate in descending order the labels of most salient/central exhibits they had interest in at a given time followed by less salient/distant exhibits. As we ideally want to understand the volunteers’ preferences, the classifier has to decide which detected exhibit is the most salient. We note that the annotation- and classification-related processes are not free of noise. Therefore, we propose to not only look at the top-k accuracy known from ImageNet [3] but to also check if any of top-k predictions are contained within the top-n fraction of all ground-truth labels enumerated for a target image. We refer to this as a top-k-n measure.

To obtain convincing baselines, we balance the use of an existing approach [5] with our mathematical contributionsFootnote 1 and evaluations. The So-HoT model [5] uses the Frobenius metric for partial alignment of within-class statistics obtained from CNNs. The hypothesis behind such modeling is that the partially aligned statistics capture so-called commonality [4, 5] between the source and target domains; thus facilitating knowledge transfer. For the pipeline in Fig. 1, we use two CNN streams of the VGG16 network [14] which correspond to the source and target domains. We build scatter matrices, one per stream per class, from feature vectors of the fc layers. To exploit the geometry of positive definite matrices, we regularize and align scatters by the Jensen-Bregman LogDet Divergence (JBLD) [19] in end-to-end manner and compare to the Affine-Invariant Riemannian Metric (AIRM) [20, 21]. However, evaluations of gradients of non-Euclidean distances are slow for large matrices. We show by the use of Nyström projections that, with typical numbers of datapoints per source/target per class being \(\sim \)50 in domain adaptation, evaluating such distances is fast and exact.

Our contributions are: (i) we collect/annotate a new challenging Open MIC dataset with domains consisting of images taken by Android phones and wearable cameras; the latter exhibiting a series of realistic distortions due to the egocentric capturing process, (ii) we compute useful baselines, provide various evaluation protocols, statistics and top-k-n results, as well as include breakdown of results w.r.t. annotated by us scene factors, (iii) we use non-Euclidean JBLD and AIRM distances for end-to-end training of the supervised domain adaptation approach and we exploit the Nyström projections to make this training tractable. To our best knowledge, these distances have not been used before in the supervised domain adaptation due to their high computational complexity.

Fig. 1.
figure 1

The pipeline. Figure 1a shows the source and target network streams which merge at the classifier level. The classification and alignment losses \(\ell \) and \(\hbar \) take the data \(\varvec{\varLambda }\) and \(\varvec{\varLambda }^{*\!}\) from both streams for end-to-end learning. Loss \(\hbar \) aligns covariances on the manifold of \(\mathcal {S}_{++}^{}\) matrices. Fig. 1b (top) shows alignment along the geodesic path (ours). Fig. 1b (bottom) shows alignment via the Euclidean dist. [5]. At the test time, we use the target stream and the classifier as in Fig. 1c.

2 Related Work

Below we describe the most popular datasets for the problem at hand and explain how Open MIC differs. Subsequently, we describe related domain adaptation approaches.

Datasets. A popular dataset for evaluating against the effect of domain shift is the Office dataset [15] which contains 31 object categories and three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset consist of objects commonly encountered in the office setting, such as keyboards, file cabinets, and laptops. The Amazon domain contains images which were collected from a website of on-line merchants. Its objects appear on clean backgrounds and at a fixed scale. The DSLR domain contains low-noise high resolution images of object captured from different viewpoints while Webcam contains low resolution images. The Office dataset and its newer extension to Caltech 10 domain [18] are used in numerous domain adaptation papers [6,7,8,9, 12, 22,23,24].

The Office dataset is primarily used for the transfer of knowledge about object categories between domains. In contrast, our dataset addresses the transfer of instances between domains. Each domain of the Open MIC dataset contains 37–166 specific instances to distinguish from (866 in total) compared to relatively low number of 31 classes in the Office dataset. Moreover, our target subsets are captured in an egocentric manner e.g., we did not align objects to the center of images or control the shutter etc..

A recent large collection of datasets for domain adaptation was proposed in technical report [25] to study cross-dataset domain shifts in object recognition with use of the ImageNet, Caltech-256, SUN, and Bing datasets. Even larger is the latest Visual Domain Decathlon challenge [26] which combines datasets such as ImageNet, CIFAR–100, Aircraft, Daimler pedestrian classification, Describable textures, German traffic signs, Omniglot, SVHN, UCF101 Dynamic Images, VGG–Flowers. In contrast, we target the identity recognition across exhibits captured in egocentric setting which vary from paintings to sculptures to glass to pottery to figurines. Many artworks in our dataset are fine-grained and hard to distinguish from without the expert knowledge.

The Office-Home dataset contains domains such as the real images, product photos, clipart and simple art impressions of well-aligned objects [27]. The Car Dataset [28] contains ‘easily acquired’ \(\sim \)1M cars of 2657 classes from websites for the fine-grained domain adaptation. Approach [29] uses 170 classes and \(\sim \)100 samples per class for attribute-based domain adaptation. Our Open MIC however is not limited to instances of cars or rigid objects. With 866 classes, Open MIC contains diverse 10 subsets with paintings, timepieces, sculptures, science exhibits, glasswork, relics, ancient animals, plants, figurines, ceramics, native arts etc. We captured varied materials, some of which are non-rigid, may emit light, be in motion or appear under large scale and viewpoint changes to form extreme yet realistic domain shifts. In some subsets, we also have large numbersFootnote 2 of frames for unsupervised domain adaptation.

Domain adaptation algorithms. Deep learning has been used in the context of domain adaptation in numerous recent works e.g., [5,6,7, 9, 22,23,24]. These works establish the so-called commonality between domains. In [7], the authors propose to align both domains via the cross entropy which ‘maximally confuses’ both domains for supervised and semi-supervised settings. In [6], the authors capture the ‘interpolating path’ between the source and target domains using linear projections into a low-dimensional subspace on the Grassman manifold. Method [22] learns the transformation between the source and target by the deep regression network. Our model differs in that our source and target network streams co-regularize each other via the JBLD or AIRM distance that respects the non-Euclidean geometry of the source and target matrices (other dist. can also be used [31, 32]). We align covariances [5] via a non-Euclidean distance.

Table 1. Frobenius, JBLD and AIRM distances and their properties. These distances operate between a pair of arbitrary matrices \(\varvec{\varSigma }\) and \(\varvec{\varSigma }^*\!\) which are points in \(\mathcal {S}_{++}^{}\) (and/or \(\mathcal {S}_{+}^{}\) for Frobenius).

For visual domains, the domain adaptation can be applied in the spatially-local sense to target so-called roots of domain shift. In [24], the authors utilize so-called ‘domainness maps’ which capture locally the degree of domain specificity. Our work is orthogonal to this method. Our ideas can be extended to a spatially-local setting.

Correlation between the source and target distributions are often used. In [33], a subspace forms a joint representation for the data from different domains. Metric learning [34, 35] can be also applied. In [8, 36], the source and target data are aligned in an unsupervised setting via correlation and Maximum Mean Discrepancy (MMD), resp. A baseline we use [5] can be seen as end-to-end trainable MMD with polynomial kernel as class-specific source and target distributions are aligned by the kernelized Frobenius norm on tensors. Our work is somewhat related. However, we first project class-specific vector representations from the last fc layers of the source and target CNN streams to the common space via Nyström projections for tractability and then we combine them with the JBLD or AIRM distance to exploit the (semi)definite positive nature of scatter matrices. We perform end-to-end learning which requires non-trivial derivatives of JBLD/AIRM distance and Nyström projections for computational efficiency.

3 Background

Below we discuss scatter matrices, Nyström projections, the Jensen-Bregman LogDet (JBLD) divergence [19] and the Affine-Invariant Riemannian Metric (AIRM) [20, 21].

3.1 Notations

Let \(\varvec{x}\in \mathbb {R}^{d}\) be a d-dimensional feature vector. \(\mathcal {I}_{N}\) stands for the index set \(\left\{ 1, 2,\cdots ,N\right\} \). The Frobenius norm of matrix is given by \(\left\| {\varvec{X}}\right\| _F\!\!=\!\!\!\sqrt{\sum \limits _{m,n} \!\!X_{mn}^2}\), where \(X_{mn}\) represents the \(\left( m,n\right) \)-th element of \(\varvec{X}\). The spaces of symmetric positive semidefinite and definite matrices are \(\mathcal {S}_{+}^{d}\) and \(\mathcal {S}_{++}^{d}\). A vector with all coefficients equal one is denoted by \(1\!\!1\) and \(\varvec{J}_{mn}\) is a matrix of all zeros with one at position (mn).

3.2 Nyström Approximation

In our work, we rely on Nyström projections, thus, we review their mechanism first.

Proposition 1

Suppose \(\varvec{X}\!\in \!\mathbb {R}^{d\times N}\) and \(\varvec{Z}\!\in \!\mathbb {R}^{d\times N'\!}\) store N feature vectors and \(N'\) pivots (vectors used in approximation) of dimension d in their columns, respectively. Let \(k:\mathbb {R}^{d}\times \mathbb {R}^{d}\rightarrow \mathbb {R}\) be a positive definite kernel. We form two kernel matrices \(\varvec{K}_{\varvec{Z}\varvec{Z}}\!\in \!\mathcal {S}_{++}^{N'\!}\) and \(\varvec{K}_{\varvec{Z}\varvec{X}}\!\in \!\mathbb {R}^{N'\!\!\times \!N}\) with their (ij)-th elements being \(k(\varvec{z}_i,\varvec{z}_j)\) and \(k(\varvec{z}_i,\varvec{x}_j)\), respectively. Then the Nyström feature map \(\tilde{\varvec{\varPhi }}\!\in \!\mathbb {R}^{N'\!\!\times \!N}\!\!\), whose columns correspond to the input vectors in \(\varvec{X}\), and the Nyström approximation of kernel \(\varvec{K}_{\varvec{X}\varvec{X}}\) for which \(k(\varvec{x}_i,\varvec{x}_j)\) is its (ij)-th entry, are given by:

$$\begin{aligned}&\tilde{\varvec{\varPhi }}= \varvec{K}_{\varvec{Z}\varvec{Z}}^{-0.5}\varvec{K}_{\varvec{Z}\varvec{X}} \quad \text {and}\quad \varvec{K}_{\varvec{X}\varvec{X}}\approx \tilde{\varvec{\varPhi }}^T\tilde{\varvec{\varPhi }}. \end{aligned}$$
(1)

Proof

See [37] for details.    \(\square \)

Remark 1

The quality of approximation of (1) depends on the kernel k, data points \(\varvec{X}\), pivots \(\varvec{Z}\) and their number \(N'\!\). In the sequel, we exploit a specific setting under which \(\varvec{K}_{\varvec{X}\varvec{X}}\!=\!\tilde{\varvec{\varPhi }}^T\tilde{\varvec{\varPhi }}\) which indicates no approximation loss.

3.3 Scatter Matrices

We make a frequent use of distances \(d^2(\varvec{\varSigma },\varvec{\varSigma }^{*\!})\) that operate between covariances \(\varvec{\varSigma }\!\equiv \!\varvec{\varSigma }(\varvec{\varPhi })\) and \(\varvec{\varSigma }^*\!\!\equiv \!\varvec{\varSigma }(\varvec{\varPhi }^*\!)\) on feature vectors. Therefore, we provide a useful derivative of \(d^2(\varvec{\varSigma },\varvec{\varSigma }^{*\!})\) w.r.t. feature vectors \(\varvec{\varPhi }\).

Proposition 2

Let \(\varvec{\varPhi }\!=\![\varvec{\phi }_1,\ldots ,\varvec{\phi }_N]\) and \(\varvec{\varPhi }^{*}\!\!=\![\varvec{\phi }^*_1\!,\ldots ,\varvec{\phi }^*_{N^*}\!]\) be feature vectors of quantity N and \(N^{*\!}\) e.g., formed by Eq. (1) and used to evaluate \(\varvec{\varSigma }\) and \(\varvec{\varSigma }^{*}\!\) with \(\varvec{\mu }\) and \(\varvec{\mu }^*\!\) being the mean of \(\varvec{\varPhi }\) and \(\varvec{\varPhi }^*\!\). Then derivatives of \(d^2\!\equiv \!d^2(\varvec{\varSigma },\varvec{\varSigma }^{*})\) w.r.t. \(\varvec{\varPhi }\) and \(\varvec{\varPhi }^{*}\!\) are:

$$\begin{aligned}&\!\!\!\!\textstyle \frac{\partial d^2(\varvec{\varSigma },\varvec{\varSigma }^{*})}{\partial \varvec{\varPhi }}\!=\!\frac{2}{N}\!\frac{\partial d^2}{\partial \varvec{\varSigma }}\!\scriptstyle \left( \varvec{\varPhi }\!-\!\varvec{\mu }1\!\!1^T\right) , \textstyle \frac{\partial d^2(\varvec{\varSigma },\varvec{\varSigma }^{*})}{\partial \varvec{\varPhi }^*}\!=\!\frac{2}{N^*}\!\frac{\partial d^2}{\partial \varvec{\varSigma }^*}\!\scriptstyle \left( \varvec{\varPhi }^{*}\!\!-\!\varvec{\mu }^{*}1\!\!1^T\right) . \end{aligned}$$
(2)

Then let be some projection matrix. For and with covariances \(\varvec{\varSigma }'\), \(\varvec{\varSigma }'^*\!\), means \(\varvec{\mu }'\), \(\varvec{\mu }'^*\!\) and \(d'^2\!\equiv \!d^2(\varvec{\varSigma }'\!,\varvec{\varSigma }'^*\!)\), we obtain:

(3)

Proof

See our supplementary material.    \(\square \)

3.4 Non-euclidean Distances

In Table 1, we list the distances d with derivatives w.r.t. \(\varvec{\varSigma }\) used in the sequel. We indicate properties such as invariance to rotation (rot.), affine mainpulations (aff.) and inversion (inv.). We indicate which distances meet the triangle inequality (Tr. Ineq.) and which are geodesic distances (Geo.). Lastly, we indicate if the distance d and its gradient \(\triangledown _{\varvec{\varSigma }}\) are finite (fin.) or infinite (\(\infty \)) for \(\mathcal {S}_{+}^{}\) matrices. This last property indicates that JBLD and AIRM distances require some regularization as our covariances are \(\mathcal {S}_{+}^{}\).

4 Problem Formulation

In this section, we equip the supervised domain adaptation approach So-HoT [5] with the JBLD and AIRM distances and the Nyström projections to make evaluations fast.

4.1 Supervised Domain Adaptation

Suppose \(\mathcal {I}_{N}\) and \(\mathcal {I}_{N^*}\!\) are the indexes of N source and \(N^*\!\) target training data points. \(\mathcal {I}_{N_c}\) and \(\mathcal {I}_{N_c^*}\!\) are the class-specific indexes for \(c\!\in \!\mathcal {I}_{C}\), where C is the number of classes (exhibit identities). Furthermore, suppose we have feature vectors \(\varvec{\phi }\) from an fc layer of the source network stream, one per image, and their associated labels y. Such pairs are given by \(\varvec{\varLambda }\!\equiv \!\{(\varvec{\phi }_n, y_n)\}_{n\in \mathcal {I}_{N}}\), where \(\varvec{\phi }_n\!\in \!\mathbb {R}^{d}\) and \(y_n\!\in \!\mathcal {I}_{C}\), \(\forall n\!\in \!\mathcal {I}_{N}\). For the target data, by analogy, we define pairs \(\varvec{\varLambda }^{*\!}\!\equiv \!\{(\varvec{\phi }^*_n, y^*_n)\}_{n\in \mathcal {I}_{N}^*}\), where \(\varvec{\phi }^*\!\!\in \!\mathbb {R}^{d}\) and \(y^*_n\!\!\in \!\mathcal {I}_{C}\), \(\forall n\!\in \!\mathcal {I}_{N}^*\). Class-specific sets of feature vectors are given as \(\varvec{\varPhi }_c\!\equiv \!\{\varvec{\phi }^c_n\}_{n\in \mathcal {I}_{N_c}}\) and \(\varvec{\varPhi }_c^*\!\!\equiv \!\{\varvec{\phi }^{*c}_n\}_{n\in \mathcal {I}_{N_c^*\!}}\), \(\forall c\!\in \!\mathcal {I}_{C}\). Then \(\varvec{\varPhi }\!\equiv \!(\varvec{\varPhi }_1,\ldots ,\varvec{\varPhi }_C)\) and \(\varvec{\varPhi }^*\!\!\equiv \!(\varvec{\varPhi }^*_1,\ldots ,\varvec{\varPhi }^*_C)\). We write the asterisk in superscript (e.g. \({\varvec{\phi }}^*\)) to denote variables related to the target network while the source-related variables have no asterisk. Our problem is posed as a trade-off between the classifier and alignment losses \(\ell \) and \(\hbar \). Figure 1 shows our setup. Our loss \(\hbar \) depends on two sets of variables \((\varvec{\varPhi }_1,\ldots ,\varvec{\varPhi }_C)\) and \((\varvec{\varPhi }^*_1,\ldots ,\varvec{\varPhi }^*_C)\) – one set per network stream. Feature vectors \(\varvec{\varPhi }(\varvec{\varTheta })\) and \(\varvec{\varPhi }^*\!(\varvec{\varTheta }^*\!)\) depend on the parameters of the source and target network streams \(\varvec{\varTheta }\) and \(\varvec{\varTheta }^*\!\) that we optimize over. \(\varvec{\varSigma }_c\!\equiv \!\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }_c))\), \(\varvec{\varSigma }^*_c\!\equiv \!\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }^*_c))\), \(\varvec{\mu }_c(\varvec{\varPhi })\) and \(\varvec{\mu }^*_c(\varvec{\varPhi }^*)\) denote the covariances and means, respectively, one covariance/mean pair per network stream per class. Specifically, we solve:

(4)

Note that Fig. 1a indicates by the elliptical/curved shape that \(\hbar \) performs the alignment on the \(\mathcal {S}_{+}^{}\) manifold along exact (or approximate) geodesics. For \(\ell \), we employ a generic Softmax loss. For the source and target streams, the matrices \(\varvec{W},\varvec{W}^*\!\!\in \!\mathbb {R}^{d\times C}\) contain unnormalized probabilities. In Eq. (4), separating the class-specific distributions is addressed by \(\ell \) while attracting the within-class scatters of both network streams is handled by \(\hbar \). Variable \(\eta \) controls the proximity between \(\varvec{W}\) and \(\varvec{W}^*\!\) which encourages the similarity between decision boundaries of classifiers. Coeffs. \(\sigma _1\), \(\sigma _2\) control the degree of the cov. and mean alignment, \(\tau \) controls the \(\ell _2\)-norm of vectors \(\varvec{\phi }\).

The Nyström projections are denoted by \(\varvec{\varPi }\). Table 1 indicates that backpropagation on the JBLD and AIRM distances involves inversions of \(\varvec{\varSigma }_c\) and \(\varvec{\varSigma }^*\) for each \(c\!\in \!\mathcal {I}_{C}\) according to (4). As \(\varvec{\varSigma }_c\) and \(\varvec{\varSigma }^*\) are formed from say 2048 dimensional feature vectors of the last fc layer, inversions are too costly to run fine-tuning e.g., 4s per iteration is prohibitive. Thus, we show next how to combine the Nyström projections with \(d_g\).

Fig. 2.
figure 2

Source subsets of Open MIC. (Top) Paintings (Shn), Clocks (Clk), Sculptures (Scl), Science Exhibits (Sci) and Glasswork (Gls). As 3 images per exhibit demonstrate, we covered different viewpoints and scales during capturing. (Bottom) 3 different art pieces per exhibition such as Cultural Relics (Rel), Natural History Exhibits (Nat), Historical/Cultural Exhibits (Shx), Porcelain (Clv) and Indigenous Arts (Hon). Note the composite scenes of Relics, fine-grained nature of Natural History and Cultural Exhibits and non-planarity of exhibits.

Proposition 3

Let us choose \(\varvec{Z}\!=\!\varvec{X}\!=[\varvec{\varPhi }\!,\varvec{\varPhi }^*\!]\) for pivots and source/target feature vectors, kernel k to be linear, and substitute them into Eq. (1). Then we obtain where \(\varvec{\varPi }(\varvec{X})\) is a projection of \(\varvec{X}\) on itself that is isometric e.g., distances between column vectors of \((\varvec{X}^T\!\varvec{X})^{0.5}\) correspond to distances of column vectors in \(\varvec{X}\). Thus, \(\varvec{\varPi }(\varvec{X})\) is an isometric transformation w.r.t. distances in Table 1, that is \(d^2_g(\varvec{\varSigma }(\varvec{\varPhi }),\varvec{\varSigma }(\varvec{\varPhi }^*\!))\!=\!d^2_g(\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi })),\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }^*\!)))\).

Proof

Firstly, we note that the following holds:

$$\begin{aligned}&\!\!\!\!\!\!\varvec{K}_{\varvec{X}\varvec{X}}\!=\!\varvec{\varPi }(\varvec{X})^T\!\varvec{\varPi }(\varvec{X})\!=\!(\varvec{X}^T\!\varvec{X})^{0.5}(\varvec{X}^T\!\varvec{X})^{0.5}\!\!\!\!\!\!=\!\varvec{X}^T\!\varvec{X}.\!\!\! \end{aligned}$$
(5)

Note that projects \(\varvec{X}\) into a more compact subspace of size \(d'\!\!=\!N\!+\!N^*\!\) if \(d'\!\ll \!d\) which includes the spanning space for \(\varvec{X}\) by construction as \(\varvec{Z}\!=\varvec{X}\). Equation (5) implies that \(\varvec{\varPi }(\varvec{X})\) performs at most rotation on \(\varvec{X}\) as the dot-product (used to obtain entries of \(\varvec{K}_{\varvec{X}\varvec{X}}\)) just like the Euclidean distance is rotation-invariant only e.g., has no affine invariance. As spectra of \((\varvec{X}^T\!\varvec{X})^{0.5}\) and \(\varvec{X}\) are equal, this implies \(\varvec{\varPi }(\varvec{X})\) performs no scaling, shear or inverse. Distances in Table 1 are all rotation-invariant, thus \(d^2_g(\varvec{\varSigma }(\varvec{\varPhi }),\varvec{\varSigma }(\varvec{\varPhi }^*\!))\!=\!d^2_g(\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi })),\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }^*\!)))\).

A strict proof shows that is a composite rotation \(\varvec{V}\varvec{U}^T\!\) if SVD of \(\varvec{Z}\!=\!\varvec{U}\varvec{\lambda }\varvec{V}^T\):

(6)

   \(\square \)

In practice, for each class \(c\!\in \!\mathcal {I}_{C}\), we choose \(\varvec{X}\!=\!\varvec{Z}\!=[\varvec{\varPhi }_c, \varvec{\varPhi }_c^*]\). Then, as , we have \(\varvec{\varPi }(\varvec{\varPhi })\!=\![\varvec{y}_1,\ldots ,\varvec{y}_N]\) and \(\varvec{\varPi }(\varvec{\varPhi }^*\!)\!=\![\varvec{y}_{N\!+\!1},\ldots ,\varvec{y}_{N\!+\!N*\!}]\) where \(\varvec{Y}\!=\![\varvec{y}_{1},\ldots ,\varvec{y}_{N\!+\!N*\!}]\!=\!(\varvec{X}^T\!\varvec{X})^{0.5}\!\). With typical \(N\!\approx \!30\) and \(N^*\!\approx \!3\), we obtain covariances of side size \(d'\!\approx \!33\) rather than \(d\!=\!4096\).

Proposition 4

Typically, the inverse square root \((\varvec{X}^T\!\varvec{X})^{-0.5}\) of can be only differentiated via costly SVD. However, if \(\varvec{X}\!=[\varvec{\varPhi }\!,\varvec{\varPhi }^*\!]\), and as in Proposition 3, and if we consider the chain rule we require:Footnote 3

$$\begin{aligned}&\textstyle \frac{\partial d^2_g(\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi })),\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }^*\!)))}{\partial \varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }))}\odot \frac{\partial \varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }))}{\partial \varvec{\varPi }(\varvec{\varPhi })}\odot \frac{\partial \varvec{\varPi }(\varvec{\varPhi })}{\partial \varvec{\varPhi }},{}^{3} \end{aligned}$$
(7)

then can be treated as a constant in differentiation:

(8)

Proof

It follows from the rotation-invariance of the Euclidean, JBLD and AIRM distances. Let us write , where \(\varvec{R}\) is a rotation matrix. Thus, we have: \(d^2_g(\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi })),\varvec{\varSigma }(\varvec{\varPi }(\varvec{\varPhi }^*\!)))\!=\!d^2_g(\varvec{\varSigma }(\varvec{R}\varvec{\varPhi }),\varvec{\varSigma }(\varvec{R}\varvec{\varPhi }^*\!))\!=\!d^2_g(\varvec{R}\varvec{\varSigma }(\varvec{\varPhi })\varvec{R}^T\!,\varvec{R}\varvec{\varSigma }(\varvec{\varPhi }^*\!)\varvec{R}^T)\). Therefore, even if \(\varvec{R}\) depends on \(\varvec{X}\), the distance \(d^2_g\) is unchanged by any choice of valid \(\varvec{R}\) i.e., for the Frobenius norm we have: \(||\varvec{R}\varvec{\varSigma }\varvec{R}^T\!-\!\varvec{R}\varvec{\varSigma }^*\!\varvec{R}^T||_F^2\!=\!\text {Tr}\left( \varvec{R}\varvec{A}^T\!{\varvec{R}}^T\!{\varvec{R}}\varvec{A}{\varvec{R}^T}\right) \!=\!\text {Tr}\left( \varvec{R}^T\!\varvec{R}\varvec{A}^T\!\varvec{A}\right) \!=\!\text {Tr}\left( \varvec{A}^T\!\varvec{A}\right) \!=\!||\varvec{\varSigma }\!-\!\varvec{\varSigma }^*||_F^2\), where \(\varvec{A}\!=\!\varvec{\varSigma }\!-\!\varvec{\varSigma }^*\!\). Therefore, we obtain: \(\frac{\partial ||\varvec{R}\varvec{\varSigma }(\varvec{\varPhi })\varvec{R}^T\!\!-\!\varvec{R}\varvec{\varSigma }(\varvec{\varPhi }^*\!)\varvec{R}^T\!||_F^2}{\partial \varvec{R}\varvec{\varSigma }(\varvec{\varPhi })\varvec{R}^T}\odot \frac{\partial \varvec{R}\varvec{\varSigma }(\varvec{\varPhi })\varvec{R}^T}{\partial \varvec{\varSigma }(\varvec{\varPhi })}\odot \frac{\partial \varvec{\varSigma }(\varvec{\varPhi })}{\partial \varvec{\varPhi }}\!=\! \frac{\partial ||\varvec{\varSigma }(\varvec{\varPhi })\!-\!\varvec{\varSigma }(\varvec{\varPhi }^*\!)\!||_F^2}{\partial \varvec{\varSigma }(\varvec{\varPhi })}\odot \frac{\partial \varvec{\varSigma }(\varvec{\varPhi })}{\partial \varvec{\varPhi }} \) 3 which completes the proof.   \(\square \)

Complexity. The Frobenius norm between covariances plus their computation have combined complexity \(\mathcal {O}((d'\!\!+\!1)d^2)\), where \(d'\!\!=\!N\!+\!N^*\!\). For non-Euclidean distances, we take into account the dominant cost of evaluating the square root of matrix and/or inversions by SVD, as well as the cost of building scatter matrices. Thus, we have \(\mathcal {O}((d'\!\!+\!1)d^2 + d^\omega )\), where constant \(2\!<\!\omega \!<\!2.376\) concerns complexity of SVD. Lastly, evaluating the Nyström projections, building covariances and running a non-Euclidean distance enjoys \(\mathcal {O}({d'}^2d + (d'\!\!+\!1){d'}^2 + {d'}^\omega )\!=\!\mathcal {O}({d'}^2d)\) complexity for \(d\!\gg \!d'\!\).

For typical \(d' = 33\) and \(d = 2048\), the non-Euclidean distances are \(~1.7\!\times \) slowerFootnote 4 than the Frobenius norm. However, non-Euclidean distances combined with our projections are \(210\!\times \) and \(124\!\times \) faster than naively evaluated non-Euclidean distances and the Frobenius norm. This cuts the time of each training from a couple of days to 6–8 h. Moreover, while unsupervised methods such as CORAL [8] align only two covariances (source and target), our most demanding supervised protocol operates on 866 classes which requires aligning \(2\!\times \!866\) covariances. For naive alignment via JBLD, we need 6 days (or much more4) to complete. With Nyström projections, JBLD takes \(\sim \)70 h.

Fig. 3.
figure 3

Examples of the target subsets of Open MIC. From left to right, each column illustrates Paintings (Shn), Clocks (Clk), Sculptures (Scl), Science Exhibits (Sci) and Glasswork (Gls), Cultural Relics (Rel), Natural History Exhibits (Nat), Historical/Cultural Exhibits (Shx), Porcelain (Clv) and Indigenous Arts (Hon). Note the variety of photometric and geometric distortions.

5 Experiments

Table 2. Unique exhibit instances (Inst.) and numbers of images of Open MIC in the source (Src.) and target (Tgt.) subsets plus backgrounds (Src+) and (Tgt+). We also have \(\sim \)380K frames (fr).
Table 3. Verification of baseline setups. (Left) Office ( domain shift) on AlexNet, VGG16 and GoogLeNet streams. We compare baseline fine-tuning on the combined source+target domains (S+T), second-order (So) Euclidean-based method [5] and our JBLD/AIRM dist. (Middle) State of the art. (Right) Open MIC on (Clk) domain shift and VGG16.

Below we detail our CNN setup, discuss the Open MIC dataset and our evaluations.

Setting. At the training and testing time, we use the setting shown in Fig. 1a and c, respectively. The images in our dataset are portrait or landscape oriented. Thus, we extract 3 square patches per image that cover its entire region. For training, these patches are training data points. For testing, we average over 3 predictions from a group of patches to label image. We briefly compare VGG16 [14] and GoogLeNet [40], and the Euclidean, JBLD and AIRM distances on subsets of Office and Open MIC. Table 3 shows that VGG16 and GoogLeNet yield similar scores while JBLD and AIRM beat the Euclidean distance. Thus, we employ VGG16 with JBLD in what follows.

Parameters. Both streams are pre-trained on ImageNet [3]. We set non-zero learning rates on the fully-connected and the last two convolutional layers of each stream. Fine-tuning of both streams takes 30–100K iterations. We set \(\tau \) to the average value of the \(\ell _2\) norm of fc feature vectors sampled on ImageNet and the hyperplane proximity \(\eta \!=\!1\). Inverse in and matrices \(\varvec{\varSigma }\) and \(\varvec{\varSigma }^*\!\) are regularized by \(\sim \)1e-6 on diagonals. Lastly, we set \(\sigma _1\) and \(\sigma _2\) between 0.005–1 to perform cross-validation.

Office. It has DSLR, Amazon and Webcam domains. For brevity, we check if our pipeline matches results in the literature on the Amazon-Webcam domain shift ().

Open MIC. The proposed dataset contains 10 distinct source-target subsets of images from 10 different kinds of museum exhibition spaces which are illustrated in Figs. 2 and 3, resp.; see also [41]. They include Paintings from Shenzhen Museum (Shn), the Clock and Watch Gallery (Clk) and the Indian and Chinese Sculptures (Scl) from the Palace Museum, the Xiangyang Science Museum (Sci), the European Glass Art (Gls) and the Collection of Cultural Relics (Rel) from the Hubei Provincial Museum, the Nature, Animals and Plants in Ancient Times (Nat) from Shanghai Natural History Museum, the Comprehensive Historical and Cultural Exhibits from Shaanxi History Museum (Shx), the Sculptures, Pottery and Bronze Figurines from the Cleveland Museum of Arts (Clv), and Indigenous Arts from Honolulu Museum Of Arts (Hon).

For the target data, we annotated each image with labels of art pieces visible in it. The wearable cameras were set to capture an image every 10s and operated in-the-wild e.g., volunteers had no control over shutter, focus, centering. Thus, our data exhibits many realistic challenges e.g., sensor noises, motion blur, occlusions, background clutter, varying viewpoints, scale changes, rotations, glares, transparency, non-planar surfaces, clipping, multiple exhibits, active light, color inconstancy, very large or small exhibits, to name but a few phenomena visible in Fig. 3. The numbers and statistics regarding the Open MIC dataset are given in Table 2. Every subset contains 37–166 exhibits to identify and 5 train, val., and test splits. In total, our dataset contains 866 unique exhibit labels, 8560 source (7645 exhibits and 915 backgrounds) and 7596 target (6092 exhibits and 1504 backgrounds including a few of unidentified exhibits) images.

Baselines. We provide baselines such as (i) fine-tuning CNNs on the source subsets (S) and testing on the randomly chosen target splits, (ii) fine tuning on target only (T) and evaluating on remaining disjoint target splits, (iii) fine-tuning on the source+target (S+T) and evaluating on remaining disjoint target splits, (iv) training state-of-the-art domain adaptation So-HoT algorithm [5] equipped by us with non-Euclidean distances.

We include evaluation protocols: (i) training/eval. per exhibition subset, (ii) training/testing on the combined set with all 866 identity labels, (iii) testing w.r.t. scene factors annotated by us (Sect. 5.2, Challenge III), (iv) unsupervised domain adapt.

Table 4. Challenge I. Open MIC performance on the 10 subsets for data 5 splits. Baselines (S), (T) and (S+T) are given as well as our JBLD approach. We report top-1, top-1-5, top-5-1, top-5-5 accuracies and the combined scores \(\text {Avg}_k\text {top-}k\text {-}k\). See Sect. 5.2 for details.

5.1 Comparison to the State of the Art

Firstly, we validate that our reference method performs on the par or better than the state-of-the-art approaches. Table 3 shows that the JBLD and AIRM distances outperform the Euclidean-based So-HoT method (So) [5] by \(\sim \!1.6\%\) (, Office, VGG16), \(0.9\%\) (Clk, Open MIC, VGG16) and recent approaches e.g., [7] by \(\sim \!2.9\%\) accuracy (, Office, AlexNet). We also observe that GoogLeNet outperforms the VGG16-based model by \(\sim \!0.5\%\). Having validated our model, we opt to evaluate our proposed Open MIC dataset on VGG16 streams for consistency with the So-HoT model [5].

Supervised vs. unsupervised domain adaptation. The goal of the supervised domain adaptation is to use few source and target training samples per class, all labeled, to mimic human abilities of learning from very few samples. In contrast, the unsupervised case can use large numbers of unlabeled target training samples. We ran our code on the Office-Home dataset [27] which has no supervised protocol. We chose / domain shifts, 20 source and 3 target train images per class (all labeled) which yielded 48.1/49.3 (So) and 49.2/50.5% ( ) accuracy. Unsupervised approach [27] that used all available target datapoints yielded 34.69/29.91% accuracy.

Table 5. Challenge II. Open MIC performance on the combined set for data 5 splits. Baselines (S), (T) and (S+T) are given as well as second-order (So) method [5] and our JBLD approach.
Table 6. Challenge III. Open MIC performance on the combined set w.r.t. 12 factors detailed in Sect. 5.2. Top-1 accuracies for baselines (S), (T), (S+T), and for our JBLD appr. are listed.

5.2 Open MIC Challenge

Below we detail our challenges on the Open MIC dataset and present our results.

Challenge I. Below we run our supervised domain adaptation with the JBLD distance per subset. We prepare 5 training, validation and testing splits. For the source data, we use all samples available per class. For the target data, we use  3 samples per class for training and validation, respectively, and the rest for testing.

We report top-1 and top-5 accuracies. Moreover, as our target images often contain multiple exhibits, we ask a question whether any of top-k predictions match any of top-n image labels ordered by our expert volunteers according to the perceived saliency. If so, we count it as a correctly recognized image. We count these valid predictions and normalize by the total number of testing images. We denote this measure as top-k-n where \(k,n\!\in \!\mathcal {I}_{5}\). Lastly, we indicate an area-under-curve type of measure \(\text {Avg}_k\text {top-}k\text {-}k\) which rewards correct recognition of the most dominant object in the scene and offers some reward if the order of top predictions is wrong (less dominant objects pred. first).

We divided Open MIC into Shn, Clk, Scl, Sci, Gls, Rel, Nat, Shx, Clv and Hon subsets to allow short 6–8 h long runs per experiment. We ran 150 jobs on (S), (T) and (S+T) baselines and 300 jobs on JBLD: 5 splits \(\times \)10 subsets \(\times \)6 hyperp. choices. Table 4 shows that the exhibits in the Comprehensive Historical and Cultural Exhibits (Shx) and the Sculptures (Scl) were the hardest to identify given 48.5 and \(54.4\%\) top-1 accuracy. This is consistent with volunteers’ reports that both exhibitions were crowded, the lighting was dim, exhibits were occluded, fine-grained and non-planar. Moreover, training on the source and testing on target baseline (S) scored mere 15.8 and 18.1% top-1 accuracy on the Glass (Gls) and Relics (Rel) due to extreme domain shifts. The easiest to identify were the Sculptures, Pottery and Bronze Figurines (Clv) and the Indigenous Arts (Hon) as both exhibitions were spacious with good lighting. The average top-1 accuracy across all subsets on JBLD is \(64.6\%\). Averages over baselines (S), (T) and (S+T) are 43.9, 57.8, and \(59.2\%\) top-1 acc. To account for uncertainty of saliency-based labeling (classifier confusing which exhibit to label), we report our proposed average top-1-5 acc. as \(71.0\%\). Our average combined score \(\text {Avg}_k\text {top-}k\text {-}k\) is \(79.8\%\). The results show that Open MIC challenges CNNs due to in-the-wild capture with wearable cameras.

Fig. 4.
figure 4

Examples of difficult to identify exhibits from the target domain in the Open MIC dataset.

Challenge II. Below we evaluate the combined set covering 866 exhibit identities. In this setting, a single experiment runs 80–120 hours. We ran 15 jobs on (S), (T) and (S+T) baselines and 60 jobs on (So) and JBLD: 2 distances \(\times \)5 splits \(\times \)6 hyperp. choices. Table 5 shows that our JBLD approach scores \(64.2\%\) top-1 accuracy and outperforms baselines (S), (T) and (S+T) by 30, 7.7 and \(8.3\%\). Fine-tuning CNNs on the source and testing on target (S) is a poor performer due to the large domain shift in Open MIC.

Challenge III. For this challenge, we break down performance on the combined set covering 866 exhibit identities w.r.t. the following 12 factors: object clipping (clp), low lighting (lgt), blur (blr), light glares (glr), background clutter (bgr), occlusions (ocl), in-plane rotations (rot), zoom (zom), tilted viewpoint (vpc), small size/far away (sml), object shadows (shd), reflections (rfl) and the clean view (ok). Table 6 shows results averaged over 5 data splits. We note that JBLD outperforms baselines. The factors most affecting the supervised domain adaptation are the small size (sml) of exhibits/distant view, low light (lgt) and blur (blr). The corresponding top-1 accuracies of 34.1, 48.6 and \(51.6\%\) are below our average top-1 accuracy of \(64.2\%\) listed in Table 5. In contrast, images with shadows (shd), zoom (zom) and reflections (rfl) score 70.4, 70.0 and \(67.5\%\) top-1 accuracy (above avg. \(64.2\%\)). Our wearable cameras captured also a few of clean shots scoring \(81.0\%\) top-1 accuracy. Thus, we claim that domain adaptation methods need to evolve to deal with such adverse factors. Our suppl. material presents further analysis of combined factors. Figure 4 shows hard to recognize instances.

Table 7. Challenge III. (Left) Open MIC performance on the combined set w.r.t. the pairs of 12 factors detailed in Sect. 5.2. Top-1 accuracies for our JBLD approach are listed. The top row shows results w.r.t. the original 12 factors. Color-coded cells are normalized w.r.t. entries of this row. For each column, intense/pale red indicates better/worse results compared to the top cell, respectively. (Right) Target image counts for pairs of factors.

Moreover, Table 7 present results (left) and the image counts (right) w.r.t. pairs of factors co-occurring together. The combination of (sml) with (glr), (blr), (bgr), (lgt), (rot) and (vpc) results in 13.5, 21.0, 29.9, 31.2, 32.6 and 33.2% mean top-1 accuracy, respectively. Therefore, these pairs of factors affect the quality of recognition the most.

Challenge IV. For unsupervised domain adaptation algorithms, we use all source data (labeled instances) for training and all target data as unlabeled input. A previously, we extract 3 patches per image and train Invariant Hilbert Space (IHS) [12], Uns. Domain Adaptation with Residual Transfer Networks (RTN) [42] and Joint Adaptation Networks (JAN) [43]. Table 8 shows results on the Open MIC dataset on the 10 subsets. Unsupervised (IHS), (RTN) and (JAN) scored on average 48.3, 49.1 and 52.1%. For split (Gls), which yielded 26.0, 30.5 and 34.2% top-1 accuracy, an extreme domain shift prevented algorithms from successful adaptation. On (Sci), unsupervised (IHS), (RTN) and (JAN) scored 63.3, 62.2 and 69.8%. On (Hon), they scored 67.3, 71.1 and 72.5%. For simple domain shifts, unsupervised domain adaptation yields visible improvements. For harder domain shifts, supervised JBLD from Table 4 works much better. Lastly, for (Hon) and (Shx) splits and (JAN), we added 4.3K and 13K unlabeled target frames (1 photo/s) and got 74.0% and 32.6% accuracy–this is a 1.5 and 0.6% increase over the low number of target images – adding many unsupervised images has only a small positive impact.

Table 8. Unsupervised domain adaptation: Open MIC performance on the 10 subsets.

6 Conclusions

We have collected, annotated and evaluated a new challenging Open MIC dataset with the source and target domains formed by images from Android and wearable cameras, respectively. We covered 10 distinct exhibition spaces in 10 different museums to collect a realistic in-the-wild target data in contrast to typical photos for which the users control the shutter. We have provided a number of useful baselines e.g., breakdowns of results per exhibition, combined scores and analysis of factors detrimental to domain adaptation and recognition. Unsupervised domain adaptation and few-shot learning methods can also be compared to our baselines. Moreover, we proposed orthogonal improvements to the supervised domain adaptation e.g., we integrated non-trivial non-Euclidean distances and Nyström projections for better results and tractability. We will make our data and evaluation scripts available to the researchers.