End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss

Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type than the search query, e.g., retrieving pictures relevant to a given text query. The state-of-the-art approach to cross-modality retrieval relies on learning a joint embedding space of the two modalities, where items from either modality are retrieved using nearest-neighbor search. In this work, we introduce a neural network layer based on canonical correlation analysis (CCA) that learns better embedding spaces by analytically computing projections that maximize correlation. In contrast to previous approaches, the CCA layer allows us to combine existing objectives for embedding space learning, such as pairwise ranking losses, with the optimal projections of CCA. We show the effectiveness of our approach for cross-modality retrieval on three different scenarios (text-to-image, audio-sheet-music and zero-shot retrieval), surpassing both Deep CCA and a multi-view network using freely learned projections optimized by a pairwise ranking loss, especially when little training data is available (the code for all three methods is released at: https://github.com/CPJKU/cca_layer).


Introduction
Cross-modality retrieval is the task of retrieving relevant items of a different modality than the search query (e.g., retrieving an image given a text query).One approach to tackle this problem is to define transformations which embed samples from different modalities into a common vector space.We can then project a query into this embedding space, and retrieve, using nearest neighbor search, a corresponding candidate projected from another modality.
A particularly successful class of models uses parametric nonlinear transformations (e.g., neural networks) for the embedding projections, optimized via a retrieval-specific objective such as a pairwise ranking loss (Kiros et al., 2014;Socher et al., 2014).This loss aims at decreasing the distance (a differentiable function such as Euclidean or cosine distance) between matching items, while increasing it between mismatching ones.Specialized extensions of this loss achieved state-of-the-art results in various domains such as natural language processing (Hermann & Blunsom, 2013), image captioning (Karpathy & Fei-Fei, 2015), and text-to-image retrieval (Vendrov et al., 2016).
In a different approach, Yan & Mikolajczyk (2015) propose to learn a joint embedding of text and images using Deep Canonical Correlation Analysis (DCCA) (Andrew et al., 2013).b) and takes advantage of both: componentwise correlated CCA projections and a pairwise ranking loss for cross-modality embedding space learning.We emphasize that our proposal in (c) requires to backpropagate the ranking loss L through the analytical computation of the optimally correlated CCA embedding projections A * and B * (compare Equation ( 3)).We thus need to compute their partial derivatives with respect to the network's hidden represenations x and y, i.e. ∂A * ∂x,y and ∂B * ∂x,y (addressed in Section 4).
Instead of a pairwise ranking loss, DCCA directly optimizes the correlation of learned latent representations of the two views.Given the correlated embedding representations of the two views, it is possible to perform retrieval via cosine distance.In summary, pairwise ranking losses optimize objectives to learn embeddings that are useful for retrieval, and allow incorporating domain knowledge into the loss function.On the other hand, DCCA is designed to maximize correlation-which has already proven to be useful for cross-modality retrieval (Yan & Mikolajczyk, 2015)-but does not allow to use loss formulations specialized for the task at hand.
In this paper, we propose a method to combine both approaches in a way that retains their advantages.We develop a Canonical Corralation Analysis Layer (CCAL) that can be inserted into a dual-view neural network to produce an optimally correlated embedding space for its latent representations.We can then apply task specific loss functions, in particular the pairwise ranking loss, on the output of this layer.To train a network using the CCA layer we describe how to backpropagate the gradient of this loss function to the dual-view neural network while relying on a automatic differentiation tools such as Theano (Theano Development Team, 2016).In our experiments, we show that our proposed method performs better than DCCA and models using pairwise ranking loss alone, especially when little training data is available.
Figure 1 compares our proposed approach to the alternatives discussed above.DCCA is a loss function that optimizes a dual-view neural network such that its two views will be maximally correlated (Figure 1a).Pairwise ranking losses are loss functions to optimize a dual-view neural network such that its two views are well-suited for nearest-neighbor retrieval in the embedding space (Figure 1b).In our approach, we boost optimization of a pairwise ranking loss based on cosine distance by placing an additional layer, the CCA projection layer, between a dual-view neural network and the optimization target (Figure 1c).

Canonical Correlation Analysis
In this section we review the concepts of CCA, the basis for our methodology.Let x ∈ R dx and y ∈ R dy denote two random column vectors with covariances Σ xx and Σ yy and cross-covariance Σ xy .The objective of CCA is to find two matrices A * ∈ R dx×k and B * ∈ R dy×k composed of k paired column vectors A j and B j (with k ≤ d x and k ≤ d y ) that project x and y into a common space maximizing their componentwise correlation: Since the objective of CCA is invariant to scaling of the projection matrices, we constrain the projected dimensions to have unit variance.Furthermore, CCA seeks subsequently uncorrelated projection vectors, arriving at the equivalent formulation: yy , and let U diag(d)V be the Singular Value Decomposition (SVD) of T with ordered singular values d i ≥ d i+1 .As shown by Mardia et al. (1979), we obtain A * and B * from the top k left-and right-singular vectors of T: Moreover, the correlation in the projection space is the sum of the top k singular values: In practice, the covariances and cross-covariance of x and y are usually not known, but estimated from a training set of m paired vectors, expressed as matrices Σyy is defined analogously to Σxx .Additionally, we apply a regularization parameter r to ensure that the covariance matrices are positive definite.Substituting these estimates for Σ xx , Σ xy and Σ yy , respectively, we can compute A * and B * using Equation 3.

Cross-Modality Retrieval Baselines
In this section we review the two most related works forming the basis for our approach.Andrew et al. (2013) propose an extension of CCA to learn parametric nonlinear transformations of two random vectors, such that their correlation is maximized.Let a ∈ R da and b ∈ R d b denote two random vectors, and let x = f (a; Θ f ) and y = g(b; Θ g ) denote their nonlinear transformations, parameterized by Θ f and Θ g .DCCA optimizes the parameters Θ f and Θ g to maximize the correlation of the topmost hidden representations x and y.For d x = d y = k, this objective corresponds to Equation 4, i.e., the sum of all singular values of T, also called the trace norm:

Deep Canonical Correlation Analysis
(5) Andrew et al. (2013) show how to compute the gradient of this Trace Norm Objective (TNO) with respect to x and y.Assuming f and g are differentiable with respect to Θ f and Θ g (as is the case for neural networks), this allows to optimize the nonlinear transformations via gradient-based methods.Yan & Mikolajczyk (2015) suggest the following procedure to exploit DCCA for crossmodality retrieval: first, neural networks f and g are trained using the TNO, with a and b representing different views of an entity (e.g.image and text); then, after the training is finished, the CCA projections are computed using Eq. 3, and all result candidates are projected into the embedding space; finally, at test time, queries of either modality are projected into the embedding space, and the best-matching sample from the other modality is found through nearest neighbor search using the cosine distance.In our experiments, we will refer to this approach as DCCA-2015 DCCA is limited by design to use the objective function described in Equation ( 5), and only seeks to maximize the correlation in the embedding space.During training, the CCA projection matrices are never computed, nor are the samples projected into the common retrieval space.All the retrieval steps-most importantly, the computation of CCA projections-are performed only after the networks f and g have been optimized.This restricts potential applications, because we can not use the projected data as an input to subsequent layers or task-specific objectives.We will show how our approach overcomes this limitation in Section 4.

Pairwise Ranking Loss
Kiros et al. ( 2014) learn a multi-modal joint embedding space with images and text.They use the cosine of the angle between two corresponding vectors x and y as a scoring function, i.e., s(x, y) = cos(x, y).Then, they optimize a pairwise ranking loss where x is an embedded sample of the first modality, y is the matching embedded sample of the second modality, and y k are the contrastive (mismatching) embedded samples of the second modality (in practice, all mismatching samples in the current mini-batch).This loss encourages an embedding space where the cosine distance between matching samples is lower than the cosine distance of mismatching samples.In this setting, the networks f and g have to learn the embedding projections freely from randomly initialized weights.Their topmost layers can be regarded as the CCA projections derived from DCCA.Since the projections are learned from scratch by optimizing a ranking loss, in our experiments we denote this approach by Learned-L rank .Figure 1b shows a sketch of this paradigm.

Learning with Canonically Correlated Embedding Projections
In the following we explain how to bring both concepts together to enhance cross-modality embedding space learning.

Motivation
We start by providing an intuition on why we expect this combination to be fruitful: DCCA-2015 maximizes the correlation between the latent representations of two different neural networks via the TNO derived from classic CCA.As correlation and cosine distance are related, we can also use such a network for cross-modality retrieval (Yan & Mikolajczyk, 2015).Kiros et al. (2014), on the other hand, learn a cross-modality retrieval embedding by optimizing an objective customized for the task at hand.The motivation for our approach is that we want to benefit from both: a task specific retrieval objective, and componentwise optimally correlated embedding projections.
To achieve this, we devise a CCA layer that analytically computes the CCA projections A * and B * during training, and projects incoming samples into the embedding space.The projected samples can then be used in subsequent layers, or for computing task-specific losses such as the pairwise ranking loss.Figure 1c illustrates the central idea of our combined approach.Compared to Figure 1b, we insert an additional linear transformation.However, this transformation is not learned (otherwise it could be merged with the previous layer, which is not followed by a nonlinearity).Instead, it is computed to be the transformation that maximizes componentwise correlation between the two views.A * and B * in Figure 1c are the very projections given by Equation (3).
In theory, optimizing a pairwise ranking loss alone could yield projections equivalent to the ones computed by CCA.In practice, however, we observe that the proposed combination gives much better cross-modality retrieval results (see Sec. 5).
Our design requires backpropagating errors through the analytical computation of the CCA projection matrices.DCCA (Andrew et al., 2013) does not cover this, since projecting the data is not necessary for optimizing the TNO.In the remainder of this section, we discuss how to establish gradient flow (backpropagation) through CCA's optimal projection matrices.In particular, we require the partial derivatives ∂A * ∂x,y and ∂B * ∂x,y of the projections with respect to their input representations x and y.This will allow us to use CCA as a layer within a multi-modality neural network, instead of as a final objective (TNO) for correlation maximization only.

Gradient of CCA Projections
As mentioned above, we can compute the canonical correlation along with the optimal projection matrices from the singular value decomposition . Specifically, we obtain the correlation as corr(A * x, B * y) = i d i , and the projections as For DCCA, it suffices to compute the gradient of the total correlation wrt.x and y in order to backpropagate it through the two networks f and g.Using the chain rule, Andrew et al. (2013) decompose this into the gradients of the total correlation wrt.Σ xx , Σ xy and Σ yy , and the gradients of those wrt.x and y.Their derivations of the former make use of the fact that both the gradient of i d i wrt.T and the gradient of ||T|| tr (the trace norm objective in Equation 5) wrt.T T have a simple form; see Andrew et al. (2013, Sec. 7) for details.
In our case where we would like to backpropagate errors through the CCA transformations, we instead need the gradients of the projected data x * = A * x and y * = B * y wrt.x and y, which requires the partial derivatives ∂A * ∂x,y and ∂B * ∂x,y .We could again decompose this into the gradients wrt.T, the gradients of T wrt.Σ xx , Σ xy and Σ yy and the gradients of those wrt.x and y.However, while the gradients of U and V wrt.T are known (Papadopoulo & Lourakis, 2000), they involve solving O((d x d y ) 2 ) linear 2×2 systems.Instead, we reformulate the solution to use two symmetric eigendecompositions TT = U diag(e)U and T T = V diag(e)V (Petersen & Pedersen, 2012, Eq. 270).This gives us the same left and right eigenvectors we would obtain from the SVD, along with the squared singular values (e i = d 2 i ).The gradients of eigenvectors of symmetric real eigensystems have a simple form (Magnus, 1985, Eq. 7) and both TT and T T are differentiable wrt.x and y.
To summarize: in order to obtain an efficiently computable definition of the gradient for CCA projections, we have reformulated the forward pass (the computation of the CCA transformations).Our formulation using two eigen-decompositions translates into a series of computation steps that are differentiable in a graph-based, auto-differentiating math compiler such as Theano (Theano Development Team, 2016), which, together with the chain rule, gives an efficient implementation of the CCA layer gradient for training our network.For a detailed description of the CCA layer forward pass we refer to Algorithm 1 in the supplemental materials.As the technical implementation is not straight-forward, we also discuss the crucial steps in the supplemental materials.Thus, we now have the means to benefit from the optimal CCA projections but still optimize for a task-specific objective.In particular, we utilize the pairwise ranking loss of Equation ( 6) on top of an intermediate CCA embedding layer.We denote the proposed retrieval network of Figure 1c as CCAL-L rank in our experiments (CCAL refers to CCA Layer).

Experiments
We evaluate our approach (CCAL-L rank ) in cross-modality retrieval experiments on two image-to-text and one audio-to-sheet-music dataset.For comparison, we consider the approach of Yan & Mikolajczyk (2015) (DCCA-2015 ), our own implementation of the TNO (denoted by DCCA), as well as the freely learned projection embeddings (Learned-L rank ) optimizing the ranking loss of Kiros et al. (2014).
The task for all three datasets is to retrieve the correct counterpart when given an instance of the other modality as a search query.For retrieval, we use the cosine distance in the embedding space for all approaches.First, we embed all candidate samples of the target modality into the retrieval embedding space.Then, we embed the query element y with the second network and select its nearest neighbor x j of the target modality.
As evaluation measures, we consider the Recall@k (R@k) as well as the Median Rank (MR) and the Mean Average Precision (MAP).The R@k rate (higher is better) is the ratio of queries which have the correct corresponding counterpart in the first k retrieval results.The MR (lower is better) is the median position of the target in a similarity-ordered list of available candidates.Finally, we define the MAP (higher is better) as the mean value of 1/rank over all queries where rank is again the position of the target in the ordered list of available candidates.

Image-Text Retrieval
In the first part of our experiments, we consider Flickr30k and IAPR TC-12, two publicly available datasets for image-text cross-modality retrieval.Flickr30k consists of imagecaption pairs, where each image is annotated with five different textual descriptions.The train-validation-test split for Flickr30k is 28000-1000-1000.In terms of evaluation setup, we follow Protocol 3 of Yan & Mikolajczyk (2015) and concatenate the five available captions into one, meaning that only one, but richer text annotation remains per image.This is done for all three sets of the split.The second image-text dataset, IAPR TC-12, contains 20000 natural images where only one-but compared to Flickr30k more detailed-caption is available for each image.As no predefined train-validation-test split is provided, we randomly select 1000 images for validation and 2000 for testing, and keep the rest for training.Yan & Mikolajczyk (2015) also use 2000 images for testing, but did not explicitly mention holdout images for validation.
The input to our networks is a 4096-dimensional image feature vector along with a corresponding text vector representation which has dimensionality 5793 for Flickr30k and 2048 for IAPR TC-12.The image embedding is computed from the last hidden layer of a network pretrained on ImageNet (layer fc7 of CNN_S by Chatfield et al. (2014)).In terms of text pre-processing, we follow Yan & Mikolajczyk (2015), tokenizing and lemmatizing the raw captions as the first step.Based on the lemmatized captions, we compute l2-normalized TF/IDF-vectors, omitting words with an overall occurrence smaller than five for Flickr30k and three for IAPR TC-12, respectively.The image representation is processed by a linear dense layer with 128 units, which will also be the dimensionality k of the resulting retrieval embedding.The text vector is fed through two batch-normalized (Ioffe & Szegedy, 2015) dense layers of 1024 units each and the ELU activation function (Clevert et al., 2015).As a last layer for the text representation network, we again apply a dense layer with 128 linear units.
For a fair comparison, we keep the structure and number of parameters of all networks in our experiments the same.The only difference between the networks are the objectives and the hyper parameters used for optimization.Optimization is performed using Stochastic Gradient Descent (SGD) with the adam update rule (Kingma & Ba, 2014) (for details please see our supplemental materials).
Table 1 lists our results on IAPR TC-12.Along with our experiments, we also show the results reported in (Yan & Mikolajczyk, 2015) as a reference (DCCA-2015 ).However, a direct comparison to our results may not be fair: DCCA-2015 uses a different ImageNetpretrained network for the image representation, and finetunes this network while we keep it fixed.This is because our interest is in comparing the methods, not in obtaining the best possible results.Our implementation of the TNO (DCCA) uses the same objective as DCCA-2015, but is trained using the same network architecture as our remaining models and permits a direct comparison.Additionally, we repeat each of the experiments 10 times with different initializations and report the mean for each of the evaluation measures.

Image-to-Text
Text-to-Image Method R@1 R@5 R@10 MR MAP R@1 R@5 R@10 MR MAP When taking a closer look at Table 1, we observe that our results achieved by optimizing the TNO (DCCA) surpass the results reported in (Yan & Mikolajczyk, 2015).We already discussed above that the two versions are not directly comparable.However, given this result, we consider our implementation of DCCA as a valid baseline for our experiments in Section 5.2 where no results are available in the literature.When looking at the performance of CCAL-L rank we further observe that it outperforms all other methods although the difference to DCCA is not pronounced for all of the measures.Comparing CCAL-L rank with the freely-learned projection matrices (Learned-L rank ) we observe a much larger performance gap.This is interesting as in principle the learned projections could converge to exactly the same solution as CCAL-L rank .We take this as a first quantitative confirmation that the learning process benefits from CCA's optimal projection matrices.
In Table 2, we list our results on the Flickr30k dataset.As above, we show the retrieval performances of Yan & Mikolajczyk (2015) as a baseline along with our results and observe similar behavior as on IAPR TC-12.Again, we point out the poor performance of the freely-learned projections (Learned-L rank ) in this experiment.Keeping this observation in mind, we will notice a different behavior in the experiments in Section 5.2.
Note that there are various other methods reporting results on Flickr30k (Karpathy et al., 2014;Socher et al., 2014;Mao et al., 2014;Kiros et al., 2014) which partly surpass ours, for example by using more elaborate processing of the textual descriptions or more powerful ImageNet models.We omit these results as we focus on the comparison of DCCA and freely-learned projections with the proposed CCA projection embedding layer.

Audio-Sheet-Music Retrieval
For the second set of experiments, we consider the Nottingham piano midi dataset (Boulanger-Lewandowski et al., 2012).The dataset is a collection of midi files split into train, validation and test set already used by Dorfer et al. (2016) for experiments on end-to-end score-following in sheet-music images.Here, we tackle the problem of audiosheet-music retrieval, i.e. matching short snippets of music (audio) to corresponding parts in the sheet music (image).Figure 2 shows examples of such correspondences.We conduct this experiment for two reasons: First, to show the advantage of the proposed method over different domains.Second, the data and application is of high practical relevance in the domain of Music Information Retrieval (MIR).A system capable of linking sheet music (images) and the corresponding music (audio) would be useful in many content-based musical retrieval scenarios.
In terms of audio preparation, we compute log-frequency-spectrograms with a sample rate of 22.05kHz, a FFT window size of 2048, and a computation rate of 31.25 frames per second.These spectrograms (136 frequency bins) are then directly fed into the audio part of the cross-modality networks.Figure 2 shows a set of audio-to-sheet correspondences presented to our network for training.One audio excerpt comprises 100 frames and the dimension of the sheet image snippet is 40×100 pixels.Overall this results in 270,705 train, 18,046 validation and 16,042 test audio-sheet-music pairs.This is an order of magnitude more training data than for the image-to-text datasets of the previous section.
In the experiments in Section 5.1, we relied on pre-trained ImageNet features and relatively shallow fully connected text-feature processing networks.The model here differs from this, as it consists of two deep convolutional networks learned entirely from scratch.Our architecture is a VGG-style Simonyan & Zisserman (2014) network consisting of sequences of 3 × 3 convolution stacks followed by 2 × 2 max pooling.To reduce the dimensionality to the desired correlation space dimensionality k (in this case 32) we insert as a final building block a 1 × 1 convolution having k feature maps followed by global average pooling (Lin et al., 2013) (for further details we again refer to the supplemental materials).
Table 3 lists our result on audio-to-sheet music retrieval.As in the experiments on images and text, the proposed CCA projection embedding layer trained with pairwise ranking loss outperforms the remaining models.Recalling the results from Section 5.1, we observe an increased performance of the freely-learned embedding projections.On measures such as R@5 or R@10 it achieves similar to or better performance than DCCA.One of the reasons for this could be the fact that there is an order of magnitude more training data available for this task to learn the projection embedding from random initialization.Still, the proposed combination of both concepts (CCAL-L rank ) achieves highest retrieval scores.
Table 3: Retrieval results on Nottingham dataset (Audio-to-Sheet-Music Retrieval).

Performance in Small Data Regime
The above results suggest that the benefit of using a CCA projection layer (CCAL-L rank ) over a freely-learned projection becomes most evident when few training data is available.
To examine this assumption, we repeat the audio-to-sheet-music experiment of the previous section, but use only 10% of the original training data (≈ 27000 samples).We stress the fact that the learned embedding projection of Learned-L rank could converge to exactly the same solution as the CCA projections of CCAL-L rank .Table 4 summarizes the low data regime results for the three methods.Consistent with our hypothesis, we observe a larger gap between Learned-L rank and CCAL-L rank compared to the one obtained with all training data in Table 3.We conclude that a network might be able to learn suitable embedding projections when sufficient training data is available.However, when having fewer training samples, the proposed CCA projection layer strongly supports embedding space learning.In addition, we also looked into the retrieval performance of Learned-L rank and CCAL-L rank on the training set and observe comparable performance.This indicates that the CCA layer also acts as a regularizer and helps to generalize to unseen samples.
Table 4: Retrieval results on audio-to-sheet-music retrieval when using only 10% of the train data.

Conclusion
We have shown how to use the optimal projection matrices of CCA as the weights of an embedding layer within a multi-view neural network.With this CCA layer, it becomes possible to optimize for a specialized loss function (e.g., related to a retrieval task) on top of this, exploiting the correlation properties of a latent space provided by CCA.As this requires to establish gradient flow through CCA, we formulate it to allow easy computation of the partial derivatives ∂A * ∂x,y and ∂B * ∂x,y of CCA's projection matrices A * and B * with respect to the input data x and y.With this formulation, we can incorporate CCA as a building block within multi-modality neural networks that produces maximally-correlated projections of its inputs.In our experiments, we use this building block within a cross-modality retrieval setting, optimizing a network to minimize a cosine distance based pairwise ranking loss of the componentwise-correlated CCA projections.Experimental results show that when using the cosine distance for retrieval (as is common for correlated views), this is superior to optimizing a network for maximally-correlated projections (as done in DCCA), or not using CCA at all.Finally, we emphasize that our CCA layer is a general network component which could provide a useful basis for further research, e.g., as an intermediate processing step for learning binary cross-modality retrieval representations.

Implementation Details
Backpropagating the errors though the CCA projection matrices is not trivial.The optimal CCA projection matrices are given by A * = Σ The main technical challenge is that common auto-differentiation tools such as Theano (Theano Development Team, 2016) or Tensor Flow (Abadi et al., 2015) do not provide derivatives for the inverse squared root and singular value decomposition of a matrix.2To overcome this, we replace the inverse squared root of a matrix by using its Cholesky decomposition as described in (Hardoon et al., 2004).Furthermore, we note that the singular value decomposition is required to obtain the matrices U and V, but in fact those matrices can alternatively be obtained by solving the eigendecomposition of TT = U diag(e)U and T T = V diag(e)V (Petersen & Pedersen, 2012, Eq. 270).This yields the same left and right eigenvectors we would obtain from the SVD (except for possibly flipped signs, which are easy to fix), along with the squared singular values (e i = d 2 i ).Note that TT and T T are symmetric, and that the gradients of eigenvectors of symmetric real eigensystems have a simple form (Magnus, 1985, Eq. 7).Furthermore, TT and T T are differentiable wrt.x and y, enabling a sufficiently efficient implementation in a graph-based, auto-differentiating math compiler.
Next section provides a detailed description of the implementation of the CCA layer.

Forward Pass of CCA Projection Layer
For easier reproducibility, we provide a detailed description of the forward pass of the proposed CCA layer in Algorithm 1.To train the model, we need to propagate the gradient through the CCA layer (backward pass).We rely on auto-differentiation tools (in particular, Theano) implementing the gradient for each individual computation step in the forward pass, and connecting them using the chain rule.
The layer itself takes the latent feature representations (a batch of m paired vectors X ∈ R dx×m and Y ∈ R dy×m ) of the two network pathways f and g as input and projects them with CCA's analytical projection matrices.At train time, the layer uses the optimal projections computed from the current batch.When applying the layer at test time it uses the statistics and projections remembered from last training batch (which can of course be recomputed on a larger training batch to get more stable estimate).
As not all of the computation steps are obvious, we provide further details for the crucial ones.In line 12 and 13, we compute the Cholesky factorization instead of the matrix square root, as the latter has no gradients implemented in Theano.As a consequence, we need to transpose C −1 yy when computing T in line 14 (Hardoon et al., 2004).In line 15 and 16, we compute two eigen decompositions instead of one singular value decomposition (which also has no gradients implemented in Theano).In line 19, we flip the signs of first projection matrix to match the second to only have positive correlations.This property is required for retrieval with cosine distance.Finally, in line 24 and 25, the two views get projected using A * and B * .At test time we apply the projections computed and stored during training (line 17).
Algorithm 1 Forward Pass of CCA Projection Layer.

Investigations on Correlation Structure
As an additional experiment we investigate the correlation structure of the learned representations for all three paradigms.For that purpose we compute the topmost hidden representation x and y of the audio-sheet-music-pairs and estimate the canonical correlation coefficients d i of the respective embedding spaces.For the present example this yields 32 coefficients which is the dimensionality k of our retrieval embedding space.3 is the high correlation coefficients of the representation learned with DCCA.This structure is expected as the TNO focuses solely on correlation maximization.However, when recalling the results of Table 3 we see that this does not necessarily lead to the best retrieval performance.The freely learned embedding Learned-L rank shows overall the lowest correlation but achieves comparable results to DCCA on this dataset.In terms of overall correlation, CCAL-L rank is situated in-between the two other approaches.We have seen in all our experiments that combining both concepts in a unified retrieval paradigm yields best retrieval performance over different application domains as well as data regimes.We take this as evidence that componentwise-correlated projections support cosine distance based embedding space learning.

Architecture and Optimization
In the following we proved additional details for our experiments carried out in Section 5.

Image-Text Retrieval
We start training with an initial learning rate of either 0.001 (all models on IAPR TC-12 and Flickr30k Learned-L rank ) or 0.002 (Flickr30k DCCA and CCAL-L rank )3 .In addition, we apply 0.0001 L2 weight decay and set the batch size to 1000 for all models.The parameter α of the ranking loss in Equation ( 6) is set to 0.5.After no improvement on the validation set for 50 epochs we divide the learning rate by 10 and reduce the patience to 10.This learning rate reduction is repeated three times.

Audio-Sheet-Music Retrieval
Table 5 provides details on our audio-sheet-music retrieval architecture.
As in the experiments on images and text we optimize our networks using adam with an initial learning rate of 0.001 and batch size 1000.The refinement strategy is the same Freely-learned embedding projections optimized with ranking loss (Learned-L rank ).(c) Canonically correlated projection layer optimized with ranking loss (CCAL-L rank ).

Figure 1 :
Figure 1: Sketches of cross-modality retrieval networks.The proposed model in (c) unifies (a) and (b) and takes advantage of both: componentwise correlated CCA projections and a pairwise ranking loss for cross-modality embedding space learning.We emphasize that our proposal in (c) requires to backpropagate the ranking loss L through the analytical computation of the optimally correlated CCA embedding projections A * and B * (compare Equation (3)).We thus need to compute their partial derivatives with respect to the network's hidden represenations x and y, i.e. ∂A * ∂x,y and ∂B * ∂x,y (addressed in Section 4).

Figure 2 :
Figure 2: Example of the data considered for audio-sheet-music (image) retrieval.Top: short snippets of sheet music images.Bottom: Spectrogram excerpts of the corresponding music audio.
where U and V are derived from the singular value decomposition ofT = Σ −1/2 xx Σ xy Σ −1/2 yy = U diag(d)V (seeSection 2).The proposed model needs to backpropagate the errors through the CCA transformations, i.e., it requires the gradients of the projected data x * = A * x and y * = B * y wrt.x and y.Applying the chain rule, this further requires the gradients of U and V wrt.T, and the gradients of T, Σ −1/2 xx , Σ xy and Σ −1/2 yy wrt.x and y.

Figure 3 :
Figure 3: Comparison of the 32 correlation coefficients d i (the dimensionality of the retrieval space is 32) of the topmost hidden representations x and y of the audio-tosheet-music dataset and the respective optimization paradigm.The maximum correlation possible is 1.0 for each coefficient Input of layer: X ∈ R dx×m and Y ∈ R dy×m hidden representation of current batch 2: Returns: X * and Y * CCA projected hidden representation 3: Parameters of layer: µ x , µ y and A * , B * means and CCA projection matrices 4: if train_time then update statistics and CCA projections during training 1: