Structured (De)composable Representations Trained with Neural Networks

The paper proposes a novel technique for representing templates and instances of concept classes. A template representation refers to the generic representation that captures the characteristics of an entire class. The proposed technique uses end-to-end deep learning to learn structured and composable representations from input images and discrete labels. The obtained representations are based on distance estimates between the distributions given by the class label and those given by contextual information, which are modeled as environments. We prove that the representations have a clear structure allowing to decompose the representation into factors that represent classes and environments. We evaluate our novel technique on classification and retrieval tasks involving different modalities (visual and language data).


Introduction
We propose a novel technique for representing templates and instances of concept classes that is agnostic with regard to the underlying deep learning model. Starting from raw input images, representations are learned in a classification task where the cross-entropy classification layer is replaced by a fully connected layer that is used to estimate a bounded approximation of the distance between each class distribution and a set of contextual distributions that we call 'environments'. By defining randomized environments, the goal is to capture common sense knowledge about how classes relate to a range of differentiating contexts, and to increase the probability of encountering distinctive diagnostic features. This idea loosely resembles how human long-term memory might achieve retrieval [7] as well as how contextual knowledge is used for semantic encoding [6]. Our experiments confirm the value of such an approach.
In this paper, classes correspond to (visual) object labels, and environments correspond to combinations of contextual labels given by either object labels or image caption keywords. Representations for individual inputs, which we call 'instance representations', form a 2D matrix with rows corresponding to classes The last layer of a convolutional neural network is replaced with fully-connected layers that map to nc × ne outputs fi,j that are used to create instance representations that are interpretable along contextual dimensions, which we call 'environments'. By computing the cosine similarity, rows are compared to corresponding class representations, which we refer to as 'templates'.
and columns corresponding to environments, where each element is an indication of how much the instance resembles the corresponding class versus environment. The parameters for each environment are defined once at start by uniformly selecting a randomly chosen number of labels from the power set of all available contextual labels. The class representation, which we refer to as 'template', has the form of a template vector. It contains the average distance estimates between the distribution of a class and the distributions of the respective environments. By computing the cosine similarity between the instance representation and all templates, class membership can be determined efficiently (Fig. 1).
Template and instance representations are interpretable as they have a fixed structure comprised of distance estimates. This structure is reminiscent of traditional language processing matrix representations and enables operations that operate along matrix dimensions. We demonstrate this with a Singular Value Decomposition (SVD) which yields components that determine the values along the rows (classes) and columns (environments) respectively. Those components can then be altered to modify the information content, upon which a new representation can be reconstructed. The proposed representations are evaluated in four settings: (1) Multi-label image classification, i.e., object recognition with multiple objects per image; (2) Image retrieval where we query images that look like existing images but contain altered class labels; (3) Single-label image classification on pre-trained instance representations for a previously unseen label; (4) Rank estimation with regard to compression of the representations.

Contributions.
(1) We propose a new deep learning technique to create structured representations from images, entity classes and their contextual information (environments) based on distance estimates. (2) This leads to template representations that generalize well, as successfully evaluated in a classification task. (3) The obtained representations are interpretable as distances between a class and its environment. They are composable in the sense that they can be modified to reflect different class membership as shown in a retrieval task.

Background
We shortly discuss useful background related to different aspects of our research.
Representing Entities with Respect to Context. In language applications, structured matrices (e.g, document-term matrices) have been used for a long time. Such matrices can be decomposed with SVD or non-negative matrix factorization. Low-rank approximations are found with methods like latent semantic indexing. Typical applications are clustering, classification, retrieval, etc. with the benefit that outcomes can usually be interpreted with respect to the contextual information. Contrary to our work, earlier methods build representations purely from labels and don't take deep neural network-based features into account. More recently [11] create an unsupervised sentence representation where each entity is a probability distribution based on co-occurrence of words.
Distances to Represent Features. The Earth Mover's Distance (EMD) also known as Wasserstein distance, is a useful metric based on the optimal transport problem to measure the distance between distributions. [3] use a similar idea to define the Word Mover's Distance (WMD) that measures the minimal amount of effort to move Word2Vec-based word embeddings from one document to another. The authors use a matrix representation that expresses the distance between words in respective documents. They note the structure is interpretable and performs well on text-based classification tasks.

Random Features.
The Word Mover's Embedding [14] is an unsupervised feature representation for documents, created by concatenating WMD estimates that are computed with respect to arbitrarily chosen feature maps. The authors calculate an approximation of the distance between a pair of documents with the use of a kernel over the feature map. The building blocks of the feature map are documents built from an arbitrary combination of words. This idea is based on Random Features approximation [9] that suggests that a randomized feature map is useful for approximating a shift-invariant kernel.
Our work can be viewed as a combination of the above ideas: we use distance estimates to create interpretable, structured representations of entities with respect to their contexts. The contextual dimension consists of features that are built from an arbitrary combination of discrete labels. Our work most importantly differs in the following manners: (1) We use end-to-end deep neural network training to include rich image features when building representations; (2) Information from different modalities (visual and language) can be combined.

CoDiR: Method
We first define some notions that are useful to understand the method, which we name Composable Distance-based Representation learning (CoDiR).

Setup and Notations.
Given is a dataset with data samples x ∼ p data , with non-exclusive class labels c i , i ∈ {1, ..., n c } which in this work are visual object labels (e.g., dog, ball, ...). Image instances s are fed through a (convolutional) neural network N . The outputs of N will serve to build templates T i,: ∈ R ne and instance representations D ∈ R nc×ne with n e a hyperparameter denoting the amount of environments. Each environment will be defined with the use of discrete environment labels l k , k ∈ {1, ..., n l }, for which we experiment with two types: (1) the same visual object labels as used for the class labels (such that n l = n c ) and (2) image caption keywords from the set of the n l most common nouns, adjectives or verbs in the sentence descriptions in the dataset. We will refer to the first as 'CoDiR (class)' and the latter as 'CoDiR (capt)'.
1 ci is shorthand for the indicator function 1 ci (x) = 1 if x ∈ C i , 0 otherwise, with C i the set of images with label c i . Similarly we denote 1 l k . Each element D i,j is a distance estimate between distributions p ci and p ej , where p ci is shorthand for p(x = x, x ∈ C i ). Informally, p ci is the joint distribution modeling the data distribution and class membership c i . To obtain p ej , several steps are performed before training. First, hyperparameter R is set, giving the maximum amount of labels per environment. For the j-th environment, we then (1) sample the actual amount of labels r j ∼ U [1, R] ∈ N; (2) sample the labels l (j) m , with m ∈ {1, ..., r j }, uniformly without replacement from the set of all discrete environment labels l k , k ∈ {1, ..., n l }. Now E j , the set of images for environment e j , is given by Note that by sampling a random amount of labels per environment, as inspired by [14], we ensure diversity in the type of composition of environments, with some holding many labels and some few.
Contextual Distance. We propose to represent each image as a 2D feature map that relates distributions of classes to environments. A suitable metric should be able to deal with neural network training as well as potentially overlapping distributions. A natural candidate is a Wasserstein-based distance function [1]. A key advantage is that the critic can be encouraged to maximize the distance between two distributions, whereas metrics based on Kullback-Leibler (KL) divergence are not well defined if the distributions have a negligible intersection [1]. In comparison to other neural network-based distance metrics, the Fisher IPM provides particularly stable estimates and has the advantage that any neural network can be used as f as long as the last layer is a linear, dense layer [5]. The Fisher GAN formulation bounds F, the set of measurable, symmetric and bounded real valued functions by defining a data dependent constraint on its second order moments. The IPM is given by: In practice, the Fisher IPM is estimated with neural network training where the numerator in Eq. 1 is maximized while the denominator is expressed as a con-straint, enforced with a Lagrange multiplier. While the Fisher IPM is an estimate of the chi-squared distance, the numerator can be viewed as a bounded estimate of the inter-class distance, closely related to the Wasserstein distance [5]. From now on, we denote this approximation of the inter-class distance as the 'distance'. During our training, critics f i,j are trained from input images to maximize the Fisher IPM for distributions p ci and p ej , ∀i ∈ {1, ..., n c }, ∀j ∈ {1, ..., n e }. The numerator then gives the distance between p ci and p ej . We denote i.e., the evaluation of the estimated distances over the training set. Intuitively, one can see why a matrix T with co-occurrence data contains useful information. A subset of images containing 'cats', for example, will more closely resemble a subset containing 'dogs' and 'fur' than one containing 'forks' and 'tables'.
Template and Instance Representations. As the template representation for class c i , we simply use the corresponding row of the learned distance matrix: T i,: . Each element T i,j gives an average distance estimate for how a class c i relates to environment e j , where smaller values indicate that class and environment are similar or even (partially) overlap. For the instance representation for an input s we then propose to use D ∈ R nc×ne with elements given by Eq. 2: where f i,j (s) is simply the output of critic f i,j for the instance s. The result is that for an input s with class label c i , D (s) i,: is correlated to T i,: as its distance estimates with respect to all different environments should be similar. Therefore, the cosine similarity between vector D (s) i,: and the template T i,: will be large for input samples from class i, and small otherwise.
Such templates can be evaluated, for example, in multi-label classification tasks (see Sect. 4). Finding the classes for an image is then simply calculated by computing whether ∀c i , cos(D (s) i,: , T i,: ) > t ci with t ci a threshold (the level of which is determined during training). From here on we will use a shorthand notation Implementation. Training n c × n e critics is not feasible in practice, so we pass input images through a common neural network for which the classification layer is replaced by n c × n e single layer neural networks, the outputs of which constitute f i,j (see Fig. 1). During training, any given mini-batch will contain inputs with many different c i and e j . To maximize Eq. 1 efficiently, instead of feeding a separate batch for the samples of x ∼ p ci and x ∼ p ej , we use the same mini-batch. Additionally, instead of directly sampling required. From these quantities, the Fisher IPM can be calculated and optimized. Algorithm 1 explains all the above in detail. 1 When comparing to similar neural network-based methods, the last layer imposes a slightly larger memory footprint (O(n 2 ) vs O(n)) but training time is comparable as they have the same amount of layers. After training completes we perform one additional pass through the training set where we use 2/3rd of the samples to calculate the templates and the remaining 1/3rd to set the thresholds for classification. 2 (De-)composing Representations. As the CoDiR representations have a clear structure, a Singular Value Decomposition of D: D = U SV can be performed, such that the rows of U and the columns of V can be interpreted as the corresponding factors as contributed by the c i and e j respectively. This leads to two applications: (1) Composition: by modifying the elements of U , one can easily obtainŨ with modified information content. By building a new repre-sentationD fromŨ , S and V , one thus obtains a similar representation to the original but with modified class membership. This will be further explained in this section. (2) Compression: The spectral norm for instance representations is large with a non-flat spectrum. One can thus compress the representations substantially by retaining only the first k eigenvectors of U and V , thus creating representations in a lower k dimensional space (rank k) without significant loss of classification accuracy. If k = 1, the new representations are (91 + 300)/(91 * 300) = 1.4% the size of the original representations. We call this method C-CoDiR(k).
Let us consider in detail how to achieve composition. To keep things simple, we only discuss the case for 'CoDiR (capt)'. Given an image s for which D (s) ⊂ c + and D (s) ⊂ c − . The goal is now to modify D (s) such that it represents an images for which D (s) ⊂ c + and D (s) ⊂ c − while preserving the contextual information in the environments of D (s) . As an example, for a D (s) of an image where D (s) ⊂ c dog and the discrete labels from which the environments are built indicate labels such as playing, ball and grass. The goal would be to modify the representation into D (s) (such that, for example, D (s) ⊂ c cat and D (s) ⊂ c dog ) and to not modify the information in the environments.
To achieve this, consider that by increasing the value of U c+,: , one can increase the distance estimate with respect to class c + , thus expressing that

Experiments
We show how CoDiR compares to a (binary) cross-entropy baseline for multilabel image classification. Additionally, CoDiR's qualities related to (de)compositions, compression and rank are examined.

Setup
The experiments are performed on the COCO dataset [4] which contains multiple labels and descriptive captions for each image. We use the 2014 train/val splits of this dataset as these sets contain the necessary labels for our experiment, where we split the validation set into two equal, arbitrary parts to have a validation and test set for the classification task. We set n c = 91, i.e., we use all available 91 class labels (which includes 11 supercategories that contain other labels, e.g., 'animal' is the supercategory for 'zebra' and 'cat'). An image can contain more than one class label. To construct environments we use either the class labels, CoDiR (class), or the captions, CoDiR (capt). For the latter, a vocabulary is built of the n l most frequently occurring adjectives, nouns and verbs. For each image, each of the n l labels is then assigned if the corresponding vocabulary word occurs in any of the captions. For the retrieval experiment we select a set of 400 images from the test set and construct their queries. 3 All images are randomly cropped and rescaled to 224 × 224 pixels. We use three types of recent state-ofthe-art classification models to compare performance: ResNet-18, ResNet-101 [2] and Inception-v3 [13]. For all runs, an Adam optimizer is used with learning rate 5.e−3. ρ for the Fisher IPM loss is set to 1e −6 . Parameters are found empirically based on performance on the validation set.

Results
Multi-label Image Classification. In this experiment the objects in the image are recognized. For each experiment images are fed through a neural network where the only difference between the baseline and our approach is the last layer.
For the baseline, which we call 'BXENT', the classification model is trained with a binary cross-entropy loss over the outputs and optimal decision thresholds are With the same underlying architecture, Table 1 shows that the CoDiR method compares favorably to the baselines in terms of F1 score. 4 When adding more detailed contextual information in the environments, as is the case for CoDiR (capt), our model outperforms the baseline in all cases. 5 The performance of CoDiR depends on the parameters n e and R. To measure this influence the multi-label classification task is performed for different n e values. Increasing n e or the amount of environments leads in general to better performance, although it plateaus after a certain level. For R, the max amount of labels per environment, an optimal value can also be found empirically between 0 and n l . The reason is that combining a large amount of labels in any environment creates a unique subset to compare samples with. When R is too large, however, subsets with unique features are no longer created and performance deteriorates. Also, even when n e and R are small, the outcome is not sensitive with regard to the choice of environments, suggesting that the amount and diversity are more important than the composition of the environments.

Retrieval.
The experiments here are designed to show interpretability, composability and compressibility of the CoDiR representations. All models and baselines in these sections are pre-trained on the classification task above. We perform two types of retrieval experiments: (1) NN: the most similar sample to a reference sample is retrieved; (2) M-NN: a sample is retrieved with modified class membership while contextual information in the environments is retained. Specifically: "Given an input s r that belongs to class c + but not c − , retrieve the instance in the dataset that is most similar to s r that belongs to c − and not c + ", where c + and c − are class labels (see Fig. 2). We will show that CoDiR is well suited for such a task, as its structure can be exploited to create modified representations D (sr) through decomposition as explained in Sect. 3. This task is evaluated as shown in Table 2a where the goal is to achieve a good combination of M-NN PREC and F1% (for the latter, higher percentages are better). We use the highly structured sigmoid outputs of the BXENT (single) and BXENT (joint) models as baselines, denoted as SEM (single) and SEM (joint) respectively. With SEM (joint) it is possible to directly modify class 0.08/0.15/0.20 C-CoDiR(5)(capt) 0.10/0.14/0.19 (b) F1 score for a simple logistic regression on pre-trained representations to classify a previously unseen label ("panting dogs"). For the last three models, n l = 300. labels while maintaining all other information. It is thus a 'best-case scenario'baseline for which one can strive, as it combines a good M-NN precision and F1% score. SEM (single) on the other hand only contains class information and thus presents a best-case scenario for the M-NN precision score yet a worstcase scenario for the F1% score. Additionally we compare with a simple baseline consisting of CNN features from the penultimate layer of the BXENT (joint) models with n l = 300. We also use those features in a Correlation Matching (CM) baseline, that combines different modalities (CNN features and word caption labels) into the same representation space [10]. The representations of these baseline models cannot be composed directly. In order to compare them to the 'M-NN' method, therefore, we define templates as the average feature vector for a particular class. We then modify the representation for a sample s by subtracting the template of c + and adding the template of c − . All representations except SEM (single) are built from the BXENT (joint) models with n l = 300. For CoDiR they are built from CoDiR (capt) with n l = 300.
For all baselines similarity is computed with the cosine similarity, whereas for CoDiR we exploit its structure as: similarity = mean cos(D (sr) , D (s) ) over all classes c for which cos(D (sr) c,: , T c ) > 0.75 × t c . Here, notations are taken from Sect. 3 and D (sr) is the modified representation of the reference sample. mean cos(D (sr) , D (s) ) is the mean cosine similarity between D (sr) and D (s) with the mean calculated over class dimensions. The similarity is thus calculated over class dimensions where classes with low relevance, i.e., those that have a low similarity with the templates, are not taken into account.
The advantages of the composability of the representations can be seen in Table 2a where CoDiR (capt) has comparable performance to the fully semantic SEM (joint) representations. CNN (joint) manages to obtain a decent M-NN precision score, thus changing class information well, but at the cost of losing contextual information (low F1%), performing almost as poorly as SEM (single). Whereas CM performs well on the NN task, it doesn't change the class information accurately and thus (inadvertently) retains most contextual information.
Rank. While the previous section shows that the structure of CoDiR representations provides access to semantic information derived from the labels on which they were trained, we hypothesize that the representations contain additional information beyond those labels, reflecting local, continuous features in the images. To investigate this hypothesis, we perform an experiment, similar to [15], to determine the rank of a matrix composed of 1000 instance representations of the test set. To maintain stability we take only the first 3 rows (corresponding to 3 classes) and all 300 environments of each representation. Each of these is flattened into a 1D vector of size 900 to construct a matrix of size 1000 * 900. Small singular values are thresholded as set by [8]. The used model is the CoDiR (capt) ResNet-18 model with n l = 300. We obtain a rank of 499, which exceeds the amount of class and environment labels (3 + 300) within, suggesting that the representations contain additional structure beyond the original labels.
The representations can thus be compressed. Table 2a shows that C-CoDiR with k = 5, denoted as C-CoDiR(5), approaches CoDiR's performance across all defined retrieval tasks. To show that the CoDiR representations contain information beyond the pre-trained labels, we also use cross-validation to perform a binary classification task with a simple logistic regression. A subset of 400 images of dogs is taken from the validation and test sets, of which 24 and 17 respectively are positive samples of the previously unseen label: panting dogs. The outcome in Table 2b shows that CoDiR and C-CoDiR(5) representations outperform the purely semantic representations of the SEM model, which shows that the additional continuous information is valuable.

Conclusion
CoDiR is a novel deep learning method to learn representations that can combine different modalities. The instance representations are obtained from images with a convolutional neural network and are structured along class and environment dimensions. Templates are derived from the instance representations that generalize the class-specific information. In a classification task it is shown that this generalization improves as richer contextual information is added to the environments. When environments are built with labels from image captions, the CoDiR representations consistently outperform their respective baselines. The representations are continuous and have a high rank, as demonstrated by their ability to classify a label that was not seen during pre-training with a simple logistic regression. At the same time, they contain a clear structure which allows for a semantic interpretation of the content. It is shown in a retrieval task that the representations can be decomposed, modified and recomposed to reflect the modified information, while conserving existing information.
CoDiR opens an interesting path for deep learning applications to explore uses of structured representations, similar to how such structured matrices played a central role in many language processing approaches in the past. In zero-shot settings the structure might be exploited, for example, to make compositions of classes and environments that were not seen before. Additionally, further research might explore unsupervised learning or how the method can be applied to other tasks and modalities with alternative building blocks for the environments. While we demonstrate the method with a Wasserstein-based distance, other distance or similarity metrics could be examined in future work.