Abstract
Despite the importance of image representations such as histograms of oriented gradients and deep Convolutional Neural Networks (CNN), our theoretical understanding of them remains limited. Aimed at filling this gap, we investigate two key mathematical properties of representations: equivariance and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parameterizations of a CNN, two different layers, or two different CNN architectures, share the same visual information or not. A number of methods to establish these properties empirically are proposed, including introducing transformation and stitching layers in CNNs. These methods are then applied to popular representations to reveal insightful aspects of their structure, including clarifying at which layers in a CNN certain geometric invariances are achieved and how various CNN architectures differ. We identify several predictors of geometric and architectural compatibility, including the spatial resolution of the representation and the complexity and depth of the models. While the focus of the paper is theoretical, direct applications to structuredoutput regression are demonstrated too.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Image representations have been a key focus of the research in computer vision for at least two decades. Notable examples include textons (Leung and Malik 2001), histogram of oriented gradients (SIFT Lowe 2004) and HOG Dalal and Triggs 2005), bag of visual words (Csurka et al. 2004; Sivic and Zisserman 2003), sparse (Yang et al. 2010) and local coding (Wang et al. 2010), super vector coding (Zhou et al. 2010), VLAD (Jégou et al. 2010), Fisher Vectors (Perronnin and Dance 2006), and, more recently, modern deep neural networks (Krizhevsky et al. 2012; Sermanet et al. 2014; Zeiler and Fergus 2013). Despite this extensive research effort, the development of image representations remains largely empirical, and our theoretical understanding of them is still limited. It is generally believed that a good representation should combine invariance and discriminability, but this characterization is rather vague; furthermore, it is often unclear what invariances are captured by existing representations and how they are obtained.
In this work, we formally investigate image representations in terms of their properties. In full generality, a representation\(\phi \) is a function mapping an image \(\mathbf {x}\) to a vector \(\phi (\mathbf {x})\in \mathbb {R}^d\) and our goal is to establish important statistical properties of such functions. We focus on two such properties. The first one is equivariance, which looks at how the representation output changes upon transformations of the input image. We demonstrate that most representations, including HOG and most of the layers in deep neural networks, change in a easily predictable manner with geometric transformations of the input (Fig. 1). We show that such equivariant transformations can be learned empirically from data (Sect. 5.1) and that, importantly, they amount to simple linear transformations of the representation output (Sects. 5.2 and 5.3). In the case of convolutional networks, we obtain this by introducing and learning a new transformation layer. As a special case of equivariance, by analyzing the learned equivariant transformations we are also able to find and characterize the invariances of the representation. This allows us to quantify geometric invariance and to show how it builds up with the representation depth.
The second part of the manuscript investigates another property, equivalence, which looks at whether different representations, such as different neural networks, capture similar information or not. In the case of CNNs, in particular, the nonconvex nature of learning means that the same CNN architecture may result in different models even when retrained on the same data. The question then is whether the resulting differences are substantial or just superficial. To answer this question, we propose to learn stitching layers that allow swapping parts of different architectures, rerouting information between them. Equivalence and coverage is then established if the resulting “FrankenCNNs” perform as well as the original ones (Sect. 6.2).
This paper extends the original conference paper (Lenc and Vedaldi 2015) substantially, by providing extensive results on recent deep neural network architectures, more analysis, and better visualizations. For equivariance, the paper investigates new formulations using alternative loss definitions as well as elementwise feature invariance. For equivalence, the paper systematically explores the equivalence between all layers of neural networks, analyzing for the first time the compatibility between different layers of different neural network architectures.
The rest of the paper is organized as follows. Section 3 discusses properties of selection of image representations. Section 5 discusses methods to learn empirically representation equivariance and invariance and presents experiments on shallow (Sect. 5.2) and deep (Sect. 5.3) representations. We also present a simple application of such results to structuredoutput regression in Sect. 5.4. In Sect. 6.2 we study the representation equivalence and show the relation between different deep image representations. Finally, Sect. 7 summarizes our findings.
2 Related Work
The problem of designing invariant or equivariant features has been widely explored in computer vision, as it is a common task to remove nuisance factors from the data (both geometric and photometric).
Invariance to geometric nuisance factors is traditionally achieved with either pose normalization, or by folding an equivariant representation over a group (e.g. by averaging, maxpooling or by exploiting function symmetries) (Cohen and Welling 2016). Both of these principles are taken into account in the architecture of deep CNNs, including the design by Krizhevsky et al. (2012) and related stateoftheart architectures (He et al. 2016; Simonyan et al. 2013), mainly for translation invariance which can be extended for different groups as well (Cohen and Welling 2016; Dieleman et al. 2015). This is even made more explicit in the scattering transform of Sifre and Mallat (2013). For pose normalization or feature folding the aim is to obtain invariant image features such that a noninvariant classifier can be used. However, in case of CNNs, the goal is to get an endtoend invariant classifier and little is known of how and where these models achieve invariance to other nuisance factors present in the data (such as horizontal flipping).
There are many examples of the general pose normalization methodology in computer vision applications. One of the common approaches is to sample the nuisance feature space sparsely with various “detectors”—such as local feature detectors with different normalization schemes (Lindeberg 1998; Lowe 1999; Mikolajczyk and Schmid 2003), bounding box proposals (Uijlings et al. 2013; Zitnick and Dollar 2014) or a direct regression of the normalized frame (Jaderberg et al. 2015; Ren et al. 2015). Another option is to sample the feature space densely using a grid search (Dalal and Triggs 2005; Felzenszwalb et al. 2009).^{Footnote 1} It is always the detected geometric “frame” which is used to normalize either image or features in order to obtain invariant representations. However, in order to be able to normalize the features (due to computational constraints), the features need to be equivariant to the selected geometry factors. A number of authors have looked at incorporating equivariance explicitly in the representations (Schimdt and Roth 2012a; Sohn and Lee 2012).
A second approach to achieving invariance to a group of transformations is to fold the equivariant representation along the manifold induced by the nuisance transformation. This can be as simple as averaging the features (Anselmi et al. 2016), maxpooling (Laptev et al. 2016; Cohen and Welling 2016) or simply by exploiting the group symmetry [such as ignoring the gradient ‘sign’ in Dalal and Triggs (2005) for vertical flip invariance].
In all these examples, invariance is a design aim that may or may not be achieved by a given architecture. By contrast, our aim is not to propose yet another mechanism to learn invariances (Anselmi et al. 2016; Bruna and Mallat 2013; Huang et al. 2007) or equivariance (Dieleman et al. 2016; Schmidt and Roth 2012b), but rather a method to systematically tease out invariance, equivariance, and other properties that a given representation may have. To the best of our knowledge, there is very limited work in conducting this type of analysis. Perhaps the works most closely related to ours only study invariances of neural networks to specific image transformations (Goodfellow et al. 2009; Zeiler and Fergus 2013). In Aubry and Russell (2015), the authors train networks on computer generated imagery to visually investigate the manifold in the feature space induced by underlying object transformation (such as rotation, style etc.). They show that across layers, the invariance to viewpoint increases with depth (by studying invariances and intrinsic dimensionality), which corroborates our findings. However, differently to this work, we attempt to find whether there exists a transformation in feature space for the whole training dataset instead of quantitative statistics on its subset. We believe this work is the first to functionally characterize and quantify these properties in a systematic manner, as well as being the first to investigate the equivalence of different representations.
The equivariance maps and steerable filters (Freeman and Adelson 1991) share some of the underlying theory. While conceptually similar, this work searches for linear maps of existing representations, instead of designing representations to achieve steerability. In fact, some more recent works have attempted to design steerable CNN representations (Cohen and Welling 2017) for CIFAR10 dataset (Krizhevsky and Hinton 2009).
Another property of image representations studied in this work is equivalence and covering, which tackles the relationship between different representations. Yosinski et al. (2014), the authors study the transferability of CNN features between different tasks by retraining various parts of the networks. While this may seem similar to our equivalence study of networks trained for different tasks, we do not change existing representations or train new features, we only study the relationship between them.
The work Li et al. (2015), published one year after our original manuscript (Lenc and Vedaldi 2015), studies different ways how to find equivalence between networks trained with a different initialization with the goal of investigating the common factors of different networks quantitatively. While this work is similar to our equivalence chapter, our goal is to find relationship between representations of different layers and various deep CNN networks with architectural differences or trained for different tasks in order to understand better the geometry of the representations.
3 Image Representations
An image representation\(\phi \) associates to an image \(\mathbf {x}\) a vector \(\phi (\mathbf {x})\in \mathbb {R}^d\) that encodes the image content in a manner useful for tasks such as classification or regression. We distinguish two important families of representations: traditional “handcrafted” representations such as SIFT and HOG (Sect. 3.1) and modern learnable deep neural networks (Sect. 3.2).
3.1 Traditional Image Representations
Before the advent of modern deep neural networks, computer vision researchers proposed various image representations such as textons (Leung and Malik 2001), histogram of oriented gradients (SIFT Lowe 2004 and HOG Dalal and Triggs 2005), bag of visual words (BoVW) (Csurka et al. 2004; Sivic and Zisserman 2003), sparse (Yang et al. 2010) and local coding (Wang et al. 2010), super vector coding (Zhou et al. 2010), VLAD (Jégou et al. 2010), Fisher Vectors (Perronnin and Dance 2006), and many others.
Such representations are entirely handcrafted, as in the case of SIFT and HOG, or are partially learned using using proxy criteria such as Kmeans clustering, as in the case of BoVW, sparse coding, VLAD, and Fisher Vectors. In this work, HOG (Dalal and Triggs 2005) is selected as a representative of traditional image features. HOG is a variant of the SIFT descriptor which became the predominant representation in image understanding tasks before deep networks. HOG decomposes an image it into small blocks (usually of \(8 \times 8\) pixels) and represents each block by a histogram of image gradient orientations. Histograms are further grouped into small partially overlapping \(2 \times 2\) blocks and normalized, building invariance to illumination changes into the representation. Histograms are computed by using weighted bilinear sampling of the image gradients, which results in approximate invariance to small image translations. Similarly, quantization of the gradient orientations and soft assignment of gradients to adjacent orientation bins gives HOG approximate invariance to small image rotations as well.
The SIFT image representation (Lowe 1999), which predates HOG, is conceptually very similar to HOG with slight differences in the normalization and gradient sampling and pooling schemes. The most significant difference is that SIFT was introduced as a descriptor of local image patches whereas HOG as a descriptor of the image as a whole, more useful for tasks such as object detection by sliding window. Due to the similarity between HOG and SIFT and due to the fact that HOG can be implemented as a small convolutional neural network (Mahendran and Vedaldi 2016), we focus on the latter in the remainder of the paper.
3.2 Deep Learnable Image Representations
Traditional image representations have been almost entirely replaced by modern deep convolutional neural networks (CNNs). CNNs share many structural elements with representations such as HOG (which, as noted, can be implemented as as a small convolutional network); crucially, however, they are based on generic blueprints containing millions of parameters that are learned endtoend to optimize the performance of the representation on a task of interest, such as image classification. As a result, these representations have dramatically superior performance than their handcrafted predecessors.
In this paper we investigate three popular families of CNNs: AlexNetlike networks (Krizhevsky et al. 2012) (ANet, CNet, Plcs (Zhou et al. 2014)), VGGlike networks (Simonyan and Zisserman 2015) (Vgg16) and ResNetlike networks (He et al. 2016) (ResN50). Recall that a deep network is a computational chain or graph comprising operations such as linear convolution by filter banks, nonlinear activation functions, pooling, and a few other simple operators. Despite differences in the local topology, AlexNet, VGG, and ResNetlike networks can generally be decomposed into a number of blocks that operate on tensors of different resolutions, with different blocks connected by downsampling layers. This subdivision is useful to compare networks, and is summarized in Fig. 2. The performance of the selected model variants in the popular ILSVRC12 benchmark is summarized in Table 1.
In more detail, ANet is the composition of twenty functions, grouped into five convolutional layers (implementing linear filtering, maxpooling, normalization and ReLU operations) and three fullyconnected layers (linear filtering and ReLU). In the paper, we analyze the output of convolution layers C1–C5, pooling layers P1, P2, P5, and of the fully connected layers F5 and F7. Features are taken immediately after the application of the linear filters (i.e. before the ReLU) and can be positive or negative, except for P15, which are taken after the nonlinearity and are nonnegative. We also consider the CNet variant of ANet due to its popularity in applications; it differs from ANet only slightly by placing the normalization operator before max pooling.
While ANet contains filters of various sizes, the C3–C5 layers all use \(3 \times 3\) filters only. This design decision was extended in the Vgg16 model to include all convolutional layers. Vgg16 consists of 5 blocks V1V5, each of which comprises a number of \(3\times 3\) convolutional layers configured to preserve the spatial resolution of the data within a block. Maxpooling operators reduce the spatial resolution between blocks. Similarly to ANet, Vgg16 terminates in 3 fully connected layers. This network has been widely used as a plugandplay replacement of ANet due to its simplicity and superior performance (Girshick et al. 2014a; He et al. 2014; Long et al. 2015). As with the ANet, in our experiments we consider outputs of the last convolution of the block (\(V1_2 \ldots V5_3\)), pooling layers P1–P5 and the fully connected layers F6 and F7.
The ResNet (He et al. 2016) architectures depart from ANet more substantially. The most obvious difference is that they contain a significantly larger number of convolutional layers. Learning such deep networks is made possible by the introduction of residual configurations where the input of a set of linear convolutions is added back to their outputs. ResNet also differs from ANet by the use of a single fully connected layer which performs image classification at the very end of the model; all other layers are convolutional, with the penultimate layer followed by average pooling. Conceptually, the lack of the fully connected layers is similar to the Google Inception network (Szegedy et al. 2015). This architectural difference makes ResNet slightly harder to use as a plugin replacement for ANet in some applications (Ren et al. 2017), but the performance is generally far better than ANet and Vgg16.
We consider a single ResNet variant, ResN50. This model is organized into residual blocks, each comprising several residual units with three convolutional layers, performing dimensionality reduction, \(3 \times 3\) convolution, and dimensionality expansion respectively. In our experiments, we consider outputs of six blocks, the first one C1 comprising a standard convolutional layer, and five residual blocks R2–R6 with a \(2\times \) downsampling during the first convolutional operation of its first (e.g. \(R2_1\)) with a stride2 convolution which performs dimensionality reduction. More details about this architecture and the operations performed in each block can be found in He et al. (2016).
4 Properties of Representations
So far a representation \(\phi \) has been described as a function mapping an image to a vector. The design of representations is empirical, guided by intuition and validation of the performance of the representation on tasks of interest, such as image classification. Deep learning has partially automated this empirical design process by optimizing the representation parameters directly on the final task, in an endtoend fashion.
Although the performance of representations has improved significantly as a consequence of such research efforts, we still do not understand them well from a theoretical viewpoint; this situation has in fact deteriorated with deep learning, as the complexity of deep networks, which are learned as black boxes, has made their interpretation even more challenging. In this paper we aim to shed some light on two important properties of representations: equivariance (Sect. 4.1) and equivalence (Sect. 4.2).
4.1 Equivariance
A popular principle in the design of representations is the idea that a representation should extract from an image information which is useful for interpreting it, for example by recognizing its content, while removing the effect of nuisance factors such as changes in viewpoint or illumination that change the image but not its interpretation. Often, we say that a representation should be invariant to the nuisance factors while at the same time being distinctive for the information of interest (a constant function is invariant but not distinctive).
In order to illustrate this concept, consider the effect on an image \(\mathbf {x}\) of certain transformationsg such as rotations, translations, or rescaling. Since in almost all cases the identity of the objects in the image would not be affected by such transformations, it makes sense to seek a representation \(\phi \) which is invariant to the effect of g, i.e. \(\phi (\mathbf {x}) = \phi (g \mathbf {x})\).^{Footnote 2} This notion of invariance, however, requires closure with respect to the transformation group G (Vedaldi and Soatto 2005): given any two transformations \(g,g'\in G\), if \(\phi (\mathbf {x}) = \phi (g \mathbf {x})\) and \(\phi (g \mathbf {x}) = \phi (g' g\mathbf {x})\), then \(\phi (\mathbf {x}) = \phi (gg'\mathbf {x})\) for the combined transformation \(gg'\). Due to the finite resolution and extent of digital images, this is not realistic even for simple transformations—for example, if \(\phi \) is invariant to any scaling factor \(g\not = 1\), it must be invariant to any multiple \(g^n\) as well, even if the scaled image \(g^n\mathbf {x}\) reduces to a single pixel. Even disregarding finiteness issues, many simple transformations close onto 2D diffeomorphisms, resulting in representations that in principle should not distinguish even heavily distorted versions of the same image. In practice, therefore, invariance is often relaxed to insensitivity to bounded transformations: \(\Vert \phi (g\mathbf {x}) \phi (\mathbf {x})\Vert \le \epsilon \Vert g\Vert \), where \(\Vert g\Vert \) is a measure of the size of the transformation.
A more fundamental problem with invariance is that the definition of a nuisance factor depends on the task at hand, whereas a representation should be useful for several tasks (otherwise there would be no difference between representations and solutions to a specific problem). For example, recognizing objects may be invariant to image translations and rotations, but localizing them clearly is not. Rather than removing factors of variation, therefore, often one seeks for representations that untangle such factors, which is sufficient to simplify the solution of specific problems while preventing others from being solved as well.
Thus, generalizing the concept of invariance, we aim at studying the equivariant properties of representations. A representation \(\phi \) is equivariant with a transformation g of the input image if the transformation can be transferred to the representation output. Formally, equivariance with g is obtained when there exists a map \(M_g : \mathbb {R}^d \rightarrow \mathbb {R}^d\) such that:
A sufficient condition for the existence of \(M_g\) is that the representation \(\phi \) is invertible, because in this case \(M_g = \phi \circ g \circ \phi ^{1}\). It is known that representations such as HOG are at least approximately invertible (Vondrick et al. 2013). Hence it is not just the existence, but also the structure of the mapping \(M_g\) that is of interest. In particular, \(M_g\) should be simple, for example a linear function. This is important because the representation is often used in simple predictors such as linear classifiers, or in the case of CNNs, is further processed by linear filters. Furthermore, by requiring the same mapping \(M_g\) to work for any input image, intrinsic geometric properties of the representations are captured. Invariance is a special case of equivariance obtained when \(M_g\) (or a subset of \(M_g\)) acts as the simplest possible transformation, i.e. the identity map.
The nature of the transformation g is in principle arbitrary; in practice, in this paper we will focus on geometric transformations such as affine warps and flips of the image.
As an illustrative example of equivariance, let \(\phi \) denote the HOG (Dalal and Triggs 2005) feature extractor. In this case \(\phi (\mathbf {x})\) can be interpreted as a \(H \times W\) vector field of of Ddimensional feature vectors, called “cells” in the HOG terminology. If g denotes image flipping around the vertical axis, then \(\phi (\mathbf {x})\) and \(\phi (g\mathbf {x})\) are related by a well defined permutation of the feature components. This permutation swaps the HOG cells in the horizontal direction and, within each HOG cell, swaps the components corresponding to symmetric orientations of the gradient. Hence the mapping \(M_g\) is a permutation and one has exactly\(\phi (g\mathbf {x}) = M_g \phi (\mathbf {x})\). The same is true for horizontal flips and \(180^{\circ } \) rotations, and, approximately,^{Footnote 3} for \(90^{\circ } \) rotations. HOG implementations (Vedaldi and Fulkerson 2010) do in fact explicitly provide such permutations.
As another remarkable example of equivariance, note that HOG, denselycomputed SIFT (DSIFT), and convolutional networks are all convolutional representations in the sense that they are local and translation invariant operators. Barring boundary and sampling effects, convolutional representations are equivariant to translations of the input image by design, which transfer to a corresponding translation of the resulting feature field.
In all such examples, the map \(M_g\) is linear. We will show empirically that this is the case for many more representations and transformations (Sect. 5).
4.2 Covering and Equivalence
While equivariance looks at how a representation is affected by transformations of the input image, covering studies the relationship between different representations. We say that a representation \(\phi \)covers a representation \(\phi '\), and we write \(\phi \rightarrow \phi '\), if there exist a map \(E_{\phi \rightarrow \phi '}\) such that
Covering captures the idea that \(\phi \) contains at least as much information as \(\phi '\). Algebraically, covering is a transitive and reflexive relation; however, it is a preorder rather than a partial order because \(\phi ' \rightarrow \phi \) and \(\phi \rightarrow \phi '\) do not imply that \(\phi \) and \(\phi '\) are identical (i.e. the \(\rightarrow \) relation is reflexive and transitive but not antisymmetric); rather, in this case we say that they are equivalent, as they both carry the same information.
Note that, if \(\phi \) is invertible, then \(E_{\phi \rightarrow \phi '} = \phi ' \circ \phi ^{1}\) satisfies this condition; hence, as for the mapping \(M_g\) before, the interest is not just in the existence but also in the structure of the mapping \(E_{\phi \rightarrow \phi '}\).
The reason why covering and equivalence are interesting properties to test for is that there exist a large variety of different image representations. In fact, each time a deep network is learned from data, the nonconvex nature of the optimization results in a different and, as we will see, seemingly incompatible neural networks. However, as it may be expected, these differences are not fundamental and this can be demonstrated by the existence of simple mapping \(E_{\phi \rightarrow \phi '}\) that bridge them. More interestingly, covering and equivalence can be used to assess differences in the representations computed at different depths in a neural network, as well as to compare different architectures (Sect. 6).
5 Analysis of Equivariance
Given an image representation \(\phi \), we study its equivariance properties (Sect. 4.1) empirically by learning the map \(M_g\) from data. The approach, based on a structured sparse regression method (Sect. 5.1), is applied to the analysis of both traditional and deep image representations in Sects. 5.2 and 5.3, respectively. Section 5.4 shows also a practical application of these equivariant mappings to object detection using structureoutput regression.
The key finding from these experiments are that:

HOG, our representative traditional feature extractor, has a high degree of equivariance with similarity transformations (translation, rotation, flip, scale) up to limitations due to sampling artifacts.

Deep feature extractors such as ANet, Vgg16, and ResN50 are also highly equivariant up to layers that still preserve sufficient spatial resolution, as those better represent geometry. This is also consistent with the fact that such features can be used to perform geometricoriented tasks, such as object detection in RCNN and related methods.

We also show that equivariance in deep feature extractors reduces to invariance for those transformations such as leftright flipping that are present in data or in data augmentation during training. This effect is more pronounced as depth increases.

Finally, we show that simple reconstruction metrics such as the Euclidean distance between features are not necessarily predictive of classification performance; instead, using a taskoriented regression method learns better equivariant maps in most cases.
5.1 Methods
As our goal is to study the equivariance properties of a given image representation \(\phi \), the equivariant map \(M_g\) of Sect. 4.1 is not available apriori and must be estimated from data, if it exists. This section discusses a number of methods to do so. First, the learning problem is discussed in general (Sect. 5.1.1) and suitable regularisers are proposed (Sect. 5.1.2). Then, efficient versions of the loss (Sect. 5.1.3) and of the map \(M_g\) (Sect. 5.1.4) are given for the special case of CNN representations.
5.1.1 Learning Equivariance
Given a representation \(\phi \) and a transformation g, the goal is to find a mapping \(M_g\) satisfying (1). In the simplest case \(M_g = (A_g, \mathbf {b}_g),\)\(A_g\in \mathbb {R}^{d\times d},\)\(\mathbf {b}_g\in \mathbb {R}^d\) is an affine transformation \(\phi (g\mathbf {x}) \approx A_g \phi (\mathbf {x}) + \mathbf {b}_g\). This choice is not as restrictive as it may initially seem: in the examples of Sect. 4.1\(M_g\) is a permutation, and hence can be implemented by a corresponding permutation matrix \(A_g\).
Estimating \((A_g,\mathbf {b}_g)\) can be formulated as an empirical risk minimization problem. Given images \(\mathbf {x}_1,\dots ,\mathbf {x}_n\) sampled from a set of natural images, learning amounts to optimizing the regularized reconstruction error
where \(\mathcal {R}\) is a regularizer and \(\ell \) a regression loss.
The choice of regularizer is particularly important as \(A_g \in \mathbb {R}^{d\times d}\) has a \(\varOmega (d^2)\) parameters. Since d can be quite large (for example, in HOG one has \(d = DWH\)), regularization is essential. The standard \(l^2\) regularizer \(\Vert A_g\Vert _F^2\) was found to be inadequate; instead, sparsityinducting priors work much better for this problem as they encourage \(A_g\) to be similar to a permutation matrix.
5.1.2 Regularizer
We consider two such sparsityinducing regularisers. The first regularizer allows \(A_g\) to contain a fixed number k of nonzero entries in each row:
Regularizing rows independently reflects the fact that each row is a predictor of a particular component of \(\phi (g\mathbf {x})\).
The second sparsityinducing regularizer is similar, but exploits the convolutional structure of a representation. Convolutional features are obtained from translation invariant and local operators (nonlinear filters). In this case, the representation \([\phi (\mathbf {x})]_{uvt}\) can be interpreted as a feature field or tensor with spatial indexes (u, v) and feature channel index t. Due to the locality of the representation, the component (u, v, t) of \(\phi (g\mathbf {x})\) should be predictable from a corresponding neighborhood \(\varOmega _{g,m}(u,v)\) of features in tensor \(\phi (\mathbf {x})\) (see Fig. 3). This results in a particular sparsity structure for \(A_g\) that can be imposed by the regularizer
where m denotes the neighbor size and the indexes of A have been identified with triplets (u, v, t). The neighborhood itself is defined as the \(m \times m\) input feature locations closer to the backprojection of the output feature (u, v).^{Footnote 4} In practice (4) and (5) will be combined in order to limit the number of regression coefficients activated in each neighborhood.
5.1.3 Loss and Optimization
As will be shown empirically in Sect. 5.3, the choice of loss \(\ell \) in Eq. (3) is important. For HOG and similar histogramlike representations, a regression loss such as \(l^2\), Hellinger, or \(\chi ^2\) distance works well. Such a loss can also be applied to convolutional architectures, although an endtoend taskoriented loss can perform better. The \(l^2\) loss can be easily optimized offline, for which we use a direct implementation of least squares or ridge regression, or the implementation by Sjöstrand et al. (2018) of the forwardselection algorithm. Alternatively, for CNNs the Siamese architecture approach described next works well.
Siamese architecture for the\(l^2\)loss For CNN representations and regression losses such as \(l^2\), the transformation \(M_g\) can also be learned using a Siamese architecture (Bromley et al. 1994). This is illustrated in Fig. 4: one branch of the network computes the representation of the original image \(\phi (\mathbf {x})\) and the second branch computes the representation of \(\psi (M_g \circ \phi (g^{1} \mathbf {x})) \) while minimizing the \(l^2\) loss between these two representations.
The Siamese approach has several advantages. First, it allows to learn \(M_g\) using the same methods used to learn the CNN, usually online SGD optimization, which may be more memory efficient than offline solvers. Additionally, a Siamese architecture is more flexible. For example, it is possible to apply \(M_g\) after the output of a convolutional layer, but to compute the \(l^2\) loss after the ReLU operator is applied to the output of the latter. In fact, since ReLU removes the negative components of the representation in any case, reconstructing accurately negative levels may be overkill; the Siamese configuration allows us to test this hypothesis.
Endtoend loss In practice, it is unclear whether a regression loss such as \(l^2\) captures well the informative content of the features or whether a different metric should be used instead. In order to sidestep the issue of choosing a metric, we propose to measure the quality of feature reconstruction based on whether the features can still solve the original task.
To this end, consider a CNN \(\zeta \) trained endtoend on a categorization problem such as the ILSVRC 2012 image classification task (ILSVRC12) (Russakovsky et al. 2015). It is common (Chatfield et al. 2014; Donahue et al. 2013; Razavian et al. 2014) to consider the first several layers \(\phi \) of the network \(\zeta = \psi \circ \phi \) as a generalpurpose feature extractor and the last layers \(\psi \) as a classifier using such features. This suggests an alternative objective that preserves the quality of the features \(\phi \) in the original problem:
Here \(y_i\) denotes the ground truth label of image \(\mathbf {x}_i\) and \(\ell \) is the same classification loss used to train \(\zeta \). Note that in this case \((A_g,\mathbf {b}_g)\) is learned to compensate for the image transformation, which therefore is set to \(g^{1}\). This formulation is not restricted to CNNs, but applies to any representation \(\phi \) given a target classification or regression task and a corresponding pretrained classifier \(\psi \) using it. This approach is further illustrated in Fig. 5.
Implementation For implementation convenience, the Siamese formulations are optimized using the same online stochastic gradient descent algorithm and weight decay used to learn the neural networks in the first place. Learning uses the MatConvNet framework (Vedaldi and Lenc 2014). The transformation layer is implemented with a layer similar to a spatial transformer (Jaderberg et al. 2015) with a fixed sampling grid. The spatial transformation and convolution with \(F_g\) has little influence on the network training speed.
5.1.4 Transformation Layer
The method of Sect. 5.1 can be substantially refined for the case of CNN representations and certain classes of transformations. In fact, the structured sparsity regularizer of (5) encourages \(A_g\) to match the convolutional structure of the representation. If g is an affine transformation more can be said: up to sampling artifacts, the equivariant transformation \(M_g\) is local and translation invariant, i.e. convolutional. The reason is that an affine transformation g acts uniformly on the image domain^{Footnote 5} so that the same is true for \(M_g\). This has two key advantages: it dramatically reduces the number of parameters to learn and it can be implemented efficiently as an additional layer of a CNN.
Such a transformation layer consists of a permutation layer, which implements the multiplication by a permutation matrix \(P_g\) moving input feature sites (u, v, t) to output feature sites (g(u, v), t), followed by convolution with a bank of D linear filters and scalar biases \((F_g, \mathbf {b}_g)\), each of dimension \(m \times m \times D\). Here m corresponds to the size of the neighborhood \(\varOmega _{g,m}(u,v)\) described in Sect. 5.1. Intuitively, the main purpose of these filters is to permute and interpolate feature channels.
Note that g(u, v) does not, in general, fall at integer coordinates. To address this issue, the permutation layer \(P_g\) distributes g(u, v) to the nearest \(2\times 2\) sites using bilinear interpolation.^{Footnote 6} The transformation layers allows to rewrite the learning objective as:
5.2 Results on Traditional Representations
This section applies the methods of Sect. 5.1 to learn equivariant maps for shallow representations, and HOG features in particular. The first method to be evaluated is sparse regression (Sect. 5.2.1) followed by structured sparsity (Sect. 5.2.2). A qualitative evaluation is given in Sect. 5.2.3.
5.2.1 Sparse Regression
The first experiment (Fig. 6) explores variants of the sparse regression formulation of Eq. (3). The goal is to learn a mapping \(M_g=(A_g,\mathbf {b}_g)\) that predicts the effect of selected image transformations g on the HOG features of an image. For each transformation, the mapping \(M_g\) is learned from 1000 training images by minimizing the regularized empirical risk (6). The performance is measured as the average Hellinger’s distance \(\Vert \phi (g\mathbf {x})M_g\phi (\mathbf {x})\Vert _\text {Hell.}\) on a test set of further 1000 images.^{Footnote 7} Images are randomly sampled from the ILSVRC12 train and validation datasets respectively.
This experiment focuses on predicting a small array of \(5\times 5\) of HOG cells, which allows to train full regression matrices even with naive baseline regression algorithms. Furthermore, the \(5\times 5\) array is predicted from a larger \(9 \times 9\) input array to avoid boundary issues when images are rotated or rescaled. Both these restrictions will be relaxed later. Figure 6 compares the following methods to learn \(M_g\): choosing the identity transformation \(M_g=\mathbf {1}\), learning \(M_g\) by optimizing the objective (3) without regularization (Least Square – LS), with the Frobenius norm regularizer for different values of \(\lambda \) (Ridge Regression—RR), and with the sparsityinducing regularizer (4) (ForwardSelection—FS, using (Sjöstrand et al. 2018)) for a different number k of regression coefficients per output dimension.
As can be seen in Fig. 6, LS overfits badly, which is not surprising given that \(M_g\) contains 1M parameters even for these small HOG arrays. RR performs significantly better, but it is easily outperformed by FS, confirming the very sparse nature of the solution (e.g. for \(k=5\) just 0.2% of the 1M coefficients are nonzero). The best result is obtained by FS with \(k=5\). As expected, the prediction error of FS is zero for a \(180^{\circ } \) rotation as this transformation is exact (Sect. 5.1), but note that LS and RR fail to recover it. As one might expect, errors are smaller for transformations close to identity, although in the case of FS the error remains small throughout the range.
5.2.2 Structured Sparse Regression
The conclusion of the previous experiments is that sparsity is essential to achieve good generalization. However, learning \(M_g\) directly, e.g. by forwardselection or by \(l^1\) regularization, can be quite expensive even if the solution is ultimately sparse. Next, we evaluate using the structured sparsity regularizer of Eq. (5), where each output feature is predicted from a prespecified neighborhood of input features dependent on the image transformation g. The right plot of Fig. 6 repeats the experiment for a \(45^{\circ } \) rotation, but this time limited to neighborhoods of \(m \times m\) input HOG cells. To be able to span larger intervals of m, an array of \(15 \times 15\) HOG cells is used. Since spatial sparsity is now imposed apriori, LS, RR, and FS perform nearly equivalently for \(m\le 3\), with the best result achieved by FS with \(k=5\) and a small neighborhood of \(m = 3\) cells. There is also a significant computational advantage in structured sparsity (Table 2) as it limits the effective size of the regression problems to be solved. We conclude that structured sparsity is highly preferable over generic sparsity (Fig. 7).
5.2.3 Regression Quality
So far results have been given in term of the reconstruction error of the features; this paragraph relates this measure to the practical performance of the learned mappings. The first experiment is qualitative and uses the HOGgle technique (Vondrick et al. 2013) to visualize the transformed features. As shown in Fig. 8, the visualizations of \(\phi (g\mathbf {x})\) and \(M_g \phi (\mathbf {x})\) are indeed nearly identical, validating the mapping \(M_g\). The second experiment (Fig. 7) evaluates instead the performance of transformed HOG features quantitatively, in a classification problem. To this end, an SVM classifier \(\langle \mathbf {w}, \phi (\mathbf {x}) \rangle \) is trained to discriminate between dog and cat faces using the data of Parkhi et al. (2011) (using \(15\times 15\) HOG templates, 400 training and 1000 testing images evenly split among cats and dogs). Then a progressively larger rotation or scaling \(g^{1}\) is applied to the input image and the effect compensated by \(M_g\), computing the SVM score as \(\langle \mathbf {w}, M_g \phi (g^{1}\mathbf {x}) \rangle \) (equivalently the model is transformed by \(M_g^\top \)). The performance of the compensated classifier is nearly identical to the original classifier for all angles and scales, whereas the uncompensated classifier \(\langle \mathbf {w}, \phi (g^{1}\mathbf {x})\rangle \) rapidly fails, particularly for rotation. We conclude that equivariant transformations encode visual information effectively.
5.3 Results on Deep Representations
This section extends the experiments of the previous section on deep representations, including investigations with taskoriented losses.
5.3.1 Regression Methods
In this section we validate the parameters of various regression methods and show that the taskoriented loss results in better equivariant maps.
The first experiment (Fig. 9) compares different methods to learn equivariant mappings \(M_g\) in a CNN. The first method (gray and brown lines) is FS, computed for different neighborhood sizes k (line color) and sparsity m (line pattern). The next method (blue line) is the \(l^2\) loss training after the ReLU layer, as specified in Sect. 5.1.3. The last method (orange line) is the task oriented formulation of Sect. 5.1 using a transformation layer.
The classification error (taskoriented loss, first row), \(l^2\) reconstruction error (second row) and \(l^2\) reconstruction error after the ReLU operation (third row) are reported against the number of training samples seen. As in Sect. 5.1.4, the latter is the classification error of the compensated network \(\psi \circ M_g \circ \phi (g^{1}\mathbf {x})\) on ImageNet ILSVCR12 data (the reported error is measured on the validation data, but optimized on the training data). The figure reports the evolution of the loss as more training samples are used. For the purpose of this experiment, g is set to be vertical image flipping. Figure 11 repeats the experiments for the taskoriented objective and rotations g from 0 to 90 degrees (the fact that intermediate rotations are slightly harder to reconstruct suggests that a better \(M_g\) could be learned by addressing more carefully interpolation and boundary effects).
Several observations can be made. First, all methods perform substantially better than doing nothing (which has \(75\%\) top1 error, red dashed line), recovering most if not all the performance of the original classifier (\(43\%\), green dashed line). This demonstrates that linear equivariant mappings \(M_g\) can be learned successfully for CNNs too. Second, for the shallower features up to C2, FS is better: it requires less training samples (as it uses an offline optimizer) and it has a smaller reconstruction error and comparable classification error than the taskoriented loss. Compared to Sect. 5.2, however, the best setting \(m=3\), \(k=25\) is substantially less sparse. From C3 onward, the taskoriented loss is better, converging to a much lower classification error than FS. FS still achieves a significantly smaller reconstruction error, showing that feature reconstruction is not always predictive of classification performance. Third, the classification error increases somewhat with depth, matching the intuition that deeper layers contain more specialized information: as such, perfectly transforming these layers for transformations which were not experienced during training (e.g. vertical flips) may not be possible.
Because the CNN uses a ReLU nonlinearity, one can ask whether optimizing the \(l^2\) loss before the nonlinearity is apt for this task. To shed light on this question, we train \(M_g\) using a \(l^2\) loss after the nonlinearity (ReLUOPT). One can see that this still performs slightly worse than the taskspecific loss, even though it performs slightly better than the FS (which may be due to more training data). However it is interesting to observe that neither the \(l^2\) loss before or after the nonlinearity is strongly predictive of the target performance. Thus we conclude that the \(l^2\) metric should only be used as a proxy metric in the hidden representation of the CNNs (with respect to the target task).
5.3.2 Comparing Transformation Types
Next we investigate which geometric transformations can be represented by different layers of various CNNs (Fig. 10), considering in particular horizontal and vertical flips, rescaling by half, and rotation of \(90^{\circ } \). We perform this experiment for three CNN models. For ANet and Vgg16 the experiment is additionally performed on two of its fully connected layer representations. This is not applicable for the ResN50 which has only the final classifier as a fully connected layer. In all experiments, the training is done for five epochs of \(2 \cdot 10^5\) training samples, using a constant learning rate of \(10^{2}\).
For transformations such as horizontal flips and scaling, learning equivariant mappings is not better than leaving the features unchanged: this is due to the fact that the CNN implicitly learns to be invariant to such factors. For vertical flips and rotations, however, the learned equivariant mapping substantially reduce the error. In particular, the first few layers for all three investigated networks are easily transformable, confirming their generic nature.
The results also show that finding an equivariant transformation for fully connected layers (or layers with lower spatial resolution in general) is more difficult than for convolutional layers. This is consistent with the fact that the deepest layers of networks contain less spatial information and hence expressing geometric transformations on top of them becomes harder. This is also consistent with the fact that ResN50 shows better equivariance properties for deeper layers compared to Vgg16 and ANet: the reason is that ResN50 preserves spatial information deeper in the architecture.
5.3.3 Qualitative Evaluation
Similarly to the visualization we obtained for the HOG features, we can use the preimage method of Mahendran and Vedaldi (2016) to invert each deep representation and assess the learned mappings visually. Figure 10 shows the inverse of the maps \(\phi (g\mathbf {x})\) and \(M_g \phi (x)\) for different representations corresponding to different layers of ANet. It also shows the results obtained by inverting with \(P_g \phi (\mathbf {x})\), considering only a permutation matrix \(P_g\) instead of using the fullyfledged map \(M_g\). In this experiment, \(M_g\) is obtained using the taskoriented optimization.
We can see that in all cases the preimages \(M_g \phi (\mathbf {x})]^{1}\) are nearly always better than the preimages \([\phi (g\mathbf {x})]^{1}\), which validates the equivariant map \(M_g\). Furthermore, in all cases the preimage obtained using \(M_g\) is better than the one obtained using the simple permutation \(P_g\), which confirms that both permutation and feature channel transformation are needed to achieve equivariance.
5.3.4 Geometric Invariances
This section explores the geometric invariance properties of different neural network architectures. This is done by measuring the performance of the hybrid network \(\psi (P_g \phi (g^{1} \mathbf {x}))\), where the spatial permutation matrix \(P_g\) is used to undo the effect of the geometric transformation in feature space as was done with the taskoriented objective (7). We compare this result to the one obtained previously where \(P_g\) was generalized to the learned equivariant map \(M_g\): the idea is that if the spatial permutation \(P_g\) is sufficient to achieve the same performance as \(M_g\) then the feature channels are already invariant to the nuisance transformation.
The performance of \(P_g\) against \(M_g\) is visualized in Fig. 10 (gray vs orange lines) for the different layers of ANet, Vgg16, and ResN50. We note that the invariance to horizontal flips is obtained progressively with depth. Consequently,m the fully convolutional layers have access to a representation which is already invariant to this geometric transformation, which significantly simplifies the image classification task.
We also observe that there is a certain degree of scale invariance in the C5 representation of ANet and Vgg16 networks. This may help to explain why RCNN object detectors such as (Girshick 2015; He et al. 2014; Ren et al. 2015) work well. Recall that thee methods use a simple spatial resampler such as Spatial Pyramid Pooling to extract features in correspondence of objects of different sizes and locations in the image. Resampling spatial coordinates is in principle insufficient to make the extracted region representation invariant to scale changes, unless, as it appears to be the case, the feature channel values are also insensitive to scale.
Additionally, it can be seen in Fig. 10, that applying only the permutation \(P_g\) on the lower layers significantly reduces the performance of the network. We can observe that earlier representations are “antiinvariant” since the rest of the network is more sensitive to this nuisance transformation when this is applied in feature space (Figs. 11, 12).
Next, we study the map \(F_g\) to identify which feature channels are invariant: these are the ones that are best predicted by themselves after a transformation. However, invariance is almost never achieved exactly; instead, the degree of invariance of a feature channel is scored as the ratio of the Euclidean norm of the corresponding row of \(F_g\) with the same row after suppressing the “diagonal” component of that row. The p rows of \(F_g\) with the highest invariance score are then replaced by (scaled) rows of the identity matrix. Finally, the performance of the modified transformation \(\bar{F}_g\) is evaluated and accepted if the classification performance does not deteriorate by more than \(5\%\) relative to \(F_g\). The corresponding feature channels for the largest possible p are then considered approximately invariant.
Table 3 reports the result of this analysis for horizontal and vertical flips, rescaling, and \(90^{\circ } \) rotation in the ANet CNN. There are several notable observations. First, for transformations in which the network has achieved invariance such as horizontal flips and rescaling. This invariance is obtained largely in C3 or C4. Second, invariance does not always increase with depth (for example C1 tends to be more invariant than C2). This is possible because, even if the feature channels within a layer are invariant, the spatial pooling in the subsequent layer may not be. Third, the number of invariant features is significantly smaller for unexpected transformations such as vertical flips and \(90^{\circ } \) rotations, further validating the approach. These results corroborate the finding reported in Fig. 10, first row.
5.4 Application to StructuredOutput Regression
To complement the theoretical investigation thus far, this section shows a direct practical application of the learned equivariant mappings of Sect. 5 to the task of structuredoutput regression (Taskar et al. 2003). In structured regression an input image \(\mathbf {x}\) is mapped to a label \(\mathbf {y}\) by the function \(\hat{\mathbf {y}}(\mathbf {x}) = {\text {argmax}}_{\mathbf {y},\mathbf {z}} \langle \phi (\mathbf {x},\mathbf {y},\mathbf {z}), \mathbf {w}\rangle \) (direct regression) where \(\mathbf {z}\) is an optional latent variable and \(\phi \) is a joint feature map. If either \(\mathbf {y}\) or \(\mathbf {z}\) include geometric parameters, the joint features can be partially or fully rewritten as \(\phi (\mathbf {x},\mathbf {y},\mathbf {z})= M_{\mathbf {y},\mathbf {z}} \phi (\mathbf {x})\), reducing inference to the maximization of \(\langle M_{\mathbf {y},\mathbf {z}}^\top \mathbf {w}, \phi (\mathbf {x})\rangle \) (equivariant regression). There are two computational advantages to this approach: (i) the representation \(\phi (\mathbf {x})\) needs only to be computed once and (ii) the vectors \(M_{\mathbf {y},\mathbf {z}}^\top \mathbf {w}\) can be precomputed offline.
This idea is demonstrated on the task of pose estimation, where \(\mathbf {y}= g\) is a geometric transformation in a class \(g^{1}\in G\) of possible poses of an object. As an example, consider estimating the pose of cat faces in the PASCAL VOC 2007 (VOC07) (Everingham et al. 2007) data taking G either to be (i) rotations or (ii) affine transformations (Fig. 14). The rotations in G are sampled uniformly every 10 degrees and the groundtruth rotation of a face is defined by the line connecting the nose to the midpoints between the eyes. These keypoints are obtained as the center of gravity of the corresponding regions in the VOC07 part annotations (Chen et al. 2014). The affine transformations in G are obtained by clustering the vectors \([\mathbf {c}_l^\top , \mathbf {c}_r^\top ,\mathbf {c}_n^\top ]^\top \) containing the location of eyes and nose of 300 example faces in the VOC07 data.
The clusters are obtained using GMMEM on the training data and used to map the test data to the same pose classes for evaluation. G then contains the set of affine transformations mapping the keypoints \([\bar{\mathbf {c}}_l^\top , \bar{\mathbf {c}}_r^\top , \bar{\mathbf {c}}_n^\top ]^\top \) in a canonical frame to each cluster center.
The matrices \(M_g\) are prelearned (from generic images not containing cats) using FS with \(k=5\) and \(m=3\) as in Sect. 5.1. Since cat faces in VOC07 data are usually upright, a second more challenging version of the data (denoted by the symbol \(\circlearrowleft \)) augmented with random image rotations is considered as well. The direct \(\langle \mathbf {w}, \phi (g\mathbf {x})\rangle \) and equivariant \(\langle \mathbf {w}, M_{g}\phi (\mathbf {x})\rangle \) scoring functions are learned using 300 training samples and evaluated on 300 test ones.
Table 4 reports the accuracy and speed obtained for HOG and ANet CNN C3, C4, and C5 features for direct and equivariant regression. The latter is generally as good or nearly as good as direct regression, but up to 22 times faster further validating the mappings \(M_g\). Figure 13 shows the cumulative error curves for the different regressors.
6 Analysis of Coverage and Equivalence
We now move our attention from equivariance to coverage and equivalence of CNN representations by first adapting the methods developed in the previous section to this analysis (Sect. 6.1) and then using them to studying numerous cases of interest (Sect. 6.2).
The key finding from these experiments are that:

Different networks trained to perform the same task tend to learn representations that are approximately equivalent.

Deeper and larger representations tend to cover well for shallower and smaller ones, but the converse is not always true. For example, the deeper layers of ANet cover for the shallower layers of the same network, Vgg16 layers cover well for ANet layers, and ResN50 layers cover well for Vgg16 layers. However, Vgg16 layers cannot cover for ResN50 layers.

Coverage and equivalence tend to be better for layers whose output spatial resolution matches. In fact, a layer’s resolution is a better indicator of compatibility than its depth.

When the same network is trained on two different tasks, shallower layers tend to be equivalent, whereas deeper ones tend to be less so, as they become more taskspecific.
6.1 Methods
As for the map \(M_g\) in the case of equivariance, the covering map \(E_{\phi \rightarrow \phi '}\) of Eq. (2) must be estimated from data. Fortunately, a number of the algorithms used for estimating \(M_g\) are equally applicable to \(E_{\phi \rightarrow \phi '}\). In particular, the objective (3) can be adapted to the covering problem by replacing \(\phi (g\mathbf {x})\) by \(\phi '(\mathbf {x})\). Following the taskoriented loss formulation of Sect. 5.1, consider two representations \(\phi \) and \(\phi '\) and a predictor \(\psi '\) learned to solve a reference task using the representation \(\phi '\). For example, these could be obtained by decomposing two CNNs \(\zeta = \psi \circ \phi \) and \(\zeta ' = \psi ' \circ \phi '\) trained on the ImageNet ILSVRC12 data (but \(\phi \) could also be learned on a different dataset, with a different network architecture or could be an handcrafted feature representation) (Fig. 14).
The goal is to find a mapping \(E_{\phi \rightarrow \phi '}\) such that \(\phi ' \approx E_{\phi \rightarrow \phi '} \phi \). This map can be seen as a “stitching transformation” allowing \(\psi ' \circ E_{\phi \rightarrow \phi '} \circ \phi \) to perform as well as \(\psi ' \circ \phi '\) on the original classification task. Hence this transformation can be learned by minimizing the loss \(\ell (y_i, \psi ' \circ E_{\phi \rightarrow \phi '} \circ \phi (\mathbf {x}_i))\) with an objective similar to (6), resulting in the architecture of Fig. 15.
In a CNN, the stitching transformation \(E_{\phi \rightarrow \phi '}\) can be implemented as a stitching layer. Given the convolutional structure of the representation, this layer can be implemented as a bank of linear filters. No permutation layer is needed in this case, but it may be necessary to down/upsample the features if the spatial dimensions of \(\phi \) and \(\phi '\) do not match. This is done by using nearest neighbor interpolation for downsampling and bilinear interpolation for upsampling, resulting in a definition similar to (7), where \(P_g\) is defined as upscaling or downscaling based on the spatial resolution of \(\phi \) and \(\phi '\).
In all experiments, training is done for seven epochs with \(2\cdot 10^5\) training samples, using a constant learning rate of \(10^{2}\). The E map is initialized randomly with the Xavier method (Glorot and Bengio 2010), although we have observed that results are not sensitive to the form of initialization (random matrix, random permutation and identity matrix) or level of weight decay.
6.2 Results
The goal of this experimental section is to asses whether different image representations carry similar information. We perform three different investigations: covering of representations produced by different layers of the same network (Sect. 6.2.1), covering of representations obtained by training the same CNN architecture on different tasks (Sect. 6.2.2), and covering of representations obtained from different CNN architectures (Sect. 6.2.3).
6.2.1 Same Architecture, Different Layers
In the first experiment we “stitch” different layers of the same neural network architecture. This is done to assess the degree of change between different layers and to provide a baseline level of performance for subsequent experiments. Note that,x when a layer is stitched to itself, the ideal stitching transformation E is the identity; nevertheless, we still initialize the map E with a random noise and learn it from data. Due to the nonconvex nature of the optimization, this will not in general recover the identity transformation perfectly, and can be used to assess the performance loss due to the limitations of the optimization procedure ((Yosinski et al. 2014) refer to this issue as “fragile coadaptation”) (Table 5).
Table 6b shows the results of this experiment on the CNet network. We test the stitching of any pair of layers in the architecture, to construct a matrix of results. Each entry in the matrix reports the accuracy of the stitched network on the ILSVRC12 data after learning the map \(E_{\phi \rightarrow \phi '}\) initialized from random noise (without learning, the error rate is 100% in all cases). There are three cases of interest: the diagonal (stitching a layer to itself), the upper diagonal (which amounts to skipping some of the layers) and the lower diagonal (which amounts to recomputing some of the layers twice).
Along the diagonal, there is a modest performance drop as a result of the fragile coadaptation effect.
For the upper diagonal, skipping layers may reduce the network performance substantially. This is particularly true if one skips C2, but less so when skipping one or more of C3–C5. We note that C3–C5 operate on the same resolution, different to that of C2, so a portion of the drop can be explained by effects of aliasing in downsampling the feature maps in the stitching layer.
For the lower diagonal, rerouting the information through part of the network twice tends to preserve the baseline performance. This suggests that the stitching map E can learn to “undo” the effect of several network layers despite being a simple linear projection. One possible interpretation is that, while layers perform complex operations such as removing the effect of nuisance factors and building invariance, it is easy to reconstruct an equivalent version of the input given the result of such operations. Note that, since deeper layers contain many more feature channels than earlier ones, the map E performs dimensionality reduction. Still, there are limitations: we also evaluated reconstruction of the input image pixels, but in this case the error rate of the stitched network remained \(>\,94\%\).
The asymmetry of the results show the importance of distinguishing the concepts of coverage (asymmetric) and equivalence (symmetric). Our results can be summarized as follows “the deep layers of a neural network cover the earlier layer, but not viceversa”.
Table 6b also reports the standard deviation of the results obtained by randomly reinitializing E and relearning it several times. The stability of the results is proportional to their quality, suggesting that learning E is stable when stitching compatible representations and less stable otherwise.
Finally, we note that there is a correlation between the layers’ resolution and their compatibility. This can be observed in the similarity of Table 6a, reporting the resolution change, and Table 6b, reporting the performance of the stitched model. We see that there are subtle differences—e.g. for the block of P2–C5, where no sampling is performed, C5 is clearly more compatible with C4 than with P2. Similarly, downsampling by a factor of \(2^{1.1}\), can lead to a top1 error from \(59.2\%\) up to \(95.4\%\). We conclude that downsampling/upsampling may lead to an offset in the results score, however there are still clear differences between the results obtained for the same constant factor. Thus we can use these results for drawing observations about the representation compatibility.
6.2.2 Same Architecture, Different Tasks
Next, we investigate the compatibility of nearly identical architectures trained on the same data twice, or on different data. In more detail, the first several layers \(\phi '\) of the ANet CNN \(\zeta '=\psi '\circ \phi '\) are swapped with layers \(\phi \) from CNet, also trained on the ILSVRC12 data, Plcs (Zhou et al. 2014), trained on the MIT Places data, and PlcsH, trained on a mixture of MIT Places and ILSVRC12 images. These representations have a similar, but not identical, structure and different parameterizations as they are trained independently.
Table 5 reports the top1 error on ILSVRC12 of the hybrid models \(\psi ' \circ E_{\phi \rightarrow \phi '} \circ \phi \) where the covering map \(E_{\phi \rightarrow \phi '}\) is learned as usual. There are a number of notable facts. First, setting \(E_{\phi \rightarrow \phi '}=\mathbf {1}\) to the identity map has a top1 error \(>99\%\) (not shown in the table), confirming that different representations are not directly compatible. Second, a strong level of equivalence can be established up to C4 between ANet and CNet, slightly weaker level can be established between ANet and PlcsH, and only a poor level of equivalence is observed for the deepest layers of Plcs. Specifically, the C12 layers of all networks are almost always interchangeable, whereas C5 is not as interchangeable, particularly for Plcs. This corroborates the intuition that C12 are generic image codes, whereas C5 is more taskspecific. Still, even in the worst case, performance is dramatically better than chance, demonstrating that all such features are compatible to an extent. Results are also stable over repeated learning of the map E.
6.2.3 Different Architectures, Same Task
The final experiment assesses the equivalence between layers of different neural network architectures trained on the same data. In this case, we stitch the output of the linear convolution layers as well as the output of the pooling layers, after ReLUs. Note that, since the two architecture differ, there is no “obvious” stitching point, so each possibility is evaluated.
ANet\(\rightarrow \)Vgg16 Table 7 shows the effect of replacing a subset of the Vgg16 layers with layers from the ANet network. Generally, the ANet can partially cover the Vgg16 layers, but there is almost always a nonnegligible performance drop compared to the more powerful Vgg16 configuration. The presence of the ReLU activation functions has little to no influence on coverage.
Contrary to the previous experiment, deeper ANet features fail to cover for earlier Vgg16 features (whereas deeper ANet features can generally cover well for early ANet features). It is possible that the constrained structure of the map \(E_{\phi \rightarrow \phi '}\) fails to capture the required transformation.
Vgg16\(\rightarrow \)ANet Next, Table 8 tests the reverse direction: whether Vgg16 can cover ANet features. The answer is mixed. The output of the Vgg16P5 layer can cover well for ANetC2 to P5, even though there is a significant resolution change. In fact, the performance is significantly better than ANet alone, (reducing the 42.5 top1 error of ANet to 34.9), which suggests the degree to which the representational power of Vgg16 is contained in the convolutional layers. The ability of Vgg16P5 to cover for ANetC2–P5 may also be explained by the fact that the last three layers of ANet have a similar structure as the V4 block of Vgg16, as they all use \(3\times 3\) filters.
On the other hand, the earlier layers of Vgg16 cover significantly less well for ANet features than Vgg16P5.
ResN50\(\rightarrow \)Vgg16 Next, in Table 9 we asses whether ResN50 features can cover Vgg16 features. As seen in Fig. 2, these two architectures differ significantly in their structure; consequently, ResN50 fails to cover well for Vgg16 in most cases. Good performance is however obtained by stitching the top layers; for example, ResN50\(R5_3\) covers well Vgg16P5. This suggests that the final layers of ResN50 are more similar to the top convolutional layers of Vgg16 than to its fully connected layers. This indicates that the main driving factor establishing the kind of information captured at different depths is predominantly controlled by the spatial resolution of the features rather than by the depth or complexity of the representation.
Vgg16\(\rightarrow \)ResN50 It was not possible to use Vgg16 features to cover for ResN50 with our method at all. In all cases, the error remained \(>90\%\). We hypothesize that the lack of the residual connections in the Vgg16 network makes the features incompatible with the ResN50 ones.
7 Summary
This paper introduced the idea of studying representations by learning their equivariant and coverage/equivalence properties empirically. It was shown that shallow representations and the first several layers of deep stateoftheart CNNs transform in an easily predictable manner with image warps. It was also shown that many representations tend to be interchangeable, and hence equivalent, despite differences, even substantial ones, in the architectures. Deeper layers share some of these properties but to a lesser degree, being more taskspecific.
A similarity of spatial resolution is a key predictor of representations compatibility; having a sufficientlylarge spatial resolution is also predictive of the equivariance properties to geometric warps. Furthermore, deeper and larger representations tend to cover well for shallower and smaller ones.
In addition the usage as analytical tools, these methods have practical applications such as accelerating structuredoutput regressors classifier in a simple and elegant manner.
Notes
Here, \(g:\mathbb {R}^2\rightarrow \mathbb {R}^2\) is a transformation of the plane. An image \(\mathbf {x}\) is, up to discretization, a function \(\mathbb {R}^2\rightarrow \mathbb {R}^3\). The action \(g\mathbf {x}\) of the transformation on the image results in a new image \([g\mathbf {x}](u)=\mathbf {x}(g^{1}(u))\).
Most HOG implementations use 9 orientation bins, breaking rotational symmetry.
Formally, denote by (x, y) the coordinates of a pixel in the input image \(\mathbf {x}\) and by \(p:(u,v)\mapsto (x,y)\) the affine function mapping the feature index (u, v) to the center (x, y) of the corresponding receptive field (measurement region) in the input image. Denote by \(\mathcal {N}_k(u,v)\) the k feature locations \((u',v')\) that are closer to (u, v) (the latter can have fractional coordinates) and use this to define the neighborhood of the backtransformed location (u, v) as \(\varOmega _{g,k}(u,v) = \mathcal {N}_k(p^{1}\circ g^{1} \circ p(u,v))\).
In the sense that \(g(x+u,y+v) = g(x,y) + (u',v')\).
Better accuracy could be obtained by using image warping techniques. For example, subpixel accuracy can be obtained by upsampling the permutation layer and then allowing the transformation filter to be translation variant (or, equivalently, by introducing a suitable nonlinear mapping between the permutation layer and the transformation filters).
The Hellinger’s distance \((\sum _i (\sqrt{x_i}\sqrt{y_i})^2)^{1/2}\) is preferred to the Euclidean distance as the HOG features are histograms.
References
Albanie, S. (2017). Estimates of memory consumption and flops for various convolutional neural networks. https://github.com/albanie/convnetburden. Accessed 8 August 2017 .
Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, A., & Poggio, T. (2016). Unsupervised learning of invariant representations. Theoretical Computer Science, 633, 112–121.
Aubry, M., & Russell, B. C. (2015). Understanding deep features with computergenerated imagery. In The IEEE international conference on computer vision (ICCV).
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994). Signature verification using a siamese time delay neural network. In Advances in neural information processing systems (pp. 737–744).
Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1872–1886.
Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678.
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the BMVC.
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In IEEE conference on computer vision and pattern recognition (CVPR).
Cohen, T., & Welling, M. (2016). Group equivariant convolutional networks. In International conference on machine learning (pp. 2990–2999).
Cohen, T. S., & Welling, M. (2017). Steerable CNNS. In International conference on learning representations.
Csurka, G., Dance, C. R., Dan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Proceedings of the ECCV workshop on statistical learning in computer vision.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the CVPR.
Dieleman, S., Willett, K. W., & Dambre, J. (2015). Rotationinvariant convolutional neural networks for galaxy morphology prediction. Monthly Notices of the Royal Astronomical Society, 450(2), 1441–1459.
Dieleman, S., De Fauw, J., & Kavukcuoglu, K. (2016). Exploiting cyclic symmetry in convolutional neural networks. In International conference on machine learning (pp 1889–1898).
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR. arXiv:1310.1531.
Everingham, M., Zisserman, A., Williams, C., & Gool, L. V. (2007). The PASCAL visual obiect classes challenge 2007 (VOC2007) results. Technical report, Pascal Challenge.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively trained part based models. PAMI.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Freeman, W. T., Adelson, E. H., et al. (1991). The design and use of steerable filters. IEEE Transactions on Pattern analysis and machine intelligence, 13(9), 891–906.
Girshick, R. (2015). Fast RCNN. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014a). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Girshick, R.B., Donahue, J., Darrell, T., & Malik, J. (2014b). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249–256).
Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., & Ng, A. Y. (2009). Measuring invariances in deep networks. In Advances in neural information processing systems (pp. 646–654).
He, K., Zhang, X., Ren, S., & Sun. J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (pp. 346–361). Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Huang, F. J., Boureau, Y. L., LeCun, Y., et al. (2007). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (pp. 1–8). IEEE.
Jaderberg, M., Simonyan, K., Zisserman, A., & kavukcuoglu, K. (2015). Spatial transformer networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., & Garnett, R. (Eds.) Advances in Neural Information Processing Systems (Vol. 28, pp. 2017–2025) Curran Associates, Inc. http://papers.nips.cc/paper/5854spatialtransformernetworks.pdf.
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In Proceedings of the CVPR.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the NIPS.
Laptev, D., Savinov, N., Buhmann, J.M., & Pollefeys, M. (2016). Tipooling: transformationinvariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 289–297).
Lenc, K., & Vedaldi, A. (2015). Understanding image representations by measuring their equivariance and equivalence. In CVPR oral prensetation.
Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using threedimensional textons. IJCV, 43(1), 29–44.
Li, Y., Yosinski, J., Clune, J., Lipson, H., & Hopcroft, J. (2015). Convergent learning: Do different neural networks learn the same representations? In Feature extraction: modern questions and challenges (pp. 196–212).
Lindeberg, T. (1998). Principles for automatic scale selection. Technical report ISRN KTH/NA/P 98/14 SE, Royal Institute of Technology.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Lowe, D. G. (1999). Object recognition from local scaleinvariant features. In Proceedings of the ICCV.
Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. IJCV, 2(60), 91–110.
Mahendran, A., & Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural preimages. International Journal of Computer Vision, 120(3), 233–255. https://doi.org/10.1007/s1126301609118.
Mikolajczyk, K., & Schmid, C. (2003). A performance evaluation of local descriptors. In Proceedings of the CVPR.
Parkhi, O., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2011). The truth about cats and dogs. In Proceedings of the ICCV.
Perronnin, F., & Dance, C. (2006). Fisher kernels on visual vocabularies for image categorizaton. In Proceedings CVPR.
Razavian, A.S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features offtheshelf: An astounding baseline for recognition. In CVPR DeepVision workshop.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2017). Object detection networks on convolutional feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7), 1476–1481. https://doi.org/10.1109/TPAMI.2016.2601099.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Schimdt, U., & Roth, S. (2012a). Learning rotationaware features: From invariant priors to equivariant descriptors. In Proceedings of the CVPR.
Schmidt, U., & Roth, S. (2012b). Learning rotationaware features: From invariant priors to equivariant descriptors. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2050–2057). IEEE.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229.
Sifre, L., & Mallat, S. (2013). Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the CVPR.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. CoRR arXiv:1409.1556.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for largescale image recognition. In International conference on learning representations.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for largescale image classification. In Proceedings of the NIPS.
Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proceedings of the ICCV.
Sjöstrand, K., Clemmensen, L. H., Larsen, R., & Ersbøll, B. (2018). SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling. Journal of Statistical Software. https://doi.org/10.18637/jss.v084.i10.
Sohn, K., & Lee, H. (2012). Learning invariant representations with local transformations. CoRR arXiv:1206.6418.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).
Taskar, B., Guestrin, C., & Koller, D. (2003). Maxmargin markov networks. In Proceedings of the NIPS.
Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. IJCV.
Vedaldi, A., & Fulkerson, B. (2010). VLFeat—An open and portable library of computer vision algorithms. In Proceedings of the ACM international conference on multimedia.
Vedaldi, A., & Lenc, K. (2014). MatConvNet—Convolutional neural networks for MATLAB. CoRR arXiv:1412.4564.
Vedaldi, A., & Soatto, S. (2005). Features for recognition: Viewpoint invariance for nonplanar scenes. In Proceedings of the ICCV.
Vondrick, C., Khosla, A., Malisiewicz, T., & Torralba, A. (2013). HOGgles: Visualizing object detection features. In Proceedings of the ICCV.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Localityconstrained linear coding for image classification. In Proceedings of the CVPR.
Yang, J., Yu, K., & Huang, T. (2010). Supervised translationinvariant sparse coding. In Proceedings of the CVPR.
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR arXiv:1311.2901.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning Deep Features for Scene Recognition using Places Database. Advances in neural information processing system.
Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010). Image classification using supervector coding of local image descriptors. In Proceedings of the ECCV.
Zitnick, L., & Dollar, P. (2014). Edge boxes: Locating object proposals from edges. In ECCV.
Acknowledgements
We would like to thank Samuel Albanie for help in preparing this manuscript. Karel Lenc was supported by ERC 677195IDIU Oxford Engineering Science DTA.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by M. Hebert.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Lenc, K., Vedaldi, A. Understanding Image Representations by Measuring Their Equivariance and Equivalence. Int J Comput Vis 127, 456–476 (2019). https://doi.org/10.1007/s112630181098y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s112630181098y