1 Introduction

The recent progress with deep learning architectures [1,2,3,4,5,6,7,8,9,10] has demonstrated that hierarchical feature representations over multiple layers have higher potential compared to approaches based on single layers of receptive fields.

Although theoretical and empirical advances are being made [11,12,13,14,15,16,17,18], we currently lack a comparable understanding of the nonlinearities in deep networks in the way that scale-space theory provides a deep understanding of early visual receptive fields. Training deep networks is still very much of an art [19]. Moreover, deep nets sometimes perform serious errors. The observed problem with adversarial examples [20,21,22,23,24,25,26] can be taken as an indication that current deep nets may not solve the same type of problem as one might at first expect them to do. For these reasons, it is of interest to develop theoretically principled approaches to capture nonlinear hierarchical relations between image structures at different scales as an extension of the regular scale-space paradigm.

A specific limitation of current deep nets is that they are not truly scale covariant. A deep network constructed by repeated application of compact \(3 \times 3\) or \(5 \times 5\) kernels, such as AlexNet [1], VGG-Net [2] or ResNet [5], implies an implicit assumption of a preferred size in the image domain as induced by the discretization in terms of local \(3 \times 3\) or \(5 \times 5\) kernels of a fixed size. Spatial max pooling over image neighbourhoods of fixed size, such as over \(2 \times 2\) neighbourhoods over multiple layers, also implies that nonlinearities are applied relative to a fixed grid spacing. Thereby, due to the nonlinearities in the deep net, the output from the network may be qualitatively different depending on the specific size of the object in the image domain, as varying because of, e.g., different distances between the object and the observer. To handle this lack of scale covariance, approaches have been developed such as spatial transformer networks [27], using sets of subnetworks in a multi-scale fashion [28] or by combining deep nets with image pyramids [29]. Since the size normalization performed by a spatial transformer network is not guaranteed to be truly scale covariant, and since traditional image pyramids imply a loss of image information that can be interpreted as corresponding to undersampling, it is of interest to develop continuous approaches for deep networks that guarantee true scale covariance or better approximations thereof.

An argument that we want to put forward in this article is that truly scale-covariant deep networks with their associated extended notion of truly scale-invariant networks may be conceptually much easier to achieve if we set aside the issues of spatial sampling in the first modelling stage and model the transformations between adjacent layers in the deep network as continuous translation-covariant operators as opposed to discrete filters. Specifically, we will propose to combine concepts from hierarchical families of CNNs with scale-space theory to define continuous families of hierarchical networks, with each member of the family being a rescaled copy of the base network, in a corresponding way as an input image is embedded into a one-parameter family of images, with scale as the parameter, within the regular scale-space framework. Then, a structural advantage of a continuous model as compared to a discrete model is that it can guarantee provable scale covariance in the following way: if the computational primitives that are used for defining a hierarchical network are defined in a multi-scale manner, e.g. from Gaussian derivatives and possibly nonlinear differential expressions constructed from these, and if the scale parameters of the primitives in the higher layers are proportional to the scale parameter in the first layer, then if we define a multi-scale hierarchical network over all the scale parameters in the first layer, the multi-scale network is guaranteed to be truly scale covariant.

This situation is in contrast to the way most deep nets are currently constructed, as a combination of discrete primitives whose scales are instead proportional to the grid spacing. That in turn implies a preferred scale of the computations and which will violate scale covariance unless the image data are resampled to multiple rescaled copies of the input image prior to being used as input to a deep net. If using such spatial resampling to different levels of resolution, then, however, it may be harder to combine information between different multi-scale channels compared to using a continuous model that preserves the same spatial sampling in the input data. Rescaling of the image data prior to later stage processing may also introduce sampling artefacts.

The subject of this article is to first present a general sufficiency argument for constructing provably scale-covariant hierarchical networks based on a spatially continuous model of the transformations between adjacent layers in the hierarchy. This sufficiency result holds for a very wide class of possible continuous hierarchical networks. Then, we will develop in more detail one example of such a continuous network for capturing nonlinear hierarchical relations between features over multiple scales.

Building upon axiomatic modelling of visual receptive fields in terms of Gaussian derivatives and affine extensions thereof, which can serve as idealized models of simple cells in the primary visual cortex [30,31,32,33], we will propose a functional model for complex cells in terms of an oriented quasi quadrature measure, which combines first- and second-order directional affine Gaussian derivatives according to an energy model [34,35,36,37]. Compared to earlier approaches of related types [38,39,40,41,42], our quasi quadrature model has the conceptual advantage that it is expressed in terms of scale-space theory in addition to well reproducing properties of complex cells as reported by [34, 43,44,45]. Thereby, this functional model of complex cells allows for a conceptually easy integration with transformation properties, specifically truly provable scale covariance, or a generalization to affine covariance provided that the receptive field responses are computed in terms of affine Gaussian derivatives as opposed to regular Gaussian derivatives.

Then, we will combine such oriented quasi quadrature measures in cascade, building upon the early idea of Fukushima [38] of using Hubel and Wiesel’s findings regarding receptive fields in the primary visual cortex [46,47,48] to build a hierarchical neural network from repeated application of models of simple and complex cells. This will result in a handcrafted network, termed quasi quadrature network, with structural similarities to the scattering network proposed by Bruna and Mallat [41], although expressed in terms of Gaussian derivatives instead of Morlet wavelets.

We will show how the scale-space properties of the quasi quadrature primitive in this representation can be theoretically analysed and how the resulting handcrafted network becomes provably scale covariant and rotation covariant, in such a way that the multi-scale and multi-orientation network commutes with scaling transformations and rotations in the spatial image domain.

As a proof of concept that the proposed methodology can lead to meaningful results, we will experimentally investigate a prototype application to texture classification based on a substantially simplified representation that uses just the average values over image space of the resulting QuasiQuadNet. It will be demonstrated that the resulting approach leads to competitive results compared to classical texture descriptors as well as to other handcrafted networks.

Specifically, we will demonstrate that in the presence of substantial scaling transformations between the training data and the test data, true scale covariance substantially improves the ability to perform predictions or generalizations beyond the variabilities that are spanned by the training data.

1.1 Structure of this Article

Section 2 begins with an overview of related work, with emphasis on related scale-space approaches, deep learning approaches somehow related to scale, rotation-covariant deep networks, biologically inspired networks, other handcrafted or structured networks including other hybrid approaches between scale space and deep learning.

As a general motivation for studying hierarchical networks that are based on primitives that are continuous over image space, Sect. 3 then presents a general sufficiency argument that guarantees provable scale covariance for a very wide class of networks defined from layers of scale-space operations coupled in cascade.

To provide an additional theoretical basis for a subclass of such networks that we shall study in more detail in this article, based on functional models of complex cells coupled in cascade, Sect. 4 describes a quasi quadrature measure over a purely 1D signal, which measures the energy of first- and second-order Gaussian derivative responses. Theoretical properties of this entity are analysed with regard to scale selectivity and scale selection properties, and we show how free parameters in the quasi quadrature measure can be determined from closed-form calculations.

In Sect. 5, an oriented extension of the 1D quasi quadrature measure is presented over multiple orientations in image space and is proposed as a functional model that mimics some of the known properties of complex cells, while at the same time being based on axiomatically derived affine Gaussian derivatives that well model the functional properties of simple cells in the primary visual cortex.

In Sect. 6, we propose to couple such quasi quadrature measures in cascade, leading to a class of hierarchical networks based on scale-space operations that we term quasi quadrature networks. We give explicit proofs of scale covariance and rotational covariance of such networks and show examples of the type of information that can be captured in different layers in the hierarchies.

Section 7 then outlines a prototype application to texture analysis based on a substantially mean-reduced version of such a quasi quadrature network, with the feature maps in the different layers reduced to just their mean values over image space. By experiments on three datasets for texture classification, we show that this approach leads to promising results that are comparable or better than other handcrafted networks or more dedicated handcrafted texture descriptors. We do also present experiments of scale prediction or scale generalization, which quantify the performance over scaling transformations for which the variabilities in the testing data are not spanned by corresponding variabilities in the training data.

Finally, Sect. 8 concludes with a summary and discussion.

1.2 Relations to Previous Contribution

This paper constitutes a substantially extended version of a conference paper presented at the SSVM 2019 conference [49] and with substantial additions concerning:

  • the motivations underlying the developments of this work and the importance of scale covariance for deep networks (Sect. 1),

  • a wider overview of related work (Sect. 2),

  • the formulation of a general sufficiency result to guarantee scale covariance of hierarchical networks constructed from computational primitives (linear and nonlinear filters) formulated based on scale-space theory (Sect. 3),

  • additional explanations regarding the quasi quadrature measure (Sect. 4) and its oriented affine extension to model functional properties of complex cells (Sect. 5),

  • better explanation of the quasi quadrature network constructed by coupling oriented quasi quadrature measures in cascade, including a figure illustration of the network architecture, details of discrete implementation, issues of exact versus approximate covariance or invariance in a practical implementation and experimental results showing examples of the type of information that is computed in different layers of the hierarchy (Sect. 6),

  • a more extensive experimental section showing the results of applying a mean-reduced QuasiQuadNet for texture classification, including additional experiments demonstrating the importance of scale covariance and better overall descriptions about the experiments that could not be given in the conference paper because of the space limitations (Sect. 7).

In relation to the SSVM 2019 paper, this paper therefore gives a more general treatment about the notion of scale covariance of more general validity for continuous hierarchical networks, presents more experimental results regarding the prototype application to texture classification and gives overall better descriptions of the subjects treated in the paper, including more extensive references to related literature.

2 Related Work

In the area of scale-space theory, theoretical results have been derived showing that Gaussian kernels and Gaussian derivatives constitute a canonical class of linear receptive fields for an uncommitted vision system [30, 31, 50,51,52,53,54,55,56,57,58,59,60,61,62]. The conditions that specify this uniqueness property are basically linearity, shift invariance and regularity properties combined with different ways of formalizing the notion that new structures should not be created from finer to coarser scales in a multi-scale representation.

The receptive field responses obtained by convolution with such Gaussian kernels and Gaussian derivatives are truly scale covariant—a property that has been used for designing a large number of scale-covariant and scale-invariant feature detectors and image descriptors [36, 63,64,65,66,67,68,69,70,71]. With the generalization to affine covariance and affine invariance based on the notion of affine scale-space [51, 56, 66, 72, 73], these theoretical developments served as a conceptual foundation that opened up for a very successful track of methodology development for image-based matching and recognition in classical computer vision.

In the area of deep learning, approaches to tackle the notion of scale have been developed in different ways. By augmenting the training images with multiple rescaled copies of each training image or by randomly resizing the training images over some scale range (scale jittering), the robustness of a deep net can usually be extended to moderate scaling factors [2, 74]. Another basic data-driven approach consists of training a module to estimate spatial scaling factors from the data by a spatial transformer network [27, 75]. A more structural approach consists of applying deep networks to multiple layers in an image pyramid [29, 76,77,78], or using some other type of multi-channel approach where the input image is rescaled to different resolutions, possibly combined with interactions or pooling between the layers [79,80,81,82]. Variations or extensions of this approach include scale-dependent pooling [83], using sets of subnetworks in a multi-scale fashion [28], using dilated convolutions [84,85,86], scale-adaptive convolutions [87] or adding additional branches of down-samplings and/or up-samplings in each layer of the network [88, 89].

A more specific approach to designing a scale-covariant network is by spatially warping the image data priori to image filtering by a log-polar transformation [90, 91]. Then, the scaling and rotation transformations are mapped to mere translations in the transformed domain, although this property only holds provided that the origin of the log-polar transformation can be preserved between the training data and the testing data. Specialized learning approaches for scale-covariant or affine-covariant feature detection have been developed for interest point detection [92, 93].

There is a large literature on approaches to achieve rotation-covariant networks [94,95,96,97,98,99,100,101,102,103] with applications to different domains including astronomy [104], remote sensing [105], medical image analysis [106, 107] and texture classification [108]. There are also approaches to invariant networks based on formalism from group theory [42, 109, 110].

In the context of more general classes of image transformations, it is worth noting that beyond the classes of spatial scaling transformations and spatial affine transformations (including rotations), the framework of generalized axiomatic scale-space theory [111, 112] does also allow for covariance and/or invariance with regard to temporal scaling transformations [113], Galilean transformations and local multiplicative intensity transformations [32, 33].

Concerning biologically inspired neural networks, Fukushima [38] proposed to build upon Hubel and Wiesel’s findings regarding receptive fields in the primary visual cortex (see [48]) to construct a hierarchical neural network from repeated application of models of simple and complex cells. Poggio and his co-workers built on this idea and constructed handcrafted networks based on two layers of such models expressed in terms of Gabor functions [39, 40, 114].

The approach of scattering convolution networks [41, 115, 116] is closely related, where directional odd and even wavelet responses are computed and combined with a nonlinear modulus (magnitude) operator over a set of different orientations in the image domain and over a hierarchy over a dyadic set of scales.

Other types of handcrafted or structured networks have been constructed by applying principal component analysis in cascade [117] or by using Gabor functions as primitives to be modulated by learned filters [118].

Concerning hybrid approaches between scale space and deep learning, Jacobsen et al. [119] construct a hierarchical network from learned linear combinations of Gaussian derivative responses. Shelhamer et al. [120] compose free-form filters with affine Gaussian filters to adapt the receptive field size and shape to the image data.

Concerning the use of a continuous model of the transformation from the input data to the output data in a hierarchical computation structure, which we will here develop for deep networks from motivations of making it possible for the network to fulfil geometric transformation properties in spatial input data, such a notion of continuous transformations from the input to the output has been proposed as a model for neural networks prior to the deep learning revolution by Le Roux and Bengio [121] from the viewpoint of an uncountable number of hidden units, and suggesting that that makes it possible for the network to represent some smooth functions more compactly.

For an overview of texture classification, which we shall later use as an application domain, we refer to the recent survey by Liu et al. [122] and the references therein.

In this work, we aim towards a conceptual bridge between scale-space theory and deep learning, with specific emphasis on handling the variability in image data caused by scaling transformations. We will show that it is possible to design a wide class of possible scale-covariant networks by coupling linear or nonlinear expressions in terms of Gaussian derivatives in cascade. As a proof of concept that such a construction can lead to meaningful results, we will present a specific example of such a network, based on a mathematically and biologically motivated model of complex cells and demonstrate that it is possible to get quite promising performance on texture classification, comparable or better than many classical texture descriptors or other handcrafted networks. Specifically, we will demonstrate how the notion of scale covariance improves the ability to perform predictions or generalizations to scaling variabilities in the testing data that are not spanned by the training data.

We propose that this opens up for studying other hybrid approaches between scale-space theory and deep learning to incorporate explicit modelling of image transformations as a prior in hierarchical networks.

3 General Scale Covariance Property for Continuous Hierarchical Networks

For a visual observer that views a dynamic world, the size of objects in the image domain can vary substantially, because of variations in the distance between the objects and the observer and because of objects having physically different sizes in the world. If we rescale an image pattern by a uniform scaling factor, we would in general like the perception of objects in the underlying scene to be preserved.Footnote 1 A natural precursor to achieving such a scale-invariant perception of the world is to have a scale-covariant image representation. Specifically, a scale-covariant image representation can often be used as a basis for constructing scale-invariant image descriptors and/or scale-invariant recognition schemes.

In the area of scale-space theory [30, 50, 52, 53, 56, 57, 59, 61], theoretically well-founded approaches have been developed to handle the notion of scale in image data and to construct scale-covariant and scale-invariant image representations [36, 63,64,65,66,67,68,69, 71, 111]. In this section, we will present a general argument of how these notions can be extended to construct provably scale-covariant hierarchical networks, based on continuous models of the image operations between adjacent layers.

Given an image f(x), consider a multi-scale representation \(L(x;\; s)\) constructed by Gaussian convolution and then from this scale-space representation defining a family of scale-parameterized possibly nonlinear operators \({\mathcal{D}}_{1,s_1}\) over a continuum of scale parameters \(s_1\):

$$\begin{aligned} F_1(\cdot ;\; s_1) = ({\mathcal{D}}_{1,s_1} \, f)(\cdot ), \end{aligned}$$
(1)

where the effect of the Gaussian smoothing operation is incorporated in the operator \({\mathcal{D}}_{1,s_1}\).

Within the framework of Gaussian scale-space representation [30, 50, 52, 53, 56, 57, 59, 61], we could consider these operators as being formed from sufficiently homogeneous possibly nonlinear combinations of Gaussian derivative operators, such that they under a rescaling of the input domain \(x' = S x\) by a factor of S are guaranteed to obey the scale covariance property:

$$\begin{aligned} F'_1(x';\; s'_1) = S^{\alpha _1} F_1(x;\; s_1) \end{aligned}$$
(2)

for some constant \(\alpha _1\) and some transformation of the scale parameters \(s'_1 = \phi _1(s_1)\). In other words, for any image representation computed over the original image domain x at scale \(s_1\), it should be possible to find a corresponding representation over the transformed domain \(x' = S x\) at scale \(s_1'\) with a possibly transformed magnitude as determined by the relative amplification factor \(S^{\alpha _1}\).

Fig. 1
figure 1

Commutative diagram for a scale-covariant hierarchical network constructed according to the presented sufficiency result. Provided that the individual differential operators \({\mathcal{D}}_{k,s_k}\) between adjacent layers are scale covariant, which, for example, holds for the class of homogeneous differential expressions of the form (12) as well as self-similar compositions of such operations that additionally satisfy corresponding homogeneity requirements, it follows that it will be possible to perfectly match the corresponding layers \(F_k\) and \(F_k'\) under a scaling transformations of the underlying image domain \(f'(x') = f(x)\) for \(x' = Sx\), provided that the scale parameter \(s_k\) in layer k is proportional to the scale parameter \(s_1\) in the first layer, \(s_k = r_k^2 \, s_1\), for some scalar constants \(r_k\). For such a network constructed from scale-space operations based on the Gaussian scale-space theory framework, the scale parameters in the two domains should be related according to \(s_k' = S^2 s_k\)

Fig. 2
figure 2

A hierarchical network defined by coupling scale-covariant differential expressions formulated within the continuous scale-space framework will be guaranteed to be provably scale covariant provided that the scale parameters in higher layers \(s_k\) for \(k \ge 2\) are proportional to the scale parameter \(s_1\) in the first layer. If the scale normalization parameter \(\gamma \) in the scale-normalized derivative expressions is equal to one, then general differential expressions in terms of such derivatives can be used based on the transformation property (9). If the scale normalization parameter \(\gamma \) is not equal to one, then one can take homogeneous polynomial differential expressions of the form (12) as well as self-similar transformations of such expressions. (In this schematic illustration, the arguments of the layers \(F_1\), \(F_2\), \(F_{k-1}\) and \(F_k\), which should be \(F_ 1(\cdot ;\; s_1)\), \(F_ 2(\cdot ;\; s_1, s_2)\), \(F_{k-1}(\cdot ;\; s_1, s_2, \dots , s_{k-1})\) and \(F_k(\cdot ;\; s_1, s_2, \dots , s_{k-1}, s_k)\), respectively, have been suppressed to simplify the notation. The argument of the input data f should be \(f(\cdot )\).)

By concatenating a set of corresponding scale-parameterized differential operators \({\mathcal{D}}_{k,s_k}\)

$$\begin{aligned} F_k(\cdot ;\; s_1, \dots , s_{k-1}, s_k) = {\mathcal{D}}_{k,s_k} \, F_{k-1}(\cdot ;\; s_1, \dots , s_{k-1}) \end{aligned}$$
(3)

that obey similar scale covariance properties such that

$$\begin{aligned} F'_k(x';\; s'_1, \dots , s'_k) = S^{\alpha _1} \dots \, S^{\alpha _{k}} \, F_k(x;\; s_1, \dots , s_k), \end{aligned}$$
(4)

it follows that the combined hierarchical network is guaranteed to be provably scale covariant, see Figs. 1 and 2 for schematic illustrations. Specifically, it is natural to choose the scale parameters \(s_k\) in the higher layers proportional to the scale parameter \(s_1\) in the first layer to guarantee scale covariance.

More generally, we could also consider constructing scale-covariant networks from other types of scale-covariant operators that obey similar scaling properties as in Eqs. (2) and (4), for example, expressed in terms of a basis of rescaled Gabor functions or a family of continuously rescaled wavelets. Then, however, the information reducing properties from finer to coarser scales in the representation computed by Gaussian convolution and Gaussian derivatives are, however, not guaranteed to hold. As mentioned above, the Gaussian kernel and the Gaussian derivatives can be uniquely determined from different ways of formalizing the requirement that they should not introduce new image structures from finer to coarser scales in a multi-scale representation [30, 31, 50,51,52,53,54,55,56,57,58,59,60,61,62].

In this overall structure, there is a large flexibility in how to choose the operators \({\mathcal{D}}_{k,s_k}\). Within the family of operators defined from a scale-space representation, we could consider a large class of differential expressions and differential invariants in terms of scale-normalized Gaussian derivatives [36] that guarantee provable scale covariance.

For example, if we choose to express the first differential operator \({\mathcal{D}}_{1,s_1}\) in a basis in terms of scale-normalized derivatives [36] (here with the multi-index notation \(\partial _x^n = \partial _{x_1^{n_1} \ldots x_D^{n_D}}\) for the partial derivatives in D dimensions and \(|n| = n_1 + \dots + n_D\))

$$\begin{aligned} \partial _{\xi ^n} = \partial _{x^n,\gamma \text {-norm}} = s^{|n| \gamma /2} \, \partial _x^n \end{aligned}$$
(5)

computed from a scale-space representation of the input signal

$$\begin{aligned} L_1(\cdot ;\; s_1) = g(\cdot ;\; s_1) * f(\cdot ) \end{aligned}$$
(6)

by convolution with Gaussian kernels

$$\begin{aligned} g(x;\; s) = \frac{1}{(2 \pi s)^{D/2}} \mathrm{e}^{-\frac{|x|^2}{2s}} \end{aligned}$$
(7)

and with s in (5) determined from \(s_1\) in (6), it then follows that under a rescaling of the image domain \(f'(x') = f(x)\) for \(x' = S \, x\) the scale-normalized derivatives transform according to [36, Eq. (20)]

$$\begin{aligned} \partial _{\xi '^n} L'_1(x';\; s'_1) = S^{|n|(\gamma -1)} \, \partial _{\xi ^n} L_1(x;\; s_1) \end{aligned}$$
(8)

provided that the scale parameters are matched according to \(s'_1 = S^2 \, s_1\). Specifically, in the special case of choosing \(\gamma = 1\), the scale-normalized derivatives will be equal

$$\begin{aligned} \partial _{\xi '^n} L'_1(x';\; s'_1) = \partial _{\xi ^n} L_1(x;\; s_1). \end{aligned}$$
(9)

This implies that any scale-parameterized differential operator \({\mathcal{D}}_{1,s_k}\) that can be expressed as a sufficiently regular function \(\psi \)

$$\begin{aligned} {\mathcal{D}}_{1,s_1} \, f = \psi ({\mathcal{J}}_{N,s_1} L_1) \end{aligned}$$
(10)

of the scale-normalized N-jet, \({\mathcal{J}}_{N,s_1} L_1\), of the scale-space representation \(L_1\) of the input image, which is the union of all partial derivatives up to order N

$$\begin{aligned} {\mathcal{J}}_{N,s_1} L_1 = \cup _{1 \le |n| \le N} \, \partial _{\xi ^n} L_1, \end{aligned}$$
(11)

will satisfy the scale covariance property (2) for \(\alpha _1 = 0\). More generally, it is not necessary that all the derivatives are computed at the same scale, although such a choice could possibly be motivated from conceptual simplicity.

In the less specific case of choosing \(\gamma \ne 1\), we can consider homogeneous polynomials of scale-normalized derivatives of the form

$$\begin{aligned} {\mathcal{D}}_{s_1} f = \sum _{i=1}^I c_i \prod _{j = 1}^J L_{1,x^{\beta _{ij}}} \end{aligned}$$
(12)

for which the sum of the orders of differentiation in a certain term

$$\begin{aligned} \sum _{j=1}^J |\beta _{ij}| = M \end{aligned}$$
(13)

does not depend on the index i of that term. The corresponding scale-normalized expression with the regular spatial derivatives replaced by \(\gamma \)-normalized derivatives is

$$\begin{aligned} {\mathcal{D}}_{\gamma \text {-norm},s_1} f = s_1^{M\gamma /2} \, {\mathcal{D}}_{s_1} f \end{aligned}$$
(14)

and transforms according to [36, Eq. (25)]

$$\begin{aligned} {\mathcal{D}}'_{\gamma \text {-norm},s'_1} f' = S^{M(\gamma -1)} {\mathcal{D}}_{\gamma \text {-norm},s_1} f \end{aligned}$$
(15)

under any scaling transformation \(f'(x') = f(x)\) for \(x' = S \; x\) provided that the scale levels are appropriately matched \(s'_1 = S^2 \, s_1\). Such a self-similar form of scaling transformation will also be preserved under self-similar transformations \(z \mapsto z^{\delta }\) of such expressions as well as for a rich family of polynomial combinations as well as rational expressions of such expressions as long as the scale covariance property (2) is preserved.

A natural complementary argument to constrain such self-similar compositions is to preserve the dimensionality of the image data, such that each layer \(F_k\) has the same dimensionality \([\hbox {intensity}]\) as the input image f. If a polynomial is used for constructing a composed nonlinear differential expression \({\mathcal{D}}_{\mathrm{comp},s_1} f\) by combinations of differential expressions of the form (14) and if this composed polynomial is a homogeneous polynomial of order P relative to the underlying partial derivatives \(\partial _{\xi ^n}\) in the N-jet, in the sense that under a rescaling of the magnitude of the original image data f by a factor of \(\beta \) such that \(f'(x') = \beta f(x)\) the differential expression transforms according to

$$\begin{aligned} {\mathcal{D}}'_{\mathrm{comp},s_1'} f' = \beta ^P {\mathcal{D}}_{\mathrm{comp},s_1} f, \end{aligned}$$
(16)

we should then transform that differential expression by the power 1 / P to preserve the dimensionality of \([\hbox {intensity}]\). A similar argument applies to differential entities formed from rational expressions of differential expressions of the form (14) as long as the scale covariance property (2) is preserved.

Corresponding reasoning as done here regarding the transformation from the input image f to the first layer \(F_1\) can be performed regarding the transformations \({\mathcal{D}}_{k,s_k}\) between any pairs of adjacent layers \(F_{k-1}\) and \(F_k\), implying that if the differential operators \({\mathcal{D}}_{k,s_k}\) are chosen from similar families of differential operators as described above regarding the first differential operator \({\mathcal{D}}_{1,s_1}\), then the entire layered hierarchy will be scale covariant, provided that the scale parameter \(s_k\) in layer k is proportional to the scale parameter \(s_1\) in the first layer, \(s_k = r_k^2 \, s_1\), for some scalar constants \(r_k\) (see Fig. 1). This opens up for a large class of provably scale-covariant continuous hierarchical networks based on differential operators defined from the scale-space framework, where it remains to be determined which of these possible networks lead to desirable properties in other respects. In the following, we will develop one specific way of defining such a scale-covariant continuous network, by choosing these operators based on functional models of complex cells expressed within the Gaussian scale-space paradigm.

4 The Quasi Quadrature Measure Over a 1D Signal

Consider the scale-space representation [50, 52, 53, 56, 57, 59, 61]

$$\begin{aligned} L(\cdot ;\; s) = g(\cdot ;\; s) * f(\cdot ) \end{aligned}$$
(17)

of a 1D signal f(x) defined by convolution with Gaussian kernels

$$\begin{aligned} g(x;\; s) = \frac{1}{\sqrt{2\pi s}} \, \mathrm{e}^{-\frac{x^2}{2s}} \end{aligned}$$
(18)

and with scale-normalized derivatives according to [36]

$$\begin{aligned} \partial _{\xi ^n} = \partial _{x^n,\gamma \text {-norm}} = s^{n \gamma /2} \, \partial _x^n. \end{aligned}$$
(19)

In this section, we will describe a quasi quadrature entity that measures the local energy in the first- and second-order derivatives in the scale-space representation of a 1D signal and analyse its behaviour to image structures over multiple scales. Later in Sect. 5, an oriented extension of this measure to 2D image space will be used for expressing a functional model of complex cells that reproduces some of the known properties of complex cells.

4.1 Quasi Quadrature Measure in 1D

Motivated by the fact that the first-order derivatives primarily respond to the locally odd component of the signal, whereas the second-order derivatives primarily respond to the locally even component of a signal, it is natural to aim at a differential feature detector that combines locally odd and even components in a complementary manner. By specifically combining the first- and second-order scale-normalized derivative responses in a Euclidean way, we obtain a quasi quadrature measure of the form

$$\begin{aligned} {\mathcal{Q}}_{x,\mathrm{norm}} L = \sqrt{\frac{s \, L_x^2 + C \, s^2 \, L_{xx}^2}{s^{\varGamma }}} \end{aligned}$$
(20)

as a modificationFootnote 2 of the quasi quadrature measures previously proposed and studied in [36, 37], with the scale normalization parameters \(\gamma _1\) and \(\gamma _2\) of the first- and second-order derivatives coupled according to \(\gamma _1 = 1 - \varGamma \) and \(\gamma _2 = 1 - \varGamma /2\) to enable scale covariance by adding derivative expressions of different orders only for the scale-invariant choice of \(\gamma = 1\). This differential entity can be seen as an approximationFootnote 3 of the notion of a quadrature pair of an odd and even filter [123] as more traditionally formulated based on a Hilbert transform [124, p. 267-272], while confined within the family of differential expressions based on Gaussian derivatives.

Fig. 3
figure 3

1D Gaussian derivatives up to orders 0, 1 and 2 for \(s_0 = 1\) with the corresponding 1D quasi quadrature measures (20) computed from them at scale \(s = 1\) for \(C = 8/11\) (horizontal axis: \(x \in [-5, 5]\))

Intuitively, this quasi quadrature operator is intended as a measure of the amount of local changes in the signal, not specific to whether the dominant response comes from odd first-order derivatives or even second-order derivatives, and with additional scale selective properties as will be described later in Sect. 4.3.

If complemented by spatial integration, the components of the quasi quadrature measure are specifically related to the following class of energy measures over the frequency domain (Lindeberg [36, App. A.3]):

$$\begin{aligned} E_{m,\gamma \text {-norm}}&= \int _{x \in {\mathbb {R}}} s^{m\gamma } \, L_{x^m}^2 \, \mathrm{d}x \nonumber \\&= \frac{s^{m\gamma }}{2 \pi } \int _{\omega \in {\mathbb {R}}} | \omega |^{2 m} \, {\hat{g}}^2(\omega ;\; s) \, \mathrm{d}\omega . \end{aligned}$$
(21)

For the specific choice of \(C = 1/2\) and \(\varGamma = 0\), the square of the quasi quadrature measure (20) coincides with the proposals by Loog [125] and Griffin [126] to define a metric of the N-jet in scale space, which can specifically been seen as an approximation of the variance of a signal using a Gaussian spatial weighting function.Footnote 4

Figure 3 shows the result of computing this quasi quadrature measure for a Gaussian peak as well as its first- and second-order derivatives. As can be seen, the quasi quadrature measure is much less sensitive to the position of the peak compared to, e.g., the first- or second-order derivative responses. Additionally, the quasi quadrature measure also has some degree of spatial insensitivity for a first-order Gaussian derivative (a local edge model) and a second-order Gaussian derivative.

4.2 Determination of the Parameter C

To determine the weighting parameter C between local second-order and first-order information, let us consider a Gaussian blob \(f(x) = g(x;\; s_0)\) with spatial extent given by \(s_0\) as input model signal.

By using the semi-group property of the Gaussian kernel \(g(\cdot ;\; s_1) * g(\cdot ;\; s_2) = g(\cdot ;\; s_1 + s_2)\), it follows that the scale-space representation is given by \(L(x;\; s) = g(x;\; s_0\,+\,s)\) and that the first- and second-order derivatives of the scale-space representation are

$$\begin{aligned} L_x&= g_x(x;\; s_0+s) = -\frac{x}{(s_0+s)} \, g(x;\; s_0+s), \end{aligned}$$
(22)
$$\begin{aligned} L_{xx}&= g_{xx}(x;\; s_0+s) = \frac{\left( x^2 - (s_0 + s)\right) }{(s_0+s)^2} \, g(x;\; s_0+s), \end{aligned}$$
(23)

from which the quasi quadrature measure (20) can be computed in closed form

$$\begin{aligned}&{\mathcal{Q}}_{x,\mathrm{norm}} L = \nonumber \\&\quad \frac{s^{\frac{1-\varGamma }{2}} \mathrm{e}^{-\frac{x^2}{2(s+s_0)}} \sqrt{x^2 (s+s_0)^2 + C s \left( s+s_0-x^2\right) ^2+2}}{\sqrt{2 \pi } \, (s+s_0)^{5/2}}.\nonumber \\ \end{aligned}$$
(24)

By determining the weighting parameter C such that it minimizes the overall ripple in the squared quasi quadrature measure for a Gaussian input

$$\begin{aligned} {\hat{C}} = {\text {argmin}}_{C \ge 0} \int _{x=-\infty }^{\infty } \left( \partial _x({\mathcal{Q}}^2_{x,\mathrm{norm}} L) \right) ^2 \, \mathrm{d}x, \end{aligned}$$
(25)

which is one way of quantifying the desire to have a stable response under small spatial perturbations of the input, we obtain

$$\begin{aligned} {\hat{C}} = \frac{4 (s+s_0)}{11 s}, \end{aligned}$$
(26)

which in the special case of choosing \(s = s_0\) corresponds to \(C = 8/11 \approx 0.727\). This value is very close to the value \(C = 1/\sqrt{2} \approx 0.707\) derived from an equal contribution condition in [37, Eq. (27)] for the special case of choosing \(\varGamma = 0\).

4.3 Scale Selection Properties

To analyse how the quasi quadrature measure selectively responds to image structures of different sizes, which is important when computing the quasi quadrature entity at multiple scales, we will in this section analyse the scale selection properties of this entity.

Let us consider the result of using Gaussian derivatives of orders 0, 1 and 2 as models of different types of local input signals, i.e.

$$\begin{aligned} f(x) = g_{x^n}(x;\; s_0) \end{aligned}$$
(27)

for \(n \in \{ 0, 1, 2 \}\). For the zero-order Gaussian kernel, the scale-normalized quasi quadrature measure at the origin is given by:

$$\begin{aligned} \left. {\mathcal{Q}}_{x,\mathrm{norm}} L \right| _{x=0,n=0} = \frac{\sqrt{C} s^{1-\varGamma /2}}{2 \pi (s+s_0)^2}. \end{aligned}$$
(28)

For the first-order Gaussian derivative kernel, the scale-normalized quasi quadrature measure at the origin is

$$\begin{aligned} \left. {\mathcal{Q}}_{x,\mathrm{norm}} L \right| _{x=0,n=1} = \frac{s_0^{1/2} s^{(1-\varGamma )/2}}{2 \pi (s+s_0)^2}, \end{aligned}$$
(29)

whereas for the second-order Gaussian derivative kernel, the scale-normalized quasi quadrature measure at the origin is

$$\begin{aligned} \left. {\mathcal{Q}}_{x,\mathrm{norm}} L \right| _{x=0,n=2} = \frac{3 \sqrt{C} s_0 s^{1-\varGamma /2}}{2 \pi (s+s_0)^3}. \end{aligned}$$
(30)

By differentiating these expressions with respect to the scale parameter s, we find that for a zero-order Gaussian kernel the maximum response over scale is assumed at

$$\begin{aligned} \left. {\hat{s}} \right| _{n=0} = \frac{s_0 \, (2 -\varGamma )}{2+\varGamma }, \end{aligned}$$
(31)

whereas for the first- and second-order derivatives, respectively, the maximum response over scale is assumed at

$$\begin{aligned} \left. \hat{s} \right| _{n=1}&= \frac{s_0 \; (1 -\varGamma )}{3+\varGamma }, \end{aligned}$$
(32)
$$\begin{aligned} \left. \hat{s} \right| _{n=2}&= \frac{s_0 \, (2 - \varGamma )}{4+\varGamma }. \end{aligned}$$
(33)

In the special case of choosing \(\varGamma = 0\), these scale estimates correspond to

$$\begin{aligned} \left. \hat{s} \right| _{n=0}&= s_0, \quad \quad \end{aligned}$$
(34)
$$\begin{aligned} \left. \hat{s} \right| _{n=1}&= \frac{s_0}{3}, \quad \quad \end{aligned}$$
(35)
$$\begin{aligned} \left. \hat{s} \right| _{n=2}&= \frac{s_0}{2}. \end{aligned}$$
(36)

Thus, for a Gaussian input signal, the selected scale level will for the most scale-invariant choice of using \(\varGamma = 0\) reflect the spatial extent \(\hat{s} = s_0\) of the blob, whereas if we would like the scale estimate to reflect the scale parameter of first- and second-order derivatives, we would have to choose \(\varGamma = -1\). An alternative motivation for using finer scale levels for the Gaussian derivative kernels is to regard the positive and negative lobes of the Gaussian derivative kernels as substructures of a more complex signal, which would then warrant the use of finer scale levels to reflect the substructures of the signal ((35) and (36)).

4.4 Spatial Sensitivity of the Quasi Quadrature Measure

Due to the formulation of the quasi quadrature measure in terms of Gaussian derivatives from the N-jet, the spatial sensitivity (phase dependency) of this entity can be estimated from the first-order component in the local Taylor expansion

$$\begin{aligned} \frac{\sqrt{s} \, \partial _x ({\mathcal{Q}}_{x,\mathrm{norm}} L)}{{\mathcal{Q}}_{x,\mathrm{norm}} L} = \frac{s \, L_{xx} \, (s^{1/2} \, L_x + C \, s^{3/2} \, L_{xxx})}{s \, L_x^2 + C \, s^2 \, L_{xx}^2}, \end{aligned}$$
(37)

where we have expressed this entity in terms of scale-normalized derivatives for \(\gamma = 1\) to emphasize the scale-invariant form of the scale-normalized perturbation measure \(s^{1/2} \, \partial _x ({\mathcal{Q}}_{x,\mathrm{norm}} L)\). Notably, this entity is zero at inflection points where \(L_{xx} = 0\).

4.5 Post-Smoothed Quasi Quadrature Measure

To reduce the spatial sensitivity of the quasi quadrature measure, the definition in Eq. (20) can be complemented by spatial post-smoothing

$$\begin{aligned} ({\overline{\mathcal{Q}}}_{x,\mathrm{norm}} L)(\cdot ;\; s, r^2 s) = g(\cdot ;\; r^2 s) * ({\mathcal{Q}}_{x,\mathrm{norm}} L)(\cdot ;\; s), \end{aligned}$$
(38)

where the parameter r is referred to as the relative post-smoothing scale. When coupling quasi quadrature measures in cascade, this amount of post-smoothing \(r^2 s\) will represent the amount of additional Gaussian smoothing before computing derivatives in the next layer in the hierarchical feature representation.

This spatial post-smoothing operation serves as a scale-covariant spatial pooling operation, notably with the support region, as determined by the integration scale \(r^2 s\), proportional to the current scale level s, as opposed to the standard application of spatial pooling over neighbourhoods of fixed size in most CNNs, which would then imply violations of scale covariance.

Fig. 4
figure 4

Example of a colour-opponent receptive field profile for a double-opponent simple cell in the primary visual cortex (V1) as measured by Johnson et al. [127]. (left) Responses to L-cones corresponding to long wavelength red cones, with positive weights represented by red and negative weights by blue. (middle) Responses to M-cones corresponding to medium wavelength green cones, with positive weights represented by red and negative weights by blue. (right) Idealized model of the receptive field from a first-order directional derivative of an affine Gaussian kernel \(\partial _{\varphi }g(x, y;\; \varSigma )\) according to (45) for \(\sigma _1 = \sqrt{\lambda _1} = 0.6\), \(\sigma _2 = \sqrt{\lambda _2} = 0.2\) in units of degrees of visual angle, \(\alpha = 157~\hbox {degrees}\) and with positive weights for the red-green colour-opponent channel \(U = R-G\) with positive values represented by red and negative values by green

5 Oriented Quasi Quadrature Modelling of Complex Cells

In this section, we will consider an extension of the 1D quasi quadrature measure (20) into an oriented quasi quadrature measure over 2D image space of the form

$$\begin{aligned} {\mathcal{Q}}_{\varphi ,\mathrm{norm}} L = \sqrt{\frac{\lambda _{\varphi } \, L_{\varphi }^2 + C \, \lambda _{\varphi }^2 \, L_{\varphi \varphi }^2}{s^{\varGamma }}}, \end{aligned}$$
(39)

where \(L_{\varphi }\) and \(L_{\varphi \varphi }\) denote directional derivatives of an affine Gaussian scale-space representation [51] [56, ch. 15]

$$\begin{aligned} L(\cdot ;\; s, \varSigma ) = g(\cdot ;\; s, \varSigma ) * f(\cdot ) \end{aligned}$$
(40)

of the form

$$\begin{aligned} L_{\varphi }&= \cos \varphi \, L_{x_1} + \sin \varphi \, L_{x_2}, \end{aligned}$$
(41)
$$\begin{aligned} L_{\varphi \varphi }&= \cos ^2 \varphi \, L_{x_1x_1} + 2 \cos \varphi \, \sin \varphi \, L_{x_1x_2} + \sin ^2 \varphi \, L_{x_2x_2}, \end{aligned}$$
(42)

and with \(\lambda _{\varphi }\) denoting the variance of the affine Gaussian kernel (with \(x = (x_1, x_2)^T\))

$$\begin{aligned} g(x;\; s, \varSigma ) = \frac{1}{2 \pi s \sqrt{\det \varSigma }} \mathrm{e}^{-x^T \varSigma ^{-1} x/2s} \end{aligned}$$
(43)

in direction \(\varphi \), preferably with the orientation \(\varphi \) aligned with the direction \(\alpha \) of either of the eigenvectors of the composed spatial covariance matrix \(s \, \varSigma \), with

$$\begin{aligned} \varSigma&=\frac{1}{\max (\lambda _1, \lambda _2)} \nonumber \\&\quad \times \left( \begin{array}{ccc} \lambda _1 \cos ^2 \alpha + \lambda _2 \sin ^2 \alpha \quad &{}\quad (\lambda _1 - \lambda _2) \cos \alpha \, \sin \alpha \\ (\lambda _1 - \lambda _2) \cos \alpha \, \sin \alpha \quad &{}\quad \lambda _1 \sin ^2 \alpha + \lambda _2 \cos ^2 \alpha \end{array} \right) \end{aligned}$$
(44)

normalized such that the main eigenvalue is equal to one.

Fig. 5
figure 5

Significant eigenvectors of a complex cell in the cat primary visual cortex, as determined by Touryan et al. [43] from the response properties of the cell to a set of natural image stimuli, using a spike-triggered covariance method (STC) that computes the eigenvalues and the eigenvectors of a second-order Wiener kernel using three different parameter settings (cutoff frequencies) in the system identification method (from left to right). Qualitatively, these kernel shapes agree well with the shapes of first- and second-order affine Gaussian derivatives

5.1 Affine Gaussian Derivative Model for Linear Receptive Fields

According to the normative theory for visual receptive fields in Lindeberg [31,32,33, 112], directional derivatives of affine Gaussian kernels constitute a canonical model for visual receptive fields over a 2D spatial domain. Specifically, it was proposed that simple cells in the primary visual cortex (V1) can be modelled by directional derivatives of affine Gaussian kernels, termed affine Gaussian derivatives, of the form

$$\begin{aligned} T_{{\varphi }^{m}}(x_1, x_2;\; s, \varSigma ) = \partial _{\varphi }^{m} \left( g(x_1, x_2;\; s, \varSigma ) \right) . \end{aligned}$$
(45)

Figure 4 shows an example of the spatial dependency of a colour-opponent simple cell that can be well modelled by a first-order affine Gaussian derivative over an R-G colour-opponent channel over image intensities. Corresponding modelling results for non-chromatic receptive fields can be found in [31,32,33].

5.2 Affine Quasi Quadrature Modelling of Complex Cells

Figure 5 shows functional properties of a complex cell as determined from its response properties to natural images, using a spike-triggered covariance method (STC), which computes the eigenvalues and the eigenvectors of a second-order Wiener kernel (Touryan et al. [43]). As can be seen from this figure, the shapes of the eigenvectors determined from the nonlinear Wiener kernel model of the complex cell do qualitatively agree very well with the shapes of corresponding affine Gaussian derivative kernels of orders 1 and 2.

Motivated by this property, that mathematical modelling of functional properties of a biological complex cell in terms of a second-order energy model reveals computational primitives similar to affine Gaussian derivatives, combined with theoretical and experimental motivations for modelling receptive field profiles of simple cells by affine Gaussian derivatives, we propose to model complex cells by a possibly post-smoothed oriented quasi quadrature measure of the form (39)

$$\begin{aligned}&(\overline{\mathcal{Q}}_{\varphi ,\mathrm{norm}} L)(\cdot ;\; s_\mathrm{loc}, s_\mathrm{int}, \varSigma _{\varphi }) = \nonumber \\&\quad \sqrt{g(\cdot ;\; s_\mathrm{int}, \varSigma _{\varphi }) * ({\mathcal{Q}}^2_{\varphi ,\mathrm{norm}} L)(\cdot ;\; s_\mathrm{loc}, \varSigma _{\varphi })} \end{aligned}$$
(46)

where \(s_\mathrm{loc} \,\varSigma _{\varphi }\) represents an affine covariance matrix in direction \(\varphi \) for computing directional derivatives and \(s_\mathrm{int} \, \varSigma _{\varphi }\) represents an affine covariance matrix in the same direction for integrating pointwise affine quasi quadrature measures over a region in image space.

The pointwise affine quasi quadrature measure in this expression \(({\mathcal{Q}}_{\varphi ,\mathrm{norm}} L)(\cdot ;\; s_\mathrm{loc}, \varSigma _{\varphi })\) can be seen as a Gaussian derivative-based analogue of the energy model for complex cells as proposed by Adelson and Bergen [34] and Heeger [35]. It is closely related to a proposal by Koenderink and van Doorn [128] of summing up the squares of first- and second-order derivative responses and nicely compatible with results by De Valois et al. [129], who showed that first- and second-order receptive fields typically occur in pairs that can be modelled as approximate Hilbert pairs.

Specifically, this pointwise differential entity mimics some of the known properties of complex cells in the primary visual cortex as discovered by Hubel and Wiesel [48] in the sense of: (i) being independent of the polarity of the stimuli, (ii) not obeying the superposition principle and (iii) being rather insensitive to the phase of the visual stimuli. The primitive components of the quasi quadrature measure (the directional derivatives) do in turn mimic some of the known properties of simple cells in the primary visual cortex in terms of: (i) precisely localized “on” and “off” subregions with (ii) spatial summation within each subregion, (iii) spatial antagonism between on- and off-subregions and (iv) whose visual responses to stationary or moving spots can be predicted from the spatial subregions.

The addition of a complementary post-smoothing stage in (46) as determined by the affine Gaussian weighting function \(g(\cdot ;\; s_\mathrm{int}, \varSigma _{\varphi })\) is closely related to recent results by Westö and May [130], who have shown that complex cells are better modelled as a combination of two spatial integration steps than a single spatial integration. This spatial post-smoothing stage, which serves as a spatial pooling operation, does additionally decrease the spatial sensitivity of the pointwise quasi quadrature measure and makes it more robust to local spatial perturbations.

By choosing these spatial smoothing and weighting functions as affine Gaussian kernels, we ensure an affine-covariant model of the complex cells, to enable the computation of affine invariants at higher levels in the visual hierarchy.

Fig. 6
figure 6

Statistics of the orientation selectivity of simple cells and complex cells in the primary visual cortex of the Macaque monkey as reported by Goris et al. [45]. With respect to the affine Gaussian derivative model for the receptive fields of simple and complex cells, the large variability in orientation selectivity reported from these biological measurements implies that we should consider derivatives of affine Gaussian kernels for a large variability in the eccentricity of their shapes, as can be parameterized by, e.g., the ratio between the eigenvalues \(\lambda _1\) and \(\lambda _2\) of the affine covariance matrix \(s \, \varSigma \) (a highly eccentric affine Gaussian derivative kernel will have more narrow orientation selectivity)

The use of multiple affine receptive fields over different shapes of the affine covariance matrices \(\varSigma _{\varphi ,\mathrm{loc}}\) and \(\varSigma _{\varphi ,\mathrm{int}}\) can be motivated by results by Goris et al. [45], who show that there is a large variability in the orientation selectivity of simple and complex cells (see Fig. 6). With respect to this model, this means that we can think of affine covariance matrices of different eccentricities as being present from isotropic to highly eccentric. By considering the full family of positive definite affine covariance matrices, we obtain a fully affine-covariant image representation able to handle local linearizations of the perspective mapping for all possible views of any smooth local surface patch.

With respect to computational modelling of biological vision, the proposed affine quasi quadrature model constitutes a novel functional model of complex cells as previously studied in biological vision by Hubel and Wiesel [46,47,48], Movshon et al. [131], Emerson et al. [132], Touryan et al. [43, 133] and Rust et al. [134] and modelled computationally by Adelson and Bergen [34], Heeger [35], Serre and Riesenhuber [135], Einhäuser et al. [136], Kording et al. [137], Merolla and Boahen [138], Berkes and Wiscott [139], Carandini [140] and Hansard and Horaud [141]. A conceptual novelty of our model, which emulates several of the known properties of complex cells although our understanding of the nonlinearities of complex cells is still limited, is that it is fully expressed based on the mathematically derived affine Gaussian derivative model for visual receptive fields [32] and therefore possible to relate to natural image transformations as modelled by affine transformations over the spatial domain.

In the following, we will use this quasi quadrature model of complex cells for constructing continuous hierarchical networks.

6 Hierarchies of Oriented Quasi Quadrature Measures

Let us in this first study henceforth for simplicity disregard the variability due to different shapes of the affine receptive fields for different eccentricities and assume that \(\varSigma = I\).

This restriction enables covariance to scaling transformations and rotations, whereas a full treatment of affine quasi quadrature measures over all positive definite covariance matrices for the underlying affine Gaussian smoothing operation would enable full affine covariance.

An approach that we shall pursue is to build feature hierarchies by coupling oriented quasi quadrature measures (39) or (46) in cascadeFootnote 5

$$\begin{aligned}&F_1(x, \varphi _1) = ({\mathcal{Q}}_{\varphi _1,\mathrm{norm}} \, L)(x) \end{aligned}$$
(47)
$$\begin{aligned}&F_k(x, \varphi _1, \ldots , \varphi _{k-1}, \varphi _k) \nonumber \\&\quad = ({\mathcal{Q}}_{\varphi _k,\mathrm{norm}} \, F_{k-1})(x, \varphi _1, \ldots , \varphi _{k-1}), \end{aligned}$$
(48)

where we have suppressed the notation for the scale levels assumed to be distributed such that the scale parameter at level k is \(s_k = s_0 \, r^{2(k-1)}\) for some \(r > 1\), e.g. \(r = 2\). Assuming that the initial scale-space representation L is computed at scale \(s_0\), such a network can in turn be initiated for different values of \(s_0\), also distributed according to a geometric distribution.

This construction builds upon an early proposal by Fukushima [38] of building a hierarchical neural network from repeated application of models of simple and complex cells [46,47,48], which has later been explored in a handcrafted network based on Gabor functions by Riesenhuber and Poggio [39] and Serre et al. [40] and in the scattering convolution networks by Bruna and Mallat [41]. This idea is also consistent with a proposal by Yamins and DiCarlo [142] of using repeated application of a single hierarchical convolution layer for explaining the computations in the mammalian cortex. With this construction, we obtain a way to define continuous networks that express a corresponding hierarchical architecture based on Gaussian derivative- based models of simple and complex cells within the scale-space framework.

Fig. 7
figure 7

Schematic illustration of how the quasi quadrature network is constructed from an image, here with a total number of four layers. In the first layer, there is an expansion over all \(M = 8\) orientations, leading to a total number of 2M independent features \(L_{\varphi }\) and \(L_{\varphi \varphi }\) over all M image orientations from which the dependent feature \({\mathcal{Q}} L\) is then computed according to (39). In the second layer, the maps of \({\mathcal{Q}} L\) for all the M image orientations are used for another expansion over image orientations, such that a total number of \(2 M^2\) independent features \(L_{\varphi }\) and \(L_{\varphi \varphi }\) is computed over all pairs of image orientations. To delimit the complexity of the features in higher layers, there is a pooling stage over image orientations by summing up the quasi quadrature responses over all the image orientations before further expansions over image orientations are performed at layer \(K = 3\). Thereby, the number of independent features in these layers is delimited by \(2 M^2\) instead of \(2 M^3\). By a corresponding pooling stage before layer 4, the number of independent features in this layer is also delimited by \(2 M^2\) (the grey boxes, which show the independent features \(L_{\varphi }\) and \(L_{\varphi \varphi }\) and the dependent feature \({\mathcal{Q}} L\) that are computed in every layer in the hierarchy, are here only shown for one of the several possible paths through the hierarchy. The combinatorial expansion in layer 2 is also only shown for one of the M orientations in layer 1)

Each new layer in this model implies an expansion of combinations of angles over the different layers in the hierarchy. For example, if we in a discrete implementation discretize the angles \(\varphi \in [0, \pi [\) into M discrete spatial orientations, we will then obtain \(M^k\) different features at level k in the hierarchy. To keep the complexity down at higher levels, we will for \(k \ge K\) in a corresponding way as done by Hadji and Wildes [143] introduce a pooling stage over orientations

$$\begin{aligned} ({\mathcal{{P}}}_k F_{k})(x, \varphi _1, \ldots , \varphi _{K-1}) = \sum _{\varphi _k} F_k(x, \varphi _1, \ldots , \varphi _{K-1}, \varphi _k), \end{aligned}$$
(49)

which sums up the responses for all the orientations in the current layer, before the next successive layer is instead defined by applying oriented quasi quadrature measures to the pooled responses

$$\begin{aligned}&F_k(x, \varphi _1, \ldots , \varphi _{k-2}, \varphi _{K-1}, \varphi _k) = \nonumber \\&\quad ({\mathcal{Q}}_{\varphi _k,\mathrm{norm}} \, {\mathcal{{P}}}_{k-1} F_{k-1})(x, \varphi _1, \ldots , \varphi _{K-1}). \end{aligned}$$
(50)

In this way, the number of features at any level will be limited to maximally \(M^{K-1}\). The proposed hierarchical feature representation is termed QuasiQuadNet.

Figure 7 gives a schematic illustration of the structure of such a resulting hierarchy using an expansion over \(M = 8\) spatial orientations in the image domain over a total number of four layers with the combinatorial expansion over image orientations delimited from layer \(K = 3\).

6.1 Scale Covariance

A theoretically attractive property of this family of networks is that the networks are provably scale covariant. Given two images f and \(f'\) that are related by a uniform scaling transformation,

$$\begin{aligned} f'(x') = f(x) \quad \quad \hbox {with} \quad \quad x' = S x \end{aligned}$$
(51)

for some \(S > 0\), their corresponding scale-space representations L and \(L'\) will be equal

$$\begin{aligned} L'(x';\; s') = L(x;\; s) \end{aligned}$$
(52)

and so will the scale-normalized derivatives

$$\begin{aligned} s'^{n/2} \, L'_{{x_i'}^n}(x';\; s') = s^{n/2} \, L_{x_i^n}(x;\; s) \end{aligned}$$
(53)

based on \(\gamma = 1\) if the scale levels are matched according to \(s' = S^2 s\) [36, Eqs. (16) and (20)].

This implies that if the initial scale levels \(s_0\) and \(s_0'\) underlying the construction in (47) and (48) are related according to \(s_0' = S^2 s_0\), then the first layers of the feature hierarchy will be related according to [37, Eqs. (55) and (63)]

$$\begin{aligned} F_1'(x', \varphi _1) = S^{-\varGamma } \, F_1(x, \varphi _1). \end{aligned}$$
(54)

Higher layers in the feature hierarchy are in turn related according to

$$\begin{aligned} F_k'(x', \varphi _1, \ldots , \varphi _{k-1}, \varphi _k) = S^{-k \varGamma } \, F_k(x, \varphi _1, \ldots , \varphi _{k-1}, \varphi _k) \end{aligned}$$
(55)

and are specifically equal if \(\varGamma = 0\). This means that it will be possible to perfectly match such hierarchical representations under uniform scaling transformations.

6.2 Rotation Covariance

Under a rotation of image space by an angle \(\alpha \),

$$\begin{aligned} f'(x') = f(x) \quad \quad \hbox {with} \quad \quad x'= R_{\alpha } x, \end{aligned}$$
(56)

the corresponding feature hierarchies are in turn equal if the orientation angles are related according to \(\varphi '_i = \varphi _i + \alpha \) (\(i = 1\ldots k\))

$$\begin{aligned} F_k'(x', \varphi '_1, \ldots , \varphi '_{k-1}, \varphi '_k) = F_k(x, \varphi _1, \ldots , \varphi _{k-1}, \varphi _k). \end{aligned}$$
(57)

6.3 Exact Versus Approximate Covariance (or Invariance) in a Practical Implementation

The architecture of the quasi quadrature network has been designed to support scale covariance based on image primitives (receptive fields) that obey the general scale covariance property (4) and to support rotational covariance by an explicit expansion over image rotations of the form (48).

Scale Covariance The statement about true scale covariance in Sect. 6.1 holds in the continuous case, provided that we can represent a continuum of scale parameters.

In a practical implementation, it is natural to sample this space into a set of discrete scale levels with a constant scale ratio between adjacent scale levels. Then, the scale-covariant property will be restricted to spatial scaling factors that can be perfectly matched between these scale levels. If the scale levels are expressed in units of \(\sigma = \sqrt{s}\) and if the scale ratio between adjacent scale levels in these units is r, then exact scale covariance will hold for all scaling factors that are integer powers of r, provided that the image resolution and the image size are sufficient to resolve the relevant image structures. For scaling factors in between these discrete values, there will be an approximation error, which could possibly be reduced by a complementary scale interpolation mechanism.

Fig. 8
figure 8

Subset of features in a hierarchy of directional derivatives and quasi quadrature measures computed from a texture image (corduroy/sample_a/42a-scale_4_im_5_col.png) from the KTH-TIPS2 dataset [144] for different combinations of angles in layers 1 and 2 for \(s_0 = 2\). (top row) Original image and first-order directional derivatives \(L_{\varphi }\). (second row) Second-order directional derivatives \(L_{\varphi \varphi }\). (third row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _1} L\) in layer 1. (fourth row) First-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (fifth row) Second-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (sixth row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 2 for \(\varphi _1 = 0\). (bottom row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _3} {\mathcal{{P}}}_{\varphi _2} {\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 3 for \(\varphi _1 = 0\) (the contrast has been reversed for the quasi quadrature measures so that high values are shown as dark and low values as bright; image size: \(200 \times 200\) pixels)

Fig. 9
figure 9

Subset of features in a hierarchy of directional derivatives and quasi quadrature measures computed from a texture image (wool/sample_a/22a-scale_7_im_3_col.png) from the KTH-TIPS2 dataset [144] for different combinations of angles in layers 1 and 2 for \(s_0 = 2\). (top row) Original image and first-order directional derivatives \(L_{\varphi }\). (second row) Second-order directional derivatives \(L_{\varphi \varphi }\). (third row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _1} L\) in layer 1. (fourth row) First-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (fifth row) Second-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (sixth row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 2 for \(\varphi _1 = 0\). (bottom row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _3} {\mathcal{{P}}}_{\varphi _2} {\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 3 for \(\varphi _1 = 0\) (the contrast has been reversed for the quasi quadrature measures so that high values are shown as dark and low values as bright; image size: \(200 \times 200\) pixels)

Fig. 10
figure 10

Subset of features in a hierarchy of directional derivatives and quasi quadrature measures computed from an indoor image for different combinations of angles in layers 1 and 2 for \(s_0 = 2\). (top row) Original image and first-order directional derivatives \(L_{\varphi }\). (second row) Second-order directional derivatives \(L_{\varphi \varphi }\). (third row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _1} L\) in layer 1. (fourth row) First-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (fifth row) Second-order directional derivatives computed from \({\mathcal{Q}}_{\varphi _1} L\) for \(\varphi _1 = 0\). (sixth row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 2 for \(\varphi _1 = 0\). (seventh row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 2 for \(\varphi _1 = \pi /2\). (bottom row) Oriented quasi quadrature measures \({\mathcal{Q}}_{\varphi _3} {\mathcal{{P}}}_{\varphi _2} {\mathcal{Q}}_{\varphi _2} {\mathcal{Q}}_{\varphi _1} L\) in layer 3 for \(\varphi _1 = 0\) (the contrast has been reversed for the quasi quadrature measures so that high values are shown as dark and low values as bright; image size: \(512 \times 350\) pixels)

For a discrete implementation with limited image resolution and limited image size, there will be additional restrictions on how well the discrete implementation approximates the continuous theory. For the implementations underlying this paper, we use a scale-space concept specially designed for discrete signals computed by separable convolution with the discrete analogue of the Gaussian kernel \(T(n;\; s) = \mathrm{e}^{-s} I_n(s)\) [145], which is defined in terms of the modified Bessel functions of integer order \(I_n(s)\) [146]. This discrete scale-space concept constitutes numerical approximation of the continuous scale-space concept via a spatial discretization of the diffusion equation, which governs the evolution properties over scale of the Gaussian scale-space concept.

Rotational Covariance The statement about true rotational covariance in Sect. 6.2 holds provided that we can represent a continuum of rotation angles. For a continuum of orientation angles, the summation over image orientations in the pooling stage (49) should be replaced by an integral over all the image orientations to guarantee exact covariance to hold for all rotation angles.

In a practical implementation, it is natural to sample the orientation angles on the unit circle into a set of discrete angles with a constant increment between. Then, the rotation-covariant property will be restricted to the set of discrete rotation angles that are spanned by this discretization. For rotation angles in between, there will be an approximation error, which could possibly be reduced by a suitable interpolation mechanism.

With regard to a discrete implementation, there may be additional deviations in how well the discrete approximations of directional derivatives numerically approximate their continuous counterparts. For the implementation underlying this paper, we complement the discrete scale-space concept in [145] with discrete derivative approximations with scale-space properties [147], where small support discrete derivative approximations \(\delta _x = (-1/2, 0, 1/2)\) and \(\delta _{xx} = (1, -2, 1)\) are applied to the discrete scale-space smoothed image data and directional derivative approximations are then computed from the continuous relationships (41) and (42).

Numerical Approximation of a Truly Covariant Continuous Theory By all steps in the discrete implementation constituting numerical approximations of their corresponding counterparts in the continuous theory, it follows that the discrete implementation will also numerically approximate the desirable covariance properties (or as an extension invariance properties) with respect to scaling transformations and rotations in the image domain. The accuracy of approximation of the combined system will then be a composed effect of the numerical accuracy of the different primitives.

Table 1 Performance results of the mean-reduced QuasiQuadNet in comparison with a selection of among the better methods in the extensive performance evaluation by Liu et al. [157] (our results in slanted font) (the column labelled “feat” states whether the image features are fixed (“F”) or learnt (“L”). The column labelled “class” states whether the classification criterion is fixed (“F”) or learnt (“L”))

6.4 Experiments

Figures 8, 9 and 10 show examples of computing different layers in such a quasi quadrature network for two texture images and an indoor image, with the combinatorial angular expansion for higher layers delimited at layer \(K = 3\).

For the quite regular corduroy image in Fig. 8, we can see that we get clear responses to the stripes in the cloth in layers 1 and 2, with only a minor dominant response in the third layer corresponding to the slight irregularity in the mid-left of the original image.

For the mixed regular/irregular wool image in Fig. 9, we get clear responses to the crochet work in layer 1, with additional clear responses to the different types of repeated crochet structures in different subparts of the image in layer 2, whereas in layer 3 the main strong response is due to the intentional overall irregularity in the pattern.

For the indoor scene in Fig. 10, we can note that the responses are strongest along the edges in the scene for all the layers, with some locally stronger responses in layers 2 and 3 assumed near corners or end-stoppings, especially when the orientations of the oriented quasi quadrature measures at higher levels in the hierarchy are orthogonal to the orientation of the oriented quasi quadrature measure in the first layer (\(\varphi _2 \bot \varphi _1\) or \(\varphi _3 \bot \varphi _1\)). For this image, which is not in any way stationary over image space, we can observe that the spatial structure of the scene can be perceived from the pure magnitude responses of the quasi quadrature measure in layer 3 in the hierarchy.

In these qualitative respects, we can see how the proposed quasi quadrature hierarchy is able to reflect nonlinear hierarchical relations between image structures over different scales.

Fig. 11
figure 11

Sample images from the KTH-TIPS2b texture dataset [144]. This dataset consists of images of 11 classes of textures with four samples from each class. Each sample has been photographed from nine distances leading to nine relative scales, with additionally 12 different pose and illumination conditions for each scale, implying a total number of \(11 \times 4 \times 9 \times 12 = 4752\) images. This figure shows one sample from each class, with varying scale, pose and illumination conditions between the samples (most images of size \(200 \times 200\) pixels)

7 Application to Texture Analysis

In the following, we will use a substantially reduced version of the proposed quasi quadrature network for building an application to texture analysis and evaluate the resulting approach on the KTH-TIPS2b, CuRET and UMD datasets (Figs. 11, 1213).

7.1 Mean-Reduced Texture Descriptors

If we make the assumption that a spatial texture should obey certain stationarity properties over image space, we may regard it as reasonable to construct texture descriptors by accumulating statistics of feature responses over the image domain, in terms of, e.g., mean values or histograms.

Inspired by the way the SURF descriptor [68] accumulates mean values and mean absolute values of derivative responses and the way Bruna and Mallat [41] and Hadji and Wildes [143] compute mean values of their hierarchical feature representations, we will initially explore reducing the QuasiQuadNet to just the mean values over the image domain of the following five features:

$$\begin{aligned} \{ \partial _{\varphi } F_{k}, |\partial _{\varphi } F_{k}|, \partial _{\varphi \varphi } F_{k}, |\partial _{\varphi \varphi } F_{k}|, {\mathcal{Q}}_{\varphi } F_{k} \}. \end{aligned}$$
(58)

These types of features are computed for all layers in the feature hierarchy (with \(F_0 = L\)), which leads to a 4000D descriptorFootnote 6 based on \(M = 8\) uniformly distributed orientations in \([0, \pi [\), four layers in the hierarchy delimited in complexity by directional pooling for \(K = 3\) with four initial scale levels \(\sigma _0 = \sqrt{s_0} \in \{ 1, 2, 4, 8 \}\).

Fig. 12
figure 12

Sample images from the CUReT texture dataset [148]. This dataset consists of images of 61 materials, with a single sample for each class, and each sample viewed under different viewing and illumination conditions. Here, we use the selection of 92 viewing and illumination conditions chosen in [149] leading to a total number of \(61 \times 92 = 5 612\) images. This figure shows one image of about every third sample, with varying viewing and illumination conditions between the samples (all images of size \(200 \times 200\) pixels)

Fig. 13
figure 13

Sample images from the UMD texture dataset [150]. This dataset consists of 25 texture classes, with 40 grey-level images from each class taken from a variety of different distances and viewing directions; thus, a total number \(25 \times 40 = 1000\) images. This figure shows one sample from each of the first 15 classes (all images of size \(1280 \times 960\) pixels)

Fig. 14
figure 14

Examples of the scaling variations in the KTH-TIPS2 dataset for one sample each from the classes “wool”, “lettuce” and “brown bread” at a subset of five of the scales in the dataset (the sizes labelled “2”, “4”, “6”, “8” and “10” from top to bottom)

Fig. 15
figure 15

Comparison between scale-covariant matching versus non-covariant matching of texture descriptors on the KTH-TIPS2b dataset [144]. In the experiments underlying this figure, we have used the scale variations in the dataset to perform matching over spatial scaling factors of \(S = \sqrt{2}\), 2, \(2\sqrt{2}\) and 4, here represented as \(\log _2 S\) on the horizontal axis. For non-covariant matching, here represented as red curves, we have used the same scale parameters for the image descriptors in the training data and the test data. For scale-covariant matching, here represented as blue curves, we have adapted the scale levels of the image descriptors to the known scale factor between the training data and the test data. As can be seen from the results, in the presence of substantial scaling variations, the use of scale-covariant matching, as enabled by the provably scale-covariant networks proposed in this article, improves the performance substantially if there are significant scaling variations in the data (all these results have been computed with SVM classification of mean-reduced image descriptors from QuasiQuadNets computed from either pure grey-level images or colour images. The results for the pure grey-level descriptors are indicated by ‘o’, whereas the results for the LUV colour descriptors are indicated by ‘*’)

7.2 Texture Classification on the KTH-TIPS2b Dataset

The second column in Table 1 shows the result of applying this approach to the KTH-TIPS2b dataset [144] for texture classification, see Fig. 11 for sample images from this dataset. The KTH-TIPS2b dataset contains images of 11 texture classes (“aluminium foil”, “cork”, “wool”, “lettuce leaf”, “corduroy”, “linen”, “cotton”, “brown bread”, “white bread”, “wood” and “cracker”) with four physical samples from each class and photographs of each sample taken from nine distances leading to nine relative scales labelled “2”, ..., “10” over a factor of 4 in scaling transformations and additionally 12 different pose and illumination conditions for each scale, leading to a total number of \(11 \times 4 \times 9 \times 12 = 4752\) images. The regular benchmark setup implies that the images from three samples in each class are used for training and the remaining sample in each class for testing over four permutations. Since several of the samples from the same class are quite different from each other in appearance, this implies a non-trivial benchmark which has not yet been saturated.

When using nearest-neighbour classification on the mean-reduced grey-level descriptor, we get 70.2 % accuracy, and 72.1 % accuracy when computing corresponding features from the LUV channels of a colour-opponent representation. When using SVM classification [156], the accuracy becomes 75.3 % and 78.3 %, respectively. Comparing with the results of an extensive set of other methods in Liu et al. [157], out of which a selection of the better results is listed in Table 1, the results of the mean-reduced QuasiQuadNet are better than classical texture classification methods such as local binary patterns (LBP) [155], binary rotation-invariant noise-tolerant texture descriptors [153] and multidimensional local binary patterns (MDLBP) [154] and also better than other handcrafted networks, such as ScatNet [41], PCANet [117] and RandNet [117]. The performance of the mean-reduced QuasiQuadNet descriptor does, however, not reach the performance of applying SVM classification to Fischer vectors of the filter output in learned convolutional networks (FV-VGGVD, FV-VGGM [151]).

By instead performing the training on every second scale in the dataset (scales “2”, “4”, “6”, “8”, “10”) and the testing on the other scales (“3”, “5”, “7”, “9”), such that the benchmark does not primarily test the generalization properties between the different very few samples in each class, the classification performance is 98.8 % for the grey-level descriptor and 99.6 % for the LUV descriptor.

7.3 Scale-Covariant Matching of Image Descriptors on the KTH-TIPS2b Dataset

An attractive property of the KTH-TIPS2b dataset is that we can use the controlled scaling variations in this dataset (see Fig. 14) to investigate the influence of scale covariance with respect to image descriptors defined from a provably scale-covariant network. To test this property, we constructed partitionings of the dataset into training sets and test sets with known scaling variations between the data.

The scales in the datasets, which we will henceforth refer to as sizes, labelled from “2” to “10”, span a scaling factor of 4, with a relative scaling factor of \(\root 4 \of {2}\) between adjacent sizes. To cover a set of relative scaling factors \(S \in \{ \sqrt{2}, 2, 2\sqrt{2}, 4 \}\), we partitioned the dataset and adapted the scale parameters of the QuasiQuadNet to the relative scaling factors in the following way:

  • Relative scaling factor \(\sqrt{2}\): Training data at the sizes labelled \(\{5, 6, 9, 10\}\) with image descriptors computed at the scales \(\sigma _0 \in \{1, 2, 4, 8\}\). Test data at the sizes labelled \(\{3, 4, 7, 8\}\) with image descriptors computed at the scales \(\sigma _0 \in \{\sqrt{2}, 2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}\}\).

  • Relative scaling factor 2: Training data at the sizes labelled \(\{7, 8, 9, 10\}\) with image descriptors computed at the scales \(\sigma _0 \in \{1, 2, 4, 8\}\). Test data at the sizes labelled \(\{3, 4, 5, 6\}\) with image descriptors computed at the scales \(\sigma _0 \in \{2, 4, 8, 16\}\).

  • Relative scaling factor \(2\sqrt{2}\): Training data at the sizes labelled \(\{8, 9, 10\}\) with image descriptors computed at the scales \(\sigma _0 \in \{1, 2, 4, 8\}\). Test data at the sizes labelled \(\{2, 3, 4\}\) with image descriptors computed at the scales \(\sigma _0 \in \{2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}, 16\sqrt{2}\}\).

  • Relative scaling factor 4: Training data at the size labelled \(\{10\}\) with image descriptors computed at the scales \(\sigma _0 \in \{1, 2, 4, 8\}\). Test data at the size labelled \(\{2\}\) with image descriptors computed at scales \(\sigma _0 \in \{4, 8, 16, 32\}\).

These partitionings between training sets and test sets have thus been constructed in such a way that there for each image descriptor computed from an image in the test set should exist a corresponding scale-matched image descriptor in the training set.

To measure the influence relative to not adapting the scale levels to scale covariance, we also performed non-covariant classification with all the image descriptors, both in the training data and the test data, computed at the scales \(\sigma _0 \in \{1, 2, 4, 8\}\).

The result of this experiment is shown in Fig. 15, which shows graphs of how the accuracy of the texture classification depends on the logarithm of the relative scaling factor \(\log _2 S\) between the training data and the test data (see also Table 2). As can be seen from the graphs, the performance is substantially higher for scale-covariant classification compared to non-covariant classification. Although this task is not influenced by the generalization ability of the image descriptors, as measured in the regular experimental setup for the KTH-TIPS2 dataset in the sense that images from all the samples are here included in both the training sets and the test sets, there are nevertheless reasons why the image data cannot be perfectly matched: (i) the support regions for the texture descriptors differ in size due to the scaling transformation, which implies that new image details appear in one of the images relative to the other (see Fig. 14 for an illustration), which in turn challenges the stationarity assumption underlying the image texture descriptor, here represented by mean values only and (ii) the boundary effects at the image boundaries are different between the two image domains, which in particular affect the image features at coarser spatial scales.

Notwithstanding these effects, due to the fact that the addition of new image structures during the scaling transformations leads to a violation of full scale covariance because of the a priori delimited image domains in the already given dataset, the primary purpose of this experiment is to conceptually demonstrate how substantial gains in performance can be obtained by having a scale-covariant network, and how such scale-covariant networks are conceptually easier to construct using a continuous model of the filtering operations in the network. Specifically, scale-space theory, which underpins this treatment, has been developed to handle such scaling variations in a theoretically well-founded manner.

Table 2 Numerical performance values underlying the graphs in Fig. 15, which quantify the performance of texture classification based on mean-reduced texture descriptors from QuasiQuadNets over scaling transformations with scaling factors of \(\sqrt{2}\), 2, \(2 \sqrt{2}\) and 4 for the KTH- TIPS2b dataset

7.4 Matching with Scale-Aggregated Covariant Image Descriptors on the KTH-TIPS2b Dataset

In the previous section, we used a priori known information about the structured amounts of scaling transformations in the KTH- TIPS2 dataset for demonstrating the importance of using scale-covariant image descriptors as opposed to non-covariant image descriptors in situations where the scaling transformations are substantial.

A more realistic scenario is that the amount of scaling transformation between the training data and the test data is not a priori known. A useful approach in such a situation is to complement the image descriptors in the training set by scale aggregation, meaning that multiple copies of image descriptors are computed over some set of scale levels, to enable scale-covariant matching of the image descriptors in the sense that for any image descriptor computed from the test set we should as far as possible increase the likelihood for the classification scheme to be able to find a corresponding scale-matched image descriptor in the training set.

To test the scale sensitivity of a composed texture classification scheme to such a scenario, we computed image descriptors for the training data at the following scales \(\sigma _0 \in \{1, 2, 4, 8\}\), \(\sigma _0 \in \{\sqrt{2}, 2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}\}\), \(\sigma _0 \in \{2, 4, 8, 16\}\), \(\sigma _0 \in \{2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}, 16\sqrt{2}\}\), \(\sigma _0 \in \{4, 8, 16, 32\}\) and computed the test data at the single scale \(\sigma _0 \in \{1, 2, 4, 8\}\). As training data we used the images at the single size \(\{2\}\) and as test data the images from a single one of each of the sizes \(\{3, 4, 5, 6, 7, 8, 9, 10\}\), to study the sensitivity to variations in scaling transformation in steps of \(\root 4 \of {2}\) between adjacent sizes.

Table 3 Numerical performance values underlying the bottom graphs in Fig. 16, which quantify the performance of texture classification based on mean-reduced texture descriptors from QuasiQuadNets over scaling transformations with different scaling factors S

The result of this experiment is shown in top figure in Fig. 16, which shows graphs of how the accuracy of the texture classification depends on the logarithm of the relative scaling factor \(\log _2 S\) between the training data and the test data. In the top figure, the experiments have been made relative to training data the single size “2” only, and corresponding test data for each one of the sizes “3”, “4”, ..., “10” in the dataset. In the bottom figure, the average result of a set of more extensive experiments is shown, where each one of the sizes “2”, “3”, ...“9” has been used for defining scale-aggregated training data and the testing data has then been taken from a single size with number label greater than the label for the training data. The graphs in the bottom figure show the average values over all those graphs for equal relative scaling factors between the training data and the test data (see also Table 3). As can be seen from the graphs, the performance is substantially higher for scale-aggregated matching compared to non-aggregated matching. In this way, the experiment demonstrates how the use of a scale-covariant network enables significantly better performance in situations where there are substantial scaling transformations in the test data that are not spanned by corresponding scaling variations in the training data.

Fig. 16
figure 16

Comparison between scale-aggregated matching versus non-aggregated matching of texture descriptors on the KTH-TIPS2b dataset [144]. In the experiments underlying this figure, we have used the scale variations in the dataset to perform matching over spatial scaling factors of S between \(\root 4 \of {2}\) and 4 in steps of factors of \(\root 4 \of {2}\), here represented as \(\log _2 S\) on the horizontal axis. For non-aggregated matching, here represented as red curves, we have used the image descriptors at the same single scale \(\sigma _0 \in \{1, 2, 4, 8\}\) in the training data and the test data. For scale-aggregated matching, here represented as blue curves, we have extended the training data with image descriptors over the set of scale levels \(\sigma _0 \in \{1, 2, 4, 8\}\), \(\sigma _0 \in \{\sqrt{2}, 2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}\}\), \(\sigma _0 \in \{2, 4, 8, 16\}\), \(\sigma _0 \in \{2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}, 16\sqrt{2}\}\), \(\sigma _0 \in \{4, 8, 16, 32\}\) to span scaling variations up to a factor of 4 in steps of \(\sqrt{2}\). In the top figure, the experiments have been made relative to training data at size “2” only, and corresponding test data at the sizes “3”, “4”, ..., “10”. In the bottom figure, a set of multiple experiments have been performed with training data at each one of the sizes “2”, “3”, ...“9”, with testing data for the sizes with greater number labels. The curve in the bottom figure shows the average value of all these experiments as averaged over equal relative scaling factors. As can be seen from the results, in the presence of substantial scaling variations, the use of scale-aggregated matching, as enabled by the provably scale-covariant networks proposed in this article, improves the performance substantially if there are significant scaling variations in the data (all these results have been computed with SVM classification of mean-reduced image descriptors from QuasiQuadNets computed from either pure grey-level images or colour images. The results for the pure grey-level descriptors are indicated by ‘o’, whereas the results for the LUV colour descriptors are indicated by ‘*’)

A similar way of handling scale variations between training data and test data by computing the image descriptors over a range of scales has also been used for texture classification by Crosier and Griffin [158].Footnote 7 This type of scale matching constitutes an integrated part of the scale-space methodology for relating image descriptors computed from image structures that have been subject to scaling transformations in the image domain. Here, we extend this approach for scale generalization to hierarchical or deep networks, where the scale covariance property of our networks makes such scale matching possible.

7.5 Texture Classification on the CUReT Dataset

The third column in Table 1 shows the result of applying a similar texture classification approach as is used in Sect. 7.2 to the CUReT texture dataset [148], see Fig. 12 for sample images from this dataset. The CUReT dataset consists of images of 61 materials, with a single sample for each material, and each sample viewed under 205 different viewing and illumination conditions. For our experiments, we use the selection of 92 cropped images of size \(200 \times 200\) pixels chosen in [149] from the criterion that a sufficiently large region of texture should be visible for all the materials. This implies a total number of \(61 \times 92 = 5 612\) images. Following the standard for this dataset, we measure the average value of a set of random partitionings into training and testing data of equal size.

With SVM classification on the mean-reduced QuasiQuadNet, we get \(98.3~\%\) accuracy for the grey-level descriptor and \(98.6~\%\) for the colour descriptor. This performance is better than handcrafted PCANet [117] and RandNet [117] and some pure texture descriptors such as local binary patterns [155], multidimensional local binary patterns (MDLBP) [154], binary rotation-invariant noise tolerant texture descriptors [153] and near the learned networks FV-AlexNet and FV-VGGM [151]. For this dataset, the handcrafted ScatNet [41] does, however, perform better and so do the learned networks FV-VGGVD [151] and median robust extended local binary patterns [152].

7.6 Texture Classification on the UMD Dataset

The fourth column in Table 1 shows the result of applying a similar texture classification approach to the UMD texture dataset [150], see Fig. 13 for sample images from this dataset. The UMD dataset consists of 25 texture classes with 40 grey-level images of size \(1280 \times 900\) pixels from each class, taken from different distances and viewpoints; thus, a total number of \(25 \times 40 = 1000\) images. Following the standard for this dataset, we measure the average of random partitions in training and testing data of equal size. When using the same scale levels \(\sigma _0 \in \{1, 2, 4, 8\}\) for the training data and the test data, we get \(97.1~\%\) accuracy of our mean-reduced grey-level descriptor, which is better than local binary patterns [155], PCANet [117] and RandNet [117].

Noting that this dataset contains significant unstructured scaling variations, which are not taken into account when computing all the image descriptors at the same scale, we also did an experiment with scale-covariant matching, where we expanded the training data to the following scale combinations \(\sigma _0 \in \{1, 2, 4, 8\}\), \(\sigma _0 \in \{\sqrt{2}, 2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}\}\), \(\sigma _0 \in \{2, 4, 8, 16\}\), \(\sigma _0 \in \{2\sqrt{2}, 4\sqrt{2}, 8\sqrt{2}, 16\sqrt{2}\}\), \(\sigma _0 \in \{4, 8, 16, 32\}\) and computed the test data at the single scale \(\sigma _0 \in \{2, 4, 8, 16\}\). The intention behind this data aggregation over scales is to make it easier to find a match between the training data and the test data over situations where there are significant scaling transformations between the training data and the test data, with specifically a lack of matching training data at a similar scale as for a given test data. Then, the performance increased from 93.3 to \(95.9\%\) using NN classification and from 97.1 to \(98.1\%\) using SVM classification on the UMD dataset.

A corresponding expansion of the training data to cyclic permutations over the underlying angles in the image descriptors in the training data, to achieve rotation-covariant matching, did, however, not improve the results.

8 Summary and Discussion

We have presented a theory for defining handcrafted or structured hierarchical networks by combining linear and nonlinear scale-space operations in cascade. After presenting a general sufficiency condition to construct networks based on continuous scale-space operations that guarantee provable scale covariance, we have then in more detail developed one specific example of such a network constructed by applying quasi quadrature responses of first- and second-order directional Gaussian derivatives in cascade.

A main purpose behind this study has been to investigate whether we could start building a bridge between the well-founded theory of scale-space representation and the recent empirical developments in deep learning, while at the same time being inspired by biological vision. The present work is intended as initial work in this direction, where we propose the family of quasi quadrature networks as a new baseline for handcrafted networks with associated provable covariance properties under scaling and rotation transformations.

Specifically, by constructing the network from linear and nonlinear filters defined over a continuous domain, we avoid the restriction to discrete \(3 \times 3\) or \(5 \times 5\) filters in most current deep net approaches, which implies an implicit assumption about a preferred scale in the data, as defined by the grid spacing in the deep net. If the input data to the deep net are rescaled by external factors, such as from varying the distance between an observed object and the observer, the lack of true scale covariance as arising from such preferred scales in the network implies that the nonlinearities in the deep net may affect the data in different ways, depending on the size of a projected object in the image domain.

By early experiments with a substantially mean-reduced representation of our provably scale-covariant QuasiQuadNet, we have demonstrated that it is possible to get quite promising performance on texture classification, and comparable or better than other handcrafted networks, although not reaching the performance of applying more refined statistical classification methods on learned CNNs.

By inspection of the full non-reduced feature maps, we have also observed that some representations in higher layers may respond to irregularities in regular textures (defect detection) or corners or end-stoppings in regular scenes.

Concerning extensions of the approach with quasi quadrature networks, we propose to:

  • relax the restriction to isotropic covariance matrices with \(\varSigma = I\) in Sect. 6 to construct hierarchical networks based on more general affine quasi quadrature measures based on affine Gaussian derivatives that are computed with varying eccentricities of the underlying affine Gaussian kernel to enable affine covariance, which will then also enable affine invariance,

  • complement the computation of quasi quadrature responses by a mechanism for divisive normalization [44] to enforce a competition between multiple feature responses and thus increase the selectivity of the image features,

  • explore the spatial relationships in the full feature maps that are suppressed in the mean-reduced representation to make it possible for the resulting image descriptors to encode hierarchical relations between image features over multiple positions in the image domain and

  • incorporate learning mechanisms into the representation.

Specifically, it would be interesting to formulate learning mechanisms that can learn the parameters of a parameterized model for divisive normalization and to formulate learning mechanisms that can combine quasi quadrature responses over different positions in the image domain to support more general object recognition mechanisms than those that can be supported by a stationarity assumption as explored in the prototype application to texture classification developed in Sect. 7.

For the specific application to texture classification in this work, it does also seem possible that using more advanced statistical classification methods on the QuasiQuadNet, such as Fischer vectors, could lead to gains in performance compared to the mean-reduced representation that we used here, based on just the mean values and the mean absolute values of the filter responses in our hierarchical representation.

Concerning more general developments, the general arguments about scale-covariant continuous networks in Sect. 3 open up for studying wider classes of continuous hierarchical networks that guarantee provable scale covariance. We plan to study such extensions in future work.