# A computational theory of visual receptive fields

- First Online:

- Received:
- Accepted:

- 20 Citations
- 2.6k Downloads

## Abstract

A receptive field constitutes a region in the visual field where a visual cell or a visual operator responds to visual stimuli. This paper presents a theory for what types of receptive field profiles can be regarded as natural for an *idealized vision system*, given a set of *structural requirements* on the first stages of visual processing that reflect *symmetry properties* of the surrounding world. These symmetry properties include (i) *covariance properties* under scale changes, affine image deformations, and Galilean transformations of space–time as occur for real-world image data as well as specific requirements of (ii) *temporal causality* implying that the future cannot be accessed and (iii) a *time-recursive updating* mechanism of a limited temporal buffer of the past as is necessary for a genuine real-time system. Fundamental structural requirements are also imposed to ensure (iv) mutual consistency and a proper handling of internal representations at *different spatial and temporal scales*. It is shown how a set of *families of idealized receptive field profiles can be derived by necessity* regarding *spatial, spatio-chromatic, and spatio-temporal receptive fields* in terms of Gaussian kernels, Gaussian derivatives, or closely related operators. Such image filters have been successfully used as a *basis* for expressing a large number of visual operations in computer vision, regarding feature detection, feature classification, motion estimation, object recognition, spatio-temporal recognition, and shape estimation. Hence, the associated so-called *scale-space theory* constitutes a both theoretically well-founded and general framework for expressing visual operations. There are very close similarities between receptive field profiles predicted from this scale-space theory and receptive field profiles found by cell recordings in *biological vision*. Among the family of receptive field profiles derived by necessity from the assumptions, idealized models with very good qualitative agreement are obtained for (i) spatial on-center/off-surround and off-center/on-surround receptive fields in the fovea and the LGN, (ii) simple cells with spatial directional preference in V1, (iii) spatio-chromatic double-opponent neurons in V1, (iv) space–time separable spatio-temporal receptive fields in the LGN and V1, and (v) non-separable space–time tilted receptive fields in V1, all within the same unified theory. In addition, the paper presents a more general framework for relating and interpreting these receptive fields conceptually and possibly predicting new receptive field profiles as well as for pre-wiring covariance under scaling, affine, and Galilean transformations into the representations of visual stimuli. This paper describes the basic structure of the necessity results concerning receptive field profiles regarding the mathematical foundation of the theory and outlines how the proposed theory could be used in further studies and modelling of biological vision. It is also shown how receptive field responses can be *interpreted physically*, as the superposition of relative variations of surface structure and illumination variations, given a logarithmic brightness scale, and how receptive field measurements will be *invariant* under multiplicative illumination variations and exposure control mechanisms.

### Keywords

Receptive field Scale space Gaussian derivative Scale covariance Affine covariance Galilean covariance Illumination invariance LGN Primary visual cortex Visual area V1 Functional model Simple cell Double-opponent cell Complex cell Vision Theoretical neuroscience Theoretical biology## 1 Introduction

*relationships*between intensity values at different points. A main reason for this is that the incoming light constitutes an

*indirect*source of information depending on the interaction between geometric and material properties of objects in the surrounding world and on external illumination sources. Another fundamental reason why cues to the surrounding world need to be collected over

*regions*in the visual field as opposed to at single image points is that the measurement process by itself requires the accumulation of energy over non-infinitesimal support regions over space and time. Such a region in the visual field for which a visual sensor and or a visual operator responds to visual input or a visual cell responds to visual stimuli is naturally referred to as a

*receptive field*(Hubel and Wiesel 1959, 1962) (see Fig. 1).

If one considers the theoretical and algorithmic problems of designing a vision system that is going to make use of incoming reflected light to infer properties of the surrounding world, one may ask what types of image operations should be performed on the image data. Would any type of image operation be reasonable? Specifically, regarding the notion of receptive fields, one may ask what types of receptive field profiles would be reasonable? Is it possible to derive a theoretical model of how receptive fields “ought to” respond to visual data?

Initially, such a problem might be regarded as intractable unless the question can be further specified. It is, however, possible to study this problem systematically using approaches that have been developed in the area of computer vision known as *scale-space theory* (Iijima 1962; Witkin 1983; Koenderink 1984; Koenderink and Doorn 1992; Lindeberg 1994a, b, 2008; Sporring et al. 1996; Florack 1997; ter Haar Romeny 2003). A paradigm that has been developed in this field is to impose *structural constraints* on the first stages of visual processing that reflect *symmetry properties* of the environment. Interestingly, it turns out to be possible to substantially reduce the class of permissible image operations from such arguments.

The subject of this article is to describe how structural requirements on the first stages of visual processing as formulated in scale-space theory can be used for deriving idealized models of receptive fields and implications of how these theoretical results can be used when modelling biological vision. A main theoretical argument is that idealized models for linear receptive fields can be derived *by necessity* given a small set of symmetry requirements that reflect properties of the world that one may naturally require an idealized vision system to be adapted to. In this respect, the treatment bears similarities to approaches in theoretical physics, where symmetry properties are often used as main arguments in the formulation of physical theories of the world. The treatment that will follow will be general in the sense that *spatial, spatio-chromatic, and spatio-temporal receptive fields are encompassed by the same unified theory*.

*scaling transformations*caused by objects of different sizes and at different distances to the observer, (ii) local

*affine transformations*caused by variations in the viewing direction relative to the object, (iii) local

*Galilean transformations*caused by relative motions between the object and the observer, and (iv) local

*multiplicative intensity transformations*caused by illumination variations (see Fig. 2). If the vision system is to maintain a stable perception of the environment, it is natural to require the first stages of visual processing to be robust to such image variations. Formally, one may require the receptive fields to be

*covariant*under basic image transformations, which means that the receptive fields should be transformed in a well-behaved and well-understood manner under corresponding image transformations (see Fig. 3). Combined with an additional criterion that the receptive field must not create new structures at coarse scales that do not correspond to simplifications of corresponding finer scale structures, we will describe how these requirements together lead to idealized families of receptive fields (Lindeberg 2011) in good agreement with receptive field measurements reported in the literature (Hubel and Wiesel 1959, 1962; DeAngelis et al. 1995; DeAngelis and Anzai 2004; Conway and Livingstone 2006).

Specifically, explicit *functional models* will be given of spatial and spatio-temporal response properties of LGN neurons and simple cells in V1, which will be compared to related models in terms of Gabor functions (Marcelja 1980; Jones and Palmer 1987b, a), differences of Gaussians (Rodieck 1965), and Gaussian derivatives (Koenderink and Doorn 1987; Young 1987; Young et al. 2001; Young RA, Lesperance 2001; Lindeberg 1994a, b, 1997, 2011). For chromatic input, the model also accounts for color-opponent spatio-chromatic cells in V1. Notably, the diffusion equations that describe the evolution properties over scale of these linear receptive field models are suitable for implementation on a biological architecture, since the computations can be expressed in terms of communications between neighboring computational units, where either a single computational unit or a group of computational units may be interpreted as corresponding to a neuron or a group of neurons.

Compared to previous approaches of learning receptive field properties and visual models from the statistics of natural image data (Field 1987; van der Schaaf and van Hateren 1996; Olshausen and Field 1996; Rao and Ballard 1998; Simoncelli and Olshausen 2001; Geisler 2008; Hyvärinen et al. 2009; Lörincz et al. 2012), the proposed theoretical model makes it possible to determine spatial and spatio-temporal receptive fields from first principles and thus without need for any explicit training stage or gathering of representative image data. In relation to such learning-based models, the proposed theory provides a *normative* approach that can be seen as describing the solutions that an ideal learning-based system may converge to, if exposed to a sufficiently large and representative set of natural image data. For these reasons, the presented approach should be of interest when modelling biological vision.

We will also show how receptive field responses can be *interpreted physically* as a superposition of relative variations of surface structure and illumination variations, given a logarithmic brightness scale, and how receptive field measurements will be *invariant* under multiplicative illumination variations and exposure control mechanisms. Despite the image measurements fundamentally being of an *indirect* nature, in terms of reflected light from external objects subject to unknown or uncontrolled illumination, this result shows how receptive field measurements can nevertheless be related to *inherent physical properties of objects* in the environment. This result therefore provides a formal justification for using receptive field responses as a basis for visual processes, analogous to the way linear receptive fields in the fovea, LGN and V1 provide the basic input to higher visual areas in biological vision.

We propose that these theoretical results contribute to an increased understanding of the role of early receptive fields in vision. Specifically, if one aims at building a neuro-inspired artificial vision system that solves actual visual tasks, we argue that an approach based on the proposed idealized models of linear receptive fields should require a significantly lower amount of training data compared to approaches that involve specific learning of receptive fields or compared to approaches that are not based on covariant receptive field models. We also argue that the proposed families of covariant receptive fields will be better at handling natural image transformations as resulting from variabilities in relation to the surrounding world.

In their survey of our knowledge of the early visual system, Carandini et al. (2005) emphasize the need for functional models to establish a link between neural biology and perception. Einhäuser and König (2010) argue for the need for normative approaches in vision. This paper can be seen as developing the consequences of such ways of reasoning by deriving *functional models of linear receptive fields* using a normative approach. Due to the formulation of the resulting receptive fields in terms of spatial and spatio-temporal derivatives of convolution kernels, it furthermore becomes feasible to analyze how receptive field responses can be related to properties of the environment using mathematical tools from differential geometry and thereby analyzing possibilities as well as constraints for visual perception.

### 1.1 Outline of the presentation

The treatment will be organized as follows: Sect. 2 formulates a set of structural requirements on the first stages of visual processing with respect to symmetry properties of the surrounding world and in relation to internal representations that are to be computed by an *idealized vision system*. Then, Sect. 3 describes the consequences of these assumptions with regard to intensity images defined over a spatial domain, with extensions to color information in Sect. 4. Sect. 5 develops a corresponding theory for spatio-temporal image data, taking into account the special nature of time-dependent image information.

Section 6 presents a comparison between spatial and spatio-temporal receptive fields measured from biological vision to receptive field profiles generated by the presented spatial, spatio-chromatic, and spatio-temporal scale-space theories, showing a very good qualitative agreement. Section 7 describes how a corresponding foveal scale-space model can be formulated for a foveal sensor to account for a spatially dependent lowest resolution with suggestions for extensions in Sect. 8.

Section 9 relates the contributions in the paper to previous work in the area in a retrospective manner, and Sect. 10 concludes with a summary and discussion, including an outline of further applications of how the presented theory can be used for modelling biological vision.

## 2 Structural requirements of an idealized visual front end

The notion of a *visual front end* refers to a set of processes at the first stages of visual processing, which are assumed to be of a general nature and whose output can be used as input to different later-stage processes, without being too specifically adapted to a particular task that would limit the applicability to other tasks. Major arguments for the definition of a visual front end are that the first stages of visual processing should be as *uncommitted* as possible and allow initial processing steps to be *shared* between different later-stage visual modules, thus implying a *uniform structure* on the first stages of visual computations (Koenderink et al. 1992; Lindeberg 1994b, Sect. 1.1).

In the following, we will describe a set of structural requirements that can be stated concerning (i) spatial geometry, (ii) spatio-temporal geometry, (iii) the image measurement process with its close relationship to the notion of scale, (iv) internal representations of image data that are to be computed by a general purpose vision system, and (v) the parameterization of image intensity with regard to the influence of illumination variations.

The treatment that will follow can be seen as a unification, abstraction and extension of developments in the area of *scale-space theory* (Iijima 1962; Witkin 1983; Koenderink 1984; Koenderink and Doorn 1992; Lindeberg 1994a, b, 2008; Sporring et al. 1996; Florack 1997; ter Haar Romeny 2003) as obtained during the last decades, see Sect. 9.2 and (Lindeberg 1996, 2011; Weickert et al. 1999; Duits et al. 2004) for complementary surveys. It will then be shown how a generalization of this theory to be presented next can be used for deriving idealized models of receptive fields by necessity, including new extensions for modelling illumination variations in the intensity domain. Specifically, we will describe how these results can be used for computational neuroscience modelling of receptive fields with regard to biological vision.

### 2.1 Static image data over spatial domain

^{1}denoted by \(x = (x_1, x_2)^T\), then the problem of defining a set of early visual operations can be formulated in terms of finding a family of operators \(\mathcal{T}_s\) that are to act on \(f\) to produce a family of new intermediate image representations

^{2}

#### 2.1.1 Linearity and convolution structure

*linear operator*such that

^{3}

*shift invariant*such that

*homogeneous across space*.

^{4}

*convolution*transformation

^{5}(Hirschmann and Widder 1955)

#### 2.1.2 Image measurements at different scales

*scale*with every image measurement. Let us therefore assume that the parameter \(s\) represents such a scale attribute and let us assume that this scale parameter should always be nonnegative \(s \in {\mathbb R}_+^N\) with the limit case when \(s \downarrow 0\) corresponding to an identity operation

^{6}

#### 2.1.3 Structural requirements on a scale-space representation

*Semigroup and cascade properties*For such image measurements to be properly related

*between*different scales, it is natural to require the operators \(\mathcal{T}_s\) with their associated convolution kernels \(T(\cdot ;\; s)\) to form a

*semigroup*

^{7}scale levels \(s_1\) and \(s_2\) with \(s_2 \ge s_1\) will obey the

*cascade property*

*multi-scale representation*.

*Self-similarity*Regarding the choice of convolution kernels to be used for computing a multi-scale representation, it is natural to require them to be

*self-similar*over scale (

*scale invariant*) in the sense that each kernel \(T(\cdot ;\; s)\) can be regarded as a rescaled version of some prototype kernel \(\bar{T}(\cdot )\). In the case of a

*scalar scale parameter*\(s \in {\mathbb R}_+\), such a condition can be expressed as

*multi-dimensional scale parameter*\(s \in {\mathbb R}_+^N\), the requirement of self-similarity over scale can be generalized into

Together, the requirements of a semigroup structure and self-similarity over scales imply that the parameter \(s\) gets both a (i) *qualitative* interpretation of the notion of scale in terms of an abstract *ordering relation* due to the cascade property in Eq. (9) and (ii) a *quantitative* interpretation of scale, in terms of the *scale-dependent spatial transformations* in Eqs. (10) and (11). When these conditions are simultaneously satisfied, we say that the intermediate representation \(L(\cdot ;\; s)\) constitutes a candidate for being regarded as a (weak) *scale-space representation*.

*Infinitesimal generator*For theoretical analysis, it is preferable if the scale parameter \(s\) can be treated as a continuous parameter and if image representations at adjacent scales can be related by partial differential equations. Such relations can be expressed if the semigroup possesses an

*infinitesimal generator*(Hille and Phillips 1957; Pazy 1983)

*directional derivative of the semigroup*along any

*positive direction*\(u = (u_1, \dots , u_N)\) in the parameter space

*Smoothing property: non-enhancement of local extrema* A further requirement on a scale-space representation is that convolution with the scale-space kernel \(T(\cdot ;\; s)\) should correspond to a *smoothing transformation* in the sense that coarser-scale representations should be guaranteed to constitute *simplifications* of corresponding finer scale representations and that new image structures must not be created at coarser scales \(L(\cdot ;\; s)\) that do not correspond to simplifications of corresponding structures in the original data \(f\).

For one-dimensional signals \(f :{\mathbb R}\rightarrow {\mathbb R}\), such a condition can be formalized as the requirement that the number of local extrema or equivalently the number of zero-crossings in the data must not increase with scale and is referred to as *non-creation of local extrema* (Lindeberg 1990). For higher-dimensional signals, however, it can be shown that there are no non-trivial linear transformations guaranteed to never increase the number of local extrema in an image (Lifshitz and Pizer 1990; Lindeberg 1990).

*non-enhancement of local extrema*). In other words, if at some scale level \(s_0\) a point \((x_0;\; s_0)\) is a maximum (minimum) over the spatial domain \(x\), i.e., for the mapping \(x \mapsto L(x;\; s_0)\), then the derivative with respect to scale at this point must not be positive (negative). For a scale-space representation based on a scalar scale parameter, we should hence require (Lindeberg 1990, 1996):

*every positive direction*in the \(N\)-dimensional parameter space, the directional derivative of the semigroup \((\mathcal{D}_u L)(x;\; s)\) according to Eq. (14) must satisfy (Lindeberg 2011):

*strong restriction*on what convolution kernels \(T(\cdot ;\; s)\) can be regarded as

*scale-space kernels*.

*Nonnegativity and normalization*Regarding the convolution kernels \(T(\cdot ;\; s)\), it is natural to require that any scale-space kernel should be

*nonnegative*

*unit mass*(unit \(L^1\)-norm)

#### 2.1.4 Requirements regarding spatial geometry

*Rotational symmetry*For a multi-scale representation based on a scalar scale parameter \(s \in {\mathbb R}_+\), it is natural to require the scale-space kernels to be

*rotationally symmetric*

For a scale-space representation based on a multi-dimensional scale parameter, one may also consider a weaker requirement of rotational invariance at the level of a family of kernels, for example regarding a set of elongated kernels with different orientations in image space. Then, although the individual kernels in the filter family are not rotationally symmetric as individual filters, a collection or a group of such kernels may nevertheless capture image data of different orientation in a rotationally invariant manner, for example if all image orientations are explicitly represented or if the receptive fields corresponding to different orientations in image space can be related by linear combinations.

*Affine covariance*When considering surface patterns that are being deformed by the perspective transformation from the surface of an object to the image plane, a restriction to rotationally symmetric kernels only will, however, interfere with the image deformations that occur if the viewing direction varies in relation to the surface normal. If we approximate the geometry of an image deformation by the derivative of the perspective mapping and assume that there are no illumination variations, then such an image deformation can be modelled by an

*affine transformation*

*local linear approximation of the perspective mapping*.

A natural requirement on an idealized vision system that observes objects whose projections on the image plane are being deformed in different ways depending on the viewing conditions is that the vision system should be able to relate or match the different internal representations of external objects that are acquired under different viewing conditions. Such a requirement is natural to enable a stable interpretation of objects in the world under variations of the orientation of the object relative to the observer, to enable invariance under variations of the viewing direction.

*affine covariance*. Within the class of linear operators \(\mathcal{T}_s\) over a two-dimensional image domain, it is, however, not possible to realize such an affine covariance property within a scale-space concept based on a scalar scale parameter. For two-dimensional image data, such affine covariance can, however, be accomplished within a three-parameter linear scale space.

### 2.2 Time-dependent image data over a spatio-temporal domain

Regarding spatio-temporal image data \(f(x, t)\), which we assume to be defined on a 2+1D spatio-temporal domain \({\mathbb R}^2 \times {\mathbb R}\) with \(x = (x_1, x_2)^T\) denoting image space and \(t\) denoting time, it is natural to inherit the above-mentioned symmetry requirements expressed for the spatial domain. Hence, corresponding structural requirements as stated in Sects. 2.1.1, 2.1.2, and 2.1.3 should be imposed on a spatio-temporal scale space, with space \(x \in {\mathbb R}^2\) replaced by space–time \((x, t) \in {\mathbb R}^2 \times {\mathbb R}\) and with the scale parameter now encompassing also a notion of *temporal scale*\(\tau \), such that the multi-dimensional scale parameter \(s\) will be of the form \(s = (s_1, \ldots , s_N, \tau )\).

#### 2.2.1 Additional requirements regarding spatio-temporal geometry

*Galilean covariance*For time-dependent image data, it is natural to also take into explicit account the basic fact that objects may move relative to the observer. Specifically, constant velocity motion

*Galilean transformation*of space–time

^{8}such a Galilean model can be regarded as a

*local linear approximation of a more general motion field*\(x(t) = (x_1(t), x_2(t))^T\).

*Galilean covariance*.

Again, within the class of linear transformations \(\mathcal{T}_s\), it is not possible to realize such a Galilean covariance property within a spatio-temporal scale concept based solely on a scalar spatial scale parameter \(s \in {\mathbb R}\) and a scalar temporal scale parameter \(\tau \in {\mathbb R}\). As will be shown later, Galilean covariance can, however, be achieved within a four-parameter linear spatio-temporal scale space.

*Temporal causality*When dealing with time-dependent image data, another structural requirement arises because of the basic fact that the future cannot be accessed. Hence, for any real-time computer vision system or a biological organism that interacts with the world, the convolution kernel must be

*time-causal*in the sense that its values must be zero regarding any access to the future

#### 2.2.2 Specific constraints regarding a real-time system

*Time recursivity and temporal memory* When dealing with spatio-temporal image data in a real-time setting, we cannot expect the vision system to have direct access to all information from the past, since we cannot assume a computer vision system or a biological organism to store a complete recording of all visual information it has seen.

If we assume that the vision system should compute internal image representations at different temporal scales, the only reasonable approach will therefore be that these computations have to be expressed in terms of computations on some internal temporal buffer \(M(x, t)\), which we assume is to be much more condensed than a complete video recording of the past. Such an internal representation is referred to as a *temporal memory*, and the restriction of the set of possible computations to a combination of the current image data \(f(x, t)\) with such a compact temporal memory \(M(x, t)\) is referred to as *time recursivity*. Specifically, this temporal memory \(M(x, t)\) must be updated over time \(t\) according to some time-recursive model.

the kernel \(U\) performs the

*update on the internal representation*\(L\) while simultaneously respecting a cascade property for \(L\) over spatial scales \(s\) andthe kernel \(h\)

*incorporates new information*from the new image data \(f(x, t)\) that arrive between \(t= t_1\) and \(t= t_2\).

*Non-enhancement of local extrema in a time-causal and time-recursive setting*When formalizing the notion of a smoothing operation in a time-causal and time-recursive context, where the internal temporal scale levels \(\tau \) are also used as the internal temporal buffer of past information, it turns out to be both useful and necessary to reformulate the requirement of non-enhancement of local extrema in the following way, to take into the fact that at any temporal moment \(t_0\), we will have access to image data over space \(x\), spatial scales \(s\), and temporal scales \(\tau \), but no direct access to image data in the future or from the past:

*every positive direction*\(u = (u_1, \ldots , u_N, u_{N+1})\) in the \(N+1\)-dimensional space consisting of the \(N\)-dimensional spatial scale parameter \(s\) complemented by time \(t\), the directional derivative \((\mathcal{D}_u L)(x, t;\; s, \tau )\) of the spatio-temporal scale-space representation in this direction \(u\) must satisfy (Lindeberg 2011, equations (79)–(80), page 52):

*non-causal*multi-parameter scale-space applied to space–time in a

*non-recursive*setting where time \(t\) is treated in an essentially similar way as space \(x\), non-enhancement of local extrema according to (18) implies a strong evolution property

*over temporal scales*\(\tau \). The conceptual difference with this

*time-recursive formulation*is that the strong temporal smoothing property, as imposed by non-enhancement of local extrema, is instead expressed in terms of the evolution properties

*over time*\(t\) and not over temporal scales \(\tau \).

Notably, this formulation of a temporal evolution property has an interesting interpretation of enforcing a smooth (stabilizing) temporal behavior of the internal representation \(L(x, t;\; s, \tau )\) of the surrounding world as the spatio-temporal data \(f(x, t)\) varies over time \(t\).

### 2.3 Influence of illumination variations

The above-mentioned symmetry requirements essentially refer to the geometry of space and space–time and its relation to image measurements over non-infinitesimal regions over space or space–time as formalized into the notion of a scale-space representation. Regarding the actual image intensities, these have so far been assumed to be given beforehand.

We may, however, consider different ways of parameterizing the intensity domain. Essentially, any monotonic intensity transformation will preserve the ordering of the intensity values from dark to bright. The perceptual impression of an image may, however, be substantially different after a nonlinear intensity transformation. Hence, one may ask whether we should assume the image data \(f\) to be proportional to image irradiance \(f \sim I\) (in units of power per unit area), some self-similar power of image irradiance \(f \sim I^{\gamma }\) or whether there is a better choice?

*Logarithmic brightness scale*Given the huge range of brightness variations under imaging natural conditions (a range corresponding to a factor of the order of \(10^{10}\) between the darkest and brightest cases for human vision), it is natural to represent the image brightness on a

*logarithmic scale*:

#### 2.3.1 Behavior under illumination variations: spatial image data

In this section, we will express properties of a logarithmic brightness scale in relation to a physical illumination model and image measurements in terms of receptive fields.

*Projection model*Consider a

*planar perspective camera*model with \(X = (X_1, X_2, X_3)^T\) denoting world coordinates with the \(X_3\)-direction perpendicular to the image plane and with the image coordinates \((x_1, x_2)^T\) for simplicity expressed in units of the focal length \(f\), leading to the perspective projection equations (assuming that \(X_3 > 0\))

*thin lens*with diameter \(d\).

*Model for image irradiance*Then, given that the image irradiance \(I\) is proportional to the surface radiance \(R\) along the direction from a point \(X\) on the surface toward its projection \(X_\mathrm{im} = (x_1, x_2, 1)^T \times f\) on the image plane

*effective f-number*, and with a spatially varying reduction in image intensities toward the periphery of the image (

*natural vignetting*) determined by the geometric factor

^{9}\(\cos ^4 \phi (X)\) with

*Model for surface reflectance*Let us next assume that the

*surface reflectance*\(R\) in the direction from the point \(X = (X_1, X_2, X_3)^T\) on the surface toward its projection \(X_\mathrm{im} = (x_1, x_2, 1)^T\) on the image planed can be modelled as proportional to an

*albedo factor*\(\rho \) determined by the surface material and the amount of incoming illumination \(i\)

This model has a similar behavior as *Lambertian surface model*, with the extension that the surface may be regarded as “gray” by not reflecting all incident light. Please note, however, that this reflectance model constitutes a substantial simplification of the bidirectional reflectance function and does not comprise, e.g., specularities or materials with diffraction grating effects.

*Combined brightness model*By combining the illumination model in Eqs. (39) and (41) with the logarithmic brightness scale in Eq. (36) and by redefining the functions \(\rho (X)\) and \(i(X)\) such that their values for three-dimensional world coordinates \(X\) can be accessed from corresponding projected image coordinates \(x\) according to \(\rho (x)\) and \(i(x)\), we obtain

- (i)
*properties of surfaces of objects*in the world as condensed into the spatially dependent albedo factor \(\rho (x)\) with the implicit understanding that this entity may in general refer to different surfaces in the world depending on the viewing direction \((x_1, x_2, 1)^T\) and thus the image position \(x = (x_1, x_2)^T\), - (ii)
*properties of the illumination field*as reflected in the spatially dependent illumination \(i(x)\), which also may refer to the amount of incoming light on different surfaces in the world depending on the value of \(x\), - (iii)
*geometric properties of the camera*as condensed into a dependency on the effective \(f\)-number \(\tilde{f}\) captured by \(C_\mathrm{cam}(\tilde{f})\), and - (iv)
a geometric

*natural vignetting*effect of the explicit form \(V(x) = V(x_1, x_2) = - 2 \log (1 + x_1^2 + x_2^2 )\).

*Invariance to the effective f-number*A noteworthy property of the model in Eq. (43) is that if we disregard effects of focal blur (not modelled here), then the influence due to the internal focal distance \(f\) and the diameter \(d\) of the camera will be

*cancelled*, if we differentiate this expression with respect to space \(x\)

*invariant to variations in the effective f-number*(as well as other multiplicative exposure parameters).

*Invariance to multiplicative illumination transformations*Moreover, if we consider image measurements from the same scene using a different illumination field \(i'(x)\) proportional to the original illumination field

*invariant to multiplicative illumination transformations.*The influence of the constant \(\log C_\mathrm{illum}\) will also disappear after filtering with a kernel having integral zero, i.e., equal positive and negative contributions.

*Relative measurements of physical entities*Furthermore, regarding, e.g., any first-order derivative \(\partial _{x_k}\) with \(k\) equal to 1 or 2

*relative variations*of the physical entities surface albedo \(\rho (x)\) and the illumination \(i(x)\) (where we assume these quantities to always be strictly positive and never becoming equal to zero):

For a smooth surface with a spatially dependent surface pattern \(\rho (X)\), the first term \(\partial _{x_k} \rho /\rho \) reflects inherent

*relative spatial variations of this surface pattern*as deformed by the perspective projection model in analogy with the affine deformation model (24).The second term \(\partial _{x_k} i/i\) reflects

*relative spatial variations in the illumination field*\(i\) as arising from the interaction between the external illumination field \(i(X, \theta (X), \varphi (X))\) and the local surface geometry \((\theta (X), \varphi (X))\) at every surface point \(X\) according to (42).The third term \((\partial _{x_k} V)(x) = (\partial _{x_k} V)(x_1, x_2) = 4 x_k/(1 + x_1^2 + x_2^2)\) constitutes a

*geometric bias due to vignetting effects*inherent to the camera. (Please note that the image coordinates in this treatment are expressed in units of the focal length with \(|x| = \sqrt{x_1^2 + x_2^2} \ll 1\) in the central field of view.) This term will disappear for a spherical camera geometry.

*Invariance properties of spatial receptive fields involving spatial derivatives*There is an interesting relationship between the cancelling of multiplicative illumination transformations in Eq. (44) and image measurements in terms of receptive fields. If we consider the derived internal scale-space representation \(L\) of a signal \(f\) and compute any spatial derivative of this representation according to

*the effect of any multiplicative illumination transformation will be invisible to image measurements in terms of receptive fields*\(\partial _{x^{\alpha }} \mathcal{T}_s\)

*that involve spatial derivatives*. Similarly, besides effects of focal blur, the intensity dependency due to variations of the effective \(f\)-number \(\tilde{f}\) will also cancel. Hence, with a logarithmic brightness scale, image measurements in terms of receptive fields that involve spatial derivatives (or more generally any receptive field with its integral equal to zero) will be

*invariant under multiplicative illumination transformations and exposure conditions*, with the latter corresponding to variations of the exposure time, the aperture and the ISO number of the sensor in a digital camera, or the diameter of the pupil and the photosensitivity of the photoreceptors in biological vision. The remaining response is a superposition of relative variations in surface patterns and illumination variations, with a position-dependent bias due to the vignetting effect.

*differentiability properties*of the image data. For images acquired from a natural world, there will in general be discontinuities in image brightness \(f\), due to discontinuities in depth, surface orientation, illumination, or the albedo of the surface patterns, which implies that we would generally expect to obtain strong spikes in the output if plain derivative operators would be applied to natural image data. The use of

*receptive field-based derivative operations*, however, regularizes this problem. For the families of smoothing kernels \(T(\cdot ;\, s)\) that can be derived from the requirement of non-enhancement of local extrema, it can be shown that the scale-space representation \(L(\cdot ;\, s)\) will indeed become

*infinitely differentiable*after any non-infinitesimal amount of smoothing \(s > 0\) if we assume bounded brightness data \(|f(x)| < C\). Hence, the output from the receptive field-based derivative operators \(\partial _{x^{\alpha }} T(\cdot ;\, s)\) will always be well defined and the validity of the results in Eqs. (44) and (50) can be formally established with \((\partial _{x^{\alpha }} f)(x)\) replaced by \((\partial _{x^{\alpha }} L)(x;\; s)\):

*necessary*to make these computations of image derivatives valid. The assumption of linearity as a basic scale-space axiom in Eq. (2) can also be motivated from the form of this expression, by making it possible to interpret the receptive field responses as a linear superposition of relative variations in surface patterns and relative variations in the illumination field. Such an interpretation would not be possible if the smoothing operator \(\mathcal{T}_s\) would be nonlinear.

*Scale-space properties of receptive field measurements involving spatial derivatives*Due to the linearity property, receptive field measurements involving spatial derivatives \(\partial _{x^{\alpha }} L\) will possess essentially similar scale-space properties over scales as possessed by the zero-order scale-space representation \(L\) of the original illumination pattern \(f\) as described in Sect. 2.1.3, with the main difference that the limit case in Eq. (7) when the scale parameter \(s\) tends to zero has to be replaced by

#### 2.3.2 Behavior under illumination variations: spatio-temporal image data

*Invariance properties of spatial receptive fields involving spatio-temporal derivatives*For spatio-temporal image data, the corresponding image formation model becomes

Regarding temporal derivatives, it follows that the influence of the vignetting effect \(V(x)\) will be cancelled by any temporal derivative operator with \(\beta \ge 0\). The temporal derivative operator will also suppress the effect of any other solely spatial illumination variation.

*Galilean covariant temporal derivative concept*When considering temporal derivatives of spatio-temporal data computed for an object that moves with image velocity \(v = (v_1, v_2)^T\) relative to the observer, it is natural to consider

*velocity-adapted temporal derivatives*\(\partial _{\bar{t}}\) along the direction of motion according to

*Galilean covariant image representations based on receptive fields involving temporal derivatives*, in analogy with the previous treatment of Galilean covariance in connection with Eq. (31).

#### 2.3.3 Summary regarding intensity and illumination variations

*receptive field responses in terms of spatial and spatio-temporal derivatives have a direct physical interpretation*as the superposition of

relative variations in the albedo of the observed surface patterns corresponding to the term \(\partial _{x^{\alpha } t^{\beta }} \left( \mathcal{T}_s \, \log \rho (x) \right) \) in (54), and

relative variations in the illumination field corresponding to the term \(\partial _{x^{\alpha } t^{\beta }} \left( \mathcal{T}_s \, \log i(x) \right) \) in (54)

*invariant under multiplicative illumination transformations*as well as other multiplicative exposure control mechanisms.

## 3 Spatial domain with pure intensity information

We shall now describe how the structural requirements on an idealized vision system as formulated in Sect. 2.1 restrict the class of possible image operations at the first stages of visual processing. For image data \(f :{\mathbb R}^2 \rightarrow {\mathbb R}\) defined over a *two-dimensional spatial domain*, let us assume that the first stage of visual processing as represented by the operator \(\mathcal{T}_s\) should be (i) *linear*, (ii) *shift invariant*, and (iii) obey a *semigroup structure over spatial scales*\(s\), where we also have to assume (iv) certain *regularity properties* of the semigroup \(\mathcal{T}_s\)*over scale*\(s\) in terms of Sobolev norms^{10} to guarantee sufficient differentiability properties with respect to space \(x \in {\mathbb R}^2\) and scale \(s\). Let us furthermore require (v) *non-enhancement of local extrema* to hold for *any* smooth image function \(f \in C^{\infty }({\mathbb R}^2) \cap L^1({\mathbb R}^2)\).

*affine and translated Gaussian kernels*

### 3.1 Gaussian receptive fields

*Gaussian derivative operators*

*general basis*for expressing image operations such as feature detection, feature classification, surface shape, image matching, and image-based recognition (Iijima 1962; Witkin 1983; Koenderink 1984; Koenderink and Doorn 1992; Lindeberg 1994a, b, 1998a, b, 2008; Florack 1997; Schiele and Crowley 1996, 2000; Lowe 1999, 2004; Chomat et al. 2000; ter Haar Romeny 2003; Linde and Lindeberg 2004, 2012; Bay et al. 2008). Specifically, this receptive field model makes it possible to compute

*scale-invariant image features and image descriptors*(Crowley 1981; Crowley and Stern 1984; Lindeberg 1998a, b, 1999, 2013; Lowe 1999, 2004; Schiele and Crowley 2000; Chomat et al. 2000; Bay et al. 2008). Other necessity results concerning Gaussian and Gaussian derivative kernels have been presented by Iijima (1962), Koenderink (1984), Koenderink and Doorn (1992), Babaud et al. (1986), Yuille and Poggio (1986), Lindeberg (1990, 1994b, 1996), and Florack and Haar Romeny (1992).

### 3.2 Affine-adapted Gaussian receptive fields

*affine Gaussian kernels*

*closed*under affine transformations, meaning that if we for two affine-related images

With respect to biological vision, the affine Gaussian kernels as well as directional derivatives of these can be used for modelling receptive fields that are oriented in the spatial domain, as will be described in connection with Eq. (111) in Sect. 6. For computational vision, they can be used for computing *affine invariant image features and image descriptors* for, e.g., cues to surface shape, image-based matching, and recognition (Lindeberg 1994b; Lindeberg and Gårding 1997; Baumberg 2000; Mikolajczyk and Schmid 2004; Tuytelaars and Gool 2004; Lazebnik et al. 2005; Rothganger et al. 2006).

### 3.3 Necessity of derived receptive fields in terms of derivatives

^{11}and non-enhancement of local extrema, with the semigroup structure (8) replaced by the cascade property

For directional derivatives that have been derived from elongated kernels whose underlying zero-order convolution kernels are not rotationally symmetric, it should be noted that we have aligned the directions of the directional derivative operators to the orientations of the underlying kernels. A structural motivation for making such an alignment can be obtained from a requirement of a weaker form of rotational symmetry at the group level. If we would like the family of receptive fields to be rotationally symmetric as a group, then it is natural to require the directional derivative operators to be transformed in a similar way as the underlying kernels.

## 4 Spatial domain with color information

To define a corresponding scale-space concept for color images, the simplest approach would be by computing a Gaussian scale-space representation for each color channel individually. Since the values of the color channels will usually by highly correlated, it is, however, preferable to *decorrelate* the dependencies by computing a color-opponent representation. Such a representation is also in good agreement with human vision, where a separation into red/green and yellow/blue color-opponent channels takes place at an early stage in the visual pathways.

### 4.1 Gaussian color-opponent receptive fields

*Gaussian color-opponent scale-space representation*\((C^{(1)}, C^{(2)})\) can be defined by applying Gaussian convolution to the color-opponent channels \((c^{(1)}, c^{(2)})^T\):

In Hall et al. (2000), Linde and Lindeberg (2004, 2012), and Sande et al. (2010), it is shown how such spatio-chromatic receptive fields in combination with regular spatial receptive fields can constitute an effective basis for object recognition.

Another type of Gaussian color model has been proposed by Koenderink and later used by Geusebroek and his co-workers (Burghouts and Geusebroek 2009) with receptive fields defined over the spectrum of wavelengths in the color spectrum, corresponding to zero-, first-, and second-order derivatives with respect to wavelength.

## 5 Spatio-temporal image data

### 5.1 Non-causal spatio-temporal receptive fields

Let us first apply a similar way of reasoning as in Sect. 3 with space \(x \in {\mathbb R}^2\) replaced by space–time \((x, t)^T \in {\mathbb R}^2 \times {\mathbb R}\) and disregarding temporal causality, thereby allowing unlimited access to information over both space and time. Given image data \(f :{\mathbb R}^2 \times {\mathbb R}\rightarrow {\mathbb R}\) defined over a 2+1D spatio-temporal domain, let us therefore again assume that the first stage of visual processing as represented by the operator \(\mathcal{T}_s\) should be (i) *linear*, (ii) *shift invariant*, and (iii) obey a *semigroup structure over both spatial and temporal scales*\(s\), where we also assume (iv) certain *regularity properties* of the semigroup \(\mathcal{T}_s\)*over scale*\(s\) in terms of Sobolev norms^{12} to guarantee sufficient differentiability properties with respect to space \(x\), time \(t\) and spatio-temporal scales \(s\). Let us furthermore require (iv) *non-enhancement of local extrema* to hold for *any* smooth image function \(f \in C^{\infty }({\mathbb R}^2 \times {\mathbb R}) \cap L^1({\mathbb R}^2 \times {\mathbb R})\) and for any positive scale direction \(s\).

*spatio-temporal Gaussian kernels*

*spatial extent*, (ii) \(\lambda _t\) determines the

*temporal extent*, (iii) \(v = (v_1, v_2)^T\) denotes the

*image velocity*and (iv) \(\delta \) represents a

*temporal delay*and corresponding to a coupling between the spatial and temporal dimensions of the form

*Gaussian spatio-temporal scale space*

*velocity-adapted temporal derivatives*

*Transformation property under Galilean transformations*Under a Galilean transformation of space–time (27), in matrix form written

### 5.2 Time-causal spatio-temporal receptive fields

If we on the other hand with regard to real-time biological vision want to respect both temporal causality and temporal recursivity, we obtain different families of receptive fields. Specifically, two different families of time-causal receptive fields can be derived depending on whether we require (i) a continuous semigroup structure over a continuum of temporal scales or (ii) fixate the temporal scale levels to be discrete a priori.

*Time-causal semigroup*Given the requirements of (i)

*linearity*and (ii) spatial and temporal

*shift invariance*, we require the scale-space kernels to be (iii)

*time-causal*and require the visual front end to be (iv)

*time recursive*in the sense that the internal image representations \(L(x, t;\; s, \tau )\) at different spatial scales \(s\) and temporal scales \(\tau \) do also constitute a sufficient internal temporal memory \(M(x, t)\) of the past, without any further need for temporal buffering. To adapt the convolution semigroup structure to a time-recursive setting, we require the spatio-temporal scale-space concept

*two-parameter semigroup*over spatial scales \(s\) and time \(t\)

*regularity properties*on the semigroup in terms of Sobolev norms to ensure differentiability (Lindeberg 2011, Appendix E), then (vii) the

*time-recursive formulation of non-enhancement of local extrema*in Eq. (34) with respect to a continuum of both spatial and temporal scale levels implies that the semigroup must satisfy the following system of diffusion equations

\(g(x - v t;\; s;\; \varSigma )\) is a

*velocity-adapted 2D affine Gaussian kernel*with spatial covariance matrix \(\varSigma \) and\(\phi (t;\; \tau )\) is a

*time-causal smoothing kernel over time*with temporal scale parameter \(\tau \), which is related to the regular one-dimensional Gaussian kernel according to \(\phi (t;\; \tau ) = - \partial _{\tau } g(\tau ;\; t)\). (Please note the shift of the order of the arguments between \(\phi \) and \(g\).)

*Truncated exponential kernels/first-order integrators*If we on the other hand fixate the temporal scale levels to be discrete

*a priori*, then an alternative model for time-causal and time-recursive receptive fields can be obtained by performing the temporal smoothing using convolution with

*truncated exponential functions*

*effective temporal scale*\(\tau _\mathrm{eff} = \log \tau \):

### 5.3 Distributions of spatio-temporal receptive fields

### 5.4 Geometric covariance properties

*closed*under (i)

*rescalings*of the spatial and temporal dimensions, (ii)

*Galilean transformations*of space–time, and (iii)

*affine transformations*in the spatial domain. Hence, it satisfies the natural transformation properties that allow it to handle:

image data acquired with different spatial and/or temporal

*sampling rates*,image structures of different spatial and/or temporal

*extent*,objects at different

*distances*from the camera,the linear component of

*relative motions*between objects in the world and the observer, andthe linear component of

*perspective deformations*.

## 6 Computational modelling of biological receptive fields

In two comprehensive reviews, DeAngelis et al. (1995), DeAngelis and Anzai (2004) present overviews of spatial and temporal response properties of (classical) receptive fields in the central visual pathways. Specifically, the authors point out the limitations of defining receptive fields in the spatial domain only and emphasize the need to characterize receptive fields in the *joint* space–time domain, to describe how a neuron processes the visual image. Conway and Livingstone (2006) show the result of a corresponding investigation concerning color receptive fields.

In the following, we will describe how the above-mentioned spatial and spatio-temporal scale-space concepts can be used for modelling the spatial, spatio-chromatic, and spatio-temporal response properties of biological receptive fields. Indeed, it will be shown that the Gaussian and time-causal scale-space concepts lead to predictions of receptive field profiles that are qualitatively very similar to *all* the receptive field types presented in DeAngelis et al. (1995), DeAngelis and Anzai (2004), and schematic simplifications of most of the receptive fields shown in Conway and Livingstone (2006).

### 6.1 LGN neurons

*first- and second-order temporal derivatives*of a time-causal temporal scale-space representation. For the first-order temporal derivative of a time-causal temporal scale-space kernel, the first peak is strongest, whereas the second peak is the most dominant one for second-order temporal derivatives. The spatial response, on the other hand, shows a high similarity to a

*Laplacian of a Gaussian*.

\(\pm \) determines the polarity (on-center/off-surround versus off-center/on-surround),

\(\partial _{x_1 x_1} + \partial _{x_2 x_2}\) denotes the spatial Laplacian operator,

\(g(x_1, x_2;\; s)\) denotes a rotationally symmetric spatial Gaussian,

\(\partial _{{t'}}\) denotes a temporal derivative operator with respect to a possibly self-similar transformation of time \(t' = t^{\alpha }\) or \(t' = \log t\) such that \(\partial _{{t'}} = t^{\kappa } \, \partial _t\) for some constant \(\kappa \in [0, 1]\) ( Lindeberg 2011, Sect. 5.1, pages 59–61)

^{13},\(h(t;\; \tau )\) is a temporal smoothing kernel over time corresponding to the time-causal smoothing kernel \(\phi (t;\; \tau ) = \tfrac{1}{\sqrt{2 \pi } \, t^{3/2}} \, \tau \, e^{-\tau ^2/2t}\) in (95), a non-causal time-shifted Gaussian kernel \(g(t;\; \tau , \delta ) = \tfrac{1}{\sqrt{2 \pi \tau }} e^{-(t - \delta )^2/2 \tau }\) according to (76) or a time-causal kernel corresponding to a set of first-order integrators over time coupled in cascade having a Laplace transform \(H_\mathrm{composed}(q;\; \mu ) = \prod _{i=1}^{k} \frac{1}{1 + \mu _i q}\) according to (99),

\(n\) is the order of temporal differentiation,

\(s\) is the spatial scale parameter and

\(\tau \) is the temporal scale parameter.

Concerning the application of the Laplacian of Gaussian model for on-center/off-surround and off-center/on-surround receptive fields in the retina, it should be emphasized that the retina also contains other types of receptive fields that are not modelled here, such as brisk transient (Y) ganglion cells that respond to rapid transients and directional selective ganglion cells that respond to visual motion (Wässle 2004).

*Note:* In all illustrations in Sect. 6, where spatial and spatio-temporal derivative expressions are aligned to biological data, the unit for the spatial scale parameter \(s\) corresponds to \([\text{ degrees }^2]\) of visual angle and the units for the temporal scale parameter \(\tau \) in the Gaussian spatio-temporal scale-space representation are \([\text{ milliseconds }^2]\), whereas the units for the temporal scale parameter \(\tau \) in the time-causal spatio-temporal scale-space representation are \([\sqrt{\text{ milliseconds }}]\). For image velocities \(v\) of velocity-adapted filters, the units are \([\text{ degrees/millisecond }]\). The reason why the units are different for the three types of spatio-temporal scale spaces is that the dimensionality of the temporal scale parameter is different in each of these spatio-temporal scale-space concepts.

### 6.2 Double-opponent spatio-chromatic cells

*double-opponent cells*, since they simultaneously compute both spatial and chromatic opponency. According to Conway and Livingstone (2006), this cell type can be regarded as the first layer of spatially opponent color computations.

### 6.3 Simple cells

*oriented in the spatial domain*and*sensitive to specific stimulus velocities.*

*simple cells*(Hubel and Wiesel 1959, 1962).

#### 6.3.1 Spatial dependencies

*spatial component*of this orientation dependency according to

\(\partial _{\varphi } = \cos \varphi \, \partial _{x_1} + \sin \varphi \, \partial _{x_2}\) is a directional derivative operator,

\(m\) is the order of spatial differentiation, and

\(g(x_1, x_2;\; \varSigma )\) is an affine Gaussian kernel with spatial covariance matrix \(\varSigma \) as can be parameterized according to (68)

In the specific case when the covariance matrix is proportional to a unit matrix \(\varSigma = s \, I\), with \(s\) denoting the spatial scale parameter, these directional derivatives correspond to regular Gaussian derivatives as proposed as a model for spatial receptive fields by Koenderink and Doorn (1987, 1992). The use of non-isotropic covariance matrices does on the other hand allow for a higher degree of orientation selectivity and does additionally allow for closedness under affine transformations (affine covariance).

^{14}to the statistics of image data in a neighborhood of each image point; see Weickert (1998) and Almansa and Lindeberg (2000) for the applications of this idea for enhancing local directional image structures in computer vision.

*Relations to Gabor functions*Based on the work by Marcelja (1980), Gabor functions

Conceptually, the ripples of the Gabor functions, which are given by complex sine waves, are related to the ripples of Gaussian derivatives, which are given by Hermite functions. A Gabor function, however, requires the specification of a scale parameter and a spatial frequency, whereas a Gaussian derivative requires a scale parameter and the order of differentiation (per spatial dimension). With the Gaussian derivative model, receptive fields of different orders can be mutually related by derivative operations and be computed from each other by nearest-neighbor operations. The zero-order receptive fields as well as the derivative-based receptive fields can be modelled by diffusion equations and can therefore be implemented by computations between neighboring computational units.

In relation to invariance properties, the family of affine Gaussian kernels is closed under affine image deformations, whereas the family of Gabor functions obtained by multiplying rotationally symmetric Gaussians with sine and cosine waves is not closed under affine image deformations. This means that it is not possible to compute truly affine invariant image representations from such Gabor functions. Instead, given a pair of images that are related by a non-uniform image deformation, the lack of affine covariance implies that there will be a systematic bias in the image representations derived from such Gabor functions, corresponding to the difference between the backprojected Gabor functions in the two image domains. If using receptive profiles defined from directional derivatives of affine Gaussian kernels, it will on the other hand be possible to compute provably affine invariant image representations.

With regard to invariance to multiplicative illumination variations, the even cosine component of a Gabor function does in general not have its integral equal to zero, which means that the illumination invariant properties under multiplicative illumination variations or exposure control mechanisms described in Sect. 2.3 do not hold for Gabor functions.

In this respect, the Gaussian derivative model is simpler, it can be related to image measurements by differential geometry, be derived axiomatically from symmetry principles, be computed from a minimal set of connections and allows for provable invariance properties under locally linearized image deformations (affine transformations) as well as local multiplicative illumination variations and exposure control mechanisms. Young (1987) has more generally shown how spatial receptive fields in cats and monkeys can be well modelled by Gaussian derivatives up to order four.

In the area of computer vision, a multi-scale differential geometric framework in terms of Gaussian derivatives and closely related operators has become an accepted and de facto standard for defining image features for feature detection, feature classification, stereo matching, motion estimation, object recognition, spatio-temporal recognition, shape analysis, and image enhancement. Specifically, the formulation of image primitives in terms of scale-space derivatives makes it possible to use tools from differential geometry for deriving relationships between image features and physical properties of objects in the environment, allowing for computationally operational and theoretically well-founded modelling of possibilities or constraints for visual perception.

*Orientation maps*Optical imaging techniques have shown that orientation selective cells that respond best to one orientation form are grouped together in highly ordered patches and that these iso-orientation patches are organized around “orientation centers” that produce characteristic pinwheel-like patterns (Bonhoeffer and Grinvald 1991). Measurements have also shown that the degree of orientation selectivity varies regularly over the cortex and can be different near versus further away from the center of a pinwheel (Blasdel 1992). Specifically, the orientation selectivity has been reported to be lowest at the positions of the centers of the pinwheels (see Fig. 27).

Given the model (111) of orientation selective receptive fields as depending on a spatial covariance matrix \(\varSigma \), this property is in good qualitative agreement with a distribution of receptive fields over a population over covariance matrices with different preferred orientations as determined from the eigenvectors of the covariance matrix and different ratios between the scale parameters along the preferred orientations as determined by the square root of the ratio between the eigenvalues of the covariance matrix. Specifically, the property of the orientation selectivity of being lowest at the positions of the centers of the pinwheels would be compatible with the covariance matrix there being close to alternatively closer to a unit matrix, implying that the orientations of the eigenvectors being sensitive to minor perturbations of the covariance matrix, thus causing the ratio between the eigenvalues being close to alternatively closer to one at the center of the pinwheel.

#### 6.3.2 Spatio-temporal dependencies

*joint space–time domain*, the spatio-temporal response properties of receptive fields in the striate cortex range from separable (Fig. 28) to strongly inseparable (Fig. 30), where a majority exhibit

*marked space–time inseparability.*The temporal profile is reported to be typically biphasic, although some cells are reported to have monophasic or triphasic responses (DeAngelis et al. 1995; DeAngelis and Anzai 2004) (Fig. 29, 31).

In terms of temporal derivatives, a biphasic behavior arises from first-order derivatives, a monophasic behavior from zero-order derivatives, and a triphasic behavior from second-order derivatives. Concerning the oriented spatial response characteristics, there is a high similarity with directional derivatives of Gaussian kernels (Young 1987).

*non-causal Gaussian spatio-temporal derivative kernels*$$\begin{aligned}&h_\mathrm{Gaussian}(x_1, x_2, t;\; s, \tau , v, \delta ) \nonumber \\&\quad = \partial _{\varphi }^{m_1} \, \partial _{\bot \varphi }^{m_2} \, \partial _{\bar{t}^n} g(x_1, x_2, t;\; s, \tau , v, \delta ) \end{aligned}$$(114)*time-causal spatio-temporal derivative kernels*$$\begin{aligned}&h_\mathrm{time-causal}(x_1, x_2, t;\; s, \tau , v)\nonumber \\&\quad = (\partial _{\bar{x_1}^{\alpha _1} \bar{x_2}^{\alpha _2}} \partial _{\bar{t}^{\beta }} h)(x_1, x_2, t;\; s, \tau , v) \end{aligned}$$(115)

\(\partial _{\varphi } = \cos \varphi \, \partial _{x_1} + \sin \varphi \, \partial _{x_2}\) and \(\partial _{\bot \varphi } = \sin \varphi \, \partial _{x_1} - \cos \varphi \, \partial _{x_2}\) denote spatial directional derivative operators according to (69) in two orthogonal directions \(\varphi \) and \(\bot \varphi \),

\(m_1 \ge 0\) and \(m_2 \ge 0\) denote the orders of differentiation in the two orthogonal directions in the spatial domain with the overall spatial order of differentiation \(m = m_1 + m_2\),

\(v_1 \, \partial _{x_1} + v_2 \, \partial _{x_2} + \partial _t\) denotes a velocity-adapted temporal derivative operator,

\(v = (v_1, v_2)^T\) denotes the image velocity,

\(n\) denotes the order of temporal differentiation,

\(g(x_1 - v_1 t, x_2 - v_2 t;\; \varSigma )\) denotes a spatial affine Gaussian kernel according to (63) that moves with image velocity \(v = (v_1, v_2)^T\) in space–time,

\(\varSigma \) denotes a spatial covariance matrix that can be parameterized by two eigenvalues \(\lambda _1\) and \(\lambda _2\) as well as a spatial orientation \(\theta \) of the form (68),

\(h(t;\; \tau )\) is a temporal smoothing kernel over time corresponding to the time-causal smoothing kernel \(\phi (t;\; \tau ) = \tfrac{1}{\sqrt{2 \pi } \, t^{3/2}} \, \tau \, e^{-\tau ^2/2t}\) in (95), a non-causal time-shifted Gaussian kernel \(g(t;\; \tau , \delta ) = \tfrac{1}{\sqrt{2 \pi \tau }} e^{-(t - \delta )^2/2 \tau }\) according to (76) or a time-causal kernel corresponding to a set of first-order integrators over time coupled in cascade having a Laplace transform \(H_\mathrm{composed}(q;\; \mu ) = \prod _{i=1}^{k} \frac{1}{1 + \mu _i q}\) according to (99),

\(s\) denotes the spatial scale and

\(\tau \) denotes the temporal scale.

*all*the linear receptive field types reported from cell recordings in LGN and V1 by DeAngelis et al. (1995), DeAngelis and Anzai (2004).

Young et al. (2001) and Young RA, Lesperance (2001) have also shown how spatio-temporal receptive fields can be modelled by Gaussian derivatives over a spatio-temporal domain, corresponding to the Gaussian spatio-temporal concept described here, although with a different type of parameterization; see also Lindeberg (1997, 2001) for closely related earlier work. These scale-space models can therefore be regarded as *idealized functional and phenomenological models of receptive fields,* whose actual realization can then be implemented in different ways depending on available hardware or wetware.

*Relations to approaches for learning receptive fields from natural image statistics* Work has also been performed on learning receptive field properties and visual models from the statistics of natural image data (Field 1987; van der Schaaf and van Hateren 1996; Olshausen and Field 1996; Rao and Ballard 1998; Simoncelli and Olshausen 2001; Geisler 2008; Hyvärinen et al. 2009; Lörincz et al. 2012) and been shown to lead to the formation of similar receptive fields as found in biological vision. The proposed theory of receptive fields can be seen as describing basic physical constraints under which a learning-based method for the development of receptive fields will operate and the solutions to which an optimal adaptive system may converge to, if exposed to a sufficiently large and representative set of natural image data. Field (1987) as well as Doi and Lewicki (2005) have described how ”natural images are not random, instead they exhibit statistical regularities” and have used such statistical regularities for constraining the properties of receptive fields. The theory presented in this paper can be seen as a theory at a higher level of abstraction, in terms of basic principles that reflect properties of the environment that in turn determine properties of the image data, without need for explicitly constructing specific statistical models for the image statistics. Specifically, the proposed theory can be used for explaining why the above-mentioned statistical models lead to qualitatively similar types of receptive fields as the idealized receptive fields obtained from our theory.

An interesting observation that can be made from the similarities between the receptive field families derived by necessity from the assumptions and receptive profiles found by cell recordings in biological vision is that receptive fields in the retina, LGN, and V1 of higher mammals are very close to *ideal* in view of the stated structural requirements/symmetry properties. In this sense, biological vision can be seen as having adapted very well to the transformation properties of the surrounding world and the transformations that occur when a three-dimensional world is projected to a two-dimensional image domain.

### 6.4 Spatio-chrom-temporal receptive fields

### 6.5 Motion selectivity

Concerning motion selectivity, DeAngelis et al. (1995), DeAngelis and Anzai (2004) report that most cortical neurons are quite *sensitive to stimulus velocity* and the speed tuning is more narrow than for LGN cells. Simple cells with inseparable receptive fields have directional preference while cells with space–time separable receptive fields do not. Moreover, the preferred direction of motion corresponds to the orientation of the filter in space–time.

*space–time separable*receptive fields correspond to spatio-temporal scale-space kernels without velocity adaptation, whereas*inseparable*receptive fields correspond to kernels that are explicitly adapted to nonzero velocities.

The above-mentioned fact that a majority of the cells are inseparable in space–time is indeed nicely compatible with a description in terms of a *multi-parameter scale space* as outlined in Sect. 2.1.3. If the vision system is to give a reasonable coverage of a set of filter parameters \(\varSigma \) and \(v\), then the set of filters corresponding to space–time separable receptive fields (corresponding to the filter parameters \(v = 0\)) will be much smaller than the set of filters allowing for nonzero values of the mixed parameters \(\varSigma \) and \(v\) over space and time.

### 6.6 Complex cells

Besides the above-mentioned linear receptive fields, there is a large number of early *nonlinear* receptive fields that do not obey the superposition principle and whose response properties are rather insensitive to the phase of the visual stimuli. The response profile of such a cell in the spatial domain is typically of the form illustrated in Fig. 21c. Such cells for which the response properties are independent of the polarity of the stimuli are referred to as *complex cells* (Hubel and Wiesel 1959, 1962).

*quasi-quadrature*measure over a one-dimensional either spatial or temporal domain as (Lindeberg 1997)

*scale-normalized derivatives*with respect to scale-normalized coordinates \(\xi = x/\sqrt{s}\) (Lindeberg 1998b) and where the constant \(C\) can be determined either to minimize the amount of ripples in the operator response (\(C = 2/3 \approx 0.667\)) or from scale selection properties (\(C = e/4 \approx 0.670\)). Within this model, the first- and second-order Gaussian derivative approximations constitute an

*approximation of a Hilbert pair*within the Gaussian derivative framework.

*pointwise*computation quasi-quadrature entities, we can apply a second-stage smoothing step

In a detailed study of the response properties of complex cells, Touryan et al. (2002) observed an additive interaction between the eigenvectors of a quadratic nonlinear model supporting the energy model (Adelson and Bergen 1985; Heeger 1992). In a more recent study, Rust et al. (2005) found that complex cell responses are better described by more linear filters than the one or two used in previous models. The above-mentioned quasi-quadrature models are in qualitative agreement with such computational structures. Specifically, the second-stage smoothing (125) of the pointwise quasi-quadrature measure is in good agreement with the model of complex cell responses in (Rust et al. (2005), Fig. 8, page 953) based on weighted averaging of a set of quadrature pairs.

*divisive normalization*( Heeger 1992; Schwartz and Simoncelli 2001) with respect to an ensemble of different nonlinear feature detectors \(\mathcal{Q}_i L\) with their respective weights \(w_i\) according to

*ensemble*of such nonlinear receptive fields would then correspond to a

*population coding*over different spatial orientations, motion directions, and temporal scales.

Suppressive influence can also be obtained by allowing for (ii) *nonlinear feedback* that alters the conductivities in the diffusion equation (112) alternatively the corresponding spatio-temporal extension based on local image measurements or by considering (iii) recurrent feedback from higher levels that influence the gain control of the feature detectors. With these extensions, the resulting model corresponds to an integration of a hierarchical and recurrent models as advocated by Martinez and Alonso (2003).

In contrast to the previous treatment of linear receptive field models, which were determined by necessity from first principles, it should be emphasized that the structure of the quasi-quadrature model is not at all determined by necessity. Instead, it is presented as one possible nonlinear extension that reproduces some of the qualitative properties of complex cells.

## 7 Foveated vision

*fovea*in such a way that the spatial extent of the receptive fields

*increases essentially linearly with eccentricity*(Koenderink and Doorn 1978) (see Fig. 35).

*limited processing capacity*that is to be distributed over receptive fields at different scales. If we assume that the idealized vision system has a

*focus-of-attention*mechanism that allows it to simulate translation invariance by changing the viewing direction, then based on the argument of scale invariance, it is natural to distribute the limited processing capacity in such a way that a

*similar amount of processing capacity*is available

*for all scales*within some scale range \([s_{\min }, s_{\max }]\). In other words, the vision system should have the same number of receptive fields at all scales within some finite scale range (see Fig. 36).

Given these assumption, it follows that *the minimum receptive field size will increase linearly with the distance from the fovea*, a distribution that is compatible with neurophysiological and psychophysical findings (Lindeberg and Florack 1992). Given such a spatially varying resolution limit, internal representations at coarser scales can then be constructed from these image measurements based on the semigroup property or the diffusion equation. Specifically, with a *log-polar retinotopic mapping*, the diffusion equation that governs the evolution properties over scale can equivalently be expressed on a log-polar domain (Lindeberg and Florack 1994). In all other respects, the receptive field profiles will be similar as for a translationally invariant spatial domain.

This foveal scale-space model has been used for computing scale-invariant image descriptors for object recognition by Kokkinos and Yuille (2008). A closely related model for foveal vision in terms of an inverted pyramid has been proposed by Crowley and his co-workers (1994) with close relations to the spotlight model for visual attention by Tsotsos (1995).

A notable property of the receptive field measurements taken in the retina as shown in Fig. 35 is that the receptive field sizes are clustered along linear functions, whereas the foveal scale-space model in Fig. 36 is based on the assumptions that all receptive field sizes above a linearly increasing minimum receptive field size should be present. Given the semigroup property (8), it follows, however, that receptive fields at scales coarser than those displayed in Fig. 35 can be constructed by combining receptive fields at finer scales. The distribution in Fig. 35 would therefore correspond to a sampling of the *outer layer* of the inverted cone of receptive field sizes in the foveal scale-space model shown in Fig. 36. Receptive fields in the interior of this cone can therefore be constructed from linear combinations of receptive field responses in the outer layer.

An interesting question concerns whether the existence of coarser-scale receptive fields corresponding to the interior of this cone could be established by cell recording of linear receptive fields in the LGN or in V1. An alternative possibility could be to investigate whether receptive fields corresponding to the outer layer of this cone could be directly combined into nonlinear receptive fields corresponding to the interior of this cone, without representing the intermediate linear receptive fields explicitly in terms of simple cells. Such investigations could then answer whether and how shift invariance is explicitly represented at the earliest levels of linear receptive fields or at higher nonlinear levels in the visual hierarchy.

## 8 Extensions

With regard to camera geometry, we have throughout based the analysis on a planar perspective projection model with a flat image plane. This choice has been made to simplify the mathematical treatment, since the translational group properties and the diffusion equations are much easier to express for a flat image geometry. To model biological vision more accurately, it would, however, be more appropriate to express a corresponding model based on a *spherical camera geometry* with a spherical image surface, which will lead to a scale-space concept based on diffusion equations on a sphere. Such a model would also have attractive theoretical properties in the sense that geometric distortions toward the periphery, such as vignetting, will disappear, and certain properties of global motion fields will become simpler. From such a background, the present model can be regarded as a *local linearization* applied in the tangent plane of the spherical camera model at the center of the visual sensor.

*self-similar intensity transformation*\(I' = I^{\gamma }\) with \(\gamma < 1\), then the result of applying a logarithmic transformation to this output

## 9 Relations to previous work

### 9.1 Biological vision

The notion of receptive field was originally defined by Sherrington (1906) to describe the somatosensory area of a body surface where a stimulus could cause a reflex. Hartline (1938) extended this notion to light stimuli and defined a *visual receptive field* as the area of the retina that must receive illumination in order to cause a discharge in a particular ganglion cell or nerve fiber. Kuffler (1953) studied the substructure of retinal receptive fields and found that they are concentric with specific “on” or “off” zones. He also coined the term “on–off” receptive fields. The Nobel laurates Hubel and Wiesel (1959, 1962, 2005) investigated and characterized the response properties of cells in the primary visual cortex (V1), discovered their orientation tuning, and proposed a taxonomy in terms of simple or complex cells based on how the cells respond to the polarity of visual stimuli. In the first wave of studies, specific stimuli such as points, bars, or sine wave gratings were used as stimuli for probing the visual cells.

Later, a new methodology for receptive field mappings was developed based on white noise stimuli, which allow for a complete characterization of the response properties of visual neurons if they can be assumed to be linear. Based on this technique, DeAngelis et al. (1995) were able to derive more detailed maps of receptive fields, including their response properties in the *joint* space–time domain; see DeAngelis and Anzai (2004) for a comprehensive overview of these developments. Conway and Livingstone (2006) performed a corresponding investigation of spatio-chromatic and spatio-chrom-temporal response properties of receptive fields in the macaque monkey. Ringach et al. (2002) showed how receptive field profiles of neurons can be derived using natural image sequences as stimuli. Felsen et al. (2005) have presented comparisons between response properties of neurons to natural image features versus noise stimuli and found that in the responses of complex cells, but not of simple cells, the sensitivity is markedly higher for natural image data than for random stimuli.

Adelson and Bergen (1985) developed a spatio-temporal energy model for motion perception based on oriented filters in the space–time domain. The quasi-quadrature approach in (118) and (119) in combination with a multi-parameter scale space can be seen as an analogue and extension of such a representation within the Gaussian derivative framework. More recently, Young et al. (2001) showed how spatio-temporal receptive fields can be modelled by Gaussian derivatives over a spatio-temporal domain, corresponding to the Gaussian spatio-temporal concept described here, although with a different type of parameterization.

The scale-space models described in this article and our earlier work (Lindeberg 1997, 2001, 2011) *unify* these treatments into a joint framework and do also comprise new extensions in the following ways: (i) a *new continuous time-causal scale-space model* that respects forbidden access to future information, (ii) a *time recursive* update mechanism based on a limited temporal buffer, (iii) a *better parameterization* of the spatio-temporal filters with respect to image velocities and image deformations, and (iv) *necessity results* showing how these scale-space models can be uniquely determined from a small set of structural assumptions regarding an idealized vision system.

It should be emphasized, however, that the theoretical necessity results presented in this paper concern *linear* receptive fields. Characterizing nonlinear receptive fields is a much more complex issue, see Ringach (2004) for an overview of different approaches for mapping receptive fields. Nonlinear gain control mechanisms in the retina have been modelled and related to biological cell recordings by Schwartz et al. (2002). Nonlinear receptive fields in V1 have been investigated and modelled in more detail by Mechler and Ringach (2002), Touryan et al. (2002), Priebe et al. (2004), and Rust et al. (2005). During recent years, there has been some questioning of whether the taxonomy by Hubel and Wiesel into simple and complex cells corresponds to distinct classes or whether V1 cells have response properties along a continuum (Mechler and Ringach 2002). Bardy et al. (2006) have shown that the response properties of some classes of complex cells can be converted to putative simple cells depending on influences originating from the classical receptive field. The experimental results can, however, be strongly dependent on the experimental conditions (Kagan et al. 2002; Mata and Ringach 2005; Chen et al. 2002) and bimodal distributions have been found by Kagan et al. (2002), Ibbitson et al. (2005), and Chen et al. (2002). Moreover, Martinez and Alonso (2003) argue that a large body of neurophysiological evidence indicates that simple cells are a separate population from the total of cortical cells in cat visual cortex. In relation to the classification of complex cells, Kagan et al. (2002) have suggested that distinctions in the classification of complex cells should be made on whether the cells are dominated by magnocellular or parvocellular input. Martinez and Alonso (2003) have suggested that complex cells should be divided into first-order complex cells that receive direct input from the LGN and second-order complex cells that receive input from simple cells. More recently, Williams and Shapley (2007) have found spatial phase-sensitive detectors in V1 that respond to contrast boundaries of one sign but not the opposite. Our knowledge about nonlinear cells in area V1 is therefore far from complete (Olshausen and Field 2004; Carandini et al. 2005).

The notion of a *logarithmic brightness scale* goes back to the Greek astronomer Hipparchus, who constructed a subjective scale for the brightness of stars in six steps labelled “1 ...6,” where the brightest stars were said to be of the first magnitude (\(m = 1\)) while the faintest stars near the limits of human perception were of the sixth magnitude. Later, when quantitative physical measurements were made possible of the intensities of different stars, it was noted that Hipparchus subjective scale did indeed correspond to a logarithmic scale. In astronomy today, the *apparent brightness* of stars is still measured on a logarithmic scale, although extended over a much wider span of intensity values. A logarithmic transformation of image intensities is also used in the retinex theory (Land 1974, 1986).

*Weber-Fechner law*attempts to describe the relationship between the physical magnitude and the perceived intensity of stimuli. This law states that the ratio of an increment threshold \(\Delta I\) for a just noticeable difference in relation to the background intensity \(I\) is constant over large ranges of magnitude variations (Palmer 1999, pages 671–672)

For a strictly positive entity \(z\), there are also information theoretic arguments to regard \(\log z\) as a default parameterization (Jaynes 1968). This property is essentially related to the fact that the ratio \(dz/z\) then becomes a dimensionless integration measure. A general recommendation of care should, however, be taken when using such reasoning based on dimensionality arguments, since important phenomena could be missed, e.g., in the presence of hidden variables. The physical modelling of the effect on illumination variation on receptive field measurements in Sect. 2.3 provides a formal justification for using a logarithmic brightness scale in this context as well as an additional contribution of showing how the receptive field measurements can be related to inherent physical properties of object surfaces in the environment.

### 9.2 Computer vision

In the area of computer vision, multi-scale representations were first constructed by repeated smoothing and subsampling, leading to the notion of *pyramids* (Burt 1981; Crowley 1981; Burt and Adelson 1983; Crowley and Stern 1984; Crowley and Parker 1984; Crowley and Sanderson 1987).

Concerning the development of *scale-space theory*, Witkin (1983) proposed to treat scale as a continuous parameter and noted that Gaussian convolution leads to a decreasing number of zero-crossings or local extrema for a one-dimensional signal. The first necessity results in the Western literature concerning the uniqueness of the Gaussian kernel for generating a linear scale-space representation were derived by Koenderink (1984) based on the assumption of *causality*, which means that iso-surfaces in scale space should point with their convex side toward coarser scales. Related uniqueness results were presented by Babaud et al. (1986) and by Yuille and Poggio (1986).

Lindeberg (1990) showed how a reformulation of Koenderink’s causality requirement in terms of *non-enhancement of local extrema* in combination with the requirement of a semigroup structure could be used for deriving a scale-space theory for discrete signals. Corresponding necessity results concerning scale-space representations of continuous image data based were then presented in Lindeberg (1996). A cascade property was also used in the construction of binomial pyramids by Crowley (1981), Crowley and Stern (1984).

Florack and Haar Romeny (1992) proposed to the use of *scale invariance* as a basic scale-space axiom and Pauwels et al. (1995) showed that in combination with a semigroup structure, there exists a more general one-parameter family of (weak) scale-space kernels that obey these axioms, including the Poisson scale space studied by Felsberg and Sommer (2004), Duits et al. (2004) have investigated the properties of these scale spaces in detail and showed that the so-called \(\alpha \)-scale spaces can be modelled by pseudo-partial differential equations. Except for the Gaussian scale space contained in this class, these *self-similar scale spaces* do, however, not obey non-enhancement of local extrema.

Closely related axiomatic derivations of image processing operators based on scale invariance have also been given in the earlier Japanese literature (Iijima 1962; Weickert et al. 1999). Koenderink and Doorn (1992) showed that Gaussian derivative operators are natural operators to derive from a scale-space representation, given the assumption of scale invariance.

The connections between the strong regularizing properties of Gaussian convolution with Schwartz distribution theory have been pointed out by Florack et al. (1992).

Generalizations of rotationally symmetric smoothing operations to the *affine Gaussian scale-space* concept were introduced in (Lindeberg 1994b) and applied in (Lindeberg and Gårding 1997) for the computation of affine invariant image descriptors. Specifically, a mechanism of *affine shape adaptation* was proposed for reaching affine covariant interest points in affine scale space, and it was shown that the computation of such affine-adapted image measurements improved the accuracy of later-stage processes in situations when there are significant perspective image deformations outside the similarity group. Baumberg (2000) and Schaffalitzky and Zisserman (2001) furthered this approach to wide baseline image matching. Mikolajczyk and Schmid (2004) proposed a more efficient algorithm and quantified its performance experimentally. Tuytelaars and Gool (2004) performed corresponding matching of widely separated views with application to object modelling. Related investigations of elongated directional filters over the spatial domain have been presented by Freeman and Adelson (1991); Simoncelli et al. (1992) and Perona (1992).

Scale-space representations of *color information* have been developed by Geusebroek et al. (2001) based on a Gaussian color model proposed by Koenderink, from which a set of differential color invariants were defined and by Hall et al. (2000) who computed first-order partial derivatives of color-opponent channels and demonstrated the applicability of such features for object recognition. Linde and Lindeberg (2004, 2012) extended this idea by showing that highly discriminative image descriptors for object recognition can be obtained from spatio-chromatic derivatives and differential invariants up to order two. More recently, Sande et al. (2010) have presented an evaluation of different color-based image descriptors for recognition.

Concerning *temporal scale spaces*, Koenderink (1988) proposed the first scale-space concept that respects temporal causality, based on a logarithmic transformation of the time axis with the present moment as the origin. Such temporal smoothing filters have been considered in follow-up works by Florack (1997) and ter Haar Romeny et al. (2001). These approaches, however, appear to require infinite memory of the past and have so far not been developed for computational applications.

To handle time causality in a manner more suitable for real-time implementation, Lindeberg and Fagerström (1996) expressed a strictly time-recursive space–time separable spatio-temporal scale-space model based on the cascades of temporal scale-space kernels in terms of either truncated exponential functions or first-order recursive filters, based on a characterization of one-dimensional scale-space filters that guarantee non-creation of local extrema with increasing scale (Lindeberg 1990). These scale spaces were also *time recursive* in the sense that no extensive memory of the past was needed. Instead, a compact temporal buffer allowed for efficient computation of the temporal smoothing operation and temporal derivatives directly from a set of internal representations at different temporal scales. A closely related time-recursive computation of temporal derivatives has been used by Fleet and Langley (1995).

Lindeberg (1997) proposed a non-separable *spatio-temporal scale-space* concept comprising the notion of *velocity-adapted derivatives* for a continuous model based on a Gaussian spatio-temporal scale-space and for a semi-discrete time-causal model; see also Lindeberg (2001) for a more detailed description of the corresponding spatio-temporal scale-space theory. Velocity adaptation was applied to optic flow estimation by Nagel and Gehrke (1998) and was shown to improve the accuracy in optic flow estimates in a similar manner as affine shape adaptation improves the accuracy of image descriptors under perspective image deformations outside the similarity group. A closely related approach for optic flow computation with corresponding deformation of the image filters was developed by Florack et al. (1998). An extension of non-separable spatio-temporal fields into time-causal velocity-adapted recursive filters was given in (Lindeberg 2002).

Laptev and Lindeberg (2004b) investigated the use of families of velocity-adapted filters for computing *Galilean invariant image descriptors*. Given an ensemble of spatio-temporal scale-space filters with different orientations in the space–time domain in a manner similar to Adelson and Bergen (1985), simultaneous adaptation to spatial scales, temporal scales, and image velocities was performed by a multi-parameter scale selection mechanism over these parameters. Specifically, it was shown that the use of velocity-adapted filters improved the separability between classes of spatio-temporal actions in situations when there are unknown relative motions between the objects and the observer. Generalizations of this approach to the context of Galilean invariant interest points were then presented in Lindeberg (2004) with an integrated Galilean invariant spatio-temporal recognition scheme in (Laptev et al. 2007).

Fagerström (2005) investigated self-similar temporal scale-space concepts derived from the assumptions of a semigroup structure combined with scale invariance, with an extension to the spatio-temporal domain in Fagerström (2007) that also comprises the notion of velocity-adapted filters. Lindeberg (2011) gives a unified treatment of the scale-space axiomatics of linear, affine, and spatio-temporal scale space for continuous images based on the assumption of non-enhancement of local extrema over spatial and spatio-temporal domains, including more explicit statements of the uniqueness results regarding the Gaussian spatio-temporal scale space earlier outlined in Lindeberg (2001) and the application of non-enhancement of local extrema to a continuous time-causal and time-recursive spatio-temporal scale space.

## 10 Summary and conclusions

Neurophysiological recordings have shown that mammalian vision has developed receptive fields that are tuned to different sizes and orientations in the image domain as well as to different image velocities in space–time. A main message of this article has been to show that it is possible to derive such families of receptive field profiles *by necessity*, given a set of structural requirements on the first stages of visual processing as formalized into the notion of an *idealized vision system*. These structural requirements reflect *structural properties of the world* in terms of scale covariance, affine covariance, and Galilean covariance, which are natural to *adapt to* for a vision system that is to *interact with the surrounding world* in a successful manner. In a competition between different organisms, adaptation to these properties may constitute an *evolutionary advantage*.

The presented theoretical model provides a *normative theory* for deriving *functional models of linear receptive fields* based on Gaussian derivatives and closely related operators. In addition, a set of plausible mechanisms have been presented of how nonlinear receptive fields can be constructed from this theory, based on a generalized energy model. Specifically, the proposed theory can *explain* the different shapes of receptive field profiles that are found in biological vision from a requirement that the visual system should be able to compute covariant receptive field responses under the natural types of image transformations that occur in the environment, to enable the computation of invariant representations for perception at higher levels.

The proposed receptive field model has been related to Gabor functions, and we have presented several theoretical arguments for preferring a Gaussian derivative model or equivalently a formulation in terms of *diffusion equations*, with the shapes of the receptive fields parameterized by a spatial covariance matrix \(\varSigma \), an image velocity \(v\) and a temporal scale parameter \(\tau \), where the spatial covariance matrix \(\varSigma \) can also encompass the spatial scale parameter \(s\) depending on the choice of parameterization.

In the most idealized version of the theory, one can see the covariance matrix \(\varSigma \) in the diffusion equation and the image velocity \(v\) as locally constant within the support region of each receptive field, corresponding to a pure feed-forward model. More generally, one can consider covariance matrices and image velocities that are locally adapted to the local image structures, leading to richer families of pseudo-linear or nonlinear scale spaces, corresponding to top-down or feedback mechanisms in biological vision.

When the image data undergo natural image transformations due to variations in viewing distance, viewing direction, relative motion between the object and the observer or illumination variations, we can linearize the possibly nonlinear image transformations locally by derivatives (Jacobians), from which transformation properties in terms of the filter parameters (scale parameters, covariance matrices, and image velocities) of the receptive fields can be derived, provided that the family of receptive fields is closed under the relevant group or subgroup of image transformations in the tangent space, leading to an algebra of transformation properties of receptive fields. In this article, we have presented a coherent and unified framework for handling such locally linearized image transformations in terms of local scaling transformations, local affine transformations, local Galilean transformations, and local multiplicative intensity transformations, such that the influence of these image transformations on the receptive field responses can be well understood. More generally, the formulation of image primitives in terms of receptive field responses that are expressed in terms of scale-space derivatives makes it possible to use tools from differential geometry for deriving relationships between image features and physical properties of objects or events in the environment, thus allowing for computationally operational and theoretically well-founded modelling of possibilities or constraints for visual perception.

We have also related the proposed approach to approaches for learning receptive field profiles from *natural image statistics* and argued that the presented model in such a context provides a normative theory for the solutions that an idealized learning system may reach if exposed to a sufficient large and representative set of natural image data. The presented theory can therefore be used for explaining why such learning approaches lead to qualitatively similar types of receptive fields.

Several of the theoretically derived receptive field profiles presented in this article have been successfully used in a large number of computer vision applications regarding feature detection, feature classification, stereo matching, motion estimation, shape analysis, and image-based recognition. Hence, these receptive field profiles can generally serve as a basis for expressing a *large number of visual operations* and have empirically been shown to lead to robust algorithms. In this respect, a vision system based on these receptive field families allows for *sharing* of early visual modules between different higher level vision functionalities, which for a biological vision system can be motivated by efficiency of resource utilization.

The linear receptive fields obtained from this theory have been compared to receptive fields found by cell recordings in the LGN and simple cells in V1.

The proposed nonlinear quasi-quadrature model has also been related to qualitatively similar properties observed for complex cells in V1.

A striking conclusion from the comparisons in Sect. 6 is that the receptive field profiles derived by the *axiomatic theory* in Sects. 3–5 are in *very good qualitative agreement* with receptive field profiles recorded in *biological vision*. Thus, we have a very good match between consequences of the theory and experimental data.

Furthermore, this indicates that the earliest receptive fields in higher mammal vision have reached a state that is very close to *ideal* in view of the stated structural requirements or symmetry properties. In this sense, biological vision can be seen as having adapted very well to the transformation properties of the surrounding world and the transformations that occur when a three-dimensional world is projected onto a two-dimensional image domain.

### 10.1 Applications to biological vision

*theoretically well-founded computational model*for early receptive fields. We propose that this theory could be used as a powerful and general tool for modelling biological vision, at least in the following ways:

The Gaussian and the time-causal receptive field families with their spatial and spatio-temporal derivative operators applied to luminance and color-opponent channels can be used for generating wider and more general families of receptive field profiles beyond those explicitly shown in the figures in this article. The idealized model for simple cells (116) comprises receptive fields of different orders of spatial and temporal differentiations, where a subset of combinations of spatial and spatio-temporal derivative operators has been demonstrated to lead to receptive field profiles in good qualitative agreement with receptive field profiles measured by cell recordings in biological vision. An interesting question concerns whether the existence of linear receptive fields corresponding to other combinations of spatial and spatio-temporal derivatives can be demonstrated, in particular when the receptive fields are measured as functions over two spatial dimensions and one temporal dimension and concerning the existence of receptive fields corresponding to higher orders of derivatives. Concerning spatio-chromatic and spatio-chrom-temporal receptive fields, the models for double-opponent receptive fields (110) and (117) are both based on rotationally symmetric Laplacians of Gaussians (alternatively differences of Gaussians) concerning the spatial dependencies. Another interesting question concerns whether biological vision implements non-symmetric spatio-chromatic receptive fields corresponding to, e.g., directional or partial derivatives of color-opponent channels as shown in Fig. 9, and whether or whether not tighter couplings could be established between the chromatic and temporal dimensions. Answering these questions would provide cues to what types of image structure the visual system explicitly responds to and therefore possibilities as well as limitations for perception. Hence, this theory may be used for generating

*predictions*about new hitherto unnoticed or unreported receptive fields and for explaining their properties in terms of differential geometric measurements. This theory can also be used for raising questions about which animals have early receptive fields with properties compatible with general purpose visual operations according to the notion of an idealized visual front end.Concerning orientation maps and population coding over image orientations and image velocities, the notion of multi-parameter receptive field families over different spatial covariance matrices \(\varSigma \), image velocities \(v\), and temporal scales \(\tau \) raises questions of how the receptive fields in V1 are distributed over different orientations and directional tunings. Since receptive fields have been found with different degrees of spatial eccentricities, corresponding to different scale parameters in different directions, this raises questions of whether the distribution over different degrees of spatial elongation is such that it could be explained by a geometric model over spatial covariance matrices \(\varSigma _i\) corresponding to structural properties of the environment. More generally and as we have previously discussed in Sect. 6.6, given that we have a population of nonlinear receptive fields that are tuned to different spatial orientations and motion directions that respond according to an energy model, an interesting question concerns how to combine the responses of a set of such nonlinear receptive fields that respond at different spatial locations and tuned to different orientations and motion directions. Could a sufficient amount of cell recordings be gathered to answer the question of how this information should be combined from a population of such nonlinear detectors, e.g., for setting the relative weights for divisive normalization or by changing the conductivities in the diffusion equations that determine the properties of the underlying receptive fields. In connection with the foveal scale-space model in Sect. 7 and the dominance of receptive fields with a linearly increasing receptive field size as function of eccentricity found by cell recordings of retinal ganglion cells, it would also as discussed in at the end of Sect. 7 be interesting to know whether and where the existence of coarser-scale receptive fields corresponding to the interior of the inverted cone in Fig. 36 could be established. In these and other ways, the presented mathematical framework for receptive fields could be used for expressing and raising questions about computational mechanisms.

The theoretical covariance properties of the associated scale-space concepts allow for

*explicit handling of invariance properties*with respect to scale variations, image deformations, and relative motions. In computational models, such as neural networks, explicit incorporation of such transformation properties may be used for*bypassing*the need for an explicit*training stage*to learn corresponding invariance properties. From a biological standpoint, it appears natural that biological organisms should develop the possibility of having these transformations hard-wired or soft-wired (the latter notion meaning that a set of initial connections being trimmed after birth), since these transformations are universal. In terms of receptive fields, these transformations will then correspond to certain parameter ranges of the scale parameters, determined by the statistics of natural images. This theory may therefore be more generally used for reducing or bypassing the need for explicit learning the spatial, spatio-chromatic, and spatio-temporal response properties of early receptive fields in computational models of visual perception. In this respect, the presented theory could allow for*lower needs for training data*and a lower amount of*computational resources*in the training stage of computational vision models, by faster formation of receptive fields given a hard-wired or soft-wired architecture. The theory may also imply higher*robustness*of early receptive fields in computational models and require*less variability*in the training data.With regard to a possible biological implementation of this theory, the evolution properties of the presented scale-space models are governed by

*diffusion equations*, which can be implemented by*operations over neighborhoods*. Hence, the computations can naturally be implemented in terms of*connections between different cells*. Diffusion equations are also used in mean field theory for approximating the computations that are performed by populations of neurons (Omurtag et al. 2000; Mattia and Guidice 2002; Faugeras et al. 2009). The generalized semigroup property (8) with the corresponding cascade property (9) possibly expressed for a multi-parameter scale space and the diffusion equations in terms of infinitesimal generators (13) and (14) describe how receptive fields corresponding to different possibly multi-dimensional scale parameters can be*related*and hence how receptive fields at coarser scales can be computed from receptive fields at finer scales. In a neural network implementation, these relations can hence be used for*setting the weights*for communications between different cells. This theory also provides a framework for modelling and explaining the temporal dynamics of neural computations between cells at different levels of processing. In this respect, the theory naturally leads to a*hierarchical architecture*with explicit expressions for how receptive fields in the fovea can constitute the basis for receptive fields in the LGN and these in turn can be used for defining receptive fields in V1 and later stages in the visual cortex.

*idealized theoretical model*for the types of computations that are natural to perform at the

*earliest stages of visual processing*given theoretical properties of the structure of the surrounding world, which are then expressed as fundamental assumptions about the functionality of the vision system. If the model should be regarded as

*biomimetic*, that would then be in a weaker sense of performing similar types of functions.

In this way, specific properties of specific organisms are suppressed (and not considered here because of reasons of scope). The approach is therefore more related to approaches in *theoretical physics*, where symmetry properties of the world are used as fundamentals in the formulation of physical theories. In the area of scale-space theory, these structural assumptions are referred as *scale-space axioms*.

Concerning notation, we will throughout use a notation similar to physics or mathematics, with scalars and vectors represented by lower case letters, \(a \in {\mathbb R}\) and \(x \in {\mathbb R}^2\), (without explicit notational overhead for vectors) and matrices represented by upper case letters, \(A\) or \(\varSigma \). Operators that act on functions will be represented by calligraphic symbols, \(\mathcal{T}\) and \(\mathcal{A}\), and we use either lower case or upper case letters for functions, \(f\) and \(L\). The overall convention is that the meaning of a symbol is defined the first time it is used.

In Eq. (1), the symbol “\(\cdot \)” at the position of the first argument of \(L\) is a place holder to emphasize that in this relation, \(L\) is regarded as a function and not evaluated with respect to its first argument \(x\). The following semi-colon emphasizes the different natures of the image coordinates \(x\) and the filter parameters \(s\).

More precisely, we will assume that linearity should hold for some transformation \(f = z(I)\) of the original luminosity values \(I\) in units of local energy measurements. In Sect. 2.3 it will be shown that a logarithmic intensity mapping \(f \sim \log I\) is particularly attractive in this respect by allowing for invariance of receptive field responses under local multiplicative intensity transformations.

For us humans and other higher mammals, the retina is obviously not translationally invariant. Instead, finer scale receptive fields are concentrated to the fovea in such a way that the minimum receptive field size increases essentially linearly with eccentricity (see Sect. 7). With respect to such a sensor space, the assumption about translational invariance should be taken as an idealized model for the region in space where there are receptive fields above a certain size.

The symbol “\(\cdot \)” used as placeholder for the first argument of \(T\) and the argument of \(f\) in Eq. (4) indicate that the convolution operation “\(*\)” is performed over the corresponding variable.

With \(s = (s_1, \ldots , s_N)\) representing a multi-dimensional scale parameter \(s \in {\mathbb R}_+^N\), Eq. (7) should be interpreted as \(\lim _{|s| \downarrow 0} L(\cdot ;\; s) = \lim _{|s| \downarrow 0} \mathcal{T}_s f = f\) with \(|s| = \sqrt{s_1^2 + \cdots + s_N^2}\).

With \(s_1 = (s_{1,1}, \dots , s_{1,N})\) and \(s_2 = (s_{2,1}, \dots , s_{2,N})\) denoting two \(N\)-dimensional scale parameters, the inequality \(s_2 \ge s_1\) should be interpreted as a requirement that the scale levels \(s_1\) and \(s_2\) have to be *ordered* in the sense that the increment \(u = s_2 - s_1\) should correspond to a *positive direction* in parameter space that can be interpreted as increasing levels of scale. For example, for the affine spatial scale-space concept \(L(x;\; \varSigma )\) to be considered later in Sect. 3, which for two-dimensional images \(f\) can be parameterized by positive semi-definite \(2 \times 2\) covariance matrices \(\varSigma \), the requirement of an ordered and positive scale direction \(u\) between the scale-space representations computed for two different covariance matrices \(\varSigma _1\) and \(\varSigma _2\) means that the difference between these covariance matrices \(\varSigma _u = \varSigma _2 - \varSigma _1\) must be positive semi-definite. With the corresponding multi-dimensional scale parameters \(s_1\) and \(s_2\) expressed as vectors \(s_1 = (\varSigma _{1,11}, \varSigma _{1,12}, \varSigma _{1,22})\) and \(s_2 = (\varSigma _{2,11}, \varSigma _{2,12}, \varSigma _{2,22})\) where \(\varSigma _{k,ij}\) denote the elements of \(\varSigma _k\) for \(k = 1\) and \(2\), the condition for \(u = (u_1, u_2, u_3) = s_2 - s_1\) to correspond to a positive direction in parameter space can therefore be expressed as \(u_1 u_3 - u_2^2 \ge 0\) and \(u_1 + u_3 \ge 0\).

This constant brightness assumption is guaranteed to hold for a Lambertian reflectance model extended with a spatially varying albedo, if the surface pattern is subject to illumination that is constant over time for corresponding surface points, see Sect. 2.3 for a more detailed model of receptive field responses under illumination variations. If the illumination intensity or the orientation of the surface normal in relation to the light source varies over time, however, the constant brightness assumption may be violated, or if the reflectance model comprises non-Lambertian, e.g., specular components. In such situations, a motion field computed from the optical flow obtained from the constant brightness assumption may therefore be different than the projected motion field of physical particles in the world. This situation can on the other hand be improved by instead applying a constancy assumption to spatial derivatives of the image intensity instead of the original zero-order image intensity. As explained in Sect. 2.3, such an assumption will in the specific case of a logarithmic brightness scale cancel the influence of local multiplicative illumination variations. By furthermore applying the constancy assumption to the output from several derivative operators simultaneously and additionally combining this assumption with an assumption of local coherence of the motion, e.g., in terms of a low parameter motion model over local regions in image space, one may additionally address the ambiguity of the aperture problem, provided that the local region of image space at which the low parameter image model is applied contains a sufficiently rich distribution of image structures of different orientations. Otherwise, the aperture problem states that under the assumption of constant brightness of corresponding physical points over time, only the motion component that is parallel to the local image gradient can be computed. The notion a Reichardt detector (Reichardt 1961; Reichardt and Schögl 1988) also addresses this issue by delay-coupled receptive fields in the retina. For the purpose of describing motion selective and motion-adapted properties of receptive fields, we shall, however, here for simplicity of presentation model temporal motions in terms of local Galilean transformations applied to image intensities, bearing in mind that this model can in a straightforward manner be transferred to the assumption of constancy of spatial derivative responses over time. Indeed, the spatio-temporal biological receptive fields that we shall describe in more detail in Sect. 6.3.2 do all support such a view by all comprising nonzero first, second, or third orders of spatial differentiation.

Note that the form of the vignetting effect may be different for lens systems composed of several lenses, and that lens systems are usually constructed to reduce the vignetting effect over some central part of the field of view. Notably, this natural vignetting effect will not be present with a spherical camera geometry, which is of high relevance with regard to biological vision.

To ensure sufficient differentiability properties such that an infinitesimal generator exists and the resulting multi-scale representation obtained by convolution with the semigroup of convolution kernels can be differentiated with respect to both space and scale such that the requirement of non-enhancement of local extrema can be applied, we do formally for an \(N\)-dimensional spatial domain require the semigroup \(\mathcal{T}_s\) to be \(C_1\)-*continuous* such that \(\lim _{h \downarrow 0} \left\| \frac{1}{h} \int _{s = 0}^{h} \mathcal{T}(s) f \, ds - f \right\| _{H^k({\mathbb R}^N)} = 0\) should hold for some \(k > N/2\) and for all smooth functions \(f \in L^1({\mathbb R}^N) \cap C^{\infty }({\mathbb R}^N)\) with \(\Vert \cdot \Vert _{H^k({\mathbb R}^N)}\) denoting the \(L^2\)-based Sobolev norm \( \Vert u \Vert _{H^k({\mathbb R}^N)} = \left( \int _{\omega \in {\mathbb R}^N} \left( 1 + |\omega |^2 \right) ^k \, |\hat{u}(\omega )|^2 \hbox {d}\omega \right) ^{1/2}\) and \(\hat{u}\) denoting the Fourier transform of \(u\) over \({\mathbb R}^N\); see (Lindeberg (2011), Sect. 3.2 and “Appendix A”) regarding details.

With “rotational invariance at the group level” meaning that although a set of receptive fields may not be rotationally symmetric as individuals, a collection or a group of such receptive fields may nevertheless make it possible to generate rotationally invariant responses, for example if all orientations are explicitly represented or if the receptive fields of different orientations can be related by linear combinations.

To ensure sufficient differentiability properties such that an infinitesimal generator exists and the resulting multi-scale representation obtained by convolution with the semigroup of convolution kernels can be differentiated with respect to both space–time and spatio-temporal scales such that the requirement of non-enhancement of local extrema can be applied, we do formally for an \(N+1\)-dimensional space–time require the semigroup \(\mathcal{T}_s\) to be \(C_1\)-*continuous* in the sense that \(\lim _{h \downarrow 0} \left\| \frac{1}{h} \int _{s = 0}^{h} \mathcal{T}(s) f \, ds - f \right\| _{H^k({\mathbb R}^N \times {\mathbb R})} = 0\) should hold for some \(k > (N+1)/2\) and for all smooth functions \(f \in L^1({\mathbb R}^N \times {\mathbb R}) \cap C^{\infty }({\mathbb R}^N \times {\mathbb R})\) with \(\Vert \cdot \Vert _{H^k({\mathbb R}^2 \times {\mathbb R})}\) denoting the \(L^2\)-based Sobolev norm \( \Vert u \Vert _{H^k({\mathbb R}^N \times {\mathbb R})} = \left( \int _{\omega \in {\mathbb R}^N \times {\mathbb R}} \left( 1 + |\omega |^2 \right) ^k \, |\hat{u}(\omega )|^2 \hbox {d}\omega \right) ^{1/2}\) and \(\hat{u}\) denoting the Fourier transform of \(u\) over \({\mathbb R}^N \times {\mathbb R}\); see (Lindeberg (2011), Sect. 3.2 and “Appendix A”) regarding details.

It can be shown that this definition is compatible with spatio-temporal scale invariance for scale selection based on local extrema over temporal scales of scale-normalized derivatives (manuscript in preparation). Specifically, the value \(\kappa = 1/2\) can be motivated both from theoretical considerations and agreement with biological receptive fields.

By the use of locally adapted feedback, the resulting evolution equation does not obey the original linearity and shift invariance (homogeneity) requirements used for deriving the idealized affine Gaussian receptive field model, if the covariance matrices \(\varSigma _0\) are determined from properties of the image data that are determined in a nonlinear way. For a fixed set of covariance matrices \(\varSigma _0\) at any image point, the evolution equation will still be linear and will specifically obey non-enhancement of local extrema. In this respect, the resulting model could be regarded as a simplest form of nonlinear extension of the linear receptive field model.

## Acknowledgments

I would like to thank Benjamin Auffarth, Oskar Linde, and Prof. Per Roland for valuable discussions and comments.

The support from the Swedish Research Council, Vetenskapsrådet (contracts 2004-4680, 2010-4766) and from the Royal Swedish Academy of Sciences as well as the Knut and Alice Wallenberg Foundation is gratefully acknowledged.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.