Time-causal and time-recursive spatio-temporal receptive fields

We present an improved model and theory for time-causal and time-recursive spatio-temporal receptive fields, obtained by a combination of Gaussian receptive fields over the spatial domain and first-order integrators or equivalently truncated exponential filters coupled in cascade over the temporal domain. Compared to previous spatio-temporal scale-space formulations in terms of non-enhancement of local extrema or scale invariance, these receptive fields are based on different scale-space axiomatics over time by ensuring non-creation of new local extrema or zero-crossings with increasing temporal scale. Specifically, extensions are presented about (i)~parameterizing the intermediate temporal scale levels, (ii)~analysing the resulting temporal dynamics, (iii)~transferring the theory to a discrete implementation in terms of recursive filters over time and (iv)~computing scale-normalized spatio-temporal derivative expressions for spatio-temporal feature detection and (v)~computational modelling of receptive fields in the lateral geniculate nucleus (LGN) and the primary visual cortex (V1) in biological vision. We show how scale-normalized temporal derivatives can be defined for these time-causal scale-space kernels and how the composed theory can be used for computing basic types of scale-normalized spatio-temporal derivative expressions in a computationally efficient manner.


. Introduction
Vision deals with the problem of deriving information about the world from the light re ected from it. Although the active and task-oriented nature of vision is only implicit in this formulation, this view captures several of the essential aspects of vision. As Marr (1982) phrased it in his book Vision, vision is an information processing task, in which a n internal representation of information is of out-most importance. Only by representation information can be captured and made available to decision processes. The purpose of a representation is to make certain aspects of the information content explicit, that is, immediately accessible without any need for additional processing. This introductory chapter deals with a fundamental aspect of early image representation|the notion of scale. As Koenderink (1984) emphasizes, the problem of scale must be faced in any imaging situation. An inherent property of objects in the world and details in images is that they only exist as meaningful entities over certain ranges of scale. A simple example of this is the concept of a branch o f a t r e e , w h i c h makes sense only at a scale from, say, a f e w c e n timeters to at most a few meters. It is meaningless to discuss the tree concept at the nanometer or the kilometer level. At those scales it is more relevant to talk about the molecules that form the leaves of the tree, or the forest in which the tree grows. Consequently, a m ulti-scale representation is of crucial importance if one aims at describing the structure of the world, or more speci cally the structure of projections of the three-dimensional world onto two-dimensional images.
The need for multi-scale representation is well understood, for example, in cartography maps are produced at di erent degrees of abstraction. A map of the world contains the largest countries and islands, and possibly, some of the major cities, whereas towns and smaller islands appear at rst in a map of a country. I n a c i t y guide, the level of abstraction is changed considerably to include streets and buildings etc. In other words, maps constitute symbolic multi-scale representations of the world around us, although constructed manually and with very speci c purposes in mind.
To compute any t ype of representation from image data, it is necessary to extract information, and hence interact with the data using certain operators. Some of the most fundamental problems in low-level vision and image analysis concern: what operators to use, where to apply them, and how large they should be. If these problems are not appropriately addressed, the task of interpreting the output results can be very hard. Ultimately, the task of extracting information from real image data is severely in uenced by the inherent measurement problem that real-world structures, in contrast to certain ideal mathematical entities, such as \points" or \lines", appear in di erent w ays depending upon the scale of observation.
Phrasing the problem in this way s h o ws the intimate relation to physics. Any physical observation by necessity has to be done through some nite aperture, and the result will, in general, depend on the aperture of observation. This holds for any device that registers physical entities from the real world including a vision system based on brightness data. Whereas constant size aperture functions may be su cient in many (controlled) physical applications, e.g., xed measurement devices, and also the aperture functions of the basic sensors in a camera (or retina) may h a ve to determined a priori because of practical design constraint s , i t i s f a r f r o m clear that registering data at a xed level of resolution is su cient. A vision system for handling objects of di erent sizes and at di erence distances needs a way t o c o n trol the scale(s) at which the world is observed.
The goal of this chapter is to review some fundamental results concerning a framework known as scale-space that has been developed by the computer vision community f o r c o n trolling the scale of observation and representing the multi-scale nature of image data. Starting from a set of basic constraints (axioms) on the rst stages of visual processing it will be shown that under reasonable conditions it is possible to substantially restrict the class of possible operations and to derive a (unique) set of weighting pro les for the aperture functions. In fact, the operators that are obtained bear qualitative similarities to receptive elds at the very earliest stages of (human) visual processing . We shall mainly be concerned with the operations that are performed directly on raw image data by the processing modules are collectively termed the visual front-end. The purpose of this processing is to register the information on the retina, and to make important aspects of it explicit that are to be used in later stage processes. If the operations are to be local, they have to preserve the topology at the retina for this reason the processing can be termed retinotopic processing. 1. 1

.1. Early visual operations
An obvious problem concerns what information should be extracted and what computations should be performed at these levels. Is any type of operation feasible? An axiomatic approach that has been adopted in order to restrict the space of possibilities is to assume that the very rst stages of visual processing should be able to function without any direct knowledge about what can be expected to be in the scene. As a consequence, the rst stages of visual processing should be as uncommitted and make a s f e w irreversible decisions or choices as possible.
The Euclidean nature of the world around us and the perspective mapping onto images impose natural constraints on a visual system. Objects move rigidly, the illumination varies, the size of objects at the retina changes with the depth from the eye, view directions may c hange etc. Hence, it is natural to require early visual operations to be una ected by certain primitive transformations (e.g. translations, rotations, and greyscale transformations). In other words, the visual system should extract properties that are invariant with respect to these transformations.
As we shall see below, these constraints leads to operations that correspond to spatio-temporal derivatives which are then used for computing (di erential) geometric descriptions of the incoming data ow. Based on the output of these operations, in turn, a large number of feature detectors can be expressed as well as modules for computing surface shape.
The subject of this chapter is to present a tutorial overview on the historical and current insights of linear scale-space theories as a paradigm for describing the structure of scalar images and as a basis for early vision. For other introductory texts on scale-space see the monographs by Lindeberg (1991 and Florack (1993) as well as the overview articles by ter Haar Romeny and Florack (1993) and Lindeberg (1994). 1.2. Multi-scale representation of image data Performing a physical observation (e.g. looking at an image) means that some physical quantity is measured using some set (array) of measuring devices with certain apertures. A basic tradeo problem that arises in this context is that if we are interested in resolving small details then the apertures should be narrow which means that less of the physical entity will be registered. A larger aperture on the other hand gives a stronger response and coarser details. Since we, in general, cannot know i n a d v ance what aperture sizes are appropriate, we w ould like t o b e a b l e t o t r e a t the scale of observation as a free parameter so as to be able to handle all scales simultaneously. This concept of having a range of measurements using apertures of di erent p h ysical sizes corresponding to observations at a range of scales is called a multi-scale measurement o f d a t a . In case a set of measurement data is already given (as is the case when an image is registered at a certain scale using a camera) this process can be simulated by the vision system. The basic idea behind a multi-scale representation is to embed the original signal into such a one-parameter family of derived signals. How should such a representation be constructed? A crucial requirement is that structures at coarse scales in the multi-scale representation should constitute simpli cations of corresponding structures at ner scales|they should not be accidental phenomena created by the smoothing method.
This property has been formalized in a variety o f w ays by di erent authors. A noteworthy coincidence is that the same conclusion can be reached from several di erent starting points. The main result we shall arrive at is that if rather general conditions are imposed on the types of computations that are to be performed at the rst stages of visual processing, then the Gaussian kernel and its derivatives are singled out as the only possible smoothing kernels. The requirements, or axioms, that specify the Gaussian kernel are basically linearity and spatial shift invariance combined with di erent w ays of formalizing the notion that structures at coarse scales should be related to structures at ner scales in a well-behaved manner new structures should not be created by the smoothing method.
An simple example where structure is created in a \ m ulti-scale representation" is when an image is enlarged by pixel replication (see gure 1.2). The sharp boundaries at regular distances are not present i n the original data they are just artifacts of the scale-changing (zooming) process. Why should one represent a signal at multiple scales when all information is present in the original data anyway? The major reason for this is to explicitly represent the multi-scale aspect of real-world images 1 . Another aim is to simplify further processing by r e m o ving unnecessary and disturbing details. More technically, the latter motivation re ects the common need for smoothing as a pre-processing step to many n umerical algorithms as a means of noise suppression.
Of course, there exists a large number of possible ways to construct a one-parameter family of derived signals from a given signal. The terminology that will be adopted 2 here is to refer to as a \multi-scale representation" any one-parameter family for which the parameter has a clear interpretation in terms of spatial scale. 1 At the rst stages there should be no preference for any certain scale or range of scales all scales should be measured and represented equivalently. In later stages the task may in uence the selection, for example, do we w ant to see the tree or the leaves? 2 In some literature the term \scale-space" is used for denoting any t ype of multi-scale representation. Using that terminology, the scale-space concept developed here should be called \(linear) di usion scale-space".

Early multi-scale representations
The general idea of representing a signal at multiple scales is not entirely new. Early work in this direction was performed by e.g. Rosenfeld and Thurston (1971), who observed the advantage of using operators of di erent size in edge detection. Related approaches were considered by Klinger (1971), Uhr (1972), Hanson and Riseman (1974), and Tanimoto and Pavlidis (1975) concerning image representations using di erent l e v els of spatial resolution, i.e., di erent a m o u n ts of subsampling.
These ideas have then been furthered, mainly by Burt and by C r o wley, to one of the types of multi-scale representations most widely used today, the pyramid. A brief overview of this concept is given below.

Quad tree
One of the earliest types of multi-scale representations of image data is the quad tree 3 introduced by Klinger (1971). It is a tree-like representation of image data in which the image is recursively divided into smaller regions.
The basic idea is as follows: Consider, for simplicity, a discrete twodimensional image I of size 2 K 2 K for some K 2 N, and de ne a measure of the grey-level variation in any region. This measure may e.g. be the standard deviation of the grey-level values.
Let I (K) = I. I f ( I (K) ) is greater than some pre-speci ed threshold , then split I (K) into sub-images I (K;1) j (j = 1 ::p) according to a speci ed rule. Apply this procedure recursively to all sub-images until convergence is obtained. A tree of degree p is generated, in which e a c h l e a f I (k) j is a homogeneous block with (I (k) j ) < . (One example is given in gure 1.3.) In the worst case, each pixel may correspond to an individual leaf. On the other hand, if the image contains a small number of regions with relatively uniform grey-levels, then a substantial data reduction can be obtained.
This representation has been used in simple segmentation algorithms for image processing of grey-level data. In the \split-and-merge" algorithm, a splitting step is rst performed according to the above s c heme. Then, adjacent regions are merged if the variation measure of the union of the two regions is below the threshold. Another application (when typically = 0) concerns objects de ned by uniform grey-levels, e.g. binary objects see e.g. the book by T animoto and Klinger (1980) for more references on this type of representation.

Pyramids
Pyramids are representations that combine the subsampling operation with a smoothing step. Historically they have yielded important steps towards current scale-space theories and we shall therefore consider them in more detail. To illustrate the idea, assume again, for simplicity, that the size of the input image I is 2 K 2 K , and let I (K) = I. The representation of I (K) at a coarser level I (K;1) is de ned by a reduction operator. Moreover, assume that the smoothing lter is separable, and that the number of lter coe cients along one dimension is odd. Then, it is su cient to study the following one-dimensional situation: where c: Z ! R denotes a set of lter coe cients. This type of lowpass pyramid representation (see gures 1.4{1.5) was proposed almost simultaneously by Burt (1981) and in a thesis by C r o wley (1981).
Low-pass and band-pass pyramids: Basic structure. A main advantage of this representation is that the image size decreases exponentially with the scale level, and hence also the amount of computations required to process the data. If the lter coe cients c(n) a r e c hosen properly, the representations at coarser scale levels (smaller k) will correspond to coarser scale structures in the image data. How can the lter be determined? Some of the most obvious design criteria that have been proposed for determining the lter coe cients are: ; positivity: c(n) 0, ; unimodality: c(jnj) c(jn + 1 j), ; symmetry: c(;n) = c(n), and ; normalization: P N n=;N c(n) = 1 .
Another natural condition is that all pixels should contribute equally to all levels. In other words, any p o i n t that has an odd coordinate index should contribute equally to the next coarser level as any p o i n t h a ving an even coordinate value. Formally this can be expressed as ; equal contribution: P N n=;N c(2n) = P N n=;N c(2n + 1).
Equivalently, this condition means that the kernel (1=2 1=2) of width two should occur as at least one factor in the smoothing kernel.
The choice of the lter size N gives rise to a trade-o problem. A larger value of N increases the number of degrees of freedom in the design at the cost of increased computational work. A natural choice when N = 1 is the binomial lter ( 1  4  1  2  1  4  ) ( 1.2) which is the unique lter of width three that satis es the equal contribution condition. It is also the unique lter of width three for which t h e F ourier transform ( ) = P N n=;N c(n) exp(;in ) i s z e r o a t = .
A negative property o f t h i s k ernel, however, is that when applied repeatedly, the equivalent c o n volution kernel (which corresponds to the combined e ect of iterated smoothing and subsampling) tends to a triangular function. Burt and Adelson (1983) argued that a should be selected such t h a t the equivalent smoothing function should be as similar to a Gaussian as possible. Empirically, they selected the value a = 0 :4.
By considering a representation de ned as the di erence between two adjacent l e v els in a low-pass pyramid, one obtains a bandpass pyramid, termed \Laplacian pyramid" by Burt and \DOLP" (Di erence Of Low Pass) by Crowley. This representation has been used for feature detection and data compression. Among features that can be detected are blobs (maxima), peaks and ridges etc (Crowley et al. , 1987. Properties. To summarize, the main advantages of the pyramid representations are that they lead to a rapidly decreasing image size, which reduces the computational work both in the actual computation of the representation and in the subsequent processing. The memory requirements are small, and there exist commercially available implementations of pyramids in hardware. The main disadvantage concerning pyramids is that they correspond to quite a coarse quantization along the scale direction, which makes it algorithmically complicated to relate (match) structures across scales. Furthermore, pyramids are not translationally invariant.
Further reading. There is a large literature on further work of pyramid representations see e.g. the book by Jolion and Rosenfeld (1994), the books edited by Rosenfeld (1984), Cantoni and Levialdi (1986) as well as the special issue of IEEE-TPAMI edited by T animoto (1989). A selection of recent developments can also be found in the articles by Chehikian and Crowley (1991), Knudsen and Christensen (1991), and Wilson and Bhalerao (1992). An interesting approach is the introduction of \oversampled pyramids", in which not every smoothing step is followed by a subsampling operation, and hence, a denser sampling along the scale direction can be obtained. Pyramids can, of course, also be expressed for three-dimensional datasets such as medical tomographic data (see e.g. Vincken et al. 1992).
It is worth noting that pyramid representations show a high degree of similarity with a type of numerical methods called multi-grid methods see the book by H a c kbusch (1985) for an extensive treatment of the subject.  In the quad tree and pyramid representations rather coarse steps are taken in the scale-direction. A scale-space r epresentation is a special type of multi-scale representation that comprises a continuous scale parameter and preserves the same spatial sampling at all scales. The Gaussian scalespace of a signal, as introduced by Witkin (1983), is an embedding of the original signal into a one-parameter family of derived signals constructed by convolution with Gaussian kernels of increasing width.
The linear scale-space representation of a continuous signal is constructed as follows. Let f: R N ! Rrepresent a n y given signal. Then, the scale-space representation I: R N R + ! R is de ned by letting the scalespace representation at zero scale be equal to the original signal I( 0 ) = f and for t > 0 I( t) = g( t) f (1.4) where t 2 R + is the scale parameter, and g: R N R + nf0g ! R is the Gaussian kernel in arbitrary dimensions (x 2 R N x i 2 R) it is written g(x t) = 1 (4 t) N=2 e ;x T x=(4t) = 1 (4 t) N=2 e ; P N i=1 x 2 i =(4t) : (1.5) of this operation is to suppress 4 most of the structures in the signal with a characteristic length less than see gure 1.6(a). Notice how this successive smoothing captures the intuitive notion of the signals becoming gradually smoother. A two-dimensional example is presented in gure 1.7.

Towards formalizing the scale-space concept
In this section we shall review some of the most important approaches for formalizing the notion of scale and for deriving the shape of the scalespace kernels in the linear scale-space theory. In a later chapter in this book by A l v arez and Morel (1994) it will be described how these ideas can be extended to non-linear scale-spaces and how the approach relates to mathematical morphology. 1. 5

.1. Continuous signals: Original formulation
When Witkin introduced the term scale-space, he was concerned with onedimensional signals and observed that new local extrema cannot be created in this family. Since di erentiation commutes with convolution, @ x nI( t) = @ x n(g( t) f) = g( t) @ x nf (1.6) this non-creation property applies also to any n th -order spatial derivative computed from the scale-space representation.
Recall that an extremum in a one-dimensional signal I is equivalent t o a zero-crossing in the rst-order derivative I x . The non-creation of new local extrema means that the zero-crossings in any d e r i v ative o f I form closed curves across scales, which w i l l n e v er be closed from below see gure 1.6(b). Hence, in the one-dimensional case, the zero-crossings form paths across scales, with a set of inclusion relations that gives rise to a tree-like data structure, termed \interval tree". (For higher-dimensional signals, however, new local extrema and zero-crossings can be created see section 1.5.5.) An interesting empirical observation made by Witkin was a marked correspondence between the length of the branches in the interval tree and perceptual saliency: . .. intervals that survive o ver a broad range of scales tend to leap out to the eye ... In later work by Lindeberg (1991 it has been demonstrated that this observation can be extended to a principle for actually detecting signi cant image structures from the scale-space representation. That approach is based on the stability and lifetime over scales, the local contrast, and the spatial extent of blob-like image structures. Gaussian smoothing has been used also before Witkin proposed the scale-space concept, e.g. by Marr and Hildreth (1980) who considered zerocrossings of the Laplacian in images convolved with Gaussian kernels of di erent standard deviation. One of the most important c o n tributions of Witkins scale-space formulation, however, was the systematic way t o relate and interconnect such representations and image structures at di erent scales in the sense that a scale dimension should be added to the scalespace representation, so that the behaviour of structures across scales can be studied. Some aspects of the resulting "deep structure" of scale-space, i.e. the study of the relations between structures at di erent scales, will be considered in the next chapter (section 2.3). See also (Koenderink 1994). 1.5.2. Inner scale, outer scale, and scale-space Koenderink (1984) emphasized that the problem of scale must be faced in any imaging situation. Any real-world image has a limited extent determined by t wo scales, the outer scale corresponding to the nite size of the image, and the inner scale given by the resolution. For a digital image the inner scale is determined by the pixel size, and for a photographic image by the grain size in the emulsion.
As described in the introduction, similar properties apply to objects in the world, and hence to image features. The outer scale of an object or a feature may be said to correspond to the (minimum) size of a window t h a t completely contains the object or the feature, whereas the inner scale of an object or a feature may loosely be said to correspond to the coarsest scale at which substructures of the object or the feature begin to appear.
Referring to the analogy with cartography g i v en in the introduction, it should be emphasized that an atlas usually contains a set of maps covering some region of interest. Within each map the outer scale typically scales in proportion with the inner scale. A single map is, however, usually not su cient for us to nd our way around the world. We need the ability t o zoom in to structures at di erent scales i.e., decrease and increase the inner scale of the observation according to the type of situation at hand. (For an excellent illustration of this notion, see the popular book Powers of Ten (Morrison and Morrison 1985), which shows pictures of the world over 50 decades of scale from the largest to the smallest structures in the universe known to man. ) Koenderink also stressed that if there is no a priori reason for looking at speci c image structures, the visual system should be able to handle image structures at all scales. Pyramid representations approach this problem by successive smoothing and subsampling of images. However, \The challenge is to understand the image really on all these levels simultaneously, and not as unrelated set of derived images at di erent levels of blurring . .." Adding a scale dimension to the original data set, as is done in the oneparameter embedding, provides a formal way to express this interrelation. 1.5.3. Causality The observation that new local extrema cannot be created when increasing the scale parameter in the one-dimensional case shows that the Gaussian convolution satis es certain su ciency requirements for being a smoothing operation. The rst proof of the necessity of Gaussian smoothing for a scale-space representation was given by Koenderink (1984), who also gave a formal extension of the scale-space theory to higher dimensions. He introduced the concept of causality, which means that new level surfaces f(x y t) 2 R 2 R: I(x y t) = I 0 g must not be created in the scale-space representation when the scale parameter is increased (see gure 1.8). By combining causality with the notions of isotropy and homogeneity, which essentially mean that all spatial positions and all scale levels must be treated in a similar manner, he showed that the scale-space representation must satisfy the di usion equation @ t I = r 2 I: (1.7) This di usion equation (with initial condition I( 0 ) = f) is the wellknown physical equation that describes how a heat distribution I evolves over time t in a homogeneous medium with uniform conductivity, given an initial heat distribution I( 0 ) = f (see e.g. Widder 1975or Strang 1986). Since the Gaussian kernel is the Green's function of the di usion equation at an in nite domain, it follows that the Gaussian kernel is the unique kernel for generating the scale-space. A similar result holds, as we shall see later, in any dimension.
The technique used for proving this necessity result was by studying the level surface through any p o i n t in scale-space for which the grey-level function assumes a maximum with respect to the spatial coordinates. If no new level surface is to be created when increasing scale, the level surface should point with its concave side towards decreasing scales. This gives rise to a sign condition on the curvature of the level surface, which assumes the form (1.7) when expressed in terms of derivatives of the scale-space representation with respect to the spatial and scale coordinates. Since points at which extrema are obtained cannot be assumed to be known a priori, this condition must hold in any point, which p r o ves the result.
In the one-dimensional case, this level surface condition becomes a level curve condition, and is equivalent to the previously stated non-creation of local extrema. Since any n th -order derivative o f I also satis es the di usion equation @ t I x n = r 2 I x n (1.8) it follows that new zero-crossing curves in I x cannot be created with increasing scale, and hence, no new maxima.
A similar result was given by Y uille and Poggio (1985,1986) concerning the zero-crossings of the Laplacian of the Gaussian. Related formulations have also been expressed by Babaud et al. (1986) and Hummel (1986). 1.5.4. Non-creation of local extrema Lindeberg (1990Lindeberg ( , 1991 considered the problem of characterizing those kernels in one dimension that share the property of not introducing new local extrema in a signal under convolution. A kernel h 2 L 1 possessing the property that for any input signal f in 2 L 1 the number of extrema in the convolved signal f out = h f in is always less than or equal to the number of local extrema in the original signal is termed a scale-space kernel: ; scale-space kernel: # extrema (h f in ) # extrema (f in ) 8f in 2 L 1 .
From similar arguments as in section 1.5.1, this de nition implies that the number of extrema (or zero-crossings) in any n th -order derivative i s guaranteed never to increase. In this respect, convolution with a scale-space kernel has a strong smoothing property.
Such k ernels can easily be shown to be positive and unimodal both in the spatial and the frequency domain. These properties may provide a formal justi cation for some of the design criteria listed in section 1.3.2 concerning the choice of lter coe cients for pyramid generation.
If the notion of a local maximum or zero-crossing is de ned in a proper m a n n e r t o c o ver also non-generic functions, it turns out that scale-space kernels can be completely classi ed using classical results by S c hoenberg (1950 h(x) e ;sx dx = C e s 2 + s 1 Y i=1 e a i s 1 + a i s (1.9) where ;c < Re(s) < c for some c > 0 and where C 6 = 0 , 0, and a i are real, and as well as trivial translation and rescaling. The product form of the expression (1.9) re ects a direct consequence of the de nition of a scale-space kernel the convolution of two scale-space kernels is a scale-space kernel. Interestingly, the reverse holds a shiftinvariant linear transformation is a smoothing operation if and only if it can be decomposed into these primitive operations.

Semi-group and continuous scale parameter
Another approach to nd the appropriate families of scale-space kernels is provided by group theory. A natural structure to impose on a scale-space representation is a semi-group structure 5 , i.e., if every smoothing kernel is associated with a parameter value, and if two s u c h k ernels are convolved with each other, then the resulting kernel should be a member of the same family, h( t 1 ) h( t 2 ) = h( t 1 + t 2 ): ( 1.12) In particular, this condition implies that the transformation from a ne scale to any coarse scale should be of the same type as the transformation from the original signal to any scale in the scale-space representation. Algebraically, this property can be written as If a semi-group structure is imposed on a one-parameter family of scalespace kernels that satisfy a mild degree of smoothness (Borel-measurability) with respect to the parameter, and if the kernels are required to be symmetric and normalized, then the family of smoothing kernels is uniquely determined to be a Gaussian (Lindeberg 1990), h(x t) = 1 p 4 t e ;x 2 =(4 t) (t > 0 2 R): (1.14) In other words, when combined with the semi-group structure, the noncreation of new local extrema means that the smoothing family is uniquely determined.
Despite the completeness of these results, however, they cannot be extended directly to higher dimensions, since in two (and higher) dimensions there are no non-trivial kernels guaranteed to never increase the number of local extrema in a signal. One example of this, originating from an observation by Lifshitz and Pizer (1990), is presented below see also Yuille (1988) concerning creation of other types of image structures: Imagine a two-dimensional image function consisting of two hills, one of them somewhat higher than the other one. Assume that they are smooth, wide, rather bell-shaped surfaces situated some distance apart clearly separated by a deep valley running between them. Connect the two tops by a narrow sloping ridge without any local extrema, so that the top point of the lower hill no longer is a local maximum. Let this con guration be the input image. When the operator corresponding to the di usion equation is applied to the geometry, the ridge will erode much faster than the hills. After a while it has eroded so much t h a t the lower hill appears as a local maximum again. Thus, a new local extremum has been created. Notice however, that this decomposition of the scene is intuitively quite reasonable. The narrow ridge is a ne-scale phenomenon, and should therefore disappear before the coarse-scale peaks. The property that new local extrema can be created with increasing scale is inherent i n t wo a n d higher dimensions. 1.5.6. Scale invariance and the Pi theorem A formulation by F l o r a c k et al. (1992) and continued work by P auwels et al. (1994) show that the class of allowable scale-space kernels can be restricted under weaker conditions, essentially by c o m bining the earlier mentioned conditions about linearity, shift invariance, rotational invariance and semi-group structure with scale invariance. The basic argument i s taken from physics physical laws must be independent of the choice of fundamental parameters. In practice, this corresponds to what is known as dimensional analysis 6 a function that relates physical observables must be independent of the choice of dimensional units. 7 Notably, this condition comprises no direct measure of \structure" in the signal the non-creation of new structure is only implicit in the sense that physically observable entities that are subject to scale changes should be treated in a self-similar manner.
Since this way of reasoning is valid in arbitrary dimensions and not very technical, we shall reproduce it (although in a modi ed form and with somewhat di erent proofs). The main result we shall arrive a t i s t h a t s c a l e invariance implies that the Fourier transform of the convolution kernel must be of the formĥ for some > 0 and p > 0. The Gaussian kernel corresponds to the speci c case p = 2 and arises as a unique choice if certain additional requirements are imposed.
Preliminaries: Semi-group with arbitrary parametrization. When basing a scale-space formulation on scale invariance, some further considerations are needed concerning the assumption about a semi-group structure.
In previous section, the scale parameter t associated with the semigroup (see equation (1.12)) was regarded as an abstract ordering parameter only. A priori, i.e. i n t h e s t a g e o f f o r m ulating the axioms, there was no direct connection between this parameter and measurements of scale in terms of units of length. The only requirement w as the qualitative ( a n d essential) constraint that increasing values of the scale parameter should somehow correspond to representations at coarser scales. A p osteriori, i.e. after deriving the shape of the convolution kernel, we could conclude that this parameter is related to scale as measured in units of length, e.g. via the standard deviation of the Gaussian kernel . The relationship is t = 2 =2 and the semi-group operation corresponds to adding -values in the Euclidean norm.
In the derivation in this section, we shall assume that such a relationship exists already in the stage of formulating the axioms. Introduce as a scale parameter of dimension length] associated with each l a yer in the scale-space representation. To allow for maximum generality in the relation between t and , assume that there exists some (unknown) transformation t = '( ) ( 1.16) such that the semi-group structure of the convolution kernel corresponds to mere adding of the scale values when measured in terms of t. F or kernels parameterized by the semi-group operation then assumes the form h( 1 ) h( 2 ) = h( ;1 ( ( 1 ) + ( 2 ))) ( 1.17) where ' ;1 denotes the inverse of '. If zero scale should correspond to the original signal it must hold that '(0) = 0. To preserve the ordering of scale values ': R + ! R + must be monotonically increasing.
Proof: Necessity from scale invariance. In analogy with the previous scalespace formulations, let us state that the rst stages of processing should be linear and be able to function without any a priori knowledge of the outside world. Combined with scale invariance, this gives the following basic axioms: ; linearity, ; no preferred location (shift invariance), ; no preferred orientation (isotropy), ; no preferred scale (scale invariance).
Recall that any linear and shift-invariant operator can be expressed as a convolution operator and introduce 2 R + to represent an abstract scale parameter of dimension length]. Then, we can assume that the scale-space representation I: R N R + ! R of any signal f: R N ! R is constructed by convolution with some one-parameter family of kernels h: R N R + ! R I( ) = h( ) f: (1.18) In the Fourier domain (! 2 R N ) this convolution becomes a product: ( 1.19) Part A: Dimensional analysis, rotational symmetry, and scale invariance. From dimensional analysis it follows that if a physical process is scale independent, it should be possible to express the process in terms of dimensionless variables. These variables can be obtained by using a result in physics known as the Pi-theorem (see e.g. Olver 1986) which states that if the dimensions of the occurring variables are arranged in a table with as many r o ws as there are physical units and as many columns as there are derived quantities (see next) then the number of independent dimensionless quantities is equal to the number of derived quantities minus the rank of the system matrix. With respect to the linear scale-space representation, the following dimensions and variables occur: 0 0 -1 +1 Obviously, there are four derived quantities and the rank of the matrix is two. Hence, we c a n e.g. introduce the dimensionless variablesÎ=f and ! .
Using the Pi-theorem, a necessary requirement for ( 1.19) to re ect a scale invariant process is that ( 1.19) can be written on the form Since we require no preference for orientation, it is su cient to assume thatĥ depends on the magnitude of its argument only and that for some functionĤ: R! R witĥ H(0) = 1 it holds thatÎ for all 1 2 2 R + . It is straightforward to show (see the following paragraph) that scale invariance implies that ' must be of the form '( ) = C p ( 1.24) for some arbitrary constants C > 0 a n d p > 0. Without loss of generality we m a y t a k e C = 1, since this parameter corresponds to an unessential linear rescaling of t. Then, with '( ) = p andH(x p ) = H(x) w e obtaiñ H(j! 1 j p )H(j! 2 j p ) = H(j! 1 j p + j! 2 j p ) ( 1.25) which can be recognized as the de nition of the exponential function ( ( 1 ) ( 2 ) = ( 1 + 2 ) ) ( ) = a for some a > 0). Consequently, H(! ) = H(j! j p ) = exp(; j! j p ) ( 1.26) for some 2 R, a n dĥ (! ) = H(! ) = e ; j! j p : (1.27) A real solution implies that must be real. Concerning the sign of , i t i s natural to require lim !1ĥ (! ) = 0 ( 1.28) rather than lim !1ĥ (! ) = 1. This means that must be negative, and we can without loss of generality s e t = ;1=2 to preserve consistency ( 1 ) + ( 3 ) a n d 0 ( ) = exp log a = 1=log a for some a.) Finally, the functional form of '( ) (equation (1.24)) can be obtained by integrating ' 0 ( ) = C 0 q . Since (0) = 0, the integration constant m ust be zero. Moreover, the singular case q = ;1 can be disregarded. The constants C and p must be positive, since ' must be positive and increasing. ( 1.33) where the parameter p > 0 is undetermined. In the work by Florack et al.
(1992) separability in Cartesian coordinates was used as a basic constraint.
Except in the one-dimensional case, this xates h to be a Gaussian.
Since, however, rotational symmetry combined with separability per se are su cient to xate the function to be a Gaussian, and the selection of two orthogonal coordinate directions constitutes a very speci c choice, it is illuminating to consider the e ect of using other choices of p. 8 In a recent w ork, Pauwels et al. (1994) have analyzed properties of these convolution kernels. Here, we shall review some basic results and describe how di erent additional constraints on h lead to speci c values of p.
Powers of ! that are even integers. Consider rst the case when p is an even integer. Using the well-known relation between moments in the spatial domain and derivatives in the frequency domain Z x2R x n h(x) dx = ( ;i) nĥ(n) (0) ( 1.34) it follows that the second moments of h are zero for any p > 2. Hence, p = 2 is the only even integer that corresponds to a non-negative convolution kernel (recall from section 1.5.4 that non-creation of local extrema implies that the kernel has to be non-negative).
Locality of in nitesimal generator. An important requirement o f a m ultiscale representation is that it should be di erentiable with respect to the scale parameter. A general framework for expressing di erentiability o f semi-groups is in terms of in nitesimal generators (see section 1.7.2 for a review and a scale-space formulation based on this notion). In Pauwels et al. (1994) it is shown that the corresponding multi-scale representations generated by convolution kernels of the form (1.33) have local in nitesimal generators (basically meaning that the multi-scale representations obey di erential equations that can be expressed in terms of di erential operators only see section 1.7.2) if and only if the exponent p is an even integer.
Speci c choice: Gaussian kernel. In these respects, p = 2 constitutes a very special choice, since it is the only choice that corresponds to a local in nitesimal generator and a non-negative c o n volution kernel.
Similarly, p = 2 is the unique choice for which the multi-scale representation satis es the causality requirement (as will be described in section 1.7.2, a reformulation of the causality requirement in terms of nonenhancement of local extrema implies that the scale-space family must have an in nitesimal generator corresponding to spatial derivatives up to order two).

Other special properties of the Gaussian kernel
The Gaussian kernel has some other special properties. Consider for simplicity the one-dimensional case and de ne normalized second-moments x and ! in the spatial and the Fourier domain respectively by ( 1.36) A remarkable property of the Gaussian kernel is that it is the only real kernel that gives equality in this relation. The Gaussian kernel is also the frequency function of the normal distribution. The central limit theorem in statistics states that under rather general requirements on the distribution of a stochastic variable, the distribution of a sum of a large number of such v ariables asymptotically approaches a normal distribution when the number of terms tend to in nity. 1. 6. Gaussian derivative operators Above, it has been shown that by starting from a number of di erent s e t s o f axioms it is possible to single out the Gaussian kernel as the unique kernel for generating a (linear) scale-space. From this scale-space representation, multi-scale spatial derivative operators can then be de ned by I i 1 :::in ( t) = @ i 1 :::in I( t) = G i 1 :::in ( t) I (1.37) where G i 1 :::in ( t) denotes a (possibly mixed) derivative of some order n = i 1 + : : : i N of the Gaussian kernel. In terms of explicit integrals, the convolution operation (1.37) is written I i 1 :::in (x t) = Z x 0 2R N g i 1 :::in (x 0 t) f(x ; x 0 ) dx 0 : (1.38) Graphical illustrations of such Gaussian derivative kernels in the onedimensional and two-dimensional cases are given in gures 1.9 and 1. 10 then the integral (1.38) is guaranteed to converge for any t > 0. This means that although f may not be di erentiable of any order, or not even continuous, the result of the Gaussian derivative operator is always well-de ned. According to the theory of generalized functions (or Schwartz distributions) (Schwartz 1951(Schwartz H ormander 1963, we can then for any t > 0 treat I( t) = g( t) f as in nitely di erentiable.

Multi-scale N -jet representation and necessity
Considering the spatial derivatives up to some order N enables characterization of the local image structure up to that order, e.g., in terms of the Taylor expansion 9 of the intensity function (Bottom row) Third-order derivative k ernels xxxT , xxyT , xyyT , yyyT . Qualitatively, these kernels resemble the shape of the continuous Gaussian derivative k ernels. In practice though, they are de ned as discrete derivative approximations using the canonical discretization framework described in section 1.7. In early work, Koenderink and van Doorn (1987) advocated the use of this so-called multi-scale N-jet signal representation as a model for the earliest stages of visual processing 10 . Then, in (Koenderink and van Doorn 1992) they considered the problem of deriving linear operators from the scalespace representation that are to be invariant under scaling transformations. Inspired by the relation between the Gaussian kernel and its derivatives, here in one dimension, @ x ng(x 2 ) = ( ;1) n 1 n H n ( x ) g( x 2 ) ( 1.41) which follows from the well-known relation between derivatives of the Gaussian kernel and the Hermite polynomials H n (see table 1.1) @ x n (e ;x 2 ) = ( ;1) n H n (x) e ;x 2 (1.42) they considered the problem of deriving operators with a similar scaling behaviour. Starting from the Ansatz ( 1.43) where the superscript ( ) describes the \order" of the function, they Hermite polynomial x 6 ; 15x 4 + 4 5 x 2 ; 15 H7(x) x 7 ; 21x 5 + 1 0 5 x 3 ; 105x smoothing property g( t 1 ) g x n( t 2 ) = g x n( t 2 + t 1 ): The latter result is a special case of the more general statement g x m( t 1 ) g x n( t 2 ) = g x m+n( t 2 + t 1 ) ( 1.46) whose validity follows directly from the commutative property of convolution and di erentiation. 1. 6

.4. Directional derivatives
Let (cos sin ) represent a unit vector in a certain direction . F rom the well-known expression for the nth-order directional derivative @ n of a function I in any direction , @ n I = (cos @ x + sin @ y ) n I: (1.47) if follows that a directional derivative of order n in any direction can be constructed by linear combination of the partial scale-space derivatives I x 1 I x 2 I x 1 x 1 I x 1 x 2 I x 2 x 2 I x 1 x 1 x 1 I x 1 x 1 x 2 I x 1 x 2 x 2 I x 2 x 2 x 2 : : : of that order. Figure 1.12 shows equivalent derivative approximations kernels of order one and two constructed in this way.
In the terminology of Freeman and Adelson (1990) and Perona (1992), kernels whose outputs are related by linear combinations are said to be \steerable". Note, however, that in this case the \steerable" property i s not attributed to the speci c choice of the Gaussian kernel. The relation (1.47) holds for any n times continuously di erentiable function.

Discrete scale-space
The treatment so far has been concerned with continuous signals. Since realworld signals obtained from standard digital cameras are discrete, however, an obvious problem concerns how to discretize the scale-space theory while still maintaining the scale-space properties.

Non-creation of local extrema
For one-dimensional signals a complete discrete theory can be based on a discrete analogy to the treatment in section 1.5.4. Following Lindeberg (1990, de ne a discrete kernel h 2 L 1 to be a discrete scale-space kernel if for any signal f in the number of local extrema in f out = h f in does not exceed the number of local extrema in f in . Using classical results (mainly by S c hoenberg 1953 see also Karlin 1968 for a comprehensive summary), it is possible to completely classify those ; moving average or rst-order recursive ltering, f out (x) =f in (x) + i f out (x ; 1) (0 i < 1) f out (x) =f in (x) + i f out (x + 1 ) (0 i < 1) ; in nitesimal smoothing or di usion smoothing (explained below), ; rescaling, and ; translation.
It follows that a discrete kernel is a scale-space kernel if and only if it can be decomposed into the above primitive transformations. Moreover, the only non-trivial smoothing kernels of nite support arise from generalized binomial smoothing (i.e., non-symmetric extensions of the lter (1.2)). If this de nition is combined with a requirement that the family of smoothing transformations must obey a semi-group property ( 1 . 1 2 ) o ver scales and possesses a continuous scale parameter, then there is in principle only one way to construct a scale-space for discrete signals. Given a signal f: Z! R the scale-space : Z R + ! R is given by I(x t) = 1 X n=;1 T(n 2 t)f(x ; n) (1.49) where T: Z R + ! R is a kernel termed the discrete analogue of the A simple interpretation of the discrete analogue of the Gaussian kernel is as follows: Consider the time discretization of (1.52) using Eulers explicit method I (k+1)(i) = t I (k) (i + 1 ) + ( 1 ; 2 t) I (k) (i) + t I (k) (i ; 1) (1.53) 11 The factor 2 in the notation 2t arises due to the use of di erent parameterizations of the scale parameter. In this book, the scale parameter is related to the standard deviation of the Gaussian kernel by 2 = 2 t which means that the di usion equation assumes the form It = r 2 I. In a large part of the other scale-space literature, the di usion equation is written It = 1 =2 r 2 I with the advantage that the scale parameter is the square of the standard deviation of the Gaussian kernel t = 2 .
where the superscript (k) denotes iteration index. Assume that the scalespace representation of I at scale t is to be computed by applying this iteration formula using n steps with step size t = t=n. Then, the discrete analogue of the Gaussian kernel is the limit case of the equivalent convolution kernel ( t n 1 ; 2t n t n ) n ( 1.54) when n tends to in nity, i.e., when the number of steps increases and each individual step becomes smaller. This shows that the discrete analogue of the Gaussian kernel can be interpreted as the limit case of iterative application of generalized binomial kernels.
Despite the completeness of these results, and their analogies to the continuous situation, however, they cannot be extended to higher dimensions. Using similar arguments as in the continuous case it can be shown that there are no non-trivial kernels in two or higher dimensions that are guaranteed to never introduce new local extrema. Hence, a discrete scale-space formulation in higher dimensions must be based on other axioms.

Non-enhancement and in nitesimal generator
It is clear that the continuous scale-space formulations in terms of causality and scale invariance cannot be transferred directly to discrete signals there are no direct discrete correspondences to level curves and di erential geometry in the discrete case. Neither can the scaling argument b e carried out in the discrete situation if a continuous scale parameter is desired, since the discrete grid has a preferred scale given by the distance between adjacent grid points. An alternative w ay to express the causality requirement in the continuous case, however, is as follows (Lindeberg 1990): Non-enhancement of local extrema: If for some scale level t 0 a p o i n t x 0 is a local maximum for the scale-space representation at that level (regarded as a function of the space coordinates only) then its value must not increase when the scale parameter increases. Analogously, i f a point is a local minimum then its value must not decrease when the scale parameter increases. It is clear that this formulation is equivalent to the formulation in terms of level curves for continuous data, since if the grey-level value at a local maximum (minimum) would increase (decrease) then a new level curve would be created. Conversely, i f a n e w l e v el curve is created then some local maximum (minimum) must have increased (decreased). An intuitive description of this requirement is that it prevents local extrema from being enhanced and from \popping up out of nowhere". In fact, it is closely related to the maximum principle for parabolic di erential equations (see, e.g., Widder 1975 and also . Preliminaries: In nitesimal generator. If the semi-group structure is combined with a strong continuity requirements with respect to the scale parameter, then it follows from well-known results in (Hille and Phillips 1957) that the scale-space family must have a n in nitesimal generator (Lindeberg 1990(Lindeberg , 1991. In other words, if a transformation operator T t from the input signal to the scale-space representation at any s c a l e t is ( 1.57) Non-enhancement of local extrema implies a second-order in nitesimal generator. By combining the existence of an in nitesimal scale-space generator with the non-enhancement requirement, linear shift-invariance, and spatial symmetry it can be shown (Lindeberg 1991(Lindeberg , 1992) that the scale-space family I: Z N R + ! R of a discrete signal f: Z N ! R must satisfy the semi-discrete di erential equation Notably, the locality condition means that A ScSp corresponds to the discretization of derivatives of order up to two. In one and two dimensions respectively (1.58) reduces to @ t I = 1 r 2 3 I (1.59) @ t I = 1 r 2 5 I + 2 r 2 2I ( 1.60) for some constants 1 0 a n d 2 0. Here, the symbols, r 2 5 and r 2 2 denote the two common discrete approximations of the Laplace operator de ned by ( b e l o w the notation f ;1 1 stands for f(x ; 1 y + 1) etc.): (r 2 5 f) 0 0 = f ;1 0 + f +1 0 + f 0 ;1 + f 0 +1 ; 4f 0 0 (r 2 2f)0 0 = 1 =2(f ;1 ;1 + f ;1 +1 + f +1 ;1 + f +1 +1 ; 4f 0 0 ): In the particular case when 2 = 0, the two-dimensional representation is given by convolution with the one-dimensional Gaussian kernel along each dimension. On the other hand, using 1 = 2 2 corresponds to a representation with maximum spatial isotropy in the Fourier domain.

Discrete derivative a p p r o ximations
Concerning operators derived from the discrete scale-space, it holds that the scale-space properties transfer to any discrete derivative approximation de ned by spatial linear ltering of the scale-space representation. In fact, the converse result is true as well  if derivative approximation kernels are to satisfy the cascade smoothing property, x n T( t 1 ) T( t 2 ) = x nT ( t 1 + t 2 ) ( 1.61) and if similar continuity requirements concerning scale variations are imposed, then by necessity also the derivative approximations must satisfy the semi-discretized di usion equation (1.58). The speci c choice of operators x n is however arbitrary any linear operator satis es this relation. Graphs of these kernels at a few levels of scale and for the lowest orders of di erentiation are shown in gure 1.9 and gure 1. 10.
To summarize, there is a unique and consistent w ay to de ne a scalespace representation and discrete analogues to smoothed derivatives for discrete signals, which to a large extent preserves the algebraic structure of the multi-scale N-jet representation in the continuous case.

Scale-space operators and front-end vision
As we h a ve seen, the uniqueness of the Gaussian kernel for scale-space representation can be derived in a variety of di erent w ays, non-creation of new level curves in scale-space, non-creation of new local extrema, non-enhancement of local extrema, and by c o m bining scale invariance with certain additional conditions. Similar formulations can be stated both in the continuous and in the discrete domains. The essence of these results is that the scale-space representation is given by a (possibly semidiscretized) parabolic di erential equation corresponding to a second-order di erential operator with respect to the spatial coordinates, and a rstorder di erential operator with respect to the scale parameter. 1. 8

.1. Scale-space: A canonical visual front-end model
A natural question now arises: Does this approach constitute the only reasonable way t o p e r f o r m t h e l o w-level processing in a vision system, and are the Gaussian kernels and their derivatives the only smoothing kernels that can be used? Of course, this question is impossible to answer to without further speci cation of the purpose of the representation, and what tasks the visual system has to accomplish. In any su ciently speci c application it should be possible to design a smoothing lter that in some sense has a \better performance" than the Gaussian derivative model. For example, it is well-known that scale-space smoothing leads to shape distortions at edges by smoothing across object boundaries, and also in estimation of surface shape using algorithms such as shape-from-texture. Hence, it should be emphasized that the theory developed here is rather aimed at describing the principles of the very rst stages of low-level processing in an uncommitted visual system aimed at handling a large class of di erent situations, and in which n o o r v ery little a priori information is available.
Then, once initial hypotheses about the structure of the world have been generated within this framework, the intention is that it should be possible to invoke more re ned processing, which can compensate for this, and adapt to the current situation and the task at hand (see section 2.8 in next chapter as well as following chapters). From the viewpoint of such approaches, the linear scale-space model serves as the natural starting point.

Relations to biological vision
In fact, a certain degree of agreement can be obtained with the result from this solely theoretical analysis and the experimental results of biological vision systems. Neurophysiological studies by Y oung (1985,1986,1987) have s h o wn that there are receptive elds in the mammalian retina and visual cortex, whose measured response pro les can be well modeled by Gaussian derivatives. For example, Young models cells in the mammalian retina by k ernels termed 'di erences of o set Gaussians' (DOOG), which basically correspond to the Laplacian of the Gaussian with an added Gaussian o set term. He also reports cells in the visual cortex, whose receptive eld pro les agree with Gaussian derivatives up to order four.
Of course, far-reaching conclusions should not be drawn from such a qualitative similarity, since there are also other functions, such as Gabor functions (see section 2.6.2 in next chapter) that satisfy the recorded data up to the tolerance of the measurements. Nevertheless, it is interesting to note that operators similar to the Laplacian of the Gaussian (centersurround receptive elds) have been reported to be dominant in the retina. A possible explanation concerning the construction of derivatives of other orders from the output of these operators can be obtained from the observation that the original scale-space representation can always be reconstructed if Laplacian derivatives are available at all other scales. If the scale-space representation tends to zero at in nite scale, then it follows from the di usion equation that I(x t) = ;(I(x 1);I(x t)) = ; Z 1 t 0 =t @ t I(x t 0 )dt 0 = ; Z 1 t 0 =t r 2 I(x t 0 )dt 0 : Observe the similarity with the standard method for reconstructing the original signal from a bandpass pyramid (Burt 1981).
What remains to be understood is whether there are any particular theoretical advantages of computing the Laplacian of the Gaussian in the rst step. Of course, such an operation suppresses any linear illumination gradients, and in a physiological system it may lead to robustness to the loss of some bers because there is substantial integration over all available scales. Furthermore, one can contend that spatial derivatives of the Gaussian can be approximated by di erences of Gaussian kernels at di erent spatial position, and it is therefore, at least in principle, possible to construct any spatial derivative from this representation. Remaining questions concerning the plausibility concerning biological vision are left to the reader's speculation and further research.

Foveal vision
Another interesting similarity concerns the spatial layout of receptive e l d s over the visual eld 12 . If the scale-space axioms are combined with the assumption of a xed readout capacity per scale from the visual front-end, it is straightforward to show that there is a natural distribution of receptive elds (of di erent scales and di erent spatial position) over the retina such that the minimum receptive e l d s i z e g r o ws linearly with eccentricity, that is the distance from the center of the visual eld (Lindeberg and Florack 1 9 9 2 , 1994). A similar (log-polar) result is obtained when a conformal metric is chosen for a non-linear scale-space (see the chapter by Florack et al. (1994) in this book). There are several results in psychophysics, neuroanatomy and electrophysiology in agreement w i t h s u c h an increase (

Introduction
In the previous chapter a formal justi cation has been given for using linear ltering as an initial step in early processing of image data (see also section 2.5 in this chapter). More importantly, a catalogue has been provided of what lter kernels are natural to use, as well as an extensive theoretical explanation of how di erent k ernels of di erent orders and at di erent scales can be related. This forms the basis of a theoretically wellfounded modeling of visual front-end operators with a smoothing e ect.
Of course, linear ltering cannot be used as the only component i n a vision system aimed at deriving information from image data some non-linear steps must be introduced into the analysis. More concretely, some mechanism is required for combining the output from the Gaussian derivative operators of di erent order and at di erent scales into some more explicit descriptors of the image geometry.
This chapter continues the treatment of linear scale-space by showing how di erent t ypes of early visual operations can be expressed within the scale-space framework. Then, we turn to theoretical properties of linear scale-space and demonstrate how the behaviour of image structures over scales can be analyzed. Finally, it is described how access to additional information suggests situations when the requirements about uncommitted processing can be relaxed. This open-ended material serves as a natural starting point for the non-linear approaches considered in following chapters.

Multi-scale feature detection in scale-space
An approach that has been advocated by Koenderink and his co-workers is to describe image properties in terms of di erential geometric descriptors, i.e., di erent (possibly non-linear) combinations of derivatives. A basic motivation for this position is that di erential equations and di erential geometry constitute natural frameworks for expressing both physical processes and geometric properties. More technically, and as we h a ve seen in section 1.6 in the previous chapter, it can also be shown that spatial derivatives are natural operators to derive from the scale-space representation.
When using such descriptors, it should be observed that a single partial derivative, e.g. I x 1 , does not represent a n y geometrically meaningful information, since its value is crucially dependent on the arbitrary choice of coordinate system. In other words, it is essential to base the analysis on descriptors that do not depend on the actual coordinatization of the spatial and intensity domains. Therefore, it is natural to require the representation to be invariant with respect to primitive transformations such as translations, rotations, scale changes, and certain intensity transformations 1 . In fact, quite a few types of low-level operations can be expressed in terms of such multi-scale di erential invariants de ned from (non-linear) combinations of Gaussian derivatives at multiple scales. Examples of these are feature detectors, feature classi cation methods, and primitive shape descriptors. In this sense, the scale-space representation can be used as a basis for early visual operations. 1 In fact, it would be desirable to directly compute features that are invariant under perspective transformations. Since, however, this problem is known to be much harder, most work has so far been restricted to invariants of two-dimensional Euclidean operations and natural linear extensions thereof, such as uniform rescaling and a ne transformations of the spatial coordinates. For an overview of geometric invariance applied to computer vision, see the book edited by Mundy and Zisserman (1992). An excellent discussion of the invariant properties of the di usion equation is found in Olver (1986). Concerning analysis of di erential invariants, see also the chapter by Olver et al. (1994) in this book.

Di erential geometry and di erential invariants
Florack et al. (1992,1993) and Kanatani (1990) have pursued this approach of deriving di erential invariants in an axiomatic manner, and considered image properties de ned in terms of directional derivatives along certain preferred coordinate directions. If the direction, along which a directional derivative is computed, can be uniquely de ned from the intensity pattern, then rotational invariance is obtained automatically, since the preferred direction follows any rotation of the coordinate system. Similarly, a n y derivative is translationally invariant. These properties hold both concerning transformations of the original signal f and the scalespace representation I of f generated by rotationally symmetric Gaussian smoothing.
Detailed studies of di erential geometric properties of two-dimensional and three-dimensional scalar images are presented by Salden et al. (1991), who makes use of classical techniques from di erential geometry (Spivak 1975Koenderink 1990), algebraic geometry, and invariant theory (Grace andYoung 1965 Weyl 1946) for classifying geometric properties of the N-jet of a signal at a given scale in scale-space.
Here, a short description will be given concerning some elementary results. Although the treatment will be restricted to the two-dimensional case, the ideas behind it are general and can be easily extended to higher dimensions. For more extensive treatments, see also (ter Haar Romeny et al. Florack 1993).
Local directional derivatives One choice of preferred directions is to introduce a local orthonormal coordinate system (u v) a t a n y p o i n t P 0 , where the v-axis is parallel to the gradient direction at P 0 , and the u-axis is perpendicular, i.e. e v = (cos sin ) T  In terms of Cartesian coordinates, which arise frequently in standard digital images, these local directional derivative operators can be written @ u = s i n @ x ; cos @ y : @ v = c o s @ x + s i n @ y (2.2) This (u v)-coordinate system is characterized by the fact that one of the rst-order directional derivatives, I u , is zero.
Another natural choice is a coordinate system in which the mixed second order derivative is zero such coordinates are named (p q) b y Florack et al.
(1992). In these coordinates, in which I pq = 0, the explicit expressions for the directional derivatives become slightly more complicated (see  for explicit expressions). A m a i n a d v antage of expressing di erential expressions in terms of such gauge coordinates is that the closed-form expressions for many di erential invariants become simple since a large number of terms disappear.
Monotonic intensity transformations One approach to derive di erential invariants is to require the di erential entities to be invariant with respect to arbitrary monotonic intensity transformations 2 Then, any property that can be expressed in terms of the level curves of the signal is guaranteed to be invariant. A classi cation by Florack et al. (1992) and Kanatani (1990), which goes back to the classical classi cation of polynomial invariants by Hilbert (1893) A ne intensity transformations Another approach is to restrict the invariance to a ne intensity transformations. Then, the class of invariants becomes larger. A natural condition to impose is that a di erential expression DI should (at least) be a relative invariant with respect to scale changes, i.e., under a rescaling of the spatial coordinates, I 0 (x) =I(sx), the di erential entity should transform as DI 0 = s k DI for some k. T rivially, this relation holds for any product of mixed directional derivatives, and extends to sums (and rational functions) of such expressions provided that the sum of the orders of di erentiation is the same for any product of derivatives constituting one term in a sum.
To give a formal description of this property, l e t I u m v n = I w denote a mixed directional derivative o f o r d e r j j = m + n, and let D be a (possibly Tensor notation for invariant expressions A useful analytical tool when dealing with di erential invariants is to express them in terms of tensor notation (Abraham 1988, Lawden 1962. Adopt the Einstein summation convention that double occurrences of a certain index means that summation is to be performed over that index. Furthermore, let ij be the symmetric Kronecker tensor and let ij::: represent the antisymmetric Levi-Civita connection (see Kay 1988). 3 Then, the expressions for the derivative operators along the ua n d v-directions (

Feature detection from di erential singularities
The singularities (zero-crossings) of di erential invariants play an important role ). This is a special case of a more general principle of using zero-crossings of di erential geometric expressions for describing geometric features see e.g. Bruce and Giblin (1984) for an excellent tutorial. If a feature detector can be expressed as a zero-crossing of such a di erential expression, then the feature will also be absolute invariant to uniform rescalings of the spatial coordinates, i.e. size changes.
Formally, this invariance property can be expressed as follows: Let S D I denote the singularity set of a di erential operator of the form (2.5), i.e.
S D I = f(x t) 2 R 2 R + : DI(x t) = 0 g and let G be the Gaussian smoothing operator, i.e., I = Gf. Under these transformations of the spatial domain (represented by x 2 R 2 ) and the In other words, feature detectors formulated in terms of di erential singularities by de nition commute with a number of elementary transformations, and it does not matter whether the transformation is performed before or after the smoothing step. A few examples of feature detectors that can be expressed in this way are discussed below.
Examples of feature d e t e ctors Edge detection. A natural way to de ne edges from a continuous greylevel image I: R 2 ! R is as the union of the points for which the gradient magnitude assumes a maximum in the gradient direction. This method is usually referred to as non-maximum suppression (see e.g. Canny (1986) or Korn (1988)). Assuming that the second and third-order directional derivatives of I in the v-direction are not simultaneously zero, a necessary and su cient condition for P 0 to be a gradient maximum in the gradient direction may be stated as: I vv = 0 I vvv < 0: Since only the sign information is important, this condition can be restated as ( I 2 v I vv = I 2 x I xx + 2 I x I y I xy + I 2 y I yy = 0 I 3 v I vvv = I 3 x I xxx + 3 I 2 x I y I xxy + 3 I x I 2 y I xyy + I 3 y I yyy < 0: Interpolating for zero-crossings of I vv within the sign-constraints of I vvv gives a straightforward method for sub-pixel edge detection ) see gure 2.2 for an illustration.  Junction detection. An entity commonly used for junction detection is the curvature of level curves in intensity data, see e.g. Kitchen and Rosenfeld (1982) or Koenderink and Richards (1988 j~ j = jI 2 v I uu j = jI 2 y I xx ; 2I x I y I xy + I 2 x I yy j: (2.11) Since the sum of the order of di erentiation with respect to x and y is the same for all terms in this sum, it follows that junction candidates given by extrema in~ also are invariant under skew transformations (Blom 1992) and a ne transformations (see the chapter by Olver et al. (1994)).
Assuming that the rst-and second-order di erentials of~ are not simultaneously degenerate, a necessary and su cient condition for a point P 0 to be a maximum in this rescaled level curve curvature is that: @ u = 0 @ v = 0 H( ) = H = uu vv ; 2 uv > 0 sign( ) uu < 0: Interpolating for simultaneous zero-crossings in @ u and @ v ) g i v es a subpixel corner detector.
Junction detectors of higher order can be derived algebraically (Salden et al. (1992)) by expressing the local structure up to some order in terms of its (truncated) local Taylor expansion and by studying the roots (i.e., the discriminant) of the corresponding polynomial. have been used for stereo matching (see, e.g., Marr 1992) and blob detection (see, e.g., Blostein and Ahuja 1987). Blob detection methods can also be formulated in terms of local extrema of the grey-level landscape (Lindeberg 1991 and extrema of the Laplacian (Lindeberg and G arding 1993).
Analysis: \Edge detection" using zero-crossings of the Laplacian. Zerocrossings of the Laplacian have been used also for edge detection, although the localization is poor at curved edges. This can be understood from the relation between the Laplace operator and the second derivative in the gradient direction (obtained from (2.10) and (2.14)) r 2 I = I uu + I vv = I vv + I v : (2.15) which s h o ws that the deviation between zero-crossings of r 2 I and zerocrossings of I vv increases with the isophote curvature . This example constitutes a simple indication of how theoretical analysis of feature detectors becomes tractable when expressed in terms of the suggested di erential geometric framework.

Scale selection
Although the scale-space theory presented so far provides a well-founded framework for dealing with image structures at di erent scales and can be used for formulating multi-scale feature detectors, it does not directly address the problem of how t o select appropriate scales for further analysis.
Whereas the problem of selecting \the best scale(s)" for handling a realworld data set may b e i n tractable unless at least some a priori information about the scene contents is available, in many situations a mechanism is required for generating hypothesis about interesting scale levels.
One general method for scale selection has been proposed in . The approach is based on the evolution over scales of (possibly non-linear) combinations of normalized derivatives de ned by @ i = p t @ x i (2.16) where the normalized c oordinates = x= p t (2.17) are the spatial correspondences to the dimensionless frequency coordinates ! considered in section 1.5.6 in the previous chapter. The basic idea of the scale selection method is to select scale levels from the scales at which di erential geometric entities based on normalized derivatives assume local maxima over scales. The underlying motivation for this approach i s t o select the scale level(s) where the operator response is as its strongest. A theoretical support can also be obtained from the fact that for a large class of polynomial di erential invariants (homogeneous di erential expressions of the form (2.5)) such extrema over scales have a nice behaviour under rescalings of the input signal: If a normalized di erential invariant D norm L assumes a maximum over scales at a certain point ( x 0 t 0 ) in scale-space, then if a rescaled signal f 0 is de ned by f 0 (sx) = f(x), a scale-space maximum in the corresponding normalized di erential entity D norm L 0 is assumed at (sx 0 s 2 t 0 ).
Example: Junction detection with automatic scale selection. In junction detection, a useful entity for selecting detection scales is the normalized rescaled curvature of level curves, norm = t 2 jrIj 2 I uu : (2.18) Figure 2.5 shows the result of detecting scale-space maxima (points that are simultaneously maxima with respect to variations of both the scale parameter and the spatial coordinates) of this normalized di erential invariant. Observe that a set of junction candidates is generated with reasonable interpretation in the scene. Moreover, the circles (with their areas equal to the detection scales) give natural regions of interest around the candidate junctions. Second stage selection of localization scale. Whereas this junction detector is conceptually very clean, it can certainly lead to poor localization, since shape distortions may be substantial at coarse scales in scale-space. A straightforward way t o i m p r o ve the location estimate is by determining the point x that minimizes the (perpendicular) distance to all lines in a neighbourhood of the junction candidate x 0 . By de ning these lines with the gradient v ectors as normals and by w eighting each distance by the pointwise gradient magnitude, this can be expressed as a standard least squares problem (F orstner and G ulch 1987), min x2R 2 x T A x ; 2 x T b + c () A x = b (2.19) where x = ( x 1 x 2 ) T , w x 0 is a window function, and A, b, and c are in this way using a Gaussian window function with scale value equal to the detection scale and by selecting the localization scale that minimizes the normalized r esidual d min = ( c ; b T A ;1 b)=traceA (2.23) over scales . This procedure has been applied iteratively ve times and those points for which the procedure did not converge after ve iterations have been suppressed. Notice that a sparser set of junction candidates is obtained and how the localization is improved.

Cues to surface shape (texture and disparity)
So far we h a ve been concerned with the theory of scale-space representation and its application to feature detection in image data. A basic functionality of a computer vision system, however, is the ability t o d e r i v e information about the three-dimensional shape of objects in the world.
Whereas a common approach, historically, has been to compute twodimensional image features (such as edges) in a rst processing step, and then combining these into a three-dimensional shape description (e.g., by stereo or model matching), we shall here consider the problem of deriving shape cues directly from image data, and by using only the types of frontend operations that can be expressed within the scale-space framework.
Examples of work in this direction have been presented by (Jones and Malik 1992 Lindeberg and G arding 1993 Malik and Rosenholtz 1993 G arding and Lindeberg 1994). A common characteristic of these methods is that they are based on measurements of the distortions that surface patterns undergo under perspective projection a problem which is simpli ed by considering the locally linearized component, leading to computation of cues to surface shape from measurements of local a ne distortions.
Measuring local a ne distortions. The method by Lindeberg and G arding (1993) is based on an image texture descriptor called the windowed s e condmoment matrix. W i t h I denoting the image brightness it is de ned by I (q) = Z x 0 2R 2 (rI)(x 0 ) ( rI) T (x 0 ) g(q ; x 0 ) dx 0 : (2.24) With respect to measurements of local a ne distortions, this image descriptor transforms as follows: De ne R by I( ) = R(B ) where B is an invertible 2 2 matrix representing a linear transformation. Then, we have I (q) = B T R (p) B (2.25) where R (p) is the second-moment m a t r i x o f R expressed at p = Bq computed using the \backprojected" normalized window function w 0 ( ; p) = (det B) ;1 w( ; q). Shape-from-texture and disparity gradients. Given two measurements of I and R , the relation (2.25) can be used for recovering B (up to an arbitrary rotation). This gives a direct method for deriving surface orientation from monocular cues, by imposing speci c assumptions on R , e.g., that R should be a constant times the unit matrix, R = cI (weak isotropy) or that det R should be locally constant (constant area).
Similarly, i f t wo cameras xate the same surface structure, a direct estimate of surface orientation can be obtained provided that the vergence angle is known. Figure 2.7 shows surface orientation estimates computed in this way. Note that for this image the weak isotropy assumption gives the orientation of the individual owers, whereas the constant area assumption re ects the orientation of the underlying surface. Figure 2.8 shows corresponding results for stereo data.

Behaviour across scales: Deep structure
The treatment so far has been concerned with the formal de nition of the scale-space representation and the de nition of image descriptors at any single scale. An important problem, however, concerns how to relate structures at di erent scales. This subject has been termed deep structure by Koenderink (1984). When a pattern is subject to scale-space smoothing, its shape changes. This gives rise to the notion of dynamic shape, which a s argued by  is an essential component o f any shape description of natural objects.

Iso-intensity l i n k i n g
An early suggestion by Koenderink (1984) to relate structures at di erent scales was to identify points across scales that have the same grey-level and correspond to paths of steepest ascent along level surfaces in scale-space.
Since the tangent v ectors of such paths must be in the tangent plane to the level surface and the spatial component m ust be parallel to (I x I y ), these iso-intensity paths are the integral paths of the vector eld (I x I t I y I t ;(I 2 x + I 2 y )): (2.26) Lifshitz and Pizer (1990) considered such paths in scale-space, and constructed a multi-scale \stack" representation, in which the grey-level at which an extremum disappeared was used for de ning a region in the original image by local thresholding on that grey-level.
Although the representation was demonstrated to be applicable for certain segmentation problems in medical image analysis, Lifshitz and Pizer observed the serious problem of non-containment, which basically means that a point, which at one scale has been classi ed as belonging to a certain region (associated with a local maximum), can escape from that region when the scale parameter increases. Moreover, such paths can be intertwined in a rather complicated way.

Feature based linking (di erential singularities)
The main cause to problem in the iso-intensity linking is that grey-levels corresponding to a feature tracked over scales change under scale-space smoothing. For example, concerning a local extremum it is a necessary consequence of the di usion equation that the grey-level value at the maximum point m ust decrease with scale. For this reason, it is more natural to identify features across scales rather than grey-levels. A type of representation de ned in this way is the scale-space primal sketch of bloblike image structures (extrema with extent) de ned at all scales in scalespace and linked into a tree-like data structure (Lindeberg 1991). More generally, consider a feature that at any l e v el of scale is de ned by h(x t) = 0 (x 2 R N t 2 R + ) (2.27) for some function h: R N R + ! R M . ( F or example, the di erential singularities considered in section 2.2.2 are of this form.) Using the implicit function theorem it is then formally easy to analyze the dependence of x on t in the solution to (2.27). Here, some simple examples will be presented of how s u c h analysis can be performed see Lindeberg (1992) for a more extensive treatment. Consider, for simplicity, data given as two-dimensional images. Then, it is su cient to study the cases when M is either 1 or 2.

Pointwise entities
If M = 2 the features will in general be isolated points. The implicit function theorem states that these points form smooth paths across scales (one-dimensional curves in three-dimensional scale-space) provided that the Jacobian @ x h is non-degenerate. The drift velocity along such a path can be written @ t x = ;(@ T x h) ;1 @ t h: Critical points. Concerning critical points in the grey-level landscape, we have h = ( I x I y ) T and the drift velocity can be written @ t x = ; 1 2 (HI) ;1 r T r(rI) where HI denotes the Hessian matrix of I and the fact that the spatial derivatives satisfy the di usion equation has been used for replacing derivatives of I with respect to t by d e r i v atives with respect to x.
Other structures. A similar analysis can be performed for other types of point structures, e.g. junctions given as maxima in~ , although the expressions then contain derivatives of higher-order. Concerning~ , the drift velocity contains derivatives up to order ve . This result gives an estimate of the drift velocity of features due to scalespace smoothing, and provides a theoretical basis for relating and, hence, linking corresponding features across scales in a well-de ned manner. This analysis can be used for stating a formal description of the edge focusing method developed by Bergholm (1987), in which edges are detected at a coarse scale and then tracked to ner scales see also Clark (1989) and Lu and Jain (1989) concerning the behaviour of edges in scale-space. (An extension of the edge focusing idea is also presented in the chapter by Richardson and Mitter (1994) in this book.) Linking in pyramids. Note the qualitative di erence between linking across scales in the scale-space representation of a signal and the corresponding problem in a pyramid. In the rst case, the linking process can be expressed in terms of di erential equations, while in the second case it corresponds to a combinatorial matching problem. It is well-known that it is a hard algorithmic problem to obtain stable links between features in di erent l a yers in a pyramid.

Bifurcations in scale-space
The previous section states that the scale linking is well-de ned whenever the appropriate submatrix of the Jacobian of h is non-degenerate, When the Jacobian degenerates, bifurcations may occur.
Concerning critical points in the grey-level landscape, the situation is simple. In the one-dimensional case, the generic bifurcation event is the annihilation of a pair consisting of a local maximum and a minimum point, while in the two-dimensional case a pair consisting of a saddle point a n d an extremum can be both annihilated and created 5 with increasing scale. A natural model of this so-called fold singularity is the polynomial I(x t) = x 3 1 + 3 x 1 (t ; t 0 ) + N X i=2 (x 2 i + t ; t 0 ) (2.31) which also satis es the di usion equation see also Poston and Stewart (1978), , Lifshitz and Pizer (1990), and Lindeberg (1992. The positions of the critical points are given by x 1 (t) = p t 0 ; t (x i = 0 i > 1) (2.32) i.e. the critical points merge along a parabola, and the drift velocity tends to in nity at the bifurcation point. Johansen (1994) gives a more detailed di erential geometric study of such bifurcations, covering also a few cases that are generically unstable when treated in a single image. Under more general parameter variations, however, such as in image sequences, these singularities can be expected to be stable in the sense that a small disturbance of the original signal causes the singular point t o a p p e a r a t a s l i g h tly di erent time moment.

Scale sampling
Although the scale-space concept comprises a continuous scale parameter, it is necessary to actually compute the smoothed representations at some discrete set of sampled scale levels. The fact that drift velocities may (momentarily) tend to in nity indicates that in general some mechanism for adaptive scale sampling must be used. Distributing scale levels over scales is closely related to the problem of measuring scale di erences. From the dimensional analysis in section 1.5.6 in previous chapter it follows that the scale parameter provides a unit of length unit at scale t = 2  for some 0 and 0 . This can be obtained directly from scale invariance: If the scale parameter (which is measured by dimension length]) is to be parametrized by a dimensionless scale parameter , then scale-invariance or self-similarity implies that d = must be the di erential of a dimensionless variable (see section 1.5.6 in the previous chapter and Florack et al. 1992).
Without loss of generality one can let d = = d and select a speci c reference scale 0 to correspond to = 0. Hence, we obtain (2.33).
In a later chapter in this book by Eberly (1994), this idea is pursued further and he considers the problem of combining measurements of scale di erences in terms of d = d = with measurements of normalized spatial di erences in terms of d = dx= .
Discrete signals. Some more care must be taken if the lifetime of a structure in scale-space is to be used for measuring signi cance in discrete signals, since otherwise a structure existing in the original signal (assigned scale value zero) would be assigned an in nite lifetime. An analysis in (Lindeberg 1991 shows that a natural way t o i n troduce such a scale parameter for discrete signals is by (t) = A + B log p(t) (2.34) where p(t) constitutes a measure of the \amount of structure" in a signal at scale t. F or practical purposes, this measure is taken as the density o f local extrema in a set of reference data.
Continuous vs. discrete models. Under rather general conditions on a onedimensional continuous signal it holds that the number of local extrema in a signal decrease with scale as p(t) 1 t for some > 0 (see section 2.7).
This means that (t) g i v en by (2.34) reduces to (2.33). For discrete signals, on the other hand, (t) is approximately linear at ne scales and approaches the logarithmic behaviour asymptotically when t increases.
In this respect, the latter approach provides a well-de ned way t o express the notion of e ective scale to comprise both continuous and discrete signals as well as for modeling the transition from the genuine discrete behaviour at ne scales to the increasing validity of the continuous approximation at coarser scales.

Regularization properties of scale-space kernels
According to Hadamard, a problem is said to be well-posed if: (i) a solution exists, (ii) the solution is unique, and (iii) the solution depends continuously on the input data. It is well-known that several problem in computer vision are ill-posed one example is di erentiation. A small disturbance in a signal, f(x) 7 ! f(x) + " sin !x, where " is small and ! is large, can lead to an arbitrarily large disturbance in the derivative f x (x) 7 ! f x (x) + !"cos !x (2.35) provided that ! is su ciently large relative t o 1 = .
Regularization is a technique that has been developed for transforming ill-posed problems into well-posed ones see Tikhonov and Arsenin (1977) for an extensive treatment of the subject. Torre and Poggio (1986) describe this issue with application to one of the most intensely studied subproblems in computer vision, edge detection, and develop how regularization can be used in this context. One example of regularization concerning the problem \given an operator A and data y nd z such that Az = y" is the transformed problem \ nd z that minimizes the following functional" min z (1 ; ) jjAz ; yjj 2 + jjPzjj 2 (2.36) where P is a stabilizing operator and 2 0 1] is a regularization parameter controlling the compromise between the degree of regularization of the solution and closeness to the given data. Variation of the regularization parameter gives solutions with di erent degree of smoothness a large value of may give rise a smooth solution, whereas a small value increases the accuracy at the cost of larger variations in the estimate. Hence, this parameter has a certain interpretation in terms of spatial scale in the result. (It should be observed, however, that the solution to the regularized problem is in general not a solution to the original problem, not even in the case of ideal noise-free data.) In the special case when P = @ xx , and the measured data points are discrete, the solution of the problem of nding S: R! R that minimizes min S (1 ; ) X (f i ; S(x i )) 2 + Z jS xx (x i )j 2 dx (2.37) given a set of measurements f i is given by approximating cubic splines see de Boor (1978) for an extensive treatment of the subject. Interestingly, this result was rst proved by S c hoenberg (1946), who also proved the classi cation of P olya frequency functions and sequences, which are the natural concepts in mathematics that underlie the scale-space kernels considered in sections 1.5.4{1. 5.5 and section 1.7.1 in previous chapter. Torre and Poggio made the observation that the corresponding smoothing lters are very close to Gaussian kernels.
The strong regularization property of scale-space representation can be appreciated in the introductory example. Under a small high-frequency disturbance in the original signal f(x) 7 ! f(x)+" cos!x, the propagation of the disturbance to the rst-order derivative of the scale-space representation is given by I x (x t) 7 ! I x (x t) + " ! e ! 2 t=2 cos !x: (2.38) Clearly, this disturbance can be made arbitrarily small provided that the derivative of the signal is computed at a su ciently coarse scale t in scalespace. (The subject of regularization is also treated in the chapters by Mumford, Nordstr om, Leaci and Solimini in this book.) 2. 6. Related multi-scale representations 2.6.1. Wavelets A t ype of multi-scale representation that has attracted a great interest in both signal processing, numerical analysis, and mathematics during recent years is wavelet representation, which dates back to Str omberg (1983) and Meyer (1989Meyer ( , 1992 In traditional wavelet theory, the zero-order derivative is not permitted it does not satisfy the admissibility condition, which in practice implies that Z 1 x=;1 h(x) dx = 0 : (2.42) There are several developments of this theory concerning di erent special cases. A particularly well studied problem is the construction of orthogonal wavelets for discrete signals, which permit a compact non-redundant m ultiscale representation of the image data. This representation was suggested for image analysis by

Tuned scale-space kernels
Interestingly, it is possible to obtain other scale-space kernels by expanding the scale-space axioms in section 1.5.6 in the previous chapter by providing additional information. This operation is called tuning ).
Although the Gaussian family is complete in the sense that Gaussian derivatives up to some order n completely characterize the local image structure in terms of its n th -order Taylor expansion at any scale, this family may not always be the most convenient one. When dealing with timevarying imagery, for example, local optic ow m a y be obtained directly from the output of a Gaussian family of space-time lters, but it may b e more convenient to rst tune these lters to the parameter of interest (in this case the velocity v ector eld). Another example is to tune the low-level processing to a particular spatial frequency, which leads to the family of Gabor functions.
Gabor functions If a family of kernels is required to be selective for a particular spatial frequency, the dimensional analysis must be expanded to include the wavenumber k of that spatial frequency: If ! k Luminance 1 1 0 0 0 Length 0 0 -1 -1 -1. According to the Pi-theorem, there are now three dimensionless independent e n tities Î =f, ! , and k!. Reasoning along similar lines lines as in section 1.5.6 in the previous chapter results in the family of Gabor lters, G b (x t k) = 1 p 2 t e ;x T x=2t e ;ik T x (2.43) which are essentially sine and cosine functions modulated by a Gaussian weighting function. Historically, these functions have been extensively used in e.g. texture analysis (see gure 2.9 for graphical illustrations).

Behaviour across scales: Statistical analysis
In section 2.3 we analyzed the qualitative behaviour over scales of image features using di erential geometric techniques. When to describe global properties, such a s t h e e v olution properties over scales of the number of local extrema, or irregular properties, such as the behaviour of noise in scale-space, other tools are needed. Here, we shall exemplify how such analysis can be performed statistically or based on statistical approximations.

Decreasing number of local extrema
Stationary processes. According to a result by Rice (1945), the density o f local maxima for a stationary normal process can be estimated by the second-and fourth-order moments of the spectral density S =  6 Here, the temporal Gaussian kernel has in nite tails, which in principle extend from minus in nity (the past) to plus in nity (the future). To cope with this (time) causality problem, Koenderink (1988) proposed to reparameterize the time scale in a logarithmic way so as to map the present (unreachable) moment t o i n n i t y.
Using this result it is straightforward to analyze how the number of local extrema can be expected to decrease with scale in the scale-space representation of various types of noise (Lindeberg 1991 p t (2.48) showing that the density is basically inversely proportional to the value of the scale parameter provided that the scale parameter is measured in terms of = p t. respectively. Note the qualitative similarities between these graphs. Observe also that a straight-line approximation is only valid in the interior part of the interval at ne scales there is interference with the inner scale of the image given by its sampling density and at coarse scales there is interference with the outer scale of the image given by its nite size.

Noise propagation in scale-space derivatives
Computing higher-order spatial derivatives in the presence of noise is known to lead to computational problems. In the Fourier domain this is e ect is usually explained in terms of ampli cation of higher frequencies F ( @ n 1 +:::+n D I @x n 1 1 : : : @ x n D D ) = ( i! 1 ) n 1 : : : (i! D ) n D FfIg: (2.52) Given that the noise level is known a priori, s e v eral studies have been presented in the literature concerning design of \optimal lters" for detecting or enhancing certain types of image structures while amplifying the noise as little as possible. From that viewpoint, the underlying idea of scale-space representation is di erent. Assuming that no prior knowledge is available, it follows that noise must be treated as part of the incoming signal there is no way for an uncommitted front-end vision system to discriminate between noise and \underlying structures" unless speci c models are available.
In this context it is of interest to study analytically how sensitive the Gaussian derivative k ernels are to noise in the input. Here, we shall summarize the main results from a treatment b y Blom (1993) concerning additive pixel-uncorrelated Gaussian noise with zero mean. This noise is completely described by its mean (assumed to be zero) and its variance.
The ratio between the variance < M 2 mxmy > of the output noise, and the variance < N 2 > of the input is a natural measure of the noise attenuation.
In summary, this ratio as a function of the scale parameter and the orders of di erentiation, m x and m y , is given by < M 2 mxmy > < N 2

Non-uniform smoothing
Whereas the linear scale-space concept and the associated theory for feature detection provides a conceptually clean model for early visual computations which can also be demonstrated to give highly useful results, there are certain limitations in basing a vision system on rotationally symmetric Gaussian kernels only. F or example, smoothing across \object boundaries" may a ect both the shape and the localization of edges in edge detection. Similarly, surface orientation estimates computed by shape from texture algorithms are a ected, since the anisotropy of a surface pattern may decrease when smoothed using a rotationally symmetric Gaussian. These are some basic motivations for considering non-linear extensions of the linear scale-space theory, as will be the subject of the following chapters in this book. 2. 8. 1. Shape distortions in computation of surface shape.
It is illuminating to consider in more detail the problem of deriving cues to surface shape from noisy image data. For simplicity, let us restrict the analysis to the monocular case, the shape-from-texture problem. The underlying ideas are, however, of much wider validity and apply to problems such as shape-from-stereo-cues and shape-from-motion. Model E ects due to scale-space smoothing. From the semi-group property o f the Gaussian kernel g( t 1 ) g( t 2 ) = g( t 1 + t 2 ), it follows that the scale-space representation of f at scale t is I(x 1 x 2 t) = g(x 1 l 2 1 + t) g(x 2 l 2 2 + t): (2.59) Thus, under scale-space smoothing the estimate of foreshortening varies aŝ (t) = s l 2 2 + t l 2 1 + t (2.60) i.e., it increases and tends to one, which means that after a su ciently large amount of smoothing the image will eventually be interpreted as at.
Hence, in cases when non-in nitesimal amounts of smoothing are necessary (e.g., due to the presence of noise), the surface orientation estimates will by necessity b e biased. Observe in this context that no assumptions have been made here about what actual method should be used for computing surface orientation from image data. The example describes essential e ects of the smoothing operation that arise in any shape-from-X method that contains a smoothing module and interprets a non-uniform Gaussian blob as the projection of a rotationally symmetric one.
Shape adaptation of the smoothing kernels. If, on the other hand, we h a ve initial estimates of the slant angle and the tilt direction (^ ^ ) computed, say b y using rotationally symmetric Gaussian smoothing, a straightforward compensation technique is to let the scale parameter in the (estimated) tilt direction, denoted tt, and the scale parameter in the perpendicular direction, denoted t^b, be related by tt = t^b cos 2^ : (2.61) If this estimate is correct, then the slant estimate will be una ected by the non-uniform smoothing operation. To illustrate this property, assume that the tilt estimate is correct (^ = = =2) and convolve the signal with a non-uniform Gaussian kernel g(x 1 x 2 tt t b ) = g(x 1 tt) g(x 2 t^b) (2.62) which g i v es I(x 1 x 2 t) = g(x 1 l 2 1 + t^b) g(x 2 l 2 2 + tt): Clearly, = if^ = . In practice, we cannot of course assume that true values of ( ) are known, since this requires knowledge about the solution to the problem we are to solve. A more realistic formulation is therefore to rst compute initial surface orientation estimates using rotationally symmetric smoothing (based on the principle that in situations when no a priori information is available, the rst stages of visual processes should be as uncommitted as possible and have no particular bias). Then, when a hypothesis about a certain surface orientation (^ 0 ^ 0 ) has been established, the estimates can be improved iteratively.
More generally, i t c a n b e s h o wn that by extending the linear scale-space concept based on the rotationally symmetric Gaussian kernel towards an a ne scale-space representation based on the non-uniform Gaussian kernel with its shape speci ed by a (symmetric positive semi-de nite) covariance matrix, t , g(x 1 t ) = 1 2 p det t e ;x T ;1 t x=2 where x 2 R 2 (2.65) a shape-from-texture method can be formulated such that up to rst order of approximation, the surface orientation estimates are una ected b y t h e smoothing operation. In practice, this means that the accuracy is improved substantially, t ypically by an order of magnitude in actual computations.
A ne scale-space. A formal analysis in  shows that with respect to the above treated sample problem, the true value of^ corresponds to a convergent xed p oint for (2.64). Hence, for the pattern (2.55) the method is guaranteed to converge to the true solution, provided that the initial estimate is su ciently close to the true value. The essential step in the resulting shape-from-texture method with a ne shape adaptation is to adapt the kernel shape according to the local image structure. In the case when the surface pattern is weakly isotropic (see section 2.2.4) a useful method is to measure the second-moment matrix I according to (2.24) and then letting t = ;1 I .

Outlook
General. An important p o i n t with the approach in the previous section is that the linear scale-space m o del is used a s a n u n c ommitted rst stage of processing. Then, when additional information has become available (here, the initial surface orientation estimates) this information is used for tuning the front-end processing to the more speci c tasks at hand. Adapting the shape of the smoothing kernel in a linear way constitutes the presumably most straightforward type of geometry-driven processing that can be performed. In the case when the surface pattern is weakly isotropic, this shape adaptation has a very simple geometric interpretation it corresponds to rotationally symmetric smoothing in the tangent plane to the surface. Edge detection. Whereas the a ne shape adaptation is conceptually simple, it has an interesting relationship to the non-linear di usion schemes that will be considered in later chapters. If the shape adaptation scheme is applied at edge points, then it leads to a larger amount of smoothing along the edge than in the perpendicular direction. In this respect, the method constitutes a rst step towards linking processing modules based on sparse edge data to processing modules based on dense lter outputs.
Biological vision. Referring to the previously mentioned relations between linear scale-space theory and biological vision one may ask: Are there corresponding geometry-driven processes in human vision? Besides the well-known fact that top-down expectations can in uence visual processes at rather low l e v els, a striking fact is the abundance of bers projecting backwards (feedback) b e t ween the di erent l a yers in the visual front-end, being more the rule then exception (Zeki 1993 Chapter 31).
At the current point, howeve r , i t i s t o o e a r l y t o g i v e a de nite answer to this question. We l e a ve the subject of non-linear scale-space to the following chapters where a number of di erent approaches will be presented.