Riesz networks: scale invariant neural networks in a single forward pass

Scale invariance of an algorithm refers to its ability to treat objects equally independently of their size. For neural networks, scale invariance is typically achieved by data augmentation. However, when presented with a scale far outside the range covered by the training set, neural networks may fail to generalize. Here, we introduce the Riesz network, a novel scale invariant neural network. Instead of standard 2d or 3d convolutions for combining spatial information, the Riesz network is based on the Riesz transform which is a scale equivariant operation. As a consequence, this network naturally generalizes to unseen or even arbitrary scales in a single forward pass. As an application example, we consider detecting and segmenting cracks in tomographic images of concrete. In this context, 'scale' refers to the crack thickness which may vary strongly even within the same sample. To prove its scale invariance, the Riesz network is trained on one fixed crack width. We then validate its performance in segmenting simulated and real tomographic images featuring a wide range of crack widths. An additional experiment is carried out on the MNIST Large Scale data set.


Introduction
In image data, similar objects may occur at highly varying scales.Examples are cars or pedestrians at different distances from the camera, cracks in concrete of varying thickness or imaged at different resolution, or blood vessels in biomedical applications (see Fig. 1).It is natural to assume that the same object or structure at different scales should be treated equally i.e. should have equal or at least similar features.This property is called scale or dilation invariance and has been investigated in detail in classical image processing [1][2][3].
Neural networks have proven to segment and classify robustly and well in many computer vision tasks.Nowadays, the most popular and successful neural networks are Convolutional Neural Networks (CNNs).It would be desirable that neural networks share typical properties of human vision such as translation, rotation, or scale invariance.While this is true for translation invariance, CNNs 1 arXiv:2305.04665v2[cs.CV] 11 Jan 2024 are not scale or rotation invariant by default.This is due to the excessive use of convolutions which are local operators.Moreover, training sets often contain a very limited number of scales.To overcome this problem, CNNs are often trained with rescaled images through data augmentation.However, when a CNN is given input whose scale is outside the range covered by the training set, it will not be able to generalize [4,5].To overcome this problem, a CNN trained at a fixed scale can be applied to several rescaled versions of the input image and the results can be combined.This, however, requires multiple runs of the network.
One application example, where the just described challenges naturally occur is the task of segmenting cracks in 2d or 3d gray scale images of concrete.Crack segmentation in 2d has been a vividly researched topic in civil engineering, see [6] for an overview.Cracks are naturally multiscale structures (Fig. 1, left) and hence require multiscale treatment.Nevertheless, adaption to scale (crack thickness1 ) has not been treated explicitly so far.
Recently, crack segmentation in 3d images obtained by computed tomography (CT) has become a subject of interest [6,7].Here, the effect of varying scales is even more pronounced [8]: crack thicknesses can vary from a single pixel to more than 100 pixels.Hence, the aim is to design and evaluate crack segmentation methods that work equally well on all possible crack widths without complicated adjustment by the user.
In this work, we focus on 2d multiscale crack segmentation in images of concrete samples.We design the Riesz network which replaces the popular 2d convolutions by first and second order Riesz transforms to allow for a scale invariant spatial operation.The resulting neural network is provably scale invariant in only one forward pass.It is sufficient to train the Riesz network on one scale or crack thickness, only.The network then generalizes automatically without any adjustments or rescaling to completely unseen scales.We validate the network performance using images with simulated cracks of constant and varying widths generated as described in [6,10].Our network is compared with competing methods for multiscale segmentation and finally applied to real multiscale cracks observed in 2d slices of tomographic images.
There is just one publicly available dataset which allows for testing scale equivariance -MNIST Large Scale [5].Additional experiments with the Riesz network on this dataset are reported in Appendix A.

The Riesz transform
The Riesz transform is a generalization of the Hilbert transform to higher dimensional spaces, see e.g.[11].First practical applications of the Riesz transform arise in signal processing through the definition of the monogenic signal [12] which enables a decomposition of higher dimensional signals into local phase and local amplitude.First, a bandpass filter is applied to the signal to separate the band of frequencies.Using the Riesz transform, the local phase and amplitude can be calculated for a selected range of frequencies.For more details we refer to [12,13].
As images are 2d or 3d signals, applications of the Riesz transform naturally extend to the fields of image processing and computer vision through the Poisson scale space [14,15] which is an alternative to the well-known Gaussian scale space.Köthe [16] compared the Riesz transform with the structure tensor from a signal processing perspective.Unser and van de Ville [17] related higher order Riesz transforms and derivatives.Furthermore, they give a reason for preferring the Riesz transform over the standard derivative operator: The Riesz transform does not amplify high frequencies.
Higher order Riesz transforms were also used for analysis of local image structures using ideas from differential geometry [18,19].Benefits of using the first and second order Riesz transforms as low level features have also been shown in measuring similarity [20], analyzing and classification of textures [11,21], and orientation estimation [22,23].The Riesz transform can be used to create steerable wavelet frames, so-called Riesz-Laplace wavelets [17,24], which are the first ones utilizing the scale equivariance property of the Riesz transform and have inspired the design of quasi monogenic shearlets [25].Interestingly, in early works on the Riesz transform in signal processing or image processing [12,13,15], scale equivariance has not been noticed as a feature of the Riesz transform and hence remained sidelined.Benefits of the scale equivariance have been shown later in [17,19].
Recently, the Riesz transform found its way into the field of deep learning: Riesz transform features are used as supplementary features in classical CNNs to improve robustness [26].In our work, we will use the Riesz transforms for extracting low-level features from images and use them as basis functions which replace trainable convolutional filters in CNNs or Gaussian derivatives in [27].

Scale invariant deep learning methods
Deep learning methods which have mechanisms to handle variations in scale effectively can be split in two groups based on their scale generalization ability.

Scale invariant deep learning methods for a limited range of scales
The first group can handle multiscale data but is limited to the scales represented either by the training set or by the neural network architecture.The simplest approach to learn multiscale features is to apply the convolutions to several rescaled versions of the images or feature maps in every layer and to combine the results by maximum pooling [4] or by keeping the scale with maximal activation [28] before passing it to the next layer.In [29,30], several approaches based on downscaling images or feature maps with the goal of designing robust multiscale object detectors are summarized.However, scaling is included in the network architecture such that scales have to be selected a priori.Therefore, this approach only yields local scale invariance, i.e. an adaption to the scale observed in a given input image is not possible after training.Another intuitive approach is to rescale trainable filters, i.e. convolutions, by interpolation [31].In [29], a new multiscale strategy was derived which uses convolution blocks of varying sizes sequenced in several downscaling layers creating a pyramidal shape.The pyramidal structure is utilized for learning scale dependent features and making predictions in every downsampling layer.Layers can be trained according to object size.That is, only the part of the network relevant for the object size is optimized.This guarantees robustness to a large range of object scales.Similarly, in [30], a network consisting of a downsampling pyramid followed by an upsampling pyramid is proposed.Here, connections between pyramid levels are devised for combining low and high resolution features and predictions are also made independently on every pyramid level.However, in both cases, scale generalization properties of the networks are restricted by their architecture, i.e. by the depth of the network (number of levels in the image pyramid), the size of convolutions as spatial operators as well as the size of the input image.
Spatial transformer networks [32] focus on invariance to affine transformations including scale.This is achieved by using a so-called localisation network which learns transformation parameters.Finally, using these transformation parameters, a new sampling grid can be created and feature maps are resampled to it.These parts form a trainable module which is able to handle and correct the effect of the affine transformations.
However, spatial transformer networks do not necessarily achieve invariant recognition [33].Also, it is not clear how this type of network would generalize to scales not represented in the training set.
In [34], so-called structured receptive fields are introduced.Linear combinations (1 × 1 convolutions) of basis functions (in this case Gaussian derivatives up to 4th order) are used to learn complex features and to replace convolutions (e.g. of size 3 × 3 or 5 × 5).As a consequence, the number of parameters is reduced, while the expressiveness of the neural network is preserved.This type of network works better than classical CNNs in the case where little training data is available.However, the standard deviation parameters of the Gaussian kernels are manually selected and kept fixed.Hence, the scale generalization ability remains limited.
Making use of the semi-group property of scale spaces, scale equivariant neural networks motivate the use of dilated convolutions [35] to define scale equivariant convolutions on the Gaussian scale space [36] or morphological scale spaces [37].Unfortunately, these neural networks are unable to generalize to scales outside those determined by their architecture and are only suitable for downscale factors which are powers of 2, i.e. {2, 4, 8, 16, • • • }.Furthermore, scale equivariant steerable networks [38] show how to design scale invariant networks on the scale-translation group without using standard or dilated convolutions.Following an idea from [34], convolutions are replaced by linear combinations of basis filters (Hermite polynomials with Gaussian envelope).While this allows for non-integer scales, scales are still limited to powers of a positive scaling factor a. Scale space is again discretized and sampled.Hence, a generalization to arbitrary scales is not guaranteed.

Scale invariant deep learning methods for arbitrary scales
The second group of methods can generalize to arbitrary scales, i.e. any scales that are in range bounded from below by image resolution and from above by image size, but not necessarily contained in the training set.Our Riesz network also belongs to this second group of methods.
An intuitive approach is to train standard CNNs on a fixed range of scales and enhance their scale generalization ability by the following three step procedure based on image pyramids: downsample by a factor a > 1, forward pass of the CNN, upsample the result by 1  a to the original image size [5,8].Finally, forward passes of the CNN from several downsampling factors {a 1 , • • • , a n > 0 | n ∈ N} are aggregated by using the maximum or average operator across the scale dimension.This approach indeed guarantees generalization to unseen scales as scales can be adapted to the input image and share the weights of the network [5].However, it requires multiple forward passes of the CNN and the downsampling factors have to be selected by the user.
Inspired by Scattering Networks [39,40], normalized differential operators based on first and second order Gaussian derivatives stacked in layers or a cascade of a network can be used to extract more complex features [41].Subsequently, these features serve as an input for a classifier such as a support vector machine.Varying the standard deviation parameter σ of the Gaussian kernel, generalization to new scales can be achieved.However, this type of network is useful for creating handcrafted complex scale invariant features, only, and hence is not trainable.
Its expansion to trainable networks by creating so-called Gaussian derivative networks [27] is one of the main inspirations for our work.For combining spatial information, γ-normalized Gaussian derivatives are used as scale equivariant operators (γ = 1).Similarly as in [34], linear combinations of normalized derivatives are used to learn more complex features in the spirit of deep learning.However, the Gaussian derivative network is based on ideas from scale space theory and requires the specification of a wide enough range of scales that cover all the scales present in the training and testing set.Hence, the scale dimension needs to be discretized and sampled densely.These networks have a separate channel for every scale and the network weights are shared between channels.Scale invariance is achieved by maximum pooling over the multiple scale channels.
Formally, for a d-dimensional signal f ∈ L 2 (R d ) (i.e. an image or a feature map), the Riesz transform of first order R where C d = Γ((d+1)/2)/π (d+1)/2 is a normalizing constant and B ε is ball of radius ε centered at the origin.Alternatively, the Riesz transform can be defined in the frequency domain via the Fourier transform F (2) for j ∈ {1, • • • , d}.Higher order Riesz transforms are defined by applying a sequence of first order Riesz transforms.That is, for where R kj j refers to applying the Riesz transform R j k j times in a sequence.
The Riesz transform kernels of first and second order resemble those of the corresponding derivatives of smoothing filters such as Gaussian or Poisson filters (Fig. 2).This can be explained by the following relations for k 1 + ... + k d = N and N ∈ N. The fractional Laplace operator △ N/2 acts as an isotropic low-pass filter.The main properties of the Riesz transform can be summarized in the following way [17]: where j ∈ {1, • • • , d}.This property reflects the fact that the Riesz transform commutes with the translation operator.• steerability: The directional Hilbert transform ). H v is steerable in terms of the Riesz transform, that is it can be written as a linear combination of the Riesz transforms This is equivalent to the link between gradient and directional derivatives [17] and a very useful property for learning oriented features.
. The energy of the Riesz transform for frequency u ∈ R d is defined as the norm of the ddimensional vector H(u) and has value 1 for all non-zero frequencies u ̸ = 0, i.e.

||H(u)||
The all-pass filter property reflects the fact that the Riesz transform is a non-local operator and that every frequency is treated fairly and equally.Combined with scale equivariance, this eliminates the need for multiscale analysis or multiscale feature extraction.
• scale (dilation) equivariance: For a > 0 define a dilation or rescaling operator L a : for j ∈ {1, • • • , d}.That is, the Riesz transform does not only commute with translations but also with scaling.
Scale equivariance enables an equal treatment of the same objects at different scales.As this is the key property of the Riesz transform for our application, we will briefly present a proof.We Fig. 2: Visualizations of Riesz transform kernels of first and second order.First row (from left to right): R 1 and R 2 .Second row (from left to right): R (2,0) , R (1,1) , and R (0,2) .restrict to the first order in the Fourier domain.The proof for higher orders follows directly from the one for the first order.Lemma 1.The Riesz transform is scale equivariant, i.e.
Proof.Remember that the Fourier transform of a dilated function is given by Fig. 3 provides an illustration of the scale equivariance.It shows four rectangles with lengthto-width ratio 20 and varying width (3, 5, 7, and 11 pixels) together with the gray value profile of the second order Riesz transform R (2,0) along a linear section through the centers of the rectangles.In spite of the different widths, the Riesz transform yields equal filter responses for each rectangle (up to rescaling).In contrast, to achieve the same behaviour in Gaussian scale space, the scale space has to be sampled (i.e. a subset of scales has to be selected), the γ-normalized derivative [1] has to be calculated for every scale, and finally the scale yielding the maximal absolute value has to be selected.In comparison, the simplicity of the Riesz transform achieving the same in just one transform without sampling scale space and without the need for a scale parameter is striking.

Riesz transform neural networks
In the spirit of structured receptive fields [34] and Gaussian derivative networks [27], we use the Riesz transforms of first and second order instead of standard convolutions to define Riesz layers.As a result, Riesz layers are scale equivariant in a single forward pass.Replacing standard derivatives with the Riesz transform has been motivated by [16], while using a linear combination of Riesz transforms of several order follows [21].

Riesz layers
The base layer of the Riesz networks is defined as a linear combination of Riesz transforms of several orders implemented as 1d convolution across feature channels (Fig. 4).Here, we limit ourselves to first and second order Riesz transforms.Thus, the linear combination reads as where are parameters that are learned during training.Now we can define the general layer of the network (Fig. 4).Let us assume that the Kth network layer takes input with c (K) feature channels and has output Here, J is defined in the same way as J R (f ) from equation (11), but trainable parameters may vary for different input channels i and output channels j, i.e.J (j,i) In practice, the offset parameters C (i,j,K) 0

Proof of scale equivariance
We prove the scale equivariance for J R (f ).That implies scale equivariance for J (j,i) K (f ) and consequently for F (K+1) j for arbitrary layers of the network.By construction (see Section 3.3), this will result in provable scale equivariance for the whole network.Formally, we show that J R (f ) from equation ( 11) is scale equivariant, i.e.
Proof.For any scaling parameter a > 0 and x ∈ R d , we have ).

Network design
The basic building block of modern CNNs is a sequence of the following operations: batch normalization, spatial convolution (e.g.3×3 or 5×5), the Rectified Linear Unit (ReLU) activation function, and Max Pooling.Spatial convolutions have by default a limited size of the receptive field and Max Pooling is a downsampling operation performed on a window of fixed size.For this reason, these two operations are not scale equivariant and consequently CNNs are sensitive to variations in the scale.Hence, among the classical operations, only batch normalization [42] and ReLU activation preserve scale equivariance.To build our neural network from scale equivariant transformations, only, we restrict to using batch normalization, ReLUs, and Riesz layers, which serve as a replacement for spatial convolutions.
In our setting, Max Pooling can completely be avoided since its main purpose is to combine it with spatial convolutions in cascades to increase the size of the receptive field while reducing the number of parameters.Generally, a layer consists of the following sequence of transformations: batch normalization, Riesz layer, and ReLU.Batch normalization improves the training capabilites and avoids overfitting, ReLUs introduce non-linearity, and the Riesz layers extract scale equivariant spatial features.For every layer, the number of feature channels has to be selected.Hence, our network with K ∈ N layers can be simply defined by a (K + 2)-tuple specifying the channel sizes2 e.g.(c (0) , c (1) , • • • c (K) , c (K+1) ).The final layer is defined as a linear combination of the features from the previous layer followed by a sigmoid function yielding the desired probability map as output.

Experiments and applications
In this section we evaluate the four layer Riesz network defined above on the task of segmenting cracks in 2d slices from CT images of concrete.Particular emphasis is put on the network's ability to segment multiscale cracks and to generalize to crack scales unseen during training.To quantify these properties, we use images with simulated cracks.Being accompanied by an unambiguous ground truth, they allow for an objective evaluation of the segmentation results.Additionally, in Appendix A scale equivariance of the Riesz network is experimentally validated on the MNIST Large Scale data set [5].

Data generation:
Cracks are generated by the fractional Brownian motion (Experiment 1) or minimal surfaces induced by the facet system of a Voronoi tessellation (Experiment 2).Dilated cracks are then integrated into CT images of concrete without cracks.As pores and cracks are both air-filled, their gray value distribution should be similar.Hence, the gray value distribution of crack pixels is estimated from the gray value distribution observed in air pores.The crack thickness is kept fixed (Experiment 1) or varies (Experiment 2) depending on the objective of the experiment.As a result, realistic semi-synthetic images can be generated (see Fig. 5).For more details on the simulation procedure, we refer to [6,10].Details on number and size of the images can be found below.Finally, we show applicability of the Riesz network for real data containing cracks generated by tensile and pull-out tests.

Quality metrics:
As metrics for evaluation of the segmentation results we use precision (P), recall (R), F1-score (or dice coefficient, Dice), and Intersection over Union (IoU).The first three quality metrics are based on true positives tp -the number of pixels correctly predicted as crack, true negatives tnthe number of pixels correctly predicted as background, false positives fp -the number of pixels wrongly predicted as crack, and false negatives fn -the number of pixels falsely predicted as background.Precision, recall, and dice coefficient are then defined via IoU compares union and intersection of the foregrounds X and Y in the segmented image and the corresponding ground truth, respectively.That is All these metrics have values in the range [0, 1] with values closer to 1 indicating a better performance.

Training parameters:
If not specified otherwise, all models are trained on cracks of fixed width of 3 pixels.All models are trained for 50 epochs with initial learning rate 0.001 which is halved every 20 epochs.ADAM optimization [43] is used, while the cost function is set to binary cross entropy loss.
Crack pixels are labelled with 1, while background is labelled with 0. As there are far more background than crack pixels, we deal with a highly imbalanced data set.Therefore, crack and pore pixels are given a weight of 40 to compensate for class imbalance and to help distinguishing between these two types of structures which hardly differ in their gray values.

Measuring scale equivariance
Measures for assessing scale equivariance have been introduced in [36,38].For an image or feature map f , a mapping function Φ (e.g. a neural network or a subpart of a network), and a scaling function L a we define Ideally, this measure should be 0 for perfect scale equivariance.In practice, due to scaling and discretization errors we expect it to be positive yet very small.
To measure scale equvariance of the full Riesz network with randomly initialized weights, we use a data set consisting of 85 images of size 512 × 512 pixels with crack width 11 and use downscaling factors a ∈ {2, 4, 8, 16, 32, 64}.The evaluation was repeated for 20 randomly initialized Riesz networks.The resulting values of ∆ a are given in Fig. 6.
The measure ∆ a was used to validate the scale equivariance of Deep Scale-spaces (DSS) in [36] and scale steerable equivariant networks in [38].In both works, a steep increase in ∆ a is observed for downscaling factors larger than 16, while for very small downscaling factors, ∆ a is reported to be below 0.01.In [38], ∆ a reaches 1 for downscaling factor 45. The application scenario studied here differs from those of [36,38].Results are thus not directly comparable but can be considered only as an approximate baseline.For small downscaling factors, we find ∆ a to be higher than in [38] (up to 0.075).However, for larger downscaling factors (a > 32), ∆ a increases more slowly e.g.∆ 64 = 0.169.This proves the resilience of Riesz networks to very high downscaling factors, i.e. large changes in scale.

Experiment 1: Generalization to unseen scales
Our models are trained on images of fixed crack width 3. To investigate their behaviour on crack widths outside of the training set, we generate cracks of widths {1, 3, 5, 7, 9, 11} pixels in images of size 512 × 512, see Fig. 9.Each class contains 85 images.Besides scale generalization properties of the Riesz network, we check how well it generalizes to random variations in crack topology or shapes, too.For this experiment we will assume that the fixed width is known.This means that the competing methods will use a single scale which is adjusted to this width.

Ablation study on the Riesz network
We investigate how the network parameters and the composition of the training set affect the quality of the results, in order to learn how to design this type of neural networks efficiently.

Size of training set:
First, we investigate the robustness of the Riesz network to the size of the training set.The literature [34] suggests that neural networks based on structure receptive fields are less data hungry, i.e. their performance with respect to the size of the training set is more stable than that of conventional CNNs.Since the Riesz network uses the Riesz transform instead of a Gaussian derivative as in [34], it is expected that the same would hold here, too.The use of smaller training sets has two main benefits.First, obviously, smaller data sets reduce the effort for data collection, i.e. annotation or simulation.Second, smaller data sets reduce the training time for the network if we do not increase the number of epochs during training.
We constrain ourselves to three sizes of training sets: 1 947, 975, and 489.These numbers refer to the sets after data augmentation by flipping and rotation.Hence, the number of original images is three times smaller.In all three cases we train the Riesz network for 50 epochs and with similar batch sizes (11, 13, and 11, respectively).Results on unseen scales with respect to data set size are shown in Table 1 and Fig. 7 (left).We observe that the Riesz network trained on the smallest data set is competitive with counterparts trained on larger data sets albeit featuring generally 1 − 2% lower Dice and IoU.

Choice of crack width for training:
There are two interesting questions with respect to crack width.Which crack width is suitable for training of the Riesz network?Do varying crack thicknesses in the training set improve performance significantly?
To investigate these questions, we choose three training data sets with cracks of fixed widths 1, 3, or 5.A fourth data set combines crack widths 1, 3, and 5.We train the Riesz network with these sets and evaluate its generalization performance across scales.Results are summarized in Fig. 7 (center) and Table 1.Crack widths 3 and 5 yield similar results, while crack width 1 seems not to be suitable, except when trying to segment cracks of width 1. Cracks of width 1 are very thin, subtle, and in some cases even disconnected.Hence, they differ significantly from thicker cracks which are 8-connected and have a better contrast to the concrete background.This indicates that very thin cracks should be considered a special case which requires somewhat different treatment.Rather surprisingly, using the mixed training data set does not improve the metrics.Diversity with respect to scale in the training set seems not to be a decisive factor when designing Riesz networks.

Number of layers:
Finally, we investigate the explanatory power of the Riesz network depending on network depth and the number of parameters.We train four networks with 2 − 5 layers and 2 721, 9 169, 18 825, and 34 265 parameters, respectively, on the same data set for 50 epochs.The network with 5 layers has structure 16 → 32 → 40 → 48 → 64 and every other network is constructed from this one by removing the required number of layers at the end.Results are shown in Table 1 and in Fig. 7 (right).The differences between the networks with 3, 4, and 5 layers are rather subtle.For   In general, Riesz networks appear to be robust with respect to training set size, depth of network, and number of parameters.Hence, it is not necessary to tune many parameters or to collect thousands of images to achieve good performance, in particular for generalization to unseen scales.For the choice of crack width, 3 and 5 seem appropriate while crack width 1 should be avoided.

Comparison with competing methods
Competing methods: The four layer Riesz network is compared to two other methods -Gaussian derivative networks [27] and U-net [44] on either rescaled images [5] or an image pyramid [8].The Gaussian derivative network uses scale space theory based on the Gaussian kernel and the diffusion equation.Using the γ-normalized Gaussian derivatives from [1], layers of first and second order Gaussian derivatives are constructed [27].We shortly state differences between our reimplementation and the original work [27].In order to reduce the computation time, we use a version of the Gaussian network  that has a single scale channel corresponding to the training thickness during the training, while additional channels with shared weights are added for the inference, i.e. testing the scale generalization.This version of the Gaussian network is different from the original one [27], which has more scale channels during training.Using multiple scale channels in the training step has been found to result in better scale generalization properties if the different scale channels were allowed to compete against each other during the training stage.However, this increases the computational burden compared to a single channel network.In this section we use a single channel version of the network but with σ adjusted to the crack width.
The sparser scale sampling ratio of 2 used in this reimplementation from Section 4.3 is, however, expected to lead to lower performance compared to using a scale sampling ratio of √ 2, as used in the original work.
U-net has around 2.7 million parameters, while the Gaussian derivative network has the same architecture as the Riesz network and hence the same number of parameters (18k).
We design an experiment for detailed analysis and comparison of the ability of the methods to generalize to scales unseen during training.In typical applications, the thickness of the cracks would not be known.Here, crack thickness is kept fixed such that the correct scale of cracks is a priori known.This allows for a selection of an optimal scale (or range of scales) such that we have a best case comparison.For the Gaussian derivative network, scale is controlled by the standard deviation parameter σ which is set to the half width of the crack.Here, we have avoided the inference of the Gaussian network with multiple scale channels and have used the assumption that the scale is a priori known.For the U-net, scale is adjusted by downscaling the image to match the crack width used in the training data.Here, we restrict the downscaling to discrete factors in the set {2, 4, 8, ...} that were determined during validation.For widths 1 and 3, no downscaling is needed.For width 5, the images are downscaled by 2, for width 7 by 4, and for widths 9 and 11 by 8.For completeness, we include results for the U-net without downscaling denoted by "U-net plain".Table 2 yields the prediction quality measured by the Dice coefficient, while the other quality measures are shown in Fig. 8. Exemplary segmentation results are shown in Fig. 9.
As expected, the performance of the plain Unet decreases with increasing scale.Scale adjustment stabilizes U-net's performance but requires manual selection of scales.Moreover, the interpolation in upsampling and downsampling might induce additional errors.The decrease in performance with growing scale is still apparent (10 − 15%) but significantly reduced compared to the plain U-net (55%).To get more insight into performance and characteristics of the U-net, we add an experiment similar to the one from [5]: We train the U-net on crack widths 1, 3, and 5 on the same number of images as for one single crack width.This case is referred to "U-net-mix scale adj." in Table 2. Scales are adjusted similarly: w5 and w7 are downscaled by factor 2, w9 and w11 are downscaled by factor 4. The results are significantly better than those obtained by the U-net trained on the single width (10 − 15% in Dice and IoU on unseen scales), but still remain worse than the Riesz network trained on a single scale (around 7% in Dice and IoU on unseen scales).
The Gaussian derivatives network is able to generalize steadily across the scales (Dice and IoU 74%) but nevertheless performs worse than the scale adjusted U-net (around 10% in IoU).Moreover, it is very sensitive to noise and typical CT imaging artifacts (Fig. 9).
On the other hand, the Riesz network's performance is very steady with growing scale.We even observe improving performance in IoU and Dice with increase in crack thickness.This is due to pixels at the edge of the crack influencing the performance metrics less and less the thicker the crack gets.The Riesz network is unable to precisely localize cracks of width 1 as, due to the partial volume effect, such thin cracks appear to be discontinuous.With the exception of the thinnest crack, the Riesz network has Dice coefficients above 94% and IoU over 88% for completely unseen scales.This even holds for the cases when the crack is more than 3 times thicker than the one used for training.

Experiment 2: Performance on multiscale data
Since cracks are naturally multiscale structures, i.e. crack thickness varies as the crack propagates, the performance of the considered methods on multiscale data is analyzed as well.On the one hand, we want to test on data with an underlying ground truth without relying on manual annotation prone to errors and subjectivity.On the other hand, the experiment should be conducted in a more realistic and less controlled setting than the previous one, with cracks as similar as possible to real ones.We therefore use again simulated cracks, this time however with varying width.The thickness is modeled by an adaptive dilation.See Fig. 10 for realization examples.The change rendering our experiment more realistic than the first one is to exploit no prior information about the scale.The Riesz network does not have to be adjusted for this experiment while the competing methods require scale selection as described in Section 4.2.2.Without knowing the scale, testing several configurations is the only option.See Appendix B for examples.Note that in this experiment we used a different crack simulation technique [10] than in Experiment 1.In principle, we cannot claim that either of the two techniques generates more realistic cracks.However, this change serves as an additional goodness check for the methods since these simulation techniques can be seen as independent.
We adjust the U-net as follows: We downscale the image by several factors from {2, 4, 8, 16...}.The forward pass of the U-net is applied to the original and every downscaled image.Subsequently, the downscaled images are upscaled back to the original size.All predictions are joined by the maximum operator.We report results for several downscaling factor combinations specified by a number N , which is the number of consecutive downscaling factors used, starting at the smallest factor 2. Similarly as in Experiment 1, we report results of two U-net models: the first model is trained on cracks of width 3 as the other models in the comparison.The second model is trained on cracks with mixed widths.Including more crack widths in the training set has proven to improve the scale generalization ability in Experiment 1.For each competing method, the highest value is underlined.The scale sampling rate for the Gaussian network and U-net is set to 2, while the number of sampled scales is in the set {1, 2, 3, 4}.

Method
Hence, the second model represents a more realistic setting that would be used in practice where the crack width is typically unknown.We denote the respective networks as "U-net pyramid" N and "U-net-mix pyramid" N .
For the Gaussian network, we vary the standard deviation parameter σ in the set {1.5, 3, 6, 12}.This selection of scales is motivated by the network having been trained on crack width 3 with σ = 1.5.We start with the original σ and double it in each step.Note that in the related study on the scale sampling for scale-channel deep networks [5], using a scale sampling factor of √ 2 was found to lead to better performance than using a scale sampling factor of 2. Hence, additional experimentation with hyperparameters of the method might improve the results.We decided to keep the sampling scheme as similar as possible to the one from U-net.The reason is that U-net with downscaling was more extensively tested on the crack segmentation task [8,45].As for the U-net, we test several configurations, now specified by the number N of consecutive σ values used, starting at the smallest (1.5).We denote the respective network "Gaussian network" N .
Results are reported in Table 3 and Fig. 10.We observe a clear weakness of the Riesz network in segmenting thin cracks (Fig. 10, first and last row).Despite of this, the recall is still quite high (90%).However, this could be due to thicker cracks -which are handled very well -contributing stronger to these statistics as they occupy more pixels.Nevertheless, the Riesz network deals with the problem of the wide range scales without sampling the scale dimension, just with a single forward pass of the network.
The performance of the U-net improves with including more levels in the pyramid, too.However, this applies only up to a certain number of levels after which the additional gain becomes minimal.Moreover, applying the U-net on downscaled images seems to induce oversegmentation of the cracks (Fig. 10, second and third row).Including a variety of crack widths in the training set improves the overall performance of U-net in all metrics.This confirms the hypothesis that U-net significantly benefits from variations in the training set.However, this model of U-net is still outperformed by the Riesz network trained on a single crack width.The Gaussian network behaves similarly as the U-net, with slightly worse performance (according to Dice or IoU) but better crack coverage (Recall).As the number of σ values grows, the recall increases but at the same time artifacts accumulate across scales reducing precision.The best balance on this data set is found to be three scales.

Experiment 3: Application to cracks in CT images of concrete
Finally, we check the methods' performance on real data: cracks in concrete samples generated by tensile and pull-out tests.In these examples,  the crack thickness varies from 1 or 2 pixels to more than 20 pixels (Fig. 11).This motivates the need for methods that automatically generalize to completely unseen scales.Here, we can assess the segmentation results qualitatively, only, as no ground truth is available.Manual segmentation of cracks in high resolution images is time consuming and prone to individual biases.Additional experiments on real cracks in the different types of concrete are shown in Appendix C. The first sample (Fig. 11, first row) is a concrete cylinder with a glass fiber reinforced composite bar embedded along the center line.A force is applied to this bar to pull it out of the sample and thus initiate cracking.Cracking starts around the bar and branches in three directions: left, right diagonal, and down (very subtle, thin crack).Crack thicknesses and thus scales vary depending on the crack branch.As before, our Riesz network is able to handle all but the finest crack thicknesses efficiently in a single forward pass without specifying the scale range.The Unet on the image pyramid requires a selection of downsampling steps (Appendix B), accumulates artifacts from all levels of the pyramid, and slightly oversegments thin cracks (left branch).
The second sample (Fig. 11, second row) features a horizontal crack induced by a tensile test.
Here we observe permanently changing scales, similar to our simulated multiscale data.The crack thickness varies from a few to more than 20 pixels.Once more, the Riesz network handles the scale variation well and segments almost all cracks with minimal artifacts.In this example, Unet covers the cracks well, too, even the very subtle ones.However, it accumulates more false positives in the areas of concrete without any cracks than the Riesz network.

Conclusion
In this paper we introduced a new type of scale invariant neural network based on the Riesz transform as filter basis instead of standard convolutions.Our Riesz neural network is scale invariant in one forward pass without specifying scales or discretizing and sampling the scale dimension.Its ability to generalize to scales differing from those trained on is tested and validated in segmenting cracks in 2d slices from CT images of concrete.Usefulness of the method become manifest in the fact that only one fixed scale is needed for training, while preserving generalization to completely unseen scales.This reduces the effort for data collection, generation or simulation.Furthermore, our network has relatively few parameters (around 18k) which reduces the danger of overfitting.
Experiments on simulated yet realistic multiscale cracks as well as on real cracks corroborate the Riesz network's potential.Compared to other deep learning methods that can generalize to unseen scales, the Riesz network yields improved, more robust, and more stable results.
A detailed ablation study on the network parameters reveals several interesting features: This type of networks requires relatively few data to generalize well.The Riesz network proves to perform well on a data set of approximately 200 images before augmentation.This is particularly useful for deep learning tasks where data acquisition is exceptionally complex or expensive.The performance based on the depth of the network and the number of parameters has been analyzed.Only three layers of the network suffice to achieve good performance on cracks in 2d slices of CT images.Furthermore, the choice of crack thickness in the training set is found to be not decisive for the performance.Training sets with crack widths 3 and 5 yield very similar results.
The two main weaknesses of our approach in the crack segmentation task are undersegmentation of thin cracks and edge effects around pores.In CT images, thin cracks appear brighter than thicker cracks due to the partial volume effect reducing the contrast between the crack and concrete.For the same reason thin cracks look discontinued.Thin cracks might therefore require special treatment.In some situations, pore edge regions get erroneously segmented as crack.These can however be removed by a post-processing step and are no serious problem.
To unlock the full potential of the Riesz transform, validation on other types of problems is needed.Furthermore, scaling the Riesz network to larger, wider, and deeper models remains an open topic.Our study as well as previous ones [34,46,47] imply that small models based on linear combination of the convolutions with fixed filters could yields results comparable to those of large CNN models.However, in order to state this reliably, training convergence, expressiveness, and run-times of these types of networks have to be compared systematically to those of CNNs.
In the future, the method should be applied in 3d since CT data is originally 3d.In this case, memory issues might occur during discretization of the Riesz kernel in frequency space.
An interesting topic for further research is to join translation and scale invariance with rotation invariance to design a new generation of neural networks with encoded basic computer vision properties [40].This type of neural network could be very efficient because it would have even less parameters and hence would require less training data, too.

Appendix A Experiment on MNIST Large Scale Dataset
We test the Riesz networks on a classification task on the MNIST Large Scale [5] to test wider applicability of Riesz networks outside of crack segmentation task.This data set was derived from the MNIST data set [48] and it consists of images of digits between 0 and 9 belonging to one of ten classes (Fig. 12) which are rescaled to a wide range of scales to test scale generalization abilities of neural networks (Fig. 13).
Our Riesz network has the channel structure 12-16-24-32-80-10 with the softmax function at the end.In total, it has 20,882 parameters.Following [27], only the central pixel in the image is used for classification.We use the standard CNN described in [5] but without any scale adjustments as a baseline to illustrate limited scale generalization property.This CNN has the channel structure 16-16-32-32-100-10 with the softmax function at the end and in total 574,278 parameters.The training set has 50,000 images of the single scale 1.We used a validation set of 1,000 images.The test set consists of scales ranging in [0.5, 8] with 10,000 images per scale.All images have size 112 × 112.Models are trained using the ADAM optimizer [43] with default parameters for 20 epochs with learning rate 0.001 which is halved every 3 epochs.Cross-entropy is used as loss function.
Fig. 14 shows validation and training loss during 20 epochs.Interestingly, the Riesz network converges faster and even its validation loss remains lower than the training loss of CNN.Accuracies for the different scales are shown in Table 4.
The Riesz network shows stable accuracy for scales in the range [0.5,4].The CNN, which has way more degrees of freedom, is only competitive for scales close to the training scale.Results for   two scale adjusted versions of the CNN as reported in [5] are also given in Table 4. Their performance is slightly superior to the Riesz network (around 1 − 2%).However, it is important to note that this approach uses (max or average) pooling over 17 scales.
Further works considering the MNIST Large Scale data set are [27,37].Unfortunately, no numeric values of the accuracies are provided, so we can compare the results only qualitatively.The Riesz network's accuracy varies less on a larger range of scales than those of the scale-equivariant networks on Gaussian or morphological scale spaces from [37] that were trained on scale 2. The Gaussian derivative network [27] trained on scale 1 yields results in a range between 98% and 99% for medium scales [0.7, 4.7] using pooling over 8 scales.The Riesz network yields similar values but without the need for scale selection.
On the smallest scale of 0.5, the Riesz network seems to give a better result than [27], while it is outperformed on the largest scales.The reason for the latter is that digits start to reach the boundary of the image.To reduce that effect, we pad the images by 20 and 40 pixels with the minimal gray value.Indeed, this improves the accuracy significantly for larger scales (  (padding 20) and 83.6% (padding 40).This is a better accuracy than that reported in [27] and [5] for models trained on scale 1.

B Experiments on scale selection for competing methods related to Riesz network
The largest benefit of the Riesz network is avoiding the sampling of the scale dimension.Here, we give more detailed insight into scale sampling in practice for competing methods: U-net applied on rescaled images and Gaussian derivative networks.We show how segmentation results change as we add additional scales to the output.As we add new scales, cracks that belong (or are close) to the added scales get segmented.However, additional noise gets segmented, too.These noise pixels that are misclassified as cracks originate from two sources: interpolation error and high frequency noise.For simulated data this is shown in Fig. 15 and Fig. 16.For real cracks see Fig.The main drawback is that one needs to select the range of scales on which to apply these methods.Since the scale dimension in the images is bounded from above by the size of the view window, when having images of different sizes scale sampling needs to be adjusted or recalibrated.It is not trivial how to achieve this in a general manner.
In contrast, the Riesz transform enables simultaneous, continuous, and equal treatment of all scales automatically adapting to the image size.

C Experiments on different types of concrete: fiber reinforced concrete
It is a well-known weakness of concrete that it has low tensile strength, i.e. under high tensile force it fails abruptly and explosively.For that reason, reinforcement material is mixed with the cement paste creating a composite material.Most common reinforcements are steel rebars.Nowadays, fibers have become widely used as reinforcement in concrete creating a new class of reinforced concrete materials, e.g.ultra high performance fiberreinforced concrete [49][50][51].A variety of materials can be used as fiber material, including glass, carbon, and basalt.Since all of these materials have different mechanical properties, the properties of fiber reinforced concrete are connected to the properties of the concrete mixture, including the fiber material.Hence, a lot of effort has recently been invested in the investigation of fiber reinforced concrete samples with various material configurations.In the context of CT imaging, different materials mean different energy absorption properties, i.e. fibers can appear both brighter or darker than concrete, which can result in very different images.In the context of crack segmentation, this means that our methods should be able to efficiently handle these variations.This section compares the performance of the Riesz network, U-net, and U-net-mix from the previous sections on three different fiber reinforced concrete images.We comment on possible post-processing steps to improve results and discuss the robustness of the methods in the context of fiber reinforced concrete.
Fig. 18 shows a sample of high performance concrete (HPC) with polypropylene fibers as reinforcement.See [45] for more details on sample and crack initiation.In this image, fibers are long and appear dark and hence interfere with the crack in the center.All three methods are able to extract the central and dominant crack in the middle.The Riesz network is not able to segment the thin crack on the left from the main crack, contrary to both U-nets.However, both U-nets accumulate a much larger amount of misclassified noise compared to the Riesz network.
Fig. 19 features a sample reinforced with steel fibers.For more details see [52,53].In this image, fibers appear bright and create uneven illumination effects.We use a simple pre-processing step to understand if we can reduce this effect and improve the performance of the methods.Simple morphological openings with square structuring elements of half-sizes 2 and 5 are used for that purpose.As the size of the structuring element increases, segmentation results improve for all three methods.While the Riesz network struggles with low contrast cracks on the right, both types of U-net segment falsely many non-crack voxels.The CT image from Fig. 20 originates from ultra high performance concrete reinforced with steel fibers [50].Again, fibers turn out to be bright structures in the images.These extremely highly X-ray absorbing fibers affect the gray value dynamics of the CT images.Morphological openings with square structuring elements of half-sizes 2 and 5 are applied to reduce this effect.As we increase the size, crack segmentation improves for the Riesz networks.Both types of U-net segment large amounts of noise, even with opening as a pre-processing step, rendering them ineffective for this sample.

D Method implementation
The Riesz network has been implemented in PyTorch.Here, we discuss the implementation of the Riesz transform (Algorithm 1) and the Riesz layer (Algorithm 2).
In Algorithm 1, the discrete Fourier transform "FT" denotes the composition of two torch functions: first, torch.fft.fft2 is applied to the input and then followed by torch.fft.fftshift.Similary, its inverse "iFT" uses torch.fft.ifftshift and torch.fft.ifft2but in the reverse order as in "FT".The operator "⊙" denotes a pointwise multiplication of two image maps (of the same size).
Based on equation (2), Algorithm 2 is the implementation of equations (11) and (12).Here, a linear combination of Riesz transforms of the input feature maps can be seen as a 1d convolution across the channel dimension.Hence, "conv1d" represents the torch function torch.nn.Conv2d with kernel size = (1, 1) where [5N input , N output ] specify the number of input and output channels, respectively.

Declarations
Ethical Approval.Not applicable.
Competing interests.The authors have no competing interests to declare that are relevant to the content of this article.

Fig. 1 :
Fig. 1: Examples of similar objects appearing on different scales: section of a CT image of concrete showing a crack of locally varying thickness (left) and pedestrians at difference distances from the camera (right, taken from [9]).

Fig. 3 :
Fig. 3: Illustration of the Riesz transform on a mock example of 550 × 550 pixels: aligned rectangles with equal aspect ratio and constant gray value 255 (left) and response of the second order Riesz transform R (2,0) of the left image sampled horizontally through the centers of the rectangles (right).

Fig. 7 :
Fig. 7: Experiment 1.Effect of the training set size (left), the crack width in the training set (center), and the network depth (right) on generalization to unseen scales.The baseline Riesz network is marked with 1 947 (left), w3 (center), and layer 4 (right) and with square symbol □.Quality metric: IoU.

Fig. 8 :
Fig. 8: Experiment 1.Comparison of the competing methods.Results of the simulation study with respect to crack width.Training on crack width 3. Quality metrics (from left to right): precision, recall, and IoU.

Fig. 10 :
Fig. 10: Experiment 2. Cracks with varying width.From left to right: input image, results of the Riesz network and the U-net with 4 pyramid levels.Image size 400 × 400 pixels.

Fig. 14 :
Fig. 14: Train and validation loss for Riesz network and CNN (as a baseline).

Fig. 19 :
Fig. 19: Cracks in concrete with steel fibers.Rows: input image, segmentation results from the Riesz network, U-net and U-net mix, respectively.Columns: original images, images after applying square closing of half-size 2, and images after applying square opening of half-size 5. Image size is 1 295 × 336.

Fig. 20 :
Fig. 20: Cracks in samples of ultra high performance concrete reinforced with steel fibers.Rows (from left to right): input image, segmentation results from the Riesz network, U-net and U-net mix, respectively.Column: original images, images after applying square opening of half-size 2, and images after applying square closing of half-size 5. Image size is 1 579 × 772.

Table 1 :
Experiment 1. Ablation study: scale generalization ability of Riesz networks.Baseline is trained on 1 947 images with cracks of width 3 and has 4 layers.Cells are colored in lightgray if the metric is better than for the baseline, but not by more than 0.02.Dark gray color is used for metrics being more than 0.02 better compared to the baseline.

Table 2 :
Experiment 1.Comparison with competing methods: Dice coefficients for segmentation of cracks of differing width.Training was performed on crack width 3. Best performing method bold.Both Gaussian network and U-net adj.are applied on a single scale that was selected according to the crack width.

Table 3 :
Experiment 2. Performance on simulated multiscale cracks.The highest overall value is given in bold.

Table 4 )
, while it remains equal for the rest of the scales.For example, for scale 8, accuracy increases from 51.8% to 79.8%

Table 4 :
Classification accuracy (in %) of MNIST Large Scale data set.Best performing method bold.