Keywords

1 Introduction

Lately, neural networks for semantic segmentation have been mostly based on the fully convolutional network (FCN) [11] paradigm. FCN models typically consist of an encoder and a decoder which are both built with stacked convolution layers. The purpose of the encoder is to extract features from the image. With increasing depth of the encoder, the features get more abstract and the resolution of the feature maps is progressively reduced. The decoder on the other hand takes the low-resolution feature maps from the encoder as an input and upscales them to the resolution of the original image so that pixel-level classification can be performed.

While encoders in the form of convolutional neural networks (CNN) have been rigorously studied, considerably less research has been conducted on the decoder side of semantic segmentation networks. The main challenge for the decoder is to upscale the feature maps to the original resolution of the image and simultaneously produce accurate region borders. In CNN-based decoders, upsampling or transposed convolution operators are typically used to progressively increase the spatial resolution of the feature maps. These operations introduce a particular kind of inductive bias. For example, transposed convolutions can create spectral artifacts in the upscaled feature maps [5]. Another apparent disadvantage of CNN decoders is that they struggle to capture long-range dependencies between different parts of the image due to their locally connected structure.

In the last few years, neural fields, aka implicit neural representations or coordinate-based networks, have received much attention for learning a variety of different signals, for example, 1D audio signals [22], 2D images [4, 26] and 3D geometries [12, 24]. A neural field takes (spatial) coordinates \(x \in \mathbb {R}^d\) as input and maps them to a task-dependent signal \(y = \varPhi (x)\) through a neural network. For example, a neural field representing an RGB image takes 2D image coordinates as input and produces three RGB values at each location. One interesting property of neural fields is that they represent signals as continuous functions on their respective spatial domain.

Inspired by the recent success of neural fields, we explore the use of neural fields as decoders in semantic segmentation networks. In this regard, we hypothesize that (continuous) neural fields provide an inductive bias which can be better suited for reconstructing high-resolution semantic maps compared with (discrete) CNN-based decoders. In our work, we examine multiple conditioning strategies which enable the neural field decoder to make use of the information in the latent feature maps produced by the encoder. Through our comparative study, we aim to provide more insights into conditioning methods of neural fields, as research has been extremely sparse in this regard. Furthermore, we believe that 2D semantic segmentation provides a well-defined task for studying conditioning methods, as it has comprehensive metrics and the possibility for insightful visualizations of the learned geometries.

2 Related Work

Semantic Segmentation. Encoder-decoder fully convolutional networks [11] have become the predominant approach for semantic segmentation. They share the challenge how to encode high-level features in typically low-resolution feature maps and subsequently upscale these feature maps to retrieve pixel-accurate semantic predictions. One drawback of CNNs is that, because of their locally connected structure, they struggle to combine information which is spatially distributed across the feature maps. Research attempting to mitigate this drawback has proposed attention mechanisms over feature maps to selectively capture and combine information on a global scale [6]. Extending the concept of attention further, neural network architectures based fully on transformers have been proposed recently for semantic segmentation [25]. In our work, we utilize a CNN, which is more efficient than transformers, for extracting features and use attention in one of our conditioning methods.

CNN Decoders. Research on decoders has been more sparse than research on neural network encoders, i.e. CNN backbones. Wojna et al. [28] compare different CNN-based decoders for several pixel-wise prediction tasks and observe significant variance in results between different types of decoders. Multiple works [5, 14] provide evidence that upscaling using transposed convolution operators introduces artifacts in the feature maps and therefore the decoder’s output. We aim to avoid any explicit or implicit discretization artifacts by using a continuous neural field decoder.

Neural Fields. Neural fields were introduced in 2019 as a representation for learning 3D shapes [12, 15]. Following works extended neural fields by learning colored appearances of scenes and objects [13, 24]. Particularly NeRF [13] has attracted a lot of attention, as it is able to generate very realistic novel views of a scene, learning from images and associated poses. NeRF effectively overfits a neural network for one individual scene. This limits the usability as the neural field needs to be re-trained for every new scene. Some works have explored the use of neural fields for semantic segmentation. Vora et al. [27] built a 3D segmentation on top of the NeRF approach. Hu et al. [9] used neural fields in conjunction with CNNs for upsampling and aligning feature maps in the decoder of a semantic segmentation network.

Neural Field Conditioning. When a neural field should share knowledge between different signals, it needs to be conditioned on a latent code which describes the signal at hand. Several conditioning approaches have been explored in the literature. Methods based on global conditional codes use one code to describe the whole signal [12, 23]. Methods based on local conditional codes [4, 29] use a different code for each spatial area in the signal. On top of these, there exist multiple methods how a neural field can actually consume a conditional code, which we describe in detail in Sect. 3.3. Rebain et al. [18] compare different methods for conditioning neural fields for 2D and 3D tasks, but did not consider global and local conditional codes. In the neural field community, there is a lack of comparative research on what conditioning strategies work well for which task. We attempt to shed more light on this by comparing different conditioning strategies for the well-defined task of 2D semantic segmentation.

3 Method

3.1 Neural Network Architecture and Training Procedure

Our high-level architecture involves a CNN encoder and a neural field decoder (see Fig. 1). We use a CNN to efficiently encode an image into a feature volume with size \(c \times h \times w\), where c is the number of channels, w is the spatial width and h is the spatial height. From this feature volume, we calculate the conditional code for the neural field decoder in different ways, depending on the conditioning strategy. During training, for every image, we sample S random points within the image. At test time, the points are densely sampled so that there exists one point for each pixel. The point coordinates are normalized to the [0,1] range, stacked, and fed to the neural field decoder as input. For every point, the decoder predicts the semantic class at that position in the image. We use a cross-entropy loss to train the whole setup in an end-to-end fashion. At test time, the class predictions per point are arranged into an image. Thereby, we can compare the predictions with the class labels using standard image segmentation metrics, such as the Intersection over Union (IoU).

Fig. 1.
figure 1

Our high-level neural network architecture. A CNN encoder encodes an image into a feature volume consisting of multiple feature maps. During training, S points per image are sampled within the image (red) and fed into the decoder. The decoder is a conditional neural field for which we use different conditioning strategies. For every point the decoder outputs a prediction of the semantic class at this position (purple). (Color figure online)

3.2 Latent Code Source: Global vs. Local

Fig. 2.
figure 2

A visualization of our conditioning strategies. We consider three conditioning methods: Concat conditioning, FiLM conditioning and Cross-Attention conditioning (right side). For Concat and FiLM conditioning, one feature vector is used, which can be calculated from global features (top path) or local features (mid path). The input to the Cross-Attention Transformer is the whole feature volume, which is reshaped and treated as tokens (bottom path).

First, we differentiate how the conditional code is calculated based on the feature volume from the encoder. We consider a global code and a local code. The global code represents the content of the complete image. Naturally, it can capture the global context in the image well. However, due to its limited capacity, it might not be able to capture fine, local geometries. On the other hand, the local code represents a spatially limited area in the image. It can utilize its full capacity for modeling the geometry in one area with high fidelity, however, it might lack global context. For example, the probability of detecting a car rises when a street is detected somewhere in the image.

We calculate the global code vector by applying a global average pooling operation. It averages all the entries in the feature maps across the spatial dimensions (see the top path in Fig. 2). This is a standard operation which is used, for example, in the ResNet classification head [8]. Through this procedure, we calculate one global code per image. For calculating the local code, we utilize the point coordinates, in addition to the feature volume. For every point, we “look up” the value of the feature maps at this position. For this purpose, we normalize the feature maps’ spatial dimensions to the [0,1] range, and therefore effectively align the feature volume with the input image. We then perform a bilinear interpolation within the feature maps based on the point coordinate to calculate the local code vector (see the middle path in Fig. 2). As a result, we have S local codes per image, one for every point. In addition to using either a global or a local code, we also consider the combination of both to jointly exploit their individual advantages. We do this by concatenating both codes.

3.3 Conditioning the Neural Field Decoder

Conditioning a neural field enables it to effectively adapt the knowledge which is shared across all signals to the signal at hand.

Conditioning by Concatenation. In the simplest conditioning method, the conditional code is concatenated to the point coordinates and is jointly used as input to the neural field. We re-concatenate the conditional code to other hidden layers using skip connections. This approach is used by HyperNeRF [16]. It has the advantage of being conceptually simple, however, it is computationally inefficient [18], because it requires \(O(k(c+k))\) parameters for the fully connected layers in the neural field, where k is the hidden layer width and c is the size of the conditioning vector.

Feature-Wise Linear Modulation. Another way to condition a neural field is to use the latent code together with an MLP to regress the parameters of the neural field. When all parameters of the neural field are calculated in this way, the approach is known as hyper-networks [7]. Feature-wise Linear Modulation (FiLM) [17] is a more constraint subtype of hyper-networks where, instead of predicting all weights, feature-wise modulations of activations in the neural field are predicted. This approach is used in Occupancy Networks [12] and piGAN [2].

Cross-Attention. Conditioning by Cross-Attention has been introduced by Jiang et al. [10] and was extended in the Scene Representation Transformer [21]. The core idea is to selectively attend to features at different spatial positions, based on the point coordinates. A transformer architecture with Cross-Attention layers is used where the queries are derived from the point coordinates and the feature volume serves as a set of tokens. This approach does have an interesting connection with using local codes, as both approaches calculate a feature vector by weighting entries in the feature maps based on the current point coordinate. However, in difference to the spatial “look up” of local codes, which can be performed for free, the Cross-Attention operation can flexibly query both local and global information as needed at the cost of more computation [18].

4 Experiments

We evaluate seven conditioning strategies on a public dataset for semantic segmentation. Concat conditioning and FiLM conditioning are used in conjunction with global, local and combined conditional codes each. The Cross-Attention Transformer uses the reshaped feature volume as input (see Fig. 2).

4.1 Dataset

For our experiments, we used the Potsdam datasetFootnote 1 which is part of the ISPRS semantic labeling contest [20]. It consists of satellite images of the German city Potsdam together with dense label masks for six classes: Impervious surfaces, Building, Low vegetation, Tree, Car and Clutter/background. The orthographic images have a sampling distance of 0.05 m/px. The total dataset consists of 38 tiles with a size of \(6000 \times 6000\) px from which we use the same 24 tiles for training as in the original contest. From the remaining tiles, we use 7 for validation and 7 for testing. From the tiles, we randomly crop patches of \(256 \times 256\) or \(512 \times 512\) pixels.

4.2 Encoder and Decoder Implementations

For the Concat and the FiLM decoder, we use a similar neural network architecture, which is based on Occupancy networks [12] (see Fig. 3a). We use either concatenation plus conventional batchnorm or conditional batchnorm at the designated places in the neural network architecture. For the Cross-Attention conditioning, we use a transformer architecture based on the Scene Representation Transformer [21] (see Fig. 3b). It uses one multi-head attention module per block. Keys and values are calculated from the feature tokens while the queries are calculated from the point coordinates. We can scale both neural network architectures by repeating the yellow blocks N times or increasing the width of the MLP layers. For all experiments we use a ResNet34 [8] backbone as the encoder, pre-trained on ImageNet. Its output feature volume has a size of \(512 \times 8 \times 8\) for input images with size \(256 \times 256 \) pixels and \(512 \times 16 \times 16\) for input images with size \(512 \times 512 \) pixels respectively.

Fig. 3.
figure 3

Our neural network architectures used for the Concat and FiLM conditioning (left) and for the Cross-Attention Transformer (right). The yellow block can be repeated N times. For the Concat approach, the orange block denoted with an asterisk represents a concatination followed by a batchnorm layer. For FiLM, the same block denotes a conditional batchnorm layer. Other batchnorm and layernorm layers have been omitted for clarity. (Color figure online)

4.3 Points Embedding

It has been shown that when coordinates are directly used as inputs, neural fields have a bias towards learning low-frequency signals. To counter this, we embed both image coordinates independently into a higher dimensional space by using Fourier features as it is commonly done with neural fields [26]:

$$\begin{aligned} \gamma (x) = (sin(2^0{\pi }x), sin(2^1{\pi }x),...,sin(2^l{\pi }x), cos(2^0{\pi }x), cos(2^1{\pi }x),...,cos(2^l{\pi }x)), \end{aligned}$$
(1)

where x is an image coordinate and l controls the embedding size.

4.4 Training Parameters

The influence of the parameters used in our experiments was evaluated in preliminary runs, based on the validation performance. For all experiments, we choose a fixed learning rate of \(1 \times 10^{-4}\) for the Adam Optimizer and a batch size of 64. We use horizontal and vertical flipping as data augmentation and perform early stopping based on the IoU metric on the validation set. For all neural field architectures, 512 points are sampled per image and we choose \(l=4\) as the size of the points embedding. Empirically, we have found that the results are not sensitive to both these parameters. We have explored scaling the neural field architectures by increasing the number of blocks and the MLP layers’ width. With that approach, we use a hidden size of 512 for all MLP layers. One block is used within the Concat and FiLM conditioning network and two blocks are used within the Cross-Attention Transformer. For all architectures, we try to have approximately the same amount of parameters to make a fair comparison.

Table 1. Results for all examined decoder architectures.
Fig. 4.
figure 4

The predictions of all examined decoder architectures on three example images (\(512 \times 512 \) px) from the test set. For Concat and FiLM conditioning, g denotes a global code source, l denotes a local code source and g/l denotes a concatenation of global and local code. The class color code is: white = Impervious surfaces, blue = Building, cyan = Low vegetation, green = Tree, yellow = Car, red = Clutter/background. By comparing the predictions with the ground truth segmentation masks, it can be observed that the ability to represent details, e.g. distinct objects or angular corners, varies greatly between the approaches. Only the Cross-Attention and the DeepLabV3+ decoders are able to faithfully represent the segmentation masks, while the Concat and FiLm approaches tend to produce overly smooth geometries. (Color figure online)

5 Results

In Table 1, we show the Intersection over Union (IoU), F-Score and the number of parameters for all seven conditioning strategies and two different image sizes on the test set. We also compare our neural field decoder with the DeepLabV3+ [3] FCN for semantic segmentation which also uses a ResNet34 backbone. In Fig. 4 we show the predictions of all decoder architectures for three example images. From the results, we can make multiple key observations.

First, the Concat and FiLM decoders perform very similarly in all aspects, regardless of the conditional code source and the image size.

Second, conditioning via Cross-Attention works best amongst all neural field approaches. Furthermore, it performs similarly to the DeepLabV3+ FCN. Notably, the Cross-Attention decoder has half as much parameters and no access to the intermediate feature maps of the encoder.

Third, the performance of the Concat and FiLM approaches can be improved by using a combination of global and local features, particularly for larger images. In that case, the performance of both approaches is not much lower compared with the Cross-Attention decoder.

Fourth, the performance of the Concat and FiLM conditioning decreases with larger input images when using global codes. This can be expected, as it is harder to model more geometries in larger images with the same code length.

Fifth, when using local codes, the performance is also degraded when dealing with larger images. This is unexpected, as the sampling distance (meters per pixel) remains the same and therefore the size of the features should also remain the same. This could be an indication that the individual vectors in the feature volume produced by the CNN encoder do not model purely local features, as stated by methods using this approach [4, 29]. This is further supported by the fact that modern CNN architectures have very large receptive fields so that one feature vector in the output feature volume receives input from the complete input image. In our case, the ResNet34 encoder has a receptive field of 899 pixels which fully covers both our image sizes.

6 Conclusion

In this work, we performed a comparative study of neural field conditioning strategies and explored the idea of a neural field-based decoder for 2D semantic segmentation. Our results show that neural fields can have a competitive performance when compared with a classic CNN decoder while requiring even fewer parameters. In the future, we can imagine a further increase in performance of the presented approach by making the neural field decoder utilize information from the intermediate layers of the encoder via skip connections. We also showed that the performance of the neural field is considerably affected by the conditioning strategy. The best conditioning strategy likely depends on the task. For the task of 2D semantic segmentation, a Cross-Attention-based Transformer is superior to Concat and FiLM conditioning. However, also the combination of local and global conditional codes is a promising approach, as the performance is not much lower. Lastly, for local features, we showed an unexpected degradation in performance when increasing the image size. Further research is required to explain this observation and deduce consequences for local conditioning methods.