Road Network Detection from Aerial Imagery of Urban Areas Using Deep ResUNet in Combination with the B‑snake Algorithm

Road network detection is critical to enhance disaster response and detecting a safe evacuation route. Due to expanding computational capacity, road extraction from aerial imagery has been investigated extensively in the literature, specifically in the last decade. Previous studies have mainly proposed methods based on pixel classification or image segmentation as road/non-road images, such as thresholding, edge-based segmentation, k-means clustering, histogram-based segmentation, etc. However, these methods have limitations of over-segmentation, sensitivity to noise, and distortion in images. This study considers the case study of Hawkesbury Nepean valley, NSW, Australia, which is prone to flood and has been selected for road network extraction. For road area extraction, the application of semantic segmentation along with residual learning and U-Net is suggested. Public road datasets were used for training and testing purposes. The study suggested a framework to train and test datasets with the application of the deep ResUnet architecture. Based on maximal similarity, the regions were merged, and the road network was extracted with the B-snake algorithm application. The proposed framework (base-line + region merging + B-snake) improved performance when evaluated on the synthetically modified dataset. It was evident that in comparison with the baseline, region merging and addition of the B-snake algorithm improved significantly, achieving a value of 0.92% for precision and 0.89% for recall.


Introduction
Recent advancements in aerial imagery have allowed the provision of high-resolution images that can distinguish roads. Road network extraction from aerial images has been applied for transportation management, road navigation, updating geographic information and urban planning. Road extraction from aerial imagery has been carried out using different methods, known as road area extraction/detection [2,10,11,22]. Image segmentation and pixel classification have been the widely used methods for sorting road or nonroad images. For instance, a shape index feature, support vector machine (SVM), angular texture feature, and a fuzzy classifier have been proposed for road area extraction [3,15,29]. A framework based on SVM facilitated road feature extraction from multi-spectral images [8]. Similarly, Yuan et al. [32] proposed a multi-stage road extraction method involving road grouping, segmentation, and medial axis point selection. Hierarchical graph-based image segmentation has also been proposed for unsupervised extraction [21]. Furthermore, the conditional random field (CRF) model has also been implemented for the said purpose [17]. However, the destruction caused by cars, trees, or surrounding features results in poor accuracy of these methods [21].

Related work
Modern image segmentation and classification techniques are powered by deep learning technology. Deep learning 1 3 methods have progressed immensely in recent times. They have been utilised for interpreting remote sensing data, computer vision and solving other complex problems with higher performance and achieving better results [24,26,27]. In deep learning, applying multilayered models allows the processing of different levels of visual information on each layer. Local features are processed by lower layers, while higher layers assist in inferring more complex features. Road network extraction has been improved using deep learning methods. The First attempt to detect roads by deep learning methods was proposed by Mnih and Hinton [19] utilising restricted Boltzmann machines (RBMs). Other studies have also suggested better outcomes with the application of deep architecture. However, it is challenging to train due to the vanishing gradient. To facilitate training, a deep residual learning framework was suggested based on identity mapping to overcome training issues [35].
Similarly, to enhance segmentation accuracy, U-Net was proposed to concentrate map features instead of fully convolutional Networks with skip connection [1]. U-Net architecture is for semantic segmentation consisting of the expansive and contracting path. Zhang et al. [34] proposed a deep residual U-Net combining the strengths of deep learning methods and U-Net, built on the residual unit instead of basic neural units and removed the cropping operation from the network.
Image segmentation by analysing images at the object level instead of working at the pixel level is a well-adapted approach for high-resolution images that is robust and less noisy [6]. Therefore, the "Object-based Image Analysis (OBIA) approach improves the quality of segmentation results [13,16,18].
Road boundary detection techniques utilise lane patterns (features) and road models. These techniques should be capable of maintaining the quality of road detection without being affected by the shadows, processing painted and unpainted roads, detecting the curved road, and detecting both sides of the lane markings utilising parallel constraints [24,26,27]. Wang et al. [25] addressed these constraints by proposing a novel B-snake algorithm. This algorithm can define a wide range of lane structures rather than only straight and parabolic models. It utilises parallel knowledge of roads and is robust against external factors like noise, shadow, and missing and incorrect markings. B-snake exhibits local control and forms arbitrary shapes which assist in describing a different range of road shapes while retaining compact representation. For instance, by increasing the control points, more complex shapes of roads with corner turns can be explained by the B-snake algorithm [28].
This study proposes a deep residual dense U-Net method along with (1) region merging (merging the regions formed by segmentation) and (2) a B-snake algorithm for road detection. The regions which were road-like were assembled for the study. The merging criterion in the region merging algorithm defines the cost of merging two regions which should be considered. The proposed framework for road network extraction is shown in Fig. 1. Moreover, the study utilises a boundary loss function in combination with BCE-dice loss [binary cross entropy criteria (BCE) and dice loss] for segmentation to merge pixels along the road network and cancel pixels that were across the road network.
The study considers the case study of the Hawkesbury Nepean Valley, located northwest of Sydney, New South Wales (NSW), Australia for road detection during disaster scenarios. The organisation of the paper is as follows. Section 2 defines the deep residual dense U-Net architecture and the lane boundaries modelling by the B-snake model. Section 3 describes the results of the pre-processing, training of the data sets, and road extraction. Section 4 summarises the outcome of the proposed framework, which depicted better results on synthetically modified data sets.

Methodology
This study proposes a Deep ResUnet architecture for training and testing on the datasets. The regions were merged based on maximal similarity, followed by road network extraction through the B-snake model. The system used during this project had the following specifications: 12th Gen Intel ® Core™ i7-12700H (24 MB cache, 14 cores, 20 threads, up to 4.70 GHz Turbo).

U-Net
U-net is a convolution neural network consisting of max pooling, ReLU activation, concatenation, and convolution operations [33]. Collecting finer details while obtaining high-precision results in semantic segmentation while keeping semantic knowledge is vital. It is difficult to train a deep neural network with limited training datasets. It can be overcome by applying a pre-trained network to the desired datasets. The extensive data augmentation in U-Net is another way to overcome the training issues. Its key contribution is the creation of shortcut connections and is found to be useful for tasks where the output and input are of similar size, and the output requires spatial resolution. The U-net efficiently creates segmentation masks. Replacing the basic unit with the residual unit significantly enhances the performance of U-Net.

Residual Unit
The residual neural network is composed of units stacked in a sequence that assists in the training of the U-Net model and overcomes the degradation issues [9]. There is a stack of residual units in between. The residual unit is composed of ReLU activation, convolutional layers, batch normalisation, no pooling layer and using 3 × 3 kernels and preserving spatial dimensions; these combinations impact the processing of the data. The residual unit is given below: where x m , and x m+1 are the input and output of the mth residual unit, the residual function F (·), activation and identity mapping function f (y m ) and h x m , respectively, for a characteristic one residual unit is given as h x m = x m .

Deep ResUnet
The combination of U-net and residual neural networks has many benefits. It provides ease of training of the network; the skip connection enhances the information propagation and minimises degradation [19] (Fig. 2). It enables designing a neural network with fewer parameters and enhanced performance. For road area extraction, a Deep ResUnet 7-level architecture has been proposed [31]. The Deep ResUnet network comprises encoding, decoding, and bridge. The input image is encoded into a compact representation converted into a pixel-wise image. The bridge connects encoding and decoding. The three components are residual units consisting of identity mapping (input and output units) and convolution blocks (consisting of ReLU activation, convolutional, and a BN layer).
Encoding and decoding path consists of 3 residual units (Fig. 2). For encoding path, instead of using the pooling x m+1 = f y m operation, each unit is applied with a stride of two to the first convolution block. This reduces the feature map to its half size for multiscale learning. Stride alters the volume of movement over the image and compresses it. The encoded output volume is affected by the size of the filter. Before each unit chain of features, maps are up sampled from the corresponding encoding path for decoding. The multi-channel feature maps are converted into desired segmentation through a 1 × 1 convolution and a sigmoid activation layer after the last level of the decoding path ( Fig. 2) [30]. The deconvolutional layers are utilised by the decoder to increase feature map size to the dimensions of the input image [12].

Loss Function
Boundary loss for road boundaries (highly unbalanced segmentation) is being used. The loss function aims to get smoother outputs at the boundaries and enhance model output for two close parallel roads. To resolve the issue of highly unbalanced segmentation, a distance metric on the space of contours is formed. The boundary loss function was combined with BCE-Dice Loss [7].

Region Merging
Region merging can be defined as the assembly of the raw regions produced by segmentation [14]. The grouping of similar regions is given as follows: where the region after grouping relates to, P and before grouping the number of all segments is given as O.
The region merging algorithms are classified into • Non-purposive grouping (NPG).
Non-purposive grouping involves merging small regions into larger regions based on efficient segmentation. It merges with regions based on related characteristics such as pixel segmentation and marker refinement. It also merges regions relating to similar objects based on expected connections of joints between parts of the same object. On the other hand, PG is based on the distinct properties of the objects. Maximal similarity based region merging (MSRM) was introduced as a region-merging approach by Ning et al. [20]. When the similarity rate is ascertained, an approach for locating image objects for merging is necessitated. Various heuristics can be applied to merging arbitrary object A with adjacent object B. Four strategies were proposed by Baatz [4]. These are (1) fitting, (2) best fitting, (3) local mutual best fitting and (4) global mutual best fitting. The roads appear as connected road segments in remote sensing images. The application of MSRM will assemble road segments and distinguish them from the rest [20]. The similarity between the arbitrary objects C and D is given as where NH c and NHD give the normalised histograms of C and D, the quantity of bins for each colour channel is given by b, P = b 3 and the element of histogram is given by i superscript. The similarity measure is given as: MSRM belongs to the second category i.e., best fitting and the merging strategy implies that two arbitrary regions C and D can only merge when the following condition is applied: where Nc gives C's adjacent regions.

B-spline Snake
The B-spline snake algorithm is efficient for rapid and spontaneous contour outlining. The application of the snake algorithm is varied and has been used for segmentation, edge detection, shape modelling, and tracking motion. The active contours or snakes move under the impact of forces (both internal and external) from the curve and image data, respectively [23]. Cubic B-spline with fewer state variables provides more economical recognition of snake and are piecewise polynomial functions. They give local proximation to contours with limited control points or parameters. Four or more control points can represent the curves. With the addition of more control points, the flexibility of the curves enhances, which either permits variation in the curve or reduces continuity at certain points when multiple knots are utilised [5].
The segmented image calculates B-Spline by defining the control points after every connected (n = 64) pixel.
A cubic B-spline can be specified by m+ 1 control point Q 0 , Q 1 …, Q m and comprises m 2 cubic polynomial curve segments, where each segment of the B-Spline is derived from its four neighbouring control points. The knots in B-spline curve are the joints between the two segments of a curve. The equation for each curve segment is: where " s " is the curve segment with a value 0-1 and " i " corresponds to curve segments. Applying B-splines as active contours is effective as they are continuous at each point and knot and smooth out the extracted features in the images. The number of control points controls the splines' flexibility or curvature. They also exhibit local control since changing a single control point will only change a small contour section. The pseudo-code for B-snake is given as below: Sim() on,D=I max (Sim(C, N i C )), 1. Get output/segmentation mask using the proposed deep residual UNET architecture. 2. Apply region merging (Sect. 2.2), whereas maximal similarity-based region merging (MSRM) involves merging small regions into larger regions based on efficient segmentation. 3. The segmented image calculates B-spline by defining the control points after every connected (n = 64) pixel. 4. Perform minimisation using the non-maximal suppression on control points to calculate optimised B-spline segments.

Minimisation Algorithm
The minimisation algorithm detects the minimum of the objective function in the n-dimensional parameter space. Using the non-maximal suppression, control points were generated to calculate the B-spline segments. Control points were generated on the segmented image using nonmaximal suppression as follows: a. The maximum distance between the peaks (n = 64) was defined. b. For every row in the image, perform a sliding window operation and, on each step, all non-maximum values were inverted to a fixed negative number. c. The same operation (b) was used to handle the nonmaximum values per column. d. The pixels with a negative value to were set to zero.
From the control points, the initial B-spline segments were calculated. Sample k = 20 points were taken along each spline segment. The sample points' distances (in the expected direction to the spline) were calculated along the 4 splines to the closest edge. The above steps were repeated till less than k% of control points were moved (k = 60). Cycle through each control point to find the contribution to 4 spline segments. For each pixel in a neighbourhood surrounding the current control point following steps were followed: a. For the 4 splines the control point was recalculated to check if the control point needs to be moved. b. The distances (in the expected direction to the spline) of the sample points along the 4 splines to the closest edge were evaluated c. The control point was moved to the neighbourhood point which had the smallest sum of distances.

Experiments and Results
The Hawkesbury Nepean Valley region, NSW, Australia was selected for road extraction. For this, the Massachusetts roads dataset's online data source was used ( Table 1). The training datasets (Fig. 3) contained 1105 images with a corresponding labelled mask. While in the test dataset, there were 13 images with 13 corresponding labelled masks.

Data Collection and Pre-processing
This study selected Hawkesbury Nepean valley, NSW, Australia to detect road networks because this region is prone to floods each year. With road network detection, disaster response can be enhanced and a safe route for evacuation could be selected. Additionally, Massachusetts road datasets were used. During pre-processing, the training dataset contained 1105 images of size (1500*1500), but we had the corresponding labelled mask for only 804 (73%) images. So only images having corresponding masks were utilised for training (Fig. 4). The Table 1. Describes the statistical overview of Massachusetts roads dataset.
Out of 804 images with masks, there are images with white patches in them but had labelled data for those white patch regions. Such images diminish the model performance and therefore were not used during training. Each of the remaining images and masks was then resized to (1536*1536) and then broken into nine images of size (512*512). The benefits of splitting images were a more extensive training dataset and more options for augmentation. Each of the nine images can have different augmentation at run time, reducing the chance of overfitting. Also, it resulted in a bigger batch size as more images of smaller size can be loaded into limited GPU memory compared to larger images. A few more random crops of size (512*512) from size (1536*1536) images were also taken to increase the dataset. To avoid data duplication, the random crops do not overlap with nine cropped images. These images were randomly rotated by either 90 degrees or 270 degrees. After pre-processing, a total of 7240 images were obtained for training. All 13 images from the testing set were correct and used directly during model performance evaluation.

Training
The training set was divided into an 85:15 ratio to obtain 6150 training images and 1090 validation images. Tensorflow v2 and TensorFlow Keras were used to build the UNET model [33]. Around 15-20% image synthesis was achieved.

Augmentation
A Runtime augmentation was performed on the training dataset to increase dataset variety with a combination of horizontal and vertical flips having a probability of 0.5. Brightness augmentation was done to improve the model deal in low-light situations. Tensorflow dataset API is used to pre-process data before feeding it into the model.

Model architecture
The model uses U-net architecture to segment small objects from large images. This capability makes U-net an excellent candidate for satellite imagery segmentation problems [36]. The benefit of using this model for road extraction is that the residual units ease the training of deep networks. The connections within the network ease the propagation of information without degradation, thus allowing the designing of a network with few parameters with better performance.

Training schedule
At the outset, the model was trained for the first ten epochs with a combination of boundary loss and BCE-dice loss, as shown in Fig. 5. Later the model was only optimised using BCE-dice loss for image mask prediction, as shown in Fig. 6. Learning Rate Decay is an advanced technique to optimise and generalise Deep Neural Networks (DNN), and its methods are widely applied. In our approach, we observed a decay of 20% in the learning rate after a cycle size of 5 epochs, as shown in Fig. 5. Whilst training, after every batch update, the cyclical learning rate decay slowly increases the learning rate.
Following graphs in Fig. 5 show training progress: • Redline implies validation data. • Orange line implies training data.  Table 2a-c, performance was significantly lost when the proposed methods were evaluated on an unseen dataset. It is due to different abilities to generalise knowledge between seemingly identical tasks, as the area on the image was synthetically modified for a flood. However, the proposed framework (baseline + region merging + B-snake) achieved better performance when evaluated on a synthetically modified dataset. It is evident that in comparison with baseline region merging and the addition of B-snake, significant improvement was achieved through the proposed framework with a value of 0.92% for precision and 0.897% for recall. A Tensor board visualisation example for validation samples is shown in Fig. 7.

Inference
For inference on test images, each image was divided into (512*512), like the training pre-processing and the model prediction is then stitched together to produce a predicted mask of size (1500*1500). To get a binary image from the prediction output, a thresholding of 0.5 was applied on each mask. Any pixel with a value above 0.5 was a positive road pixel. Small blobs (white patches) of false positives were removed.

Conclusion
Thus, a framework was suggested to enhance road network extraction. The framework was based on deep residual dense U-Net, region merging based on similarity and a B-snake algorithm. The study utilised a boundary loss function in combination with BCE-Dice loss for segmentation to merge the pixels along the road and cancel the pixels across the road network. A case study of Hawkesbury Nepean valley was considered for road network extraction. The Massachusetts roads dataset was used for training and testing the data. In the training datasets, there were 1105   It was observed that network evaluation on unseen datasets experienced a loss in performance. The reason was due to varying abilities to gather information from similar tasks slightly modified for floods. However, the proposed framework depicted better results on synthetically modified data sets. A precision of 0.92 and recall of 0.87% was achieved, respectively. Implementation of the boundary loss function in combination with BCE-Dice loss for segmentation was selected as a learning strategy for the study; however, if higher weightage is applied to the proposed method, the non-road regions also start to merge the pixel resulting in poor segmentation.