1 Introduction

Buildings are a significant element of the urban environment and the most vital areas for human production and living, making them the leading indicators of urban expansion and the size of population centres [1]. Given the importance and interest of buildings in the urban environment, numerous studies have focused on automatically extracting building outlines by exploiting different datasets and techniques [2, 3]. These detailed, up-to-date geographic data on the built environment are essential and present a practical approach to comprehending how assets and people are exposed to environmental hazards such as floods, erosion, and coastal storms [4].

With significant advances in diverse platforms and sensors such as Light Detection and Ranging (LiDAR), cameras mounted on satellites, aerial platforms, and unmanned aerial vehicles (UAVs), it is now easier to acquire 2D images and generate 3D point clouds for building extraction purposes [5,6,7]. Existing studies have shown that satellite and aerial imageries [8,9,10], as well as LiDAR [11,12,13], have been extensively used for building segmentation purposes, and these data sources have achieved outstanding results. However, these data sources are costly to acquire, and the freely available ones are also temporally devoid for most developing countries in Africa.

Several techniques have been developed for building segmentation over the past decades and can loosely be categorised as traditional and deep learning (DL) based methods [14]. The traditional-based methods involve experts manually digitising building footprints or utilising the geometric features or contextual properties of buildings, such as shape [15, 16], edge [17, 18] and shadow [19] to extract the building outlines. However, these techniques are arduous, onerous, and economically expensive [20]. In view of the foregoing, there has been a fostering interest in implementing DL approaches for automatic building extraction after several advancements were achieved in a diversity of computer vision projects, such as classification [21], object detection [22], and semantic segmentation [23, 24]. Thus, the DL-based methods seek to do away with expert intervention as much as possible to increase productivity gain.

It is acknowledged that the DL-based building extraction process is a semantic segmentation problem that involves pixel-wise labelling of image pixels to categorise such pixels as either buildings or not. Among the different DL architectures used, the encoder-decoder is the most effective and solves most end-to-end issues encountered by other DL architectures [20]. One typical example of such encoder-decoder architecture is the U-Net built on a fully connected network (FCN) which has gained traction as a cutting-edge architecture for building extraction. The U-Net utilises a skip connection that enables a decoder to accept low-sampled features from the encoder to create outputs with minimum information loss [25]. To improve the accuracy and increase the performance of the U-Net, some works [26,27,28] proposed replacing the encoder part with a pre-trained network via transfer learning. Of the various pre-trained models, the residual network with 34 deep layers (ResNet—34) provides a good balance between performance and accuracy and hence was adopted in this paper.

The hyperparameters of DL networks significantly influence the network’s performance, as they serve as control agents for the network’s training process. Designing an optimal DL architecture for efficient learning and achieving optimum results depends on adjusting the hyperparameters, such as the number of convolutional layers, filter size, batch size, and convolutional filters. This implies exploring a vast and intricate search space and experimenting with diverse combinations and manually finding the best configuration and fine-tuning hyperparameter is computationally expensive [29]. Also, selecting the appropriate hyperparameters to achieve optimum training results is imperative. For example, choosing a lower learning rate implies that the network learns slowly and requires increasing the epochs to attain superior performance. By contrast, a higher learning rate would lead to early convergence whereby the network reaches sub-standard performance speedily and fails to improve further. Therefore, there is the need to optimise the DL hyperparameters for proper training and ideal performance results.

Existing studies [30,31,32] revealed that the grid and random search optimisation techniques are the most conventional ones used to fine-tune DL hyperparameters. Grid search, a systematic approach, exhaustively explores hyperparameter combinations, suitable for models with a limited number of hyperparameters [33, 34]. For instnace, Jiang and Xu [35] used grid search to optimise hyperparameters for various deep feedforward neural networks and machine learning models, achieving promising results in predicting breast cancer metastasis. Priyadarshini and Cotton [36] also applied grid search to develop a long short term memory-convolutional neural network (LSTM-CNN)based model for sentiment analysis. Ngoc et al. [37] utlised grid search within the walk-forward validation methodology to search for the optimal hyperparameters of a multilayer perceptron model. Although these approaches attained promising results, Bacanin et al. [34] argued that grid search is a trial-and-error technique that necessitates a strong understanding of the specific domain. Its efficiency diminishes as the number of hyperparameters expands, leading to exponential growth in computational demands [34]. On the contrary, the random search surpasses the grid search in effectiveness by allowing parallelisation and random sampling from a search space to identify the best hyperparameters [38, 39]. Rodríguez et al. [40] utilised random search for fine-tuning the hyperparameters of a CNN to recognise images over an augmented reality sandbox, and Jekova and Krasteva [41] exploited it to fine-tune an end-to-end CNN to analyse out-of-hospital cardiac arrest rhythms during cardiopulmonary resuscitation. Likewise, Ragab et al. [42] utilised it to optimise a one-dimensional CNN to recognise human activity. Nonetheless, a notable limitation of this random search technique is its failure to incorporate previously obtained results, potentially leading to repeated searches with the same hyperparameters [43]. As such, deep neural networks (DNNs) optimised with random search techniques tend to converge to local optima, distant from the global best hyperparameter configurations. Furthermore, hyperparameters vary in type, and can be continuous, categorical or discrete values, making these random search techniques sometimes inadequate [38, 44].

To address such problem, many researchers have tackled hyperparameter tuning of DL models as a hyperparameter optimisation (HPO) problem and proposed the use of advanced optimisation techniques such as metaheuristics [29, 34, 45]. Different from the other optimisation strategies, metaheuristics are designed to efficiently explore a large space, make use of previous searches, and strike a balance between exploration and exploitation [38, 46]. Moreover, these algorithms are capable of escaping local optima; thus, providing a more efficient way to find close-to-optimal [47]. These metaheuristic optimisation algorithms have demonstrated exceptional performance and their ease of implementation have made them preferential for tackling a wide array of complex optimisation problems spanning engineering, communications, industrial settings, and social sciences [48]. Furthermore, their versatility and robustness make them valuable tools for consistently delivering effective solutions to intricate optimisation tasks across various domains such as biological information analysis [49], chemical information optimisation [50], task scheduling within cloud computing environments [51], feature selection [52], image segmentation tasks [53], and even cost-effective emission dispatch problems [54], among others. One of such algorithms is the grey wolf optimisation (GWO) technique [55], a novel population-based metaheuristic that has proven to be a strong fit for handling numerous benchmark optimisation challenges. GWO is capable of learning in iteration and identifying the best-fit value in each iteration. Additionally, the mathematical framework of GWO makes it possible to identify solutions in an \(n\)-dimensional search space, mimicking the grey wolves hunting technique [56]. Moreover, GWO is computationally less expensive as it utilises only one position while reserving the three best solutions for enhanced exploration [57]. Coupled with its computational advantages, GWO has been utilised to tackle various issues such as feature extraction [58], weight initialisation of CNN [59, 60], and hyperparameter optimisation of CNNs for classification problems [45, 61]. Despite its vast application areas, the GWO has been underexplored to verify its authenticity in hyperparameter optimisation of DL segmentation architectures such as the U-Net and its associated variants for image segmentation problems.

In view of the enumerated benefits, this research proposes a hybrid intelligent building segmentation model of grey wolf and UResNet-34 called GWO-UResNet-34. The principal idea is to employ GWO to find the optimal hyperparameters of UResNet-34 in terms of activation function, learning rate, loss function, and epoch. This is necessary because choosing any hyperparameter configuration for the UResNet-34 can lead to local optimum solution, slow convergence and manual tasking. Therefore, to improve the UResNet-34 building segmentation performance, a suitable objective or fitness function was constructed and evaluated using the GWO by computing the fitness value in the form of mean intersection over union (MIoU) and F-1 score metric. The experimental results, when compared with U-Net and UResNet-34, illustrated the superiority of the proposed GWO-UResNet-34 approach. This study provides the first comprehensive assessment of the applicability of GWO as a reliable hyperparameter optimisation algorithm to enhance the performance of UResNet-34.The significant contributions of this paper to existing literature are to:

  • Investigate the performance of utilising GWO to fine-tune the adjustable hyperparameters of UResNet-34 for building segmentation.

  • Conduct extensive experiments on different localities to ascertain the proposed GWO-UResNet-34 versatility.

  • Evaluate and compare the results of the proposed GWO-UResNet-34 architecture with conventional U-Net and UResNet-34-based building segmentation approaches.

2 Related works

2.1 U-Net-Based building extraction

This section examines relevant literature on building extraction and segmentation techniques that utilise the U-Net model and its variations. This is crucial as the GWO-UResNet-34 model is a U-Net variant. A search was conducted on Google Scholar and Science Direct websites to gather information on U-Net-based building segmentation studies. The search employed keywords such as U-Net, building extraction, building, and modified U-Net. Initially, many papers were retrieved, but a selection process was implemented to include only studies that utilised U-Net or its associated variants for building segmentation. To further narrow down the selection, studies conducted between 2018 and 2022 were considered.

The literature review revealed that although scholars have reported promising results for U-Net-based models in semantic segmentation, its applicability to building extraction is still challenging and complex, even with high-quality images. This challenge is attributed to buildings with varied characteristics, while many urban and environmental factors can cause occlusion [62]. Hence, various U-Net models have been developed to automatically extract building footprints for different building properties and environments. Similarly, diverse U-Net architectures have been developed for building segmentation using remotely sensed images and have achieved good results. For instance, Wu et al. [63] proposed a multi-constraint FCN in performing building segmentation from aerial images by introducing additional constraints to the immediate layers of the basic U-Net model. Using transfer learning, Adibah et al. [27] developed a U-Net architecture based on ResNet-34 for building extraction and outperformed previous works on the INRIA dataset.

Liu et al. [64] replaced the encoder section of a U-Net model with a ResNet encoder to segment buildings in open-sourced remote sensing data. Delibasoglu and Cetin [65] modified the original U-Net model using inception blocks to improve building segmentation accuracy. AMUNet, a U-Net with multi-loss and an attention block was presented by Guo et al. [66] to overcome the insensitivity of DL models to small buildings and subdue the background noise for a better segmentation outcome. He et al. [14] proposed a hybrid first and second-order attention network (HFSA) to explore the connection between the intermediate layers for building delineation in remotely sensed images. Erdem and Avdan [26] compared various U-Net models for building extraction by replacing the encoder portion of the original U-Net with VGG-16, InceptionResNetV2, and DenseNet121 CNNs. Pan et al. [67] investigated the feasibility and accuracy of U-Net for extraction and classification in high-density areas. Rastogi et al. [68] presented a UNet-AP, a model that introduces an atrous spatial pyramid pooling (ASPP) capable of incorporating contextual information in the bottom neck of the original U-Net.

The approach by Chen et al. [69] attempts to overcome the biases encountered by the encoder and decoder parts of DL models by combining a self-attention module with the reconstruction-bias strategy for efficient building segmentation. Li et al. [70] developed an attention-enhanced U-Net by utilising a ResNet to add a spatial-channel attention mechanism and a multi-scale fusion module to improve small-building extraction. Jin et al. [71] presented a boundary-aware refined network (BARNet) based on U-Net and DeepLab-v3 to address the incomplete segmentation of large buildings and refine the accuracy of building extraction. Abdollahi and Pradhan [72] proposed a MultiRes-UNet network that utilises a MultiRes block to assimilate learned features while replacing the skip connections of the original U-Net network with the shortest path called the Res path. Xu et al. [73] designed a Holistically-Nested Attention U-Net (HA U-Net) that exploits an influential attention mechanism unit to incorporate multi-scale path information proficiently. Ye et al. [74] introduced a context-transfer-UNet (CT-UNet) to address the inter-class similarity between buildings and backgrounds by constructing a dense boundary block (DBB) capable of employing the reuse process to enhance attributes and proliferate recognition proficiencies.

These works demonstrated a collective effort to improve the accuracy and efficiency of the U-Net model by exploring various strategies and modifying the architecture. However, the existing studies reviewed did not explore the effect of hyperparameter selection and its effect on the sensitivity of the varying U-Net-based building segmentation models.

2.2 Metaheuristic-based hyperparameter optimisation

Metaheuristic algorithms are stochastic approximations that aim to find solutions close to the global optimum. They comprise two key phases: exploration (diversification) for global search and exploitation (intensification) focused on refining current best solutions. These two mechanisms dictate the generation of new candidate solutions based on the previous ones, enabling the algorithms to have global exploration and efficient search [34, 46]. Various metaheuristics have been developed over the past decade and are broadly classified into four groups [75]:

  • Evolutionary algorithms (EAs): These mimic biological processes such as reproduction, mutation, crossover and selection to solve complex programs. EAs can be subdivided into various categories, each having its unique characteristics. Evolutionary programming (EP) [76] is one category that focuses on evolving computer programs or representations, and it is well-suited for problems where the solutions can be represented as programs or symbolic expressions. The Evolution strategies (ES) [77] pay attention to parameter adaptation and are often used for numerical optimisation problems. Differential evolution (DE) [78] uses mutation and crossover to generate new solutions to estimate the differences in fitness values between target individuals and others in the population. Biogeography-based optimisation (BBO) [79] is inspired by the principles of biogeography, such as immigration and emigration between different regions, to model the flow of solutions between habitats. Genetic algorithms (GAs) [80] are arguably the most utilised subset of EAs. They use a population-based approach and employ selection, crossover, and mutation operators to evolve and improve solutions.

  • Swarm intelligence (SI) algorithms: These algorithms draw inspiration from the behaviours of animals and plants. For instance, particle swarm optimisation (PSO) [81] is inspired by the foraging process of bird flocks, cuckoo search (CS) [82] is influenced by the brood parasitic behaviour of certain cuckoo species, grey wolf optimiser (GWO) [55] is based on the leadership hierarchy and hunting mechanisms of grey wolves, salp swarm algorithm (SSA) [83] emulates the swarming behaviour of salps, and whale optimisation algorithm (WOA) [84] is inspired by the social behaviour of humpback whales, among others.

  • Physics-based algorithms (PAs): These algorithms are rooted in various physical phenomena. Common among these is the simulated annealing algorithm (SA) [85], which is based on the principle of solid annealing.

  • Human Activity-Related Algorithms (HA): These algorithms take cues from human activities. Popular algorithms in the subclass include teaching–learning-based optimisation (TLBO) [86] which is influenced by the traditional teaching mode, passing vehicle search (PVS) [87] which is inspired by the vehicle passing mechanisms on two-lane highways, and sine cosine algorithm (SCA) [88] which incorporates sine and cosine waves.

In recent years, metaheuristic algorithms have become prominent as practical alternatives for optimising hyperparameters of DL networks in various domains. For instance, Bouktif et al. [89] adopted GA and PSO to optimise LSTM hyperparameters for load forecasting. The results showed that the multi-sequence DL model tuned by these metaheuristic algorithms outperformed benchmark machine learning models and naive LSTM configurations. Somua et al. [90] introduced the “eDemand” model which utilised an improved sine cosine optimisation algorithm (ISCOA) to optimise the hyperparameters of LSTM for accurate and robust energy consumption forecasting. In addition, a Haar wavelet-based mutation operator was introduced to enhance ISCOA’s ability to converge towards global optimal solutions. A case study using real-time energy consumption data indicated that the proposed model outperformed state-of-the-art energy consumption forecast models across various performance metrics. Similarly, Peng et al. [90] proposed “FOA-LSTM”, a model that combined LSTM with the fruit fly optimisation algorithm (FOA) to determine optimal hyperparameters. Experimental results using various datasets, including the USA’s NN3 time series and monthly energy consumption data, demonstrated the FOA-LSTM model’s effectiveness. The proposed model outperformed other forecasting models, reducing the symmetric mean absolute percentage error (SMAPE) by up to 11.44% in some instances. Nadeem et al. [91] presented the “SHO-CNN” model, which leverages the spotted hyena optimiser (SHO) metaheuristic optimisation algorithm to fine-tune hyperparameters critical for CNN performance such as the learning rate, momentum, number of epochs, batch size, dropout, number of nodes, and activation function. Experimental results on various news datasets demonstrated that SHO-CNN outperformed baseline CNNs and other optimisation approaches, achieving high accuracy levels in multi-label news classification.

Challapalli and Devarakonda [92] used a hybrid particle swarm grey wolf (HPSGW) algorithm to fine-tune hyperparameters of a CNN, including batch size, number of hidden layers, number of epochs, and filter size, to achieve optimal network performance. The proposed method is tested on benchmark datasets like MNIST and CIFAR and applied to classifying 8 Indian classical dances. The experimental results revealed significant performance improvements compared to previous methods, achieving high accuracy levels for image classification tasks. Tuba et al. [46] addressed the problem of tuning hyperparameters for CNNs) using the bare-bones fireworks algorithm. The proposed method was tested on benchmark datasets, including CIFAR-10 and MNIST, and compared to other optimisation techniques. The results indicated that the proposed method outperforms previous methods, achieving high classification accuracy on both datasets. Tsai and Fang [47] introduced a novel metaheuristic algorithm called “search economics for hyperparameter optimisation” to improve the accuracy of prediction systems. This algorithm assigns search agents to different subspaces based on their potential for optimisation. Compared to methods such as Bayesian, random forest, support vector regression, DNN, and DNN with different hyperparameter search algorithms and based on data from Taipei City, Taiwan, the proposed method obtained lower mean absolute percentage error. Similarly, Tsai et al. [93] tackled the problem of predicting bus passengers using deep learning and optimizing hyperparameters based on simulated annealing (SA) to enhance accuracy. The proposed method was compared to other machine learning techniques, including support vector machines, random forests, and gradient boosting. Simulation results showed that the SA-based approach outperformed the other methods, achieving high accuracy in forecasting bus passenger numbers.

Nematzadeh et al. [94] proposed a method that utilises GWO and GA metaheuristics to fine-tune hyperparameters DNN for biomedical and biological purposes. The authors experimented on 11 biomedical, biological, and natural datasets, and the results demonstrated that metaheuristic methods, especially GWO, outperform other optimisation methods and showed faster convergence. Houssein et al. [48] introduced an optimised model called “IMPA-ResNet50” that used the improved marine predators algorithm (IMPA) for hyperparameter optimisation of a CNN model. A comparative assessment using mammographic datasets and compared to state-of-the-art approaches showed the superiority of IMPA-ResNet50, achieving high accuracy, sensitivity, and specificity, making it a promising tool for breast cancer diagnosis. Lee et al. [95] proposed using GA to optimise the network architecture and hyperparameters of CNNs. The authors applied this approach to an amyloid brain image dataset for Alzheimer’s disease diagnosis. The evaluation results demonstrated that their algorithm outperformed baseline CNN by a significant margin, achieving an 11.73% improvement on a classification task. Gülcü and Kus [96] presented the microcanonical optimisation algorithm (MOA), a variant of SA, for hyperparameter optimisation and architecture selection. The generated network was compared with networks generated by other optimisation-based approaches using six widely used image recognition datasets. The results indicated that the proposed method achieved competitive classification results. Utama et al. [97] explored PSO to tune hyperparameters and the architecture of a CNN for multivariate time-series analysis. The proposed network, PSO-CNN, was evaluated using electronic journal visitor datasets, and experimental results showed that PSO-CNN attained better performance than a standard CNN.

The literature review demonstrates metaheuristics’ versatility and efficacy in optimising DNN hyperparameters across different application domains, enhancing model performance and effectiveness in addressing real-world challenges. Nevertheless, while these studies collectively exemplify the success of metaheuristic algorithms in DNN hyperparameter optimisation, the need to explore new metaheuristics remains crucial. The No Free Lunch (NFL) theorem [98] champions that no single metaheuristic algorithm universally outperforms all others across diverse problem domains. As such, researchers continue to introduce novel metaheuristic algorithms, variants, and hybridization techniques that have the potential to outperform existing methods or exhibit unique capabilities. Moreover, the diversity of problems, sensitivity to problem characteristics, resource constraints, and comprehensive benchmarking further necessitate the continuous investigation of different metaheuristics tailored to specific challenges and problem characteristics.

3 Data and methodology

This work utilises GWO to optimise the hyperparameters of a modified U-Net with a ResNet-34 (UResNet-34) backbone for building segmentation purposes. The overall methodology of the study is illustrated in Fig. 1 and includes the input datasets and data pre-processing, model fitting and testing, and post-processing. The input dataset was high-resolution images obtained from a UAV survey and pre-processed into a DL readable format. The UResNet-34 model was subsequently developed and optimised with the GWO to create the GWO-UResNet-34 for training and prediction. Five commonly used evaluation metrics, accuracy, precision, recall, F-1 score, and mean intersection over union (MIoU), were utilised to evaluate the trained models. Detailed descriptions of the steps employed are given in the subsequent sections.

Fig. 1
figure 1

Schematic of the Proposed Methodology

3.1 Dataset description and study area

The dataset used as input for training the various models (GWO-UResNet-34, U-Net, and UResNet-34) employed in this study were three spectral bands (red, green, and blue) UAV orthomosaics of Accra and Tarkwa in Ghana, depicted in Fig. 2. Accra is the capital city, while Tarkwa is a mining hub in Ghana, making both areas associated with rapid developments and high urbanisation. These factors have led to ascendancy in the construction of various structures to cater for accommodation and office spaces. The Accra UAV orthomosaics was obtained in 2019 for a pilot phase of the national electrical grid connection project, had a 3 cm spatial resolution, and covered an area of 163.36 ha. It had 2531 manually digitised polygons representing structures (Fig. 2a). The UAV orthomosaic of Tarkwa was acquired in 2021 for an academic project, had a 5 cm spatial resolution, and covered an area of approximately 79.66 ha, with 494 structures annotated as polygons (Fig. 2b). This difference in resolutions between the two orthomosaics was to achieve a more generalised model. Compared to the Accra orthomosaic, the Tarkwa orthomosaic consisted of more slum structures.

Fig. 2
figure 2

UAV Orthomosaics of (a) Accra, Ghana, and (b) Tarkwa, Ghana

For the testing data, orthomosaics of four different localities were utilised. Locality-1 is associated with well-laid buildings with roads, while Locality-2 is dominated by slum buildings in proximity. In contrast, Locality-3 has a few buildings but has a substantial area of background vegetation, while Locality-4 is a conglomeration of Locality-1 and Locality-2. The geometrical variance (varying sizes, colours, and architectural designs) of the buildings depicted in Fig. 3 was to assess if the model can effectively segment buildings in other areas worldwide.

Fig. 3
figure 3

Test Localities with Varying Building Attributes

3.2 Grey wolf optimiser

GWO was inspired by the rigid leadership, pecking order, and group hunting behaviour of grey wolves, a species of the Canidae family [61]. Grey wolves are regarded as elite predators and are at the pinnacle of the trophic level, habitually living in packs of 5 to 12 wolves. The population of a pack is split into four distinct hierarchies consisting of the alpha (α), beta (β), delta (δ), and omega(ω). The alpha is the most powerful and makes judgments about hunting and sleeping. Beta wolves come second in the pecking order, and they support the alpha wolf in administration and convey information to and manage the subordinate groups. Omega wolves are the lowest in the hierarchy and typically fulfil sacrificial duties, albeit they are allowed to feed after the top hierarchies have finished feeding. Delta wolves lead the omega wolves and are usually scouts, sentinels, and guardians by assisting the alphas and betas when searching for and hunting prey. They also protect the territorial borders of the pack and nurse the frail and injured wolves. As expounded in the original work by Mirjalili et al. [55], the hunting tactic of the grey wolf is in four phases: encircling, hunting, attacking, and searching for the prey. The GWO algorithm is modelled according to the hunting strategy, with each wolf denoting a randomly initialised solution with the highest fittest score represented by the α – the first optimum solution, followed by the β – the second-best and δ – the third-best. Simultaneously, ω delineates the residual solutions. The three fittest wolves (α, β, and δ) are considered to be knowledgeable about the possible position of the prey, followed by ω. During hunting, the wolves generally encircle their prey, and this behaviour is delineated mathematically in Eqs. (1) and (2) [55].

$$\overrightarrow{D}=\overrightarrow{C} \cdot \left|{\overrightarrow{X}}_{p}\left(t\right)-\overrightarrow{X}(t)\right|$$
(1)
$$\overrightarrow{X}\left(t+1\right)=\left|{\overrightarrow{X}}_{p}\left(t\right)-\overrightarrow{A} \cdot \overrightarrow{D}\right|$$
(2)

where \(\overrightarrow{X}\left(t\right)and {\overrightarrow{X}}_{p}\left(t\right)\) denote the grey wolves and the prey’s location, respectively, at the tth iteration. \(\overrightarrow{D}\) denotes the position alteration element. \(\overrightarrow{A} and \overrightarrow{C}\) are coefficient vectors and are computed as shown in Eqs. (3) and (4) [55].

$$\overrightarrow{A}=2\overrightarrow{a} \cdot \boldsymbol{ }{\overrightarrow{r}}_{1}-\overrightarrow{a}$$
(3)
$$\overrightarrow{C}=2 \cdot {r}_{2}$$
(4)

where \({\overrightarrow{r}}_{1} and {r}_{2}\) are vectors with values between 0 and 1 that are generated randomly, and \(\overrightarrow{a}\) is a moderating entity that linearly diminishes from 2 to 0.

In the GWO algorithm, the positions of the fittest solutions, that is, α, β, and δ, are updated first, followed by the re-positioning of the other search agents (ω) based on Eqs. (5), (6) and (7) [55].

$${\overrightarrow{D}}_{\alpha }=\left|{\overrightarrow{C}}_{1} \cdot {\overrightarrow{X}}_{\alpha }-\overrightarrow{X}\right|$$
(5)
$${\overrightarrow{D}}_{\beta }=\left|{\overrightarrow{C}}_{2} \cdot {\overrightarrow{X}}_{\beta }-\overrightarrow{X}\right|$$
(6)
$${\overrightarrow{D}}_{\delta }=\left|{\overrightarrow{C}}_{3} \cdot {\overrightarrow{X}}_{\delta }-\overrightarrow{X}\right|$$
(7)

where \({\overrightarrow{D}}_{\alpha }\), \({\overrightarrow{D}}_{\beta }\), and \({\overrightarrow{D}}_{\delta }\) denotes the step size of ω with regards to α, β, and δ, with their respective as \({\overrightarrow{X}}_{\alpha }\), \({\overrightarrow{X}}_{\beta }\), and \({\overrightarrow{X}}_{\delta }\). \({\overrightarrow{C}}_{1}\), \({\overrightarrow{C}}_{2}\) and \({\overrightarrow{C}}_{3}\) are randomly initiated vectors and \(\overrightarrow{X}\) is the current solution location.

After the distances are defined, \(\overrightarrow{X}\left(t+1\right)\) which denotes the final position of the current solution is subsequently computed by Eqs. (8), (9), (10) and (11) [55].

$${\overrightarrow{X}}_{1}= {\overrightarrow{X}}_{\alpha }- {\overrightarrow{A}}_{1}({\overrightarrow{D}}_{\alpha })$$
(8)
$${\overrightarrow{X}}_{2}= {\overrightarrow{X}}_{\beta }- {\overrightarrow{A}}_{2}({\overrightarrow{D}}_{\beta })$$
(9)
$${\overrightarrow{X}}_{3}= {\overrightarrow{X}}_{\delta }- {\overrightarrow{A}}_{3}({\overrightarrow{D}}_{\sigma })$$
(10)
$$\overrightarrow{X}\left(t+1\right)= \frac{{\overrightarrow{X}}_{1}+{\overrightarrow{X}}_{2}+ {\overrightarrow{X}}_{3}}{3}$$
(11)

The capabilities of GWO enhanced due to the random adaptiveness of \(\overrightarrow{A}\) and \(\overrightarrow{C}.\) These parameters enable the algorithm to balance when exploring and exploiting a search space. Thus, \(\overrightarrow{A}\) initiates the exploration of search space and \(\left|\overrightarrow{A}\right|>1\) prompts candidate solutions to diverge from a weaker prey in search of a fitter one. Similarly, candidate solutions converge toward the prey when \(\left|\overrightarrow{A}\right|<1 \overrightarrow{C}=\left[0, 2\right]\left[\mathrm{0,2}\right]\). which a random vector secondarily specifies weights for the prey considering its location from the wolf [55]. Table 1 presents the pseudo-code for the GWO algorithm.

Table 1 Pseudo-code of GWO Algorithm

3.3 U-Net Architecture

U-Net was proposed initially by Ronneberger et al. [99] for semantic segmentation of biomedical images. The technique is built upon the FCN and is composed of two symmetrical parts that form a U shape with skip connections to help concatenate feature maps that provide localisation information, as depicted in Fig. 4. The first part is the contracting path, usually referred to as the encoder. In contrast, the second part, usually termed the decoder, is an expansive path. The encoder path seeks to learn what objects are in the input images and consists of several convolutions and max-pooling layers that gradually decrease input image size while increasing the network’s depth. The encoder part comprises two repeated 3 × 3 convolution kernels and a down-sampling layer of 2 × 2 window size combined with a rectified linear unit (ReLU) activation function. The decoder part operates similarly but utilises a 2 × 2 transpose convolution strategy to up-sample the images and concatenate the corresponding down-sample feature map from the encoder part. Finally, a convolution with a 1 × 1 kernel and a sigmoid function is utilised to map every feature map into the desired outputs. Regarding building segmentation with U-Net, the encoding process refers to building identification and separating buildings from non-buildings. On the other hand, the expansive process refers to building localisation and involves determining the spatial existence of the buildings [100].

Fig. 4
figure 4

UNet Architecture

3.4 U-Net with ResNet backbone

The residual network (ResNet) was introduced by He et al. [101] to solve the vanishing gradient problem encountered by most DL networks. The layers of ResNet are organised in “residual blocks” capable of learning residual functions regarding input layers instead of learning unreferenced functions. ResNet utilises skip connections between layers, thereby reducing the layers, simplifying the network, and speeding the learning. Fewer layers also imply less propagation and a reduction in the impact of the vanishing gradient. The pros of ResNet have made it suitable to be utilised in several deep networks, such as super-resolution, generative adversarial networks (GANs), and semantic segmentation, among others. ResNet variants comprise ResNet-18, ResNet-34, ResNet-50, and ResNet-101; however, the ResNet-34 has proven to provide a good balance between performance and accuracy when performing semantic segmentation tasks as demonstrated in literature [27, 64].

3.5 Hyperparameter optimization of UResNet-34 using GWO

In this work, the hyperparameters of UResNet-34 were optimised using GWO. This is necessary because choosing the appropriate hyperparameters impacts the accuracy and convergence of the DL model significantly. The learning rate, training epoch, optimiser, activation function, and loss function are notable DL training hyperparameters that are usually optimised. The learning rate regulates how much the DL model changes each time the model weights are updated in response to an estimated error. The training epoch expresses the number of times the complete dataset is distributed through the model. The optimiser alters the characteristics of a neural network, such as its weights and learning rate, and aids in reducing total loss and improving accuracy. The activation function helps the DL network to learn sophisticated patterns in the data. The loss function measures the difference between an estimated and true value, from which gradients are derived to update the DL network’s weight. Therefore, identifying the optimal hyperparameters is regarded as an optimisation problem. Some hyperparameters, mainly the optimisers, activation, and loss functions, are encoded into integers for the objective function. However, these are inverse transformed into their original values, so the model does not encounter an error. Since the objective function aims to ascertain the best hyperparameter combination to attain greater accuracy, the objective function is defined to achieve maximum fitness values with specific lower and upper bounds set for each hyperparameter. The overall design for the GWO-optimised UResNet-34 is illustrated in Fig. 5.

Fig. 5
figure 5

Overall design for the GWO-optimized UResNet-34

3.6 Evaluation metrics

Five standard metrics (recall, precision, accuracy, F-1 score, and MIoU) [102] were adopted to assess the proposed GWO-UResNet-34 model against U-Net and UResNet-34. The recall defines how complete a model is and is expressed as the ratio of the number of the positively detected targets to the total number of positive targets, as shown in Eq. (12). Precision is an indication of how exact or correct a model is and is defined as the ratio of the number of the positively detected targets to the total number of targets detected as positive. The mathematical expression for precision is presented in Eq. (13). Accuracy is the ratio of the correctly detected targets to the total number of detected targets and is computed as shown in Eq. (14). F1 is the harmonic mean of precision and recall (Eq. (15), and MIoU provides a balance between recall rate and accuracy as defined in Eq. (16).

$$\mathrm{Recall }= \frac{TP}{TP+FN}$$
(12)
$$\mathrm{Precision }=\frac{TP}{TP+FP}$$
(13)
$$\mathrm{Accuracy }= \frac{TP+TN}{TP+FP+TN+FN}$$
(14)
$$\mathrm{F}1-\mathrm{score }= \frac{2*Precision*Recall}{Precision+Recall}$$
(15)
$$\mathrm{MIoU }= \frac{1}{K}{\sum }_{i=1}^{K}\frac{P\cap G}{P\cup G}$$
(16)

where \(\mathrm{TP}\), \(\mathrm{TN}\), \(\mathrm{FP}\), and \(\mathrm{FN}\) represent the true positive (pixels precisely predicted as buildings), true negative (pixels precisely predicted as non-building), false positive (pixels erroneously predicted as buildings but are not), and false negative (pixels not predicted as non-building but are), respectively. \(K\) is the number of classes, which in this research is 2, \(P\) denotes the predicted buildings, while \(G\) represents the ground truth. MIoU ranges between 0 and 1, with 0 being the worst prediction and 1 being the best.

4 Experimental results

4.1 Dataset pre-processing

The dataset was manually labelled into two classes, building or non-buildings, and subsequently converted to binary mask (0,1) to serve as ground truth data, with 1 representing buildings and 0 representing non-buildings. A sample image tile with its corresponding mask in the dataset is given in Fig. 6. The orthomosaics were in tiles of 5000 × 5000 pixels, and as these sizes were too big for computer memory, they were divided into 256 × 256 patches, resulting in 10,624 images and masks each. Image patches not having at least 5% of building information were removed from the dataset to prevent bias towards the background. The remaining dataset (5045 images and masks each) was randomly divided into training (80%) and validation (20%) datasets. The purpose of the training dataset is to provide the model with the necessary information and visual properties of the buildings. The validation data, on the other hand, aids in verifying and improving the model’s performance during training. Data augmentation techniques such as vertical flip, \({90}^{0}\) random rotation, horizontal flip, transpose, and grid distortion were randomly applied to only the train images and their corresponding masks. This procedure produced 12,000 samples each for the training images and their corresponding masks.

Fig. 6
figure 6

Sample (a) Training Image and (b) Corresponding Mask Tiles

For the test data, each orthomosaic was of a tile size of \(5000\times 5000\) pixels and was directly fed to the trained model during the prediction stage. However, a modified version of the Smoothly-Blend-Image-Patches code (https://github.com/Vooban/Smoothly-Blend-Image-Patches) was employed for the models to ensure smooth and efficient predictions. The code works by first appropriately padding the input image to accommodate potential sampling beyond the image’s boundaries during the prediction process. Subsequently, the image is divided into patches, forming a 5D NumPy array. These patches, represented initially as 3D arrays, are ordered in spatial dimensions, necessitating the addition of two extra dimensions. These spatially ordered patches are then reshaped into 4D arrays, aligning along a single batch size dimension. This arrangement facilitated batch predictions, optimising GPU memory usage by simultaneously loading all patches into memory. The prediction function incorporates the trained models and is employed to predict buildings in each patch. The predicted results are restructured back into a 5D array following batch predictions. Lastly, a spline interpolation is applied to merge the patch predictions into a cohesive 3D image array. The benefit of this approach is that it capitalises on batch size for efficient GPU utilisation while ensuring smooth predictions.

4.2 Experimental design

The experiment was implemented using Python programming language using open-source libraries such as TensorFlow, OpenCV, NumPy, Segmentation Models, and Mealpy. The models were developed, trained, and evaluated on a Windows operating system using a GeForce RTX 2060 GPU with 16GB RAM. The training process involved utilising a data generator with a batch size of 16 to read both the images and their corresponding masks for the training and validation datasets. The data was subsequently fed to the UResNet-34 model 6, formulated as an objective function for the GWO. The GWO parameters were initialised to solve for the best hyperparameter combination for UResNet-34 to achieve maximum model accuracy. The parameters for the GWO settings are presented in Table 2.

Table 2 GWO Parameter Settings

4.3 Performance comparison

The proposed GWO-UResNet-34 model efficacy is validated by comparing it against the U-Net and UResNet-34 models on four distinct localities using the evaluation metrics (recall, precision, accuracy, F-1 score, and MIoU). Both U-Net and UResNet-34 were trained and validated using the same data as the proposed GWO-UResNet-34 model. The findings of the evaluation are discussed in the subsequent sections.

4.3.1 Results on locality-1 test data

Table 3 shows the performance outcomes of each model when utilised to segment the buildings in Locality-1. Based on Table 3, UResNet-34 achieved better results than U-Net. However, when GWO was utilised to optimise UResNet-34, there were general improvements in all the metric scores except for precision. Nevertheless, GWO-UResNet-34 was able to find a balance between recall and precision, which UResNet-34 failed. This resulted in better F1 and MIoU values for GWO-UResNet-34, the superior and comprehensive metrics for semantic image segmentation tasks. Compared to UResNet-34, GWO-UResNet-34 had 10.39% and 13.20% improvements in F1 score and MIoU, respectively.

Table 3 Comparative evaluation scores among models for Locality-1

A graphical evaluation was conducted to illustrate each model’s segmentation outcomes, depicted in Fig. 7. The first and second columns represent the test image and its corresponding mask, while the final three columns represent the segmentation results of U-Net, UResNet-34, and GWO-UResNet-34 accordingly. From Fig. 7, it is noticeable that the proposed GWO-UResNet-34 model had only a few inaccurately predicted buildings (false positives) and could extract the geometries of building more efficiently than the U-Net and UResNet-34. This could be attributed to the GWO enhanced learning dynamics achieved by the UResNet-34 model by directing the model towards more promising regions of the hyperparameter space. As such, the proposed model is better equipped to converge to a solution that minimises the segmentation error.

Fig. 7
figure 7

Building segmentation results for Locality-1. a UAV orthomosaic; (b) mask; (c)U-Net; (d) UResNet-34; (e) GWO-UResNet-34

4.3.2 Results on locality-2 test data

Locality-2 test data was challenging since it largely comprised a slum with buildings having no noticeable boundaries, thus making them difficult to identify and segment. Regardless, the findings for the quantitative comparison among the models presented in Table 4 indicate the superiority of the proposed GWO-UResNet-34 model. The proposed model had 0.77%, 8.49%, 0.79%, and 1.83% improvements in accuracy, precision, F1 score, and MIoU values over UResNet-34. Compared to U-Net, the corresponding improvements were 8.96%, 1.15%, 25.57%, 15.32%, and 13.48%, respectively.

Table 4 Comparative evaluation scores among models for Locality-2

Figure 8 illustrates the visual comparison of the segmentation results achieved by each model. All the models had issues extracting the imperceptible gaps between dense buildings. However, the segmentation output from the proposed GWO-UResNet-34 was comparable to the ground truth.

Fig. 8
figure 8

Building segmentation results for Locality-2. a UAV orthomosaic; (b) mask; (c)U-Net; (d) UResNet-34; (e) GWO-UResNet-34

4.3.3 Results on locality-3 test data

This test data was selected to assess how the model can segment buildings in areas with more vegetation than buildings. It is evident in Table 5 that the proposed GWO-UResNet-34 model achieved the best results. U-Net had a high precision of 0.9059 but a lower recall rate of 0.2810, and a similar situation was encountered for UResNet-34. However, the proposed model could find a balance between precision and recall with scores of 0.9097 and 0.8842, respectively. This balance improved F1 and MIoU values of 46.7% and 31.12%, and 17.25% and 13.26% over U-Net and UResNet-34, respectively.

Table 5 Comparative evaluation scores among models for Locality-3

A visual comparison is formulated and conveyed in Fig. 9 to demonstrate the segmentation results achieved by the proposed and compared models. From Fig. 9, the enhanced learning of the proposed model enabled it to achieve segmentation outputs that are similar to the mask. Thus, the geometry of the segmented buildings, to a greater extent, is akin to that of masks with almost no false positives.

Fig. 9
figure 9

Building segmentation results for Locality-3. a UAV orthomosaic; (b) mask; (c)U-Net; (d) UResNet-34; (e) GWO-UResNet-34

4.3.4 Results on locality-4 test data

The Locality-4 dataset encompasses a diverse range of building types, including commercial, residential, and slum structures. This dataset was explicitly employed to evaluate the model’s capability to learn and identify buildings in a complex setting. The quantitative evaluation results for the three models are provided in Table 6. The results in the table reveal an overall improvement in all evaluation metrics achieved by the GWO-UResNet-34 model. This model demonstrated a well-balanced performance with higher F1 scores of 9.20% and 4.96% and improved MIoU values of 10.99% and 6.08% compared to U-Net and UResNet-34, respectively.

Table 6 Comparative evaluation scores among models for Locality-4

The schematic diagram in Fig. 10 illustrates the visual assessment of the models. The outlines of buildings segmented by the proposed GWO-UResNet-34 model are well-defined and comparable to the mask. Overall, GWO-UResNet-34 could segment buildings with more detailed information and less noise (false positives). U-Net and UResNet-34 achieved similar results for large buildings but struggled with smaller buildings.

Fig. 10
figure 10

Building segmentation results for Locality-4. a UAV orthomosaic; (b) mask; (c)U-Net; (d) UResNet-34; (e) GWO-UResNet-34

4.4 Strengths and limitations of the study

The results from the comparative assessment indicated the superiority of the GWO-UResNet-34 for building extraction from different urban layouts. The GWO-UResNet-34 repetitively outperformed U-Net and UResNet-34 in almost all evaluation metrics across the four test datasets. Moreover, unlike the U-Net and UResNet-34, GWO-UResNet-34 could find a good balance between precision and recall. This balance is vital for semantic image segmentation tasks, ensuring accurate identification and comprehensive coverage of building segments. This improvement implies that the GWO-UResNet-34 model had good generalisation capabilities and demonstrated continual improvements across different test datasets, including areas with dense vegetation, challenging slum areas, and diverse building types.

However, although the GWO-UResNet-34 model exhibited great potential in segmenting buildings from different localities, some limitations exist. First, the computational cost of the GWO-UResNet-34 must be improved. Thus, the hyperparameter selection took considerable time before an optimal model could be attained, and this can limit the scalability and practicality of the approach for large and complex datasets. Also, although satisfactory results have been achieved, the study only utilised the GWO algorithm. It is recommended that other standalone metaheuristics (e.g., PSO and WOA), hybrid, or improved metaheuristic algorithms must be investigated to assess their performance in building segmentation. In addition, the study was limited to just four different localities. Therefore, future work can test the efficiency of the proposed GWO-UResNet-34 in other localities with different building and roof configurations, such as condensed slums, rural settings, and complex and non-uniform architecture.

4.5 Research implication

This work has demonstrated the use and highlighted the importance of metaheuristic algorithms, notably the GWO algorithm, as an alternative for optimum hyperparameter selection and combination. The accurate and efficient building segmentation achieved by the model can support a variety of applications, such as urban planning and infrastructure monitoring. Moreover, this study will encourage further exploration and refinement of other optimisation techniques for optimising the selection of DL network hyperparameters.

5 Conclusions and future works

This work proposed a GWO-UResNet-34 for building extraction from high-resolution UAV orthomosaics. The GWO algorithm was utilised to fine-tune the adjustable hyperparameters of the UResNet-34 DL model. The hyperparameters comprised the activation function, optimiser, learning rate, loss function, and epoch. The proposed GWO-UResNet-34 model was evaluated using four different location-based testing datasets with distinct building layouts and shapes and compared with two other segmentation models namely, U-Net and UResNet-34. Five evaluation metrics were used to assess the model’s efficiency, specifically accuracy, precision, recall, F1 score, and MIoU. The following key conclusions were drawn from the study:

  1. i.

    The results indicated that the proposed GWO-UResNet-34 model was more robust, achieved state-of-the-art performance, and outperformed the other two models.

  2. ii.

    One notable strength of the GWO-UResNet-34 was its ability to balance precision and recall, which is vital for accurate and comprehensive building segmentation tasks.

  3. iii.

    Overall, the GWO-UResNet-34 had a better generalisation capability across the four different locations tested, thus demonstrating the potential of metaheuristic algorithms, particularly the Grey Wolf Optimiser (GWO), for optimising hyperparameter selection of DL networks for building segmentation from UAV orthomosaics.

  4. iv.

    As a limitation, the computational cost and scalability of the approach need to be carefully probed as the hyperparameter selection process took considerable time and may pose challenges for large and complex datasets.

  5. v.

    Future studies could explore other metaheuristic algorithms to assess their performance in optimising hyperparameters of DNNs for building segmentation.