1 Introduction

Recently, ocean exploration and utilization of oceanic resources and energy have gained significance worldwide (Zhou et al. 2023a, b). Modern marine research has an urgent requirement for high-precision three-dimensional (3D) underwater data (Fan et al. 2017). Underwater 3D reconstruction primarily relies on two key technologies: optical and acoustic imaging. Underwater visibility is substantially limited due to the attenuation of visible light, decreasing the effectiveness of optical imaging underwater compared with air.

In contrast, acoustic imaging has found extensive applications in underwater measurements, seabed operations, underwater archaeology, underwater navigation, and various other fields because of its minimal underwater attenuation and long-distance propagation capabilities. Side-scan sonar technology has been widely used for seabed terrain and object reconstruction. This method operates by emitting sound pulse waves to the left and right sides through sound wave transducer arrays located on either side of the device (Key 2000). When these emitted sound waves encounter water or objects on the seabed, they scatter in both directions, generating echoes. These echoes are captured by the sonar and subsequently converted into electrical signals through a series of processes. The intensity of the echoes is displayed on the sonogram corresponding to variations in brightness within the pixels of the side-scan sonar image. Stronger echoes are represented by higher gray values in the sonar image. Figure 1 illustrates the 'StarFish' side-scan sonar and the captured sonar image.

Fig. 1
figure 1

a Side-scan sonar 'StarFish' from BlueView company. b Schematic of the side-scan sonar scanning range. c Example of the captured sonar image

The imaging model of sonar is similar to optical reflection, and researchers often simplify the sonar imaging model by approximating it to the Lambertian model (Durá et al. 2004; Coiras et al. 2007). The Lambertian model assumes that the object’s surface exhibits diffuse reflection characteristics, where the reflection intensity depends solely on the angle between the incident light and the surface’s normal direction, independent of the reflection direction (Ju et al. 2021, 2023c). Shape-from-shading (SfS) algorithm (Horn 1970) can be applied to deduce the object’s 3D shape using the Lambertian reflection model. In other words, by knowing the grayscale variations within a single image, these methods calculate the object’s surface normal using the Lambertian model and subsequently derive its surface height. However, this approach has challenges because of the global information loss and noise.

To address the above challenges, we explored the following aspects.

  • (1) High-low frequency separation SfS: Traditional SfS algorithms (Horn 1970; Ikeuchi and Horn 1981; Frankot and Chellappa 1988) tend to overly emphasize reconstructing high-frequency details while overlooking the low-frequency information of the overall scene. To overcome these limitations, we employ the discrete cosine transform (DCT) (Ahmed et al. 1974) to split the sonar images into high- and low-frequency components. The low-frequency image is then reconstructed using the SfS algorithm, whereas the high-frequency image contains too much noise and is discarded. Our proposed high-low frequency separation SfS method enhances the reconstruction accuracy of traditional SfS methods.

  • (2) Depth estimation network with dilated convolution and attention mechanism: We also propose a monocular depth estimation network that combines the attention mechanism (Vaswani et al. 2017) with dilated convolution (Yu et al. 2017). The proposed network introduces a multiscale attention mechanism within the connection between the encoder and decoder. This enables the extraction of detailed features in the scene, ultimately improving depth estimation.

  • (3) Normal-depth fusion algorithm: We introduce a normal-depth fusion algorithm to fuse the surface normal maps obtained from (1) and the depth maps obtained from (2). The depth map produced using the depth estimation algorithm places a stronger emphasis on global depth changes, which contrasts with the focus of SfS. By combining these two methods, we address each other’s limitations, resulting in a 3D reconstruction that effectively balances global and local details.

We conducted experiments on synthetic (Gwon et al. 2017), NYU-depth-v2 (Silberman et al. 2012), and real side-scan sonar datasets to demonstrate the effectiveness of the proposed normal-depth fusion method. The remainder of this paper is organized as follows: Section 2 discusses related works. Section 3 presents the proposed method. The experimental results are presented in Section 4.

2 Related work

2.1 Shape-from-Shading (SfS)

SfS proposed by Horn (1970) is a classic algorithm used for monocular image-based 3D reconstruction. The SfS algorithm operates by deducing the 3D shape of the target, relying on the assumption of the Lambertian model. In contrast to photometric stereo methods like Ju et al. (2020a, 2022, 2023a, b); Liu et al. (2022), SfS uses a single image as input, making it capable of addressing surface reconstruction in dynamic scenes with non-rigid objects. SfS can be categorized into three types based on the reconstruction process: cost function minimization, propagation, and local methods.

Cost function minimization methods are commonly used to address the Lambert reflection reconstruction problem by iteratively adjusting the cost function (Ikeuchi and Horn 1981). This problem involves determining the surface normal direction in the x- and y-axes, given the grayscale of the input image. Various constraints have been introduced to effectively solve this problem. For example, Brooks and Horn (1985) introduced two fundamental constraints: photometric consistency and smoothness. Photometric consistency ensures that the grayscale of the inverted 3D object reconstruction matches that of the input image. Smoothness constraints maintain a gradual and natural transition along the edges of the reconstructed surface. Frankot and Chellappa (1988) introduced integrability constraints to ensure that the reconstructed surface is integrable and accurately represents the real scene. For specific tasks, such as side-scan sonar imaging, Zheng and Chellapa (1991) proposed a multiresolution 3D reconstruction method that uses a coarse-to-fine structure.

Propagation methods involve calculating other points along characteristic splines based on the depth and normal direction of a known point. Horn (1970) started from a singular point and propagated along the surface curve of the surrounding spherical neighborhood. The propagation direction follows the change in surface gradient. Rouy and Tourin (1992) introduced an SfS method based on the Hamilton-Jacobi-Bellman equation (Bellman 1966) and the viscous solution theory. They applied dynamic programming to establish the relationship between the viscous solution and optimal control, enabling the retrieval of a unique solution for SfS. Kimmel and Bruckstein (1995) employed multilayer contours initially defined by closed curves near singular points to reconstruct the object’s surface.

Local methods recover the shape information of the target by analyzing the image’s brightness and its first and second derivatives. Pentland (1984) assumes that the surface where each point in the image resides can be approximated as a local sphere. Lee and Rosenfeld (1985) leveraged this local spherical assumption to determine the inclination and declination of the object surface’s normal with respect to the base light source coordinate system by analyzing the first derivative of the image’s brightness.

2.2 Monocular depth estimation

Deep learning techniques have shown great capability in many computer vision tasks (Ju et al. 2020b; Kong et al. 2022; Rao et al. 2023; Xiao et al. 2023; Yu et al. 2023; Zhang et al. 2023b). Monocular depth estimation, driven by deep neural networks (Wang et al. 2020; Kong et al. 2021; Luo et al. 2021; Xiao et al. 2021; Zhang et al. 2023a), has recently seen significant advancements because of its potent feature learning capabilities. Monocular depth learns the depth map of an object from a single input image. Eigen et al. (2014) first used a coarse-to-fine network for monocular depth estimation. In this approach, the former part of the network learns the global depth of the scene, while the latter optimizes the local details of the target shape. Li et al. (2015) introduced a dual-process framework based on the VGG-16 model. These two processes estimate the depth and gradient values of the object, respectively. Chen et al. (2016) developed a multiscale network to predict the depth value of each pixel by learning the relative depth of the scene. The network employs the relative depth error as the loss function. Recently, Lee et al. (2019) proposed a multiscale planar guidance layer as a replacement. This layer adopts local planar assumptions to directly obtain the original resolution feature map. Gan et al. (2018) employed a transformation layer to reflect the positional relationships between images and estimate the relative depth information between different objects in the scene through the position changes in adjacent images. Jung et al. (2017) employed generative adversarial networks (GANs) for single-image depth value prediction. The generator combines GlobalNet to extract global features and RefinementNet to estimate local details.

3 Proposed method

In this section, we present the proposed normal-depth fusion method for underwater side-scan sonar image 3D reconstruction. Figure 2 shows the overview framework. The proposed method can be divided into three parts: high-low frequency separation SfS method (Section 3.1), local planar guidance monocular depth estimation network (Section 3.2), and normal-depth fusion algorithm (Section 3.3).

Fig. 2
figure 2

The overall structure of the proposed underwater side-scan sonar image 3D reconstruction method

3.1 High-low frequency separation SfS method

First, we propose an improved SfS surface reconstruction method for underwater side-scan sonar images. Traditional SfS techniques have demonstrated good capability in reconstructing high-frequency details; however, they face issues related to ambiguous global height changes within a scene. To address this limitation, we introduce a high-low frequency separation method for 3D reconstruction from input images. In this approach, the low-frequency component is reconstructed using the SfS algorithm, whereas the high-frequency image contains too much noise and is discarded. The proposed high-low frequency separation SfS method enhances the reconstruction accuracy of traditional SfS methods. Figure 3 shows the framework of the proposed high and low-frequency separation SfS method.

Fig. 3
figure 3

Framework of the proposed high-low frequency separation SfS method. We first remove the shadow of the input sonar image and use the DCT (Ahmed et al. 1974) to split the sonar images into high- and low-frequency components. The output of SfS (Brooks and Horn 1985) using the low-frequency component, with less noise, obviously outperforms the results of the original sonar image

In the context of side-scan sonar imaging, which is similar to the optical imaging model, the propagation of sound waves can be obstructed by tall targets (Deb and Suny 2014). This obstruction forms sound shadows behind the obstructing targets. If these sound shadows are not properly accounted for in the SfS algorithm, they can be incorrectly interpreted as depressions, thus impacting the accuracy of the target shape reconstruction. As shown in Fig. 3, we first remove the shadow areas in the input sonar images. Inspired by Deb and Suny (2014), we first transform the red-green-blue (RGB) colorspace sonar image into the YCbCr colorspace. The Y channel corresponds to the brightness information of the image, while the Cb and Cr channels represent the color information. To identify shaded areas, we calculate the mean value of the Y channel and examine individual pixels within the image. Pixels with brightness values lower than the mean are then labeled as shadow areas.

We compute the means of the Y, Cb, and Cr channels, along with the standard deviation of the Y channel, separately for the shadowed and non-shadowed regions within the image. We denote the means of the three channels for the shadowed area as \(\mu _{SY}\), \(\mu _{SCb}\), \(\mu _{SCr}\), and for the non-shadowed area as \(\mu _Y\), \(\mu _{Cb}\), \(\mu _{Cr}\), with the ratio of the standard deviation of the Y channel as \(\lambda\). In this context, the coefficients \(\alpha _{y}\), \(\alpha _{b}\), and \(\alpha _{r}\) for shadow removal can be defined as follow:

$$\begin{aligned} \alpha _{y} = \mu _Y - \lambda \times \mu _{SY} \end{aligned},$$
(1)
$$\begin{aligned} \alpha _{b} = \mu _{Cb}- \lambda \times \mu _{SCb} \end{aligned},$$
(2)
$$\begin{aligned} \alpha _{r} = \mu _{Cr}- \lambda \times \mu _{SCr} \end{aligned}.$$
(3)

In this case, for a pixel in the YCbCr image with the value [Y, Cb, Cr], its value after shadow removal is denoted as [\(Y'\), \(Cb'\), \(Cr'\)] can be formulated as follows:

$$\begin{aligned} {Y}'= \alpha _{y} \textrm+ \lambda \times {Y} \end{aligned},$$
(4)
$$\begin{aligned} {Cb}' = \alpha _ {b} + \lambda \times {Cb} \end{aligned},$$
(5)
$$\begin{aligned} {Cr}' = \alpha _{r} + \lambda \times {Cr} \end{aligned}.$$
(6)

After removing the shadows in sonar images, we employ DCT (Ahmed et al. 1974) to separate the high- and low-frequency components of the sonar image. DCT is a commonly used data image compression technique that converts images from spatial to frequency distribution. We apply a mask operation to the transformed image, zeroing out coefficients smaller than a specific threshold. This process results in compressed low-frequency image coefficients. We then calculate the high-frequency coefficients by subtracting the compressed image coefficients from the original coefficients. The low-frequency coefficients capture the grayscale variations in the image, whereas the high-frequency coefficients represent the edges and texture details of the target shape. We perform an inverse DCT on the low- and high-frequency coefficients to obtain different images for the high- and low-frequency components.

Finally, we employ a cost function minimization SfS method (Brooks and Horn 1985) to reconstruct the surface normal of the image. Figure 3 shows that since noise primarily exists in the high-frequency component, the proposed high-low frequency separation SfS method significantly improves the accuracy of the results by reducing noise, compared with the results of Brooks and Horn (1985) via the original sonar image.

3.2 Depth estimation network with dilated convolution and attention mechanism

In this section, we propose a novel monocular depth estimation network based on an encoder-decoder architecture to estimate the depth information of the captured sonar image. The encoder part extracts dense features and fundamental contextual information from the input image via a ResNet-101 module (He et al. 2016). It captures contextual information at multiple scales using convolution kernels with different expansion rates (dilated convolution) (Yu et al. 2017). In contrast, the decoder employs a local planar guidance module (Lee et al. 2019) to replace nearest neighbor upsampling, enabling the network to efficiently restore the feature map to its original resolution. The decoder incorporates an attention mechanism (Vaswani et al. 2017) for differential connections, allowing the predicted depth map to be specific to objects while maintaining the original prediction accuracy. This refinement enhances the edges and local details, catering to subsequent high-precision 3D reconstruction requirements. Figure 4 shows the detailed structure of the proposed depth estimation network.

Fig. 4
figure 4

Detailed structure of the proposed depth estimation network

Figure 4 shows that our proposed depth estimation network follows an encoding-decoding scheme. The overall process involves reducing the input feature resolution to \(\frac{p}{8}\) and restoring the depth map to the original resolution p for dense depth prediction. First, a dense feature extractor, which includes a backbone network like ResNet-101 (He et al. 2016), is used to downsample the original monocular input image to \(\frac{p}{8}\) resolution (where p represents the input image resolution). Subsequently, multi-scale dilated convolutional layers are employed with an attention mechanism to extract contextual information. The expansion rates r used are 3, 6, 12, and 24. Each scale’s dilated convolution layer is combined with an attention module. During the decoding phase, local planar guidance layers (Lee et al. 2019) are used for layer-by-layer upsampling of the multi-scale feature maps, progressing from \(\frac{p}{8}\) resolution to p, finally restoring them to the original image resolution. Finally, a \(1 \times 1\) convolution operation is applied to obtain the final high-resolution depth map d, which is a single-channel output.

The proposed depth estimation network is optimized using the following loss function:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_ {\textrm{depth}} + w \mathcal {L}_{\textrm{gradient}} \end{aligned},$$
(7)

where \(\mathcal {L}_{\textrm{depth}}\) is the loss function on the depth map domain, and \(\mathcal {L}_{\textrm{gradient}}\) is the gradient constraint on the gradient of the depth map with a weight w empirically set to 0.1. Specifically, we define the \(\mathcal {L}_{\textrm{depth}}\) and \(\mathcal {L}_{\textrm{gradient}}\) as follows:

$$\begin{aligned} \mathcal {L}_{\textrm{depth}} = \frac{1}{T} \sum ^{T}_{i} (d_{i} -\tilde{d_{i}}) \end{aligned},$$
(8)

and

$$\begin{aligned} \mathcal {L}_{\textrm{gradient}} = \frac{1}{T} \sum ^{T}_{i} [(\frac{\partial d_{i}}{\partial x}-\frac{\partial \tilde{d_{i}}}{\partial x})^{2} + (\frac{\partial d_{i}}{\partial y}-\frac{\partial \tilde{d_{i}}}{\partial y})^{2} ] \end{aligned},$$
(9)

where \(d_i\) and \(\tilde{d_i}\) represent a pixel on the ground truth and the estimated depth map with the position index i, T stands for the number of total pixels. \(\large \frac{\partial d_i}{\partial x}\) and \(\large \frac{\partial d_i}{\partial y}\) represent the gradient of \(d_i\) on x and y directions, respectively.

3.3 Normal-depth fusion algorithm

In this section, we combine the surface normal map obtained from the high-low frequency separation SfS (Section 3.1) with the depth map obtained from the depth estimation network (Section 3.2). This fusion allows us to leverage the benefits of detailed and global results.

We capitalize on the strengths of SfS and depth estimation to jointly optimize the final height map. The main aim is to accurately compute the global scene height while preserving the local intricacies of the object’s shape. This involves using the object’s surface normal vector obtained from SfS to guide our target surface normal vector and utilizing the depth map obtained from depth estimation to guide the target’s height map. Additionally, we introduce supplementary surface normal gradient constraints to ensure a smooth transition in the object’s surface normal vector. These three components are combined to formulate the final fusion energy function as follows:

$$\begin{aligned} \int \!\!\!\int _{(x, y)} m\left( D_f-D_e\right) ^2+(1-m)\left( \left( n_f-n_s\right) ^2+\left( n_{f x}+n_{f y}\right) ^2\right) {\textrm {d}} x {\textrm {d}} y=\int \!\!\!\int _{(x, y)} F {\textrm {d}} x {\textrm {d}} y \end{aligned},$$
(10)

where \(D_f\) denotes the target depth map after surface optimization, \(D_e\) represents the depth obtained from depth estimation, \(n_s\) denotes the surface normal vector obtained using the light and dark shape recovery algorithm, \(n_f\) denotes the target surface normal, and \(n_{fx}\) and \(n_{fy}\) denote the gradients of the target surface normal vector in the x and y directions, respectively. The variable m represents the weight proportion of each component. The surface normal vector can be expressed in terms of height gradient as follows:

$$\left( p_f, q_f,-1\right) =\left( \frac{n_x}{n_z}, \frac{n_y}{n_z},-1\right) =\left( \frac{\partial D_f}{\partial x}, \frac{\partial D_f}{\partial y},-1\right) \text{ s.t. } n_f=\left( n_x, n_y, n_z\right),$$
(11)

which can be solved using the Euler-Lagrange (E-L) equation. The E-L equation for Eq. (11) can be expressed as follows:

$$\begin{aligned} m\left( D_f-D_e\right) -(1-m)\left( \frac{\partial ^2 D_f}{\partial x^2}+\frac{\partial ^2 D_f}{\partial y^2}+\frac{\partial ^3 D_f}{\partial x^3}+\frac{\partial ^3 D_f}{\partial y^3}-\frac{\partial ^2 D_s}{\partial x^2}-\frac{\partial ^2 D_s}{\partial y^2}\right) =0 \end{aligned},$$
(12)

where each part can be expressed via differential representation:

$$\frac{\partial ^2 D(x, y)}{\partial x^2} \approx \Delta ^2 D(x)=D(x+1, y)-2 D(x, y)+D(x-1, y),$$
(13)
$$\frac{\partial ^2 D(x, y)}{\partial y^2} \approx \Delta ^2 D(y)=D(x, y+1)-2 D(x, y)+D(x, y-1),$$
(14)
$$\frac{\partial ^3 D(x, y)}{\partial x^3} \approx \Delta ^2 D(x+1)-\Delta ^2 D(x),$$
(15)
$$\frac{\partial ^3 D(x, y)}{\partial y^3} \approx \Delta ^2 D(y+1)-\Delta ^2 D(y).$$
(16)

Eqs. (13), (14), (15), and (16) can be used to solve the E-L equation, as follows:

$$D_f=\frac{1}{m}\left[ (1-m) \widetilde{D}+m D_e-(1-m)\left( \frac{\partial ^2 D_s}{\partial x^2}+\frac{\partial ^2 D_s}{\partial y^2}\right) \right],$$
(17)

where \(\widetilde{D}\) is:

$$\widetilde{D}= D(x+2, y+1)+D(x, y+1)+D(x+1, y+2)+D(x+1, y)-4 D(x+1, y+1),$$
(18)

where the iterative representation of Eq. (17) is given by:

$$D_f^{n+1}=\frac{1}{m}\left[ (1-m) \widetilde{D}^n+m D_e-(1-m)\left( \frac{\partial ^2 D_s}{\partial x^2}+\frac{\partial ^2 D_s}{\partial y^2}\right) \right],$$
(19)

where \(\widetilde{D}^n\) represents the result of the nth iteration. The result of the nth iteration can be used to obtain the result of the n+1 iteration. If the number of iterations exceeds the maximum threshold, or the absolute difference between the results of two consecutive iterations is less than the minimum threshold, the iteration can be terminated, and the output obtained at that point is considered the final target result. Algorithm 1 summarizes the proposed fusion method.

figure a

Algorithm 1 The proposed fusion algorithm

4 Experimental results

4.1 Implementation details

Our method is implemented using PyTorch. We used the Adam optimizer with default settings (\(\beta _1\) = 0.9 and \(\beta _2\) = 0.999). The training was performed on a single RTX 2080Ti graphics processing unit (GPU). Given the similarity between the imaging model of side-scan sonar and optical images and the limited availability of side-scan sonar training datasets, we opted to use the NYU-depth-v2 dataset for training. Specifically, we employed 30000 samples with data augmentation (including rotation and cropping) for training and used 1000 samples from the NYU-depth-v2 dataset for validation. We also test the proposed method on the synthetic side-scan sonar dataset (Gwon et al. 2017) and real side-scan sonar datasetFootnote 1.

4.2 Evaluation metrics

To evaluate the performance of the proposed SfS method, we employ the Lambertian reflection model to invert the reconstructed height map into a reflection map corresponding to the object for quantitative comparison. This is because we do not have the ground-truth surface normal map of the side-scan sonar images. The evaluation of the reconstruction quality is based on the correlation coefficient r and signal-to-noise ratio SNR between the reflection image \(I^{\prime }\) and the original grayscale image I. These metrics serve as indirect evaluation indices of the reconstruction effectiveness, which are given as follows:

$$\begin{aligned} r=\frac{\sum _{y=1}^N \sum _{x=1}^M\left( I^{\prime }(x, y)-\bar{I^{\prime }}\right) \times (I(x, y)-\bar{I})}{\sqrt{\sum _{y=1}^N \sum _{x=1}^M\left( I^{\prime }(x, y)-\bar{I^{\prime }}\right) ^2} \sqrt{\sum _{y=1}^N \sum _{x=1}^M(I(x, y)-\bar{I})^2}} \end{aligned},$$
(20)

where \(\bar{I^{\prime }}\) and \(\bar{I}\) represent the average value of \(I^{\prime }\) and I, respectively. The correlation coefficient r has a value between –1 and 1. When r is closer to 1, it indicates a stronger correlation between the reflection map and input image, suggesting a higher degree of similarity. This means a better reconstruction effect for the estimated height map. Additionally, a SNR indicates a higher proportion of signal in the image, resulting in less noise and better image quality.

Furthermore, to assess the performance of the proposed depth estimation network and fused output, we evaluate several widely used metrics, such as threshold metrics (\(\delta < 1.25\), \(\delta < 1.25^2\), \(\delta < 1.25^3\)), absolute relative error (AbsRel), root mean square error (RMSE), and log10 error. For these metrics, the smaller their values, the better the result.

4.3 Effectiveness of the proposed SfS method

To assess the performance of the proposed high-low frequency separation SfS method, we conducted a comparative evaluation against the minimization method (Brooks and Horn 1985) and local method (Pentland 1984). We first test them on the NYU-depth-v2 test dataset (Silberman et al. 2012), with visualized results in Fig. 5 and quantitative results in Table 1. Then, we compare the estimations on the real side-scan sonar dataset, with visualized results in Fig. 6 and quantitative results in Table 2.

Fig. 5
figure 5

Reconstruction results of the proposed SfS method compared with those of the minimization method (Brooks and Horn 1985) and local method (Pentland 1984) on the NYU-depth-v2 dataset (Silberman et al. 2012)

Table 1 Quantitative comparisons using the metrics r and SNR on the NYU-depth-v2 dataset (Silberman et al. 2012)
Fig. 6
figure 6

Reconstruction results of the proposed SfS method compared with those of the minimization method (Brooks and Horn 1985) and local method (Pentland 1984) on the real side-scan sonar dataset

Table 2 Quantitative comparisons using the metrics r and SNR on the real side-scan sonar dataset

Figure 5 shows noticeable distortions in the reconstruction results of Brooks and Horn (1985) and Pentland (1984), with much noise. In contrast, the proposed method, which incorporates low-frequency height constraints, yields smoother reconstructions with fewer instances of missing height information. In the case of distant-view scenes, minimization and local methods are simply replaced with holes, which evidently fails to meet scene reconstruction requirements.

As presented in Table 1, the proposed high-low frequency separation SfS outperforms the minimization and local methods across all metrics with different samples. A higher r and SNR of the proposed method demonstrates that our reconstructed height map retains richer local details and less noise, demonstrating the effectiveness of the proposed SfS method.

We conducted comparative experiments on the real side-scan sonar dataset to further evaluate the proposed method. As shown in Fig. 6, traditional SfS methods (Pentland 1984; Brooks and Horn 1985) show height maps with more noise and less smoothness. In contrast, the height map reconstructed using our method suppresses the impact of high-frequency noise on the reconstruction quality due to the addition of low-frequency height constraint information. For example, in the scene of view3, the height map reconstructed using our method delicately represents the undulations of the seafloor ripple topography, resulting in gentler terrain features. In contrast, traditional methods produce rougher ripple topography with larger undulations. The quantitative comparison demonstrates the effectiveness of the proposed method in handling the real side-scan sonar dataset.

4.4 Effectiveness of the proposed depth estimation network

In this experiment, we assess the effectiveness of the proposed depth estimation network incorporating dilated convolution and an attention mechanism. We conducted our initial evaluation on the NYU-depth-v2 dataset (Silberman et al. 2012). Figure 7 shows an ablation experiment to assess the efficacy of the incorporated attention module (Vaswani et al. 2017). Subsequently, we conducted a comparative analysis with recent state-of-the-art (SOTA) depth estimation networks on the NYU-depth-v2 test set, with results summarized in Table 3. Finally, we validate the proposed method using a synthetic side-scan sonar dataset (Gwon et al. 2017) and a real side-scan sonar dataset, as shown in Figs. 8 and 9. Since we know the ground truth in the synthetic side-scan sonar dataset, we further quantitatively compare them in Table 3.

Fig. 7
figure 7

Ablation study of attention module (Vaswani et al. 2017) in our depth estimation network. The orange boxes represent the regions with detailed structures

Table 3 Quantitative comparisons on the NYU-depth-v2 test set
Fig. 8
figure 8

Visualized examples on the synthetic side-scan sonar dataset (Gwon et al. 2017), compared with BTS (Lee et al. 2019) and Song (Song et al. 2021)

Fig. 9
figure 9

Qualitative results on the real side-scan sonar dataset, compared with BTS (Lee et al. 2019) and Song (Song et al. 2021)

Figure 7 shows that our network with the attention module accurately predicts global depth, offering superior performance in predicting object boundaries and scene details compared with the network without the attention mechanism. Furthermore, we compare our methods with nine recent networks (as listed in Table 3) using the NYU-depth-v2 test set. The results demonstrate that the proposed depth estimation network achieves the best or second-best performance across six evaluation metrics.

We further validate the depth estimation performance of the proposed method on the side-scan sonar dataset (Fig. 8 and Table 4). As shown in Fig. 8, the comparison clearly demonstrates that our network accurately captures the global scene depth while presenting local details more intricately. The proposed network excels in rendering small objects with precision. In contrast, the depth maps obtained from BTS (Lee et al. 2019) suffer from information loss in edge details, making it challenging to distinguish objects. Song (Song et al. 2021) also exhibits numerous depth estimation errors and missing information. Our depth map closely resembles actual depth, delivering superior performance. As presented in Table 4, the proposed method outperforms SOTA methods.

Figure 9 shows the qualitative evaluation of the proposed depth estimation network using real side-scan sonar images. As demonstrated in the second image, our method reveals finer details, highlighting the pipeline structure. The third and fourth images reveal significant depth information loss and blurry estimations in the BTS (Lee et al. 2019) and Song (Song et al. 2021) methods. In contrast, our method incorporates the attention mechanism (Vaswani et al. 2017) between the encoder and decoder, allowing the network to emphasize vital image features and prioritize local scene details. In the decoding phase, we replace the traditional deconvolution network with the integration of local plane guidance layers (Lee et al. 2019), enhancing the performance in estimating depth map details.

4.5 Effectiveness of the proposed normal-depth fusion algorithm

To evaluate the performance of our fusion method (Section 3.3), we compare it with the standalone SfS method (Section 3.1) and the independent depth estimation network (Section 3.2). Figure 10 and Table 4 show the results on the NYU-depth-v2 dataset, whereas Fig. 11 shows the results on real side-scan sonar images.

Fig. 10
figure 10

Visualized fused results on the NYU-depth-v2 dataset (Silberman et al. 2012)

Table 4 Quantitative comparisons of the fused method on the NYU-depth-v2 dataset (Silberman et al. 2012)
Fig. 11
figure 11

Visualized fused results on the real side-scan sonar images

Figure 10 shows that the proposed SfS method can effectively capture the details but may yield inaccurate depth predictions for objects hidden in shadows and affected by noise. The depth map generated by the depth estimation algorithm accurately depicts global depth trends but lacks fine details. In contrast, the fusion algorithm produces a depth map that combines the correct global depth with rich details from the SfS algorithm. It particularly excels in preserving edge details. Table 4 also demonstrates the improved performance of the proposed fusion algorithm. In summary, our fusion algorithm is superior in 3D reconstruction compared with the single SfS method and depth estimation network.

We also assess the proposed fused algorithm on the real side-scan sonar dataset. To assess the applicability of the fusion method in various side-scan sonar image scenarios, we conducted comparative experiments involving the reconstruction of underwater aircraft wreckage, underwater structures, and seabed terrain with distinct characteristics (Fig. 11).

In the underwater aircraft wreckage view, the height map obtained using the SfS algorithm exhibits relatively pronounced fluctuations. In contrast, the fused height map exhibits smoother variations on the seafloor, with reduced disruptions caused by shadows in the image. Additionally, in the underwater structure view, the primary subject in the image appears to be a metal structure, resulting in high-brightness reflection. In this case, the normal map obtained using the SfS method effectively captures the contour details of the object. However, in the reconstructed height map, the edges of the target exhibit excessive jaggedness, which is associated with uneven brightness distribution in the image. The height map obtained using the final fusion algorithm successfully mitigates the excessive noise observed in the SfS algorithm. Moreover, fusion with the depth map results in a smoother and more natural overall height transition. Finally, in the seabed terrain view, the proposed SfS algorithm successfully captures the undulations of the seafloor terrain; however, the resulting 3D terrain appears excessively rough, lacking smoothness. The depth estimation process also encounters issues because it incorrectly estimates the depth in the latter part of the shadow. Although the height map exhibits smoother characteristics during the reconstruction fusion process, it still contains incorrect height estimates, particularly in the upper-left portion. We argue that if the depth estimation process can accurately estimate the depth of the image, the overall reconstruction quality should surpass that of the SfS algorithm.

5 Conclusions

In this study, we presented a novel 3D reconstruction approach for underwater side-scan sonar images. Our method addresses the issues of noise and errors in global information encountered in traditional methods. We first proposed an SfS method that separates high- and low-frequency components in side-scan sonar images using DCT. This technique helps mitigate the problem of noise in traditional SfS methods. We developed a monocular depth estimation network incorporating dilated convolution and an attention mechanism. This network accurately estimates the depth map of sonar images, providing global information that complements the SfS results. Finally, we designed a fusion algorithm to combine the surface normal and depth maps obtained using our methods. The fused 3D height map provides more accurate results with less noise and precise global structural information. The experimental results on various datasets validate the effectiveness of the proposed high-low frequency separation SfS method, depth estimation network, and fusion algorithm.