Detecting large-scale underwater cracks based on remote operated vehicle and graph convolutional neural network

It is of great significance to quickly detect underwater cracks as they can seriously threaten the safety of underwater structures. Research to date has mainly focused on the detection of above-water-level cracks and hasn’t considered the large scale cracks. In this paper, a large-scale underwater crack examination method is proposed based on image stitching and segmentation. In addition, a purpose of this paper is to design a new convolution method to segment underwater images. An improved As-Projective-As-Possible (APAP) algorithm was designed to extract and stitch keyframes from videos. The graph convolutional neural network (GCN) was used to segment the stitched image. The GCN’s m-IOU is 24.02% higher than Fully convolutional networks (FCN), proving that GCN has great potential of application in image segmentation and underwater image processing. The result shows that the improved APAP algorithm and GCN can adapt to complex underwater environments and perform well in different study areas.


Introduction
In recent decades, with the development of Ocean Engineering, Hydraulic Engineering, Bridge and Tunnel Engineering, etc., a series of new underwater structure models has been formed [1,2]. However, due to the effects of the external environment (such as wind and wave, corrosion, hydraulic flushing, temperature stress, etc.) or human factors (such as design errors or improper selection of materials), underwater structures may have various degrees of damage, which may lead to cracks during long-term service [3]. At present, it is still difficult to simulate the crack propagation and formation mechanism [4]. Many scholars have proposed algorithms to simulate crack propagation, such as NHPD [5] and GCEM [6]. Cracks seriously damage the reliability and longevity of the structure; they can exist not only at the structure's surface but can also extend into the interior [7,8]. Zhang and Zhuang [9,10] proposed a self-propagating strong discontinuity embedded approach with the statically optimal symmetric (SDA-SOS) formulation to study the propagation law of cracks. Their computational examples showed that cracks seriously degrade structure durability. Rezaiee-Pajand and Tavakoli [11] thought cracks are the external manifestations of the accumulation of fatigue in the structure. Therefore, it is essential to monitor structures in time to prevent crack expansion.
Due to the complexity of the underwater environment, only a few technologies have been applied to underwater crack detection, such as electrical exploration, elastic wave testing, radar, etc. [12,13]. For example, Li et al. [1] proposed a high sensitivity rotating alternating current field to measure underwater cracks; Luo et al. [14] detected concrete cracks using a tapered polymer fiber sensor; Shi et al. [15] used sonar images to detect and classify below-water-level cracks in dams. However, these methods have shortcomings in common: shallow measurement depth, inability to fully examine deep-water structures, large positioning errors, low efficiency, and weak adaptability. Moreover, these methods are costly, and neither convenient nor reliable [15,16]. Visual estimation is more efficient and inexpensive for obtaining crack information such as location and shape [17]. With the rapid development of machine vision technology, some crack detection methods based on image processing have been proposed [18]. In the last century, Belytschko et al. [19] began developing and designing the first camera-based road damage detection vehicle GERPHO. Ukai [20,21] proposed an image acquisition system using a multi-eye line array camera to monitor tunnel cracks in 2000 and 2007. Since the start of the 21st century, research on crack detection has been further deepened. Lu et al. [22] proposed a road crack detection algorithm based on adaptive threshold segmentation; Talab et al. [23] used the Sobel filter and Otsu algorithm to detect concrete cracks, which had an accuracy of 85% on their data sets, but the algorithm was sensitive to changes in shooting angle and light; Xiao and Li [24] combined the adaptive Canny operator with seepage theory and proposed a crack detection algorithm. However, these methods were limited to traditional image processing technology. The algorithms were susceptible to environmental factors, had large errors, and had low generalization ability [25].
Deep learning (DL) has an advantage of being little affected by noise, being able to migrate to different environments, and high accuracy. The rapid development of DL provides different ideas for people to solve problems. For example, Nguyen-Thanh et al. [26] solved potential energy problems in parametric deep energy methods based on physical information neural networks (PINN). Nguyen-Thanh et al. [27] also presented a deep energy method for finite deformation hyperelasticitiy using deep neural networks (DNNs), which could avoid entirely a discretization like FEM. Guo et al. [28] proposed a deep collocation method (DCM) for thin plate bending problems. Zhuang et al. [29] present a deep autoencoder based energy method (DAEM) for bending, vibration and buckling analysis of Kirchhoff plates; Guo et al. [30] present a stochastic deep collocation method (DCM) based on a neural architecture search (NAS) and transfer learning for heterogeneous porous media. DL has been widely used in many fields such as solving partial differential equations in Computational Mechanics [31,32], dam subsidence prediction [33,34], urban traffic monitoring, etc. Many scholars have tried to use DL to detect cracks. Cha et al. [35] used a five-layer convolutional neural network (CNN) to detect concrete cracks and processed gridded images based on sliding window technology. Kim and Cho [36] used CNN with Alexnet as the backbone for accurate classification of five observable entities such as cracks, plants and concrete. In 2015, Long et al. [37] proposed fully convolutional networks (FCN) by replacing the full connection layer of CNN with convolution layer. FCN realizes semantic segmentation in the real sense. Image semantics segmentation based on DL has also been widely used in crack detection. Dung and Anh [38] realized the automatic detection of concrete cracks by deep FCN. Their results show that cracks are reasonably detected, and crack density is also accurately evaluated. Bang et al. [17] introduced an attention model into image semantics segmentation and obtained good results in detecting road cracks. Zhang et al. [39] proposed a neural network with multiple convolution layers and combine context information to detect cracks in structures. The method adopted an end-to-end training approach, and could realize pixel level processing of images of any size. Zhang et al. [40] proposed a faster, simpler single-stage detector based on YoLoV3 for detecting multiple concrete bridge damages. Liu et al. [41] combined target detection with semantic segmentation, and designed a two-step network. Zhang and Yuen [42] designed a novel crack detection system based on a broad learning system, and their system can be accelerated without GPU during training, which reduces the requirement of the computer configuration.
However, cracks are usually continuous, long-distance, and large-scale. It is difficult to get a complete crack in the field of view of a single image, either above or below water level [43,44]. Assessment of complete cracks is significant for analyzing damage degree and true working form of underwater structures. Therefore, it is necessary to determine the complete shape of a crack by stitching of multiple images. There are mainly two steps of image stitching technology: registration and fusion. In the 1980s, Burt and Adelson [45] proposed an image fusion method based on the Laplace Pyramid. The image pyramid and scale transformation lay an important foundation for subsequent research. In 2004, Lowe [46] proposed an image registration method, SIFT, which performs image registration based on the eigenvectors of the image's feature point. After Lowe, Bay et al. [47] proposed SURF, which uses integrated images to achieve faster image registration. Rublee et al. [48] proposed the ORB algorithm, which has a strong advantage in registration speed. In recent years, image stitching technology has begun to be applied to structural health monitoring (SHM). Zhu et al. [49] stitched different positions' concrete column images based on traditional featurebased image stitching technique. Won et al. [50] automatically generated panoramic bridge images using deep matching. Based on the SIFT algorithm, Wang et al. [43] obtained the complete shapes of cracks in a dam. Wu et al. [44] stitched large-scale panoramic cracks using Oriented FAST and Rotated BRIEF feature matching algorithm.
However, due to the complex underwater optical environment, underwater images have low contrast and much noise. Moreover, due to the multi-interface refraction of light when using a fisheye camera, the collected images are often distorted [51,52], which means that the underwater image is essentially in Non-Euclidean Space compared to the above water image. The current image stitching and semantics segmentation algorithms are based on image data's translation invariant, scale invariant and rotation invariant. In other words, these methods are aimed at data in Euclidean space, and they may not accurately detect cracks from underwater images. The APAP algorithm proposed by Zaragoza et al. [53] is an image fusion algorithm. The APAP algorithm first grids the images, then spatially warps each grid cell using its corresponding local homograph matrix, and finally superimposes them on the canvas to complete the image fusion. Because the APAP algorithm uses a local-global fusion strategy, it can eliminate errors caused by image distortion when image stitching. At present, APAP is mainly used to stitch large-scale remote sensing images [54], and it has not been studied for SHM.
Graph convolution neural network (GCN) is a new DL model proposed for data in non-Euclidean space. It was first proposed in 2017 and achieved the detection of highdimensional data features by constructing nodes and edges. Landrieu and Simonovsky [55] proposed a largescale points-cloud segmentation algorithm based on superpoint graphs. In 2018, the OCNet was proposed by Yuan and Wang [56], which could apply a non-local operator to segmentation. To date, GCN has had little broad application and research, including in image segmentation and underwater image processing, but it has great potential.
Based on the above discussion, our paper's primary work and purpose are to study the large-scale underwater cracks detection method using image stitching and semantic segmentation. Remote Operated Vehicles (ROVs) were used to collect images of cracks in different underwater structures. A dataset of underwater cracks in three areas was created for training, validation, and other GCN processes. Since the data collected by the ROV is transmitted in the form of videos, an improved APAP algorithm was designed, which can automatically extract keyframes from the video and then stitch the images. Then, an image segmentation algorithm based on GCN was designed, which takes the pre-trained Resnet101 as the backbone. In addition, our research also compared the effect of GCN and traditional FCN in segmenting underwater crack images.

Dataset and methods
In the work reported in this paper, an underwater cracks detection method was designed, composed of the proposed image stitching algorithm and GCN algorithm. Our study can be divided into three parts: data acquisition, image stitching and image segmentation. The detailed process of our study is shown in Fig. 1. Firstly, the ROV was used to get underwater videos. Next, image stitching was done directly on videos. The stitched image of different study areas was obtained through the improved APAP algorithm. Then, a dataset of underwater cracks was created by framing videos. These images were stochastically selected, and LabelMe was used to mark crack areas on the image for training, validation and test of GCN and FCN. The datasets were randomly divided into training set, validation set, and test set according to the ratio of 8:1:1. Finally, the stitched images were segmented and the large-scale underwater crack patterns were reconstructed by invoking the final training results.
The above processes was performed on Matlab2020 and Python 3.9. The computer specifications for code writing and program running were as follows: OpenCV 2.4.2 was used as the visual library for underwater crack detection. Image stitching was mainly carried out on HP Star14 X360 platform. PyTorch 3.8 was used as framework for DL in our study, and LabelMe was used to mark images. GCN and FCN were trained on NVIDIA Tesla V100 32GB GPU and NVIDIA GeForce RTX 2080Ti GPU.

Data acquisition
Data were collected from different underwater structures. These structures were a dam in Hubei Province, China, Tunnel A and Tunnel B, both in Hebei. Seabotix vLBV300 was mainly used to detect underwater cracks of the dam, and the camera used was a 650 TV line highdefinition color camera. Tunnel A and Tunnel B were mainly detected by Dolphin One, and the camera used was Shark Marine. The detailed parameters of ROVs and cameras are shown in Table 1.
By framing the video, we obtained 4097 + 12701 + 8534 underwater images from the three sites, a total of 25332. After selection, a dataset containing 957 underwater crack images was established. After processing, the final dataset was obtained.

Image stitching
This paper proposes an improved APAP algorithm to stitch underwater videos directly. This algorithm can be divided into three steps in the process: keyframes detection, feature points matching, and image stitching, as shown in Fig. 2.
The keyframes extraction was based on the frames' similarity to ensure that the similarity of each keyframe is not too low. In this study, the phash algorithm was used to calculate the similarity between frames. By phash, the Hamilton distance between two images could be obtained [57]. The greater the Hamilton distance between frames, the smaller their similarity.
Firstly, every frame was resized to the same pixel dimensions, 32 × 32, and converted to grayscale images (the purpose of which was to reduce the difference between the size and proportion of these images, only retaining the images' basic information). In order to decrease the calculated quantities and run the program conveniently on the CPU, discrete cosine transform (DCT) was used to transform gray images. The expression of DCT is: where is the original image data, is the coefficient after DCT transformation, N is the points' number of original image data, and is the compensation factor that makes the DCT transformation matrix orthogonal.
Next, based on the result of the DCT calculation, a hash value composed of 64 bits was made. Then, the hash value of two images was compared to calculate the Hamilton distance between them. Hamilton distance between frames was used as the criterion for extracting keyframes in our study. The process of extracting keyframes is shown in Fig. 3, with the frame n − 1 set as image A, frame n set as image B, allowing calculation of the Hamilton distance between image A and image B. Image B (frame n) was considered as the keyframe when the distance between B and A was larger than an artificially set threshold. Then image B became image A and the process was repeated. In this way, all keyframes were extracted from videos.
For feature points matching, the method used in this work was SIFT. There are three main processes for SIFT algorithm to achieve feature matching [46]: extracting some prominent feature points in two images; describing these feature points (for example, the location, the direction, and the number, etc.); matching them, as shown in Fig. 4. K (x, y) The SIFT algorithm mainly searches for feature points in scale space. First, the keyframe is convoluted using a Gaussian kernel function to obtain the projection in different scales; the formula is: where is a scale-varying Gaussian kernel function; is a spatial coordinate of pixel; is the scale coordinate that determines the images' smoothness, the overview characteristics of the image corresponding to a large scale, and the detail characteristics of the image corresponding to a small scale.
values correspond to coarse scales (low resolution) and fine scales (high resolution). Combining the original image with the projection to get an image pyramid and a difference of Gaussian scale (DOG), the DOG function is: Feature points will be found on the DOG, and described on the image pyramid. SIFT considers that the feature points are essentially the extreme points of the DOG function, which means feature points are composed of local extreme points in the DOG. In SIFT, extreme points of the DOG function are those points that are larger or smaller than the surrounding pixel point in scale domain.

3σ
To detect extreme points, local characteristics of the image were used to assign a baseline direction to each critical point. The gradient and direction distribution characteristics of other neighborhood pixels were counted with the feature point as the origin and as the radius to determine the gradient and direction of the feature point. The formulas for calculating the pixel gradient and direction are: Finally, a descriptor was created for each feature point, a set of vectors was used to describe the feature point, and a subset of descriptions containing all feature points was created. Based on the subset of feature point descriptions, the feature point in image A which was nearest to the feature point in image B was searched and matched.
Since the study focused on the crack area, too many matching points will affect the stitching effect of the crack area. Therefore, the random sample consensus (RANSAC) was used to remove some matching points that may have affected the final stitching result. The core idea of RANSAC is that: for a fitting problem, there are two kinds of data points, one affects the fitting effect (outer point) and the other is conducive to the fitting of function (inner point). RANSAC aims to find out and eliminate the outer points through continuous random sampling.
Direct linear transform (DLT) was used to estimate the perspective transformation matrix for the remaining feature points, and a global homography matrix was obtained. The calculation method of homography matrix is: where H is the homography matrix; and are the camera model matrix of the two pictures to be stitched. Then by dividing the images to be stitched into grids and taking the center points of each grid, the distance and weights between the interior points on the source map and center points could be determined.
Putting the weightings into the A matrix of the DLT algorithm and building a new W*A matrix, the local homography matrix of the current grid could be naturally obtained. Then, the stitched image was obtained by traversing each grid and mapping it to the panoramic canvas using the local homography matrix.
In practice, the video only needed to be imported directly into the program. Our algorithm first divided the video into frames, and then compared the distances between frames. Setting the threshold, the program output all key frames that met the requirements as shown in Fig. 3. In the end, these key frames were stitched based on SIFT.  [37] in 2015, is the first image semantic segmentation system based on DL. FCN can accept any input size and produce appropriate output through efficient reasoning and learning. The network structure of FCN is divided into two parts: full convolution and deconvolution. The full convolution part replaces the last full connection layer of the CNN network with convolution to extract features and form a heatmap. The purpose of deconvolution part is to upsample the heatmap so that the output results are consistent with the original size. In this work, the backbone of FCN was replaced with Resnet101 and the attention mechanism was inserted in Resnet101 to ensure reliability compared with GCN.

Graph convolutional network
This study used a new semantic segmentation algorithm, GCN. Moving on from traditional semantics segmentation algorithms such as FCN, U-net, and Deeplab, a new convolution method, graph convolutional, was used in GCN, enabling the model to learn deeper information about the data [58]. In addition, the attention mechanism was inserted into the backbone to make the GCN in this work able to access more crack area information during training. The GCN of this paper [58] can be divided into two parts: backbone part and graph convolution part, as shown in Fig. 5.
The backbone used in this paper is Resnet101, which mainly consists of 33 convolution blocks, two pooling layers, and one full connection layer. Each convolution block of Resnet101 contains three convolution cores connected by residuals to ensure that no network degradation or loss of information occurs during training. To connect with the graph convolution part, the full connection layer was removed so that the output of Resnet101 is a dense information feature map with 2048 channels. Also, considering the small proportion, by size, of crack areas in images, the attention mechanism was inserted into each convolution block, as shown in Fig. 6. By inserting the attention mechanism, the feature map information could mainly focus on the crack area. This helped the graph convolution part to learn the key information better.
The output of the backbone is a feature map with multiple channels. This study considers that not only the pixel of the feature map has correlation, but also those different channels have correlation. Therefore, the graph convolution part had two branches, which convoluted the output of backbone from channel and feature. In the channel branch, 2048 channels of the feature map were convoluted to determine which channels were important and which were unimportant, by two 1 × 1 graph convolution kernels. After further aggregation and compression, the weightings of these channels were obtained. In the feature branch, the pixel of the feature map was convoluted by three 1 × 1 graph convolution kernels, to obtain the coordinates and correlation information of pixels in non-Euclidean space. Finally, the result of two branches was aggregated with the output of backbone to get the segment result of underwater crack.
Due to the underwater environment, the image data collected was often distorted and did not have translation invariance. Therefore, this paper made a hypothesis that the underwater image data is better described with non-Euclidean data. So, a new convolution method: graph convolution was adopted in the graph convolution part. Graph convolution is a special kind of convolution that can handle data in non-Euclidean space and extract deeper data features. G = (V, E, W) was used to represent the data [59]. V is node-set, E is edge-set, and W is the weighted adjacency matrix. The node corresponds to the pixel of the images, which records the color, brightness and other information of the object. In contrast, the edge corresponds to the relationship between pixels and records the shape and texture of the object. For normal images, the arrangement of nodes and side rules can be achieved by smoothing the data on the data and learning the deep information of the data through convolution kernel. However, as shown in Fig. 7, for non-Euclidean data, if the graph convolution is processed in the same way, a lot of information is missed.
So, this study builds a graph convolution method based on Fourier transformation. Graph convolution uses Fig. 5 The overall structure of GCN model. Fourier transformation and Laplace matrix to transform non-Euclidean data into frequency domain, obtain the graph's spectrum, and convolve the spectrum of the graph, as shown in Fig. 8. Furthermore, the graph Laplacian could be diagonalized as L: where is the complete set of orthonormal eigenvectors; is the nonnegative eigenvalues of L. Then, the data could be transformed into the spectral domain by Fourier transformation: x = U T x, (10) x where x is our data, and is the projection of our data in the spectral domain. Correspondingly, the process of converting data from the spectral domain to graph could be represented as: x = Ux. (11) According to convolution theorem, graph convolution could be written as: is the convolution filter in spectral domain.
( 1 − iter total_iter ) 0. 9 We implemented GCN using Pytorch. The hyperparameter information of our program is shown in Table 2. In our program, we adopted a polynomial learning rate decay schedule where the initial learning rate was multiplied by . The loss function used in this paper was Cross Entropy Loss Function, the activation function was ReLU. We also used synchronized  The gray node represents the pixel of the image; the black node represents the missed node during convolution. Fig. 8 The graph data is transformed into frequency domain based on Fourier transformation.
where TP represents the number of correct pixels extracted when calling the trained model to extract the crack area; FP represents the number of error pixels extracted; FN represents the number of pixels in the crack area that have been misjudged. F1 combines the indicators of Precision and Recall, representing the model's balance value with the constraints of recall and prediction. It is often used to compare the actual application of models. F1 could reflect the overfitting phenomenon of the model. The calculation formula is: . (14) Accuracy indicates how many of all pixels are accurately identified as crack areas. The calculation formula is: (15) where TN represents the number of pixels in which the non-crack area is divided into non crack areas.

Image stitching results
Based on the improved APAP algorithm, the image of collected videos were stitched. 522 keyframes were extracted from three videos. There were 66 keyframes with underwater cracks to which image stitching was done.
Through the SIFT algorithm, feature points of each image were extracted and matched roughly. The SIFT algorithm could extract many feature points, but most of these feature points were useless. Therefore, only a few fine matching point pairs were retained after RANSAC. As shown in Table 3, Tunnel A had the highest number of well-matching points and the longest stitching time. The Dam (a) area had the least number of well-matching points and the shortest time. But this does not mean that the stitching time was related to the number of wellmatching points. Tunnel B had only 625 well-matching points, but its stitching time was longer than that of Dam (b). We think that this was probably because the total number of pixels in Tunnel B was more than that for Dam (b).
Based on the exact matching result, the local homograph matrix was used for image fusion. By iteratively fusing these images, the final stitching result was obtained after adjustment, as shown in Fig. 9.

Image semantic segmentation results
Unlike our improved APAP algorithm, the training of GCN and FCN was carried out directly on frames. By framing these videos, a total of 4097 + 12701 + 8534 = 25332 underwater images were obtained. After selection, 957 images containing underwater cracks were collected. After cropping, de-ghosting, and secondary selection, a  . 9 The stitching result of large-scale underwater cracks in different areas.
dataset for GCN and FCN training, validation, and test were obtained. After training, the trained GCN and FCN models were used to segment the stitched underwater image, and results are shown in Fig. 10. The loss curves of GCN are shown in the Fig. 11. The result indicates that GCN could accurately segment the underwater crack in images and was not affected by noises such as water, lighting conditions, aquatic plants, shadows, floating dust, etc. GCN detected most of the underwater cracks and segmented the actual crack pixels as much as possible. Compared with the segmentation result of FCN, the segmentation result of GCN was finer, and GCN had better effect on slim cracks. The segmentation result of FCN was coarser, less sensitive to slim cracks, and susceptible to the underwater environment.
Our study also calculated the proportion of the crack area in whole images, and the proportion of the crack area extracted by GCN and FCN in images, as shown in Table 4. By comparison, it can be seen that the result of GCN was closest to the actual value, and FCN was larger.

Image semantic segmentation evaluation
In order to more accurately evaluate the effect of GCN and FCN, three indices on the test set were compared, as shown in Table 5.
It can be seen that compared with FCN, GCN offered significant improvement. The m-IOU value of GCN was 25.02% higher than that of FCN, and the F1 value was 15.71% higher than that of FCN. GCN showed better generalization ability and practical application effect than FCN. However, their accuracies were not much differentboth were more than 90%. This was probably due to the   Table 4 shows that the proportion of crack area in the whole image was tiny. That is, the proportion of non-crack areas in the image was large. When calculating accuracy, both TP and TN are included. Because the non-crack area is large, the final TN value is also very high and far higher than TP, FP and FN. So, the accuracy tends to be 1, which indicates that accuracy is not an appropriate criterion for the scenario used in this paper.

Threshold selection in improved APAP algorithm
Since the difference (distance) between two adjacent frames in the video is usually small (especially for ROV, which moves slowly in water) [60], if images are stitched directly frame by frame, it will not only increase the running burden of the computer, but also affect the final stitching result and decrease the efficiency.
Moreover, the ROV's navigation is not uniform and straight because of human operations and complex underwater environments. Therefore, the area scanned by the camera is different in different time periods. So, extracting keyframes should not be based on the video timing but on the severity of scene changes in the video; the more dramatic the scene changes, the larger area scanned by the camera, the more keyframes should be extracted. In our opinion, when the similarity between two adjacent frames is small (the distance between two adjacent frames is large), it means that the scene changes violently in this time period (there are more keyframes in this period).
As shown in Fig. 3, the improved APAP algorithm is based on a threshold when extracting keyframes. This threshold is set artificially, and it is different in different videos. Our study compared the effect of different thresholds on key frame extraction results. Figure 12(a) shows that the number of keyframes extracted from the video sequence decreases as the threshold increased. But the keyframe was not extracted from the video when the threshold was larger than a certain range.
The ratio of keyframes number extracted and the total frames number in video could be identified as the extraction rate. From Fig. 12(b), it can be seen that with the increase of threshold, the extraction rate decreased, and the threshold was different in different regions. The lower the extraction rate, the fewer keyframes extracted, and the fewer images to be stitched, the faster the algorithm. However, this also meant that the distance between two adjacent keyframes was greater.
As shown in Fig. 13, as the distance between images increased, the number of matching points pairs reduced, and the final stitching result could be gradually distorted. In summary, the selection of thresholds was neither too large nor too small. The threshold needed to be set according to the video quality and the stitching requirements.
Because the underwater environment is very complex, acquired underwater crack images are affected by many factors, which produces great difficulty in surveying underwater cracks. If we use graph, to represent an image, the nodes correspond to the pixel of the images, which records the color, brightness, etc. of the underwater structure, and the edge corresponds to the relationship between pixels and records the shape, location, etc. of the underwater structure. Therefore, there are two main impacts of the underwater environment on data: the impact on nodes and the impact on edges.
For the impact of nodes, it is mainly manifested in that the image does not reflect the true color of the underwater structure. Many studies [52,61,62] have shown that scattering, refraction, and absorption are unavoidable when light travels in water, as shown in Fig. 14. There is a lot of floating dust in the water, and light is scattered by these impurities. At the same time, underwater cameras often have water shields. Moreover, the medium from the lens to the imaging point is air, so the light will refract when passing through the lens. Because water molecules strongly resonate with photons in the infrared, yellow and ultraviolet bands, there is a strong spectral effect when light is transmitted in water. The energy in yellow, ultraviolet, and infrared bands of light is largely absorbed by water. Therefore, as shown in Fig. 15, underwater structures are mainly imaged in the green band; the underwater image brightness in the green band is higher than that in the red and blue bands; the color information Fig. 13 The influence of the distance between images to be stitched on the stitching result. of images is incomplete.
The impact of edges is mainly manifested as image distortion. For the underwater environment, due to the use of wide-angle lenses, or fisheye lenses, and the refraction of light caused by multiple media, underwater images are often distorted. Our study tested the distortion of underwater images in Tunnel A, Tunnel B and dam. Figure 16 shows the distorted degree of any point on the image relative to the image center. Moreover, the SMIA TV Distortion of these underwater images were calculated: the distortion of Tunnel A was -5.17%, the distortion of Tunnel B was -2.3%, and the distortion of dam was -29.5%. Significantly, Tunnel A and Tunnel B were monitored by the same camera, but the distortion in the two environments was different. These observations indicate that the distortion of underwater image was mainly barrel distortion, and different underwater environments had different effects on image distortion. Another factor is that the surface of some underwater structures is sometimes a curved surface rather than flat. Images are flat, which means curved surfaces are compressed and distorted during imaging, as shown in Fig. 17.
Generally speaking, image creation involves a structureto-plane projection. When collecting underwater images with ROV, many factors affect the result, such as refraction, scattering, type of lens, suspended solids, etc.
Calculable underwater structures are essentially projected onto a distorted plane. Although the data is still image, the pixel correlation has been distorted, as shown in Fig. 18. So, it would be more appropriate to describe them with non-Euclidean data. Therefore, GCN converged more easily than FCN during training, obtaining higher m-IOU and F1 values.
In fact, there are other networks [63,64] besides Resnet101 that can work as backbone for FCNs and GCNs. As shown in Table 6, this study compared the performance of GCN and FCN in different backbones and the result shows that Resnet101 is indeed more effective than other networks.
The segmentation results of FCN and GCN under different water depths were also compared, and these results are shown in Fig. 19. It can be seen that with the increase of water depth, FCN was affected by the surrounding environment, and the error increased; many non-crack areas are divided into crack areas. GCN can still maintain good segmentation results and fewer errors. This indicates that GCN is less affected by water depth change and has good stability.

The order of segmentation and stitching
According to some studies [43,54], stitching the processed images can effectively reduce the stitching time. This paper also compared the order of segmentation and stitching when extracting large-scale cracks, as shown in Fig. 20.
It can be seen that the final result of segmentation-first was similar to that of stitching-first. Moreover, stitching the segmented image can reduce the time-consuming. However, in this paper, stitching and segmentation are two steps of a process. Therefore, it is necessary to analyze the total time consumption. As shown in Table 7, although segmentation first reduced the time, it increased the segmentation time. Overall, stitching first saves time. The total time formula could be expressed as: where represents the total time; represents the segmentation time; represents the stitching time. Further, is determined by the model and the pixel number. The larger the model size, the more pixels to be processed, the longer the is. is determined by the pixel number and the feature point number.  16 Image distortion in different areas. Note: r d represents the distance from a point on the image to the image center, which is also called as image distortion radius; r u represents the distance from a point on the undistorted image to the image center, which is also called as image undistorted radius; ▲r represents the difference between the distortion radius and the undistorted radius from a point to the center of the image.

T total
Although segmentation-first reduced the number of feature points processed during stitching, it also increased the segment pixel number. Due to the large size of GCN, the segmentation time became longer and the total time was increased. Stitching-first reduced the number of pixels to be segmented, so the total time was shorter. Therefore, the relationship between the order and is uncertain, which needs to be determined according to the model size and the number of image pixels.

Conclusions
Most underwater cracks are large-scale, but an underwater camera has a small field of view and cannot get the complete shape of underwater cracks; this paper presents an underwater large-scale crack detection method based on image stitching and image semantics segmentation.
This paper proposes an improved APAP algorithm, which can directly extract keyframes in the video for image stitching. The experimental result shows that: the improved APAP algorithm can adapt to different underwater environments; the number of keyframes extracted is far less than the total number of video frames, which greatly simplifies the data; APAP can extract a large number of feature points from complex underwater pictures; the use of RANSAC algorithm can reduce useless matching points; there is no obvious seam and ghosting in the stitching result, and the result is ideal.
Based on previous studies, this study considers that: due to the complexity of the underwater environment, the irregularity of the underwater structure, the scattering and absorption of light by water, the presence of suspended matter, the refraction of light, the use of fisheye cameras, and so on, the underwater images are essentially distorted and the relationship between pixels is irregular. Therefore, it is more appropriate to describe and process underwater images using non-Euclidean data.
For image semantics segmentation, the use of GCN to  segment underwater cracks is proposed in this study. By inserting the attention mechanism into the Resnet101, the backbone part could retain more crack information during training. By inserting the dual channel graph convolution module, GCN could process the non-Euclidean data and extract the high-dimensional features of underwater