1 Introduction

The ocean acts as a heat sink and is vital to the Earth’s climate system. It regulates and balances the global climate environment through the exchange of energy and substances in the atmosphere and the water cycle. As a huge heat storage, the ocean collects most of the heat from global warming and is sensitive to global climate change. The global ocean hold over 90% of the Earth’s increasing heat as a response to the Earth’s Energy Imbalance (EEI), leading to substantial ocean warming in recent decades [24, 36]. Subsurface thermohaline are basic and essential dynamic environmental variables for understanding the global ocean’s involvement in recent global warming caused by the greenhouse gas emissions. Moreover, many significant dynamic processes and phenomena are located beneath the ocean’s surface, and there are many multiscale and complicated 3D dynamic processes in the ocean’s interior. To completely comprehend these processes, it is necessary to accurately estimate the thermohaline structure in the global ocean’s interior [43].

The ocean has warmed dramatically as a result of heat absorption and sequestration during recent global warming. Meanwhile, the heat content of the ocean has risen rapidly in recent decades [3, 14]. The global upper ocean warmed significantly from 1993 to 2008 [6]. The rate of heat uptake in the intermediate ocean below 300 m has increased much more in recent years [2]. It shows that the warming of the ocean above 300 m slows down, while the warming of the ocean below 300 m speeds up. The ocean system accelerates heat uptake, leading to significant and unprecedented heat content increasing and worldwide ocean warming, particularly in the subsurface and deeper ocean. This has caused the global ocean heat content hitting a record high in recent years [14, 15]. In addition, the ocean salinity as another key dynamic variable is also crucial for investigations on ocean variability and warming. The salinity mechanism has been proposed to expound how the upper ocean’s warming heat transferred to the subsurface and deeper ocean [12], which highlights the importance of salinity distribution in the heat redistribution and the process of ocean warming. Furthermore, the global hydrological cycle is modulated by ocean salinity [4]. The thermohaline expansions, which contribute significantly to sea-level rise, are also linked to ocean temperature and salinity [9]. Therefore, to improve the understanding of the dynamic process and climate variability in subsurface and deeper ocean, deriving and predicting subsurface thermohaline structure is critical [31].

Due to the sparse and uneven sampling of float observations and the lack of time-series data in the ocean, there are still large uncertainties in the estimation of the ocean heat content and the analysis of the ocean warming process [13, 42]. In the era of ship-based measurement, large areas of the global ocean are without or lack of in-situ observation data, especially in the Southern Ocean. The data obtained by the traditional ship-based method not only has limited coverage, but also can’t achieve uniform spatiotemporal measurement, hindering the multi-scale studies on the ocean processes. Since 2004, the Argo observation network has achieved the synchronous observation for the upper 2000 m of the global ocean in space and time [39, 51]. However, the number of Argo floats is currently insufficient and far from enough for the global ocean observation, which cannot provide high-resolution internal observation and cannot meet the requirements of global ocean processes and climate change study. Given that satellite remote sensing can obtain large-scale sea surface range and high-resolution sea surface observation data, satellite remote sensing has become an essential technique for ocean observation. Although sea surface satellites can provide large-scale, high-resolution sea surface observation data, they cannot directly observe the ocean subsurface temperature structure [1]. Since many subsurface phenomena have surface manifestations that can be interpreted with the help of satellite measurements, it is able to derive the key dynamic parameters (especially the thermohaline structure) within the ocean from sea surface satellite observations by certain mechanism models. Deep ocean remote sensing (DORS) has the ability to retrieve ocean interior dynamic parameters and enables us to characterize ocean interior processes and features and their implications for the climate change [25].

Previous studies have demonstrated that the DORS technique has a great potential to detect and predict the dynamic parameters of ocean interior indirectly based on satellite measurements combined with float observations [41, 43]. DORS methods mainly include numerical modeling and data assimilation [25], dynamic theoretical approach [30, 48, 50], and empirical statistical and machine learning approach [23, 41]. The accuracy of numerical and dynamic modeling for subsurface ocean simulation and estimation at large scale is not guaranteed due to the complexity and uncertainty of these methods. Reference [47] empirically estimated mesoscale 3D oceanic thermal structures by employing a two-layer model with a set of parameters. Reference [35] determined the vertical structure and transport on a transect across the North Atlantic Current by integrating historical hydrography with acoustic travel time. Reference [34] estimated the 4D structure of the Southern Ocean from satellite altimetry by a gravest empirical mode projection. However, in the big ocean data and artificial intelligence era, data-driven models, particularly cutting-edge artificial intelligence or machine learning models, perform well and can reach high accuracy in DORS techniques and applications. So far, the empirical statistical and AI models have been well developed and applied, including the linear regression model [19, 23], empirical orthogonal function-based approach [32, 37], geographically weighted regression model [43], and advanced machine learning models, such as artificial neural networks [1, 45], self-organizing map [10], support vector machine [28, 41], random forests (RFs) [43], clustering neural networks [31], and XGBoost [44]. Although traditional machine learning methods have made significant contributions to DORS techniques, they are unable to consider and learn the spatiotemporal characteristics of ocean observation data. In the big earth data era, deep learning has been widely utilized for process understanding for data-driven Earth system science [38]. Deep learning techniques offer great potential in DORS studies to help overcome limitations and improve performance [46]. For example, Long Short-Term Memory (LSTM) can well capture data time-series features and achieves time-series learning [8], and Convolutional Neural Networks (CNN) take into account data spatial characteristics to easily realize spatial learning [5]. Deep learning technique has unleashed great potential in data-driven oceanography and remote sensing research.

This chapter proposes several novel approaches based on ensemble learning and deep learning to accurately retrieve and depict subsurface thermohaline structure from multisource satellite observations combined with Argo in situ data, and highlight the AI applications in the deep ocean remote sensing and climate change studies. We aim to construct AI-based inversion models with strong robustness and generalization ability to well detect and describe the subsurface thermohaline structure of the global ocean. Our new methods can provide powerful AI-based techniques for examining subsurface and deeper ocean thermohaline change and variability which has played a significant role in recent global warming from remote sensing perspective on a global scale.

2 Study Area and Data

The ocean plays a significant role in modulating the global climate system, especially during recent global warming and ocean warming [51]. It serves as a significant heat sink for the Earth’s climate system [12], and also acts as an important sink for the increasing CO2 caused by anthropogenic activities and emissions. The study area focused here is the global ocean which includes the Pacific Ocean, Atlantic Ocean, Indian Ocean, and Southern Ocean (180\(^{\circ }\) W~180\(^{\circ }\) E and 78.375\(^{\circ }\) S~77.625\(^{\circ }\) N).

The satellite-based sea surface measurements adopted in this study include sea surface height (SSH), sea surface temperature (SST), sea surface salinity (SSS), and sea surface wind (SSW). Here, the SSH is obtained from AVISO satellite altimetry. The SST is acquired from Optimum Interpolation Sea-Surface Temperature (OISST) data. The SSS is obtained from the Soil Moisture and Ocean Salinity (SMOS). The SSW is acquired from Cross-Calibrated Multi-Platform (CCMP). The longitude (LON) and latitude (LAT) georeference information are also employed as supplementary input parameters. All sea surface variables above have the same 0.25\(^{\circ }\) \(\times \) 0.25\(^{\circ }\) spatial resolution. The subsurface temperature (ST) and salinity (SS) data are from Argo gridded products with 1\(^{\circ }\) \(\times \) 1\(^{\circ }\) spatial resolution. This study adopted Argo gridded data for subsurface ocean upper 1,000 m with 16 depth levels as labeling data. We initially applied the nearest neighbor interpolation approach to unify the satellite-based sea surface variables to 1\(^{\circ }\) \(\times \) 1\(^{\circ }\) spatial resolution.

All the aforementioned satellite-based sea surface variables and Argo gridded data should be subtracted their climatology (baseline: 2005–2016) to obtain their anomaly fields in order to avoid the climatology seasonal variation signal [41]. In this study, We primarily focus on the nonseasonal anomaly signals, which are more difficult to detect but more significant for climate change. We applied a maximum-minimum normalization approach to normalize the training dataset to the range of [0, 1]. The testing dataset was likewise subjected to the corresponding normalization, which can effectively prevent data leakage during the modeling.

3 Retrieving Subsurface Thermohaline Based on Ensemble Learning

Here, the specific procedure of subsurface thermohaline retrieval based on machine learning approaches contains three technical steps. Firstly, the training dataset for the model was constructed. We selected the satellite-based sea surface parameters (SSH, SST, SSS, SSW) as input variables for AI-based models, and the subsurface temperature anomaly (STA) and salinity anomaly (SSA) from Argo gridded data were adopted as data labels for training and testing. Moreover, all the input surface and subsurface datasets were uniformly normalized and randomly separated into a training dataset (60%) and a testing dataset (40%), which were utilized to train and test the AI-based models, respectively. Secondly, the model was trained using the training dataset. The model’s hyper-parameters were tuned by using Bayesian optimization approach, and then a proper machine learning model was well set up using the optimal input parameters. Finally, the prediction was performed based on the trained model. We predicted the STA and SSA by the optimized model, and then evaluated the model performance and accuracy by determination coefficient (R2) and root-mean-square error (RMSE).

3.1 EXtreme Gradient Boosting (XGBoost)

Gradient Boosting Decision Tree (GBDT) as a boosting algorithm is an iterative decision trees algorithm and is composed of multiple decision trees [16]. EXtreme Gradient Boosting (XGBoost) is an upgraded GBDT ensemble learning algorithm [11], as well as an optimized distributed gradient boosting library. XGBoost implements an ensemble machine learning algorithm based on decision tree that adopts a gradient boosting framework, and also provides a parallel tree boosting that solve many data science problems in an efficient, flexible and accurate way. To achieve the optimal model performance, the parameter tuning is essential during the modeling. XGBoost contains several hyper-parameters which are related to the complexity and regularization of the model [49], and they must be optimized in order to refine the model and improve the performance. Here, we used the well-performed Bayesian optimization approach to tune the XGBoost hyper-parameters.

Fig. 1
figure 1

Spatial distribution of the a Argo STA and the b XGBoost-estimated STA in December 2015 at 600 m depth

Fig. 2
figure 2

Spatial distribution of the a Argo SSA and the b XGBoost-estimated SSA in December 2015 at 600 m depth

Figures 12 show the spatial distribution of subsurface temperature and salinity anomalies (STA and SSA) of the global ocean from the XGBoost-based result and Argo gridded data in December 2015 at 600 m depth. It is clear that both the XGBoost-estimated STA and SSA were significantly consistent with the Argo gridded STA and SSA at 600 m depth. The R2 of STA/SSA between Argo gridded data and XGBoost-estimated result is 0.989/0.981, and the RMSE is 0.026 \(^{\circ }\)C/0.004 PSU.

Fig. 3
figure 3

Spatial distribution of the a Argo STA and the b RFs-estimated STA in June 2015 at 600 m depth

Fig. 4
figure 4

Spatial distribution of the a Argo SSA and the b RFs-estimated SSA in June 2015 at 600 m depth

3.2 Random Forests (RFs)

Random Forests (RFs) are a popular and well-used ensemble learning method for data classification and regression. Reference [7] proposed the general strategy of RFs, which fit numerous decision trees on various data subsets by randomly resampling the training data. RFs adopt averaging to improve the prediction accuracy and control overfitting, and correct for the decision tree’s tendency of overfitting. RFs have been effectively applied in varying remote sensing fields [21, 53] and generally perform very well. Several advantages make RFs well-suited to remote sensing studies [20, 52].

The basic strategy of RFs is to grow a number of decision trees on random subsets of the training data [40], and determine the decision rules, and choose the best split for each node splitting [29]. This strategy performs well compared to many other classifiers and makes it robust against overfitting [7]. RFs only require two input parameters for training, the number of trees in the forest (ntree) and the number of variables/features in the random subset at each node (mtry), and both parameters are generally insensitive to their values [29].

Figures 34 show the spatial distribution of subsurface thermohaline anomalies of the global ocean from RFs-based result and Argo gridded data in June 2015 at 600 m depth. It is clear that the spatial distribution and pattern between RFs-estimated results and Argo gridded data are quite similar. The R2 of STA/SSA between Argo data and XGBoost-estimated result is 0.971/0.972, and the RMSE is 0.042 \(^{\circ }\)C/0.005 PSU.

4 Predicting Subsurface Thermohaline Based on Deep Learning

The predicting process for subsurface thermohaline based on deep learning includes three steps. Firstly, the training dataset combined satellite-based sea surface parameters (SSH, SST, SSS, SSW) with Argo subsurface data as training label were prepared. Secondly, we carried out a hyperparameter tuning based on a grid-search strategy to achieve an optimal deep learning model by training. Here, we set up the time-series deep learning models by adopting the time-series data as the training dataset and the rest as the testing dataset, so as to realize time-series subsurface thermohaline prediction. Finally, the performance measures of RMSE and R2 were adopted to evaluate the model performance and accuracy.

4.1 Bi-Long Short-Term Memory (Bi-LSTM)

The LSTM is a sort of recurrent neural network [22], which is well-suited to time-series modeling and has been widely applied in natural language processing and speech recognition. The primary principle behind LSTM is to leverage the target variable’s historical information. Unlike traditional feedforward neural networks, the training errors in an LSTM propagate over a time sequence, capturing the time-dependent relationship of the training data’s historical information [18]. Bi-Long Short-Term Memory (Bi-LSTM) is an upgraded LSTM algorithm. The Bi-LSTM consists of two unidirectional LSTM that processes the input sequence forward and backward meanwhile, and captures the information ignored by the unidirectional LSTM.

To ensure the Bi-LSTM model can achieve good performance and high accuracy, it is necessary to select and tune the proper hyperparameters as the input of Bi-LSTM model. Here, we randomly picked 20% of the training dataset for Bi-LSTM hyperparameter tuning, so as to achieve the optimal model input. The Bayesian optimization approach was utilized in this study to obtain the best number of layers and neurons for Bi-LSTM network. By model testing, we finally selected a neural network with three layers and neuron counts of 32, 64, and 64 for respective layer. Moreover, the batch normalization was conducted after the hidden layer of each network. According to the previous practice, the optimal performance could be effectively attained with mini-batch sizes ranging from 2 to 32 [33]. Thus, the best batch size was set to 32 for the model. In addition, the optimal epoch of the STA network was set to 257, while the best one of the SSA network was set to 81. Moreover, We adopted the RMSE, R2, and Spearman’s rank correlation coefficient (\(\rho \)) to obtain the optimal Bi-LSTM timestep. The results demonstrate that the Bi-LSTM model performs optimally when the network timestep is set to 10. Thus, the timestep here was set as 10.

Table 1 The different datasets feeded to Bi-LSTM model
Fig. 5
figure 5

Spatial distribution of the a Argo STA and the b LSTM-predicted STA in December 2015 at 200 m depth

We employed the data from December 2010 to November 2015 as the training dataset and the data in December 2015 as the testing dataset. The testing dataset adopted the target month dataset for performance evaluation (Table 1). In general, Bi-LSTM was characterized by a whole temporal sequence in both training and prediction, but for the accuracy validation, we only focused on the target month. When constructed the input dataset for Bi-LSTM, we restructured the data grid by grid with time sequence according to the rule of \(X_{i = 1}^{j = 1}\), \(X_{i = 1}^{j = 2}\)...\(X_{i = 1}^{j = 60}\), \(X_{i = 2}^{j = 1}\)...\(X_{i = 2}^{j = 60}\),..., \(X_{i = 24922}^{j = 1}\)...\(X_{i = 24922}^{j = 60}\) (i represents the grid point, j represents the month).

Figures 56 show the spatial distribution of subsurface temperature and salinity anomalies of the global ocean from the LSTM-predicted result and Argo gridded data in December 2015 at 200 m depth. It is clear that the LSTM-predicted result can accurately retrieve and capture most anomaly signals in the subsurface ocean. The R2 of STA/SSA between Argo gridded data and LSTM-predicted result is 0.728/0.476, and the RMSE is 0.378 \(^{\circ }\)C/0.055 PSU.

Fig. 6
figure 6

Spatial distribution of the a Argo SSA and the b LSTM-predicted SSA in December 2015 at 200 m depth

Fig. 7
figure 7

The meridional vertical profile of the STA in December 2015 at the longitude of 190\(^{\circ }\) for a Argo gridded data, and b LSTM-predicted result

Fig. 8
figure 8

The meridional vertical profile of the SSA in December 2015 at the longitude of 190\(^{\circ }\) for a Argo gridded data, and b LSTM-predicted result

Figure 7 is the meridional profile (at longitude 190\(^{\circ }\)) for Argo gridded and LSTM-predicted STA for vertical comparison and validation. The results presented that the two vertical profiles are highly consistent in the vertical distribution pattern, and over 99.75% of the profile points were within ±1 \(^{\circ }\)C prediction error, while over 99.44% of the profile points were within ±0.5 \(^{\circ }\)C error. Figure 8 is the same meridional profile for Argo gridded and LSTM-predicted SSA for vertical comparison and validation. The results indicated that the two vertical profiles match well in the vertical distribution pattern, and over 99.55% of the profile points were within ±0.2 PSU prediction error, while over 99.36% of the profile points were within ±0.1 PSU error. The results demonstrated that the model prediction performance for STA and SSA are excellent with high accuracy.

4.2 Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) is a well-known deep learning algorithm. [17] proposed a neural network structure, including convolution and pooling layers, which can be regarded as the first implementation of the CNN model. On this basis, [27] proposed the LeNet-5 network, which used the error backpropagation algorithm in the network structure and was considered a prototype of CNN. Until 2012, the deep network structure and dropout method were applied in the ImageNet image recognition contest [26], and significantly reduced the error rate, which opened a new era in the image recognition field. So far, the CNN technique has already been widely utilized in a variety of applications, including climate change and marine environmental remote sensing applications [5]. Here, the CNN algorithm combined with satellite observations was employed to predict ocean subsurface parameters.

Fig. 9
figure 9

Spatial distribution of the a Argo ST and the b CNN-predicted ST in December 2015 at 200 m depth

Fig. 10
figure 10

Spatial distribution of the a Argo SS and the b CNN-predicted SS in December 2015 at 200 m depth

We utilized the CNN approach to retrieve ocean subsurface temperature (ST) and salinity (SS) based on satellite remote sensing data directly. Figures 910 show the spatial distribution of subsurface thermohaline of the global ocean from the CNN-predicted and Argo gridded data in December 2015 at 200 m depth. The R2 of STA/SSA between Argo gridded data and CNN-predicted result is 0.972/0.822, and the RMSE is 0.924 \(^{\circ }\)C/0.293 PSU.

5 Conclusions

This chapter proposes several AI-based techniques (ensemble learning and deep learning) for retrieving and predicting subsurface thermohaline in the global ocean. The proposed models are proved to estimate the subsurface temperature and salinity structures accurately in the global ocean through multisource satellite remote sensing observations (SSH, SST, SSS, and SSW) combined with Argo float data. The performance and accuracy of the models are well evaluated by Argo in situ data. The results demonstrate that the AI-based model has strong robustness and generalization ability, and can be well applied to the prediction and reconstruction of subsurface dynamic environmental parameters.

We employ XGBoost and RFs ensemble learning algorithms to derive the subsurface temperature and salinity of the global ocean, and the R2/RMSE of XGBoost retrieved STA and SSA are 0.989/0.026 \(^{\circ }\)C and 0.981/0.004 PSU, and the R2/RMSE of RFs retrieved STA and SSA are 0.971/0.042 \(^{\circ }\)C and 0.972/0.005 PSU. Moreover, Bi-LSTM and CNN deep learning algorithms are adopted to time-series predicting of subsurface thermohaline, the R2/RMSE of Bi-LSTM predicted STA and SSA are 0.728/0.378 \(^{\circ }\)C and 0.476/0.055 PSU, the R2 / RMSE of CNN predicted ST and SS are 0.972/0.924 \(^{\circ }\)C and 0.822/0.293 PSU (CNN to predict the ST and SS directly). Overall, ensemble learning algorithms which are suited for small data modeling can be used to well retrieve mono-temporal subsurface thermohaline structure, while deep learning algorithms which are fit for big data modeling can be well adopted to predict time-series subsurface thermohaline structure.

In the future, we can employ longer time-series of remote sensing data for modeling and utilize more advanced deep learning algorithms to improve the model applicability and robustness. We should further promote the application of AI and deep learning techniques in the deep ocean remote sensing and data reconstruction for revisiting global ocean warming and climate change. The powerful AI technology shows great potential for detecting and predicting the subsurface environmental parameters based on multisource satellite measurements, and can provide a useful technique for promoting the studies of deep ocean remote sensing as well as ocean warming and climate change during recent decades.