1 Introduction

Subsurface temperature data is of paramount importance when it comes to unraveling the intricate tapestry of climate variability, particularly within the realm of oceanic processes (Levitus et al. 2012). These invaluable measurements, harvested from an array of oceanographic instruments ranging from buoys and ships to profiling floats and remote sensing technologies, serve as a window into the dynamic interplay of oceanic heat content and air-sea interactions. These factors are not only key drivers of climate variability but also pivotal in shaping our understanding of the broader climate system (IPCC 2014).

The fluctuations in subsurface temperatures, whether observed at a global or regional scale, hold sway over a multitude of climatic phenomena. They exert a profound influence on sea level variability, a phenomenon intimately connected to the thermal expansion of the ocean and the complex, large-scale processes governing its behavior (Wunsch 2020). Moreover, these temperature variations exert their influence on the exchange of heat between the ocean and atmosphere, thereby leaving an indelible mark on large-scale climate patterns. These patterns encompass the famed El Niño-Southern Oscillation (ENSO), and the Pacific Decadal Oscillation (PDO), among others (National Academies of Sciences, Engineering, and Medicine 2016). Importantly, these fluctuations in climate state unfold over a wide spectrum of temporal scales, ranging from interannual to decadal and beyond. Hence, deciphering the intricate linkages between subsurface temperature changes and climate variability across these diverse timescales is crucial for predicting and comprehending climate patterns and their impacts. To achieve this, monitoring and analyzing subsurface temperature data, along with other ocean and climate variables like observed sea level and climate indices, is essential (Royston et al. 2022). Such efforts aim to strengthen climate models, improving their ability to simulate and project future climate scenarios.

To delve deeper into climate variability, a range of analytical techniques comes into play. Statistical methods, including time series analysis, spectral analysis, and Empirical Orthogonal Function/Principal Component (EOF/PC) analysis, stand as the cornerstone for identifying dominant modes of variability and unraveling their spatiotemporal characteristics (IPCC 2021). Among these methods, EOF/PC analysis stands out for its ability to effectively reduce the dimensionality of high-dimensional climate datasets, offering a unique advantage. By transforming the original variables into a new set of uncorrelated variables, known as principal components, EOF/PC analysis allows for the distillation of essential information while filtering out noise and redundant data (Jolliffe and Cadima 2016). Traditionally, these techniques are applied to surface variables, such as sea surface temperature (SST) or surface sea-level pressure (SLP) anomalies, or to full-depth integrated variables like sea surface height (SSH) (Cassou et al. 2018; Han et al. 2017; Li et al. 2012). Nonetheless, it is important to acknowledge that these variables might not completely grasp or emphasize the intricacies of interannual to decadal climate variability, especially in the upper-ocean layers extending up to 700 m in depth or even deeper (Nieves et al. 2017). Considering this, our approach advocates shifting the focus to subsurface ocean layers, where the bulk of heat is stored in many regions (as illustrated in studies such as Maze et al. 2017 for the North Atlantic, Thomas et al. 2021 for the Southern Ocean, or Nieves et al. 2021 providing a comprehensive analysis across the world’s oceans from tropical to mid-latitudes). This perspective provides a vantage point for studying climate variability and gaining deeper insights into specific regional dominant layers.

Furthermore, it is crucial to recognize that traditional studies often focus on “user-defined” areas of interest or regions with long-lasting climate patterns, like the Pacific or Atlantic regions. However, our introduction of an optimized unsupervised clustering criteria (Bunkers et al. 1996; Fereday et al. 2008; Fovell and Fovell 1993) applied to global subsurface temperature fields allows shorter-lived variability modes to be highlighted. These overlooked modes and oscillations can have substantial impact on regional climate variations, as demonstrated by the results presented here. In fact, our mixed clustering PC procedure not only identified well-known persistent climate phenomena like ENSO and PDO but also unveiled unprecedented local variability modes with relatively shorter frequency cycles. Comparison with regional and coastal sea levels also demonstrated strong alignment across multiple timescales.

Differing from prior research, this marks the first globally applicable data-driven framework for automated analysis of multi-year ocean depth-layered temperature estimates to extract regional variability modes. This is accomplished without being overly reliant on specific assumptions about the underlying processes, while concurrently reducing data complexity. Additionally, the resulting regional variability modes could serve as sea level analogues in situations where tide gauge data is unavailable, or altimetry faces challenges in shallow waters, refraction issues, or other coastal effects (Adebisi et al. 2021; Benveniste et al. 2019; IPCC 2023).

2 Materials and Methods

2.1 Data Sources

We used three distinct datasets for our analysis. The first dataset, the World Ocean Atlas 2023 (WOA23) series, provides climatologies – mean one-degree gridded oceanographic fields from 1955 to 2022. These fields include temperature estimates at selected depth levels based on the objective analysis of in-situ historical measurements from various sources (Locarnini et al. 2024). The pre-calculated temperature estimates, supplied by NOAA, are crucial for understanding ocean climate variability, and include the seasonal surface temperature anomaly (STA), vertically averaged temperature anomaly for the 0–100 m layer (MTA 100 m), and heat content for the 0–700 m layer (OHC 700 m), all spanning from 1955 to the present (https://www.nodc.noaa.gov/OC5/3M_HEAT_CONTENT/). Additionally, we used a satellite-based dataset that offers daily mean quarter-degree gridded sea level anomalies (MSLA), computed with respect to a twenty-year mean climatological field from 1993 to 2012. These anomalies originate from the multi-satellite merged product (https://data.marine.copernicus.eu/product/SEALEVEL_GLO_PHY_CLIMATE_L4_MY_008_057/description) and have been interpolated to a 3-month resolution to maintain consistency with temperature estimates. This dataset serves as a valuable reference for tracking variations in sea level. For both datasets, we adjusted each data point by removing the global mean temperature or sea level value at each time step, along with the seasonal cycle. This seasonal cycle was calculated by averaging the temperature data for each season across all years in the dataset. These adjustments help to smooth out short-term fluctuations and emphasize the longer-term changes from regular, natural internal fluctuations, as detailed by Nieves et al. (2021). The last dataset comprises the historical time series of climate indices (https://psl.noaa.gov/gcos_wgsp/Timeseries/), also interpolated at resolution of 3 months. To assess the relationships on multi-year to decadal timescales, all datasets were smoothed using two methods, the Savitzky-Golay filter and the moving window filter of 1-, 3-, and 5-year.

2.2 Methodology

As mentioned earlier, we devised an integrated approach to uncover regional-scale ocean modes across diverse temporal scales, capturing both low-frequency climate variability and longer-lived signals. This method combines a machine learning clustering algorithm (applied to ocean depth-layered temperature data) with principal component analysis tailored to the identified cluster regions.

2.2.1 Identification of Climatic Regions Through Multi-Scale K-Means Clustering of Global Subsurface Ocean Temperature Data

The central aim of this particular step in the methodology was to capture predominant climatic regions using an optimized K-means clustering technique. This method was employed on the global maps of ocean preprocessed variables, as described in Sect. 2.1, to discern variations among different regions and within the dominant layers of these regions, particularly within 700 m depth, using a multi-scale analytical approach.

K-means clustering is a widely used unsupervised machine learning algorithm (Michelangeli et al. 1995). Its primary purpose is to categorize data points into distinct clusters, with the goal of maximizing the ratio between inter- and intra-cluster variance. A similar technique, as demonstrated by Fereday et al. (2008) using K-means clustering, has been previously applied to fields related to SLP. In this specific study, the distance calculation employed within the K-means clustering technique was based on Correlation distance (Ortiz-Bejar et al. 2022). This particular distance metric yielded more precise results when compared to alternative distance measures such as Euclidean, City-block or Cosine distances. To mitigate the potential challenges associated with the random initialization of centroids (Fränti and Sieranoja 2019), it is advisable to conduct multiple random restarts (15 times in this case) and select the clustering solution with the lowest distance error as the initial configuration for the algorithm. Additionally, given that running K-means clustering on the same dataset may yield slightly different outcomes, the entire process was executed 20 times, and the centroids were recalculated by taking the median of all centroid solutions. This approach ensured the production of a more stable clustering result. In fact, we conducted a sensitivity analysis by examining the distance between centroids across all repetitions. This analysis confirmed the robustness of our method (as seen in Supplementary Fig. 1).

The selection of the optimal number of clusters was informed by an analysis using information criteria, which suggested that a minimum of 20 clusters captures the most significant variations (see Supplementary Fig. 2). We found that with K = 22, or a similar cluster count, the clustering maps for each ocean variable across various timescales identified multiple regions of climatic and coastal dynamical significance, aligned with the insights provided by Nieves et al. (2021). Also importantly, increasing the number of clusters beyond this range did not yield additional insights; rather, it produced smaller, less significant clusters. Figure 1 presents results using data smoothed with a 3-year filter. For a broader perspective, Supplementary Fig. 3 explores results across various timescales. The variations observed among these timescales are due to a mix of high-frequency and longer-term ocean/climatic processes. Despite these differences, certain regions show consistent patterns across timescales, indicating persistent phenomena that influence variability from seasonal to multi-year scales. A limitation of this analysis is that smoothing the data could introduce minor biases and subtly impact the clustering outcomes (see Supplementary Fig. 3(d), (h), (l), especially when fine details are crucial. However, this is not a significant concern in our study, as we are not concentrating on smaller-scale regions. An interesting observation from our study is that some clusters are present across various ocean basins. Shared patterns of variability often emerge in K-means analysis due to factors like global oceanic circulation, teleconnections, common climatic drivers, and anthropogenic influences (Camargo et al. 2023). However, the analysis primarily focused on individual basins.

Fig. 1
figure 1

Representation of the 22 clustering regions identified through K-means on the global maps of three ocean variables over the 1955–2022 period. From left to right: (a) STA, (b) MTA 100 m, and (c) OHC 700 m. Discrepancies in the clusters for different temperature estimates arise from variations in the warming patterns across different ocean layers (Nieves et al. 2015). Colors and numbers are independent for each result. The map numbers indicate selected regions for test comparisons (in Sect. 3), chosen for their significance in coastal areas (Nieves et al. 2021) and/or their relevance to key global modes (https://psl.noaa.gov/gcos_wgsp/Timeseries/)

2.2.2 PC-Guided Refinement to Unveil Regional Climate Patterns

In this stage of the process, we employed an EOF/PC analysis, a method known for revealing the principal components (PCs) that account for the most significant share of variability within a dataset, aiding in additional dimensionality reduction and extraction of the most critical complex features (Hannachi et al. 2007). Note that the application of this analysis is crucial to further refine the data, even in the smaller geographical areas under study, leading to a more comprehensible understanding of the sources of variation within the identified clusters from K-means. Each PC reflects the original variables and is ordered based on the amount of variance it can account for. In our case, these PCs serve as representatives of the significant underlying climate patterns of the clustered data. To facilitate meaningful comparisons and analysis, all components underwent normalization, a process involving subtracting the minimum value and dividing it by the range, which is the difference between maximum and minimum values across all components. This normalization step enhances the reliability and interpretability of our findings.

We present only the first PC (referred to as PC1) as it shows a strong correlation with regional sea level data and established climate modes, detailed in Sect. 3. While not displayed here, the second and third PCs (PC2 and PC3, respectively) also exhibited noteworthy agreement with sea levels and climate indices. On average, PC1 elucidated approximately 65% of the variance across all temperature estimates, underscoring its importance as a descriptor of regional climate variability, as detailed in Sect. 3. In contrast, PC2 and PC3 accounted for 10% and 6% (respectively) of the variance across all regions studied. Refer to Fig. 2 for an overview of the entire methodology workflow.

Fig. 2
figure 2

Overview of the sequential steps in our automated data-driven approach for identifying regional variability modes: The three temperature estimates at various depth layers underwent preprocessing (Sect. 2.1) before being subjected to K-means clustering, which identified regions with similar temperature changes (Sect. 2.2.1). Subsequently, the clustered regions underwent EOF/PC analysis to reveal robust climate patterns (Sect. 2.2.2), facilitating their comparison with regional/coastal sea level and conventional climate modes (Sect. 3)

3 Results and Discussion

This section presents the outcomes of our automated AI-based framework for identifying regional-scale climate variability patterns using global subsurface ocean temperature data. The results are compared with benchmark datasets, such as sea level and conventional climate modes, to assess the scope of the approach. This methodology facilitates the efficient grouping of similar data points and discerns the pivotal layers that contribute to variance within regional modes, influencing changes in regional sea-level variability across diverse climatic regions.

3.1 Interpreting EOF/PC Solutions From Temperature Data Alongside Sea Level Patterns

In this analysis, we present the quantitative relationship between regional sea levels and their corresponding regional PC1 derived from subsurface temperature estimates across various depth layers for each clustered region (referred to as our regional variability modes). It is crucial to acknowledge that variations in sea level can be influenced by a multitude of factors beyond just temperature, which fall well beyond the scope of this study. Consequently, we do not anticipate a complete correspondence between these two variables. Nevertheless, as we explore regional sea level data across various timescales, a robust connection emerges in many regions of the world’s oceans (including the Pacific Ocean, Indo-Pacific region, Indian Ocean and northwest Atlantic Ocean). This underscores the presence of a local oceanic/climatic phenomenon that predominantly manifests through temperature changes in the studied ocean layers, especially with respect to their variability.

In reference to the shallower layers (as depicted in Fig. 3 and Supplementary Figs. 46), it is observed that the PC1 derived from STA estimates effectively represents sea level variability in areas where internal climate variability exerts a substantial influence on oceanic changes. In these regions, incorporating information from the deeper layers does not notably improve the model’s performance. This is particularly noticeable in regions like the east and west (E and W) Pacific Ocean (with correlation values of 0.92 and 0.88, respectively), and the northwest (NW2) Atlantic Ocean (with a correlation of 0.69), as illustrated in Fig. 3(a), Supplementary Fig. 4(d), and Supplementary Fig. 6(d), respectively. In contrast, in regions where substantial heat accumulates in deeper ocean layers, the PC1 obtained from MTA 100 m and OHC 700 m estimates provides a more accurate representation of thermal expansion, consequently resulting in a better replication of sea level variations. This is exemplified, for instance, in Supplementary Fig. 6(b) and (c) concerning the northwest (NW1) Atlantic Ocean (with correlation values of 0.84 and 0.87 for depths of 100 and 700 m, respectively). On the other hand, the Indo-Pacific waters and Indian Ocean exhibits a well-represented sea level variability across all regions (with correlation values ranging from 0.88 to 0.97 for the best temperature estimates), as illustrated in Supplementary Figs. 4 and 5. To accommodate all possible scenarios, these findings emphasize the importance of incorporating depth layers in the analysis of climate variability. This enables the model to automatically and accurately identify the most appropriate temperature estimate for explaining sea level changes in each specific case, marking a significant advancement compared to Nieves et al. (2021). Note that all moderate to strong correlations show high statistical significance, as demonstrated by their corresponding p-values, which were nearly zero.

Furthermore, we have identified interesting nuances in specific regions. Some clusters span coastal regions across different continents, for example, indicating shared variability between the southeastern coast of America and the central-western and southwestern coasts of Africa, as illustrated in Supplementary Fig. 6(i) and (o). Additionally, we observed regions, particularly on the central and eastern sides of the Atlantic Ocean, where a weakened correlation can be attributed to factors beyond temperature changes, such as salinity variations (Wang et al. 2017) (as shown in Supplementary Fig. 6(j)-(l)). This highlights the intricate interaction among different ocean variables and factors that can sometimes influence sea level variability.

These results, consistently observed across diverse time-scales (not shown), provide additional support, reinforcing the robustness of our integrated clustering and EOF/PC analysis on subsurface temperature across a wide range of regions. Moreover, it holds the potential to generate sea level analogues within numerous coastal areas. It is worth highlighting that we also conducted comparisons against coastal sea levels, in addition to regional estimates, across all study regions (refer to Fig. 3 and Supplementary Figs. 46). In general, both regional and coastal sea levels exhibit striking similarities and share the same variability as our regional modes. This feature proves particularly valuable in regions where the measurement of coastal sea levels through tide gauges or satellite altimetry is either lacking or remains inadequately quantified, as previously noted (see also Radin and Nieves 2021; Wenzel and Schröter 2010). Hence, our work also has practical implications for monitoring and understanding sea level changes throughout a broad spectrum of regions.

Fig. 3
figure 3

Consistency observed between the regional PC1 time series (PC1r, in gray) derived from the subsurface temperature estimates (STA, MTA 100 m, OHC 700 m, from left to right) and the regional (light blue) and coastal (dark blue) MSLA in the east Pacific Ocean for the depicted regions within the plots. Coastal estimates were obtained using the mask provided by Nieves et al. (2021)

3.2 Comparative Assessment with Key Climate Indices

In this segment, we investigate the degree of agreement between our regional modes of climate variability and prominent climate indices, such as ENSO and PDO. Our focus centers on the specific geographical clustering regions depicted in Fig. 4, which mirror the conventionally defined areas used in the estimation of the climate time series (referenced here: https://psl.noaa.gov/gcos_wgsp/Timeseries/). As these climate indices are intricately linked to surface climate variables like SST, we ensure coherence by exclusively comparing them to our regional climate indices (PC1) derived from the STA data (as described in the Data sources section).

While a comprehensive comparison with major climate modes of variability may encounter challenges stemming from differences in datasets, variations in the geographical scopes of study areas, and methodological approaches, our results demonstrate strong alignment. Remarkably, our corresponding regional variability modes for ENSO and PDO exhibit consistent large-scale fluctuations linked to the warming and cooling phases in the tropical and north Pacific regions, respectively (see Fig. 4(a) and (b)). The correlation values for these relationships are 0.76 and 0.79, respectively. Consequently, our novel regional-scale indices provide compelling evidence of the efficacy of our approach in capturing ocean climate information related to large-scale oscillations. Furthermore, the methodology can yield similar outcomes without relying on underlying assumptions about regions or variables, instead leveraging automated means.

It is crucial to highlight that the introduced clustering method not only identified regions with globally dominant variance, corresponding to primary sources of natural variability such as ENSO/PDO, but also enabled to unveil climate patterns characterized by reduced variability, as discussed earlier. This is significant because applying EOF/PC analysis directly to global fields of ocean/climate variables, rather than focusing solely on regional domains, can obscure the presence of less influential modes on a global scale (Tung et al. 2019). Additionally, it is essential to acknowledge that the results of our approach may vary slightly based on the adjustments made to clustering parameters or the choice of decomposition technique. However, the dominant regional modes identified should consistently emerge. Lastly, it is crucial to note that the differing patterns observed across various ocean depth layers and timescales reflect the complex nature of Earth’s climate system (Nieves et al. 2015).

Fig. 4
figure 4

Association of the PC1 time series from the STA dataset in the regions depicted in the right panel (PC1r) with the large-scale climate modes: (a) ENSO 3 and (b) PDO. The climate indices were also smoothed with a 3-year filter for fair comparison purposes

4 Conclusions

The unified strategy presented here for the automated extraction of regional-scale variability modes allows for a more detailed exploration of the intricacies of local ocean/climate patterns. This approach identifies the most relevant ocean layers in each uncovered climatic region, providing valuable insights into regional and coastal sea levels.

At the heart of this method lies the utilization of observations of layered ocean temperature, which acts as a key to unraveling a comprehensive understanding extending beyond the surface. Notably, our findings reveal that ocean climate variability is not solely shaped by surface temperature but is profoundly influenced by temperature variations at different depths within the ocean. Our analysis emphasizes that many oceanic modes of variability capture a substantial portion of the variance across the entire vertical expanse of the ocean.

Moreover, our proposed approach is not only effective but also adaptable, allowing for precise customization through the incorporation of alternative options in the clustering analysis. It possesses the versatility to include additional variables and to integrate alternative analytical techniques, complementing the EOF/PC analysis used in our initial implementation. This capability serves as a powerful resource, fostering a deeper understanding of the dynamic interplay between the ocean and the climate. Ultimately, it contributes to gaining the knowledge and tools needed to make informed decisions regarding local changes associated with climate variability.