Background & Summary

The northwest Pacific and adjacent eastern Asian waters are known for their high primary productivity1,2,3, which has been examined in various marine chemical and biological studies, including those on water quality changes and material cycling. Nutrient data play a crucial role in these studies, with nitrogen and phosphorus being especially significant, as they influence the growth and reproduction of phytoplankton and shape the area’s phytoplankton species composition4,5,6,7,8,9. Human activities, such as artificial nitrogen input in densely populated regions such as eastern China and western Europe, affect the ocean’s biogeochemical structure10,11,12,13.

Coastal waters near East Asia, including the heavily impacted Yellow Sea and East China Sea, the unique East Sea with relatively lower human impact, and the northwest Pacific influenced by the Kuroshio current, are influenced by both natural and human factors, leading to complex changes in nutrient levels14. Nutrient supply from the deep sea has been reported to have decreased worldwide due to the strengthening of the stratification1,2,3,15, while artificial supply through the atmosphere and rivers has increased in the East Asian waters11,12,16,17,18. Thus, understanding the long-term changes in nutrient levels in this region is crucial, and various perspectives are being studied to comprehend the phenomenon and predict future changes6,19,20,21.

From the perspective of the spatial estimation of ocean information, studies providing gridded data such as OISST (Optimum Interpolation SST [Sea Surface Temperature]) are ongoing22,23,24. In addition, research is being conducted to remove the bias of numerical models using spatial estimation techniques such as Kriging25. However, these techniques mainly focus on temperature and salinity. Nutrient data, however, primarily rely on in situ observations, as satellite remote sensing and unmanned equipment cannot cover them. Efforts have been made to provide monthly gridded nutrient data for the North Pacific region26,27,28, but 4D gridded nutrient data (x, y, z, t) that consider both spatial and temporal variations are limited to some reanalysis data using numerical models29.

This study uses 40 years of nutrient concentration observations from 1980 to 2019 to spatially and temporally optimize the data for the northwest Pacific Ocean (N25–45°, E121–145°) into a gridded format. The estimated nutrient grid data were validated through a verification process and are presented (with validation errors shown) in the modeled results.

Methods

The procedure used to create gridded data is summarized in Fig. 1. This section outlines the steps involved in transforming raw observational data into nutrient grid data and validating the results. All procedures were executed using the R programming language (R core team, 2023)30. The data production process consisted mainly of three steps: data collection and preprocessing, spatiotemporal estimation, and postprocessing and validation. These steps will be discussed in further detail below.

Fig. 1
figure 1

Workflow of the study. Nutrient and depth data were downloaded from various sources. The nutrient data obtained from different sources were compiled into a common data format, and data within the valid range of time, space, and nutrient variables were selected. Then, the refined data were used for variogram modeling and spatiotemporal kriging. 10-fold cross-validation was performed to indicate estimation errors, and the data were saved.

Data procurement

The data analyzed in this study were acquired from the National Institute of Fisheries Science (NIFS) Serial Oceanographic Observation (SOO, https://www.nifs.go.kr/kodc/eng/eng_coo_list.kodc)31, Japan Meteorological Agency (JMA) Oceanographic and Marine Meteorological Observations by Research Vessels (OMMORV, https://www.data.jma.go.jp/gmd/kaiyou/db/vessel_obs/data-report/html/ship/ship_e.php)32, and topographic data from the National Oceanic and Atmospheric Administration (NOAA) Earth TOPOgraphy(https://www.ncei.noaa.gov/maps/grid-extract/)33,34. The data specifications, including file format, observational period, spatiotemporal resolution, and accessible URLs, are presented in Table 1.

Table 1 Data specifications for SOO, OMMORV, and ETOPO 2022.

The OMMORV data are currently accessible for download starting from 1997, with earlier data available in the JMA Data Report of Oceanographic Observations Special Issue35. To avoid duplications, any potential overlap in the data were excluded from the spatiotemporal estimation process.

Nutrient data

The SOO data can be obtained by specifying the region, line, station, observation date, and depth. These data comprise ten water quality parameters, including temperature, salinity, dissolved oxygen concentration, nitrate concentration, and phosphate concentration. The data have been collected since 1961 with a bimonthly (February, April, June, August, October, and December) measurement frequency. Although the station and line locations may have undergone slight changes, as of 2020, data from 207 stations along 25 lines have been accumulated.

The OMMORV data can be obtained through the provided link associated with each research vessel. The hydrographic data, saved with an ‘.E’ extension, were used in this study. The format of the data underwent a change in 2010 and is now classified into two versions, ‘E2.x’ and ‘E3.x’. The data are a 126-byte ASCII record in a fixed width format, with observation information and items arranged at regular intervals. Detailed information on the format of the data is available separately36.

To ensure compatibility, the nutrient data from both SOO and OMMORV underwent unit conversion. The original units of μmol/kg were changed to μmol/L, and the density was calculated based on the recorded water temperature and salinity data. This calculation was carried out assuming standard atmospheric pressure (10.1325 dbar) with the gsw library in the R programming language37,38. The SOO and OMMORV data were merged into a table of 2,023,251 observations and 16 variables, and the missing information for each variable is shown in Table 2.

Table 2 The basic description of the dataset merged from SOO and OMMORV.

The bathymetry used in this study utilized NOAA’s ETOPO 2022 data, which is the second release of these data following ETOPO133,39. The water depth data can be obtained either by downloading the desired area data from the following URL or by using the R Package marmap40. In this study, the water depth data within the range of N25–45° and E121–145° were processed at 10-minute intervals using the getNOAA.bathy function from the marmap library. The computation was performed on grid points that were densely set from the surface to the 9,000 m depth range using the standard depth from WOD1841. However, to reduce the computational load, the number of depth grids was reduced from 137 to 43. Depth intervals were created at 20 m intervals for the 0–200 m range, 100 m intervals for the 200–2000 m range, and 500 m intervals for the 2000–9000 m range to produce the grids.

Spatiotemporal estimation

The spatiotemporal kriging (STK) approach was used to transform the irregular nutrient data collected from the SOO and OMMORV sources into spatiotemporal grid data. This method has been widely used in various fields42,43,44,45 and distinguishes itself from those in previous studies by considering the vertical dimension in the 4-dimensional spatiotemporal estimation. The R libraries gstat and spacetime were utilized for maximum 3-dimensional spatiotemporal estimation46,47,48; however, a custom function had to be developed, as 4-dimensional coordinates are not supported by these libraries.

Kriging with External Drift (KED) was applied for nutrient estimation in unmeasured spatiotemporal areas using spatiotemporal coordinates as the auxiliary variables. KED is also referred to as universal kriging when the drift is limited to spatial coordinates49,50. The unmeasured point’s estimation at a specific time point is represented as a weighted combination of the spatial trend and the residual from the regression model at the measured point as in Eq. (1).

$$\widehat{z}\left({x}_{0}\right)={\sum }_{k=1}^{m}{\omega }_{k}{f}_{k}\left({x}_{0}\right)+{\omega }_{0}+{\sum }_{i=1}^{n}{{\rm{\lambda }}}_{i}e\left(z\left({x}_{i}\right)-\bar{z}\right)$$
(1)

where xi and x0 represent the coordinates of the observation and the target location, respectively, and the 4-dimensional coordinate structure including horizontal, vertical, and temporal dimensions is represented as \(\left[x,y,z,t\right]=\left[{x}_{1},\cdots \,,{x}_{m}\right]\). The subscript i denotes the observed location, and 0 denotes the location of interest for prediction. fk (x0) is a function representing the average spatiotemporal variation, and a linear function is used. ωk is the coefficient of the regression function fk (x0), and ω0 is the Lagrange parameter to remove bias. e is the residual of fk (x0), and λi represents the weight coefficient of e.

The optimal coefficients (ωk, ω0, λi) that satisfy the condition of minimizing the error variance in Eq. (2) are derived in the form of Eq. (3), and it is solved as shown in Eq. (4).

$$\min {\left[\widehat{z}\left({x}_{0}\right)-z\left({x}_{0}\right)\right]}^{2}$$
(2)

The process of finding the solution in the form of a matrix equation is shown in Eqs. (3, 4), and the block matrices that make up the overall matrix equation are constituted as in Eq. (511). The bold symbols indicate the block matrix.

$${{\boldsymbol{C}}}^{{\boldsymbol{KED}}}\cdot {{\boldsymbol{\lambda }}}^{{\boldsymbol{KED}}}={{\boldsymbol{C}}}_{{\bf{0}}}^{{\boldsymbol{KED}}}$$
(3)
$${\lambda }^{{\boldsymbol{KED}}}={{\boldsymbol{C}}}^{{{\boldsymbol{KED}}}^{-1}}\cdot {{\boldsymbol{C}}}_{{\bf{0}}}^{{\boldsymbol{KED}}}$$
(4)
$${{\boldsymbol{\lambda }}}^{{\boldsymbol{KED}}}=\left[\begin{array}{c}{{\boldsymbol{\lambda }}}_{{\boldsymbol{i}}}\\ {\omega }_{0}\\ {{\boldsymbol{\omega }}}_{{\boldsymbol{k}}}\end{array}\right],i=1,\ldots ,n;k=1,\ldots ,m$$
(5)
$${{\boldsymbol{C}}}^{{\boldsymbol{KED}}}=\left[\begin{array}{ccc}{{\boldsymbol{\sigma }}}_{{\boldsymbol{ij}}}^{2} & {{\boldsymbol{I}}}_{{\boldsymbol{n}}} & {\boldsymbol{X}}\\ {{\boldsymbol{I}}}_{{\boldsymbol{n}}}^{{\boldsymbol{T}}} & {\bf{0}} & {{\bf{0}}}_{{\boldsymbol{m}}}\\ {{\boldsymbol{X}}}^{{\boldsymbol{T}}} & {{\bf{0}}}_{{\boldsymbol{m}}}^{{\boldsymbol{T}}} & {\bf{0}}\end{array}\right]$$
(6)
$${{\boldsymbol{C}}}_{{\bf{0}}}^{{\boldsymbol{KED}}}=\left[\begin{array}{c}{{\boldsymbol{\sigma }}}_{{\bf{0}}{\boldsymbol{j}}}^{2}\\ 1\\ {{\boldsymbol{X}}}_{{\bf{0}}}\end{array}\right]$$
(7)

where X is the observed coordinates, X0 is the target coordinates, and ~ represents the min-max scaled coordinates. In is a unit vector of n × 1, 0m is a zero vector of 1 × m, and 0 is a zero matrix of m × m. The superscript T on a matrix represents the transpose.

$${\boldsymbol{X}}=\left[\begin{array}{ccc}{({\widetilde{x}}_{1})}_{1} & \cdots & {({\widetilde{x}}_{m})}_{1}\\ \vdots & \ddots & \vdots \\ {({\widetilde{x}}_{1})}_{n} & \cdots & {({\widetilde{x}}_{m})}_{n}\end{array}\right]=\left[\begin{array}{cccc}{\widetilde{x}}_{1} & {\widetilde{y}}_{1} & {\widetilde{z}}_{1} & {\widetilde{t}}_{1}\\ \vdots & \vdots & \vdots & \vdots \\ {\widetilde{x}}_{n} & {\widetilde{y}}_{n} & {\widetilde{z}}_{n} & {\widetilde{t}}_{n}\end{array}\right],=\left[\begin{array}{c}{\widetilde{x}}_{0}\\ {\widetilde{y}}_{0}\\ {\widetilde{z}}_{0}\\ {\widetilde{t}}_{0}\end{array}\right]$$
(8 9)
$${{\boldsymbol{\sigma }}}_{{\boldsymbol{ij}}}^{{\bf{2}}}={{\boldsymbol{\gamma }}}_{{\boldsymbol{st}}}^{{\boldsymbol{SM}}}({{\boldsymbol{h}}}_{{\boldsymbol{ij}}},{{\boldsymbol{u}}}_{{\boldsymbol{ij}}}),{{\boldsymbol{\sigma }}}_{{\bf{0}}{\boldsymbol{j}}}^{2}={{\boldsymbol{\gamma }}}_{{\boldsymbol{st}}}^{{\boldsymbol{SM}}}({{\boldsymbol{h}}}_{{\bf{0}}{\boldsymbol{i}}},{{\boldsymbol{u}}}_{{\bf{0}}{\boldsymbol{i}}})$$
(10 11)

where the matrices \({{\boldsymbol{\sigma }}}_{{\boldsymbol{ij}}}^{{\bf{2}}}\) and \({{\boldsymbol{\sigma }}}_{{\bf{0}}{\boldsymbol{j}}}^{{\bf{2}}}\) are calculated based on the spatiotemporal variogram \({\gamma }_{st}^{SM}\left(h,u\right)\) estimated from the observational data. The variogram represents the change in covariance with respect to distance and time, reflecting the strength of the correlation between data points as a function of spatial and temporal distance. \({\gamma }_{st}^{SM}\left(h,u\right)\) is expressed as a function of the spatial distance h and the temporal distance u (12, 13).

$${h}_{ij}=\sqrt{{\left({\widetilde{x}}_{i}-{\widetilde{x}}_{j}\right)}^{2}+{\left({\widetilde{y}}_{i}-{\widetilde{y}}_{j}\right)}^{2}+{\left({\widetilde{z}}_{i}-{\widetilde{z}}_{j}\right)}^{2}}$$
(12.1)
$${h}_{0i}=\sqrt{{\left({\widetilde{x}}_{0}-{\widetilde{x}}_{i}\right)}^{2}+{\left({\widetilde{y}}_{0}-{\widetilde{y}}_{i}\right)}^{2}+{\left({\widetilde{z}}_{0}-{\widetilde{z}}_{i}\right)}^{2}}$$
(12.2)
$${u}_{ij}=\sqrt{{\left({\widetilde{t}}_{i}-{\widetilde{t}}_{j}\right)}^{2}}$$
(13.1)
$${u}_{0i}=\sqrt{{\left({\widetilde{t}}_{0}-{\widetilde{t}}_{i}\right)}^{2}}$$
(13.2)

The \({\gamma }_{st}^{SM}\left(h,u\right)\) was fitted using the Sum-Metric (14) model51,52. The Spherical model (15) was applied equally to the γs, γt, and γjoint models.

$${\gamma }_{st}^{SM}\left(h,u\right)={\gamma }_{s}\left({h}_{ij}\right)+{\gamma }_{t}\left({u}_{ij}\right)+{\gamma }_{joint}\left(\sqrt{{h}_{ij}^{2}+{\left(\kappa {u}_{ij}\right)}^{2}}\right)$$
(14)
$$\gamma \left(h\right)=Var\left(z\right)-{C}_{0}\left(1.5\frac{h}{a}-0.5{\left(\frac{h}{a}\right)}^{3}\right)+b$$
(15)

In the variogram model, h is the separation distance (generally, spatial distance), a is the range, and b is the nugget. The minimum of the observational data is C0 + b, and κ is an anisotropic parameter for time. Each parameter was optimally estimated using the L-BFGS-B algorithm46,53. An example of spatiotemporal variogram modeling using observed nitrate data from 2013 is provided in Fig. 2.

Fig. 2
figure 2

Example of empirical (a) and fitted (b) variograms. The nitrate data from 2013 were used and fitted using the Sum-Metric model. The spatial and temporal distances were min-max scaled. The semivariance of the z-axis represents the inverse correlation with respect to spatiotemporal distance.

When computing the actual \(\widehat{z}\left({x}_{0}\right)\), only the estimated λi is used, and the regression coefficient ωk, which determines the spatial average variation, is calculated using Eq. (16) as a constraint.

$$\widehat{z}\left({x}_{0}\right)={\sum }_{i=1}^{n}{\lambda }_{i}{z}_{i}$$
(16)

The prediction results of Eq. (16) are accompanied by the indicator of estimation uncertainty, the error variance, as provided in Eq. (17).

$${\sigma }_{KED}^{2}={\sigma }^{2}-{\sum }_{i=1}^{n}{\lambda }_{i}{\sigma }_{0i}^{2}+{\sum }_{k=0}^{m}{\omega }_{k}{f}_{k}\left({x}_{0}\right)$$
(17)

Data Records

The reproduced data are provided in comma-separated values (.csv) and R Data (.RData) file formats, with processing codes written in the R language54. The code for reading, analyzing, and visualizing the data can also be used to update the data. The data provided in CSV format consist of spatiotemporal coordinates (x, y, z, t), estimated values of nitrate or phosphate, and error variances of kriging. The error variances provide quantitative information on the magnitude of estimation errors and can be utilized in future conditional simulations. R Data (.RData) is a binary data format that can be directly loaded into memory using the load function in the R programming language for immediate use.

The dataset is projected in the Lambert azimuthal equal-area projection method with the following coordinate reference system (CRS):

“+proj=laea +lat_0 = 34.53333 +lon_0 = 137.0698 +x_0 = 0 +y_0 = 0 +datum = WGS84 +units = m +no_defs +ellps = WGS84 +towgs84 = 0,0,0

The data can be converted back to the longitude and latitude coordinate system using the following CRS:

“+proj = longlat +datum = WGS84 +no_defs +ellps = WGS84 +towgs84 = 0,0,0

Coordinate transformation using R can be performed using spatial data libraries such as sp and sf55,56,57.

However, the conversion process may introduce slight errors, resulting in longitude and latitude coordinates with nonuniform degree intervals. Therefore, interpolation methods such as aggregation, nearest neighbor, or bilinear interpolation may be necessary for the stretched grid.

Technical Validation

The performance of the estimation model was evaluated using 10-fold cross-validation for spatial estimation results obtained through STK (Fig. 3, Table 3). Note that Simple and Ordinary Kriging always predict values that are less than or equal to the maximum observed value, while KED can predict values that are greater or smaller than the neighboring observed values. Therefore, if an estimated value falls outside the range of the WOD18 standard, it may need to be adjusted to a value within the range before interpretation. In this case, negative concentration values were replaced with 0. The root mean square error (RMSE), mean absolute error (MAE), and adjusted coefficient of determination \(\left({R}_{adj}^{2}\right)\) were used as performance evaluation metrics (18)58

$${R}_{adj}^{2}=\frac{n-1}{\left(n-m-1\right)\left(1-{R}^{2}\right)}$$
(18)

Where n is the number of data points and m(=4) is the number of predictor variables used in the estimation.

Fig. 3
figure 3

10-fold cross-validation results using observational data from 1980 to 2019, spanning a total of 40 years(a: nitrate, b: phosphate, c: water temperature). Some values fall outside of the valid range, but many data points are densely distributed around the 1:1 line (black solid line). Due to the nature of KED, concentration values calculated as negative values can be converted to 0 or detection limit values.

Table 3 The error evaluation metrics for the 10-fold cross validation, including the root mean square error (RMSE), mean absolute error (MAE), and adjusted R squared.

The performance of the model in predicting water temperature and nitrate and phosphate concentrations was evaluated using error metrics (RMSE, MAE, adjusted R-squared). Note that since various reanalysis datasets are available, sea water temperature estimates are not provided here, and only the performance evaluation results are presented as supporting information. The estimation errors for water temperature were 2.05 °C (RMSE), 1.42 °C (MAE), and 0.93 (\({R}_{adj}^{2}\)). The error metrics for nitrate concentration were 2.79 (RMSE), 1.84 (MAE), and 0.97 (\({R}_{adj}^{2}\)), and those for phosphate concentration were 0.22 μmol/L (RMSE), 0.14 μmol/L (MAE), and 0.96 (\({R}_{adj}^{2}\)). The spatial distribution of the errors showed that the area near Hokkaido in Japan had higher nutrient concentrations than other areas, with approximately 7 μmol/L for nitrate and approximately 0.8 μmol/L for phosphate(Fig. 4a,b).

Fig. 4
figure 4

Error distribution for horizontal (a,b) and vertical (c,d) directions. a and c represent the error distribution of nitrate concentration, while b and d represent the error distribution of phosphate concentration. In areas with high data density, errors are relatively low. The vertical distribution of errors shows an increasing trend in the thermocline (approximately 300 m).

The RMSE variation was analyzed according to depth(Fig. 4c,d). The RMSE was observed to be relatively higher within the 0–1000 m depth range, where the thermocline is located, but remained stable below a depth of 1000 m. However, an increase in RMSE was observed in the deep sea below 5000 m. The increased error in the thermocline and deep sea was attributed to the abrupt changes in nutrient concentration and lack of data, respectively.

Additionally, Compatibility with global-scale projects was assessed. The raw data utilized in this study was contrasted with the biogeochemical data product of GLODAPv2.2022(https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0257247/)59. Since the data prior to 2010 was verified in previous research using CLIVAR and SIO datasets11, the focus was on data from 2010 onwards. A total of 671 data points with precisely matching longitude, latitude, depth and time were compared(Fig. 5).

Fig. 5
figure 5

A comparison of nutrient data used for STK estimation (from NIFS and JMA) with spatiotemporally matched cruise data from GLODAPv2.2022. The results of the comparison for nitrate (a,c) and phosphate (b,d) are presented, with the distribution of differences converging towards a central value of 0.

Subsequently, the data estimated by STK was also compared with the GLODAPv2.2022 data(Fig. 6). Among the gridded data from the period 2010–2019, grids that were spatiotemporally closest to certain GLODAPv2.2022 data were compared. The grid data closest to the GLODAPv2.2022 data were identified, and those within the 5% quantile distance were selected. The criteria for selection were a horizontal distance of approximately 15 km, a vertical distance of about 16 m, and a time difference within roughly 9 days. For NO3, 2,652 data points were contrasted, yielding an \({R}_{adj}^{2}\) of 0.984 and a Residual Standard Error of around 2.03. For PO4, 2,676 data points were examined, with an \({R}_{adj}^{2}\) of 0.981 and a Residual Standard Error of approximately 0.158.

Fig. 6
figure 6

Results from a comparison of nitrate (a) and phosphate (b) gridded data estimated by STK from 2010 to 2019 with GLODAPv2.2022 data. Information from the linear model is represented by the linear regression line (black solid line), 95% confidence interval (red dotted line), and 95% prediction interval (blue dotted line).

Usage Notes

This dataset was used to assess the nutrient dynamics in select areas of the northwest Pacific, both locally and regionally (Fig. 7). Gridded data can be examined through basic statistical analysis and spatial statistical methods such as EOF. These data can also be utilized for comparison with biogeochemical modeling outcomes. The surface (0–50 m averaged) nitrate concentration trend estimated in this study corroborates the decreasing nitrate concentration trend observed in previous studies20,21 since approximately 2010 in the Yellow Sea (Fig. 8) .

Fig. 7
figure 7

Spatial distribution of the mean and standard deviation of nitrate (a,b) and phosphate (c,d) concentrations during the target period of estimation (1980–2019). The mean concentration and variability are highest in the Yellow Sea and near Hokkaido, Japan. The high standard deviation observed in the northwest Pacific off the southeast coast of Japan is attributed to the lack of data.

Fig. 8
figure 8

The time series decomposition of the estimated nitrate (a) and phosphate (b) concentrations in the Yellow Sea region is displayed, where they were decomposed into trend, seasonal, and residual components. These outcomes can be utilized for research on the sources of variability and trend analysis.

Table 4 The system environment used for development and testing is presented.