Highlights

  • ML algorithms consistently modeled SOC variables with high performance

  • Free publicly available remote sensing data was useful for SOC variables prediction

  • Bofedales and grasslands were the most important reservoirs of SOC and fractions

Introduction

The High Andes, located between 5°S and 20°S and above 4000 m.a.s.l., are characterized by their rich agro biodiversity (Monge-Salazar and others 2022) and ecosystem services (Rolando and others 2017a). However, the high melting rate of their glaciers (Zemp and others 2019), high frequency and intensity of extreme events (heavy rainfalls, frosts, strong winds, droughts, among others; Poveda and others 2020), and changes in land use (mainly agricultural intensification and encroachment; Rolando and others 2017a) make these areas especially vulnerable to climate change. Rising temperatures have led to an expansion of crops to higher elevations (Skarbø and VanderMolen 2016), promoting an increasing incidence of pests (Dangles and others 2008) and diseases.

Global warming drives crop encroachment on the Andes’s higher lands (Rolando and others 2017a), which causes a substantial land-use change and the reduction of soil organic carbon (SOC) pools. External market demand, environmental policy, and management of high Andean grasslands have led to regrettable examples of landscape degradation and transformation. In the Andean highlands of Junin-Peru, the so-called “boom” of maca (Lepidium meyenii), a “superfood” appreciated for its energizing nutritional power with high demand in the Asian market during 2011–2015 (Turin and others 2018), has transformed a landscape dominated by highland grassland cover to a prevalence of bare soil degraded by maca cultivation. This cultivation process involves burning and plowing the grassland with heavy machinery, releasing significant amounts of carbon (123–136 t ha−1; Rolando and others 2017b). Furthermore, other activities put to risk the conservation and functioning of high Andes wetlands named “bofedales,” which are crucial for water security in lowlands (MINAM 2015) and for conserving significant soil C stocks (Monge-Salazar and others 2022; Hribljan and others 2016) and biodiversity (Polk and others 2019; Maldonado 2014). These activities involve extracting compact blocks of vegetation with a thin layer of soil, which is then used as alternative energy for heating and cooking (Caro and others 2014) and the overgrazing caused by domestic livestock (Cochi Machaca and others 2018).

Andean soil contains high quantities of SOC, the carbon that remains in the soil after the partial decomposition of organic matter by microorganisms (Alavi-Murillo and others 2022). However, few studies address SOC assessment and modeling in the Andean highlands region. Refractory SOC (RSOC) represents a fraction that persists in soil and has a finite turnover time of thousands of years (Krull and others 2003). It represents one of the significant global SOC pools (Jagadamma and others 2010), and its quantification is crucial for understanding C dynamics (decomposition and stabilization processes). Also, the 13C isotope composition of SOC (δ13CSOC) constitutes another crucial soil trait because it may be used to estimate plant inputs into soil organic matter (Ehleringer and others 2000; Bernoux and others 1998). Moreover, δ13CSOC has been shown to vary with SOC turnover rate, and sources of SOC under land-use change (Ehleringer and others 2000; Xia and others 2021; Han and others 2023). Predictions of quantity and turnover rate based on δ13CSOC are subject to errors associated with climate variability, temporal differences, and anthropogenic contamination. Therefore, it is essential to quantify these errors and compare them with other results to achieve robustness. Artificial intelligence methods, including machine learning (ML) and deep learning, emerged in the last two decades in pedometrics and have been demonstrated to outperform other SOC modeling approaches, such as linear regression and geostatistical approaches, due to their ability to find nonlinear patterns in a multidimensional set of potential environmental predictions (Somarathna and others 2016; Keskin and others 2019; Veronesi and Schillaci 2019; Chen and others 2022; Grunwald 2022; Zhu and others 2022). However, multiple algorithms are commonly tested and compared because no rule exists for choosing the best ML algorithm. This is because ML models are considered black boxes (the underlying processes for prediction are unknown), and the algorithms fit differently depending on the input data.

Remote sensing data have been used as a primary source of predictor variables. Multispectral imagery, including Landsat (Ayala Izurieta and others 2021), MODIS (Sreenivas and others 2016), SPOT (Liu and others 2015a), and others, is used as a nondestructive data source to study SOC variability (Gehl and Rice 2007; Chatterjee and others 2021) through the calculation of different spectral indices. The Andean highlands have received little attention for quantifying soil C fractions. No approaches for developing predictive models that help to understand C process dynamics and the main drivers for this system have been validated, perhaps due to the high spatial heterogeneity and limited resources to conduct sampling. In this study, SOC, RSOC, and δ13CSOC (referred to as soil C target variables hereafter) were measured and used to develop predictive models using ML algorithms and publicly available remote sensing data in the Andean highlands of Junin-Peru. This study aims: i) to compare the soil C target variables among the most important land uses in the zone, ii) to analyze the performance of some ML methods for predictive modeling, iii) to find their most important environmental predictors related to land use, climate, topography, and soil properties, and iv) to spatially model SOC across the study area.

Materials and Methods

Study Area

The study was conducted in the central Peruvian Andean highlands within the districts of Junin and Carhuamayo in the department of Junin (10° 01′ S, 76° 07′ W, 4200 m a.s.l.). The study area comprised about 800 km2 within the Junin National Reserve buffer zone (Figure 1), whose primary purpose is to protect the grassland and bofedal ecosystems and biodiversity of Junin´s lake and the surrounding central Andean highlands. The Ramsar Convention identifies this site as an essential wetland area (site number 882; RSIS 2021). The climate is rainy and cold, with dry autumn/winter according to the Thornthwaite climatic classification system (SENAMHI 2022). The annual average maximum temperature, minimum temperature, and precipitation are 9–19 °C, -3–3 °C, and 500–1200 mm, respectively (period 1981–2010; SENAMHI 2022). The soil in the study area is characterized mainly by a predominance of Inceptisols with a trend of high SOC concentrations and acidic pH (Rolando and others 2018).

Figure 1
figure 1

The study area and the 198 soil sample locations in the proximities of Junin Lake are located in the Central Peruvian Andean Region of the Province of Junin. Advanced Land Observing Satellite (ALOS) Phased Array type L-band Synthetic Aperture Radar (PALSAR) digital elevation model was used for mapping.

As grasslands and “bofedales” dominate the landscape, the primary land use and the main livelihood is grazing livestock consisting of cattle and sheep, which coexist with wild vicuñas (Vicugna vicugna). In some cases, subsistence agriculture is practiced with crops of potato and maca and is limited to a few small spots of land. However, from the 1990s to the present day, maca cultivation has had a significant expansion, becoming the primary driver of land-use change and the leading disruptor of the high Andean drylands (“puna”) ecosystem (Turin and others 2018).

Measured Soil Data

The soil sampling sites were selected following the Latin Hypercube sampling (LHS) statistical method, which provides an efficient way of sampling variables, ensuring a good representation of the environmental characteristics of the study area (Carré and others 2007; Wang and others 2022; Stein 1987; McKay and others 1979). The LHS method used the multidimensional distributions of the slope, precipitation, minimum and maximum temperatures, normalized difference vegetation index (NDVI), and land cover estimated by a supervised classification from Landsat 8 imagery from the United States Geological Survey (USGS 2020) to determine the sampling locations. The sampling locations were adjusted in practice due to high slopes and accessibility, resulting in the selection of 198 sites. A composite soil sample (~ 5 kg) was gathered at each sampling site from five locations: one central point and four points positioned 2 m apart in the N, S, E, and W cardinal directions. These samples were collected using an auger from the 0.3 m soil profile (Art’s Manufacturing & Supply Inc., model Mud Augers, USA). In addition, a pit measuring 0.8 × 0.7 × 0.5 m3 was dug at the central point for bulk density measurements within the 0–0.3 m soil profile, using metal cylinders of 0.05 m in diameter. Then, C stock was estimated by multiplying SOC (see its determination below) with bulk density following Rolando and others’ (2017b) procedure. Unfortunately, bulk density measurements were made for just 64% of the sites selected due to operational inconveniences; therefore, LULC averaged values are reported.

Composite soil samples were analyzed for texture and pH using a hydrometer and suspension potentiometer (water in 1:1 relation) at the Soil Laboratory of the National Agrarian University La Molina—Lima, Peru. The soil C target variables’ values were determined using a Combustion Module coupled to a Cavity Ring-Down Spectroscopy (CM-CRDS) system based on Liu and other’s (2018) procedure for SOC and δ13CSOC. Thus, a soil subsample per site was sieved to < 2 mm, dried at 60 °C, and ground with a mortar and pestle. Then, the final soil sample weights to be analyzed were determined by land-use type based on their mean expected soil C concentration. Hence, 0.015, 0.030, 0.027, and 0.0075 g were packaged in tin capsules for maca crops, fallow and cultivated pastures, native grasslands and improved pastures, and wetlands (“bofedales,” see below), respectively. For RSOC, a second soil subsample per site was oxidized using H2O2, according to Jagadamma and others (2010), with slight modifications. Thus, 1 g of sieved soil (< 2 mm) was oxidized by adding 90 ml of 10% H2O2 for 2–3 days, centrifugated for 15 min, washed three times with deionized water, and freeze-dried. From the remaining soil, 0.075 g was weighed and packaged in tin capsules. Finally, all tin capsules were submitted to a CM-CRDS system (G2131-iAnalyzer, Picarro Inc., USA). δ13CSOC was estimated from the 13C/12C natural abundance values reported by the equipment relative to international standard VPDB (Vienna Pee Dee Belemnite) using the equations by Liu and others (2018). All the analyses were performed in the Schaeffer Lab in the Department of Biosystems Engineering and Soil Science at the University of Tennessee, Knoxville, USA.

Environmental Predictors and Land-Use and Land-Cover Categories

Given that the soil C target variables result from complex processes and interactions of several environmental factors—including topography, climate, soil properties, and vegetation—the primary environmental predictors underpinning their unique processes are likely to vary in significance. Despite this complexity and considering the limited ML experience in predicting soil C variables beyond SOC, this study utilized an identical set of features (environmental predictors hereafter) for SOC, RSOC, and δ13CSOC. Thus, the environmental predictors considered for the models were obtained from publicly available remote sensing data, soil lab analysis, and vegetation type and condition at soil sampling (see definitions in Table 1). The topographic variables were elevation (DEM—Digital elevation model), slope, aspect, and topographic wetness index (TWI), derived from the Advanced Land Observing Satellite (ALOS) Phased Array type L-band Synthetic Aperture Radar (PALSAR)—Radiometric Terrain Correction product. The climate indices were the minimum and maximum of the average monthly minimum (TMNN and TMNX, respectively) and maximum temperatures (TMXN and TMXX, respectively) and the average annual total precipitation (PREC), calculated from WorldClim version 2.1 climate data (period 1970–2000 with ~ 1 km resolution). Vegetation also plays a vital role in these carbon variables, so the nine spectral bands and several vegetation indexes were estimated from a Landsat 8 Operational Land Imager (OLI) imagery from November 26th, 2014 (see list in Table 1). Remote sensing data was preprocessed using Environmental Systems Research Institute (ESRI) ArcGIS software (ESRI, 2011, Redlands, CA).

Table 1 Features Considered as Potential Environmental Predictors in Soil Organic Carbon Prediction

In addition, as the predominant vegetation was grasslands and grasslands converted into maca fields, finer land-use and land-cover (LULC) categories were defined depending on the type of grassland, condition, and history (see Figure 2). “Vigorous grasslands” (n = 45) was defined as healthy, tall grasslands with good cover and sparse bare soil. “Partially degraded grasslands” (n = 57) were referred to as medium-sized, sparse grasslands with some bare soil, whereas “degraded grasslands” (n = 47) were typified as low and sparse grasslands covered surrounded by abundant bare soil. All grassland categories are land used neither for cropping activities nor perturbed. “Improved pastures” (n = 5) referred to grasslands with introduced cultivated species such as white clover (Trifolium pratense) and red clover (Trifolium repens) through inter-seeding, implying a minimum perturbation since it does not require plowing. “Cultivated pastures” (n = 24) was defined to transform native grasslands into an association of species such as king grasses (Lolium multiflorum, Lolium perenne) and clovers (Trifolium spp.), introduced 40 years ago in the case of the multi-communal cooperative system and 15 years ago in the farmer community system. “Bofedales” (n = 10) is a type of Andean highland wetland with hydromorphic vegetation and generally accumulates peat, seasonally or permanently saturated with water (Monge-Salazar and others 2022). “Fallow 1” (n = 13) referred to bare soils from recently harvested maca crops or up to 2 years of fallow, which in turn come from the recent conversion of vigorous or partially degraded grasslands plowed to be converted to maca cropland. “Fallow 2” (n = 20) was composed of bare soils with invasive sparse grass species, coming from maca crops harvested 3 to 5 years ago, which in turn result from the conversion of “vigorous” or “partially degraded grasslands” that have been plowed to be transformed into maca cropland. “Fallow-3” (n = 12) referred to invasive grass species with sparse low vegetation resulting from long-standing maca fallow (> 5 years) of transformed grasslands into maca cropland. (Table 2).

Figure 2
figure 2

Photos of land-use and land-cover categories (see definition in Materials and Methods section): A Bofedales, B Cultivated pastures, C Improved pastures, D Vigorous grasslands, E Partially degraded grasslands, F Degraded Grasslands, G Fallow areas fallow with 0–2 years after maca cultivation (Fallow 1), H Fallow areas fallow with 3–5 years after maca cultivation (Fallow 2), I Fallow areas fallow with > 5 years after maca cultivation (Fallow-3).

Table 2 Model Performance Metrics (R2—Coefficient of Determination and RMSE—Root mean Square Error) for Random Forest (RF), Artificial Neural Networks (ANNs), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGB) Algorithms on the Models’ Training and Testing of Soil Organic Carbon (SOC), Refractory SOC (RSOC), and 13C Isotopic Composition of SOC (δ13CSOC)

Modeling Approach

From the 42 potential environmental predictors considered for this study (33 numerical and nine categorical from LULC, Table 1), some may be nonessential or repetitive, and it is always better to identify and exclude them from the model building. Addressing and preselecting the minimum-optimal and all-relevant features to include (feature selection) helps optimize the model prediction and reduce overfitting (Parsaie and others 2021). Among the feature selection methods, the Boruta method (Kursa and Rudnicki 2010) has yielded better results when working with environmental processes like SOC decomposition due to its ability to identify linear and nonlinear relationships from complex processes (Keskin and others 2019; Zeraatpisheh and others 2022). This study used Boruta to select all-relevant and tentative environmental predictors for building the models for every soil C target variable. Based on a random forest (RF) classification algorithm, this method creates randomness in the system and determines the unimportant, meaningful, and tentative attributes of a given variable. After the Boruta feature selection, the new dataset underwent balancing and partitioning, including the selected environmental predictors and the soil C target variables. This partitioning for model training and testing was based on the values of the soil C target variable, utilizing a fivefold approach (Yates and others 2022). The approach comprises five cycles of model training and testing, where each iteration involves permuting four folds for training (75%) and reserving onefold for testing (25%). The four top-performing algorithms in predicting SOC (John and others 2020; Emadi and others 2020)—RF, artificial neural Networks (ANN), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGB)—were employed to develop predictive models for the soil C target variables. Due to the differences in the ranges and distributions of the environmental predictors’ values, feature scaling (transformations of values) through scaling (subtracting feature mean and dividing by feature standard deviation, mean 0, and standard deviation 1) and normalization (dividing by the feature maximum, range from 0 to 1), was executed and tested to determine the most effective method for enhancing model performance. Following the literature recommendation, especially for regression and when variable importance is of interest, feature scaling was applied even for the tree-based algorithms RF and XGB (Strobl and others 2007; Balabaeva and Kovalchuk, 2019). Then, hyperparameters were tuned using “out-of-bag, “tenfold cross-validation repeated three times, and “leave-one-out cross-validation” resampling methods for RF, SVM, and ANN-XGB. For every soil C target variable modeled, performance metrics were averaged across the fivefold partitions for both the training and testing phases (see next section) and compared to identify the best predictive ML model. Once the best model was found and due to the small dataset, the ML model was retrained using the whole dataset (without partitioning), and the important variables were evaluated. The ML models were built and assessed using R 3.6.1 and the packages “Boruta” v7.0.0 (Kursa and Rudnicki 2010) for feature selection and “caret” v6.0.86 (Kuhn and others 2019) for applying the RF, ANN, SVM, and XGB algorithms.

Finally, its spatial distribution was mapped, and SOC was identified as the primary variable of interest. The RF model was recalibrated by retraining it, using the most important spatially available environmental predictors, which included LULC, SER2, NDMI, MSAVI, NDVI, DLAKE, SWIR2, and NBR1. For LULC, a land-cover classification was performed using the RF classification algorithm in Google Collaboratory. This classification used the 198 sample sites across the nine LULC categories and categories for water bodies, inundated areas, urban areas, rocks, and cattails (Mantas and Caro 2023) and the same Landsat imagery used in this study. These additional land-use categories were masked together and defined as “Non-carbon storing surfaces” for mapping purposes. Furthermore, a raster depicting the Euclidean distance—the shortest distance—to Junin Lake (DLAKE) was generated based on the lake’s boundary. The remaining predictors were Landsat-based indices, which were already spatially available. The training samples for classification, the DLAKE raster, and the process of raster snapping (at 30 m resolution) for all variables were conducted in ArcGIS.

Statistical Comparison of Soil Organic C Variables and Models Performance Assessment

The Kruskal–Wallis rank sum test was used to test significant differences among LULC for the soil C target variables, followed by Dunn’s post hoc test with Holm’s correction method for adjusting p-values for multiple comparisons. For that analysis, the R packages “stats” (R Core Team 2022) and “DescTools’’ (Signorel and others 2022) were used. Next, the coefficient of determination (R2) and root mean square error (RMSE) were used to assess the performance of the ML models tested. R2 represents the proportion of variance explained by each ML model, and RMSE indicates the accuracy of the predicted values (Yang and others 2014). R2 and RMSE were calculated as follows:

$$R^{2} = \left( {\frac{{N\mathop \sum \nolimits_{i = 1}^{N} X_{i} \hat{X}_{i} - \mathop \sum \nolimits_{i = 1}^{N} X_{i} \mathop \sum \nolimits_{i = 1}^{N} \widehat{{X_{i} }}}}{{\sqrt {\left[ {N\mathop \sum \nolimits_{i = 1}^{N} X_{i}^{2} - \left( {\mathop \sum \nolimits_{i = 1}^{N} X_{i} } \right)^{2} } \right]\left[ {N\mathop \sum \nolimits_{i = 1}^{N} \hat{X}_{i}^{2} - \left( {\mathop \sum \nolimits_{i = 1}^{N} \hat{X}_{i} } \right)^{2} } \right]} }}} \right)^{2}$$
$$RMSE = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{N} (\hat{X}_{i} - X_{i} )^{2} }}{N}}$$

where \(\hat{X}_{i}\), \(X_{i}\), and N are the model predicted values, observed values, and, total number of observed values, respectively. Higher R2 (close to 1) and lower RMSE (close to 0) mean better ML model performance. Model performance metrics were calculated as the average across the fivefold partitions for training and testing.

RESULTS

Soil C Measurements by LULC

SOC values ranged between 1.67–17.77%, with the lowest value found in “degraded grasslands,” which was significantly lower than that of “bofedales” (p-value < 0.001) and “partially degraded grasslands” (p-value < 0.01) (Figure 3A, Table S1). The highest SOC value was found in “bofedales,” being significantly (p-value < 0.05) 2–3 times higher than that of the other LULC categories except for “improved pastures” (Figure 3A, Table S1). RSOC values ranged between 0.01 and 2.58%, being the lowest and highest ones found in “fallow–1” and “bofedales,” respectively (Figure 3B). “Cultivated” and “improved pastures” were not significantly (p-value > 0.05) lower than “bofedales” which was 2–3 times higher than the other LULC categories (Table S1). Values of δ13CSOC ranged between − 29.09–20.35 ‰ being the highest one found in “fallow–3” (Figure 3C). The lowest value was found in “bofedales” (all its values were below the overall mean of − 24.76 ± 0.074 ‰) which was significantly different to all the other LULC categories except to “cultivated” and “improved pastures” (Figure 3C, Table S1). “Fallow–3” (p-value < 0.01), “degraded grasslands” (p-value < 0.01), and “partially degraded grasslands” (p-value < 0.05) showed significant differences compared to “Cultivated pastures” (Table S1). Bulk density in “bofedales” was approximately half compared to other LULC categories (0.49 t m−3 vs. 0.98–1.09 t m−3), while carbon stock was nearly twice as high (210.9 t ha−1 vs. 97.4–126.3 t ha−1) (Table S2).

Figure 3
figure 3

A Soil organic carbon (SOC), B refractory SOC (RSOC), and C 13C isotopic composition of SOC (δ13CSOC) on different land-use and land-cover categories (see definition in Materials and Methods section). Red dashed horizontal line represents the global average.

Model Performance and Comparison

Overall, RF consistently outperformed other ML algorithms in modeling soil C target variables models, achieving R2 > 0.87 during training, except for XGB in the RSOC model (0.95), and R2 > 0.42 during testing, except for ANN in the RSOC model (R2 = 0.50) (Table 2). Thus, following the criteria indicated in Sect. ”Statistical Comparison of Soil Organic C Variables and Models Performance Assessment” and analyzing the average fivefold R2 and RMSE values for training and testing, respectively (Table 2), RF was selected as the most appropriate model for predicting SOC and δ13CSOC, and ANN for predicting RSOC.

Explanatory Variables

The environmental predictors excluded (see selection criteria in Sect. ”Modeling Approach”) from the model building of the soil C target variables were SILT (silt content), CLAY (clay content), BLUE (blue band), GREEN (green band), SWIR1 (short-wave infrared-1 band), SLOPE, ASPECT, TWI, TMNN, TMNX, and “Cultivated pastures” (data not shown). A total of 23, 22, and 20 out of the 42 environmental predictors were selected (see selection criteria in Sect. ”Modeling Approach”) for building the models for SOC, δ13CSOC, and RSOC, respectively (data not shown). From the selected environmental predictors, “bofedales” were identified as the most critical for SOC, followed by SER2 (Spectral vegetation indexes 2) and NDMI (Normalized Difference Moisture Index), both of which were considerably less important (Figure 4A). Regarding δ13CSOC, “bofedales” also were the most critical environmental predictor, followed by NDMI and DLAKE, which had similar importance, and then by NIR (Near-infrared band) and pH, which were the next ones in importance (Figure 4B). For RSOC, pH was as critical as “Fallow-3,” followed by SWIR2 (short-wave infrared-2 band), “bofedales,” and EVI (Enhanced Vegetation Index) with lower importance (Figure 4C).

Figure 4
figure 4

Rankings of the top most important environmental predictors defined for the best-performed machine learning model for soil organic carbon (SOC) with random forest (A, in %), 13C isotopic composition of SOC (δ.13CSOC) with random forest (B, in ‰), and refractory SOC (RSOC) with Artificial Neural Network (C, in %). Importance is defined as the increase in the MSE prediction when the variables are permuted. The environmental predictors are described in Table 1.

SOC Mapping

The land-cover classification yielded accuracies of 95% during training and 60% during testing. Predicted SOC values within the study area ranged from 2.7 to 11.5% (Figure 5). The highest SOC values were predominantly found north and south of Junin´s lake, mainly in the “bofedales” zone. Areas with the next highest SOC values were found in the southernmost part of the study area, primarily corresponding to cultivated pasture zones. Conversely, the lowest SOC values were predicted in the western Reserve Buffer Zone.

Figure 5
figure 5

Predicted spatial distribution of soil organic carbon across the Lake Junín Region, Junín, Peru. This map showcases the distribution, as inferred by a random forest algorithm utilizing the eight most significant environmental predictors available spatially. Non-carbon storing surfaces correspond to classes such as water bodies, inundated surfaces, cattails, rocks, and urban areas.

Discussion

“Bofedales” as Essential Reservoirs of Soil Organic Carbon in the Andes Highlands

Bofedales showed higher SOC amounts compared to other assessed land uses (Figure 2). Even though these wetlands have been recognized as an essential reservoir of SOC in the Andes (Alavi-Murillo and others 2022; Segnini and others 2013), their relevance in policy incidences and conservation/restoration actions is scarce, or null (Maldonado 2014). The SOC range of values in this study (3.2–17.8%) was in the lower range of values reported by other studies (13.2– 83.2%) (Cooper and others 2010; Segnini and others 2010; Alavi-Murillo and others 2022; Monge-Salazar and others 2022). Plant biomass extraction from the soil through “champeo” and overgrazing has been reported in the study area (Caro and others 2007, 2014; Salvador and others 2014; Mantas and Caro 2023); these perturbations could promote SOC reduction. On the other hand, C stock values (in the 0–0.3 m soil profile) found in this study (Table S2) were slightly lower than those values reported in the literature for “bofedales” (211 vs. 230–306 t ha−1 from Segnini and others 2010), grasslands (102–126 vs. 135–144 t ha−1 from Farley and others 2013), fallows (106–11 vs. ~ 123 t ha−1 from Rolando and others 2017b) and pastures (97–119 vs. ~ 136 t ha−1 from Rolando and others 2017b). There were no significant differences in C stock values among LULC categories except for “bofedales”, which was almost twice as high (Table S2). In “bofedales,” C stocks are more extensive and profound than the other LULC categories and range from 30–700 t C ha−1 per meter of peat depth (peat thickness can reach as deep as 15 m). The study findings highlighted the importance of “bofedales” as a reservoir of SOC and its stable C fractions and called for its conservation and restoration (see Scale, Reach, and Impacts of Land-use Changes and their Implications for Conservation Section).

The highest depletion of δ13CSOC (ranged from − 29.5 to − 25.0 ‰) in “bofedales” than other LULC suggested that SOC was formed from plants under no water restriction conditions and better photosynthetic performance discriminating against 13C (Farquhar and others 1989; More and others 2022). This finding highlights the potential for relatively high primary productivity in “bofedales” in this Andean ecosystem. On the other hand, high enrichment of δ13C is also related to higher fractions of persistent SOC pools (Ehleringer and others 2000), which is consistent with our findings considering that “bofedales” showed the highest RSOC (1.10 ± 0.23%) than other LULC (Figure 3C). Furthermore, Segnini and others (2010) found an increase in persistent SOC pools with soil depth in Andean- “bofedales.”

Highland grasslands have been reported as other important reservoirs of C stocks and SOC in the Andes (Gibbon and others 2010; Zimmermann and others 2010; Farley and others 2013). In our study area, Rolando and others (2017b) detected that cultivated pastures showed similar values of SOC but a higher depletion of δ13C (4.5 ± 0.2% and − 26.0 ± 0.1 ‰, respectively) than native grasslands (4.6 ± 0.3% and − 25.6 ± 0.1 ‰, respectively) and fallow areas (4.1 ± 0.3% and − 25.6 ± 0.1 ‰, respectively). This has been interpreted as a higher depletion of δ13C in cultivated pastures from incorporating N-fixer species (white clover) and long-standing perennial grasses (like ryegrass), manure, and supplemental irrigation. In this study (in agreement with Rolando and others 2017b), “cultivated pastures” LULC showed significantly more depleted δ13CSOC (− 25.4 ± 0.16 ‰) than “partially degraded” (− 24.6 ± 0.12 ‰) and “degraded grasslands” (− 24.5 ± 0.12 ‰), and fallows area after three years (− 23.7 ± 0.46 ‰) (Figure 3). This result suggested that vegetation that formed SOC in cultivated pastures had better physiological performance and that soil in degraded grasslands and fallow areas likely had more labile C forms.

RF as Promising ML Algorithm for Predicting Soil C Variables in the Andean Highlands

Overall, among the ML algorithms, RF performed the best, capturing C processes’ nonlinear interactions with acceptable and consistent R2 and RMSE performances (Table 2), which agrees with most of the reported SOC modeling studies. In the literature, the performance of ML algorithms predicting SOC is highly variable. It depends on multiple factors, like the observed sample size, number and type of covariates, time–space resolution, extent of the study area, and model algorithm (Grunwald 2022). Sample size has a more significant effect than the model algorithm on the model performance (Somarathna and others 2017). R2 is among the most reported model performance indicators for ML regression algorithms for soil C models due to its more straightforward interpretation, especially when comparing multiple site applications where target value ranges and/or units may differ to use RMSE. However, most of these R2 values ranged from 0.24 to 0.68 (from first to third quartile) (Grunwald 2022), reflecting little understanding of the main drivers and methods for predicting SOC. For this study, the R2 of predicted soil C target variables varied from 0.42 to 0.50 for the best ML algorithms, agreeing with other studies with small sampling sizes and similar covariates (Zeraatpisheh and others 2022). Using multi-temporal data or soil nutrient indicators as covariates has been a strategy to counter the effect of a small sample size, allowing somewhat higher R2, 0.58–0.68 (John and others 2020; Shafizadeh-Moghadam and others 2022). Therefore, the moderate performance of the models, especially in predicting δ13CSOC, suggested that the processes involved are too complex for the given small sample size and/or some essential variables at the correct time–space scale were missing as covariates. Regarding the RMSE, predicting SOC got 1.47%, which seems high, but considering the small sample size and high SOC values from “bofedales,” it is fair and in the midrange of the reported values from 0.59 to 2.7 across multiple SOC studies (Padarian and others 2019; Peng and others 2015; Safanelli and others 2020). Few studies modeled other C fractions apart from SOC with ML techniques; for example, Adi and Grunwald (2020) and Keskin and others (2019) modeled persistent C fraction at 0–0.2 m depth for Florida State using 850 and 1014 soil samples and 151 and 327 environmental predictors, respectively. When employing the RF algorithm, these studies achieved acceptable R2 values of 0.68 and 0.72, respectively. This suggests that model performances could be improved by adding sampled data and potential environmental predictors. The ANN model was selected for RSOC predictions due to its balanced performance in the training and testing phases. Although XGB and RF demonstrated superior learning capabilities during training, ANN performed well in training and exhibited the best generalization to unseen data in the testing phase (Table 2).

Vegetation and Climatic Indices as Essential Predictors of Soil Organic Carbon

Quality and quantity of SOC are mainly determined by a soil’s physical and chemical environment, physical accessibility of organic matter to biological agents (that is, microbes and/or enzymes), and the ratio of C inputs to losses (Krull and others 2003; Luo and others 2017; Sing and others 2018; Dynarski and others 2020). Even though land use significantly affects both labile and persistent C pools (Liu and others 2020; Padbhushan and others 2022; Smith 2008), the latter responds much slower than labile C pools to land-use and other human-induced changes (for example, land management) (Dynarski and others 2020; Padbhushan and others 2022; Sainepo and others 2018). Thus, LULC was one of the leading environmental predictors, “bofedales” the most relevant for SOC and δ13CSOC, and “Fallow-3” for RSOC. Several studies have highlighted the importance of LULC as a predictor variable for SOC (Emadi and others 2020; Keskin and others 2019; Xiong and others 2014) and RSOC (Keskin and others 2019; Xiao and others 2022) using ML algorithms. Regarding δ13CSOC, Wang and others (2015) stress that the litter quality and soil water can increase the carbon isotope fractionation during organic matter decomposition. Because soil 13C isotope composition (δ13C) is strongly influenced by leaf (litter) δ13C, variations in this variable can be influenced by LULC because it determines the type and quality of litter inputs into the soil (Smith and Chalk 2021; Wang and others 2013). Thus, δ13C values in labile C pools (that is, relatively “new” material) would reflect δ13C values closer to the current vegetation, whereas δ13C values in persistent C pools (that is, older material) shows relatively enriched δ13C values due to isotopic discrimination of the heavy isotope in soil organic matter compounds (Wang and others 2013). In addition, the crucial role of soil water and soil temperature and pH during soil organic matter decomposition has been highlighted as they increase the activity of soil fauna and microorganisms (Wang and others 2013; Wang and others 2015; Smith and Chalk 2021). Thus, we found that for “bofedales,” some indicators of soil water (DLAKE and NDMI) and vegetation (SER2), and pH were relevant environmental predictors for δ13CSOC and RSOC (Figure 3). The greater relevance of pH for RSOC and δ13CSOC than for SOC could be due to its impact on the activity and growth of microorganisms, which metabolize the different forms of C, resulting in a variation in the organic carbon isotopic composition of the soil (Neina 2019; Klink and others 2022). Also, soil pH can affect the interactions between soil minerals and organic matter, which determines the preservation and stability of C (Neina 2019). Although some soil variables, such as clay content, were reported as essential predictors for SOC (John and others 2020; Davy and Koen 2013), in this study, it was not of high relevance, likely due to the importance of pH against other chemical indicators to explain SOC in Andean highlands soils (Alavi-Murillo and others 2022).

The relationship between SOC and remotely sensed and easily accessible variables has rarely been reported (Mirchooli and others 2020). However, Lamichhane and others (2019) reported that these variables were among the top five for SOC prediction. NDMI is a vegetation index that detects vegetation water content and is a good predictor for measuring SOC using ML methods (John and others 2020). Mirchooli and others (2020) found that coloration index and NDMI are the most critical environmental predictors for SOC prediction in the RF model, followed by elevation, NDVI, and slope. NDMI is indirectly related to soil moisture in the surface layers (0–0.3 m), and the latter can prevent the net loss of organic soils through oxidation (Liu and others 2015b). In this study, NDMI was the main environmental predictor in both SOC and δ13CSOC under the RF model, followed by SER2, NIR, and NBR1. These last variables are closely related by the NIR and SWIR2 bands, found in the spectrum´s wavelengths from 850 to 2200 nm. Bishop and others (2008) found a strong absorption near 1400 nm (also for Kaolinite) and 1900 nm, indicating the presence of water bound in the interlayer lattices of soil. This could provide the conditions for a physical protection mechanism through the interaction of SOC with the soil mineral matrix and the stabilization process by aggregate formation (Krull and others 2003). Also, Al-abbas and others (1972) reported an inverse relationship to SOC approximately near this region of the spectrum, and with all this, it could have obtained the affinity to be one of the best environmental predictors for SOC and δ13CSOC.

Scale, Reach, and Impacts of Land-use Changes and Their Implications for Conservation

The extraction of vegetation and part of the topsoil of “bofedales” and grasslands (an activity locally called “champeo”) has been carried out for decades by rural inhabitants (Caro and others 2014). “Champeo” allows the local population to guarantee fuel for domestic use (mainly cooking); however, it also constitutes a critical perturbation affecting SOC accumulation (Table 3). Overgrazing caused by domestic livestock is another activity reported in the study area (Caro and others 2007; Salvador and others 2014) that reduces peat production and can affect SOC pools from the assessed “bofedales” and grasslands. Both perturbations (“champeo” and overgrazing) are the most important drivers that impact the change from vigorous/native to degraded “bofedales”/grasslands, reducing SOC (Figure 3) and provisioning, regulation, and supporting ecosystem services (Table 3). Land policy reforms during the’70 s promoted establishing a multi-communal agrarian company (SAIS Tupac Amaru) in the region, covering more than 0.2 Mha, to increase grassland productivity for livestock (Diez 2020). Through these reforms, the natural grasslands from these lands were managed by incorporating productive pastures (ryegrass-white clover), irrigation, inorganic–organic fertilization, and rotational livestock grazing (Rolando and others 2017b). The land-use change from native to cultivated grasslands was the only one that was not considered a perturbation; it increased plant productivity (see first Discussion section) and soil health, promoting provisioning, regulation, and supporting ecosystem services (Table 3). Land-use changes caused by crop encroachment in highland grasslands are considered one of the most critical perturbations that threaten the ecosystem services of these landscapes in the highland Andes region (Tovar and others 2013; Rolando and others 2017a). Climate change facilitating the upward expansion of agriculture (Tovar and others 2012; Arce and others 2019) and socioeconomic factors like the increase of international market demand (like quinoa, Gamboa and others 2020) have been crucial drivers of Andes grasslands transformation.

Table 3 Caused Impacts by the Primary Land-Use Changes in the Study Area and Highland Andean Ecosystems Reported by Literature

In the study area, maca (Lepidium meyenii) cultivation was gradually extended in the grasslands of Junin since the early 90 s for local, American, and European markets. Still, its expansion was massive in 2011–2015 to cover the high demands of the Asian markets. This led to a rapid transformation of the high Andean landscape with direct consequences on “puna” ecosystem services, such as the decrease of grassland primary production, reduced grazing areas, reduced land cover, loss of water infiltration and retention capacity of soils, besides changes in the main livelihood (Turin and others 2018) (see Table 3). This study corroborates findings previously reported in the field (Rolando and others 2017b, 2018), highlighting the occurrence of a degradation process following maca cultivation (as indicated in Table 3), particularly in steep terrains. Swift restorative measures are imperative to reinstate ecosystem services provided by grasslands. Despite the inclusion of high Andean natural pasture management for greenhouse gas reduction within the National Determined Contribution (NDC), outlined by the Peruvian multisectoral working group (MINAM 2019), further measures are warranted to ensure the preservation of soil C stored within grasslands and unique “bofedales” ecosystems. Economic and social incentives for pastoralists must be implemented to guarantee the establishment of best management practices (rotational grazing, improved fallows with legumes, water harvesting, wetland, and grassland restoration) to avoid the expansion of the agricultural frontier. Special attention must be provided to “bofedales” which occupy around 0.8% of the Peru surface (~ 1.05 Mha) and are found predominantly in mine concessions (41% of total “bofedales” surface), keeping 21% of them under the custody of rural inhabitants (Fuentealba and Rios 2023). Despite that, an increase of + 2% year−1 in areas of “bofedales” (by greater availability of water resources in dry seasons due to deglaciation) has been reported for the 1986–2005 period in the southern Andes (Pauca-Tanco and others 2020), in recent years there has been a reduction of areas of “bofedales.” Thus, some studies reported an area loss rate of − 3.8 to − 0.4% year−1 during the 2005–2016 period (Machuca-Crespo 2018; Pauca-Tanco and others 2020; Pamo-Sedano and Oscco-Coa 2022). These ecosystems can be restored by establishing artificial “bofedales,” which can preserve the same ecosystem services as natural ones, as was remarked in recent studies (Monge-Salazar and others 2022).

The present study was conducted in the Junin National Reserve, which covers 5303.9 and 3608.8 ha of “bofedales” of the Junin and Pasco departments, respectively (Fuentealba and Rios 2023). Conservation areas can be crucial as a life lab to test and monitor restoration activities involving local communities, thus improving the geospatial modeling of SOC to build an interoperable public digital infrastructure that can serve as a monitoring-verification system for future compensation schemes for the benefit of indigenous pastoralists and rural inhabitants. Focusing on the ecologically significant and delineated regions of the Junin National Reserve and its buffer zone, our predictive mapping depicted distinct variations in SOC distribution. Specifically, within the reserve itself, approximately 32% of the C storing surfaces had SOC values over 9.6%, compared to only 8% within its buffer zone (Figure 5). While RSOC and δ13CSOC are key variables that provide valuable information, the significant importance of pH—a site-specific sampled predictor—in their models limited our ability to produce accurate spatial distribution maps.

Conclusion

Processes that drive SOC and fractions like RSOC and δ13CSOC in high Andean rangeland systems have not been studied yet, challenging the choice of environmental predictors (LULC identification and classification, remote sensing products, climate and soil variables, among others) for their modeling. Under this context, ML algorithms capture nonlinear interaction and process complexity to model the studied soil C target variables with acceptable and consistent performance. “Bofedales” were the most important reservoirs in terms of the total and the refractory fraction of SOC compared to the other land uses. Its highest depletion of 13δC is a potential indicator of higher turnover rates, high plant productivity, and C persistence. Because “bofedales” are affected by strong perturbations (extraction of vegetation and part of the topsoil—“champeo,”, overgrazing) in the study area, it is recommended to establish restoration activities to guarantee ecosystem services from those ecosystems. For example, the management of natural grasslands through cultivated pastures showed indicators of higher productivity (more depletion of δ13C), remarking its potential for grassland restoration after crop encroachment (like maca crop) in this area.

Free, publicly available remote sensing data can be beneficial for SOC prediction. Vegetation indices close to the NIR band, such as NDMI and SER2, were good environmental predictors for the total soil C (SOC and δ13CSOC). However, to improve the prediction, vegetation and climatic indices must be complemented with data taken in situ, such as pH, and especially LULC, because it is the primary driver of SOC variation. Together, these variables can explain SOC dynamics, facilitating their prediction using ML algorithms. Considering the high reservoirs of C in the soils of highland Andean ecosystems, future SOC and fractions mapping will be essential for decision-makers and regional governments for compensation schemes in voluntary or regulated C markets. The SOC map elaborated in this study can be used for this aim, and some improvements can be achieved if more soil samplings are collected, especially in “bofedales,” improved and cultivated pastures, and fallows LULC.