Introduction

Soil organic carbon (SOC) in farmland plays a pivotal role in both soil fertility and vegetation growth [1, 2]. The spatial heterogeneity of SOC in farmland is notably pronounced, influenced by diverse factors including climate, topography, soil properties, and human activities [3]. These factors collectively determine the input of SOC in farmland to a significant extent. Obtaining accurate spatial map of SOC in farmlands is conducive by the fact that it facilitates monitoring the changes over time, playing a crucial role in assessing farmland soil quality [4,5,6].

The current studies often use easy-to-measure environmental variables in prediction of SOC spatial distribution. The philosophy of utilizing environmental variables was established after introducing SCORPAN model [7]. It includes soil properties, climate, organisms, topography, parent material, time factor and spatial position [8,9,10].

In recent years, the accuracy of farmland SOC digital mapping has been remarkably improved by combining farmland planting and management systems, including cropping system, crop type, multiple cropping index, stubble index, and distance to irrigation canals [5, 11,12,13,14,15]. However, existing studies only address typical cropping systems in certain regions, where the systems are relatively straightforward [16,17,18]. Examples are traditional rice–wheat and wheat–corn rotation systems. Complex cropping systems will exacerbate the spatial heterogeneity of farmland SOC, which brings challenges to SOC spatial prediction.

Soil-landscape prediction models have been numerously performed multiple linear regression (MLR). It is commonly used algorithm in predicting SOC [19]. The model explores the joint influence of multiple variables and is simple, intuitive, and highly interpretable [20]. However, it is limited to linear relationship assumptions and low prediction accuracy [21, 22]. Commonly used machine learning algorithms, such as artificial neural network, support vector machine (SVM), random forest (RF) [23,24,25], has a high prediction accuracy and can reflect the relative importance of variables. However, these algorithms cannot reveal the relationship between SOC and environmental variables and have poor interpretability [26,27,28,29].

The Cubist model is a rule-based predictive model, each rule is associated with a MLR sub-model [30, 31]. This rule and model matching completes the shortcomings of a single model, thereby improving the predictive accuracy of the model. More importantly, it fits nonlinear relationships in the form of stratified linear regression [32,33,34]. Therefore, the use of Cubist can not only reveal the stratified linear relationship between SOC and environmental variables, but can also contribute to acquiring a high-precision SOC spatial distribution map.

In this study, spatial distribution maps of local cropping systems were derived using multi-period “Environment and Disaster Monitoring and Forecasting Small Satellite Constellation System” (abbreviated as HJ-CCD) images in Tianmen, Qianjiang, and Xiantao cities situated in the Jianghan Plain. Subsequently, the Cubist model was employed to investigate the stratified linear relationship between SOC and environmental variables within the region. The primary controlling factors influencing SOC under various cropping systems were identified, leading to the creation of a spatial distribution map for SOC.

Materials and methods

Study area and soil samples

The study area is located in the hinterland of the Jianghan Plain, Hubei Province, China (112° 29′–113° 49′ E, 30° 04′–30° 54′ N). The land was formed by the alluvial deposits of the Yangtze and Han rivers, with a total area of ~ 7133 km2. The region has a typical subtropical monsoon climate, which is warm and humid, and has an average annual rainfall of 1135 mm and an average annual temperature of 17.3 °C. Rainfall in the area is mainly concentrated in the summer, accounting for 70% of the total annual precipitation. The terrain is flat, with an average elevation of 30 m above sea level, and the water level drop is small, making the area prone to flooding. The predominant soil in this area is fluvisols, with a small amount of gleysols also present. Rivers crisscross this area, and the soil is fertile, creating favorable natural conditions for crop growth. On this fertile land, crop cultivation is mainly carried out using biannual or triple-season planting methods, making it one of the important grain production areas in Hubei Province.

For this research, a total of 12,041 soil samples were collected from agricultural fields in 2015 (Fig. 1). Due to insufficient preparation of auxiliary variables in the preliminary stage and considering the challenges of accessing certain areas, we opted for a relatively simple random sampling method. The sampling depth was 30 cm. The samples were air-dried in the laboratory. After removing plant roots and gravel, they were pulverized and sieved through a 20-mesh nylon sieve. Soil organic carbon content was determined using Walkley–Black wet oxidation method [35].

Fig. 1
figure 1

Location of the study area and sampling points

Acquisition of environment variables

Based on the SCORPAN model [7], 17 environmental variables were selected considering four aspects: soil properties, climate, organism, and spatial position (Table 1). The study area is located in a plain, the terrain undulation is small, and the correlation with SOC is low, so topographic factors are not included in the model. Additionally, residuals in the study area did not exhibit spatial correlation, and as a result, no further processing was applied to the residuals. The formula for the SCORPAN model is as follows:

$${S}_{{\text{a}}} \,{\text{or}} \,{S}_{{\text{c}}}= f\left(s,c,o,r,p,a,n\right)+ \varepsilon ,$$

where \({S}_{{\text{a}}} {\text{or}} {S}_{{\text{c}}}\) represents soil properties or soil type; \(s\) represents other soil information at the same point; \(c\) represents climatic factors;  \(o\) represents biological factors; \(r\) represents topographic and geomorphological features; \(p\) represents soil parent material or lithological characteristics; \(a\) represents the time factor of soil formation; \(n\) represents spatial location; and \(\varepsilon\) represents the residuals with spatial autocorrelation.

Table 1 Influencing factors of SOC

Acquisition of spatial distribution map of cropping system

The spatial distribution map of the cropping system is based on time series and is calculated from summer and winter crops. By investigating the growth cycles of winter rapeseed and winter wheat, we discovered that the flowering period of winter rapeseed occurs in March to April, while this time coincides with the jointing stage of winter wheat. During this period, winter wheat and winter rapeseed have distinct differences. Therefore, we selected an image (ID: 2365654) from late March taken by the HJ-CCD with no cloud cover. Environmental satellites, characterized by high temporal and spatial resolution, have a spatial resolution of 30 m and a temporal resolution of 2 days, which surpasses what is achievable by other images. They are often specifically utilized for monitoring environmental changes and assessing disaster risks. This environmental satellite image serves as the foundation for obtaining a distribution map of winter crops. From the true-color remote sensing image observation (Fig. 2), the winter rapeseed flowers appear yellow–green, which distinguishes the yellow–green regions as winter rapeseed fields. Winter wheat, being in its jointing stage, shows rapid leaf area growth and appears as a deep green color on the image. Fallow land is depicted in shades of pink–purple. A combination of standard false-color composite images was employed to enhance vegetation characteristics and improve interpretational accuracy. Under the standard false-color composite image, winter rapeseed is depicted as a pale pink color, winter wheat is represented in red, and fallow land appears as a dark green color. The two vertical viewing tools in ENVI software was used to compare the standard false-color composite image and the true-color image to establish areas of interest. The supervised classification method using SVM was applied to classify the images. The classified results were then compared with the validation samples. The overall accuracy and kappa coefficient of the classification were as high as 91.28% and 0.86, respectively. The distribution map of summer crops was created using 30 m resolution land use data from Hubei Province, China. The paddy fields and dry land areas were extracted from the land use types to determine the geographic distribution of summer crops. The cropping system distribution map was obtained using the ArcGIS 10.8 platform. Initially, the distribution maps for summer and winter crops were imported into the ArcGIS interface. The two raster images were separately reclassified. The final map was generated using the raster calculator tool (Fig. 3).

Fig. 2
figure 2

Remote sensing images of winter crops in true color and standard false color

Fig. 3
figure 3

Spatial distribution of cropping systems

Acquisition of other environment variables

Soil types, clay content, silt content, and pH are from the Soil Grids website (https://www.soilgrids.org/) with a resolution of 250 m. The spatial distribution map of the total contents of nitrogen, phosphorus, and potassium content with a resolution of 90 m was obtained from the National Earth System Science Data Center (http://soil.geodata.cn/ztsj.html). The average annual temperature and average annual rainfall data were obtained from the Chengdu Institute of Mountain Hazards and Environment (https://mp.weixin.qq.com/s/FPBT39rBDGzXe9sdunO-9Q), Chinese Academy of Sciences (Fig. 4). These meteorological data consist of a 30-year average from 1991 to 2020, and its resolution is 30 m. The maximum resolution of the annual normalized vegetation index is 30 m, and the average resolution of the average annual normalized vegetation index is 1 km, both from the Data Center for Resources and Environmental Sciences (http://www.resdc.cn/DOI), Chinese Academy of Sciences. Dis_IC, Dis_River, Dis_RS, and Dis_Pond were calculated on the ArcGIS 10.8 platform by using the 'Near' tool in the 'ArcToolbox' (Table 1). Administrative district data were obtained by filtering the county-level administrative boundaries through the attribute table by using ArcGIS 10.8.

Fig. 4
figure 4

Spatial distribution of environmental variables. a MAP: mean annual precipitation. b MAT: mean annual temperature. c Soil types. d NDVImean: annual NDVI average. e NDVImax: annual NDVI maximum. f Clay content. g pH. h Silt content. i TN. j TP, and k TK

Spatial prediction model

Ordinary kriging

Kriging is a geostatistical interpolation technique that is an optimal linear unbiased spatial interpolation method [36, 37]. This interpolation method is characterized by the introduction of a semivariogram when estimating the interpolation coefficients to measure the spatial correlation of sample data with distances. Several kriging methods are used in the interpolation formula, and ordinary kriging (OK) is the most commonly used method in resource reserve estimation in kriging method valuation [38]. The estimation formula for OK method is as follows:

$${Z}^{*}\left(x\right)=\sum_{i=1}^{n}{\lambda }_{i}Z\left({x}_{i}\right),$$

where \({Z}^{*}\left(x\right)\) is the value of the point to be estimated, \(Z\left({x}_{i}\right)\) is the observation at location \({x}_{i}\), and \({\lambda }_{i}\) is the weight factor of \(Z\left({x}_{i}\right).\)

Random forest

Random forest (RF), one of the most popular machine learning algorithms, employs the concept of ensemble learning to integrate multiple decision trees. RF can handle not only classification tasks, but also address regression problems; it is now widely applied in estimating agricultural yields [39]. The algorithm performs multiple random selections on the sample set through the specified number of features and the number of decision trees; each new set of samples after random selection corresponds to a decision tree [40, 41]. Subsequently, the results of each decision tree are voted, and that with the most votes is the result of RF. In this study, RF was implemented using the ‘randomForest’ package in R [40]. RF has two key parameters, namely “mtry” and “ntree”, where “mtry” represents the number of features chosen for each tree, and “ntree” represents the total number of decision trees in the ensemble. In this study, the default parameters were found to be optimal, with the parameters set as follows: ntree = 500, mtry = 5.

Cubist model

Cubist comes from Quinlan’s M5 model decision tree algorithm [30]. The Cubist model splits the dataset by establishing several rules. The principle of splitting is that the prediction error of the model and the dependent variable is minimized, and each subset of data after being split is simulated; a linear regression model of each subset is then established separately [42]. Similar to RF models, the Cubist models employ an ensemble learning strategy, in which the final prediction result is equal to the weighted average of all model tree predictions.

The two important parameters that need to be set when using the Cubist model are “committee” and “neighbors”. The “committees” parameter signifies the number of decision trees used during model construction, while the “neighbors” parameter is exclusive to the prediction phase, representing the number of neighbors to consider during prediction. The “committees” parameter can be set between 1 and 100, while “neighbors” range from 0 to 9. When predicting a specific point, the final prediction is the sum of the predicted value and the average of the “neighbors” residuals. In the Cubist model, the model can calculate the proportion of independent variables used in all “committee” to showcase the importance of independent variables in the calibration process. These proportions are utilized for rule formulation and model establishment. These proportions include the conditional contribution rate (%) and the modeling contribution rate (%).

The calibration process for Cubist in this study was implemented through the “Cubist” package in R [43]. After validation, the optimal parameter values of “committee” and “neighbors” are 100 and 9, respectively.

Model validation

In this study, 12,041 sampling points were randomly divided into calibration set and validation set, of which 80% sampling points were used for calibration (n = 9633) and 20% sampling points were used for validation (n = 2408).

We conducted tenfold cross-validation on the calibration set, employing this cross-validation method to enhance the credibility of the model results. In order to better reflect the performance differences between models, we performed external validation on the validation set. Evaluation metrics such as root mean square error (RMSE), coefficient of determination (R2), and Lin’s concordance correlation coefficient (LCCC) were chosen for assessment. RMSE values reflect the degree of deviation between the actual and observed values. R2 verifies the fitting degree of the model to the data, while LCCC considers the accuracy, consistency, and bias of the model. A powerful model typically exhibits low RMSE values as well as R2 and LCCC values close to 1. The formula for calculating the four evaluation indicators is as follows:

$${\text{RMSE}}=\sqrt[2]{\frac{1}{n}{\sum }_{1}^{n}{\left({o}_{i}-{P}_{i}\right)}^{2},}$$
$${R}^{2}=1-\frac{{\sum }_{1}^{n}{\left({o}_{i}-{P}_{i}\right)}^{2}}{{\sum }_{1}^{n}{\left({o}_{i}-\overline{O }\right)}^{2}},$$
$${\text{LCCC}}=\frac{2r{S}_{{\text{O}}}{S}_{{\text{P}}}}{{S}_{{\text{O}}}^{2}+{S}_{{\text{P}}}^{2}+{\left(\overline{O }-\overline{P }\right)}^{2}},$$

where n is the number of validation samples, \({o}_{i}\) is the observation at sample point i, \({P}_{i}\) is the predicted value at sample point i, \(\overline{O }\) is the average of the observations, \(\overline{P }\) is the average of the predicted values, R is the Pearson correlation coefficient between the observed and predicted values, \({S}_{{\text{O}}}\) is the standard deviation of the observation, and \({S}_{{\text{P}}}\) is the standard deviation of the predicted value.

Results

Descriptive statistics

The descriptive statistics of the sample point data include minimum, maximum, mean, standard deviation, skewness, and kurtosis (Table 2). The SOC varies between 0.60 and 54.20 g/kg, the mean of the divided calibration set and the validation set is not different from the total data set, and the standard deviation of the divided data set is similar to that of the total sample set. The skewness and kurtosis of the dataset are 1.063 and 2.083, respectively. A skewness of 1.063 indicates that the data are generally right-skewed, indicating the existence of some extreme values in the data; a kurtosis of 2.083 indicates that the data distribution is flatter than the normal distribution, indicating few extreme values. The RStudio platform was used to test the normal distribution of K-S, and the results confirmed that the data satisfied the normal distribution.

Table 2 Descriptive statistics for soil organic carbon samples

Soil properties in different cropping systems

This study employed SPSS 24 software for one-way ANOVA to explore variations in SOC content and soil properties under different cropping systems. The SOC content under paddy–upland rotation is ~ 16 g/kg, and that under upland–upland rotation is ~ 12 g/kg. The SOC content in paddy field is significantly higher than that in dry land (Table 3). Duncan’s post hoc test following the one-way ANOVA demonstrated significant differences in SOC content among different cropping systems. Rice–wheat rotation has the highest SOC content. As the TN content increases, the SOC content also increases, suggesting the positive correlation between SOC content and total nitrogen content. Significant differences in TN content under different cropping systems, with the highest TN content in rice–fallow rotation and the lowest TN content in dry crops–rape rotation. pH varies among the different cropping systems in the study area, with the highest pH value in rice–fallow rotation and the lowest pH value in dry crop–wheat rotation; the pH of the high SOC value area is weakly alkaline.

Table 3 Soil properties under different cropping systems

Cubist model results

Analysis of main controlling factors

Based on the analysis of the calibration process of the Cubist model and the contribution of each environmental variable, variations in conditional contribution were particularly pronounced. The contribution rates of CS and MAP were high, reaching 85% and 46%, respectively (Fig. 5). The SOC varies under different CS and MAP conditions, making CS and MAP as critical stratifying variables. The top five contributors to the modeling in terms of contribution rates are MAP, TN, Silt, Clay, and NDVImax. Climate and soil attributes were the most influential variables. The importance of MAP was particularly substantial, accounting for 83% of the influence. Prior research highlighted the significance of climate on a global scale [44]. However, this study demonstrates its equally important role at a local level, with precipitation being more significant than temperature in this area. Regarding soil nutrients, the importance of TN content stood out, whereas the importance of TP and TK contents was lower. The influence of pH was found to be the weakest. Soil texture emerged as a key factor influencing SOC, with clay and silt contents playing significant roles in this study. In terms of remote sensing data, NDVImax emerged as a critical variable, while NDVImean’s importance was weaker. Among distance-related factors, the importance of water-related factors such as Dis_IC, Dis_Pond, and Dis_River was higher than that of Dis_RS.

Fig. 5
figure 5

Cubist importance ranking chart. MAP mean annual precipitation, Silt silt content, Clay clay content, NDVImax annual NDVI maximum, Dis_IC distance from the nearest irrigated canal, Dis_Pond distance from the nearest pond, MAT mean annual temperature, Dis_River distance from the nearest river, Dis_RS distance from the nearest rural settlement, NDVImean annual NDVI average, CS cropping system, ST soil types, AD administrative district

Stratified linear model results

The stratified rules of the Cubist model and the linear regression results for each layer are presented in Fig. 6. Zoning rules were primarily based on summer crop type, precipitation, and NDVImax. Rule 1 corresponds to the cultivation of summer dry crops. The majority of this rule is located in Tianmen City and Qianjiang City, while Xiantao City mostly consists of scattered cultivation. Rule 1 is influenced by irrigation channels and ponds, whereas Rules 2–4 remain unaffected. Rule 2 is mainly situated in the northern part of Tianmen City and the western part of Qianjiang City. As inferred from the stratified rules, Rules 1 and 2 are also influenced by Dis_RS. Rule 3 is predominantly distributed in the eastern part of Xiantao City and the southwestern part of Qianjiang City. The classification of Rules 3 and 4 is primarily based on precipitation, with Rule 3 is more influenced by precipitation than Rule 1. However, Rule 3 is not affected by soil texture. Under the premise of precipitation-based division, Rules 3 and 4 further partition zones based on NDVImax. Compared with Rule 3, Rule 4 is significantly influenced by NDVImax. These findings indicate the stratified heterogeneity between SOC and environmental variables. Moreover, main controlling factors vary significantly under different cropping systems.

Fig. 6
figure 6

Cubist modeling zoning plot and zoning coefficient plot. RW rotation of rice with winter wheat, RF rotation of rice with fallow land, RR rotation of rice with winter rapeseed, DW rotation of dry crops with winter wheat, DF rotation of dry crops with fallow land, DR rotation of dry crops with winter rapeseed, Silt silt content, Clay clay content, MAP mean annual precipitation, MAT mean annual temperature, Dis_IC distance from the nearest irrigated canal, Dis_Pond distance from the nearest pond, Dis_River distance from the nearest river, Dis_RS distance from the nearest rural settlement

Model evaluation results

RMSE, R2, and LCCC were used to evaluate the prediction accuracy of the models (Table 4). The Cubist model has the highest R2 (0.292), followed closely by the RF model (0.263), then the OK model (0.211), while the MLR model has the lowest R2 (0.207). The Cubist model has the lowest overall deviation level, while the MLR model exhibits the highest deviation, indicating less accurate predictions by the MLR model. When observing the LCCC values, the Cubist model still performs the best (0.482). These findings suggest that the Cubist model is optimal, followed by the RF and OK models, while the MLR model demonstrates the weakest performance in the study area.

Table 4 Comparison of accuracy among OK, MLR, RF, and Cubist predictive models

SOC spatial distribution map

By observing the spatial distribution map of SOC predicted through the Cubist model (Fig. 7), an overall trend can be found in the spatial distribution of SOC in the research area, with higher values in the south and north and lower values in the east and west. Numerous localized high-SOC regions exist in the eastern part. These high-SOC regions are primarily situated in the southern, northern, southeastern, and northeastern parts of the research area, with a few also found in the southwest. The average SOC content in the study area is 13.5 g/kg, ranging from 2.7 g/kg to 34.17 g/kg. These high-SOC areas share a common characteristic: they are all located near ponds, with this feature being particularly pronounced in the southern and northeastern parts of the study area. Based on analysis of Figs. 2 and 3, the paddy fields have significantly higher SOC values than the dry land, and regions with higher MAP values, such as the eastern part of the study area, exhibit lower SOC content. The spatial distribution maps of clay and silt exhibit similar trends to the SOC spatial distribution map, with areas having higher clay content corresponding to higher SOC content. When analyzing the distribution map of TN in conjunction with SOC, a high TN content corresponds to a high SOC content, consistent with the analysis in Table 3.

Fig. 7
figure 7

SOC spatial distribution map via the Cubist model

Discussion

Relationship between SOC and environmental variables

The results from the Cubist model indicate that the spatial distribution of SOC is influenced by cropping system, climate, soil nutrients, soil texture, and vegetation. The cropping system has the greatest conditional contribution rate.

The climate makes very important contributions in terms of modeling contribution rate and conditional contribution rate. Numerous large-scale studies found that under natural conditions, SOC is significantly positively correlated with precipitation and negatively correlated with temperature [45,46,47,48]. Increasing the temperatures can enhance microbial activity, thereby accelerating the mineralization rate of SOC [49, 50]. However, this result contrasts our study findings, which could be attributed to the relatively small scope of our study area (predominantly consists of cultivated land with well-developed irrigation facilities). In such cases, higher temperatures could enhance vegetation photosynthesis, leading to increased input of crop residues, thereby favoring SOC accumulation. Excessive rainfall might potentially result in soil erosion and leaching of SOC, leading to its volatilization [51]. Given the study area’s location between the Yangtze River and the Han River, excessive rainfall could result in the erosion of soil particles carrying SOC.

Soil nutrients and texture are important influencing factors of SOC, and nitrogen to a certain extent promotes the sequestration of SOC. An increase in nitrogen content can promote the production of plant biomass, thereby facilitating the input of carbon into the soil [52, 53]. In the present study, we revealed that an elevated nitrogen content augments the sequestration capacity of SOC. This outcome aligns harmoniously with the conclusions drawn from the majority of antecedent research endeavors [54,55,56]. Clay and silt have the capability to adsorb organic carbon onto their surfaces, thereby impeding the microbial decomposition of SOC [57, 58]. Additionally, a high clay content enhances soil aggregate stability, leading to a reduction in the mineralization rate of SOC. Consequently, soils characterized by a clayey texture often exhibit elevated levels of organic carbon [59, 60], consistent with our findings.

The vegetation index serves as a crucial metric for assessing vegetation productivity and health status [61,62,63,64]. A higher vegetation index, on one hand, indicates improved vegetation growth and higher soil nutrient content. On the other hand, it signifies elevated input of plant residues [65, 66]. A positive correlation exists between the vegetation index and SOC, consistent with the outcomes of the present research.

Planting wheat in winter has the highest SOC content, followed by fallow and planting rape. Winter wheat cultivation during the winter season increases tillage intensity, which may lead to soil structure disruption, hindering the accumulation of SOC [67, 68]. Therefore, the local practice of straw returning has been adopted. This approach can directly increase the content of SOC while improving soil structure and enhancing soil fertility [69,70,71]. This finding may explain why the SOC content in winter wheat is higher than that in fallow land. At the same time, fallow has a higher SOC content than rapeseed, which may be related to tillage intensity [72, 73]. In most areas of the study region, a double cropping rice–rape rotation is practiced, and this high-intensity planting often leads to soil nutrient loss. This finding may be a key reason for the low SOC content in rapeseed fields. The carbon sequestration capacity of the rice–wheat rotation system is higher than those in other rotation systems. The above analysis is an important aspect explaining why rice–wheat rotation soil has a higher SOC content. On the other hand, this can be attributed to the benefits of alternate wetting and drying. Such a rotation stimulates carbon and nitrogen cycling, leading to an increase in SOC [74, 75].

Stratified heterogeneous relationship analysis of SOC and impact factor

The Cubist model has the smallest RMSE and possesses the highest R2 and LCCC values, making it the optimal model for the study area. This rule-based Cubist model effectively reveals the stratified heterogeneity between SOC and environmental variables. Cropping system stands out as a prominent stratifying variable within the Cubist model, with a substantial conditional contribution rate of 85%. The stratification rules indicate varying main controlling factors under different cropping systems.

The differences in cropping systems lead to significant disparities in SOC (Table 3). However, the influence of winter crops appears to be less pronounced, while summer crops predominantly determine the input of SOC. The stratified outcomes from the Cubist model validate this observation. The results highlight that paddy fields exhibit higher SOC content than drylands. This discrepancy could be attributed to the prolonged flooding of paddy fields, leading to poor soil aeration and suppressed decomposition of SOC [76, 77]. Consequently, SOC content in paddy fields tends to be higher than in well-aerated drylands, consistent with previous research outcomes.

The stratified results of the Cubist model also partly reflect a characteristic. During summer, when dry crops are cultivated (upland–upland rotation), they are influenced by factors, such as ponds and irrigation channels, which are water-related. However, when rice is cultivated during the summer (paddy–upland rotation), there is no impact from ponds or irrigation channels. This finding might be closely tied to the land use types. In dry land rotations, the soil moisture content is low, making upland–upland rotations highly sensitive to water. In paddy field rotations, the soil remains flooded year-round, maintaining a high moisture content; thus, it is less responsive to subtle water changes. Regardless of whether it is upland–upland rotation or paddy–upland rotation, both are influenced by ‘distance from rural residential areas.’ This finding could be attributed to the proximity to residential areas, which facilitates fertilization and irrigation, thereby promoting crop growth and the accumulation of organic carbon. Hence, geographical location to some extent influences SOC dynamics.

Rules 3 and 4 have precipitation exceeding 1114.6 mm and are paddy–upland rotations; the distinction lies in their different vegetation indices. Rule 4 is significantly influenced by the vegetation index compared with Rule 3. The vegetation index is a crucial indicator of vegetation coverage and health. With NDVImax > 0.845 in Rule 4, this result indicates a very healthy vegetation state. Under conditions of abundant precipitation, Rule 4 can greatly enhance the input of SOC.

These results underscore a significant relationship between changes in SOC content and different cropping systems. The control of SOC is closely associated with human agricultural management. The main controlling factors vary under different cropping systems. Tailoring agricultural management practices based on the main controlling factors for different regions might contribute to increased crop yields.

Limitations

In this study, the influence of factors such as cropping system on SOC is explored, and mapping is realized based on the quantitative relationship between the factors. The R2 of the model is only 29.2%, which may be limited by the following points. The year of soil attributes is inconsistent with the sampling year, soil nutrient data are predicted from 2010 to 2018, and soil texture data are predicted in 2019.

The study area falls within China’s significant grain and material production region, the Jianghan Plain, characterized by frequent human activities. Human activity indicators related to SOC, such as fertilization amount, land ownership, and methods of straw returning, are challenging to quantify spatially and do not effectively reflect changes in SOC [78]. Moreover, the study area has a per capita arable land area of 1.48 mu, with varying cultivation and management practices among different landowners, leading to substantial random errors.

Conclusion

This study investigated the main controlling factors under a complex cropping system and employed Cubist, RF, OK, and MLR models to predict the spatial distribution of SOC. Overall, the SOC content in the study area ranged from 2.70 g/kg to 34.17 g/kg, with the highest content observed in rice–wheat rotation. The Cubist model outperformed other models, indicating its feasibility in explaining SOC variations under intricate cropping systems. Cropping system, MAP, TN, clay content, silt content and NDVImax as the main controlling factors for farmland SOC, highlighting that lower rainfall, higher soil attributes, and increased vegetation cover contribute to SOC accumulation. The main controlling factors for SOC differed significantly across various cropping systems. Summer crops exhibited a more pronounced impact on SOC spatial variation compared with winter crops. For paddy–upland rotation, factors such as river distance and NDVI played a key role; for upland–upland rotation, irrigation-related factors were more influential. This finding underscores the need for a greater focus on cultivating summer crops and implementing appropriate planting density in paddy–upland rotations as well as considering irrigation factors in upland–upland rotations. This work reveals the variations in main controlling factors of SOC under different cropping systems and highlights the significance of field zoning management.