Abstract
Diffuse solar radiation (DSR) plays a critical role in renewable energy utilization and efficient agricultural production. However, there is a scarcity of high-precision, long-term, and spatially continuous datasets for DSR in the world, and particularly in China. To address this gap, a 41-year (1982–2022) daily diffuse solar radiation dataset (CHDSR) is constructed with a spatial resolution of 10 km, based on a new ensemble model that combines the clear-sky irradiance estimated by the REST2 model and a machine-learning technique using precise cloud information derived from reanalysis data. Validation against ground-based measurements indicates strong performance of the new hybrid model, with a correlation coefficient, root mean square error and mean bias error (MBE) of 0.94, 13.9 W m−2 and −0.49 W m−2, respectively. The CHDSR dataset shows good spatial and temporal continuity over the time horizon from 1982 to 2022, with a multi-year mean value of 74.51 W m−2. This dataset is now freely available on figshare to the potential benefit of any analytical work in solar energy, agriculture, climate change, etc (https://doi.org/10.6084/m9.figshare.21763223.v3).
Similar content being viewed by others
Background & Summary
Diffuse solar irradiance (DSR) is a critical component of solar radiation, substantially influenced by atmospheric conditions such as aerosols and cloud cover1,2. Accurately modeling DSR is crucial not only for agriculture applications, where it enhances plant productivity and photosynthetic efficiency by modifying the light environment, but also for border environment and economic impacts3,4. Variability in DSR predictions can significantly affect climate modeling and carbon budget assessments, potentially leading to large discrepancies in environment policy and climate strategy effectiveness. Moreover, as global efforts intensify to achieve carbon neutrality for the strategic placement and efficiency optimization of solar power systems5. Therefore, improving the accuracy of DSR predication is essential, not only for advancing agriculture productivity but also for enhancing the reliability of renewable energy resources and supporting robust climate change responses6,7.
Observations from accurately-calibrated and continuously-serviced pyranometers are the most effective way to obtain reliable, long-term DSR data. However, considering the construction and maintenance costs, out of the 119 stations operated by the China Meteorological Administration (CMA), only 17 of these CMA stations are equipped for the monitoring of DSR. In contrast, satellite remote sensing can be used to derive spatiotemporally continuous estimates of surface solar irradiance on a regional or global scale8,9,10,11,12,13. Satellite-based retrieval models can be classified into two categories: semi-empirical models14 and physical model15,16. Semi-empirical models first estimate the surface global horizontal irradiance (GHI), then one empirical model (among many possibilities17) is operated to separate it into its direct and diffuse components. However, the actual performance of these empirical models is generally station dependent18, which can lead to large deviations unless a local post-processing adjustment (usually referred to as “site adaptation”) is carried out19,20. Yu21 used an improved separation model while introducing the Kt-K group criterion, which resulted in hourly DSR predictions having a relative root mean square error (rRMSE) of ≈26–44% at three control stations.
Physical radiative transfer models have been applied to estimate solar radiation around the world22,23,24, but such models are too complex and slow to be used in operational satellite retrievals. Simpler parameterized models are thus rather utilized for such tasks, in particular to obtain the irradiance components under ideal cloud-free conditions. Such models can be divided into the spectral25 and broadband1 types. Simplified physical models like Solar Irradiance Scheme (SOLIS) or Fast All-sky Radiation Model for Solar applications with Narrowband Irradiances on Tilted surface (FARMS-NIT) can provide spectrally resolved irradiance data under clear-sky conditions and/or cloudy conditions26,27,28, using satellite-derived information. In parallel, the REST2 model (Reference Evaluation of Solar Transmittance, 2 bands) can estimate the broadband surface irradiance with similar accuracy as more sophisticated spectral models, yet with much lower computational requirements29, and thus has been widely used for estimating global, direct, and diffuse solar radiation.
Under some cloudy and/or humid weather conditions, such as in tropical climates, the quality of satellite remote-sensing observations can be limited, which constrains the accuracy of physical models that require high-precision cloud optical property data. Compared to the loss of accuracy caused by the noisy environment for conventional radiation models, machine learning (ML) algorithms are good candidates to achieve good DSR estimates under cloudy and/or hazy conditions, even with limited cloud and aerosol optical properties as input variables. Various ML methods have been developed to improve prediction accuracy in cloudy environments, including Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Networks (ANN) and Deep Neural Networks (DNN)30,31,32,33,34. For example, Shamshiband et al.35 proposed a hybrid model by integrating SVM with the Wavelet Transform algorithm. Their results demonstrate that this hybrid model yields good predictions and offers significantly higher accuracy than SVM or ANN. Fan et al.36 generated three new hybrid SVM models with heuristic algorithms, compared to the stand-alone SVM model, the hybrid model performed more precisely and reliably. Wu et al.37 generated a high resolution DSR datasets based on a generalized additive models (GAM) that cooperates with several ML models. Zhao et al.38 established a hybrid model based on four XGB boosting models to improve the satellite-based DSR products.
The existing literature shows that the key role that can be played by ML in tackling cloud uncertainty remains undiminished, despite the challenges39,40. On this basis, various DSR datasets have been prepared based on through ML methods under all-sky conditions. For example, Jiang15 produced a 12-year (2007–2018) hourly solar radiation dataset at 5-km resolution based on DNN. In parallel, Chakraborty41 used RF algorithms to generate a 40-year (1980–2019) monthly dataset using information from NASA’s MERRA-2 reanalysis, with a spatial resolution of 0.5 × 0.625°. Both have advantages in terms of the spatial and temporal resolution as well as spatial/temporal extent, and disadvantages in terms of each other’s strengths. In addition, well-established databases such as ERA5, SARAH-E, or CERES all supply DSR datasets at different scales. However, these DSR products are subject to large uncertainties, at least over China. For instance, the RMSE and rRMSE of the daily mean DSR obtained by ERA5 and CERES products under all-weather conditions were found to be above 30 W m−2 and 40%, respectively37,42, while the rRMSE for SARAH was even above 50%.
At global or continental spatial scale, the existing DSR datasets lack simultaneous high spatial resolution and temporal continuity. In this study, the REST2 radiation model is used to generate a DSR dataset at high spatial resolution under clear-sky conditions. Subsequently, a novel hybrid model that combines the REST2 clear-sky estimates with a ML stacking model is developed to construct a continuous and gridded DSR dataset with a high resolution of 10 km over China, and a long span of 41 years (1982–2022).
Methods
Validation data
The surface DSR observations used in this study are recorded at 17 stations maintained by CMA. The 2011–2015 data from all these stations are used for the training states of the hybrid ML model, whereas the 2010 data are used for independent validation. Moreover, the daily DSR records from the same sites but for the 2000–2015 extended period are used for the validation of the modeled DSR values across the country’s five major climate zones. Temperate continental zone (TCZ), Temperate monsoon zone (TMS), High-mountain plateau zone (HPZ), Subtropical monsoon zone (SMZ), and Tropical monsoon zone (TOZ). Figure 1 shows the spatial distribution of the 17 radiometric sites over China, superimposed with an elevation map of the country.
The ground-based DSR observations used in this study are obtained with pyranometers that were developed by CMA. A shade disk is attached to a solar tracker to block the direct radiation from the pyranometer and only sense DSR. The quality-control method of the measured DSR data includes three successive steps: (1) Climate-dependent thresholding and permissible value check43; (2) Internal consistency check; and (3) Time continuity check.
Reanalysis data
The European Centre for Medium-Range Weather Forecasts (ECMWF) has developed the ERA5 long-term reanalysis dataset, providing several products with various spatial and temporal resolutions44. The main ERA5 reanalysis provides meteorological data at a single levels or at the surface, including ozone, water vapor, etc., with a spatial resolution of 0.25° × 0.25° and a temporal resolution of 1 h since 194045. In what follows, it is referred to as ERA5-Single. In parallel, the ERA5-Land dataset, which starts only in 1950, has a finer resolution of 0.1° × 0.1°46. In addition, NASA’s Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA2) reanalysis also provides some additional inputs for the hybrid model47. MERRA-2 products are widely used in solar applications (among others) because of their excellent spatial and temporal continuity48,49. In particular, its aerosol optical depth (AOD) estimates are considered of high quality50 and are frequently used for solar irradiance modeling11,51. Details on the reanalysis data used in this study are given in Table 1.
Clear-sky diffuse irradiance estimation
In this study, the REST2 model1 provides estimates of DSR under clear-sky conditions. REST2 consists of parameterizations of look-up tables obtained from the SMARTS spectral model25,52. The different sources of atmospheric attenuation are parameterized for each of two spectral bands (0.28–0.7 µm and 0.7–4.0 µm). The model has been thoroughly validated against various sources of high-quality irradiance data29,53,54. Version 9.1 of the REST2 model is used here; it incorporates a number of improvements (particularly related to the modeling of diffuse irradiance) compared to the publicly-available version 555. For the present application, the input variables are local standard time (Year, Month, Day, Hour), surface pressure, regional ground albedo, reduced ozone vertical pathlength precipitable water, and AOD at 550 nm. These variables are uniformly interpolated from their ERA5 or MERRA-2 sources to a uniform resolution of 0.1° × 0.1°, and then applied as input variables to REST2 to generate 41 years of hourly clear-sky DSR data for every single pixel over China.
Under clear-sky conditions, the REST2 derived DSR maps produced here for China can be considered of high quality. In turn, these clear-sky DSR maps can be used as the necessary foundation to apply the machine-learning algorithm that evaluates the cloud impacts, as explained below, and ultimately to prepare maps at relatively high spatial resolution.
All-sky diffuse solar radiation estimation
After estimating the REST2 estimates of clear-sky diffuse solar radiation, the results are combined with various cloud properties derived from ERA5 to form the input data to the ML model that is tasked to estimate DSR under all-sky conditions. Meanwhile, it should be noted that since the model is constructed based on the station observation data from CMA, but since we lack hourly-scale station observation data it is difficult to obtain the data in China at an hourly scale, we set both the output and the input of the all-sky model as daily scale. In selecting the input variables, we relied heavily on guidance from previous literature and the availability of data56. Based on previous studies, it was possible to ensure that the variables selected could represent the combined effects of different meteorological and geographical factors on solar radiation.
The process of constructing the DSR estimation model based on the stacking method is shown in Fig. 2. In this study, six years of data samples from 2011–2015, totaling 36492 entries, are formed based on the matching records at 17 CMA stations. Each base learner (GBDT, RF, XGB, Bagging) builds a non-linear relationship between the input variables and the DSR results with the same training set. Then, the base learner, which constitutes the ensemble learning, builds an improved DSR estimation result.
All data sets are constructed with a 7:3 ratio of training and test sets, and the ten-fold cross-validation is performed on the training data to distinguish the sub-training set and sub-validation sets in the training set. The training and validation sets of the meta-learner are obtained in the form of weighted averages. In this study, the correlation coefficient R, the mean bias error (MBE), the root-mean square error (RMSE) and the relative root-mean square error (rRMSE) are used to quantify the quality of the estimation results.
The main concept behind integrated modelling is to combine multiple base models to create a more comprehensive and robust model for strongly supervised learning. By appropriately combining these base models, a reduction in variance, bias, or improved prediction can be achieved. One approach, known as the stacking model, allows the use of different types of models as primary learners. This involves aggregating the outputs of these base learners to form the final output, while training the primary learners independently. The meta-models are then trained using the outputs of the primary learners as inputs and the output data of the training set. In parallel, the GridSearchCV method is used to optimize the hyperparameters of the ML model. The main steps include: (i) separating the training and test samples; (ii) identifying the parameters to be optimized and setting the others to default values; (iii) adjusting them sequentially in steps within the specified parameter range; and (iv) iterating the paths. This whole process result in what is referred to below as “stacking model”.
Validation of modelled DSR using different clear-sky detection methods
The REST2 model is designed to estimate solar radiation under cloudless conditions, which means that the choice of an appropriate clear-sky detection (CSD) method is important for evaluating the accuracy of such predictions. These methods attempt to detect clear-sky periods based on an analysis of historical time series of at least one component of solar irradiance. Many methods have been proposed in the literature, and the typically provide different (more or less stringent) results, as reviewed in57. Here, three different clear-sky detection methods of various complexity and requirements are tested:
-
CSD1: Simple method uniquely using the clearness-index, Kt, defined as the ratio between GHI and its extraterrestrial counterpart; this method has been popular in earlier solar resource studies.
-
CSD2: Clear-sky detection method according to Long & Ackerman58; this method has been widely used by the atmospheric sciences community.
-
CSD3: The BrightSun method, which is more elaborate and the most recent detection method.
The Chinese city of Xianghe harbored one radiometric station of the BSRN network until 201459. The station provided observations of the three radiation components at 1-min resolution, and is thus qualified to test even the most demanding CSD methods, such as CSD2 or CSD3. In the literature, it is often considered that clear-sky situations occur when the clearness index is larger than 0.760. Comparatively to that simple (and questionable) assumption, the Long & Ackerman method has a more physical background, whereas BrightSun represents an even more advanced methodology, based on modifications to the Reno-Hansen method61. Whereas the latter only requires GHI measurements, BrightSun also depends on those of either direct or diffuse irradiance, just like Long & Ackerman. Figure 3 compares the 1-min clear-sky periods detected by CSD2 and CSD3 at Xianghe in 2010. These results are plotted in K-Kt space, where K represents the diffuse fraction, i.e., DSR/GHI. It is clear that, under low Kt CSD3 detection returns more clear periods than CSD2, in part because the latter does not operate under low-sun conditions. This finding also means that there are limitations to the application of CSD1, since there appears to be many clear situations when Kt < 0.7, most likely because of the generally hazy conditions at Xianghe. Moreover, Fig. 3 indicates that CSD3 effectively filters out situations when various types of clouds have a small effect on GHI but a large effect on DSR. A general limitation of all simple CSD methods, however, is that they typically have difficulty detecting clear conditions under extremely hazy conditions, which do occur in Xianghe. In that sense, CSD3 does appear to detect many such clear but hazy situations, whereas CSD2 appears much too stringent. An important consequence of these findings is that the validation of clear-sky radiation models, such as REST2, would not be possible under historical high-turbidity conditions if using CSD1 or CSD2. Conversely, some of the scenes reported as clear by CSD3 could actually be cloudy.
Figure 4 shows how the REST2 hourly predictions compare to the DSR observations at Xianghe in 2010, alternatively using the three CSD methods. Overall, REST2 demonstrates excellent accuracy in estimating the hourly DSR under clear-sky conditions, with a correlation coefficient above 0.84 and a RMS error of 81.5 W m−2 (39.1%) for CSD1 (N = 898), 20.0 W m−2 (22.0%) for CSD2 (N = 192), and 41.4 W m−2, (31.3%) for CSD3(N = 956). Figure 4 shows that the DSR results are concentrated around ≈100 W m−2, while hourly clear-sky DSR observations can reach ≈600 W m−2, i.e., very hazy conditions, using either CSD1 or CSD3. In contrast, CSD2 can only detect clear periods with DSR up to only ≈180 W m−2, thus excluding the haziest conditions, as discussed above. BrightSun appears the most appropriate CSD method in a hazy environment such as Xianghe, or many other urban areas in China. In all plots of Fig. 4, the remaining scatter can be explained by imperfections in the CSD method and/or by imperfections in the critical inputs to the model, particularly in terms of AOD. In conclusion, the REST2 estimation of DSR under clear-sky conditions are found reliable overall, thus providing a stable foundation for in the ML modeling step, which is necessary to estimate the all-sky irradiance.
Data Records
Using the models described above, a database of daily average all-sky DSR, referred to as CHDSR62, has been produced for China. This database covers the 41-year period from 1982 to 2022, and can be downloaded from a dedicated website (https://doi.org/10.6084/m9.figshare.21763223.v3). The name of each data file is formatted as “DIF_yyyy.nc”, where “yyyy” represents the year, and the diffuse solar radiation is stored in the netCDF file as floating-point values in “W m−2”. Each file contains three variables (DSR, Longitude, and Latitude), with dimensions of 641 × 361. The latitude and longitude ranges covered by this dataset are 72°–136°E, 18°–54°N, respectively, with a spatial resolution of 0.1° (≈10 km); the time format used is local standard time (China’s time zone is +8).
Technical Validation
Validation against ground measurements
The performance of the stacking model for the CHDSR62 dataset is evaluated to check the viability of long-term estimation. Firstly, the model training set is validated for the period 2011–2015, as shown in Fig. 5a. The test set for 2010 is also validated, as depicted in Fig. 5b. Additionally, the extended 2000–2015 time range also is finally validated, as illustrated in Fig. 5c. Overall, the stacking model performs well against the test set from 2000 to 2015, with R values ranging from 0.89 to 0.94, RMSE values ranging from 13.9 to 19.0 W m−2, and MBE values ranging from −1.6 to 0.5 W m−2. The sample-based cross validation R, RMSE and MBE results are 0.93, 16.0 W m−2, 0.5 W m−2, respectively. Using the test set of 2010, the stacking model performs better, with an R value of 0.94, RMSE of 13.9 W m−2, and MBE of −0.5 W m−2. The deviation of the stacking model remains low for both the training and test sets, showing the stability of the model’s estimation results. Furthermore, over the extended 2000–2015 time period, the model retains high accuracy with R = 0.89, RMSE = 19.0 W m−2, and MBE = −1.6 W m−2, thus revealing the powerful temporal extensibility of the stacking model in estimating DSR. The low RMSE and MBE confirm that the proposed datasets have a high degree of accuracy over China, with the desired stability for long-term estimation.
The CHDSR62 also delivers the best results and lowest bias when compared to other DSR products. For example, ERA5 also provides DSR products, but a significant underestimation of approximately 43.1 W m−2 has been observed over China, apparently caused by poor estimates of the cloud path63. The remote-sensed CERES product performs somewhat better with R = 0.8 and RMSE = 3.6 W m−2 37. Jiang et al. obtained a better result with the R and RMSE of 0.79 and 20.1 W m−2, respectively64. In parallel, Jiang et al.15 also produced another daily DSR dataset with R = 0.89 and RMSE = 58.3 W m−2, respectively15. Wu et al. produced a high-resolution daily DSR dataset with R = 0.87 and RMSE = 20.2 W m−2 37.
To demonstrate the general applicability of the stacking model, the model is also validated separately for the five major climatic zones of China. The results displayed in Fig. 6 show that, in terms of R, the ranking is TMZ > SMZ > HPZ > TCZ > TOZ, all with rRMSE below 30%. Overall, the accuracy varies significantly over the different climate zones, confirming previous results65. For example, the modeled DSR performs best over TMZ with the highest R (0.91) and the lowest RMSE (18.72 W m−2), whereas the worst R (0.84) occurs for TOZ. One explanation is that the terrain is flat over TMZ, thus the remote-sensed or modeled input data (e.g., of AOD) are less affected by that important factors compared to the TOZ situation. An interesting observation is that the HPZ climate (which consists mainly of the Qinghai-Tibet Plateau) is the worst performing region in terms of RMSE, but not in terms of R. The largest scatter could be expected from the generally high elevations and substantial diversity in terrain features. In conclusion, the stacking model shows good applicability in the temperate monsoon climate zone, especially in the plains.
Spatial distribution
Figure 7 illustrates the annual and multi-year mean DSR values for the period 1982–2020 over China. The diffuse solar radiation values range from 63.7 to 97.7 W m−2, with a 39-year average value of 77.0 W m−2. As depicted in the primary map in Fig. 7, elevated DSR values are predominantly concentrated in the Taklamakan Desert, Central and Southern China, and sections of the Qinghai-Tibet plateau. The presence of sand and dust aerosols amplifies the scattering effect, leading to elevated DSR values across the Taklamakan Desert66. The shorter radiation path in the Qinghai-Tibet region weakens the atmospheric molecular scattering effect, resulting in lower DSR values in that area. In contrast, the high levels of diffuse solar radiation over low-altitude and low-latitude areas, such as Guangzhou, can be attributed to a combination of increased industrialization and the abundance of sea salt aerosols from the monsoon67,68.
The mean diffuse solar radiation experienced significant turning points in 1990, 2000 and 2010, with DSR values of 75.6 W m−2, 78.8 W m−2,79.5 W m−2 respectively. However, in 2020, there was a slight decrease in the mean DSR (78.2 W m−2). At the insert of these four decades, the center of the low values was located in Inner Mongolia, with mean DSR of 63.8 W m−2, 68.5 W m−2, 72.3 W m−2, and 68.0 W m−2, respectively. This is mainly because of that area’s high geographical latitude, low sun elevation, and low aerosol burden, resulting in less solar radiation being scattered by the atmosphere. In 1990 and 2000, the centers of the high values were observed over high-altitude areas, specifically in Kunming. There, the mean DSR was 84.4 W m−2 during those years. This is attributed to the higher elevation of the Yunnan Plateau region, which was accompanied an increase in atmospheric transparency during the period, resulting in a brightening of the Southwestern region69. However, in 2010 and 2020, the center of the high DSR values moved to eastern China, an area characterized by significant urban and industrial developments and high anthropogenic pollutant emissions, resulting in mean DSR values of 92.2 W m−2 and 86.2 W m−2, respectively.
Long-term trends
Over China, the interannual trends of DSR from 1982 to 2020 are depicted in Fig. 8. Overall, the mean annual diffuse solar irradiation varied from 72.3 to 81.8 W m−2, exhibiting an overall decreasing trend of −0.012 W m−2 yr−1. More specifically, the figure delineates five periods with characteristic trends. From 1982 to 1990, the diffuse solar radiation showed a decreasing trend of −0.786 W m−2 yr−1. During 1992–1998, another downward period existed, with a mean trend of −1.245 W m−2 yr−1. An opposite trend followed in 1998–2008, characterized by a slight increase of 0.300 W m−2 yr−1. Finally, DSR was affected by a small decreasing tend of −0.203 W m−2 yr−1 during 2008–2020. These trends are purposedly not qualified as “dimming” or “brightening” to avoid confusion, since dimming relates to a decrease in GHI, which normally corresponds to an increase in DSR, and vice versa for brightening. Meanwhile, due to the lack of pre-2000 site data, the DSR trends before 2000 should be considered as indicative only. Regarding the question of the validity of the temporal variation of the dataset under long time series, in fact, according to the existing studies, the diffuse solar radiation exhibits a similar trend over the 39-year period as in the present study, which testifies to the validity of the model proposed in the present study until 200070,71,72.
Remarkably, Fig. 8 displays two peaks in DSR, occurring in 1983 and 1992. Correspondingly, Fig. 7 shows that the national DSR widely peaked in 1983 and 1992. In 1983, the peak in DSR is directly related to a large global increase in aerosol concentration caused by the eruption of the EI Chichón volcano in Mexico, which resulted in an enhanced scattering effect73. Similarly, the eruption of Mount Pinatubo in the Philippines in 1991 caused a peak in DSR the following year (1992)74. Large amounts of volcanic aerosols were released into the stratosphere during those two eruptions, eventually resulting in a significantly additional burden in AOD, which affected the whole world for ≈2 years75,76. After 2000, rapid industrialization in Asia and an increase in sulphur-containing particle emissions led to an intensified atmospheric scattering effect77. This effect resulted in an annual increase in the distribution of DSR in Northern China as illustrated in Fig. 8, up to a peak that occurred in 2008. Subsequently, the observed decreasing trend in annual DSR can be attributed to the air quality measures and carbon emission reduction policies implemented by the Chinese government78.
Uncertainty and limitations
Figure 9 depicts the validation results of the estimated DSR at the 17 test sites. Overall, the dataset maintains rRMSE under 30%, with over 70% of the stations displaying an RMSE below 20 W m−2. Moreover, more than 40% of the sites demonstrate a correlation coefficient greater than 0.9, while less than 19% of them exhibit a correlation coefficient between 0.85. These findings reflect a high agreement between the estimates and observations, particularly in the Northern (Beijing) and North-Eastern (Shenyang) parts of the country. Conversely, elevated RMSE values and diminished R values are predominantly observed in the Qinghai-Tibet Plateau and the Sichuan Basin — regions known for their dramatic weather changes and/or complex terrain. For those areas, the combination of spatial inhomogeneities in terrain and rapid spatiotemporal changes in cloud cover can directly affect the representativeness of the atmospheric inputs to the model, resulting in relatively larger deviations between modeled DSR and station observations. However, at stations with high levels of industrialization, such as Beijing, Shenyang, Wuhan or Shanghai, where atmospheric aerosols contain a lot of black carbon, the hybrid model yields good results, with R values exceeding 0.9 and RMSE below 20 W m−2. This suggests that the model is effective with regard to the impact of aerosols on DSR. In contrast, the model appears somewhat less accurate at Kunming (R = 0.84, RMSE = 20.15 W m−2, rRMSE = 24%), in other regions. That station is representative of Southwest China, where the average annual rainfall is much higher than elsewhere, thus indicative of larger and more complex cloudiness.
In conclusion, this study presents the most accurate dataset of diffuse solar radiation currently available for China, showing a consistently good correlation and homogeneity throughout mainland China. The dataset can serve as a fundamental resource for researchers investigating the long-term spatial distribution of diffuse solar radiation or other relevant studies. The study has identified several limitations, including reduced accuracy performance over areas affected by either tropical monsoon or terrain-included inhomogeneities, such as over the Qinghai-Tibet and Sichuan basin regions. Further refinements would be necessary to address the various possible causes of spatiotemporal variations in cloud cover. To that effect, and to better align the estimates with rapidly changing weather conditions, it is anticipated that the model can be improved by considering new inputs related to cloudiness, such as cloud height, wind speed, or temporal variability in cloud cover. Over China, the MERRA-2 and ERA5 datasets show significant differences in diffuse radiation, which may be partly due to other variables such as cloud cover79. Given the differences in aerosol data assimilation between MERRA-2 and ERA5, additional correction steps will need to be added to the model in the future to reduce the impact of inconsistencies from different data sources on the final results.
Usage Notes
This developed diffuse solar radiation dataset (CHDSR) performs well when compared to the CMA stations during 2000–2015. However, the CHDSR could be affected by the lack of DSR observation results during the early years (1982–1990), the pre-2000 datasets are provided for information purposes only and need to be used with care for validation. Therefore, when using this dataset for early years, it should be considered only as a preliminary reference until further validation can be performed.
This long-term dataset is suitable for a better understanding of the spatial and temporal variations in diffuse solar radiation across China, and for further evaluations of photosynthetic efficiency.
Code availability
The complete code used in this work is not released along with the dataset because a part of it involves the REST2_v9.1 code, which cannot currently be shared without permission. Meanwhile, the readers could get the code from the home page of the REST2 model (https://www.solarconsultingservices.com/).
References
Gueymard, C. A. REST2: High-performance solar radiation model for cloudless-sky irradiance, illuminance, and photosynthetically active radiation - Validation with a benchmark dataset. Sol. Energy 82, 272–285 (2008).
Liu, Q., Zhang, Z., Fan, M. & Wang, Q. The Divergent Estimates of Diffuse Radiation Effects on Gross Primary Production of Forest Ecosystems Using Light-Use Efficiency Models. Geophys. Res. Lett. 48, 1–11 (2021).
Williams, I. N., Riley, W. J., Kueppers, L. M., Biraud, S. C. & Torn, M. S. Separating the effects of phenology and diffuse radiation on gross primary productivity in winter wheat. J. Geophys. Res. Biogeosciences 121, 1903–1915 (2016).
Rap, A. et al. Enhanced global primary production by biogenic aerosol via diffuse radiation fertilization. Nat. Geosci. 11, 640–644 (2018).
Freitas, S., Catita, C., Redweik, P. & Brito, M. C. Modelling solar potential in the urban environment: State-of-the-art review. Renew. Sustain. Energy Rev. 41, 915–931 (2015).
Mercado, L. M. et al. Impact of changes in diffuse radiation on the global land carbon sink. Nature 458, 1014–1017 (2009).
Chakraborty, T., Lee, X. & Lawrence, D. M. Diffuse Radiation Forcing Constraints on Gross Primary Productivity and Global Terrestrial Evapotranspiration. Earth’s Futur. 10, 1–16 (2022).
Xiao, M., Yu, Z. & Cui, Y. Evaluation and estimation of daily global solar radiation from the estimated direct and diffuse solar radiation. Theor. Appl. Climatol. 140, 983–992 (2020).
Huld, T., Müller, R. & Gambardella, A. A new solar radiation database for estimating PV performance in Europe and Africa. Sol. Energy 86, 1803–1815 (2012).
Letu, H. et al. High-resolution retrieval of cloud microphysical properties and surface solar radiation using Himawari-8/AHI next-generation geostationary satellite. Remote Sens. Environ. 239, 111583 (2020).
Sengupta, M. et al. The National Solar Radiation Data Base (NSRDB). Renew. Sustain. Energy Rev. 89, 51–60 (2018).
Qu, Z. et al. Fast radiative transfer parameterisation for assessing the surface solar irradiance: The Heliosat-4 method. Meteorol. Zeitschrift 26, 33–57 (2017).
Schroedter-Homscheidt, M. et al. Surface solar irradiation retrieval from MSG/SEVIRI based on APOLLO Next Generation and HELIOSAT-4 methods. Meteorol. Zeitschrift 31, 455–476 (2022).
Perez, R., Cebecauer, T. & Šúri, M. Semi-Empirical Satellite Models. Sol. Energy Forecast. Resour. Assess. 21–48 https://doi.org/10.1016/B978-0-12-397177-7.00002-4 (2013).
Jiang, H., Lu, N., Qin, J. & Yao, L. Hourly 5-km surface total and diffuse solar radiation in China, 2007–2018. Sci. Data 7, 1–12 (2020).
Miller, S. D., Heidinger, A. K. & Sengupta, M. Physically Based Satellite Methods. Sol. Energy Forecast. Resour. Assess. 49–79 https://doi.org/10.1016/B978-0-12-397177-7.00003-6 (2013).
Gueymard, C. A. & Ruiz-Arias, J. A. Extensive worldwide validation and climate sensitivity analysis of direct irradiance predictions from 1-min global irradiance. Sol. Energy 128, 1–30 (2016).
Yang, D. Estimating 1-min beam and diffuse irradiance from the global irradiance: A review and an extensive worldwide comparison of latest separation models at 126 stations. Renew. Sustain. Energy Rev. 159, 112195 (2022).
Polo, J. et al. Preliminary survey on site-adaptation techniques for satellite-derived and reanalysis solar radiation datasets. Sol. Energy 132, 25–37 (2016).
Yang, D. & Gueymard, C. A. Probabilistic post-processing of gridded atmospheric variables and its application to site adaptation of shortwave solar radiation. Sol. Energy 225, 427–443 (2021).
Yu, Y., Tang, Y., Chou, J. & Yang, L. A novel adaptive approach for improvement in the estimation of hourly diffuse solar radiation: A case study of China. Energy Convers. Manag. 293, 117455 (2023).
Boudjella, M. Y., Belbachir, A. H., Dib, S. A. A. & Meftah, M. Calculation of surface spectral irradiance using the Geant4 Monte Carlo toolkit. J. Atmos. Solar-Terrestrial Phys. 248, (2023).
Halthore, R. N. et al. Intercomparison of shortwave radiative transfer codes and measurements. J. Geophys. Res. D Atmos. 110, 1–18 (2005).
Vicent, J. et al. Comparative analysis of atmospheric radiative transfer models using the Atmospheric Look-up table Generator (ALG) toolbox (version 2.0). Geosci. Model Dev. 13, 1945–1957 (2020).
Gueymard, C. A. Parameterized transmittance model for direct beam and circumsolar spectral irradiance. Sol. Energy 71, 325–346 (2001).
Mueller, R. W. et al. Rethinking satellite-based solar irradiance modelling: The SOLIS clear-sky module. Remote Sens. Environ. 91, 160–174 (2004).
Xie, Y. & Sengupta, M. A Fast All-sky Radiation Model for Solar applications with Narrowband Irradiances on Tilted surfaces (FARMS-NIT): Part I. The clear-sky model. Sol. Energy 174, 691–702 (2018).
Xie, Y., Sengupta, M. & Wang, C. A Fast All-sky Radiation Model for Solar applications with Narrowband Irradiances on Tilted surfaces (FARMS-NIT): Part II. The cloudy-sky model. Sol. Energy 188, 799–812 (2019).
Abreu, E. F. M., Gueymard, C. A., Canhoto, P. & Costa, M. J. Performance assessment of clear-sky solar irradiance predictions using state-of-the-art radiation models and input atmospheric data from reanalysis or ground measurements. Sol. Energy 252, 309–321 (2023).
Guermoui, M. et al. A novel ensemble learning approach for hourly global solar radiation forecasting. Neural Comput. Appl. 34, 2983–3005 (2022).
Citakoglu, H., Babayigit, B. & Haktanir, N. A. Solar radiation prediction using multi-gene genetic programming approach. Theor. Appl. Climatol. 142, 885–897 (2020).
Patel, D., Patel, S., Patel, P. & Shah, M. Solar radiation and solar energy estimation using ANN and Fuzzy logic concept: A comprehensive and systematic study. Environ. Sci. Pollut. Res. 29, 32428–32442 (2022).
Ma, R. et al. Estimation of Surface Shortwave Radiation from Himawari-8 Satellite Data Based on a Combination of Radiative Transfer and Deep Neural Network. IEEE Trans. Geosci. Remote Sens. 58, 5304–5316 (2020).
Li, R., Wang, D., Liang, S., Jia, A. & Wang, Z. Estimating global downward shortwave radiation from VIIRS data using a transfer-learning neural network. Remote Sens. Environ. 274, 112999 (2022).
Shamshirband, S., Mohammadi, K., Yee, P. L., Petković, D. & Mostafaeipour, A. A comparative evaluation for identifying the suitability of extreme learning machine to predict horizontal global solar radiation. Renew. Sustain. Energy Rev. 52, 1031–1042 (2015).
Fan, J., Wu, L., Ma, X., Zhou, H. & Zhang, F. Hybrid support vector machines with heuristic algorithms for prediction of daily diffuse solar radiation in air-polluted regions. Renew. Energy 145, 2034–2045 (2020).
Wu, J. et al. Constructing High-Resolution (10 km) Daily Diffuse Solar Radiation Dataset across China during 1982–2020 through Ensemble Model. Remote Sens. 14, (2022).
Zhao, S. et al. Simulation of Diffuse Solar Radiation with Tree-Based Evolutionary Hybrid Models and Satellite Data. Remote Sens. 15, 1–23 (2023).
Attar, N. F., Sattari, M. T., Prasad, R. & Apaydin, H. Comprehensive review of solar radiation modeling based on artificial intelligence and optimization techniques: future concerns and considerations. Clean Technol. Environ. Policy https://doi.org/10.1007/s10098-022-02434-7 (2022).
Camporeale, E. The Challenge of Machine Learning in Space Weather: Nowcasting and Forecasting. Sp. Weather 17, 1166–1207 (2019).
Chakraborty, T. C. & Lee, X. Using supervised learning to develop BaRAD, a 40-year monthly bias-adjusted global gridded radiation dataset. Sci. Data 8, 1–10 (2021).
Cao, Q., Liu, Y., Sun, X. & Yang, L. Country-level evaluation of solar radiation data sets using ground measurements in China. Energy 241, 122938 (2022).
Long, C. N. & Shi, Y. An Automated Quality Assessment and Control Algorithm for Surface Radiation Measurements. Open Atmos. Sci. J. 2, 23–37 (2008).
Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146, 1999–2049 (2020).
Hersbach, H. et al. ERA5 hourly data on single levels from 1940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS) https://doi.org/10.24381/cds.adbb2d47 (2023).
Muñoz Sabater, J. ERA5-Land hourly data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS) https://doi.org/10.24381/cds.e2161bac (2019).
Global Modeling and Assimilation Office, G. MERRA-2 inst3_2d_gas_Nx: 2d,3-Hourly,Instantaneous,Single-Level,Assimilation,Aerosol Optical Depth Analysis V5.12.4. Goddard Earth Sciences Data and Information Services Center (GES DISC) https://doi.org/10.5067/HNGA0EWW0R09 (2015).
Sun, X. et al. Worldwide performance assessment of 95 direct and diffuse clear-sky irradiance models using principal component analysis. Renew. Sustain. Energy Rev. 135, 110087 (2021).
Qin, W., Wang, L., Wei, J., Hu, B. & Liang, X. A novel efficient broadband model to derive daily surface solar Ultraviolet radiation (0.280–0.400 μm). Sci. Total Environ. 735, (2020).
Gueymard, C. A. & Yang, D. Worldwide validation of CAMS and MERRA-2 reanalysis aerosol optical depth products using 15 years of AERONET observations. Atmos. Environ. 225, 117216 (2020).
Sun, X., Yang, D., Gueymard, C. A., Bright, J. M. & Wang, P. Effects of spatial scale of atmospheric reanalysis data on clear-sky surface radiation modeling in tropical climates: A case study for Singapore. Sol. Energy 241, 525–537 (2022).
Gueymard, C. A. The SMARTS spectral irradiance model after 25 years: New developments and validation of reference spectra. Sol. Energy 187, 233–253 (2019).
Badescu, V. et al. Computing global and diffuse solar hourly irradiation on clear sky. Review and testing of 54 models. Renew. Sustain. Energy Rev. 16, 1636–1656 (2012).
Ruiz-Arias, J. A., Gueymard, C. A., Santos-Alamillos, F. J. & Pozo-Vázquez, D. Worldwide impact of aerosol’s time scale on the predicted long-term concentrating solar power potential. Sci. Rep. 6, 1–10 (2016).
Gueymard, C. A. & Ruiz-Arias, J. A. Validation of direct normal irradiance predictions under arid conditions: A review of radiative models and their turbidity-dependent performance. Renew. Sustain. Energy Rev. 45, 379–396 (2015).
Zhou, Y., Liu, Y., Wang, D., Liu, X. & Wang, Y. A review on global solar radiation prediction with machine learning models in a comprehensive perspective. Energy Convers. Manag. 235, 113960 (2021).
Gueymard, C. A., Bright, J. M., Lingfors, D., Habte, A. & Sengupta, M. A posteriori clear-sky identification methods in solar irradiance time series: Review and preliminary validation using sky imagers. Renew. Sustain. Energy Rev. 109, 412–427 (2019).
Long, C. N. & Ackerman, T. P. Identification of clear skies from broadband pyranometer measurements and calculation of downwelling shortwave cloud effects. J. Geophys. Res. Atmos. 105, 15609–15626 (2000).
Driemel, A. et al. Baseline Surface Radiation Network (BSRN): Structure and data description (1992–2017). Earth Syst. Sci. Data 10, 1491–1501 (2018).
Liu, B. Y. H. & Jordan, R. C. The interrelationship and characteristic distribution of direct, diffuse and total solar radiation. Sol. Energy 4, 1–19 (1960).
Bright, J. M. et al. BRIGHT-SUN: A globally applicable 1-min irradiance clear-sky detection model. Renew. Sustain. Energy Rev. 121, 109706 (2020).
Qi, Q., Wu, J. & Qin, W. CHDSR: Daily Surface Solar Diffuse Radiation Dataset in China (1980–2022, 10km) based on REST2_v9.1 and Integrated Machine Learning techniques. figshare. Dataset. https://doi.org/10.6084/m9.figshare.21763223.v3 (2022).
Jiang, H., Yang, Y., Bai, Y. & Wang, H. Evaluation of the Total, Direct, and Diffuse Solar Radiations from the ERA5 Reanalysis Data in China. IEEE Geosci. Remote Sens. Lett. 17, 47–51 (2020).
Jiang, H., Yang, Y., Wang, H., Bai, Y. & Bai, Y. Surface diffuse solar radiation determined by reanalysis and satellite over East Asia: Evaluation and comparison. Remote Sens. 12, 1–19 (2020).
Zhou, Y., Wang, D., Liu, Y. & Liu, J. Diffuse solar radiation models for different climate zones in China: Model evaluation and general model development. Energy Convers. Manag. 185, 518–536 (2019).
Hu, K., Kumar, K. R., Kang, N., Boiyo, R. & Wu, J. Spatiotemporal characteristics of aerosols and their trends over mainland China with the recent Collection 6 MODIS and OMI satellite datasets. Environ. Sci. Pollut. Res. 25, 6909–6927 (2018).
Feng, Y., Chen, D. & Zhao, X. Estimated long-term variability of direct and diffuse solar radiation in North China during 1959–2016. Theor. Appl. Climatol. 137, 153–163 (2019).
Zhang, H. et al. Simulation of direct radiative forcing of aerosols and their effects on East Asian climate using an interactive AGCM-aerosol coupled system. Clim. Dyn. 38, 1675–1693 (2012).
Xia, X. A closer looking at dimming and brightening in China during 1961–2005. Ann. Geophys. 28, 1121–1132 (2010).
Jia, D. et al. Evaluation of machine learning models for predicting daily global and diffuse solar radiation under different weather/pollution conditions. Renew. Energy 187, 896–906 (2022).
Feng, Y., Cui, N., Zhang, Q., Zhao, L. & Gong, D. Comparison of artificial intelligence and empirical models for estimation of daily diffuse solar radiation in North China Plain. Int. J. Hydrogen Energy 42, 14418–14428 (2017).
Xue, X. Prediction of daily diffuse solar radiation using artificial neural networks. Int. J. Hydrogen Energy 42, 28214–28221 (2017).
Hay, J. E. & Darby, R. El chichón – influence on aerosol optical depth and direct, diffuse and total solar irradiances at vancouver, b.c. Atmos. - Ocean 22, 354–368 (1984).
Nagel, D., Herber, A., Thomason, L. W. & Leiterer, U. Vertical distribution of the spectral aerosol optical depth in the Arctic from 1993 to 1996. J. Geophys. Res. Atmos. 103, 1857–1870 (1998).
Russell, P. B. et al. Global to microscale evolution of the Pinatubo volcanic aerosol derived from diverse measurements and analyses. J. Geophys. Res. Atmos. 101, 18745–18763 (1996).
Molineaux, B. & Ineichen, P. Impact of Pinatubo aerosols on the seasonal trends of global, direct and diffuse irradiance in two northern mid-latitude sites. Sol. Energy 58, 91–101 (1996).
Streets, D. G. et al. Anthropogenic and natural contributions to regional trends in aerosol optical depth, 1980–2006. J. Geophys. Res. Atmos. 114, 1–16 (2009).
He, Q., Zhang, M. & Huang, B. Spatio-temporal variation and impact factors analysis of satellite-based aerosol optical depth over China from 2002 to 2015. Atmos. Environ. 129, 79–90 (2016).
Chakraborty, T. & Lee, X. Large differences in diffuse solar radiation among current-generation reanalysis and satellite-derived product. J. Clim. 34, 6635–6650 (2021).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qi, Q., Wu, J., Gueymard, C.A. et al. Mapping of 10-km daily diffuse solar radiation across China from reanalysis data and a Machine-Learning method. Sci Data 11, 756 (2024). https://doi.org/10.1038/s41597-024-03609-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03609-1
- Springer Nature Limited