Value-Added Products Derived from 15 Years of High-Quality Surface Solar Radiation Measurements at Xianghe, a Suburban Site in the North China Plain

Surface solar radiation (SSR) is a key component of the energy budget of the Earth’s surface, and it varies at different spatial and temporal scales. Considerable knowledge of how and why SSR varies is crucial to a better understanding of climate change, which surely requires long-term measurements of high quality. The objective of this study is to introduce a value-added SSR dataset from Oct 2004 to Oct 2019 based on measurements taken at Xianghe, a suburban site in the North China Plain; two value-added products based on the 1-minute SSR measurements are developed. The first is clear sky detection by using a machine learning model. The second is cloud fraction estimation derived from an effective semi-empirical method. A “brightening” of global horizontal irradiance (GHI) was revealed and found to occur under both clear and cloudy conditions. This could likely be attributed to a reduction in aerosol loading and cloud fraction. This dataset could not only improve our knowledge of the variability and trend of SSR in the North China Plain, but also be beneficial for solar energy assessment and forecasting.


Introduction
Surface solar radiation (SSR) is an important energy driver for the earth-atmosphere system and exerts direct or indirect effects on many fields such as climate, hydrological cycle, plant growth, and renewable energy, to name just a few (Wild, 2009;Yang et al., 2022;and references therein). SSR is modulated by atmospheric absorption and scattering processes, which are highly variable at multiple scales due to either predictable factors (e.g., solar zenith angle) or meteorological factors (e.g., water vapor, aerosol, and cloud). Under clear-sky conditions, aerosol dominates SSR variability, while cloud properties (amount, type, thickness, position relative to the sun, etc.) predominantly determine how cloudy-sky SSR varies. Long-term SSR is highly sensitive to climate perturbations caused by natural and anthropogenic factors. The variation in SSR has thus become an indicator of climate change and has received extensive attention.
SSR measurement across the world has been boosted as a result of an international effort to coordinate the collection of geophysical data starting in 1957-58 (International Geophysical Year). As a result, daily and monthly SSR data have been archived since then. Around 500 000 monthly mean surface irradiance data at 2500 locations are publicly accessible from ETH Zurich (http://www.geba.ethz.ch). Limited by recording, i.e., data storage and transfer capability, surface irradiance was measured hourly, at best, before the 1990s. Benefited by technical innovations, SSR measurements with high temporal resolution (1-10 min) have been extensively performed across the world afterwards. For example, 1-minute SSR measurements since the 1990s are available at more than 50 Baseline Surface Radiation Network (BSRN) sites that are located in diverse climatic zones (https://bsrn.awi.de). The measured SSR plays a crucial role in the detection of long-term trends and variability in the surface energy budget, as well as the validation of model and satellite SSR products (Wild, 2009;Wang et al., 2015;Li et al., 2016).
Based on the large amount of measured data, studies have found that SSR has generally shifted from "global dimming" (decrease of global SSR) between the 1950s-80s to "global brightening" (increase of global SSR) since the middle of the 1980s (Wild et al., 2005;Wild, 2009). Further regional analysis has revealed that the "brightening " has been mostly concentrated over developed regions including North America, Europe, Australia, Japan, etc., while develop-ing countries such as China and India still suffered from "dimming " in the 1990s (Che et al., 2005;Qian et al., 2006;Wild, 2009;Wang et al., 2015). The specific decadal variation of SSR and its potential causes still need further exploration in different regions, which surely requires long-term SSR measurements with high temporal resolution.
The North China Plain (NCP) is one of the most heavily polluted regions in China. In order to record the fingerprint of anthropogenic activities on climate, a comprehensive radiation site was established at Xianghe in the end of 2004, which has provided valuable data to explore how SSR varies, especially the trend before and after the pollution prevention and control actions in 2013 Shi et al., 2021;Xia et al., 2021). Efforts have been devoted to the quality control and derivation of value-added products (Liu et al., 2021a(Liu et al., , b, 2022. The goal of this study is to introduce these valuable SSR measurements. The introduced dataset could be able to play a critical role in exploring the reasons for the SSR variability in this region. It could be beneficial for the verification of satellite-retrieved and model-simulated data and could also play an indispensable role in solar energy assessment and forecasting (Gueymard, 2012;Fu et al., 2022;Huang et al., 2022;Yang et al., 2022).
The paper is organized as follows. Section 2 describes the data source and instruments. Section 3 presents the quality control procedures. Sections 4 and 5 briefly describe the methodology of clear sky detection, clear-sky irradiance parameterization, and cloud cover estimation. Section 6 presents the dataset composition and the organizational structure of the dataset. Section 7 presents some general characteristics of the dataset.

Site, instruments, and measurements
Xianghe (XH: 39.75°N, 116.96°E, 30 m a.s.l.) is a suburban site located between Beijing and Tianjin, two megacities in the NCP. XH experiences a continental monsoon climate characterized by warm and moist summers but cold and windy winters. Annual rainfall is ~600 mm, ~75% of which falls in summer. There are both natural aerosols (dust, mostly in spring) and anthropogenic pollutants of urban, rural, or mixed origins (all year round) in this region.
A set of solar radiometers and a sun photometer were installed on the roof of a four-story building in Sep 2004 ( Table 1). The radiometers sample every second, but 1-min means and standard deviations are recorded. Prior to Jul ). An intercomparison between the raw measurements (without quality control) using bias (BIAS), relative mean square error (RMSE), and correlation coefficient (R 2 ) for Oct 2013 is shown in Fig. 1. It indicates that the impact of the replacement of instruments is small, especially for DHI and GHI. Moreover, the obvious outliers in DNI ( Fig. 1a) are caused by solar tracker problems, which would not pass the quality control process. Therefore, the homogeneity would be better after the quality control process. All pyranometers are heated and ventilated as recommended by the BSRN protocol. Daily checks of the radiometer dome and level are performed by site staff to ensure measurement quality. A Yankee Environmental Systems, Inc., Model 440 Total Sky Imager (TSI-440), a full-color sky camera for sky imaging, has provided measurements during 2004-09.

Data quality control procedures
The quality control procedures recommended by BSRN and Atmospheric Radiation Measurement (ARM) are modi- fied and applied to the measurements (Ohmura et al., 1998;Long and Dutton, 2002). The procedures include: 1) zero offset correction; 2) physical threshold check; 3) solar track malfunction test; 4) blocking test; and 5) comparison test.
1) Zero offset correction: The thermopile-based single black detector pyranometer suffers from infrared loss from the detector. The offset correlates to simultaneous net infrared radiation measurements and is used here to correct the offset (Reda et al., 2005).
2) Physical threshold check: The physical threshold of GHI, DNI, and DHI recommended by Long and Dutton (2002) is adopted for the SSR measurements. The physical threshold check has minimal effects on the measurements.
3) Solar tracker malfunction test: The detection of solar tracker malfunction is modified given the frequent occurrence of heavy aerosol loading at the site, as shown by a polluted case on 13 Apr 2008 (Fig. 2). The potential clear sky GHI is originally calculated using the following formula: , where μ is the cosine of solar zenith angle (SZA), and b is set to be 1.2 and a is 1400 (Long and Ackerman, 2000), but here a is reduced gradually from 1400 (Fig. (Fig. 2b) to account for the substantial aerosol radiative effect (Xia et al., 2007). The expected clear-sky Rayleigh DHI is calculated using the formula suggested by Long and Ackerman (2000). The following thresholds are proven to be effective in detecting potential tracking problems in most cases. The instants when SZA < 87.5° and , DNI should exceed 2 W m -2 ; otherwise, the tracker does not work properly (Fig. 2c). If and the ratio of DHI and GHI (k d ) , but DNI < 15 W m -2 and DHI exceeds the calculated Rayleigh DHI, it also means malfunction of the solar tracker.

Clear sky detection and clear-sky SSR parameterization
Clear sky detection (CSD) is a prior step in the evaluation of aerosol and cloud radiative effects. High-frequency observations of GHI and its components provide abundant information about the intermittency of clouds in the sky, which is the fundamental basis for CSD methods (Gueymard et al., 2019). All of these methods rely on dramatic departures of the magnitude and temporal variability of cloudy GHI, DNI, DHI, and their derivative quantities from their clear-sky counterparts. Sixteen CSD methods in the literature were assessed by using TSI clear sky detection results during 2004-09 at XH. All these methods did not produce satisfactory results under heavily polluted conditions (Liu et al., 2021a). Therefore, we used the labeled GHI, DNI, and DHI measurements in Liu et al. (2021a) to train a random forest (RF) model for the CSD, in which seven derivative quantities were adopted as predictor features. The thresholds and intervals of parameters in the optimal procedure are shown in Table 2. The accuracy score here means the proportion of correct detection of clear/cloudy skies relative to total samples of clear/cloudy skies. The average accuracy score of the RF model exceeded 0.91, with the accuracy scores under clear and cloudy conditions being 0.88 and 0.93, respectively. Figure 3 presents the performance of the RF model for different AOD levels. The accuracy scores of clear sky detection are over 0.8, even in heavily polluted conditions (AOD > 1). Meanwhile, the RF model can identify cloudy skies very well. The proposed RF model is applied to the 15-year irradiance measurements to obtain the CSD results.
A simple model is introduced to parameterize the clearsky GHI and DNI each day if there are over 100 clear sky instants and their μ range exceeds 0.65. μ is widely used as the independent variable to parameterize GHI, DNI, and DHI (Long and Ackerman, 2000). However, daytime AOD often shows an increasing tendency at XH, implying that SSR is not symmetric for the same μ in the morning and afternoon (Song et al., 2018). The power law function with μ as the independent variable is therefore not able to work properly under these circumstances. We attempted to use the local standard time (LST) in decimal hours as the independent variable GHI cs = a 1 sin(b 1 LST + c 1 ) DNI cs = a 2 e b 2 LST/µ to parameterize clear SSR, the regression result of which is better than that using μ in many cases. The potential clearsky GHI (GHI cs ) and clear-sky DNI (DNI cs ) are parameterized as follows: , . Figure 4 shows an example for 20 Feb 2005. AOD shows a rapid increasing trend, varying from 0.09 in the early morning to 0.41 in the late afternoon (Fig. 4a); accordingly, DNI decreases and DHI increases. The calculated GHI cs and DNI cs using LST indicate smaller residuals of 0.78 W m -2 and 0.56 W m -2 , relative to 7.27 W m -2 and 11.82 W m -2 , when using μ in the parametrization. If there are not enough clear sky instants in a day for the regression analysis, the monthly mean clear-sky GHI and DNI are used. The potential GHI cs is then used to calculate the clear-sky index (k c = GHI /GHI cs ), which is widely used in the solar energy community.

Cloud fraction estimation
Clouds dramatically modulate SSR, among which CF is the dominant factor, indicating that SSR measurements likely provide CF information (Long et al., 2006;Xie and Liu, 2013). Here, we adopt the method of Xie and Liu (2013) to simultaneously retrieve CF and cloud albedo (CA) based on 1-minute GHI and DNI measurements. The underlying physics is captured well by the analytical formulations, which were approved to produce satisfactory CF and CA retrievals.
In this method, CA can be related to the ratio B 1 /B 2 by Table 2. Parameter settings to determine the optimal hyperparameters and corresponding accuracy score in the optimal situation.
piecewise polynomials (not shown here). B 1 and B 2 can be given by: where f 1 and f 2 are estimation functions (see Xie and Liu (2013) for more details). And CF is defined as a function that relates to CA and B 1 : (3) The parameter requirements can be provided by the irradiance measurements and the proposed clear-sky SSR model in this paper. The Xie and Liu (2013) method is only applied in cloudy samples detected by the proposed RF model. And it should be noted that, limited by the method, only partially cloudy samples have valid CF and CA values. Figure 5 presents the frequency distribution and monthly variation of estimated daytime CF during the study period. The monthly mean CF ranges from about 0.4 up to 0.85, with the lowest CF in January, and the highest in July. CF follows the conventional pattern of monthly variability in the present study over the NCP (Ma et al., 2014;Yang et al., 2020).

Dataset composition and structure
The dataset has one file in ".csv" format; the name of this file is "A value-added surface shortwave radiation dataset at Xianghe.csv". The file has two parts: the headline and the data lines. The data lines are defined by the corresponding column in the headline, which includes the following 17 items: Year, Month, Day, Hour, and Minute are Beijing time LST = UTC+8 of the corresponding data. cosSZA is the cosine of the solar zenith angle. Soldst is the Earth-Sun distance in Astronomical Units. GHI, DNI, and DHI are measured global horizontal irradiance, direct normal irradiance, and diffuse horizontal irradiance, respectively. GHI cs and DNI cs are clear-sky GHI and DNI parameterized by the proposed method in this study. k t is the clearness index calculated as the ratio of each GHI measurement to the GHI at the top of the atmosphere. k c is the clear sky index, the ratio of GHI to GHI cs . Both are widely used in the solar energy community. CSD_m is the CSD results, where "0" represents "clear sky" and "1" represents "cloudy sky." CF_Xie and CA_Xie are, respectively, estimated CF and CA by the method of Xie and Liu (2013).

Trends of aerosol, cloud, and SSR
The seasonal and interannual variations of irradiance, aerosol, and cloud are analyzed. The method recommended by Roesch et al. (2011) is used to calculate monthly mean irradiance as follows. The monthly mean is computed by averaging the monthly diurnal variation (96 × 15 min = 24 h). This method helps to minimize the impact of missing values and is especially suitable for clear and cloudy skies separately. Linear trends of irradiance, aerosol, and cloud time series and their significance levels are estimated using Sen's slope and the Mann-Kendall test, respectively (Mann, 1945;Sen, 1968;Kendall, 1975).
Figures 6a-c show time series of monthly mean GHI, DNI, and DHI for all sky, clear sky, and cloudy sky, respectively. Trends in all-sky GHI, DNI, and DHI are all positive (Fig. 6a). More specifically, all-sky GHI, DNI, and DHI increase by 1.32 W m -2 yr -1 , 1.8 W m -2 yr -1 , and  0.096 W m -2 yr -1 , respectively. It is interesting to note that only the trend of DNI (p-value = 0.005) is significant. It is widely reported that there has been a "brightening " after 1989 in the NCP (Wang and Wild, 2016). For instance, Li et al. (2018) and Shi et al. (2021) both indicate a general increasing trend of SSR in the NCP from 2005 onward. Moreover, our results show that the positive trend has become more apparent since 2013, as seen by annual GHI increases from 160-180 W m -2 (2004-12) to 180-200 W m -2 (2013-19). Similar to the variation trend of all-sky SSR, clear-sky GHI, DNI, and DHI increase by 1.32 W m -2 yr -1 , 1.92 W m −2 yr -1 , and 0.036 W m −2 yr -1 , respectively (Fig. 6b). This result agrees with an increasing clear-sky GHI trend by 1.38 W m −2 yr -1 from 2005 to 2015 . This could be attributed to a reduction of aerosol loading and absorption (Xia et al., 2021). Figure 6d shows that AOD has a significant declining trend (p-value = 0.012) (-0.012 yr -1 ), and single scattering albedo (SSA) has increased (p-value = 0.031) by 0.0012 yr -1 . This reflects the benefits of substantial emission reduction associated with stringent pollution control policies in China (Ding et al., 2019;Shi et al., 2021).
Cloudy GHI, DNI, and DHI trends also show increasing tendencies, but the increasing rates are relatively smaller than those of the clear sky counterparts. This result is consistent with the insignificant decreasing trend of CF shown in Fig. 6e. Note that the CFs are derived from SSR measurements and thus they are not completely independent, though the CF retrieval method has been extensively tested against both sky imager and satellite observations, which are independent of any SSR measurements (Xie et al., 2014;An and Wang, 2015).

Summary
Long term measurement of SSR is crucial for regional energy budget, water cycle, and climate change studies. Furthermore, the classification of sky condition and the evaluation of cloud characteristics are necessary to understand the potential causes behind the variation of SSR. In this study, we use 15 years  of SSR measurements at XH to generate a value-added dataset.
Considering the frequent air pollution at XH, the modified BSRN quality control procedure and clear-sky SSR parameterization are used. The evaluation of the new clearsky SSR parameterization demonstrates that the functions using LST could better reproduce the diurnal variability of aerosol loading. The machine learning-based CSD method shows high accuracy scores, with 0.88 and 0.93 for clear and cloudy skies, respectively. The proposed CSD model has especially good performance under polluted conditions. Due to limited surface observations of cloudiness, an estimation of CF by the Xie and Liu (2013) method was used to understand the cloud climatology at XH.
The dataset has the potential to improve our knowledge of the radiative effect of aerosol and cloud in this heavily polluted region, which should contribute substantially to the evaluation of climate and environmental effects of human activities. The proposed clear-sky irradiance parameterization and CSD model as well as the dataset could be beneficial for irradiance-related studies, for instance, solar energy assessment and forecasting.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.