Tests of Sunspot Number Sequences: 3. Effects of Regression Procedures on the Calibration of Historic Sunspot Data
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s1120701508292
 Cite this article as:
 Lockwood, M., Owens, M.J., Barnard, L. et al. Sol Phys (2016) 291: 2829. doi:10.1007/s1120701508292
 14 Citations
 601 Downloads
Abstract
We use sunspotgroup observations from the Royal Greenwich Observatory (RGO) to investigate the effects of intercalibrating data from observers with different visual acuities. The tests are made by counting the number of groups \([R_{\mathrm{B}}]\) above a variable cutoff threshold of observed total whole spot area (uncorrected for foreshortening) to simulate what a loweracuity observer would have seen. The synthesised annual means of \(R_{\mathrm{B}}\) are then rescaled to the full observed RGO group number \([R_{\mathrm{A}}]\) using a variety of regression techniques. It is found that a very high correlation between \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\) (\(r_{\mathrm{AB}} > 0.98\)) does not prevent large errors in the intercalibration (for example sunspotmaximum values can be over 30 % too large even for such levels of \(r_{\mathrm{AB}}\)). In generating the backbone sunspot number \([R_{\mathrm{BB}}]\), Svalgaard and Schatten (Solar Phys., 2016) force regression fits to pass through the scatterplot origin, which generates unreliable fits (the residuals do not form a normal distribution) and causes sunspotcycle amplitudes to be exaggerated in the intercalibrated data. It is demonstrated that the use of Quantile–Quantile (“Q–Q”) plots to test for a normal distribution is a useful indicator of erroneous and misleading regression fits. Ordinary leastsquares linear fits, not forced to pass through the origin, are sometimes reliable (although the optimum method used is shown to be different when matching peak and average sunspotgroup numbers). However, other fits are only reliable if nonlinear regression is used. From these results it is entirely possible that the inflation of solarcycle amplitudes in the backbone group sunspot number as one goes back in time, relative to related solar–terrestrial parameters, is entirely caused by the use of inappropriate and nonrobust regression techniques to calibrate the sunspot data.
Keywords
Sunspot number Historic reconstructions Calibration Regression techniques1 Introduction
Articles 1 and 2 of this series (Lockwood et al., 2016a, 2016b) provide evidence that the new “backbone” group sunspot number \([R_{\mathrm{BB}}]\) proposed by Svalgaard and Schatten (2016) overestimates sunspot numbers as late as Solar Cycle 17 and that this overestimation increases as one goes back in time. There is also some evidence that most of the overestimation grows in discrete steps, which could imply a systematic problem with the ordinary linearregression techniques used by Svalgaard and Schatten to “daisychain” the calibration from modern values back to historic ones. This daisychaining is unavoidable in this context unless a method is used to calibrate historic (prephotographic) data with modern data without relating both to data taken during the interim. (Note that one such a method, which avoids both regressions and daisychaining, has recently been developed by Usoskin et al. (2016).) As discussed in Articles 1 and 2, the regressions used are of particular concern because the daisychaining means that both random and systematic errors are amplified as one goes back in time.
As one reads the article by Svalgaard and Schatten (2016), one statement stands out and raises immediate concerns in this context: “Experience shows that the regression line almost always very nearly goes through the origin, so we force it to do so …” To understand the implications of this, consider two observers A and B, recording annual mean sunspotgroup numbers \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\), respectively. If observer B has lower visual acuity than A, then \(R_{\mathrm{B}} \leq R_{\mathrm{A}}\). This may be caused by B having a lower resolution and/or less wellfocused telescope, or one that gives higher scatteredlight levels. It may also be caused by the keenness of observer B’s eyesight and how conservative he/she was in making the subjective decisions to define spots and/or spot groups from what he/she saw. In addition, the local atmospheric conditions may also have hindered observer B (greater aerosol concentrations, more mists or thin cloud). Forcing the fits through the origin means that \(R_{\mathrm{A}} = 0\) when \(R_{\mathrm{B}} = 0\) and vice versa. When the higheracuity observer A sees no spot groups, the loweracuity observer B should not detect any either and so both \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\) should indeed both be zero in this case. However, there will, in general, have been times when observer A could detect groups but observer B could not and so \(R_{\mathrm{A}} > 0\) when \(R_{\mathrm{B}} = 0\). Thus any linearregression fit used to scale \(R_{\mathrm{B}}\) to \(R_{\mathrm{A}}\) should not, in general, pass through zero as Svalgaard and Schatten (2016) forced all of their fits to do. There is no advantage gained by forcing the fits through the origin (if anything fits are easier to make without this restriction) but, as discussed in this article, it introduces the potential for serious error.
Other concerns are that the errors in the data do not meet the requirements set by the assumptions of ordinary leastsquares (OLS) fitting algorithms, and this possibility should always be tested for using the fit residuals. Failure of these tests means an inappropriate fitting procedure has been used or the noise in the data is distorting the fit. In addition, OLS can be applied by minimising the perpendiculars to the bestfit line or by minimising the verticals to the fit line. It can be argued that this choice should depend on the relative magnitudes of the errors in the fitted parameters. Another possibility that we consider here is that the effect of reduced acuity of observer B may vary with the level of solar activity leading to nonlinearity in the relationship between \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\) (see Usoskin et al.2016 for evidence of this effect). We here also investigate the effects of using the linear ordinary leastsquares fits used by Svalgaard and Schatten (2016) under such circumstances.
Figure 1b illustrates the effects of using a linear fit if observer B’s lower acuity has more effect at low sunspot numbers than at high ones, giving a nonlinear (quadratic) relationship. In this case, a linear regression with nonzero intercept causes inflation of both the highest and the lowest values but lowers those around the average. Figure 1c shows the effects of both using a linear fit and making it pass through the origin, as employed by Svalgaard and Schatten (2016): in this case the effects are as in Figure 1a but the nonlinearity makes them more pronounced.
Nonlinearity between the two variables is just one of the main pitfalls in OLS regression. These can arise because the data violate one of the four basic assumptions that are inherent in the technique and that justify the use of linear regression for purposes of inference or prediction. The other pitfalls are a lack of statistical independence of the errors in the data; heteroscedasticity in the errors (they vary systematically with the fit parameters); and cases for which the errors are not normally distributed (about zero). In particular, one or more largeerror datapoints can exert undue “leverage” on the regression fit. If one or more of these assumptions is violated (i.e. if there is a nonlinear relationship between the variables or if their errors exhibit correlation, heteroscedasticity, or a nonGaussian distribution) then the forecasts, confidence intervals, and scientific insights yielded by a regression model may be seriously biased or misleading. If the fit is correct, then the fit residuals will reflect the errors in the data and so we can apply tests to the residuals to check that none of the assumptions has been invalidated. Nonlinearity is often evident as a systematic pattern when one plots the fit residuals against either of the regressed variables. For regression of timeseries data, lack of independence of the errors is seen as high persistence of the fit residual time series. Lack of homoscedasticity is apparent from scatter plots because the scatter increases systematically with the variables. A normal distribution of fit residuals can be readily tested for using a Quantile–Quantile (“Q–Q”) plot (e.g. Wilk and Gnanadesikan 1968). This is a graphical technique for determining if two datasets come from populations with a common distribution; hence by making one of the datasets normally distributed we can test the other to see if it also has a normal distribution. Erroneous outliers and lack of linearity can also be identified from such Q–Q plots. If outliers are at large or small values they can have a very large influence on a linear regression fit – such points can be identified because they have a large Cook’sD (leverage) factor (Cook 1977) and should be removed and the data refitted. There is no one standard approach to regression that can be applied and implicitly trusted. There are many options that must be investigated, and the above tests must be applied to ensure that the best option is used and that the results are statistically robust. In addition to OLS, we here employ nonlinear regression (using secondorder and thirdorder polynomials), Median Least Squares (MLS) and Bayesian Least Squares (BLS). The MLS and BLS procedures were discussed by Lockwood et al. (2006).
The results presented in this article show that linear regression fits in the context of intercalibrating sunspotgroup numbers can violate the inherent assumptions and lead to some very large errors, even though the correlation coefficients are high. In Section 2, we present one example in which intercalibration over two full sunspot cycles (1953 – 1975) can produce an inflation of sunspot peak values of over 30 % even when the correlation between \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\) exceeds 0.98. This is a significant error. To put it into some context, Svalgaard (2011) pointed out a probable discontinuity in sunspot numbers around 1945 that has been termed the “Waldmeier discontinuity”. Svalgaard quantified it as a 20 % change but Lockwood, Owens, and Barnard (2014) and Lockwood et al. (2016a) find it is \(11.9\pm0.6~\mbox{\%}\) and Lockwood, Owens, and Barnard (2016) find it to be 10 %. (The latter estimate is lower because it is the only one not to assume proportionality.) Hence 30 % is a very significant number for one intercalibration, let alone when it is combined with the effect of others in a series of intercalibrations. In Section 3 we present a second example interval (1923 – 1945, when solar activity was lower) to see if it reveals the same effects.
Lastly, we note that we here employ annual means to be consistent with Svalgaard and Schatten (2016). We do not test for any effects of this in the present article but it does cause additional concerns when the data are sparse. This is because observers A and B may have been taking measurements on different days and, because of factors such as regular annual variations in cloud obscuration, their data could even mainly come from different phases of the year. This may therefore not be a random error, which would again invalidate the assumptions of ordinary leastsquares regression. Usoskin et al. (2016) show this effect can be highly significant for sparse data and Willis, Wild, and Warburton (2016) show it even needs to be considered when using the earliest (before 1885) data from the Royal Observatory, Greenwich.
In the present article, we make use of the photoheliographic measurements from the Royal Observatory, Greenwich and the Greenwich Royal Observatory (here collectively referred to as the “RGO” data). We employ the version of the RGO data made available by the Space Physics website (solarscience.msfc.nasa.gov/greenwhch.shtml) of the Marshall Space Flight Center (MSFC) which has been compiled, maintained and corrected by D. Hathaway. These data were downloaded in June 2015. As noted by Willis et al. (2013b), there are some small differences between these MSFC data and versions of the RGO data stored elsewhere (notably those in the National Geophysical Data Center, NGDC, Boulder). We here use only data for 1923 – 1976 for which these differences are minimal. The use of this interval also avoids all times when the calibration of the RGO data has been questioned (Cliver and Ling 2016; Willis, Wild, and Warburton 2016).
2 Study of 1953 – 1975
2.1 Distribution of Sunspot Group Areas
2.2 Variations of \(R_{\mathrm{A}}\) and \(R_{\mathrm{B}}\) and Fits of \(R_{\mathrm{B}}\) to \(R_{\mathrm{A}}\)
Fit procedures employed.
Fit  Line colour in figures  Fit type  Assumed variation  Parameter minimised  Treatment of intercept 

1  Blue  OLS  linear  r.m.s. of perpendiculars  Not forced through origin 
2  Green  OLS  linear  r.m.s. of verticals  Not forced through origin 
3  Red  OLS  linear  r.m.s. of perpendiculars  Forced through origin 
4  Orange  OLS  linear  r.m.s. of verticals  Forced through origin 
5  Brown  Polynomial  2ndorder polynomial  r.m.s. perpendiculars  Not forced through origin 
6  –  MLS  linear  r.m.s. perpendiculars  Not forced through origin 
7  –  BLS  linear  r.m.s. perpendiculars  Not forced through origin 
8  Cyan  Polynomial  3rdorder polynomial  r.m.s. perpendiculars  Not forced through origin 
Fits using median least squares (MLS, fit 6) and Bayesian least squares (BLS, fit 6) were also made but were no better than the comparable OLS fit (fit 1). We also attempted successive removal of the largest outliers to try to make the fits converge to a stable result, but again no improvement was made for all these linear fits. This left just one assumption to test, namely that the variation of \(R_{\mathrm{B}}\) with \(R_{\mathrm{A}}\) is linear. A leastsquares fit of a secondorder polynomial fit was carried out (fit 5): this is shown by the brown lines in Figures 3 and 4. This appears to remove the problem of the exaggerated peak values. Note that for this fit one outlier data point has been removed (see below). In addition, a thirdorder polynomial fit was carried out (fit 8): the Q–Q plot for this fit is shown by the cyan points in Figure 5f and it can be seen that this fit generates some nonGaussian tails to the distribution.
In Figure 5e, the open triangles show the results for the secondorder polynomial fit to all datapoints and the point in the upper tail of the distribution is seen to be nonGaussian. This arises from the outlier data point that can be seen in Figure 3 at \(R_{\mathrm{A}} \approx 9.3\), \(R_{\mathrm{B}} \approx 3.2\). The solid circles are for the fit after this outlier has been removed and the remaining points can now be seen to give an almost perfect Gaussian distribution of residuals, and so the fit is robust. The brown lines in Figures 3 and 4 show the results of this fit with the outlier removed. The largest outlier was also removed or all other fits but fit 5 was the only one for which the Q–Q plot was significantly improved. Note that for the test done here, the fits are never used outside the range of values that were used to make the fit. However, this would not necessarily be true of an intercalibration between two daisychained data segments and very large errors could occur if there is nonlinearity and one is extrapolating to values outside the range used for calibration fitting.
2.3 Effect of the Threshold \(A_{\mathrm{th}}\)
3 Study of 1923 – 1945
4 Discussion and Conclusions
Our tests of regression procedures, comparing the original RGO sunspotgroup area data with a deliberately degraded version of the same data, show that there is no one definitive method that ensures the regressions derived are robust and accurate. Certainly correlation coefficient is not a valuable indicator and very high correlations are necessary but very far from sufficient.
The one definitive statement that we can make is that forcing fits through the origin is a major mistake. It causes solarcycle amplitudes to be inflated so that peak values in the loweracuity data are too high and both minimum and mean values are too low. This is the method used by Svalgaard and Schatten (2016), and our findings show that it will have contributed to a false upward drift in their backbone group number reconstruction \([R_{\mathrm{BB}}]\) values as one goes back in time. At the time of writing we do not have the original data to check the effect on both the regressions used to intercalibrate backbones and any regressions used to combine data into backbones. Both will be subject to this effect. Hence we cannot tell whether or not this explains all of the differences between, for example, the long term changes in \(R_{\mathrm{BB}}\) and the terrestrial data (ionospheric, geomagnetic, and auroral) discussed in Articles 1 and 2. However, it will have contributed to these differences. Note that all of the above also applies to any technique based on the ratio \(R_{\mathrm{A}} /R_{\mathrm{B}}\) as that also forces the fit through the origin.
Lastly it is not clear which procedure should be used to daisychain the calibrations. Ordinary leastsquares fits work well only when the Q–Q plots show a good normal distribution of residuals. Even then, minimising the verticals gives the best answer for the mean values but minimising the perpendiculars gives the best answer for the peak values. The failures in the Q–Q plots appear to be mainly because the dependence is not linear and a nonlinear fit then works well. We used a secondorder polynomial and the fitted \(R_{\mathrm{B}}^{2}\) term is found to be relatively small (meaning it is a nearlinear fit) and hence this seems to have been adequate, at least for the cases we studied. However, we note that this should not be used for values that are outside the range seen during the intercalibration interval because the dependence of the extrapolation on the polynomial used is then extremely large.
Acknowledgements
The authors are grateful to David Hathaway and the staff of the Solar Physics Group at NASA’s Marshall Space Flight Center for maintaining the online database of RGO data used here. The work of M. Lockwood, M.J. Owens, and L. Barnard at Reading was funded by STFC consolidated grant number ST/M000885/1 and that of I.G. Usoskin was done under the framework of the ReSoLVE Center of Excellence (Academy of Finland, project 272157).
Funding information
Funder Name  Grant Number  Funding Note 

Science and Technology Facilities Council (GB) 

Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.