Introduction

Power system load forecasting investigates the changing pattern of the power load, seeks intrinsic correlations between power load and the factors that influence it, and then forecasts the future load scientifically based on corresponding historical data [1]. Particularly, long-term system load forecasting plays an important role in power system planning.

Classical methods for long-term system load forecasting are mainly in three categories: time series models [25], correlation models [69] and artificial intelligence models [913]. Time series models forecast the future load based on the historical load data, so the underlying assumption is that the future load will follow the same trend as its past. Thus, the forecasting error will significantly increase once the trend changes. In short, the major problem of common time series methods is that they don’t adapt well to a changing environment. Meanwhile, because of the complicated set of factors influencing the power load and insufficient annual data, correlation models and artificial intelligence models sometimes cannot work well either.

For several reasons, long-term system load forecasting for a developed city becomes a problem. When a city enters the late stage of urbanization, economic growth slows down, industry is restructuring, and the population starts to saturate [14]. Under such circumstances, the load trend is changing from the fast growing stage to the saturating and fluctuating stage, which weakens the regularity and makes it difficult to conduct accurate forecasting work. Take Shanghai as an example. The average annual growth rate of power consumption in Shanghai from 2001 to 2010 was 8.83%. However, since 2011, the average rate sharply dropped to 1.42% over the next 4 years, with an unprecedented negative growth of −1.42% in 2014. The official value of the average annual growth rate used for planning for Shanghai is 3.2% from 2016 to 2020, which is also much lower than before [15], while contrasting with the experience of the previous 5-year period. In fact, there are quite a number of cities in this transitional period around the world, so it is timely to propose a corresponding effective forecasting method.

In recent years, two main trends have appeared in the development of long-term system load forecasting research. One is that hybrid models are gradually becoming the mainstream. Reference [16] proposes a hybrid model combining dynamic and fuzzy time series approaches to forecast the power consumption in household, commerce and industry respectively. Reference [17] utilizes an Ensemble Empirical Mode Decomposition method to extract the electricity consumption characteristics in multiple time scales, and then construct a relational model between these characteristics and the factors they affect to improve forecasting. Reference [18] constructs a semi-parametric model to investigate the uncertainties in mid-to-long-term forecasting and estimate the probability distribution of the future load, while a novel Kullback-Liebler divergence-based similarity measure strategy is combined to identify the significant impact factors. A Grey model optimized by the Ant Lion Optimizer and a regression model optimized by Improved Particle Swarm Optimization are proposed in [19, 20], respectively. Hybrid models incorporate advantages of different single models so that changing load patterns can be better described, and forecasting accuracy can be improved.

Another research trend in long-term system load forecasting field is that the “big data” concept is gaining increasing attention [21]. With the development of Smart Grids with Advanced Metering Infrastructure, massive power consumption data are available at different network levels, providing a new opportunity to understand the intrinsic characteristics of the power load and improve the forecasting accuracy. Reference [22] investigates how many lagged hourly temperatures and moving average temperatures are needed in a regression model based on a massive load and temperature dataset. In order to prepare for forecasting, [2325] utilize high-resolution data at an hourly interval to analysis and recognize the load pattern. Particularly, clustering methods are widely used for load forecasting, especially in big-data analyses. In [26], customers are grouped according to consumption similarities, and system load forecasting is improved by combining the forecasting results of each group. References [27, 28] apply the hierarchical clustering method to put similar load curves into one cluster, and forecast the future load of each cluster respectively. Reference [29] introduces the Fuzzy Hopfield Neural Network to classify the hourly load curves based on the date information in order to weight different forecasting models. References [30, 31] extract and analyze the load pattern based on the clustering results, while spectral clustering and functional clustering are utilized to prepare for load forecasting in [32, 33]. In summary, by putting similar objects into one cluster, clustering methods can help recognizing load patterns, weighting different models, simplifying calculation and preparing for model construction in the load forecasting field.

Based on the big data idea, this paper proposes a data-driven linear clustering (DLC) method to improve the stability and accuracy of long-term system load forecasting. A large substation load dataset is utilized to investigate the composition of the system load and to reveal its changing pattern.

Two major contributions of this paper are:

  1. 1)

    The data-driven forecasting idea is introduced to address the forecasting difficulties caused by load fluctuation in developed cities. Based on an autoregressive integrated moving average (ARIMA) model, it is theoretically proved in this paper that the data-driven method is effective in reducing the random forecasting errors so that it can better adapt to the changing environment.

  2. 2)

    A novel linear clustering method is proposed to put complementary substation load curves into the same cluster. After clustering, more accurate ARIMA models can be constructed so that the forecasting error can be further reduced.

The rest of the paper is organized as follows. In Sect. 2, the proposed DLC method is introduced. In Sect. 3, the forecasting error is analyzed, while the results of applying the method to load data are demonstrated in Sect. 4. Finally, Sect. 5 concludes this paper.

Data-driven linear clustering method

Among all load forecasting problems, long-term system load forecasting has its own characteristics. In terms of time scale, it usually forecasts the annual power load in the next few years or even decades based on annual historical data, which means a low data quantity and resolution. In terms of spatial scale, system load forecasting focuses on a load system such as a city, a province or even the whole country, so that the load level is usually high and the load curve is relatively smooth.

However, the forecasting methods based on annual system load data are gradually incapable to grasp the load trend in the transitional period mentioned above. So, in order to improve the forecasting accuracy, we need more detailed information about the load system to better understand its structure and inherent regularity. In this case, we propose the DLC method to conduct long-term system load forecasting based on a large substation load dataset. Two main parts are included in this method: the linear clustering preprocessing part and the optimal ARIMA modelling and forecasting part. The large substation load dataset, comprising load time series with annual interval, is firstly clustered by the proposed linear clustering method. Then the time series of summed load of each cluster are modeled and forecasted using optimal ARIMA model. Finally, the system load forecasting results are obtained by summing up all the ARIMA forecasting results.

The flow chart of the DLC method is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of proposed DLC method

Linear clustering preprocessing

Suppose \(y_{t} ({\kern 1pt} t = 1,2, \ldots ,T)\) is the time series of the system power load, and y t is composed of subsequences \(y_{k,t} {\kern 1pt} (k = 1,2, \ldots ,N)\), where N is the number of subsequences and T is the number of time samples, usually at annual intervals. y k,t could be a substation load series, or the load series of a district load, and so on. Then we have:

$$\sum\limits_{k = 1}^{N} {y_{k,t} = y_{t} }$$
(1)

The proposed linear clustering preprocessing method aims to smooth the multiple substation or district load series in such a way that the modelling accuracy is improved. Linear clustering here refers to the clustering criteria. Traditional clustering methods usually classify objects into classes according to a measure of similarity. In order to prepare for modelling and forecasting, we cluster subsequences into classes such that the sum of subsequences in a class has a better linear property than the sum of all subsequences in the dataset. A better linear property means more obvious regularity so that modelling accuracy could be better.

Therefore, the proposed linear clustering is indeed an optimization problem to find the optimal clustering that provides the best global linearity, which can be described by:

$${\kern 1pt} \hbox{min} {\kern 1pt} \sum\limits_{i = 1}^{M} {f_{\text{RMS}} (S_{i,t} - S_{i,t}^{ *} )}$$
(2)
$${\text{s}} . {\text{t}} .{\kern 1pt} \left\{ {\begin{array}{*{20}l} {S_{i,t} = \sum {y_{{_{k,t} }},\quad i = 1,2, \ldots ,M} } \hfill \\ {\sum\limits_{i = 1}^{M} {S_{i,t} } = \sum\limits_{k = 1}^{N} {y_{k,t} } = y_{t} } \hfill \\ \end{array} } \right.$$

where \(S_{i,t} {\kern 1pt}\) is the sum of the obtained cluster i, \(i = 1,2, \ldots, M, M \le N;\,S_{i,t}^{*}\) is the corresponding linear fitting series; f RMS is the root mean square (RMS) calculation:

$$f_{\text{RMS}} (x) = \sqrt {\frac{1}{n}(x_{1}^{2} + x_{2}^{2} + { \ldots }{ + }x_{n}^{2} )}$$
(3)

where x is an n-dimension vector and \(x = (x_{1} ,x_{2} , \ldots ,x_{n} ).\)

In order to solve this problem, the following iterative algorithm is described:

  • Step 1: Construct least-squares linear fitting model for each subsequence y k,t and calculate the corresponding RMS value of the linear fitting residual, denoted u k , k = 1,2,…,N, as a linearity measurement for each original subsequence.

  • Step 2: Find the subsequence with the maximum RMS value u kmax from Step 1 and mark it as y kmax,t . Then y kmax,t is the subsequence with the most obvious fluctuation and usually the most difficult one to construct an accurate model for. Therefore, y kmax,t is our major optimization target for this iteration.

  • Step 3: Construct new linear fitting models for sum series \(Y_{j,t} = y_{k{\hbox{max}} ,t} + y_{j,t} ,{\kern 1pt} j = 1,2, \ldots ,N,j \ne k\), and calculate the corresponding RMS values of the fitting residual marked as U j . This step is to see whether there is any other subsequence that can be summed with y kmax,t to improve the linear fit.

  • Step 4: Find the minimum value of U j from Step 3 and mark it as U jmin. If

    $$U_{j\hbox{min} } < u_{k\hbox{max} }$$
    (4)

    it means that there exists a subsequence y jmin,t that can be summed with y kmax,t to improve the linear fit. In this case, we replace y jmin,t and y kmax,t by their sum Y jmin,t and go back to Step 1. The iteration stops when U jmin ≥ u kmax, which means the subsequences cannot be smoothed any further by summation.

After such linear clustering preprocessing, the smoothness of the subsequences is improved while their number is reduced, which are better conditions for modelling and forecasting.

Optimal ARIMA modelling and forecasting

The ARIMA model proposed by Box and Jenkins in 1970s has a good performance when describing and forecasting a time series [34]. Therefore, we use it to forecast the summed load of each cluster and to analyze the load forecasting error. The ARIMA (p,d,q) model can be described by:

$$y_{t} = \varphi_{1} y_{t - 1} + \varphi_{2} y_{t - 2} + \cdots + \varphi_{p} y_{t - p} + \varepsilon_{t} - \theta_{1} \varepsilon_{t - 1} - \theta_{2} \varepsilon_{t - 2} - \cdots - \theta_{q} \varepsilon_{t - q}$$
(5)

where ɛ t is white noise; φ and θ are the coefficients.We can see that there are two parts contained in the ARIMA model: the autoregressive (AR) part

$$y_{t} = \varphi_{1} y_{t - 1} + \varphi_{2} y_{t - 2} + \cdots + \varphi_{p} y_{t - p} + \varepsilon_{t}$$
(6)

and the moving average (MA) part

$${\kern 1pt} y_{t} = \varepsilon_{t} - \theta_{1} \varepsilon_{t - 1} - \theta_{2} \varepsilon_{t - 2} - \cdots - \theta_{q} \varepsilon_{t - q}$$
(7)

The AR part describes the remembered characteristics of past system states, while the MA part reflects the influence of noise on the current system state. p and q are the corresponding orders of the two parts. Because ARIMA model is only suitable for stationary series, differencing preprocessing is required if the series is not stationary [35], with d representing the differencing order.

We construct optimal ARIMA models for S i,t , \(i = 1,2, \ldots ,M,\) forecast their future values respectively, and sum them up to obtain the final system load forecast. The corresponding algorithm steps are as follows.

  • Step 1: The Unit Root Test [36] is firstly applied to the preprocessed series S i,t to check whether they are stationary or not. Any non-stationary series will be converted into a stationary one by differencing.

  • Step 2: Construct multiple ARIMA (p,d,q) models for each stationary series with different combination of parameters p and q. Due to limited series length, we limit p and q to a relatively low order in order to avoid overfitting [37]: p = 0,1,2; q = 0,1.

  • Step 3: Among all the ARIMA models constructed in Step 2, find the optimal one for each stationary series by the Akaike information criterion (AIC). This is a criterion to measure the modelling effect considering both the fitting accuracy and the complexity of the constructed model [38]:

    $$AIC = 2n + T\ln (f_{\text{RSS}} /T)$$
    (8)

    where n is the number of parameters in the constructed model; T is the length of the series; f RSS is the residual sum of squared differences which reflects the modelling accuracy. Generally, the model with the smallest AIC value is the optimal one, so the mathematical description of choosing the optimal ARIMA model for S i,t is:

    $$\begin{aligned} \hbox{min} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} AIC = 2n + T\ln (f_{\text{RSS}}^{2} /T) \hfill \\ {\text{s}} . {\text{t}} .\left\{ \begin{aligned} &n = p + q \hfill \\ &p = 0,1,2,{\kern 1pt} {\kern 1pt} {\kern 1pt} q = 0,1 \hfill \\ &q \le p \hfill \\ & f_{\text{RSS}} = \sqrt {\sum\limits_{t = 1}^{T} {(S_{i,t} - S_{i,t}^{*} )^{2} } } \hfill \\ \end{aligned} \right. \hfill \\ \end{aligned}$$
    (9)

    where S * i,t is the ARIMA (p,d,q) fitting value of S i,t .

  • Step 4: Forecast the future value of each preprocessed series S i,t based on its corresponding optimal ARIMA model selected in Step 3. The forecasting results are denoted by S i,t+τ , \({\kern 1pt} \tau = 1,2, \ldots ,\Delta T\), in which ΔT is the forecasting period. Note that ΔT cannot be too big because of a limitation of the ARIMA model [39].

  • Step 5: Sum up all the ARIMA forecasting results to obtain the final system load forecasting results:

    $$S_{t + \tau } = \sum\limits_{i = 1}^{M} {S_{i,t + \tau } }$$
    (10)

Forecasting error analysis

When we forecast the power load, the forecasting error mainly consists of two parts: the modelling error and the random error [40]. The modelling error refers to the difference between the model fitting value and the true value. Generally, the smoother the load curve, the smaller the modelling error, so that the constructed model can better fit the pattern of changing load. The random error refers to the forecasting error caused by some random and unpredictable factors that change the original load changing pattern. Thus, in order to improve the forecasting accuracy, we should both improve the modelling accuracy and try to limit the random error.

Here, we analyze the forecasting error of different forecasting methods based on ARIMA model. For simplicity, we make two assumptions in advance [40]:

  1. 1)

    Because the ARIMA forecasting results mainly depend on the AR part, we assume that the time series of the power load follows the first term of the AR model in (6), denoted the AR(1) pattern:

    $$y_{t} = \varphi_{1} y_{t - 1} + \varepsilon_{t} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} y_{k,t} = \varphi_{k,1} y_{k,t - 1} + \varepsilon_{k,t}$$
    (11)
  2. 2)

    We assume that the white noise in the time series of power load is White Gaussian Noise (WGN), and that the standard deviation of the noise is proportional to the load level:

    $$\varepsilon_{t} \sim N(0,\sigma^{2} y_{t}^{2} ),{\kern 1pt} \varepsilon_{k,t} \sim N(0,\sigma^{2} y_{k,t}^{2} )$$
    (12)

    where σ > 0 is the proportionality coefficient.

Suppose the modelling error in time t is v m t , and the random error is v r t . Then v m t depends on the AR(1) part, while v r t is related to ɛ t according to the analysis above. Then the total forecasting error can be described by:

$$v_{t} = v_{t}^{m} + v_{t}^{r}$$
(13)

Modelling error

Based on (1), suppose the ARIMA modelling result for time series y t is:

$$y_{t}^{*} = \varphi_{1}^{*} y_{t - 1} + \varepsilon_{t}^{*}$$
(14)

  From (2) we know that ɛ t is WGN, so ɛ * t  = 0. Then (14) becomes:

$$y_{t}^{*} = \varphi_{1}^{*} y_{t - 1}$$
(15)

  From (11), the actual value of φ 1 is:

$$\varphi_{1} = \frac{{y_{t - 1} - \varepsilon_{t - 1} }}{{y_{t - 2} }}$$
(16)

  Therefore the parameter estimation error for φ 1 is:

$$\Delta \varphi_{1} = \varphi_{1} - \varphi_{1}^{*} = \frac{{\varepsilon_{t - 1} }}{{y_{t - 2} }}$$
(17)

where Δφ 1 is the source of the modelling error, which is proportional to the WGN ɛ t and inversely proportional to the load level y t according to (17). If we model and forecast the system load directly (called the direct method in this paper), the modelling error will be small because the load level y t is high and the standard deviation of the noise ɛ t is low due to the smoothness of the system load curve. On the other hand, if we model and forecast the subsequences of the system load and then sum them up to obtain the system load forecasting result (called the data-driven method in this paper), the modelling error for each subsequence will be more significant. The proposed DLC method constructs a forecasting model based on the smoothed sum series so that the modelling accuracy can be guaranteed to a certain extent, theoretically inferior to the direct method but better than the data-driven method.

The modelling error of the forecasting results can be evaluated by:

$$v_{t}^{*m} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\frac{{\left| {y_{t} - y_{t}^{*} } \right|}}{{y_{t} }}} \times 100\%$$
(18)

Random error

From (12) we know that the WGN of a subsequence \(\varepsilon_{k,t} \sim N(0,\sigma^{2} y_{k,t}^{2} )\). Because of the mutual independence property of WGN, we have:

$$\sum\limits_{k = 1}^{N} {\varepsilon_{k,t} } \sim N\left( {0,\sum\limits_{k = 1}^{N} {\sigma^{2} y_{k,t}^{2} } } \right)$$
(19)

Because σ > 0, y k,t  > 0, and y k,t are not all equal for different k, we have:

$$\sum\limits_{k = 1}^{N} {\sigma^{2} y_{k,t}^{2} } < \sigma^{2} \left( {\sum\limits_{k = 1}^{N} {y_{k,t} } } \right)^{2} = \sigma^{2} y_{t}^{2}$$
(20)

Equation (20) is the theoretical basis of the data-driven method: the variance of the WGN is smaller than for the direct method. In this way, the forecasting random error can be limited and a more stable system load forecasting result can be obtained by the data-driven method. And this is exactly the value of using a large quantity of substation load data. Similarly, the proposed DLC method also takes advantage of the large dataset, so that its random forecasting error will be smaller than that of the direct method.

The forecasting error can be evaluated by:

$$v_{t}^{*} = \frac{1}{\Delta T}\sum\limits_{\tau = 1}^{\Delta T} {\frac{{\left| {y_{T + \tau } - S_{T + \tau } } \right|}}{{y_{T + \tau } }}} \times 100\%$$
(21)

According to (13), the random forecasting error can be evaluated as:

$$v_{t}^{*r} = v_{t}^{*} - v_{t}^{*m}$$
(22)

In short, the direct method usually performs well in modelling, but will probably gain an uncontrollable random error when forecasting. On the contrary, the data-driven method can limit the forecasting random error, but makes it harder to construct a precise model for each subsequence. As a combination of the above two methods, the proposed DLC method can reduce the random forecasting error while guaranteeing modelling accuracy, providing improved forecasting results.

Application results

Peak load data from Shanghai are used to test the effectiveness of the proposed DLC method [41]. The annual peak loads from 2001 to 2015 are shown in Fig. 2a. The system load is composed of 83 substation loads at 220 kV (N = 83), of which the corresponding peak load curves are plotted in Fig. 2b.

Fig. 2
figure 2

Annual peak load and corresponding annual peak load of 83 substations at 220 kV in Shanghai

Here, we construct a model based on the load data from 2001 to 2012, and conduct virtual forecasting from 2013 to 2015 to test its effectiveness. For comparison, four different forecasting schemes are applied:

  1. 1)

    Direct method: construct an optimal ARIMA model for the system load data in Fig. 2a directly and forecast the system peak load.

  2. 2)

    Data-driven method: construct optimal ARIMA model for each original subsequence in Fig. 2b and forecast each one, then sum up all the forecasting results to obtain the system load forecasting results.

  3. 3)

    DLC method: based on the subsequence data in Fig. 2b, using the forecasting algorithm proposed in Sect. 2 to obtain the system load forecasting results.

  4. 4)

    Classical methods: apply some classical forecasting methods, such as the scrolling GM(1,1) model, the elasticity coefficient model and a regression model to forecast the system peak load.

Additionally, we apply the proposed DLC method to another four cities to test its adaptability. Finally, we forecast the future load in Shanghai from 2016 to 2020 based on DLC method.

Direct method

The optimal ARIMA model was applied directly forecast the system load. The model fitting and forecasting results are shown in Fig. 3.

Fig. 3
figure 3

Modelling and forecasting results by direct method

The modelling error of the direct method is 2.18% and the average forecasting error is 10.02%, so the random error is 7.84%. From Fig. 3 we can see that the annual system load curve in Shanghai is relatively smooth, which is advantageous for modelling and leads to a high modelling accuracy. However, the forecasting results are not desirable. This is mainly due to the changing pattern of the load growth. Shanghai has a high urbanization level and is under industrial restructuring, in which backward production facilities are closed down and the development of tertiary industry is accelerated. Meanwhile, the population in Shanghai is becoming saturated. Under such circumstances, the pattern of load growth is changing, having shown significant fluctuation since 2009. This makes it difficult for the direct method to work well. In order to obtain a better forecasting result, more information about the load system is required to explore the internal regularity of the fluctuating system load.

Data-driven method

Optimal ARIMA modelling and forecasting was conducted for each 220 kV substation load in Fig. 2b, and the results are shown in Fig. 4.

Fig. 4
figure 4

Load curves with data-driven method

The average modelling and forecasting error in Fig. 2a are 20.32 and 26.46% respectively, so the average random error is 6.13%. The significant increase of the modelling error is due to the low load level and the high fluctuation of substation loads, which has been discussed in Sect. 3. And this is also the main reason for the large forecasting error for individual substation loads.

After summing up the modelling and forecasting results in Fig. 4a to obtain the system results in Fig. 4b, the system modelling and forecasting errors are 2.35 and 3.65% respectively, with a random error 1.30%. We can see that the random error has been effectively reduced from 7.84 to 1.30% compared with the direct method, therefore the corresponding forecasting error is reduced. This is the value of the data-driven method, which has also been discussed in Sect. 3. But on the other hand, the modelling error is 2.35% and becomes the major part of the forecasting error.

DLC method

In order to improve the modelling accuracy of the data-driven method, the substation load data were preprocessed using the proposed linear clustering method. The modelling and forecasting results are shown in Fig. 5.

Fig. 5
figure 5

Load curves with DLC method using clustering criterion 1

We can see from Fig. 5a that the preprocessed data obtained by linear clustering method are much smoother than the original data in Fig. 4a, making them more suitable for time series modelling. The average modelling error has been reduced to 10.71%. Also, the number of subsequences is reduced from 83 to 30 (M = 30) so the computational work is reduced. More importantly, the forecasting results for substation load clusters are more stable: the average forecasting error is reduced to 18.54%, with a random error 7.76%.

After summing up the modelling and forecasting results in Fig. 5a to obtain the system results in Fig. 5b, the system modelling error is 1.40%, the forecasting error 2.67%, and the random error 1.27%. The more accurate results prove that the proposed DLC forecasting algorithm takes advantage of clustering to limit the random error while guaranteeing the modelling accuracy.

In the proposed linear clustering preprocessing method, the clustering criterion in Step 4 in Sect. 2.1 is crucial. Different clustering criteria will result in different clustering results, thus leading to different modelling and forecasting effects. In the DLC method presented above, the clustering criterion is shown in (4), denoted “criterion 1”. Consider relaxing it to (23), denoted “criterion 2”:

$$U_{j\hbox{min} } < \sqrt {u_{k\hbox{max} }^{2} + u_{j}^{2} }$$
(23)

where u j is the RMS value of the linear fitting residual for y jmin,t . The new modelling and forecasting results are shown in Fig. 6.

Fig. 6
figure 6

Modelling and forecasting results using clustering criterion 2

We can see that after relaxing the clustering criterion, the number of the subsequences has further reduced to 21 (M = 21), and each of them is smoother. The average modelling error is 7.64%, forecasting error 14.96%, and random error 7.32%. After summing them up to obtain the system results in Fig. 6b, the modelling error is 1.34%, the forecasting error 3.47%, and the random error 2.13%.

Generally, a relaxed criterion will result in smoother load curves with a lower number of clusters, which is advantageous to the modelling accuracy but disadvantageous to reducing the random error. A stricter criterion will lead to an opposite effect. Therefore, an ideal clustering criterion should be a proper compromise between the number of clusters and smoothness of load curves, so that the forecasting accuracy can be optimized. In order to obtain such an optimal criterion, characteristics of the load should be considered, and clustering results with different criteria should be analyzed and compared.

Classical methods

For further comparison, the classical scrolling GM(1,1) model, the elasticity coefficient model and a regression model are constructed to forecast the system power load directly [42]. The modelling and forecasting results are illustrated in Fig. 7.

Fig. 7
figure 7

Modelling and forecasting results by GM(1,1) model, regression model and elasticity coefficient model

The results show that all the three classical models have an adequate modelling accuracy: 5.973.51 and 1.22% respectively. However, they have common difficulty in capturing the changing load pattern in the forecasting zone, with the forecasting errors 11.25, 4.76 and 9.43% respectively.

The final system forecasting results of forecasting methods applied are summarized in Table 1.

Table 1 Results summary

In order to demonstrate the adaptability of the proposed DLC method, we collected substation load data from four different cities, which are shown in Fig. 8. Note that the four cities are in different stage of urbanization. The modelling and forecasting results of the DLC method with both criteria for each city are also plotted in Fig. 9, and the forecasting errors are shown in Table 2.

Fig. 8
figure 8

Corresponding substation load data

Fig. 9
figure 9

DLC modelling and forecasting results

Table 2 DLC forecasting error in four different cities

The comparison of results in Table 1 proves the effectiveness of the proposed DLC forecasting method. Firstly, the three methods based on the data-driven methodology all show a successful reduction of random error when compared with the other methods. Secondly, the DLC method can provide modelling accuracy at almost the same level as the direct method, so that the forecasting accuracy is improved. The DLC method also performs better than the three classical models mentioned above. Additionally, the forecasting results for four different cities shown in Table 2 indicate the adaptability and stability of the DLC method.

Finally, we forecast the peak load in Shanghai from 2016 to 2020, based on the proposed DLC method with criterion 1 as an example in Table 3. Figure 10 shows load growth and ARIMA forecasting results for each cluster and the overall peak load growth.

Table 3 DLC forecasting results in Shanghai from 2016 to 2020
Fig. 10
figure 10

Load curves forecast to 2020 using the proposed DLC method

The modelling error is 1.92%, and the average annual growth rate from 2016 to 2020 for peak load in Shanghai is 2.86%.

Conclusion

In this paper, we propose a data-driven linear clustering method to solve the long-term system load forecasting problem caused by load fluctuations in some developed cities. In order to grasp the internal structure of the system load and improve the forecasting accuracy, we introduce a data-driven method to conduct modelling and forecasting based on a large quantity of substation load data. We have theoretically proved that this data-driven method is effective in reducing the forecasting random error so that a more stable result can be obtained. However, the data-driven method can result in modelling difficulty, which is disadvantageous for forecasting accuracy. For this problem, we propose a linear clustering method to preprocess the substation load data, making it more smooth and thereby reducing the modelling error. When applied to load data from Shanghai the proposed DLC method is shown to be effective in both reducing the forecasting random error and guaranteeing the modelling accuracy, so that a more stable and accurate system load forecasting result can be obtained. Furthermore, applying the same method to load data from another four cities indicates that the proposed DLC method is adaptable and stable.

Future work could theoretically investigate the optimal clustering criterion and level to further improve the forecasting stability and accuracy. Meanwhile, substation load curves in the same cluster have a linear complementarity property, which provides an opportunity to conduct correlation analysis of urbanization characteristics such as industrial structure, population and land utilization. This would help to understand and quantify the structural influences behind the changing peak load.