1 Introduction

With increasing environmental concerns, electric vehicles (EVs) and renewable energy sources are receiving more and more attention all over the world [1]. According to the data from International Energy Agency [2], the number of global EV reached 3.1 million in 2017, increasing by 57% over the previous year, while the number of EV on the road is expected to reach 125 million by 2030. China will be the first country to start replacing traditional fuel vehicles with electric ones. At the same time, in 2030 renewable energy in China will account for 15% of the total energy consumption [3].

Large-scale integration of EVs and renewable energy into the grid poses great challenges in the operation of the power system due to their uncertainty and intermittent nature. Generally, large centralized energy storage systems (ESSs) can mitigate these problems, however, this would require expensive installations of large-capacity battery banks, pumped hydro and other large systems [4, 5].

Large-scale mobile and distributed ESS, composed of numerous on-board EV batteries can provide similar solutions, if their duality as ‘loads’ and ‘sources’ can be utilized and demand-side response technologies are applied [6, 7]. This allows to increase the penetration of renewable power generation or improve the resiliency and stability by forming microgrids [8, 9].

An important prerequisite for EVs to provide ancillary services to utilities or efficient operation of microgrids is to forecast the EV schedulable capacity (EVSC) in a fast and accurate way. In this way, system operators can optimize the schedule for the participation of EVs in ancillary services.

In current literature, EVSC is generally obtained using probabilistic EV models [10,11,12,13,14,15], including plug-in time probability models based on binomial distributions [10], plug-in location probability models [11] and the aggregated queuing network model [12]. In [13,14,15], a Monte Carlo method is used to simulate the behavior of different types of EVs operating under realistic conditions, including start-stop time, charging rate, charging time, etc. In other nonprobability models, the state of charge (SOC) of the EV batteries is used to obtain EVSC for individual and aggregated EVs [16, 17]. Several parameter hypotheses are needed in most of these models, also due to the scarcity of historical data.

With the development of communication and Internet of Things technologies, real-time operation data of individual EVs can be acquired from their battery management system (BMS). The large amounts of actual operation data such as SOC, times of EVs access to charging infrastructure, etc., enable to develop more accurate EVSC models. Nevertheless, dealing with the processing and analysis of a very large number of data poses great challenges. For example, if we assume that half of the 100 million EVs estimated on road in China by 2030 [2] will be involved in power system scheduling operation, and the collection interval of related information is one minute, the volume of data will reach 1–2 petabyte each year. Therefore, this paper treats the forecasting of EVSC based on real-time operation data of individual EVs as an essentially big data analysis problem.

Big data analysis and management are clear trends of future smart grids. This is challenging for traditional machine learning (ML) algorithms, since they are designed for a single machine and are not suitable to deal with big data [18]. Thus, more efficient ML algorithms for parallel computing or for big data are required.

The parallel processing methods proposed in literature can be divided into three groups. Group one refers to parallel processing of traditional algorithms by using Hadoop and Spark cluster technology [19,20,21]. Group two is the combination of clustering or optimization algorithms and traditional ML algorithms [22,23,24]. Group three is a combination of group one and group two [25,26,27]. In group one, the parallelization of the algorithm effectively improves the computing speed and accuracy of load forecasting in parallel computing framework MapReduce and Spark. For example, [19] analyses the forecasting time and error for data sets with different sizes in different sizes of Hadoop clusters. In group two, clustering algorithms on large-scale data sets can be used to improve the performance markedly [22, 23]. For group three, [25] and [26] propose new hybrid algorithms, which combine the improved particle swarm optimization and extreme learning machine, fuzzy C clustering and support vector machines (SVM), respectively. The problems of over-fitting and long training time caused by the increase of data scale are faced by multi-distributed back propagation (BP) neural networks [27].

Different from the algorithms above, [19] and [28] propose ensemble learning algorithms of random forest (RF) and gradient boosting decision tree (GBDT), respectively. Ensemble learning algorithm integrates multiple base learners into a strong learner to improve the forecasting accuracy. It is considered as one of the important future research directions of ML [29]. Unlike traditional multi-linear regression (MLR) algorithm [30], GBDT can flexibly handle a certain number of different types of feature attributes, including continuous and discrete values [31]. Thus, it is widely used in traffic and load forecasting. However, since the output of the algorithm is the result of multiple iterations, there is a strong inter-dependence among regression trees, thus it is difficult to realize the parallelization of the GBDT algorithm.

The parallel GBDT (PGBDT) algorithm is derived from the GBDT algorithm and enables parallel computations. It requires less iteration time than GBDT by parallel processing of a large number of gradient and optimization computations without affecting the prediction accuracy of the model. Thus it is applicable to the big data scenario, although it has not been used so far for EVSC forecasting (EVSCF).

The application of big data analysis equipped with ML algorithms has been mainly found in the field of load forecasting, but rarely for EVSCF. In [15] and [32], the application of SVM, RF and decision tree (DT) algorithms are investigated for EVSCF. It is found that SVM and RF have an improved performance, when the forecasting curves of EVSCF fluctuate less, while RF is more effective than SVM, when there are large fluctuations [15]. DTs are heavily dependent on their input data, which means that even small variations in data may result in large changes in the structure of the optimal DT. In this paper, in order to address this problem, two new algorithms, suitable for big data analysis, i.e., PGBDT and parallel k-nearest neighbors (PKNN), are applied to the EVSCF problem and the results are analyzed.

With the rapid growth of EVs, un-controlled charging of a large number EVs may cause the phenomenon of “peak peaking”, i.e., increase the peak-to-valley difference of the utility and affect the stable operation of power grid. EVSCF methods provide strong data support for load peak shifting, frequency regulation, economic dispatch and intelligent EV charging/discharging strategies. These different applications require the results of EVSCF for the scheduling of renewable energy or load at different time scales [33, 34]. For example, real-time load forecasting has a time horizon of several seconds to 10 minutes and is used for frequency/voltage regulation, in order to eliminate the effect of volatility of renewable energies [35]; ultra-short-term load forecasting has a time scale of one hour or short-term load forecasting of several hours to tens of hours for economic dispatch and peak shaving and valley filling [36, 37]. Forecasting of renewable generation is usually required for ultra-short-term 15 minutes to four hours ahead and for short-term 24–72 hours ahead [38].

So far, time scaling of EVs for power system operation has not been properly discussed. References [15] and [31] deal only with real time and one-day-ahead time scales. In this paper, ultra-short-term scaling of one hour is additionally incorporated for the first time in the EVSCF models. In this way, EVs can be used for more power system services such as real-time optimization, peak shaving and valley filling, economic dispatch, etc.

Overall, the main contribution of this paper is the development of EVSCF models for multi-time scales based on the PGBDT algorithm which is used to forecast EVSC faster and more accurately.

The results of real-time EVSCF based on a large amount of real-time operation data from BMS of individual EVs are used as historical data for training the EVSCF models for ultra-short-term scale of one hour and one-day-ahead scale of 24 hours. The PGBDT algorithm is initially proposed and tested on a big data platform for multi-time scale EVSCF models to prove its feasibility and effectiveness.

The rest of this paper is organized as follows. In Section 2, the PGBDT algorithm is described. Section 3 discusses EVSCF models for multi-time scales. In Section 4, the proposed models are validated and compared with parallel random forest (PRF) and PKNN algorithms on a big data platform. This is followed by conclusions in Section 5.

2 PGBDT algorithm

In the PGBDT algorithm [28], a training sample is composed of a 1-dimensional target vector \(y_{i}\) and a set of K-dimensional input vectors \(\varvec{x}_{i} = [x_{i}^{1} , x_{i}^{2} , \ldots , x_{i}^{K} ]\). The objective is to obtain \(f^{*} (\varvec{x})\) that is, mapping x to \(y_{i}\) in training samples \(S = \{ (y_{i} , \varvec{x}_{i} )\}_{i = 1}^{n}\) of known value \(y_{i}\) and ith set of feature attributes \(\varvec{x}_{i}\) with the length K, while minimizing the loss function \(L(y_{i} , f(\varvec{x}_{i} ))\) as shown in (1):

$$f^{*} (\varvec{x}) = \mathop {\arg \hbox{min} }\limits_{{f(\varvec{x})}} \sum\limits_{i = 1}^{n} {L(y_{i} ,f(\varvec{x}_{i} ))}$$
(1)

where \(f(\varvec{x}_{i} )\) is the ith predicted value as the output of a mapping function. The loss function \(L(y_{i} , f(\varvec{x}_{i} ))\) used here is the square error loss function of the regression problem as shown in (2):

$$L(y_{i} ,f(\varvec{x}_{i} )) = (y_{i} - f(\varvec{x}_{i} ))^{2}$$
(2)

\(f^{*} (\varvec{x})\) can be approximated by the additive form of \(f(\varvec{x})\) as shown in (3):

$$f(\varvec{x}) = \sum\limits_{m = 1}^{M} {c_{m} b_{m} }$$
(3)

where \(c_{m}\) is a scaling factor; and \(b_{m}\) is the least-squares coefficient of the base learner for the mth iteration.

By defining the base learner as a J-terminal node regression tree, the specific steps of PGBDT algorithm are as follows.

  • Step1: Initialize the model (3) by setting the initial and maximum number of iterations m as one and M, respectively, and the initial function \(f_{0} (\varvec{x})\) as shown in (4).

    $$f_{0} (\varvec{x}) = \mathop {\arg \hbox{min} }\limits_{p} \sum\limits_{i = 1}^{n} {L(y_{i} ,p)}$$
    (4)

    where p is a constant value for minimizing the loss function; and \(f_{0} (\varvec{x})\) is a regression tree with only one node.

  • Step2: Obtain the negative gradient of the loss function as shown in (5), and \(f_{m-1} (\varvec{x})\) is the model after (m−1)th iteration.

    $$r_{mi} = - \left( {\frac{{\partial L(y_{i} ,f(\varvec{x}_{i} ))}}{{\partial f(\varvec{x}_{i} )}}} \right)_{{f(\varvec{x}) = f_{m - 1} (\varvec{x})}}$$
    (5)

    The mth regression tree is constructed according to all samples and their negative gradients [39], and its splitting rule is to divide it into two regions according to the value s of the kth feature attribute: \(S_{left} \left( {k, s} \right) = \{ x_{i} \left| {x_{i}^{k} \le s} \right.\}\) and \(S_{right} \left( {k, s} \right) = \{ x_{i} \left| {x_{i}^{k} > s} \right.\}\). The minimization of the sum of regional variances after splitting is shown in (6):

    $$\begin{aligned} Gain(k,s) = \mathop {\hbox{min} }\limits_{k,s} \left[ {\sum\limits_{{x_{i} \in S_{left} }} {\left( {y_{i} - \frac{1}{{n_{left} }}\sum\limits_{i = 1}^{{n_{left} }} {y_{i} } } \right)^{2} } + } \right. \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left. {\sum\limits_{{x_{i} \in S_{right} }} {\left( {y_{i} - \frac{1}{{n_{right} }}\sum\limits_{i = 1}^{{n_{right} }} {y_{i} } } \right)^{2} } } \right] \hfill \\ \end{aligned}$$
    (6)

    where the training sample S with size n is divided into left-dataset Sleft and right-dataset Sright according to splitting s, the size of which are nleft and nright, respectively.

    Thus, its corresponding J-terminal node regions \(R_{mj} ,j = 1,\;2,\; \ldots ,\;J\) are obtained.

  • Step3: Obtain the corresponding least-squares coefficient \(b_{mj}\) of the mth regression tree as in (7).

    $$b_{mj} = \bar{r}_{mi} \varepsilon (\varvec{x}_{i} )\;\;\;\;\varvec{x}_{i} \in R_{mj}$$
    (7)
    $$\varepsilon \left( {\varvec{x}_{i} } \right) = \left\{ {\begin{array}{*{20}c} {1\;\;\;\;\;\;\varvec{x}_{i} \in R_{mj} } \\ {0\;\;\;\;\;\;\varvec{x}_{i} \notin R_{mj} } \\ \end{array} } \right.$$
    (8)

    where \(\bar{r}_{mi}\) is the average value of negative gradient of mth regression tree; and \(\varepsilon ({\varvec{x}}_{i} )\) is an indicator function.

  • Step4: Find the scaling factor \(c_{m}\) of the mth regression tree for solving the “linear search” by (9).

    $$c_{m} = \mathop {\arg \hbox{min} }\limits_{c} \sum\limits_{{\varvec{x}_{i} \in R_{mj} }} {L\left( {y_{i} ,f_{m - 1} (\varvec{x}_{i} ) + c\sum\limits_{j = 1}^{J} {b_{mj} } } \right)}$$
    (9)
  • Step 5: Update the model \(f_{m} (\varvec{x})\) as (10).

    $$f_{m} (\varvec{x}) = f_{m - 1} (\varvec{x}) + c_{m} \sum\limits_{j = 1}^{J} {b_{mj} }$$
    (10)
  • Step 6: If \(m < M\), let \(m = m + 1\) and repeat Step 2 to Step 5, otherwise output the final \(f_{M} (\varvec{x})\).

After M iterations, the final model \(f(\varvec{x})\) is obtained as (11):

$$f(\varvec{x}) = f_{M} (\varvec{x}) = f_{0} (\varvec{x}) + \sum\limits_{m = 1}^{M} {\sum\limits_{j = 1}^{J} {c_{m} b_{mj} } }$$
(11)

From (11), it can be seen that the PGBDT algorithm is a combined algorithm. It approximates the expected model by iterating a series of regression trees to improve the model accuracy and provides a strong predictive performance and generalization ability. Each regression tree can be parallelized by finding splits on each non-terminal node of the regression tree in parallel, whose splitting criteria depends on the minimization of variance after splitting. Therefore, the whole model of PGBDT can be parallelized by generating each regression tree in parallel during its generation process.

3 EVSCF models for multi-time scales

3.1 Time scales used in the proposed EVSCF models

In this paper, three time scales for EVSCF are proposed, i.e., real-time scale of one minute, ultra-short-term scale of one hour and one-day-ahead scale of 24 hours. This allows the provision of different ancillary services for power system, as shown in Fig. 1. For the one-day-ahead EVSCF, the forecasting is performed for every of the 24 hours in advance. Considering the uncertainty of the one-day-ahead scheduling, the schedule for the ultra-short-term scale of one hour is formulated to reduce the forecasting errors, which is performed one hour in advance. The real-time EVSCF is carried out one minute in advance. Its time scale is short and the precision is high and less affected by uncertain factors. It can be used for frequency and voltage regulation and further correction of scheduling errors. The real-time scale interval is one minute and the time scales for ultra-short-term and one-day-ahead are set to 60 minutes and 1440 minutes, respectively.

Fig. 1
figure 1

Diagram of time scales in the proposed EVSCF

3.2 Real-time EVSCF model

3.2.1 Classification of EVs connected to grid

The proposed real-time EVSCF model is based on real-time data of individual EVs, which are acquired from the BMS of each EV.

In order to ensure the accuracy of real-time EVSCF model, the selected time scale for the prediction model is equal to the time interval of the real-time operation data acquisition, which is one minute in this paper. Since the SOC of the EV battery changes slightly within one minute, EVSC in real time is calculated dynamically through the real-time data acquisition and big data analysis method, and can be regarded as the forecasted value of EVSC for the next minute. To build the EVSCF model in real time, it is necessary firstly to classify the individual EVs accessing the utility network according to their levels of SOC so that the aggregated charging or discharging capacity of EVs can be obtained.

The participation of EVs in dispatch mainly depends on the scheduling period from the grid side \(t_{s}\), the expected remaining time to disconnect from the grid \(t_{d,l}\) and the minimum charging time to reach expected SOC \(t_{d,c}\) from individual EV users. The rules to classify EVs are as follows [15].

  1. 1)

    If \(t_{d,l} < t_{s}\) or \(t_{d,l} < t_{d,c}\), \(EV_{d}\) is not allowed to participate in the scheduling plan.

  2. 2)

    If \(t_{d,l} \ge t_{s}\) and \(t_{d,l} \ge t_{d,c}\): ① if \(SOC_{d}^{t} < SOC_{d}^{\hbox{min} }\), \(EV_{d}\) is allowed to be charged; ② if \(SOC_{d}^{\hbox{max} } < SOC_{d}^{t}\), \(EV_{d}\) is allowed to be discharged; ③ if \(SOC_{d}^{\hbox{min} } < SOC_{d}^{t} < SOC_{d}^{\hbox{max} }\), \(EV_{d}\) is allowed to be charged or discharged according to the scheduling plan.

    Note that \(EV_{d}\) is the dth EV; \(SOC_{d}^{t}\) is the SOC of \(EV_{d}\) at current time t; and \(SOC_{d}^{\hbox{min} }\) and \(SOC_{d}^{\hbox{max} }\) are the minimum and maximum expected SOC for each \(EV_{d}\), respectively.

3.2.2 Definition of charging/discharging rate

The charging/discharging rate \(v_{d,t}\) is used to characterize the users’ demands, as shown in (12). The value of \(v_{d,t}\) is related to the initial SOC, ending SOC and \(t_{d,l}\) of \(EV_{d}\).

$$v_{d,t} = \frac{{SOC_{d}^{t} - SOC_{d}^{{t + t_{d,l} }} }}{{t_{d,l} }}$$
(12)

where \(SOC_{d}^{{t + t_{d,l} }}\) is the SOC of EVd at time \(t + t_{d,l} ;\)\(v_{d,t} < 0\) indicates that \(EV_{d}\) is charging; and \(v_{d,t} > 0\) indicates that \(EV_{d}\) is discharging.

3.2.3 Real-time EVSCF model

The real-time EVSCF model includes real-time schedulable charge capacity (SCC) and schedulable discharge capacity (SDC) of EVSC based on real-time operation data of EVs from BMS and group classification above. In this paper, SCC and SDC of individual EVs are obtained from (13) and (14), respectively:

$$SCC_{{d,t_{s} }} = v_{d,t} t_{s} C_{d} \;v_{d,t} \in \left[ {\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{v}_{d,t} ,0} \right]$$
(13)
$$SDC_{{d,t_{s} }} = v_{d,t} t_{s} C_{d} \;\;v_{d,t} \in \left[ {0,\bar{v}_{d,t} } \right]$$
(14)

where \(\bar{v}_{d,t}\) and \(\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{v}_{d,t}\) are the limits of charging/discharging rate, respectively; \(SCC_{{d,t_{s} }}\)and \(SDC_{{d,t_{s} }}\) are SCC and SDC of EVSCF of \(EV_{d}\) at the scheduling period \(t_{s}\), respectively; and \(C_{d}\) is the rated battery capacity of \(EV_{d}\).

According to the real-time SCC and SDC of individual EVs as well as SCC and SDC of EVSCF of the cluster of EVs, \(SCC_{{t_{s} }}^{all}\) and \(SDC_{{t_{s} }}^{all}\) are derived as in (15) and (16), respectively:

$$SCC_{{t_{s} }}^{all} = \sum\limits_{d = 1}^{N} {SCC_{{d,t_{s} }} }$$
(15)
$$SDC_{{t_{s} }}^{all} = \sum\limits_{d = 1}^{N} {SDC_{{d,t_{s} }} }$$
(16)

where N is the total number of EVs connected to the grid.

3.3 Ultra-short-term and one-day-ahead EVSCF models

3.3.1 Construction of training dataset and testing dataset

Feature selection is required before establishing the training dataset and testing dataset for ultra-short-term and one-day-ahead EVSCF models. The following feature attributes of EVs are selected to train the model based historical data and time attributes of operations for individual EV.

  1. 1)

    The average values of SCC and SDC of EVSC at the same time t of the previous month are \(\overline{SCC}_{t,mon}^{all}\) and \(\overline{SDC}_{t,mon}^{all}\), which are calculated as in (17) and (18), respectively:

    $$\overline{SCC}_{t,mon}^{all} { = }\frac{1}{l}\sum\limits_{k = 1}^{l} {SCC_{t,mon}^{k} }$$
    (17)
    $$\overline{SDC}_{t,mon}^{all} { = }\frac{1}{l}\sum\limits_{k = 1}^{l} {SDC_{t,mon}^{k} }$$
    (18)

    where l is the total number of days of the previous month; SCCkt,mon, SDCkt,mon are the values of SCC and SDC of EVSC at the same time t on the kth day of the previous month, respectively.

  2. 2)

    The average values of SCC and SDC of EVSC at the same time t last week are \(\overline{SCC}_{t,week}^{all}\) and \(\overline{SDC}_{t,week}^{all}\), which are calculated as in (19) and (20), respectively, where SCCkt,week, SDCkt,week are the values of SCC and SDC of EVSC at time t on the kth day of last week:

    $$\overline{SCC}_{t,week}^{all} { = }\frac{1}{ 7}\sum\limits_{k = 1}^{7} {SCC_{t,week}^{k} }$$
    (19)
    $$\overline{SDC}_{t,week}^{all} { = }\frac{1}{ 7}\sum\limits_{k = 1}^{7} {SDC_{t,week}^{k} }$$
    (20)
  3. 3)

    The values of SCC and SDC of EVSC at the same time t of the previous day are \(SCC_{t,day}\) and \(SDC_{t,day}\).

According to different time attributes, the following four feature attributes are selected as inputs at the training stage: current time t (a total number of 1440 time slots, represented by 0 to 1439), indication of rush hour, holiday or working time.

In summary, through the correlation analysis, the data set with length q is divided into two parts: training dataset with length p and testing dataset with length q-p. The next step is to construct EVSCF models for ultra-short-term and one-day-ahead scales, as follows.

3.3.2 Ultra-short-term and one-day-ahead EVSCF models

Ultra-short-term and one-day-ahead EVSCF models with PGBDT proposed in this paper differentiate only in the time scales. Both are trained as follows:

  1. 1)

    Input training dataset A including the feature attributes and actual value of EVSC \(y_{t - p}\), \(A = \{ (y_{t - p} , \varvec{h}_{t - p} , \varvec{w}_{t - p} )\}_{t = 1}^{p}\) comprises 10 feature attributes of the training dataset with length p; \(\varvec{h}_{t - p} = [x_{t - p}^{1} , x_{t - p}^{2} , \ldots , x_{t - p}^{6} ]\) is a 6-dimensional vector with the historical data of EVSC; and \(\varvec{w}_{t - p} = [x_{t - p}^{7} , x_{t - p}^{8} , x_{t - p}^{9} , x_{t - p}^{10} ]\) is a 4-dimensional vector with the time attributes of EVSC.

  2. 2)

    Set the parameters of the PGBDT algorithm including the number of iterations I and maximum depth d.

  3. 3)

    Train the model represented by (21) by the training dataset A:

    $$y_{t - p} = f (\varvec{h}_{t - p} , \varvec{w}_{t - p} )$$
    (21)
  4. 4)

    Substitute testing dataset B into the model, and obtain the predicted value of EVSCF \(y_{t}^{e}\) as (22):

    $$y_{t}^{e} = f (\varvec{h}_{t - p} , \varvec{w}_{t - p} )$$
    (22)

    \(B = \{ (y_{t} , \varvec{h}_{t} , \varvec{w}_{t} )\}_{t = p + 1}^{q}\) has 10 feature attributes of the testing dataset with length q–p.

3.3.3 Evaluation indexes

In order to evaluate the performance of the proposed PGBDT algorithm for the ultra-short-term and one-day-ahead EVSCF models, the mean absolute percentage error (MAPE) and root mean square error (RMSE) are chosen as evaluation indexes. The expressions are shown in (23) and (24), respectively:

$$MAPE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {(y_{i} - y_{i}^{e} )/y_{i} } \right|} \times 100\%$$
(23)
$$RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left[(y_{i} - y_{i}^{e} )/y_{i} \right]}^{2} } \times 100\%$$
(24)

where \(y_{i}\) and \(y_{i}^{e}\) are the actual and forecasted EVSC values, respectively. If \(y_{i}\) is 0, it is replaced with the historical average of EVSC. The smaller the value of MAPE, the more accurate the predicted value is. RMSE is sensitive to outliers and can amplify the prediction errors. It can be used to evaluate the stability of the algorithm.

3.4 Implementation of EVSCF models with PGBDT algorithm and big data analysis

3.4.1 Real-time EVSCF framework based on big data

Equations (12)–(16) form the real-time EVSCF model. Although the proposed model looks simple, it is difficult to apply, since it needs to process the large amount of related data of EVs in real time. In this paper, Hadoop is used to solve the storage problem of big data by the Hadoop distributed file system (HDFS) [40]. Moreover, Spark designed for large-scale data processing is used. The Spark streaming can process stream data with a minimum interval of 500 ms. In this paper, the real-time processing interval is 60 s, which enables parallel computation to meet real-time requirements.

Parallel processing on Spark is shown in Fig. 2a. When EVs are connected to the grid, their operation information can be acquired and processed through the following functions. The Map function calculates the real-time EVSC of individual EVs. The ReduceByKey function combines the value (the output of Map function) of each key (the number of EVs). Real-time EVSCF values are obtained from the real-time EVSCF model. This is used as historical data for ultra-short-term and one-day-ahead EVSCF models, as discussed in the following sections.

Fig. 2
figure 2

Structure of EVSCF models for multi-time scales

3.4.2 Framework of EVSCF models based on PGBDT

The structure of EVSCF models based on PGBDT algorithm for multi-time scales is shown in Fig. 2. The real-time EVSCF model is built, as shown in Fig. 2a, and the historical data of the real-time EVSCF is combined with the time attributes to generate the training dataset and testing dataset. According to the different prediction periods, the training dataset and testing dataset are updated in order to apply rolling forecasting. Finally, one-day-ahead and ultra-short-term EVSCF models based on the PGBDT algorithm are trained, tested and evaluated, as shown in Fig. 2b.

4 Study cases

4.1 Big data platform configuration

Combining the advantages of Hadoop and Spark, a big data platform is constructed to test the proposed method. The hardware of configured big data platform consists of two IBM servers, which can communicate on the same network through Gigabit gateway. Based on Ubuntu 64-bit operation systems, a computer cluster containing four machines is set up, one of which is selected as the master node and the rest three as the slave nodes. The configuration parameters of the big data platform are shown in Table 1.

Table 1 Big data platform configuration parameters

With the big data platform, the real-time data of 521 EVs are used to test the EVSCF models proposed in Section 3. These data are acquired from BMS of each EV with one-minute resolution, (17 GB in total) in the period from Nov. 1, 2015, 00:00 to Apr. 30, 2016, 23:59 [15].

4.2 Processing time analysis of different real-time EVSCF data scales

In this paper, the speed-up factor Sspeedup is defined as an evaluation index for measuring the parallelization degree of the big data platform, as shown in (25):

$$S_{speedup} = T_{s} /T_{c}$$
(25)

where Ts and Tc are the running time of single machine and cluster machines for processing big data, respectively.

With the proposed big data platform and the real-time operation data of EVs, real-time EVSCF is performed. Table 2 shows the values of Sspeedup for the processing methods by single machine and cluster machines. The achieved EVSCF increases from 0.5 GB to 17 GB.

Table 2 Processing time for different data size of real-time EVSCF

It can be seen from Table 2 that with the increasing data size of real-time EVSCF, the speed-up factor increases from 11 to 66. The acceleration effect is obvious, reflecting the ability of the proposed method to process large-scale data.

4.3 Simulation results and discussions

4.3.1 Real-time EVSCF

The real-time EVSCF is performed using the operation data of EVs connected to the grid from Nov. 1, 2015, 00:00 to Apr. 30, 2016, 23:59 with one-minute resolution, including a total number of 262080 data points. The predicted results of both SCC and SDC are shown in Fig. 3. It can be seen that there is a drop in the curves of both SCC and SDC during the course Feb. 7, 2016 to Feb. 13, 2016 (time: 141120–151200 minutes), which is explained by the Chinese Spring Festival.

Fig. 3
figure 3

Real-time EVSCF during Nov. 2015 to Apr. 2016

Figure 4 depicts SCC and SDC curve form Apr. 24, 2016 to Apr. 30, 2016. The daily trend of EVSCF is basically the same, because the travel time of EV buses is nearly the same every day. Since EVs will leave or access the grid at any time, the values of EVSCF always change and reflect the volatility of EVSC.

Fig. 4
figure 4

Real-time EVSCF of a week during Apr. 24, 2016 to Apr. 30, 2016

Figure 5 shows the results of real-time EVSCF for the specific day of Apr. 30, 2016. From Fig. 5, it is expected that the maximum values of SCC and SDC during this day are 167.557 kWh and 117.155 kWh, respectively, and the minimum values are zero. Figure 5 shows that the values of real-time EVSCF for the time periods from 05:00 to 08:59 (time: 301–540 minutes) and 16:00 to 18:59 (time: 961–1140 minutes) are close to zero, which reflects the intermittency of EVSC. For EVs, this happens during rush hours when EVs have completed charging and are disconnected from the grid. Therefore, during these periods, there are few EVs participating in grid scheduling.

Fig. 5
figure 5

Real-time EVSCF of Apr. 30, 2016

In summary, EVSC for EVs is lower during the daytime, close to 0 during rush hours and higher during the night. The time characteristics of EVSC are consistent with the operation frequency of buses. The probability of access to grid at night is much higher than at daytime, which results in higher EVSC for EV buses at night. Based on this characteristic, charging of EVs can be shifted, not only reducing the peak power, but also being charged at low electricity prices. Operation regularity also provides the basis for EVSC predictability. The analysis results of big data show the characteristics of EVSC, namely, volatility, intermittent and predictability.

4.3.2 Ultra-short-term EVSCF

For ultra-short-term and one-day-ahead EVSCF models, the real-time historical EVSC data from Nov. 1, 2015, 00:00 to Apr. 23, 2016, 23:59 are used for training datasets, while the historical EVSC data from Apr. 24, 2016, 00:00 to Apr. 30, 2016, 23:59 are used for testing datasets. Therefore, ultra-short-term EVSCF is set an hour in advance to forecast the next hour, rolling to the 168th hour (7 × 24 hours).

To demonstrate the effectiveness of the proposed method, the results from the PRF and PKNN algorithms [32, 41] are compared with the results of the PGBDT algorithm proposed in this paper. Taking into account the accuracy and processing time, the set of parameters for different algorithms are selected, as shown in Table 3.

Table 3 Set of parameters for different algorithms

The errors of SCC and SDC in MAPE and RMSE and the training time for ultra-short-term EVSCF obtained by the three ML algorithms are shown in Table 4. It can be seen that PGBDT has the best performance in both accuracy and traning time, and the MAPE of SCC by PGBDT are 6.52% and 24.01% lower than those of PRF and PKNN, respectively. Similarly, MAPE of SDC by PGBDT are 6.50% and 24.52% lower than those of PRF and PKNN, respectively. The training times of SCC by PGBDT are 3.23 s and 13.48 s faster than those of PRF and PKNN, respectively. Overall, the prediction accuracy and training time of PKNN are much inferior to those of PGBDT.

Table 4 Prediction errors and training time of ML algorithms for ultra-short-term EVSCF

In order to quantitatively evaluate the reliability of PGBDT, PRF and PKNN algorithms, the cumulative probability curves of 168 hours of ultra-short-term EVSCF are obtained, as shown in Fig. 6. As can be seen, PGBDT has more than 92% of its results meeting the requirements of MAPE within 8%, which means that PGBDT has good generalization ability for 92% of new samples. At the same level of error, the data volume of PRF is about 50%, while that of PKNN is less than 20%. This reflects that the reliability of PGBDT is the highest among the three algorithms.

Fig. 6
figure 6

Cumulative probability curves of ultra-short-term EVSCF

Figure 7 shows the actual value and predictive values of SDC for different algorithms for a typical day of Apr. 30 from 00:00 to 24:00. It can be seen that the curve of PGBDT values is consistent with the curve of actual values, and the other two curves have large deviations during the two selected periods. The amplitude of SDC changes more than 70% in two hours and within 10% in two hours, respectively.

Fig. 7
figure 7

Forecasting errors of SDC by three algorithms on Apr. 30, 2016, 00:00–24:00

To further evaluate the performance in different time periods, 24 hours are divided into three periods according to the operation practice of EVs, including peak hours of EVSC (00:00–02:59, 20:00–23:59), flat hours of EVSC (03:00–04:59, 09:00–15:59, 19:00–19:59), valley hours of EVSC (05:00–08:59, 16:00–18:59). The histograms of the evaluation indexes are shown in Figs. 8 and 9. As can been seen from Figs. 8 and 9, the errors between PGBDT and PRF are slightly different in peak hours. But in flat hours and valley hours, due to multiple iteration errors, the results from PGBDT are stable and much better than that of PRF and PKNN.

Fig. 8
figure 8

Comparison of MAPE in different time periods

Fig. 9
figure 9

Comparison of RMSE in different time periods

4.3.3 One-day-ahead EVSCF

One-day-ahead EVSCF is performed one day in advance for the next day. For example, in one-day-ahead 24-hour EVSCF model, predicting that EVSC of Apr. 24, 2016 needs a training dataset from Nov. 1, 2015 to Apr. 23, 2016, predicting that EVSC of Apr. 25, 2016 needs a training dataset from Nov. 2, 2015 to Apr. 24, 2016, etc, in a rolling way, EVSC of Apr. 30, 2016 is predicted. Table 5 shows the forecasting errors of SCC and SDC in MAPE and RMSE and the training time based on PGBDT, PRF and PKNN algorithms for one-day-ahead EVSCF. Similar to the results of Table 4, PGBDT is the best among all the algorithms. Comparing with the results in Tables 4 and 5, it can be seen that the smaller the time scale of EVSCF is, the smaller the forecasting errors in MAPE are. The value of RMSE does not vary with the prediction time scale, and only depends on the complexity of the data and the amount of outlier data.

Table 5 Prediction errors and training time of ML algorithms for one-day-ahead EVSCF

5 Conclusion

This paper investigates the EVSCF using big data analysis and ML algorithms. EVSCF models are established for multi-time scales based on actual operation data of EVs. Real-time EVSCF is achieved using the constructed big data platform, where the speed of Hadoop and Spark is 66 times faster than traditional methods. The proposed models are tested and compared with PRF and PKNN, exhibiting superior performance. The simulation results containing real operation data of EVs connected to the grid with one-minute resolution. It shows that for one-hour ultra-short-term EVSCF model, the PGBDT algorithm has the highest accuracy for SCC and SDC, with the forecasting errors in MAPE of 3.79% and 3.37%, and reduced training time by 30% and 60%, respectively, compared with those obtained by PRF and by PKNN. The performance of PGBDT-based EVSCF model for one-day-ahead 24 hours is much better than PRF and PKNN, proving its reliable forecasting performance and generalization ability. The simulation results also prove that the proposed PGBDT-based EVSCF models can take advantage of the analytical ability of ML under a big data environment and provide powerful support for EV participation in grid scheduling and ancillary services.