1 Introduction

The growth of variable renewable energy (VRE) generation presents new challenges to both system operators and system users. VRE generation (e.g. solar and wind) is weather dependent, i.e. uncertain and intermittent, and can only be planned with limited accuracy and controlled in one direction. VRE increases the magnitude and frequency of changes in the supply–demand balance and also affects the predictability of these changes, resulting in an increased need for flexible resources to manage imbalances between supply and demand in power systems [1]. Any imbalance between supply and demand results in frequency deviations. Voltage and frequency regulation is critical in power systems to ensure the stability and reliability of the electrical grid. Maintaining the frequency within certain limits is essential for the safe and efficient operation of electrical equipment, frequency fluctuations can disrupt the timing and synchronization of electrical equipment, potentially causing damage or outages.

The transmission system operator (TSO) is responsible for maintaining the instantaneous balance between generation and consumption of electricity. The TSO activates balancing reserves provided by market participants to resolve real-time system imbalances. In order to maintain the balance of the grid and reduce the need for balancing reserve activation, the imbalance settlement framework incentivizes private legal entities, called Balance responsible parties (BRPs), to maintain the aggregated balance of their generation and consumption portfolio (balancing group). The difference between the nomination and the actual electricity consumption or production aggregated to a BRP is the imbalance volume, and BRPs are charged for their imbalance volume (balancing energy cost). BRPs with a significant share of VRE generation in their portfolio will increasingly face imbalance charges due to schedule deviations. The market environment and trends point to shorter lead times, allowing market participants to trade on the basis of their continuously improving forecasts in both the commodity and balancing markets. With the internationalization of the balancing market in Europe, these processes are taking place in an expanding supply–demand environment.

Imbalance settlement mechanisms provide a framework for the settlement of imbalances. According to the [2] EB Regulation, an imbalance price is defined as a positive, negative or zero price defined for each settlement period for an imbalance in each direction. The literature distinguishes between single and dual pricing imbalance settlement systems. Under single pricing, a BRP that is short its scheduled energy is subject to the same imbalance price as a BRP that is over its scheduled energy. Under dual pricing, two imbalance prices are set: one for positive imbalances, which occur when a BRP has a surplus of energy, and another for negative imbalances, which occur when a BRP has a shortage of energy [3]. The choice between single and dual pricing affects the behavior and incentives of the BRP, as well as the financial options available. When using a single pricing approach, the BRP is motivated to contribute to the overall system balance, while dual pricing incentivizes the BRP to maintain a balanced position within its portfolio [4]. The European TSO’s imbalance settlement harmonization methodology [5] confirms that the European TSO shall implement the use of single imbalance pricing in accordance with Article 55 of the EB Regulation.

Minimizing system imbalance through frequency regulation is an important system operation function of the TSO. Through the balancing group system, it delegates this task to the BRPs. In the single price model, the BRPs also have an interest in ensuring that their deviation from the schedule is such that the direction of the imbalance contributes to the restoration of system balance. In fact, a supportive schedule deviation may generate revenue for a BRP, depending on the imbalance pricing methodology and price forecast. The BRP must know its position, control its assets, and be able to predict system imbalance in order to take advantage of this.

BRPs can influence this imbalance volume by increasing the accuracy of their nominations, by over-nominating or under-nominating, or by applying internal balancing. Over the time horizon in which the system imbalance can be estimated and exploited, minimizing the balancing energy charge/maximizing revenues is not the only decision criterion for a BRP. The shortening of intraday lead times and the internationalization of wholesale markets during this period also provide a liquid market to cover the current position with commercial transactions in case it deviates from the day-ahead plans. A complex decision logic governs its activities. It takes into account the possibility of participating in organized intraday markets (liquidity, prices) and bilateral trading within the scheduling intervals, the schedules and position of its own aggregated portfolio and individual assets, the operating costs and their adjustments, the forecast direction of the system imbalance and the expected price of balancing energy.

The management of imbalances that are not managed by the trading markets remains the responsibility of the TSO. The TSO will ensure that the quality characteristics of the system are maintained at the required level by activating different types of reserves through the frequency control processes. While the activation of FCR (frequency containment reserve) is distributed and autonomous, with the primary controllers of the units involved in FCR control responding to local frequency measurements, the activation of FRR (frequency restoration reserve) is controlled by the TSO [6]. The command is manual for manual frequency restoration reserve (mFRR), the automatic frequency restoration reserve (aFRR) demand is automatically activated by the TSO through its closed-loop automatic generation controller (AGC). One of the main inputs to the AGC is the area control error (ACE), which is an instantaneous power value formed by the deviation of a control area from its aggregated plan and the deviation of the frequency from 50 Hz [7]. The objective of the controller is to reduce this value to 0 using the controlled units. If the imbalance can be estimated by the TSO, it will be able to reduce the area error to be handled by the aFRR controller by activating the cheaper and more available mFRR reserves in advance. Due to the growing variability of the increasing share of VRE generation TSOs are expected to increase their demand for reserves [8, 9] and the importance of fast reserves continues to grow, emphasizing the importance of imbalance forecasting for TSOs as well.

In this paper, we present a linear method for the short-term prediction (\(<=\) 2 h) of multi-period system imbalance volume. The method relies heavily on the autoregressive property of the imbalance and on the appropriate compilation of publicly available explanatory variables that are correlated with the imbalance. By using the predictor variables associated with each step of the multi-step forecasting horizon, we also utilize past observed and future known inputs.

In recent years, several forecasting techniques have emerged for predicting time series data. Most of these techniques involve non-linear models based on sophisticated mathematical algorithms. Partly motivated by the conclusions of [10], the aim of the authors was to propose a computationally tractable imbalance forecasting method based on linear tools. The main contribution of this paper is to propose an Autoregressive Distributed Lag (ARDL) forecasting algorithm that is able to outperform forecasting methods such as LSTM. The proposed ARDL method, together with two non-linear benchmark models (ETR and LSTM) and an ARIMAX model, were subjected to a comparative analysis using the same prepared and processed input data.

The rest of the paper is organized as follows. Section 2 provides a review of the relevant literature. Section 3 describes the methodology used in the study. Section 4 presents the results and introduces the benchmark models. Finally, Sect. 5 concludes the paper with a summary of findings and implications.

2 Related work

In their study, Garcia et al. [11] explore and highlight the constraints and inadequacies of commonly employed yet simplistic forecasting methods such as ARIMA and exponential smoothing, due to the non-periodic, non-stationary, and noisy nature of imbalance time-series data. To address these challenges, the researchers propose the utilization of neural networks to capture the non-linear and irregular patterns within the data, enabling accurate prediction of daily imbalance medians.

Kratochvil [12] deals with multivariate short-term imbalance forecasts from the perspective of a BRP that tries to gain profit from achieving the opposite direction of the system imbalance. The impact of the most important predictors is analyzed by applying autocorrelation analysis. Focusing on BRP’s objective, the point forecast of the concrete value of the system imbalance is not calculated. 5 intervals of the system imbalance are defined and the predicted imbalance is mapped to one of them, simplifying the prediction problem to a classification. Kratochvil’s ARIMA model resulted in an accuracy of 61.0 % in the Czech market.

Contreras [13] is dedicated to hourly imbalance predictions using random forest regression in the Spanish market with an accuracy of 68.3 %.

In their study, Salem et al. [14] introduce an additional forecasting approach that offers the Transmission System Operator (TSO) valuable insights into the anticipated trends of imbalances within the upcoming two hours. This solution is accompanied by prediction intervals, providing information on the reliability of the forecasts. The researchers employ a quantile regression forests ensemble method to predict imbalances in the Norwegian power system. They found that training the model with datasets spanning at least twelve months led to significant enhancements in forecast accuracy compared to using three or six-month datasets.

In [15] a comprehensive model is proposed that combines a Bidirectional Long Short-Term Memory (BLSTM) architecture, an attention mechanism, and an encoder-decoder structure. The proposed model provides valuable insights into the relative significance of features and aid in understanding the complex temporal dependencies present in the data.

To gain valuable market insight and competitive advantage, both imbalance volume and price are necessary inputs for decision-making close to delivery time. While the modeling of day-ahead and intraday electricity markets has been the subject of numerous papers, the modeling of imbalance market prices has received less attention. [16] benchmarked earlier models published before 2015 and concluded that none of the benchmarked models produced informative day-ahead point forecasts. This suggests that information available before the day-ahead market closes is efficiently reflected in the day-ahead market price rather than the balancing market price.

Dumas et al. [17] combine imbalance volume forecasts with reserve costs. The authors utilize a two-step approach, namely they first calculate the probabilities for the system imbalance and then based on that make predictions regarding the imbalance prices.

In the study conducted by Browell and Gilbert [18], it was shown that the forecast of imbalance in the UK could be achieved by utilizing a logistic regression model. The model incorporated demand, wind, solar, and total supply margin as explanatory variables. The findings indicated that the logistic regression consistently outperformed the benchmark by an impressive margin of 4 %.

A novel approach to probabilistic forecasting of German power imbalance prices is presented by Michal Narajewski [10]. This study is significant because it addresses the highly volatile nature of the imbalance market, which is characterised by frequent extreme price spikes. Narajewski uses advanced methods such as lasso with bootstrap, gamlss, and probabilistic neural networks to forecast 30 min before delivery. These methods are compared to a naive benchmark and it is shown that while they do not significantly outperform the benchmark in terms of prediction accuracy, they do provide a significantly better empirical coverage.

Bottieau et al. [19] demonstrated the superiority of machine learning techniques compared to conventional benchmarks when utilizing a feature set consisting of forward prices, as well as recent and forecasted data on generation and load. The authors successfully employed a one-step-ahead forecasting model for system imbalance.

Koch [20] analyzes a strategy of taking positions in the German intraday market based on expected imbalance prices and examines its impact on system stability. It uses a logistic regression model to predict the direction of the overall system balance and to apply a profitable trading strategy. Intraday trading is used to estimate the imbalance prices and decide whether to take a buy or sell position on a quarter-hourly basis. The applied strategy simulates a decision with available information during active trading considering the current and not just average market prices. The model was able to correctly classify the system balance in 68 % of all quarter hours.

Ahmed and Kumar [21] use exponential smoothing, and Holt’s exponential smoothing for forecasting nodal electricity prices. The model evaluation and the forecasting performance have been tested using various techniques such as Akaike Information Criterion (AIC), Akaike Information Criterion Corrected (AICc), Bayesian Information Criteria (BIC).

3 Methods

The models presented in this paper assume that imbalance correlates with other external variables that can be measured or scheduled, or with historical values of imbalance itself. The power system can be affected by several primary sources of imbalance. The amount of power consumed by consumers can be highly variable over the course of a day. This can result in deviations from the schedule. Unplanned power outages due to equipment failure, extreme weather conditions, or other unforeseen circumstances can cause imbalances. Power market conditions, such as changes in demand, fuel, or electricity prices can also cause imbalances in the power system. Table 1 contains all the explanatory variables that are used to train the forecast models and forecast future values.

3.1 Problem statement

For the power system, we have metering data, plans and forecasts available. Measurements are historical data with quarterly resolution. They are available for the quarter preceding the forecast. The plans, which are typically schedules provided by market participants, refer to both past and future time periods. Forecasts published by system operators according to their internal methodology are also used as predictors. Given these time-series data, or goal is to provide a multi-step forecast of system imbalances at forecast execution time t for the current and subsequent quarters \(t+0, t+1,..., t+7\). This is achieved by creating separate lag structures of the feature set and training ARDL models for each forecast timestep. Both training and verification are performed on historical data, adhering to the rules of temporal availability of plan and measured data. The quality of the prediction is evaluated using metrics commonly used in the literature and compared to the performance of leading non-linear machine learning methods.

3.2 Variables

Fig. 1
figure 1

Distribution of imbalance

The distribution of the observed imbalance in Fig. 1 is centered around 0 (median: \(-\)12.5), with slightly more weight in the positive imbalance (skew: 0.24). The imbalance distribution is leptokurtic. It has a sharper peak and fatter tails than a normal distribution. The Q1 and Q3 quartiles are – 79 and 50, respectively. The imbalance time series in Fig. 2 shows fluctuations over time with no clear seasonal pattern, suggesting that it may be stationary. A stationary time series has a constant mean and variance over time. This implies no trend or seasonality. As the lags increase, the ACF plot shows a gradual decrease in correlation. The fact that the autocorrelations remain within significance limits after the first few lags suggests that there is no strong autocorrelation in the data at higher lags. The plot of the PACF shows a spike at lag 1, and then a decline in the partial autocorrelations immediately thereafter. The sharp cutoff after the first lag in PACF, together with the gradual decline in ACF, suggests that an AR(1) model is appropriate for the imbalance time series.

Fig. 2
figure 2

Imbalance time series analysis

We have applied a seasonal decomposition to the imbalance time series. This suggests that there is no significant seasonality in the data. There is no clear long-term upward or downward trend, and the pattern appears to be somewhat cyclical. The consistent spread of residuals suggests that the model captures trend and seasonality fairly well, leaving random noise that the model cannot explain.

It is obvious that the variables should have a significant impact on the imbalance. An important selection criterion is that they should be available from public sources that are used in the electricity industry, and with a timing that allows them to be used for short-term forecasting of the imbalance.

Table 1 Usage of predictor variables. Present refers to the period the forecast is executed in

The ’Metered’ type refers to variables that are not available in the forecast period, only values from the past can be used. ’Planned’ variables are available both for the current time interval and for the future time interval. Typical examples are schedules that are submitted by market participants as part of the day-ahead or intra-day processes, or forecasts that are published by the TSOs. ’Fixed’ variables are available for both past and future. However, only the forecast interval is used.

The correlation tests support the engineering considerations that higher load and production schedules are associated with higher imbalances, and that a high proportion of weather-dependent production increases the variability. Recent deviations from schedules have a direct impact on imbalance. These can be calculated from the difference between schedules and actuals.

Increasing the load curve at the quarter hour limit reduces the imbalance, and decreasing the load curve increases the imbalance. Increasing the continuous load curve also increases the imbalance. These two factors are due to the difference between a quarter-hourly step schedule and a continuous real load. Within fifteen minutes, the schedule is unchanged, so the imbalance increases continuously as the load ramps up. At the fifteen-minute limit, the imbalance will drop when the schedule jumps, and it will continue to increase from that point on as long as the upward direction in load is unchanged. This means that when calculating the imbalance for a given quarter hour, both the schedule change and the planned load change should be considered.

Numerous features could be extracted from the quarter-hourly time stamp, but correlation checks have shown that the time within a day has a clear impact on the imbalance, while annual, monthly, and weekly samples are not as relevant. To capture the cyclic patterns in time-based data, timestamps are encoded using cyclic encoding. By converting the day’s quarters into cyclical components, we can preserve the relative distances between quarters while introducing cyclical patterns into the model. Rather than representing the quarter as a single value that increases linearly, cyclic encoding represents the feature as two separate variables: one representing the sine of the angle around the circle and the other the cosine.

The correlation between time series can be tested with a regression causality test, called Granger test [22]. The Granger causality test is a statistical hypothesis test designed to determine whether one time series can be used to predict another time series. It measures correlation, not causation, in the sense that it can provide useful information about the relationship between two variables, but it does not provide conclusive evidence of causality. The idea is that a variable X Granger-causes Y, if past values of X can help explain Y. Granger causality is only relevant with time series variables. To test for Granger causality, the Granger causality test compares the predictive power of two models: one that includes both X and Y as predictors of Y’s future values, and another that only includes Y as a predictor. If the coefficients of past values of X in the Granger model are statistically significant (p-value < 0.05) then we can reject the null hypothesis of the test (coefficients of past X values in the regression equation is zero) and conclude that X Granger-causes Y. The Granger test was carried out for 5 lags, the p-values were far below 0.05, so we can observe that all the selected variables are Granger-causes of the imbalance.

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine if a time series is stationary or non-stationary. The test is based on a first-order autoregressive model of the time series. The null hypothesis of the ADF test is that there is a unit root in the AR model, implying that the data series is nonstationary because the time series mean and variance change over time. The ADF p-values for the imbalance time series are effectively 0, and the ADF statistics are below 0.05, indicating that we are confident in rejecting the null hypothesis of a unit root. This is a confirmation of the stationary nature of the data.

3.3 Forecast model

We propose a linear multihorizon imbalance forecasting method that uses both past values of the outcome variable and exogenous variables, and provides flexibility in the design of the lag structure. For this purpose, we choose the Autoregressive Distributed Lag (ARDL) model. ARDL is a time series econometric model that is used to analyze the relationship between two or more variables, where one variable is considered the dependent variable and the others are independent variables. The ARDL model allows for the examination of both short-term and long-term relationships between the variables. It is a dynamic model, as it takes into account the lagged values of the variables, which can help to capture the persistence and dynamics of the relationship. The literature review did not find any studies using the ARDL model for the prediction of imbalances, but it is considered to be a well-established tool for the analysis of time series data, and several publications have used the ARDL model in different areas. [23] extended the relationship between co-integration and error correction models. [24] demonstrated the effectiveness of the ARDL approach in producing reliable estimates, especially when the data sample size is relatively small. Furthermore, [25] highlighted the importance of the ARDL model as an essential tool in dynamic one-equation regression, which is widely used in economic time series modelling. Taken together, these references highlight the widespread application and reliability of the ARDL model in various research areas.

The basic ARDL structure includes a lagged dependent variable, a set of lagged independent variables, and possibly exogenous variables [26]. The model can be estimated using ordinary least squares (OLS) regression or other estimation techniques. Our motivation was the combination of the mathematically simpler linear model with deep domain knowledge to create a prediction process that would have competitive results. This required both the target and independent variables to have regressive components. In the case of ARDL, unlike, for example, the linear ARIMAX also used as a benchmark, where the exogenous variables are not lagged, we had the flexibility to specify the lag structure of both the dependent and independent variables.

The general ARDL model applied is specified by Eq. 1:

$$\begin{aligned} Y_t = \underset{\text {Constant}}{\underbrace{\delta _0}} + \underset{\text {Autoregressive}}{\underbrace{\sum _{p=1}^P \phi _p Y_{t-p}}} + \underset{\text {Distributed Lag}}{\underbrace{\sum _{k=1}^M \sum _{j=1}^{Q_k} \beta _{k,j} X_{k, t-j}}} + \underset{\text {Fixed}}{\underbrace{Z_t \Gamma }} + \epsilon _t \end{aligned}$$
(1)

where,

  • \(Y_t\): the forecasted value of the dependent variable at t ‘item \(Y_{t-p}\): values of the dependent variable at \(t-1, t-2,\cdots ,t-p\)

  • P: maximal lag of Y

  • \(X_{k,t-j}\): \(k^{th}\) independent variable at \(t-1, t-2,\cdots ,t-j\) periods.

  • \(Q_k\): maximal lag of \(X_k\)

  • M: number of lagged variables

  • \(Z_t\): fix non-lagged independent variable at time period t

  • \(\epsilon _t\) assumed to be i.i.d.

  • \(\delta , \phi , \beta , \gamma\) are estimated modell parameters.

The ARDL model according to Eq. 1 calculates the value of the next time period after the observations. This is called single-step forecasting because we only need to forecast one step. However, in our case we are interested in several quarters of an hour, several steps need to be predicted at the same time, so this is a multi-step time series forecasting problem. There are several approaches to multi-step forecasting. The direct approach generates separate models for the forecasts \(t, t+1, t+n\) with the same predictor variables, but with different dependent variables for each step. This is a simple method, but it does not allow the value of the \(t+2\) prediction to be used to calculate \(t+3\). The recursive method uses the same one-step model several times. The result of the previous step is used as input for the current estimation.

Fig. 3
figure 3

Lag structure of the predictor variables. A separate model is traned for each step. FC refers to the forecasted values of imbalance

An ensemble model is used in this paper. For each step, the predicted value of the dependent variable from the previous step is used to train a separate model with different variables and lag structures. The model is illustrated in Fig. 3. The forecast is made at the time t for the time intervals between t and \(t+7\). We assume that for both the dependent and independent variables, observations are already available for time period \(t-1\). The value of the fixed variable (Z in Eq. 1) is known for the given interval t. From these inputs, the ARDL model calculates the value of the dependent variable value for period \(t+1\) (\(FC_{t+1}\)). In the same period t, the forecast for the interval \(t+1\) is also computed (\(FC_{t+1}\)). To do this, the existing observations of the dependent variable are used as predictors, along with its forecast for the previous interval (\(t+0\)) and the value of the fixed variable for \(t+1\). For Step \(t+2\), the set of observations remains the same, but the predictors for \(t+0\) and \(t+1\) of the dependent variable are added to the set of predictors. Multi-step forecasting is implemented iteratively, with different steps using different variable and lag structures, thus training a separate ARDL model for each step.

Algorithm 1
figure a

Train multi-step ARDL and forecast

As shown in Fig. 3, to train the model for each step, the predictor set must be constructed by adding the prediction of the previous step to the predictor set. The algorithm for training and prediction is summarized in the Algorithm 1. After a model has been trained for a prediction step, the corresponding prediction is made, and then the next step is trained by adding that prediction to the set of predictions. The forecast follows a similar logic. The forecast that is associated with a step is both the final result and the input for the forecast of the next step.

4 Results

4.1 ARDL

Real data was used to verify the method described above. The data to generate model variables according to Sect. 3.2 are available on the Hungarian TSO website and the ENTSO-E Transparency Service. The train period lasts for quarter-hour intervals from 01/01/2011 to 17/02/2022, while the test period spans from 18/02/2022 to 01/06/2022. In these intervals, all quarter-hour values have been taken into account. It can be said that the time stamps are independent and identically distributed (IID). The time series are stationary, the statistical properties of the time series (such as the mean and the variance) do not change over time.

A separate model is taught for each step because a given forecast event involves multiple forecast steps with different variable structures. They are also evaluated separately, so that we can see the forecasting performance for forecasting steps \(t+0, t+1, t+2, t+3,...,t+7\) and observe the deterioration of the forecast accuracy with increasing look-ahead window.

It is worth looking at the environment for which the imbalance is being studied when examining forecast errors. The peak load of the Hungarian power system was 7361 MW. The share of renewable energies was 13,66 % of the total electricity consumption. 286 MW of positive aFRR capacity has been procured in the year 2021 [27].

Fig. 4
figure 4

Sample of observed imbalance and prediction

Figure 4 shows a snippet of the imbalance forecast. A solid line is the forecast value and a dashed line is the observed value for the same quarter of an hour. The time series ’FC pred_t0’ represents the prediction step ’Step t+0’ as shown in Fig. 3. For a given time stamp, there are several predictions (t0–t7), but for the sake of clarity, only \(t+0\) and \(t+1\) are shown in the figure. The forecast error is the difference between the observed and the predicted value. In addition to the magnitude of the error, the sign accuracy is of particular importance. This is due to the fact that the methods of balancing energy management and settlement for positive and negative system imbalances are very different. Therefore, in addition to the well-known metrics, we present an independently developed metric for evaluating sign accuracy.

Figure 5 contains metrics used to evaluate the predictions. The mean average error (MAE) is a measure of the average size of the errors or differences between the predicted values and the actual values, without taking into account their direction. It is calculated by taking the absolute difference between the predicted value and the actual observed value, and then averaging these differences over all the observations [28]. The resulting value, expressed in the same units as the original variable, represents the average absolute error between the predicted and actual values.

Fig. 5
figure 5

Evaluation metrics of the forecast steps

The root mean squared error (RMSE) is a measure of the average size of the errors between the predicted value and the actual value, taking into account the direction of the errors. The RMSE is calculated by taking the square root of the average of the squared differences between the predicted and actual values [28]. Like the MAE, the RMSE is expressed in the same units as the original variable, and lower values indicate better model performance. RMSE penalizes large errors more compared to MAE, as it gives greater weight to larger errors due to the squaring operation.

As the number of prediction steps increases, all prediction metrics increase.This is as we expected, since both the imbalance and the measured values of the independent variables affecting the imbalance continue to move further away from the predicted value over time.

Figure 6a shows the cumulative distribution of the absolute error. The error in 90 % of the projections is smaller than 60 MW at \(t+0\), but at \(t+7\) the error is larger than 160 MW in 10 % of the projections.

Fig. 6
figure 6

Forecast errors of the ARDL model

Sign accuracy is important because the value, or more precisely the sign, of the system imbalance is used to build logic that encourages power system actors to reduce the system imbalance. Figure 1 shows that the value of the system imbalance around 0 is the most common, and we can assume that the sign is the most difficult to predict accurately for small imbalances. To illustrate, we show a figure that evaluates the sign accuracy as a function of a parameter (Fig. 6b). The sign accuracy percentage (SAP) is calculated by dividing the number of correct predictions of the direction (Eq. 2) by the total number of predictions (Eq. 3). In addition, a threshold value is applied so that SAP is only calculated for forecasts with an absolute value that is greater than the threshold value (40 MW in Fig. 5).

$$\begin{aligned} SA_{step,\tau }^{True} = \sum _{t} \Bigl [&(FC_{t+step} \ge \tau \wedge Y_{t+step}> 0) \vee \nonumber \\&(FC_{t+step} \le -\tau \wedge Y_{t+step}< 0) \Bigr ] \nonumber \\ SA_{step,\tau }^{False} = \sum _{t} \Bigl [&(FC_{t+step} \ge \tau \wedge Y_{t+step} < 0) \vee \\&(FC_{t+step} \le -\tau \wedge Y_{t+step} > 0) \Bigr ] \nonumber \end{aligned}$$
(2)
$$\begin{aligned} SAP_{step,\tau } = \frac{SA_{step,\tau }^{True}}{SA_{step,\tau }^{True}+SA_{step,\tau }^{False}}*100 \end{aligned}$$
(3)

where,

  • \(\left[ condition \right]\) is the Iverson bracket, denoting a number that is 1 if the condition in square brackets is satisfied, and 0 otherwise

  • \(SA_{step,\tau }^{True}, SA_{step,\tau }^{False}\): number of correct and incorrect directional predictions given a threshold value \(\tau\) and a forecast window step

  • \(FC_{t+step}\): imbalance forecast of interval \(t+step\) executed at time interval t

  • \(Y_{t+step}\): measured imbalance of interval \(t+step\)

  • \(\tau\): threshold value, \(0 \le \tau\) only forecast over \(\tau\) or \(-\tau\) are considered

  • \(SAP_{step,\tau }\): sign accuracy percentage

4.2 Benchmark models

In the field of short-term forecasting of power system imbalances, it is imperative to establish robust benchmarks to evaluate the effectiveness of newly developed models. In this context, our study uses three widely accepted and scientifically validated benchmark models: ARIMA(X), LSTM, and ETR. The Autoregressive integrated moving average (ARIMA) model, a mainstay in time series forecasting, is chosen for its proven utility in capturing linear dependencies and trends in time series data, particularly in electricity load forecasting [21]. The long short-term memory (LSTM) model, a recurrent neural network architecture, is included because of its superior ability to capture complex temporal dynamics and long-term dependencies in sequence data, making it particularly suitable for multi-step time series forecasting, as demonstrated in power load forecasting applications [10]. Finally, the extra-trees regressor (ETR), a variant of the Random Forest algorithm, is chosen for its effectiveness in handling non-linear relationships and complex interactions between variables, which is critical in time series forecasting, especially in contexts involving autocorrelated dependent variables and multiple lagged explanatory variables [19]. Together, these models provide a comprehensive framework for evaluating the performance of our ARDL model, encompassing both linear and nonlinear dynamics relevant to forecasting short-term imbalances in power systems. Although several studies use the methods selected for the benchmark, the literature on short-term imbalance forecasting is not very rich, and comparisons can be valid only if the models are run on the same assumptions and data. To this end, the benchmark models were not only presented but also applied and implemented to the problem at hand.

The linear ARIMAX model extends the traditional ARIMA model by including exogenous variables. The core of ARIMAX is the ARIMA model, which includes autoregression (AR), differencing (I) and moving average (MA) components [29]. The AR part expresses the current value of a variable as a linear combination of its own past values. The order of the AR part indicates the number of lagged terms used. The ’I’ part includes differencing to ensure stationarity. The MA part models the current value of the series as a linear combination of current and past error terms. The order of the MA determines the number of lagged forecast errors in the forecast equation. The main extension of ARIMAX is the inclusion of external variables that may influence the forecast. These variables are not part of the time series itself, but are assumed to influence it. They are treated as additional regressors in the forecasting equation and their coefficients are estimated alongside the ARIMA parameters. The input data of an ARIMAX model should satisfy several key characteristics. Stationarity of the dependent variable has already been checked, and the Granger test proves that the independent variables have a plausible link to the dependent variable.

Augmented Dickey-Fuller tests were performed to determine the order of differencing and then models were fitted in the range 0–8 for the autoregressive and moving average terms to determine the optimal parameters of the ARIMAX model. The model that gave the best AIC value was selected (p = 5, d = 0, q = 1).

The Extratree Regressor model uses an ensemble of unpruned decision trees in a process known as “extremely randomized trees” [30]. This method is particularly useful for time series forecasting tasks because it effectively captures complex interactions between variables, thereby improving the model’s predictive performance. The Extratree Regressor model’s inherent ability to accommodate nonlinear relationships makes it well suited for this task, as it can identify intricate patterns and dependencies among these variables [31].

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture first proposed by Hochreiter and Schmidhuber [32], and has been widely adopted for sequence prediction problems due to its potential to remember and learn long-term dependencies in sequential data. This property of LSTM makes it particularly well suited for multi-step time series forecasting, where the objective is to predict several future steps of a target variable given its historical observations and possibly also exogenous variables. It also plays a vital role in efficient power system planning [10].

LSTM units are a type of recurrent neural network (RNN) architecture that was developed to address the limitations of conventional RNNs, particularly the vanishing gradient problem [33], which makes it difficult for RNNs to learn long-term dependencies in sequence data. An LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. The cell is responsible for remembering values over arbitrary time intervals, and each of the three gates can be thought of as a conventional artificial neuron, as they perform operations on the data passing through them. The gates regulate the flow of information into and out of the cell, which is why they are called gates. The cell state, or the "memory" of the LSTM unit, carries information throughout the sequence processing. The forget gate is a sigmoid function that decides what information should be discarded or kept. The input gate updates the cell state with new information. The hidden state contains information on previous inputs. The hidden state is also used for predictions. The LSTM does have a form of memory because the output of each LSTM unit is fed into the next, and changes to the cell state are made dependent on the operation of the gates. This is how the LSTM can maintain information in "memory" over time.

To make multi-step imbalance forecasts from a lagged multivariate input, a sequential LSTM model was created. In a sequential model, the layers are stacked sequentially, meaning the output of one layer serves as the input to the next layer. An encoder-decoder architecture [34] was implemented, that is commonly used for sequence-to-sequence prediction tasks, where the input sequence is encoded into a compressed, fixed-length representation (context vector), and then the decoder generates the output sequence based on that representation.

The first encoder LSTM layer takes the input sequence, processes it, and learns the context vector. Its final hidden state is passed to a ’repeat vector’ layer that repeats the context vector as many times as the length of the output steps (imbalance forecast window). This step allows the context vector to match the desired length of the output sequence and prepares the encoded representation to be fed into the decoder part of the network. The decoder LSTM layer is then used to generate the multi-step output sequence. Then a dense layer is applied to each time step of the output sequence independently, ensuring that the output sequence matches the desired shape.

The results of the ARDL model have been compared with the benchmark models and the results are summarized in Table 2. For MEA and RMSE, a lower value indicates better forecasting efficiency. For SAP, a higher value is favorable because the value indicates the percentage of forecasts above 40 MW and below -40 MW that correctly predict the system direction. The performance of ARDL is better than the benchmark models in the first 4 intervals, i.e. the first 1 h. In the second hour, ARIMAX performs slightly better.

The present study is based on the assumption that the correct choice of relevant external variables and the design of a lag structure tailored to the forecasting task effectively support forecast accuracy. These preparatory steps made the mathematically simpler dynamic regression method competitive with the state-of-the-art neural network and random forest models.

Table 2 Comparison of prediction results using 5 months of test data

5 Discussion and Conclusion

In recent publications, the forecast of the volume of imbalances has usually been considered as a preliminary step to the price forecast. We believe that the joint prediction of imbalance volume and price provides market participants with the information they need to take the most advantageous market position by trading on intraday markets or by controlling their assets. The direction of the system imbalance is most important from a market position perspective. However, the forecasted volume helps to judge how accurate the sign prediction is. On the other hand, TSOs publish anonymous balancing bids in near real time, so that the resulting price of balancing energy can be approximated from the balancing energy demand and the merit order of the bids, based on knowing the settlement rules.

The single price model encourages the BRPs to take positions that contribute to the balance of the system, but this is strongly influenced by their position, the market opportunities and the cost structure of their portfolio. In case of negative imbalance (shortage), the price of balancing energy is derived from the average price of upward regulation, which is usually high above spot market prices. In this case, the direction of payment is determined by the imbalance position of the BRP, the supportive or causing state (non-aggravating or aggravating imbalance) decides whether at this high price the TSO pays the BRP or the BRP pays the TSO. Here the BRP should take a position above the schedule, even by purchasing energy on the ID market (if \(Price_{ID} < Price_{BE}\) ), because the positive deviation from the schedule is rewarded with a high price by the single price settlement model. In real time, the potential BRP deficit should be reduced, since the cost of the upward regulation is assumed to be lower than the predicted upward balancing energy price (if \(SRMC_{UP} < Price_{BE}\)). Whether it is demand generated in the ID market or self-balancing of the portfolio, production is still being increased in processes outside the use of balancing reserves. As a result, the system imbalance decreases, along with the balancing energy demand and the corresponding price. In the case of a positive system imbalance (surplus), there is not so much difference between the two cases of payment direction, as the balancing energy price is derived from the negative direction regulation, which can be positive or negative, but in any case closer to zero than the upward direction.

For a portfolio of gas engines, it is worth controlling up for negative imbalance and down for positive imbalance if its short-run marginal cost is between the negative and positive balancing energy prices. A PV portfolio with a short run marginal cost around 0 should not be controlled downward in any case in case of positive system imbalance. When there is a surplus in the system, PVs may be controlled downward only if the downward balancing energy price is negative. Thus, for a given BRP position, the prediction of the imbalance direction is of great importance, because the positive and negative direction require completely different reactions. The decision logic can be summarised by a decision tree with a forecast of the imbalance, the price of the balancing energy and the characteristics of the portfolio of the BRP in the nodes. The development and evaluation of this decision logic will determine the future direction of research on this topic.