1 Introduction

Accurate river streamflow forecasting is pivotal for effective water resource planning, infrastructure design, utilization, optimization, and flood planning and warning. Streamflow prediction is a cumbersome task due to several factors, including but not limited to uncertainties and complexities in hydrological processes and characteristics such as precipitation, the influence of climate change, topography, and lack of observed data in some cases [1,2,3].

While Brown (1963) [4] distinguishes prediction as a subjective method and forecasting as an objective method, Brass (1974) [5] defines a forecast as "looking into the future" and prediction as a systematic approach to anticipate future events. In this paper, the terms prediction and forecasting are used interchangeably to refer to the systematic analysis and anticipation of future streamflow in comparison to observed streamflow. This paper attempts to apply streamflow prediction methodologies to historical observed univariate streamflow time-series data, with the Sobat River in South Sudan serving as a case study.

Streamflow forecasting relies predominantly on time-series data, which represent observations recorded at discrete points or continuously through time. These observations are either univariate or multivariate time-series. In univariate time-series, which is the focus of this paper, a single variable or feature is exclusively observed at each data point, exemplified by the volumetric rate of water flow measured in cubic meters per second (m3/s). This approach streamlines the analysis, concentrating solely on the historical values of a single variable, operating on the assumption that the variable is primarily dependent on its historical values, thereby serving as a foundation for forecasting and prediction methodologies [6,7,8].

Foundational and traditional time-series forecasting models such as exponential smoothing [9,10,11], auto-regressive integrated moving average (ARIMA) and the seasonal auto-regressive integrated moving average (SARIMA) [7, 12,13,14,15], have long been extensively used in streamflow prediction, capturing vital parameters, patterns and seasonality in observations. A number of studies have demonstrated that the intrinsic linear or non-linear dynamic characteristics that govern hydrological time series have a profound impact on the performance of forecasting models [16,17,18]. Consequently, there is a pressing need to introduce and develop new models that can effectively overcome the challenges posed by the intrinsic features of time-series. For instance, Farhang Rahmani and Mohammad Hadi Fattahi [19] demonstrated that reducing the irregularities and extreme variations in river flow data before anomalies (extreme events such as floods) using multifractal analysis led to predicting river floods approximately 12 days in advance. Wang et al. [20] conducted a study on forecasting annual runoff time-series using ARIMA and the ensemble empirical model decomposition (EEMD). Their study concluded that EEMD effectively enhanced forecasting accuracy and that their proposed EEMD-ARIMA model significantly improved ARIMA time-series approaches for annual runoff time-series forecasting. It is evident that the advent of data-driven machine learning models such as long short-term memory (LSTM), random forest regression and Facebook Prophet (Prophet), among others, have marked a significant evolution in time-series prediction, integrating advanced techniques to enhance predictive capabilities [7, 11, 21,22,23,24,25].

SARIMA is an extension of a class of ARIMA models, that incorporates seasonality into its structure. It has been widely applied in several studies, including streamflow prediction, due to its flexibility in adapting to different combinations of trends, seasonality, and autocorrelation patterns present in observations [26]. Using SARIMA for prediction requires careful selection of appropriate parameters such as the nonseasonal autoregressive order (p), differencing (d), nonseasonal moving average (q), seasonal differencing (D), seasonal autoregressive order (P), seasonal moving average (Q), and period (S) at which observations follow a similar trend [27]. The choice of these parameters requires domain knowledge and can be computationally expensive or challenging when dealing with complex observations. However, an auto-ARIMA algorithm that automates the selection of parameters has recently gained prominence because of its efficiency and effectiveness compared to using domain knowledge to make educated guesses of model parameters [28, 29]. The auto-ARIMA function automates model selection by utilizing an algorithm that performs a grid search to determine optimal hyperparameters. The optimal model is the model with the lowest Akaike information criterion (AIC) value [28, 29]. By automating the selection of SARIMA parameters, auto-ARIMA simplifies workflows, making them accessible to a wider range of users from different scopes while enhancing efficiency [28].

Prophet, on the other hand, developed by Facebook, is robust against outliers and easier to fine-tune than traditional methods such as ARIMA and SARIMA, while also excelling in effectively handling trends, seasonality, holidays, and missing data [30, 31]. Studies have highlighted Prophet’s superiority in terms of both forecasting accuracy and computational time; hence, Prophet is an efficient and reliable choice for time-series analysis [32, 33].

Recent studies have utilized both traditional and data-driven methodologies, including SARIMA and Prophet, for predicting streamflow in various applications [34,35,36,37,38,39].

The SARIMA and Prophet models among other machine learning techniques, were used to forecast natural streamflow in the Upper Collorado River Basin [39]. This study illustrated the efficacy of SARIMA and Prophet in predicting streamflow up to 24 months in advance. When compared to a conceptual model of the hydrologic modeling system (HEC-HMS in streamflow forecasting in a dam), the SARIMA model produced a lower prediction accuracy (RMSE 2.8 and 3.4 m3/s, respectively) [40]. A time-series prediction of monsoon rainfall in India elaborated the strength of the SARIMA model in capturing the relationships between past and current data, revealing its efficacy with reduced forecasting errors compared to the Prophet model [38]. A related study evaluated the efficacy of both the SARIMA and Prophet models in predicting short term traffic in Tamil Nadu, India [41]. The two models had similar efficacies, with the SARIMA model having minimal errors during forecasting compared to the Prophet model [41]. In a study on the forecasting of monthly streamflow for the White Nile River at Malakal station in South Sudan [42], the streamflow was found to be seasonal and non-stationary, and the residuals of the SARIMA model were not correlated, resulting to a reliable forecast of monthly flow for three successive years with a higher prediction accuracy. Another study compared SARIMA with the Holt-Winters model(s) for forecasting monthly streamflow in the western region of Cuba [43]. Although the Holt-Winters models outperformed SARIMA in reproducing the mean series seasonality when training observations were limited, both the SARIMA and Holt-Winters models exhibited comparable performance in forecasting streamflow one year ahead when provided with longer training subsets [43].

While SARIMA and Prophet have been widely used individually for forecasting streamflow, their derivatives or combinations have also been explored in various contexts [44,45,46,47]. For instance, a multi-regime switching ARIMA-MS-GARCH model for daily streamflow prediction emphasized the importance of incorporating multiple regimes in modeling streamflow dynamics, demonstrating the potential of advanced time-series models such as SARIMA and its derivatives for enhancing precision [48]. Similarly, Mehr et al. [49] evaluated a hybrid genetic programming and SARIMA models for daily streamflow prediction in Boreal lake–river systems [49]. They reported Nash-Sutchliffe efficiency values exceeding 99% for both models. Additionally, the hybrid GP-SARIMA model eliminated approximately one-fourth of the RMSEs of the standalone models, indicating strong performance in streamflow forecasting. Furthermore, a study applied a hybrid Facebook Prophet-ARIMA framework for forecasting high-frequency temperature data and concluded that the Prophet model excels at modeling complex seasonality and general trends, while ARIMA excels at capturing short-term dynamics [50].

Although several studies have used the SARIMA and Prophet models individually for streamflow prediction, there is a significant gap in the literature regarding the application of hybrid SARIMA-Prophet models, specifically for univariate streamflow prediction. Despite the extensive use of both SARIMA and Prophet in previous literature, each has its own limitations. Although they are robust at capturing linear and seasonal patterns, SARIMA models struggle with nonlinearity and volatility changes. Prophet, on the other hand, excels in handling complex seasonality and trends but may not capture short-term dynamics as effectively as SARIMA [51, 52]. Existing studies have focused primarily on either SARIMA or Prophet, with limited exploration of the synergistic benefits of combining these models, even after studies indicated that both the SARIMA and Prophet models struggled to correctly predict extreme events, resulting in either underestimations or overestimations [51, 52]. The present study addresses this gap by investigating whether a hybrid SARIMA-Prophet model can improve the accuracy and robustness of streamflow forecasting for the Sobat River in South Sudan, compared to using SARIMA and Prophet individually.

We introduce dynamic weighting that adjusts the contributions of the SARIMA and Prophet models based on their predictive performance on an evaluation segment of the observed time-series data. This adaptive mechanism ensures that the hybrid model remains responsive to changes in the underlying data. A detailed comparative analysis of the residuals and the predictive performance of the SARIMA, Prophet, and hybrid SARIMA-Prophet models is conducted using the mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R2). By combining these models, we aim to enhance the predictive performance, particularly in the context of hydrological time-series where both linear trends and complex seasonal patterns are present.

The next sections of this paper are designed as follows: the Data and Methods section details the methodological approach, describes the formulation of the models and the evaluation metrics; the Results section presents the comprehensive findings of the methods; the Discussion section analyses the effectiveness of the models used, with a comparative analysis to the literature, and suggests directions for future research; and the Conclusion section summarizes the key findings of the study.

2 Data and methods

In this section, we present the methodology employed in the study. Figure 1 shows the structure of the implementation of the methodology.

Fig. 1
figure 1

Workflow of the methodology

2.1 Study area

The White Nile River and its tributaries serve as crucial water sources in South Sudan. Unfortunately, these rivers are also major sources of riverine flooding.

Despite their significance in ecosystems, the availability of gauging stations for monitoring the flow of these rivers is limited. Figure 2 shows the study area. The Sobat River is one of the largest tributaries of the White Nile River. It originates from Ethiopia and converges with the Bahr-el-Jebel section of the White Nile River in Malakal town, which has experienced severe flooding in recent years [53]. At present, the Sobat River does not have a single gauging station, although streamflow was recorded between the early twentieth century and the early 1980s at the Dollieb Hills gauging station. Due to missing observations between 1945 and 1960, the earlier portion of the data was considered for this study. The lack of literature on streamflow prediction conducted on the Sobat River highlights the contributions of this study to future forecasting efforts, particularly if a new gauging station becomes operational along the Sobat River.

Fig. 2
figure 2

Study area (Map generated using Quantum Geographic Information System—QGIS desktop software)

2.2 Data used

Monthly historical univariate streamflow data from the Dolleib Hills gauging station on the Sobat River were collected from the Global Runoff Data Center (GRDC) [54] spanning from 1907 to 1944 as shown in Fig. 3. In Fig. 4, a detailed decomposition of the observations is represented, separated into four distinct panels: observed, trend, seasonal, and residual. The 'Observed' panel directly corresponds to the original streamflow data, providing a contextual basis for the decomposition. The 'Trend' panel highlights underlying nonseasonal changes or trends in the data, which might indicate long-term changes in the river's flow characteristics. The 'Seasonal' panel isolates the recurring seasonal pattern, revealing the periodic fluctuations that likely reflect climatic variations throughout the year. Finally, the 'Residual' panel shows the irregularities and anomalies in the data that are not explained by the trend or seasonal components, potentially indicating random events or nonsystematic factors affecting the streamflow.

Fig. 3
figure 3

Decomposition of the Sobat River hydrograph

Fig. 4
figure 4

A hydrograph of the Sobat River

For our analysis, we partitioned the observed streamflow data (Table 1) from 1907 to 1937 for model training and from 1938 to 1944 for model validation.

Table 1 Study data

2.3 Seasonal ARIMA model

Despite the possibility of removing deterministic seasonal effects from a time-series, many stochastic seasonal components appear in the residuals of time observations [55]. The SARIMA model captures both nonseasonal and seasonal auto-regressive (AR), differencing (I), and moving average (MA) components. It has been extensively used for streamflow prediction in the literature, owing to its robustness in handling seasonality and model interpretability [43, 56]. The equation of the SARIMA [55] model is expressed by Eqs. (1, 2, and 3):

$$ARIMA\left(p,d,q\right) x \left(P,D,Q\right)S$$
(1)

where p represents the nonseasonal AR order, d represents the nonseasonal differencing, q represents the nonseasonal MA order, P represents the seasonal AR order, D represents seasonal differencing, Q represents the seasonal MA order, and S represents the periodicity or the seasonal span pattern.

Eq. (2) formulates SARIMA without any operations being differentiated.

$$\left(1\right)\Phi \left({B}^{S}\right)\phi \left(B\right)\left({x}_{t}-\mu \right)=\Theta ({B}^{S})\theta (B){\omega }_{t}$$
(2)

where the nonseasonal AR term is \(\phi \left(B\right)=1-{\phi }_{1}B-\dots -{\phi }_{p}{B}^{p}\), the nonseasonal MA term is \(\uptheta \left(B\right)=1+{\theta }_{1}B+\dots +{\theta }_{q}{B}^{q}\), the seasonal AR term is \(\Phi \left({B}^{S}\right)=1-{\Phi }_{1}{B}^{S}-\dots -{\Phi }_{P}{B}^{PS}\), and the seasonal MA term is \(\Theta \left({B}^{S}\right)=1+{\Theta }_{1}{B}^{S}+\dots +{\Theta }_{Q}{B}^{QS}\).

When a backshift of the seasonal period(S) of 12 corresponding to a yearly seasonal pattern is applied to the seasonal portion of the model, Eq. (3) is obtained.

$$\left(1-{\phi }_{1}B\right)\left(1-{\Phi }_{1}{B}^{12}\right)\left(1-B\right)\left(1-{B}^{12}\right){y}_{t}=(1+{\theta }_{1}B)(1+{\Theta }_{1}{B}^{12}){\varepsilon }_{t}$$
(3)

where \({y}_{t}\) is the time-series at time \(t\) and \({\varepsilon }_{t}\) is the error term.

2.3.1 Data preparation for the SARIMA model

Before the SARIMA model can be used for any predictions, it is necessary to determine the stationarity of the observed streamflow. To assess the stationarity of the streamflow data, we utilized two widely used statistical tests: the augmented Dickey-Fuller (ADF) and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests. The ADF and KPSS tests are standard time-series analyses for evaluating the stationarity of data and are widely accepted in various fields, including hydrological time-series [57,58,59]. The ADF test focuses on detecting the presence of a unit root in the data, which is indicative of nonstationarity. In contrast, the KPSS test assesses whether the underlying trend of the data is stationary or nonstationary. Utilizing both methods is key in double-checking stationarity and trends and employing seasonal or nonseasonal differencing, depending on the results. While other methods for testing stationarity exist, such as the Phillips–Perron test [60] or visual inspection of time-series plots, we chose the ADF and KPSS due to their ease of interpretability and widespread use.

The general equation of the ADF test [61] is represented by Eq. (4).

$${\Delta y}_{t}= \alpha +\beta t+{\gamma y}_{t-1}+{\delta }_{1}{\Delta y}_{t-1}+{\delta }_{2}{\Delta y}_{t-2}+\dots +{\delta }_{k}{\Delta y}_{k}+{\epsilon }_{t}$$
(4)

where \({\Delta y}_{t}\) is the differenced time-series; \(t\) represents time; \({y}_{t-1}\) represents the lagged value of the time-series; \({\Delta y}_{t-1}\), \({\Delta y}_{t-2}\), and \({\Delta y}_{t-k}\) are lagged differences of the time-series; \(\alpha\) is a constant term; \(\beta\) is the coefficient of the trend term; \(\gamma\) is the coefficient of the lagged level; \({\delta }_{1}\), \({\delta }_{2}\), and \({\delta }_{k}\) are the coefficients of the lagged differences; and \({\epsilon }_{t}\) is the error term.

The null hypothesis of the ADF test (H0) states that the data is nonstationary, while the alternative hypothesis (H1) states that the time-series data is stationary.

The equation of the KPSS test [62] is presented in Eq. (5).

$${y}_{t}= {\mu }_{t}+{\epsilon }_{t}$$
(5)

where: \({y}_{t}\) is the observed time-series, \({\mu }_{t}\) is the deterministic trend component, and \({\epsilon }_{t}\) is the error term.

The null H0 of the KPSS test states that the time-series data is trend stationary, and H1 states that the data is nonstationary.

2.3.2 SARIMA model identification

ARIMA models generally involve selecting the appropriate AR, MA, and I terms. For a SARIMA model, both the nonseasonal and seasonal AR and MA terms and (I)(if any), including the periodicity (S), are determined.

These orders can be selected manually depending on the analysis of the sample autocorrelation function (ACF) and the Partial autocorrelation function(PACF) (taking into consideration a certain number of lags). However, due to the cumbersomeness of making an educated guess of the seasonal and nonseasonal AR and MA terms, an auto-ARIMA [28, 63] function that performs a grid search was preferred in conjunction with a thorough study of the ACF and the PACF.

2.4 Prophet Model

Facebook-Prophet is an open-source time-series forecasting model developed by Facebook and designed to handle various time-series patterns, including seasonality, and changes in trends [64]. The trend component, g(t), is often modeled using a piecewise linear or logistic function. The seasonality component, s(t), can be modeled as a Fourier series or with custom seasonality [64]. The holiday component, h(t), incorporates the effects of holidays or events by creating binary indicators [64]. The Prophet model is suitable for handling large linear and nonlinear datasets with ease of use [41, 50, 65]. The General Equation for the Prophet model is defined in Eq. (6).

$$y\left(t\right)=g\left(t\right)+s\left(t\right)+h\left(t\right)+{\varepsilon }_{t}$$
(6)

where: \(y\left(t\right)\) is the observed time-series data at time t, \(g\left(t\right)\) represents the trend component capturing the underlying growth or decline in the observed time-series,\(s\left(t\right)\) represents the component of seasonality accounting for periodic patterns in the data, \(h\left(t\right)\) represents the component of holidays (if any) considering the effects of specific holidays (or missing data) or events (which is not applicable in the context of the data used), and \(\epsilon \left(t\right)\) is the error term.

2.5 The hybrid SARIMA-prophet model

It has been reported in the literature that while the SARIMA model excels in capturing linear patterns and seasonality, the Prophet model excels in handling seasonality and change points. We combined the strengths of both the SARIMA and Prophet models to construct a hybrid SARIMA-Prophet model, as illustrated in Eq. (7).

$${\widehat{y}}_{t}=\omega \widehat{y}{SARIMA}_{t}+(1-\omega )\widehat{y}{Prophet}_{t}$$
(7)

where:\(\widehat{y}{SARIMA}_{t}\) and \(\widehat{y}{Prophet}_{t}\) are the forecasts from the SARIMA and the Prophet models at time t, respectively. \(\omega\) is a weight parameter that optimizes the predictive contributions of both the SARIMA and Prophet models.

2.6 Model performance evaluation

Three statistical indices were used in the evaluation of the chosen models. The root mean squared error (RMSE), the mean absolute error (MAE), and the coefficient of determination (R2) were used to evaluate the performance of the models.

MAE indicates the average magnitude of the errors, expressed in the same units as the observed data (m3/s), while RMSE measures the square root of the average squared differences between the predicted and observed values (also expressed in the same units as the observed data), also in m3/s. R2, which is dimensionless, represents the proportion of the variance in the observed data that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating better model performance.

The MAE is computed in Eq. (8) as follows:

$$MAE=\frac{1}{N}\sum |\left({S}_{o}-{S}_{f}\right)$$
(8)

The RMSE is represented as Eq. (9):

$$RMSE=\sqrt{\frac{1}{N}\sum {\left({S}_{0}-{S}_{f}\right)}^{2}}$$
(9)

R2 is computed from Eq. (10) as follows:

$${R}^{2}=\left[\frac{\sum \left({S}_{0}-\overline{{S}_{0}}\right)\left({S}_{f}-\overline{{S}_{f}}\right)}{\sqrt{\sum \left({S}_{0}-\overline{{S}_{0}}\right)\sum \left({\left({S}_{f}-\overline{{S}_{f}}\right)}^{2}\right)}}\right]$$
(10)

where: \({S}_{0}\) is the observed values, \({\overline{s} }_{o}\) is the mean of observed values, \({S}_{f}\) is the forecasted values, \({\overline{s} }_{f}\) is mean of the forecasted values, and N is the total number of observations.

2.7 Software tools

In this study, we employed a variety of software tools for time series analysis and modeling. These tools were selected for their robustness, versatility and ease of integration with different data processing workflows. Owing to its extensive libraries for scientific computing and data analysis, we used Python as the primary programming language for implementing our time series analysis and modeling workflow. Its rich ecosystem of libraries allowed us to efficiently preprocess data, build models, and visualize the results. The following libraries and tools are used in this study.

  1. 1.

    Pandas was used for data cleaning, transformation, generating descriptive statistics and data manipulation.

  2. 2.

    NumPy was employed for efficient numerical computations, including array operations and mathematical functions necessary for data preprocessing and feature engineering.

  3. 3.

    Statsmodels as a Python module was used for conducting statistical tests, implementing and fitting the SARIMA components and identification of optimal parameters and the validation of model assumptions.

  4. 4.

    Facebook Prophet as an open-source forecasting tool was used to model and forecast the trend and seasonality components of the streamflow data.

  5. 5.

    Matplotlib and Seaborn were used for visualizing the time series data, model diagnostics, and forecasting results.

  6. 6.

    Jupyter Notebooks served as the interactive environment for developing and documenting our analysis. It facilitated exploratory data analysis (EDA), model development, and the iterative refining of the models.

The combination of these software tools unlocked a comprehensive and efficient approach to time series analysis and modeling in our study.

3 Results

3.1 Statistical characteristics of the data

Table 2 provides a summary of the basic descriptive statistics for the monthly streamflow volume measurements in cubic meters per second (m3/s). These statistics include the mean, median, standard deviation, skewness, kurtosis of the data, central tendency of the data, variability, and distribution shape of the streamflow volumes.

Table 2 Statistical characteristics of the data

The mean streamflow volume of 36.8 m3/s represents the average flow conditions in the Sobat River over the observed period, with half of the values falling below the median of 39.4 m3/s, suggesting a relatively balanced distribution with a slight skew toward higher values. The significant variability in the streamflow data is highlighted by a standard deviation of 25.2 m3/s and a variance of 623.3 m3/s2, indicating considerable monthly fluctuations due to seasonal changes, precipitation patterns, and other hydrological factors. The 25th and 75th percentile values of 11.0 m3/s and 58.2 m3/s, respectively, show that 50% of observed streamflow volumes range between these levels. The wide range of flow conditions, with a minimum of 1.6 m3/s and a maximum of 107.0 m3/s, underscores the river's variability and potential extreme events. A skewness of 0.2 suggested a near-symmetrical distribution with a slight right-skew, while the kurtosis of -1.1 indicated a flatter distribution than normal, with fewer extreme values.

3.2 The SARIMA model selection

3.2.1 Stationary tests

Following the analysis of the statistical characteristics of the data, we conducted the ADF [61] and KPSS [62] tests to assess the stationarity of the time-series data. The results of these tests are summarized in Table 3.

Table 3 Results of the ADF and KPSS tests

The ADF test yielded a test statistic of -4.138 and a p-value of 0.000836, which is below a significance level of \(\alpha\) = 0.05. Given the critical values at the 1%, 5%, and 10% levels (-3.445, -2.868, and -2.57, respectively), we rejected the null hypothesis, concluding that the time-series data is stationary. The KPSS test produced a test statistic of 0.323 and a p-value of 0.1, which is above the 0.05 threshold. With critical values of 1%, 5%, and 10% (0.739, 0.463, and 0.347, respectively), we failed to reject the null hypothesis.

3.2.2 ACF and PACF analysis

To further confirm the stationarity of the observations and simplify SARIMA model selection, we analyzed the ACF and PACF plots for both the original (Fig. 5) and seasonally differenced data (Fig. 6).

Fig. 5
figure 5

ACF and PACF plots of observed streamflow

Fig. 6
figure 6

ACF and PACF of seasonally differenced observed streamflow

The ACF plot of the original data (Fig. 5 left) shows the correlation of the time-series with its own lagged values. Significant correlations are observed at increments of 12 at lags 12, 24, and 36, suggesting yearly seasonality (S = 12). The correlations decrease over time, which is typical for stationary data. The PACF plot (Fig. 5) shows significant spikes at the initial lags, indicating the presence of autoregressive components.

To account for seasonal variations, seasonal differencing (D = 1) was applied, given the visual inspection of the ACF and PACF of the observed streamflow. The ACF plot of the seasonally differenced observed streamflow (Fig. 6) shows significant spikes at the initial lags, with the first lag indicating a strong positive correlation. This observation suggests that the seasonal pattern had been effectively differenced. The rapid decline and oscillation around zero within the confidence interval indicate that any remaining seasonality and trend have been adequately addressed by seasonal differencing.

The seasonally differenced (Fig. 6, right) PACF plot of the observed streamflow shows a quick decline and lack of significant spikes beyond the first lag, suggesting that higher-order nonseasonal AR terms are unnecessary. Additionally, the diminished spikes at seasonal lags (approximately lags 12, 24, and 36) confirm that seasonal differencing (D = 1) effectively removed the seasonal structure.

3.2.3 AutoARIMA model selection

To ensure the most accurate model selection for the Sobat River's monthly observed streamflow, we utilized auto-ARIMA, an automated approach that evaluates a range of potential models based on statistical criteria. This method minimizes human bias and errors, providing a more objective assessment of model suitability. AutoARIMA evaluates multiple models within the specified constraints and selects the best model based on the lowest AIC value.

Based on the insights from our preliminary analysis of the ACF and PACF plots, we configured the AutoARIMA function with d = 0, D = 1, and S = 12 to reflect findings from the ACF and PACF analysis.

From the comparisons based on the lowest AIC values, the best model for prediction values was found to be the ARIMA(1,0,1)(2,1,0)12 model, highlighted in gray in Table 4.

Table 4 AutoArima model selection

3.2.4 ARIMA(1,0,1)(2,1,0)(12) model diagnostics

The diagnostic plots of the SARIMA model (ARIMA(1,0,1)(2,1,0) [12]) for the Sobat River's observed monthly streamflow data provide insights into the model's adequacy and its potential use for prediction.

The standardized residual plot (Fig. 7) in the top left shows that the residuals fluctuate around zero without any clear patterns. This indicates that the model has effectively captured the structure of the observed streamflow, leaving behind white noise. The absence of significant patterns suggests that the model is well-specified. The histogram of the residuals (Fig. 7, top right), along with the estimated density plot, resembles a normal distribution with the overlaid normal distribution (N(0,1)) confirming this visual assessment. The Q-Q plot (Fig. 7, bottom left) shows that the residuals follow the theoretical quantiles of a normal distribution closely, with minor deviations at the tails. This further supports the normality assumption, indicating that the residuals are approximately normally distributed. The correlogram of the residuals (Fig. 7, bottom left) shows that the autocorrelations fall within the confidence intervals for all lags. This indicates no significant autocorrelation in the residuals, implying that the model has adequately captured the underlying time-series structure and that the residuals are essentially white noise.

Fig. 7
figure 7

SARIMA Model (ARIMA(1,0,1)(2,1,0) [12]) diagnostics summary

3.3 Prophet model implementation

Unlike the SARIMA model, Prophet does not require stationarity checks on the data. However, the tabular data columns must follow a naming convention with the columns strictly named “ds” and “y” for the independent and dependent variables respectively, before being fit to the model. The independent variable is time (in months), while the dependent variable is streamflow (in m3/s).

3.3.1 Prophet components analysis

The trend component (Fig. 8) of the Sobat River's observed monthly streamflow data from 1938 to 1944 reveals a clear linear decline over the observed period. This indicates a consistent decrease in the observed streamflow over time. The yearly seasonality component (Fig. 8) captures the repeating patterns within each year, displaying distinct seasonal fluctuations. The highest observed streamflow was observed between August and November, while the lowest observed streamflow occurred around March and May. Between these peak and low periods, the streamflow exhibited intermediate fluctuations, reflecting the complex interaction of seasonal factors affecting the river’s flow.

Fig. 8
figure 8

Prophet model’s trend and yearly components

3.4 Prediction accuracy

The SARIMA model's predictions (Fig. 9a) are closely aligned with the observed streamflow, indicating its robustness in capturing the general trends and seasonal patterns present in the data. However, deviations from the observed streamflow become more pronounced during periods of high streamflow. These periods of high flow are challenging for the SARIMA model to predict accurately, leading to significant overestimations and isolated underestimations during these times.

Fig. 9
figure 9

Model predictions for SARIMA (a) Prophet (b) and Hybrid SARIMA-Prophet (c) against observed streamflow

The Prophet model (Fig. 9b), on the other hand, demonstrates a strong ability to follow the patterbs of the observed streamflow, accurately capturing both the long-term trends and seasonal variations. Despite this, the Prophet model shows some overestimation and a slight underestimation during abrupt changes in the streamflow. These sudden changes, which might be due to unexpected hydrological events, are less effectively captured by the Prophet model, leading to slight discrepancies in the predictions.

In contrast, the Hybrid SARIMA-Prophet model's predictions (Fig. 9c) align most closely with the observed streamflow. This hybrid approach effectively combines the strengths of both the SARIMA and Prophet models, resulting in predictions that display minimal deviations from the observed streamflow. By eliminating underestimations that affected both the SARIMA and Prophet models, the Hybrid model exceled in capturing both the seasonal patterns and short-term fluctuations in the observed streamflow, providing a more accurate and reliable forecast. This was particularly evident during periods of abrupt changes and high streamflow, where the hybrid model's predictions remained more consistent with the observed data, thereby demonstrating its enhanced predictive performance.

3.4.1 Prediction accuracy

Figure 10 shows a scatter plot comparing the observed streamflow against the predicted streamflow for the SARIMA, Prophet, and Hybrid SARIMA-Prophet models, providing a visual assessment of each model’s accuracy. The red dashed line represents perfect prediction, where the predicted values are equal to the osbserved values, with the expectation that the models will obtain every prediction right.

Fig. 10
figure 10

Combined model prediction against observed streamflow

The predictions of the SARIMA model, represented by the blue dots, show a reasonable alignment with the observed values, with a number of predictions clustering around the perfect line of prediction (red line). Despite the reasonable alignment along the line of perfect prediction, several instances exist where the model significantly overestimates and periodically underestimates the obsered streamflow, especially at higher observed values (and a few more instances at lower values). The results indicate that the SARIMA model struggeled to cope with capturing extreme values accurately.

The Prophet model predictions (orange dots), on the other hand, aslo exhibit a general alignment with the observed values, but also exhibit noticeable deviations from the observed streamflow. Like the SARIMA model, the Prophet model occasionally overestimates and occassionally underestimates streamflow during extreme events. Both models indicated close competition in accuracy.

The hybrid SARIMA-Prophet model predictions which are represented by green dots in Fig. 10, show tighter clustering around the perfect line of prediction than does the individual models. While deviations from perfect prediction exist, the hybrid approach reduced the magnitude of errors observed in individual models, indicating that combining the strengths of both models helped in achieving more accurate predictions.

3.4.2 Residual analysis

The residual plots of the SARIMA, Prophet, and Hybrid SARIMA-Prophet models (Fig. 11) provide a comparative view of the errors in the predictions made by each model against the observed values of the Sobat River streamflow from 1938 to 1944.

Fig. 11
figure 11

Model residuals plot

The residuals of the SARIMA model (blue dots) exhibit noticeable peaks, particularly during approximately 1938 and 1941, indicating instances where the model significantly overestimated or occasionally underestimated the streamflow. The residuals fluctuate around zero (dotted red line, y = 0) but show clusters of positive residuals in some periods and negative residuals in others, suggesting that the SARIMA model captures the overall trend but struggles with certain seasonal variations or anomalies.

The residuals of the Prophet model (orange dots) also show significant peaks similar to those of the SARIMA model, indicating some common challenges in capturing the variability of the data. While the residuals are generally centered around zero, there are periods where the model consistently overestimates) or underestimates the streamflow.

The residuals of the hybrid SARIMA-Prophet model (green dots), which combines predictive capabilities of both the SARIMA and Prophet models, demonstrate less variability than the individual models. This suggests the mitigation of some of the extreme errors present in the individual models. Despite the reduced variability, the hybrid SARIMA-Prophet model residuals still showed clusters around certain periods, indicating that while it smoothed out some errors, it did not eliminate the errors completely.

3.5 Model validation

We compared the performance of the models based on the three-performance metrics as shown in Table 5.

Table 5 Comparative performance of the models

The SARIMA model shows good performance, with a MAE of 5.39 m3/s and an R2 of 0.85, indicating that it can explain 85% of the variance in the observed streamflow. However, the relatively high RMSE of 8.70 m3/s suggested that the model had difficulty capturing extreme values, resulting in large prediction errors. The Prophet model slightly outperformed the SARIMA model with a marginally lower MAE (5.35 m3/s) and RMSE (8.32 m3/s), and a slight improvement in capturing the variability of the streamflow (R2 of 0.86). The hybrid SARIMA-Prophet model outperformed the individual SARIMA and Prophet models. With the lowest MAE (4.047 m3/s) and RMSE (6.17 m3/s) and the highest R2 (0.92), the hybrid model significantly reduced prediction errors and explains 92% of the variance in the observed data.

4 Discussion

Stationarity tests using the ADF and KPSS tests on observed streamflow are necessary for fine-tuning the SARIMA model (Table 3). An ADF p-value of 0.000836, which is significantly below the significance level of α = 0.05, and a test statistic of -4.138, which is less than critical values at the 1%, 5%, and 10% levels (-3.445, -2.868, and -2.57, respectively), imply a rejection of the null hypothesis (existence of a unit root). There is significant evidence to conclude that the observed streamflow is stationary.

Complementing the ADF test, the KPSS test produced a test statistic of 0.323 less than the critical values at the 1%, 5%, and 10% levels (0.739, 0.463, and 0.347, respectively) and a p-value of 0.1 greater than the significance level of the threshold of α = 0.05. The results of both the test statistic and the p-value fail to support the rejection of the null hypothesis that the observed streamflow is stationary (i.e., has no unit root). This implies stationarity around a deterministic trend (trend stationarity). This confirmation of trend stationarity correlates with the occurrence of a consistent trend in the observed monthly streamflow of the Sobat River over time, without the need for additional differencing to remove trends. While observed streamflow at Malakal Station on the White Nile River was reported to be non-stationary [42], our confirmation of stationarity of observed streamflow of the Sobat River which is a major tributary of the White Nile, contributes to the literature documenting the complex and dynamic nature of the Nile River systems.

Following the confirmation of stationarity and trend stationarity, a nonseasonal differencing (d) was not necessary for the transformation of the observed streamflow, hence simplifying the modeling process. However, the ACF and PACF plots of the observed streamflow necessitated a seasonal differencing (D). The significant correlations at seasonal lags observed in the nondifferenced ACF and PACF plots (Fig. 5) exhibited seasonal patterns. Therefore, while nonseasonal differencing is unnecessary, seasonal differencing was required to account for periodic fluctuations within the observed streamflow for the SARIMA model.

A more consistent decrease in observed streamflow between 1938 and 1944 in the prophet trend component (Fig. 8) suggests a long-term reduction in the overall volume of observed streamflow. Several factors could have contributed to this downward trend, including climate change, increased upstream water usage, and other environmental influences, as stated in other studies [66, 67]. Further investigations and research will be beneficial for ensuring the sustainability of water resources on the Sobat River, since the behavior of observed streamflow during consistent regimes and irregular changes impacts predictive capabilities of models.

The residuals of the SARIMA and Prophet models indicate that both models lag in handling extreme values, with significant overestimations and slight underestimations of the observed streamflow, as shown in (Fig. 11). Prophet only slightly edged SARIMA in explaining the residual variability (86%, compared to 85% for SARIMA). A study by Elseidi [50] also noted that while Prophet captured long-term trends, it failed to address short-term fluctuations effectively.

By combining the predictive capabilities of the SARIMA and Prophet models, the hybrid model significantly reduces residual variability, approximating a normal distribution and exhibiting no significant autocorrelation. Previous studies also support the effectiveness of hybrid models. For instance, Mehr et. al. [49] reported that a hybrid genetic programming and SARIMA (GP-SARIMA) model eliminated approximately one-fourth of the RMSEs of the standalone models (2.408 m3/s and 2.288 m3/s for GP and SARIMA respectively, for monthly forecasting). Similarly, Elseidi’s study [50] concluded that the ARIMA-Prophet hybrid model outperformed the single ARIMA model in terms of accuracy and robustness with an RMSE of 10.92 m3/s compared to the 10.99 m3/s of the ARIMA model.

Our hybrid model’s improved performance was indicated through the comparative metrics of MAE (4.047 m3/s), RMSE (6.17 m3/s), and R2 (0.92). These metrics are significantly better than those of the SARIMA (MAE: 5.39 m3/s, RMSE: 8.70 m3/s, R2: 0.85) and Prophet (MAE: 5.35 m3/s, RMSE: 8.32 m3/s, R2: 0.86) models. The improved MAE, RMSE, and R2 values indicate lower residual variability and reduced extreme errors, allowing the hybrid SARIMA-Prophet model to handle both long-term patterns and short-term streamflow fluctuations more effectively. This finding validates our approach of integrating the SARIMA model, a traditional time-series model, with machine learning approaches such as Prophet for enhanced predictive capabilities.

While existing studies provide valuable insights, they have focused on specific hydrological contexts or utilized multivariate variables, or even different environmental variables such as temperature or precipitation. For instance, Ahmadpour et. al. [68] primarily discussed the effectiveness of a hydrologic modeling system (HEC-HMS) and SARIMA in streamflow forecasting without considering a hybrid approach that could enhance accuracy. Although Elseidi’s [50] hybrid ARIMA-Prophet results emphasized temperature forecasting, the direct applicability of these findings is limited in the context of hydrological streamflow. The results and findings of our study will be vital not only for the Sobat River but also in recommending hybrid models for use in predicting streamflow.

While our findings will be useful in different forecasting contexts, the study has been limited to data specificity, particularly specific to the Sobat River and may not be directly applicable to other rivers whose hydrological characteristics are different, without proper validation. Moreover, this study focused primarily on historical univariate streamflow observations without integrating other potential influencing factors such as precipitation, temperature, or land use data, which could provide additional insights and a more comprehensive understanding of the factors influencing streamflow. The application of our hybrid model to various hydrological datasets and different climatic regions would help validate its generalizability and effectiveness under diverse environmental conditions. The implementation of a hybrid model framework for real-time forecasting could also significantly benefit water resource management and planning, providing timely and accurate predictions for decision-making processes.

We recommend that future research explore the application of hybrid models incorporating both traditional and machine learning approaches to multivariate time-series forecasting by incorporating additional environmental variables such as precipitation, temperature, and land use changes to increase model accuracy and reliability.

5 Conclusion

The present study investigated streamflow forecasting in the Sobat River through a combination of the SARIMA and Prophet models, and developed a hybrid SARIMA-Prophet model. By leveraging the complementary strengths of the individual SARIMA and Prophet models, the study aimed to improve predictive accuracy and robustness of the models by integrating them into a hybrid SARIMA-Prophet model. With a comprehensive residual analysis and evaluation with key performance metrics, our study demonstrated two key findings.

  1. 1.

    While Prophet slightly edged over SARIMA in terms of accuracy and reducing large errors, both models effectively captured overall trends and seasonal patterns in the streamflow data. However, both models displayed limitations in handling extreme values and short-term fluctuations.

  2. 2.

    By integrating the SARIMA and Prophet models, the hybrid approach minimized residual variability and reduced extreme errors. The distribution of the residuals from the hybrid model was closer to a normal distribution, showing no significant autocorrelation, thereby suggesting that the hybrid model managed long-term patterns and short-term fluctuations more effectively.

By combining the linear and seasonal pattern-capturing abilities of SARIMA with the flexibility and robustness of Prophet in handling nonlinear trends and complex seasonality, the hybrid SARIMA-Prophet approach contributes to the literature where hybrid models are increasingly being designed to enhance prediction accuracy and reliability, particularly for streamflow forecasting. Due to their ease of use and less computational demands and robustness, we recommend the adoption of hybrid models that combine traditional time-series models with more flexible machine learning approaches such as Prophet and other complex models, to enhance predictive accuracy and robustness.