Introduction

The reliable estimation of river flow rate (discharge) is a prerequisite and crucial component for hydrological applications and analyses. Because of the dynamic nature of hydrological system, direct measurements of discharge are typically time consuming, costly and even impossible, especially during flood. Therefore, most discharge records are derived from converting the measured water levels (stages) to discharges by a functional relationship that is expressed as a rating curve. A calibrated stage–discharge rating offers an easy, cheap, and fast technique to estimate discharge (World Meteorological Organization 1980; Kennedy 1984; Herschy 1999). Stage–discharge rating is generally treated as the following power curve (Herschy 1999):

$$Q = b\;\;\left( {a + H} \right)^{\alpha }$$
(1)

where Q is the discharge; H is the stage; α is an index exponent; a and b are constants (depending on the study area).

Unfortunately, the functional relationship between stage and discharge is complex, time-varying, and cannot always captured by simple rating curve, even with the help of traditional modeling techniques such as polynomial regression or autoregressive integrated moving average ARIMA technique (Bhattacharya and Solomatine 2000). Many research attempts to establish this relation via data-driven techniques such as artificial neural networks ANNs (Tawfik et al. 1997; Bhattacharya and Solomatine 2000; Sudheer and Jain 2003; Bisht et al. 2010), decision trees (Bhattacharya and Solomatine 2003; Ghimire and Reddy 2010; Ajmera and Goyal 2012), support vector machine (Aggarwal et al. 2012), wavelet-regression model (Kişi 2011), Takagi–Sugeno fuzzy inference system (Lohani et al. 2006), and evolutionary-based data-driven models (Ghimire and Reddy 2010; Azamathulla et al. 2011). The results approve that these techniques are very efficient and reliable.

The aim of this study is to investigate the potential of the different data-driven models (artificial neural networks, fuzzy inference system, and M5 decision trees) to emulate stage–discharge rating curve of the Gharraf River at Hay, south of Iraq. Daily records of the stage and discharge are available for this river at Hay station for the period from April 2005 to May 2006. The performance of these techniques was compared and the best one with smaller estimation error selected for future estimation of discharge from available data of previous discharge and stage values.

Modeling techniques

Artificial neural networks

Artificial neural networks (ANNs) are massively parallel systems composed of many processing elements connected by links of variable weights. Given sufficient data and complexity, ANNs can be trained to model any relationship between a series of independent and dependent variables. For this reason, ANNs are considered to be universal approximates and have been successfully applied to a wide variety of problems that are difficult to understand, define and quantify. There are many different types of ANNs based on topology. One of the many ANN paradigms, the Multilayer Perceptron (MLP) network, is by far the most popular (Lippmann 1987). The MLP is layered feedforward network which is typically trained with static backpropagation (BP) algorithm. MLP is capable of approximating any measurable function from one finite-dimensional space to another within a desired degree of accuracy (HornikK and White 1989). The MLP network consists of layers of parallel processing nodes. Each layer is fully connected to the preceding layer by interconnection strength, or weights, w. Figure 1 presents a three-layer MLP neural network consisting of layers i, j, and k, with interconnection weights w ij and w jk between layers of neurons. Each neuron in a layer receives and processes weighted input from a previous layer and transmits its output to nodes in the following layer through links. The connection between ith and jth neuron is characterized by the weight coefficient w ij and the ith neuron by the threshold coefficient ϑ i . The weight coefficient reflects the degree of importance of the given connection in the network. The output value of the ith neuron xi is computed as follows: (Haykin 1994)

Fig. 1
figure 1

Architecture of multilayer perceptron with one hidden layer

$$x_{i} = f\left( {\xi_{i} } \right)$$
(2)

with

$$\xi_{i} \;\; = \;\;\vartheta_{i} \;\; + \;\sum\limits_{{j \in \varGamma_{i}^{ - 1} }} {w_{ij} \;x_{j} }$$
(3)

where f(ξ i ) is the activation function. The threshold coefficient can be understood as a weight coefficient of the connection. With formally added neuron j, where x j  = 1, sigmoid shape activation functions are normally defined as:

$$f\left( {\xi_{i} } \right) = \frac{1}{{1\;\; + \;e^{ - \xi } }}$$
(4)

The backpropagation algorithm works by computing the error between the network output and the corresponding target value and propagating this backward through the network to update the weights. The weight updates are calculated based on:

$$\varDelta w_{ij} \left( t \right) = - \eta \frac{\partial E}{{\partial w_{ij} }} + \mu \Delta W_{ij} \left( {t - 1} \right)$$
(5)

Where η and μ are the learning and momentum rates, respectively. E is the error, or objective function, and Δw ij (t) and Δw ij (t–1) are– the weight increments between nodes i and j for iterations t and t–1. A detailed description of this algorithm can be found in Fausett (1994) and Haykin (1994).

M5 decision tree

A decision tree is a logical model represented as a binary (two-way split) tree that shows how the values of a target (dependent) variable can be predicted using the values of a set of predictor (independent) variables. There are basically two types of decision trees: (1) classification trees which are the msost commonly used to predict a symbolic attribute (class) (2) regression trees which are used to predict the value of a numeric attribute Witten and Frank (2005). If each leaf in the tree contains a linear regression model, which is used to predict the target variable at that leaf, then it is called a model tree.

The M5 model tree algorithm was originally developed by Quinlan (1992). Detailed description of this technique is beyond the scope of this paper. It can be found in Witten and Frank (2005). A short description of this technique follows. The M5 algorithm constructs a regression tress by recursively splitting the instance space using tests on a single attributes that maximally reduce variance in the target variable. Figure 2 illustrates this concept. The formula to compute the standard deviation reduction (SDR) is (Quinlan 1992):

Fig. 2
figure 2

Examples of M5 model. 16 are linear regression models [modified after Solomatine and Xue (2004)]

$${\text{SDR}} = {\text{sd}}\left( T \right) - \sum {\frac{{\left| {T_{i} } \right|}}{\left| T \right|}\;{\text{sd}}\left( {T_{i} } \right)}$$
(6)

where T represents a set of example that reaches the node; T i represents the subset of examples that have the ith outcome of the potential set; and sd represents the standard deviation.

After the tree has been grown, a linear multiple regression is built for every inner node using the data associated with that node and all the attributes that participate for tests in the subtree to that node. After that, every subtree is considered for pruning process to overcome the overfitting problem. Pruning occurs when the estimated error for the linear model at the root of a subtree is smaller or equal to the expected error for the subtree. Finally, the smoothing process is employed to compensate for the sharp discontinuities between adjacent linear models at the leaves of the pruned tree.

Fuzzy logic

The term “fuzzy logic” has in fact two distinct meanings. In a narrow sense, it is viewed as a generalization of classical multi-valued logics (Demicco and Klir 2001). In a broad sense, fuzzy logic is viewed as a system of concepts, principles, and methods for dealing with modes of reasoning that are approximate rather than exact Fig. 1. The fuzzy logic system is a cognitive artificial intelligence scientific technique developed in 1965 by Professor Lotfi Zadeh of the Department of Computer Science, University of California, Berkeley, USA. It provides a means of representing uncertainties and vagueness that characterize human perception, judgmental reasoning, and decision (Emami et al. 2000). The generation of a fuzzy model is based on expert knowledge and historical data Fig. 2. Fuzzy inference is the process of formulating the mapping from a given input to an output equation using fuzzy logic, and then the mapping provides a basis from which decisions can be made or discerned. The fuzzy inference system (FIS) consists of four main interconnected components (Fig. 3): rules, fuzzifier, inference engine, and output processor. Once the rules have been established, a fuzzy logic system can be viewed as a map from inputs to outputs. The rules are the heart of a FIS and can be expressed as a collection of IF–THEN statements. The IF part of a rule is antecedent and the THEN part is consequent. Depending on the particular structure of the consequent proposition, three main types of fuzzy models are distinguished as: (1) Linguistic (Mamdani-Type) fuzzy model (Zadeh 1973; Mamdani 1977), (2) Fuzzy relational model (Pedrycz 1984; Yi and Chung 1993), (3) Takagi–Sugeno (TS) fuzzy model (Takagi and Sugeno 1985). In this paper, the TS fuzzy model is employed to emulate stage–discharge rating curve, so a brief description of this method is outlined below.

Fig. 3
figure 3

Flow chart of fuzzy inference system model

In the TS fuzzy inference system, the rule consequents are usually taken to be either crisp numbers or linear functions of the inputs (Lohani et al. 2006)

$$R_{i} \;\; = \;\;{\text{IF}}\;\;\,x\,\;\,{\text{is}}\;\;\,A_{i\,} \,{\text{THEN}}\quad y_{i} \;\; = \;\;a_{i}^{T} x\;\; + \;\;b_{i} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,\;2,\; \ldots ,\;M$$
(7)

where x ∊ ℜn is the antecedent and y i  ∊ ℜ is the consequent of the ith rule. In the consequent, a i is the parameter vector and b i is the scalar offset. The number of rules is denoted by M and A i is the (multivariate) antecedent fuzzy set of the ith rule defined by the membership function

$$\mu_{i} \left( x \right):\Re^{n} \to \left[ {0,\;\;1} \right]$$
(8)

The fuzzy antecedent in the TS model is defined as an and-conjunction by means of the products operator Wolfs and Willems (2013)

$$\mu_{i} \left( x \right)\;\; = \;\;\prod\nolimits_{j\;\; = \;\;1}^{p} {\mu_{ij} \left( {x_{j} } \right)}$$
(9)

where x j is the jth input variable in the p dimensional input data space, and μ ij the membership degree of x j to the fuzzy set describing the jth premise part of the ith rule. μ i (x) is the overall truth value of the ith rule.

$$y\;\; = \;\;\sum\limits_{i\; = \;1}^{M} {u_{i} \left( x \right)\; \times \;y_{i} }$$
(10)

Where u i is the normalized degree of fulfillment of the antecedent clause of rule R i (Setnes 2000)

$$u_{i} \;\; = \;\;\frac{{\mu_{i} \left( x \right)}}{{\sum\limits_{f}^{M} {\mu_{f} \left( x \right)} }}$$
(11)

The y i s are called consequent functions of the M rules and defined by:

$$y_{i} \;\; = \;\;W_{i0} \;\; + \;\;W_{i1} x_{1} \;\; + \;\;W_{i2} x_{2} \;\; + \cdots \; + \;\;W_{ip} x_{p}$$
(12)

where W ij are the linear weights for the ith rule consequent function.

The study area and data description

The Gharraf River system is located in Mesopotamian plain, southern Iraq (Fig. 4). The drainage area of this system is 435,052 × 106 m2. The river begins in the Kut Barrage and runs south between the great Euphrates and Tigris Rivers, and ends in Al-Hammer marsh land in Nassyria city. The main length of the river is approximately 230 km. The Gharraf area is characterized by hot and dry summer and cold and wet winter. The climate of the area is classified as semi-arid one. The course of the Shatt Al Gharraf can be subdivided according to the conditions that governed its development as follows (Iraqi Ministries of Environment, Water resources, Municipalities and Public works 2006): (1) The Hay Delta, which ends at Kalaat Sukkar in which expansion can take place, (2) The Rafai gully extends to about 10 km upstream of Bada’a in which flow is restricted, no lateral expansion being possible, (3) The Bada’a Delta is the most recent region of expansion on the left bank towards the Hor Abu Ijul, Hor H’weynah and Hor Ghamukah depressions, and (4) The Shattrah and Kasser–Ibrahim Deltas are the regions of expansion at the end of the Rafai gully.

Fig. 4
figure 4

Location of the Gharraf River system, southern Iraq

The daily averages of stage and discharge data for the Hay station on the Gharraf River were used in this study. The observed data covers the period from April 2005 to May 2006. In Iraq, it is difficult to obtain enough time series to build data-driven models; hence, approximately 1 year was used. The available data was arbitrary divided into two parts sets: 66 % for training and 34 % for testing for all models developed in this study. The statistical parameters of the used data are given in Table 1. In this table, N, Min., Max., \(\bar{x}\), Me, s, C v, and K s refer to total number of data, minimum, maximum, arithmetic average, standard deviation, coefficient of variation, and coefficient of skewness, respectively. From Table 1, one could conclude that variation of discharge values is higher than that for stage. The maximum values of stage in testing set are higher than that for training set, this may cause difficulty to estimate discharge at extreme values. One the other hand, the maximum and minimum values of discharge in testing set fall within the range in training test. This may overcome the problem of estimation extreme discharge values which previously mention.

Table 1 Summary of statistical parameters of the used data

Performance criteria for the developed models

The performance of the various data-driven models was evaluated by means of errors statistics criteria such as root mean squared error (RMSE) and coefficient of determination (R 2). The mathematical formulation of these criteria is outlined below:

  1. (a)

    Root mean square error (RMSE)

    $${\text{RMSE}} = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {\left( {Q_{i} - \hat{Q}_{i} } \right)}^{2} }}{n}}$$
    (13)

    where Q i is the measured discharge and \(\hat{Q}\) is the simulated discharge, n is the number of observations (instants). As the value of this criterion approaches zero, the better fit between observed and modeled data is obtained.

  2. (b)

    Coefficient of determination

    $$R^{2} = 1 - \frac{\text{SSE}}{{{\text{SS}}y}}$$
    (14)

    where \({\text{SSE }} = \sum\limits_{i = 1}^{n} {\left( {Q_{i} - \hat{Q}_{i} } \right)^{2} }\) \({\text{SSy }} = \sum\limits_{i = 1}^{n} {\left( {Q_{i} - \bar{Q}_{i} } \right)^{2} }\) where \(\bar{Q}\) is the arithmetic mean of the observed Q. The better the fit, the closer R 2 is to ± 1.

Applications of the techniques

Artificial neural networks

In this study, feedforward neural network (MLP) with backpropagation algorithm was employed for developing ANN models. The popularity of MLP in hydrological application (Zhang and Govindaragju 2003; Leahy et al. 2008) is the main reason for selecting this network. Although, the architecture of MLP can have many hidden layers, works by Cybenco (1989) and Coulibaly et al. (1999) have shown that a single hidden layer is sufficient for the MLP to approximate any complex non-linear function. For all the developed models, the Levenberg–Marquardt algorithm was applied to train the networks. The logistic sigmoid transfer function is used in the hidden layer and a linear one in the output layer for the all the developed networks. The early stopping method was selected to overcome overfitting problem. Demo version of Alyuda NeuroIntelligent commercial software was used in this study to build different neural networks. NeuroIntelligence is a neural network software for experts. It is used to apply neural networks to solve real-world forecasting, classification and function approximation problems. It is full-packed with proven techniques for neural network design and optimization. To ensure that each variable is treated equally in the models, all the input and output data were automatically normalized into the range [−1, 1]. The default values of learning rate (0.1) and momentum rate (0.1) were used for building network models. The number of nodes in the hidden layer for each developed models were determined by trial and error procedure considering the need to derive reasonable results.

The study examined various combinations of river stage H t with specified lag times H t−1 and H t−2 and the antecedent discharges Q t−1 and Q t−2 as inputs to the ANN models to evaluate the degree of effect of each of these variables on output variable Q t . The input combinations evaluated in the present study are shown in Table 2. The same variable input combinations were also used for M5 and TS fuzzy inference system techniques. Also, to reduce network error, different numbers of iterations for the best network were examined. These tests were conducted to verify whether an increase iteration numbers could reduce error rate or not.

Table 2 Input combinations for the developed models

M5 decision trees

For building M5 models, Weka 3.6 software was used. Weka is open-source machine learning/data mining software written in Java Witten and Frank (2005). The software contains a comprehensive set of pre-processing tools, learning algorithms and evaluation methods. For this study, the parameters of M5 algorithm were set to their default values; pruning factor 4.0 and smoothing option. The software was available on http://www.cs.waikato.ac.nz/~ml.

TS fuzzy inference system

A fuzzy toolbox in MATLAB 2012a software was used for building fuzzy models. Membership functions were extracted via subtractive clustering method. Subtractive clustering method (Chiu 1994) is an extension of the mountain clustering method, where data points (not grid points) are considered as the potential candidates for cluster centers. It uses the positions of the data points to calculate the density function, thus reducing the number of calculations significantly. Since each data point is a candidate for cluster centers, a density measure at data point x i is defined as (Chiu 1994)

$$D_{i} = \sum\limits_{j = 1}^{n} {\exp \left( { - \frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{\left( {{{r_{a} } \mathord{\left/ {\vphantom {{r_{a} } 2}} \right. \kern-0pt} 2}} \right)^{2} }}} \right)}$$
(15)

where r a is a positive constant representing a neighborhood radius. Therefore, a data point will have a high density value if it has many neighboring data points. A trial and error procedure was used to assign a suitable value of calculus radius. After many trials the best result was 0.2. Three Gaussian membership functions were extracted for each model, which were labeled as low, medium, and high. The same labels were used for Q t . Default values of the TS inference system were used in this study.

Results and discussions

The RMSE and R 2 statistics of each ANN model in testing period are given in Table 3. The ANN model whose inputs were H t−1, H t , Q t−2, and Q t−1 (input combination 4) with [4 15 1] architecture has the smallest RMSE (9.91 m3/s) and the highest R 2 (0.82). As shown in Table 3, using only the stage H t (input combination 1) gives poor estimate with the RMSE (21.99) and R 2 (0.05). Among the ANN models, whose inputs were the antecedent discharges (input combinations 2, 3, 4, and 5), the ANN model with Q t−1 has the biggest RMSE (12.06 m3/s) and the lowest R 2 (0.67). This emphasizes that the Q t is mostly dependent on the antecedent discharge values. Among the ANN models, whose inputs were the antecedent stages (input combinations 3, 4, and 5), the ANN model with inputs H t−2, H t−1, and H t has the biggest RMSE (12.03 m3/s) and the lowest R 2 (0.72). In general, all the developed ANN models except ANN-1 and ANN-2 with [2 3 1] have good capabilities to emulate stage–discharge relationship because they have reasonable RMSE and R 2. Table 3 also shows that the increasing of hidden nodes brought slightly better performance for the developed models. The Q t estimates of the best performance models are also represented graphically in Fig. (5). It is obviously seen from these figures that measured and estimated discharge was reasonably good. All the figures show that the estimated discharge Q t for all the developed models was underestimated especially with the lowest values of discharge.

Table 3 Statistical performance criteria for one hidden layer ANN’s models
Fig. 5
figure 5

Comparison between measured and estimated discharge and best fit lines for best Performance ANN’s models a ANN-3 [3 3 1] b ANN-3 [3 15 1] c ANN-4 [4 5 1] d ANN-4 [4 15 1] and e ANN-6 [3 8 8 1]

Table 4 presents the statistical performances of M5 decision tree models in which the model whose inputs were Ht and Q t−2 (input combination 2) was the best model among all other developed models with lowest RMSE and R 2, 8.10 and 0.88, respectively. The other models also perform best except the MT1 with single input H value. Figure 6 shows a graphical comparison between measured and estimated discharges. It is obvious from Fig. 5 that the MT2-5 four models have very good agreement between measured and estimated discharges for both low and high values. For the MT5 model, the following rule was extracted from M5 algorithm:

Table 4 Statistical performance criteria for M5P decision tree technique
Fig. 6
figure 6

Comparison between measured and estimated discharges and best fit lines for the best Performance of M5 models

$$\begin{gathered} Q_{t - 1} \Leftarrow \, 136.5\;\;:\;\;{\text{LM1 }}\left( {61/39.078\;\% } \right) \hfill \\ Q_{t - 1} > \, 136.5 \, \;\;: \hfill \\ |Q_{t - 1} \Leftarrow \, 151.5\;:{\text{ LM2 }}\left( {102/31.145\;\% } \right) \hfill \\ |Q_{t - 1} > \, 151.5 \, :{\text{ LM3 }}\left( {54/64.323\;\% } \right) \hfill \\ \end{gathered}$$

LM num: 1

$$Q_{t} = \, - 17.2053 \, \times \, H_{t - 2} - \, 2.2944 \, \times \, H_{t - 1} + \, 0.2323 \, \times \, Q_{t - 2} + \, 0.1694 \, \times \, Q_{t - 1} + \, 2.992 \, \times \, H \, + \, 237.7584$$

LM num: 2

$$Q_{t} = - 0.4746 \, \times \, H_{t - 2} - \, 2.5982 \, \times \, H_{t - 1} - \, 0.0163 \, \times \, Q_{t - 2} + \, 0.6489 \, \times \, Q_{t - 1} + \, 3.1867 \, \times \, H \, + \, 53.6424$$

LM num: 3

$$Q_{t} = - 0.4746 \, \times \, H_{t - 2} - \, 14.9457 \, \times \, H_{t - 1} - \, 0.0276 \, \times \, Q_{t - 2} + \, 0.2338 \, \times \, Q_{t - 1} + \, 19.5879 \, \times \, H + \, 85.3705$$

For the MT2 with minimal input data (input combinations 2), the following tree was extracted:

LM num: 1

$$Q_{t} = \, 0.8555 \, \times \, Q_{t - 1} + \, 21.2475$$

The statistical performance of TS inference system is shown in Table 5. Results of this data-driven model were similar to that of M5 model. The TS2 with two inputs parameter, i.e., H and Qt−1 was the best among the other models with lowest RMSE (8.17) and R 2 (0.88). The worst model was the model whose input was stage only Fig. 6. The other three models (TS3-5) also perform very well where both high and low values were reasonably predicted (Fig. 7). The TS2 was selected in this study a candidate for comparison with other data-driven models because it has minimal input data and perform the best for all other developed models as mentioned previously. The membership editor and fuzzy rules for this model are shown in Figs. 8, 9, respectively. Three simple fuzzy rules were generated for this model. These are:

Table 5 Statistical performance criteria for TS fuzzy engine
Fig. 7
figure 7

Comparison between measured and estimated discharge and best fit lines for the best Performance of TS inference engines

Fig. 8
figure 8

Membership editor for TS2 inference engine

Fig. 9
figure 9

Fuzzy rules for TS2 inference system

$$\begin{gathered} IFQ_{t - 1} {\text{ is low and }}H{\text{ is low THEN}}\;Q{\text{ is low}} \hfill \\ IFQ_{t - 1} {\text{ is medium and }}H{\text{ is medium THEN}}\;Q{\text{ is medium}} \hfill \\ IFQ_{t - 1} {\text{ is high and }}H{\text{ is high THEN}}\;Q{\text{ is high}} \hfill \\ \end{gathered}$$

The TS inference system for this model is illustrated in Fig. 10.

Fig. 10
figure 10

TS2 fuzzy inference system

The comparison between the three data-driven best models is presented in Table 6. The best data-driven model for estimating Qt was M5 model tree. Although, the performance of TS inference system was very close to that for M5 model in terms of R 2, the M5 method has the lowest RMSE (8.10 m3/s). Results also reveal that the M5 model performed better than the ANN model for both low and high discharge predictions. The complex structure of ANN and the many parameters which must be assigned for successful training make the ANN a second priority when compared with the simple structure and very fast training M5 algorithm. The generated tree structure with linear models on the leaves bears another benefit for this technique; it was very easy to understand even from those people who are unfamiliar with hydrology. The same results were obtained by Ajmera and Goyal (2012) when they compared between ANN and M5 techniques for mimic flow rating curve. The results of this study agree with Ajmer and Goyal (2012) and added another comparison, i.e., between TS inference system and M5 which enhance the capability of M5 model for emulating stage–discharge relationship. The results also indicated that TS and MT models that used only two variables (Q t−1 and H) were very good for predicting Q t for the study area.

Table 6 Comparison between the three best data-driven models

Conclusions

The abilities of the artificial neural networks, M5 decision trees, and Takagi and Sugeno fuzzy inference techniques for emulating stage–discharge relationship for Gharraf River system, southern Iraq have been investigated and discussed in this study. The study demonstrated that modeling of this relationship is possible through using these techniques. The M5 decision tree technique models with minimal data, i.e., current stage and one antecedent discharge, perform better than that ANN models and TS inference engine. The root mean squared error and correlation of determination for best M5 model were (8.17 m3/s) and (0.88), respectively. The best M5 and TS models were able to predict discharge on both high and low values. Most of the developed ANN models were slightly capable to predict the discharge but most predictions were underestimating. All the developed models with stage as a single input failed to mimic stage–discharge relationship. This implies that antecedent discharges were needed for better relationship at this area. The study used data from one station and further studies using more data may enhance the results obtained by this study.