Extreme learning machine
High speed modeling despite the large number of parameters which should be initialized prior to modeling and also trapping in local minima are common issues of learning the feed-forward neural network (FFNN) by gradient algorithms such as back propagation (BP). To conquer such troubles, Bagherifar et al. (2020), Huang et al. (2004, 2006) presented the extreme learning machine (ELM) algorithm. Converting the nonlinear problem to a linear form, as the most significant achievement of this approach, significantly reduces the modeling process time so that the modeling process is completed in few seconds. The structure of this model is a single layer FFNN (SLFFNN). In this technique, the matrix interfacing the input layer to the hidden layer (the input weight matrix) and hidden layer biases are haphazardly initialized, while the matrix which associates the hidden layer to the output layer (i.e., the output weight matrix) is computed by processing a linear problem. Using the defined structure significantly reduces the modeling time compared to time-consuming methods such as BP. The results of the previous studies demonstrate that in addition to the surprising modeling speed, this method has higher generalizability than other gradient algorithms (Azimi and Shiri 2021; Azimi et al. 2021; Azimi et al. 2017b).
Given N samples in the form of datasets with several inputs (\(x_{i} \in R^{n}\)) and one output(\(y_{i} \in R\)) as \(\left\{ {(x_{i} ,y_{i} )} \right\}_{i = 1}^{N}\) and considering L neurons within the hidden layer and the activation function g(x), the structure of the ELM model for establishing mapping between the considered inputs and output is expressed as follows (Azimi et al. 2017b):
$$ y_{j} = \sum\limits_{i = 1}^{L} {\beta_{i} g({\mathbf{w}}_{i} \cdot {\mathbf{x}}_{j} + b_{i} )} ,\quad j = 1,2, \ldots ,N $$
(1)
In this relationship, \({\mathbf{w}}_{i} = (w_{i1} ,w_{i2} , \ldots ,w_{in} )\,(i = 1:L)\) is the matrix interfacing input neurons to the ith hidden layer neuron, bi represents the bias related to the ith hidden layer neuron and βi denotes the weight matrix associating the ith hidden layer neuron to the output neuron. Among the three defined matrices, wi and bi are randomly initialized, while the matrix βi is computed during solving the interested problem via the ELM. Also, N denotes the number of input variables of the problem. The output node in the mathematical expression specified for the ELM is linear and \({\mathbf{w}}_{i} .{\mathbf{x}}_{j}\) represent the inner outcome related to matrices of the input weight and problem variables, respectively. Rewriting this formula as a matrix forms Eq. (2) (Azimi et al. 2017b):
$$ {\mathbf{H}}\beta = {\mathbf{y}} $$
(2)
where \(\beta = [\beta_{1} , \ldots ,\beta_{N} ]^{{\text{T}}}\), \(\beta = [\beta_{1} , \ldots ,\beta_{N} ]^{{\text{T}}}\) and H denotes the hidden layer output matrix which is solved in the following way:
$$ {\mathbf{H}}({\mathbf{w}}_{{\mathbf{1}}} ,{\mathbf{w}}_{{\mathbf{2}}} , \ldots ,{\mathbf{w}}_{{\mathbf{L}}} ,{\mathbf{x}}_{{\mathbf{1}}} ,{\mathbf{x}}_{{\mathbf{2}}} , \ldots ,{\mathbf{x}}_{{\mathbf{L}}} , \ldots ,b_{1} ,b_{2} , \ldots ,b_{L} ) = \left[ {\begin{array}{*{20}c} {g{(}{\mathbf{w}}_{{\mathbf{1}}} \cdot {\mathbf{x}}_{{\mathbf{1}}} + b_{1} {)}} & \cdots & {g{(}{\mathbf{w}}_{{\mathbf{L}}} \cdot {\mathbf{x}}_{{\mathbf{1}}} + b_{1} {)}} \\ \vdots & \ddots & \vdots \\ {g{(}{\mathbf{w}}_{{\mathbf{1}}} \cdot {\mathbf{x}}_{{\mathbf{N}}} + b_{1} {)}} & \cdots & {g{(}{\mathbf{w}}_{{\mathbf{1}}} \cdot {\mathbf{x}}_{{\mathbf{N}}} + b_{L} {)}} \\ \end{array} } \right] $$
(3)
As discussed before, among the three defined matrices, w and b are randomly initialized and β is calculated analytically. Thus, the only remaining unknown is the matrix β which is calculated by means of a linear relationship. Since the matrix H in Eq. 2 is non-square, solving this equation is not as simple as solving a linear equation. Thus, the loss function value is minimized through the solution process by the least square technique (i.e.,\(\min \left\| {{\mathbf{y}} - {\mathbf{H}}\beta } \right\|\)). Therefore, the result is obtained through the optimization process by minimizing l2-norm (Azimi et al. 2017b).
$$ \hat{\beta } = {\mathbf{H}}^{ + } {\mathbf{y}} $$
(4)
where H+ is the Moore–Penrose generalized inverse (MPGI) (Rao and Mitra 1971) of H. For modeling by the ELM, the hidden layer nodes number should be less than training samples (L < N) otherwise, overfitting occurs. So, the above relationship is rewritten as follows (Azimi et al. 2017b):
$$ \hat{\beta } = ({\mathbf{H}}^{{\text{T}}} {\mathbf{H}}{)}^{ - 1} {\mathbf{H}}^{{\text{T}}} {\mathbf{y}} $$
(5)
Outlier robust ELM (ORELM)
Considering that in solving complex nonlinear problems via AI-based algorithms, the existence of outliers which is because of the nature of the problem and should not be considered as outliers, will have a noticeable influence on the modeling outcomes. Thus, a significant percent of the modeling error (e) is related to the existence of outliers. To take into account this issue in modeling nonlinear problems using the ELM, Zhang and Luo (2015) defined the existence of outliers by sparsity. Knowing that l0-norm acts better in reflecting sparsity than l2-norm, instead of using l2-norm, they defined the training model error (e) as follows:
$$ \mathop {\min }\limits_{\beta } \,C\left\| {\mathbf{e}} \right\|\,_{0} + \left\| \beta \right\|_{2}^{2} \,\,\,\,\,\,\text{subject\,to}\,\,{\mathbf{y}} - {\mathbf{H}}\beta = {\mathbf{e}} $$
(6)
Zhang and Luo (2015) to solve this non-convex programming problem, modified it in a tractable convex relaxation form without loss of the sparsity characteristic problem. Using l1-norm instead of l0-norm in the following equation not only guarantees the presence of sparse characteristics but comes with minimization convex:
$$ \mathop {\min }\limits_{\beta } \left\| {\mathbf{e}} \right\|\,_{1} + \frac{1}{C}\left\| \beta \right\|_{2}^{2} \quad \text{subject\,to}\;{\mathbf{y}} - {\mathbf{H}}\beta = {\mathbf{e}} $$
(7)
This formula is a constrained convex optimization problem in such a way that fits completely the suitable domain of the Augmented Lagrangian (AL) multiplier approach. Therefore, the function AL is provided as:
$$ L_{\mu } ({\mathbf{e}},\beta ,\lambda ) = \left\| {\mathbf{e}} \right\|_{1} + \frac{1}{C}\left\| \beta \right\|_{2}^{2} + \lambda^{2} ({\mathbf{y}} - {\mathbf{H}}\beta - {\mathbf{e}}) + \frac{\mu }{2}\left\| {{\mathbf{y}} - {\mathbf{H}}\beta - {\mathbf{e}}} \right\|_{2}^{2} $$
(8)
Here, \(\lambda \in R^{n}\) is the vector of the Lagrangian multiplier and \(\mu = = 2N/\left\| {\mathbf{y}} \right\|_{1}\) (Yang and Zhang 2011) demonstrates the penalty parameter. The optimal answer of finding e, β and λ is achieved through the iterative process for minimizing the function AL:
$$ \left\{ \begin{gathered} ({\mathbf{e}}_{k + 1} ,\beta_{k + 1} ) = \arg \,\mathop {\min }\limits_{{{\mathbf{e}},\beta }} L_{\mu } ({\mathbf{e}},\beta ,\lambda )\quad (\text{a}) \hfill \\ \lambda_{k + 1} = \lambda_{k} + \mu ({\mathbf{y}} - {\mathbf{H}}\beta_{k + 1} - {\mathbf{e}}_{k + 1} )\quad \;\;\;(\text{b}) \hfill \\ \end{gathered} \right\} $$
(9)
Physical models
In this paper, the experimental data reported by Ojha and Subbaiah (1997) and Hussain et al. (2011) are used for approving the AI models. The rectangular-shaped channel used in Ojha and Subbaiah's model is assumed to be 4.5 m in length, 0.4 m in width and 0.5 m in height. A side slide gate with a width of 0.2 m installed at 2.5 m from the channel entrance has been installed on this model. However, the rectangular channel utilized in Hussain et al. (2011) model is assumed to be 9.15 m in length, 0.5 m in width and 0.6 m in height. Additionally, a rectangular side gate is situated at a 5 m distance from the entrance of the channel. The experimental data were utilized on rectangular gates for three widths comprising 0.044 m, 0.089 m and 0.133 m for this model. Figure 1 depicts the layouts of these laboratory models.
DC of side slot
Generally, the DC of side slots is a function of the following parameters (Hussain et al. 2011; Azimi et al. 2018):
\(\left( L \right)\) = length of side slot, \(\left( b \right)\) = height of side slot, \(\left( B \right)\) = width of main channel, \(\left( W \right)\) = height of the side slot bed from the main channel bed, \(\left( {V_{1} } \right)\) = flow velocity in the main channel, \(\left( {Y_{m} } \right)\) = the flow depth in the main channel, \(\left( \rho \right)\) = fluid density, \(\left( \mu \right)\) = flow viscosity, \(\left( g \right)\) = gravitational acceleration
$$ C_{d} = f_{1} \left( {L,b,B,W,V_{1} ,Y_{m} ,\rho ,\mu ,g} \right) $$
(10)
Given that the flow Froude number is \(F_{r} = \frac{{V_{1} }}{{\sqrt {g.Y_{m} } }}\) and density, viscosity and gravitational acceleration remain constant, while \(B\), \(W\) and \(Y_{m}\) become dimensionless, thus Eq. (10) is rewritten as:
$$ C_{d} = f_{2} \left( {\frac{B}{L},\frac{W}{L},\frac{{Y_{m} }}{L},F_{r} } \right) $$
(11)
Hence, in this investigation, the impact of the factors of Eq. (11) on the DC of side slots is considered. The range of the variables reported by Ojha and Subbaiah (1997) and Hussain et al. (2011) for developing the ORELM models are given in Table 1.
Table 1 Range of the dimensionless parameters reported by Ojha and Subbaiah (1997) and Hussain et al. (2011) for developing ORELM models The integrations of the dimensionless parameters related to relation (11) used for developing the ORELM models are shown in Fig. 2. It merits referencing that 70% of the laboratory data are employed for learning the models, while the remaining (i.e., 30%) are utilized for examining them.
Goodness of fit
To test the precision of the presented mathematical models, several statistical indices comprising the correlation coefficient (R), variance accounted for (VAF), root-mean-square error (RMSE), scatter index (SI), mean absolute error (MAE) and Nash–Sutcliffe efficiency (NSC) are utilized (Azimi et al. 2022):
$$ R = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {F_{i} - \overline{F} } \right)\left( {O_{i} - \overline{O} } \right)} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {\left( {F_{i} - \overline{F} } \right)^{2} \sum\nolimits_{i = 1}^{n} {\left( {O_{i} - \overline{O} } \right)^{2} } } } }} $$
(12)
$$ {\text{VAF}} = \left( {1 - \frac{{{\text{var}} \left( {F_{i} - O_{i} } \right)}}{{{\text{var}} \left( {F_{i} } \right)}}} \right) \times 100 $$
(13)
$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left( {F_{i} - O_{i} } \right)^{2} } } $$
(14)
$$ {\text{SI}} = \frac{{{\text{RMSE}}}}{{\overline{O} }} $$
(15)
$$ {\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {F_{i} - O_{i} } \right|} $$
(16)
$$ {\text{NSC}} = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {\left( {O_{i} - F_{i} } \right)^{2} } }}{{\sum\nolimits_{i = 1}^{n} {\left( {O_{i} - \overline{O}} \right)^{2} } }} $$
(17)
where \(O_{i}\) = observed measurements, \(F_{i}\) = values recreated by models, \(\overline{O}\) = average of observed measurements, n = number of observed measurements.
Next, different activation functions are tested. After that, the most powerful model alongside the most affecting input parameter is specified by performing a sensitivity analysis(SA). Additionally, the ORELM most efficient model is contrasted with the ELM. Besides, an uncertainty analysis (UA) as well as a partial derivative sensitivity analysis (PDSA) are conducted on the best model. Finally, a computer code is provided for the estimation of the DC of side slots.