1 Introduction

South Africa is well known for cow dung management and the production of biogas. Biogas digesters generate vast quantities of organic residues and biogas. The residues known as digestate are used as organic fertilizer, whereas biogas is used for energy for cooking and heating [1]. Most dairy farms in the province produce meat and milk and generate large quantities of cow dung [2]. The good news is that organic waste can be converted into useful resources through a proper recycling system [3]. He [4] mentioned that the production of waste from livestock farming improves agriculture, and farming (an aspect of agriculture) is sustainable compared to conventional agriculture. However, lots of benefits are derived from cow dung, such as being cheap and readily available, used as a biofertilizer, biopesticides, cleaning agents in rural areas, burning fuel, and acting as a mosquito repellent [5].

For biogas production, any organic waste could be used. Biogas technology is considered one of the best technologies for the treatment of organic waste to bring about both material and energy [6]. Certainly, the recycling process of using agricultural waste for biogas generation is a promising approach and, hence, advantageous towards economic, environmental, and social impact [7]. It has been studied and revealed that biogas production is highly beneficial compared to other renewable energy sources. In this regard, it does not create harmful components such as sulphur dioxide (SO2), carbon monoxide (CO), nitrogen dioxide (NO), and other particles [1].

One way of accessing the quality of production of biogas is through the quantity of methane yield. Hence, this can be achieved through prediction by mathematical modelling. Interestingly, mathematical models have been widely studied in relation to the modelling of anaerobic digestion processes. This usually occurs with a set of input variables to predict the desired or output parameter in a dramatic condition [8]. On the other hand, the statistical models to predict biogas production of biomass have been widely studied, focusing on macromolecular and physical-chemical composition [9]. These studies aimed at determining energy and economic evaluation based on analytical and experimental data. Furthermore, it ensures high replicability and reliability of results. In essence, the usual models used are multiple linear regression (MLR) models [10], multiple non-linear regression (MNLR), artificial neural network (ANN) models [10, 11], and adaptive network-based fuzzy inference (ANBFI) model [10] [11], response surface methodology (RSM).

According to Madhavan et al. [12], these models are helpful in predicting the desired or output yield from a given set of predictors or input parameters as well as assist in testing the workability of scale-up biogas digester, provided there is a given input parameter and organic feed load. Further, models guide analyzing a biogas digester’s failure and optimizing the feedstock’s biomethane yield and reduction rate [13]. Abubakar et al. [14] mentioned that the target of using mathematical models to analyse biogas yield is to determine the kinetic parameter, which is beneficial to the performance of the biogas digester in terms of its efficiency and behaviour. However, for the sake of the present study, the ANN and MLR were employed. The artificial neural network (ANN) and multiple linear regression (MLR) models came about as an alternative to the mechanistic model, such as the ADM1. However, there has always been difficulty in quantifying the organic composition necessary for effluent feed steam. This is because of the need required for many parameters and estimations. Indeed, the effluent feed steam usually provides important information used in the development of the model [15]. As regards the non-linear and complex relationship relating to the input and output variables of ANN and MLR models, Mansourian et al. [16] and Absipour et al. [17] reported their findings which revealed that ANN models have higher ability and efficiency in predicting biogas yield than the MLR model.

In monitoring the framework for the operation of anaerobic digestion, ANN modelling is usually employed. The ANN increases the optimization cost and identifies the quality or stability of wastewater effluent. The ANN was used in the study because of its ability to recognize complex non-linear relationships between the input and output parameters. These are necessary and needed in the model prediction and estimation of the properties of anaerobic digestion properties. Usually, this occurs without the need for prior knowledge of laws (theoretical and physical) that apply to biological processes [18]. Further, the model involving ANN is helpful in predicting the performance of anaerobic digesters with a high degree of accuracy. This is possible if the quality of the historical data is good [18]. Sathish and Vivekanandan [19] mentioned that the ANN model is usually reported to have a high determination coefficient (R2) of 0.998 and is faster and more straightforward in the modelling of large-scale operation plant. It also serves as a guide to estimate or predict the behaviour of digesters in the treatment of waste. This outcome provides suitable information and recommendations to plant engineers and technicians in controlling and preventing biogas digester failure [20].

Multiple linear regression (MLR) is also known as a simple regression model, which is essential in predicting biogas yield from feedstock characteristics and maximizing biogas yield through the management of anaerobic digestion. Further, it can be useful in performing preliminary energetic and economic evaluations [20]. MLR is regarded as one of the types of regression analysis with linear and non-linear [14], consisting of multiple independent variables (χ1, χ2, χ3……….) with their respective coefficient (a1, b2, c3….) and a constant (ao). Mostly, the MLR deals with the modelling of a linear relationship between the independent and dependent variables. According to Abdipour et al. [17], MLR is based on the linear and additive association of explanatory variables in focus to attempt the model. This also involves the relationship between dependent and independent variables. Seeing that the efficiency prediction of the MLR model depends on the existence of linear relationships of the input and output parameters, various studies have been conducted and published because of their simplicity.

In this study, we focused on the yield of biogas from cow dung by comparing their prediction using the developed MLR and ANN in an underground fixed biogas digester. Various parameters were used as predictors for this purpose.

1.1 Objectives of the study

The study sought to accomplish the following objectives:

  1. (a)

    To rank the input parameters (absolute global solar irradiance, absolute relative humidity, slurry temperature, biogas temperature, absolute pH, and ambient temperature) to their weights of contribution to the volume of biogas yield.

  2. (b)

    To develop a multiple linear regression model to predict the volume of biogas yield (V) using absolute global solar irradiance (Ir), absolute relative humidity (RHr), slurry temperature (Ts), biogas temperature (Tg), absolute pH (pHr), and ambient temperature (Ta) as predictors.

  3. (c)

    To develop an artificial neural network model to predict the volume of biogas yield as the model output and the absolute global solar irradiance, absolute relative humidity, slurry temperature, biogas temperature, absolute pH, and ambient temperature as predictors.

  4. (d)

    To compare the accuracy of both the MLR and ANN models and justify the benefits of ANN over MLR models.

However, to the best knowledge of the authors, no study has been published on the detailed and extensive comparison of the MLR and ANN model used in the study, thereby focusing on many aspects such as the 75th and 25th percentile, the median, upper, and lower limit of the measured target and predicted values. The study will be more beneficial to researchers in mathematics, computer science, physics, and engineering intending to undertake studies on or have some knowledge of renewable energy as it relates to bioenergy.

2 Literature review

This section reviews the existing and recent studies on ANN and MLR for the prediction of biogas yield from various authors. This is important to stimulate and predict the produced amount of biogas with time under specified conditions. The ANN and MLR are considered effective alternatives to experimental methods, which are time-consuming and require expensive prototypes [21]. In this context, estimating the biogas yield involving the co-digestion of yellowback fungus spent mushroom and different livestock manure, such as chicken, dairy, and pig manure, was carried out by Gao et al. [22]. The study, conducted at a constant temperature of 35°C, used spent mushrooms to the manure of 10 – 90 w/w and total solids content of 5 – 15% as an independent variable. It was observed that the result of the study fits experimental data through a polynomial model, which is said to be a suitable regression model with a determination coefficient (R2) of more than 0.95.

The prediction of biogas production from anaerobic co-digestion of waste sludge and wheat straw using two-dimensional mathematical models and artificial neural networks was conducted by Abdeldaiem et al. [21]. From the experimental result, co-digestion of a 7% mixing ratio of straw to sludge was said to improve the C/N ratio to 35, and the highest yield of biogas was obtained, which is 15-fold higher than the sludge mono-digestion. Thereby reducing the total solids (58.06%), volatile solids (66.55%), and chemical oxygen demand (74.67%). In terms of the model result, the two-dimensional mathematical model showed a high correlation with the experimental data. This was the case; the logistic kinetic model is considered the best for experimental data representation. Therefore, the ANN result reveals that the training, validation, and testing of the multilayer feedforward neural network (MFFN-MFO) model yield a very high correlation coefficient when compared with other models. Hence, this proved that ANN is the most helpful tool for modelling biogas yield.

Mougari et al. [23] developed a model using the ANN and modified Gompertz (MG) equation to predict the cumulative biogas yield (CBY) and cumulative methane yield (CMY) from mono-digestion and co-digestion of different organic wastes. The study was conducted under optimal operational conditions, considering the effect of VS/TS ratio, carbon, C/N ratio, and digestion time. These were selected as the input predictors for the implementation of the ANN approach, whereas the GM was employed to optimize the ANN architecture and provide reliable and fast learning for better prediction performance. From the study, both approaches (ANN and GM) performed well in predicting the CBY and CMY. Hence, they showed a good agreement with the experimental data. The GA-ANN model showed a minor deviation and higher predictive accuracy with satisfactory RMSE and R2 of 0.0045 and 0.9996 for CBY and 0.0046 and 0.9998 for CMY. The study is evidence that CBY and CMY can be effectively predicted using both approaches and can be a tool for the scale-up of anaerobic digestion in terms of techno-economic analysis.

An artificial neural network (ANN) model was developed for the prediction of biogas production by Oluwayomi et al. [24]. The input parameter used in their study includes retention time and a combination of cow dung and pig waste. The obtained predicted ANN result was compared to the experimental value, which showed a significant and satisfactory performance in relation to biogas yield with a determination coefficient (R2) of 0.92.

On the other hand, to explore the MLR model based on the characteristics of substrate and operating conditions, Xu et al. [25] developed a model. The following substrate characteristic considered includes the volatile solid content, lignin cellulose, sugar yield etc., whilst the operating characteristics include the total solids, carbon/nitrogen ratio, and particle size. The model analysis revealed that the substrate’s characteristics had an impact on the model, whereas the operating condition (TS%) had no significant impact. The insignificant impact of the operating condition on the model was because of the limited data availability and large variance amongst data collection.

Rossi et al. [26] aimed to develop a multilinear regression model for the prediction of biogas production from dry anaerobic digestion of organic fraction of municipal solid waste (OFMSW). Their study used 332 data sets from the plug flow biogas digester to build the model. The input parameter includes total volatile solids, organic loading rate, hydraulic retention time, carbon/nitrogen ratio, lignin content, and total volatile fatty acid. Their study reported that all the input parameters showed the optimal fitting ability of R2 = 0.91. Further, the feedstock characteristics and operating parameters gave the best fit and predictive ability of R2 = 0.87.

To determine the most accurate model for the prediction of biogas production from chicken manure was the case study conducted by Abubakar et al. [27]. The study employed the non-linear regression analysis (NLREG) to compare the first and modified first order. In the study, the authors compare both models, looking at the following regression parameters: the number of iterations, the final sum of deviation, standard error of estimate, average deviation, maximum deviation, determination coefficient, and the final sum of squared deviation. The results of the study reveal that the NLREG of first and modified first-order reported biogas production of 10,252,217.1g and 83,861.2925 g, respectively. Both models showed an R2 of 0.8483 and an average deviation of 141.126, amongst other parameters considered. Findings from the study indicate that using the NLREG, the first-order model is the most correct kinetic model. Although, the model is said not to satisfy or fit the measured biogas yield.

3 Material and methods

3.1 Sample preparation and analysis

The sample referred to as cow dung was used as the substrate for biogas production. This was taken from the dairy farm of the University of Fort Hare. To reduce the moisture content of the cow dung, it was dried to less than 20% based on Li et al. [28] study. The following physio-chemical properties of the cow dung were obtained. Total solids (13.80%), volatile solids (11.04%), pH (7.83), carbon/nitrogen ratio (24), chemical oxygen demand (42,583 mg/L), and calorific value of 27.0 MJ*g−1. All these analytical determinations were carried out at the Department of Microbiology, Fort Hare University, according to APHA 2005 standard method [29]. Thereafter with the ratio of 1:1 (water waste), the dilution of solid waste and water was conducted to obtain the slurry with reference to Obileke et al. [30]. In essence, the mixing and stirring were necessary to ensure that homogeneity was achieved.

3.2 Reactor design

The reactor known as the fixed dome biogas digester consists of three parts: the digester chamber, the inlet, and the outlet chamber (see Fig. 1). In term of design, the digester chamber was fabricated using high-density polyethylene (HDPE) plastic, whilst the inlet and the outlet chamber was constructed using bricks and cement. The digester chamber has a cylindrical (neck of the digester), spherical (gas storage), and frustum shape (slurry chamber) with a volume of 2.15 m3. The inlet chamber is connected to the digester chamber with PVC pipe, as seen in Fig. 1. As seen in Fig. 1, the angle used in the design was to avoid a rectangular shape of 90° that may result in clogging or dead zone of the slurry at the edge of the digester chamber. If not taken into consideration, it might result in a reduction in biogas production. Therefore, to ensure and enable a smooth downward flow of the slurry from the inlet chamber, the digester chamber was made to an incline angle slightly less than 90°. The complete specific reactor (fixed dome digester) is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of the fabricated biogas digester

3.3 Experimental design

The study involved the mono-digestion mode of feeding. By mono-digestion, it simply means the use of one substrate for biogas production. Two hundred litres of the cow slurry was introduced into the bio-digester on the first day. Thereafter, the gas valve was left open for 3 days for the expulsion of any air. An inoculum from an existing biogas digester near the University was added to the biogas digester system to increase the rate of fermentation. Subsequently, 50 L of the cow slurry was fed into the biogas digester every 3 days. Interestingly, the cow dung occupies 55−60% of the total volume of the biogas digester, leaving enough space for the accumulation of the biogas. In terms of performance monitoring of the digester for data collection, the authors developed and built a gas temperature and pressure sensor (GTPS) and data acquisition system (DAS). The sensor is responsible for the measurement of biogas composition, pH, and temperature of the gas and slurry, whereas the data acquisition system was helpful in storing and collecting data. The data collection was done over a monitoring period of 50 days with an interval of 30 min using a CR 1000 data logger. For simplicity, Table 1 provides a summary of the material and method of the study.

Table 1 Summary of materials and equipment used in the study

3.4 Selection and consideration of input parameters

In developing the ANN and MLR model, input parameter selection and choice are essential. Therefore, these affect the model’s coefficient and results [31] and maximize biogas yield [32]. Although there are various methods to select the input parameter, Abdipour et al. [17] mentioned that the correlation coefficient is one of the effective methods. It is regarded as the best choice because it deduces a simple relationship between variables and assists in identifying the scenario in which there is a strong correlation between the input and output parameters or variables. Therefore, the selection of parameters in the study was based on the factors that affect the input materials. This parameter includes the absolute global solar irradiance (Ir), absolute relative humidity (RHr), slurry temperature (Ts), biogas temperature (Tg), absolute pH (pHr), and ambient temperature (Ta). The selection of pH and temperature as input parameters provides an effective and efficient process control strategy. In this case, the parameters contribute towards the biogas increase and decrease of gas pressure inside the digester. However, pressure is not included in the present study parameter, but previous studies have reported its impact on the microclimate of the biogas digesters [32].

3.4.1 Properties of network

The MLR and ANN were used to develop both models. This software employs quantitative and graphics to monitor the training and prediction processes. The model training was done using different combinations of parameters; temperature (gas, slurry, and ambient), pHr, RHr, and Ir to achieve the maximum R2 and minimum RMSE values. This process was accomplished through a trial-and-error method, keeping some training parameters constant and slowly moving the other parameters over a wide range of values. MATLAB (R 2021Aa) Toolbox opens the network/data manager window that permits the user to import, create, use, and export data. The properties to develop the ANN in MATLAB (R 2021a) are as follows.

  • Input parameter/network input/predictor: temperature (gas, slurry, and ambient), pHr, RHr, and Ir

  • Output parameter/ network output/ response: volume of biogas yield

  • Network type: feed-forward backpropagation

  • Training function: TRAINLM

  • Adaption learning function: LEARNGDM.

  • Hidden layer transfer function: transig

4 Results and discussion

This section presents the results and discussion focusing on the ranking of predictors to their weight of contribution, developing the MLR and ANN model for the prediction of biogas yield, and finally comparing the accuracies of both models.

4.1 Ranking of predictors by weights of the significance of importance

The selected input parameters (the absolute daily global solar irradiance (Ir), absolute daily relative humidity (RHr), daily slurry temperature (Ts), daily biogas temperature (Tg), absolute daily pH (pHr), and daily ambient temperature (Ta) were ranked by weights of importance to the desired output (daily volume of biogas yield (V)). In a similar study conducted by Tangwe et al. [33], the authors used a different set of input parameters (number of days, slurry temperature, and pH) and the weights of importance to the desired output (log of total bacteria count) whilst comparing the prediction accuracy of total viable bacteria counts in a balloon digester. Recent studies [34,35,36,37] have focused on ranking predictors by weight of significance or importance using different statistical algorithms or analyses. However, in the present study, these predictors were ranked by weight using the reliefF test. The reliefF test is a statistical algorithm test used to rank predictors according to their weights of importance to the output parameter [38]. The choice of statistical test used in the study is because it estimates and shows the quality of a given feature in the context of other features and is non-parametric [39]. Furthermore, its efficiency is because it does not explicitly explore feature subsets and does not bother trying to identify an optimal minimum feature of subset size [40]. However, the reliefF test algorithms identify a subset of features that may not be the smallest but still include some important and redundant features that are small enough to be used with more refined approaches in a detailed analysis. Therefore, the input and output parameters datasets were obtained during the hydraulic retention period of the charged cow manure in the underground digester. The weights of the predictors were achieved by the ranking method using the reliefF test and ranged from −1 (worse) to +1 (best). The positive weights insinuate that the input parameters are primary factors, whilst the negative weights confirm the predictors as secondary factors. Figure 2 shows each selected predictor’s contribution by ranking their weights.

Fig. 2
figure 2

Ranking of input parameters according to weight contribution to output

In Fig. 2, it was depicted that both Ir and RHr were primary factors, with Ir contributing the most by weights (0.0352) followed by RHr (0.0215). This differs from the study conducted by Tangwe et al. [38], where the number of days was reported to contribute the most with a weight of 0.106, followed by slurry temperature with 0.024 and 0.038 for pH. The difference in the result might be attributed to the type of biogas digester used in various studies. However, in the present study, the contributions from the rest of the predictors by weight ranking were negative and, therefore, secondary factors. The ranking by weights for the secondary factors in descending order was Ts (−0.0002), Tg (−0.0114), pHr (−0.0148), and Ta (−0.0183).

4.2 Development of a mathematical model to predict the biogas yield

Multiple linear regression (MLR) and artificial neural network (ANN) models were developed using the selected input parameters (Ir, RHr, Ts, Tg, pHr, and Ta) and output parameters (V). Here, the output parameter (V) refers to the volume of biogas yield. According to Abubakar et al. [14], biogas yield or production is the main defining factor of anaerobic digestion. Having indicated in Sect. 3.4, the input parameters used in the model influence the output (biogas yield). However, the individual effect of these parameters or variables can be found in the study by Obileke et al. [29] and Abubakar et al. [32]. The ANN and MLR models were data-driven, with the MLR model exploiting the least square regression method to derive the forcing and scaling constants associated with the input parameters [29] whilst the ANN used the supervised shallow machine learning to train the input parameters (Ir, RHr, Ts, Tg, pHr, and Ta) and the target (V) to ensure the model output mimics the target.

4.2.1 Derivation of the forcing and scaling constants for the MLR model

The developed MLR model is represented by Eq. 1 with the quantities k0 as the forcing constant whilst k1, k2, k3, k4, k5, and k6 as the scaling constants, respectively. The forcing and scaling constants are necessary for the development of a mathematical model. Hence, this is unlimited in most published articles where the forcing and scaling constants are not derived and stated.

$$k0+k1 Ir+k2(RHr)+k3 Ts+k4 Tg+k5 pH+k6 Ta=V$$
(1)

Table 2 shows the derived forcing and scaling constants based on the trained dataset using the least square regression analysis.

Table 2 Determination of forcing and scaling parameters for the biogas yield based on MLR

Table 2 revealed that the forcing constant was negative (k0 = −0.2533) whilst all the scaling constants associated with each predictor were positives, with the greatest scaling constant (0.9308) associated with pHr whilst the least (0.0017) was associated with Ts. The derived determination coefficient (R2) for the measured and modelled daily volume of biogas yield was 0.9688, whilst the p value was 0.9999. Based on Abubakar et al. [14], defining the R2 as a statistical measure of data proximity to the fitted regression line does not indicate whether a regression model is enough. A low R2 value can be seen as a good model, and a high R2 can also be obtained for a model that does not fit the data points. Certainly, high R2 are not often excellent, and low R2 values are not often bad. However, in a similar study on MLR for biogas production, Obileke et al. [29] reported R2 of 0.955 for the measured and modelled biogas production and p value of 0.920, using cow dung as substrate whereas Rossi et al. [26], Ithnin and Hashin [41], Tufaner and Demirci [8] reported R2 as 0.92, 0.83, and 0.96 using an organic fraction of municipal solid waste (OFMSW), palm oil mill effluent (POME), and synthesis wastewater respectively. Differences in the determination coefficient (R2) reported via MLR might be because of the different feedstock as well as the input parameters considered by various authors. In a different study where the R2 obtained was 0.83 and p value < 0.00006, Rath et al. [42] used the biochemical constituent (acid detergent lignin, hemicellulose, crude fat-, and water-soluble carbohydrate) as the input parameters. It can be observed that the R2 and p value obtained are a bit different from previous studies reported earlier. This might be attributed to the type or selection of predictor used for the MLR model development. The mean square and absolute mean errors for the measured and modelled daily volume of biogas yield were 1.84 × 10−4 and 2.2 × 10−4, respectively. The very good determination coefficient confirmed in the study shows that the model output is accurately predicting the measured target (daily volume of biogas yield), as seen in Rossi et al. [26], Ithnin and Hashin [41], Tufaner and Demirci [8] studies. The excellent p value (close to 1) and the absolute mean error and root mean square error revealed no significant difference between the measured and model output, and the errors between the model output and the measured target are much close to 0. This is different from the Iweka et al. [43] study, where the developed model employed is significant with a p value of 0.0001. It is interesting to state that a p value less than 0.050 means that the represented terms are significant, whereas a p value greater than 0.05 indicates significance. In a similar case, Obileke et al. [44] reported a 5% significance level when conducting a comparative study of the above-ground and underground digester. Hence, there was no significant difference in the gas and slurry temperature (p value >0.05). It can be confirmed that due to the positive values for the scaling constants, an increase in any of the input parameters will increase the daily volume of biogas produced, provided the rest of the predictors are held constant.

Table 3 shows the analysis of variance (ANOVA) table used for the determination of the p value for the measured and predicted daily volume of biogas yield for both MLR and ANN models.

Table 3 Comparing measured and predicted biogas yield using ANOVA table

The sum of the square between (columns) and within (errors) the groups of the measured and predicted daily volume of biogas yield were 0 and 0.58332, whilst the total sum of the square was 0.58332. The degree of freedom between and within the measured and predicted daily volume of biogas yield were 1 and 98, whilst the total degree of freedom was 99. The sum of squares and degree of freedom ratio is the mean square between and within groups for the measured volume, and the model volume of biogas yield was 0 and 0.00595. The ratio of the mean square for the between and within a group was 4.19 × 10−29 and is equal to the F-statistic. The probability of the F-statistic is the p value and was 0.9999, whilst from Iweka et al. [43], an F value of 0.80 was reported.

Figure 3 shows the profiles of the measured daily volume of biogas yield and the predicted values using MLR during the hydraulic retention period. The figure revealed that the measured dataset for the daily volume of biogas yield exhibited no outliers with the predicted model output curve fitting line. Also, it is justified by the very good R2 (0.9688) and the very small RMSE (1.84 × 10−4) reported in the study; there is a good agreement between the measured target and model output of the biogas production in training, as also observed in Rossi et al. [26], Ithnin and Hashin [41], Tufaner and Demirci [8] studies. Abubakar et al. [45] reported a good fit of experimental data with the same high R2 (close to unity) and small RMSE values as in the present study. Hence, the R2 obtained includes 0.9932, 0.9972, 0.9992, and 0.9963, whereas the RMSE obtained includes 0.0202, 0.1371, 0.0093, and 0.0202 for the Cone, Transfert, Modified Gompertz, and Logistic models respectively using POLYMATH and Origin 2018 software.

Fig. 3
figure 3

Comparing the measured volume of biogas yield and the predicted using MLR

In addition, the one-way analysis of variance (ANOVA) was used to demonstrate that no significant difference existed between the measured daily volume of biogas yield and the predicted values. Figure 4 shows the analysis of variance (ANOVA) plots for the measured daily volume of biogas yield and the predicted values over the hydraulic retention period. The upper horizontal lines (black colour) of the ANOVA plots represented the upper limits for the measured biogas volume per day, which is 0.9624, and the predicted per day, which is 0.9448. The lower horizontal lines (black colour) of the ANOVA plots represented the lower limits for the measured biogas volume per day, which is 0.6558, and the predicted value per day, which is 0.62708. The 75th percentile of the ANOVA plots is the upper limit of the ANOVA block (blue colour), and the measured biogas volume is 0.8435 whilst the predicted is 0.83721. The 25th percentile of the ANOVA plots is the lower limit of the ANOVA block (blue colour), and the measured and predicted volumes were 0.7290 and 0.73558, respectively. The red horizontal lines on the ANOVA plots represent the median, and for the measured volume, the value was 0.77406, whilst the median for the predicted values was 0.78069. The ANOVA plots revealed that no outliers existed for the measured volume of biogas yield and the predicted values, and the p value between them was 0.9999. It can be depicted that the dataset was generally distributed for both the measured and predicted volume of biogas yield, and no significant difference existed between the two groups. Therefore, based on the very good values for the determination coefficient, p value, mean square error, and absolute error, the visual plots, as well as the ANOVA plots of the measured and predicted daily volume of biogas yield, it can be accepted to use the derived MLR model in predicting the volume of biogas yield with the selected input parameters as the predictors.

Fig. 4
figure 4

Comparing the measured and the predicted with MLR based on ANOVA plots

4.2.2 Governing concept of the derived ANN

ANNs can model linear or non-linear correlations by mimicking biological neural networks. The network system consists of input and output parameters interconnected through a series of neurons. The output signal of each neuron in the network is transmitted to the following layers by a transfer function (f). f calculates the overall output during the process, and this process also gives \(\hat{y}\) neuron output for n inputs and is expressed by Eq. 2.

$$\hat{y}=f\left(b+\sum\nolimits_{i=1}^n{w}_{i}{x}_{i}\right)$$
(2)

Where b is the bias of the hidden neuron, wi is the weight of the link between the input and hidden neuron, and xi is the input variable. In this study, the transfer functions used in hidden and output layers were the tangential sigmoid transfer function (tansig) given in Eq. 3 and the linear transfer function (purelin) given in Eq. 4, respectively.

$$f(x)=\frac2{1+e^{-2x}}-1;\;\left[-1,1\right]$$
(3)
$$f(x)=x;\;\left[-\infty,\infty\right]$$
(4)

An ANN-based feed-forward algorithm was developed in the MATLAB (R2021a) environment exploiting the Neural Network toolbox.

Fig. 5
figure 5

ANN configuration used to train the input and output dataset

Figure 5 shows the ANN architecture used with a configuration 6-10-1; the input and output matrices were determined to be 6 by 50 and 1 by 50, respectively. The input layer consists of 6 input parameters (Ir, RHr, Ts, Tg, pHr, Ta), each of 50 data points, and the output layer consists of 1 output parameter (V) with 50 data points.

The dataset is divided into three independent data blocks: training validation and test data. The ANN’s original dataset (whole dataset) was randomly divided into training, validation, and testing and, by default, was partitioned such that 70% (36 datasets) of the total 50 datasets were training datasets used to develop the ANN model. The remaining 30% were divided such that 15% (8 datasets) make up the validation and 15% (8 datasets) as testing datasets.

4.2.3 Derivation of the statistical parameters for the ANN

Table 4 shows the descriptive statistics upon the training of the input parameters (Ir, RHr, Ts, Tg, pHr, and Ta) and the target (V) using the Levenberg Marquardt (LM) as backpropagation (BP) algorithm for the feed-forward neuron network (FFNN) with an architectural configuration of 6 -10-1.

Table 4 Determination of p value and MSE for the biogas yield based on all datasets of ANN

The determined p value for the whole dataset (input and output parameters) used in training the ANN was 0.9103, as shown in Table 3. In addition, the determination coefficient, mean square error, and absolute mean error derived after training the whole dataset were 0.9999, 8.33 × 10−6, and 1.93 × 10−4, respectively. Table 5 shows the results of the correlation coefficients and mean square errors for the sample datasets for the training, validation, and testing. The excellent correlation values, which were close to 1, and the very small mean square errors, very close to 0, confirmed that the ANN model could predict the daily volume of the biogas yield.

Table 5 Training of input and target using BP FFNN with ANN configuration (6-10-1)

4.2.4 Established regression plots from the developed ANN

For the developed ANN regression plot, the original dataset of the ANN was divided randomly into training, validation, and testing. By default, 70% (36 datasets) of the total 50 datasets referred to the training dataset used to develop the model of the ANN. The remaining 30% were divided into 2 partitions: the validation (15%) and testing (15%) dataset. Figure 6 shows the derived regression plots obtained using the MATLAB ANN model upon completion of training.

Fig. 6
figure 6

Training, validation, testing, and all dataset ANOVA regression plot

In Fig. 6, during the training stage, the connection weights of the neural network were said to be adjusted to minimize the MSE on the training set. The regression plots for the target (measured volume of biogas yield) and the model output (predicted volume of biogas yield) for the training, validation, testing, and all datasets had very strong correlations of 0.9999, 0.9857, 0.9980, and 0.9971, respectively. These values are close to 1, indicating that the model output accurately predicts the measured target and almost fits the observed data line, similar to a study conducted by Abubakar et al. [14], where the R2 was reported as 0.9993. Also, the trained ANN can predict the daily volume of biogas yield. The findings reveal that the prediction of the ANN was strongly correlated with experimental data, which was also confirmed by Suberu et al. [46]. Furthermore, the modelled line (black dash lines) best fits the sample data (solid colour lines), which shows strong agreement and could predict biogas production outcomes and optimization in the fixed dome digester. The far scattering of some data points from the zero-line experience in Fig. 6 is likely due in part to the noise in the experimental data and might not be exclusively connected to the accuracy of the ANN model, according to Almomani [47].

Figure 7 shows the profile of the measured daily volume of biogas yield (target) and the predicted values using ANN during the hydraulic retention period. It is depicted that the measured dataset for the daily volume of biogas yield contained no outliers, and the predicted model output curve fitting line demonstrated an excellent prediction. This can also be justified by the excellent R2 of 0.9999 and RMSE of 8.33 × 10−6 between the measured (target) and model output. However, the RMSE in the range of 10−9 and 10−7 with the best target achieved around 10−8 and R2 of 0.996 /0.985 were obtained in Almomani [47] study. This was done for the training and testing data set separately for biogas production. The R2 in the present study was observed to agree with that of Almomani [47]. This is evidence that the ANN model is acceptable and does not have any systematic errors.

Fig. 7
figure 7

Comparing the measured volume of biogas yield and the predicted using ANN

Additionally, the one-way analysis of variance (ANOVA) was used to illustrate that no significant difference existed between the measured and the predicted volume of biogas yield. Figure 8 shows the ANOVA plots for the measured daily volume of biogas yield and the predicted values throughout the hydraulic retention cycle. The upper limits (upper horizontal lines (black colour) of the ANOVA plots) for the measured volume yield and the predicted daily values were 0.9624 and 0.9555. The lower limits (lower horizontal lines (black colour) of the ANOVA plots) for the measured target and the predicted value per day were 0.6558 and 0.6582. The 75th percentile of the ANOVA plots (upper limit of the ANOVA block (blue colour)) for the target and model output was 0.8434 and 0.8402. The 25th percentile of the ANOVA plots (lower limit of the ANOVA block (blue colour)) for the measured and predicted volumes was 0.7290 and 0.7291, respectively. The red horizontal lines on the ANOVA plots (median) for the target and model output were 0.7741 and 0.7728. The ANOVA plots revealed that no outliers existed for the measured targets and predicted values; hence, the p value was 0.9103. It can be confirmed that the dataset was generally distributed for both the target and model output, and no significant difference existed between the two groups. Therefore, according to the derived parameters (determination coefficient, p-value, mean square error, and absolute mean error), the visual plots of the targets and model outputs, and the ANOVA plots for the targets and model outputs, it can be accepted to use the derived ANN model in predicting the volume of biogas yield.

Fig. 8
figure 8

Comparing the measured and the predicted with ANN based on ANOVA plots

4.2.5 Summary of the comparison of the MLR and ANN model

In Table 6, the statistical analysis results of the data set modelled with MLR and ANN are presented, and the comparison of the prediction for the biogas production is made. It is interesting that the statistical analysis was performed on the test data. From Table 1, the statistical parameter (MLR and ANN) is close to 1; this confirms that the model was accurate. The same was reported in Tufaner et al. [48] and Willmott [49] studies. According to the Durbin-Watson Statistics, the difference between the MLR and the ANN is that the MLR shows positive autocorrelation; hence, this is not the case for the ANN model. Therefore, the estimation model is remarkable in both models, and therefore, both models predict the yield of biogas accurately, with the ANN model showing a better performance due to its higher R2 and smaller MSE value when compared with the corresponding values for MLR as also reported in Barik and Murugan [50], Suberu et al. [46]. According to Suberu et al. [46], the R value is an indication of the correlation between the measured target and the predicted value. Therefore, a higher value of R obtained indicates a closer relationship, and zero R represents a random relationship. However, the findings also agree with Tufaner and Demirci [8] study, where the ANN made less error than the MLR whilst predicting the optimum result and performing best. Categorically, it is shown that ANN results are more accurate than MLR results.

Table 6 Summary of the comparison studies via MLR and ANN models of prediction of biogas yield

Before concluding the study, it is necessary to provide a literature overview of the application and performance efficiency of the MLR and ANN model and include the present results for comparative purposes. This is presented in Table 7.

Table 7 An overview of the application and performance efficiency of ANN and MLR for predicting biogas generation rates from digesters from the literature compared to the current study results

5 Conclusion

The study confirmed that mathematical models, especially the MLR and ANN, can accurately predict biogas production rates in an underground biogas digester charged with cow manure. It was concluded that the input parameters used in both models constituted primary (Ir and RHr) and secondary factors (Ts, Tg, pHr, and Ta) based on the reliefF test and with reference to the volume of biogas yield as the output parameter. The ranking by weight of the predictors depicted that absolute global daily solar irradiance contributed the most whilst the contribution from the ambient temperature was the least. The ANN was demonstrated to give better prediction when compared with the MLR from the critical statistical results (determination coefficient, correlation coefficient, mean square error, and absolute mean error) achieved. The ANOVA plots for both the targets and model outputs for the derived ANN and MLR were used to show that the dataset was normally distributed, and no significant difference existed between the two groups (measured biogas yield (targets) and the predicted biogas yield (model outputs)) using the p values criteria.

Recommendation

The process of performing other anaerobic digestion cycles with the same cow manure is required to ensure the MLR model is validated, and the ANN is also evaluated with the dataset obtained from the hydraulic retention cycle. A future study on the application of using the ANN as an optimization tool for the performance of the underground digester charged with cow manure is of very high imperative. A real-time simulation application tool will be developed that will be used to simulate the biogas yield with the inputs parameters (Ir, RHr, Ts, Tg, pHr, and Ta) using both ANN and MLR to assist end users or manufacturers of bio-digesters in comparing the performance of the underground digester under different operating conditions.