1 Introduction

Along the entire length of the water bodies, there are several processes that can influence the quality of its waters [28]. From a historical point of view, having urban centers developed around freshwater rivers and lakes, it can be said that these water resources play a great role in the development of populations [29]. Currently, the population faces a series of environmental, economic and social problems, which causes concerns about the environment to increase [28]. Influenced by the new standards of modern life, as well as the ever-increasing population growth rate, the demand for water is increasing. This and other factors, such as the inappropriate disposal of solid waste and industrial and urban effluents, end up influencing a worrying increase in the pollution of water resources [41]. Therefore, it is of great interest that the pollution of these natural resources is controlled, reduced and monitored [31].

According to Piedras et al. [30], the Mirim Lagoon River Basin is one of the main transboundary river basins in South America. Located on the Atlantic coast of South America, more specifically between the extreme south of Brazil and the north of Uruguay, Mirim Lagoon is one of the most important freshwater lagoons in the Brazilian territory, being of great importance for the State of Rio Grande do Sul, both economically and ecologically. This important natural resource is the second largest water body in the country, occupying a territorial extension of 375 thousand hectares of water surface, shared between the Brazilian and Uruguayan territories [27, 42].

As a large part of the surrounding land is used for agricultural activities, Mirim Lagoon presents several sources of pollution. These outbreaks can also be attributed to the natural releases of industrial effluents and domestic sewage in the São Gonçalo channel, the channel responsible for connecting Mirim Lagoon to Lagoa dos Patos [27]. Also, due to population growth in the world, surface waters, such as rivers, lakes and ponds are susceptible to contamination from a diverse variety of polluting sources, which causes a deterioration in their quality [13, 14]. In this context, the quality of the waters of Mirim Lagoon has been suffering degradation due to the anthropic activities developed in its surroundings [12, 42]. Therefore, its conservation is necessary, as well as the use of reliable techniques for monitoring the water quality of this lagoon.

Accurate assessment of the type and extent of water pollution is a difficult and complicated task, and one of the main problems faced in monitoring water quality is the complexity of analyzing a large number of variables [21, 23]. According to Alves et al. [2], the Water Quality Index (WQI), proves to be a useful tool to represent the quality of water bodies. Developed by the National Sanitation Foundation (NSF), the WQI was introduced in Brazil primarily through the Environmental Company of the State of São Paulo (CETESB), which adapted it to the country’s specificities [3, 15]. This index, adapted by CETESB, consists of nine variables; however, in virtue of the different local characteristics of each region of Brazil, it can be adjusted to better portray the reality of each water body, or better, hydrographic basin [2, 43].

More specifically, the WQI can be modeled for a specific situation, that is, it can be simplified by taking into account only some of the environmental variables that really affect the quality of a given source. This reduction in the number of variables, in addition to facilitate the construction and elaboration of a quality index, makes it more economically attractive, enabling its use in several hydrographic basins in the country. In addition, another positive point of using an WQI with few environmental variables is the reduction in the eclipse effect. This effect is characterized by the attenuation, for example, of the negative impact of one of the variables due to the stability of the others [38].

Another important tool that can be used very well for monitoring water resources is the use of statistical methods. These methods have the advantage of providing a simple and objective interpretation from a previously complex set of data. The application of these methods can facilitate the manipulation of this data, which helps to better understand the effects of the processes that occur in the water body [18, 49]. In addition, the application of advanced statistical methods is of great use to extract important information without significant loss of accuracy [48].

Several studies around the world have been carried out in order to model quality indexes that best suit the conditions of each region. Al-Janabi et al. [1] assessed the water quality of the Tigre River in Baghdad, using a Canadian water quality index.

One of the statistical methods that can be used thinking of developing a new equation for the WQI of a given water body is the Multiple Linear Regression (MLR). Coming from Linear Regression, MLR is a statistical model related to the treatment of time series of data, where its analysis is a forecast of values of one or more response variables to which it uses a set of explanatory variables [26].

Therefore, this work aims to develop a new WQI equation for Mirim Lagoon through multiple linear regression analysis models and to evaluate the suitability of these models for monitoring the water quality of this water resource, as well as assessing whether the results of WQI generated by this new equation can be used with precision to monitor the quality of the waters of this lagoon.

2 Methodology

2.1 Study area

The Mirim Lagoon hydrographic basin is located on the Atlantic coast of South America, on the coastal plain of Rio Grande do Sul in the extreme south of Brazil, and occupies 47,362 km2 of direct contribution and 55,110 km2 when including the coastal strip, in which they are located. The main baths of the region are inserted, including Lagoa Mangueira, connected exclusively to Mirim Lagoon by the Taim hydrological system. This important lagoon constitutes the second largest water body with lake characteristics in the country, connecting Lagoa dos Patos through the São Gonçalo channel [27, 42].

With an average depth of 5 meters and 180 km long along its central axis, northeast-southwest, which is almost parallel to the direction of the most frequent wind, the Mirim Lagoon in the past was affected by the brackish waters of the São Gonçalo Channel coming from of Patos’ Lagoon, which harmed irrigated rice crops in the region. So, in 1977 the Eclusa’s dam was built in the São Gonçalo Canal in order to avoid this [37]. Furthermore, with an estimated average overflow to Patos’ Lagoon of 700 m3/s, the Mirim Lagoon can also connect to Mangueira Lagoon through swamps during flood events, which ends up forming the largest lagoon complex in South America [20, 37]. According to Munar et al. [25], the main tributaries of Mirim Lagoon are the Jaguarão, Taquari and Cebollati rivers, the first on the Brazilian side and the last two on the Uruguayan side. It is also worth noting that the region in question has a subtropical climate, with an average annual rainfall ranging from 1250 to 2000 mm and an average annual temperature between 14 and 18 °C [8].

The main use of its waters is through direct extraction for agricultural use in rice fields in Brazilian and Uruguayan territory and, in addition, the waters of this lagoon are also intended, after treatment, for human consumption mainly for the municipalities of Pelotas and Rio Grande [22, 36]

2.2 Collections, monitoring points and laboratory analysis

For the purposes of this study, a total of 22 sampling campaigns were carried out, divided into 7 monitoring points over the years 2015, 2016, 2017, totaling 154 samples. The monitoring points of this study can be seen in Table 1. These points comprise the two sides of the lagoon, these being in the directions Pelotas—Jaguarão and Pelotas—Santa Vitória do Palmar, as shown in Fig. 1.

Table 1 Mirim Lagoon monitoring points
Fig. 1
figure 1

Source: Author

Mirim Lagoon monitoring points.

All sample campaigns were performed on the banks of the Mirim Lagoon at the collections points.

For the collection of samples, in each campaign and at each monitoring point, the following flasks were used: two different plastic flasks previously sanitized and sterilized, a 2-L flask for samples intended for analysis of total solids, turbidity and pH and a 1-L flask for samples referring to nitrogen and phosphorus analysis. For the determinations of DO and BOD, two 250-mL glass vials containing a sealing system were used. For the microbiological analysis (thermotolerant coliforms), a 250-mL glass flask, properly sterilized, was used.

All collected samples were duly stored and sent to the water and effluent laboratory of the Mirim Lagoon Hydrographic Basin Development Agency (LMA), where the physical, chemical and microbiological analyzes considered within the WQI calculation (CETESB) were carried out for the study of the water quality of this water body. All analyzes were performed according to the methodology described by APHA [4]. The results of these analyzes were taken from the LMA website, which were publicly accessible during the period covered in this study.

2.3 Original WQI

As previously mentioned, this index appeared in the United States, at the National Sanitation Foundation, due to the need to monitor the quality of water resources and was adapted in Brazil by CETESB in order to better adapt to national water characteristics [15]. Both in its form developed by NSF and in its adaptation by CETESB, the WQI is calculated using the weighted product of each of its parameters, as shown in Eq. 1. This calculation results in a number from 0 to 100, with 100 being better quality.

$${\text{WQI}} = \mathop \prod \limits_{i = 1}^{n} qi^{wi}$$
(1)

where qi is the quality of the ith parameter, also a number between 0 and 100, which is obtained through the quality graph as a function of the measured concentration. Wi is the weight of the ith parameter represented by a number between 0 and 1.

To perform the WQI calculation, 9 determinations are performed according to the adaptation made by CETESB of the parameters defined by the NSF. In Brazil, the nitrate parameter was replaced by total nitrogen and the total phosphate was replaced by total phosphorus, maintaining the same weights and quality curves established by NSF. Therefore, the parameters that constitute the WQI in Brazil are: temperature, dissolved oxygen (DO), thermotolerant coliforms, pH, biochemical oxygen demand (BOD), total nitrogen (N), total phosphorus, turbidity and total solids [3].

Each parameter considered within the WQI has a weight for the calculation of this index, as fixed in relation to its relevance to the overall conformation of water quality. These weights are shown in Table 2.

Table 2 Relative weights of parameters for calculating the WQI.

The results obtained by calculating the WQI can be represented in quality ranges, as shown in Table 3. These ranges vary according to the state to which the studied water body is inserted, with the range for the state of Rio de Janeiro being adopted here. Grande do Sul.

Table 3 Classification of the WQI for the state of Rio Grande do Sul.

For Centeno and Cecconello [6], the calculation of the WQI consists of the use of variables that reflect the changes that occurred in the region of the micro-basin of the water resource, be they anthropic or natural. For these, the WQI has the advantage of resulting in a value that is easy to communicate and understand, since in this index the interactions and changes in the analyzed parameters, having different units of measurement, result in a single number.

2.4 Statistical analysis

First, the sample distribution of the data covered in this study was analyzed. For this, normality tests were performed. To prove or reject the normality of the data, the Kolmogorov–Smirnov (K–S) and Shapiro–Wilk (S–W) tests were used at a significance level of 0.05. For these tests, the null hypothesis (H0) considers the distribution to be normal and for p value < 0.05 this hypothesis is rejected. The K–S test is based on the largest difference between the theoretical and the empirical cumulative probabilities, in absolute value. Regarding the S-W test, its calculation can be visualized in Eq. 2.

$$W = \frac{{\left( {\mathop \sum \nolimits_{i = 1}^{n} a_{n - i + 1} (x_{n - i + 1} - x1} \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} - \bar{x}} \right)^{2} }}$$
(2)

where i = 1, 2, …, n is the sample size, \(x_{i}\) = ordered values of the sample, \(\bar{x}\) = mean value of the sample, \(x_{n - i + 1}\) constants generated from mean, variances, and covariates of statistical order of a sample size N and a normal distribution, and W statistical variable of SW.

After the analysis of the sample distribution, a correlation matrix with its coefficients and significance was used, in order to assess the correlation between the chemical and microbiological variables studied and those variables with the WQI. The correlation coefficient to be used depends on the sample distribution of the data, which is why normality tests were previously performed.

For data that follows a normal distribution, for example, Pearson’s coefficient is the most commonly used, whereas for data that do not have sample normality, nonparametric coefficients are recommended, such as the Spearman Rhô coefficient, also known as Spearman’s coefficient [17].

For the modeling of a new WQI equation, the multiple regression analysis (MLR) methodology was used. This analysis typically uses a methodology called ordinary least squares (OLS) that fits a straight line in the scatter plot of the variables, which minimizes the sum of the squared residuals. In other words, using the OLS method, the linear regression analysis traces a line with a more accurate fit in the center of the data distribution, minimizing its residuals. Through this method, regression analysis allows to quantitatively assess the relationship that specific variables have with a result of interest. [47].

First, regression models will be evaluated using all variables that have a significant correlation according to the aforementioned correlation matrix and, later, the model equation that presents the best adjustments will be extracted. This equation, then, will be used in the calculation of a new water quality index for the water resource studied here, using only the variables established by the regression analysis.

With the new WQI equation obtained by the regression analysis and this new index for each sample already calculated, we proceed with the Student’s t test to assess whether there was a significant variation between the original WQI and the new WQI. This test will be done for the numerical results of WQI and for their quality ranges.

The t test for paired samples measures the significance of the variation between pairs of dependent samples. Since the results of the original WQI were used to perform the regression analysis, the results of the new WQI are not independent of these, so the use of the paired t test.

This test, like all significance tests, is a hypothesis test that aims to evaluate a null hypothesis and an alternative. According to this test, if p value < 0.05, the null hypothesis is rejected, which says that there is no significant variation between the original WQI and the new WQI, with 95% confidence. That is, if p > 0.05 the indices—original and new—are not significantly different.

3 Results and discussion

3.1 Original WQI

As already discussed, the WQI developed by NSF in the USA and adapted in Brazil by CETESB, hereinafter referred to as the original WQI, is calculated by the weighted product of each of its nine parameters. For this study, the WQI of 7 monitoring points in 22 collections was calculated, generating 154 WQI results, as shown in Table 4.

Table 4 Original WQI of Mirim Lagoon

In Brazil, the original WQI adapted by CETESB widely is used. Recent studies have used this WQI in assessing the quality of Mirim Lagoon, which found a likely influence of anthropic activities on its quality, such as agricultural activities developed in its surroundings [39, 46]. Santos et al. [34, 35], also studying a water body located in southern Brazil, used this WQI and found that the water quality index decreased as the water body studied approached the urban area, also indicating pollution probably from activities anthropic.

3.2 Normality test

All variables related to the original WQI, as well as the result of this index, were evaluated for their sample distribution. According to the K–S and S–W tests, most variables do not follow a normal distribution, as they obtained a p value < 0.05. Therefore, the correlation coefficient used in the next stage of the study will be Spearman’s coefficient.

3.3 Correlation matrix

Once the sampling distribution of the data has been evaluated, it follows with the correlation matrix. This matrix, with its coefficients and p values of significance, can be seen in Table 5. The focus is given on the correlations of the parameters with the final result of the original WQI, as these are the correlations that will be used to define the regression models later.

Table 5 Parameter correlation coefficients—original WQI and their respective significance

As can be seen in Table 5, the variables phosphorus, thermotolerant coliforms, DO, turbidity and total solids have a significant correlation with the result of the original WQI. Therefore, these are the variables that will be evaluated in the linear regression models later. Regarding the intensity of the correlations, the threshold defined by [19] is used, which says that correlations with a coefficient equal to or greater than 0.5, in absolute value, are strong correlations. Thus, the phosphorus—WQI and thermotolerant coliforms—WQI correlations are strong, in addition to being significant. It should be noted that the intensity of the correlations is evaluated in absolute value because negative correlations are inversely proportional, while positive correlations have direct proportionality. In other words, the increase in the concentrations of phosphorus and thermotolerant coliforms decreases the final result of the WQI while the increase in the DO concentration increases the WQI, for example.

Fia et al. [16] found high concentrations of phosphorus in their study on Mirim Lagoon, which corroborates the fact that this variable has a strong and significant correlation with the WQI of this water resource. The variables phosphorus, turbidity and total solids are associated with pollution from agricultural activities, with phosphorus being one of the main components of many chemical additives used in plantations [32]. As for the thermotolerant coliforms and DO variables, both can be related to the presence of pollution by organic matter, also causing a deterioration in water quality [45].

Also, with regard to Thermotolerant Coliforms, their presence in surface waters may be related, according to Santos et al. [34, 35], to sources of agricultural pollution and domestic sewage. Tormam et al. [44] also in a study of Mirim Lagoon, corroborates this inference, showing that coliforms are associated with degradation of water quality in this lagoon, mainly associated with the discharge of domestic sewage. Furthermore, the work developed by Drose et al. [10] also shows that Mirim Lagoon may be influenced by sources of anthropogenic pollution such as effluent discharge. This same study associates turbidity with physical and mineral pollution in Mirim Lagoon.

3.4 Linear regression analysis

Knowing which variables have significant correlation with the original WQI, these variables were used to construct linear regression models. Analyzing the residual statistics of the regression models using these 5 variables, it was possible to infer that the model that had the best fit was the one that used only the variables phosphorus, DO and thermotolerant coliforms, thus only these three parameters will be used for the following analyzes. Since phosphorus is representative of pollution from agricultural activities [32] and OD and thermotolerant coliforms representative of pollution by organic matter [45], these three variables may well be considered to encompass the processes that influence the quality of Mirim Lagoon.

The summary of the models with the 5 variables that have significant correlation with the original WQI can be seen in Table 6. As can be seen, the addition of predictors in the models 1, 2 e 3 causes significant variation, which means that they add more information to the regression models. Although the addition of the turbidity variable has a significant effect on the regression models, this variable will not be used for the following analyzes due to the results obtained by the analysis of the waste statistics, as stated above. The TS variable has no significant effect in this MLR analysis, as can be seen in Table 6.

Table 6 Linear regression analysis—summary of the models

Still on Table 6, according to Wheelan [47], R2 is a measure of the total size of the variation, which is explained by the regression equation of this model. For model 3 of the regression analysis in this study, we have an R2 equal to 0.728, which means that this model explains 72.8% of the WQI variation in this regression.

Siqueira et al. [40], in a study developed in Parauebas River, located in the northern region of Brazil, also used linear regression, in addition to principal component analysis, to assess the influence of the variables that of the WQI. In this study, the authors found that the variable that most contributed to the variability of water quality was the thermotolerant coliforms, indicating anthropogenic pollution, such as the discharge of organic effluents into the Parauebas River. In addition, demonstrating the importance of statistical methods in the assessment of water quality, several recent studies use methods such as linear regression, principal component analysis, correlation and variance tests and multiple comparisons, among others, for the evaluation of water quality [9, 34, 35, 45, 48, 49].

Table 7 shows the ANOVA of these models. In this table, it is possible to observe the decrease of the residuals and the average squares along the models and the significance of the variation from one model to another. This significance, confirmed by the ANOVA test, is measured with a 95% confidence level and a 0.05 significance level. Therefore, those models that obtain a p value of less than 0.05 have significant variability among themselves [5]. Model 3 presents the lowest residues, demonstrating that this is the best model to be used. The ANOVA test confirms the significant variation between the models, that is, it confirms that there is information gain with the addition of predictors up to model 3, as already noted in the summary.

Table 7 ANOVA for regression models 1, 2 and 3

The coefficients of the linear regression equation, using model 3, can be seen in Table 8. From this table was extracted Eq. 3, which will henceforth be the new WQI equation for this study. It is worth emphasizing that, although the regression method can be used for different types of data from different water resources, the equation it generates is specific to the object of study of these data, that is, Eq. 3 generated here should not be used for another water resource other than Mirim Lagoon.

$$Y = 60,2 - 6,2*Xa + 1,9*Xb - 0,008*Xc$$
(3)

where Y is the value of the new WQI, Xa is the concentration of phosphorus, Xb is the concentration of dissolved oxygen (DO) and Xc is the concentration of total coliforms.

Table 8 Coefficients of model 3 of the linear regression analysis

As can be seen in Table 9, the residuals and the predicted standard values are within a range of approximately three standard deviations, corroborating the adequacy of the generated equation. The distribution of standardized residuals of regression the graph of the expected and observed cumulative probabilities are presented in Figs. 2 and 3, which demonstrate the adjustment of the regression.

Table 9 Waste statistics
Fig. 2
figure 2

Distribution of standardized residuals of regression

Fig. 3
figure 3

Graphic distribution of the expected probability versus the observed probability

3.5 New WQI

Using Eq. 1, the new WQI results for Mirim Lagoon were calculated. These results can be seen in Table 10. It is noteworthy that the original WQI uses nine parameters, while the new WQI, only 3.

Table 10 New WQI from Mirim Lagoon

Using the quality ranges described in Table 3 to classify the results of the new WQI and comparing these ranges with those relating to the original WQI, only in 19.5% of the results did the quality range change between the original WQI and the new WQI. In other words, in more than 80% of the 154 samples, there was no change in the quality range for the Mirim Lagoon WQI.

3.6 T test

Finally, having the results of the original WQI and the new WQI, it is assessed whether there was significant variability between the two, for their numerical results and their quality ranges. This analysis is done by the Student’s t test for paired samples and its results can be seen in Table 11. The paired samples t test was used because, unlike the normal t test, it considers that the samples are dependent.

Table 11 T test for paired samples

The T test for paired samples considers a null hypothesis in which there is no significant variability between the components of the evaluated pair. Therefore, for p value < 0.05, the null hypothesis is rejected and the significance of this variation is confirmed. For the data in this study, the t test obtained a p value > 0.05, retaining H0 and then confirming that there is no significant variation between the original WQI and the new WQI for their absolute values and for their quality ranges.

The fact that there was no significant difference between the original WQI and the new WQI was precisely what was intended by this study, since it aimed at building a new WQI using fewer variables without significant loss of information. Reducing the number of variables analyzed for the WQI can generate a number of advantages, such as, for example, financial savings by river basin managers. Another example of the benefit caused by the reduction in the number of independent variables in the calculation of the WQI is the reduction in the eclipse effect, characterized, as already mentioned, by the attenuation of the impact of the variation of one of the parameters due to the stability of the others.

Ewaid et al. [13], in his study about the Tigre River in Baghdad, using this multiple linear regression (MLR) methodology, also obtained results from a new WQI that can be compared to the original WQI, demonstrating that the new equation modeled by them can be used to predict and monitor the quality of the waters of this river. Other studies, such as those by Kangabam et al. [24], Eregno [11], Chenini and Khemiri [7] and Saleem et al. [33] also used a similar methodology for assessing surface and underground water quality, attesting to the importance of the method and corroborating its viability.

4 Conclusion

The new WQI equation, generated through the MLR, in this study uses only three variables, phosphorus, DO and thermotolerant coliforms, facilitating the monitoring of this water resource, as it makes monitoring less financially costly. Still, it is concluded that the use of statistical methods is a very useful tool in the monitoring of water quality. After all, it was through these methods that it was possible to model a WQI equation with fewer variables for this water resource.

Thus, the new WQI equation modeled by this study can be used to predict and monitor the quality of the waters of Mirim Lagoon with considerable precision. It should also be noted that this method can be widely applied to other water resources as long as they have prior historical monitoring.