Use of random forest for assessing the effect of water quality parameters on the biological status of surface waters

The Water Framework Directive aims to reach good status in European surface waters by 2027. Despite the efforts taken already, the ecological status of surface waters has hardly improved during the last decades. In order to find efficient measures, there is an urgent need to improve our knowledge in understanding the linkage between the anthropogenic factors and the indicators of the ecological status assessment. Due to the complexity of the ecosystems, basic statistical methods (such as linear regression) cannot help in finding relationships between the biological quality elements and the supporting water chemistry parameters. The paper demonstrates that in these cases a machine learning data-driven method can be a promising tool for supporting biological classification. With random forest, the Gini index was used for ranking physico-chemical variables based on their influence on biological elements. Variables that have the biggest Gini index were selected for predicting the biological status of phytoplankton, phytobenthos and macrophytes. Binary classification and predictions were performed on a five-class scale. Predictions tended to be fairly good (errors varied within 8–60%, median 33.3%). A comparative analysis was also made with logistic regression, however, in some cases it led to slightly worse or slightly better predictions. We concluded that due to significant errors, the biological status assessment cannot be replaced completely by model predictions, but the method is sufficient to fill in certain gaps in the data and can help in the planning of biological monitoring systems. The evaluation was performed with Hungarian river and water quality database.


Introduction
Freshwater ecosystems are key to maintaining biodiversity (Hooper et al. 2012).
Water ecosystems are especially vulnerable to disturbance and degradation due to anthropogenic pressures.They are the recipients of point and diffuse pollution and the bearer of the negative effects of hydromorphological changes caused by human activity mainly in the last century (Sabater et al. 2019).Climate change amplifies the effect of anthropogenic pressures; thus, both are threatening the health of freshwater ecosystems.Climate change is expected to cause significant changes in the weather of central Europe (Behrens et al. 2010).Water temperature will increase, and the quantity of rainwater will decrease which will lead to the decrease of oxygen saturation and an increase of algal bloom, as well as the concentrations of contaminants (Whitehead et al. 2009).To deal with these changes, better knowledge is needed on how water ecosystems work.The linkages between anthropogenic pressures and their effect on biological quality elements (BQEs) are usually nonlinear (Grizzetti et al. 2017), therefore remedial interventions often do not lead to the expected goals.Considering multiple stressors complicates the understanding of the relationship between stressors and biology.Because of the complex relationship, taking into account only one stressor might lead to improper conclusions (Lyche-Solheim et al. 2013;Szomolányi and Clement 2022).Although the combined effects of multiple stressors are rarely considered throughout planning remediation actions (Nõges et al. 2016).
In the European Union, the Water Framework Directive (European Commission 2000) determines the water policy.The main objective of the Directive is to achieve "good status" of all water bodies by 2027.In the case of surface waters, good status means good ecological and chemical status.Water bodies are categorized into five classes according to their ecological status: high, good, moderate, poor, and bad (European Commission 2000).
The overall status of surface water bodies is determined by the combination of ecological and chemical status.Ecological status depends on the following classification elements: biological elements, physico-chemical elements, hydromorphological elements, and river basin specific pollutants (European Commission 2000).The approach of ecological status classification is defined by the ECOSTAT guidance (European Commission Working group 2A, 2003).The Water Framework Directive requires the "one-out all-out" principle to be applied when determining the ecological status or ecological potential of surface waters (Fig. 1.).This means that the lowest (worst) class of the considered variables determines the status of the water body.Thus, the water body is ultimately given the classification that is the worst of the results obtained for the biological, physico-chemical and hydromorphological characteristics, as well as for the examination of other specific pollutants.For the biological classification, also the "one-out all-out" principle has to be applied, therefore the outcome of the biological classification is given by the result of the worst rating of the five biological quality elements.
Contrary to previous practice, when status classification had no legal consequence, the Water Framework Directive (WFD) not only requires the Member States to carry out a general status assessment but also requires the planning and implementation of remedial measures (Somlyódy 2011).It prescribes the development of a long-term action plan, based on the exploration of the linkage between water quality and human activities, taking economic considerations into account, under which the Member States are required to report regularly on measures taken and to be taken in the field of water management and protection (European Commission 2000).
Despite the efforts taken already and the upcoming deadline, still more than half of the European water bodies do not reach good ecological status primarily due to nutrient surplus from diffuse and point sources (European Environment Agency 2018).Rivers in Hungary are in even worse ecological status: 8.1% are in bad, 16.4% are in poor, 62.7% are in moderate status and only 12.8% of the water bodies reach good or high status (GDWM 2021).The surveys of the Nitrate Report (Hungarian Ministry of Agriculture and Ministry of Interior 2020) found that 77% of Hungarian watercourses-typically because of the phosphate load-and 32% of lakes are eutrophic.The assessment was carried out in line with the WFD classification, by including the eutrophication-relevant biological elements (phytoplankton, phytobenthos and macrophytes) and physical and chemical parameters (nitrates, total inorganic nitrogen, phosphate and total phosphorus concentrations).
Biological classification plays a key role in ecological classification because of the one out all out principle.If BQEs show a weaker status than chemistry and hydromorphology-which happens regularly-then the ecological status will be worse than it would be if we only considered the physico-chemical status (European Commission Working group 2A 2003).As biological classification is more complex than classification based on physico-chemical variables, the high proportion of rivers with moderate or worse status might be because of the results of biological classification.In Hungary, the biological status of the water bodies (especially in case of the rivers) is significantly worse than their physico-chemical and hydromorphological status (GDWM 2021).This might reflect the lack of harmonization within the status assessment and nutrient thresholds were set too high, as it was demonstrated in our former study (Szomolányi and Clement 2022).This problem is not unique to Hungary, many EU Member States struggles with it (Poikane et al. 2021).Anthropogenic stressors play a key role in influencing the quality of surface waters.Changes in water quality induced by human activity can affect all biota group (phytoplankton, phytobenthos, macrophytes, benthic invertebrates and fish) of the WFD.Each biological community is sensitive to different pressures in different water types.This paper focuses on phytoplankton, phytobenthos and macrophytes.Phytobenthos is particularly sensitive to pollution and human impacts and less to hydromorphological changes (Szilágyi et al. 2008) in all watercourse types.In Hungary, IPS (specific pollution sensitivity index, Coste in CEMAGREF 1982) and IPSITI (the acronym IPSITI comes from IPS, SI (Austrian saprobic index, Rott et al. 1997) and TI (Austrian trophic index, Rott et al. 1999) indices) indices are used.IPS is used for Hungarian river types 1 (highland small rivers with steep bed-slope and siliceous geochemical aspect), 9 and 10 (Danube-sized lowland rivers), while IPSITI is used for all the other river types (highland small rivers with steep bed-slope and limy geochemical aspect, small-medium hilly and lowland rivers, large-very large rivers hilly and lowland rivers) (GDWM 2021).Phytoplankton is not a good indicator in small, hilly rivers as in upstream rivers the water residence time is short (Borics et al. 2007).Zooplankton grazing induced mortality (Garnier and Billen 1994) and the dilution rate is high (Billen et al. 1994), but it can be successfully used to evaluate ecological status in large rivers (Borics et al. 2007).Phytoplankton is good indicator for eutrophication (Hilton et al. 2006).The growth of macrophyte communities is influenced by nutrients, salinity, oxygen concentration, sediment characteristics and light (Barendregt 2003).
One of the challenges in defining water quality in accordance with anthropogenic stressors is to understand which factors are the most important in influencing the biological quality of watercourses (Khatri and Tyagi 2015).Basic statistical methods cannot help solving this problem, on the other hand, machine learning methods-for example, artificial neural networks (Banerjee et al. 2011), genetic algorithms (Babbar-Sebens and Minsker 2010), logistic regression and model trees (Holguin-Gonzalez et al. 2013), random forest and gradient boosted regression trees (Valerio et al. 2021;Stock et al. 2018)-are able to model complex and nonlinear relationships.The use of machine learning algorithms is still limited in the field of water quality management-in Hungary too -, despite their well-known advantages and ease of applicability on a multiple stressor system.
There are a few comparative studies in the field of water quality prediction and the identification of key water parameters that compare the accuracy of machine learning methods, and they all find that random forest is one of the most accurate one (Alnahit et al. 2022;Chen et al. 2020;Visser et al. 2022, Nassir et al. 2022).
Alnahit et al. used and compared random forest and boosted regression tree to predict the long-term median value of water quality parameters such as total N, total P, and turbidity in the Southeast Atlantic region of the USA.The study found that both methods provided reasonable results, but random forest was easier to train and robust to overfitting.Partial plots were used to identify the impact thresholds (Alnahit et al. 2022).
Chen et al. compared the water quality prediction performance of 10 learning models on Chinese surface water quality data.Based on model accuracy measurements, decision tree, random forest and deep cascade forest had the best performance (Chen et al. 2020).
Visser et al. reported an experiment on comparing 11 machine learning models according to their predictive power, interpretability and on predicting ecological quality ratios (EQR) of BQEs.The study found random forest and boosting to be the best choice considering every aspect of the models (Visser et al. 2022).
Most of the studies in the field of water quality prediction applies random forest for regression, not for classification.However, Nasir et al. ( 2022) used random forest classification for predicting Water Quality Index and compared the results (based on accuracy, precision, receiver operating characteristic curve (ROC curve), etc.) with other machine learning methods.They found that the results of the models were similar, but the random forest and CATBoost was outstandingly better, and logistic regression was worse than the other methods which had the accuracy between 0.8 and 0.9 with a sample containing seven water quality parameters and some metadata.
The system of the ecological status assessment is linked to the drivers-pressuresstate change-impacts-response conceptual model (DPSIR) approach: it allows the design of measures based on the description of the load-effect relationships.The Water Framework Directive allows the ecological status assessment to be carried out by expert estimation or modelling in the absence of measurement data, therefore measurement can be eliminated-or at least the sampling frequency can be reduced-and replaced with modelling (European Commission 2000).Model based solutions require monitoring data to be analysed to understand the functioning of freshwater ecosystem and to be able to make predictions and forecasts, thus, model-based solutions link ecology with informatics.Tree-based machine learning models based on monitoring data offer low-cost and time efficient solutions for predicting the biological status of surface waters.
The objective of this paper is to demonstrate how random forest can be a promising tool for supporting biological classification.Biological monitoring is very expensive and time-consuming, however, WFD allows model-based approaches to estimate the condition of a given quality element if low confidence and precision may lead to misclassification (European Commission Working Group 2A, 2003) and for establishing type-specific reference conditions and boundaries (European Commission Working Group 2.3 2003).With the application of predicting methods, the efforts on the monitoring can be reduced, or with a better prediction, data gaps can be diminished.In our study the machine learning method was used for ranking the anthropogenic stressors to predict the biological status (based on three BQEs, the phytoplankton, the phytobenthos and the macrophytes) of watercourses.Background values which were involved into the prediction model are as follows: physico-chemical water quality parameters, catchment data (e.g., land use) and hydromorphological features.Biological status predictions were made in two ways: taking into account all five classes and the two most important categories (good or better/moderate or worse).The performance of the random forest model was compared to the performance of logistic regression.

Study area
The whole area of Hungary belongs to the Danube River Basin, where the climate is continental, temperate.The average annual temperature is 9.7 °C, the average annual precipitation is approximately 600 mm (Hungarian Meteorological Service 2021).
Within the 93 000 km 2 of the country 886 river water bodies were delineated in line with the WFD, which allows the identification and quantification of significant pressures and the classification of status.Data from each water body were used in the study.There are 1279 monitoring stations (GDWM 2021) in the country which were all included in the analysis.Biological and chemical sampling stations are indicated in Fig. 2.
Among the mandatory typological elements prescribed by the WFD (European Commission Working Group 2.3 2003), the height above sea level, the size of the water catchment area, the geology and, in addition to this, the roughness of the bed material and the size of the bed-slope were all used as selected characteristics to differentiate the Hungarian watercourses ("B" system).The water course typology is according to the "B" system described in the WFD.The details of the used typological elements are described in Table 1.
10 types of rivers were defined in Hungary.The types are differentiated according to their altitude, size of the catchment area, geology, sediment roughness, and bottom   small number of samples in some types and thus the different sample number for each type.

Database
The study was performed with the Hungarian surface water quality monitoring database covering the period 2013-2017 (NEIS 2021).NEIS contains raw datasets for all monitoring sites, including water chemistry (measured concentrations) and biology (EQR values).Furthermore, the database of water body-related metadata was available for supporting the River Basin Management Plan of Hungary (GDWM 2021).These two databases were merged by extending the monitoring site level data d with metadata available for the water bodies represented by selected monitoring sites: river type, land use of the direct catchment area (derived from CORINE Land Cover, European Union, Copernicus Land Monitoring Service 2012), and the results of water quality status assessment for all BQEs, physico-chemical quality element and hydromorphological status.In our study sampling-site-level data of physico-chemical variables and metadata for all the selected biotas (phytoplankton, phytobenthos and macrophytes) were used.

Random forest
Random forest, which was first introduced by Breiman (2001), is a classification method based on decision trees used in machine learning, which gives results by averaging the results of decision trees.By using multiple trees, overfitting can be avoided.
The advantage of this method is that it gives more accurate results compared to decision trees and there is a lower chance of over-learning, i.e., the model will work well not only on the learner database but also on an unknown database (Breiman 2001).
The method develops a predetermined number of trees from the same database.Each tree and each new split are made from data selected by the bagging method, so every step is randomized.Bagging (which is an acronym of bootstrap aggregating) is a machine learning ensemble meta-algorithm aimed to improve the stability and accuracy of algorithms.With the use of bagging, variance can be reduced, and overfitting can be avoided (Breiman 1996).During bagging, random samples are taken from the original dataset, thus creating a "new" training data for constructing decision trees (Prasad et al 2006).Samples that are not included in the bootstrap are called out-of-bag (OOB) samples, which can be used to calculate the OOB error to validate the model (Virro et al. 2022).In random forest, there is no need for cross-validation or a separate training and test dataset to get an unbiased estimate of the test set error (Breiman 2001).
There are many variable importance measures e.g., chi-square (Mingers 1989), Mean Decrease Accuracy and Mean Decrease Impurity (Gini index) (Breiman et al. 1984).We used the Gini index, which shows the frequency of the selection for a split for each variable and their overall discriminative value for the classification problem (Breiman et al. 1984) compared with the Mean Decrease Accuracy, which expresses how much accuracy the model loses by excluding each variable (Breiman et al. 1984).

Logistic regression
With logistic regression a logistic model can be fitted to categorical data.Multinomial regression could be a good method to fit a logistic model to the five-class scaled biological response variable, but because of the small sample size, we had to merge classes and use binomial logistic regression.Biological classes were separated into two classes: moderate or worse and good or better.Thus, binomial logistic regression was a good option to fitting a model with a binary response.The approach has the advantage of being applicable in situations with weak relationship between the variables of interest (Kelly et al. 2022).

Setup of the data matrix
We selected several variables which are expected to have a significant impact on phytoplankton, phytobenthos and macrophytes status according to the literature.We expected that nutrients, oxygen household defining parameters, and suspended matter (via light limitation) have a strong relationship with phytoplankton (Hilton et al. 2006;Mischke et al 2018).Phytobenthos tends to show strong relationship with nutrients and organic pollution (Várbíró et al. 2012).Significant effect was expected between nutrients, oxygen concentration, hydromorphology, suspended matter (via light limitation) and macrophytes (Barendregt 2003).As eutrophication is a general problem detected in Hungarian rivers too (Hungarian Ministry of Agriculture and Ministry of Interior 2020), nutrient forms are included among the variables with greater emphasis as they best indicate the progress of the eutrophication process.
We only considered variables which do not have included into the BQEs (i.e., chlorophyll-A concentration was excluded) and do not directly correlate with each other (like dissolved oxygen concentration and oxygen saturation) to avoid distortion.Although we ignored the relationship between nutrient forms, and the correlation between electrical conductivity and chloride ion concentration (as chloride makes just a fraction of the measured conductivity).PH was deliberately left out due to the effect of the photosynthetic activity of the plants.Studies show that correlation between variables does not affect the result of the model significantly (Nicodemus et al. 2010).Variables for which data gap exceeded 50% were removed (for example DOC fell out during this step).After the data cleaning, our data matrix contained the following variables: EQR of the BQEs, watercourse type, and selected background variables that are described in Table 3.
We would like to mention that hypromorphological status is formed from morphological status, hydrological status as well as continuity and is classified on a five-class  scale.Morphological status assessment involves riverbed modifications, occurrence of artificial substances in the bed and/or the shore (sealed surfaces), silting, land use on the catchment area, and the linkage between the water body and the floodplain.Continuity is affected by hydraulic structures which obstruct the longitudinal and transversal continuity.Hydrological status assessment involves the effect of backwaters on the water body, effects of water withdrawal, retention of the reservoirs and hydropeaking.From continuity, morphological and hydrological status, the overall hydromorphological status is derived based on the one-out, all-out principle (GDWM 2021).
Land use data refer only to the direct catchment area of the rivers (excluding tributaries and upstream river stretches of the same river).We considered three land use categories: intensively used arable lands, extensively used pastures, and heterogeneous agricultural areas.The category of intensively used arable lands also includes permanent crops like vineyards, fruit trees and berry plantations.
Some of the background variables were not used in case of the analysis of large, very large and Danube-sized rivers (Hungarian types 4, 7, 8, 9, 10): • Proportion of agricultural areas (heterogeneous agricultural areas, intensively used arable lands, extensively used pastures) on the catchment as we only had this data on the direct catchment of the waterbody,-we did not consider the catchment of the tributaries -, which may be an underrepresentation of the entire catchment, • Electrical conductivity as in big rivers, pollution coming from controllable human sources has no significant effect on that parameter, rather conductivity is determined by the geological aspects of the river catchment, • Water temperature as human impacts (e.g.thermal water discharges) cannot modify the water temperature significantly in big rivers.
The analyses were performed for each type separately and also for combined type groups.The reason of merging type groups is the small number of samples in some types and thus the different sample number for each type.Predictions were not made for small rivers (types 1, 2, 3, 5, 6) in case of phytoplankton and for large, very large and Danube-sized rivers (types 4, 8, 9, 10) in case of macrophytes as these are not relevant in the mentioned watercourse types.
From the variables described in Table 3, we chose the five that has the biggest importance on each biota in each river type according to the Gini Index (see Table 4) we got from the random forest prediction made with all the variables for each river type (large, very large and Danube-sized rivers make an exception) and each BQE.

Computation
The predictive model and variable ranking were performed in R 4.2.0 (R Core Team 2022) with the randomForest package, version 4.7-1.1 (Liaw and Weiner 2002) (see the steps of the methodology in Fig. 3).Besides the mentioned statistical tool, various packages were used for data manipulation (tidyverse version 1.3.1 (Wickham et al. 2019) and data visualization (ggplot2 version 3.3.6(Wickham 2016), ggpubr version 0.4.0 (Kassambara 2020)).As random forest does not need a separate training and test dataset (Breiman 2001), the model was trained on the entire dataset.Hyperparameters were tuned for each run with the tuneRF function of the randomForest 4.7-1.1 package (Liaw and Weiner 2002).Number of trees was 50, number of variables randomly sampled as candidates at each split varied between 3 and 20 for the five-class scale predictions and we used four for the binomial predictions.
Five variables were selected according to the Gini index (mean decrease in Gini) with random forest for each water type and each BQE.We predicted the biological Fig. 3 Flowchart of the steps of the methodology status classes from the chosen variables.The number of variables selected for estimators was arbitrarily defined.It must be satisfactory to represent the complexity of the riverine ecosystems, however, higher number of estimators may cause overlearning and-as we found-do not necessarily increase the accuracy of the model, while fewer estimators also lead to higher error rates.
Two types of predictions were made; first we predicted the biological status on a five-class scale (bad/poor/moderate/good/high classes), then we compared the results with predictions with a binary outcome (good or better/moderate or worse classes).The reason of the two types of predictions is that misclassifications do not have the same consequence, since water quality improvement is only needed when the status of the water body does not reach good status.The most important difference is between the moderate and the good status, therefore binomial random forest predictions could be used for deciding whether the waterbody reaches good status or not.Five-class predictions could be used for the designing of remediation actions.
The accuracy of the predictions was identified with the Out of Bag (OOB) error, which is an unbiased estimate of the true prediction error.However, in the case when the number of subjects is not much fewer than the number of variables, the OOB error overestimates the true error, the random forest actually performs better than the OOB indicates (Mitchell 2011).Out of bag samples have the advantage of creating internal accuracy estimates without separating the dataset into a training and a test set (Prasad et al 2006).
Random forest predictions which only considered "good or better" and "moderate or worse" classes were compared with predictions made with a benchmark binomial logistic regression.The MASS package version 7.3-58.2(Venables and Ripley 2002) was used for the logistic regression.Multinomial logistic regression could not be made because of the small sample size.In deciding whether the logistic regression predictions are good, the standard cut-off value was used, which is 0.5, meaning that if the predicted probability is greater than 0.5, than the observation is classified as a good prediction.
R script for the random forest and binomial logistic regression predictions can be found as Supplementary material S1.

Limitations of the study
The database we used contains some shortages.Variables were available with different scaling, which can cause distortion in the predictions.In order for the data to undergo uniform processing, we did not add additional data to the database (e.g., land use categories were not extended).
We only had land use information of the direct catchment area of the waterbodies, which did not include the land use of the tributaries and upstream stretches' catchment.Therefore, pollution stemming from the land use of the whole upstream catchment areas are not considered, although it might have an influence on the biological status of the waterbody.We would like to mention that the direct catchment area, especially the zone extending a few hundred meters away from the shoreline has much higher influence on the water quality (Szpakowska et al. 2022).
Data about potential local sources (e.g.urban and industrial areas), which are not necessarily reflected in the background variables (e.g.water chemistry) were omitted, because only the areal proportion based on the land use was available, which does not provide information on the intensity of the activity.
We used five predictor variables uniformly for each prediction, although in some cases different number of variables would have been justified based on the Gini importance rankings.Predictions were tested with fewer and also with more than five variables.In some cases, the error lowered with the change in the predictor variable number, but in some cases, the error got higher.As no connection could be found between the OOB error and the number of variables, it has been set arbitrarily to 5. The changes in the OOB error with the number of variables are presented through the example of phytobenthos in Supplementary material S2.

Variable importance
The variable analysis of the predictive model revealed that some of the variables do not have such an importance on the biological status as it was assumed (for example hydromorphological status), while other stressors, like electrical conductivity and water temperature are more important than it was expected.The ranking of the variables according to their importance of the biological quality elements are shown in Figs. 4, 5, and 6.
From the variables that appear in Figs. 4, 5, and 6, and variable ranking for each water type we selected the first five with the biggest relative importance to the predictions.The top five variables differed in the case of the five-class and the binary classification model.The chosen variables can be seen in Tables 4 and 5.In case of river type 1,2 and macrophytes, random forest was not applicable as the database only contained data from one biological class.

Predictions
We used the random forest algorithm to predict the status class of biological quality indicators.The prediction was made with the five chosen variables, which varied by river type and BQE (defined in paragraph 3.1).
The errors of the predictions for phytoplankton, phytobenthos and macrophytes biological status classes are listed in Table 6.
In comparison with the random forest, predictions were made with binomial logistic regression for the "good or better"/"moderate or worse" status classes as a benchmark method.Predictions were only made to the merged river types (lowland small watercourses, highland small watercourses, big rivers), as there were not enough data to perform the analysis considering each type alone.The errors of the logistic regression are listed in Table 7.Because of the limited sample number, multinomial logistic regression for the five-scale predictions could not be performed.

Discussion
With the random forest we successfully ranked physico-chemical parameters according to their impact on the biological status for each BQE and each river type, which can help us select the stressors which are responsible for water quality deterioration.With the knowledge of the most important stressors, it becomes easier to create adequate water management policies.We used the mean decrease in Gini to choose the five most important variables for each biota and each river type, although the variable orders made by the mean decrease in accuracy were very similar.
The study revealed that reducing nutrient loads remains vital, but not the only tool in the fight against eutrophication (Istvánovics and Honti 2012).
The ranking gave different results for each group of organisms.In the case of phytobenthos, BOD 5 , COD Mn , nutrients and suspended matter concentration are the most important variables, in accordance that diatoms are good indicators of organic pollution, eutrophication, and salinity (Sládeček 1986, Martín andde los Reyes Fernández 2012).
The ranking of the stressors revealed that inorganic nutrients, suspended matter concentration, dissolved oxygen and chloride are very important variables for both phytobenthos and phytoplankton.However, the percentage of agricultural areas on Fig. 4 Ranking of variable importance according to the mean decrease in impurity (Gini importance) and mean decrease accuracy in case of phytobenthos in the merged river types.Abbreviations: prop. of int.used arable land-proportion of intensively used arable lands; prop. of ext.used pastures-proportion of extensively used pastures; prop. of heterogeneous agricultural areas-proportion of heterogeneous agricultural areas Fig. 5 Ranking of variable importance according to the mean decrease in impurity (Gini importance) and mean decrease accuracy in case of phytoplankton in large rivers the catchment (indicating non-point loads) rather affects phytobenthos, which factor is among the top three predictor variables for most water types.Latter is proven by studies (Trábert et al. 2020;Birk et al. 2020).Nutrient forms, land use on the catchment, COD Mn and electrical conductivity showed high importance in classifying macrophytes status, too.
Many EU countries use different metrics to measure organic matter content.The variable ranking of the random forest is able to determine the order of importance between the metrics.We identified BOD 5 as the best metric based on the importance ranks.
In most cases, the hydromorphological characteristics were placed at the back of the ranking.Continuity, hydromorphological status and morphological status were included among the five selected variables in one case each (for macrophytes and for phytobenthos).This indicates that although, based on the literature, hydromorphological effects are the determining stressors for rivers Nõges et al. (2016), water chemistry and, through this, the presence of pollution, have a stronger influence on the condition of the studied groups of organisms than hydrological and morphological factors.
Random forest predictions were made with 47.9-53.3%OOB error on classifying phytoplankton, 18.7-60.4%OOB error on classifying phytobenthos, and 8.3-44.3%OOB error on classifying macrophytes biological status on a five-class scale.Macrophytes status class predictions tended to be the best considering the mean OOB error, although it is probably due to overlearning induced by small sample size.Phytobenthos status class predictions tended to be better than predictions for phytoplankton.
Binomial status predictions tended to be better, for phytoplankton the OOB error was between 27.3 and 33.5%, for phytobenthos the error was 16.2-50.0%,for macrophytes OOB was between 8.3 and 33.3%.
The predictor sample size could affect the accuracy of the model (Chen et al. 2020).The model performance can be improved with a larger set of data, but random forest should provide satisfactory performance with limited-size dataset (Prusa et al. 2016;Fig. 6 Ranking of variable importance according to the mean decrease in impurity (Gini importance) and mean decrease accuracy in case of macrophytes in small rivers.Abbreviations: prop. of int.used arable land-proportion of intensively used arable lands; prop. of ext.used pastures-proportion of extensively used pastures; prop. of heterogeneous agricultural areas-proportion of heterogeneous agricultural areas Chen et al. 2020).We found five predictors satisfactory for the models, because higher number of estimators did not necessarily increase the accuracy of the model, while fewer estimators may lead to higher error rates.The error is not proportional to the number of variables.Even when the figures with the variable importance indicate that the right number is not five, the error might be bigger or smaller.For example, in the case of phytobenthos and highland small rivers (types 1, 2, 3) according to the variable importance plot (Fig. 4) BOD 5 concentration, dissolved oxygen concentration and proportion of extensively used pastures are the most important ones standing apart from the rest, and all the other variables have quasi the same importance.Despite this, OOB error gets higher when only the top three variables are considered.Because of the lack of relationship between OOB error and number of variables, an optimal function could not be determined in general for the variable number.Changes in OOB error  with the predictor variable number in the case of binary classifications of phytobenthos is shown in supplementary material S2.
The results of logistic regression and random forest models cannot be compared properly due to different operating mechanisms and efficiency indicators.Binomial logistic regression was made only as a comparative analysis for the random forest models.The errors of the binomial logistic regression for the biological status class predictions were between 21.2 and 45.1%, very similar to the accuracy of the random forest predictions based on the OOB error in the cases of phytoplankton and phytobenthos, however differences were bigger between the accuracy of the two models in the case of macrophytes.Although, due to the small sample size, in contrary with the random forest approach, logistic regression could not be applied for multinomial classification.In comparison with other studies (Nasir et al. 2022), our random forest model showed higher error rate, and in our case, logistic regression gave similar predictions to the binomial classification problem.

Conclusion and outlook
The research attempted the application of a machine learning technique to predict the biological status of rivers based on environmental factors (e.g.basic water chemistry, hydromorphology and catchment properties), aiming to show that with a good  Random forest with Gini index was applied for ranking the background variables according to their relative importance on the biological quality elements (phytoplankton, phytobenthos and macrophytes).The variables resulted by the model for each biota were corresponded to those expectations known from the literature.Nutrients, BOD 5 , dissolved oxygen and chloride are very important variables for both phytobenthos and phytoplankton.Macrophytes are rather influenced by nutrient forms and electrical conductivity.Land use had a significant effect on both phytobenthos and macrophytes.After selecting the most important variables, random forest algorithm was used to predict the biological status for each biota in each relevant river type, which in some cases performed almost perfectly (8.3%) while in other cases the prediction was poor (biggest error is 60.4%).Based on these error rates, it is obvious that the predictive model is not sufficient to replace biological classification completely.However, in the case of certain waterbody types (types 4, 7, 8, 9, and 10 for phytobenthos, and type 5 for macrophytes), the data gaps can be reduced by predictions.Predictive methods can also be used in the planning of the monitoring systems (for example the status of water bodies with bad, poor or high status classifications can be predicted with small error, thus the expenditures would be smaller for sampling these water bodies).The paper proved that random forest is able to describe the behaviour of river ecosystems and model their biological status with similar precision to logistic regression.Although logistic regression-in contrary to random forest approach-could not be applied for multinominal classification due to the small sample size.
The improvement of the predictor models will be the subject of future research.With a larger dataset (for example EU level data from the WISE system), the model performance could be tested.Including hazardous substances could also be interesting, but because of the lack of data in the period 2013-2017, it was not feasible.Eliminating the described limitations in paragraph 2.7 can also improve the performance of the model.Multiple stressor analyses demonstrated in this paper provide useful insights into how complex freshwater ecosystems work.The variable order made by the method can help improve the quality of freshwater ecosystems by considering the most important stressors during the implementation of water management policies.Thus, the effectiveness of current policies can be improved.This does not mean that only the five most important variables which were identified by the random forest for each river type and each BQE have to be measured in rivers, but these should be measured more frequently and have to be regulated more strictly.

Fig. 1
Fig. 1 An example of how certain parameters may be combined to estimate the status of a biological quality element and the status of the water body.Letters in the squares indicate the five status classes: H = High, G = Good, M = Moderate, P = Poor, B = Bad (Modified from source: European Commission Working group 2A 2003)

Fig. 2
Fig. 2 Map of Hungary with surface water bodies, and monitoring stations hydromorphological features Status based on morphological features Status based on hydrological features Status based on longitudinal and transversal continuity Proportion of intensively used arable lands on the direct catchment Proportion of extensively used pastures on the direct catchment Proportion of heterogeneous agricultural areas on the direct catchment

Table 1
Elements used for the Hungarian watercourse typology (Modified from source: GDWM 2021) slope(GDWM 2021).The applied dataset represents all Hungarian river types.The properties of the types are presented in Table2.In the study certain river types were combined.We formed groups by merging certain water types with similar attributions.The reason of merging type groups is the

Table 2
River types used in the study

Table 3
Background variables of the final database that were used for the predictions

Table 4
Variables ranked based on their overall discriminative values according to the random forest

Table 5
Variables in the rank based on their overall discriminative values according to the random forest

Table 5
(continued) Predictions for binary classification were made with these variables prediction, multiple stressors can be more easily taken into consideration in water policies.

Table 6
Errors of the random forest predictionsThe number of samples is indicated in brackets OOB error out of bag error, n.r.not relevant, NA not applicable as only one status class was in the database for river types 1,2

Table 7
Errors of the predictions made with binomial logistic regression compared to the errors of random forest BLR binomial logistic regression, n.r.not relevant, RF random forest