CleverRiver: an open source and free Google Colab toolkit for deep-learning river-flow models

In a period in which climate change is significantly varying rainfall regimes and their intensity all over the world, river-flow prediction is a major concern of geosciences. In recent years there has been an increase in the use of deep-learning models for river-flow prediction. However, in this field we can observe two main issues: i) many case studies use similar (or the same) strategies without sharing the codes, and ii) the application of these techniques requires good computer knowledge. This work proposes to employ a Google Colab notebook called CleverRiver, which allows the application of deep-learning for river-flow predictions. CleverRiver is a dynamic software that can be upgraded and modified not only by the authors but also by the users. The main advantages of CleverRiver are the following: the software is not limited by the client hardware, operating systems, etc.; the code is open-source; the toolkit is integrated with user-friendly interfaces; updated releases with new architectures, data management, and model parameters will be progressively uploaded. The software consists of three sections: the first one enables to train the models by means of some architectures, parameters, and data; the second section allows to create predictions by using the trained models; the third section allows to send feedback and to share experiences with the authors, providing a flux of precious information able to improve scientific research.


Introduction
River flow prediction is an important tool for early flood warning, water resource management, water demand assessment, irrigation, agriculture, and hydroelectric power generation. These aspects become more and more critical in the case of climate changes causing a variation in rainfall regime and land use in many areas (Merz et al. 2014;Deitch et al. 2017;Blöschl et al. 2019). In particular, extreme weather events produce flash floods, floods, and debris flow phenomena. These have relevant socio-economic implications and represent a significant scientific issue, as confirmed by the extensive literature on the subject (Bates et al. 2008a(Bates et al. , b, 2012Gaume et al. 2016;Bryndal et al. 2017;IPCC 2018).
In recent years, we have observed an increase in the use of deep-learning in geosciences and in particular in riverflow prediction, with promising results (e.g., Boulmaiz et al. 2020;Chattopadhyay et al. 2020;Kratzert et al. 2018;Luppichini et al. 2022;Sit et al. 2020). The implementation of suitable run-off models is made difficult by the complexity of the natural systems and by the environmental information available (Jaiswal et al. 2020). Furthermore, each physicallybased model is limited by the inevitable simplifications of the modeled system (Antonetti and Zappa 2018). The deeplearning models available make it possible to manage complex systems without having to introduce any simplifications, Information is instead directly extracted from the data. These procedures are the most appropriate for addressing 1 3 the noisy and chaotic nature of the time-series forecasting problems (Livieris et al. 2020).
Long short-term memory (LSTM) and convolutional neural networks (CNNs) are the most common and most efficient deep-learning methods (Zheng et al. 2019;Yi et al. 2019;Fawaz et al. 2020;Sit et al. 2020). The combination of CNN and LSTM models (CNN-LSTM) allows to exploit the advantages of two different layers. LSTM efficiently acquires sequence pattern information thanks to its peculiar architecture, whereas CNN layers filter the noise in the input data and extract the most significant features for the final prediction model (Bengio et al. 2013). On the other hand, LSTM exploits only the features present in the training set, although they can be adapted to cope with temporal correlations (Livieris et al. 2020). Several works have used deep-learning models based on LSTM networks to create run-off simulations (Kratzert et al. 2018;Le et al. 2019;Boulmaiz et al. 2020;Liu et al. 2020;Nguyen and Bae 2020;Hu et al. 2020), whereas others are based on CNN (Li et al. 2018;Huang et al. 2020;Kim and Song 2020;Hussain et al. 2020), or on a combination of both (CNN-LSTM) (Kimura et al. 2019;Baek et al. 2020;Xu et al. 2020). Other LSTM techniques (LSTM-ED) consider two blocks of layers: the first block (called encoder) reads the input sequence and encodes it into a fixed-length vector, whereas the second block (called decoder) decodes the fixed-length vector and transmits the intended sequence (Sutskever et al. 2014;Cui et al. 2022;Luppichini et al. 2022).
However, these tools require good computer skills that can limit an application of these techniques outside the research community, for example the technical bodies managing the territory. The availability of software and user-friendly toolkits can improve the application of these techniques in several other cases. If the results derived from these toolkits are inserted into a network, it will be possible to obtain increased knowledge, leading to future developments and improvements in the field. However, Sit et al. 2020 observed that similar techniques had been used worldwide for different studies but, apart from some exceptions, these applications are not open-source and reproducible. This is a noteworthy limit to their distribution and application.
The aim of this work is to exploit a dynamic and valid Google Colab toolkit called CleverRiver for the application of deep-learning models for river-flow prediction. This toolkit makes it possible to build workflows using hardware resources made available by the company and not those of the user's desktop PC (Bisong 2019). The toolkit allows the application of deep-learning models based on different architectures, currently the most used to create models for river flow predictions. In particular, the architectures are based on the researches of Luppichini et al. 2022 andLupi et al. (2022). Google Colab can be employed by two different types of users: the first user has poor computer skills, or none at all; the second user is able to understand and to interact with the code of the toolkit. In the first case, the user can apply the method of the tool to her/his data to obtain a result and new computational capacity. In the second case, the user can compare the code of the toolkit with her/his own code, and can also contribute to improving the toolkit by bringing new and clear knowledge to the scientific community.

Materials and methods
CleverRiver is projected in close relationship to the work of Luppichini et al. 2022, making it possible to apply their method. The workflow is based on the use of the API of Keras, Tensorflow libraries for the creation of the deeplearning models. The toolkit uses also the Numpy and Pandas libraries for the management of the data. CleverRiver is composed of three sections. The first one aims to train deeplearning models by using a progressive and user-friendly procedure. In this section the user can exploit different types of data (e.g., hydrometric height, discharge, rainfall, temperature) with different data frequency (e.g., daily, hourly) setting the inputs and outputs of the models in simple manner. A deep-learning model can be interpreted as a mathematical expression: where Ô is the predicted output (hydrometric height or discharge) at time t, and I are the antecedent inputs (e.g., m can be 1 = rainfall, 2 = discharge, 3 = temperature). The choice of the dimension of n depends on the characteristics of data such as sample frequency (daily, hourly, etc.), and on the characteristics of the simulated watershed (e.g., run-off time). These parameters must be chosen by the user after some tests have been performed. For example, Luppichini et al. 2022 set n to 96, using rainfall data with 15 min of frequency corresponding to a maximum antecedent t of 24 h. The authors then simulated the watershed characterized by a fast run-off (in several cases lower than 12 h). By setting these parameters and after uploading the CSV files (e.g., rainfall data) in the workspace, the procedure allows to create the input matrix that will be used to train the models.
For training of the models, the dataset has to be divided into three parts: training, validation, and dataset testing. The training and validation datasets are used during the training steps, whereas the test dataset is used during the evaluation of the results. Dividing the dataset allows the user to reduce the possibility of overfitting. The partition 60%-20%-20% for training, validation and test datasets, respectively, has been used in several studies (Li et al. 2020;Nguyen and Bae 2020; Hu et al. 2020;Luppichini et al. 2022) and has permitted to dispose of sufficient data for both the training and the evaluation of the model. To train the model, the software allows to select between loss function and optimizer from a list of the most commonly used of the two parameters. CleverRiver provides the use of three different model architectures: i) LSTM; ii) LSTM-ED; iii) CNN-LSTM. These are the most common architectures used for flood prediction (Sit et al. 2020;Cui et al. 2022;Luppichini et al. 2022). The first one is the most straightforward architecture composed of a simple LSTM node and a Dense node. The LSTM-ED architecture was proposed by Luppichini et al. (2022) and is based on two blocks of LSTM nodes. Finally, the CNN-LSTM architecture proposed by Lupi et al. (2022) is composed of a combination of CNN and LSTM nodes (Fig. 1). The parameters of the architecture size (e.g., number of nodes) can be modified by the user allowing to test different settings.
The last parameters for simulation allow to define the range of the time interval of the predictions. For example, if the dataset has a daily frequency, we can define that the max value of the range is t = 10 days and we can create a simulation each day (step = 1) or every five days (step = 5). The algorithm trains a model for each t of prediction. The following step is model training. During this phase, some graphs and CSV files are compiled, which help to understand the errors of the models. To stop the training, we used the specific API of Keras and specifically the early stopping method. This method allows the training procedure to stop when the monitored metric, namely the value of the cost function, has ceased to improve. Therefore, given all the possible hypotheses, we wanted to find the best one (called "optimal"). This hypothesis would allow us to make more accurate estimates, still based on the data available.
The first section ends with the possibility to create a unique ZIP file of the output of the models and to transfer it onto the local device.
The second section uses the trained models processed during the first section to create specific simulations of events designed to test and apply the models. In this section, the user can define the time interval of the events and steps between each simulation. In other words, one can choose an interval time (from 2020-01-10 to 2020-01-15) and then decide the beginning of each simulation (e.g., each day, every two days), until the interval is complete. The results can be exported using the functionalities of Google Colab, or running a specific step that creates a ZIP of the work environment.
The third section aims to create a relationship between the users and the CleverRiver authors, with a form that allows to send a message directly to the authors, inviting the users to share their toolkits and experiences.
CleverRiver is loaded onto a GitHub repository together with the documentation and datasets having different characteristics (e.g., number of stations, sample frequency) for experimentation (https:// github. com/ mlupp ichini/ Cleve rRiver). In this work, the CleverRiver results are derived using the "dataset2" uploaded onto the GitHub repository, composed of 25 hydrometric height time series and 19 rainfall time series of the Arno River   Figure 2 shows the location of the stations composing the dataset used to describe the workflow.

Software description
CleverRiver is installed by importing the necessary libraries and setting up of the workspace (Step 1.1 in Fig. 3). The tool then prompts the user to import the input data in the "training_input_data" directory; the notebook checks whether the files are correct for the following procedures (Step 1.2 in Fig. 3). CleverRiver can create the input matrix through Steps 1.3 and 1.4. The algorithm provides some information on the size of the input matrix, such as number of records, number of columns, number of data (Fig. 4). The user can define the model parameters with Step 1.5 by using a simple user-friendly interface (Fig. 5).
Step 1.6 allows to train the models. For each simulation, CleverRiver provides the structure of the model and the errors calculated on the test dataset of the best model expressed in terms of Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). The outputs are saved in the "trained_models" directory and each training has its own directory. For each simulation, the algorithm saves the model in JSON and H5 format. It also saves a CSV file including the information and the errors of the model; the predicted values calculated on the test dataset in CSV format; and three graphs (Fig. 6).
The first graph is a plot of the training history of the loss function value calculated on the training and on the validation dataset (Fig. 7). The second graph is a scatter plot showing the relationship between observed and predicted values (Fig. 8). The third graph is a time plotter of the test dataset with observed and predicted values (Fig. 9).
Section Materials and methods allows to use the trained model to create specific simulations. With Steps 2.1, 2.2 and 2.3, CleverRiver creates the workspace and imports the trained models and data for the simulations. With Step 2.4, it creates the input matrix as for Step 1.4, and it saves the predicted values in the "output_predictions" directory.
It is then possible to execute Step 2.5 for the simulation of a specific event. In this step, the algorithm prompts some simple inputs: i) the time interval to simulate; ii) the time distance between each simulation; iii) the label of the y axis for the output graphs (Fig. 10). For each simulated event, the plots (Fig. 11) are saved in the output folder and can be downloaded using the Google Colab functions or Step 2.6, which allows to create a ZIP file of the "output_predictions" directory. Table 1 summarizes the parameters required by the toolkit with a brief description, the value ranges, and the default values. Fig. 10 Steps 2.1, 2.2 and 2.3: creation of the workspace and import of trained models and data input for simulations Fig. 11 Step 2.5: setting the simulation of the event and result of the graph (the coloured curves are the successive simulations with a time distance of 6 h, whereas the black line represents the observed values) Section Software description is a form that allows to contact the CleverRiver authors, so as to create a network for different applications of river flood prediction (Fig. 12).

Discussion and conclusions
Google Colab notebooks are important tools for creating dynamic workspaces with no limits for the client in terms of operating system, Python installation, and hardware (Bisong 2019;Yang et al. 2022). CleverRiver is the first deep-learning software for the prediction of river-flow, and it provides valid techniques based on the most common approaches (Sit et al. 2020; Van et al. 2020;Luppichini et al. 2022) for training of the models and evaluation of the results. CleverRiver is an open-source Python toolkit for the simulation of river flows and can be a reference point for the dissemination of deep-learning models in this field. This toolkit is based on the LSTM and CNN layers, which are probably the most popular, efficient, and commonly used deep-learning techniques (Fawaz et al. 2020;Yi et al. 2019;Zheng et al. 2019). These types of layers have been used in several works with the purpose of predicting river-flow (e.g., Li et al. 2018;Baek et al. 2020;Boulmaiz et al. 2020;Huang et al. 2020;Kim and Song 2020;Van et al. 2020;Hussain et al. 2020;Luppichini et al. 2022). For this reason, CleverRiver is a valid toolkit able to apply this, or similar architectures, in a potentially large number of future applications.
River-flow models based on deep-learning cannot yet be used on a large scale as they require particular computational skills. This is the main difference from physical models, which use different types of free, for-pay, open-source, and non open-source software.
Importantly, the ability to simulate efficient river-flow with the great number of data available in different parts of the word is crucial for present-day river management and geo-risks. In this regard, CleverRiver represents a valuable tool for a range of potential users including (but not limited to): • policy makers responsible for regulating river development; • river managers and engineers designing and implementing flood protection; • researchers evaluating the impacts of climate change within the fluvial zone; • students and neophytes to deep-learning techniques, who will be able to learn and try out their datasets.
Finally, the growing demographic pressure on fluvial zones and the changes caused by climate change in highfrequency and high-intensity precipitation events strongly suggest the need to plan for future adaptation of the community (Bates et al. 2008a(Bates et al. , b, 2012Gaume et al. 2016;Bryndal et al. 2017;IPCC 2018).
New releases will be progressively uploaded with new architectures, data management, and model parameters. For these reasons, we think that CleverRiver can be a valid tool to solve the problem of the scarce availability of opensource codes for flood prediction (Sit et al. 2020) and to extend the use of these tools outside the scientific community by means of a preliminary and cognitive approach. collaborative research agreement no. 579999-2019 "Autorità di Bacino Distrettuale Appennino Settentrionale" (Resp. Monica Bini and Roberto Giannecchini) and by the project "Cambiamenti globali e impatti locali: conoscenza e consapevolezza per uno sviluppo sostenibile della pianura Apuo-versiliese" Fondazione Cassa di Risparmio di Lucca (call 2018 years 2019-2022-Resp. M. Bini).
Data availability You can contact Marco Luppichini (marco.luppi-chini@unifi.it) for data and materials.

Declarations
Competing interests The authors declare no competing interests.

Conflicts of Interest
The authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.