1 Introduction

The usage of hydrodynamic models, not only for flood forecasting but also as a planning tool in urban drainage has increased considerably over the last decades and with it the importance of understanding a model’s ability to reproduce the system behaviour. To ensure that the model performance is sufficient to be a reliable foundation for any planning procedure, the calibration process is a crucial and fundamental component of the model development process (Muschalla et al. 2009; Tscheikner-Gratl et al. 2017). Consequently, the process of model calibration has been the topic of many research activities and publications. For example, Di Pierro et al. (2005) investigated the development of calibration algorithms, Kleidorfer et al. (2009a) highlighted the impact of data accuracy and Deletic et al. (2009) focussed on the sources and propagation of uncertainties.

However, uncalibrated or insufficiently calibrated models are still in use in engineering practice, with data availability often being the limiting factor. Calibration usually requires measurement campaigns, which in turn can increase the economic cost of the simulation projects up to an unachievable level, especially for smaller operators (Freni et al. 2009). Calibration uncertainties relate to the data used for calibration and their selection (Notaro et al. 2013) and to the calibration methods (Leonhardt 2015). They stem of measurement errors for both input and calibration data, the selection of appropriate calibration and validation datasets, the applied calibration algorithms and the objective functions used during the calibration process (Deletic et al. 2012).

Another possible deficit of urban water management studies is that the case studies in scientific literature are often the same, usually larger cities which have the financial and human resources to participate in research projects and to provide an appropriate data-background (i.e. Los Angeles in Barco et al. (2008), Melbourne in Bach et al. (2013) or Shenzen in Gong et al. (2017)). They are selected for providing this good data background, e.g. measurement data over longer periods of time, and/or the required infrastructure for further data collection and management. Such case studies are not always representative for the entire situation of the living environment in a country. At the very least, there is the risk that research outcomes are biased towards large and more affluent municipalities (Tscheikner-Gratl et al. 2016a).

It is apparent that there are a manifold of factors influencing data availability. In this paper, the influence of data availability on calibration performance is investigated for the hydrodynamic drainage model of a small Austrian municipality. For this purpose, different scenarios of data availability for calibration are simulated. Scenarios for varying input data (different number of calibration events, different rainfall input) have already been considered in Tscheikner-Gratl et al. (2016a). There, the model was calibrated with different rainfall events and data sampled according to empirically based measurement campaigns. Additional scenarios also considered uncertainties in the measurement data, by assuming systematic errors in the collection of water level monitoring data. While the influences of the usage of different model input data (i.e. rainfall recordings) for calibration were evaluated in a very detailed way, effects of the varying spatial distribution of the calibration data (in-sewer measurements) were only marginally discussed. All scenarios used one single measurement data set, which is a water level measurement at one point of the system. Consequently, this work cannot answer the question of the effects of using only one measurement site for calibration in contrast to spatially distributed measurements. Although the small size of the case study and the limitation of funds mimicked nicely an engineering approach, a distributed measurement campaign may lead to a more differentiated outcome and consequently a better representation of the case study.

Existing studies, e.g. Kleidorfer et al. (2009b), investigated the influence of an increasing number of measurement stations for the calibration of conceptual sewer models. The effects of locating calibration points for hydrodynamic models were investigated by Vonach et al. (2018), resulting in a proposed heuristics for measurement site placement concluding with the intention of further research to improve the used methodology. With this work, we want to enhance the understanding of these effects by investigating the influence of a different number and different combinations of calibration points for hydrodynamic drainage models. For this, simulation results from a reference scenario are taken as synthetic measurement data, due to limited data availability. This paper describes three scenarios (scenario I, II and III – varying in amount and distribution of measurement sites) to highlight the influences originating from the design of in-sewer measurement campaigns.

2 Methods

The basis for the used methodology is an existing hydrodynamic model of the case study’s urban drainage network (Kleidorfer et al. 2014; Muschalla et al. 2015; Tscheikner-Gratl et al. 2016a; Tscheikner-Gratl et al. 2016b; Vonach et al. 2018). The performance of different scenarios was assessed using the Storm Water Management Model (SWMM 5.1.012) software tool (Burger et al. 2014; Gironás et al. 2010), which is widely used (e.g. by Yazdi (2017) and Gong et al. (2017)).

2.1 Case Study

The analysed case study Telfs is a small municipality with 15,000 inhabitants in Tyrol, Austria at an altitude of about 630 m above sea level. It has an average annual rainfall of about 1000 mm.

The here modelled urban drainage network of Telfs consists of 52 km of combined sewers, 28 km of wastewater sewers and 12 km of stormwater sewers. These stormwater sewers have nine outfalls (in the following figures referred to as RW for rainwater) into the receiving water bodies, while in comparison only three combined sewer overflows (CSO) exist. In total, a catchment area of approx. 73 ha (1251 subcatchments) is connected to the sewer system. For model calibration and validation, precipitation was measured over a period of 1 year with a temporal resolution of 5 min at three sites (rain gauges RG 1-3) within the catchment area and the water level at one site near the inflow to the wastewater treatment plant. This measurement setup also represents limited data availability, inherent to smaller operators due to limited budget.

The ten rain events with the highest occurring intensities, which surpassed an event threshold of 3 mm for all three rain gauges with an inter-event time of 24 h (listed in Table 1), are consolidated to one continuous rain series. Interconnection of the events is avoided by adding dry-weather periods of 4 h (DWA-A 118 2006) between the individual events (see Fig. 1).

Table 1 Characteristics of the rain events (avg. peak: [mm/5 min] (averaged maximum of the three gauges), max. Peak: [mm/5 min] (maximum occurring at one of the gauges), max. at (rain gauge where the maximum peak occurs), avg. sum: [mm])
Fig. 1
figure 1

Consolidated rain series with 10 measured rain events (RE01-RE10) (RG 1: weighing bucket (precipitation) gauge, RG 2 and RG 3: tipping bucket gauges)

A wastewater treatment plant (WWTP) is located southeast of the town. This plant additionally treats the wastewater of four nearby communities and its capacity is designed for 40,000 population equivalents. Accordingly, the case study’s drainage network must cope with conveying also the wastewater of the other association members (the four nearby communities) to the WWTP.

Regarding the model, it is important to mention that there are two different and independent outlets connected to the WWTP.

2.2 Calibration Procedure

For the initial model calibration, a measurement campaign for water level data at one location and precipitation data at three locations (shown in Fig. 2) was executed in 2014. Due to this restriction in data availability, the calibration datasets used in this paper are derived from the model, which was calibrated to the measured data and is subsequently referred to as ‘reference scenario’. This reference scenario has been calibrated by Tscheikner-Gratl et al. (2016a) with the rain series shown in Fig. 1, including the consideration of all three rain gauges.

Fig. 2
figure 2

Model of the case study Telfs with synthetic measurement points for scenario I, II and III (measurement points encircled with arrows leading to enlarged details)

Calibration scenarios I, II and III representing a different number and location of measurement sites were established by considering weak points in the uncalibrated model (pipes where the agreement of simulated water level courses between the uncalibrated model and the model of the reference scenario is low) and considering the operator’s empirical knowledge. Basic considerations regarding this heuristic scenario development approach can be found in Vonach et al. (2018).

Figure 3 exemplifies abstractions, which were made during the procedure. The basis is the real system with the real water level measurement. A first abstraction is made by calibrating a model to this measurement. For this, input data is varied in different calibration scenarios (Tscheikner-Gratl et al. 2016a). The calibration scenario, which resulted in the best overall agreement between the measurement and the model, i.e. the scenario with a calibration to the entire rain series of 10 events and of all three rain gauges with agreement expressed by the Nash-Sutcliffe Efficiency NSE (McCuen et al. 2006; Nash and Sutcliffe 1970), is then used as the reference scenario.

Fig. 3
figure 3

Necessary abstractions for the model-based approach

In contrast to Tscheikner-Gratl et al. (2016a), all scenarios of the present work are calibrated with the three rain events RE03, RE06 and RE09 of all three gauges (see Fig. 1). These rain events turned out to be representative, as models calibrated with only one of those delivered a high model performance compared to the reference scenario in terms of their ability to predict the CSO and flooding volumes (Tscheikner-Gratl et al. 2016a). To obtain appropriate synthetic measurement data, i.e. model outputs of the reference scenario, the reference model is simulated with a consolidated rain series of these three rain events.

By this means, synthetic measurement data and system performance is available in any form (e.g. water levels, CSO and flooding volumes, etc.) and at every point in the system. The model is then calibrated in three separate scenarios (I, II and III), each using different data from different calibration points (shown on Fig. 2) which is extracted from the reference scenario. A similar approach but for a different objective is also used and described in Kleidorfer et al. (2009b). In this work, a temporal resolution of the synthetic measurement data of 5 min is used.

The locations of the measurement sites for the simulated calibration data for the three scenarios I to III are shown in Fig. 2. Scenario I consists of five calibration points scattered throughout the network. Scenario II has two calibration points right at the outlets to the wastewater treatment plant. The (one) investigated calibration point for scenario III is the same point where also the real data was collected. Nevertheless, the synthetic water levels from the reference scenario were used for scenario III to keep the scenarios comparable on the same level of abstraction.

Only subcatchment related parameters concerning the runoff concentration and the total runoff volume are varied. The runoff model implemented in SWMM uses two main parameters, the subcatchment width and the subcatchment imperviousness. The width of a subcatchment determines the shape of the area and consequently significantly affects the concentration time while the imperviousness mainly relates to the total runoff volume. There are more parameters determining the timing (i.e. catchment slope or pipe roughness), but subcatchment width is a non-measurable value and is thus only determinable by calibration, and choosing more parameters to calibrate would reduce parameter identifiability. To avoid a further deficiency in the identifiability of the parameters, subcatchments are clustered, according to their initially estimated values, into three groups distinguishing the imperviousness and four groups differentiated by their width (the ratio between width and area, i.e. their shape), see also Vonach et al. (2018). Each cluster has its own factor being multiplied with the initial width or imperviousness. Consequently, these seven factors represent the calibration parameters (instead of 1251 × 2 = 2502 possible unique values).

Parameter adaptation is implemented and automated with R (R Development Core Team 2008). We used an optimization algorithm based on a Nelder-Mead simplex (Duan et al. 1994; Nelder and Mead 1965). The Nash-Sutcliffe efficiency (NSE) was chosen as the objective function to compare measured and predicted water levels. NSE is a measure to compare time series. It ranges from -∞ (no agreement) to 1 (perfect match). For the calibrations in this paper, a threshold of NSE = 0.9 is chosen (Shamseldin 1997). When the optimization algorithm finds a parameter set, which exceeds this threshold at the calibration point, the algorithm terminates and the model is considered as calibrated for the currently regarded calibration point.

As there are several points to calibrate the model to, calibration to each synthetic measurement station is performed in a downstream order. By this means, only subcatchments lying upstream of the current and downstream of the previous calibration point are modified. As an example, the subcatchments varied for each calibration step of scenario I are shown in Fig. 4. This methodology highly increases the number of calibration parameters compared with a calibration to all points simultaneously, as every step establishes a new set of seven parameters.

Fig. 4
figure 4

Systematic order of adapted subcatchments for calibration scenario I

Each point used for calibration (in total 7 different points of the network, because scenario I and II use the same point at the wastewater treatment plant) is additionally tested as a single calibration point to investigate the effect of a preceding stepwise calibration procedure.

A calibration to systems outlets can lead to a loss of information about high-resolution system behaviour. To highlight the possible extent of this loss, a sensitivity analysis is performed for scenario II (a calibration to measurements at the outlets to the WWTP) as an addition to the calibration scenarios. In this sensitivity analysis, 300 sets of calibration parameters were sampled randomly within defined ranges. The resulting models were then simulated three times with three different rainfall inputs. Two of the rainfall inputs are design storms Euler II with a return period of 5 and 10 years, respectively, prepared according to Austrian design guidelines (ÖWAV RB11 2009). More information about the application of these design storm events can be found e.g. in Mikovits et al. (2017) or De Toffol et al. (2006). They are characterized by a high peak (here 12.9 and 15.7 mm/5 min) and a short duration (here 120 min). Required statistical data for these rain events are available from the Austrian rainfall database (eHYD) (Weilguni 2009). The third simulation is done with the measured rain event RE03 from all three rain gauges, where the highest intensities of the measurement series occur (9.2 mm/5 min, measured at RG3).

2.3 Model Validation and Evaluation

Influences caused by the choice of calibration points are evaluated with scenarios I, II and III. The models of the intermediate calibration steps are simulated with six different rainfall inputs (resulting in (1 + 2 + 5) × 6 = 48 simulations with the intermediate scenario models and six simulations with the reference scenario model). These rainfall inputs include our own measurements (the entire measured data of three rain gauges, not the consolidated rain series from Fig. 1), each gauge is used for an individual simulation, a design storm event of type Euler II with a return period (rp) of 10 years (y) and the precipitation data sets ZAMG1 and ZAMG2. ZAMG1 and ZAMG2 are data sets of the Austrian Central Institute for Meteorology and Geodynamics (ZAMG) and are chosen from the nearest measurement sites available (ZAMG1 10 km from the catchment and ZAMG2 30 km). A spatial distribution of the occurring rainfall is not considered for validation. All rain series used for validation last 200 days, except the design storm events, which have a duration of 120 min. Our own measurements and the design storms have a temporal resolution of 5 min, whereas ZAMG1 and ZAMG2 have a time step of 10 min.

Subsequently, model outputs are compared to the reference scenario regarding the occurring flooding and CSO volumes.

To show the effects of generalizing the system behaviour, accompanied by information loss due to a smaller number of measurement stations, simulated CSO and flooding volumes (taken from the randomly created models of the sensitivity analyses for scenario II) were compared with the results from the reference scenario. Only those models were considered which had a good agreement with the reference scenario (NSE > 0.8) for the water level courses in the scenario II calibration points (the pipes just before both outlets to the WWTP). So, the occurring effect of information loss about the upstream system’s behaviour while maintaining a good agreement at the calibration points can be highlighted. For these evaluations now, the threshold is lowered compared to the calibration scenarios. The authors are aware of the fact, that a NSE of 0.9 is an ambitious value for calibration, as this value indicates a “very satisfactory model performance” according to Shamseldin (1997). By lowering the threshold to a still appropriate value of 0.8 (indicating a “fairly good model” (Shamseldin 1997)), the benefits of having more evaluable data points might outweigh the disadvantage of a slightly lower agreement for the significance of results.

3 Results and Discussion

The following elaboration of results is divided into three parts. First, the final validation performance of the calibration scenarios is compared. Secondly, the performance of the intermediate calibration steps are evaluated to provide an insight to the impact of a varying complexity of a calibration process on the model accuracy. Thirdly, results of the sensitivity analysis performed on scenario II are given.

3.1 Performance of Calibrated Models

Tables 2, 3 and 4 give an overview of the simulated CSO and flooding volumes of the calibration scenarios when simulating the resulting models with the five measured rainfall series (RG1, RG2, RG3, ZAMG1, ZAMG2) and one design storm (Euler II) used for validation.

Table 2 CSO and flooding volumes for different calibration scenarios (Sc. I, II, III and reference scenario) and rainfall inputs (depicted as: CSO / flooding in [m3])
Table 3 Percentage deviations of CSO volumes from the reference scenario for different calibration scenarios (Sc. I, II, III) and rainfall inputs
Table 4 Percentage deviations of flooding volumes from the reference scenario for different calibration scenarios (Sc. I, II, III) and rainfall inputs

Out of the measured rainfall series, ZAMG1 caused the largest CSO and flooding volumes in the reference scenario.

Rain series RG3 and RG2 result in very low flooding volumes (4 and 36 m3) for the reference scenario and the relative deviations of the calibrated models (Table 4) are therefore even higher but accordingly less significant. Rain series ZAMG2 and RG1 did not elicit any flooding in the reference scenario, therefore percentage deviations are not possible to evaluate.

Figure 5 shows the results for models simulated with the measured rainfall of ZAMG1. It shows the deviations of predicted flooding and CSO volumes from the reference scenario not only for the scenarios evaluated in this paper but also in comparison with the calibration scenarios with different rainfall input or systematic errors in the water level measurements (Tscheikner-Gratl et al. 2016a). By this, the effects of different calibration points on model performance can be compared directly to the effect of using different rainfall inputs for calibration or a systematic measurement error.

Fig. 5
figure 5

Flooding volume and CSO volume deviation for measured 200-days rain series ZAMG1 (asterisked scenarios are results simulated with models according to Tscheikner-Gratl et al. (2016a); these scenarios are named according to their rainfall input (RainGauge and RainEvent) used for calibration)

Only one of all models (the scenario that was calibrated with rain event RE05 and including all three rain gauges) deviates less than 25% for both volumes. Scenarios I, II and III all show less than 25% deviation from the reference scenario for the CSO volume. Flooding volume is overestimated throughout for these three scenarios with up to 102%. Scenario III, which has the same amount of calibration points (one) as the other calibration scenarios from Tscheikner-Gratl et al. (2016a), shows a better performance for CSO volumes than most of those scenarios. In return, the agreement of flooding volume is inferior.

No scenario exceeds a deviation of 100% for CSO volume, but three (including scenario III) exceed a deviation of 100% for flooding volume.

The good agreements of CSO volumes for scenarios I, II and III with the reference scenario could elicit from an advantageous sampling of model input data. A spatially distributed rainfall as well as different rain events were used for these calibrations. This contrasts with the majority of the other scenarios, where mostly either only one rain gauge or only one rain event is used for calibration.

Looking at the resulting models themselves, scenario II agrees best in terms of the connected impervious area. The reference scenario has a mean imperviousness of 45.6%. Scenario II approximates this mean imperviousness with 49.3%. Nevertheless, according to the results in Fig. 5, it still underestimates CSO volumes by 7% while overestimating flooding volumes by 93%. Scenarios I and III both overestimate the mean imperviousness (Im) with Im = 54.2% (scenario I) and Im = 57.5% (scenario III), flooding volumes (I: 58% and III: 102% deviation from the reference scenario) and CSO volumes (I: 17% and III: 17% deviation from the reference scenario).

3.2 Uncertainties Due to a Spatial Difference in Calibration Data Availability

Concerning the necessary amount of measurement stations to gather data for calibration, Fig. 6 shows the change in the model behaviour (CSO and flooding volumes compared to the reference scenario) with a different number of measurement sites. To investigate a larger number of calibration points as well as the final model, the models resulting from the intermediate steps in the calibration procedure are used. For flooding, only results for the two rainfall inputs ZAMG1 and Euler II are shown on Fig. 6, as they result in the most significant flooding volumes in the reference scenario.

Fig. 6
figure 6

CSO and flooding volumes and deviations from the reference scenario for six (CSO) and two (flooding) validation rainfall inputs and the intermediate calibration steps of the three scenarios

For both scenarios I and II, adding a further calibration point improves model performance for flooding volumes throughout. This is also the case for CSO volumes for scenario II. For scenario I, there are some exceptions, where model performance worsens temporarily during the step-wise calibration procedure (RG1: step 3, RG2: step 2 RG3: step 2, ZAMG1: step 2). Nevertheless, in absolute numbers, these interim degradations of performance for CSO volumes are only marginally, e.g. for a simulation with RG1 the CSO volume increases only by 133 m3 from step 2 (1893 m3) to step 3 (2026 m3). For other rainfall inputs, degradations are only occurring with less than 30 m3 increase in CSO volume.

Regarding the significant improvements of the last points of scenario I and II (which are the same points of the network) and the fact that there is only one CSO but no additional subcatchments between the endpoints of scenario I and II and the calibration point of scenario III, it can be assumed that this branch is favourable for calibration. Thus, Fig. 7 shows the resulting CSO volumes for models calibrated to only one calibration point on this branch. Models calibrated only to the endpoint of scenario I and II show very similar results as calibration scenario III, where the calibration point is approx. 2.5 km more upstream with an interposed CSO construction. In addition, scenario I, where the model is calibrated to 5 different points subsequently, is not superior to a model, which is calibrated only to the last of these 5 points.

Fig. 7
figure 7

CSO volumes when using one single calibration point (all on the same sewer branch)

Both points used in scenario II would result in a positive deviation when used as a single calibration point. Thus, the switch from an over- to an underestimation of CSO volumes within the calibration procedure of scenario II (Figs. 6 and 7) does not stem from one of these two points but is resulting from the combination of both used calibration points. The inferior model performance from a calibration to the first point of scenario II compared to a calibration to both points or only the second point of scenario II might be related to the high increase of the calibrated connected area. After the subsequent calibration of scenario II, the second measurement point (south eastern outlet) is connected to 18.3 ha of impervious area and the first (north eastern outlet) to only 5.6 ha.

Using the reference scenario model to produce synthetic measurement data also enables us to compare the resulting water level courses in all other pipes. Figure 8a to h exemplify how the model fit changes apart from the considered calibration point(s). For this, the Nash-Sutcliffe efficiency is calculated in selected pipes of the network at the end of a branch with no change in the inflow (includes about one third of all pipes in the system). It shows the stepwise improvement in specific pipes for the intermediate steps of each calibration scenario.

Fig. 8
figure 8

Overall agreements for different calibration steps in scenarios I, III and II

3.3 Sensitivity Analysis for Scenario II

Out of 300 models, 71 models resulted in a NSE > 0.8 at both calibration points for an Euler II design storm of rp = 5y and 88 models for rp = 10y. The measured rainfall (rain event RE03) input evoked only very little flooding volumes between 0 and 10 m3 and is neglected for further evaluations. Therefore, Fig. 9 shows results for simulations with the two design storms.

Fig. 9
figure 9

Absolute values (left) and deviation from the reference scenario (right) of flooding volume and CSO volume for design storm events Euler II (rp = 5y and 10y) and random calibration parameter variations for scenario II

All but one of the random but calibrated models overestimated the flooding volume while underestimating the CSO volume. The deviations in the CSO volume range between −16% and + 9%, this is a significantly smaller range than for the flooding volume with a range from −9% to +109%. The higher return period rainfall events are shown in this case to result in lower relative errors than rainfall events of a lower return period, especially in flooding volume. These results agree with the findings of Ahmed (2012), who also showed an improved model performance at higher flows in the context of calibrating an integrated river basin model.

The range of CSO volumes is about the same for both rainfall events (1589 m3 for rp = 5y and 1262 m3 for rp = 10y). Also, if we neglect the flooding volume’s outliers, the occurring range for the absolute flooding volume is in the same order of magnitude for both return periods (with 742m3 for rp = 5y and 645 m3 for rp = 10y).

The impact of coarse-grained sensor distributions (coarse-grained in terms of the flow path within the system, not in a geographic sense) is shown to be relatively minor when regarding the CSO volume but is more significant for the resulting flooding volume. This shows that the flooding volume is more sensitive to changes in the connected area than the CSO volume. The here simulated return period of 5 years is close to the design return period (3 to 5 years depending on land-use) meaning that additional surface runoff causes flooding before it can reach a CSO outlet. The resulting ranges for both volumes are remarkably smaller than their absolute differences to the values from the reference scenario. Consequently, a dense distribution of models with similar deviations can be seen. This could be a result of unsuitable boundary conditions of the calibration parameters. However, the presence of occurring outliers preclude this possibility.

4 Conclusion

A model-based approach was used to evaluate different calibration scenarios with 1, 2 and 5 calibration points. They were validated subsequently with different rainfall sets, including measured rainfall series from available surrounding measurement stations and design storm events with different return periods. Then they were compared to scenarios processed in Tscheikner-Gratl et al. (2016a). Even for different rainfall inputs and neglecting spatial distributions of occurring intensities, their validations resulted in very good agreements and low deviations from the used reference scenario when looking at the CSO volumes. Concerning flooding volumes, the here established scenarios can keep up with those from Tscheikner-Gratl et al. (2016a), but are not able to prevail.

The performed sensitivity analysis exemplified a possible error range due to the loss of information. For design storms with return periods of 5 and 10 years, it shows higher possible deviation ranges for flooding than for CSO volumes. Possible ranges of 100% broadness for predicted flooding volumes while having about the same agreement at the calibration points show that there is a quite significant uncertainty inherent to the calibration procedure when using few calibration points as a reference. This result of the study is especially interesting when linking the model calibration with the modelling aims, which are usually either the prediction of CSO volumes (to protect receiving water quality) or to predict flooding volume (to limit flood risk). While prediction of CSO volume is rather stable, high deviations of flooding volume have to be expected even for “calibrated” models. In current modelling practise, this is not a problem as hydrodynamic models are still often used in a way that flooding is just not allowed for certain design return periods. If flooding happens, the pipes have to be re-designed. But with changes in the design approach toward a risk approach in which flooding volumes, flooded areas and water on the surface is accepted to a certain extend for certain design return periods, the instability in predicting flooding volumes might cause problems. Currently the main recommendation only is that distributed measurements improve the understanding of the overall system behaviour. They also improve the calibration performance, but for sure further investigations of calibration of hydrodynamic urban drainage models under high flow conditions are required.

The evaluation of different scenarios for measurement campaigns showed significant differences of model performance with a varying number and location of calibration points. Often, economic or practical reasons restrict the execution of extended measurement campaigns. This study shows that with a careful selection of input data also one well-chosen calibration point can express and predict the system behaviour of a small case study to a satisfiable extent for engineering purposes. A favourable sampling of output data (i.e. a well-planned measurement site selection) can reduce uncertainties regarding an unfavourable choice of input data (i.e. the choice of calibration rainfall events). This emphasizes the importance of conferring resources to the calibration process. These resources are meant in terms of ensuring the availability of different kinds of measurement data (e.g. water levels and CSO volumes) as well as the time used to enable a well-considered planning of data collection.