Feature importance measures for hydrological applications: insights from a virtual experiment

Cappelli, Francesco; Grimaldi, Salvatore

doi:10.1007/s00477-023-02545-7

Feature importance measures for hydrological applications: insights from a virtual experiment

Original Paper
Published: 08 September 2023

Volume 37, pages 4921–4939, (2023)
Cite this article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Francesco Cappelli¹ &
Salvatore Grimaldi¹

191 Accesses
2 Citations
Explore all metrics

Abstract

Discriminating the role of input variables in a hydrological system or in a multivariate hydrological study is particularly useful. Nowadays, emerging tools, called feature importance measures, are increasingly being applied in hydrological applications. In this study, we propose a virtual experiment to fully understand the functionality and, most importantly, the usefulness of these measures. Thirteen importance measures related to four general classes of methods are quantitatively evaluated to reproduce a benchmark importance ranking. This benchmark ranking is designed using a linear combination of ten random variables. Synthetic time series with varying distribution, cross-correlation, autocorrelation and random noise are simulated to mimic hydrological scenarios. The obtained results clearly suggest that a subgroup of three feature importance measures (Shapley-based feature importance, derivative-based measure, and permutation feature importance) generally provide reliable rankings and outperform the remaining importance measures, making them preferable in hydrological applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of machine learning algorithms for predicting wave runup

Article Open access 18 December 2023

GTS_CME: an open-source MATLAB-based software for the analysis of common mode errors in GNSS coordinate time series

Article 25 May 2024

Moving Trend Analysis Methodology for Hydro-meteorology Time Series Dynamic Assessment

Article Open access 15 May 2024

References

Apley D, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised learning models. J Royal Stat Soc Ser B: Stat Methodol 82:1059–1086
Article Google Scholar
Baucells M, Borgonovo E (2013) Invariant probabilistic sensitivity analysis. Manag Sci 59(11):2536–2549
Article Google Scholar
Baucells M, Borgonovo E, Plischke E, et al (2021) Trend analysis in the age of machine learning
Borgonovo E (2007) A new uncertainty importance measure. Reliab Eng Syst Saf 92(6):771–784
Article Google Scholar
Borgonovo E, Lu X, Plischke E et al (2017) Making the most out of a hydrological model data set: Sensitivity analyses to open the model black-box. Water Resour Res 53(9):7933–7950
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Cappelli F, Tauro F, Apollonio C et al (2022) Feature importance measures to dissect the role of sub-basins in shaping the catchment hydrological response: a proof of concept. Stoch Environ Res Risk Assess 37(4):1247–1264
Article Google Scholar
Casalicchio G, Molnar C, Bischl B (2019) Visualizing the feature importance for black box models. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 655–670
Cohen J, Cohen P, West SG et al (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge, Milton Park
Book Google Scholar
Fox J (2015) Applied regression analysis and generalized linear models. Sage Publications, Thousand Oaks
Google Scholar
Gibbons JD, Chakraborti S (2011) Nonparametric statistical inference. In: International encyclopedia of statistical science. Springer, p 977–979
Greenwell BM, Boehmke BC, McCarthy AJ (2018) A simple and effective model-based variable importance measure. arXiv preprint arXiv:1805.04755
Hastie T, Tibshirani R, Friedman JH et al (2009) The elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, New York
Book Google Scholar
Havlicek LL, Peterson NL (1977) Effect of the violation of assumptions upon significance levels of the Pearson r. Psychol Bull 84(2):373
Article Google Scholar
Hooker G, Mentch L (2019) Please stop permuting features: an explanation and alternatives. arXiv e-prints pp arXiv–1905
Hooker G, Mentch L, Zhou S (2021) Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Stat Comput 31(6):1–16
Article Google Scholar
Iman RL, Conover W (1987) A measure of top-down correlation. Technometrics 29(3):351–357
Google Scholar
Iman RL, Hora SC (1990) A robust measure of uncertainty importance for use in fault tree system analysis. Risk Anal 10:401–406
Article Google Scholar
James G, Witten D, Hastie T et al (2013) An introduction to statistical learning, vol 112. Springer, New York
Book Google Scholar
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Article Google Scholar
Li B, Yang G, Wan R et al (2016) Comparison of random forests and other statistical methods for the prediction of lake water level: a case study of the poyang lake in china. Hydrol Res 47(S1):69–83
Article Google Scholar
Li H, Ameli A (2022) A statistical approach for identifying factors governing streamflow recession behaviour. Hydrolo Process 36(10):e14718
Article Google Scholar
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30
Mohr CH, Manga M, Wang CY et al (2017) Regional changes in streamflow after a megathrust earthquake. Earth Planet Sci Lett 458:418–428
Article CAS Google Scholar
Molnar C (2020) Interpretable machine learning. Lulu. com
Pearson K (1905) On the general theory of skew correlation and non-linear regression, mathematical contributions to the theory of evolution, Drapers’ company research memoirs, vol XIV. Dulau & Co., London
Google Scholar
Plischke E, Borgonovo E, Smith CL (2013) Global sensitivity measures from given data. Eur J Oper Res 226(3):536–550
Article Google Scholar
Razavi S, Gupta HV (2016) A new framework for comprehensive, robust, and efficient global sensitivity analysis: 1. Theory. Water Resour Res 52(1):423–439
Article Google Scholar
Razavi S, Jakeman A, Saltelli A et al (2021) The future of sensitivity analysis: an essential discipline for systems modeling and policy support. Environ Model Softw 137(104):954
Google Scholar
Ribeiro MT, Singh S, Guestrin C (2016) “ why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135–1144
Saltelli A (2002) Making best use of model evaluations to compute sensitivity indices. Comput Phys Commun 145(2):280–297
Article CAS Google Scholar
Saltelli A, Ratto M, Andres T et al (2008) Global sensitivity analysis: the primer. John Wiley & Sons, Hoboken
Google Scholar
Savage IR (1957) Contributions to the theory of rank order statistics-the “trend’’ case. Ann Math Stat 28(4):968–977
Article Google Scholar
Schmidt L, Heße F, Attinger S et al (2020) Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany. Water Resour Res 56(5):e2019WR025924
Article Google Scholar
Shapley L (1952) A value for n-person games. Ann Math Stud Study 28:307–317
Google Scholar
Song X, Zhang J, Zhan C et al (2015) Global sensitivity analysis in hydrological modeling: review of concepts, methods, theoretical framework, and applications. J Hydrol 523:739–757
Article Google Scholar
Spearman C (1961) The proof and measurement of association between two things
Strobl C, Boulesteix AL, Zeileis A et al (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):1–21
Article Google Scholar
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst 41:647–665
Article Google Scholar
Team RC (2013) R: a language and environment for statistical computing. r foundation for statistical computing. Vienna, Austria ISBN 3-900051-07-0, http://wwwR-projectorg/ 30
Venables W, Ripley B (2002) Modern applied statistics with S fourth edition by, world
Wang S, Peng H, Hu Q et al (2022) Analysis of runoff generation driving factors based on hydrological model and interpretable machine learning method. J Hydrol: Reg Stud 42(101):139
CAS Google Scholar
Weisberg S (2005) Applied linear regression, vol 528. John Wiley & Sons, Hoboken
Book Google Scholar

Download references

Acknowledgements

This research has received no external funding.

Author information

Authors and Affiliations

Department for Innovation in Biological, Agro-food and Forest Systems, University of Tuscia, 01100, Viterbo, Italy
Francesco Cappelli & Salvatore Grimaldi

Authors

Francesco Cappelli
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Grimaldi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors performed the experiments, analyzed the data and wrote the manuscript. Both authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Francesco Cappelli.

Ethics declarations

Conflict of interests

The authors declare no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Feature importance measures in detail

We introduce here the notation and the main notions useful for defining the measures used in this work. Let Y and $X=(X_1,\dots ,X_d)$ be a random variable and a random vector on the probability space $(\Omega ,{\mathcal {B}}(\Omega ), {\mathbb {P}})$, with $X \in {\mathcal {X}}_d \subseteq {\mathbb {R}}^d$, $Y \in {\mathcal {Y}}_Y \subseteq {\mathbb {R}}$, the cumulative distribution function ${\mathbb {P}}_X$ and ${\mathbb {P}}_Y$, and respective density functions $p_X$ and $p_Y$. The vector ${\textbf{X}}$ can be written as $\textbf{X}=(X_i,{\textbf{X}}_{-i})$, where $\textbf{X}_{-i}=\{X_l,l=1,\dots ,d,l\ne i\}$. We denote the corresponding observed value of $X_i$ as ${\textbf{x}}=\{x_i^{(1)},...,x_i^{(N)}\}'$ and the corresponding jth observation as $\textbf{x}^{(j)}=\{x_1^{(j)},...,x_d^{(j)}\}$ associated with the corresponding target value $y^{(j)} \in {\mathcal {Y}}$. For the computation of the ML feature importance measure, we adopt an ML model ${\widehat{g}}$ (a linear model) to approximate the unknown model. The chosen ML model is fitted on the training set $\{(\textbf{x}^{(j)},y^{(j)})\}_{j=1}^N$. We use $L({\widehat{g}})=\mathbb E[{\mathcal {L}}(Y,{\widehat{g}}({\textbf{X}}))]$ to refer to the generalization error of a trained ML model, where $\mathcal L:Y\times {\mathbb {R}}^d \rightarrow {\mathbb {R}}_{+}$ is the loss function. The notion of Shapley value arises from cooperative game theory. Given a group of players $D=(1,\dots ,d)$ who can join the coalitions $K\subseteq D$. The total number of possible coalitions is $2^d$. We denote by $v:2^d\rightarrow {\mathbb {R}}_{+}$ the value function that assigns the reward to coalitions. The reward of player i is given by

$$\begin{aligned} \phi _i(v)=\sum _{K\subseteq D \setminus \{i\}}\frac{|K|!(|D|-|K|-1)!}{|D|!}[ v(K\cup {i})-v(K)]. \end{aligned}$$

(5)

Regarding the ALE function, its estimation requires to partition the support of the feature of interest in K not overlapping intervals ${\mathcal {X}}_i^k=[z_k,z_{k-1}]$ with $k=1,\dots ,K$. The estimate of the ALE function of feature $X_i$ can be written as

$$\begin{aligned} \widehat{ALE}_{j}(x_{j})=\sum _{k=1}^{K}\frac{1}{N_{i}^k} \sum _{j: \textbf{x}_{j}\in {\mathcal {X}}_i^k}\left[ \widehat{g}\left(z_{i}^k,\textbf{x}_{-i}^{(j)}\right)- \widehat{g}\left(z_{i}^{k-1},\textbf{x}_{-i}^{(j)}\right)\right] , \end{aligned}$$

(6)

for each $x_i \in \left[z_{j}^0,z_{j}^K\right]$, where $z_{i}^0=\min \left\{x_i^{(1)},...,x_i^{(N)}\right\}$ and $z_{i}^K=\max \left\{x_i^{(1)},...,x_i^{(N)}\right\}$. The term $N_i^k$ denotes the number of data points in the kth interval for the feature $X_i$. We now report the feature importance measures employed in our analysis.

1.
Correlation methods
- Pearson correlation coefficient $r_i$ of feature $X_i$ is defined as follows
  $$\begin{aligned} r_i= & {} \frac{{\mathbb {E}}(YX_i)-\mathbb E(X_i)E(Y)}{\sqrt{E(X_i^2)-({\mathbb {E}}(X_i))^2}\sqrt{\mathbb E(Y^2)-({\mathbb {E}}(Y))^2}}\nonumber \\{} & {} = \frac{\text {cov}(Y, X_i)}{\sigma _{X_i}\sigma _{Y}}, \end{aligned}$$
  (7)
  where $\text {cov}(Y,X_i)$ denotes the covariance between Y and $X_i$, while $\sigma _{X_i}$ and $\sigma _Y$ denote the standard deviations of $X_i$ and Y, respectively.
- Spearman’s rank correlation coefficient $\rho _i$ of feature $X_i$ can be written as
  $$\begin{aligned} \rho _i=\frac{\text {cov}(\text {Rank}(Y), \text {Rank}(X_i))}{\sigma _{\text {Rank}(X_i)}\sigma _{\text {Rank}(Y)}}. \end{aligned}$$
  (8)
  Note that the formula above is similar to that of Pearson’s coefficient, but it is applied to rank features.
- Kendall rank correlation coefficient $\tau _i$ of feature $X_i$ can be written as
  $$\begin{aligned} \tau _i=\frac{n_c-n_d}{\sqrt{n_0-n_c}\sqrt{n_0-n_c}}, \end{aligned}$$
  (9)
  where $n_c$ is the number of concordant pairs, $n_d$ is the number of discordant pairs, $n_0$ is the total number of data pairs and $n_1$ is the total number of data pairs with different values for both features.
2.
Regression-based methods
- Standardized regression coefficient SCR$_i$ of the feature $X_i$ is defined as
  $$\begin{aligned} \text {SCR}_i=\frac{\beta _i\sigma _{X_i}}{\sigma _Y}, \end{aligned}$$
  (10)
  where $\beta _i$ is the coefficient of $X_i$ in the linear model.
3.
ML feature importance measures
- Permutation feature importance PFI$_i$ of feature $X_i$ is defined as
  $$\begin{aligned} \text {PFI}_i = {\mathbb {E}}[{\mathcal {L}}(Y,{\widehat{g}}(X_i^{\pi }, \textbf{X}_{-i})]-{\mathbb {E}}[{\mathcal {L}}(Y,{\widehat{g}}(X_i, \textbf{X}_{-i})], \end{aligned}$$
  (11)
  where $X_i^\pi$ has the same marginal distribution of $X_i$ and ${\mathbb {E}}[\cdot ]$ is the expectation operator.
- Permute and Relearn feature importance $\text {VI}_j^{\pi \text {L}}$ of feature $X_i$ is defined as
  $$\begin{aligned} \text {VI}_i^{\pi \text {L}}= {\mathbb {E}}[{\mathcal {L}}(Y, \widehat{g}(X_i, \textbf{X}_{-i}))]-{\mathbb {E}}[{\mathcal {L}}(Y, \widehat{g}^{\pi ,i}(X_i, \textbf{X}_{-i}))], \end{aligned}$$
  (12)
  where ${\widehat{g}}^{i, \pi }$ is the ML model retrained after the permutation of $X_i$.
- Shapley-based feature importance $\text {SbFI}_i$ of feature $X_i$ is defined as
  $$\begin{aligned} \text {SbFI}_i= & {} \sum _{K\subseteq D \setminus \{i\}}\frac{|K|!(|D|-|K|-1)!}{|D|!}\nonumber \\{} & {} [{\widehat{g}}_K(\textbf{x}_{K\cup {i}})-g_K(\textbf{x}_{K})], \end{aligned}$$
  (13)
  where ${\widehat{g}}_K ({\textbf{x}}_K )=E_{{\textbf{X}}_C } [\widehat{g}(x_K,{\textbf{X}}_C )]$, with C being the complement of K.
- Shapley feature importance $\text {SFIMP}_i$ of feature $X_i$ is defined using a value function based on the model generalization error, i.e.,
  $$\begin{aligned} \text {SFIMP}_i= &{} \sum _{K\subseteq D \setminus \{i\}}\frac{|K|!(|D|-|K|-1)!}{|D|!}\nonumber \\{} & {} \left[{\widehat{L}}_{K\cup {i}}({\widehat{g}})- \widehat{L}_{K}({\widehat{g}})\right], \end{aligned}$$
  (14)
  where ${\widehat{L}}_{K}({\widehat{g}})=\frac{1}{N}\sum _{j=1}^N \sum _{k=j}^N {\mathcal {L}}\left({\widehat{g}}( x_K^{(j)}, {\textbf{x}}_C^{(k)}), y^{(j)}\right)$.
- Derivative-based measure $\kappa _i^{\text {ALE}}$ of feature $X_i$ is defined as
  $$\begin{aligned} \kappa _i^{\text {ALE}}(x_i)= & {} \frac{1}{K}\sum _k {\mathbb {E}}\nonumber \\{} & {} \left[ \frac{{\widehat{g}}\left(z_i^k, \textbf{x}_{-i}^{(j)}\right)-\widehat{g}\left(z_i^{k-1}, \textbf{x}_{-i}^{(j)}\right)}{z_i^k-z_i^{k-1}}\right] ^2\frac{\sigma _{X_i}^2}{\sigma _Y^2}. \end{aligned}$$
  (15)
- ALE-based feature importance $\text {FI}^{\text {ALE}}$ of feature $X_i$ is defined as
  $$\begin{aligned} \text {FI}^{\text {ALE}}_i= \sqrt{{\mathbb {V}}(\text {ALE}_i(X_i))}, \end{aligned}$$
  (16)
  where ${\mathbb {V}}[\cdot ]$ denotes the variance operator.
4.
SA methods
- Variance-based sensitivity measure of feature $X_i$ is defined as
  $$\begin{aligned} \eta _i^2=\frac{{\mathbb {V}}[Y] -{\mathbb {E}}_{\textbf{X}_{-i}}[{\mathbb {V}}_{X_i}[Y \mid X_i]]}{{\mathbb {V}}[Y]}. \end{aligned}$$
  (17)
- Density-based sensitivity measure $\delta _i$ of feature $X_i$ can be written as
  $$\begin{aligned} \delta _i=\frac{1}{2}{\mathbb {E}}_{X_i}\left[ \int _{{\mathcal {Y}}} |p_Y(y)-p_{Y\mid X_i}(y) |dy\right] , \end{aligned}$$
  (18)
  where $p_Y$ and $p_{Y\mid X_i}$ are the marginal output density and the conditional density, respectively.
- Cumulative distribution-based sensitivity measure $\beta ^{\text {KS}}_i$ of feature $X_i$ can be written as
  $$\begin{aligned} \beta ^{\text {KS}}_i={\mathbb {E}}_{X_i}\left[ \sup _{{\mathcal {Y}}}\left|{\mathbb {P}}_Y(y)-{\mathbb {P}}_{Y\mid X_i }(y)\right|dy\right] , \end{aligned}$$
  (19)
  where ${\mathbb {P}}_Y$ and ${\mathbb {P}}_{Y\mid X_i}$ are the cdfs.

B Tables

In this Section, we report the mean values of the CPI index calculated on the 100 simulations generated for the 30 case studies with $N =$ 100, 1000, and 10000 for the four classes of methods used.

Table 8 CPI estimates for all case studies reported in Table 2 for $N=$ 100, 1000 and 10000 using the Correlation Methods and Regression-based method

Full size table

Table 9 CPI estimates for all case studies reported in Table 2 for $N=$ 100, 1000 and 10000 using the ML feature importance measures

Full size table

Table 10 CPI estimates for all case studies reported in Table 2 for $N=$ 100, 1000 and 10000 using the SA measures

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cappelli, F., Grimaldi, S. Feature importance measures for hydrological applications: insights from a virtual experiment. Stoch Environ Res Risk Assess 37, 4921–4939 (2023). https://doi.org/10.1007/s00477-023-02545-7

Download citation

Accepted: 20 August 2023
Published: 08 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00477-023-02545-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature importance measures for hydrological applications: insights from a virtual experiment

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of machine learning algorithms for predicting wave runup

GTS_CME: an open-source MATLAB-based software for the analysis of common mode errors in GNSS coordinate time series

Moving Trend Analysis Methodology for Hydro-meteorology Time Series Dynamic Assessment

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Appendices

Appendix

A Feature importance measures in detail

B Tables

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature importance measures for hydrological applications: insights from a virtual experiment

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of machine learning algorithms for predicting wave runup

GTS_CME: an open-source MATLAB-based software for the analysis of common mode errors in GNSS coordinate time series

Moving Trend Analysis Methodology for Hydro-meteorology Time Series Dynamic Assessment

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Appendices

Appendix

A Feature importance measures in detail

B Tables

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation