Preventing deception with explanation methods using focused sampling

Vreš, Domen; Robnik-Šikonja, Marko

doi:10.1007/s10618-022-00900-w

Preventing deception with explanation methods using focused sampling

Published: 09 December 2022

(2022)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

433 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Machine learning models are used in many sensitive areas where, besides predictive accuracy, their comprehensibility is also essential. Interpretability of prediction models is necessary to determine their biases and causes of errors and is a prerequisite for users’ confidence. For complex state-of-the-art black-box models, post-hoc model-independent explanation techniques are an established solution. Popular and effective techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. Recently, (Slack et al. in Fooling LIME and SHAP: Adversarial attacks on post-hoc explanation methods, 2020) put their robustness into question by showing that their outcomes can be manipulated due to inadequate perturbation sampling employed. This weakness would allow owners of sensitive models to deceive inspection and hide potentially unethical or illegal biases existing in their predictive models. Such possibility could undermine public trust in machine learning models and give rise to legal restrictions on their use. We show that better sampling in these explanation methods prevents malicious manipulations. The proposed sampling uses data generators that learn the training set distribution and generate new perturbation instances much more similar to the training set. We show that the improved sampling increases the LIME and SHAP’s robustness, while the previously untested method IME is the most robust. Further ablation studies show how the enhanced sampling changes the quality of explanations, reveal differences between data generators, and analyze the effect of different level of conservatism in the employment of biased classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on semi-supervised learning

Article Open access 15 November 2019

Explainable artificial intelligence: a comprehensive review

Article 18 November 2021

In AI we trust? Perceptions about automated decision-making by artificial intelligence

Article 01 January 2020

Data availability

All the used dataset are publicly available, the links are provided in the paper.

Code availability

The source code is freely available, the link is in the paper.

Notes

References

Alvarez-Melis D, Jaakkola TS (2018) On the robustness of interpretability methods. In: ICML workshop on human interpretability in machine learning (WHI 2018)
Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. ProPublica
Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised learning models. J R Stat Soc Ser B 82(4):1059–1086
Article MathSciNet MATH Google Scholar
Barber R, Candès E (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43:2055–2085
Article MathSciNet MATH Google Scholar
Bates S, Candès E, Janson L, Wang W (2021) Metropolized knockoff sampling. J Am Stat Assoc 116(535):1413–1427
Article MathSciNet Google Scholar
Candès E, Fan Y, Janson L, Lv J (2018) Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. J R Stat Soc Ser B 80(3):551–577
Article MathSciNet MATH Google Scholar
Chakraborty J, Peng K, Menzies T (2020) Making fair ML software using trustworthy explanation. In: 2020 35th IEEE/ACM International conference on automated software engineering (ASE), pp 1229–1233
Dimanov B, Bhatt U, Jamnik M, Weller A (2020) You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods. Proc ECAI 2020:2473–2480
Google Scholar
Doersch C (2016) Tutorial on variational autoencoders
Dombrowski AK, Alber M, Anders C, Ackermann M, Müller KR, Kessel P (2019) Explanations can be manipulated and geometry is to blame. In: Advances in neural information processing systems, pp 13589–13600
Dua D, Graff C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 9 Aug, 2020
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: Proceedings of international conference on machine learning (ICML), pp 1050–1059
Ghorbani A, Abid A, Zou J (2019) Interpretation of neural networks is fragile. Proc AAAI Conf Artif Intell 33:3681–3688
Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Heo J, Joo S, Moon T (2019) Fooling neural network interpretations via adversarial model manipulation. In: Advances in neural information processing systems, pp 2925–2936
Hooker G, Mentch L, Zhou S (2021) Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Stat Comput 31:82
Article MathSciNet MATH Google Scholar
Kroll JA, Huey J, Barocas S, Felten EW, Reidenberg JR, Robinson DG, Yu H (2017) Accountable algorithms. Univ Pa Law Rev 165(3):633–705
Google Scholar
Lipton ZC (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31–57
Article Google Scholar
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30:4765–4774
Google Scholar
Miok K, Nguyen-Doan D, Zaharie D, Robnik-Šikonja M (2019) Generating data using Monte Carlo dropout. In: International conference on intelligent computer communication and processing (ICCP), pp 509–515
Molnar C, König G, Herbinger J, Freiesleben T, Dandl S, Scholbeck CA, Casalicchio G, Grosse-Wentrup M, Bischl B (2021) General pitfalls of model-agnostic interpretation methods for machine learning models. ArXiv preprint 2007:04131
Moody J, Darken CJ (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294
Article Google Scholar
Mujkic E, Klingner D (2019) Dieselgate: how hubris and bad leadership caused the biggest scandal in automotive history. Public Integr 21(4):365–377
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Redmond M, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141:660–678
Article MATH Google Scholar
Ribeiro MT, Singh S, Guestrin C (2016) “Why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135–1144
Robnik-Šikonja M (2018) Dataset comparison workflows. Int J Data Sci 3:126–145
Article Google Scholar
Robnik-Šikonja M (2019) semiArtificial: Generator of semi-artificial data. https://cran.r-project.org/package=semiArtificial, R package version 2.3.1
Robnik-Šikonja M (2016) Data generators for learning systems based on RBF networks. IEEE Trans Neural Netw Learn Syst 27(5):926–938
Article MathSciNet Google Scholar
Robnik-Šikonja M, Kononenko I (2008) Explaining classifications for individual instances. IEEE Trans Knowl Data Eng 20:589–600
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Saito S, Chua E, Capel N, Hu R (2020) Improving LIME robustness with smarter locality sampling. ArXiv preprint 2006:12302
Selbst AD, Barocas S (2018) The intuitive appeal of explainable machines. Fordham Law Rev 87:1085
Google Scholar
Shapley LS (1988) A value for n-person games. In: Roth AE (ed) The Shapley value: essays in Honor of Lloyd S. Shapley. Cambridge University Press, Cambridge, pp 31–40
Chapter Google Scholar
Slack D, Hilgard S, Jia E, Singh S, Lakkaraju H (2020) Fooling LIME and SHAP: Adversarial attacks on post-hoc explanation methods. In: AAAI/ACM Conference on AI, Ethics, and Society (AIES)
Štrumbelj E, Kononenko I (2010) An efficient explanation of individual classifications using game theory. J Mach Learn Res 11:1–18
MathSciNet MATH Google Scholar
Štrumbelj E, Kononenko I (2013) Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst 41:647–665
Article Google Scholar
Štrumbelj E, Kononenko I, Robnik-Šikonja M (2009) Explaining instance classifications with interactions of subsets of feature values. Data Knowl Eng 68(10):886–904
Article Google Scholar

Download references

Acknowledgements

The work was partially supported by the Slovenian Research Agency (ARRS) core research programme P6-0411. This paper is supported by European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media). We thank the anonymous reviewers who helped to substantially improve our work with insightful comments and suggestions.

Funding

The work was partially supported by the Slovenian Research Agency (ARRS) core research programme P6-0411 and by the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 825153, Project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).

Author information

Authors and Affiliations

Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000, Ljubljana, Slovenia
Domen Vreš & Marko Robnik-Šikonja

Authors

Domen Vreš
View author publications
You can also search for this author in PubMed Google Scholar
Marko Robnik-Šikonja
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DV conducted the experiments and wrote the initial version of the paper. MR lead the research and participated in writting the paper.

Corresponding author

Correspondence to Domen Vreš.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Responsible editor: Martin Atzmueller, Johannes Fürnkranz, Tomáš Kliegr and Ute Schmid.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Derivation of IME convergence rate

The following derivation was presented by Štrumbelj and Kononenko (2010). Let $ \sigma ^2_i $ be the variance of i-th feature’s sampling population for instance w, i.e. the population of $ V_{\pi , w} = (f(w_{[w_i = x_i, i\in Pre^i(\pi )\cup \{i\}]}) - f(w_{[w_i = x_i, i\in Pre^i(\pi )]})) $. The value of $ \sigma ^2_i $ is determined on $ m_{min} $ samples drawn for each feature at the beginning of the explanation process. The desired accuracy of the IME method is given by the pair $ (1-\alpha , e) $, which limits the size of the error e using the following expression (recall Eq. (2.4)):

$$\begin{aligned} P(|\hat{\phi _i} - \phi _i| < e) \ge 1 - \alpha . \end{aligned}$$

(A.1)

Let $ m_i(1-\alpha , e) $ be the minimal number of samples for the i-th feature that satisfy Eq. (A.1). The value of $ m_i(1-\alpha , e) $ can be calculated using

$$\begin{aligned} m_i(1-\alpha , e) = \frac{z_{1-\alpha }^2 \cdot \sigma _i^2}{e^2}, \end{aligned}$$

(A.2)

where $ z_{1-\alpha } $ is implicitly defined with $ P(|X| > z_{1-\alpha }) < 1-\alpha $ for $ {\mathscr {N}}(0, 1) $ distributed random variable X. Let $ m(1-\alpha , e) $ be the minimal number of samples that each feature satisfies (A.1). The value of $ m(1-\alpha , e) $ can be calculated using

$$\begin{aligned} m(1-\alpha , e) = n\cdot \frac{z_{1-\alpha }^2 \cdot \overline{\sigma ^2}}{e^2}, \end{aligned}$$

(A.3)

where n denotes the number of features and $ \overline{\sigma ^2} $ denotes the sum $ \frac{1}{n}\sum _{i=1}^{n}\sigma _i^2 $.

Training discriminator function of the attacker

Details of training attacker’s decision models d (see Fig. 2) is described in Algorithms 1, 2, and 3. We used a slightly different algorithm for each of the three explanation methods, LIME, SHAP, and IME, as each method uses a different sampling. Algorithms first create out-of-distribution instances by method-specific sampling. The training sets for decision models are created by labelling the created instances with 0; the instances from sample S (to which the attacker has access) from distribution $ X_{dist} $ are labelled with 1. Finally, the machine learning model dModel is trained on this training set and returned as d. In our experiments, we used the random forest classifier as dModel and the training part of each evaluation dataset as S.

Heatmaps as tables

We present the information contained in Figs. 4 and 5 in a more detailed tabular form in Tables 8 and 9 respectively.

Table 8 The robustness results for gLIME (top table), gSHAP (middle table), and gIME (bottom table)

Full size table

Table 9 The robustness results for gLIME (top), gSHAP (middle), and gIME (bottom) on the MvNormData dataset

Full size table

Different prediction thresholds

The full results of experiments described in Sect. 5.6 are shown in Tables 10, 11, and 12.

Table 10 Percentages of instances where the sensitive feature (race) was recognized as the most important feature with gLIME for adversarial attacks with one (top table) or two unrelated features (bottom table)

Full size table

Table 11 Percentage of instances where the sensitive attribute (race) was recognized as the most important feature with gSHAP for adversarial attacks with one (top table) or two unrelated features (bottom table)

Full size table

Table 12 Percentages of instances where the sensitive attribute (race) was recognized as the most important feature with gIME for adversarial attacks with one (top table) or two unrelated features (bottom table)

Full size table

MvNormData dataset

To test the robustness of the improved explanation methods on highly correlated data in Sect. 5.2, we created the MvNormData synthetic dataset based on the multivariate normal distribution. The dataset consists of 2000 instances described with 10 features. As we wanted our features to be zero centred, we sampled instances from $ {\mathscr {N}}(0, \varSigma ) $. We randomly generated the symmetric positive definite matrix $ \varSigma $ as $ 10\times 10 $ matrix by first generating its Cholesky decomposition consisting of a lower triangular $ 10\times 10 $ matrix V. We generated each element on and below the diagonal of V independently by setting them to random integers between $ -5 $ and 5. To make sure that the diagonal elements are positive, we transformed them with:

$$\begin{aligned} v_{ii} \leftarrow |v_{ii}| + 1, \end{aligned}$$

where $ v_{ii} $ represents the i-th diagonal element of V. We obtained the following matrix V:

$$\begin{aligned} V = \begin{bmatrix} 5 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 5 &{} 3 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 5 &{} 3 &{} 3 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ -1 &{} 1 &{} 1 &{} 4 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ -3 &{} 0 &{} 4 &{} -1 &{} 4 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ -1 &{} 5 &{} 3 &{} 0 &{} 5 &{} 4 &{} 0 &{} 0 &{} 0 &{} 0 \\ -1 &{} 4 &{} 4 &{} -3 &{} -5 &{} 4 &{} 1 &{} 0 &{} 0 &{} 0 \\ 2 &{} 1 &{} 5 &{} 0 &{} -4 &{} 3 &{} 4 &{} 4 &{} 0 &{} 0 \\ -2 &{} 2 &{} 2 &{} -2 &{} -2 &{} 5 &{} 2 &{} 5 &{} 1 &{} 0 \\ -1 &{} 5 &{} -3 &{} 2 &{} -3 &{} -4 &{} -2 &{} 2 &{} -5 &{} 5 \end{bmatrix}. \end{aligned}$$

By setting $ \varSigma = V^T V $, we obtained the following $ \varSigma $:

$$\begin{aligned} \varSigma = \begin{bmatrix} 25 &{} 25 &{} 25 &{} -5 &{} -15 &{} -5 &{} -5 &{} 10 &{} -10 &{} -5\\ 25 &{} 34 &{} 34 &{} -2 &{} -15 &{} 10 &{} 7 &{} 13 &{} -4 &{} 10\\ 25 &{} 34 &{} 43 &{} 1 &{} -3 &{} 19 &{} 19 &{} 28 &{} 2 &{} 1 \\ -5 &{} -2 &{} 1 &{} 19 &{} 3 &{} 9 &{} -3 &{} 4 &{} -2 &{} 11\\ -15 &{} -15&{} -3 &{} 3 &{} 42 &{} 35 &{} 2 &{} -2 &{} 8 &{} -23\\ -5 &{} 10 &{} 19 &{} 9 &{} 35 &{} 76 &{} 24 &{} 10 &{} 28 &{} -14\\ -5 &{} 7 &{} 19 &{} -3 &{} 2 &{} 24 &{} 84 &{} 58 &{} 56 &{} 0 \\ 10 &{} 13 &{} 28 &{} 4 &{} -2 &{} 10 &{} 58 &{} 87 &{} 59 &{} -12\\ -10 &{} -4 &{} 2 &{} -2 &{} 8 &{} 28 &{} 56 &{} 59 &{} 75 &{} -11\\ -5 &{} 10 &{} 1 &{} 11 &{} -23 &{} -14 &{} 0 &{} -12 &{} -11 &{} 122 \end{bmatrix}. \end{aligned}$$

With this $ \varSigma $, we randomly sampled 2000 instances from $ {\mathscr {N}}(0, \varSigma ) $ distribution.

As we wanted our unbiased model $ \psi $ to be simple and perfectly fit the output variable y, we decided y to be linearly dependent on instances x. Therefore, we sampled y randomly from $ Bernoulli(X\cdot \beta ) $ distribution. The coefficient vector $ \beta $ was randomly generated as integers between $ -5 $ and 5, while $ \beta _1 $ was set to 0, as we chose $ X_1 $ as the sensitive feature and did not want it to have a direct impact on unbiased classifier $ \psi $. We obtained the following $ \beta $:

$$\begin{aligned} \beta ^T = [ 0, 0 , 4, -1, 5, 0, -5, -1, -3, 1] \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vreš, D., Robnik-Šikonja, M. Preventing deception with explanation methods using focused sampling. Data Min Knowl Disc (2022). https://doi.org/10.1007/s10618-022-00900-w

Download citation

Received: 11 March 2021
Accepted: 19 November 2022
Published: 09 December 2022
DOI: https://doi.org/10.1007/s10618-022-00900-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preventing deception with explanation methods using focused sampling

Abstract

Access this article