Towards robustness of post hoc Explainable AI methods
Title | Towards robustness of post hoc Explainable AI methods |
---|---|
Summary | Towards robustness of post hoc Explainable AI methods |
Keywords | Explainable AI, Robustness, Adversarial attacks |
TimeFrame | Fall 2023 |
References | [[References::[1] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
[2] Lundberg, S., and S. I. Lee. "A unified approach to interpreting model predictions. arXiv 2017." arXiv preprint arXiv:1705.07874 (2022). [3] Aïvodji, Ulrich, et al. "Fooling SHAP with Stealthily Biased Sampling." The Eleventh International Conference on Learning Representations. 2022. [4] Laberge, Gabriel, et al. "Fool SHAP with Stealthily Biased Sampling." arXiv preprint arXiv:2205.15419 (2022). [5] Slack, Dylan, et al. "Fooling lime and shap: Adversarial attacks on post hoc explanation methods." Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 2020. [6] Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232. [7] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps." arXiv preprint arXiv:1312.6034 (2013). [8] Baniecki, Hubert, Wojciech Kretowicz, and Przemyslaw Biecek. "Fooling partial dependence via data poisoning." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer Nature Switzerland, 2022. [9] Dimanov, Botty, et al. "You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods." ECAI 2020. IOS Press, 2020. 2473-2480. [10] Saito, Sean, et al. "Improving lime robustness with smarter locality sampling." arXiv preprint arXiv:2006.12302 (2020).]] |
Prerequisites | |
Author | |
Supervisor | Parisa Jamshidi, Peyman Mashhadi, Jens Lundström |
Level | Master |
Status | Open |
Post Hoc explanation methods like LIME [1] and SHAP [2], due to their internal perturbation mechanisms, are shown to be susceptible to adversarial attacks [3, 4]. This means that, for example, a biased method can be altered maliciously in a way to fool explanation methods so that it appears as unbiased [5]. Furthermore, there are methods for fooling Partial Dependence Plot (PDP)[6] and Gradient-Based approaches7], which propose attacks according to each method's weaknesses [8, 9]. Almost every industrial sector leans towards adopting AI. However, there is a barrier of trust to AI which can be alleviated by Explainable AI; therefore it is of immense importance to make XAI methods robust to adversarial attacks. This project aims at exploring and equipping a chosen Post hoc XAI method with a mechanism to make them robust to adversarial attacks [10].