Towards robustness of post hoc Explainable AI methods

From ISLAB/CAISR
Title Towards robustness of post hoc Explainable AI methods
Summary Towards robustness of post hoc Explainable AI methods
Keywords Explainable AI, Robustness, Adversarial attacks
TimeFrame Fall 2023
References [[References::[1] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

[2] Lundberg, S., and S. I. Lee. "A unified approach to interpreting model predictions. arXiv 2017." arXiv preprint arXiv:1705.07874 (2022).

[3] Aïvodji, Ulrich, et al. "Fooling SHAP with Stealthily Biased Sampling." The Eleventh International Conference on Learning Representations. 2022.

[4] Laberge, Gabriel, et al. "Fool SHAP with Stealthily Biased Sampling." arXiv preprint arXiv:2205.15419 (2022).

[5] Slack, Dylan, et al. "Fooling lime and shap: Adversarial attacks on post hoc explanation methods." Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 2020.

[6] Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.

[7] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps." arXiv preprint arXiv:1312.6034 (2013).

[8] Baniecki, Hubert, Wojciech Kretowicz, and Przemyslaw Biecek. "Fooling partial dependence via data poisoning." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer Nature Switzerland, 2022.

[9] Dimanov, Botty, et al. "You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods." ECAI 2020. IOS Press, 2020. 2473-2480.

[10] Saito, Sean, et al. "Improving lime robustness with smarter locality sampling." arXiv preprint arXiv:2006.12302 (2020).]]

Prerequisites
Author
Supervisor Parisa Jamshidi, Peyman Mashhadi, Jens Lundström
Level Master
Status Open


Post Hoc explanation methods like LIME [1] and SHAP [2], due to their internal perturbation mechanisms, are shown to be susceptible to adversarial attacks [3, 4]. This means that, for example, a biased method can be altered maliciously in a way to fool explanation methods so that it appears as unbiased [5]. Furthermore, there are methods for fooling Partial Dependence Plot (PDP)[6] and Gradient-Based approaches7], which propose attacks according to each method's weaknesses [8, 9]. Almost every industrial sector leans towards adopting AI. However, there is a barrier of trust to AI which can be alleviated by Explainable AI; therefore it is of immense importance to make XAI methods robust to adversarial attacks. This project aims at exploring and equipping a chosen Post hoc XAI method with a mechanism to make them robust to adversarial attacks [10].