Towards robustness of post hoc Explainable AI methods

Title Towards robustness of post hoc Explainable AI methods
Summary Towards robustness of post hoc Explainable AI methods
Keywords Explainable AI, Robustness, Adversarial attacks
TimeFrame Fall 2023
References [[References::[1] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

[2] Lundberg, S., and S. I. Lee. "A unified approach to interpreting model predictions. arXiv 2017." arXiv preprint arXiv:1705.07874 (2022).

[3] Aïvodji, Ulrich, et al. "Fooling SHAP with Stealthily Biased Sampling." The Eleventh International Conference on Learning Representations. 2022.

[4] Laberge, Gabriel, et al. "Fool SHAP with Stealthily Biased Sampling." arXiv preprint arXiv:2205.15419 (2022).

[5] Slack, Dylan, et al. "Fooling lime and shap: Adversarial attacks on post hoc explanation methods." Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 2020.

[6] Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.

[7] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps." arXiv preprint arXiv:1312.6034 (2013).

[8] Baniecki, Hubert, Wojciech Kretowicz, and Przemyslaw Biecek. "Fooling partial dependence via data poisoning." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer Nature Switzerland, 2022.

[9] Dimanov, Botty, et al. "You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods." ECAI 2020. IOS Press, 2020. 2473-2480.

[10] Saito, Sean, et al. "Improving lime robustness with smarter locality sampling." arXiv preprint arXiv:2006.12302 (2020).]]

Supervisor Parisa Jamshidi, Peyman Mashhadi, Jens Lundström
Level Master
Status Open

Post Hoc explanation methods like LIME [1] and SHAP [2], due to their internal perturbation mechanisms, are shown to be susceptible to adversarial attacks [3, 4]. This means that, for example, a biased method can be altered maliciously in a way to fool explanation methods so that it appears as unbiased [5]. Furthermore, there are methods for fooling Partial Dependence Plot (PDP)[6] and Gradient-Based approaches7], which propose attacks according to each method's weaknesses [8, 9]. Almost every industrial sector leans towards adopting AI. However, there is a barrier of trust to AI which can be alleviated by Explainable AI; therefore it is of immense importance to make XAI methods robust to adversarial attacks. This project aims at exploring and equipping a chosen Post hoc XAI method with a mechanism to make them robust to adversarial attacks [10].