TY - JOUR
T1 - Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
AU - Wang, Ying
AU - Rudner, Tim G.J.
AU - Wilson, Andrew Gordon
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability.To improve the interpretability of vision-language models such as CLIP, we propose a multimodal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features.We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare.Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available.Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.
AB - Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability.To improve the interpretability of vision-language models such as CLIP, we propose a multimodal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features.We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare.Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available.Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.
UR - http://www.scopus.com/inward/record.url?scp=85191163994&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191163994&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85191163994
SN - 1049-5258
VL - 36
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -