Black-box access is insufficient for rigorous ai audits

S Casper, C Ezell, C Siegmann, N Kolt… - The 2024 ACM …, 2024 - dl.acm.org
External audits of AI systems are increasingly recognized as a key mechanism for AI
governance. The effectiveness of an audit, however, depends on the degree of access …

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

S Marks, C Rager, EJ Michaud, Y Belinkov… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce methods for discovering and applying sparse feature circuits. These are
causally implicated subnetworks of human-interpretable features for explaining language …

Carzero: Cross-attention alignment for radiology zero-shot classification

H Lai, Q Yao, Z Jiang, R Wang, Z He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The advancement of Zero-Shot Learning in the medical domain has been driven
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …

Interpreting clip with sparse linear concept embeddings (splice)

U Bhalla, A Oesterling, S Srinivas, FP Calmon… - arXiv preprint arXiv …, 2024 - arxiv.org
CLIP embeddings have demonstrated remarkable performance across a wide range of
computer vision tasks. However, these high-dimensional, dense vector representations are …

Decomposing and editing predictions by modeling model computation

H Shah, A Ilyas, A Madry - arXiv preprint arXiv:2404.11534, 2024 - arxiv.org
How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …

Erm++: An improved baseline for domain generalization

P Teterwak, K Saito, T Tsiligkaridis, K Saenko… - arXiv preprint arXiv …, 2023 - arxiv.org
Domain Generalization (DG) measures a classifier's ability to generalize to new distributions
of data it was not trained on. Recent work has shown that a hyperparameter-tuned Empirical …

Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines

M Toker, H Orgad, M Ventura, D Arad… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the
image generation process. However, the process by which the encoder produces the text …

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

W Bousselham, A Boggust, S Chaybouti… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision Transformers (ViTs), with their ability to model long-range dependencies through self-
attention mechanisms, have become a standard architecture in computer vision. However …

Evolving Interpretable Visual Classifiers with Large Language Models

M Chiquier, U Mall, C Vondrick - arXiv preprint arXiv:2404.09941, 2024 - arxiv.org
Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to
their open-vocabulary flexibility and high performance. However, vision-language models …

Finding Visual Task Vectors

A Hojel, Y Bai, T Darrell, A Globerson, A Bar - arXiv preprint arXiv …, 2024 - arxiv.org
Visual Prompting is a technique for teaching models to perform a visual task via in-context
examples, without any additional training. In this work, we analyze the activations of MAE …