From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Understanding the internal mechanisms of Vision-Language Models like CLIP is becoming increasingly critical as they see broader deployment. While mechanistic interpretability has made great strides, most existing methods still rely on model activations. This makes them inherently dataset-dependent, vulnerable to data bias, and often restricted to providing only coarse, head-level explanations.

To address these challenges, we introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free and training-free framework that analyzes CLIP’s vision transformer directly in weight space. By applying Singular Value Decomposition (SVD) to the value-output matrices of attention heads, SITH uncovers the principal semantic directions the model uses to process information. We then interpret these directions using COMP (Coherent Orthogonal Matching Pursuit), a novel sparse coding algorithm we use to decompose each singular vector into a sparse, semantically coherent combination of human-interpretable concepts.

SITH Method Overview — Overview of the SITH framework. We isolate the value-output matrix of a CLIP attention head, decompose it via SVD, and interpret the resulting singular vectors as sparse combinations of semantic concepts using the COMP algorithm.

Our analysis reveals that these singular vectors are not merely mathematical abstractions but map to distinct, human-interpretable concepts. We also find that many attention heads exhibit a striking degree of intra-head semantic alignment, where dominant directions are grouped under a cohesive “theme”. For instance, in ViT-L/14, one head specializes in materials (with individual vectors encoding “steel,” “paper,” or “glass”) while another focuses on colors (such as “red”, “purple”, or “orange”). This level of granularity offers a clear advantage over activation-based methods: where a head might be broadly labeled as “color-related”, SITH identifies the precise vectors responsible for specific colors like “red” or “green”.

Remarkably, these functional patterns appear to be universal. We observe the same specialized heads emerging across vastly different architectures, scales, and training regimes, from ViT-B/32 to MobileCLIP. This provides strong weight-space evidence for the Universality Hypothesis, suggesting that diverse models converge on the same fundamental circuits to perform similar semantic operations.

These insights allow us to move beyond passive observation and perform precise, data-free model edits. By surgically modulating the singular values of specific vectors, we can directly control the model’s behavior. For instance, by suppressing “background” vectors to mitigate spurious correlations in datasets like Waterbirds, or by removing unsafe NSFW concepts to improve model safety. We can also amplify task-relevant features to boost zero-shot classification performance, achieving these improvements entirely without retraining or access to new data.

Furthermore, SITH provides a unique lens into model adaptation. By analyzing the “fine-tuning delta” in weight space, we demonstrate that adaptation does not drastically alter the model’s semantic foundation. Instead, it subtly reorients the stable feature basis, with the weight updates themselves showing clear semantic alignment with the target domain, such as the emergence of specialized plant-related directions when fine-tuning on Flower102.

For a more in-depth look at our findings, including interactive demos and visualizations, please visit the Project Page.

Cite