|
Authors
|
A. Doumanoglou |
| K. Driessens | |
| D. Zarpalas | |
|
Year
|
2026 |
|
Venue
|
Transactions on Machine Learning Research |
|
Download
|
|
Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Compared to previous matrix decomposition, autoencoder, and dictionary learning approaches which rely on the reconstruction of feature activations, we propose a different perspective to learn these encoding-decoding direction pairs. We base identifying the decoding directions on directional clustering of feature activations and introduce signal vectors to estimate encoding directions under a probabilistic perspective. Unlike most other works, we also take advantage of the network’s instructions encoded in its weights to guide our direction search. For this, we illustrate that a novel technique called Uncertainty Region Alignment can exploit these instructions to reveal the encoding-decoding mechanism of interpretable concepts that influence the network’s predictions. Our thorough and multifaceted analysis shows that, in controlled, toy settings with synthetic data, our approach can recover the ground-truth encoding-decoding direction pairs. In real-world settings, our method effectively reveals the encoding-decoding mechanism of interpretable concepts, often scoring substantially better in interpretability metrics than other unsupervised baselines, such as PCA and NMF. Finally, we provide concrete applications of how the learned directions can help open the black box and understand global model behavior, explain individual sample predictions in terms of local, spatially-aware, concept contributions and intervene on the network’s prediction strategy to provide either counterfactual explanations or correct erroneous model behavior.