Research paper: Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization
Neural networks are typically over-parameterized relative to the complexity of the functions they learn. A Vision Transformer might have millions of parameters, but it's not clear how many of those parameters are actually necessary for the learned behavior. Put differently: what is the intrinsic dimensionality of a trained network?
This question matters for practical reasons (compression, efficiency) and theoretical ones (understanding generalization and capacity). While we have well-developed notions of effective rank for matrices, extending this to neural networks—which consist of many matrices arranged in complex ways—is less obvious.
The concept of effective rank is well-established in linear algebra and matrix theory. For a given matrix, the effective rank estimates how many singular values meaningfully contribute to its structure, as opposed to just counting non-zero singular values. Various definitions exist, often based on entropy measures or thresholding the singular value spectrum.
However, applying this concept to neural networks is not straightforward. A trained network doesn't have a single weight matrix—it has many, arranged in complex architectures. More importantly, what matters isn't just the rank of individual weight matrices, but the representational capacity of the learned functions they compute. This raises a question: can we estimate how many dimensions a neural network actually uses in a way that reflects its learned behavior, not just its weight matrices?
The method I explore in this paper is fairly straightforward:
By systematically varying the rank and observing how performance changes, we can probe where the meaningful capacity lives. The approach treats effective rank as an empirical property measured through distillation, rather than something computed directly from weight matrices.
One choice I made was to treat effective rank as a range rather than a single number. The idea is to identify the smallest contiguous set of ranks where a student can achieve 85-95% of the teacher's accuracy. This seems more realistic than pinpointing a single threshold, since the transition from insufficient to sufficient capacity is gradual.
To make the estimates more stable, I fit the accuracy vs. rank curve with a monotone PCHIP interpolant. I also compute what I call an "effective knee"—the rank that maximizes perpendicular distance between the smoothed curve and its endpoint secant. This is just a way to identify where improvements start to saturate.
On Vision Transformers, the method shows that substantial compression is possible. Through systematic low-rank factorization and distillation, I was able to reduce parameters by roughly 11× while retaining about 94.7% of the teacher's accuracy. This suggests that these models, despite having millions of parameters, operate in considerably lower-dimensional spaces than their nominal capacity.
The framework is automated and doesn't require manual tuning for each architecture. This makes it possible to systematically compare intrinsic dimensionality across different models, datasets, and training conditions, though more work is needed to understand how these measurements relate to generalization and other properties.
Understanding intrinsic dimensionality could help address several questions in deep learning:
This work has several limitations. The method requires training multiple student networks, which is computationally expensive. The choice of distillation objective and training procedure can affect the results. And most importantly, "effective rank" as measured here is just one proxy for intrinsic dimensionality—there are likely other useful ways to measure this.
The framework is also limited to architectures where low-rank factorization makes sense. Extending this to other types of structure (sparsity, quantization, etc.) would be interesting but isn't covered by the current approach.