Measuring What Neural Networks Really Learn

Research paper: Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization

Motivation

Neural networks are typically over-parameterized relative to the complexity of the functions they learn. A Vision Transformer might have millions of parameters, but it's not clear how many of those parameters are actually necessary for the learned behavior. Put differently: what is the intrinsic dimensionality of a trained network?

This question matters for practical reasons (compression, efficiency) and theoretical ones (understanding generalization and capacity). While we have well-developed notions of effective rank for matrices, extending this to neural networks—which consist of many matrices arranged in complex ways—is less obvious.

Background: Effective Rank of Matrices

The concept of effective rank is well-established in linear algebra and matrix theory. For a given matrix, the effective rank estimates how many singular values meaningfully contribute to its structure, as opposed to just counting non-zero singular values. Various definitions exist, often based on entropy measures or thresholding the singular value spectrum.

However, applying this concept to neural networks is not straightforward. A trained network doesn't have a single weight matrix—it has many, arranged in complex architectures. More importantly, what matters isn't just the rank of individual weight matrices, but the representational capacity of the learned functions they compute. This raises a question: can we estimate how many dimensions a neural network actually uses in a way that reflects its learned behavior, not just its weight matrices?

An Empirical Approach

The method I explore in this paper is fairly straightforward:

            The Process:
            Train a full-rank "teacher" network to convergence
Factorize its weight matrices at multiple ranks (r = 1, 2, 4, 8, ...)
Train each low-rank "student" via knowledge distillation
Measure performance as a function of rank

        

By systematically varying the rank and observing how performance changes, we can probe where the meaningful capacity lives. The approach treats effective rank as an empirical property measured through distillation, rather than something computed directly from weight matrices.

Effective Rank as a Region

One choice I made was to treat effective rank as a range rather than a single number. The idea is to identify the smallest contiguous set of ranks where a student can achieve 85-95% of the teacher's accuracy. This seems more realistic than pinpointing a single threshold, since the transition from insufficient to sufficient capacity is gradual.

To make the estimates more stable, I fit the accuracy vs. rank curve with a monotone PCHIP interpolant. I also compute what I call an "effective knee"—the rank that maximizes perpendicular distance between the smoothed curve and its endpoint secant. This is just a way to identify where improvements start to saturate.

Results on Vision Transformers

On Vision Transformers, the method shows that substantial compression is possible. Through systematic low-rank factorization and distillation, I was able to reduce parameters by roughly 11× while retaining about 94.7% of the teacher's accuracy. This suggests that these models, despite having millions of parameters, operate in considerably lower-dimensional spaces than their nominal capacity.

The framework is automated and doesn't require manual tuning for each architecture. This makes it possible to systematically compare intrinsic dimensionality across different models, datasets, and training conditions, though more work is needed to understand how these measurements relate to generalization and other properties.

Why This Might Be Useful

Understanding intrinsic dimensionality could help address several questions in deep learning:

Compression: Identifying which dimensions matter could guide better compression strategies beyond simple pruning.
Generalization: The relationship between effective rank and generalization is still unclear, but measuring both could help us understand this connection.
Transfer Learning: If different tasks share low-rank subspaces, this could inform how we think about transfer and multitask learning.
Interpretability: Focusing on the dimensions that carry most of the information might make interpretation more tractable.

Limitations and Future Work

This work has several limitations. The method requires training multiple student networks, which is computationally expensive. The choice of distillation objective and training procedure can affect the results. And most importantly, "effective rank" as measured here is just one proxy for intrinsic dimensionality—there are likely other useful ways to measure this.

The framework is also limited to architectures where low-rank factorization makes sense. Extending this to other types of structure (sparsity, quantization, etc.) would be interesting but isn't covered by the current approach.

📄 Read the Paper 💻 View Code & Experiments