Invertible Memory Flow Networks
Research paper with Alexandr Plashchinsky at VECTOR Labs.
Overview
Long-sequence memory is still a hard problem. Transformers scale poorly with context length, while recurrent models often lose information over long horizons. More generally, compressing a long stream into a bounded state is difficult because the model has to decide what information should survive.
Invertible Memory Flow Networks, or IMFN, are an early attempt to make this problem easier by changing its structure.
Instead of asking one model to compress an entire sequence directly, IMFN breaks the problem into many small local compression problems. Each local module learns a 2-to-1 merge: take two memory states, merge them into one, and learn an inverse path that can approximately recover the original pair.
Those local merges are composed in a binary tree.
The result is a tree-structured memory system with logarithmic compression depth, local reconstruction pressure, and a route toward constant-cost online inference through student distillation.
This should be read as preliminary evidence, not a finished solution. The experiments are mostly reconstruction-based, and reconstruction is an imperfect proxy for useful memory. A model can reconstruct pixels without understanding what matters, and it can also preserve useful predictive information while failing to reconstruct every detail.
Memory as a Trajectory
Many sequence models can be viewed as dynamical systems. An LSTM updates a cell state. A Transformer updates a residual stream. A state space model evolves a latent state through time.
In all of these cases, information moves through some shared memory space.
IMFN starts from the view that memory is not only a static vector. It is also the trajectory that vector takes through representation space. If memory is a trajectory, then one useful question is:
Can we make that trajectory approximately invertible?
A useful memory state should preserve enough structure that some information about the past remains recoverable. In this work, we test that idea through reconstruction, but reconstruction should not be treated as the final goal. It is a convenient measurement tool, not necessarily the right objective for intelligent memory.
The Teacher
The IMFN teacher is a binary tree of learned sweeper modules.
Each sweeper has two parts:
- a merge function that maps two memory states into one
- an inverse function that maps one memory state back into two
At the bottom level, raw inputs are encoded into memory space. Higher levels operate only on latent memory states. Each level is trained locally, so every merge is pressured to preserve information that can be recovered by the inverse path.
This factorization is the main idea. Rather than learning one global compression map over a long sequence, the model learns a local approximately invertible primitive and composes it.
For MNIST, the teacher compresses image sequences by mapping flattened images into memory vectors, then repeatedly merging adjacent memory states up the tree. Reconstruction runs the tree backward.
For video, the same idea is applied to tokenized UCF-101 clips. The bottom level merges pairs of frames into memory tokens, and higher levels merge token latents.
The Student
The tree teacher is useful, but it is still a structured computation. For online inference, IMFN distills the tree into a recurrent student.
The teacher defines a trajectory by zero-padding unseen leaves:
y0 = f(0, 0, ..., 0)
y1 = f(x1, 0, ..., 0)
y2 = f(x1, x2, ..., 0)
...
yn = f(x1, x2, ..., xn)
The student learns to follow this trajectory with a residual update:
m_{t+1} = m_t + g_theta(m_t, x_t, t)
This gives the student constant per-step inference cost while trying to preserve the teacher's tree-induced memory dynamics.
A useful detail is that teacher trajectories can be generated efficiently. When a new leaf changes from zero to data, only the path from that leaf to the root needs to be recomputed. This gives a Merkle-style update with O(log n) work per step rather than rebuilding the full tree.
Results
On MNIST sequence reconstruction, IMFN is compared against Transformer and Mamba sequence compressors at sequence length T = 128 and memory dimension d = 1024.
| Model | MSE | PSNR | SSIM |
|---|---|---|---|
| IMFN | 0.052132 +/- 0.000939 | 12.83 +/- 0.08 | 0.6300 +/- 0.0058 |
| Transformer | 0.066598 +/- 0.000140 | 11.77 +/- 0.01 | 0.3396 +/- 0.0011 |
| Mamba | 0.055753 +/- 0.000518 | 12.54 +/- 0.04 | 0.4341 +/- 0.0041 |
The parameter counts are also notable:
| Model | Parameters |
|---|---|
| IMFN | 61.43M |
| Transformer | 182.56M |
| Mamba | 186.86M |
In this setup, IMFN reconstructs more faithfully while using fewer parameters. The result is encouraging, but narrow. It suggests that locally invertible structure can help with this particular compression task. It does not show that IMFN is generally better than Transformers, Mamba, or other long-context models.
On UCF-101 video reconstruction, IMFN degrades gradually as compression depth increases:
| Level | Compression | MSE | PSNR |
|---|---|---|---|
| L0 | 2-to-1 | 0.000796 +/- 0.000551 | 31.95 +/- 2.96 |
| L1 | 4-to-1 | 0.000810 +/- 0.000556 | 31.87 +/- 2.96 |
| L2 | 8-to-1 | 0.001204 +/- 0.000888 | 30.25 +/- 3.11 |
| L3 | 16-to-1 | 0.002163 +/- 0.001839 | 27.96 +/- 3.46 |
From L0 to L3, the compression ratio becomes 8 times larger, while MSE increases by about 2.7 times. This suggests that the hierarchy is not simply accumulating error linearly with depth.
Again, this is a reconstruction result. It is useful, but limited.
Why This Matters
The central idea is that long-context memory may need structure, not just scale.
A flat context window stores everything explicitly. A recurrent state compresses everything into one evolving vector. IMFN explores a third option: a tree-like memory whose local transformations are trained to preserve recoverable information.
This is interesting because many real memory problems are not about remembering every pixel or token. They are about preserving the information needed for future prediction, decision-making, and action.
That distinction matters. Reconstruction asks:
Can the model recover what it saw?
Prediction asks:
Did the memory preserve what matters for what comes next?
For agents, robotics, and long-horizon planning, prediction is probably the more important question.
Limitations
This work is early.
The current experiments are mostly reconstruction-focused. MNIST is a clean testbed, and UCF-101 gives a first real-video validation, but neither is enough to establish this as a general memory mechanism.
Reconstruction is also not a great final metric. It can reward low-level detail that may not matter, and it can miss higher-level structure that does matter. A memory system for agents should probably be judged by whether it improves prediction, planning, and behavior under bounded memory.
The student also trails the teacher at longer sequence lengths. This is expected: matching the trajectory of a deeper tree is harder than matching a shallow one. Better training schedules, intermediate supervision, and different distillation targets may reduce this gap.
Direction
The next direction is to explore tree-like memory structures using prediction rather than reconstruction.
Instead of training the hierarchy mainly to reconstruct inputs, we want to test whether a tree-structured memory can preserve information that improves future prediction. For example:
- predicting future frames or latent states
- predicting missing parts of a sequence
- predicting task-relevant events
- predicting action-conditioned outcomes in an agent or robotics setting
This would move the objective closer to what memory is actually for. A useful memory should not necessarily reproduce the past. It should carry the information needed to model the future.
If this works, IMFN-like systems could become a memory substrate for agents that need bounded memory, fast online updates, and some ability to recover or reason about the past without storing everything.
← cd ~/research