State-of-the-art models today have two training phases: pre-training and post-training. The two serve different purposes, but together, they produce AI that both knows what we want and, to some extent, does what we want.
For readers unfamiliar with how GPT works, here's a very brief overview. GPT, which stands for Generative Pre-Training Transformer, is a family of large neural networks based on the Transformer architecture. These models are auto-regressive: at each step, they output a probability distribution, sample from it, and use the sample in the next prediction.
Both pre-training and post-training share this core process. Here's where they diverge:
Today's models output tokens that often align with our goals, but as we look forward, two main problems emerge:
These limitations suggest we need a fundamentally different approach. The current paradigm treats alignment as an optimization problem— finding the right reward signal to guide behavior. But what if the issue isn't about rewards at all? What if the real problem is that we have no way to verify whether our models are reasoning correctly?
Today, when a model answers a question, we might evaluate the quality by human preference, but we have no insight into its reasoning. The model could hallucinate, make logical errors, or even intentionally mislead— and we might not realize until much later. As models become more capable, this opacity is dangerous; we're essentially flying blind.
Instead of asking "How do we reward good behavior?", perhaps we should ask, "How do we verify correct reasoning?" In mathematics, we accept proofs because they follow valid logical steps from axioms— not because they feel right. In courts, we require evidence and logic that withstands scrutiny. In science, we demand reproducible evidence and methodology. Why shouldn't AI be held to the same standard?
What if we reframed the alignment problem— not as a lack of alignment, but as a lack of proof? Currently, we have no guarantee that a model's output is tethered to reality. It may hallucinate, fabricate, or mislead; we can't verify its reasoning.
Imagine a world where AI doesn't just output answers, but outputs proofs for its statements— a chain of reasoning, similar to what mathematicians use with Lean or other proof assistants. The AI would be required to establish its claims using a transparent, verifiable proof engine.
This engine could combine probabilistic reasoning (Bayesian networks, causal models) with formal proof systems, forming a "reality prover"— an engine that can validate deductions in a transparent, checkable way.
Traditional proof systems like Lean start from the top down: begin with axioms (foundational truths), then build proofs by applying logical rules. This is a deductive approach— rigorous, but requiring humans to hand-specify what counts as true.
I propose the inverse: a bottom-up, inductive method. Instead of starting with axioms, we start with data— the vast corpus of statements humanity believes to be true or false. We then synthesize a proof-checking engine: a program that correctly classifies these statements according to our collective knowledge.
Think of it as learning the rules of reality from examples, rather than coding them by hand. We feed the system millions of claims— scientific facts, historical events, logical contradictions— and task it with generating a proof engine that can distinguish truth from falsehood. The engine becomes a functional artifact: a discrete program that captures humanity's epistemology, not as a black-box neural network, but as inspectable, verifiable code.
This approach is inductive in the same way a scientist induces general laws from observations. We observe what humanity accepts as true, and we synthesize the inference rules that best explain these observations. The result is a proof engine grounded in human knowledge, yet formal and checkable like traditional proof systems.
Projects like Cyc tried something similar with hand-built logic, but LLMs might be able to generate, evolve, and refine these proof engines much faster and more broadly than any manual approach.
This mirrors how we align human reasoning in high-stakes situations: courts require evidence and logical argument; academia demands peer review; engineering requires verifiable calculations. By requiring AI systems to show their work in a formally verifiable way, we move from alignment by reward to alignment by proof and verification.
The technical challenges are significant. We must develop proof systems that are rigorous enough to prevent deception and flexible enough to handle real-world uncertainty. We need to ground formal proofs in empirical evidence and ensure that proof engines cannot be gamed by sophisticated models.
But the potential payoff is enormous. By reframing alignment as a problem of proof rather than reward, we gain scalability, interpretability, and the ability to formally reason about AI behavior. This is a method that could, in principle, help align not just today's models, but superintelligent systems in the future.
After all, if we're going to trust AI with increasingly important decisions, shouldn't we be able to verify that their reasoning is sound?