Alignment by Proof: An Alternative Method

State-of-the-art models today have two training phases: pre-training and post-training. The two serve different purposes, but together, they produce AI that both knows what we want and, to some extent, does what we want.

How GPT Works (Brief Overview)

For readers unfamiliar with how GPT works, here's a very brief overview. GPT, which stands for Generative Pre-Training Transformer, is a family of large neural networks based on the Transformer architecture. These models are auto-regressive: at each step, they output a probability distribution, sample from it, and use the sample in the next prediction.

Example: "The big brown fox jumped over the fence."

Phase 1: Tokenization. The input sentence is tokenized into pieces mapped to a dictionary. If "fox" is token 452, we have a vector with a 1 at index 452. This becomes a word embedding— a higher-dimensional vector encoding richer meaning.

Phase 2: Attention & Forward Propagation. These embeddings are processed by attention blocks. Each token is updated by aggregating information from tokens it is most related to— allowing the model to "pay attention" to the important parts for each word.

Phase 3: Sampling. Decoder-based transformers work in an autoregressive fashion: input the first word, predict the second, input both to predict the third, and so on. The model produces a probability distribution for the next token; we sample from this distribution.

Both pre-training and post-training share this core process. Here's where they diverge:

Pre-training: The model learns the distribution of human text via next-token prediction— akin to reading the textbook cover-to-cover.
Post-training: The pre-trained model isn't ready for use— it must be aligned with our values. To do this, a smaller model is trained on human-evaluated outputs, assigning rewards to text. This reward model then trains the larger model (via RLHF) to output text it finds rewarding.

The Growing Cracks in Current Alignment

Today's models output tokens that often align with our goals, but as we look forward, two main problems emerge:

Scale. As models grow, so must the reward models. Large models might outpace their aligners, leading to misalignment. As size increases, it's harder to guarantee reliable alignment.
Interpretability. The reward function is a black box: a learned set of weights, opaque and difficult to inspect or debug.

These limitations suggest we need a fundamentally different approach. The current paradigm treats alignment as an optimization problem— finding the right reward signal to guide behavior. But what if the issue isn't about rewards at all? What if the real problem is that we have no way to verify whether our models are reasoning correctly?

A Different Framework: From Rewards to Reasoning

Today, when a model answers a question, we might evaluate the quality by human preference, but we have no insight into its reasoning. The model could hallucinate, make logical errors, or even intentionally mislead— and we might not realize until much later. As models become more capable, this opacity is dangerous; we're essentially flying blind.

Instead of asking "How do we reward good behavior?", perhaps we should ask, "How do we verify correct reasoning?" In mathematics, we accept proofs because they follow valid logical steps from axioms— not because they feel right. In courts, we require evidence and logic that withstands scrutiny. In science, we demand reproducible evidence and methodology. Why shouldn't AI be held to the same standard?

Alignment by Proof: A New Approach

What if we reframed the alignment problem— not as a lack of alignment, but as a lack of proof? Currently, we have no guarantee that a model's output is tethered to reality. It may hallucinate, fabricate, or mislead; we can't verify its reasoning.

Imagine a world where AI doesn't just output answers, but outputs proofs for its statements— a chain of reasoning, similar to what mathematicians use with Lean or other proof assistants. The AI would be required to establish its claims using a transparent, verifiable proof engine.

This engine could combine probabilistic reasoning (Bayesian networks, causal models) with formal proof systems, forming a "reality prover"— an engine that can validate deductions in a transparent, checkable way.

A Functional, Inductive Approach

Traditional proof systems like Lean start from the top down: begin with axioms (foundational truths), then build proofs by applying logical rules. This is a deductive approach— rigorous, but requiring humans to hand-specify what counts as true.

I propose the inverse: a bottom-up, inductive method. Instead of starting with axioms, we start with data— the vast corpus of statements humanity believes to be true or false. We then synthesize a proof-checking engine: a program that correctly classifies these statements according to our collective knowledge.

Think of it as learning the rules of reality from examples, rather than coding them by hand. We feed the system millions of claims— scientific facts, historical events, logical contradictions— and task it with generating a proof engine that can distinguish truth from falsehood. The engine becomes a functional artifact: a discrete program that captures humanity's epistemology, not as a black-box neural network, but as inspectable, verifiable code.

This approach is inductive in the same way a scientist induces general laws from observations. We observe what humanity accepts as true, and we synthesize the inference rules that best explain these observations. The result is a proof engine grounded in human knowledge, yet formal and checkable like traditional proof systems.

Concrete example:
Suppose you ask: "Did the assassination of Archduke Ferdinand directly lead to World War I?"

With proof-based alignment, the AI wouldn't just write an essay. It would generate a formal proof showing:

What historical evidence it draws from
The causal reasoning connecting the assassination to the war
The logical steps supporting or weakening alternate explanations
Confidence and uncertainty at each step

This proof could then be run through a checker to verify that each step follows logically and is properly supported by evidence.

Projects like Cyc tried something similar with hand-built logic, but LLMs might be able to generate, evolve, and refine these proof engines much faster and more broadly than any manual approach.

The Promise of Provable Alignment

Scalability. As the proof engine validates more statements, it becomes harder to fool. A discrete, formal engine can be kept relatively small and auditable— like Lean— while verifying the reasoning of much larger models.
Interpretability. The proofs are code! Even if humans didn't write the original reasoning engine, we can inspect, study, and validate both the proofs and the inference rules.
Meta-reasoning. We could prove things about the proof engine itself and about the AI's reasoning process. Even complex principles like fairness or compassion, while not binary, might be representable in a probabilistic or axiomatic form for formal verification.

This mirrors how we align human reasoning in high-stakes situations: courts require evidence and logical argument; academia demands peer review; engineering requires verifiable calculations. By requiring AI systems to show their work in a formally verifiable way, we move from alignment by reward to alignment by proof and verification.

Looking Forward

The technical challenges are significant. We must develop proof systems that are rigorous enough to prevent deception and flexible enough to handle real-world uncertainty. We need to ground formal proofs in empirical evidence and ensure that proof engines cannot be gamed by sophisticated models.

But the potential payoff is enormous. By reframing alignment as a problem of proof rather than reward, we gain scalability, interpretability, and the ability to formally reason about AI behavior. This is a method that could, in principle, help align not just today's models, but superintelligent systems in the future.

After all, if we're going to trust AI with increasingly important decisions, shouldn't we be able to verify that their reasoning is sound?