Taming LLMs: Using Executable Oracles to Prevent Bad Code

You Can't Trust The Damn Things

By this point, most of us who have experimented with Claude, Codex, and other LLM-based coding agents have noticed that the current generation of these can sometimes do good work, at superhuman speed, when given some kinds of highly constrained tasks. On the other hand, these same tools frequently fall over in baffling ways, emitting tasteless or nonsensical code.

When an LLM has the option of doing something poorly, we simply can't trust it to make the right choices. The solution, then, is clear: we need to take away the freedom to do the job badly.

Executable Oracles

The software tools that can help us accomplish this are executable oracles. The simplest executable oracle is a test case—but test cases, even when there are a lot of them, are weak.

Consider Claude's C Compiler: even after passing GCC's "torture test suite" and more, it still had 34 nasty miscompilation bugs that were within easy reach. But it wouldn't have had those bugs if Csmith and YARPGen had been included in the testing loop. These tools are better executable oracles because each of them implicitly encodes a vast collection of test cases.

This piece is about collapsing as many failure-producing degrees of freedom as possible. Zero degrees of freedom is aspirational, but a good aspiration.

Example Scenarios

1. Claude's C Compiler Quality

The compiler contains a somewhat elaborate set of optimization passes, but they appear to make very little difference in the quality of its output. What if the human had included an executable oracle for code quality into its testing loop? The LLM would have been able to incorporate this feedback and do a significantly better job.

2. Dataflow Transfer Functions

Given access to command-line tools for evaluating the precision and verifying the soundness of a transfer function, Codex produced results better than anything seen in LLVM or in their own randomized synthesis results. By pinching the LLM's results between opposing executable oracles for soundness and precision, synthesis worked really well.

3. JustHTML Parser

JustHTML was tested into existence using a large test suite. The coding agent painted itself into a corner by creating a poor software architecture—a difficult degree of freedom to put an executable oracle on. The author manually walked the LLM through refactoring tasks, arriving at a suitable architecture.

Where Can We Find Executable Oracles?

Finding executable oracles for LLMs feels the same as finding test oracles: with a little effort and critical thinking, we can often find a programmatic way to pin down some degree of freedom that would otherwise be available to the LLM to screw up.

Correctness Oracles

Test suites
Fuzzers and property-based testers
Runtime sanitizers
Static analyzers
Linters
Strong type systems
Formal verifiers

Performance Oracles

Compiler-inserted instrumentation
Runtime instrumentation
Heap profilers
Hardware performance counters
Performance regression test suites

Which Degrees of Freedom Can't We Control?

Software architecture, modularity, and maintainability: Picking an appropriate architecture is critical as software scales up, and there's no good executable oracle.
Duplication and unnecessary complexity: LLMs love to write weirdly excessive defensive code.
GUI polish: Creating a polished interface is doubly difficult for coding agents.
Security: Fuzzers are useful, but there are plenty of defects they can't catch.

Practicalities

An ideal executable oracle would be fast, deterministic, local, sandbox-compatible, and have easy-to-interpret output.
Provide LLM with CLI tools that have queryable interfaces (tool --help should work).
LLMs are poor at dealing with long-running tools.
Provide specific advice about tool timeouts in your documentation.
The playbook should codify a linear sequence of steps.
Be clear about which requirements are hard and which are soft.

Conclusions

Our goal should be to give an LLM coding agent zero degrees of freedom. This is aspirational at present, but it's where we should be trying to go.

Given any uncontrolled degree of freedom—some aspect of implementing a piece of software that is important for its use case—we cannot expect LLM coding agents to reliably do a good job. Strong, automated oracles are necessary forcing functions for keeping LLMs in line.

An important corollary is that since there are very important aspects of code that we can't easily measure, such as security, modularity, maintainability, and readability, code written by the current generation of LLM coding agents is generally not suitable for use cases where those things are important.