Deterministic Validation for LLM Output
Schema-based validation catches the variance an LLM data cleaner produces between runs. Pattern: deterministic where you can, LLM where you must.
An LLM-based data cleaner running the same prompt on the same input three times can produce three different outputs. The fix is not better prompting, it is pairing the LLM with a deterministic validator that knows what valid output looks like. This page covers the variance pattern, the validation approach, and where to apply it.
The variance is real and measurable
A data cleaner that normalizes messy JSON against a schema should produce identical output for identical input. It does not.
Running the same cleaner three times on the same three-record test file with action values "dropped", "ACCEPTED", and "reject" (target enum: DROP, ACCEPT, REJECT):
| Run | Changes | "dropped" became |
|---|---|---|
| 1 | 9 | DROP |
| 2 | 11 | REJECT |
| 3 | 10 | unchanged |
Three records, three runs, three outcomes. On a test file that small, variance this high is obvious. On a larger dataset it hides.
The fix: spec-driven validation
Define what valid output looks like. Let the LLM attempt the transformation. Run a deterministic validator over the result. If validation fails, you know exactly what is wrong and where.
$ python3 schema_validator.py cleaned.json -s schema.json --pretty
{
"valid": false,
"violation_count": 1,
"violations": [
{
"path": "[0].action",
"message": "'dropped' is not one of ['DROP', 'ACCEPT', 'REJECT']",
"expected": "enum: DROP, ACCEPT, REJECT",
"actual": "\"dropped\""
}
]
}
No inference cost. No variation. If the cleaner misses something, the validator catches it. The combination is more reliable than either alone.
This is JSON Schema plus Python's jsonschema library. Nothing exotic. The point is that the validation layer is structurally incapable of producing inconsistent results, which is the exact property the LLM does not have.
Measuring where variance lives
Not every stage of a pipeline has variance. Timing five chained tools and running the pipeline multiple times on the same input produces a breakdown like:
| Stage | Time | Notes |
|---|---|---|
| Parse | 0.44s | Regex, no LLM |
| Validate | 1.65s | Python jsonschema |
| Analyze | 38.73s | LLM triage + investigation |
| Total | 41.19s |
The LLM is the bottleneck. The Python stages are noise. More interesting is the variance distribution:
Stability Assessment:
- anomalies_detected: STABLE (0% CoV)
- investigated: VARIABLE (26% CoV)
- dismissed: MODERATE (15% CoV)
Statistical detection is perfectly deterministic. The variance comes from LLM triage decisions. This tells you where to focus prompt tuning, temperature adjustments, and chain-of-thought forcing: at the 26% CoV stage, not at the stages that are already stable.
The pattern across the pipeline
| Tool | LLM role | Variance |
|---|---|---|
| Format converter | LLM for everything | High |
| Schema validator | No LLM | Zero |
| Log parser | Regex-first, LLM-fallback | Low |
| Statistical profiler | Python only | Zero |
| Agent investigation | LLM for reasoning only | Low (reasoning only) |
Each iteration moves more work out of the LLM and into deterministic code. LLMs are good at reasoning, synthesis, and handling ambiguity. They are expensive, slow, and non-deterministic. The pattern is to use them for what they are good at and use Python for everything else.
What I got wrong first
Skipping the three-runs test because things looked fine. Variance is easy to miss when a single run looks correct. The only way to catch it is to deliberately run the same input through the same pipeline multiple times and diff the outputs. A three-record test file surfaced it clearly. A larger file would have hidden it.
Also: writing the LLM integration before defining the output spec. Retrofitting a schema onto a cleaner that was already running meant rewriting prompts and tests. Defining the spec first would have made the cleaner simpler and the validator obvious from the start.
Tradeoffs
Deterministic validation adds a code layer. The cost is maintaining a schema alongside the prompts. The benefit is that the validator catches drift that no amount of prompt tuning can fully eliminate. For any pipeline where downstream code consumes the LLM's output as structured data, the validation layer is non-optional. For cases where the output is a free-form summary consumed by a human, skip it.