created January 3, 2026 · 4 min readAI Data Analyst

Deterministic Validation for LLM Output

Schema-based validation catches the variance an LLM data cleaner produces between runs. Pattern: deterministic where you can, LLM where you must.

local-llmvalidationarchitecture

An LLM-based data cleaner running the same prompt on the same input three times can produce three different outputs. The fix is not better prompting, it is pairing the LLM with a deterministic validator that knows what valid output looks like. This page covers the variance pattern, the validation approach, and where to apply it.

The variance is real and measurable

A data cleaner that normalizes messy JSON against a schema should produce identical output for identical input. It does not.

Running the same cleaner three times on the same three-record test file with action values "dropped", "ACCEPTED", and "reject" (target enum: DROP, ACCEPT, REJECT):

Run	Changes	`"dropped"` became
1	9	`DROP`
2	11	`REJECT`
3	10	unchanged

Three records, three runs, three outcomes. On a test file that small, variance this high is obvious. On a larger dataset it hides.

The fix: spec-driven validation

Define what valid output looks like. Let the LLM attempt the transformation. Run a deterministic validator over the result. If validation fails, you know exactly what is wrong and where.

$ python3 schema_validator.py cleaned.json -s schema.json --pretty

{
  "valid": false,
  "violation_count": 1,
  "violations": [
    {
      "path": "[0].action",
      "message": "'dropped' is not one of ['DROP', 'ACCEPT', 'REJECT']",
      "expected": "enum: DROP, ACCEPT, REJECT",
      "actual": "\"dropped\""
    }
  ]
}

No inference cost. No variation. If the cleaner misses something, the validator catches it. The combination is more reliable than either alone.

This is JSON Schema plus Python's jsonschema library. Nothing exotic. The point is that the validation layer is structurally incapable of producing inconsistent results, which is the exact property the LLM does not have.

Measuring where variance lives

Not every stage of a pipeline has variance. Timing five chained tools and running the pipeline multiple times on the same input produces a breakdown like:

Stage	Time	Notes
Parse	0.44s	Regex, no LLM
Validate	1.65s	Python jsonschema
Analyze	38.73s	LLM triage + investigation
Total	41.19s

The LLM is the bottleneck. The Python stages are noise. More interesting is the variance distribution:

Stability Assessment:
- anomalies_detected: STABLE (0% CoV)
- investigated:       VARIABLE (26% CoV)
- dismissed:          MODERATE (15% CoV)

Statistical detection is perfectly deterministic. The variance comes from LLM triage decisions. This tells you where to focus prompt tuning, temperature adjustments, and chain-of-thought forcing: at the 26% CoV stage, not at the stages that are already stable.

The pattern across the pipeline

Tool	LLM role	Variance
Format converter	LLM for everything	High
Schema validator	No LLM	Zero
Log parser	Regex-first, LLM-fallback	Low
Statistical profiler	Python only	Zero
Agent investigation	LLM for reasoning only	Low (reasoning only)

Each iteration moves more work out of the LLM and into deterministic code. LLMs are good at reasoning, synthesis, and handling ambiguity. They are expensive, slow, and non-deterministic. The pattern is to use them for what they are good at and use Python for everything else.

What I got wrong first

Skipping the three-runs test because things looked fine. Variance is easy to miss when a single run looks correct. The only way to catch it is to deliberately run the same input through the same pipeline multiple times and diff the outputs. A three-record test file surfaced it clearly. A larger file would have hidden it.

Also: writing the LLM integration before defining the output spec. Retrofitting a schema onto a cleaner that was already running meant rewriting prompts and tests. Defining the spec first would have made the cleaner simpler and the validator obvious from the start.

Tradeoffs

Deterministic validation adds a code layer. The cost is maintaining a schema alongside the prompts. The benefit is that the validator catches drift that no amount of prompt tuning can fully eliminate. For any pipeline where downstream code consumes the LLM's output as structured data, the validation layer is non-optional. For cases where the output is a free-form summary consumed by a human, skip it.

$ ls ../findings/ --project="AI Data Analyst"

[evolving]Agent Trace Telemetry Three Local Models Compared on One Investigation Agent Investigation With Query Tools Entity Profiling Over Anomaly Flagging Local LLM Security Agent on Consumer Hardware