January 3, 2026 · 4 min readAI Data Analyst

Deterministic Where You Can, LLM Where You Must

I ran my data cleaner three times on the same input. Got three different results. That's when I started building safety nets.

local-llmvalidationarchitecturelessons-learned

I ran my data cleaner three times on the same input. Got 9 changes, then 11, then 10. Same prompt, same model, same data. I knew LLMs were non-deterministic, but the variance surprised me. That's when I started building safety nets.

The Variance Problem

The data cleaner normalizes messy JSON against a schema. The schema says action must be DROP, ACCEPT, or REJECT. My test data had "dropped", "ACCEPTED", and "reject". The cleaner should normalize all of these.

Run 1: 9 changes. "dropped" became "DROP".

Run 2: 11 changes. "dropped" became "REJECT".

Run 3: 10 changes. "dropped" was left unchanged.

Three records. Three runs. Three different outcomes. On a small test file.

I'd used LLMs enough to expect some variation. Temperature settings, sampling, the usual suspects. But seeing it this concretely on such a small dataset was different. The cleaner was useful, but I couldn't trust it to catch everything.

The Fix: Spec-Driven Validation

Claude suggested adding a schema validator. It clicked immediately.

The insight: define what valid output looks like, then verify against it. The LLM handles the fuzzy transformation work. A deterministic validator catches anything that doesn't conform.

$ python3 schema_validator.py cleaned.json -s schema.json --pretty

{
  "valid": false,
  "violation_count": 1,
  "violations": [
    {
      "path": "[0].action",
      "message": "'dropped' is not one of ['DROP', 'ACCEPT', 'REJECT']",
      "expected": "enum: DROP, ACCEPT, REJECT",
      "actual": "\"dropped\""
    }
  ]
}

No inference cost. No variation. If the cleaner misses something, the validator catches it. The combination is more reliable than either alone.

This is spec-driven development applied to LLM pipelines. Define the spec first (JSON Schema). Let the LLM attempt the transformation. Validate the result. If it fails, you know exactly what's wrong and where.

Measuring Where Variance Lives

I built metrics into the orchestration pipeline. Five tools chained together, timing collected at each stage.

The breakdown:

Stage	Time	Notes
Parse	0.44s	Regex, no LLM
Validate	1.65s	Python jsonschema
Analyze	38.73s	LLM triage + investigation
Total	41.19s

The LLM is the bottleneck. Everything else is noise.

More interesting was the variance testing. Run the same pipeline multiple times, collect stats:

Stability Assessment:
- anomalies_detected: STABLE (0% CoV)
- investigated: VARIABLE (26% CoV)
- dismissed: MODERATE (15% CoV)

Statistical detection is perfectly deterministic. Same 269,347 anomalies every run. The variance comes from LLM triage decisions.

This tells you where to focus. Prompt tuning, temperature adjustments, chain-of-thought forcing - apply them to the 26% variance stage, not the stages that are already stable.

The Pattern

Looking back at everything I built:

Tool	LLM Role	Variance
Format Converter	LLM for everything	High
Schema Validator	No LLM	Zero
Log Parser	Regex-first, LLM-fallback	Low
Statistical Profiler	Python only	Zero
Agent Investigation	LLM for reasoning only	Low (reasoning only)

Each iteration moved more work out of the LLM and into deterministic code. The pattern: deterministic where you can, LLM where you must.

I think LLMs are good at reasoning, synthesis, handling ambiguity. They're also expensive, slow, and non-deterministic. For me, the pattern became: use them for what they're good at, use Python for everything else. This approach paid off when the agent actually found the attacks in two different datasets.

What I'd Do Next Time

If I were starting another LLM pipeline from scratch:

Define the output spec first. Before writing any LLM integration, I'd write down what valid output looks like. JSON Schema, type definitions, example outputs. This becomes the validation target. I didn't do this at the start of this project and had to retrofit it.

Build deterministic validation early. The validator doesn't need to be fancy. JSON Schema + Python's jsonschema library took an afternoon. I'm glad Claude suggested it, because I wouldn't have thought to build one on my own.

Measure variance before tuning. Running the pipeline multiple times on the same input showed me which stages were stable and which ones varied. Without that, I would have wasted time tuning stages that didn't need it.

Don't skip the three-runs test. It sounds obvious, but it's easy to skip when things seem to be working. That test caught variance I wouldn't have noticed otherwise.

The LLM handles the fuzzy work. The spec catches the drift. Together they worked better than either alone.

$ ls ../blog/ --project="AI Data Analyst"

When the Agent Found the Attacks Why I Stopped Flagging Anomalies and Started Profiling Entities Building a Local LLM Security Agent on Consumer Hardware