Deterministic Where You Can, LLM Where You Must
I ran my data cleaner three times on the same input. Got three different results. That's when I started building safety nets.
I ran my data cleaner three times on the same input. Got 9 changes, then 11, then 10. Same prompt, same model, same data. I knew LLMs were non-deterministic, but the variance surprised me. That's when I started building safety nets.
The Variance Problem
The data cleaner normalizes messy JSON against a schema. The schema says action must be DROP, ACCEPT, or REJECT. My test data had "dropped", "ACCEPTED", and "reject". The cleaner should normalize all of these.
Run 1: 9 changes. "dropped" became "DROP".
Run 2: 11 changes. "dropped" became "REJECT".
Run 3: 10 changes. "dropped" was left unchanged.
Three records. Three runs. Three different outcomes. On a small test file.
I'd used LLMs enough to expect some variation. Temperature settings, sampling, the usual suspects. But seeing it this concretely on such a small dataset was different. The cleaner was useful, but I couldn't trust it to catch everything.
The Fix: Spec-Driven Validation
Claude suggested adding a schema validator. It clicked immediately.
The insight: define what valid output looks like, then verify against it. The LLM handles the fuzzy transformation work. A deterministic validator catches anything that doesn't conform.
$ python3 schema_validator.py cleaned.json -s schema.json --pretty
{
"valid": false,
"violation_count": 1,
"violations": [
{
"path": "[0].action",
"message": "'dropped' is not one of ['DROP', 'ACCEPT', 'REJECT']",
"expected": "enum: DROP, ACCEPT, REJECT",
"actual": "\"dropped\""
}
]
}
No inference cost. No variation. If the cleaner misses something, the validator catches it. The combination is more reliable than either alone.
This is spec-driven development applied to LLM pipelines. Define the spec first (JSON Schema). Let the LLM attempt the transformation. Validate the result. If it fails, you know exactly what's wrong and where.
Measuring Where Variance Lives
I built metrics into the orchestration pipeline. Five tools chained together, timing collected at each stage.
The breakdown:
| Stage | Time | Notes |
|---|---|---|
| Parse | 0.44s | Regex, no LLM |
| Validate | 1.65s | Python jsonschema |
| Analyze | 38.73s | LLM triage + investigation |
| Total | 41.19s |
The LLM is the bottleneck. Everything else is noise.
More interesting was the variance testing. Run the same pipeline multiple times, collect stats:
Stability Assessment:
- anomalies_detected: STABLE (0% CoV)
- investigated: VARIABLE (26% CoV)
- dismissed: MODERATE (15% CoV)
Statistical detection is perfectly deterministic. Same 269,347 anomalies every run. The variance comes from LLM triage decisions.
This tells you where to focus. Prompt tuning, temperature adjustments, chain-of-thought forcing - apply them to the 26% variance stage, not the stages that are already stable.
The Pattern
Looking back at everything I built:
| Tool | LLM Role | Variance |
|---|---|---|
| Format Converter | LLM for everything | High |
| Schema Validator | No LLM | Zero |
| Log Parser | Regex-first, LLM-fallback | Low |
| Statistical Profiler | Python only | Zero |
| Agent Investigation | LLM for reasoning only | Low (reasoning only) |
Each iteration moved more work out of the LLM and into deterministic code. The pattern: deterministic where you can, LLM where you must.
I think LLMs are good at reasoning, synthesis, handling ambiguity. They're also expensive, slow, and non-deterministic. For me, the pattern became: use them for what they're good at, use Python for everything else. This approach paid off when the agent actually found the attacks in two different datasets.
What I'd Do Next Time
If I were starting another LLM pipeline from scratch:
Define the output spec first. Before writing any LLM integration, I'd write down what valid output looks like. JSON Schema, type definitions, example outputs. This becomes the validation target. I didn't do this at the start of this project and had to retrofit it.
Build deterministic validation early. The validator doesn't need to be fancy. JSON Schema + Python's jsonschema library took an afternoon. I'm glad Claude suggested it, because I wouldn't have thought to build one on my own.
Measure variance before tuning. Running the pipeline multiple times on the same input showed me which stages were stable and which ones varied. Without that, I would have wasted time tuning stages that didn't need it.
Don't skip the three-runs test. It sounds obvious, but it's easy to skip when things seem to be working. That test caught variance I wouldn't have noticed otherwise.
The LLM handles the fuzzy work. The spec catches the drift. Together they worked better than either alone.