created April 11, 2026 · 7 min readevolvingAutoresearch

Debugging Experiment Loops

Running observations from debugging autonomous experiment loops. What I find when I stop guessing from aggregates and trace through scoring code and spans.

autoresearchexperimentsdebuggingscoringagents

When an agent metric plateaus and prompt changes don't move it, the first instinct is to reason from aggregate scores. Every time I've done that, I've been wrong. This is a running log of what I find when I stop guessing and trace through the scoring code, the trace spans, and the actual agent behavior.

What it is

Running observations from debugging autonomous experiment loops. I use a Karpathy-style autoresearch harness adapted for agent workflows: pick a metric, iterate on prompts overnight, see what moves the number. Each entry here documents one debugging cycle: what looked stuck, what hypotheses failed, what the autopsy found, and what actually moved the number.

Why it exists

Claude gave me three confident root-cause explanations for a stuck metric in one session. All three turned out to be wrong when I asked it to actually pull the traces instead of reasoning from aggregate scores. The trace data was there the whole time. Claude just defaulted to guessing from summaries instead of looking at the evidence.

Since I don't write the code myself, I need a way to verify what the LLM tells me about agent behavior. The protocol below is that verification layer. This Finding tracks how well it works in practice.

The protocol

These are the rules I follow now before proposing any explanation for observed agent behavior. They came out of getting burned enough times.

Before claiming "the agent does X": Pull at least one trace and confirm X happens in the spans. Cite the trace ID and the span sequence.
Before claiming "metric Y measures Z": Read the function that computes Y. Quote the relevant lines.
Before proposing a code change: Read the current code that produces the behavior you want to change.
Before running an A/B experiment: Verify the hypothesis target actually maps to a measurable score component. If you are targeting evidence_breadth, the lever has to be something score_evidence actually weighs.
Before comparing experiments: Compare same-shape data (median vs median, include sample size).

Red-flag phrases that should trigger "pull the trace" instead of more reasoning: "this probably", "likely because", "the data suggests", "it looks like", "my read is", "I think the issue is." If the LLM uses one of these instead of citing a specific trace or code line, that's the signal to push back.

Current observations

2026-04-11: The vocabulary-gap cycle

Project: ai-data-analyst (security investigation agent)
Metric stuck: evidence_breadth plateaued at 0.84 across 107 hermes-4-70b runs
Dataset: 22 new runs after the fix, compared against 29-run baseline

What looked stuck. The agent's evidence_breadth score sat at 0.84 regardless of what we changed. Prompt rewrites, budget increases, forced tool-count rules, different investigation strategies. Nothing moved it. The trace explorer showed it as a flat line across experiments while every other dimension varied.

Three hypotheses that failed.

"Force 4 tools per entity to increase evidence breadth." Added a hard rule requiring get_entity_detail, pivot, get_timeline, and get_peer_comparison for every high-score entity. Coverage crashed to 0.25 (2 of 15 entities investigated). The 4-tool workload overflowed the context eviction buffer, and the scorer doesn't measure tool diversity anyway. Wrong lever, wrong direction.
"The tool budget is the constraint." Raised tool_budget from 20 to 50 with the same prompt. The agent used 11 of 50 tools (22% utilization). Budget was definitively not the bottleneck.
"The eviction function is broken." Pulled an old trace and found eviction fires at "15 → 11" steady state under 2-tool investigations. The function works fine for the workload it handles. The 4-tool workload from hypothesis 1 overflowed because per-cycle message growth exceeded per-cycle trim, not because eviction was broken.

All three hypotheses came from Claude reasoning about aggregate scores instead of reading the code. I didn't catch it until I asked Claude to explain how it was arriving at those conclusions. It couldn't point to specific traces or code lines.

The autopsy. I had Claude read scorer_v2.py line by line (and every other file in the pipeline). The score_evidence function computes three components: has_evidence_ratio (40%), avg_source_coverage (30%), and evidence_depth (30%). The middle component, avg_source_coverage, compares the agent's declared sources_checked array against the sources actually available for each entity in the SQLite database.

The comparison is exact string matching after normalization. The database uses canonical names: suricata, stream_dns, stream_http, fortigate, auth_log, wineventlog. The agent had been writing informal names: dns, http, auth, windows. None of those match after normalization. Every finding got roughly 50% source coverage credit regardless of how thorough the investigation actually was.

The 0.84 plateau was a vocabulary gap, not a prompt problem or a tool-count problem.

The fix. One prompt edit: a 6-row mapping table showing the get_entity_detail response key → canonical scorer name. The agent now knows to write stream_dns instead of dns.

The result. Evidence_breadth jumped from 0.84 median to 0.97 median across 22 new runs. Composite went from 9.30 to 9.74. Two runs hit a perfect 10.0/10. The fix also exposed a secondary finding: evidence_breadth anti-correlates with entity count at the per-finding level (runs investigating 8 entities averaged 0.976 breadth, runs investigating 15 averaged 0.952). This is a design property of the scoring formula, not a bug.

I also found 5 other harness bugs during the autopsy that had nothing to do with evidence_breadth: a hardcoded constant in the efficiency scorer, a unit-mismatch in the diagnostic report for script-mode evaluation, the harness returning reasoning from the best run while returning the average score, cache hits being invisible in the records, and fragile substring-based nudge classification. Fixed all 6 (including the vocabulary gap), documented in the harness changelog.

What this cycle taught me. If I had made Claude read scorer_v2.py before running any experiment, it would have found the string matching and the canonical name list in 30 minutes. Instead I let it guess from dashboards for hours and run two useless experiments. The trace system and the scorer code are the verification layer. If the LLM skips them, push back.

I also ran an LLM council on the prompt design question. All 5 advisors reasoned about what the prompt "should" do without anyone asking to see the data. A DuckDB query against existing records disproved the council's pick in 30 seconds. Councils are useful for surfacing framings. They are not a substitute for checking the data.

Open questions

Should the score_evidence formula change to remove the coverage/evidence_breadth anti-correlation? Currently it averages source coverage per finding, so investigating more entities with more source diversity mechanically lowers the per-finding average. Changing the formula would invalidate every prior baseline.
How do I detect other vocabulary-style impedance mismatches before they plateau? The canonical-name gap was invisible for months. Is there a structural check I could run against new scoring functions?
When the baseline is at the scoring ceiling (9.78/10), is there value in more autoresearch iterations? The editor found exactly one +0.13 improvement in 15 attempts. Maybe the harness should shift to a different workflow or dataset instead of squeezing the same one.
The agent still writes informal source names 9% of the time despite the mapping table. Structural post-hook to rewrite names before submission, or stronger prompt instruction, or just accept it?

What we've learned so far

2026-04-11: Canonical string matching in scoring functions created an invisible metric ceiling. The scorer's expected source names and the agent's actual vocabulary lived in different namespaces for months. One 6-row mapping table fixed it. Evidence_breadth 0.84 → 0.97.
2026-04-11: I don't write the code, Claude does. The trace system is my safeguard against having to trust the LLM's reasoning at face value. In this session Claude skipped pulling traces and instead reasoned from aggregate scores, producing three wrong root causes before I caught it by asking how it was arriving at those conclusions. The traces existed the whole time. The LLM just didn't use them until I made it.
2026-04-11: Forcing Claude to read every file end-to-end instead of skimming turned up 17 things the prior session's handoff missed, including bugs in the harness that were degrading the optimization signal. The prior session had access to the same files and didn't catch them.
2026-04-11: LLM councils can be directionally wrong when none of the advisors ask to see data before reasoning. Good for surfacing framings. Not a substitute for measurement.

$ ls ../findings/ --project="Autoresearch"

[evolving]Autoresearch Harness Log