created April 10, 2026 · updated April 12, 2026 · 5 min readevolvingAutoresearch

Autoresearch Harness Log

Working notes on how I use the autoresearch harness to probe agent workflows, find design holes, and decide what to experiment with next.

autoresearchagentsexperimentsharness

Working notes on the autoresearch harness. This is a living doc, not a finished writeup. I dump observations here as I run experiments and let patterns emerge over time. If you are looking for what the harness is and how it is built, see the project page. This is where I track what I am learning from using it.

The pattern comes from Karpathy's autoresearch, originally built for ML training iteration. My adaptation generalizes it for prompts, configs, documents, and full agent investigation scoring.

How I use it

The feedback loop looks like this:

Pick a project I am trying to improve (right now that is mostly ai-data-analyst)
Look at the trace explorer or scorer output, find a dimension that is lagging
Design an autoresearch experiment targeting that dimension
Walk away, let it iterate overnight
Read the digest the next morning, keep what worked, note what did not

The interesting part is usually step 5. The digest tends to surface something I did not know to ask about at step 1, which becomes the next experiment.

Current focus (updated 2026-04-12)

First full experiment cycle completed on ai-data-analyst: autopsy → 6 harness fixes → vocabulary-gap prompt fix → 15-iteration run. Baseline jumped from 9.30 to 9.78 from the vocabulary fix alone. The editor agent found one additional +0.13 improvement across 15 iterations. See the debugging observations for the full breakdown.
Created HARNESS_CHANGELOG.md to track structural changes that affect score comparability across experiments. Without it, future sessions cannot tell whether a metric moved because the prompt changed or because the harness changed.
Second workflow landed: results-analysis. A two-phase agent pipeline that compares autoresearch experiments. Phase 1 (collector) gathers evidence from records.jsonl, traces, scorer source, and changelog. Phase 2 (analyst) runs statistical tests and submits a structured analysis. Deterministic scorer checks 6 dimensions of analysis quality.
First e2e run scored 2.75/10. The analyst correctly ran Mann-Whitney U tests (statistical_rigor = 1.0) but failed discipline checks: didn't cite trace_ids (evidence_grounding = 0.0), discussed dimensions it hadn't researched (scorer_awareness = 0.0), ignored changelog (0.0), didn't flag tracing gaps (0.0). These are hook and prompt issues, not architecture issues.
Context management was the main engineering challenge. Hermes-4-70B at 16K context blows out after 4 tool calls if trace data is returned raw. Fixed with tool output condensation (35KB → 800B per trace) and fresh-session scratchpad writing (Domain 5 cheat sheet pattern). The pipeline now completes 12 tool calls comfortably.

Open questions

When a dimension plateaus across 15 iterations, how do I tell the difference between "the prompt cannot improve this" and "the evaluator cannot see the improvement"?
What is the right granularity for a single experiment? Run one prompt through 15 iterations, or run 3 prompts through 5 iterations each?
Should the harness generate its own experiment ideas, or is that overreach? The current version just executes what I give it. Letting it propose targets feels like the next meaningful capability but also feels risky.
How do I avoid re-learning the same lesson across experiments? The scratchpad helps within a single run but does not carry between runs. Maybe a cross-run scratchpad lives in this log.
When the baseline is near the scoring ceiling (9.78/10), is running more iterations useful? The editor found exactly one +0.13 improvement in 15 attempts. Maybe the next experiment should target a different workflow or dataset instead of squeezing the same one.

What I am learning (dated)

2026-04-10: The trace explorer and the autoresearch harness are two sides of the same loop. Telemetry shows me where to aim, the harness moves the needle, telemetry measures whether anything moved. Without both I was guessing.
2026-04-11: Had Claude read every file in the pipeline end-to-end instead of skimming. Found 6 harness bugs (efficiency hardcode, diagnostic report unit mismatch, reasoning from best-not-mean, source name vocabulary gap, cache hits invisible in records, fragile nudge classification). Fixed all 6, re-scored historical data, launched a 15-iteration experiment. The vocabulary fix alone moved the baseline from 9.30 to 9.78.
2026-04-11: The stuck evidence_breadth dimension turned out to be neither the prompt nor the evaluator. The scorer expected canonical source names (stream_dns), the agent wrote informal ones (dns). A string-matching mismatch that was invisible in aggregate scores.
2026-04-12: Built and deployed the results-analysis workflow. First e2e run on GB10: 2.75/10 baseline. The scoring breakdown immediately shows where to aim: statistical_rigor is perfect (1.0), everything else is near zero. The hooks and prompts need work, not the architecture. This is the harness working as designed, same as when synthetic-v1 ran 15 iterations with 0 improvements because the ceiling was tool budget, not prompt quality.
2026-04-12: Context management is not optional for local 70B models. Hermes at 16K context needs condensed tool outputs (summaries not raw data) and fresh-session scratchpad writing (cheat sheet Domain 5). Without both, the pipeline fails at 4 tool calls. With both, it completes 12 tool calls comfortably.

$ ls ../findings/ --project="Autoresearch"

[evolving]Debugging Experiment Loops