Skip to content
cd ../findings
created April 10, 2026 · 3 min readevolvingAutoresearch

Autoresearch Harness Log

Working notes on how I use the autoresearch harness to probe agent workflows, find design holes, and decide what to experiment with next.

autoresearchagentsexperimentsharness

Working notes on the autoresearch harness. This is a living doc, not a finished writeup. I dump observations here as I run experiments and let patterns emerge over time. If you are looking for what the harness is and how it is built, see the project page. This is where I track what I am learning from using it.

The pattern comes from Karpathy's autoresearch, originally built for ML training iteration. My adaptation generalizes it for prompts, configs, documents, and full agent investigation scoring.

How I use it

The feedback loop looks like this:

  1. Pick a project I am trying to improve (right now that is mostly ai-data-analyst)
  2. Look at the trace explorer or scorer output, find a dimension that is lagging
  3. Design an autoresearch experiment targeting that dimension
  4. Walk away, let it iterate overnight
  5. Read the digest the next morning, keep what worked, note what did not

The interesting part is usually step 5. The digest tends to surface something I did not know to ask about at step 1, which becomes the next experiment.

Current focus (updated 2026-04-10)

  • Using the harness to probe where the ai-data-analyst agent workflow has design holes. I keep noticing that certain scoring dimensions plateau even when the prompt changes significantly, which points to structural problems (tool layer, context management, evaluator sensitivity) rather than prompt problems
  • Thinking about how the harness could start generating its own experiment ideas based on patterns across runs, instead of me having to decide what to try next manually. No design for this yet, just circling the idea

Open questions

  • When a dimension plateaus across 15 iterations, how do I tell the difference between "the prompt cannot improve this" and "the evaluator cannot see the improvement"?
  • What is the right granularity for a single experiment? Run one prompt through 15 iterations, or run 3 prompts through 5 iterations each?
  • Should the harness generate its own experiment ideas, or is that overreach? The current version just executes what I give it. Letting it propose targets feels like the next meaningful capability but also feels risky.
  • How do I avoid re-learning the same lesson across experiments? The scratchpad helps within a single run but does not carry between runs. Maybe a cross-run scratchpad lives in this log.

What I am learning (dated)

  • 2026-04-10: The trace explorer and the autoresearch harness are two sides of the same loop. Telemetry shows me where to aim, the harness moves the needle, telemetry measures whether anything moved. Without both I was guessing.