Autoresearch
Autonomous experiment harness that iterates on prompts, configs, and agent workflows overnight on the Dell Pro Max GB10.
$ inspired by karpathy/autoresearch
$ cat story.md
Autoresearch is the harness I use to figure out what is actually worth changing in a prompt, config, or agent workflow. I run 15 iterations overnight, see which dimensions moved, and decide what to try next. Without it I was guessing at what to tune.
First full experiment cycle completed: ran an autopsy on the ai-data-analyst scoring pipeline, found 6 harness bugs (including one that silently capped a scoring dimension via a string-matching vocabulary gap), fixed them, and ran a 15-iteration experiment. The vocabulary fix alone moved the baseline from 9.30/10 to 9.78/10. The harness keeps a changelog now so future experiments can tell whether a metric moved because of a prompt change or because the harness changed underneath.
$ ls ./components
- -Autonomous iteration loops that run unsupervised
- -Diff-based edits for large targets and reasoning-heavy models
- -Script-based evaluation beyond LLM-as-judge
- -Scratchpad-style context management over long experiments
- -Reusable harness patterns across different target types
- -Dual-model orchestration (editor and evaluator on different ports)
$ ls ./findings/ --project="Autoresearch"
Autoresearch Harness Log
evolvingWorking notes on how I use the autoresearch harness to probe agent workflows, find design holes, and decide what to experiment with next.
Debugging Experiment Loops
evolvingRunning observations from debugging autonomous experiment loops. What I find when I stop guessing from aggregates and trace through scoring code and spans.