Skip to content
cd ../projects
In ProgressPythonYAMLllama.cppJSONLSSH

Autoresearch

Autonomous experiment harness that iterates on prompts, configs, and agent workflows overnight on the Dell Pro Max GB10.

$ inspired by karpathy/autoresearch

$ cat story.md

Autoresearch is the harness I use to figure out what is actually worth changing in a prompt, config, or agent workflow. I run 15 iterations overnight, see which dimensions moved, and decide what to try next. Without it I was guessing at what to tune.

First full experiment cycle completed: ran an autopsy on the ai-data-analyst scoring pipeline, found 6 harness bugs (including one that silently capped a scoring dimension via a string-matching vocabulary gap), fixed them, and ran a 15-iteration experiment. The vocabulary fix alone moved the baseline from 9.30/10 to 9.78/10. The harness keeps a changelog now so future experiments can tell whether a metric moved because of a prompt change or because the harness changed underneath.

$ ls ./components

YAML Experiment ConfigDiff-Based Edit EngineLLM-as-Judge EvaluatorScript-Based EvaluatorStructural ValidatorScratchpad Context ManagerJSONL Event LoggerStatus MonitorMarkdown Digest GeneratorSSH Deploy ScriptsDual-Model Experiment Runner
  • -Autonomous iteration loops that run unsupervised
  • -Diff-based edits for large targets and reasoning-heavy models
  • -Script-based evaluation beyond LLM-as-judge
  • -Scratchpad-style context management over long experiments
  • -Reusable harness patterns across different target types
  • -Dual-model orchestration (editor and evaluator on different ports)

$ ls ./findings/ --project="Autoresearch"