Skip to content
cd ../projects
In ProgressPythonYAMLllama.cppJSONLSSH

Autoresearch

Autonomous experiment harness that iterates on prompts, configs, and agent workflows overnight on the Dell Pro Max GB10.

$ inspired by karpathy/autoresearch

$ cat story.md

Autoresearch is the harness I use to figure out what is actually worth changing in a prompt, config, or agent workflow. I run 15 iterations overnight, see which dimensions moved, and decide what to try next. Without it I was guessing at what to tune.

Still early. Most of what I am doing right now is probing my own agent workflows to find where the design assumptions break down. The trace explorer on the ai-data-analyst project feeds this loop: look at the telemetry, pick a dimension that is lagging, run an autoresearch experiment, see if the scores move. The harness itself evolves as I use it, so I keep the Autoresearch Harness Log for notes on what is working and what is not.

$ ls ./components

YAML Experiment ConfigDiff-Based Edit EngineLLM-as-Judge EvaluatorScript-Based EvaluatorStructural ValidatorScratchpad Context ManagerJSONL Event LoggerStatus MonitorMarkdown Digest GeneratorSSH Deploy ScriptsDual-Model Experiment Runner
  • -Autonomous iteration loops that run unsupervised
  • -Diff-based edits for large targets and reasoning-heavy models
  • -Script-based evaluation beyond LLM-as-judge
  • -Scratchpad-style context management over long experiments
  • -Reusable harness patterns across different target types
  • -Dual-model orchestration (editor and evaluator on different ports)

$ ls ./findings/ --project="Autoresearch"