Ralph Loops
Learning autonomous AI development through test-driven loops.
$ cat story.md
I wanted to understand what makes AI succeed or fail at completing tasks on its own. The hypothesis: if your tests fully specify what you want, the AI can build it.
The pattern is simple. Write tests that capture exactly what you'd accept AND reject, give the AI a prompt, let it run until tests pass. The magic is in learning what makes good tests.
This isn't about building production software. It's about learning a methodology that makes AI more useful, and discovering that good PRDs and tests are valuable skills even without AI.
$ ls ./experiments/
TryParse
A parser combinator library built in 2 iterations.
CLI Components
A React component library built in 2 iterations.
Photon Forge
A light-beam puzzle game built in 9 iterations.
llama-mcp-server
An MCP server bridging Claude Code to local llama.cpp, built with the "True Ralph" pattern.
$ cat pipeline.txt
AI passes whatever tests you write. Shallow tests = shallow code.
E2E tests verify "it runs" not "it is correct."
Tests for what should NOT happen catch bugs positive tests miss.
Lock critical behavior with known-good states.
Build a validator, test all generated content against it.
If you would reject valid output, that criterion is missing.
Generators need both acceptance AND rejection tests.
You can not test "looks good" but CAN test "follows rules."
Unit tests pass but build fails. Always test tsc/build.
Browser devtools misses real device issues. Test on phones.
Debugging non-essential features can be worse than removing them.
For bounded problems with clear tests, the loop is a safety net.
If E2E tests matter, they must be in the explicit checklist.
Run all test suites once before Ralph to catch config bugs.
Code that passes jsdom tests may crash in real Chrome.
Demos must match their deployment context (dark theme, etc.).
$ cat current-status.md
llama-mcp-server Complete
Publishing
llama-mcp-server validated the "True Ralph" pattern: 44 tasks executed in separate context windows, zero failures, 398 tests. Each Ralph read specs from files, completed one task, and exited. Knowledge persisted in code, not memory.
Key discovery: convention inheritance without shared memory. Ralph #7 created a formatError() helper. Every subsequent Ralph copied the pattern. Institutional knowledge lives in files.
Four subprojects now complete. Prepping llama-mcp-server for npm publish as first open source contribution. Package name confirmed: llama-mcp-server.
Milestones
5 / 7- -What makes AI succeed/fail at autonomous tasks
- -Writing tests that fully specify requirements
- -Patterns for different task types (games, parsers, UI, MCP servers)
- -Transferable skills: PRD writing, test design
- -Multi-context execution (True Ralph pattern)
$ ls ./blog/ --project="Ralph Loops"
Setting Up Your First Ralph Loop: A Practical Guide
Jan 15, 2026How I set up autonomous AI development with specs, task lists, and a bash loop
Building an MCP Server in 2 Hours with 44 Autonomous AI Tasks
Jan 15, 2026How fresh context windows per task changed my AI-assisted development workflow
Ralph Loops Work Too Well (Now What?)
Jan 12, 2026I tried two different approaches to test-driven AI development. Both worked. Here's what I learned about writing tests as requirements.
Package name confirmed (llama-mcp-server). First open source contribution. 19 tools bridging Claude Code to local llama.cpp.
Multi-context execution works. Each task in fresh context window, knowledge persists in files. Convention inheritance observed: later Ralphs copy patterns from earlier ones.
Set up 44 atomic tasks to build MCP server for llama.cpp. Testing "True Ralph" pattern: each task gets own context window, specs live in files.
React component library completed in 2 iterations. Key lesson: E2E tests weren't in the explicit checklist, so Ralph never ran them. Success criteria must be exhaustive.
Unit tests passed in jsdom but E2E failed in Chrome with "Illegal invocation." Timer functions need their original context in real browsers.
Set up 97 unit tests and 22 E2E tests for 4 React components: TypingEffect, ProgressBar, Collapsible, CopyButton. Testing methodology on UI component library.
Parser library completed in 2 iterations. Discovered that unit tests passing does not mean the build succeeds. Added build test (tsc --noEmit) as Lesson 13.
TryParse completed 180 tests in a single Claude invocation. The loop is a safety net for larger projects, not a requirement for bounded problems.
Set up 180 tests for a parser combinator library. Testing autonomous approach vs phased approach from Photon Forge.
Fixed mirror removal for touch devices (tap-to-cycle). Cut endless mode rather than debug complex mobile timing issues. Lesson 13: know when to cut.
Discovered you can not test "looks good" but CAN test design rules (same button widths, min touch targets, no horizontal scroll). Ralph fixed 12 UI issues in 1 iteration.
Added test for non-trivial levels (mirrors >= 1). Fixed generator in 1 iteration. Lesson 11 proven: generators need rejection criteria, not just acceptance.
Three iterations to fix endless mode. Lessons: flaky tests let bugs slip, React hooks need direct tests, generators can produce technically-valid-but-unacceptable output.
Built a solver to validate all 20 levels are solvable. Lesson: for content that needs to be "valid," build a validator and test against it.
First Ralph Loop experiment. E2E tests passed but game logic was buggy. Core lesson: E2E tests verify "it runs" not "it is correct."
Created project to systematically test autonomous AI development. Goal: document what works through experiments, not theory.