cd ../projects
BuiltClaude CodeVitestPlaywrightReactTypeScriptMCP

Ralph Loops

Learning autonomous AI development through test-driven loops.

$ cat story.md

I wanted to understand what makes AI succeed or fail at completing tasks on its own. The hypothesis: if your tests fully specify what you want, the AI can build it.

The pattern is simple. Write tests that capture exactly what you'd accept AND reject, give the AI a prompt, let it run until tests pass. The magic is in learning what makes good tests.

This isn't about building production software. It's about learning a methodology that makes AI more useful, and discovering that good PRDs and tests are valuable skills even without AI.

$ ls ./experiments/

$ cat pipeline.txt

Setup
Run Loop
Observe
Document
Iterate
Components Complete16 of 16
Tests Define DoneLevel 1

AI passes whatever tests you write. Shallow tests = shallow code.

Unit Tests for LogicLevel 1

E2E tests verify "it runs" not "it is correct."

Test Negative CasesLevel 2

Tests for what should NOT happen catch bugs positive tests miss.

Fixture TestsLevel 2

Lock critical behavior with known-good states.

The Solver PatternLevel 3

Build a validator, test all generated content against it.

Explicit RequirementsLevel 2

If you would reject valid output, that criterion is missing.

Rejection CriteriaLevel 3

Generators need both acceptance AND rejection tests.

Design ConstraintsLevel 3

You can not test "looks good" but CAN test "follows rules."

Build TestsLevel 2

Unit tests pass but build fails. Always test tsc/build.

Mobile TestingLevel 3

Browser devtools misses real device issues. Test on phones.

When to CutLevel 4

Debugging non-essential features can be worse than removing them.

One Iteration Enough?Level 4

For bounded problems with clear tests, the loop is a safety net.

Exhaustive Success CriteriaLevel 2

If E2E tests matter, they must be in the explicit checklist.

Verify ScaffoldingLevel 2

Run all test suites once before Ralph to catch config bugs.

jsdom vs BrowserLevel 3

Code that passes jsdom tests may crash in real Chrome.

Demo Styling ContextLevel 3

Demos must match their deployment context (dark theme, etc.).

$ cat current-status.md

llama-mcp-server Complete

Publishing

January 15, 2026

llama-mcp-server validated the "True Ralph" pattern: 44 tasks executed in separate context windows, zero failures, 398 tests. Each Ralph read specs from files, completed one task, and exited. Knowledge persisted in code, not memory.

Key discovery: convention inheritance without shared memory. Ralph #7 created a formatError() helper. Every subsequent Ralph copied the pattern. Institutional knowledge lives in files.

Four subprojects now complete. Prepping llama-mcp-server for npm publish as first open source contribution. Package name confirmed: llama-mcp-server.

Milestones

5 / 7
Photon Forge: 9 loops, 330 tests, puzzle game
TryParse: 2 loops, 201 tests, parser library
CLI Components: 2 loops, 119 tests, React components
llama-mcp-server: 44 tasks, 398 tests, MCP server
True Ralph pattern validated
Publish to npm as llama-mcp-server
Blog post: Building MCP Server with Autonomous AI
  • -What makes AI succeed/fail at autonomous tasks
  • -Writing tests that fully specify requirements
  • -Patterns for different task types (games, parsers, UI, MCP servers)
  • -Transferable skills: PRD writing, test design
  • -Multi-context execution (True Ralph pattern)

$ ls ./blog/ --project="Ralph Loops"

Jan 15, 2026
llama-mcp-server ready for npm publishllama-mcp-server

Package name confirmed (llama-mcp-server). First open source contribution. 19 tools bridging Claude Code to local llama.cpp.

Jan 14, 2026
True Ralph pattern validated: 44 tasks, zero failuresllama-mcp-server

Multi-context execution works. Each task in fresh context window, knowledge persists in files. Convention inheritance observed: later Ralphs copy patterns from earlier ones.

llama-mcp-server experiment startedllama-mcp-server

Set up 44 atomic tasks to build MCP server for llama.cpp. Testing "True Ralph" pattern: each task gets own context window, specs live in files.

CLI Components complete: success criteria lessonCLI Components

React component library completed in 2 iterations. Key lesson: E2E tests weren't in the explicit checklist, so Ralph never ran them. Success criteria must be exhaustive.

jsdom vs browser differences discoveredCLI Components

Unit tests passed in jsdom but E2E failed in Chrome with "Illegal invocation." Timer functions need their original context in real browsers.

CLI Components experiment startedCLI Components

Set up 97 unit tests and 22 E2E tests for 4 React components: TypingEffect, ProgressBar, Collapsible, CopyButton. Testing methodology on UI component library.

Jan 11, 2026
TryParse complete: build tests lesson learnedTryParse

Parser library completed in 2 iterations. Discovered that unit tests passing does not mean the build succeeds. Added build test (tsc --noEmit) as Lesson 13.

One iteration can be enoughTryParse

TryParse completed 180 tests in a single Claude invocation. The loop is a safety net for larger projects, not a requirement for bounded problems.

TryParse experiment startedTryParse

Set up 180 tests for a parser combinator library. Testing autonomous approach vs phased approach from Photon Forge.

Photon Forge deployed with mobile fixesPhoton Forge

Fixed mirror removal for touch devices (tap-to-cycle). Cut endless mode rather than debug complex mobile timing issues. Lesson 13: know when to cut.

Jan 10, 2026
UI polish via design constraintsPhoton Forge

Discovered you can not test "looks good" but CAN test design rules (same button widths, min touch targets, no horizontal scroll). Ralph fixed 12 UI issues in 1 iteration.

Rejection criteria validatedPhoton Forge

Added test for non-trivial levels (mirrors >= 1). Fixed generator in 1 iteration. Lesson 11 proven: generators need rejection criteria, not just acceptance.

Endless mode bugs: multiple lessonsPhoton Forge

Three iterations to fix endless mode. Lessons: flaky tests let bugs slip, React hooks need direct tests, generators can produce technically-valid-but-unacceptable output.

Solver pattern enables content validationPhoton Forge

Built a solver to validate all 20 levels are solvable. Lesson: for content that needs to be "valid," build a validator and test against it.

Photon Forge v1: E2E tests not enoughPhoton Forge

First Ralph Loop experiment. E2E tests passed but game logic was buggy. Core lesson: E2E tests verify "it runs" not "it is correct."

Ralph Loops project startedMethodology

Created project to systematically test autonomous AI development. Goal: document what works through experiments, not theory.

--- journey start ---