January 12, 2026Ralph Loops

Ralph Loops Work Too Well (Now What?)

I tried two different approaches to test-driven AI development. Both worked. Here's what I learned about writing tests as requirements.

ralph-loopsai-developmentmethodologytdd

I set out to find where the Ralph Loop methodology breaks. Two projects, two different approaches. I haven't found the ceiling yet. Here's what I learned along the way.

The Experiment

Photon Forge was the stress test. A light-beam puzzle game with physics, color mixing, procedural generation, scoring. I ran it through 9 iterations, adding features each time to see where Claude would stumble.

v1: Basic game (57 tests)
v3: Added a solver that validates every level (124 tests)
v6: Procedural level generation with difficulty scaling (320 tests)
v8: Mobile support, UI polish (330 tests)

TryParse was the opposite approach. Minimal prompt, 180 tests written upfront, one instruction: "make all tests pass." Completed in a single iteration.

Both worked. Every time. You can try both demos on their project pages.

What I Think Is Happening: Tests Are Requirements

Looking at both projects, a pattern emerged: write the test, get the feature. The tests seemed to be the specification.

Here's what that looked like in practice. For Photon Forge's color mixing system, I wrote tests like this:

describe('color mixing - primary combinations', () => {
  test('red + blue = purple', () => {
    expect(mixColors('red', 'blue')).toBe('purple');
  });

  test('blue + red = purple (order independent)', () => {
    expect(mixColors('blue', 'red')).toBe('purple');
  });

  test('red + yellow = orange', () => {
    expect(mixColors('red', 'yellow')).toBe('orange');
  });
});

That's the entire requirement. Claude reads the test, understands what mixColors needs to do, and implements it. No ambiguity, no back-and-forth about "what should happen when..."

For more complex features, the tests get more specific. The solver needed to not just return something, but return solutions that actually work:

test('solver finds solution for Level 1', () => {
  const level = getLevelById(1)!;
  const result = solve(level);

  expect(result).not.toBeNull();
  expect(result!.solved).toBe(true);
});

test('solver solution actually wins the game', () => {
  const level = getLevelById(1)!;
  const result = solve(level)!;

  // Verify the solution works by running it through physics
  const { cellColors } = traceAllBeams(level, result.mirrors);
  const { allLit } = checkTargets(level, cellColors);

  expect(allLit).toBe(true);
});

The second test is the key - it doesn't trust that the solver "works," it proves it by running the solution through the actual game engine. That level of specificity meant Claude couldn't cut corners.

What This Suggests

If I'm right about tests being requirements, the bottleneck isn't Claude's capability. It's my ability to express what I want as tests.

If I can write a test for it, I can build it. The question shifts from "can AI code this?" to "can I specify what I actually want?"

What seemed to work for me:

Starting with tests, not prompts. The prompt was minimal ("make all tests pass") because the tests were complete.
Testing behavior, not implementation. 'red + blue = purple' is a behavior. How mixColors works internally didn't matter.
Integration tests caught shortcuts. The solver test that verified solutions actually work caught bugs that unit tests missed.
More tests meant more features. Photon Forge grew from 57 to 330 tests. Each batch of new tests added new capabilities.

Where This Gets Interesting

I found the repomirror project using a similar approach - they shipped 6 repositories overnight.

If the bottleneck is specification, not implementation, this process could work for anyone. Write the requirements as tests, run the loop, get working software.

I'm still testing this across different domains to find where it breaks. So far: nothing. Once I've validated the approach further, I plan to open-source the template so others can try it themselves.

What I'm Still Figuring Out

The hard question: how do you know what tests to write upfront? For TryParse, I knew exactly what a URL parser should do. For a game, I discovered requirements as I went.

Maybe that's fine - the methodology handles both approaches. But I suspect there's a skill here around translating vague ideas into testable requirements. That might be the thing worth getting good at.

For now, I'm still experimenting. If you want to try this approach yourself, I documented how to set up your first Ralph Loop and the fresh context per task pattern that made llama-mcp-server so efficient.