Setting Up Your First Ralph Loop: A Practical Guide
How I set up autonomous AI development with specs, task lists, and a bash loop
I built a complete npm package (19 tools, 398 tests) with zero manual intervention during execution. The setup was simpler than I expected: spec files that define what to build, a task list that tracks progress, and a bash loop that runs until everything's done. Here's how I set it up.
What You Need
Before starting, you'll need:
- Claude Code - The CLI tool (I use it via my Max subscription, but API access works too)
- A test framework - Vitest, Jest, Playwright, whatever fits your project
- A project you can break into atomic tasks - Each task needs clear success criteria
That's it. The magic isn't in special tooling. It's in how you structure the work upfront.
The Setup That Worked for Me
For llama-mcp-server, I organized the project like this:
llama-mcp-server/
├── specs/
│ ├── tools.md # What each tool should do
│ ├── conventions.md # Code patterns to follow
│ └── task-list.md # Checkbox list of tasks
├── src/ # Code goes here
├── tests/ # Tests go here
└── run-ralph.sh # The loop script
The key insight I stumbled into: specs on disk become institutional memory. Each time Claude runs, it reads the same files. No context window limits because each task starts fresh.
Front-Loading Through Conversation
I didn't write these specs myself. I spent about 30 minutes talking with Claude about what an MCP server should do, what tools it should expose, how errors should be handled. Then Claude wrote the specs.
The task list ended up looking like this:
## Phase 1: Infrastructure
- [x] Create src/types.ts with Tool and ToolResult interfaces
- [x] Create src/config.ts with environment loading
- [x] Create src/client.ts with LlamaClient interface
- [x] Write tests/unit/config.test.ts
- [x] Write tests/unit/client.test.ts
## Phase 2: Server Tools
- [x] Implement llama_health in src/tools/server.ts
- [x] Write tests for llama_health
- [x] Implement llama_props in src/tools/server.ts
- [x] Write tests for llama_props
...
44 tasks total. Each one atomic. Each one testable.
The Loop Script
Here's the bash script that ran the whole thing:
#!/bin/bash
cd "$(dirname "$0")"
while true; do
# Check if any tasks remain
if ! grep -q "^\- \[ \]" specs/task-list.md; then
echo "All tasks complete!"
break
fi
echo "=== Starting new Ralph context ==="
if claude --dangerously-skip-permissions -p "You are Ralph. Read these files:
- specs/tools.md (tool specifications)
- specs/conventions.md (code patterns)
- specs/task-list.md (your task list)
Pick the FIRST task marked [ ] (not started). Complete ONLY that task.
After completing the task:
1. Run: npm run typecheck
2. Run: npm test
3. If both pass, mark the task [x] in specs/task-list.md
4. Output: RALPH_WIGGUM_COMPLETE"; then
echo "=== Task completed ==="
else
echo "=== Failed (likely rate limit), waiting 5 min ==="
sleep 300
fi
sleep 2
done
Each iteration:
- Checks if any
[ ]tasks remain - Runs Claude with a fresh context
- Claude reads the specs, picks the first incomplete task, does it
- Marks it
[x]and exits - Loop continues until all tasks are done
The --dangerously-skip-permissions flag is required. Without it, Claude hangs waiting for permission prompts you can't see.
What the Spec Files Actually Looked Like
The loop script tells Claude to read the specs, but what goes in them? Here's what I ended up with.
specs/tools.md - What to Build
This file described each tool the MCP server needed to expose. Here's a sample:
# Tool Specifications
## llama_health
Check if llama-server is running and get status.
**Parameters:** None
**Returns:**
- `status`: "ok" | "loading_model" | "error"
- `slots_idle`: number of available slots
- `slots_processing`: number of busy slots
**Error handling:**
- If server unreachable, return user-friendly message about checking if server is running
- If timeout, suggest reducing max_tokens or checking server load
## llama_complete
Generate text completion from a prompt.
**Parameters:**
- `prompt` (required): The prompt to complete
- `max_tokens` (optional, default 256): Maximum tokens to generate
- `temperature` (optional, default 0.7): Sampling temperature 0-2
- `stop` (optional): Array of stop sequences
**Returns:**
- `content`: Generated text
- `tokens_evaluated`: Number of prompt tokens
- `tokens_predicted`: Number of generated tokens
**Error handling:**
- Same connection/timeout handling as llama_health
Each tool got this treatment. The key was being specific about parameters, return values, and error handling. When I was vague ("handle errors appropriately"), I got inconsistent results. When I was specific ("return user-friendly message about checking if server is running"), I got exactly that.
specs/conventions.md - How to Write the Code
This file established patterns that every task should follow:
# Code Conventions
## File Structure
- Tools go in `src/tools/[category].ts` (e.g., server.ts, inference.ts)
- Each tool is a factory function: `createXxxTool(client: LlamaClient): Tool`
- Tests go in `tests/unit/tools/[category].test.ts`
## Tool Pattern
Every tool follows this structure:
export function createHealthTool(client: LlamaClient): Tool {
return {
name: 'llama_health',
description: 'Check if llama-server is running',
inputSchema: z.object({}),
handler: async (input): Promise<ToolResult> => {
try {
const result = await client.health();
return {
content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
};
} catch (error) {
return {
content: [{ type: 'text', text: formatError(error, client.baseUrl) }],
isError: true,
};
}
},
};
}
## Error Handling
Use the `formatError` helper for all error messages:
- Connection refused → "Cannot connect to llama-server at {url}. Is it running?"
- Timeout → "Request timed out. Try reducing max_tokens or check server load."
- Other errors → Return the raw message
## Testing Pattern
Each tool needs tests for:
1. Success case with mocked response
2. Error case (connection refused)
3. Error case (timeout)
Use vi.mocked() for the client methods.
The conventions file grew during the project. Task 7 created the formatError helper, and I added it to conventions so later tasks would use it. By the end, every tool had consistent error handling because it was documented as the pattern to follow.
specs/task-list.md - The Work Breakdown
The task list needed tasks that were atomic (completable in one shot) and verifiable (tests prove it's done). Here's what good vs bad tasks looked like:
# Task List
## Good Tasks (atomic, testable)
- [ ] Create src/types.ts with Tool and ToolResult interfaces
- [ ] Implement llama_health in src/tools/server.ts
- [ ] Write tests for llama_health (success, connection error, timeout)
- [ ] Implement llama_complete in src/tools/inference.ts
## Bad Tasks (too big, unclear)
- [ ] Implement all server tools # Too big, no checkpoint
- [ ] Make the error handling better # Unclear what "better" means
- [ ] Add tests # Which tests? For what?
I organized tasks into phases so dependencies flowed downward:
## Phase 1: Infrastructure
- [x] Create src/types.ts with Tool and ToolResult interfaces
- [x] Create src/client.ts with LlamaClient class
- [x] Write tests/unit/client.test.ts
## Phase 2: Server Tools (depends on Phase 1)
- [x] Implement llama_health
- [x] Write tests for llama_health
- [x] Implement llama_props
- [x] Write tests for llama_props
## Phase 3: Inference Tools (depends on Phase 1)
- [ ] Implement llama_complete
- [ ] Write tests for llama_complete
...
How the Conversation Produced These
I didn't write these specs from scratch. The conversation went something like:
Me: "I want to build an MCP server that wraps the llama.cpp HTTP API. What tools should it expose?"
Claude: Listed the llama.cpp endpoints and suggested mapping each to an MCP tool.
Me: "What parameters does each one need? What should the error handling look like?"
Claude: Detailed each tool's parameters based on the llama.cpp docs, suggested consistent error messaging.
Me: "Can you write this up as a specs/tools.md file I can use?"
Claude: Produced the file.
Me: "Now break this into atomic tasks. Each task should be one thing that can be tested."
Claude: Produced the task list, organized into phases.
The whole conversation took about 30 minutes. The actual specs files came out of that discussion, not from me writing documentation.
Why This Worked Better Than My Earlier Attempts
I'd tried other approaches before. With Photon Forge, I ran single-context loops where Claude would iterate on the same codebase until tests passed. That worked, but I kept running into problems:
Context window limits. Projects got big enough that Claude would forget earlier decisions.
Scope creep. I'd think of new features mid-stream and inject them. The project would drift.
The fresh-context-per-task approach fixed both. Each task runs in isolation. By the time I see the output, it's already done. There's no opportunity to say "actually, also do this."
What I Learned About Writing Specs
The quality of the output depends entirely on the specs. Some things I noticed:
Task 1 Can Over-Deliver (And That's Good)
My first task said "Create src/types.ts with Tool and ToolResult interfaces." That's 2 interfaces.
Claude created 25+. It read the tool specifications, understood what types the whole project would need, and created them all upfront. Every subsequent task just imported what it needed.
I didn't plan this. It just happened. But it meant later tasks had everything they needed.
Conventions Propagate Automatically
The conventions file told Claude how to structure error handling:
function formatError(message: string, baseUrl: string): string {
if (message.includes('ECONNREFUSED')) {
return `Cannot connect to llama-server at ${baseUrl}. Is it running?`;
}
return message;
}
Task 7 created this helper. Every subsequent task that needed error handling copied the pattern. By the end, all 19 tools had consistent error messages.
No one told Task 15 to use formatError. It just read the existing code, saw the pattern, and followed it.
Tests Need to Be Specific
Early on, I had a test that said "level is solvable" for a puzzle game. Claude generated levels that were technically solvable but trivial (the solution was 0 moves). I would have rejected them if I'd seen them, but I never wrote that rejection into the test.
What I learned: if I'd reject the output when I see it, that rejection needs to be in the test.
// Too vague
test('level is solvable', () => {
expect(solve(level).solved).toBe(true);
});
// Better - includes what I actually meant
test('level requires at least 1 mirror to solve', () => {
const solution = solve(level);
expect(solution.solved).toBe(true);
expect(solution.mirrors.length).toBeGreaterThanOrEqual(1);
});
Pitfalls I Hit
Build Tests Catch What Unit Tests Miss
All my unit tests passed. Then npm run build failed with 17 TypeScript errors. Vitest doesn't run strict TypeScript checking. The production build does.
I added a build test after that:
test('typescript compiles without errors', async () => {
await execAsync('npx tsc --noEmit');
}, 30000);
Success Criteria Must Be Exhaustive
Claude follows the success criteria literally. If I say "run npm test" but not "run npm run test:e2e", E2E tests won't run. I had a project where unit tests all passed but E2E failed because I forgot to include E2E in the checklist.
Rate Limits Need Handling
The original loop script had no error handling. When I hit API rate limits, Claude would exit with an error, the loop would immediately retry, hit the limit again, and spin. Adding a 5-minute sleep on failure fixed it.
The Results
For llama-mcp-server:
| Metric | Value |
|---|---|
| Tasks | 44 |
| Tests | 398 |
| Runtime | ~2 hours |
| Manual intervention | 0 |
The code quality was better than I expected. Consistent error handling, proper TypeScript types, comprehensive test coverage. I didn't write any of it during execution.
Getting Started
If you want to try this, I'd suggest:
-
Start with a conversation. Talk through what you want to build until you understand it well enough to break it into tasks.
-
Let Claude write the specs. The conversation produces the task list, conventions, and requirements. You're not writing documentation, you're having a discussion that gets captured.
-
Keep tasks atomic. Each task should be completable in one context window and verifiable with tests.
-
Run the loop and walk away. The point is autonomous execution. If you're tempted to intervene, add it to the task list instead.
I've written more about the llama-mcp-server project specifically in Building an MCP Server in 2 Hours if you want the full story.
What I'm Still Figuring Out
This worked well for a project where I knew the full scope upfront. I'm less sure how it handles:
- Projects where requirements emerge as you build
- Large-scale refactoring across many files
- Tasks that depend on judgment calls rather than test results
For now, I'm using this approach for anything I can spec out in advance. It's the most efficient workflow I've found so far.