January 15, 2026Ralph Loops

Building an MCP Server in 2 Hours with 44 Autonomous AI Tasks

How fresh context windows per task changed my AI-assisted development workflow

ralph-loopsmcpllama.cppautonomous-aitest-driven-development

30 minutes of conversation. 44 autonomous tasks. 398 tests. Zero manual intervention. I built a complete MCP server package (now on npm as llama-mcp-server) by front-loading requirements through conversation, then letting Claude execute with a fresh context window per task. This was by far my most efficient project, and I think I finally understand why.

The Old Way Wasn't Working

I've been using AI to help build software for a while now. My previous workflow looked like this: start with an idea, begin building, stay engaged throughout to help with integration logic, course-correct as issues arose. It worked, but I kept running into the same problems.

Scope creep. I'm bad about this. As one feature gets implemented, I think about possibilities. "What if we also added..." leads to a refactor, which leads to more refactoring. The project grows in directions I didn't plan.

Context window limits. Projects got big enough that they couldn't fit in a single context window. Claude would forget earlier decisions. I'd end up managing multiple documents just to track dependencies and keep things consistent.

Building before defining "done." I'm not a developer by trade. I'd start building before I really understood what the output should look like. Then I'd learn the hard way that my mental model was wrong.

I tried to be disciplined about planning, but it's hard to resist the excitement of seeing code appear. The "plan before you code" advice is easy to ignore when you can just... start coding.

The New Pattern: Fresh Context Per Task

The llama-mcp-server project used a different approach. Instead of one long session where I stay engaged, I:

Spent 30 minutes defining what the tool should do (conversation with Claude)
Had Claude write the specs, conventions, and task list
Ran a loop that gave each task its own context window
Walked away

Here's the loop script:

#!/bin/bash
cd "$(dirname "$0")"

while true; do
  # Check if any tasks remain
  if ! grep -q "^\- \[ \]" specs/task-list.md; then
    echo "All tasks complete!"
    break
  fi

  echo "=== Starting new Ralph context ==="

  if claude --dangerously-skip-permissions -p "You are Ralph. Read these files:
- specs/tools.md (tool specifications)
- specs/conventions.md (code patterns)
- specs/task-list.md (your task list)

Pick the FIRST task marked [ ] (not started). Complete ONLY that task.

After completing the task:
1. Run: npm run typecheck
2. Run: npm test
3. If both pass, mark the task [x] in specs/task-list.md
4. Output: RALPH_WIGGUM_COMPLETE"; then
    echo "=== Ralph context ended successfully ==="
  else
    echo "=== Claude failed (likely rate limit) ==="
    echo "Waiting 5 minutes before retry..."
    sleep 300
  fi

  sleep 2
done

Each iteration:

Starts with a fresh context (no memory of previous tasks)
Reads the specs from disk (that's the institutional memory)
Picks the first incomplete task
Implements it, runs tests
Marks it complete and exits

The key insight: context resets, but the specs persist. Each new Claude instance reads the same conventions, follows the same patterns, builds on what previous instances wrote to disk.

Front-Loading for Non-Developers

Here's what "front-loading" actually meant for me: a 30-minute conversation.

I didn't write the specs myself. I talked with Claude about what an MCP server should do, what tools it should expose, how it should handle errors. Claude recommended the MCP server approach for bridging to llama.cpp. I asked questions until I understood what the end user would experience.

Then Claude wrote:

specs/tools.md - What each of the 19 tools should do
specs/conventions.md - Code patterns to follow
specs/task-list.md - 44 tasks across 8 phases

The task list looked like this:

## Phase 1: Infrastructure

- [x] Create src/types.ts with Tool and ToolResult interfaces
- [x] Create src/config.ts with environment loading
- [x] Create src/client.ts with LlamaClient interface
- [x] Write tests/unit/config.test.ts
- [x] Write tests/unit/client.test.ts
- [x] Write tests/unit/build.test.ts

## Phase 2: Server Tools

- [x] Implement llama_health in src/tools/server.ts
- [x] Write tests for llama_health
- [x] Implement llama_props in src/tools/server.ts
- [x] Write tests for llama_props
...

Each task is atomic. Each task is testable. Each task has clear success criteria: typecheck passes, tests pass.

The Happy Accident

Task 1 said: "Create src/types.ts with Tool and ToolResult interfaces." That's 2 interfaces.

Here's what Claude actually created:

// 25+ types, not 2
export interface ToolResult {
  content: Array<{ type: 'text'; text: string }>;
  isError?: boolean;
}

export interface Tool {
  name: string;
  description: string;
  inputSchema: z.ZodType;
  handler: (input: unknown) => Promise<ToolResult>;
}

export interface ChatMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

export interface HealthResponse {
  status: 'ok' | 'loading_model' | 'error';
  slots_idle: number;
  slots_processing: number;
}

export interface CompletionOptions {
  max_tokens?: number;
  temperature?: number;
  top_p?: number;
  top_k?: number;
  stop?: string[];
  seed?: number;
}

// ... 20 more interfaces

Task 1 over-delivered. It read the tool specifications, understood what types the whole project would need, and created them all upfront. Every subsequent task just imported what it needed.

This wasn't planned. It was Claude (Opus 4.5) "showing off" a bit. But it worked out because the types were correct and comprehensive.

Why This Prevents Scope Creep

The old workflow let me inject new ideas mid-stream. "While we're here, let's also add..." was always an option.

The new workflow doesn't have a "while we're here." Each task runs in isolation. By the time I see the output, the task is already complete and marked done. There's no opportunity to say "actually, also do this."

If I want to add a feature, I have to:

Add it to the task list
Wait for the loop to pick it up
Watch it get implemented in isolation

That friction is the feature. It forces me to think about whether the new feature is worth adding before I add it. Usually, by the time I've thought about it, I realize it's scope creep and I don't add it.

The other scope creep killer: the specs are written before building starts. The 30-minute conversation produced a complete specification. Once the loop started running, the spec was the contract. I couldn't move the goalposts because the goalposts were in a file that Ralph was reading.

The Evidence

Final Stats

Metric	Value
Tasks	44
Tests	398
Runtime	~2 hours (including one rate limit pause)
Manual intervention	0
Lines of code	~3000

Convention Inheritance

The conventions file told Ralph how to structure code:

// From specs/conventions.md - Error handling pattern

function formatError(message: string, baseUrl: string): string {
  if (message.includes('ECONNREFUSED') || message.includes('fetch failed')) {
    return `Cannot connect to llama-server at ${baseUrl}. Is it running?`;
  }
  if (message.includes('abort') || message.includes('timeout')) {
    return `Request timed out. Try reducing max_tokens or check server load.`;
  }
  return message;
}

Task 7 (implementing llama_health) created this helper. Every subsequent task that needed error handling copied the pattern. By the end, all 19 tools had consistent, user-friendly error messages.

No one told Task 15 to use formatError. It just read the existing code, saw the pattern, and followed it.

Generated Code Quality

Here's what the autonomous loop produced for llama_health:

export function createHealthTool(client: LlamaClient): Tool {
  return {
    name: 'llama_health',
    description: 'Check if llama-server is running and get status',
    inputSchema: z.object({}),
    handler: async (): Promise<ToolResult> => {
      try {
        const health = await client.health();
        return {
          content: [{ type: 'text', text: JSON.stringify(health, null, 2) }],
        };
      } catch (error) {
        const message = error instanceof Error ? error.message : String(error);
        return {
          content: [{ type: 'text', text: `Error: ${formatError(message, client.baseUrl)}` }],
          isError: true,
        };
      }
    },
  };
}

Clean. Follows the conventions. Handles errors without throwing. This is production code, and I didn't write any of it.

Test Coverage

Each tool got its own tests with mocked dependencies:

describe('createHealthTool', () => {
  let mockClient: LlamaClient;

  beforeEach(() => {
    mockClient = {
      baseUrl: 'http://localhost:8080',
      timeout: 30000,
      health: vi.fn(),
      // ... other mocked methods
    } as unknown as LlamaClient;
  });

  it('returns health status when server responds with ok', async () => {
    vi.mocked(mockClient.health).mockResolvedValue({
      status: 'ok',
      slots_idle: 2,
      slots_processing: 0,
    });

    const tool = createHealthTool(mockClient);
    const result = await tool.handler({});

    expect(result.isError).toBeUndefined();
    expect(result.content[0].text).toContain('ok');
  });

  it('returns error when server unreachable', async () => {
    vi.mocked(mockClient.health).mockRejectedValue(
      new Error('fetch failed')
    );

    const tool = createHealthTool(mockClient);
    const result = await tool.handler({});

    expect(result.isError).toBe(true);
    expect(result.content[0].text).toContain('Cannot connect');
  });
});

398 tests like this, all generated autonomously, all passing.

What I'd Tell Past Me

If I could go back to when I started experimenting with AI-assisted development, I'd say:

Use fresh context per task from the start. Don't try to keep one long conversation going. Let each task be its own isolated execution. The specs on disk are your institutional memory.

Front-load through conversation, not documentation. You don't have to write specs yourself. Have a conversation about what you want to build until you understand it, then let Claude write the specs. The conversation is the work; the implementation is execution.

The method makes planning mandatory. This is the real benefit for someone like me who's prone to scope creep. You can't skip to the fun part because the specs ARE the work. Once the loop starts, you're committed to what you defined.

I've now done four projects with variations of this approach. This one, with fresh context per task, was by far the most efficient. I didn't have to manage context. I didn't have to remind Claude of earlier decisions. I didn't have to fight the urge to add features mid-stream.

The llama-mcp-server is now published on npm. 19 tools, 398 tests, comprehensive error handling. Built in 2 hours of loop time plus 30 minutes of conversation.

I think this might be how I build things from now on.