Building an MCP Server in 2 Hours with 44 Autonomous AI Tasks
How fresh context windows per task changed my AI-assisted development workflow
30 minutes of conversation. 44 autonomous tasks. 398 tests. Zero manual intervention. I built a complete MCP server package (now on npm as llama-mcp-server) by front-loading requirements through conversation, then letting Claude execute with a fresh context window per task. This was by far my most efficient project, and I think I finally understand why.
The Old Way Wasn't Working
I've been using AI to help build software for a while now. My previous workflow looked like this: start with an idea, begin building, stay engaged throughout to help with integration logic, course-correct as issues arose. It worked, but I kept running into the same problems.
Scope creep. I'm bad about this. As one feature gets implemented, I think about possibilities. "What if we also added..." leads to a refactor, which leads to more refactoring. The project grows in directions I didn't plan.
Context window limits. Projects got big enough that they couldn't fit in a single context window. Claude would forget earlier decisions. I'd end up managing multiple documents just to track dependencies and keep things consistent.
Building before defining "done." I'm not a developer by trade. I'd start building before I really understood what the output should look like. Then I'd learn the hard way that my mental model was wrong.
I tried to be disciplined about planning, but it's hard to resist the excitement of seeing code appear. The "plan before you code" advice is easy to ignore when you can just... start coding.
The New Pattern: Fresh Context Per Task
The llama-mcp-server project used a different approach. Instead of one long session where I stay engaged, I:
- Spent 30 minutes defining what the tool should do (conversation with Claude)
- Had Claude write the specs, conventions, and task list
- Ran a loop that gave each task its own context window
- Walked away
Here's the loop script:
#!/bin/bash
cd "$(dirname "$0")"
while true; do
# Check if any tasks remain
if ! grep -q "^\- \[ \]" specs/task-list.md; then
echo "All tasks complete!"
break
fi
echo "=== Starting new Ralph context ==="
if claude --dangerously-skip-permissions -p "You are Ralph. Read these files:
- specs/tools.md (tool specifications)
- specs/conventions.md (code patterns)
- specs/task-list.md (your task list)
Pick the FIRST task marked [ ] (not started). Complete ONLY that task.
After completing the task:
1. Run: npm run typecheck
2. Run: npm test
3. If both pass, mark the task [x] in specs/task-list.md
4. Output: RALPH_WIGGUM_COMPLETE"; then
echo "=== Ralph context ended successfully ==="
else
echo "=== Claude failed (likely rate limit) ==="
echo "Waiting 5 minutes before retry..."
sleep 300
fi
sleep 2
done
Each iteration:
- Starts with a fresh context (no memory of previous tasks)
- Reads the specs from disk (that's the institutional memory)
- Picks the first incomplete task
- Implements it, runs tests
- Marks it complete and exits
The key insight: context resets, but the specs persist. Each new Claude instance reads the same conventions, follows the same patterns, builds on what previous instances wrote to disk.
Front-Loading for Non-Developers
Here's what "front-loading" actually meant for me: a 30-minute conversation.
I didn't write the specs myself. I talked with Claude about what an MCP server should do, what tools it should expose, how it should handle errors. Claude recommended the MCP server approach for bridging to llama.cpp. I asked questions until I understood what the end user would experience.
Then Claude wrote:
specs/tools.md- What each of the 19 tools should dospecs/conventions.md- Code patterns to followspecs/task-list.md- 44 tasks across 8 phases
The task list looked like this:
## Phase 1: Infrastructure
- [x] Create src/types.ts with Tool and ToolResult interfaces
- [x] Create src/config.ts with environment loading
- [x] Create src/client.ts with LlamaClient interface
- [x] Write tests/unit/config.test.ts
- [x] Write tests/unit/client.test.ts
- [x] Write tests/unit/build.test.ts
## Phase 2: Server Tools
- [x] Implement llama_health in src/tools/server.ts
- [x] Write tests for llama_health
- [x] Implement llama_props in src/tools/server.ts
- [x] Write tests for llama_props
...
Each task is atomic. Each task is testable. Each task has clear success criteria: typecheck passes, tests pass.
The Happy Accident
Task 1 said: "Create src/types.ts with Tool and ToolResult interfaces." That's 2 interfaces.
Here's what Claude actually created:
// 25+ types, not 2
export interface ToolResult {
content: Array<{ type: 'text'; text: string }>;
isError?: boolean;
}
export interface Tool {
name: string;
description: string;
inputSchema: z.ZodType;
handler: (input: unknown) => Promise<ToolResult>;
}
export interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string;
}
export interface HealthResponse {
status: 'ok' | 'loading_model' | 'error';
slots_idle: number;
slots_processing: number;
}
export interface CompletionOptions {
max_tokens?: number;
temperature?: number;
top_p?: number;
top_k?: number;
stop?: string[];
seed?: number;
}
// ... 20 more interfaces
Task 1 over-delivered. It read the tool specifications, understood what types the whole project would need, and created them all upfront. Every subsequent task just imported what it needed.
This wasn't planned. It was Claude (Opus 4.5) "showing off" a bit. But it worked out because the types were correct and comprehensive.
Why This Prevents Scope Creep
The old workflow let me inject new ideas mid-stream. "While we're here, let's also add..." was always an option.
The new workflow doesn't have a "while we're here." Each task runs in isolation. By the time I see the output, the task is already complete and marked done. There's no opportunity to say "actually, also do this."
If I want to add a feature, I have to:
- Add it to the task list
- Wait for the loop to pick it up
- Watch it get implemented in isolation
That friction is the feature. It forces me to think about whether the new feature is worth adding before I add it. Usually, by the time I've thought about it, I realize it's scope creep and I don't add it.
The other scope creep killer: the specs are written before building starts. The 30-minute conversation produced a complete specification. Once the loop started running, the spec was the contract. I couldn't move the goalposts because the goalposts were in a file that Ralph was reading.
The Evidence
Final Stats
| Metric | Value |
|---|---|
| Tasks | 44 |
| Tests | 398 |
| Runtime | ~2 hours (including one rate limit pause) |
| Manual intervention | 0 |
| Lines of code | ~3000 |
Convention Inheritance
The conventions file told Ralph how to structure code:
// From specs/conventions.md - Error handling pattern
function formatError(message: string, baseUrl: string): string {
if (message.includes('ECONNREFUSED') || message.includes('fetch failed')) {
return `Cannot connect to llama-server at ${baseUrl}. Is it running?`;
}
if (message.includes('abort') || message.includes('timeout')) {
return `Request timed out. Try reducing max_tokens or check server load.`;
}
return message;
}
Task 7 (implementing llama_health) created this helper. Every subsequent task that needed error handling copied the pattern. By the end, all 19 tools had consistent, user-friendly error messages.
No one told Task 15 to use formatError. It just read the existing code, saw the pattern, and followed it.
Generated Code Quality
Here's what the autonomous loop produced for llama_health:
export function createHealthTool(client: LlamaClient): Tool {
return {
name: 'llama_health',
description: 'Check if llama-server is running and get status',
inputSchema: z.object({}),
handler: async (): Promise<ToolResult> => {
try {
const health = await client.health();
return {
content: [{ type: 'text', text: JSON.stringify(health, null, 2) }],
};
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return {
content: [{ type: 'text', text: `Error: ${formatError(message, client.baseUrl)}` }],
isError: true,
};
}
},
};
}
Clean. Follows the conventions. Handles errors without throwing. This is production code, and I didn't write any of it.
Test Coverage
Each tool got its own tests with mocked dependencies:
describe('createHealthTool', () => {
let mockClient: LlamaClient;
beforeEach(() => {
mockClient = {
baseUrl: 'http://localhost:8080',
timeout: 30000,
health: vi.fn(),
// ... other mocked methods
} as unknown as LlamaClient;
});
it('returns health status when server responds with ok', async () => {
vi.mocked(mockClient.health).mockResolvedValue({
status: 'ok',
slots_idle: 2,
slots_processing: 0,
});
const tool = createHealthTool(mockClient);
const result = await tool.handler({});
expect(result.isError).toBeUndefined();
expect(result.content[0].text).toContain('ok');
});
it('returns error when server unreachable', async () => {
vi.mocked(mockClient.health).mockRejectedValue(
new Error('fetch failed')
);
const tool = createHealthTool(mockClient);
const result = await tool.handler({});
expect(result.isError).toBe(true);
expect(result.content[0].text).toContain('Cannot connect');
});
});
398 tests like this, all generated autonomously, all passing.
What I'd Tell Past Me
If I could go back to when I started experimenting with AI-assisted development, I'd say:
Use fresh context per task from the start. Don't try to keep one long conversation going. Let each task be its own isolated execution. The specs on disk are your institutional memory.
Front-load through conversation, not documentation. You don't have to write specs yourself. Have a conversation about what you want to build until you understand it, then let Claude write the specs. The conversation is the work; the implementation is execution.
The method makes planning mandatory. This is the real benefit for someone like me who's prone to scope creep. You can't skip to the fun part because the specs ARE the work. Once the loop starts, you're committed to what you defined.
I've now done four projects with variations of this approach. This one, with fresh context per task, was by far the most efficient. I didn't have to manage context. I didn't have to remind Claude of earlier decisions. I didn't have to fight the urge to add features mid-stream.
The llama-mcp-server is now published on npm. 19 tools, 398 tests, comprehensive error handling. Built in 2 hours of loop time plus 30 minutes of conversation.
I think this might be how I build things from now on.