Local LLM Security Agent on Consumer Hardware
Running a security investigation agent on a 16GB consumer GPU with llama.cpp, the OpenAI-compatible API, and a small 7B model.
A security investigation agent can run on a 16GB consumer GPU with a 7B parameter model, llama.cpp, and two query tools. The tooling has matured enough that no CUDA wrangling or cloud API budget is required. This page is a walkthrough of the stack, what each piece does, and where consumer hardware stops being enough.
Why local at all
API calls to hosted LLMs work fine. For a lot of work, that's the right default. The reasons to go local are narrow but real: data you can't send out, running agents in loops where per-call cost adds up, and wanting to actually see the mechanics instead of hiding them behind a framework.
Local inference used to mean fighting CUDA drivers, VRAM limits, and model formats. llama.cpp has taken most of that away. Download a quantized GGUF, run the server, done.
The stack
Three pieces: a model, an inference runtime, and an agent loop with tools.
┌─────────────────────────────────────────────────────────────┐
│ Python orchestration (no LLM) │
│ - Decides what data to gather │
│ - Calls query tools │
│ - Manages context window │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Query tools (Python, fast) │
│ - Return SUMMARIES, not raw data │
│ - Deterministic, zero variance │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LLM analysis (Hermes 2 Pro 7B via llama.cpp) │
│ - Receives pre-fetched summaries │
│ - Reasons about what the data means │
│ - Outputs structured findings │
└─────────────────────────────────────────────────────────────┘
The LLM never sees raw log lines. It sees the output of deterministic Python that has already filtered and aggregated. The split matters because Python is free, fast, and has no variance, while LLM tokens are expensive, slow, and probabilistic.
The API is dead simple
llama.cpp exposes an OpenAI-compatible endpoint. Same call format whether you're hitting a local server, Claude, or OpenAI:
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"messages": [
{"role": "system", "content": "You are a data converter..."},
{"role": "user", "content": "Convert this to JSON: ..."}
],
"max_tokens": 500
}
)
result = response.json()["choices"][0]["message"]["content"]
HTTP POST with a messages array. The model has no memory between requests. Every call is stateless. Conversation history means including previous messages in the array.
Once this pattern is clear, swapping between providers is trivial. Same code structure, different endpoint. The abstraction frameworks like CrewAI provide starts to feel like overhead. Direct calls make the mechanics visible.
What the agent does
The agent is given two tools: query_ip and query_domain. Starting from alerts, it pulls relevant data and produces findings in a structured format:
Finding: 40.80.148.42
- Who: IP address 40.80.148.42
- What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts
- When: August 10, 2016, with 28,119 total events
- Where: Targeting internal server 192.168.250.70
- Why: Repeated, diverse attack signatures indicate active exploitation attempt
- Verdict: MALICIOUS
The 5 W's framework forces the agent to answer specific questions rather than produce narrative summaries. Each field has to come from actual tool output, not inference.
Hardware reality
On an RTX 5080 with 16GB VRAM, a 7B parameter model (Hermes 2 Pro) fits comfortably. Processing 86,000 auth.log records took under a minute for the statistical work and about 20 minutes for full agent investigation.
This does not scale to enterprise data volumes. Millions of events, real-time analysis, and multiple concurrent investigations would need more hardware or cloud GPU services like Modal. For learning, prototyping, and home-scale workloads, consumer GPUs are fine.
What I got wrong first
Assuming local inference would be a CUDA nightmare. The mental barrier was bigger than the technical one. The actual getting-started work was a few hours of reading llama.cpp docs and running a quantized model. Months of avoidance for a few hours of setup.
Also: trying to make the LLM do the heavy lifting. My first architecture fed raw logs to the model and asked it to triage. It flagged 269,000 anomalies on 86,000 records because every unique IP was "rare." The fix was pushing detection into deterministic Python and using the LLM only for reasoning over pre-filtered summaries. See Entity Profiling Over Anomaly Flagging for how that shift played out.
Tradeoffs
Consumer hardware works for home-scale or prototyping, not for enterprise throughput. A 7B model is capable enough for tool-using agent loops if the tools do the heavy lifting, but it will hallucinate if you ask it to reason across too much context. The sweet spot is narrow: small enough to fit on a laptop, smart enough to follow a structured agent loop.