Building a Local LLM Security Agent on Consumer Hardware
I avoided local AI for months. Work forced my hand, and I had a security agent running in a week.
I avoided local AI for months, assuming it would mean VRAM crashes and hours of configuration. Then work forced my hand. Turns out llama.cpp just works now. A week later I had a security investigation agent running on my laptop. Here's what made it finally click.
Why I Avoided It
I'd tried local image generation before. The pattern was always the same: download a model that was "optimized for consumer GPUs," watch it eat all my VRAM, crash, tweak settings, repeat. Eventually something would work, but I never understood why. Just lucky configuration.
I assumed local LLMs would be the same. Probably worse, given how much larger language models are. So I stuck with API calls to Claude and OpenAI. Worked fine. No reason to change.
What Made It Click
At work, I needed to set up local inference for a project. No choice but to figure it out.
The first surprise: llama.cpp just worked. Download a quantized model, run the server, done. No CUDA hell, no driver mismatches. The tooling has matured.
The second surprise: the API is dead simple. llama.cpp exposes an OpenAI-compatible endpoint. Same format whether you're hitting a local server, Claude, or OpenAI:
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"messages": [
{"role": "system", "content": "You are a data converter..."},
{"role": "user", "content": "Convert this to JSON: ..."}
],
"max_tokens": 500
}
)
result = response.json()["choices"][0]["message"]["content"]
That's it. HTTP POST with a messages array. The model has no memory between requests. Every call is stateless. Want conversation history? Include previous messages in the array.
Once I understood this pattern, swapping between providers became trivial. Same code structure, different endpoint. The abstraction that frameworks like CrewAI provide started to feel like overhead. I wanted to see the actual mechanics.
The First Tool
To test the setup, I built a format converter. Feed it messy data (CSV, logs, key-value pairs), get structured JSON back.
$ python3 format_converter.py syslog.txt --pretty
Input (firewall logs):
<134>Feb 12 14:22:01 firewall01 kernel: DROP IN=eth0 OUT= SRC=10.0.0.50
DST=192.168.1.1 PROTO=TCP SPT=44231 DPT=22
Output:
{
"timestamp": "Feb 12 14:22:01",
"source": "firewall01",
"action": "DROP",
"interface_in": "eth0",
"source_ip": "10.0.0.50",
"destination_ip": "192.168.1.1",
"protocol": "TCP",
"source_port": 44231,
"destination_port": 22
}
I didn't tell it those were iptables logs. I didn't provide field name examples. It figured out sensible names on its own. Ports came back as integers, not strings.
From nothing to working tool in a few hours. The barrier I'd imagined wasn't there.
From Tool to Agent
The format converter was a single tool. I wanted a pipeline: detect format, parse, validate, clean, analyze. Each stage feeding the next.
The interesting evolution happened in the analysis stage. My first approach was message-centric: flag anomalies, have the LLM triage them. On 86,000 auth.log records, it flagged 269,000 anomalies. Every unique IP got flagged because that specific IP was "rare." Technically correct. Completely useless.
I pivoted to entity profiling. Instead of "is this message rare?" I asked "what is this IP doing overall?" That worked better, but profilers were purpose-built. I wrote more about why I stopped flagging anomalies and started profiling entities. I had an IP profiler and a username profiler because that's what auth.log needed. Adding new data sources meant building new profilers.
The final architecture: give the agent tools to query what it needs.
┌─────────────────────────────────────────────────────────────┐
│ Python Orchestration (no LLM) │
│ - Decides what data to gather │
│ - Calls query tools │
│ - Manages context window │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Query Tools (Python, fast) │
│ - Returns SUMMARIES, not raw data │
│ - Deterministic, zero variance │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LLM Analysis (Hermes 2 Pro 7B) │
│ - Receives pre-fetched summaries │
│ - Reasons about what the data means │
│ - Outputs structured findings │
└─────────────────────────────────────────────────────────────┘
Two tools: query IP, query domain. The agent starts from alerts, pulls relevant data, and produces findings:
Finding: 40.80.148.42
- Who: IP address 40.80.148.42
- What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts
- When: August 10, 2016, with 28,119 total events
- Where: Targeting internal server 192.168.250.70
- Why: Repeated, diverse attack signatures indicate active exploitation attempt
- Verdict: MALICIOUS
It found the attacker. It also found the ransomware infrastructure. From a format converter to a security investigation agent in about a week. I documented the full investigation results in a follow-up post.
Hardware Reality
Everything runs on an RTX 5080 with 16GB VRAM. A 7B parameter model (Hermes 2 Pro) fits comfortably. No cloud APIs for the analysis work.
The honest take on consumer hardware: it depends on scale.
For a home network security solution, my laptop is more than enough. Personal projects, learning, experimentation - consumer GPUs are viable. I processed 86,000 auth.log records in under a minute for the statistical work, about 20 minutes for full agent investigation.
This wouldn't scale to enterprise data volumes. Millions of events, real-time analysis, multiple concurrent investigations - you'd need more hardware or cloud GPU services like Modal.
But for learning? For building something that works? Consumer hardware is fine.
What I'd Tell Someone Still Avoiding It
Find a project or walkthrough and just do it.
A lot of really smart people have created the tooling required to make this simple. llama.cpp handles the inference. Quantized models fit on consumer GPUs. The OpenAI-compatible API means you don't need to learn new patterns.
For me, the barrier was mental, not technical. I spent months assuming this would be hard. Work forced me to try it. I had something working in hours.