cd ../blog
January 2, 2026AI Data Analyst

Building a Local LLM Security Agent on Consumer Hardware

I avoided local AI for months. Work forced my hand, and I had a security agent running in a week.

local-llmllama.cppsecurityhardwaretutorial

I avoided local AI for months, assuming it would mean VRAM crashes and hours of configuration. Then work forced my hand. Turns out llama.cpp just works now. A week later I had a security investigation agent running on my laptop. Here's what made it finally click.

Why I Avoided It

I'd tried local image generation before. The pattern was always the same: download a model that was "optimized for consumer GPUs," watch it eat all my VRAM, crash, tweak settings, repeat. Eventually something would work, but I never understood why. Just lucky configuration.

I assumed local LLMs would be the same. Probably worse, given how much larger language models are. So I stuck with API calls to Claude and OpenAI. Worked fine. No reason to change.

What Made It Click

At work, I needed to set up local inference for a project. No choice but to figure it out.

The first surprise: llama.cpp just worked. Download a quantized model, run the server, done. No CUDA hell, no driver mismatches. The tooling has matured.

The second surprise: the API is dead simple. llama.cpp exposes an OpenAI-compatible endpoint. Same format whether you're hitting a local server, Claude, or OpenAI:

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "messages": [
            {"role": "system", "content": "You are a data converter..."},
            {"role": "user", "content": "Convert this to JSON: ..."}
        ],
        "max_tokens": 500
    }
)

result = response.json()["choices"][0]["message"]["content"]

That's it. HTTP POST with a messages array. The model has no memory between requests. Every call is stateless. Want conversation history? Include previous messages in the array.

Once I understood this pattern, swapping between providers became trivial. Same code structure, different endpoint. The abstraction that frameworks like CrewAI provide started to feel like overhead. I wanted to see the actual mechanics.

The First Tool

To test the setup, I built a format converter. Feed it messy data (CSV, logs, key-value pairs), get structured JSON back.

$ python3 format_converter.py syslog.txt --pretty

Input (firewall logs):

<134>Feb 12 14:22:01 firewall01 kernel: DROP IN=eth0 OUT= SRC=10.0.0.50
     DST=192.168.1.1 PROTO=TCP SPT=44231 DPT=22

Output:

{
  "timestamp": "Feb 12 14:22:01",
  "source": "firewall01",
  "action": "DROP",
  "interface_in": "eth0",
  "source_ip": "10.0.0.50",
  "destination_ip": "192.168.1.1",
  "protocol": "TCP",
  "source_port": 44231,
  "destination_port": 22
}

I didn't tell it those were iptables logs. I didn't provide field name examples. It figured out sensible names on its own. Ports came back as integers, not strings.

From nothing to working tool in a few hours. The barrier I'd imagined wasn't there.

From Tool to Agent

The format converter was a single tool. I wanted a pipeline: detect format, parse, validate, clean, analyze. Each stage feeding the next.

The interesting evolution happened in the analysis stage. My first approach was message-centric: flag anomalies, have the LLM triage them. On 86,000 auth.log records, it flagged 269,000 anomalies. Every unique IP got flagged because that specific IP was "rare." Technically correct. Completely useless.

I pivoted to entity profiling. Instead of "is this message rare?" I asked "what is this IP doing overall?" That worked better, but profilers were purpose-built. I wrote more about why I stopped flagging anomalies and started profiling entities. I had an IP profiler and a username profiler because that's what auth.log needed. Adding new data sources meant building new profilers.

The final architecture: give the agent tools to query what it needs.

┌─────────────────────────────────────────────────────────────┐
│  Python Orchestration (no LLM)                              │
│  - Decides what data to gather                              │
│  - Calls query tools                                        │
│  - Manages context window                                   │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  Query Tools (Python, fast)                                 │
│  - Returns SUMMARIES, not raw data                          │
│  - Deterministic, zero variance                             │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LLM Analysis (Hermes 2 Pro 7B)                             │
│  - Receives pre-fetched summaries                           │
│  - Reasons about what the data means                        │
│  - Outputs structured findings                              │
└─────────────────────────────────────────────────────────────┘

Two tools: query IP, query domain. The agent starts from alerts, pulls relevant data, and produces findings:

Finding: 40.80.148.42

  • Who: IP address 40.80.148.42
  • What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts
  • When: August 10, 2016, with 28,119 total events
  • Where: Targeting internal server 192.168.250.70
  • Why: Repeated, diverse attack signatures indicate active exploitation attempt
  • Verdict: MALICIOUS

It found the attacker. It also found the ransomware infrastructure. From a format converter to a security investigation agent in about a week. I documented the full investigation results in a follow-up post.

Hardware Reality

Everything runs on an RTX 5080 with 16GB VRAM. A 7B parameter model (Hermes 2 Pro) fits comfortably. No cloud APIs for the analysis work.

The honest take on consumer hardware: it depends on scale.

For a home network security solution, my laptop is more than enough. Personal projects, learning, experimentation - consumer GPUs are viable. I processed 86,000 auth.log records in under a minute for the statistical work, about 20 minutes for full agent investigation.

This wouldn't scale to enterprise data volumes. Millions of events, real-time analysis, multiple concurrent investigations - you'd need more hardware or cloud GPU services like Modal.

But for learning? For building something that works? Consumer hardware is fine.

What I'd Tell Someone Still Avoiding It

Find a project or walkthrough and just do it.

A lot of really smart people have created the tooling required to make this simple. llama.cpp handles the inference. Quantized models fit on consumer GPUs. The OpenAI-compatible API means you don't need to learn new patterns.

For me, the barrier was mental, not technical. I spent months assuming this would be hard. Work forced me to try it. I had something working in hours.