Skip to content
cd ../blog
March 31, 2026 · 7 min readAI Data Analyst

Three Models, One Investigation

My investigation agent produced zero findings on real data. I rebuilt the tool layer as a pre-computed index, then ran three local models against the same attack dataset.

securityagentslocal-llmarchitectureinvestigation

My investigation agent produced zero findings on real data. Every tool call timed out. The model was calling the right tools with the right arguments, but the tools themselves were scanning 1.5GB files line by line and couldn't finish in 30 seconds. I rebuilt the tool layer as a pre-computed SQLite index, and the same agent architecture produced 9 structured findings with correct verdicts against known ground truth.

Where It Started

Back in January, I built a security investigation agent that ran on a 7B model locally. It had two tools: query_ip and query_domain. You'd feed it alerts, it would query the data, and produce findings in a structured format. I wrote four posts about that process.

It worked on small test datasets. The tools loaded log files into memory, filtered by IP, and returned summaries. Good enough for a few thousand records.

What Broke

I pointed it at the Splunk BOTS v1 dataset. 3.5 million Suricata events. 1.3 million DNS records. 39,000 HTTP requests. Real attack data with known ground truth.

Every tool call timed out. The query_ip subprocess loaded the entire 1.5GB Suricata file into a Python list, then scanned every record looking for a matching IP. With a 30-second timeout, it never finished. The stuck-loop detection I'd built (part of an earlier ralph loop to enforce coding standards) kicked in correctly: the model called query_ip("40.80.148.42"), got a timeout error, retried, got another timeout, hit the 3-call limit, moved to the next entity. Same result. Every entity, zero data returned.

The model was doing exactly what it should. The tools just couldn't keep up.

The Diagnosis

The problem was architectural, not model-related. query_ip.py called load_jsonl() which read every line of a JSONL file into a list, then iterated the list to filter by IP. For the Suricata file alone, that's 3.5 million json.loads() calls and 1.5GB of memory, per tool invocation, in a subprocess with a timeout.

The fix wasn't a longer timeout. It was moving all that work out of the agent's tool-calling loop entirely.

The Fix: Pre-Computed Index

I wrote build_index.py, a single Python script that streams through all data sources once and produces a SQLite database. No LLM involved. It runs before the agent starts and takes about 2 minutes.

What it computes:

  • 22,053 entities (IPs and domains) with cross-source profiles
  • 119,719 relationships (who talked to whom, shared targets, DNS resolutions)
  • Anomaly scores (0-100) from keyword matching on alert signatures, volume outliers, auth failure patterns
  • Temporal bins (hourly activity for entities with alerts or suspicious behavior)
  • 1,360 alert details and 60 HTTP attack samples (SQL injection URIs, XSS payloads)

The agent's tools became SQLite lookups instead of file scans:

ToolWhat It DoesResponse Time
get_overviewTop anomalous entities + dataset summaryunder 1ms
get_entity_detailFull cross-source profile for one entityunder 1ms
pivotRelated entities (follow the attack chain)under 1ms
get_timelineHourly activity + concurrent entitiesunder 1ms
submit_findingStructured verdict submissioninstant

The agent now starts by calling get_overview to see the landscape, picks the highest-scored entity, calls get_entity_detail for a deep dive, uses pivot to follow relationships, and submits findings as it goes. Open-ended investigation instead of a fixed entity list.

Three Models, Same Dataset

I ran three local models on the Dell Pro Max GB10 (128GB unified memory, NVIDIA GB10 GPU) against the same BOTS v1 index:

ModelParametersFindingsTool CallsDuration
Hermes 4 70B70B (Q4)912904s
Nemotron Cascade 30B30B (Q4)34110s
GPT-OSS 20B20B (Q4)3624s

What They All Found

Every model correctly identified the three main entities:

  • 40.80.148.42 (MALICIOUS): The attacker. 589 IDS alerts including XSS, SQL injection, Shellshock (CVE-2014-6271). 20,000+ HTTP requests targeting imreallynotbatman.com. All three models gave this 0.95-0.97 confidence.
  • 192.168.250.70 (MALICIOUS): The victim web server. Targeted by the attacker with exploit attempts. Two models called it MALICIOUS, Nemotron Cascade called it SUSPICIOUS (it was receiving attacks, not generating them, so that's a judgment call).
  • 192.168.250.100 (MALICIOUS): Internal host with C2 communication patterns, Cerber ransomware check-ins, Tor traffic.

What Only Hermes 4 Found

With a 16K context window and 12 tool calls, Hermes 4 70B went deeper:

  • 85.93.43.236, 85.93.4.54, 85.93.0.0 (MALICIOUS, 0.85): The Cerber ransomware C2 infrastructure. Three external IPs communicating with the compromised internal host using ransomware beacon patterns.
  • 61.197.203.243 (SUSPICIOUS, 0.80): An SSH brute-forcer with 409 failed login attempts across 47 usernames. Completely unrelated to the web attack, found because the anomaly index flagged it independently.
  • 54.148.194.58 (SUSPICIOUS, 0.70): External IP involved in potential data exfiltration.
  • 192.168.2.50 (MALICIOUS, 0.90): Internal host with DoS attempts and exploit signatures against other internal hosts.

What None of Them Found

The known scanner IP, 23.22.63.114, scored only 5 out of 100 on the anomaly index. It had high HTTP volume (enough to flag it) but zero IDS alerts. None of the three models investigated it because it wasn't in the top 10 overview results.

This is the real gap. The obvious attacks get caught because IDS rules fire. The subtle attacker who scans without triggering signatures needs a different detection approach entirely. That's the next problem to solve.

Context Window Matters

I ran Hermes 4 twice: once with 8K context, once with 16K.

Context SizeFindingsDuration
8K tokens1139s
16K tokens9904s

With 8K, the agent investigated 5 entities but hit context truncation at 14 messages. Only got one finding out before the conversation overflowed. With 16K, it completed 12 tool calls and submitted 9 findings including the ransomware C2 infrastructure.

The same model, same data, same tools. The only difference was how much conversation history it could hold. Context window is a hard constraint on investigation depth.

What I Learned

Pre-processing quality determines investigation quality. The model is only as good as the data it gets. When tools returned timeout errors, the best model in the world couldn't investigate. When tools returned pre-computed anomaly scores and relationship graphs, even a 20B model produced correct findings in 24 seconds.

Smaller models are faster but less curious. GPT-OSS 20B finished in 24 seconds and got the top 3 right. But it stopped there. Hermes 4 70B spent 15 minutes and found the ransomware infrastructure, the brute-forcer, and an internal host doing lateral movement. Whether the extra findings are worth 15 minutes depends on the use case.

The principle holds: deterministic where you can, LLM where you must. The anomaly scores, relationship graph, and temporal bins are all computed without any LLM. The agent only does what requires judgment: deciding which entities to investigate next, synthesizing evidence into a verdict, and determining confidence levels.

What's Next

The current system finds threats that trigger IDS alerts. That's the easy case. The harder problem: what if the adversary has been inside the network for months without triggering any signatures? I'm working on behavioral profiling, entity clustering, and temporal drift detection to find the "normal but wrong" traffic patterns that signature-based detection misses.