Skip to content
cd ../findings
created March 31, 2026 · 6 min readAI Data Analyst

Three Local Models Compared on One Investigation

Running Hermes 4 70B, Nemotron Cascade 30B, and GPT-OSS 20B against the same security investigation exposes a speed-vs-depth tradeoff that shows up clearly when tools are fast.

securityagentslocal-llmmodel-comparison

Three local models (Hermes 4 70B, Nemotron Cascade 30B, GPT-OSS 20B) running against the same pre-computed index on the Splunk BOTS v1 dataset produce different investigation depths despite agreeing on the obvious attacks. The comparison only becomes meaningful once the tool layer stops being the bottleneck. This page walks through what changed, what each model found, and what the speed-versus-depth tradeoff looks like when tools return in under a millisecond.

The tool layer was the bottleneck

Pointing an agent at the BOTS v1 dataset (3.5M Suricata events, 1.3M DNS records, 39K HTTP requests) with a file-scanning tool produced zero findings. Every query_ip call loaded a 1.5GB Suricata JSONL into a Python list, then iterated to filter. With a 30-second subprocess timeout, no call finished. The stuck-loop detector correctly kicked in, the model tried each entity three times, and the whole investigation returned nothing.

The model was calling the right tools with the right arguments. The tools could not keep up. The fix was architectural: move all the scanning out of the agent's tool-calling loop entirely.

Pre-computed SQLite index

build_index.py is a single Python script that streams through all data sources once and produces a SQLite database. No LLM involved. It runs before the agent starts and takes about 2 minutes.

What the index contains:

  • 22,053 entities (IPs and domains) with cross-source profiles
  • 119,719 relationships (who talked to whom, shared targets, DNS resolutions)
  • Anomaly scores (0 to 100) from keyword matching on alert signatures, volume outliers, and auth failure patterns
  • Temporal bins (hourly activity for flagged entities)
  • 1,360 alert details and 60 HTTP attack samples (SQL injection URIs, XSS payloads)

The agent's tools became SQLite lookups:

ToolWhat it doesResponse time
get_overviewTop anomalous entities plus dataset summaryunder 1ms
get_entity_detailFull cross-source profile for one entityunder 1ms
pivotRelated entities (follow the attack chain)under 1ms
get_timelineHourly activity plus concurrent entitiesunder 1ms
submit_findingStructured verdict submissioninstant

The agent now calls get_overview first, picks the highest-scored entity, runs get_entity_detail for a deep dive, uses pivot to follow relationships, and submits findings as it goes. Open-ended investigation instead of a fixed entity list.

Three models, same dataset

Running on the Dell Pro Max GB10 (128GB unified memory, NVIDIA GB10 GPU) against the same BOTS v1 index:

ModelParametersFindingsTool callsDuration
Hermes 4 70B70B (Q4)912904s
Nemotron Cascade 30B30B (Q4)34110s
GPT-OSS 20B20B (Q4)3624s

What all three found

Every model correctly identified the three main entities:

  • 40.80.148.42 (MALICIOUS): the attacker. 589 IDS alerts including XSS, SQL injection, Shellshock (CVE-2014-6271). 20,000 HTTP requests targeting imreallynotbatman.com. All three models gave this 0.95 to 0.97 confidence.
  • 192.168.250.70 (MALICIOUS or SUSPICIOUS): the victim web server, targeted by the attacker with exploit attempts. Two models called it MALICIOUS, Nemotron Cascade called it SUSPICIOUS (it was receiving attacks, not generating them, so that is a judgment call).
  • 192.168.250.100 (MALICIOUS): internal host with C2 communication patterns, Cerber ransomware check-ins, Tor traffic.

What only Hermes 4 70B found

With a 16K context window and 12 tool calls, Hermes 4 went deeper:

  • 85.93.43.236, 85.93.4.54, 85.93.0.0 (MALICIOUS, 0.85): the Cerber ransomware C2 infrastructure. Three external IPs communicating with the compromised internal host using ransomware beacon patterns.
  • 61.197.203.243 (SUSPICIOUS, 0.80): an SSH brute-forcer with 409 failed login attempts across 47 usernames. Completely unrelated to the web attack, found because the anomaly index flagged it independently.
  • 54.148.194.58 (SUSPICIOUS, 0.70): external IP involved in potential data exfiltration.
  • 192.168.2.50 (MALICIOUS, 0.90): internal host with DoS attempts and exploit signatures against other internal hosts.

What none of them found

The known scanner IP 23.22.63.114 scored only 5 out of 100 on the anomaly index. It had high HTTP volume (enough to flag it) but zero IDS alerts. None of the three models investigated it because it was not in the top 10 overview results. This is the real gap: the obvious attacks get caught because IDS rules fire, but the subtle attacker who scans without triggering signatures needs a different detection approach entirely.

Context window is a hard constraint

Running Hermes 4 twice, once with 8K context and once with 16K:

Context sizeFindingsDuration
8K tokens1139s
16K tokens9904s

With 8K, the agent investigated 5 entities but hit context truncation at 14 messages. Only one finding made it out before the conversation overflowed. With 16K, it completed 12 tool calls and submitted 9 findings including the ransomware C2 infrastructure. Same model, same data, same tools. Investigation depth is gated by how much conversation history fits.

What I got wrong first

Chasing model quality before fixing the tool layer. With a broken tool layer, all three models produced zero findings, which made it look like a model-selection problem. It was not. Once the tools returned in under a millisecond, the gap between models became visible and measurable. Tool speed is a prerequisite for any meaningful model comparison.

Tradeoffs

Smaller models are faster but less curious. GPT-OSS 20B finished in 24 seconds and got the top 3 right, then stopped. Hermes 4 70B spent 15 minutes and found the ransomware infrastructure, the brute-forcer, and an internal host doing lateral movement. Whether the extra findings are worth 15 minutes depends on the use case. For on-call triage, 24 seconds and the top 3 is a good floor. For deep investigations, the extra depth is the whole point.

The principle that shows up everywhere: deterministic where you can, LLM where you must. The anomaly scores, relationship graph, and temporal bins are all computed without any LLM. The agent only does what requires judgment: deciding which entities to investigate next, synthesizing evidence into a verdict, and determining confidence levels.