Skip to content
cd ../findings
created January 5, 2026 · 5 min readAI Data Analyst

Agent Investigation With Query Tools

Giving a 7B model two query tools and a 5 W's output format is enough to find attacks on a raw auth.log. The architecture beats dumping the logs into the prompt.

securityagentslocal-llminvestigation

A 7B parameter model cannot fit 86,000 log records in context, and even if it could, inference slows as context grows. Exposing query tools instead and asking the agent to answer the 5 W's of an investigation produces usable findings on both labeled and unlabeled datasets. This page walks through both test runs and what the architecture proves.

The constraint

A naive "dump all the logs into the prompt" approach does not work at any meaningful scale. Context budget runs out long before the dataset does, and inference latency compounds. Analysts do not read every log line either. They query, filter, pivot, and work from summaries.

The same approach works for an agent: Python handles aggregation, statistics, and outlier detection, while the agent gets query tools to pull what it needs. See Deterministic Validation for LLM Output for the broader split.

Test 1: BOTS with IDS labels

The Splunk BOTS dataset has 12 million events across firewall, IDS, DNS, and HTTP logs, plus pre-labeled IDS alerts that give the agent signatures to work from. Two tools: query_ip and query_domain. The agent starts from alerts and pulls data as needed.

Finding: 40.80.148.42

  • Who: IP address 40.80.148.42
  • What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts
  • When: August 10, 2016, with 28,119 total events
  • Where: Targeting internal server 192.168.250.70
  • Why: Repeated, diverse attack signatures indicate active exploitation attempt
  • Verdict: MALICIOUS

It also traced the ransomware infrastructure:

Finding: 85.93.43.236

  • What: Triggered an alert due to a signature indicating TROJAN Ransomware/Cerber Checkin Error ICMP Response
  • When: 2016-08-24 at 10:49:36
  • Verdict: MALICIOUS

Two tools, one path from web scanner to ransomware infection. Worth noting that this test had training wheels: the IDS told the agent what was bad before it started looking.

Test 2: auth.log without signatures

The harder test is a plain auth.log with no IDS, no threat intel, and no labels. Just failed logins, successful logins, and disconnects. The profiler computes statistics, but statistics alone do not tell an analyst what to do.

Profiler output for a suspicious IP:

{
  "ip": "173.192.158.3",
  "priority": "ELEVATED",
  "behaviors": {
    "AUTH_FAILURE": 303,
    "DISCONNECT": 445
  },
  "unique_usernames": 61,
  "total_events": 748,
  "first_seen": "Dec  2 04:12:40",
  "last_seen": "Dec  2 04:17:19"
}

61 usernames in 5 minutes is a credential spray. That takes pattern recognition. Agent output for the same IP:

Finding: IP 173.192.158.3

  • Who: IP address 173.192.158.3
  • What: High volume of authentication attempts with credential spraying (61 different usernames targeted) including 303 authentication failures
  • When: The activity occurred within a short duration (Dec 2, 04:12:40 - 04:17:19) with frequent DISCONNECT events
  • Where: The target host (limited to SSH authentication logs)
  • Why: This behavior may indicate an automated script or brute force attack attempting unauthorized access
  • Verdict: MALICIOUS
  • Confidence: HIGH
  • Recommended Actions:
    1. Block the IP address at the network level
    2. Examine firewall logs for broader context
    3. Questions remaining: What are the origins of the traffic? Are there any successful login attempts?

Same numbers, different output. The agent identified the 5-minute window as significant, produced context and reasoning, and recommended specific follow-up actions.

Results summary

MetricValue
Entities investigated20 (10 IPs, 10 usernames)
Timeabout 2 minutes
MALICIOUS verdicts3 (HIGH confidence)
SUSPICIOUS verdicts15 (mostly MEDIUM confidence)
ModelHermes 2 Pro 7B (local)
HardwareRTX 5080

Twenty entities investigated in two minutes on consumer hardware. No cloud API, no credits burned.

What the agent missed

The agent correctly flagged distributed attack patterns across common usernames like admin, ftpuser, and guest. It missed usernames like D-Link, PlcmSpIp, vyatta, and pi, which are default credentials for IoT devices (D-Link routers, Polycom VoIP phones, Vyatta network appliances, Raspberry Pis). The server was being targeted by Mirai-style botnets scanning for factory defaults.

The agent asked the right question ("is this a valid system account?") but lacked the domain knowledge to answer it. That is expected. A 7B model trained months ago does not have current threat intelligence.

The fix is not a smarter prompt, it is another tool: a threat intel lookup, a web search tool, or an enrichment API. The architecture already supports adding tools. That is the point of the split.

What I got wrong first

Assuming the agent would need more context to be useful. A bigger model, a longer context window, more tokens per call. None of that was the limiting factor. The limiting factor was the set of tools the agent could call, and those tools were under a hundred lines of Python each.

Tradeoffs

Two query tools plus a 5 W's output format is a low-ceiling, low-floor setup. It will not beat a trained human analyst on subtle cases, but it will not go off the rails either. Adding tools is cheap. Each new tool extends the agent's effective reach without requiring any prompt changes. The ceiling lives in the tool set, not the model.