created January 5, 2026 · 5 min readAI Data Analyst

Agent Investigation With Query Tools

Giving a 7B model two query tools and a 5 W's output format is enough to find attacks on a raw auth.log. The architecture beats dumping the logs into the prompt.

securityagentslocal-llminvestigation

A 7B parameter model cannot fit 86,000 log records in context, and even if it could, inference slows as context grows. Exposing query tools instead and asking the agent to answer the 5 W's of an investigation produces usable findings on both labeled and unlabeled datasets. This page walks through both test runs and what the architecture proves.

The constraint

A naive "dump all the logs into the prompt" approach does not work at any meaningful scale. Context budget runs out long before the dataset does, and inference latency compounds. Analysts do not read every log line either. They query, filter, pivot, and work from summaries.

The same approach works for an agent: Python handles aggregation, statistics, and outlier detection, while the agent gets query tools to pull what it needs. See Deterministic Validation for LLM Output for the broader split.

Test 1: BOTS with IDS labels

The Splunk BOTS dataset has 12 million events across firewall, IDS, DNS, and HTTP logs, plus pre-labeled IDS alerts that give the agent signatures to work from. Two tools: query_ip and query_domain. The agent starts from alerts and pulls data as needed.

Finding: 40.80.148.42

Who: IP address 40.80.148.42

What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts

When: August 10, 2016, with 28,119 total events

Where: Targeting internal server 192.168.250.70

Why: Repeated, diverse attack signatures indicate active exploitation attempt

Verdict: MALICIOUS

It also traced the ransomware infrastructure:

Finding: 85.93.43.236

What: Triggered an alert due to a signature indicating TROJAN Ransomware/Cerber Checkin Error ICMP Response

When: 2016-08-24 at 10:49:36

Verdict: MALICIOUS

Two tools, one path from web scanner to ransomware infection. Worth noting that this test had training wheels: the IDS told the agent what was bad before it started looking.

Test 2: auth.log without signatures

The harder test is a plain auth.log with no IDS, no threat intel, and no labels. Just failed logins, successful logins, and disconnects. The profiler computes statistics, but statistics alone do not tell an analyst what to do.

Profiler output for a suspicious IP:

{
  "ip": "173.192.158.3",
  "priority": "ELEVATED",
  "behaviors": {
    "AUTH_FAILURE": 303,
    "DISCONNECT": 445
  },
  "unique_usernames": 61,
  "total_events": 748,
  "first_seen": "Dec  2 04:12:40",
  "last_seen": "Dec  2 04:17:19"
}

61 usernames in 5 minutes is a credential spray. That takes pattern recognition. Agent output for the same IP:

Finding: IP 173.192.158.3

Who: IP address 173.192.158.3

What: High volume of authentication attempts with credential spraying (61 different usernames targeted) including 303 authentication failures

When: The activity occurred within a short duration (Dec 2, 04:12:40 - 04:17:19) with frequent DISCONNECT events

Where: The target host (limited to SSH authentication logs)

Why: This behavior may indicate an automated script or brute force attack attempting unauthorized access

Verdict: MALICIOUS

Confidence: HIGH

Recommended Actions:

Block the IP address at the network level

Examine firewall logs for broader context

Questions remaining: What are the origins of the traffic? Are there any successful login attempts?

Same numbers, different output. The agent identified the 5-minute window as significant, produced context and reasoning, and recommended specific follow-up actions.

Results summary

Metric	Value
Entities investigated	20 (10 IPs, 10 usernames)
Time	about 2 minutes
MALICIOUS verdicts	3 (HIGH confidence)
SUSPICIOUS verdicts	15 (mostly MEDIUM confidence)
Model	Hermes 2 Pro 7B (local)
Hardware	RTX 5080

Twenty entities investigated in two minutes on consumer hardware. No cloud API, no credits burned.

What the agent missed

The agent correctly flagged distributed attack patterns across common usernames like admin, ftpuser, and guest. It missed usernames like D-Link, PlcmSpIp, vyatta, and pi, which are default credentials for IoT devices (D-Link routers, Polycom VoIP phones, Vyatta network appliances, Raspberry Pis). The server was being targeted by Mirai-style botnets scanning for factory defaults.

The agent asked the right question ("is this a valid system account?") but lacked the domain knowledge to answer it. That is expected. A 7B model trained months ago does not have current threat intelligence.

The fix is not a smarter prompt, it is another tool: a threat intel lookup, a web search tool, or an enrichment API. The architecture already supports adding tools. That is the point of the split.

What I got wrong first

Assuming the agent would need more context to be useful. A bigger model, a longer context window, more tokens per call. None of that was the limiting factor. The limiting factor was the set of tools the agent could call, and those tools were under a hundred lines of Python each.

Tradeoffs

Two query tools plus a 5 W's output format is a low-ceiling, low-floor setup. It will not beat a trained human analyst on subtle cases, but it will not go off the rails either. Adding tools is cheap. Each new tool extends the agent's effective reach without requiring any prompt changes. The ceiling lives in the tool set, not the model.

$ ls ../findings/ --project="AI Data Analyst"

[evolving]Agent Trace Telemetry Three Local Models Compared on One Investigation Entity Profiling Over Anomaly Flagging Deterministic Validation for LLM Output Local LLM Security Agent on Consumer Hardware