Why I Stopped Flagging Anomalies and Started Profiling Entities
I expected the LLM to do the heavy lifting. I learned most of the work should be deterministic.
I started this project thinking the LLM would do the heavy lifting. By the end, most of the work was Python, and the LLM only handled reasoning. That shift taught me more than any individual tool I built.
The Problem With Message-Centric Analysis
My first approach was simple: feed logs to an anomaly detector, flag the rare ones, have the LLM triage them.
On 86,000 auth.log records, it found 269,000 anomalies.
Every unique IP got flagged because that specific IP only appeared a few times. Technically correct. Completely useless. The output was so noisy that an analyst would be overwhelmed before finding anything real.
I tried prompt engineering. Added chain-of-thought forcing to make the LLM justify its triage decisions. Variance got worse. Ten runs gave me: 0, 20, 20, 20, 20, 9, 20, 20, 20, 20 anomalies flagged.
That's when I stopped thinking about prompts and started thinking like an analyst.
The Analyst Perspective
If someone handed me this output at work, what would I actually do with it?
I'd ignore the individual messages. The first thing I'd do is group by IP. Then I'd ask: "What is this IP doing overall?"
The tool was answering "is this message rare?" when I needed "is this entity suspicious?" That's a behavioral question, not a statistical one.
Before (message-centric):
Anomaly #1: "Failed password for root from 61.197.203.243" (MEDIUM)
Anomaly #2: "Failed password for admin from 61.197.203.243" (MEDIUM)
Anomaly #3: "Failed password for test from 61.197.203.243" (LOW)
... 47 more anomalies from the same IP ...
An analyst has to mentally group these, figure out what the IP is doing, cross-reference with other IPs. That's investigative work the tool should handle.
After (entity-centric):
CREDENTIAL STUFFING (HIGH):
IP 61.197.203.243 attempted 47 different usernames
→ Block this IP. Investigate if any succeeded elsewhere.
The profiler did the grouping. The analyst gets actionable intelligence, not raw events.
Profilers Worked Until They Didn't
I built two profilers: one for IPs, one for usernames. For auth.log analysis, they were perfect. Fast, deterministic, zero variance.
Then I wanted to test on a larger, more diverse dataset: Splunk's BOTS (12 million events across firewall, IDS, DNS, and HTTP logs).
The problem hit immediately. BOTS has different field types. Firewall logs have source/destination IPs, ports, actions. IDS logs have signatures and severity. HTTP logs have URLs and user agents.
Building a profiler for every field type across every log source wasn't going to scale. At work, I deal with custom datasets constantly. If I had to write a new profiler every time a new data source showed up, the maintenance burden would kill the project.
The Agent Approach
Instead of pre-computing everything, I gave the agent tools to query what it needed:
# Before: pre-computed profiles for everything
ip_profiles = build_ip_profiles(records) # Have to know fields upfront
user_profiles = build_user_profiles(records)
# After: agent queries what it needs
def query_ip(ip: str) -> dict:
"""Returns behavioral summary for any IP across all log sources."""
return aggregate_entity_data(ip, entity_type="ip")
The agent starts from alerts (like a real analyst would), then pulls relevant data as the investigation unfolds. Adding a new data source means adding a query function, not building a new profiler.
Here's what the agent produces:
Finding: 40.80.148.42
- Who: IP address 40.80.148.42
- What: Triggered 589 IDS alerts including XSS, SQL injection, and information leak attempts
- When: August 10, 2016, with 28,119 total events
- Where: Targeting internal server 192.168.250.70
- Why: Repeated, diverse attack signatures indicate active exploitation attempt
- Verdict: MALICIOUS
Two tools. Query IP, query domain. That's all it had. And it traced an attack from web scanner to ransomware infection.
Where the LLM Lives Now
The architecture that worked:
┌─────────────────────────────────────────────────────────────┐
│ Python Orchestration (no LLM) │
│ - Decides what data to gather │
│ - Calls query tools │
│ - Manages context window │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Query Tools (Python, fast) │
│ - Returns SUMMARIES, not raw data │
│ - Deterministic, zero variance │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LLM Analysis (Hermes 2 Pro 7B) │
│ - Receives pre-fetched summaries │
│ - Reasons about what the data means │
│ - Outputs structured findings │
└─────────────────────────────────────────────────────────────┘
Python handles orchestration and data gathering. The LLM only does what LLMs are good at: reasoning about information and generating natural language analysis.
The Numbers
| Metric | Message-Centric | Profilers | Agent |
|---|---|---|---|
| Processing time (86K records) | 617s | 22.7s | ~20 min |
| Variance | 26% CoV | 0% | Low (reasoning only) |
| New data source effort | N/A | Build new profiler | Add query function |
| Output usefulness | Noise | Good for single source | Good for any source |
The agent is slower than profilers for single-source analysis. But it scales to any data source without custom code per field type.
What I'd Tell Someone Starting Fresh
I can't give a one-sentence answer. I'd start by asking: "What's your dataset like? How many different data sources are you pulling in?"
- Single source, known fields: Profilers. Fast, deterministic, zero variance.
- Multiple sources, varied fields: Agent with tools. More flexible, reasonable speed.
- Either way: Do as much as possible in Python. The LLM should reason, not count.
The biggest surprise from this project: most of the work not only could be done with deterministic Python, it should be. LLMs are expensive, slow, and non-deterministic. Use them for what they're good at (reasoning, synthesis, natural language) and use Python for everything else. I wrote more about this pattern in Deterministic Where You Can, LLM Where You Must.
The Pattern
Looking back across all the tools I built:
| Tool | LLM Role |
|---|---|
| Format Converter | LLM for everything |
| Log Parser | Regex-first, LLM-fallback |
| Anomaly Flagger | Stats-first, LLM-selective |
| Profilers | Python-only |
| Agent | LLM for reasoning only |
Each iteration moved more work out of the LLM and into deterministic code. That's the pattern I'd follow from the start next time.