AI Data Analyst
An agent pipeline that turns messy data into structured analysis.
$ cat story.md
This project exists because of time. At work, I build tools for security analysts who need to process massive amounts of data quickly. The bottleneck isn't computing power, it's the manual work of cleaning, formatting, and making sense of messy inputs.
The dream: feed it CSVs, logs, API responses, whatever, and get back structured output ready for analyst review. Not just formatted data, but actual analysis with anomalies flagged and patterns identified.
I'm building this on local AI (llama.cpp) for three reasons: cost (API calls add up), privacy (some data shouldn't leave the machine), and learning (I want to understand how this works at a lower level than just calling an API).
$ cat pipeline.txt
Basic tool schema, local LLM setup
Complex output, error reporting
Multi-field processing, transformation
Pattern detection, auto-format detection
Multi-tool orchestration
Multi-source extraction, format handling
Agent tool design, data summarization
Agentic workflows, LLM tool orchestration
$ cat current-status.md
Project Complete
Retrospective
This project is done. What started as a generic data analyst became a security investigation agent, and that scope creep taught me more than the original plan would have.
The final architecture: Python handles orchestration and data gathering, the LLM reasons about what it finds. Not the autonomous agent I imagined at the start, but it works on consumer hardware (RTX 5080, 7B model) and produces analyst-ready output.
Eleven blog posts document the journey, including a capstone retrospective on what I learned about scope creep, AI collaboration, and planning for outputs first. The tools work. The architecture is sound. Time for the next project.
Milestones
8 / 8$ cat agent-tools.txt
The Investigation Agent queries log data to assess threats. Python handles aggregation, the LLM does reasoning.
Activity summary across Suricata, FortiGate, DNS, HTTP logs
DNS queries, resolved IPs, alert associations for a domain
Windows logon events, privilege escalations, host access
Find connected IPs, domains, or users for lateral movement
- -Running local LLMs with llama.cpp
- -Tool-use patterns for structured output
- -Multi-step agent pipelines
- -Data transformation at scale
$ ls ./blog/ --project="AI Data Analyst"
When the Agent Found the Attacks
Jan 5, 2026I gave a 7B model query tools and asked it to answer the 5 W's of an investigation. On two different datasets, it found the attacks.
Why I Stopped Flagging Anomalies and Started Profiling Entities
Jan 5, 2026I expected the LLM to do the heavy lifting. I learned most of the work should be deterministic.
Deterministic Where You Can, LLM Where You Must
Jan 3, 2026I ran my data cleaner three times on the same input. Got three different results. That's when I started building safety nets.
Building a Local LLM Security Agent on Consumer Hardware
Jan 2, 2026I avoided local AI for months. Work forced my hand, and I had a security agent running in a week.
Published "From Data Analyst to Security Analyst: A Week of Building with AI" covering the full journey: scope creep, architectural pivots, what I learned about AI collaboration, and why the next project needs guardrails.
Tested the agent on auth.log using statistics alone (no IDS signatures). Agent correctly identified brute force and credential stuffing patterns from behavioral data. Missed IoT default credentials, but architecture is sound.
Published "From Profilers to Agent Investigation" explaining the architectural pivot, and "Testing the Agent on 12 Million Events" documenting the BOTS results with actual agent output.
Ran investigation on 12.6M events (no entity limit). Agent found scanner IP, target server, ransomware victim, and C2 infrastructure in ~20 minutes. Results match published BOTS writeups used for analyst training.
Limited test with 3 entities confirmed the approach works. Agent correctly identified the attacker IP, target server, and victim domain from the BOTS scenario.
Built query_ip and query_domain tools that summarize data for the LLM. Investigator agent uses tool calling to query entities extracted from alerts. Context resets between entities to stay within 8K limit.
Extracted 12.6M events from BOTS v1: Fortigate (7.7M), Suricata (3.6M), DNS (1.4M), HTTP (39K). Windows Event Logs skipped (no handler built, scope control).
Profilers worked for auth.log but building one per field type per log source would not scale. New approach: LLM-driven investigator queries data on demand starting from alerts. Same principle (Python aggregates, LLM reasons) but agent drives the investigation.
LLM correlation tested on 86k records. Identified coordinated credential stuffing campaign across 7 IPs. Smart cleaning implemented (skip when validation passes). Pipeline now runs in 22 seconds, down from 617s.
Integrated IP Profiler, Username Profiler, and Correlator into analysis_pipeline.py. New flow: detect → parse → validate → clean → profile → correlate. Removed old anomaly-flagger stages. Added --no-llm flag for fast deterministic analysis.
Built tool that combines IP and username profiles. Pre-correlation (Python) finds credential stuffing, distributed brute force, potential compromises. Optional LLM analysis for deep campaign detection. 125 HIGH severity findings from test data.
Groups events by target username. Detects distributed attacks (many IPs targeting one account) and potential compromises (success after failures). Pure Python, 0% variance. 543 usernames profiled, 69 HIGH priority.
Groups events by source IP, classifies into SCANNING, AUTH_FAILURE, AUTH_SUCCESS, DISCONNECT, PRIVILEGE. Scores by behavioral diversity. Pure Python, 0% variance. Processes 86k records in 3.3 seconds (was 41s with LLM triage).
Core insight from variance testing: the problem was not the prompt, it was the model. Message-centric anomaly detection creates noise (50 scanner IPs = 50 anomalies). Entity-centric profiling creates signal (50 scanner IPs = 50 behaviors on 1 profile). Analysts naturally group this way.
Built orchestrator that chains all 5 tools. Single command runs full pipeline with auto-detect, metrics collection, and variance testing. Validated on 86k records in 41s. All pipeline components complete.
Ran 3 iterations to measure consistency. Statistical detection: perfectly stable (0% CoV). Investigation severity: stable. LLM triage decisions: variable (26% CoV). Now we know exactly where to focus prompt tuning.
Observed inconsistent severity ratings (same pattern rated HIGH vs CRITICAL). Tested hypothesis: prompting for justification before rating. Result: all 5 investigations got consistent HIGH ratings, and analysts now get the "why" behind each rating. Chain-of-thought forcing works.
Ran complete three-phase pipeline on 86k SSH attack logs. LLM triage filtered 20 → 19 (only 1 dismissed). Deep investigation found real attack patterns: brute force with common usernames, scanning probes, unusual disconnects. Added --summary-only flag for clean output.
Solved the noise problem same session. Added auto-detection: skip fields by name pattern (timestamp, pid, uuid) or cardinality ratio (>60% unique). Result: 16% fewer anomalies, 5x faster runtime. The analyst shouldn't have to pre-analyze data.
Tested full pipeline on SecRepo auth.log (86,839 SSH attack logs). Log Parser: 0.53 seconds, 100% success, regex-first validated. Anomaly Flagger: revealed design gap where high-cardinality fields (timestamps, PIDs) create noise.
Published fourth blog post covering the three-phase architecture. Core insight: treat context window as a finite resource. Stats do bulk work (free), LLM triage is batched (efficient), deep investigation resets context between anomalies (prevents bias).
Knowledge check revealed sub-optimal design: Format Converter (always LLM) and Log Parser (regex-first) have overlapping functionality. Future consideration: add a routing layer that samples data and picks the fastest path automatically. Analyst shouldn't need to pre-sort data.
Fifth tool done. Second Level 3 tool. Three-phase architecture: stats find outliers, LLM triage filters noise, deep investigation provides actionable analysis. Context window managed as a resource.
Fourth tool done. First Level 3 tool. Auto-detects log format (syslog, JSON, key=value) using regex patterns, with LLM fallback for unknown formats. Regex-first for speed, LLM for flexibility.
Third tool done. Uses local LLM to normalize JSON data based on schema. Handles type conversions, enum normalization, whitespace trimming. Key learning: LLMs are non-deterministic, which is why you chain with the deterministic Schema Validator.
Second tool done. Validates JSON against JSON Schema files, outputs structured violation reports with paths, messages, and expected vs actual values. Pure Python with jsonschema library, no LLM needed for v1. Teaches handling multiple inputs and complex error reporting.
First tool in the pipeline is working. Built Python CLI that calls local LLM via HTTP, converts messy data (CSV, logs, key-value) to structured JSON. Tested with real firewall logs. The understanding of how LLM APIs work (messages array, roles, HTTP transport) made building the tool straightforward.
Python calling llama.cpp server via HTTP requests. OpenAI-compatible format: messages array with system/user/assistant roles. Same pattern works with any LLM API.
Tested llama.cpp server mode on port 8080. OpenAI-compatible API confirmed working. Multi-line prompts work properly via JSON body. Performance: ~22,000 t/s prompt (cached), 93 t/s generation.
Interactive prompts work. Learned that each newline submits a message (no multi-line input). Workaround: use -f prompt.txt for complex prompts.
RTX 5080 needs CUDA 12.8 for native sm_120 support. Upgraded from 12.6 and got 19x performance improvement. Generation went from 5.2 t/s to 100 t/s.
Downloaded Hermes 2 Pro 7B Q4_K_M (4.1GB). Built llama.cpp with CUDA support. Model loads and generates text.
Chose Format Converter as the first tool to build. Set up llama.cpp infrastructure. The goal: learn tool building from the ground up.