BuiltPythonpandasllama.cpp

Agent-Driven Discovery

Multi-agent system that explores datasets and finds interesting insights.

$ cat story.md

I love data. I used to pull random datasets from Kaggle and visualize them just for fun. What if AI agents could do the exploration part?

The idea: point a team of agents at a dataset and let them decide what's interesting. An Explorer proposes insights, a Validator pushes back. The back-and-forth is where the magic happens.

This is a showcase project. I run it locally, publish the interesting findings as blog posts with visualizations of how the agents got there. Practice with multi-agent orchestration, monitoring, and the pandas-to-LLM pattern.

$ cat pipeline.txt

Load Data

Explore

Propose

Question

Validate

Visualize

Components Complete12 of 12

Data Tools LayerLevel 1

Pandas tool design, agent-callable functions

Complete

LLM CommunicationLevel 1

Conversation history, JSON parsing, error recovery

Complete

Agent PromptsLevel 2

Persona design, structured output instructions

Complete

OrchestratorLevel 3

Multi-agent coordination, rejection loops

Complete

MonitoringLevel 2

Structured logging, run capture for visualization

Complete

Skeptic RefactorLevel 2

Mandatory feedback loops, structured questioning

Complete

Hallucination DetectionLevel 2

Column validation, grounding agents in reality

Complete

Batch AnalysisLevel 2

Measuring agent behavior at scale, approval rates

Complete

Exploration VisualizerLevel 3

Visualizing agent decision paths, responsive design patterns

Complete

Multiple PersonasLevel 3

Different Explorer personalities (Statistician, Storyteller, Detective, Contrarian)

Complete

Collaborative ModeLevel 3

Turn-based multi-agent discussions, consensus mechanisms, adversarial voting

Complete

Roundtable ReplayLevel 3

SVG visualization, playback UX, animating multi-agent conversations

Complete

Narrator AgentLevel 2

Cut - project goals achieved without it

Cut

$ cat current-status.md

Project Complete

Done

January 10, 2026

Project complete. Built a multi-agent data exploration system with Explorer/Skeptic pattern, multiple personas, collaborative mode, and visualization components for the portfolio.

Key learnings: mandatory feedback loops beat prompt engineering, specialized personas outperform generic ones, adversarial voting requires explicit thresholds, and SVG visualizations need explicit hex colors (Tailwind opacity classes fail).

The Narrator Agent was cut. The core learning goals were achieved, and adding commentary would have been polish without new insights.

Milestones

23 / 23

Picked dataset (Steam Games, 122k games)

Built 7 pandas tools

Built LLM communication layer

Created Explorer and Validator prompts

Built orchestrator with rejection handling

Built structured JSON logging

First successful exploration run

Tune Validator to be harder to impress

Run batch explorations

Build challenge mode (experimental)

Refactor Validator → Skeptic (mandatory questions)

Add column validation for hallucination detection

Test across multiple datasets

Write first blog post

Build multiple Explorer personas

Write second blog post (personas)

Create exploration path visualization

Build persona comparison UI

Mobile-responsive visualizations

Build Collaborative Mode

Write third blog post (self-correcting agents)

Build Roundtable Replay visualization

~~Narrator Agent~~ (cut)

$ cat agent-tools.txt

The Explorer agent has access to these data analysis tools:

get_schema()

Dataset structure: columns, types, row count

get_stats(column)

Summary statistics: mean, median, std, min, max

get_distribution(column)

Value distribution: histogram or top categories

correlate(col_a, col_b)

Pearson correlation between two numeric columns

query(filter_expr)

Filter data with pandas expression, return sample

find_outliers(column)

Statistical outliers using IQR method

get_sample(n)

Random sample of n rows from the dataset

$ cat exploration-path.txt

Sample exploration run showing the Detective persona analyzing earthquake data. Tool errors (red) are normal - the agent recovers and continues exploring. The Skeptic catches a hallucination mid-run when the Explorer starts discussing "budget and revenue" on earthquake data, forcing a correction.

detective|earthquake1826_2026.csv|36k tokens|57.7s

Tool Call

Explorer Proposal

Skeptic Question

Skeptic Rejection

Approved

→ get_schema22 tok

→ find_outliers(column=depth)106 tok

→ get_distribution(column=mag, top_n=10)77 tok

→ correlate(col_a=depth, col_b=mag)83 tok

✗ query[ERROR]122 tok

✗ query[ERROR]96 tok

✗ get_sample(n=5, filter_expr=depth > 107.75)[ERROR]76 tok

Explorer: The presence of a significant number of earthquakes with depths greater than the upper bound of the expected range suggests either anomalous earthquake behavior or measurement errors in the depth data.

Skeptic asks: What is the expected range of depth values for earthquakes in this dataset, and how do the outliers deviate from this range?

Explorer: The positive correlation between budget and revenue suggests that investing more in a company's operations and marketing leads to higher returns. This relationship is supported by the high correlation coefficient of 0.88, indicating a strong connection between the two financial metrics. Companies with larger budgets tend to generate more revenue, which could be attributed to better brand recognition, increased market share, and more effective product development.

Skeptic rejects: Hallucination - references non-existent data

ATTEMPT 2

Explorer: The presence of outliers in the depth column, representing 12.56% of the total depth values, suggests that these unusual earthquake depths could be attributed to either anomalous earthquake behavior or measurement errors. Further investigation into these specific cases may help determine the root cause of this deviation.

Skeptic asks: What is the distribution of depth values for the outliers found in the depth column?

Explorer: The distribution of outliers in the depth column, which represent 12.56% of the total depth values, shows that these unusual earthquake depths have a wider range and a more spread-out shape compared to the main distribution. This indicates that the outliers deviate significantly from the typical depth values, and further investigation into these specific cases may help determine the root cause of this deviation.

✓ Approved

→ get_schema22 tok

→ find_outliers(column=depth)106 tok

→ get_distribution(column=mag, top_n=10)77 tok

→ correlate(col_a=depth, col_b=mag)83 tok

✗ query[ERROR]122 tok

✗ query[ERROR]96 tok

✗ get_sample(n=5, filter_expr=depth > 107.75)[ERROR]76 tok

Explorer proposes: The presence of a significant number of earthquakes with depths greater than the upper bound of the expected range suggests either anomalous earthquake behavior or measurement errors in the depth data.

Skeptic asks: What is the expected range of depth values for earthquakes in this dataset, and how do the outliers deviate from this range?

Explorer proposes: The positive correlation between budget and revenue suggests that investing more in a company's operations and marketing leads to higher returns. This relationship is supported by the high correlation coefficient of 0.88, indicating a strong connection between the two financial metrics. Companies with larger budgets tend to generate more revenue, which could be attributed to better brand recognition, increased market share, and more effective product development.

Skeptic rejects: Hallucination - references non-existent data

ATTEMPT 2

Explorer proposes: The presence of outliers in the depth column, representing 12.56% of the total depth values, suggests that these unusual earthquake depths could be attributed to either anomalous earthquake behavior or measurement errors. Further investigation into these specific cases may help determine the root cause of this deviation.

Skeptic asks: What is the distribution of depth values for the outliers found in the depth column?

Explorer proposes: The distribution of outliers in the depth column, which represent 12.56% of the total depth values, shows that these unusual earthquake depths have a wider range and a more spread-out shape compared to the main distribution. This indicates that the outliers deviate significantly from the typical depth values, and further investigation into these specific cases may help determine the root cause of this deviation.

✓ Approved

APPROVED after 1 rejection, 7 tool calls

$ diff personas.txt

Same earthquake dataset, three different personas. The Statistician focuses on distributions and correlations, the Detective hunts for anomalies, and the Storyteller looks for narratives in the data.

6 tools0 rejects37.9sOK ✓

→ get_schema

→ get_stats(column=mag)

→ get_stats(column=depth)

→ correlate(col_a=mag, col_b=depth)

→ find_outliers(column=depth)

→ query(filter_expr=depth > 50.3, sample_n=10)

Explorer: The distribution of latitude for the top 10 deepest earthquakes is significantly skewed towards the southern hemisphere.

Skeptic asks: Are the geological conditions in the southern hemisphere more conducive to deep earthquakes along the Pacific Ring of Fire, leading to the skewed latitude distribution?

Explorer: Deep earthquakes along the Pacific Ring of Fire are predominantly found in the southern hemisphere, particularly in the subduction zones between -50° and -15° latitude. The higher frequency of outlier deep earthquakes in this region is consistent with the overall distribution of subduction zones along the Ring of Fire, which are more extensive in the southern hemisphere. In the northern hemisphere, deep earthquakes are less common but still occur, mainly between 15° N and 50° N latitude.

✓ Approved

→ get_schema

→ get_stats(column=mag)

→ get_stats(column=depth)

→ correlate(col_a=mag, col_b=depth)

→ find_outliers(column=depth)

→ query(filter_expr=depth > 50.3, sample_n=10)

Explorer proposes: The distribution of latitude for the top 10 deepest earthquakes is significantly skewed towards the southern hemisphere.

Skeptic asks: Are the geological conditions in the southern hemisphere more conducive to deep earthquakes along the Pacific Ring of Fire, leading to the skewed latitude distribution?

Explorer proposes: Deep earthquakes along the Pacific Ring of Fire are predominantly found in the southern hemisphere, particularly in the subduction zones between -50° and -15° latitude. The higher frequency of outlier deep earthquakes in this region is consistent with the overall distribution of subduction zones along the Ring of Fire, which are more extensive in the southern hemisphere. In the northern hemisphere, deep earthquakes are less common but still occur, mainly between 15° N and 50° N latitude.

✓ Approved

Select a persona to view their exploration path

$ ./roundtable --replay

Watch a collaborative exploration unfold. Six personas analyze earthquake data, building on each other's insights through 8 rounds of discussion until reaching consensus. Press play to see their thought process.

$ cat roundtable-replay.txt

earthquake1826_2026.csv|8 rounds|4m 2s|87 events

Active Speaker

Tool Call

Proposal

Agree

Disagree

Press play to start

1/87

-Multi-agent orchestration patterns
-Agent monitoring and observability
-Pandas-to-LLM data summarization
-Critic/validator agent patterns
-Persona-driven prompt engineering
-Collaborative consensus mechanisms

$ ls ./blog/ --project="Agent-Driven Discovery"

Building a Self-Correcting Multi-Agent System

Jan 9, 2026

My AI agents wouldn't stop agreeing with each other. Here's what I tried, what failed, and what finally worked.

Five Personas, One Dataset: How Different Agents Find Different Insights

Jan 9, 2026

Same data, different perspectives. We built 5 Explorer personas and watched them find completely different insights from identical datasets.

Building Feedback Loops into Multi-Agent Systems

Jan 8, 2026

How changing from a Validator to a Skeptic (and making it ask questions) dramatically improved insight quality in our data exploration pipeline.

View all posts ->

Jan 9, 2026

Roundtable Replay visualization completeRoundtable Replay

Built animated visualization for collaborative mode. Shows 6 personas seated around a circular table, with discussion feed playing back events step-by-step. Includes manual navigation and auto-play modes.

Tailwind opacity classes fail in SVGRoundtable Replay

Discovered that Tailwind opacity classes (bg-foreground/20) render as pure black in SVG elements. Had to switch to explicit hex colors for all fills and strokes.

Third blog post: Self-Correcting Multi-Agent SystemsDocumentation

Documented the collaborative mode development arc: repetition loops, folding skeptics, quality validation. Real transcript examples show the iteration process.

Collaborative mode hardenedCollaborative Mode

Fixed 4 major issues: repetition detection (80% word overlap), adversarial voting prompts, filter_expr support in tools, quality validation with correlation thresholds.

Adversarial agents finally disagreeCollaborative Mode

After adding explicit correlation thresholds and separate adversarial vote prompts, Skeptic and Contrarian now vote DISAGREE when proposals cite weak evidence. 4-2 votes instead of 6-0.

Collaborative mode functionalCollaborative Mode

Built roundtable discussion system: 6 personas share context, take turns, vote on consensus. Includes repetition detection, adversarial voting, and quality validation.

Mobile-responsive visualizations completeExploration Visualizer

Added mobile-responsive design to exploration paths. Mobile uses flattened sequential view with colored depth indicators (border thickness). Comparison UI unified to tabbed interface with clear active state highlighting. Tested on actual mobile device.

Persona comparison UI live on siteExploration Visualizer

Added tabbed comparison view showing Statistician, Detective, and Storyteller exploring the same earthquake dataset. Users can switch between personas to see different exploration strategies.

Agent tools documentation addedDocumentation

Added tools sections to both Agent-Driven Discovery and AI Data Analyst project pages. Shows what capabilities each agent has: 7 pandas tools for Explorer, 4 investigation tools for AI Data Analyst.

Exploration path visualization completeExploration Visualizer

Built CLI and React visualization showing Explorer → Skeptic decision trees. ASCII flowcharts for terminal, expandable React component for site. Includes --compare mode for side-by-side persona comparison.

Second blog post: Five Personas, One DatasetDocumentation

Documented the personas feature with comparison results, efficiency metrics, and the hallucination catch during Detective exploration.

Hallucination caught mid-explorationHallucination Detection

Detective persona started talking about "budget and revenue" on earthquake data. The mandatory questioning architecture saved us: Skeptic rejected, Explorer recovered with proper analysis.

Specialized personas outperform defaultMultiple Personas

Focused prompts lead to faster convergence. Specialized personas use 3-6 tool calls vs 15 for default. Clear "what makes a good insight" criteria helps the model focus.

Different perspectives, different insightsMultiple Personas

Same movies dataset, different findings. Contrarian found low-budget success stories (Blumhouse, documentaries) while Statistician confirmed overall correlation. Different questions lead to different answers.

Multiple Explorer personas completeMultiple Personas

Built 5 personas: Default, Statistician, Storyteller, Detective, Contrarian. Each has unique personality traits, good/bad insight criteria, and exploration tips. CLI supports --persona all for comparison runs.

Jan 8, 2026

First blog post publishedDocumentation

Wrote "Building Feedback Loops into Multi-Agent Systems" documenting the Validator→Skeptic journey, with before/after examples, hallucination warts, and lessons learned.

Column validation catches hallucinationsHallucination Detection

Model was talking about "budget and revenue" when analyzing network logs. Added actual column names to Skeptic prompts with instruction to reject insights referencing non-existent data. Fixed.

Tested across 5 datasets successfullySkeptic Refactor

Movies, earthquakes, network logs, Netflix, time-series all produce legitimate, dataset-specific insights. No more hallucinations after column validation.

Skeptic refactor completeSkeptic Refactor

Renamed Validator to Skeptic, made questioning mandatory. Skeptic ALWAYS asks a question before approving. Removed rubber-stamp path. Simple change, dramatic improvement in insight quality.

Challenge mode produces deeper insightsChallenge Mode

When Validator challenges, insights go from shallow ("budget correlates with revenue") to nuanced ("Drama/Comedy have 0.82 correlation, Documentaries have none"). Follow-up questions work.

LLMs are probabilistic, not rule-followersPrompts

Despite "ALWAYS CHALLENGE" in prompts, Validator only challenged ~50% of obvious insights. Need to enforce rules in code, not prompts. Planning Skeptic refactor.

Batch analysis toolingMonitoring

Built analyze_runs.py to measure Validator behavior at scale. Reduced first-try approval from 93% to 47% through iterative prompt tuning.

Data quality > prompt engineeringData Tools

Steam dataset had broken Metacritic column (boolean, not score). Explains repetitive Price correlations. Switched to TMDB movies dataset with real numeric relationships.

Jan 7, 2026

First successful exploration runsOrchestrator

Two explorations completed on Steam Games dataset. Explorer proposed insights about price/player correlation and popularity/review scores. Both approved by Validator in 4-8 seconds.

MVP pipeline completeAll

Built all 6 core modules: tools.py, llm.py, prompts.py, orchestrator.py, monitor.py, run.py. End-to-end pipeline working with structured JSON logging.

Fixed infinite loop bugOrchestrator

When Explorer failed to produce valid JSON, the loop continued without incrementing the attempt counter. Added explorer_failures limit to prevent runaway loops.

JSON compliance strugglesPrompts

LLM sometimes outputs natural language instead of JSON despite instructions. Added retry logic with stronger prompts, max 3 invalid responses before giving up.

Steam Games dataset loadedData Tools

Downloaded Steam Games dataset from Kaggle: 122,611 games, 39 columns. Fixed CSV parsing issue where pandas used first column as index.

Jan 6, 2026

Architecture decisions lockedPlanning

Decided: tool-based data access (agent calls pandas), local llama.cpp (Hermes 2 Pro 7B), custom Python orchestration, structured JSON logging, 3-5 rejection limit.

Pivoted from Site ChatbotPlanning

Started planning a portfolio chatbot but pivoted during conversational discovery. Landed on Agent-Driven Discovery: multi-agent system for autonomous data exploration.

--- journey start ---

Agent-Driven Discovery

$ cat story.md

$ cat pipeline.txt

$ cat current-status.md

Project Complete

Milestones

$ cat agent-tools.txt

$ cat exploration-path.txt

$ diff personas.txt

$ ./roundtable --replay

$ cat learning-goals.txt

$ ls ./blog/ --project="Agent-Driven Discovery"

Building a Self-Correcting Multi-Agent System

Five Personas, One Dataset: How Different Agents Find Different Insights

Building Feedback Loops into Multi-Agent Systems

$ git log --oneline