cd ../blog
January 8, 2026Agent-Driven Discovery

Building Feedback Loops into Multi-Agent Systems

How changing from a Validator to a Skeptic (and making it ask questions) dramatically improved insight quality in our data exploration pipeline.

multi-agentllmprompt-engineeringlocal-llmexperimentation

Building Feedback Loops into Multi-Agent Systems

This is a story about how a small change in agent behavior transformed the quality of outputs from our multi-agent data exploration system. The change? Making our "Validator" agent ask questions instead of just approving or rejecting.

The Setup

I'm building a multi-agent system that explores datasets and discovers insights. The architecture is simple:

  1. Explorer: Investigates a dataset using tools (get_schema, correlate, query, etc.)
  2. Validator: Reviews proposed insights and approves or rejects them

The whole thing runs on a local LLM (Hermes 2 Pro 7B via llama.cpp) because I wanted to experiment freely without API costs adding up.

The Problem: Rubber-Stamp Approvals

The first version of the Validator was too lenient. I'd run the pipeline and get insights like:

"Movies with higher budgets tend to have higher revenues."

The Validator would approve this immediately. First-try approval rate? 93%.

That's not validation. That's a rubber stamp.

Attempt #1: Make the Validator Meaner

My first fix was to add explicit rejection criteria to the Validator prompt:

REJECT immediately if ANY of these apply:
- Weak correlation (<0.3): "correlation of 0.12" is noise, not insight
- Obvious relationships: "higher budget = higher revenue"
- Missing "so what": who cares? what's the implication?

This helped. First-try approval dropped to 47%. The Validator was now rejecting surface-level observations and forcing the Explorer to try again.

But there was a problem: the Explorer would often just rephrase the same insight or try a completely different angle. The feedback loop was binary (approve/reject) with no guidance on how to improve.

Attempt #2: Challenge Mode (Optional)

I added an experimental "challenge" mode where the Validator could respond with a question instead of just approve/reject:

{"action": "challenge", "question": "Does this hold for all genres?"}

When the Validator challenged, the Explorer would investigate the question and propose a refined insight. The results were great when it worked:

Before challenge:

"Budget correlates with revenue"

After challenge:

"Drama/Comedy films show 0.82 budget-revenue correlation, but Documentaries show almost none"

The problem? It was inconsistent. Sometimes the Validator would challenge, sometimes it would just approve. I had instructions like "ALWAYS CHALLENGE these patterns" but the model didn't reliably follow them.

The Fix: Rename to Skeptic, Make Questioning Mandatory

The solution was embarrassingly simple: remove the option to approve on first pass entirely.

I renamed "Validator" to "Skeptic" and gave it a new mandate:

On FIRST review, you have TWO options:

1. QUESTION (default - always do this unless rejecting):
{"action": "question", "question": "A specific question..."}

2. REJECT (only for fundamentally broken insights):
{"action": "reject", "reason": "...", "suggestion": "..."}

NEVER approve on first review. Always ask a question unless rejecting.

Then I added a separate "followup" phase where the Skeptic reviews the Explorer's response and can finally approve or reject.

The New Data Flow

Before (Validator):

Explorer proposes insight
    ↓
Validator: approve/reject
    ↓
(if rejected, Explorer tries again)

After (Skeptic):

Explorer proposes insight
    ↓
Skeptic asks a probing question
    ↓
Explorer investigates and expands insight
    ↓
Skeptic reviews expanded insight: approve/reject

The key difference: every insight goes through at least one round of questioning. No more rubber stamps.

Results: Before and After

Here's what the same type of finding looks like with each approach:

Movies Dataset

Validator (before):

"Movies with higher budgets tend to have higher revenues."

Evidence: Correlation coefficient 0.73

After Skeptic questioning:

"The positive correlation between budget and revenue holds across most genres, with a few exceptions. Generally, higher budgets lead to higher revenues, except in the Horror genre where a negative correlation is observed."

Evidence:

  • Very strong positive correlation (r=0.73) between budget and revenue overall
  • Action and adventure movies have the highest correlation (0.72 and 0.69)
  • Horror genre shows negative correlation

The Skeptic asked: "Does this hold for all genres, or are some outliers?" and the Explorer actually investigated.

Another Movies Example

Explorer insight (after questioning):

"The relationship between movie budgets and revenues varies by studio, with some companies consistently achieving high revenues from high-budget films, while others exhibit weaker or negative correlations."

Evidence:

  • Disney and Warner Bros show strong positive correlations
  • Pixus and Bleecker Street exhibit weaker or negative correlations

This came from the Skeptic asking about studio-level variation.

Earthquake Dataset

Explorer insight (after questioning):

"The distribution of moderate intensity earthquakes (mag 5.0-6.35) is predominantly driven by high seismic activity regions, such as the Pacific Ring of Fire. In the Eastern Mediterranean, the majority occur in Turkey, Greece, and the Aegean Sea. The Central United States experiences a higher frequency of earthquakes below magnitude 5.0."

Evidence:

  • Statistical summary of magnitudes by region
  • Majority of 5.0-6.35 magnitude earthquakes occur in coastal Japan, Philippines, and California

Network Logs Dataset

Explorer insight (after questioning):

"Intrusion attempts on ports 80, 443, and 3389 follow a consistent pattern of higher frequency on Tuesdays and Thursdays, especially during 12:00 AM to 6:00 AM and 4:00 PM to 8:00 PM. This pattern suggests targeted exploitation during lower network activity periods."

Evidence:

  • Filtered rows show consistent Tuesday/Thursday pattern for ports 80, 443, and 3389
  • Pattern holds across all source and destination IPs

Netflix Dataset

Explorer insight (after questioning):

"The relationship between Rating and Season_Count varies significantly across genres. In Drama and Crime shows, higher ratings are associated with more seasons, while in Comedy and Reality shows, higher ratings lead to fewer seasons."

Evidence:

  • Drama and Crime: +0.22 correlation
  • Comedy and Reality: -0.23 correlation
  • Higher rated Drama/Crime shows are long-running series; higher rated Comedy/Reality are often limited series

The Warts: Hallucinations

This experimental process wasn't all wins. Running a 7B parameter model locally means dealing with hallucinations.

Here's an actual output from the Network_logs dataset before I added safeguards:

"There is a strong positive correlation between budget and revenue in the dataset, indicating that higher budgets lead to higher revenues."

That's the movies insight appearing when analyzing network logs. The model confused which dataset it was exploring.

Another example from the movies dataset:

"Within the Action, Adventure, and RPG genres, high-priced games tend to have higher average ratings in specific subgenres such as Action RPGs..."

It started talking about video games when analyzing movies.

The Fix: Column Validation

I added the dataset's actual column names to the Skeptic's prompt:

Dataset columns: Scan_Type, Port, Payload_Size, Intrusion, Source_IP...

VALIDATION: If the insight references columns or concepts NOT in this
dataset (e.g., "budget" in a network logs dataset), REJECT immediately
with reason "Hallucination - references non-existent data".

After this change, the same Network_logs run produced:

"For 'full' Scan_Type and Port 80, larger Payload sizes (>5kb) are associated with a higher likelihood of intrusion attempts, as evidenced by a correlation of -0.0697 and a histogram distribution showing that 52.43% of full scans with larger Payload_Sizes occur with intrusions."

Legitimate, dataset-specific analysis instead of hallucinated movie insights.

Batch Results

I ran batches to measure the impact:

Before Skeptic (Validator with challenge mode):

MetricValue
First-try approvals80%
Avg rejections/run0.2
Avg tool calls/run3.0
Avg duration33s

After Skeptic (mandatory questioning + column validation):

MetricValue
First-round approvals20%
Avg rejections/run1.0
Avg tool calls/run15.0
Avg duration86s

The Skeptic is "harsher" by the numbers, but the insights are dramatically better. More tool calls means more investigation. Longer duration means more thinking. And 0% forced accepts means the system is producing approvable insights, just after more rigorous examination.

What I Noticed Along the Way

1. Optional behaviors were unreliable with the 7B model

"ALWAYS CHALLENGE these patterns" didn't work. The model would sometimes follow the instruction, sometimes not. Removing the option entirely (no approve path on first pass) made the behavior consistent. I'm not sure if this is a model size issue or a prompt design issue.

2. Structure helped more than iteration

The original approve/reject loop let the Explorer try again, but without direction. Adding a question-answer phase created a structured feedback loop where each iteration built on specific feedback. That structure seemed to matter more than just "try again."

3. Grounding in column names caught hallucinations

Giving the Skeptic the actual column names let it catch hallucinations. The model could verify "does this insight reference real columns?" even when it couldn't reliably generate correct insights itself.

4. Approval rate was a useful tuning metric

When first-try approval was 93%, I knew the Validator was too lenient. When mandatory questioning dropped it to 20%, insights were getting properly scrutinized. I'm still figuring out what the "right" approval rate is.

What Seemed to Work

The Skeptic refactor was a small change, but it transformed the quality of outputs. Making agents ask questions of each other created behaviors that simple approve/reject gates didn't provide.

I think multi-agent systems might benefit more from structured dialogue than from task handoffs. But I'm still early in this experiment. The next step is trying different personas and seeing if the pattern holds. I documented that experiment in Five Personas, One Dataset.