Building Feedback Loops into Multi-Agent Systems
How changing from a Validator to a Skeptic (and making it ask questions) dramatically improved insight quality in our data exploration pipeline.
Building Feedback Loops into Multi-Agent Systems
This is a story about how a small change in agent behavior transformed the quality of outputs from our multi-agent data exploration system. The change? Making our "Validator" agent ask questions instead of just approving or rejecting.
The Setup
I'm building a multi-agent system that explores datasets and discovers insights. The architecture is simple:
- Explorer: Investigates a dataset using tools (get_schema, correlate, query, etc.)
- Validator: Reviews proposed insights and approves or rejects them
The whole thing runs on a local LLM (Hermes 2 Pro 7B via llama.cpp) because I wanted to experiment freely without API costs adding up.
The Problem: Rubber-Stamp Approvals
The first version of the Validator was too lenient. I'd run the pipeline and get insights like:
"Movies with higher budgets tend to have higher revenues."
The Validator would approve this immediately. First-try approval rate? 93%.
That's not validation. That's a rubber stamp.
Attempt #1: Make the Validator Meaner
My first fix was to add explicit rejection criteria to the Validator prompt:
REJECT immediately if ANY of these apply:
- Weak correlation (<0.3): "correlation of 0.12" is noise, not insight
- Obvious relationships: "higher budget = higher revenue"
- Missing "so what": who cares? what's the implication?
This helped. First-try approval dropped to 47%. The Validator was now rejecting surface-level observations and forcing the Explorer to try again.
But there was a problem: the Explorer would often just rephrase the same insight or try a completely different angle. The feedback loop was binary (approve/reject) with no guidance on how to improve.
Attempt #2: Challenge Mode (Optional)
I added an experimental "challenge" mode where the Validator could respond with a question instead of just approve/reject:
{"action": "challenge", "question": "Does this hold for all genres?"}
When the Validator challenged, the Explorer would investigate the question and propose a refined insight. The results were great when it worked:
Before challenge:
"Budget correlates with revenue"
After challenge:
"Drama/Comedy films show 0.82 budget-revenue correlation, but Documentaries show almost none"
The problem? It was inconsistent. Sometimes the Validator would challenge, sometimes it would just approve. I had instructions like "ALWAYS CHALLENGE these patterns" but the model didn't reliably follow them.
The Fix: Rename to Skeptic, Make Questioning Mandatory
The solution was embarrassingly simple: remove the option to approve on first pass entirely.
I renamed "Validator" to "Skeptic" and gave it a new mandate:
On FIRST review, you have TWO options:
1. QUESTION (default - always do this unless rejecting):
{"action": "question", "question": "A specific question..."}
2. REJECT (only for fundamentally broken insights):
{"action": "reject", "reason": "...", "suggestion": "..."}
NEVER approve on first review. Always ask a question unless rejecting.
Then I added a separate "followup" phase where the Skeptic reviews the Explorer's response and can finally approve or reject.
The New Data Flow
Before (Validator):
Explorer proposes insight
↓
Validator: approve/reject
↓
(if rejected, Explorer tries again)
After (Skeptic):
Explorer proposes insight
↓
Skeptic asks a probing question
↓
Explorer investigates and expands insight
↓
Skeptic reviews expanded insight: approve/reject
The key difference: every insight goes through at least one round of questioning. No more rubber stamps.
Results: Before and After
Here's what the same type of finding looks like with each approach:
Movies Dataset
Validator (before):
"Movies with higher budgets tend to have higher revenues."
Evidence: Correlation coefficient 0.73
After Skeptic questioning:
"The positive correlation between budget and revenue holds across most genres, with a few exceptions. Generally, higher budgets lead to higher revenues, except in the Horror genre where a negative correlation is observed."
Evidence:
- Very strong positive correlation (r=0.73) between budget and revenue overall
- Action and adventure movies have the highest correlation (0.72 and 0.69)
- Horror genre shows negative correlation
The Skeptic asked: "Does this hold for all genres, or are some outliers?" and the Explorer actually investigated.
Another Movies Example
Explorer insight (after questioning):
"The relationship between movie budgets and revenues varies by studio, with some companies consistently achieving high revenues from high-budget films, while others exhibit weaker or negative correlations."
Evidence:
- Disney and Warner Bros show strong positive correlations
- Pixus and Bleecker Street exhibit weaker or negative correlations
This came from the Skeptic asking about studio-level variation.
Earthquake Dataset
Explorer insight (after questioning):
"The distribution of moderate intensity earthquakes (mag 5.0-6.35) is predominantly driven by high seismic activity regions, such as the Pacific Ring of Fire. In the Eastern Mediterranean, the majority occur in Turkey, Greece, and the Aegean Sea. The Central United States experiences a higher frequency of earthquakes below magnitude 5.0."
Evidence:
- Statistical summary of magnitudes by region
- Majority of 5.0-6.35 magnitude earthquakes occur in coastal Japan, Philippines, and California
Network Logs Dataset
Explorer insight (after questioning):
"Intrusion attempts on ports 80, 443, and 3389 follow a consistent pattern of higher frequency on Tuesdays and Thursdays, especially during 12:00 AM to 6:00 AM and 4:00 PM to 8:00 PM. This pattern suggests targeted exploitation during lower network activity periods."
Evidence:
- Filtered rows show consistent Tuesday/Thursday pattern for ports 80, 443, and 3389
- Pattern holds across all source and destination IPs
Netflix Dataset
Explorer insight (after questioning):
"The relationship between Rating and Season_Count varies significantly across genres. In Drama and Crime shows, higher ratings are associated with more seasons, while in Comedy and Reality shows, higher ratings lead to fewer seasons."
Evidence:
- Drama and Crime: +0.22 correlation
- Comedy and Reality: -0.23 correlation
- Higher rated Drama/Crime shows are long-running series; higher rated Comedy/Reality are often limited series
The Warts: Hallucinations
This experimental process wasn't all wins. Running a 7B parameter model locally means dealing with hallucinations.
Here's an actual output from the Network_logs dataset before I added safeguards:
"There is a strong positive correlation between budget and revenue in the dataset, indicating that higher budgets lead to higher revenues."
That's the movies insight appearing when analyzing network logs. The model confused which dataset it was exploring.
Another example from the movies dataset:
"Within the Action, Adventure, and RPG genres, high-priced games tend to have higher average ratings in specific subgenres such as Action RPGs..."
It started talking about video games when analyzing movies.
The Fix: Column Validation
I added the dataset's actual column names to the Skeptic's prompt:
Dataset columns: Scan_Type, Port, Payload_Size, Intrusion, Source_IP...
VALIDATION: If the insight references columns or concepts NOT in this
dataset (e.g., "budget" in a network logs dataset), REJECT immediately
with reason "Hallucination - references non-existent data".
After this change, the same Network_logs run produced:
"For 'full' Scan_Type and Port 80, larger Payload sizes (>5kb) are associated with a higher likelihood of intrusion attempts, as evidenced by a correlation of -0.0697 and a histogram distribution showing that 52.43% of full scans with larger Payload_Sizes occur with intrusions."
Legitimate, dataset-specific analysis instead of hallucinated movie insights.
Batch Results
I ran batches to measure the impact:
Before Skeptic (Validator with challenge mode):
| Metric | Value |
|---|---|
| First-try approvals | 80% |
| Avg rejections/run | 0.2 |
| Avg tool calls/run | 3.0 |
| Avg duration | 33s |
After Skeptic (mandatory questioning + column validation):
| Metric | Value |
|---|---|
| First-round approvals | 20% |
| Avg rejections/run | 1.0 |
| Avg tool calls/run | 15.0 |
| Avg duration | 86s |
The Skeptic is "harsher" by the numbers, but the insights are dramatically better. More tool calls means more investigation. Longer duration means more thinking. And 0% forced accepts means the system is producing approvable insights, just after more rigorous examination.
What I Noticed Along the Way
1. Optional behaviors were unreliable with the 7B model
"ALWAYS CHALLENGE these patterns" didn't work. The model would sometimes follow the instruction, sometimes not. Removing the option entirely (no approve path on first pass) made the behavior consistent. I'm not sure if this is a model size issue or a prompt design issue.
2. Structure helped more than iteration
The original approve/reject loop let the Explorer try again, but without direction. Adding a question-answer phase created a structured feedback loop where each iteration built on specific feedback. That structure seemed to matter more than just "try again."
3. Grounding in column names caught hallucinations
Giving the Skeptic the actual column names let it catch hallucinations. The model could verify "does this insight reference real columns?" even when it couldn't reliably generate correct insights itself.
4. Approval rate was a useful tuning metric
When first-try approval was 93%, I knew the Validator was too lenient. When mandatory questioning dropped it to 20%, insights were getting properly scrutinized. I'm still figuring out what the "right" approval rate is.
What Seemed to Work
The Skeptic refactor was a small change, but it transformed the quality of outputs. Making agents ask questions of each other created behaviors that simple approve/reject gates didn't provide.
I think multi-agent systems might benefit more from structured dialogue than from task handoffs. But I'm still early in this experiment. The next step is trying different personas and seeing if the pattern holds. I documented that experiment in Five Personas, One Dataset.