Results Analysis
Two-phase agent pipeline that compares experiment pairs with statistical rigor and structural enforcement of evidence-before-conclusions.
$ cat story.md
During a session analyzing autoresearch experiment results, I had Claude propose three wrong hypotheses in a row. Each one was stated with confidence, each reversed when I told Claude to actually pull the trace data instead of reasoning from aggregate scores. The ANALYZE_RESULTS_PROTOCOL came out of that session as a checklist, but checklists are prompt-level guidance and the ai-data-analyst had already proven that prompt-level rules get ignored (the 5 W's framework was vestigial across 100+ runs).
The solution: enforce evidence-before-conclusions architecturally. A two-phase pipeline where Phase 1 must complete evidence collection before Phase 2 can analyze. Post-hooks on tool use enforce that every claim cites a trace_id, every metric reference links to the scorer code, every 'improved/regressed' judgment has a statistical test behind it. If the agent tries to speculate, it gets blocked.
The learning output was the piece that made this more than automation. Every statistical test includes a method card explaining why that test was chosen, what the assumptions are, how to interpret the numbers, and when you would use a different test. The harness optimizes for educational quality alongside analytical correctness.
First e2e run on the GB10 with Hermes-4-70B scored 2.75/10. The analyst correctly ran Mann-Whitney U tests across 3 dimensions (statistical_rigor = 1.0) but failed the discipline checks: didn't cite trace_ids in its claims, discussed dimensions it hadn't researched in Phase 1, and ignored the changelog entirely. These are exactly the behaviors the scoring system was built to catch. The 16K context window was the main engineering challenge. Trace spans were 8-35KB each, blowing out context after 4 tool calls. I had Claude condense all tool outputs to summaries (traces went from 35KB to 800 bytes) and implement a fresh-session pattern from the cheat sheet for scratchpad writing. The pipeline now completes 12 tool calls comfortably.
$ cat agent-tools.txt
Two-phase pipeline. Phase 1 collects evidence (12 tool call budget), Phase 2 analyzes it (8 tool call budget). Context eviction between phases: only the scratchpad crosses the boundary.
Phase 1: Evidence Collector
Aggregate stats and per-run composite scores from records.jsonl. Returns median, mean, std per dimension.
Condensed trace summary: tool call sequence, timing, token counts, entities investigated, and 2-3 notable spans.
Plain-text explanation of what a scoring dimension measures, its weight, and key thresholds.
HARNESS_CHANGELOG entries between two dates. Context for code changes that affect score comparability.
Phase 2: Analyst + Educator
Mann-Whitney U or bootstrap with effect size, p-value, confidence intervals, and a method card explaining the test.
Checks whether claims are supported by trace evidence. Flags missing trace_ids, unresearched dimensions, missing span types.
Structured report with dimension comparisons, root causes, and gaps. Pre-hooks enforce evidence citation and statistical backing.
Educational output: methods used, dimensions explained, decision framework for next experiment. Pre-hook checks completeness.
$ ls ./components
- -Two-phase agent pipeline with code-level orchestration
- -Structural enforcement of analysis protocols via tool-use hooks
- -Research validation methodologies (statistical significance, effect size, confidence intervals)
- -OpenTelemetry-aligned tracing with parent-child span hierarchies
- -Converting prototypes into reusable multi-workflow infrastructure
- -Scoring educational output quality alongside analytical correctness