Results Analysis
Two-phase agent pipeline that compares experiment pairs with statistical rigor and structural enforcement of evidence-before-conclusions.
$ cat story.md
During a session analyzing autoresearch experiment results, I had Claude propose three wrong hypotheses in a row. Each was stated with confidence, each reversed when I told Claude to actually pull the trace data instead of reasoning from aggregate scores. The ANALYZE_RESULTS_PROTOCOL came out of that session as a checklist, but checklists are prompt-level guidance and the ai-data-analyst had already proven that prompt-level rules get ignored (the 5 W's framework was vestigial across 100+ runs).
The solution: enforce evidence-before-conclusions architecturally. A two-phase pipeline where Phase 1 must complete evidence collection before Phase 2 can analyze. Post-hooks on tool use enforce that every claim cites a trace_id, every metric reference links to the scorer code, every 'improved/regressed' judgment has a statistical test behind it. If the agent tries to speculate, it gets blocked.
The learning output was the piece that made this more than automation. Every statistical test includes a method card explaining why that test was chosen, what the assumptions are, how to interpret the numbers, and when you would use a different test. The harness optimizes for educational quality alongside analytical correctness.
$ ls ./components
- -Two-phase agent pipeline with code-level orchestration
- -Structural enforcement of analysis protocols via tool-use hooks
- -Research validation methodologies (statistical significance, effect size, confidence intervals)
- -OpenTelemetry-aligned tracing with parent-child span hierarchies
- -Converting prototypes into reusable multi-workflow infrastructure
- -Scoring educational output quality alongside analytical correctness