Skip to content
cd ../projects
In ProgressPythonllama.cppscipyJSONL

Results Analysis

Two-phase agent pipeline that compares experiment pairs with statistical rigor and structural enforcement of evidence-before-conclusions.

$ cat story.md

During a session analyzing autoresearch experiment results, I had Claude propose three wrong hypotheses in a row. Each was stated with confidence, each reversed when I told Claude to actually pull the trace data instead of reasoning from aggregate scores. The ANALYZE_RESULTS_PROTOCOL came out of that session as a checklist, but checklists are prompt-level guidance and the ai-data-analyst had already proven that prompt-level rules get ignored (the 5 W's framework was vestigial across 100+ runs).

The solution: enforce evidence-before-conclusions architecturally. A two-phase pipeline where Phase 1 must complete evidence collection before Phase 2 can analyze. Post-hooks on tool use enforce that every claim cites a trace_id, every metric reference links to the scorer code, every 'improved/regressed' judgment has a statistical test behind it. If the agent tries to speculate, it gets blocked.

The learning output was the piece that made this more than automation. Every statistical test includes a method card explaining why that test was chosen, what the assumptions are, how to interpret the numbers, and when you would use a different test. The harness optimizes for educational quality alongside analytical correctness.

$ ls ./components

Evidence Collector (Phase 1)Analyst + Educator (Phase 2)Orchestrator (fixed-phase pipeline)Structural Scorer (6 checks)Post-hook Enforcement (ANALYZE_RESULTS_PROTOCOL)OTEL-Aligned TracingStatistical Test Tools (Mann-Whitney U, Bootstrap)Tracing Gap IdentifierWiki-Capture Integration
  • -Two-phase agent pipeline with code-level orchestration
  • -Structural enforcement of analysis protocols via tool-use hooks
  • -Research validation methodologies (statistical significance, effect size, confidence intervals)
  • -OpenTelemetry-aligned tracing with parent-child span hierarchies
  • -Converting prototypes into reusable multi-workflow infrastructure
  • -Scoring educational output quality alongside analytical correctness