Agent Trace Telemetry
What to measure about an agentic investigation loop, and how a trace explorer turns raw run data into evidence for the next prompt or harness change.
Agentic investigation runs produce a lot of signal you cannot see from the final score alone. This page covers the trace record format, the five scoring dimensions, and the six-tab DuckDB-backed explorer that turns 114 backfilled traces into a diagnostic view over experiments. It is a living document and gets updated as more runs land.
What it is
Each investigation run writes a JSONL trace record with the inputs, decisions, tool calls, token counts, post-hook nudges, and final score. A Marimo notebook loads every JSONL file through DuckDB and exposes six tabs (Overview, Scores, Tokens, Tools, Enforcement, SQL) with charts for each class of signal. The tabs are designed around a "good vs investigate" framing so the next experiment has somewhere to aim.
Why it exists
Before the trace explorer, an experiment produced a composite score and nothing else. A regression from 9.3 to 7.0 was visible. The reason was not. The bottleneck could live in triage ordering, evidence depth, coverage, field completeness, efficiency, or some combination. Picking the wrong dimension to fix burns a whole experiment cycle.
The telemetry layer fills the gap between "the score changed" and "here is what to change next."
What we measure
Scoring dimensions (five, weighted to a 0 to 10 composite):
| Dimension | Weight | What it captures |
|---|---|---|
| Triage quality | 25% | Kendall tau between investigation order and anomaly score ranking |
| Evidence breadth | 25% | Distinct sources per finding and source coverage vs available sources |
| Coverage | 25% | Ratio of flagged entities (score at or above 50) actually investigated |
| Structural completeness | 15% | Required fields populated per finding (8 fields checked) |
| Efficiency | 10% | Findings per tool call plus budget utilization |
Per-run JSONL record fields include the obvious (composite score, dimension scores, outcome, model, prompt version) and the ones that matter for diagnostics:
tokens_input_total,tokens_output_total,tokens_wasted(tokens from context that got evicted before being used)total_tool_calls,cache_hits,cache_missesposthook_triggers,findings_rejected,posthook_types(classified nudge categories)findings_submitted,entities_investigatedmessages_at_end(parsed from eviction spans to measure context pressure)
The JSONL format keeps each run self-contained. DuckDB loads all files with read_json_auto(traces/*.jsonl) and everything downstream is SQL.
The six tabs
Overview

Summary cards (total runs, avg composite with sigma, pass rate, latest score), a composite-score-over-time line with a standard deviation band, and a horizontal bar chart of outcome distribution. Good looks like an upward trend with tight bands and no red dots. Worth investigating when bands widen or clusters of failures appear.
Scores

A 5-axis radar chart showing the mean value per scoring dimension, with a "Bottleneck" callout naming the weakest axis. Box plots per prompt_version show how each dimension moves as prompts evolve. Good looks like a full, symmetrical radar reaching toward 1.0 with tight boxes. Worth investigating when one axis collapses (that is the bottleneck to fix next) or when a prompt version has wide boxes (variance without quality).
Tokens

Three charts: tokens per finding scatter colored by prompt version, token waste trend bars, and input vs output stacked bars per run. Token waste is the key metric: input tokens that got evicted from context before the agent used them. High waste means the investigation is processing content it could not exploit, which is usually a context management or loop-discipline problem.
Tools

Cache hit ratio over time, a total_tool_calls vs findings_submitted efficiency frontier scatter, and a findings-per-tool-call line with a target rule at 0.75. The efficiency frontier is the most useful one: upper-left means lots of findings from few tool calls (good), lower-right means lots of tool calls with few findings (the agent is grinding).
Enforcement

Nudge frequency by day (post-hook triggers), rejection frequency by day (findings that failed validation), and a nudges vs composite scatter. Post-hooks fire when the agent breaks a structural rule (missing required fields, calling get_entity_detail without submit_finding, invalid format). High nudge counts mean the agent is bouncing off guardrails. A negative nudges-vs-composite correlation is the warning signal: nudges firing without improving outcomes.
SQL

Five pre-seeded DuckDB queries (score variance by prompt version, token cost per run, post-hook trigger rate by model, cache effectiveness, eviction correlation with scores) plus three custom query editors for ad-hoc investigation. The pre-seeded queries are the ones I kept re-typing, promoted into the UI.
Current observations (updated 2026-04-10)
First pass against the 114 backfilled records spanning 5 experiment directories. This section will grow as more experiments land.
- Hermes-4-70B is the clear baseline at 9.3 composite. Cross-model runs on the same prompt show Nemotron-3-Super-120B jumping from 3.45 to 7.01 after the post-hook fix, which is a real lift but still well below Hermes. Agent-specific training appears to matter more than raw parameter count on this task.
- Evidence breadth is the likely next bottleneck. Initial test run of scorer v2 scored 9.43/10 composite with evidence breadth at 0.837, the lowest of the five dimensions. The 5 W's framework added to the system prompt was meant to drive more tool usage per entity but the ceiling has not moved much yet.
- Token waste is low in passing runs. Healthy runs show input:output token ratios where output is a small fraction of input, suggesting the model is concise when the loop is working. Verbose failure modes (Nemotron-3-Super at 1449 tokens per submit_finding vs Hermes at 350) fill context quickly and cause overflow inside four turns.
- GPT-OSS-20B and Llama-3.3-70B are broken for agent use at the moment. GPT-OSS-20B writes 2000-token essays instead of tool calls. Llama-3.3-70B produces zero LLM responses (llama.cpp chat template issue). Both show up as flat zero composite scores across all dimensions.
Open questions
- Does the evidence breadth ceiling come from the 5 W's framework being too soft, or from the scorer counting raw source coverage instead of specific W-coverage (Who / What / When / Where / Why)?
- Is the
cache_hitsmetric gameable in a way that hurts investigation quality? An agent that keeps asking for the same entity's data with no progress looks cache-efficient but is actually stuck. - Is there a threshold of post-hook triggers above which composite score starts dropping? The nudges-vs-score scatter should reveal this once more data lands.
- Does the eviction-correlation query surface a clean relationship between number of context evictions and coverage drops, or is the effect dominated by other variables?
What we've learned so far
- 2026-04-10: A trace explorer built around "good vs investigate" descriptors is faster to act on than raw SQL, because each chart already encodes what the experiment designer should look for. The SQL tab exists for everything the pre-seeded charts do not answer.
- 2026-04-10: Daily-rotated JSONL with DuckDB
read_json_autois enough for 100 to 1000 traces. No database, no migrations, no schema work. A single file per day keeps diffs readable and scp backups cheap.