Evaluation
ts-agents now includes a deterministic internal benchmark that compares the raw tool surface against the structured discovery, skills, and workflow layers. The goal is not to simulate a frontier model; it is to make the repo’s own agent-facing contract measurable and easy to rerun.
What the benchmark covers
The harness in ts_agents.evals.refactor_benchmark compares four assist levels:
plain_model— no runnable repo contractplain_tools— direct tool calls with no discovery or workflow guidancestructured_discovery—tool search/tool showbefore executionskills_workflows—skills showplusworkflow run
It runs three representative tasks:
- inspect an unknown univariate series
- compare forecasting baselines on a deterministic series
- perform labeled-stream window-size selection and evaluation
Metrics:
- task success rate
- parse / schema failure rate
- invalid tool calls
- artifact completeness
- latency
- retries / recovery
Latest checked-in snapshot
Rerun command:
uv run python -m ts_agents.evals.refactor_benchmark \
--output-dir benchmarks/results/latestChecked-in outputs:
benchmarks/results/latest/results.jsonbenchmarks/results/latest/summary.md
Results from the current snapshot:
| Assist level | Success rate | Parse failure rate | Invalid tool calls | Avg artifact completeness | Avg duration (ms) | Retries | Recovery rate |
|---|---|---|---|---|---|---|---|
plain_model |
0.00 | 1.00 | 0 | 0.00 | 0.0 | 0 | 0.00 |
plain_tools |
0.67 | 0.00 | 1 | 0.00 | 314.1 | 1 | 0.33 |
structured_discovery |
1.00 | 0.00 | 0 | 0.00 | 62.6 | 0 | 0.00 |
skills_workflows |
1.00 | 0.00 | 0 | 1.00 | 114.9 | 0 | 0.00 |
How to read it
plain_modelfails every scenario because it produces no machine-runnable contract.plain_toolscan recover sometimes, but it still guesses wrong once and produces no artifact bundles.structured_discoveryreaches full task success, but it still leaves artifact completeness at0.00because raw tool calls do not write the workflow bundle.skills_workflowsreaches full task success and full artifact completeness, which is the main proof that the repo’s structured surface improves agent outcomes.
Why this matters
This benchmark closes the loop on the refactor:
- workflows make artifact production deterministic
skills showexposes the actionable policy layer- deprecated compatibility aliases remain available without being the recommended path
If you change the workflow contract, skill metadata, or tool discovery surface, rerun the benchmark and update the checked-in snapshot.