Building Data Science Agents in the Real World

Lessons from ts-agents: a CLI-first time-series automation toolkit

Farrukh Nauman · InertialRange Labs AB

2026-03-01

CLI-first

Stable contract for tools
Composes with shell/CI
Artifacts are paths

Skills

SKILL.md runbooks
Domain priors + checklists
Works with many harnesses

Sandboxes

Isolate messy deps
Scale compute when needed
Save logs + outputs

Scope: a “personal assistant data scientist” for time series

Make quick-n-dirty analysis fast, repeatable, and hackable.

What this repo is

A batteries-included time-series toolkit (60+ tools) exposed as one CLI: ts-agents
A thin agent layer that can drive the CLI and produce artifacts (plots, reports)
A modular workbench: swap UI / agent harness / execution backend without rewriting tools

Scope (cont.)

What it is not

Not a monolithic end-user GUI product (yet)
Not “just an agent framework” — the value is in tool contracts + priors
Not trying to replace Claude Code / Codex as general harnesses

Roadmap direction

Add richer GUI interactions (artifact browser + job monitor + review UX) — but keep the CLI as the stable spine.

What is ts-agents?

CLI + skills + sandboxes + optional UI/agents

CLI-first workflow (ts-agents ...) for scripting, composing, automation
Tool registry with metadata (category, cost, params, timeouts)
Sandbox executor backends: local, subprocess, docker, daytona, modal
Optional front-ends: Gradio UI + simple (LangChain) and deep (multi-agent) modes

Quickstart

uv sync
uv run ts-agents workflow list
uv run ts-agents tool list
uv run ts-agents workflow run forecast-series \
  --input-json '{"series":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}' \
  --horizon 5

Design choice: artifacts over chat — tools write plots/reports to disk; the agent returns paths + summaries.

Architecture: CLI is the stable contract

Agents and UIs are optional front-ends.

┌──────────────────────────────────┐    ┌──────────────────────────────┐
│  ts-agents CLI (contract)        │    │  Front-ends (swappable)      │
│  workflow / tool / sandbox /     │    │                              │
│  skills / data / agent           │    │  • Claude Code / Codex CLI   │
└──────────┬───────────────────────┘    │  • Custom agents (simple +   │
           │                            │    deep via LangChain)       │
           ▼                            │  • Gradio UI                 │
┌──────────────────────────────────┐    └──────────────────────────────┘
│  Tool registry + wrappers        │               ▲
│  metadata: params + cost +       │               │
│  timeouts                        ├───────────────┘
│  wrap: LangChain / deepagent     │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Execution layer (sandboxes)     │
│  local • subprocess • docker •   │
│  daytona • modal                 │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Artifacts (outputs/…)           │
│  plots • tables • JSON •         │
│  QMD/PDF • logs                  │
└──────────────────────────────────┘

How do you expose time-series skills to agents?

Five common integration styles (and why they feel different).

Approach	What the agent sees	Tradeoffs
Function tool-calls (LangChain, deepagents)	JSON-schema tools; direct function calls	Great framework UX; wrappers/parsers can be brittle
Single CLI contract (`ts-agents`)	One command surface + stable artifacts	Composable, debuggable, harness-friendly
Many small CLIs	Dozens of commands and pipes	Strong Unix composability; weaker discoverability
Service/API (HTTP/gRPC)	Network calls + JSON	Good for multi-user governance; infra/auth overhead
Notebook/interpreter	Inline Python/cells	Flexible exploration; weaker repeatability

Key idea: pick a stable contract first (CLI or API), then add wrappers as convenience layers.

What ts-agents implements (in the end)

CLI as the primary interface + wrappers for agent frameworks.

Implemented

One package → one CLI entrypoint: ts-agents
Tools registered with metadata: params, cost, timeouts
CLI routes execution through sandboxes; outputs saved as artifacts
Optional wrappers for function-calls (LangChain / deepagents)
SKILL.md runbooks for reliable external harness execution

Contract examples

# discover tools
uv run ts-agents tool list --bundle demo

# run a tool with stable args
uv run ts-agents tool run \
  stl_decompose_with_data \
  --run Re200Rm200 \
  --var bx001_real

# export SKILL.md
uv run ts-agents skills export --all-agents

Why this matters: you can swap the “brain” (Claude/Codex/LangChain) without rewriting the time-series code.

Two paths to automation

Both exist in ts-agents — the point is to compare tradeoffs.

Path A: Custom agent harness

LangChain / tool schemas + wrappers
Simple agent (flat) for testing
Deep agent (orchestrator + subagents)
UI must handle plots/logs/reports
Full control… and full maintenance cost

Path B: Reuse a mature harness

SKILL.md runbooks define workflows + outputs
Claude Code / Codex drives ts-agents in a terminal
Artifacts are paths; review via diff + git
Composability: shell scripts, Make, CI
Low glue-code; fewer failure modes

Practical recommendation: start with Path B. Build Path A only when you need strict policies, custom UX, or deep integrations.

Lesson: SKILL.md + mature CLIs beat bespoke harnesses

Plain-text runbooks are a high-leverage domain prior.

What mature CLI agents give you for free

✓ Project-wide search/navigation
✓ File-aware editing + diffs
✓ Command execution + logs
✓ Session persistence + resumption
✓ Permissions/sandboxing + audit trails
✓ Background execution for long commands

When a custom harness is still worth it

Strict tool policies + centralized auditability
Non-developer UI requirements
Deep proprietary integrations (data stores, schedulers)
Latency-sensitive or offline deployments

A skill is a cheap way to inject domain expertise: tool order, commands, artifacts, and done criteria.

Out-of-the-box harness power: Claude Code CLI vs Codex CLI

These are agent runtimes, not just chat UIs.

Capability	Claude Code CLI	Codex CLI
Repo-aware editing	Multi-file in-terminal edits	Full-screen TUI + diffs
Command execution	Shell commands with logs	Command runs with transcript
Permissions/sandbox	Permission modes + sandbox	Approval modes (read-only/auto/full)
Long-running work	Background bash + subagents	Background mode + cloud execution
Session persistence	Resume/continue sessions	Persistent interactive sessions
Context management	Auto-compaction near limits	Long-session context management
Extensibility	Hooks + plugins + subagents	Skills + MCP + scripting
Review workflow	Git-oriented flows	Built-in review presets

Implementing wait/resume/background/compaction/permissions/diffs in a custom harness is real engineering work.

Lesson: chat GUIs do not scale to real DS agents

Multi-tool + multi-artifact workflows need a workbench.

A real agent UI needs

Chat messages + reasoning traces
Tool outputs (stdout/stderr, JSON)
Intermediate artifacts (plots/tables/HTML)
Report viewers (PDF/Quarto/Markdown)
State inspection (files/sessions/caches)
Progress + cancellation + retries

Why CLI-first works better initially

stdout is a universal UI (text/JSON)
Every artifact is a file path
Shell scripts / Make / CI provide composability
A thin GUI can browse artifacts and launch commands

Lesson: dependencies are the hidden killer

Time-series stacks are fragmented; one env rarely covers everything.

Failure modes

Coverage gaps: best method in another library
Binary/system deps vary by OS/toolchain
Version conflicts (numpy/torch/protobuf, etc.)
Silent breakage: agent chooses tool, C-extension fails cryptically

Design response in ts-agents

Expose algorithms as CLI tools (stable contract)
Use lazy imports/conditional registration for optional deps
Use sandbox backends for per-tool environment isolation
Persist logs + artifacts for post-mortem debugging

Lesson: long-running compute breaks agent loops

Minutes-to-hours jobs need async, progress, and resumption.

Submit → Execute → Progress → Artifacts → Resume

What helped in ts-agents

Cost metadata per tool (LOW → VERY_HIGH) + approval gates
Timeouts and sandbox limits to avoid wedged runs
Report-first outputs (QMD/PDF) + intermediate artifacts
Treat expensive runs as batch jobs, not chat turns

Why mature harnesses matter

Claude Code: background bash commands + subagents
Codex: background mode + cloud task execution
Both: robust log capture + session continuity

Lesson: tool selection needs domain priors

With many similar tools, LLMs often choose slow or mediocre defaults.

Typical failure: given many classifiers, an agent jumps straight to expensive methods before cheap baselines.

How ts-agents injects priors

Tool metadata: category + estimated cost + parameter schema
Tool bundles (minimal / demo / full) to reduce choice set
SKILL.md decision trees (“try cheap baselines first…”)
Benchmarks that reproduce failures and compare bundles

Practical heuristic

Cheap baselines
Robust defaults
Only then SOTA

In practice: tool ordering + cost flags + skills.

Sandboxes: where agents actually run

Isolation + scalability + safety for tool execution.

What sandboxes solve

✓ Per-tool environments (dependency conflict avoidance)
✓ Hard resource limits (CPU/RAM/timeouts)
✓ Network policy + data boundary controls
✓ Reproducible execution + clean state
✓ Burst compute for heavy scripts (cloud backends)

ts-agents backends: local · subprocess · docker · daytona · modal

What sandboxes do NOT solve

Data gravity (large datasets still move)
Cold-start latency (spin-up/install time)
Cost governance (cloud bills)
Infra debug overhead (new failure modes)

Rule of thumb: default local; burst to sandbox/cloud for heavy or messy deps; keep artifacts as the interface.

Why CLI-first: modularity + composability

Hackable batteries-included beats reinventing the harness.

Advantages of one CLI contract

Composes with shell pipelines, Makefiles, CI
Easier debugging from logs + artifacts
Easier benchmarking from deterministic commands + bundles
Harness swaps: Claude/Codex/LangChain/custom
Competition tuning: add tools, adjust routing, pin environments

Composability example

# chain tools via artifacts
uv run ts-agents tool run forecast_theta_with_data \
  --run Re200Rm200 \
  --var bx001_real \
  --param horizon=30 \
  --save outputs/theta.json

# extract and post-process
cat outputs/theta.json | \
  jq '.forecast[0:10]' \
  > outputs/preview.json

# make it repeatable
make demo-forecasting

Contrast: general-purpose CLIs are powerful, but they don’t ship your domain tools.

Open problems + roadmap

Where to invest next to make DS agents feel real.

Agent runtime

Async-first execution: queue/progress/cancel/resume
Automatic environment resolution: per-tool images/lockfiles + caching
Tool routing: hybrid LLM + heuristics + learned rankers
Evaluation harness: realistic tasks with latency/cost constraints

UI workbench

Artifact browser (plots/reports/tables) + provenance
Run history + diffs between experiments
Job monitor for long tasks + streaming logs
Human-in-the-loop review gates for expensive tools

North star: keep the CLI contract stable while iterating agents + UI around it.

Key takeaways

Invest in CLI tools + SKILL.md runbooks — let mature CLI agents be your harness.
Treat artifacts as first-class outputs — chat is just the control plane.
Dependencies are the #1 silent failure mode — sandboxes reduce pain but add latency/cost.
Long-running compute needs runtime features (background/progress/resume), not bigger prompts.
Tool routing needs domain priors: cost metadata, bundles, and decision trees.
A stable CLI contract keeps the system hackable and benchmark-friendly.

Thanks!

Repo + docs: github.com/fnauman/ts-agents

Suggested live demo prompts

# workflow-first demo (no API key)
uv run ts-agents workflow run forecast-series \
  --input-json '{"series":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}' \
  --horizon 5

# agent run (requires OPENAI_API_KEY)
uv run ts-agents agent run \
  "Compare forecasting methods for bx001_real" \
  --type deep

# skills export for Claude/Codex
uv run ts-agents skills export --all-agents