ClickHouse is a columnar analytics database optimised for append-only, high-throughput writes and fast aggregation queries. It is the correct technology for LLM trace data because: - Trace rows are immutable: written once, never updated
Milestone 2.5.1 — ClickHouse Schema: llm_traces Table
Status: Planned
Goal: 2.5 — Async trace emission to ClickHouse
Phase: 2 — Single Provider End-to-End
Estimated effort: 1–2 days
ADR required: ADR-0030 — ClickHouse schema design and retention
Why This Milestone Exists
ClickHouse is a columnar analytics database optimised for append-only, high-throughput writes and fast aggregation queries. It is the correct technology for LLM trace data because:
- Trace rows are immutable: written once, never updated
- Query patterns are aggregations: "total tokens by agent this week," "p99 latency by model"
- Write throughput is very high: 1000+ requests/sec per proxy instance = 86M rows/day
- Storage efficiency: columnar compression on repetitive fields (org_id, model, provider) is 10–50× better than row-oriented Postgres
The llm_traces table is the single source of truth for all operational analytics in IBEX Harness.
Branch
feat/m2-5-1-clickhouse-schema
PR Title
feat(infra): ClickHouse llm_traces table schema (m2.5.1)
Schema
-- infra/migrations/clickhouse/001_create_llm_traces.sql
CREATE TABLE IF NOT EXISTS ibex.llm_traces (
-- Request identity
request_id String,
org_id UUID,
agent_id UUID,
session_id Nullable(UUID),
checkpoint_id Nullable(UUID),
-- LLM request metadata
model LowCardinality(String), -- "gpt-4o", "gpt-4o-mini", etc.
provider LowCardinality(String), -- "openai", "anthropic" (Phase 4+)
is_streaming Bool,
-- Token usage (from provider response)
input_tokens UInt32,
output_tokens UInt32,
total_tokens UInt32,
-- Latency breakdown (milliseconds)
auth_latency_ms UInt16, -- time for auth validation (cache or gRPC)
directive_latency_ms UInt16, -- time for directive resolve
provider_ttfb_ms UInt32, -- time to first byte from provider
total_latency_ms UInt32, -- total proxy processing time (not including LLM)
-- Outcome
status_code UInt16, -- HTTP status returned to client
is_complete Bool, -- false if stream terminated early or provider error
error_code LowCardinality(String), -- "" if success
-- Timestamps
requested_at DateTime64(3, 'UTC'), -- when request arrived at proxy
completed_at DateTime64(3, 'UTC'), -- when response was fully sent
-- Partitioning key
event_date Date MATERIALIZED toDate(requested_at)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (org_id, agent_id, requested_at)
TTL event_date + INTERVAL 90 DAY -- configurable; default 90-day retention
SETTINGS index_granularity = 8192;Design decisions documented in ADR-0030:
LowCardinality(String)formodel,provider,error_code— ClickHouse dictionary compression reduces storage 5–10×ORDER BY (org_id, agent_id, requested_at)— primary sort key supports the most common query pattern: "traces for this org's agent in time range"- No content fields — LLM request/response content is never stored in traces (privacy, storage cost)
- 90-day TTL default — configurable per deployment; enterprise customers may need 1-year retention
Acceptance Criteria
- Table created in ClickHouse via migration runner
- Schema includes all fields listed above
- No content fields (prompt text, completion text)
- TTL configured with default 90 days
-
EXPLAINon common query patterns shows primary key usage - ADR-0030 written covering schema choices and retention policy
Last updated on