Milestone 2.5.1 — ClickHouse Schema: `llm_traces` Table

Status: Planned
Goal: 2.5 — Async trace emission to ClickHouse
Phase: 2 — Single Provider End-to-End
Estimated effort: 1–2 days
ADR required: ADR-0030 — ClickHouse schema design and retention

Why This Milestone Exists

ClickHouse is a columnar analytics database optimised for append-only, high-throughput writes and fast aggregation queries. It is the correct technology for LLM trace data because:

Trace rows are immutable: written once, never updated
Query patterns are aggregations: "total tokens by agent this week," "p99 latency by model"
Write throughput is very high: 1000+ requests/sec per proxy instance = 86M rows/day
Storage efficiency: columnar compression on repetitive fields (org_id, model, provider) is 10–50× better than row-oriented Postgres

The llm_traces table is the single source of truth for all operational analytics in IBEX Harness.

Branch

feat/m2-5-1-clickhouse-schema

PR Title

feat(infra): ClickHouse llm_traces table schema (m2.5.1)

Schema

SQL

-- infra/migrations/clickhouse/001_create_llm_traces.sql
CREATE TABLE IF NOT EXISTS ibex.llm_traces (
    -- Request identity
    request_id          String,
    org_id              UUID,
    agent_id            UUID,
    session_id          Nullable(UUID),
    checkpoint_id       Nullable(UUID),
 
    -- LLM request metadata
    model               LowCardinality(String),  -- "gpt-4o", "gpt-4o-mini", etc.
    provider            LowCardinality(String),  -- "openai", "anthropic" (Phase 4+)
    is_streaming        Bool,
 
    -- Token usage (from provider response)
    input_tokens        UInt32,
    output_tokens       UInt32,
    total_tokens        UInt32,
 
    -- Latency breakdown (milliseconds)
    auth_latency_ms     UInt16,   -- time for auth validation (cache or gRPC)
    directive_latency_ms UInt16,  -- time for directive resolve
    provider_ttfb_ms    UInt32,   -- time to first byte from provider
    total_latency_ms    UInt32,   -- total proxy processing time (not including LLM)
 
    -- Outcome
    status_code         UInt16,   -- HTTP status returned to client
    is_complete         Bool,     -- false if stream terminated early or provider error
    error_code          LowCardinality(String),  -- "" if success
 
    -- Timestamps
    requested_at        DateTime64(3, 'UTC'),  -- when request arrived at proxy
    completed_at        DateTime64(3, 'UTC'),  -- when response was fully sent
 
    -- Partitioning key
    event_date          Date MATERIALIZED toDate(requested_at)
 
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (org_id, agent_id, requested_at)
TTL event_date + INTERVAL 90 DAY   -- configurable; default 90-day retention
SETTINGS index_granularity = 8192;

Design decisions documented in ADR-0030:

LowCardinality(String) for model, provider, error_code — ClickHouse dictionary compression reduces storage 5–10×
ORDER BY (org_id, agent_id, requested_at) — primary sort key supports the most common query pattern: "traces for this org's agent in time range"
No content fields — LLM request/response content is never stored in traces (privacy, storage cost)
90-day TTL default — configurable per deployment; enterprise customers may need 1-year retention

Acceptance Criteria

Table created in ClickHouse via migration runner
Schema includes all fields listed above
No content fields (prompt text, completion text)
TTL configured with default 90 days
EXPLAIN on common query patterns shows primary key usage
ADR-0030 written covering schema choices and retention policy