phase 2 single provider

ClickHouse is a columnar analytics database optimised for append-only, high-throughput writes and fast aggregation queries. It is the correct technology for LLM trace data because: - Trace rows are immutable: written once, never updated

Milestone 2.5.1 — ClickHouse Schema: llm_traces Table

Status: Planned
Goal: 2.5 — Async trace emission to ClickHouse
Phase: 2 — Single Provider End-to-End
Estimated effort: 1–2 days
ADR required: ADR-0030 — ClickHouse schema design and retention


Why This Milestone Exists

ClickHouse is a columnar analytics database optimised for append-only, high-throughput writes and fast aggregation queries. It is the correct technology for LLM trace data because:

  • Trace rows are immutable: written once, never updated
  • Query patterns are aggregations: "total tokens by agent this week," "p99 latency by model"
  • Write throughput is very high: 1000+ requests/sec per proxy instance = 86M rows/day
  • Storage efficiency: columnar compression on repetitive fields (org_id, model, provider) is 10–50× better than row-oriented Postgres

The llm_traces table is the single source of truth for all operational analytics in IBEX Harness.


Branch

feat/m2-5-1-clickhouse-schema

PR Title

feat(infra): ClickHouse llm_traces table schema (m2.5.1)


Schema

SQL
-- infra/migrations/clickhouse/001_create_llm_traces.sql
CREATE TABLE IF NOT EXISTS ibex.llm_traces (
    -- Request identity
    request_id          String,
    org_id              UUID,
    agent_id            UUID,
    session_id          Nullable(UUID),
    checkpoint_id       Nullable(UUID),
 
    -- LLM request metadata
    model               LowCardinality(String),  -- "gpt-4o", "gpt-4o-mini", etc.
    provider            LowCardinality(String),  -- "openai", "anthropic" (Phase 4+)
    is_streaming        Bool,
 
    -- Token usage (from provider response)
    input_tokens        UInt32,
    output_tokens       UInt32,
    total_tokens        UInt32,
 
    -- Latency breakdown (milliseconds)
    auth_latency_ms     UInt16,   -- time for auth validation (cache or gRPC)
    directive_latency_ms UInt16,  -- time for directive resolve
    provider_ttfb_ms    UInt32,   -- time to first byte from provider
    total_latency_ms    UInt32,   -- total proxy processing time (not including LLM)
 
    -- Outcome
    status_code         UInt16,   -- HTTP status returned to client
    is_complete         Bool,     -- false if stream terminated early or provider error
    error_code          LowCardinality(String),  -- "" if success
 
    -- Timestamps
    requested_at        DateTime64(3, 'UTC'),  -- when request arrived at proxy
    completed_at        DateTime64(3, 'UTC'),  -- when response was fully sent
 
    -- Partitioning key
    event_date          Date MATERIALIZED toDate(requested_at)
 
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (org_id, agent_id, requested_at)
TTL event_date + INTERVAL 90 DAY   -- configurable; default 90-day retention
SETTINGS index_granularity = 8192;

Design decisions documented in ADR-0030:

  • LowCardinality(String) for model, provider, error_code — ClickHouse dictionary compression reduces storage 5–10×
  • ORDER BY (org_id, agent_id, requested_at) — primary sort key supports the most common query pattern: "traces for this org's agent in time range"
  • No content fields — LLM request/response content is never stored in traces (privacy, storage cost)
  • 90-day TTL default — configurable per deployment; enterprise customers may need 1-year retention

Acceptance Criteria

  • Table created in ClickHouse via migration runner
  • Schema includes all fields listed above
  • No content fields (prompt text, completion text)
  • TTL configured with default 90 days
  • EXPLAIN on common query patterns shows primary key usage
  • ADR-0030 written covering schema choices and retention policy

Edit on GitHub

Last updated on

On this page

0%