phase 1 core platform

Both services currently expose `/metrics` using a custom mutex-based counter implementation. This implementation does not produce the Prometheus text exposition format and cannot be scraped by a standard Prometheus server. The ARCHITECTURE.md lists 15+ metric names that must exist. This milestone replaces the custom im

Milestone 1.3.2 — Prometheus Metric Catalog and Client Migration

Status: Planned
Goal: 1.3 — Observability baseline
Phase: 1 — Core Platform
Estimated effort: 2–3 days


Why This Milestone Exists

Both services currently expose /metrics using a custom mutex-based counter implementation. This implementation does not produce the Prometheus text exposition format and cannot be scraped by a standard Prometheus server. The ARCHITECTURE.md lists 15+ metric names that must exist. This milestone replaces the custom implementation with prometheus/client_golang and defines the canonical metric catalog that all services must implement.


Non-Goals

  • Grafana dashboard configuration (ops tooling, not platform code)
  • ClickHouse-sourced metrics (Phase 3)
  • Per-org or per-agent metric labels (high cardinality — explicitly forbidden)

Branch

chore/m1-3-2-prometheus-metric-catalog

PR Title

chore(obs): prometheus client migration and canonical metric catalog (m1.3.2)


Prerequisites

  • 1.3.1 merged — OTel providers initialized; packages/telemetry exists

Deliverables

1. packages/metrics — canonical metric registry

Go
// Package metrics defines the canonical Prometheus metric registry for
// all IBEX Harness Go services.
//
// Rules:
//   1. All metrics are defined here (single source of truth).
//   2. No metric uses org_id, agent_id, or user_id as a label — these
//      are high-cardinality. Per-entity breakdowns live in ClickHouse.
//   3. Metric names follow the ibex_{service}_{noun}_{unit} convention.
//   4. All histograms use the standard latency buckets below.
//   5. Callers import this package and call functions — they never
//      call prometheus.MustRegister directly.
package metrics
 
import "github.com/prometheus/client_golang/prometheus"
 
// LatencyBuckets are the standard histogram buckets for all latency
// measurements in IBEX Harness. They are tuned for the <20ms proxy
// overhead target: fine-grained at the low end, coarser at the top.
var LatencyBuckets = []float64{
    0.001, 0.005, 0.010, 0.020, 0.050,
    0.100, 0.250, 0.500, 1.000, 5.000,
}

Required metrics — Phase 1 (both services must implement):

Metric NameTypeLabelsDescription
ibex_proxy_request_duration_secondsHistogramroute, method, status_codeEnd-to-end proxy request duration
ibex_proxy_requests_totalCounterroute, method, status_codeTotal requests to the proxy
ibex_proxy_active_connectionsGaugeCurrently open HTTP connections
ibex_proxy_rate_limited_totalCounterresult (allowed/denied)Rate limit check outcomes
ibex_proxy_rate_limit_redis_errors_totalCounterRedis failures during rate limiting
ibex_auth_validate_token_duration_secondsHistogramresult (ok/error/revoked)Auth gRPC ValidateToken call duration
ibex_auth_validate_agent_duration_secondsHistogramresult (ok/error/not_found)Auth gRPC ValidateAgent call duration
ibex_auth_grpc_requests_totalCountermethod, statusAuth gRPC call outcomes
ibex_db_query_duration_secondsHistogramoperationDatabase query duration
ibex_db_pool_open_connectionsGaugestate (in_use/idle)DB connection pool state
ibex_process_upGaugeservice1 if the service is running

Label value rules:

  • route: route template (e.g., /v1/chat/completions), NOT the URL (never include UUIDs)
  • status_code: HTTP status code as string ("200", "429", "503")
  • result: a small, bounded enum — never a dynamic string

2. Replace custom metrics in auth and proxy

  • Delete services/auth/internal/metrics/ custom implementation
  • Delete services/proxy/internal/metrics/ custom implementation
  • Import packages/metrics in both services
  • Wire promhttp.HandlerFor(registry, ...) to /metrics endpoint in both

3. CI metric format gate

Add a CI step that:

  1. Starts the proxy in test mode
  2. Scrapes /metrics
  3. Parses the response using the Prometheus text parser
  4. Asserts each required metric from the table above is present
bash
# ci/check-metrics.sh
curl -sf http://localhost:${PROXY_PORT}/metrics | \
  promtool check metrics && \
  grep -q 'ibex_proxy_request_duration_seconds' /dev/stdin

Testing Requirements

  • TestMetricsEndpoint_Format: scrape /metrics, parse with prometheus/client_golang/expfmt, assert zero parse errors
  • TestMetricsEndpoint_RequiredMetrics: assert each metric in the catalog table above is present in the output
  • TestMetricLabels_NoHighCardinality: lint test that asserts no label named org_id, agent_id, user_id, session_id appears in any registered metric

Acceptance Criteria

  • /metrics on both services returns valid Prometheus text format
  • All 11 metrics in the catalog exist and are registered
  • No org_id, agent_id, or user_id label values in any metric
  • Custom mutex metrics implementation deleted from both services
  • CI gate (promtool check metrics) passes
  • packages/metrics is the single registration point — no MustRegister calls outside this package
Edit on GitHub

Last updated on

On this page

0%