Both services currently expose `/metrics` using a custom mutex-based counter implementation. This implementation does not produce the Prometheus text exposition format and cannot be scraped by a standard Prometheus server. The ARCHITECTURE.md lists 15+ metric names that must exist. This milestone replaces the custom im
Milestone 1.3.2 — Prometheus Metric Catalog and Client Migration
Status: Planned
Goal: 1.3 — Observability baseline
Phase: 1 — Core Platform
Estimated effort: 2–3 days
Why This Milestone Exists
Both services currently expose /metrics using a custom mutex-based counter implementation. This implementation does not produce the Prometheus text exposition format and cannot be scraped by a standard Prometheus server. The ARCHITECTURE.md lists 15+ metric names that must exist. This milestone replaces the custom implementation with prometheus/client_golang and defines the canonical metric catalog that all services must implement.
Non-Goals
- Grafana dashboard configuration (ops tooling, not platform code)
- ClickHouse-sourced metrics (Phase 3)
- Per-org or per-agent metric labels (high cardinality — explicitly forbidden)
Branch
chore/m1-3-2-prometheus-metric-catalog
PR Title
chore(obs): prometheus client migration and canonical metric catalog (m1.3.2)
Prerequisites
- 1.3.1 merged — OTel providers initialized;
packages/telemetryexists
Deliverables
1. packages/metrics — canonical metric registry
// Package metrics defines the canonical Prometheus metric registry for
// all IBEX Harness Go services.
//
// Rules:
// 1. All metrics are defined here (single source of truth).
// 2. No metric uses org_id, agent_id, or user_id as a label — these
// are high-cardinality. Per-entity breakdowns live in ClickHouse.
// 3. Metric names follow the ibex_{service}_{noun}_{unit} convention.
// 4. All histograms use the standard latency buckets below.
// 5. Callers import this package and call functions — they never
// call prometheus.MustRegister directly.
package metrics
import "github.com/prometheus/client_golang/prometheus"
// LatencyBuckets are the standard histogram buckets for all latency
// measurements in IBEX Harness. They are tuned for the <20ms proxy
// overhead target: fine-grained at the low end, coarser at the top.
var LatencyBuckets = []float64{
0.001, 0.005, 0.010, 0.020, 0.050,
0.100, 0.250, 0.500, 1.000, 5.000,
}Required metrics — Phase 1 (both services must implement):
| Metric Name | Type | Labels | Description |
|---|---|---|---|
ibex_proxy_request_duration_seconds | Histogram | route, method, status_code | End-to-end proxy request duration |
ibex_proxy_requests_total | Counter | route, method, status_code | Total requests to the proxy |
ibex_proxy_active_connections | Gauge | — | Currently open HTTP connections |
ibex_proxy_rate_limited_total | Counter | result (allowed/denied) | Rate limit check outcomes |
ibex_proxy_rate_limit_redis_errors_total | Counter | — | Redis failures during rate limiting |
ibex_auth_validate_token_duration_seconds | Histogram | result (ok/error/revoked) | Auth gRPC ValidateToken call duration |
ibex_auth_validate_agent_duration_seconds | Histogram | result (ok/error/not_found) | Auth gRPC ValidateAgent call duration |
ibex_auth_grpc_requests_total | Counter | method, status | Auth gRPC call outcomes |
ibex_db_query_duration_seconds | Histogram | operation | Database query duration |
ibex_db_pool_open_connections | Gauge | state (in_use/idle) | DB connection pool state |
ibex_process_up | Gauge | service | 1 if the service is running |
Label value rules:
route: route template (e.g.,/v1/chat/completions), NOT the URL (never include UUIDs)status_code: HTTP status code as string ("200","429","503")result: a small, bounded enum — never a dynamic string
2. Replace custom metrics in auth and proxy
- Delete
services/auth/internal/metrics/custom implementation - Delete
services/proxy/internal/metrics/custom implementation - Import
packages/metricsin both services - Wire
promhttp.HandlerFor(registry, ...)to/metricsendpoint in both
3. CI metric format gate
Add a CI step that:
- Starts the proxy in test mode
- Scrapes
/metrics - Parses the response using the Prometheus text parser
- Asserts each required metric from the table above is present
# ci/check-metrics.sh
curl -sf http://localhost:${PROXY_PORT}/metrics | \
promtool check metrics && \
grep -q 'ibex_proxy_request_duration_seconds' /dev/stdinTesting Requirements
TestMetricsEndpoint_Format: scrape/metrics, parse withprometheus/client_golang/expfmt, assert zero parse errorsTestMetricsEndpoint_RequiredMetrics: assert each metric in the catalog table above is present in the outputTestMetricLabels_NoHighCardinality: lint test that asserts no label namedorg_id,agent_id,user_id,session_idappears in any registered metric
Acceptance Criteria
-
/metricson both services returns valid Prometheus text format - All 11 metrics in the catalog exist and are registered
- No
org_id,agent_id, oruser_idlabel values in any metric - Custom mutex metrics implementation deleted from both services
- CI gate (
promtool check metrics) passes -
packages/metricsis the single registration point — noMustRegistercalls outside this package
Last updated on