ibexharness
DocsBlogReleasesRoadmap
GitHub
ibexharness

Documentation

Architecture Decision RecordsADR-0002: Repository foundation bootstrapADR-0003: Branch protection and merge policyADR-0004: Protobuf and code generation policyADR-0005: Postgres migration strategyADR-0006: Auth protobuf contract (`ibex.auth.v1`)ADR-0007: Auth token validation implementationADR-0008: Security scanning and CI quality gatesADR-0009: Permission bitmap layoutADR-0010: Cryptography policyADR-0011: Proxy auth gRPC client and middlewareADR-0012: Proxy request normalization (OpenAI chat)ADR-0013: Proxy input validation and stable error envelopeADR-0014: Core domain migration sequencingADR-0015: Proxy rate limit skeleton (Phase 1)ADR-0016: Proxy agent identity verification (Phase 1)ADR-0017: Request ID and trace context strategy (Phase 1)ADR-0018: Graceful shutdown contract (Phase 1)ADR-0019: OpenTelemetry provider configuration (Phase 1)ADR-0020: Shared package boundaries — `packages/config` and `packages/apierror`ADR-0021: Prometheus Metric Catalog (Phase 1)ADR-0022: Health check contract (Phase 1)ADR-0023: Docs site architecture (Phase 1.5)
ADRs›ADR-0021: Prometheus Metric Catalog (Phase 1)
ADRs

ADR-0021: Prometheus Metric Catalog (Phase 1)

Architecture decision record 0021.

ADR-0021: Prometheus Metric Catalog (Phase 1)

Status: Accepted
Date: 2026-06-07
Milestone: 1.3.2

Context

Auth and proxy exposed /metrics via duplicate custom mutex-based implementations with inconsistent metric names, histogram buckets, and high-cardinality path labels (raw URLs with UUIDs). Cursor rules (18-observability.mdc, 29-ibex-packages.mdc) require a shared packages/metrics registry.

M1.3.1 initialized OTel tracing (ADR-0019). Phase 1 Prometheus exposition uses prometheus/client_golang directly — not the OTel meter bridge.

Decision

1) Single registration point

All Prometheus metrics are defined and registered in packages/metrics. Services call exported methods on *ProxyRegistry or *AuthRegistry; they never call prometheus.MustRegister directly.

2) Naming convention

ibex_{service}_{noun}_{unit} — e.g. ibex_proxy_request_duration_seconds.

3) Phase 1 catalog

MetricTypeLabelsService
ibex_proxy_request_duration_secondsHistogramroute, method, status_codeproxy
ibex_proxy_requests_totalCounterroute, method, status_codeproxy
ibex_proxy_active_connectionsGauge—proxy
ibex_proxy_rate_limited_totalCounterresult (allowed/denied)proxy
ibex_proxy_rate_limit_redis_errors_totalCounter—proxy
ibex_auth_validate_token_duration_secondsHistogramresult (ok/error/revoked)auth
ibex_auth_validate_agent_duration_secondsHistogramresult (ok/error/not_found)auth
ibex_auth_grpc_requests_totalCountermethod, statusauth
ibex_auth_http_request_duration_secondsHistogramroute, method, status_codeauth
ibex_auth_http_requests_totalCounterroute, method, status_codeauth
ibex_db_query_duration_secondsHistogramoperationauth
ibex_db_pool_open_connectionsGaugestate (in_use/idle)auth
ibex_process_upGaugeserviceboth

4) Label rules

  • route: Go 1.22+ route template (r.Pattern), recorded after ServeMux dispatch. Never raw r.URL.Path.
  • status_code: HTTP status as string ("200", "429").
  • result: Bounded enums only — never dynamic error strings.
  • Forbidden labels: org_id, agent_id, user_id, session_id — per-entity breakdowns belong in ClickHouse (Phase 3).

5) Histogram buckets

All latency histograms use packages/metrics.LatencyBuckets:

0.001, 0.005, 0.010, 0.020, 0.050, 0.100, 0.250, 0.500, 1.000, 5.000

Tuned for the <20ms proxy overhead target.

6) Exposition

/metrics on each service uses promhttp.HandlerFor(registry, promhttp.HandlerOpts{}). Content-Type is set by promhttp.

7) Middleware order

Proxy (outer → inner): RequestContext → Span → metrics → ResponseHeaders → logging → mux.

Metrics middleware must run after RequestContext and Span (both call r.WithContext) so http.ServeMux sets r.Pattern on the same request pointer the metrics middleware observes. See TestHTTPMiddleware_RecordsRouteTemplate.

Auth HTTP: AuthHTTPMiddleware records ibex_auth_http_* on the auth HTTP router (health, metrics, etc.).

8) ValidateAgent status vs metric labels

ValidateAgent returns gRPC PermissionDenied when the agent record is missing or belongs to another org (GetByIDAndOrg returns nil). This follows multi-tenant isolation rules (no NotFound that leaks cross-org existence). The histogram label result=not_found is used for observability only and does not mirror the gRPC status code.

9) Validate timing location

ibex_auth_validate_token_duration_seconds and ibex_auth_validate_agent_duration_seconds are recorded on the auth gRPC server. Proxy-side ibex_proxy_auth_validate_* and ibex_proxy_agent_validate_* metrics are retired (supersedes ADR-0011 §8 proxy metric names).

Rate-limit metrics deferred in ADR-0015 §7 are implemented in M1.3.2.

Consequences

Positive

  • Standard Prometheus scrape format; CI can validate with expfmt parser
  • Single catalog; consistent buckets and naming
  • Route-template labels align with OTel span http.route

Negative

  • Breaking change for anyone scraping old ibex_http_* or proxy validate metric names
  • Token admin counters (ibex_auth_token_*) removed from Phase 1 catalog scope

References

  • Milestone 1.3.2
  • ADR-0019
  • ADR-0015

Was this page helpful?

Edit on GitHub

Last updated on

PreviousADR-0020: Shared package boundaries — `packages/config` and `packages/apierror`NextADR-0022: Health check contract (Phase 1)

On this page

  • Context
  • Decision
  • 1) Single registration point
  • 2) Naming convention
  • 3) Phase 1 catalog
  • 4) Label rules
  • 5) Histogram buckets
  • 6) Exposition
  • 7) Middleware order
  • 8) ValidateAgent status vs metric labels
  • 9) Validate timing location
  • Consequences
  • Positive
  • Negative
  • References
0%