Observability
OpenTelemetry tracing, Prometheus metrics, and structured JSON logs for Phase 1 Go services.
Phase 1 ships observability for the auth and proxy Go services: W3C distributed tracing, a bounded Prometheus catalog, and structured JSON logs correlated by request_id and trace_id. Python services and full SLO alerting expand in later phases.
Correlation identifiers
Every inbound HTTP request carries:
| ID | Source | Propagation |
|---|---|---|
request_id | packages/reqid — UUID v7 | X-Request-ID header; all log lines in request scope |
trace_id | OpenTelemetry span context | W3C traceparent; X-Trace-ID response header |
org_id | Verified token (when authenticated) | Logs and gRPC metadata — never as a metric label |
OpenTelemetry
Both services initialize OTel via packages/telemetry at startup:
| Parameter | Type | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | host:port | OTLP gRPC collector (e.g. localhost:4317). Empty = noop exporter. |
OTEL_SAMPLE_RATIO | float | Root span sampling ratio via ParentBased + TraceIDRatio. Default: 0.01 |
OTEL_SERVICE_NAME | string | Falls back to IBEX_SERVICE_NAME when unset. |
OTEL_DEPLOYMENT_ENVIRONMENT | string | Falls back to IBEX_ENV (development, staging, production). |
HTTP spans are named {method} {route_template} (e.g. POST /v1/chat/completions) — never raw paths with UUIDs. Proxy gRPC auth calls propagate trace context via otelgrpc client interceptors.
Tail-based 100% export for 5xx traces requires a collector-side sampler (Phase 2). Phase 1 marks span status ERROR on HTTP status ≥ 500 for sampled spans.
Prometheus metrics
Scrape GET /metrics on each service. All metrics register in packages/metrics — services never call prometheus.MustRegister directly.
Phase 1 catalog (selected)
| Metric | Type | Service |
|---|---|---|
ibex_proxy_request_duration_seconds | Histogram | proxy |
ibex_proxy_requests_total | Counter | proxy |
ibex_proxy_rate_limited_total | Counter | proxy |
ibex_auth_validate_token_duration_seconds | Histogram | auth |
ibex_auth_grpc_requests_total | Counter | auth |
ibex_db_query_duration_seconds | Histogram | auth |
ibex_process_up | Gauge | both |
Histogram buckets target the <20ms proxy overhead budget: 0.001 … 5.000 seconds.
curl -s http://localhost:8080/metrics | grep ibex_proxyStructured logging
Logs use packages/logger — JSON to stdout in production. Every line in a request scope includes request_id; when a span is active, trace_id is injected automatically.
| Level | When to use |
|---|---|
| DEBUG | Local debugging only; never per-request in production |
| INFO | Lifecycle events: startup, config loaded, migration applied |
| WARN | Recoverable anomalies: Redis fail-open, retry succeeded |
| ERROR | Requires action: DB failure, auth service unreachable |
Log level configuration
| Parameter | Type | Description |
|---|---|---|
IBEX_LOG_LEVEL | enum | DEBUG | INFO | WARN | ERROR Default: INFO |
IBEX_LOG_FORMAT | string | json in production Default: json |
Middleware order (proxy)
Outer → inner: RequestContext → Span → metrics → ResponseHeaders → logging → mux. Auth middleware runs inside the mux after span creation so route templates and trace context are available to metrics and logs.
Optional error tracking
Sentry DSN via SENTRY_DSN is optional in development. Link Sentry events to trace_id and request_id when enabled.
Local verification
Confirm metrics endpoints
curl -s http://localhost:8080/metrics and :8081/metrics return Prometheus text.
Send a traced request
Include X-Request-ID on a proxy call; search service stdout for that ID.
Optional collector
Point OTEL_EXPORTER_OTLP_ENDPOINT at Jaeger or Tempo to visualize spans locally.
Related
Was this page helpful?
Last updated on