ibexharness
DocsBlogReleasesRoadmap
GitHub
ibexharness

Documentation

ObservabilityTroubleshootingHealth checksIncident response
Operations›Observability
Operations

Observability

OpenTelemetry tracing, Prometheus metrics, and structured JSON logs for Phase 1 Go services.

Phase 1 ships observability for the auth and proxy Go services: W3C distributed tracing, a bounded Prometheus catalog, and structured JSON logs correlated by request_id and trace_id. Python services and full SLO alerting expand in later phases.

Phase 1 stack

Auth and proxy expose /metrics. OTel exports to OTLP when OTEL_EXPORTER_OTLP_ENDPOINT is set; otherwise spans exist for in-process propagation only (ADR-0019).

Correlation identifiers

Every inbound HTTP request carries:

IDSourcePropagation
request_idpackages/reqid — UUID v7X-Request-ID header; all log lines in request scope
trace_idOpenTelemetry span contextW3C traceparent; X-Trace-ID response header
org_idVerified token (when authenticated)Logs and gRPC metadata — never as a metric label

Cardinality budget

Prometheus labels must stay bounded. Never label metrics with org_id, agent_id, or raw URL paths. Per-tenant analytics belong in ClickHouse (ADR-0021).

OpenTelemetry

Both services initialize OTel via packages/telemetry at startup:

Mermaid diagram: graph LR
+--------------+     +----------------+     +-----------------------+     +--------------------------+     +-----------------------+
|              |     |                |     |                       |     |                          |     |                       |
| HTTP request |---->| SpanMiddleware |---->| Handler / gRPC client |---->| OTel context propagation |---->| Logger trace_id field |
|              |     |                |     |                       |     |                          |     |                       |
+--------------+     +----------------+     +-----------------------+     +--------------------------+     +-----------------------+
ParameterTypeDescription
OTEL_EXPORTER_OTLP_ENDPOINThost:port
OTLP gRPC collector (e.g. localhost:4317). Empty = noop exporter.
OTEL_SAMPLE_RATIOfloat
Root span sampling ratio via ParentBased + TraceIDRatio.
Default: 0.01
OTEL_SERVICE_NAMEstring
Falls back to IBEX_SERVICE_NAME when unset.
OTEL_DEPLOYMENT_ENVIRONMENTstring
Falls back to IBEX_ENV (development, staging, production).

HTTP spans are named {method} {route_template} (e.g. POST /v1/chat/completions) — never raw paths with UUIDs. Proxy gRPC auth calls propagate trace context via otelgrpc client interceptors.

Tail-based 100% export for 5xx traces requires a collector-side sampler (Phase 2). Phase 1 marks span status ERROR on HTTP status ≥ 500 for sampled spans.

Prometheus metrics

Scrape GET /metrics on each service. All metrics register in packages/metrics — services never call prometheus.MustRegister directly.

Phase 1 catalog (selected)

MetricTypeService
ibex_proxy_request_duration_secondsHistogramproxy
ibex_proxy_requests_totalCounterproxy
ibex_proxy_rate_limited_totalCounterproxy
ibex_auth_validate_token_duration_secondsHistogramauth
ibex_auth_grpc_requests_totalCounterauth
ibex_db_query_duration_secondsHistogramauth
ibex_process_upGaugeboth

Histogram buckets target the <20ms proxy overhead budget: 0.001 … 5.000 seconds.

bash
curl -s http://localhost:8080/metrics | grep ibex_proxy

Structured logging

Logs use packages/logger — JSON to stdout in production. Every line in a request scope includes request_id; when a span is active, trace_id is injected automatically.

LevelWhen to use
DEBUGLocal debugging only; never per-request in production
INFOLifecycle events: startup, config loaded, migration applied
WARNRecoverable anomalies: Redis fail-open, retry succeeded
ERRORRequires action: DB failure, auth service unreachable

Privacy by default

Do not log raw tokens, passwords, memory content, or full prompts/responses. Metrics capture per-request volume; logs capture anomalies.

Log level configuration

ParameterTypeDescription
IBEX_LOG_LEVELenum
DEBUG | INFO | WARN | ERROR
Default: INFO
IBEX_LOG_FORMATstring
json in production
Default: json

Middleware order (proxy)

Outer → inner: RequestContext → Span → metrics → ResponseHeaders → logging → mux. Auth middleware runs inside the mux after span creation so route templates and trace context are available to metrics and logs.

Optional error tracking

Sentry DSN via SENTRY_DSN is optional in development. Link Sentry events to trace_id and request_id when enabled.

Local verification

1

Confirm metrics endpoints

curl -s http://localhost:8080/metrics and :8081/metrics return Prometheus text.

2

Send a traced request

Include X-Request-ID on a proxy call; search service stdout for that ID.

3

Optional collector

Point OTEL_EXPORTER_OTLP_ENDPOINT at Jaeger or Tempo to visualize spans locally.

Related

  • Health checks
  • Troubleshooting
  • ADR-0019: OpenTelemetry
  • ADR-0021: Prometheus catalog

Was this page helpful?

Edit on GitHub

Last updated on

PreviousOperationsNextTroubleshooting

On this page

  • Correlation identifiers
  • OpenTelemetry
  • Prometheus metrics
  • Phase 1 catalog (selected)
  • Structured logging
  • Log level configuration
  • Middleware order (proxy)
  • Optional error tracking
  • Local verification
  • Related
0%