ibexharness
DocsBlogReleasesRoadmap
GitHub
ibexharness

Documentation

Architecture Decision RecordsADR-0002: Repository foundation bootstrapADR-0003: Branch protection and merge policyADR-0004: Protobuf and code generation policyADR-0005: Postgres migration strategyADR-0006: Auth protobuf contract (`ibex.auth.v1`)ADR-0007: Auth token validation implementationADR-0008: Security scanning and CI quality gatesADR-0009: Permission bitmap layoutADR-0010: Cryptography policyADR-0011: Proxy auth gRPC client and middlewareADR-0012: Proxy request normalization (OpenAI chat)ADR-0013: Proxy input validation and stable error envelopeADR-0014: Core domain migration sequencingADR-0015: Proxy rate limit skeleton (Phase 1)ADR-0016: Proxy agent identity verification (Phase 1)ADR-0017: Request ID and trace context strategy (Phase 1)ADR-0018: Graceful shutdown contract (Phase 1)ADR-0019: OpenTelemetry provider configuration (Phase 1)ADR-0020: Shared package boundaries — `packages/config` and `packages/apierror`ADR-0021: Prometheus Metric Catalog (Phase 1)ADR-0022: Health check contract (Phase 1)ADR-0023: Docs site architecture (Phase 1.5)
ADRs›ADR-0022: Health check contract (Phase 1)
ADRs

ADR-0022: Health check contract (Phase 1)

Architecture decision record 0022.

ADR-0022: Health check contract (Phase 1)

  • Status: Accepted
  • Date: 2026-06-05
  • Authors: IBEX Harness team

Context

M1.4.3 requires a documented, consistent health check contract across Go services. Before this ADR:

  • /health and /ready returned ad-hoc JSON (status + optional reason)
  • Auth readiness used a Postgres TCP dial only (not SELECT 1)
  • Proxy readiness checked Redis but not auth gRPC
  • Logic was duplicated in services/*/internal/health

Kubernetes operators need predictable probe behaviour: liveness restarts unhealthy pods; readiness removes pods from load balancing without restart.

Decision

1) Endpoints

EndpointPurposeExternal checksHTTP status
GET /healthLivenessNoneAlways 200 when process responds
GET /readyReadinessCritical + advisory dependencies200 or 503 (see below)

2) Response schema

JSON
{
  "status": "ok",
  "checks": {
    "postgres": { "status": "ok", "latency_ms": 2 },
    "redis": { "status": "failed", "message": "connection refused", "latency_ms": 501 }
  }
}
  • status (top-level): ok | degraded | unhealthy
  • checks[name].status: ok | failed
  • checks[name].message: present when failed (no secrets)
  • checks[name].latency_ms: wall time for that checker

/health body: {"status":"ok","checks":{}} always.

3) Readiness HTTP mapping

ConditionHTTPstatus field
All critical pass, no advisory failures200ok
All critical pass, any advisory failure200degraded
Any critical failure503unhealthy

4) Checker classification (Phase 1)

Auth — critical:

  • postgres: SELECT 1 via *sql.DB
  • grpc: TCP reachability to the gRPC listen address

Proxy — critical:

  • auth_grpc: ValidateToken with sentinel probe token ibex_health_probe_invalid; codes.Unauthenticated means reachable
  • redis: PING over RESP

Advisory (Phase 1): none. Phase 2 may add LLM provider reachability as advisory (degraded but still HTTP 200).

5) Timeouts

  • 500ms per checker (PerCheckTimeout)
  • 750ms overall request budget (OverallTimeout)
  • Checkers run in parallel within the overall budget

6) packages/healthcheck

Shared implementation used by auth and proxy. Services register checkers in main; HTTP routers delegate to Server.HealthHandler() / ReadyHandler().

PackageMay importMust not
packages/healthcheckstdlib, packages/proto (auth client interface)service packages, packages/logger

Consequences

  • Breaking change: not_ready + reason replaced by checks map (documented here)
  • make dev-smoke unchanged (HTTP status codes only)
  • K8s probe examples live in OPS_GUIDE.md

References

  • M1.4.3 milestone
  • ADR-0020

Was this page helpful?

Edit on GitHub

Last updated on

PreviousADR-0021: Prometheus Metric Catalog (Phase 1)NextADR-0023: Docs site architecture (Phase 1.5)

On this page

  • Context
  • Decision
  • 1) Endpoints
  • 2) Response schema
  • 3) Readiness HTTP mapping
  • 4) Checker classification (Phase 1)
  • 5) Timeouts
  • 6) packages/healthcheck
  • Consequences
  • References
0%