ADRs
ADR-0022: Health check contract (Phase 1)
Architecture decision record 0022.
ADR-0022: Health check contract (Phase 1)
- Status: Accepted
- Date: 2026-06-05
- Authors: IBEX Harness team
Context
M1.4.3 requires a documented, consistent health check contract across Go services. Before this ADR:
/healthand/readyreturned ad-hoc JSON (status+ optionalreason)- Auth readiness used a Postgres TCP dial only (not
SELECT 1) - Proxy readiness checked Redis but not auth gRPC
- Logic was duplicated in
services/*/internal/health
Kubernetes operators need predictable probe behaviour: liveness restarts unhealthy pods; readiness removes pods from load balancing without restart.
Decision
1) Endpoints
| Endpoint | Purpose | External checks | HTTP status |
|---|---|---|---|
GET /health | Liveness | None | Always 200 when process responds |
GET /ready | Readiness | Critical + advisory dependencies | 200 or 503 (see below) |
2) Response schema
{
"status": "ok",
"checks": {
"postgres": { "status": "ok", "latency_ms": 2 },
"redis": { "status": "failed", "message": "connection refused", "latency_ms": 501 }
}
}status(top-level):ok|degraded|unhealthychecks[name].status:ok|failedchecks[name].message: present whenfailed(no secrets)checks[name].latency_ms: wall time for that checker
/health body: {"status":"ok","checks":{}} always.
3) Readiness HTTP mapping
| Condition | HTTP | status field |
|---|---|---|
| All critical pass, no advisory failures | 200 | ok |
| All critical pass, any advisory failure | 200 | degraded |
| Any critical failure | 503 | unhealthy |
4) Checker classification (Phase 1)
Auth — critical:
postgres:SELECT 1via*sql.DBgrpc: TCP reachability to the gRPC listen address
Proxy — critical:
auth_grpc:ValidateTokenwith sentinel probe tokenibex_health_probe_invalid;codes.Unauthenticatedmeans reachableredis:PINGover RESP
Advisory (Phase 1): none. Phase 2 may add LLM provider reachability as advisory (degraded but still HTTP 200).
5) Timeouts
- 500ms per checker (
PerCheckTimeout) - 750ms overall request budget (
OverallTimeout) - Checkers run in parallel within the overall budget
6) packages/healthcheck
Shared implementation used by auth and proxy. Services register checkers in main; HTTP routers delegate to Server.HealthHandler() / ReadyHandler().
| Package | May import | Must not |
|---|---|---|
packages/healthcheck | stdlib, packages/proto (auth client interface) | service packages, packages/logger |
Consequences
- Breaking change:
not_ready+reasonreplaced bychecksmap (documented here) make dev-smokeunchanged (HTTP status codes only)- K8s probe examples live in OPS_GUIDE.md
References
Was this page helpful?
Edit on GitHub
Last updated on