Both services expose `/health` and `/ready` endpoints, but with no defined response schema, no specification of what each probe checks, and no consistent behaviour between services. This creates two concrete problems: Kubernetes probe misconfiguration: Without a documented contract, platform engineers configuring l
Milestone 1.4.3 — Health Check Contract and Standardisation
Status: Complete
Goal: 1.4 — Developer Experience Baseline
Phase: 1 — Core Platform
Estimated effort: 1 day
ADR required: ADR-0022 — Health check contract
Why This Milestone Exists
Both services expose /health and /ready endpoints, but with no defined response schema, no specification of what each probe checks, and no consistent behaviour between services. This creates two concrete problems:
Kubernetes probe misconfiguration: Without a documented contract, platform engineers configuring liveness and readiness probes must guess at response codes and timing. A wrong probe configuration silently causes rolling deploy failures or keeps broken pods in the load balancer rotation.
Inconsistent dependency checking: The proxy's /ready should verify it can reach the auth gRPC service. The auth's /ready should verify it can reach Postgres. Without a standard, each developer adds whatever checks they feel like. One service checks Redis and not Postgres; another checks nothing.
Non-Goals
- Deep health checks that verify data integrity (too expensive for probes)
- Health check aggregation dashboard (observability work)
- Custom health check protocols beyond HTTP
Branch
chore/m1-4-3-health-check-contract
PR Title
chore(ops): health check contract and standardisation across services (m1.4.3)
Deliverables
1. ADR-0022 — Health check contract
Define:
/health— liveness probe: returns 200 if the process is alive and not deadlocked. Checks nothing external. Called by Kubernetes every 10s. Fails → pod is restarted./ready— readiness probe: returns 200 only if the service can handle traffic. Checks all critical dependencies. Fails → pod removed from service load balancer (not restarted).- Response schema:
{"status": "ok"|"degraded"|"unhealthy", "checks": {...}} - Dependency classification: critical (readiness fails if down) vs. non-critical (degraded but ready)
2. packages/healthcheck — reusable probe framework
package healthcheck
// Checker is a named dependency check function.
// Returns nil if the dependency is healthy.
// Returns an error with a human-readable description if not.
type Checker func(ctx context.Context) error
// Server runs /health and /ready endpoints.
type Server struct {
// CriticalCheckers are checked on /ready.
// If any fails, /ready returns 503.
CriticalCheckers map[string]Checker
// AdvisoryCheckers are checked on /ready.
// If any fails, /ready returns 200 with status "degraded".
// The service continues to receive traffic.
AdvisoryCheckers map[string]Checker
}
// Response is the JSON body returned by both endpoints.
type Response struct {
Status string `json:"status"` // "ok", "degraded", "unhealthy"
Checks map[string]Check `json:"checks"`
}
type Check struct {
Status string `json:"status"` // "ok", "failed"
Message string `json:"message,omitempty"` // error message if failed
LatMs int64 `json:"latency_ms"`
}3. Standard checks per service
Auth service critical checks (fail /ready):
- Postgres:
SELECT 1with 500ms timeout - gRPC listener: can bind and accept (self-check)
Auth service advisory checks (degrade /ready):
- None in Phase 1
Proxy service critical checks (fail /ready):
- Auth gRPC:
PingorValidateTokenwith a known-bad token (expect UNAUTHENTICATED, not a transport error) - Redis:
PINGwith 500ms timeout
Proxy service advisory checks (degrade /ready):
- None in Phase 1; Phase 2 will add LLM provider reachability as advisory
4. Response examples
// GET /health — always 200 if process is alive
{
"status": "ok",
"checks": {}
}
// GET /ready — 200 when all critical checks pass
{
"status": "ok",
"checks": {
"postgres": {"status": "ok", "latency_ms": 2},
"redis": {"status": "ok", "latency_ms": 1}
}
}
// GET /ready — 503 when a critical check fails
{
"status": "unhealthy",
"checks": {
"postgres": {"status": "failed", "message": "connection refused", "latency_ms": 5001},
"redis": {"status": "ok", "latency_ms": 1}
}
}
// GET /ready — 200 but degraded when advisory check fails
{
"status": "degraded",
"checks": {
"postgres": {"status": "ok", "latency_ms": 3}
}
}Acceptance Criteria
-
/healthon both services returns 200 and{"status":"ok"}with no external checks -
/readyon auth service returns 503 when Postgres is unreachable -
/readyon proxy service returns 503 when auth gRPC is unreachable -
/readyon proxy service returns 503 when Redis is unreachable - Response body is valid JSON matching the defined schema
-
packages/healthcheckused by both services (no duplicated implementation) - Kubernetes probe configuration documented in
docs/OPS_GUIDE.md - ADR-0022 written and indexed
Last updated on