Milestone 1.4.3 — Health Check Contract and Standardisation

Status: Complete
Goal: 1.4 — Developer Experience Baseline
Phase: 1 — Core Platform
Estimated effort: 1 day
ADR required: ADR-0022 — Health check contract

Why This Milestone Exists

Both services expose /health and /ready endpoints, but with no defined response schema, no specification of what each probe checks, and no consistent behaviour between services. This creates two concrete problems:

Kubernetes probe misconfiguration: Without a documented contract, platform engineers configuring liveness and readiness probes must guess at response codes and timing. A wrong probe configuration silently causes rolling deploy failures or keeps broken pods in the load balancer rotation.

Inconsistent dependency checking: The proxy's /ready should verify it can reach the auth gRPC service. The auth's /ready should verify it can reach Postgres. Without a standard, each developer adds whatever checks they feel like. One service checks Redis and not Postgres; another checks nothing.

Non-Goals

Deep health checks that verify data integrity (too expensive for probes)
Health check aggregation dashboard (observability work)
Custom health check protocols beyond HTTP

/health — liveness probe: returns 200 if the process is alive and not deadlocked. Checks nothing external. Called by Kubernetes every 10s. Fails → pod is restarted.
/ready — readiness probe: returns 200 only if the service can handle traffic. Checks all critical dependencies. Fails → pod removed from service load balancer (not restarted).
Response schema: {"status": "ok"|"degraded"|"unhealthy", "checks": {...}}
Dependency classification: critical (readiness fails if down) vs. non-critical (degraded but ready)

2. `packages/healthcheck` — reusable probe framework

package healthcheck
 
// Checker is a named dependency check function.
// Returns nil if the dependency is healthy.
// Returns an error with a human-readable description if not.
type Checker func(ctx context.Context) error
 
// Server runs /health and /ready endpoints.
type Server struct {
    // CriticalCheckers are checked on /ready.
    // If any fails, /ready returns 503.
    CriticalCheckers map[string]Checker
 
    // AdvisoryCheckers are checked on /ready.
    // If any fails, /ready returns 200 with status "degraded".
    // The service continues to receive traffic.
    AdvisoryCheckers map[string]Checker
}
 
// Response is the JSON body returned by both endpoints.
type Response struct {
    Status string            `json:"status"` // "ok", "degraded", "unhealthy"
    Checks map[string]Check  `json:"checks"`
}
 
type Check struct {
    Status  string `json:"status"`           // "ok", "failed"
    Message string `json:"message,omitempty"` // error message if failed
    LatMs   int64  `json:"latency_ms"`
}

3. Standard checks per service

Auth service critical checks (fail /ready):

Postgres: SELECT 1 with 500ms timeout
gRPC listener: can bind and accept (self-check)

Auth service advisory checks (degrade /ready):

None in Phase 1

Proxy service critical checks (fail /ready):

Auth gRPC: Ping or ValidateToken with a known-bad token (expect UNAUTHENTICATED, not a transport error)
Redis: PING with 500ms timeout

Proxy service advisory checks (degrade /ready):

None in Phase 1; Phase 2 will add LLM provider reachability as advisory

4. Response examples

JSON

// GET /health — always 200 if process is alive
{
  "status": "ok",
  "checks": {}
}
 
// GET /ready — 200 when all critical checks pass
{
  "status": "ok",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 2},
    "redis":    {"status": "ok", "latency_ms": 1}
  }
}
 
// GET /ready — 503 when a critical check fails
{
  "status": "unhealthy",
  "checks": {
    "postgres": {"status": "failed", "message": "connection refused", "latency_ms": 5001},
    "redis":    {"status": "ok", "latency_ms": 1}
  }
}
 
// GET /ready — 200 but degraded when advisory check fails
{
  "status": "degraded",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 3}
  }
}

Acceptance Criteria

/health on both services returns 200 and {"status":"ok"} with no external checks
/ready on auth service returns 503 when Postgres is unreachable
/ready on proxy service returns 503 when auth gRPC is unreachable
/ready on proxy service returns 503 when Redis is unreachable
Response body is valid JSON matching the defined schema
packages/healthcheck used by both services (no duplicated implementation)
Kubernetes probe configuration documented in docs/OPS_GUIDE.md
ADR-0022 written and indexed

Milestone 1.4.3 — Health Check Contract and Standardisation

Why This Milestone Exists

Non-Goals

Branch

PR Title

Deliverables

1. ADR-0022 — Health check contract

2. `packages/healthcheck` — reusable probe framework

3. Standard checks per service

4. Response examples

Acceptance Criteria