phase 1 core platform

Both services expose `/health` and `/ready` endpoints, but with no defined response schema, no specification of what each probe checks, and no consistent behaviour between services. This creates two concrete problems: Kubernetes probe misconfiguration: Without a documented contract, platform engineers configuring l

Milestone 1.4.3 — Health Check Contract and Standardisation

Status: Complete
Goal: 1.4 — Developer Experience Baseline
Phase: 1 — Core Platform
Estimated effort: 1 day
ADR required: ADR-0022 — Health check contract


Why This Milestone Exists

Both services expose /health and /ready endpoints, but with no defined response schema, no specification of what each probe checks, and no consistent behaviour between services. This creates two concrete problems:

Kubernetes probe misconfiguration: Without a documented contract, platform engineers configuring liveness and readiness probes must guess at response codes and timing. A wrong probe configuration silently causes rolling deploy failures or keeps broken pods in the load balancer rotation.

Inconsistent dependency checking: The proxy's /ready should verify it can reach the auth gRPC service. The auth's /ready should verify it can reach Postgres. Without a standard, each developer adds whatever checks they feel like. One service checks Redis and not Postgres; another checks nothing.


Non-Goals

  • Deep health checks that verify data integrity (too expensive for probes)
  • Health check aggregation dashboard (observability work)
  • Custom health check protocols beyond HTTP

Branch

chore/m1-4-3-health-check-contract

PR Title

chore(ops): health check contract and standardisation across services (m1.4.3)


Deliverables

1. ADR-0022 — Health check contract

Define:

  • /health — liveness probe: returns 200 if the process is alive and not deadlocked. Checks nothing external. Called by Kubernetes every 10s. Fails → pod is restarted.
  • /ready — readiness probe: returns 200 only if the service can handle traffic. Checks all critical dependencies. Fails → pod removed from service load balancer (not restarted).
  • Response schema: {"status": "ok"|"degraded"|"unhealthy", "checks": {...}}
  • Dependency classification: critical (readiness fails if down) vs. non-critical (degraded but ready)

2. packages/healthcheck — reusable probe framework

Go
package healthcheck
 
// Checker is a named dependency check function.
// Returns nil if the dependency is healthy.
// Returns an error with a human-readable description if not.
type Checker func(ctx context.Context) error
 
// Server runs /health and /ready endpoints.
type Server struct {
    // CriticalCheckers are checked on /ready.
    // If any fails, /ready returns 503.
    CriticalCheckers map[string]Checker
 
    // AdvisoryCheckers are checked on /ready.
    // If any fails, /ready returns 200 with status "degraded".
    // The service continues to receive traffic.
    AdvisoryCheckers map[string]Checker
}
 
// Response is the JSON body returned by both endpoints.
type Response struct {
    Status string            `json:"status"` // "ok", "degraded", "unhealthy"
    Checks map[string]Check  `json:"checks"`
}
 
type Check struct {
    Status  string `json:"status"`           // "ok", "failed"
    Message string `json:"message,omitempty"` // error message if failed
    LatMs   int64  `json:"latency_ms"`
}

3. Standard checks per service

Auth service critical checks (fail /ready):

  • Postgres: SELECT 1 with 500ms timeout
  • gRPC listener: can bind and accept (self-check)

Auth service advisory checks (degrade /ready):

  • None in Phase 1

Proxy service critical checks (fail /ready):

  • Auth gRPC: Ping or ValidateToken with a known-bad token (expect UNAUTHENTICATED, not a transport error)
  • Redis: PING with 500ms timeout

Proxy service advisory checks (degrade /ready):

  • None in Phase 1; Phase 2 will add LLM provider reachability as advisory

4. Response examples

JSON
// GET /health — always 200 if process is alive
{
  "status": "ok",
  "checks": {}
}
 
// GET /ready — 200 when all critical checks pass
{
  "status": "ok",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 2},
    "redis":    {"status": "ok", "latency_ms": 1}
  }
}
 
// GET /ready — 503 when a critical check fails
{
  "status": "unhealthy",
  "checks": {
    "postgres": {"status": "failed", "message": "connection refused", "latency_ms": 5001},
    "redis":    {"status": "ok", "latency_ms": 1}
  }
}
 
// GET /ready — 200 but degraded when advisory check fails
{
  "status": "degraded",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 3}
  }
}

Acceptance Criteria

  • /health on both services returns 200 and {"status":"ok"} with no external checks
  • /ready on auth service returns 503 when Postgres is unreachable
  • /ready on proxy service returns 503 when auth gRPC is unreachable
  • /ready on proxy service returns 503 when Redis is unreachable
  • Response body is valid JSON matching the defined schema
  • packages/healthcheck used by both services (no duplicated implementation)
  • Kubernetes probe configuration documented in docs/OPS_GUIDE.md
  • ADR-0022 written and indexed

Edit on GitHub

Last updated on

On this page

0%