ibexharness
DocsBlogReleasesRoadmap
GitHub
ibexharness

Documentation

ObservabilityTroubleshootingHealth checksIncident response
Operations›Health checks
Operations

Health checks

Liveness and readiness endpoints, probe contract, and Kubernetes configuration.

Auth and proxy expose two HTTP probe endpoints with distinct semantics. Liveness tells the orchestrator the process is alive; readiness tells the load balancer whether to send traffic. The contract is defined in ADR-0022 and implemented in packages/healthcheck.

Do not conflate probes

Point liveness at /health only. Using /ready for liveness restarts pods when a dependency blips — that makes outages worse.

Endpoint summary

EndpointProbe typeChecks depsOn failure
GET /healthLivenessNoOrchestrator restarts the pod
GET /readyReadinessYes (critical)Pod removed from service endpoints

Response schema

JSON
{
  "status": "ok",
  "checks": {
    "postgres": { "status": "ok", "latency_ms": 2 },
    "redis": { "status": "failed", "message": "connection refused", "latency_ms": 501 }
  }
}
FieldValues
Top-level statusok | degraded | unhealthy
checks[name].statusok | failed
checks[name].messagePresent on failure — no secrets
checks[name].latency_msWall time for that checker

/health always returns HTTP 200 with {"status":"ok","checks":{}}.

Readiness HTTP mapping

ConditionHTTPstatus field
All critical checks pass200ok
Critical pass, advisory fail200degraded
Any critical check fails503unhealthy

Phase 1 defines no advisory checks. degraded is reserved for Phase 2 (e.g. LLM provider reachability).

Mermaid diagram: graph TD
+--------------------------+                                  
|                          |                                  
|        GET /ready        |                                  
|                          |                                  
+--------------------------+                                  
              |                                               
              |                                               
              |                                               
              |                                               
              v                                               
<-------------------------->                                  
|                          |                                  
|     Critical checks      |-------------------+              
|                          |                   |              
<-------------------------->               any fail           
              |                                |              
          all pass                             |              
              |                                |              
              |                                |              
              v                                v              
+--------------------------+     +---------------------------+
|                          |     |                           |
|    HTTP 200 status ok    |     | HTTP 503 status unhealthy |
|                          |     |                           |
+--------------------------+     +---------------------------+
              |                                               
              |                                               
              |                                               
              |                                               
              v                                               
<-------------------------->                                  
|                          |                                  
| Advisory checks Phase 2+ |                                  
|                          |                                  
<-------------------------->                                  
              |                                               
        advisory fail                                         
              |                                               
              |                                               
              v                                               
+--------------------------+                                  
|                          |                                  
| HTTP 200 status degraded |                                  
|                          |                                  
+--------------------------+                                  

Phase 1 checkers

Auth service

CheckerTypeWhat it does
postgresCriticalSELECT 1 via connection pool
grpcCriticalTCP reachability to gRPC listen port

Proxy service

CheckerTypeWhat it does
auth_grpcCriticalValidateToken probe; Unauthenticated means reachable
redisCriticalPING over RESP

Empty REDIS_URL

When Redis is not configured, the redis checker fails and proxy /ready returns 503. Rate limiting uses a Noop limiter — acceptable for local dev without Redis if you accept not-ready status.

Timeouts

BudgetValue
Per checker500ms
Overall /ready request750ms

Checkers run in parallel within the overall budget. Configure Kubernetes probe timeoutSeconds: 2 to accommodate network overhead.

Local verification

bash
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8081/ready  | jq .

Kubernetes probes

Recommended configuration for auth and proxy Deployments:

YAML
ports:
  - name: http
    containerPort: 8080   # auth: 8081 per IBEX_PORT
 
livenessProbe:
  httpGet:
    path: /health
    port: http
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3
 
readinessProbe:
  httpGet:
    path: /ready
    port: http
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 2
  successThreshold: 1
1

Liveness → /health

Restarts only when the process is deadlocked or crashed — not when Postgres is briefly unavailable.

2

Readiness → /ready

Removes the pod from the Service when auth gRPC or Redis (proxy) or Postgres (auth) is unreachable.

3

Verify after deploy

Confirm at least N replicas return /ready 200 before declaring rollout complete.

Related

  • Observability — /metrics on the same HTTP port
  • Troubleshooting
  • ADR-0022: Health check contract

Was this page helpful?

Edit on GitHub

Last updated on

PreviousTroubleshootingNextIncident response

On this page

  • Endpoint summary
  • Response schema
  • Readiness HTTP mapping
  • Phase 1 checkers
  • Auth service
  • Proxy service
  • Timeouts
  • Local verification
  • Kubernetes probes
  • Related
0%