Incident response
On-call runbooks for proxy outages, auth failures, and tenant isolation violations.
Runbooks are step-by-step procedures for alerts and incidents. They apply to Phase 1 services (auth, proxy) and shared dependencies (Postgres, Redis). Capture trace_id, deploy version, and a timeline from the first observation — never paste secrets into tickets or chat.
Before you start
Open an incident note immediately:
## Incident
Start time:
Severity:
Impact:
Services affected:
## Timeline
- [time] observed symptom
- [time] hypothesis + check
- [time] mitigation applied
- [time] recovery verifiedRunbook: Proxy down (P1)
Alert signals
ibex_process_up{service="proxy"} == 0for 1 minute- Error rate above 10% for 5 minutes
- Customer report: LLM calls failing
Likely causes
- Bad deploy or crash loop
- Missing required environment variables
- OOMKilled under load
- Readiness failure misconfigured as liveness (avoidable restarts)
Immediate actions (first 15 minutes)
Confirm blast radius
All traffic or single region? Is ingress reachable?
Inspect pods
CrashLoopBackOff, OOMKilled, ImagePullBackOff?
Recent deploy
If a rollout started near onset, prepare rollback to last known-good image digest.
kubectl -n ibex-system get pods -l app=proxy
kubectl -n ibex-system describe pod <proxy-pod>
kubectl -n ibex-system logs <proxy-pod> --previousMitigation
- Roll back to last known-good image (preferred — proxy is stateless)
- Scale replicas up if load spike
- If auth is down: proxy should stay live (
/health200) but not ready (/ready503) — do not point liveness at/ready
Recovery verification
/healthreturns 200 on N replicas/readyreturns 200- Error rate and
ibex_proxy_request_duration_secondsp99 back to baseline make dev-smokepasses locally after fix
Runbook: Auth validation failures spike (P1/P2)
Alert signals
- Spike in 401 responses across proxy
- Dashboard login failures (Phase 2+)
ibex_auth_validate_token_duration_secondserrors rising
Likely causes
- Client mass misconfiguration (wrong token deployed)
- Token revocation or expiration bug
- JWT key rotation mismatch (
kidnot in keyset) - Clock skew between services
Diagnosis
Classify failure type
Invalid vs expired vs revoked vs org_suspended — error codes in logs and metrics result labels.
Check concentration
Single org suggests client misconfig; many IPs suggests brute force or attack.
JWT keyset
Verify JWT_KEY_ID_CURRENT matches a key in JWT_PUBLIC_KEYS_PEM on all verifiers.
Clock sync
NTP drift causes premature expiry if services disagree on time.
Mitigation
| Cause | Action |
|---|---|
| Attack suspected | Tighten auth endpoint rate limits; notify security |
| Key rotation bug | Restore previous public key in keyset; redeploy verifiers |
| Client misconfig | Notify impacted org; issue new PAT — avoid extending expired tokens |
Verification
- 401 rate returns to baseline
- Proxy auth cache hit rate stable
- Protected route smoke test returns expected 501 (auth passed)
Runbook: Suspected tenant isolation violation (P1)
Alert signals
- User report: seeing another org's data
- Audit log anomaly
- Cross-tenant access in telemetry (should be impossible)
Immediate actions — do not delay
Declare P1
Freeze deployments. Start incident timeline.
Enable safe logging
Enhanced audit mode — still no raw tokens or memory content in logs.
Identify path
API endpoint, Redis key namespace, ClickHouse query, or connection pool org context leak?
Diagnosis plan
- Confirm with controlled test orgs if possible
- Verify RLS policies enabled on
ibex_coretables - Search code path for queries missing
org_id - Inspect Redis keys for missing org prefix
- For analytics: confirm query guard rejected unscoped ClickHouse SQL
Mitigation
- Fail closed when org context is missing or mismatched
- Hotfix query guards and key namespacing
- If uncertain: disable affected endpoint via feature flag while preserving core proxy traffic
Verification
- Cross-tenant integration tests pass in staging
- Reproduction no longer possible
- Postmortem with regression tests committed
Severity reference
| Level | Examples |
|---|---|
| P1 | Tenant isolation breach, secret compromise, auth bypass, proxy total outage |
| P2 | Sustained p99 overhead regression, Redis degraded, DB failover |
| P3 | Analytics delay, non-blocking dashboard issues |
Related
Was this page helpful?
Last updated on