ibexharness
DocsBlogReleasesRoadmap
GitHub
ibexharness

Documentation

ObservabilityTroubleshootingHealth checksIncident response
Operations›Incident response
Operations

Incident response

On-call runbooks for proxy outages, auth failures, and tenant isolation violations.

Runbooks are step-by-step procedures for alerts and incidents. They apply to Phase 1 services (auth, proxy) and shared dependencies (Postgres, Redis). Capture trace_id, deploy version, and a timeline from the first observation — never paste secrets into tickets or chat.

Safety rules

Do not disable auth, rate limits, or RLS to recover service. Prefer rollback, scale-up, or reversible mitigations.

Before you start

Open an incident note immediately:

markdown
## Incident
Start time:
Severity:
Impact:
Services affected:
 
## Timeline
- [time] observed symptom
- [time] hypothesis + check
- [time] mitigation applied
- [time] recovery verified

Runbook: Proxy down (P1)

Alert signals

  • ibex_process_up{service="proxy"} == 0 for 1 minute
  • Error rate above 10% for 5 minutes
  • Customer report: LLM calls failing

Likely causes

  • Bad deploy or crash loop
  • Missing required environment variables
  • OOMKilled under load
  • Readiness failure misconfigured as liveness (avoidable restarts)

Immediate actions (first 15 minutes)

1

Confirm blast radius

All traffic or single region? Is ingress reachable?

2

Inspect pods

CrashLoopBackOff, OOMKilled, ImagePullBackOff?

3

Recent deploy

If a rollout started near onset, prepare rollback to last known-good image digest.

bash
kubectl -n ibex-system get pods -l app=proxy
kubectl -n ibex-system describe pod <proxy-pod>
kubectl -n ibex-system logs <proxy-pod> --previous

Mitigation

  1. Roll back to last known-good image (preferred — proxy is stateless)
  2. Scale replicas up if load spike
  3. If auth is down: proxy should stay live (/health 200) but not ready (/ready 503) — do not point liveness at /ready

Recovery verification

  • /health returns 200 on N replicas
  • /ready returns 200
  • Error rate and ibex_proxy_request_duration_seconds p99 back to baseline
  • make dev-smoke passes locally after fix

Runbook: Auth validation failures spike (P1/P2)

Alert signals

  • Spike in 401 responses across proxy
  • Dashboard login failures (Phase 2+)
  • ibex_auth_validate_token_duration_seconds errors rising

Likely causes

  • Client mass misconfiguration (wrong token deployed)
  • Token revocation or expiration bug
  • JWT key rotation mismatch (kid not in keyset)
  • Clock skew between services

Diagnosis

1

Classify failure type

Invalid vs expired vs revoked vs org_suspended — error codes in logs and metrics result labels.

2

Check concentration

Single org suggests client misconfig; many IPs suggests brute force or attack.

3

JWT keyset

Verify JWT_KEY_ID_CURRENT matches a key in JWT_PUBLIC_KEYS_PEM on all verifiers.

4

Clock sync

NTP drift causes premature expiry if services disagree on time.

Mitigation

CauseAction
Attack suspectedTighten auth endpoint rate limits; notify security
Key rotation bugRestore previous public key in keyset; redeploy verifiers
Client misconfigNotify impacted org; issue new PAT — avoid extending expired tokens

Verification

  • 401 rate returns to baseline
  • Proxy auth cache hit rate stable
  • Protected route smoke test returns expected 501 (auth passed)

Runbook: Suspected tenant isolation violation (P1)

Alert signals

  • User report: seeing another org's data
  • Audit log anomaly
  • Cross-tenant access in telemetry (should be impossible)

Immediate actions — do not delay

1

Declare P1

Freeze deployments. Start incident timeline.

2

Enable safe logging

Enhanced audit mode — still no raw tokens or memory content in logs.

3

Identify path

API endpoint, Redis key namespace, ClickHouse query, or connection pool org context leak?

Diagnosis plan

Mermaid diagram: graph TD
+----------------------------+                             
|                            |                             
|      Report received       |                             
|                            |                             
+----------------------------+                             
               |                                           
               |                                           
               |                                           
               |                                           
               v                                           
<---------------------------->                             
|                            |                             
|       Reproducible?        |-----------------+           
|                            |                 |           
<---------------------------->                no           
               |                               |           
              yes                              |           
               |                               |           
               |                               |           
               v                               v           
+----------------------------+     +----------------------+
|                            |     |                      |
|    Controlled test orgs    |     | Audit log + trace_id |
|                            |     |                      |
+----------------------------+     +----------------------+
               |                                           
               |                                           
               |                                           
               |                                           
               v                                           
+----------------------------+                             
|                            |                             
|   RLS policies enabled?    |                             
|                            |                             
+----------------------------+                             
               |                                           
               |                                           
               |                                           
               |                                           
               v                                           
+----------------------------+                             
|                            |                             
| SET LOCAL per transaction? |                             
|                            |                             
+----------------------------+                             
               |                                           
               |                                           
               |                                           
               |                                           
               v                                           
+----------------------------+                             
|                            |                             
|     App WHERE org_id?      |                             
|                            |                             
+----------------------------+                             
               |                                           
               |                                           
               |                                           
               |                                           
               v                                           
+----------------------------+                             
|                            |                             
|   Redis keys namespaced?   |                             
|                            |                             
+----------------------------+                             
  1. Confirm with controlled test orgs if possible
  2. Verify RLS policies enabled on ibex_core tables
  3. Search code path for queries missing org_id
  4. Inspect Redis keys for missing org prefix
  5. For analytics: confirm query guard rejected unscoped ClickHouse SQL

Mitigation

  • Fail closed when org context is missing or mismatched
  • Hotfix query guards and key namespacing
  • If uncertain: disable affected endpoint via feature flag while preserving core proxy traffic

Never return 404 for cross-tenant

Cross-tenant probes must get 403. A 404 confirms resource existence to an attacker.

Verification

  • Cross-tenant integration tests pass in staging
  • Reproduction no longer possible
  • Postmortem with regression tests committed

Severity reference

LevelExamples
P1Tenant isolation breach, secret compromise, auth bypass, proxy total outage
P2Sustained p99 overhead regression, Redis degraded, DB failover
P3Analytics delay, non-blocking dashboard issues

Related

  • Security overview
  • Tenant isolation
  • Troubleshooting
  • Health checks

Was this page helpful?

Edit on GitHub

Last updated on

PreviousHealth checksNextAuth gRPC (ValidateToken, ValidateAgent)

On this page

  • Before you start
  • Runbook: Proxy down (P1)
  • Alert signals
  • Likely causes
  • Immediate actions (first 15 minutes)
  • Mitigation
  • Recovery verification
  • Runbook: Auth validation failures spike (P1/P2)
  • Alert signals
  • Likely causes
  • Diagnosis
  • Mitigation
  • Verification
  • Runbook: Suspected tenant isolation violation (P1)
  • Alert signals
  • Immediate actions — do not delay
  • Diagnosis plan
  • Mitigation
  • Verification
  • Severity reference
  • Related
0%