phase 1 core platform

The config problem: Both `services/auth` and `services/proxy` currently call `os.Getenv()` directly, scattered across multiple source files. There is no validation that required variables are set. A service that starts with `POSTGRES_DSN=\"\"` will crash at runtime when handling the first request, not at startup with

Milestone 1.4.2 — Shared Infrastructure Packages: packages/config and packages/apierror

Status: Planned
Goal: 1.4 — Developer Experience Baseline
Phase: 1 — Core Platform
Estimated effort: 2–3 days
ADR required: ADR-0020 — Shared package boundaries


Why This Milestone Exists

The config problem: Both services/auth and services/proxy currently call os.Getenv() directly, scattered across multiple source files. There is no validation that required variables are set. A service that starts with POSTGRES_DSN="" will crash at runtime when handling the first request, not at startup with a clear diagnostic message. There is no single source of truth for what environment variables a service needs — instead, developers discover them one by one from crash logs.

The error code problem: M1.2.3 introduced the stable error envelope with codes like INVALID_JSON, MISSING_AGENT_ID, and RATE_LIMITED. These codes are currently string literals scattered across handler files. There is nothing preventing a developer from typing "INVLAID_JSON" (typo) in a new handler and shipping an inconsistent error code to clients. There is no canonical registry that documents all valid codes, their HTTP status mappings, and their intended semantics.

Both problems compound as more services are added. Phase 2 adds a Python API service. Without a canonical error code registry, Python and Go services will independently invent error codes that diverge. Without shared config patterns, each new service will reinvent env loading.


Non-Goals

  • Feature flags or runtime config hot-reload (Phase 4)
  • A secrets manager integration (Vault, AWS SSM — Phase 3)
  • Centralised config server (Consul, etcd — not planned)
  • Python config package (Python uses pydantic-settings; the pattern is documented in 07-python-code-quality.mdc)

Branch

chore/m1-4-2-shared-config-and-error-packages

PR Title

chore(infra): packages/config and packages/apierror shared infrastructure (m1.4.2)


Prerequisites


Deliverables

1. packages/config — typed, validated environment loading

Go
// Package config provides typed, validated environment variable loading
// for IBEX Harness Go services.
//
// Usage:
//   type MyServiceConfig struct {
//       DatabaseURL string        `env:"POSTGRES_DSN"          required:"true"`
//       ListenAddr  string        `env:"IBEX_PROXY_ADDR"       envDefault:":8080"`
//       MaxConns    int           `env:"IBEX_DB_MAX_CONNS"     envDefault:"20"`
//       Timeout     time.Duration `env:"IBEX_SHUTDOWN_TIMEOUT" envDefault:"30s"`
//   }
//   cfg, err := config.Load[MyServiceConfig]()
//   // err describes every missing/invalid variable; not just the first one
//
// The Load function:
//   1. Reads all struct fields tagged with `env:"VAR_NAME"`
//   2. Parses according to field type (string, int, bool, time.Duration, url.URL, uuid.UUID)
//   3. Applies defaults from `envDefault` tags
//   4. Returns all validation errors at once (not just the first)
//   5. Logs the resolved config at DEBUG level (redacting values for fields tagged `secret:"true"`)
package config
 
// Load parses environment variables into T and validates all required fields.
// Returns a descriptive error listing ALL missing/invalid variables — not just the first.
// Call in main() and Fatal on error; do not ignore the error.
func Load[T any]() (T, error)
 
// MustLoad is Load with a fatal log on error. Use in main() only.
func MustLoad[T any]() T
 
// Secret marks a field value as sensitive. Its value is redacted in debug logs.
// Apply to fields that hold tokens, passwords, or API keys.
//   APIKey string `env:"OPENAI_API_KEY" required:"true" secret:"true"`
type Secret string

Adopt github.com/caarlos0/env/v11 as the underlying parser — it is well-maintained, supports generics, and handles all required types. Do NOT use Viper (too heavy), godotenv (env files are not our pattern), or raw os.Getenv.


2. packages/apierror — canonical error code registry

Go
// Package apierror defines all error codes used in IBEX Harness HTTP and
// gRPC APIs. It is the single source of truth for:
//   - Error code string values (used in the JSON error envelope)
//   - HTTP status code mappings
//   - Human-readable default messages
//   - gRPC status code mappings
//
// Rules:
//   1. New error codes are added ONLY in this package.
//   2. No service or package uses raw string literals for error codes.
//   3. Every code has a corresponding HTTP status and gRPC status.
//   4. Error messages are safe to return to API callers (no internal details).
package apierror
 
// Code is a canonical IBEX error code string.
// Values are UPPER_SNAKE_CASE and stable across API versions.
type Code string
 
// ═════════════════════════════════════════════════════════════
// Client error codes (4xx)
// ═════════════════════════════════════════════════════════════
 
const (
    // CodeInvalidJSON is returned when the request body cannot be parsed as JSON.
    // HTTP 400 | gRPC INVALID_ARGUMENT
    CodeInvalidJSON Code = "INVALID_JSON"
 
    // CodeInvalidRequest is returned when a request field fails validation.
    // HTTP 400 | gRPC INVALID_ARGUMENT
    CodeInvalidRequest Code = "INVALID_REQUEST"
 
    // CodeMissingAgentID is returned when X-IBEX-Agent-ID header is absent.
    // HTTP 400 | gRPC INVALID_ARGUMENT
    CodeMissingAgentID Code = "MISSING_AGENT_ID"
 
    // CodeProviderNotConfigured is returned when no LLM provider is configured
    // for the requested model. HTTP 400 | gRPC FAILED_PRECONDITION
    CodeProviderNotConfigured Code = "PROVIDER_NOT_CONFIGURED"
 
    // CodeUnauthenticated is returned for missing, invalid, expired, or revoked tokens.
    // HTTP 401 | gRPC UNAUTHENTICATED
    CodeUnauthenticated Code = "UNAUTHENTICATED"
 
    // CodePermissionDenied is returned for valid tokens with insufficient permissions,
    // and for all cross-tenant resource access attempts.
    // HTTP 403 | gRPC PERMISSION_DENIED
    CodePermissionDenied Code = "PERMISSION_DENIED"
 
    // CodeAgentNotAuthorized is returned when X-IBEX-Agent-ID refers to an agent
    // that does not exist in the authenticated org, or belongs to another org.
    // HTTP 403 | gRPC PERMISSION_DENIED
    CodeAgentNotAuthorized Code = "AGENT_NOT_AUTHORIZED"
 
    // CodeAgentSuspended is returned when the agent exists and belongs to the org
    // but its status is paused, suspended, or archived.
    // HTTP 403 | gRPC PERMISSION_DENIED
    CodeAgentSuspended Code = "AGENT_SUSPENDED"
 
    // CodeRateLimited is returned when the org-level rate limit is exceeded.
    // HTTP 429 | gRPC RESOURCE_EXHAUSTED
    CodeRateLimited Code = "RATE_LIMITED"
)
 
// ═════════════════════════════════════════════════════════════
// Server error codes (5xx)
// ═════════════════════════════════════════════════════════════
 
const (
    // CodeInternalError is returned for unexpected server errors.
    // Callers should not retry without backoff.
    // HTTP 500 | gRPC INTERNAL
    CodeInternalError Code = "INTERNAL_ERROR"
 
    // CodeAuthUnavailable is returned when the auth gRPC service is unreachable.
    // HTTP 503 | gRPC UNAVAILABLE
    CodeAuthUnavailable Code = "AUTH_UNAVAILABLE"
 
    // CodeProviderUnavailable is returned when the LLM provider is unreachable (Phase 2+).
    // HTTP 503 | gRPC UNAVAILABLE
    CodeProviderUnavailable Code = "PROVIDER_UNAVAILABLE"
 
    // CodeProviderTimeout is returned when the LLM provider exceeds its deadline (Phase 2+).
    // HTTP 504 | gRPC DEADLINE_EXCEEDED
    CodeProviderTimeout Code = "PROVIDER_TIMEOUT"
)
 
// HTTPStatus returns the HTTP status code for a given error code.
// Returns 500 for unknown codes.
func HTTPStatus(code Code) int
 
// GRPCCode returns the gRPC status code for a given error code.
func GRPCCode(code Code) codes.Code
 
// ErrorResponse is the stable JSON error envelope returned by all IBEX HTTP endpoints.
// It matches the envelope introduced in M1.2.3.
type ErrorResponse struct {
    Error ErrorDetail `json:"error"`
}
 
type ErrorDetail struct {
    Code      Code   `json:"code"`
    Message   string `json:"message"`
    RequestID string `json:"request_id"`
}
 
// New constructs an ErrorResponse with the given code and message.
func New(code Code, message, requestID string) ErrorResponse
 
// WriteHTTP writes an ErrorResponse as JSON to w with the appropriate HTTP status.
func WriteHTTP(w http.ResponseWriter, code Code, message, requestID string)

3. Adopt both packages in existing services

  • Replace all os.Getenv calls in services/auth with packages/config typed struct
  • Replace all os.Getenv calls in services/proxy with packages/config typed struct
  • Replace all raw error code strings in both services with apierror.Code* constants
  • Replace all inline {"error":{"code":...}} JSON construction with apierror.WriteHTTP

Files Affected

PathAction
packages/config/config.goAdd
packages/config/config_test.goAdd
packages/apierror/apierror.goAdd
packages/apierror/apierror_test.goAdd
services/auth/cmd/auth/main.goAdopt packages/config
services/proxy/cmd/proxy/main.goAdopt packages/config
services/proxy/internal/handler/*.goAdopt packages/apierror
services/auth/internal/grpc/*.goAdopt packages/apierror for gRPC codes
docs/adr/ADR-0020-shared-package-boundaries.mdAdd

Acceptance Criteria

  • packages/config.Load[T]() reports ALL missing required vars in a single error, not just the first
  • Service startup fails at main() with clear message listing missing env vars — not at first request
  • All os.Getenv calls removed from both services
  • All error code strings replaced with apierror.Code* constants
  • apierror.WriteHTTP is the only way to write error responses in handler code
  • packages/config struct field tags are the single source of truth for env var names (docs generated from struct)
  • ADR-0020 written and indexed

Edit on GitHub

Last updated on

On this page

0%