LLM models have context window limits. Injecting memories without tracking token consumption can push the context over the limit, causing provider errors or silent truncation. The budget calculator determines exactly how many tokens are available for memory injection given the model, the existing messages, and required
Milestone 3.5.2 — Token Budget Calculator (Model-Aware, tiktoken)
Status: Planned
Goal: 3.5 — Context assembly engine
Phase: 3 — Memory Engine and Operator Platform
Estimated effort: 2 days
ADR required: ADR-0039 — Token budget allocation strategy
Why This Milestone Exists
LLM models have context window limits. Injecting memories without tracking token consumption can push the context over the limit, causing provider errors or silent truncation. The budget calculator determines exactly how many tokens are available for memory injection given the model, the existing messages, and required reserves.
Accurate token counting requires the provider's tokeniser — for OpenAI models this is tiktoken. Approximations (character count / 4) are off by up to 30% for code, technical content, and non-English text. At 128K token budgets, a 30% error is 38,400 tokens of wasted space.
ADR-0039 — Token budget allocation strategy
Document:
- Why tiktoken (not character approximation): Accuracy. Character/4 approximation fails on code (1.5 chars/token), CJK text (0.5 chars/token), and special tokens.
- Token reserve strategy: Reserve 15% for LLM response (min 500, max 4096 tokens). Reserve 5% safety buffer. Reserve directive tokens (always in full). Remaining = memory budget.
- Model context windows used in Phase 3:
| Model | Context window | Memory budget (approximate) |
|---|---|---|
gpt-4o | 128,000 | ~80,000 after safety + directive |
gpt-4o-mini | 128,000 | ~80,000 |
gpt-4-turbo | 128,000 | ~80,000 |
gpt-3.5-turbo | 16,385 | ~8,000 |
Deliverables
src/context/services/budget.py
from __future__ import annotations
from dataclasses import dataclass
import tiktoken
# Context windows by model (as of Phase 3).
# Update when new models are added.
MODEL_CONTEXT_WINDOWS: dict[str, int] = {
"gpt-4o": 128_000,
"gpt-4o-mini": 128_000,
"gpt-4-turbo": 128_000,
"gpt-4": 8_192,
"gpt-3.5-turbo": 16_385,
"gpt-3.5-turbo-16k": 16_385,
}
DEFAULT_CONTEXT_WINDOW = 8_192 # conservative fallback for unknown models
RESPONSE_RESERVE_RATIO = 0.15 # 15% reserved for LLM response
RESPONSE_RESERVE_MIN = 500 # minimum response reserve, tokens
RESPONSE_RESERVE_MAX = 4_096 # maximum response reserve, tokens
SAFETY_BUFFER_RATIO = 0.05 # 5% safety margin
@dataclass(frozen=True)
class TokenBudget:
model: str
context_window: int # total context window for this model
messages_tokens: int # tokens consumed by original messages
directive_tokens: int # tokens consumed by directive (reserved in full)
response_reserve: int # tokens reserved for LLM response
safety_buffer: int # tokens reserved as safety margin
memory_tokens: int # tokens available for memory injection
is_constrained: bool # True if memory_tokens < 1000 (very little room)
class TokenBudgetCalculator:
"""
Calculates the exact token budget for memory injection.
Uses tiktoken for OpenAI models; falls back to character approximation
for unknown models.
"""
def __init__(self) -> None:
self._encoders: dict[str, tiktoken.Encoding] = {}
def _get_encoder(self, model: str) -> tiktoken.Encoding:
if model not in self._encoders:
try:
self._encoders[model] = tiktoken.encoding_for_model(model)
except KeyError:
# Unknown model: use cl100k_base (used by GPT-4 family)
self._encoders[model] = tiktoken.get_encoding("cl100k_base")
return self._encoders[model]
def count_tokens(self, text: str, model: str) -> int:
enc = self._get_encoder(model)
return len(enc.encode(text))
def count_messages_tokens(self, messages: list[dict], model: str) -> int:
"""
Count tokens for a messages array including role overhead.
OpenAI charges 4 tokens per message for role + separator tokens.
"""
enc = self._get_encoder(model)
total = 3 # priming tokens
for msg in messages:
total += 4 # message overhead
total += len(enc.encode(msg.get("content", "")))
total += len(enc.encode(msg.get("role", "")))
return total
def calculate(
self,
model: str,
messages: list, # list of proto Message objects
directive: str = "",
) -> TokenBudget:
context_window = MODEL_CONTEXT_WINDOWS.get(model, DEFAULT_CONTEXT_WINDOW)
msg_dicts = [{"role": m.role, "content": m.content} for m in messages]
messages_tokens = self.count_messages_tokens(msg_dicts, model)
directive_tokens = self.count_tokens(directive, model) if directive else 0
response_reserve = max(
RESPONSE_RESERVE_MIN,
min(RESPONSE_RESERVE_MAX, int(context_window * RESPONSE_RESERVE_RATIO)),
)
safety_buffer = int(context_window * SAFETY_BUFFER_RATIO)
consumed = messages_tokens + directive_tokens + response_reserve + safety_buffer
memory_tokens = max(0, context_window - consumed)
return TokenBudget(
model=model,
context_window=context_window,
messages_tokens=messages_tokens,
directive_tokens=directive_tokens,
response_reserve=response_reserve,
safety_buffer=safety_buffer,
memory_tokens=memory_tokens,
is_constrained=memory_tokens < 1_000,
)Acceptance Criteria
-
count_tokens("Hello world", "gpt-4o")returns 2 (verified against OpenAI's tokeniser) -
calculatereturnsmemory_tokens=0when messages already fill the context window - Unknown model falls back to
cl100k_baseencoder without raising -
is_constrained=Truewhen memory budget < 1000 tokens - ADR-0039 written with token reserve rationale and model table
Last updated on