Compaction
Automatic conversation context management using provider-native compaction APIs. Keeps long-running conversations coherent without exceeding context limits.
Compaction automatically summarizes older conversation history when approaching a model's context window limit. This allows agents to handle arbitrarily long conversations without losing important context.
How Compaction Works
Turn N: Conversation grows past token threshold
→ Provider generates a summary of older messages
→ Summary replaces the old messages in context
Turn N+1: Agent sees summary + recent messages (reduced token count)
Compaction is provider-native — it uses each LLM provider's built-in mechanisms rather than a custom summarization layer:
| Provider | Mechanism | How It Works |
|---|---|---|
| Anthropic (Claude) | Server-side compact_20260112 beta | The API automatically summarizes when input tokens exceed a configurable trigger threshold. Returns a compaction block that replaces older messages. |
| OpenAI (GPT) | Summary-based fallback | After a turn exceeds the threshold, a separate gpt-5.4-nano call generates a conversation summary. The summary is injected as a system message on subsequent turns. |
Configuration
Compaction is configured per-agent via the API or dashboard:
| Setting | Default | Description |
|---|---|---|
enabled | false | Turn compaction on/off |
anthropicTriggerTokens | 150000 | Token threshold for Anthropic compaction (min: 50,000) |
anthropicInstructions | null | Custom summarization prompt (e.g., "preserve all code blocks") |
anthropicPauseAfter | false | Pause after compaction for custom content insertion |
openaiCompactThreshold | 100000 | Token threshold for OpenAI summary generation (min: 1,000) |
API
# Get compaction config
GET /api/agents/{agentId}/compaction
# Update compaction config
PUT /api/agents/{agentId}/compaction
{
"enabled": true,
"anthropicTriggerTokens": 100000,
"openaiCompactThreshold": 75000
}
SDK
// Read config
const config = await client.getCompactionConfig(agentId);
// Enable compaction
await client.updateCompactionConfig(agentId, {
enabled: true,
anthropicTriggerTokens: 100000,
});
State Persistence
Compaction state is stored per-thread in thread_compaction_state:
| Field | Description |
|---|---|
provider | Which provider performed the compaction (anthropic or openai) |
compaction_summary | The generated summary text |
compacted_at | When compaction last occurred |
compaction_count | How many times this thread has been compacted |
pre_compaction_tokens | Input tokens before compaction |
post_compaction_tokens | Input tokens after compaction |
The summary is also written to threads.summary for backward compatibility with agents that don't use compaction.
Compaction vs Memory
| Compaction | Memory | |
|---|---|---|
| Purpose | Reduce context window usage | Store discrete retrievable facts |
| Scope | Per-thread conversation history | Per-agent, per-thread, or per-resource |
| Trigger | Token threshold exceeded | Auto-extraction or explicit tool call |
| Retrieval | Automatically prepended to messages | Semantic search injection |
| Persistence | Replaces old messages with summary | Stored indefinitely in memories table |
These systems are complementary: compaction keeps the context window manageable, while memory provides long-term recall of specific facts across threads.
Cost Tracking
Compaction generates additional tokens that are tracked separately in usage analytics:
compaction_input_tokens— tokens sent to the compaction/summary modelcompaction_output_tokens— tokens generated by the compaction/summary
These tokens are included in the estimated_cost_usd value returned in the done SSE event. The billing model differs by provider:
| Provider | Compaction Model | Rate |
|---|---|---|
| Anthropic | Same as the main model | Main model's per-token rate |
| OpenAI | gpt-5.4-nano | $0.20 / $1.25 per 1M tokens |
In the cost_events ledger, main-LLM and compaction costs are recorded as separate rows so that per-token-rate analysis stays accurate. The usage_events table stores the combined total for backward compatibility.
How History Limits Work
All providers now share a single history cap: the 500 most-recent messages are fetched (newest first, then reversed to chronological order). Older turns beyond this cap are handled by compaction when enabled.
| Compaction State | Behavior |
|---|---|
| Disabled | 500 most-recent messages sent to the model |
| Enabled | 500 most-recent messages sent; compaction summarizes older context when the token threshold is exceeded |
This replaced the previous per-provider limits (50 for OpenAI, 200 for Anthropic) which fetched the oldest N messages — effectively dropping the most recent turns once a thread exceeded the cap.
Example Flow (Anthropic)
- User enables compaction with
anthropicTriggerTokens: 100000 - After 40 turns, input tokens reach 105,000
- Anthropic API detects threshold is exceeded
- API generates a summary of older messages (~3,500 tokens)
- Summary is returned as a compaction block
- Flapjack persists the summary in
thread_compaction_stateandthreads.summary - On turn 41, the compaction summary replaces the older messages
- Input tokens drop to ~25,000 (summary + recent messages)
- Conversation continues with full context awareness
Example Flow (OpenAI)
- User enables compaction with
openaiCompactThreshold: 80000 - After 35 turns, input tokens reach 85,000
- After the response is streamed, Flapjack calls
gpt-5.4-nanowith the conversation - A ~1,500 token summary is generated (with a 15-second timeout)
- Summary is persisted in
thread_compaction_stateandthreads.summary - On turn 36, the summary is injected as a system message
- Combined with the existing
threads.summarymechanism, context is preserved