Concepts

Compaction

Automatic conversation context management using provider-native compaction APIs. Keeps long-running conversations coherent without exceeding context limits.

Compaction automatically summarizes older conversation history when approaching a model's context window limit. This allows agents to handle arbitrarily long conversations without losing important context.

How Compaction Works

Turn N: Conversation grows past token threshold
  → Provider generates a summary of older messages
  → Summary replaces the old messages in context
Turn N+1: Agent sees summary + recent messages (reduced token count)

Compaction is provider-native — it uses each LLM provider's built-in mechanisms rather than a custom summarization layer:

Provider	Mechanism	How It Works
Anthropic (Claude)	Server-side `compact_20260112` beta	The API automatically summarizes when input tokens exceed a configurable trigger threshold. Returns a compaction block that replaces older messages.
OpenAI (GPT)	Summary-based fallback	After a turn exceeds the threshold, a separate `gpt-5.4-nano` call generates a conversation summary. The summary is injected as a system message on subsequent turns.

Configuration

Compaction is configured per-agent via the API or dashboard:

Setting	Default	Description
`enabled`	`false`	Turn compaction on/off
`anthropicTriggerTokens`	`150000`	Token threshold for Anthropic compaction (min: 50,000)
`anthropicInstructions`	`null`	Custom summarization prompt (e.g., "preserve all code blocks")
`anthropicPauseAfter`	`false`	Pause after compaction for custom content insertion
`openaiCompactThreshold`	`100000`	Token threshold for OpenAI summary generation (min: 1,000)

API

# Get compaction config
GET /api/agents/{agentId}/compaction

# Update compaction config
PUT /api/agents/{agentId}/compaction
{
  "enabled": true,
  "anthropicTriggerTokens": 100000,
  "openaiCompactThreshold": 75000
}

SDK

// Read config
const config = await client.getCompactionConfig(agentId);

// Enable compaction
await client.updateCompactionConfig(agentId, {
  enabled: true,
  anthropicTriggerTokens: 100000,
});

State Persistence

Compaction state is stored per-thread in thread_compaction_state:

Field	Description
`provider`	Which provider performed the compaction (`anthropic` or `openai`)
`compaction_summary`	The generated summary text
`compacted_at`	When compaction last occurred
`compaction_count`	How many times this thread has been compacted
`pre_compaction_tokens`	Input tokens before compaction
`post_compaction_tokens`	Input tokens after compaction

The summary is also written to threads.summary for backward compatibility with agents that don't use compaction.

Compaction vs Memory

	Compaction	Memory
Purpose	Reduce context window usage	Store discrete retrievable facts
Scope	Per-thread conversation history	Per-agent, per-thread, or per-resource
Trigger	Token threshold exceeded	Auto-extraction or explicit tool call
Retrieval	Automatically prepended to messages	Semantic search injection
Persistence	Replaces old messages with summary	Stored indefinitely in `memories` table

These systems are complementary: compaction keeps the context window manageable, while memory provides long-term recall of specific facts across threads.

Cost Tracking

Compaction generates additional tokens that are tracked separately in usage analytics:

compaction_input_tokens — tokens sent to the compaction/summary model
compaction_output_tokens — tokens generated by the compaction/summary

These tokens are included in the estimated_cost_usd value returned in the done SSE event. The billing model differs by provider:

Provider	Compaction Model	Rate
Anthropic	Same as the main model	Main model's per-token rate
OpenAI	`gpt-5.4-nano`	$0.20 / $1.25 per 1M tokens

In the cost_events ledger, main-LLM and compaction costs are recorded as separate rows so that per-token-rate analysis stays accurate. The usage_events table stores the combined total for backward compatibility.

How History Limits Work

All providers now share a single history cap: the 500 most-recent messages are fetched (newest first, then reversed to chronological order). Older turns beyond this cap are handled by compaction when enabled.

Compaction State	Behavior
Disabled	500 most-recent messages sent to the model
Enabled	500 most-recent messages sent; compaction summarizes older context when the token threshold is exceeded

This replaced the previous per-provider limits (50 for OpenAI, 200 for Anthropic) which fetched the oldest N messages — effectively dropping the most recent turns once a thread exceeded the cap.

Example Flow (Anthropic)

User enables compaction with anthropicTriggerTokens: 100000
After 40 turns, input tokens reach 105,000
Anthropic API detects threshold is exceeded
API generates a summary of older messages (~3,500 tokens)
Summary is returned as a compaction block
Flapjack persists the summary in thread_compaction_state and threads.summary
On turn 41, the compaction summary replaces the older messages
Input tokens drop to ~25,000 (summary + recent messages)
Conversation continues with full context awareness

Example Flow (OpenAI)

User enables compaction with openaiCompactThreshold: 80000
After 35 turns, input tokens reach 85,000
After the response is streamed, Flapjack calls gpt-5.4-nano with the conversation
A ~1,500 token summary is generated (with a 15-second timeout)
Summary is persisted in thread_compaction_state and threads.summary
On turn 36, the summary is injected as a system message
Combined with the existing threads.summary mechanism, context is preserved