Sleep-time Compute

Brief definition

Compute that an LLM agent spends between user turns — pre-computing likely answers, reorganising memory, and updating indexes — so that test-time queries are faster, cheaper, and more accurate.

What it is

Most LLM use is synchronous: the user asks, the model thinks at test time, the model answers. Sleep-time compute breaks this rhythm. The agent does work between interactions — anticipating likely next questions from the existing context, pre-computing summaries, reorganising its memory store, updating indexes — so that when the next query arrives, much of the synthesis has already been done.

shira summarises a 2025 paper showing that this offline reasoning between turns gave roughly 5× test-time compute reduction and up to 18% accuracy gains on the benchmark studied. The framing she offers is intuitive: “let LLMs dream.” The model is doing the same kind of consolidation work humans do during sleep — turning recent experience into reorganised, more accessible memory.

Why it matters

Sleep-time compute changes the economics of agent design. Test-time compute is the expensive, latency-sensitive part of the pipeline; users feel it as response time. Sleep-time compute is none of those things — it can run cheaply on background hardware, in parallel, without a user waiting. Shifting work from test time to sleep time directly improves the user-facing system.

The pattern shows up implicitly in several places. The LLM Knowledge Base workflow is sleep-time compute writ large: the wiki itself is the pre-computed answer cache, and lint passes find inconsistencies before any query encounters them. Agent memory systems that pre-summarise long conversations, RAG pipelines that pre-compute embeddings and reranker scores, and CI-style checks that run on every commit all share the same logic. Sleep-time compute is the architectural answer to the question of what an idle agent should be doing.

What it is

Why it matters

Related concepts