From Inference Loops
to Long-Running Agents

Fundamentals, Workflow, and What Actually Fits

← tap to go back

tap for next →

A note before we start

These aren’t all my own ideas. This is an aggregate of the best thinking in this space — the engineers I learn from — folded together with my own experiments on what actually works and what doesn’t.

“text in”LLM“text out”

An LLM is a function. Text in, text out. Stateless. No memory between calls.

An agent is a while-true loop
that appends to an array.

agent.py

# What an agent actually is
while True:
    user_input = get_input()
    response = llm.complete(user_input)
    if response.wants_tool:
        result = execute_tool(response.tool_call)
        response = llm.complete(result)
    print(response)

Agent · ↻ while true

user input

LLM

output

References

Mihai Eric · The Emperor Has No Clothes: Claude Code in 200 Lines

Geoffrey Huntley · fundamental skills and knowledge you must have in 2026 for SWE

And that array
is the context window.

Every API call sends the entire array.
Each turn appends.
The model is stateless.

CONTEXT WINDOW · 200K · not to scale

system prompt6.3k

tool definitions9.5k

CLAUDE.md2.5k

user message0.4k

assistant1.2k

tool result3.1k

unused context

The loop, in action.

agent.py

# this iteration takes the tool branchwhile True:    user_input = get_input()    response = llm.complete(user_input)    if response.wants_tool:        result = execute_tool(response.tool_call)        response = llm.complete(result)    print(response)

Agent · log

// agent ready · waiting for input

→user_input received“find all TODO comments in src/”

→LLM responds: wants_tool=truetool_call: bash(“grep -rn TODO src/”)

→executing tool · bash3 matches in auth.ts, api.ts, db.ts

→LLM responds with answer“Found 3 TODOs across the codebase”

→print(response) · loop iterates ↻

Context window · 200K

system prompt6.3k

tool definitions9.5k

CLAUDE.md2.5k

skills0.9k

user0.3k

assistant · tool_call0.5k

tool_result0.6k

assistant0.4k

~179k remaining · empty

The agent harness wraps the loop.

Everything that isn’t the LLM — what tools exist, what context loads, when to stop.

Agent harness · Claude Code

Agent · ↻ while true

user input

LLM

output

system prompt

context mgmt

skills & tools

MCPs

sub-agents

plan mode

session persistence

permissions & hooks

the whole stack

Loop.Array.Harness.

An agent is a while-true loop appending to an array. The harness controls what’s in it.

If the context window is just an array,
what goes in the array is everything.

Instruction-following accuracy decays as the number of instructions grows; frontier thinking models hold to roughly 150–200 instructions before rules start getting ignored.

The instruction ceiling is real.

Frontier thinking models reliably follow ~150–200 instructions. Beyond that, even rules at the top get ignored.

smaller models · exponential decayfrontier thinking · linear decay

Reference

Dex Horthy · humanlayer.dev · arxiv:2507.11538

Smart zone, dumb zone.

The window isn’t uniform. The first ~40%is where the model thinks clearly. Past that, attention frays — tool choice gets sloppy, instructions get dropped, the goal drifts.

“The more context you use, the worse results you’ll get.”

Context window · 200K

system prompt6.3k

tool definitions9.5k

CLAUDE.md2.5k

skills0.9k

~180k remaining · fresh session

Reference

Dex Horthy · escaping the Dumb Zone (#262)

The allocation problem.

Static fills eat your usable space before the conversation starts.

Static fills → smart zone shrinks.

before the conversation even starts.

Context window · 200K

system prompt6.3k

tool definitions9.5k

CLAUDE.md2.5k

skills0.9k

~180k remaining · ~60k smart available

The context rot problem.

Nothing fails. Every call succeeds. It just fills up.

Same window, same model → it rots.

no errors. just volume.

Context window · 200K

system prompt6.3k

tool definitions9.5k

CLAUDE.md2.5k

skills0.9k

user · “implement feature X”0.3k

assistant · read files0.5k

tool_result · 4 files0.6k

~178k remaining · smart zone

Good context stays in the smart zone.

Fresh session per task start clean, don't reuse a tired window
Only what this task needs drop the MCPs and notes that aren't useful here
Offload to disk save big stuff as files, keep short summaries in the window
Send sub-agents for side quests let them explore, return one paragraph
Leave room below the line finalizing work (tests, commits, lint) still has space
Split big work across sessions when it won't fit one window, plan it, write the spec to disk, let multiple agents pick it up

Context window · 200K

system prompt · lean1.2k

tool definitions · 4 tools2k

spec.md · one task3k

skills0.9k

user · one clear goal0.3k

assistant · tool_call0.5k

tool_result1.5k

assistant · tool_call0.5k

tool_result2k

assistant · “done”0.4k

~188k remaining · all above the line

Every session starts from zero. Context doesn’t engineer itself.

Allocation, rot, compaction, recovery. Someone has to handle them.

Your harness wraps theirs.

Anthropic ships the agent harness. You ship the layer around it.

User harness · what you build

Agent harness · Claude Code

Agent · ↻ while true

user input

LLM

output

system prompt

context mgmt

skills & tools

MCPs

sub-agents

plan mode

session persistence

permissions & hooks

custom CLAUDE.md

agent_docs/

custom skills

issue tracker

The Ralph loop

3rd-party tools

The layer you own.

Files, skills, loops, and rules the agent reads every session. That’s what makes long-running runs possible.

Start with a dumb harness.

Dumb = minimal. A few files the agent reads every run — no cleverness.

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

│ ├── grill-me.md

│ ├── to-prd.md

│ ├── session-planner.md

│ └── improve-codebase-architecture.md

├── .beads/

├── agent_docs/

│ ├── adding-a-feature.md

│ ├── anti-patterns.md

│ ├── architecture.md

│ ├── components.md

│ ├── server.md

│ ├── tech-debt.md

│ ├── workflow.md

│ └── specs/

├── scripts/

│ └── ralph.sh

└── src/

├── app/

│ └── CLAUDE.md

├── modules/

│ └── CLAUDE.md

├── components/

│ └── CLAUDE.md

└── server/

└── CLAUDE.md

01

CLAUDE.md

project map · multi-level

02

agent_docs/

architecture · conventions · specs

03

custom skills

slash commands · tiny markdown

04

issue tracker

Beads · dependency graph · survives sessions

05

The Ralph loop

the loop · fresh window each pass

CLAUDE.md.

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

├── .beads/

├── agent_docs/

│ └── ...

├── scripts/

│ └── ralph.sh

└── src/

├── app/

│ └── CLAUDE.md

├── modules/

│ └── CLAUDE.md

├── components/

│ └── CLAUDE.md

└── server/

└── CLAUDE.md

A map, not a brain dump. Points at docs, doesn't contain them.
Lists standards, never-rules, skill names. One section each.
Multi-level. Sub-CLAUDE.mds in each module — app, modules, components, server.
Progressive disclosure. Root loads at session start. Sub-files load only when the agent enters that directory.
Same context budget, more steering. Right rules show up at the right moment.

References

Dex Horthy · Writing a Good CLAUDE.md

multica-ai · andrej-karpathy-skills · CLAUDE.md

agent_docs/

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

├── .beads/

├── agent_docs/

│ ├── adding-a-feature.md

│ ├── anti-patterns.md

│ ├── architecture.md

│ ├── components.md

│ ├── server.md

│ ├── tech-debt.md

│ ├── workflow.md

│ └── specs/

├── scripts/

│ └── ralph.sh

└── src/

└── ...

One concern per file. architecture, conventions, anti-patterns, workflow.
Linked from CLAUDE.md, not loaded by it. “Before adding a feature, read adding-a-feature.md.”
On-demand context. Agent reads docs only when relevant — nothing wasted up front.
Specs split big work into phases. Dated, on disk, diffable. One phase per session.
Surviveable. Specs outlive context windows. Reset and continue.

some skills

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

│ ├── grill-me.md

│ ├── to-prd.md

│ ├── session-planner.md

│ └── improve-codebase-architecture.md

├── .beads/

├── agent_docs/

│ └── ...

├── scripts/

│ └── ralph.sh

└── src/

└── ...

/grill-meinterview before plan

/to-prdturn idea into a PRD

/session-plannerbreak PRD into ralph-ready sessions

/improve-codebase-architectureaudit + propose refactors

Each skill wraps a recurring workflow into one verb. Loaded only when invoked. Stolen from Matt Pocock.

Reference

Matt Pocock · mattpocock/skills

issue tracker (Beads)

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

├── .beads/

├── agent_docs/

│ └── ...

├── scripts/

│ └── ralph.sh

└── src/

└── ...

Tasks survive sessions. Not in the context window — on disk, in a graph.
Dependency graph. Beads knows what's blocked, what's ready, what's done.
bd ready — the next thing to work on. One command. Top-priority issue with no open blockers.
Linked to specs. One spec breaks into many issues. Same vocabulary across plan and execution.
Feeds the loop. Ralph asks Beads what's next, runs it, loops.

Reference

Steve Yegge · Introducing Beads — A Coding-Agent Memory System

The Ralph loop

~/project/

├── CLAUDE.md

├── .claude/

│ └── skills/

├── .beads/

├── agent_docs/

│ └── ...

├── scripts/

│ └── ralph.sh

└── src/

└── ...

while :; do
  cat PROMPT.md | claude  # Claude Code CLI
done

The window is the budget. A plan becomes epics, epics become issues. Each session takes whatever fits inside the smart zone, could be one issue, could be five.
PROMPT.md is the instruction sheet. Tells the agent which spec to read, where to find the next task, and the rules of the run. Re-read every iteration. This is where the intelligence lives.
Reset every loop. Fresh window each pass. No compaction. State that matters lives on disk: Beads, specs, CLAUDE.md.

Reference

Geoffrey Huntley · The Ralph Wiggum Loop

Where my hours actually go.

Plan

~2:00 hr

Execute

59 min

Review

~30 min

Most of my time is here, not in the run. Brainstorm, grill, spec, atomic issues.

Two ways to plan.

Write the spec after you understand it, not before.

Plan mode

eager to write

Rushes to write the plan before it understands the problem. The asset comes first; understanding never catches up.

/grill-me

eager to understand

Builds shared understanding of the problem first. The asset comes after — and it's right.

Grill before you plan.

/grill-me

The agent asks now or assumes later. Assumptions become bugs.

Q1.

What happens when a user pulls to refresh while a stream is loading? Cancel? Queue?

Q2.

Bottom-sheet scroll behavior, does it lock the parent scroll or compete with it?

Q3.

Offline state: stale data shown, error shown, or skeleton? Decide once, here.

Reference

Matt Pocock · grill skill

The spec.

/to-prd

Synthesizes the grilling session into a PRD on disk. No new interview.

specs/2026-05-04-mobile-components-phase-4.md

## GOAL

Reimplement 8 web components for React Native.

## CONTEXT

Web components don't translate 1:1 to RN. Preserve the visual language.

## SCOPE

atoms, molecules, weather, explore, profile, places, itineraries, verification.

## ACCEPTANCE

Visual parity at 3 breakpoints. All existing tests pass.

## OUT OF SCOPE

Architecture decisions, navigation refactor.

Reference

Matt Pocock · to-prd skill

Decompose.

/session-planner

Spec to issues, sized for the smart zone.

spec.md

/session-planner

feature branch

N Beads issues + deps

ralph.sh

Sized for the smart zone. Each issue fits one fresh session.
Ralph script wired to this queue. Per-epic guard rails, ready to run.

The run.

The same session that planned the work now runs and watches it.

Phone showing a Claude session tailing the Ralph log

Launched from the session, not a terminal. Same session that did grilling, spec, decompose now runs ralph.sh.
Ralph runs headless. Each iteration spawns a fresh claude -p, picks the next Beads issue, lints, commits, closes it, loops. Verbose output streams to a log.
The session polls the log. Tails every minute, summarizes progress, flags failures. I’m on my phone.

Review.

Ralph closes the queue. The agent opens the PR.

Phone showing the Ralph final summary: 59 min, 13 epics, all closed

Open the PR. Ralph pushed and opened it. Read the diff.
Check the preview build. Web → Vercel preview. Mobile → Xcode Cloud build lands in TestFlight. Click through.
Loop back if needed. Anything off becomes a new issue. Ralph runs again. Otherwise merge.

When the task fits, the result lands at 95–99% of what I wanted.

Task selection is the work.

What fits

01

Large refactors

Class components to hooks. Old testing library to new one. Web app to mobile.

02

Migrations

Database changes, API upgrades, repo-wide pattern swaps.

03

Repetitive product work

Forms, tables, admin pages, login flows. Anything where the requirements are clear.

04

Mechanical cleanup

Dead code, dependency upgrades, documentation, test coverage.

What doesn't fit

01

High-taste UI work

Motion, interaction feel, visual identity. Generators give you good. Taste gives you great.

02

Architecture choices

Service boundaries, data models, scaling tradeoffs. Wrong here is expensive.

03

Open-ended improvement

“Make this better.” “Modernize the product.” If you can’t measure success, the loop has no compass.

The harness compounds.

Every repeated mistake→an anti-pattern entry.Every repeated workflow→a slash command.Every recurring correction→a CLAUDE.md edit.Every mechanical rule→a lint.

Each fix is permanent. The next session starts smarter than the last.

Be on the loop. Not in.

With ideas from

Geoffrey Huntley · Dexter Horthy · Matt Pocock · Steve Yegge
Ryan Lopopolo · Lance Martin · Mario Zechner · Armin Ronacher

Thank you.

You’ve got the map.

The Deck is free, and it stays free. When the Harness Starter Kit ships, it hands you the actual harness setup — the custom skills, CLAUDE.md, the ralph-loop script and the agent_docs templates.

Lock the discounted price, get the Kit the day it ships, plus the Screencast free when it lands.

Back to home

Beontheloop

From Inference Loopsto Long-Running Agents

From Inference Loops
to Long-Running Agents