← d3dev

Cursor: Scaling Long-Running Autonomous Coding Agents

cursoragentsmulti-agentcoordinationscaling

At a Glance

Cursor ran hundreds of coding agents concurrently on single codebases, writing 1M+ lines of code. Flat self-coordination failed hard. The fix: hierarchy — planners create tasks, workers grind, judges evaluate. They built a web browser from scratch in a week. Prompts matter more than harness or models. Simplicity wins.

Cursor: Scaling Long-Running Autonomous Coding Agents

Metadata

Field	Value
Title	Scaling Long-Running Autonomous Coding
Link	https://cursor.com/blog/scaling-agents
Tags	cursor, agents, multi-agent, coordination, scaling
Date Downloaded	2026-02-25

At a Glance

Quotes

With no hierarchy, agents avoided hard tasks, made only safe small changes, and churned without progress.
— Cursor engineering

We removed an "integrator" role and things got better. Simplicity wins.
— Cursor blog

Prompts matter more than harness or models. The difference between a well-prompted GPT-5.2 and a poorly-prompted one is larger than the gap between models.
— Cursor team

Sam's TLDR

Cursor ran hundreds of coding agents concurrently on single codebases for weeks, writing 1M+ lines of code. Flat self-coordination failed hard — agents held locks, became risk-averse, and churned. The fix: hierarchy. Planners create tasks, workers grind, judges evaluate. They built a web browser from scratch (1M LoC, 1000 files), migrated Solid→React in the Cursor codebase (266K+/193K- edits, 3 weeks), and made video rendering 25x faster. GPT-5.2 crushes extended autonomous work. Opus 4.5 takes shortcuts. Prompts matter more than harness or models. Simplicity wins — they removed an "integrator" role and things got better.

Key Points

Single agents are too slow for complex, multi-month projects. Parallel agents are the obvious next step but coordination is the hard part. [1]
Flat self-coordination failed: agents shared a file, used locks to claim tasks. Locks became bottlenecks (20 agents → effective throughput of 2-3). Agents held locks too long, forgot to release them, or updated without locking. Optimistic concurrency was better mechanically but agents became risk-averse with no hierarchy — avoided hard tasks, made only safe small changes, churned without progress. [1]
Planner/worker hierarchy solved it: Planners explore the codebase and create tasks (can spawn sub-planners recursively). Workers grind on assigned tasks without coordinating with each other. A judge agent evaluates at the end of each cycle. [1]
Ambitious test projects:

Project	Duration	Scale
Web browser from scratch	~1 week	1M+ LoC, 1,000 files
Solid→React migration (Cursor codebase)	3+ weeks	+266K/-193K edits
Video rendering optimization	Long-running	25x faster (Rust rewrite), merged to prod
Java LSP	Ongoing	7.4K commits, 550K LoC
Windows 7 emulator	Ongoing	14.6K commits, 1.2M LoC
Excel clone	Ongoing	12K commits, 1.6M LoC

Model insights: GPT-5.2 is far better than Opus 4.5 for extended autonomous work (follows instructions, maintains focus, avoids drift). Opus 4.5 stops early and takes shortcuts. GPT-5.2 is a better planner than GPT-5.1-Codex despite the latter being coding-specific. Best results come from matching model to role. [1]
Simplicity wins: They built an "integrator" role for quality control and conflict resolution — it created more bottlenecks than it solved. Workers handled conflicts fine on their own. Removed it, things improved. [1]
Prompts > everything: "A surprising amount of the system's behavior comes down to how we prompt the agents. The harness and models matter, but the prompts matter more." [1]
Structure sweet spot: Too little structure → conflicts, duplication, drift. Too much → fragility. Distributed computing patterns don't all transfer to agents. [1]
Still unsolved: Planners should wake when tasks complete. Agents sometimes run too long. Periodic fresh starts needed to combat drift/tunnel vision. [1]

Full Summary

Cursor has been running an experiment in scaling autonomous coding agents — not just one agent on a task, but hundreds of agents working concurrently on a single codebase for weeks at a time. The results are striking: over a million lines of code written, trillions of tokens deployed, and real production code shipped.

Their journey through coordination strategies mirrors what we've seen in distributed systems, but with distinctly AI-flavored failure modes. The first attempt — flat self-coordination via a shared file with locks — failed because agents are unreliable lock holders. They'd forget to release, hold too long, or write without acquiring. Even optimistic concurrency (no locks, but writes fail if state changed) didn't solve the deeper problem: without hierarchy, agents become risk-averse. Nobody owns the hard problems. Everyone makes small safe changes. Work churns without progress.

The solution was a planner/worker/judge pipeline. Planners recursively explore the codebase and decompose work into tasks. Workers claim and complete individual tasks without worrying about coordination. Judges evaluate at cycle boundaries. This is basically how human engineering orgs work — PMs plan, ICs execute, leads review.

The showcase projects are genuinely impressive. Building a web browser from scratch in a week (1M LoC across 1000 files) demonstrates that the coordination actually works at scale. The Solid→React migration of Cursor's own codebase — a 3-week effort involving nearly 460K lines changed — shows it can handle real-world complexity. And a Rust-based video renderer that's 25x faster and was actually merged to production proves this isn't just a demo.

The model findings are directly relevant to anyone building agent systems. GPT-5.2 dominates for long-running autonomous tasks. Opus 4.5 is worse for sustained work because it takes shortcuts and yields control early. And counter-intuitively, GPT-5.2 outperforms the coding-specific GPT-5.1-Codex as a planner. The lesson: match model to role, don't just pick the "best" model for everything.

The most quotable insight: "The harness and models matter, but the prompts matter more." After all the engineering of coordination mechanisms, the biggest improvements came from prompt engineering — getting agents to coordinate well, avoid pathological behaviors, and maintain focus over long periods.

For Sam's orchestrator, the parallels are clear. We already use the planner/worker model with atomic task claims (findOneAndUpdate, not locks). The judge/reviewer step is something we could strengthen. And the insight about role-specific model selection is worth testing — could we use different models for planning vs. coding vs. reviewing?

References

[1]Cursor Engineering Blog — Scaling Long-Running Autonomous Coding Agents. https://cursor.com/blog/scaling-agents