← d3dev

Cursor: Scaling Long-Running Autonomous Coding Agents


https://cursor.com/blog/scaling-agents
cursoragentsmulti-agentcoordinationscaling
At a Glance

Cursor ran hundreds of coding agents concurrently on single codebases, writing 1M+ lines of code. Flat self-coordination failed hard. The fix: hierarchy — planners create tasks, workers grind, judges evaluate. They built a web browser from scratch in a week. Prompts matter more than harness or models. Simplicity wins.

Cursor: Scaling Long-Running Autonomous Coding Agents

Metadata

FieldValue
TitleScaling Long-Running Autonomous Coding
Linkhttps://cursor.com/blog/scaling-agents
Tagscursor, agents, multi-agent, coordination, scaling
Date Downloaded2026-02-25

At a Glance

Cursor ran hundreds of coding agents concurrently on single codebases, writing 1M+ lines of code. Flat self-coordination failed hard. The fix: hierarchy — planners create tasks, workers grind, judges evaluate. They built a web browser from scratch in a week. Prompts matter more than harness or models. Simplicity wins.

Quotes

With no hierarchy, agents avoided hard tasks, made only safe small changes, and churned without progress.

— Cursor engineering

We removed an "integrator" role and things got better. Simplicity wins.

— Cursor blog

Prompts matter more than harness or models. The difference between a well-prompted GPT-5.2 and a poorly-prompted one is larger than the gap between models.

— Cursor team

Sam's TLDR

Cursor ran hundreds of coding agents concurrently on single codebases for weeks, writing 1M+ lines of code. Flat self-coordination failed hard — agents held locks, became risk-averse, and churned. The fix: hierarchy. Planners create tasks, workers grind, judges evaluate. They built a web browser from scratch (1M LoC, 1000 files), migrated Solid→React in the Cursor codebase (266K+/193K- edits, 3 weeks), and made video rendering 25x faster. GPT-5.2 crushes extended autonomous work. Opus 4.5 takes shortcuts. Prompts matter more than harness or models. Simplicity wins — they removed an "integrator" role and things got better.

Key Points

ProjectDurationScale
Web browser from scratch~1 week1M+ LoC, 1,000 files
Solid→React migration (Cursor codebase)3+ weeks+266K/-193K edits
Video rendering optimizationLong-running25x faster (Rust rewrite), merged to prod
Java LSPOngoing7.4K commits, 550K LoC
Windows 7 emulatorOngoing14.6K commits, 1.2M LoC
Excel cloneOngoing12K commits, 1.6M LoC

Full Summary

Cursor has been running an experiment in scaling autonomous coding agents — not just one agent on a task, but hundreds of agents working concurrently on a single codebase for weeks at a time. The results are striking: over a million lines of code written, trillions of tokens deployed, and real production code shipped.

Their journey through coordination strategies mirrors what we've seen in distributed systems, but with distinctly AI-flavored failure modes. The first attempt — flat self-coordination via a shared file with locks — failed because agents are unreliable lock holders. They'd forget to release, hold too long, or write without acquiring. Even optimistic concurrency (no locks, but writes fail if state changed) didn't solve the deeper problem: without hierarchy, agents become risk-averse. Nobody owns the hard problems. Everyone makes small safe changes. Work churns without progress.

The solution was a planner/worker/judge pipeline. Planners recursively explore the codebase and decompose work into tasks. Workers claim and complete individual tasks without worrying about coordination. Judges evaluate at cycle boundaries. This is basically how human engineering orgs work — PMs plan, ICs execute, leads review.

The showcase projects are genuinely impressive. Building a web browser from scratch in a week (1M LoC across 1000 files) demonstrates that the coordination actually works at scale. The Solid→React migration of Cursor's own codebase — a 3-week effort involving nearly 460K lines changed — shows it can handle real-world complexity. And a Rust-based video renderer that's 25x faster and was actually merged to production proves this isn't just a demo.

The model findings are directly relevant to anyone building agent systems. GPT-5.2 dominates for long-running autonomous tasks. Opus 4.5 is worse for sustained work because it takes shortcuts and yields control early. And counter-intuitively, GPT-5.2 outperforms the coding-specific GPT-5.1-Codex as a planner. The lesson: match model to role, don't just pick the "best" model for everything.

The most quotable insight: "The harness and models matter, but the prompts matter more." After all the engineering of coordination mechanisms, the biggest improvements came from prompt engineering — getting agents to coordinate well, avoid pathological behaviors, and maintain focus over long periods.

For Sam's orchestrator, the parallels are clear. We already use the planner/worker model with atomic task claims (findOneAndUpdate, not locks). The judge/reviewer step is something we could strengthen. And the insight about role-specific model selection is worth testing — could we use different models for planning vs. coding vs. reviewing?

References

  1. [1]Cursor Engineering Blog — Scaling Long-Running Autonomous Coding Agents. https://cursor.com/blog/scaling-agents