The AI-Native
Developer Playbook
A spec-driven, AI-native way of building software, optimised for Kanban flow and compound engineering. Work moves through clearly defined stages — Requirements, PRD, Design, Code, Tests, Docs, Deployment, and Operations — guided by reusable specs and AI agents at every step.
A field-ready guide for developers, architects, and product managers — moving from AI-assisted habits to a systematic, compound way of building software, built on the Plan → Work → Review → Compound loop.
Why This Playbook Exists
If you ship code with a copilot and use a chat model to draft tests or docs, you're already AI-assisted. That's the easy part. The harder, more valuable part — the one this playbook is about — is becoming AI-native: making AI a structural part of how you design, build, review, ship, and operate software, in a way that gets better with every loop instead of plateauing. The SAND Framework is the system that makes that happen.
You'll know the practice is working when…
Specs live in git
Your team's specifications are versioned alongside code, reviewed in PRs, and referenced in stories — not stashed in someone's notebook.
Stories direct agents
Every user story names its inputs, target artifacts, governing specs, and expected compound outcome — clear enough that an agent can act without a meeting.
Retros produce diffs
Every retro ends with a list of artifacts you wrote back into the system: spec updates, new patterns, regression tests, refined prompts.
The Shift, in Plain Terms
The move from AI-assisted to AI-native is mostly about replacing two old habits with two new ones. Hold these in your head as you read the rest of the playbook.
Where most teams are today
- Free-form prompts, written fresh per task
- AI output is a one-off draft you copy-paste
- Tests, docs, and diagrams are downstream chores
- What worked last sprint is in someone's head
- Reviews catch defects but rarely improve the next loop
Where you're going with SAND
- Versioned specs govern every agent invocation
- AI output is a reviewable artifact in the pipeline
- Tests, docs, and diagrams are first-class deliverables
- What worked is written back into specs and patterns
- Each loop leaves the system measurably better
The Three Terms You'll See Throughout
Spec
A versioned, reviewable document that constrains what an agent does. Lives in git. Examples: prd_spec, code_spec, qa_docs_spec.
Agent
A narrowly scoped AI worker invoked under a spec, producing reviewable artifacts (PRDs, code, tests, docs, diagrams). Logged, costed, replayable.
Compound
The step where you write learnings back into specs, patterns, agents, and tests so the next loop starts ahead of where this one ended.
The Loop You'll Live In
Every unit of work — a story, a bugfix, a refactor, a release — runs the same four-step loop. Steps 1–3 deliver the change. Step 4 is what makes the system get better.
Plan
Frame the work. Name the artifact being transformed, the spec that governs it, and what success looks like.
Work
Agents produce the primary output under spec control. You contribute what they're bad at.
Review
Multi-agent critics and human reviewers assess against spec and constraints. Defects are fixed in place.
Compound
Write learnings back: spec updates, new patterns, new tests, refined prompts, documented anti-patterns.
What "Compound" Actually Looks Like
Compound is not a discussion or a retrospective sticky note. It produces concrete artifacts. After every story, you should be able to point at a diff against one of these:
Tangible compound deliverables
- A new section, example, or rule added to a spec
- A reusable snippet promoted into a shared library
- A new regression test added to the cross-team suite
- A refined prompt, system message, or agent definition
- A documented anti-pattern with explicit rationale
The honest exception
If a story ends with no diff against any of these, ask: was this loop genuinely identical to one we've run before? If yes, fine — but that should be rare. If you're skipping Compound every story, you're not compounding; you're just delivering.
Eight Operating Principles
These are the principles you cite in design reviews and PR threads. They're how disagreements get settled without re-litigating philosophy every time.
1. Specs over prompts
Prompts are tactical. Specs are versioned, reviewed, reused. If something matters, it goes in a spec.
2. Every loop compounds
Work isn't done until the Compound step has produced a concrete improvement to the system.
3. Humans decide. Agents produce.
Architectural choices and approvals stay with humans. Generation, refactoring, and repetitive work move to agents.
4. Reviewable diffs only
Agents produce changes small enough to review with clear blast radius. Large rewrites are decomposed.
5. Routing matches risk
Frontier models for ambiguous, high-risk, cross-artifact work. Cheaper models for routine, repetitive work.
6. Default to determinism
Where outputs can follow schemas or templates, prefer that over open-ended generation.
7. Reuse beats regeneration
If a pattern, snippet, or spec exists, the agent must use it. Regenerating from scratch is a smell.
8. Cost is first-class
Cost per story and per stage are tracked alongside lead time and quality. Surprise bills are bugs.
Specs are the Source of Truth
If you take one habit from this playbook, take this one: write a spec, not a prompt. A prompt is a snowflake — it works for a moment and disappears. A spec is a contract — it lives in git, it's reviewed, it's versioned, it's reused, and it gets sharper every time someone uses it.
The Specs You'll Touch Most
| Spec | What It Encodes | Owned By | Used By |
|---|---|---|---|
prd_spec | Templates, domain language, NFR patterns, acceptance-criteria style | Product | PRD agent, consistency agent |
code_spec | Tech stack, architecture style, coding standards, security & observability norms | Architecture group + all engineers | Build agent, multi-agent reviewers |
qa_docs_spec | Test strategies, coverage rules, doc tone and structure | QA + Tech-writing | Verification agent, doc agent |
design_system_spec | Tokens, components, accessibility rules, microcopy guidelines | Design | Wireframe / microcopy agents |
deployment_spec | Pipeline templates, rollout strategies, risk classification rules | SRE | Pipeline agent, risk-classification agent |
ops_spec | Alerts, SLOs, runbooks, remediation policies | SRE | Monitoring & remediation agents |
modernization_spec | Refactor patterns, migration playbooks, deprecation policies | Architects | Modernization agent |
What a Real Spec Snippet Looks Like
This is a redacted slice of a code_spec — the section that governs how cache layers should be built for multi-tenant services. Notice it's not a blob of free-form prose; it's structured enough that an agent can actually use it.
# Section 7.3 — Multi-tenant cache layers applies_to: [service, library] when: "component reads/writes cached data scoped to a tenant" required_patterns: - name: tenant-keyed cache keys rule: "all keys MUST embed tenant_id as the first segment" example: "flag:{tenant_id}:{flag_key}" - name: bounded invalidation fan-out rule: "invalidation MUST NOT exceed N keys per call (N=1000)" rationale: "see incident I-2024-117 — unbounded fan-out caused regional outage" - name: audit on write rule: "every write emits an audit event with actor, tenant, before/after" anti_patterns: - "global cache keys without tenant prefix (cross-tenant leak risk)" - "cache-aside without negative-result caching (thundering herd)" - "premature interfaces around the cache client (over-engineering)" reusable_components: - "@platform/tenant-cache (preferred)" - "@platform/audit-emitter" tests_required: - "isolation: tenant A cannot read tenant B's keys" - "invalidation bound: fan-out > N raises error before execution" - "audit completeness: every write produces matching audit event"
How to Write Your First Spec Section
1. Wait for the second instance
Don't speculate. The first time you do something, just do it. The second time, ask: is this a pattern? If yes, write it down.
2. Start with structure, not prose
Rules, examples, anti-patterns, required tests. Agents can act on lists; they can't act on essays.
3. Cite the incident or PR
Every rule has a reason. Link to it. "See incident I-2024-117" beats "this is important."
4. Get it reviewed
Specs go through PR review like code. Two approvers minimum for shared specs. Treat updates as first-class deliverables — tag them in your PR description.
The Pipeline, Stage by Stage
The SAND Framework breaks every feature into nine stages. At each stage there's a spec, an artifact going in, and an artifact coming out. The agents change; the loop is the same. We'll trace a single feature — a tenant-scoped audit log for a feature flag service — through every stage so you can see the actual hand-offs.
Discovery & Requirements
Capture business goals, user needs, constraints, and risks at a level of clarity sufficient to drive PRD generation.
You do
- Articulate goals, target users, success metrics
- Identify hard constraints (regulatory, integration, performance)
- Surface known risks and dependencies
Agents do
- Discovery agent produces a structured brief from notes
- Gap-analysis agent flags missing NFRs and edge cases
- Risk agent surfaces likely failure modes
Compound
- New question categories added to
requirements_spec - Domain glossary grows
- Missed patterns from human reviewers added to gap-analysis checklist
Product Requirements Document
Convert the brief into a structured, reviewable PRD that downstream stages can act on directly.
You do
- Author goal narrative, success metrics, prioritization
- Approve the structured PRD
Agents do
- PRD agent generates the structured PRD
- Consistency agent checks against organizational standards
- Traceability agent links sections to upstream and downstream artifacts
Compound
- Sections reviewers had to add by hand become new templates
- Ambiguous phrasings caught downstream get blacklisted
Design & Architecture
Decide the architectural shape, key patterns, and UX direction; produce design and architecture artifacts that constrain implementation.
You do
- Architects make decisions and write ADRs
- Designers own user journeys and IA
- Security architects do threat modeling
Agents do
- Architecture agent generates 2–3 candidates with trade-offs
- Diagram agent produces C4 views
- Design agent produces wireframes against the design system
- Threat-model agent drafts STRIDE analysis
Compound
- Chosen pattern becomes a reference architecture
- Rejected candidates with hot-spot risks become documented anti-patterns
architecture_spec.Implementation
Produce the code, IaC, and initial tests that realize the approved design.
You do
- Frame each unit of work; decompose into reviewable diffs
- Review agent-generated code critically
- Handle complex debugging, novel algorithms, performance work
Agents do
- Build agent generates diffs from PRD + design +
code_spec+ repo context - Multi-agent review runs in parallel: security, performance, over-engineering, style
- Test-scaffold agent produces unit and contract tests
Compound
- Corrected patterns added to
code_spec - Anti-patterns documented with rationale
- Reusable code goes into shared platform libraries
Testing & Quality
Verify the implementation meets the PRD and NFRs; grow the regression net.
You do
- Design test strategy and risk-based coverage
- Identify edge cases agents miss
- Design what cannot be automated
Agents do
- Verification agent generates unit, integration, contract tests
- Property-based agent proposes invariants
- Flakiness detector and coverage agent run continuously
Compound
- Edge cases become templates for similar components
- Regression suite grows with every loop
- Property invariants become checklists for similar work
Documentation & Diagrams
Produce and maintain docs, diagrams, and operational artifacts.
You do
- Tech writers and architects review for tone, accuracy, audience fit
- SREs own runbooks
- Approve customer-facing docs
Agents do
- Doc agent generates API reference, READMEs, FAQs, changelogs
- Diagram agent regenerates C4 views from code
- Runbook agent drafts initial runbooks
- Customer-facing doc agent works against a voice-tuned spec
Compound
- Frequent FAQs pull into
qa_docs_spec - Confusing phrasings caught in support tickets become things-to-avoid in
content_spec
Deployment & Release
Promote changes safely with appropriate gates and rollback paths.
You do
- Release managers approve promotion
- EMs own release decisions
- SREs own the deployment platform
Agents do
- Pipeline agent generates and maintains CI/CD, IaC, manifests
- Risk-classification agent assigns risk levels and recommends rollout
- Release-notes agent composes notes from PRs and ADRs
Compound
- Successful canary patterns become "safe templates"
- Metrics that should have been rollback triggers but weren't get added
Operations & Incidents
Keep production healthy, detect incidents early, convert every incident into durable improvement.
You do
- SREs own SLOs, on-call, postmortems
- Engineering teams own service health
- Incident commanders run major incidents
Agents do
- Monitoring agent correlates signals, surfaces anomalies
- Remediation agent proposes diagnoses and fixes
- Postmortem agent drafts timelines
Compound
- Every incident produces durable updates: alerts, SLOs, runbooks, tests
- Repeat-incident-class rate trends to zero
Continuous Modernization
Keep systems healthy and changeable: upgrade dependencies, remove dead code, refactor toward simpler designs.
You do
- Architects and tech leads decide scope and risk appetite
- EMs integrate modernization into the backlog
Agents do
- Modernization agent proposes incremental refactors
- Dependency-graph agent maintains the system view
- Migration-plan agent generates step-by-step plans with rollback paths
Compound
- Each refactor must leave the system simpler or test net stronger
- Migration playbooks become near-templated for the next service
AI-Ready User Stories
An AI-ready story gives an agent enough to start without a meeting. The familiar "As a... I want... so that..." stays. Four things get added: the input artifact, the target artifacts, the spec sections that govern the work, and the loop position.
A Real Story, in the Format
# Add tenant-scoped audit log to feature flag service user_narrative: | As a release manager, I want every flag change recorded with actor, tenant, before/after value, and timestamp, so that I can audit changes during incidents. inputs: - PRD §4.3 (audit) - ADR-014 (audit-log architecture) - code_spec §7 (audit logging) - qa_docs_spec §3 (async event tests) target_artifacts: - code change in flag-admin-service - contract test (audit-log API) - integration test (write path → audit emission) - API doc update + runbook update acceptance_criteria: - Given a tenant admin updates a flag, when the update is committed, then an audit record is written within 100ms. - The record contains all required fields (actor, tenant, flag_key, before, after, timestamp). - Queries by tenant return only that tenant's records. loop_position: Work + Review compound_expectation: | If audit-emission helpers are reused, promote into @platform/audit-emitter and update code_spec §7. cost_budget: "$8 / story (est.)"
The Discipline Behind the Format
Inputs are explicit
The agent shouldn't have to guess which spec applies. If you can't list the inputs, the story isn't ready.
Target artifacts are listed up front
Code is rarely the only deliverable. Tests, docs, runbook updates are part of "done."
Acceptance criteria are testable
"Given/When/Then" or similar. If a human can't write a test from it, neither can an agent.
Compound expectation is named
If you can predict what should be promoted into a spec or library, write it down. Otherwise flag the story as exploratory.
Cost budget sets a ceiling
If the agent burns through it, that's a signal to pause and re-plan, not to keep going.
Loop position is named
Tells reviewers what to look for. A "Plan" story is reviewed differently from a "Work" story.
Reviewing Agent Output
Reviewing agent-generated work is a different skill from reviewing human-generated work. Agents are confident, prolific, and locally consistent — which means defects are often plausible. Your job isn't to read every line. It's to ask the questions that catch what plausibility hides.
The Six Questions, in Order
Run these in order on every agent-produced PR. The order matters: cheap checks first.
Open the PR alongside the spec sections cited in the story. Walk through them. Is every required pattern actually applied? If a spec section is missing from the PR, ask why before reading further.
What does this PR touch: data, security, public APIs, infra, internal-only? Match review depth to blast radius. A pure docs PR doesn't need a 90-minute review. A change to the auth path does.
Agents tend to omit the unglamorous: error paths, partial-failure handling, observability hooks, audit logging, edge cases on inputs. If you don't see them, ask explicitly.
Suspiciously clean abstractions, premature interfaces, novel patterns where boring ones would do. Over-engineering is the most common AI-generated defect. Push back hard.
Before approving: what spec, library, or test should this PR feed? If nothing, why not? The Compound deliverable is part of the PR, not a follow-up.
Look at the multi-agent review output. If every critic returned "looks good," be suspicious — they may be too lenient. Tighten their prompts in the Compound step.
The Compound Step
The Compound step is where AI-native delivery diverges from "AI-assisted faster." It's also the step under the most pressure to be skipped. The story is shipped, the reviewer is satisfied, the next story is waiting. Twenty minutes spent updating a spec feels like a tax. It isn't. It's the principal.
End-of-Story Compound Check
- A pattern emerged that's likely to recur — added to the relevant spec.
- A reusable snippet was extracted into a shared library or module.
- A new edge case was caught — regression test added to the suite.
- An anti-pattern was rejected — documented in the spec's anti-patterns section with rationale.
- An agent prompt or system message was refined — change committed and noted.
- A spec gap was found — issue filed for the spec owner with a concrete proposal.
- An incident or near-miss occurred — postmortem entry made with monitoring/runbook updates.
- A cost surprise occurred — routing rule or batching strategy updated.
- Nothing applies — story is genuinely a repeat of one we've shipped before. (Be suspicious of this answer.)
Roles & What Changes for You
The shift looks different from each seat. Pick yours below; the others are useful too — knowing what your teammates are leaning into is half of working well together.
Your craft doesn't disappear — it concentrates. The judgment calls about decomposition, debugging, and design get more of your hours. The boilerplate gets less. The Compound step is where you make your team faster, not just yourself.
✓ Lean into
- Decomposing work into reviewable units
- Reviewing agent output critically
- Debugging hard, novel problems
- Shaping
code_spec - Mentoring & capturing patterns
- Performance and integration work
✕ Step away from
- Boilerplate and scaffolding
- Repetitive test writing
- Mechanical refactoring
- Manual changelog maintenance
- Hand-rolled doc updates
- Acting as a typist for the agent
Your team's speed is now bounded by how well it runs the loop, not by how fast you review. Spend your hours on Plan and Compound. Make the rest of the team great at Work and Review.
✓ Lean into
- Architectural intent at story level
- Resolving cross-cutting questions
- Running Plan and Compound for the team
- Ensuring team work feeds shared specs
- Calibrating multi-agent review
✕ Step away from
- Reviewing every routine PR
- Manually maintaining team docs
- Re-explaining patterns in chat
- Status-collection meetings
The diagrams now generate themselves from code. What doesn't generate itself is the judgment encoded in architecture_spec and code_spec. That's where your hours go. You're the steward of the constraints under which everyone else's agents operate.
✓ Lean into
- Architecture decisions and ADRs
- Owning
architecture_spec&code_spec - Steering Compound across BUs
- Reviewing high-impact agent proposals
- Cross-team pattern promotion
✕ Step away from
- Drawing diagrams by hand
- One-off architecture documents that go stale
- Reviewing every routine PR
- Being the only person who knows the why
Agents write the repetitive tests. You design the strategy: what level, what edge cases, what risks justify what coverage. Your most valuable artifact isn't a test suite — it's a richer qa_docs_spec that everyone's verification agent reads from.
✓ Lean into
- Test strategy and risk-based coverage
- Designing edge cases agents miss
- Owning
qa_docs_spec - Auditing the regression library
- Property-based testing design
✕ Step away from
- Writing repetitive unit/integration tests
- Maintaining fixtures and mocks by hand
- Manual regression sweeps
- Status reports the dashboard already shows
Every incident is now a Compound opportunity. The remediation agent does the rote work; you design what it does and what it doesn't, and you write incidents back into ops_spec so the same class doesn't recur.
✓ Lean into
- SLO design and incident command
- Postmortem authorship
- Owning
ops_spec&deployment_spec - Designing remediation policies
- Toil-reduction agent calibration
✕ Step away from
- Manual signal correlation
- Hand-writing every runbook
- Repetitive remediations
- Pipeline plumbing
Your PRD is no longer a document — it's an input to a pipeline. Quality goes up when the PRD is structured enough that the PRD agent and the consistency agent can do most of the drafting. Your hours move toward goal-setting, prioritization, and growing prd_spec.
✓ Lean into
- Goal articulation and success metrics
- Customer empathy and prioritization
- Owning
prd_spec - Structured requirements briefs
- Cross-team domain glossary
✕ Step away from
- Drafting boilerplate PRD sections
- Manual traceability matrices
- Re-typing the same NFR templates
- Status-collection meetings
Agents will produce wireframes and microcopy. The design system, the tokens, and design_system_spec are what make those outputs good. Your most leveraged work is in the system, not the screen.
✓ Lean into
- Owning the design system & tokens
- Encoding accessibility into specs
- Crafting
content_specfor voice - Reviewing and refining agent UI
- User-research synthesis & journeys
✕ Step away from
- Producing every wireframe by hand
- Re-writing similar microcopy from scratch
- Maintaining the design library manually
- One-off prototype builds
SDLC & Kanban Alignment
The SAND Framework can work with any SDLC approach — including Waterfall and various Agile frameworks. But it is principally aligned with Kanban. The small, reviewable, spec-governed increments that SAND produces are a natural fit with Kanban's core philosophy of continuous flow, limited WIP, and relentless cycle-time optimisation.
Why SAND and Kanban Are a Natural Fit
Limit Work in Progress
Each SAND stage is a discrete, bounded column. Stories can't proceed until their stage artifact is reviewed and accepted. Agents producing reviewable diffs — not massive rewrites — keep each card genuinely small and completable within WIP limits.
Faster Cycle Time
Agents compress the Work phase. Specs eliminate the planning ramp-up on repeat patterns. The Compound step means the second similar story starts faster than the first. Cycle time doesn't just stay flat — it actively trends down.
Optimise for Flow
Blockers in traditional Kanban often come from waiting for humans to draft things. SAND moves that wait time to the agent, which is non-blocking. Human review is focused and fast because reviewable diffs have clear blast radius.
Continuous Improvement
Kanban requires you to make the process visible and improve it. The Compound step is the structural mechanism: every loop writes its improvement back into specs, libraries, and tests — exactly what a Kanban retrospective should produce.
Framework Compatibility Overview
🎯 Kanban
SAND's primary alignment. Small increments, WIP limits, flow optimisation, and the Compound step map directly to Kanban principles. The stagewise pipeline is a natural Kanban board layout.
⚡ Scrum / Agile Sprints
Works well. Map stages to sprint ceremonies. The sprint cadence replaces Kanban's continuous flow — compound deliverables happen at sprint retro. WIP limits require explicit enforcement.
🏗️ Waterfall / Stage-Gate
Compatible at the stage level — each SAND stage aligns with a waterfall phase. Compound is harder to enforce at pace. Large batch sizes reduce the benefit of agent-generated reviewable diffs.
Selective Model Usage — Routing AI to the Right Work
Not all work is equal, and not all AI models are equal in price or capability. Cost optimisation is a first-class SAND principle. Here's the routing logic.
Complex & Novel Work
- Ambiguous requirements needing deep reasoning
- First-iteration architecture on greenfield problems
- Cross-artifact traceability (PRD ↔ code ↔ tests)
- High-blast-radius security or performance reviews
- Novel algorithm or domain-specific logic generation
- Postmortem root-cause analysis
Standard Development Work
- PRD generation from a structured brief
- 2nd and 3rd iteration on established patterns
- Diagram generation from code
- Test generation for known component types
- Routine PR review against existing spec rules
- API documentation from OpenAPI schema
Routine & Repetitive Tasks
- Changelog generation from commit messages
- Boilerplate code from templates
- Formatting and linting correction
- Simple unit test scaffolding (4th+ iteration)
- FAQ generation from support tickets
- Translation/localisation of known strings
When to Let AI Iterate — and When to Hand Off to a Human
AI iteration is powerful but not infinite. The quality of AI output typically follows a curve: significant gains on early iterations, diminishing returns by iteration 3–4, and potential quality erosion after that. Know when to stop the loop.
| Iteration | Owner | Typical Focus | Signal to Proceed / Escalate |
|---|---|---|---|
| 1st iteration | AI (Tier 1–2) | Initial draft — scaffold, structure, happy path. High variance is expected. | Proceed if spec coverage is ≥ 70%. Escalate if the agent is hallucinating APIs or misreading context. |
| 2nd iteration | AI (Tier 2) | Address review feedback, fill in error paths, add observability hooks and edge cases. | Proceed if spec coverage is ≥ 90% and no security findings remain. Escalate if same defects recur. |
| 3rd iteration | AI or Human (assess) | Fine-tuning: performance, subtle logic bugs, complex multi-system interactions. | If 3 iterations haven't resolved the core issue, a human should diagnose root cause before a 4th AI pass. |
| 4th+ iteration | Human (preferred) | Persistent defects often signal a misunderstanding of context, system constraints, or spec gaps. | Human fixes the issue, then updates the spec so the same pattern doesn't repeat on the next story. |
| Always human | Human only | Architecture decisions, ADRs, postmortem conclusions, regulatory sign-off, incident command. | N/A — these are non-delegable by design. |
Greenfield vs Brownfield
The loop is the same, but the emphasis shifts hard depending on whether you're building on empty ground or modifying a system that already has users. Treat them as different sports.
🌱 Greenfield · Generate
- Heavy use of scaffolding agents — full service skeletons from PRD + specs
- Architecture agent proposes candidates; you choose and ADR
- Reuse platform scaffolds aggressively
- Front-load design and architecture; constraints stick for the system's life
- Speed of iteration matters more than reversibility
- Compound win: each project contributes back to scaffolds and reference architectures
🏗️ Brownfield · Comprehend, then change
- Codebase-comprehension agent first — build a knowledge model before any change
- Characterization tests required before refactor
- Small, reversible diffs only. No big-bang rewrites
- Feature-flag-controlled cutovers; explicit rollback paths
- Domain-expert review where agent confidence is low
- Compound win:
modernization_specgrows; the second migration is faster than the first
The Decision Rule
| Question | Greenfield treatment | Brownfield treatment |
|---|---|---|
| Existing codebase to integrate with? | No, or only at well-defined boundaries | Yes, with deep coupling |
| Current system understood? | N/A | Imperfectly; comprehension is part of the work |
| Cost of breaking existing behaviour? | Low | High; users and SLAs depend on it |
| Test coverage of affected area? | Build from scratch with verification agent | Often thin; characterization tests required first |
| Primary AI emphasis | Generation and scaffolding | Comprehension, characterization, incremental change |
| Spec emphasis | prd_spec, architecture_spec, code_spec | modernization_spec, ops_spec, code_spec evolution |
| Risk posture | Speed of iteration | Reversibility and small blast radius |
Anti-Patterns to Spot Early
These are failure modes we've seen across the industry and inside our own teams. Each is a path of least resistance — easy to fall into, expensive to walk back.
| Anti-Pattern | What it looks like | The fix |
|---|---|---|
| 🚩 The prompt sprawl | Every developer writes their own prompts in their own style. Outputs drift. Nothing compounds because there's nothing to compound into. | Promote good prompts into shared spec sections. Treat one-off prompts as drafts on the way to specs. |
| 🚩 The skipped Compound | "We're behind, let's just ship and circle back." Two months later: still shipping, still behind, system hasn't improved. | Compound is part of definition-of-done, not after it. If you can't compound today, decide explicitly that this story is a repeat — don't drift. |
| 🚩 The over-eager agent | The agent ships a 1500-line PR that "refactors while we're at it." Reviewable diffs become unreviewable diffs. | Bound the scope in the story. Reject scope creep at review. Decompose large rewrites; don't accept them in one PR. |
| 🚩 The plausible mistake | The agent invents a function, an API, or a library that doesn't exist — but it looks right. You merge. CI catches it. Or worse: it doesn't. | Always run generated code against the actual project. Trust nothing that hasn't compiled or linted in your environment. |
| 🚩 The frontier-model default | Routine tasks hit the most expensive model. Costs balloon. Latency degrades. Routine work blocks behind frontier-model queue. | Tier the routing. Smaller models for repetitive work. Frontier only for genuinely complex, ambiguous, or cross-artifact work. |
| 🚩 The tribal spec | One person owns the spec, edits it from gut feel, doesn't review changes. The spec becomes their preferences in document form. | Specs go through PR review. Changes cite incidents, PRs, or evidence. Two approvers for shared specs. Quarterly pruning. |
| 🚩 The skill atrophy | Engineers stop debugging hard problems because the agent always tries first. Real expertise erodes; the team can't operate without agents. | Rotate engineers through "no-agent" debugging weeks. Require human authorship of high-impact ADRs and postmortems. |
| 🚩 The infinite iteration trap | The same defect is retried 5+ times with slight prompt tweaks. No human diagnoses the root cause. Time is lost; spec gaps compound silently. | Cap AI iterations at 3. On the 4th pass, a human diagnoses first. The insight goes back into the spec before the next story. |
| 🚩 The WIP overflow | Agents generate so fast that review queues overwhelm the team. Ten stories are "in review" simultaneously; none are truly done. | Apply Kanban WIP limits to the review column, not just to development. Agent speed without review discipline creates the illusion of progress. |
Metrics That Matter
Two families of metrics: delivery (familiar) and compounding (new). Track both. The compounding metrics are how you'll know the practice is working — delivery metrics alone can be misleading in early phases.
What Good Looks Like at 12 Months
The Three Metric Families
| Family | Metric | Phase 1 target | Phase 3 target |
|---|---|---|---|
| Delivery | Lead time per feature (vs baseline) | −20 to −30% | −70% or more |
| Delivery | Change failure rate | Flat | Flat or better |
| Delivery | MTTR | Flat | −50% |
| Delivery | Deployment frequency | +50% | +200% |
| AI-Native | % AI-generated code/tests/docs | 40–60% | 70–85% |
| AI-Native | Human review time per PR | Flat | −30% |
| AI-Native | Agent rework rate | <25% | <10% |
| AI-Native | AI cost per story | Tracked | Below phase 1 baseline |
| AI-Native | Avg. AI iterations per story | Tracked | ≤2.5 (signal of spec quality) |
| Compounding | Reuse rate | Tracked | >60% |
| Compounding | Spec updates per sprint (impact-weighted) | ≥3 per team | ≥5 per team |
| Compounding | Time to ship Nth similar feature vs first | Tracked | −50% by N=3 |
| Compounding | Repeat-incident-class rate | Tracked | Trending to zero |
| Kanban | Cycle time per stage | Tracked by stage | WIP-limit violations trending to zero |
| Kanban | Review queue depth | <5 cards simultaneously | <3 cards simultaneously |
Self-Assessment
Eight questions about your team. Answer honestly — there's no scoring police. The result will tell you whether you should focus on foundations, scaling, or refinement.
Team Readiness Diagnostic
~3 minutes. Answer for your immediate team, not the whole company.
Pick an answer for each question to see your team's readiness.
The 8-Week Path to Your First Compound
If your team is starting from AI-assisted today, here's a concrete path. Don't try to do everything at once. The point is to ship a real loop and feel the compound, not to roll out a framework.
Weeks 1–2 · Adopt the language
- Read this playbook end-to-end as a team
- Pick one feature for the pilot loop
- Identify the spec sections you'll need
- Set up cost reporting per story
- Set up your Kanban board with SAND stage columns
Weeks 3–4 · Run the first loop
- Convert the pilot story to AI-ready format
- Run Plan → Work → Review with explicit ownership
- Force the Compound step at the end
- Tag every artifact in the PR description
- Set a WIP limit on the Review column
Weeks 5–6 · Scale the loop
- Run 3–5 stories under the model
- Add multi-agent review on non-trivial PRs
- Measure: lead time, rework rate, cost, AI iterations
- Update specs with what worked
- Tier your model routing for the first time
Weeks 7–8 · Demonstrate compound
- Ship a second instance of a similar story
- Measure how much faster it was
- Demo the spec diffs at sprint review
- Onboard one neighbouring team
Definition of Done
The team's definition of done evolves to match what an AI-native loop is expected to produce. Print this. Stick it next to your sprint board. Argue from it.
A story is done when…
- Code is merged and meets
code_spec. - Tests cover the acceptance criteria and any new edge cases; risk-weighted coverage is adequate.
- Documentation is regenerated and reviewed; customer-facing docs are reviewed by tech-writing where applicable.
- Diagrams reflecting the change are current.
- Multi-agent review has run on non-trivial PRs with no unresolved findings; human review is recorded.
- Compound deliverables are explicit: spec updates, new patterns, new tests — or a recorded note that nothing applies (and why).
- Cost is recorded and within budget for the story.
- The PR description names the inputs (specs, ADRs) and the agent runs that produced the change.
- Model routing is recorded: which tier was used, and why (captures cost and complexity signals).
- AI iteration count is recorded; if ≥ 4 iterations were required, a human diagnosis note is included.
- WIP limits respected throughout: the story did not sit blocked in any stage column beyond the agreed SLA.
Governance & Continuous Learning
This playbook is a living system. It should improve every quarter based on what we ship, what breaks, and what we learn. The compound principle applies to the playbook itself.
For Individual Contributors
- › Follow the PLAN → WORK → REVIEW → COMPOUND loop on every story
- › Treat spec updates as first-class deliverables, not optional follow-ups
- › Cite inputs (specs, ADRs) and agent runs in every PR description
- › Surface anti-patterns and spec gaps the moment you spot them
- › Spend an hour a week reading other teams' Compound diffs
- › Record model tier used per story; flag routing anomalies
For Tech Leads & Architects
- › Run Plan and Compound for the team — these are not delegable
- › Calibrate multi-agent reviewers quarterly; tighten when too lenient
- › Audit one randomly selected agent run per week
- › Promote patterns across teams; resist forking
- › Coach on judgment in reviews, not just activity
- › Review WIP limits and stage cycle times monthly; adjust to team capacity