Way of Building · Developer Playbook · Version 2

The AI-Native
Developer Playbook

⬡ The SAND Framework — Stagewise AI‑Native Development

A spec-driven, AI-native way of building software, optimised for Kanban flow and compound engineering. Work moves through clearly defined stages — Requirements, PRD, Design, Code, Tests, Docs, Deployment, and Operations — guided by reusable specs and AI agents at every step.

A field-ready guide for developers, architects, and product managers — moving from AI-assisted habits to a systematic, compound way of building software, built on the Plan → Work → Review → Compound loop.

⬡ SAND Framework 👩‍💻 Developers · Architects · PMs 🔁 Compound Engineering 🧠 Specs over Prompts 🤝 Human-Led · Agent-Powered 🎯 Kanban-Aligned ⚡ v2 — Innovation Labs
PLAN
WORK
REVIEW
COMPOUND
Section 1

Why This Playbook Exists

If you ship code with a copilot and use a chat model to draft tests or docs, you're already AI-assisted. That's the easy part. The harder, more valuable part — the one this playbook is about — is becoming AI-native: making AI a structural part of how you design, build, review, ship, and operate software, in a way that gets better with every loop instead of plateauing. The SAND Framework is the system that makes that happen.

The SAND Framework: Stagewise AI-Native Development is our spec-driven, AI-native way of building software, optimised for Kanban flow and compound engineering. Engineers, designers, and PMs focus on intent, architecture, and review — while agents handle generation, refactoring, and routine tasks. Every stage is governed by a reusable spec; every loop compounds into the next.
How to use this playbook: Read it once cover-to-cover to absorb the shape of the practice. Come back to specific sections when you're starting a new project, onboarding a teammate, or stuck in a review. Print sections 8, 9, and 17 — they are designed to live next to your screen.

You'll know the practice is working when…

Specs live in git

Your team's specifications are versioned alongside code, reviewed in PRs, and referenced in stories — not stashed in someone's notebook.

Stories direct agents

Every user story names its inputs, target artifacts, governing specs, and expected compound outcome — clear enough that an agent can act without a meeting.

Retros produce diffs

Every retro ends with a list of artifacts you wrote back into the system: spec updates, new patterns, regression tests, refined prompts.

The Compound Principle: Every loop produces two things — the feature, and an improvement to the system that produced it. Skip the second one and you're just doing AI-assisted work faster.
Section 2

The Shift, in Plain Terms

The move from AI-assisted to AI-native is mostly about replacing two old habits with two new ones. Hold these in your head as you read the rest of the playbook.

Where most teams are today
  • Free-form prompts, written fresh per task
  • AI output is a one-off draft you copy-paste
  • Tests, docs, and diagrams are downstream chores
  • What worked last sprint is in someone's head
  • Reviews catch defects but rarely improve the next loop
Where you're going with SAND
  • Versioned specs govern every agent invocation
  • AI output is a reviewable artifact in the pipeline
  • Tests, docs, and diagrams are first-class deliverables
  • What worked is written back into specs and patterns
  • Each loop leaves the system measurably better

The Three Terms You'll See Throughout

Spec

A versioned, reviewable document that constrains what an agent does. Lives in git. Examples: prd_spec, code_spec, qa_docs_spec.

Agent

A narrowly scoped AI worker invoked under a spec, producing reviewable artifacts (PRDs, code, tests, docs, diagrams). Logged, costed, replayable.

Compound

The step where you write learnings back into specs, patterns, agents, and tests so the next loop starts ahead of where this one ended.

Section 3

The Loop You'll Live In

Every unit of work — a story, a bugfix, a refactor, a release — runs the same four-step loop. Steps 1–3 deliver the change. Step 4 is what makes the system get better.

Step 1

Plan

Frame the work. Name the artifact being transformed, the spec that governs it, and what success looks like.

Step 2

Work

Agents produce the primary output under spec control. You contribute what they're bad at.

Step 3

Review

Multi-agent critics and human reviewers assess against spec and constraints. Defects are fixed in place.

Step 4

Compound

Write learnings back: spec updates, new patterns, new tests, refined prompts, documented anti-patterns.

What "Compound" Actually Looks Like

Compound is not a discussion or a retrospective sticky note. It produces concrete artifacts. After every story, you should be able to point at a diff against one of these:

Tangible compound deliverables

  • A new section, example, or rule added to a spec
  • A reusable snippet promoted into a shared library
  • A new regression test added to the cross-team suite
  • A refined prompt, system message, or agent definition
  • A documented anti-pattern with explicit rationale

The honest exception

If a story ends with no diff against any of these, ask: was this loop genuinely identical to one we've run before? If yes, fine — but that should be rare. If you're skipping Compound every story, you're not compounding; you're just delivering.

Section 4

Eight Operating Principles

These are the principles you cite in design reviews and PR threads. They're how disagreements get settled without re-litigating philosophy every time.

1. Specs over prompts

Prompts are tactical. Specs are versioned, reviewed, reused. If something matters, it goes in a spec.

2. Every loop compounds

Work isn't done until the Compound step has produced a concrete improvement to the system.

3. Humans decide. Agents produce.

Architectural choices and approvals stay with humans. Generation, refactoring, and repetitive work move to agents.

4. Reviewable diffs only

Agents produce changes small enough to review with clear blast radius. Large rewrites are decomposed.

5. Routing matches risk

Frontier models for ambiguous, high-risk, cross-artifact work. Cheaper models for routine, repetitive work.

6. Default to determinism

Where outputs can follow schemas or templates, prefer that over open-ended generation.

7. Reuse beats regeneration

If a pattern, snippet, or spec exists, the agent must use it. Regenerating from scratch is a smell.

8. Cost is first-class

Cost per story and per stage are tracked alongside lead time and quality. Surprise bills are bugs.

Section 5

Specs are the Source of Truth

If you take one habit from this playbook, take this one: write a spec, not a prompt. A prompt is a snowflake — it works for a moment and disappears. A spec is a contract — it lives in git, it's reviewed, it's versioned, it's reused, and it gets sharper every time someone uses it.

The Specs You'll Touch Most

SpecWhat It EncodesOwned ByUsed By
prd_specTemplates, domain language, NFR patterns, acceptance-criteria styleProductPRD agent, consistency agent
code_specTech stack, architecture style, coding standards, security & observability normsArchitecture group + all engineersBuild agent, multi-agent reviewers
qa_docs_specTest strategies, coverage rules, doc tone and structureQA + Tech-writingVerification agent, doc agent
design_system_specTokens, components, accessibility rules, microcopy guidelinesDesignWireframe / microcopy agents
deployment_specPipeline templates, rollout strategies, risk classification rulesSREPipeline agent, risk-classification agent
ops_specAlerts, SLOs, runbooks, remediation policiesSREMonitoring & remediation agents
modernization_specRefactor patterns, migration playbooks, deprecation policiesArchitectsModernization agent

What a Real Spec Snippet Looks Like

This is a redacted slice of a code_spec — the section that governs how cache layers should be built for multi-tenant services. Notice it's not a blob of free-form prose; it's structured enough that an agent can actually use it.

# Section 7.3 — Multi-tenant cache layers
applies_to: [service, library]
when: "component reads/writes cached data scoped to a tenant"

required_patterns:
  - name: tenant-keyed cache keys
    rule: "all keys MUST embed tenant_id as the first segment"
    example: "flag:{tenant_id}:{flag_key}"
  - name: bounded invalidation fan-out
    rule: "invalidation MUST NOT exceed N keys per call (N=1000)"
    rationale: "see incident I-2024-117 — unbounded fan-out caused regional outage"
  - name: audit on write
    rule: "every write emits an audit event with actor, tenant, before/after"

anti_patterns:
  - "global cache keys without tenant prefix (cross-tenant leak risk)"
  - "cache-aside without negative-result caching (thundering herd)"
  - "premature interfaces around the cache client (over-engineering)"

reusable_components:
  - "@platform/tenant-cache (preferred)"
  - "@platform/audit-emitter"

tests_required:
  - "isolation: tenant A cannot read tenant B's keys"
  - "invalidation bound: fan-out > N raises error before execution"
  - "audit completeness: every write produces matching audit event"

How to Write Your First Spec Section

1. Wait for the second instance

Don't speculate. The first time you do something, just do it. The second time, ask: is this a pattern? If yes, write it down.

2. Start with structure, not prose

Rules, examples, anti-patterns, required tests. Agents can act on lists; they can't act on essays.

3. Cite the incident or PR

Every rule has a reason. Link to it. "See incident I-2024-117" beats "this is important."

4. Get it reviewed

Specs go through PR review like code. Two approvers minimum for shared specs. Treat updates as first-class deliverables — tag them in your PR description.

Section 6

The Pipeline, Stage by Stage

The SAND Framework breaks every feature into nine stages. At each stage there's a spec, an artifact going in, and an artifact coming out. The agents change; the loop is the same. We'll trace a single feature — a tenant-scoped audit log for a feature flag service — through every stage so you can see the actual hand-offs.

⬡ SAND Framework · All Stages at a Glance
🔍
S1
Discovery
requirements_spec
📋
S2
PRD
prd_spec
🏛️
S3
Design & Arch
architecture_spec
⚙️
S4
Code
code_spec
S5
Tests
qa_docs_spec
📖
S6
Docs
content_spec
🚀
S7
Deployment
deployment_spec
📡
S8
Operations
ops_spec
♻️
S9
Modernize
modernization_spec
Human
Goals · Risks
PM + Agent
Structured PRD
Architect + Agent
ADRs · C4
Dev + Agent
Code · IaC
QA + Agent
Tests · Props
Writer + Agent
Docs · Diagrams
SRE + Agent
CI/CD · Risk
SRE + Agent
SLOs · Runbooks
Architect
Refactor · Migrate
Every stage: Plan → Work → Review → Compound
Stage 9 is continuous, not sequential
Each stage governed by a versioned spec
Stage 1 · Discovery Spec: requirements_spec

Discovery & Requirements

Capture business goals, user needs, constraints, and risks at a level of clarity sufficient to drive PRD generation.

You do

  • Articulate goals, target users, success metrics
  • Identify hard constraints (regulatory, integration, performance)
  • Surface known risks and dependencies

Agents do

  • Discovery agent produces a structured brief from notes
  • Gap-analysis agent flags missing NFRs and edge cases
  • Risk agent surfaces likely failure modes

Compound

  • New question categories added to requirements_spec
  • Domain glossary grows
  • Missed patterns from human reviewers added to gap-analysis checklist
Audit-log example: The gap-analysis agent flags missing requirements: retention period, query SLAs, who can read other tenants' logs, GDPR-style deletion. PM resolves these before the PRD agent runs.
Stage 2 · PRD Spec: prd_spec

Product Requirements Document

Convert the brief into a structured, reviewable PRD that downstream stages can act on directly.

You do

  • Author goal narrative, success metrics, prioritization
  • Approve the structured PRD

Agents do

  • PRD agent generates the structured PRD
  • Consistency agent checks against organizational standards
  • Traceability agent links sections to upstream and downstream artifacts

Compound

  • Sections reviewers had to add by hand become new templates
  • Ambiguous phrasings caught downstream get blacklisted
Audit-log example: PRD comes out with capability sections (write, query, retention, deletion), NFRs (write latency < 100ms), Given-When-Then acceptance criteria. Consistency agent flags missing retention policy.
Stage 3 · Design Specs: architecture_spec, design_system_spec

Design & Architecture

Decide the architectural shape, key patterns, and UX direction; produce design and architecture artifacts that constrain implementation.

You do

  • Architects make decisions and write ADRs
  • Designers own user journeys and IA
  • Security architects do threat modeling

Agents do

  • Architecture agent generates 2–3 candidates with trade-offs
  • Diagram agent produces C4 views
  • Design agent produces wireframes against the design system
  • Threat-model agent drafts STRIDE analysis

Compound

  • Chosen pattern becomes a reference architecture
  • Rejected candidates with hot-spot risks become documented anti-patterns
Audit-log example: Three candidate designs — synchronous-write, async via queue, hybrid with local buffer. The hybrid wins; ADR-014 is recorded and the pattern is added to architecture_spec.
Stage 4 · Implementation Spec: code_spec

Implementation

Produce the code, IaC, and initial tests that realize the approved design.

You do

  • Frame each unit of work; decompose into reviewable diffs
  • Review agent-generated code critically
  • Handle complex debugging, novel algorithms, performance work

Agents do

  • Build agent generates diffs from PRD + design + code_spec + repo context
  • Multi-agent review runs in parallel: security, performance, over-engineering, style
  • Test-scaffold agent produces unit and contract tests

Compound

  • Corrected patterns added to code_spec
  • Anti-patterns documented with rationale
  • Reusable code goes into shared platform libraries
Audit-log example: Build agent produces the audit emitter, async pipeline, and storage layer. Security reviewer flags missing tenant scoping on the read path. Over-engineering checker flags an unnecessary abstraction. Both fixed in a second pass before human review.
Stage 5 · Testing Spec: qa_docs_spec

Testing & Quality

Verify the implementation meets the PRD and NFRs; grow the regression net.

You do

  • Design test strategy and risk-based coverage
  • Identify edge cases agents miss
  • Design what cannot be automated

Agents do

  • Verification agent generates unit, integration, contract tests
  • Property-based agent proposes invariants
  • Flakiness detector and coverage agent run continuously

Compound

  • Edge cases become templates for similar components
  • Regression suite grows with every loop
  • Property invariants become checklists for similar work
Audit-log example: Property-based agent proposes "every write produces exactly one audit event with matching fields." QA engineer adds an edge case: partial-failure on the queue must not produce phantom audits. Both go into the regression suite.
Stage 6 · Documentation Specs: qa_docs_spec, content_spec

Documentation & Diagrams

Produce and maintain docs, diagrams, and operational artifacts.

You do

  • Tech writers and architects review for tone, accuracy, audience fit
  • SREs own runbooks
  • Approve customer-facing docs

Agents do

  • Doc agent generates API reference, READMEs, FAQs, changelogs
  • Diagram agent regenerates C4 views from code
  • Runbook agent drafts initial runbooks
  • Customer-facing doc agent works against a voice-tuned spec

Compound

  • Frequent FAQs pull into qa_docs_spec
  • Confusing phrasings caught in support tickets become things-to-avoid in content_spec
Audit-log example: Internal README and API reference are generated. Tech writer edits the customer-facing audit-log guide for tone. Architect adds two human-only steps to the runbook (manual replay sign-off).
Stage 7 · Deployment Spec: deployment_spec

Deployment & Release

Promote changes safely with appropriate gates and rollback paths.

You do

  • Release managers approve promotion
  • EMs own release decisions
  • SREs own the deployment platform

Agents do

  • Pipeline agent generates and maintains CI/CD, IaC, manifests
  • Risk-classification agent assigns risk levels and recommends rollout
  • Release-notes agent composes notes from PRs and ADRs

Compound

  • Successful canary patterns become "safe templates"
  • Metrics that should have been rollback triggers but weren't get added
Audit-log example: Risk agent classifies the change as medium risk, recommends 10% canary on a single tenant for 24h. Release manager approves and signs off after each stage.
Stage 8 · Operations Spec: ops_spec

Operations & Incidents

Keep production healthy, detect incidents early, convert every incident into durable improvement.

You do

  • SREs own SLOs, on-call, postmortems
  • Engineering teams own service health
  • Incident commanders run major incidents

Agents do

  • Monitoring agent correlates signals, surfaces anomalies
  • Remediation agent proposes diagnoses and fixes
  • Postmortem agent drafts timelines

Compound

  • Every incident produces durable updates: alerts, SLOs, runbooks, tests
  • Repeat-incident-class rate trends to zero
Audit-log example: Monitoring agent detects elevated audit-write latency in one region, correlates with a recent config change. Remediation agent proposes rollback. SRE approves; rollback runs; postmortem produces three durable updates.
Stage 9 · Modernization Spec: modernization_spec

Continuous Modernization

Keep systems healthy and changeable: upgrade dependencies, remove dead code, refactor toward simpler designs.

You do

  • Architects and tech leads decide scope and risk appetite
  • EMs integrate modernization into the backlog

Agents do

  • Modernization agent proposes incremental refactors
  • Dependency-graph agent maintains the system view
  • Migration-plan agent generates step-by-step plans with rollback paths

Compound

  • Each refactor must leave the system simpler or test net stronger
  • Migration playbooks become near-templated for the next service
Audit-log example: Twelve months in, the audit-log service's queue library is superseded. Modernization agent proposes a migration PR with rollback plan. Architect approves; it ships through the standard pipeline.
Section 7

AI-Ready User Stories

An AI-ready story gives an agent enough to start without a meeting. The familiar "As a... I want... so that..." stays. Four things get added: the input artifact, the target artifacts, the spec sections that govern the work, and the loop position.

A Real Story, in the Format

# Add tenant-scoped audit log to feature flag service

user_narrative: |
  As a release manager, I want every flag change recorded with
  actor, tenant, before/after value, and timestamp, so that I can
  audit changes during incidents.

inputs:
  - PRD §4.3 (audit)
  - ADR-014 (audit-log architecture)
  - code_spec §7 (audit logging)
  - qa_docs_spec §3 (async event tests)

target_artifacts:
  - code change in flag-admin-service
  - contract test (audit-log API)
  - integration test (write path → audit emission)
  - API doc update + runbook update

acceptance_criteria:
  - Given a tenant admin updates a flag, when the update is
    committed, then an audit record is written within 100ms.
  - The record contains all required fields (actor, tenant,
    flag_key, before, after, timestamp).
  - Queries by tenant return only that tenant's records.

loop_position: Work + Review

compound_expectation: |
  If audit-emission helpers are reused, promote into
  @platform/audit-emitter and update code_spec §7.

cost_budget: "$8 / story (est.)"

The Discipline Behind the Format

Inputs are explicit

The agent shouldn't have to guess which spec applies. If you can't list the inputs, the story isn't ready.

Target artifacts are listed up front

Code is rarely the only deliverable. Tests, docs, runbook updates are part of "done."

Acceptance criteria are testable

"Given/When/Then" or similar. If a human can't write a test from it, neither can an agent.

Compound expectation is named

If you can predict what should be promoted into a spec or library, write it down. Otherwise flag the story as exploratory.

Cost budget sets a ceiling

If the agent burns through it, that's a signal to pause and re-plan, not to keep going.

Loop position is named

Tells reviewers what to look for. A "Plan" story is reviewed differently from a "Work" story.

Section 8

Reviewing Agent Output

Reviewing agent-generated work is a different skill from reviewing human-generated work. Agents are confident, prolific, and locally consistent — which means defects are often plausible. Your job isn't to read every line. It's to ask the questions that catch what plausibility hides.

The Six Questions, in Order

Run these in order on every agent-produced PR. The order matters: cheap checks first.

1
Did it follow the spec?

Open the PR alongside the spec sections cited in the story. Walk through them. Is every required pattern actually applied? If a spec section is missing from the PR, ask why before reading further.

2
Where is the blast radius?

What does this PR touch: data, security, public APIs, infra, internal-only? Match review depth to blast radius. A pure docs PR doesn't need a 90-minute review. A change to the auth path does.

3
What's the agent not showing me?

Agents tend to omit the unglamorous: error paths, partial-failure handling, observability hooks, audit logging, edge cases on inputs. If you don't see them, ask explicitly.

4
Is anything too elegant?

Suspiciously clean abstractions, premature interfaces, novel patterns where boring ones would do. Over-engineering is the most common AI-generated defect. Push back hard.

5
What's the compound win?

Before approving: what spec, library, or test should this PR feed? If nothing, why not? The Compound deliverable is part of the PR, not a follow-up.

6
Did the critics earn their keep?

Look at the multi-agent review output. If every critic returned "looks good," be suspicious — they may be too lenient. Tighten their prompts in the Compound step.

Section 9

The Compound Step

The Compound step is where AI-native delivery diverges from "AI-assisted faster." It's also the step under the most pressure to be skipped. The story is shipped, the reviewer is satisfied, the next story is waiting. Twenty minutes spent updating a spec feels like a tax. It isn't. It's the principal.

End-of-Story Compound Check

~10 min · Per story
  • A pattern emerged that's likely to recur — added to the relevant spec.
  • A reusable snippet was extracted into a shared library or module.
  • A new edge case was caught — regression test added to the suite.
  • An anti-pattern was rejected — documented in the spec's anti-patterns section with rationale.
  • An agent prompt or system message was refined — change committed and noted.
  • A spec gap was found — issue filed for the spec owner with a concrete proposal.
  • An incident or near-miss occurred — postmortem entry made with monitoring/runbook updates.
  • A cost surprise occurred — routing rule or batching strategy updated.
  • Nothing applies — story is genuinely a repeat of one we've shipped before. (Be suspicious of this answer.)
Definition by demonstration: If your retro ends without a list of artifacts you wrote back into the system, you didn't compound. You just delivered.
Section 10

Roles & What Changes for You

The shift looks different from each seat. Pick yours below; the others are useful too — knowing what your teammates are leaning into is half of working well together.

DEV
Developer
From individual contributor to compound engineer.

Your craft doesn't disappear — it concentrates. The judgment calls about decomposition, debugging, and design get more of your hours. The boilerplate gets less. The Compound step is where you make your team faster, not just yourself.

✓ Lean into

  • Decomposing work into reviewable units
  • Reviewing agent output critically
  • Debugging hard, novel problems
  • Shaping code_spec
  • Mentoring & capturing patterns
  • Performance and integration work

✕ Step away from

  • Boilerplate and scaffolding
  • Repetitive test writing
  • Mechanical refactoring
  • Manual changelog maintenance
  • Hand-rolled doc updates
  • Acting as a typist for the agent
TL
Tech Lead
From PR-bottleneck to loop conductor.

Your team's speed is now bounded by how well it runs the loop, not by how fast you review. Spend your hours on Plan and Compound. Make the rest of the team great at Work and Review.

✓ Lean into

  • Architectural intent at story level
  • Resolving cross-cutting questions
  • Running Plan and Compound for the team
  • Ensuring team work feeds shared specs
  • Calibrating multi-agent review

✕ Step away from

  • Reviewing every routine PR
  • Manually maintaining team docs
  • Re-explaining patterns in chat
  • Status-collection meetings
ARC
Architect
From diagrammer to spec steward.

The diagrams now generate themselves from code. What doesn't generate itself is the judgment encoded in architecture_spec and code_spec. That's where your hours go. You're the steward of the constraints under which everyone else's agents operate.

✓ Lean into

  • Architecture decisions and ADRs
  • Owning architecture_spec & code_spec
  • Steering Compound across BUs
  • Reviewing high-impact agent proposals
  • Cross-team pattern promotion

✕ Step away from

  • Drawing diagrams by hand
  • One-off architecture documents that go stale
  • Reviewing every routine PR
  • Being the only person who knows the why
QA
QA Engineer
From test-writer to test-strategist.

Agents write the repetitive tests. You design the strategy: what level, what edge cases, what risks justify what coverage. Your most valuable artifact isn't a test suite — it's a richer qa_docs_spec that everyone's verification agent reads from.

✓ Lean into

  • Test strategy and risk-based coverage
  • Designing edge cases agents miss
  • Owning qa_docs_spec
  • Auditing the regression library
  • Property-based testing design

✕ Step away from

  • Writing repetitive unit/integration tests
  • Maintaining fixtures and mocks by hand
  • Manual regression sweeps
  • Status reports the dashboard already shows
SRE
Site Reliability Engineer
From firefighter to feedback-loop designer.

Every incident is now a Compound opportunity. The remediation agent does the rote work; you design what it does and what it doesn't, and you write incidents back into ops_spec so the same class doesn't recur.

✓ Lean into

  • SLO design and incident command
  • Postmortem authorship
  • Owning ops_spec & deployment_spec
  • Designing remediation policies
  • Toil-reduction agent calibration

✕ Step away from

  • Manual signal correlation
  • Hand-writing every runbook
  • Repetitive remediations
  • Pipeline plumbing
PM
Product Manager
From PRD-writer to spec-author.

Your PRD is no longer a document — it's an input to a pipeline. Quality goes up when the PRD is structured enough that the PRD agent and the consistency agent can do most of the drafting. Your hours move toward goal-setting, prioritization, and growing prd_spec.

✓ Lean into

  • Goal articulation and success metrics
  • Customer empathy and prioritization
  • Owning prd_spec
  • Structured requirements briefs
  • Cross-team domain glossary

✕ Step away from

  • Drafting boilerplate PRD sections
  • Manual traceability matrices
  • Re-typing the same NFR templates
  • Status-collection meetings
DSN
Designer
From visual producer to system curator.

Agents will produce wireframes and microcopy. The design system, the tokens, and design_system_spec are what make those outputs good. Your most leveraged work is in the system, not the screen.

✓ Lean into

  • Owning the design system & tokens
  • Encoding accessibility into specs
  • Crafting content_spec for voice
  • Reviewing and refining agent UI
  • User-research synthesis & journeys

✕ Step away from

  • Producing every wireframe by hand
  • Re-writing similar microcopy from scratch
  • Maintaining the design library manually
  • One-off prototype builds
Section 11

SDLC & Kanban Alignment

The SAND Framework can work with any SDLC approach — including Waterfall and various Agile frameworks. But it is principally aligned with Kanban. The small, reviewable, spec-governed increments that SAND produces are a natural fit with Kanban's core philosophy of continuous flow, limited WIP, and relentless cycle-time optimisation.

Why SAND and Kanban Are a Natural Fit

The alignment in one sentence: SAND breaks every unit of work into the smallest reviewable increment that a spec can govern and an agent can produce — which is exactly what Kanban's WIP limits and cycle-time pressure demand.
Kanban Principle 1

Limit Work in Progress

Each SAND stage is a discrete, bounded column. Stories can't proceed until their stage artifact is reviewed and accepted. Agents producing reviewable diffs — not massive rewrites — keep each card genuinely small and completable within WIP limits.

Kanban Principle 2

Faster Cycle Time

Agents compress the Work phase. Specs eliminate the planning ramp-up on repeat patterns. The Compound step means the second similar story starts faster than the first. Cycle time doesn't just stay flat — it actively trends down.

Kanban Principle 3

Optimise for Flow

Blockers in traditional Kanban often come from waiting for humans to draft things. SAND moves that wait time to the agent, which is non-blocking. Human review is focused and fast because reviewable diffs have clear blast radius.

Kanban Principle 4

Continuous Improvement

Kanban requires you to make the process visible and improve it. The Compound step is the structural mechanism: every loop writes its improvement back into specs, libraries, and tests — exactly what a Kanban retrospective should produce.

Framework Compatibility Overview

🎯 Kanban

SAND's primary alignment. Small increments, WIP limits, flow optimisation, and the Compound step map directly to Kanban principles. The stagewise pipeline is a natural Kanban board layout.

Compatibility: Excellent (primary framework)

⚡ Scrum / Agile Sprints

Works well. Map stages to sprint ceremonies. The sprint cadence replaces Kanban's continuous flow — compound deliverables happen at sprint retro. WIP limits require explicit enforcement.

Compatibility: Good

🏗️ Waterfall / Stage-Gate

Compatible at the stage level — each SAND stage aligns with a waterfall phase. Compound is harder to enforce at pace. Large batch sizes reduce the benefit of agent-generated reviewable diffs.

Compatibility: Partial

Selective Model Usage — Routing AI to the Right Work

Not all work is equal, and not all AI models are equal in price or capability. Cost optimisation is a first-class SAND principle. Here's the routing logic.

Tier 1 · Frontier Models
e.g. GPT-4o, Claude Opus, Gemini Ultra

Complex & Novel Work

  • Ambiguous requirements needing deep reasoning
  • First-iteration architecture on greenfield problems
  • Cross-artifact traceability (PRD ↔ code ↔ tests)
  • High-blast-radius security or performance reviews
  • Novel algorithm or domain-specific logic generation
  • Postmortem root-cause analysis
Tier 2 · Capable Models
e.g. Claude Sonnet, GPT-4o-mini (large context)

Standard Development Work

  • PRD generation from a structured brief
  • 2nd and 3rd iteration on established patterns
  • Diagram generation from code
  • Test generation for known component types
  • Routine PR review against existing spec rules
  • API documentation from OpenAPI schema
Tier 3 · Lighter / Hosted Models
e.g. Claude Haiku, smaller fine-tuned models

Routine & Repetitive Tasks

  • Changelog generation from commit messages
  • Boilerplate code from templates
  • Formatting and linting correction
  • Simple unit test scaffolding (4th+ iteration)
  • FAQ generation from support tickets
  • Translation/localisation of known strings

When to Let AI Iterate — and When to Hand Off to a Human

AI iteration is powerful but not infinite. The quality of AI output typically follows a curve: significant gains on early iterations, diminishing returns by iteration 3–4, and potential quality erosion after that. Know when to stop the loop.

Iteration Owner Typical Focus Signal to Proceed / Escalate
1st iteration AI (Tier 1–2) Initial draft — scaffold, structure, happy path. High variance is expected. Proceed if spec coverage is ≥ 70%. Escalate if the agent is hallucinating APIs or misreading context.
2nd iteration AI (Tier 2) Address review feedback, fill in error paths, add observability hooks and edge cases. Proceed if spec coverage is ≥ 90% and no security findings remain. Escalate if same defects recur.
3rd iteration AI or Human (assess) Fine-tuning: performance, subtle logic bugs, complex multi-system interactions. If 3 iterations haven't resolved the core issue, a human should diagnose root cause before a 4th AI pass.
4th+ iteration Human (preferred) Persistent defects often signal a misunderstanding of context, system constraints, or spec gaps. Human fixes the issue, then updates the spec so the same pattern doesn't repeat on the next story.
Always human Human only Architecture decisions, ADRs, postmortem conclusions, regulatory sign-off, incident command. N/A — these are non-delegable by design.
Iteration discipline: The temptation after a failed 3rd AI iteration is to try a 4th with a better prompt. Resist it. Repeated AI iteration on the same problem is a symptom — either the spec is wrong, the story is too large, or the problem requires human judgment. Diagnose before retrying.
The SAND / Kanban compact: Each SAND stage is a Kanban column. Each story is a card. WIP limits on columns enforce the "small reviewable increment" discipline. The Compound step is your Kanban improvement cadence — not once a quarter, but every card.
Section 12

Greenfield vs Brownfield

The loop is the same, but the emphasis shifts hard depending on whether you're building on empty ground or modifying a system that already has users. Treat them as different sports.

🌱 Greenfield · Generate

  • Heavy use of scaffolding agents — full service skeletons from PRD + specs
  • Architecture agent proposes candidates; you choose and ADR
  • Reuse platform scaffolds aggressively
  • Front-load design and architecture; constraints stick for the system's life
  • Speed of iteration matters more than reversibility
  • Compound win: each project contributes back to scaffolds and reference architectures

🏗️ Brownfield · Comprehend, then change

  • Codebase-comprehension agent first — build a knowledge model before any change
  • Characterization tests required before refactor
  • Small, reversible diffs only. No big-bang rewrites
  • Feature-flag-controlled cutovers; explicit rollback paths
  • Domain-expert review where agent confidence is low
  • Compound win: modernization_spec grows; the second migration is faster than the first

The Decision Rule

QuestionGreenfield treatmentBrownfield treatment
Existing codebase to integrate with?No, or only at well-defined boundariesYes, with deep coupling
Current system understood?N/AImperfectly; comprehension is part of the work
Cost of breaking existing behaviour?LowHigh; users and SLAs depend on it
Test coverage of affected area?Build from scratch with verification agentOften thin; characterization tests required first
Primary AI emphasisGeneration and scaffoldingComprehension, characterization, incremental change
Spec emphasisprd_spec, architecture_spec, code_specmodernization_spec, ops_spec, code_spec evolution
Risk postureSpeed of iterationReversibility and small blast radius
Heuristic: If any answer in the right column applies, treat the workstream as brownfield — even if part of it is technically new. Mixed workstreams (a greenfield service that integrates deeply with a legacy core) require explicit decisions about which patterns apply at which boundary.
Section 13

Anti-Patterns to Spot Early

These are failure modes we've seen across the industry and inside our own teams. Each is a path of least resistance — easy to fall into, expensive to walk back.

Anti-PatternWhat it looks likeThe fix
🚩 The prompt sprawl Every developer writes their own prompts in their own style. Outputs drift. Nothing compounds because there's nothing to compound into. Promote good prompts into shared spec sections. Treat one-off prompts as drafts on the way to specs.
🚩 The skipped Compound "We're behind, let's just ship and circle back." Two months later: still shipping, still behind, system hasn't improved. Compound is part of definition-of-done, not after it. If you can't compound today, decide explicitly that this story is a repeat — don't drift.
🚩 The over-eager agent The agent ships a 1500-line PR that "refactors while we're at it." Reviewable diffs become unreviewable diffs. Bound the scope in the story. Reject scope creep at review. Decompose large rewrites; don't accept them in one PR.
🚩 The plausible mistake The agent invents a function, an API, or a library that doesn't exist — but it looks right. You merge. CI catches it. Or worse: it doesn't. Always run generated code against the actual project. Trust nothing that hasn't compiled or linted in your environment.
🚩 The frontier-model default Routine tasks hit the most expensive model. Costs balloon. Latency degrades. Routine work blocks behind frontier-model queue. Tier the routing. Smaller models for repetitive work. Frontier only for genuinely complex, ambiguous, or cross-artifact work.
🚩 The tribal spec One person owns the spec, edits it from gut feel, doesn't review changes. The spec becomes their preferences in document form. Specs go through PR review. Changes cite incidents, PRs, or evidence. Two approvers for shared specs. Quarterly pruning.
🚩 The skill atrophy Engineers stop debugging hard problems because the agent always tries first. Real expertise erodes; the team can't operate without agents. Rotate engineers through "no-agent" debugging weeks. Require human authorship of high-impact ADRs and postmortems.
🚩 The infinite iteration trap The same defect is retried 5+ times with slight prompt tweaks. No human diagnoses the root cause. Time is lost; spec gaps compound silently. Cap AI iterations at 3. On the 4th pass, a human diagnoses first. The insight goes back into the spec before the next story.
🚩 The WIP overflow Agents generate so fast that review queues overwhelm the team. Ten stories are "in review" simultaneously; none are truly done. Apply Kanban WIP limits to the review column, not just to development. Agent speed without review discipline creates the illusion of progress.
Section 14

Metrics That Matter

Two families of metrics: delivery (familiar) and compounding (new). Track both. The compounding metrics are how you'll know the practice is working — delivery metrics alone can be misleading in early phases.

What Good Looks Like at 12 Months

−40 to −60%
Lead time on participating teams
Flat or better
Change failure rate
≥30%
Reuse rate
−30%+
Time to ship 2nd similar feature

The Three Metric Families

FamilyMetricPhase 1 targetPhase 3 target
DeliveryLead time per feature (vs baseline)−20 to −30%−70% or more
DeliveryChange failure rateFlatFlat or better
DeliveryMTTRFlat−50%
DeliveryDeployment frequency+50%+200%
AI-Native% AI-generated code/tests/docs40–60%70–85%
AI-NativeHuman review time per PRFlat−30%
AI-NativeAgent rework rate<25%<10%
AI-NativeAI cost per storyTrackedBelow phase 1 baseline
AI-NativeAvg. AI iterations per storyTracked≤2.5 (signal of spec quality)
CompoundingReuse rateTracked>60%
CompoundingSpec updates per sprint (impact-weighted)≥3 per team≥5 per team
CompoundingTime to ship Nth similar feature vs firstTracked−50% by N=3
CompoundingRepeat-incident-class rateTrackedTrending to zero
KanbanCycle time per stageTracked by stageWIP-limit violations trending to zero
KanbanReview queue depth<5 cards simultaneously<3 cards simultaneously
Section 15

Self-Assessment

Eight questions about your team. Answer honestly — there's no scoring police. The result will tell you whether you should focus on foundations, scaling, or refinement.

Team Readiness Diagnostic

~3 minutes. Answer for your immediate team, not the whole company.

1. When your team uses AI to generate code, how does it work?
2. Where do learnings from your last incident live?
3. Your typical user story tells the agent...
4. Documentation and diagrams in your repo are...
5. Your last sprint's retro produced...
6. Multi-agent review (security, perf, over-engineering) on PRs is...
7. AI cost per story is...
8. Shipping the second instance of a similar feature takes...
Awaiting answers

Pick an answer for each question to see your team's readiness.

Section 16

The 8-Week Path to Your First Compound

If your team is starting from AI-assisted today, here's a concrete path. Don't try to do everything at once. The point is to ship a real loop and feel the compound, not to roll out a framework.

Weeks 1–2 · Adopt the language

  • Read this playbook end-to-end as a team
  • Pick one feature for the pilot loop
  • Identify the spec sections you'll need
  • Set up cost reporting per story
  • Set up your Kanban board with SAND stage columns

Weeks 3–4 · Run the first loop

  • Convert the pilot story to AI-ready format
  • Run Plan → Work → Review with explicit ownership
  • Force the Compound step at the end
  • Tag every artifact in the PR description
  • Set a WIP limit on the Review column

Weeks 5–6 · Scale the loop

  • Run 3–5 stories under the model
  • Add multi-agent review on non-trivial PRs
  • Measure: lead time, rework rate, cost, AI iterations
  • Update specs with what worked
  • Tier your model routing for the first time

Weeks 7–8 · Demonstrate compound

  • Ship a second instance of a similar story
  • Measure how much faster it was
  • Demo the spec diffs at sprint review
  • Onboard one neighbouring team
Pilot success criteria: By the end of week 8, you should have shipped 6–8 stories under the loop, made at least 5 spec/library/test contributions back to the system, and seen the second-instance feature ship in measurably less time than the first. If those three are true, you're ready to onboard the next team.
Section 17

Definition of Done

The team's definition of done evolves to match what an AI-native loop is expected to produce. Print this. Stick it next to your sprint board. Argue from it.

A story is done when…

v2 · Print & pin
  • Code is merged and meets code_spec.
  • Tests cover the acceptance criteria and any new edge cases; risk-weighted coverage is adequate.
  • Documentation is regenerated and reviewed; customer-facing docs are reviewed by tech-writing where applicable.
  • Diagrams reflecting the change are current.
  • Multi-agent review has run on non-trivial PRs with no unresolved findings; human review is recorded.
  • Compound deliverables are explicit: spec updates, new patterns, new tests — or a recorded note that nothing applies (and why).
  • Cost is recorded and within budget for the story.
  • The PR description names the inputs (specs, ADRs) and the agent runs that produced the change.
  • Model routing is recorded: which tier was used, and why (captures cost and complexity signals).
  • AI iteration count is recorded; if ≥ 4 iterations were required, a human diagnosis note is included.
  • WIP limits respected throughout: the story did not sit blocked in any stage column beyond the agreed SLA.
Section 18

Governance & Continuous Learning

This playbook is a living system. It should improve every quarter based on what we ship, what breaks, and what we learn. The compound principle applies to the playbook itself.

For Individual Contributors

  • › Follow the PLAN → WORK → REVIEW → COMPOUND loop on every story
  • › Treat spec updates as first-class deliverables, not optional follow-ups
  • › Cite inputs (specs, ADRs) and agent runs in every PR description
  • › Surface anti-patterns and spec gaps the moment you spot them
  • › Spend an hour a week reading other teams' Compound diffs
  • › Record model tier used per story; flag routing anomalies

For Tech Leads & Architects

  • › Run Plan and Compound for the team — these are not delegable
  • › Calibrate multi-agent reviewers quarterly; tighten when too lenient
  • › Audit one randomly selected agent run per week
  • › Promote patterns across teams; resist forking
  • › Coach on judgment in reviews, not just activity
  • › Review WIP limits and stage cycle times monthly; adjust to team capacity

Feedback & Learning Loop

Observe
Log every agent run with inputs, outputs, model tier, cost, iteration count, and spec version. Without observation there is nothing to learn from.
Surface
In retros, name what worked, what didn't, and what's missing from the specs. Make the gap visible before debating the fix.
Encode
Write the learning back into the right artifact: spec, prompt, test, library, runbook, or routing rule. Vague action items don't compound.
Propagate
Share the diff with neighbouring teams. A pattern that compounds across two teams compounds twice as fast.
Prune
Quarterly, retire what no longer serves: stale specs, unused patterns, brittle prompts. Compound systems also accumulate dead weight.
Quarterly Playbook Review Agenda: (1) Compounding metrics by team — who's actually accumulating capability, (2) spec catalogue health — what's growing, what's stale, (3) routing and cost — where models are over-spent, (4) anti-pattern review — what new failure modes have we seen, (5) Kanban flow health — WIP violations, stage cycle time anomalies, (6) AI iteration rates — stages with high average iterations signal spec gaps, (7) playbook updates — sections that need rewriting, (8) cross-BU pattern promotion candidates.
SAND

The treadmill builds endurance. The compound builds capability.

Choose deliberately, every loop. The SAND Framework is designed to evolve as our practice matures and as the technology shifts beneath us. Treat it the same way you'd treat a good code_spec — argue with it, refine it, version it.

Tarento · Way of Building · Version 2 · Innovation Labs · SAND Framework