Features · Recovery

Recovery

A playbook plus a score. Six of eight known Next.js + Supabase disasters covered. Two shown as gaps. The score exists to be accurate, not reassuring.

Covered disasters — 6 of 8

What the playbook covers.

Each entry has four parts: symptoms (how you know you're in this disaster), immediate triage (what to do in the first five minutes), rollback procedure (how to undo the damage), and post-mortem checklist (what to add so it doesn't happen again).

RLS off in production

Symptoms: All users can read each other's data. Supabase logs show unexpected cross-tenant queries.
Immediate triage: Identify affected tables. Disable public access immediately. Enable RLS. Write and apply policies.
Rollback: Policies can be added without downtime. No data loss — RLS is a read gate, not a delete.
Post-mortem: Audit all tables. Add RLS-on-every-table rule to prevent recurrence.

Leaked service-role key

Symptoms: Unexpected API calls in Supabase logs. Key visible in public bundle or git history.
Immediate triage: Rotate the key immediately. Invalidate old key in Supabase dashboard. Audit who hit the API.
Rollback: Key rotation takes seconds. Any code using the old key stops working — update env vars.
Post-mortem: Run credential-shape scan on full git history. Add pre-commit hook to block future commits.

Force-pushed main

Symptoms: Commits missing from main. Team members with stale checkouts pushing on top of wrong base.
Immediate triage: Find the last good commit in reflog. Create a recovery branch. Do not push to main yet.
Rollback: git reflog to find the lost commit. Cherry-pick or reset. Force-push is now allowed once, deliberately.
Post-mortem: Enable branch protection on main. Add PreToolUse hook to block force-push.

Auth.js session loop

Symptoms: Users redirected repeatedly between login page and protected route. Never lands.
Immediate triage: Check NextAuth secret matches across environments. Check callback URL config. Check middleware matcher.
Rollback: Temporary: disable the protected route. Fix: align NEXTAUTH_SECRET and NEXTAUTH_URL.
Post-mortem: Add session revalidation rule. Test login flow in staging before deploy.

Hardcoded OAuth callback

Symptoms: OAuth works locally, fails in production with redirect_uri_mismatch from Google/GitHub.
Immediate triage: Check provider config for localhost:3000. Add production URL to OAuth app's allowed redirect list.
Rollback: Provider config change is instant. No code deploy needed.
Post-mortem: Add OAuth callback list audit rule. Use environment variable for the base URL.

Broken Supabase migration

Symptoms: Migration fails mid-run. Database in partial state. Subsequent migrations blocked.
Immediate triage: Check supabase migration status. Identify which migration failed. Decide: fix or rollback.
Rollback: If reversible: apply the down migration. If not: restore from last clean backup.
Post-mortem: Require migrations to be reversible. Test migrations on a branch database before applying to main.

Gaps — 2 of 8

What we don't cover yet.

Multi-tenant RLS escalation

The failure mode requires tenant-boundary fixture testing we haven't written. We won't claim coverage we can't demonstrate.

Server-side OAuth callback mismatch

Distinct from the hardcoded-callback case above — affects server-side OAuth flows in specific Next.js route handler configurations. Fixture not yet written.

Why we show the gap

A "100% covered" claim with two unverified disasters is dishonest.

Six confirmed beats eight including two we're not sure about. The score updates when the kit ships fixture-tested coverage for the missing two — not when we write a page that says "coming soon."

The Recovery Readiness score on the dashboard is the same number that appears here. If you see 6 / 8, those are the same six. No asterisks.

Five review protocols

Protocol 0 runs before any work starts. Protocol 4 is the exit gate.

Codebase audit before any epic

find . -type f, wc -l, pytest --collect-only, git log --since=... Enforced by the planner agent. No scope is written blind.

Epic review checklist

7-section EPIC.md (Context / What / Why / How / DoD / Test Plan / Owner) audited line-by-line. One owner per epic, named. Not 'the team.'

Architecture quality

Four checks: observability (is it logged?), testability (can you test it without mocking the world?), data flow (does the input boundary match the output boundary?), error handling (do failures surface or silently degrade?).

Ontology consistency

The new component fits the type system from Layer 11. Does not smuggle behaviour past type boundaries. Reviewed against the pre-defined ontology, not against what feels natural.

Readiness gate

Tests passing, no open blockers, DoD verified, Last Audited date stamped. A component is not done until Protocol 4 is satisfied.

EPIC.md template

Seven sections. One owner. Migration strategy required.

Every epic gets a single EPIC.md file with seven fixed sections: Context, What, Why, How, Definition of Done, Test Plan, Owner. One owner per epic — named, not "the team." If no one is named, no one is accountable.

Migrations follow a fixed three-stage pattern: shadow mode (new and old run in parallel, new writes are silently discarded), hint (new runs and its output is visible but not acted on), cutover (new is canonical, old is removed). Hard cutovers — turning off the old system in the same deploy that turns on the new one — are explicitly disallowed.

Definition of Done requirements

Four structural requirements for every task's DoD.

Structured logging

Logger name declared, levels appropriate, all external API calls logged, no secrets in any log line.

manifest.json required fields

id, name, tier, type, inputs.requires, outputs.schema, fallback_outputs. Missing fields block activation.

reasoning and confidence fields

Required in every decision output. An audit log where you can't tell why a decision was made is not an audit log.

Platform line

Every task declares: mobile-conform required (375 / 390 / 768), server-side, kit-content, or none.

Activation evidence

Tests passing is not activation.

Activation tests check disk or DB output, not mocked results. A test that passes against mock_response proves the test harness works. It says nothing about whether the component ran on real data.

The partial-activation table in the kit tracks which components are permanently in fallback, which return NaN, and which routes are wired but unreachable. The distinction is explicit: "ran" vs "ran on real data."

Integration audit

Finds the installed-but-never-used failure class.

The integration audit runs as part of /pre-commit and catches: installed packages never imported, UI components never rendered, modules never called, API routes with no caller, agents never invoked, env vars never read.

This is the "we installed shadcn six months ago and never used it" failure class. It is not a code quality issue — it is a signal that the integration step was skipped. The audit surfaces it before it becomes a maintenance burden.

How it gets used

Two ways to open the playbook.

Via slash command

Type /recover rls-off inside Claude Code. Opens the playbook entry directly in the conversation. Claude walks you through it.

Direct file read

If Claude Code is the thing that broke, open __documentation/recovery-playbook.md directly. The file is human-readable markdown.

← All features Next: AI transparency →