Deterministic
Every PostToolUse: Edit|Write hook
Pure unit tests. Run on every file write via hook — no manual trigger. Failure blocks the commit. Target: under 30 seconds.
Features · Testing
Six levels, six schedules. Level 1 runs on every file write. Level 6 is a hard deploy gate. Skip any level and something slips — not hypothetically, predictably.
The six levels
Every PostToolUse: Edit|Write hook
Pure unit tests. Run on every file write via hook — no manual trigger. Failure blocks the commit. Target: under 30 seconds.
Pre-commit + CI
Real database, real HTTP calls. Marked @pytest.mark.integration. Run pre-commit and in CI. No mocks at the DB boundary.
Weekly
Fixture-based LLM behaviour tests. Does this prompt still produce a sane answer? Catches drift before it lands in production.
Per deploy
Did the component run on real data and write the expected artefact to disk or DB? A test that passes against mock_response proves nothing about prod.
Per deploy
Suppresses pre-existing failures so a new break is visible. ships as templates/baseline_test_wrapper.sh. Run it before reporting a test count — you want signal, not noise.
Deploy gate — hard block
Cross-component consistency. Six categories: configuration consistency, wiring completeness, model/constant staleness, dead-component detection, data-flow completeness, deploy integrity. The deploy script refuses to ship if any system-invariant test fails.
Why six levels
Stacking levels is not redundancy — each one finds a failure class the one above cannot see.
Hook integration
PostToolUse: Edit|Write
Level 1 — deterministic unit tests on every file write
PreToolUse: Bash (matched on git commit)
Level 2 — integration tests before any commit lands
Stop hook, Sundays
Level 3 — weekly quality regression on LLM behaviour
deploy.sh
Level 6 — system invariants. Not a hook — the deploy script. Hard block.
The deploy gate
These are not style checks or coverage thresholds. They check structural invariants that, if broken, mean something real is wired wrong.
Every capability row with status: active has a corresponding worker entry. Catches the 'we built it but never wired it' failure.
Config schema has no orphaned fields. Catches config rot — fields added for an old feature, never removed.
Audit log is unbroken since last deploy. Any gap — missing entry, tampered entry — blocks the deploy.
Activation evidence
Level 4 tests check disk or DB output, not a mocked return value. A test that passes against mock_response proves the test harness works. It says nothing about whether the component ran on your data.
The partial-activation table tracks which components are permanently in fallback, which return NaN, and which routes are wired but unreachable. Green tests with a half-activated system are not a clean bill of health.