Claude Code Kit

Features · Testing

Testing

Six levels, six schedules. Level 1 runs on every file write. Level 6 is a hard deploy gate. Skip any level and something slips — not hypothetically, predictably.

The six levels

Each level catches what the one above misses.

01

Deterministic

Every PostToolUse: Edit|Write hook

Pure unit tests. Run on every file write via hook — no manual trigger. Failure blocks the commit. Target: under 30 seconds.

02

Integration

Pre-commit + CI

Real database, real HTTP calls. Marked @pytest.mark.integration. Run pre-commit and in CI. No mocks at the DB boundary.

03

Quality regression

Weekly

Fixture-based LLM behaviour tests. Does this prompt still produce a sane answer? Catches drift before it lands in production.

04

Activation evidence

Per deploy

Did the component run on real data and write the expected artefact to disk or DB? A test that passes against mock_response proves nothing about prod.

05

Baseline-aware wrapper

Per deploy

Suppresses pre-existing failures so a new break is visible. ships as templates/baseline_test_wrapper.sh. Run it before reporting a test count — you want signal, not noise.

06

System invariants

Deploy gate — hard block

Cross-component consistency. Six categories: configuration consistency, wiring completeness, model/constant staleness, dead-component detection, data-flow completeness, deploy integrity. The deploy script refuses to ship if any system-invariant test fails.

Why six levels

Each level has a blind spot.

Stacking levels is not redundancy — each one finds a failure class the one above cannot see.

Unit tests

Finds
Logic bugs
Blind spot
Wiring bugs — 'we never registered the new capability in the worker'

Integration tests

Finds
Wiring bugs
Blind spot
Drift — 'the prompt now returns gibberish for unicode names'

Activation tests

Finds
Drift
Blind spot
Config rot — 'the new tier never had its rate-limit row added'

System invariants

Finds
Config rot
Blind spot
Nothing — this is the floor

Hook integration

Tests run when they need to, without a manual trigger.

  • PostToolUse: Edit|Write

    Level 1 — deterministic unit tests on every file write

  • PreToolUse: Bash (matched on git commit)

    Level 2 — integration tests before any commit lands

  • Stop hook, Sundays

    Level 3 — weekly quality regression on LLM behaviour

  • deploy.sh

    Level 6 — system invariants. Not a hook — the deploy script. Hard block.

The deploy gate

Three system-invariant tests that block the ship.

These are not style checks or coverage thresholds. They check structural invariants that, if broken, mean something real is wired wrong.

test_every_active_capability_registered_in_worker

Every capability row with status: active has a corresponding worker entry. Catches the 'we built it but never wired it' failure.

test_no_pairs_field_in_training_config

Config schema has no orphaned fields. Catches config rot — fields added for an old feature, never removed.

verify_hash_chain()

Audit log is unbroken since last deploy. Any gap — missing entry, tampered entry — blocks the deploy.

Activation evidence

"Ran" is not the same as "ran on real data."

Level 4 tests check disk or DB output, not a mocked return value. A test that passes against mock_response proves the test harness works. It says nothing about whether the component ran on your data.

The partial-activation table tracks which components are permanently in fallback, which return NaN, and which routes are wired but unreachable. Green tests with a half-activated system are not a clean bill of health.