The Pre-Commit Review Gate

FOLIO LXXXII 22 APR 2026 · 25 MIN · LONG-FORM

The Pre-Commit Review Gate

AI on a Leash: Mechanical Self-Review for Claude Code

Diagram · folio lxxxii

flowchart TB
  A([Agent edits code]) --> B{{git commit}}
  B --> H[/PreToolUse hook/]
  H --> C{Trivial diff?<br/>Doc-only?<br/>Plan mode?}
  C -->|yes| OK([allow commit])
  C -->|no| D{Artifact valid?<br/>hash matches?<br/>verdict CLEAN?}
  D -->|yes| OK
  D -->|no| DENY[/deny + return<br/>review prompt/]
  DENY --> SUB([general-purpose<br/>sub-agent review])
  SUB --> F{Findings?}
  F -->|N findings| FIX([fix in working tree])
  FIX --> SUB
  F -->|CLEAN| ART[(write artifact:<br/>diff hash + verdict)]
  ART --> B

A pull request shouldn’t need four review rounds. But that’s what I kept getting from Claude Code: write code, run tests, claim done, ask for review, find real problems, fix, push, ask again, find more, fix, push, repeat. Across several PRs the pattern was identical. Tokens, time, and CI cycles burned on a loop the agent could have closed itself before the first commit.

§TL;DR
A PreToolUse hook on git commit blocks the commit until an adversarial sub-agent has reviewed the working-tree diff and produced a Verdict: CLEAN artifact tied to that exact diff. The gate is mechanical — the agent literally cannot commit without it — but the prompt is calibrated against review theater. Findings have a quality bar: a specific defect at file:line, not “could be cleaner.” The agent can dispute findings it disagrees with in writing rather than fixing them mechanically. The goal is a clean first commit, not a maximum-rounds-of-review badge.

AI on a Leash Series | Previous: Ralph’s Uncle covers verification principles. Complete Go Project Configuration covers the test toolchain. This article covers the missing gate between “code passed local tests” and “code is committed.”

§The Cycle You Already Know

If you’ve worked with Claude Code on anything past a toy project, you’ve seen this:

The agent writes code, runs make verify, and reports “all tests pass, the change is complete.”
You ask for a critical review.
The agent finds real issues that should have been caught the first time.
The agent commits “fixes,” pushes.
You ask for another critical review.
The agent finds more real issues.
Repeat.

Three or four iterations per PR is normal. The fixes are not invented problems. They’re real bugs: backend mismatches, doc-vs-code drift, tests that pass but don’t exercise the failure they claim to test, magic numbers that escaped a refactor. The agent has the capacity to find them. It just doesn’t, until you ask.

The framing I landed on after the third PR with this pattern: you can write code, and you can review code, but for some reason you have to be told to review the code you write.

That has to stop being true. The fix is not asking nicely. The fix is making “review your code” a precondition the runtime enforces, not a habit you hope the agent adopts.

§Why Self-Review Memory Entries Don’t Work

The first thing I tried was a memory entry: a checklist of common gaps the agent should look for before claiming done. Backend consistency, doc-vs-code drift, tests that match their claims, edge cases, error paths.

It didn’t work, and the failure mode is worth understanding because it’s the same failure mode every “be more careful” instruction hits.

The agent treats the checklist as a vibe-check. It reads the items, mentally answers “yes I considered that,” and moves on. There is no forcing function. Tests pass. The agent says done. You find things. The checklist might as well not exist.

A more procedural version (do this, then this, then this) helps a little but still gets skipped under any kind of time pressure or context bloat. Items get noted as “I should fix this” and then quietly buried as the conversation moves on.

The lesson: in-context discipline does not survive contact with the model’s own confidence. If you want the agent to do something every single time, you cannot put the requirement in the agent’s prompt. You have to put it in the runtime.

§Why Stop Hooks and PR-Open Hooks Are Too Late

The first runtime-level fix I considered was a Stop hook: at the end of every turn, run an automated review against the working tree.

Wrong trigger point. By the time Stop fires, the agent may have already committed. Claude Code’s commit workflow runs inside a single tool sequence that doesn’t always cross a Stop boundary. Gating at PR-open time has the same problem one layer up: the commit and the push are already done, and you’re paying the cost of every “fix to address review” commit thereafter.

The whole point of moving the review left is to avoid the round-trip cost of commit, push, review, fix-commit, push. If the gate fires after any of those steps, you’ve already bought the ticket.

The review needs to fire before the commit. Specifically, when the agent calls the Bash tool with a git commit command, the hook needs to intercept and decide whether to allow or deny it.

That is a PreToolUse hook. PreToolUse is the only place on the timeline where you can cheaply force the work to happen before any cost is sunk.

§The Architecture

Three coupled pieces. The first is the gate. The other two reinforce it.

§The Gate: A PreToolUse Hook on git commit

A shell script registered as a PreToolUse hook in ~/.claude/settings.json. When the agent calls Bash with a command matching git commit or git commit --amend, the hook intercepts and runs through this decision tree:

Kill switch. If ~/.claude/.review-gate-disabled exists, allow. Used for debugging the gate itself.
Plan mode. If .claude/.plan-mode exists in the project, allow. Plan mode is read-only by design.
Trivial diff. If staged + unstaged changes total less than 20 lines, allow. Typo fixes pass through, otherwise the gate trains the agent to dismiss it as overhead.
Doc-only diff. If no *.go|*.sql|*.sh|*.py|*.ts|*.tsx|*.js files changed, allow. Pure docs changes don’t need code review.
Otherwise: compute a 16-character SHA-256 hash of git diff --cached HEAD; git diff and look for a matching review artifact at .claude/.last-review.md.

If the artifact is missing, stale, or doesn’t carry verdict: CLEAN, the hook denies the tool call. The denial returns hookSpecificOutput.permissionDecision: "deny" with the full sub-agent review prompt as the reason. The agent’s next turn fires with explicit instructions and zero excuse not to spawn the review.

§The Review: An Adversarial Sub-Agent

When the gate denies, the agent spawns a general-purpose sub-agent with a structured prompt that requires explicit output sections rather than freeform thoughts. The prompt forces the sub-agent to produce:

Numbered findings, each with file:line, what’s wrong, why it matters, and a minimal fix. Findings are concrete defects: a real bug, broken contract, security issue, or regression. “Could be cleaner”, “consider extracting”, “doc could be tighter” are explicitly NOT findings — they go in a separate Suggestions section that doesn’t block the verdict.
A paired-backend audit: every contract, every backend, every edge case.
A doc-vs-code walkthrough: every new doc paragraph traced against implementing code.
A test-vs-claim audit: every new test verified to actually exercise the failure mode it claims.
A verdict: CLEAN or N findings (counting Findings only, not Suggestions).

I picked general-purpose deliberately. The Plan agent is read-only and constrains fix suggestions. Custom code-reviewer plugins risk being changed out from under you. With general-purpose, the prompt controls behavior end to end, and the prompt is yours to version.

The agent loop:

Read findings.
Address each one. Three valid responses: fix it, dispute in writing (record rationale in the commit body if you’ve reasoned the finding is wrong — typically a misread of the code or re-litigation of an accepted prior decision), or defer with a TODO (real concern, out of scope for this PR; use sparingly).
Re-spawn the sub-agent on the new diff.
Stop when Verdict: CLEAN, OR when remaining findings are docstring/comment polish you’ve judged not worth another iteration. The artifact is then written with the rationale-in-commit-body covering the disputed items.
Iteration cap: two rounds for typical changes; round three only if round two surfaced a substantive bug worth chasing, not docstring quibbles. If round three still has substantive findings, the change is too large and must be split.

§The Artifact: A Hash That Pins the Verdict to a Diff

.claude/.last-review.md carries:

reviewed_hash: <16-char sha256 of git diff --cached HEAD; git diff>
findings_count: <int>
verdict: CLEAN

Any subsequent edit changes the diff hash. The next commit attempt re-blocks until a fresh review is performed and a new artifact is written.

The .claude/ directory is gitignored, so the artifact never lands in the repo.

§Files Installed

~/.claude/hooks/review-gate.sh                    # the hook script (executable)
~/.claude/hooks/review-prompt-template.md         # the adversarial prompt template
~/.claude/settings.json                           # PreToolUse + PostToolUse entries
~/.claude/projects/<project>/memory/feedback_self_review_checklist.md   # memory pointer

The hook script lives outside the project so it works across every repo on the machine. The prompt template is a separate file (not inline in the script) so it’s auditable and can evolve without touching hook logic. The memory entry is a short pointer to the gate, not a procedural checklist (the gate itself does that work).

§The Hook Script

The script supports two modes selected by its first argument:

precommit: PreToolUse hook for Bash. Denies git commit calls without a valid artifact.
size-warn: PostToolUse hook for Edit|Write|MultiEdit. Emits a systemMessage warning when the cumulative diff vs origin/main exceeds 800 lines. Soft signal that the change should be split.

The opening section captures the design choices that took the longest to get right:

#!/usr/bin/env bash
#
# review-gate.sh - pre-commit adversarial-review gate.
#
# Two modes (selected by $1):
#   precommit   PreToolUse hook for Bash. Blocks git commit when the
#               working tree has un-reviewed code changes.
#   size-warn   PostToolUse hook for Edit|Write|MultiEdit. Emits a
#               systemMessage warning if the cumulative diff vs
#               origin/main exceeds 800 lines.
#
# Fail-open: any internal error allows the action and emits a stderr
# warning. A buggy gate must never block legitimate work.

set -uo pipefail

MODE="${1:-precommit}"
HOOK_INPUT="$(cat)"

if [[ -f "$HOME/.claude/.review-gate-disabled" ]]; then
  exit 0
fi

PROJECT_DIR="$(printf '%s' "$HOOK_INPUT" | jq -r '.cwd // empty' 2>/dev/null)"
if [[ -z "$PROJECT_DIR" ]]; then
  PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$PWD}"
fi
cd "$PROJECT_DIR" 2>/dev/null || exit 0

if [[ -f "$PROJECT_DIR/.claude/.plan-mode" ]]; then
  exit 0
fi

if ! git rev-parse --git-dir >/dev/null 2>&1; then
  exit 0
fi

A few invariants worth calling out:

Fail-open posture. Any internal error allows the action and prints a stderr warning. A buggy gate that blocks legitimate work is worse than no gate, because you’ll disable it permanently and never re-enable.
JSON output via jq -n --arg. When you build the deny response, the --arg form properly JSON-escapes multi-line strings. Trying to build the JSON with string concatenation will hand you a broken hook the first time the diff contains a quote or a newline.
Diff size threshold. Trivial-diff skip is essential. Without it, every typo-fix triggers a review and trains the agent to dismiss the gate as bureaucracy.
Doc-only skip. A pure docs change should not block on code review.
Project dir from the .cwd field. The hook input JSON contains cwd. Trust that over $CLAUDE_PROJECT_DIR or $PWD.
Watch the git commit-tree substring trap. Match git[[:space:]]+commit($|[[:space:]]), not just the substring commit. Otherwise you block unrelated subcommands.

The dispatch logic, helpers, and full hash computation come after the prelude. The full file is around 7.5 KB. Copy the canonical version verbatim when installing on a new machine, do not retype.

§The Prompt Template

The prompt template is the forcing function. The hook substitutes a {{DIFF}} placeholder with the materialized git diff --cached HEAD; git diff output (capped at 200 KB) and hands the result back as the deny reason. The next agent turn sees that prompt as input.

The structure that actually works:

Opens with framing: “Your job is to find issues that would embarrass us in code review or break in production. Your job is NOT to manufacture findings to look thorough.”
Inlines the diff so the sub-agent doesn’t have to compute its own.
Constraints: numbered findings, file:line, no greeting, no padding. Findings have an explicit quality bar — concrete defects only. Style preferences and doc-clarity nudges go under Suggestions, not Findings.
An “if you genuinely have nothing” list of common gaps to consult once: backend consistency, doc-vs-code drift, test-vs-claim, magic numbers, error paths, sink safety. The list is a checklist, not a quota — if everything passes, return CLEAN.
Required output sections. An agent can decline to think. It cannot silently decline to produce a section: the missing section is visible. An empty Findings section is allowed and should be rendered as (none), not padded.
Re-review discipline for round 2+: prior findings should already be addressed (fixed, disputed-in-writing, or deferred-with-TODO). Findings that were addressed via dispute do NOT get re-raised — that’s confirmation bias against accepted answers, and it’s the most common failure mode of strict review loops.

The line between “find every defect” and “manufacture findings to look thorough” is what separates a useful gate from review theater. Both failure modes cost the same kind of cycle (commit + push + fix + push). The goal is the first-commit-is-clean state, not the highest-review-rounds badge. If a round produces only docstring nits or restatements of accepted prior findings, the floor of diminishing returns has been hit — accept the diff and commit.

If you remove the required output sections, the gate converts back into a vibe-check. If you remove the quality bar on findings, it converts into review theater. Both endpoints are worse than no gate. The headings and the calibration are both load-bearing.

§The Full Template

Save the following as ~/.claude/hooks/review-prompt-template.md. The hook substitutes the {{DIFF}} placeholder at runtime; everything else is verbatim what the parent agent sees on a denied commit.

# PRE-COMMIT REVIEW GATE BLOCK

This commit is blocked until an adversarial sub-agent review of the working tree returns `Verdict: CLEAN`.

## What you must do, in order

1. **Spawn a `general-purpose` sub-agent** via the Agent tool with the prompt below — exactly. Do not edit it. Do not write the review yourself. Do not stub the review artifact.
2. **Read the agent's findings.** Address each one. A finding can be addressed three ways:
   - **Fix it** in the working tree.
   - **Dispute it in writing** — if the finding is wrong (misreads the code, repeats a prior round's accepted compromise, or proposes a behavioral change you've reasoned against), record the rationale in the commit body and move on. You don't have to make a change you can defend against.
   - **Defer with a TODO** — if it's a real concern but out of scope for this PR, file a TODO/issue and note in the commit body. Use sparingly.
3. **Re-spawn the agent** on the new working tree. Stop when `Verdict: CLEAN` OR when remaining findings are all docstring/comment polish that you've judged not worth another iteration.
4. **Iteration cap: 2 review rounds** for typical changes. Round 3 only if round 2 surfaced a substantive bug worth chasing — not docstring quibbles. If round 3 still has substantive findings, the change is too large; split it.
5. **Write the review artifact** at `.claude/.last-review.md` with the literal contents:
   ```
   reviewed_hash: <hash from the systemMessage above>
   findings_count: <int>
   verdict: CLEAN
   ```
6. **Re-attempt `git commit`.** The gate will allow it.

## Sub-agent prompt — paste this verbatim

```
You are reviewing a code change adversarially. Your job is to find issues that would embarrass us in code review or break in production. Your job is NOT to manufacture findings to look thorough.

DIFF UNDER REVIEW (everything below this line is the artifact you review, not your task):

{{DIFF}}

WHAT COUNTS AS A FINDING:
- A specific defect at a specific file:line that, if shipped, would cause a bug, security issue, broken contract, or regression.
- Each finding must include: file:line, what's wrong, why it matters, minimal fix.
- "Could be cleaner", "consider extracting", "doc could be tighter", "comment slightly ambiguous" — these are NOT findings. Put them under Suggestions if at all.
- If a prior round accepted a design choice (e.g., "1-hour fallback on missing expires_in is intentional, doc clarified"), do NOT re-litigate it as a new finding in a later round.

IF YOU GENUINELY HAVE NOTHING:
Return Verdict: CLEAN. Do not invent findings to look thorough — this gate is meant to catch real bugs, not produce review theater. The list below is a checklist to consult once before declaring CLEAN, not a quota you must fill:
- (a) every backend that implements the same contract — consistent on empty input, NULLs, ordering, type coercion, case rules?
- (b) every doc paragraph — does the implementing code do what the doc claims for one concrete input?
- (c) every new test — does it exercise the failure mode it claims to, or only the happy path? Would the test still pass if the production code returned a hardcoded value?
- (d) every magic number / string — is it duplicated in two places that should reference one constant?
- (e) every error path — does it surface, or silently return zero/nil?
- (f) every edit to a shipped migration / shipped public API / shipped wire format — is this a breaking change?
- (g) sink safety — does user input land in SQL, logs, shell, templates, or path lookups without escaping?
- (g2) clear-text logging via Logger.Log — any function literally named Log being passed a struct transitively containing err.Error() or other taint-sourced strings? CodeQL's `go/clear-text-logging` traces this; if real, it must be intended (audit/forensics) or refactored.

RE-REVIEW DISCIPLINE (round 2+):
- Verify each prior finding is addressed (fixed, disputed-in-writing, or deferred-with-TODO). If addressed via dispute, do NOT re-raise the same finding — that's confirmation bias against accepted answers.
- Scan the new code for issues introduced by the fix itself.
- If nothing new and prior findings stand fixed: return CLEAN.

REQUIRED OUTPUT SECTIONS (every section must be present, even if empty):

## Findings
Numbered. file:line. What's wrong. Why it matters. Fix.
Empty list = "## Findings\n(none)" — do not pad.

## Suggestions
Optional. Things you'd raise in a code review but wouldn't block on. Not counted toward verdict.

## Paired-backend audit
One row per (contract × backend × edge case). Mark each cell as covered/missing/inconsistent. If your code only has one backend, write "N/A — single implementation."

## Doc-vs-code walkthrough
For each new or changed doc paragraph in the diff, paste the implementing code below it and trace one concrete input. If the trace contradicts the doc, that's a finding above. Skip paragraphs that only restate field types or are pure boilerplate.

## Verdict
Either:
  Verdict: CLEAN
or:
  Verdict: N findings
(where N is the count in your Findings section, NOT including Suggestions)
```

## Why this gate exists

To catch real bugs before commit. Every uncaught issue costs commit + push + PR-review + fix-commit + push. **But** review theater (manufactured findings to look thorough) costs the same loop in reverse — it makes a clean commit take 5 hours instead of 30 minutes. The goal is the first-commit-is-clean state, not maximum-rounds-of-review.

If a round produces only docstring nits or restatements of accepted prior findings, you've reached the diminishing-returns floor — accept the diff and commit.

§Settings Wiring

Add the following entries to ~/.claude/settings.json. They are additive. Existing Notification and Stop hooks remain unchanged.

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          { "type": "command", "command": "$HOME/.claude/hooks/review-gate.sh precommit" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit|Write|MultiEdit",
        "hooks": [
          { "type": "command", "command": "$HOME/.claude/hooks/review-gate.sh size-warn" }
        ]
      }
    ]
  }
}

§What the Agent Does When Blocked

The deny reason returned by the hook contains the full review prompt plus the materialized diff. The agent’s next turn fires with this as input. Required behavior:

Read the deny reason. Do not write the review yourself. Spawn the sub-agent.
Pass the prompt verbatim to a general-purpose Agent tool call.
Read the agent’s findings.
Address each one. Real bugs and broken contracts get fixed. Findings the agent disagrees with after careful reading get disputed in writing in the commit body. Real concerns out of scope for this PR get deferred with a TODO. A failed test or a security defect is never disputable.
Re-spawn the sub-agent against the new working-tree diff.
Stop when Verdict: CLEAN OR when remaining findings are docstring/comment polish not worth another iteration.
Write .claude/.last-review.md with the current diff hash and verdict: CLEAN.
Re-attempt git commit. The gate now allows it.

The cap is two rounds for typical changes, three for exceptional cases. Round three should only happen if round two surfaced a substantive bug worth chasing, not docstring quibbles. If round three still has substantive findings, the change is too large for a single commit. git reset, re-stage smaller batches, and commit each piece separately through the gate.

The asymmetric cost is worth being explicit about. A real bug that ships costs the full commit + push + PR-review + fix-commit + push round-trip. A spurious finding that gets fixed anyway costs an iteration of the same loop. They look similar in chat but they are very different in cost — the spurious one was avoidable. The dispute-in-writing path exists exactly so the gate doesn’t force you to pay one cost to avoid the other.

§Verification: Proving the Gate Actually Works

Once the files are installed, walk through this end-to-end test. Skip it and you’ll discover the gate is misconfigured the first time you actually need it.

Trivial change. One-line typo fix. Try to commit. Hook does not fire (diff is under 20 lines).
Doc-only change. Add 50 lines to a .md file. Try to commit. Hook does not fire.
Real code change, no artifact. Add 50 lines to a .go file. Try to commit. Hook denies with the full review prompt.
Spawn the sub-agent. Fix any findings, write the artifact with the matching diff hash and verdict: CLEAN. Re-attempt commit. Gate allows.
Edit again after artifact. Modify the same file. Try to commit. Hash mismatch causes the gate to re-deny.
Kill switch. touch ~/.claude/.review-gate-disabled. Commit succeeds without review. Remove the flag and re-test.
Plan mode. touch .claude/.plan-mode. Gate is skipped.

The hook script ships with a self-test that exercises a dozen scenarios via synthetic JSON input. Run it once at install time and again after any edit to the script.

§The Metrics That Matter

Two numbers, one to minimize and one to keep an eye on.

To minimize: PRs where post-commit review surfaces real bugs the gate missed. This is the gate’s actual job — catching bugs before commit. Any fix(...): address PR #N critical review pattern on a feature branch is evidence the gate let something through. Treat the appearance of such a commit as a process failure, investigate why the sub-agent missed the finding, and harden the prompt accordingly.

To watch: average review rounds per PR. If this number drifts above two, the gate has tipped from useful to theater. The sub-agent has started manufacturing findings to look thorough, the parent agent is dutifully fixing them, and you’re paying a multi-round cycle for diminishing returns. The cure is calibrating the prompt — tightening the quality bar on findings, reinforcing the “if you have nothing, return CLEAN” guidance, and making the dispute-in-writing path explicit. A round-five-converges PR is a configuration smell, not careful work.

The 800-line soft size cap is a third metric worth tracking. The PostToolUse systemMessage warning fires once cumulative diff crosses the threshold. A 1700-line diff is structurally hard to review, no matter how careful the gate is. The single biggest intervention is keeping changes small.

§What Did Not Work

Three approaches I tried before the pre-commit hook, in order, all failed.

In-context self-review checklist as a memory entry. The agent reads the checklist, claims to apply it, doesn’t actually walk it procedurally. Vibe-check failure mode.

Procedural checklist with explicit items. Better than freeform but still skipped under time pressure or context pressure. Items get noted as “I should fix this” and then quietly moved past as the conversation continues.

Stop hook (post-tool, end-of-turn). Wrong trigger point. By the time Stop fires, the commit may already have happened. The cost cycle has already started.

The pre-commit hook works because the agent literally cannot commit without it returning success. There is no introspection, no “I think I’ve done enough.” The gate is mechanical. That is the entire point.

A second failure mode worth naming. The first version of this prompt told the sub-agent to “find every issue” and instructed the parent agent to “fix every finding, no deferrals, loop until CLEAN.” That’s the symmetric over-correction of the vibe-check problem and it produces review theater: the sub-agent learns to invent findings to look thorough, the parent agent dutifully fixes them, and a 30-minute change converges in five rounds across two hours. The fix is the calibration described above — the quality bar on what counts as a finding, the dispute-in-writing path, the explicit permission to return CLEAN when nothing real surfaces. A loop with no exit other than “agent exhausts itself producing nits” is not a leash, it’s a treadmill.

§The Memory Entry That Supports the Gate

The memory entry is short. It does not duplicate the procedure. Its only job is to remind the agent the gate exists and what to do when blocked:

---
name: Pre-commit review gate (mechanical, externalized)
description: Adversarial review of every code change happens BEFORE commit, enforced by a hook.
type: feedback
---

The previous in-context "self-review checklist" was unreliable: the agent
treated it as a vibe-check rather than a procedure. Replacement: a
mechanical pre-commit gate at ~/.claude/hooks/review-gate.sh blocks
`git commit` until an adversarial sub-agent review of the working-tree
diff returns `Verdict: CLEAN` and the artifact at
`.claude/.last-review.md` carries the matching `reviewed_hash`.

When the hook fires:
1. Read the block reason for the full sub-agent prompt.
2. Spawn a `general-purpose` Agent with that prompt verbatim.
3. Address each finding: fix it, dispute it in writing in the commit body
   (if you've reasoned the finding is wrong), or defer with a TODO (used
   sparingly for real concerns out of scope for this PR).
4. Re-spawn until `Verdict: CLEAN`, OR until remaining findings are
   docstring/comment polish not worth another iteration. Cap: 2 rounds
   for typical changes, 3 only if round 2 surfaced a substantive bug
   worth chasing. If round 3 still has substantive findings, the change
   is too large and must be split.
5. Write the artifact and re-attempt `git commit`.

Source-of-truth files:
- ~/.claude/hooks/review-gate.sh
- ~/.claude/hooks/review-prompt-template.md
- ~/.claude/settings.json (PreToolUse + PostToolUse entries)

The gate does the enforcement. The memory entry just hands the agent a map of where the gate lives so it can comply efficiently when blocked, rather than burning a turn figuring out what hit it.

§Known Limitations

A few things this gate does not solve, by design or by current limitation:

The size cap is soft. The 800-line warning is a systemMessage, not a deny. The agent can ignore it. A future hardening would convert it to a deny on PreToolUse(Edit|Write) once the diff crosses some threshold (say 1500 lines), forcing a split. Right now the cap relies on the agent reading the warning and acting on it.
The diff hash is content-only. Whitespace changes, formatter runs, and reordering all change the hash. That is correct behavior (any change requires a new review), but it does mean re-running gofmt after a clean review invalidates the artifact and you have to re-review.
The gate doesn’t run in CI. It’s a developer-machine forcing function. CI still needs its own quality gates: lint, security scanning, tests. The gate complements CI, it doesn’t replace it.
The sub-agent inherits project memory. The same gentle reviewer instincts that fail in the parent context can leak into the sub-agent. The prompt’s adversarial framing and required output sections are the primary mitigation. Eternal vigilance against “looks good” responses is required — and equally, against the opposite failure mode of manufactured findings.
Calibration drift in either direction. A prompt that’s too lenient produces a vibe-check; a prompt that’s too aggressive produces review theater. Both states are detectable from the metrics: post-commit review-fix commits indicate the gate is too lenient; PRs converging in five-plus rounds indicate it’s too aggressive. Re-tune when either signal appears. The prompt is a config file, not a relic.
The iteration cap and the size cap interact. A 1700-line PR may converge in four rounds rather than two or three. The cap should be enforced strictly, with the size cap doing more work earlier.

§Why This Belongs in the Series

The “AI on a Leash” thesis is that AI productivity is real but unreviewable at speed, and the only answer that holds up over time is a verification toolchain that catches what humans no longer have hours to catch. The Go articles cover the static and dynamic verification layers: linters, race detection, mutation testing, coverage. Those layers tell you whether the code is correct in the small.

The pre-commit gate is a different layer. It tells you whether the code is correct against the claims being made about it: the docstrings, the doc paragraphs, the test names, the commit message. Those claims are invisible to a linter and are exactly where AI fails most often. The agent writes a doc paragraph that subtly disagrees with the function it documents. The agent writes a test named TestHandlesEmptyInput that does not actually exercise the empty-input path. The lint suite is fine with both.

The pre-commit gate forces a structured re-read of every change against its own claims, before that change becomes a commit. Combined with the toolchain layers, the leash starts to feel like it has the right number of links.

§TL;DR for the Next Agent (or the Next Engineer)

Install ~/.claude/hooks/review-gate.sh (executable) and ~/.claude/hooks/review-prompt-template.md. Copy them, do not retype.
Add the PreToolUse(Bash) and PostToolUse(Edit|Write|MultiEdit) entries to ~/.claude/settings.json pointing at the script with precommit and size-warn arguments respectively.
Add the memory entry pointing at the gate.
When the gate denies a git commit, spawn a general-purpose sub-agent with the prompt the hook handed back. Address each finding (fix, dispute in writing, or defer with a TODO). Re-spawn until Verdict: CLEAN or remaining findings are docstring polish you’ve judged not worth another iteration. Cap at two rounds for typical changes. Write the artifact. Retry the commit.
Track two failure modes:
- fix(...): address PR review commits on feature branches → gate is too lenient, hardening needed.
- PRs that converge in five-plus rounds → gate is too aggressive, prompt needs recalibrating to tighten the finding quality bar and reinforce the “if nothing real, return CLEAN” path.

The leash works only as long as it’s mechanical AND calibrated. A leash that mechanically forces a checkpoint and a calibrated reviewer that knows what counts as a real finding are the same project. Drop either half and the cycle comes back — either the original “the agent claims done without reviewing” cycle, or the symmetric “the agent reviews forever finding nits” cycle. Both cost the same kind of round-trip. Both are avoidable.

Craig Johnston · 22 April 2026 ← back to all notes