My daughter called me an “unk” the other day. I’ll take it. If Ralph is the young hotshot running AI in a bash loop until the output converges, I’m the uncle who’s been shipping software long enough to know that “it looks right” and “it works” are not the same thing.
TL;DR
Iterate on the refinement of the verification toolchain until bad output can’t survive.
Don’t get me wrong, I want Ralph on my team. I need his enthusiasm and boundary pushing. I need to see just how good he is, and for that he needs real-world constraints.
“Instead of freaking out about these constraints, embrace them. Let them guide you. Constraints drive innovation and force focus. Instead of trying to remove them, use them to your advantage.” ― 37Signals, Getting Real: The Smarter, Faster, Easier Way to Build a Web Application
I run a small team. We’ve worked together for decades, shipping software on tight budgets for clients who care about one thing: does it work? Nobody calls to congratulate us on our cyclomatic complexity scores or test coverage percentages. They call when the system is down. So that’s what this is about. Not theory. Just what we’ve learned shipping AI-assisted code and cleaning up after it when we didn’t verify enough.
AI-Verified Development Series | Previous: Go’s Constraints and Idioms Make AI Coding Better
This article covers language-agnostic principles for closing that gap. The language doesn’t matter. The toolchain does.
Where This Differs from Ralph
Geoffrey Huntley’s Ralph has become a popular meme in AI coding circles. The technique is a bash loop: while :; do cat PROMPT.md | claude-code ; done. Run the AI, refine the prompt, run it again. Eventually the output converges. I don’t hate Ralph. There’s common DNA here: both approaches use persistent instruction files, both iterate, both recognize that operator skill matters, and both are pragmatic about shipping on a budget.
Where we diverge is on what “done” means.
Ralph refines the prompt until the output looks right. I refine the verification toolchain until bad output can’t survive. Ralph’s quality ceiling is the operator’s ability to judge the result. My quality ceiling is the verification suite’s ability to catch problems the operator would miss.
And to be clear: I don’t implicitly trust make verify either. But I’m not willing to throw out 50 years of software engineering best practices in hopes of perfect convergence. Ralph’s output might look great today, then three weeks from now the system goes down from a memory leak that no amount of prompt tuning would have caught. Or worse: you thought OAuth was wired up properly, but half your endpoints are open to the public exposing PII. The code converged. It ran. It looked right. It just wasn’t verified.
That’s the gap this series is about closing.
Why AI Guardrails Matter
LLMs produce statistically likely code. They pattern-match against training data. When you ask an LLM to write a sorting function, it produces the most probable sorting function given its training corpus. Useful, but dangerous.
The training data contains bugs. Real codebases have real bugs, and LLMs learn those patterns just as readily as they learn correct patterns. An LLM will reproduce an off-by-one error with the same confidence it reproduces a correct boundary check. It has no concept of correctness, only probability.
This means guardrails must be:
- Automated. Human discipline fails at scale. You will not catch every subtle bug in every AI-generated function through code review alone. You might catch it on Monday morning when you’re fresh. You won’t catch it at 4pm on Friday when you’re reviewing the fifteenth AI-generated file.
- Deterministic. Same input, same result, every time. A guardrail that sometimes catches a bug is worse than no guardrail at all, because it creates false confidence.
- Fast. Seconds, not minutes. AI generates code at machine speed. Your verification must run at machine speed too, or the feedback loop breaks. If your CI takes 20 minutes, the AI (and you) have already moved on to the next feature by the time you discover the last one was broken.
Traditional code review doesn’t scale when AI generates 10x the code volume. A senior engineer reviewing AI output line-by-line will bottleneck your entire team. Verification has to keep up.
We’ve always automated everything we could, long before AI entered the picture. CI pipelines, deployment scripts, linting, test suites. AI didn’t change the philosophy. It changed the volume. More generated code means more verification, and human time is too expensive to spend on problems a machine can catch.
The point isn’t to replace human judgment. No amount of static analysis replaces a thoughtful code review. But no reviewer on my team reviews code that hasn’t already passed every automated check. When a human reviews AI-generated code, the question should be “is this the right approach?” not “does this even work?”
The Verification Hierarchy
Not all verification is equal. Each level catches failures that the previous level misses. Skip a level and you leave a category of bugs completely undetected.
Level 1: Static Analysis
Static analysis examines code without running it. It catches structural problems: type errors, unused variables, unreachable code, complexity violations, known security anti-patterns. It runs in milliseconds and costs nothing.
Every language has mature static analysis tooling. ESLint and TypeScript for JavaScript/TypeScript. pylint, mypy, and ruff for Python. golangci-lint for Go. clippy for Rust. Checkstyle and SpotBugs for Java.
AI-generated code frequently triggers static analysis warnings that the AI itself would never flag. Variable shadowing, unused imports from hallucinated packages, type coercions that lose precision. These are exactly the kinds of issues that look fine during a quick review but cause subtle runtime failures.
Run static analysis first. If it fails, don’t bother with the rest.
Level 2: Unit Tests
Unit tests verify that individual functions produce correct output for known inputs. They’re the foundation of your verification strategy.
The key word is “known.” Unit tests encode specific expectations: given this input, produce this output. They’re a contract between the developer (or the AI) and the codebase. When a unit test fails, you know exactly what broke and where.
AI is good at writing unit tests. It’s also good at writing bad unit tests. We’ll cover this problem in detail in the Anti-Vaporware section below.
A minimum of 80% code coverage is a reasonable baseline. Not because 80% is a magic number, but because anything less means significant portions of your codebase are completely unverified. And “unverified” with AI-generated code means “assumed correct because it looks right.”
Level 3: Integration and E2E Tests
Unit tests verify functions in isolation. Integration tests verify that components work together. The distinction matters because AI-generated code frequently gets individual functions right while getting the connections between them wrong.
Common failures at this level:
- Interface mismatches. The AI writes a client that sends JSON with camelCase keys. The server expects snake_case. Both sides work perfectly in unit tests.
- Protocol errors. The AI generates an HTTP client that doesn’t handle 429 (rate limit) responses. Unit tests mock the API and never return 429.
- Timing issues. The AI writes async code that assumes operations complete in a specific order. Unit tests run synchronously and never expose the race condition.
- State management. The AI creates a function that works correctly the first time it’s called but corrupts shared state on subsequent calls. Single-invocation unit tests pass.
Integration tests are slower and more complex than unit tests. That’s the tradeoff. They catch an entire category of bugs that unit tests are structurally incapable of detecting.
Level 4: Mutation Testing
Mutation testing is the final check. It answers the question no other testing level can: “would my tests actually catch a real bug?”
Discovering mutation testing shifted how my team thinks about quality. We went from “just ship it, the tests pass” to actual software engineering, asking whether the tests themselves are any good. It’s a different mindset.
The concept is simple. A mutation testing tool takes your code, makes small changes (mutations), and runs your tests against each mutated version. If your tests still pass when the code is wrong, your tests are worthless.
Mutations include:
- Changing
+to- - Changing
>to>= - Changing
truetofalse - Removing function calls
- Replacing return values with defaults
If your tests catch the mutation (fail when the code is wrong), the mutant is “killed.” If your tests still pass, the mutant “survived,” and you have a test gap.
A 60% mutation kill rate is a reasonable starting threshold. This means at least 60% of all introduced mutations cause test failures. Below that, your test suite provides a false sense of security.
What Each Level Catches
| Verification Level | What It Catches | What It Misses |
|---|---|---|
| Static Analysis | Type errors, style violations, security patterns, dead code | Logic errors, incorrect behavior, integration failures |
| Unit Tests | Wrong outputs, edge cases, error handling, regressions | Interface mismatches, timing bugs, system-level failures |
| Integration/E2E Tests | Component interaction bugs, protocol errors, state issues | Subtle logic bugs masked by test design |
| Mutation Testing | Weak tests, tautological assertions, test gaps | Nothing (it validates the tests themselves) |
They stack. You need all four.
The Anti-Vaporware Problem
Here’s the problem that keeps biting me: AI writes tautological tests.
A tautological test verifies “the code does what the code does.” But then again, don’t we all? It’s the natural instinct. You write a function, then write a test that basically restates the function. if (a == 1 || a != 1) is always true and it tells you nothing. AI just does it faster and at scale.
AI can generate unit tests almost for free. The time-value equation has changed completely. What used to take a developer an afternoon now takes seconds. That’s a genuine shift. But a thousand tautological tests are worth exactly zero. The cost dropped, but worthless is still worthless.
Here’s what this looks like in practice:
# AI-generated function
def calculate_discount(price, percentage):
return price * (1 - percentage / 100)
# AI-generated "test" - tautological
def test_calculate_discount():
assert calculate_discount(100, 20) == 100 * (1 - 20 / 100)
Read that test carefully. The assertion literally reimplements the function’s logic. It computes the expected value using the same formula as the function under test. If the formula is wrong, both the function and the test are wrong in the same way, and the test passes.
This isn’t a contrived example. I see this pattern constantly in AI-generated test suites. The AI copies the implementation into the test and calls it verification. Your coverage report says 100%. Your mutation score says 0%.
The fix is to encode expected outputs as concrete values that come from business requirements, not from reimplementing the logic:
# Meaningful test - encodes business requirements
def test_calculate_discount():
assert calculate_discount(100, 20) == 80.0 # 20% off $100 = $80
assert calculate_discount(100, 0) == 100.0 # No discount
assert calculate_discount(100, 100) == 0.0 # Full discount
assert calculate_discount(0, 50) == 0.0 # Zero price
Each assertion here is a business requirement expressed as a concrete number. The test doesn’t know or care how the function computes the discount. It only cares that the output matches the expected business result. If someone changes the formula, these tests catch it because the expected values are independent of the implementation.
Mutation testing is the cure for tautological tests at scale. When a mutation tool changes 1 - percentage / 100 to 1 + percentage / 100, the tautological test’s assertion also effectively changes (because it reimplements the logic by calling the function), and the test still passes. The meaningful test’s hardcoded 80.0 doesn’t change, and the test fails. Mutation killed.
Set a 60% minimum mutation score threshold in your CI pipeline. Below that, your test suite has too many survivors, meaning too many cases where real bugs would go undetected.
Dead code detection is the companion tool. If code can be deleted without any test failures, one of two things is true: it’s untested, or it’s unnecessary. Both are problems. In AI-generated codebases, dead code accumulates quickly because the AI generates “just in case” utility functions that nothing actually calls. Remove them. Less code means fewer bugs and a smaller attack surface.
Reproducibility and Pinned Dependencies
AI suggests packages from its training data. This creates three distinct problems:
Outdated packages. The AI learned about version 2.3 of a library. The current version is 4.1 with a completely different API. The code compiles against 2.3 but your lock file resolves to 4.1.
Vulnerable packages. The AI suggests a package that had no known CVEs at training time but has three critical ones now. The AI has no mechanism to check this.
Hallucinated packages. The AI invents a package name that doesn’t exist. This sounds harmless until you realize that attackers register these hallucinated package names on npm, PyPI, and other registries. They publish malicious packages under the names that AI commonly hallucinates. This is a documented supply chain attack vector, not a theoretical risk.
The solution is aggressive pinning. Pin everything to exact, immutable references.
Dependency versions
Use exact versions, not ranges. "lodash": "4.17.21" not "lodash": "^4.17.0". Commit your lock files. Run security audits in CI.
Container images
Tags are mutable. Someone can push a new image to python:3.12-slim at any time, and your next build pulls different code than your last build.
# Bad: mutable tag, you get whatever "slim" means today
FROM python:3.12-slim
# Good: immutable digest, you get exactly this image, forever
FROM python:3.12-slim@sha256:abcdef123456...
The digest pins the exact image. It cannot change. Your build today produces the same result as your build six months from now, even if the upstream maintainer pushes a compromised image to the 3.12-slim tag.
Tool versions in CI
Pin your CI tool versions too. If your CI installs golangci-lint@latest, a new release with different default rules will randomly break your builds. Or worse, a new release with relaxed rules will stop catching bugs your previous version caught.
This applies to everything: linters, formatters, security scanners, test runners. Pin them all. Update them deliberately, not accidentally.
The general principle: if AI suggested it, verify it exists, verify it’s maintained, and pin it to an exact version. Trust nothing that comes from statistical inference about package names.
Acceptance Criteria Before Code
Without acceptance criteria, AI generates code that satisfies itself. It writes an implementation, then writes tests that verify the implementation does what the implementation does. We’ve already seen why that’s useless.
Acceptance criteria break this cycle by defining expected behavior before any code exists. The Given/When/Then format forces requirements to be explicit, testable, and independent of implementation:
## Feature: User Registration Rate Limiting
### Acceptance Criteria:
- Given a new IP address, when they attempt registration, then allow the request
- Given an IP that has registered 5 accounts in 1 hour, when they attempt
another, then return 429 Too Many Requests
- Given a rate-limited IP, when 1 hour has elapsed, then allow registration again
- Given a request from a whitelisted IP, when they attempt registration, then
always allow regardless of rate
Now the AI has concrete specifications to implement against. Each “then” clause becomes a test assertion with a specific expected outcome. The AI can’t write tautological tests because the expected behavior is defined externally.
Without these criteria, the AI might implement rate limiting that:
- Uses 10 requests per hour instead of 5 (wrong threshold)
- Uses a 30-minute window instead of 1 hour (wrong time window)
- Returns 403 Forbidden instead of 429 Too Many Requests (wrong status code)
- Doesn’t support whitelisting at all (missing feature)
All of these implementations would pass AI-generated tautological tests. None of them meet the actual requirements.
Now, I’m not suggesting you spend hours writing detailed specifications by hand. If the whole point of AI is automation, automate the automation. Have the AI draft the acceptance criteria. Have it draft the Given/When/Then specs, the edge cases, the error scenarios. Then have it critically review its own output before you even look at it.
I treat this the same way I treat code review: I don’t want to see it until it’s passed its own tests. I’d rather smell-test AI’s assessment of its own criteria than read its first draft. I’m like an evil college professor who hands back your paper unread and says “revise it.” Not because I’m lazy, but because the second draft is always better than the first, and my time is better spent evaluating a refined version than line-editing a rough one.
The format matters less than the discipline: define what “correct” means before the AI writes implementation code. Whether a human or an AI drafts those criteria, they need to exist and they need to be independent of the implementation.
Documentation as Code: CLAUDE.md
CLAUDE.md (or the equivalent context file for other AI tools) is persistent context that loads at the start of every AI coding session. Unlike chat history, which disappears when you start a new conversation, CLAUDE.md survives session restarts. It’s the closest thing to “institutional memory” that AI coding tools have.
In my experience, it’s the file that makes the biggest difference in AI output quality.
What makes a good CLAUDE.md? Keep it concise: every line competes for context window space. A 500-line CLAUDE.md wastes tokens on boilerplate. Make it actionable: include commands the AI can run, not philosophy it should follow. “Run make verify before committing” works. “Strive for high code quality” doesn’t. Use thresholds, not vibes: “80% coverage minimum” is something the AI can check against. “Good coverage” means the AI decides 47% is fine because the tests look reasonable. And the AI treats CLAUDE.md as ground truth. If it says “run tests before every commit,” the AI does. If it says nothing about tests, the AI skips them.
Here is a complete, copy-paste-ready universal CLAUDE.md template. Adapt the placeholder commands to your language and toolchain:
# CLAUDE.md - Universal AI-Verified Development
## Verification (Required Before Every Commit)
Run the full verification suite:
```
[VERIFY_COMMAND] # e.g., make verify, npm run verify, ./scripts/verify.sh
```
Individual checks (all must pass):
```
[LINT_COMMAND] # Static analysis
[TEST_COMMAND] # Unit tests with race/thread-safety detection
[COVERAGE_COMMAND] # Coverage report (threshold: 80% minimum)
[SECURITY_COMMAND] # Security scanning
[MUTATION_COMMAND] # Mutation testing (threshold: 60% minimum)
[DEADCODE_COMMAND] # Dead code detection
[BUILD_COMMAND] # Build/release validation
```
## Code Quality Thresholds
- Test coverage: ≥80% (no exceptions for "simple" code)
- Mutation score: ≥60% (proves tests catch real bugs)
- Cyclomatic complexity: ≤10 per function
- Cognitive complexity: ≤15 per function
- Function arguments: ≤5
## Dependency Policy
- All dependencies pinned to exact versions
- Lock files committed and up to date
- No deprecated or archived dependencies
- Security audit clean (`[AUDIT_COMMAND]`)
- Container images use digests, not tags
## AI-Specific Rules
1. **No tautological tests**: tests must encode expected outputs, not reimplement logic
2. **No hallucinated imports**: verify every dependency exists and is actively maintained
3. **Human review required**: all code requires human review before merge
4. **Acceptance criteria first**: do not write code without Given/When/Then acceptance criteria
5. **Explain non-obvious decisions**: if the AI makes an architectural choice, it must explain why in a comment or commit message
## Git Policy
- Conventional commits: `type(scope): description`
- Types: feat, fix, refactor, test, docs, chore, ci
- Commits are atomic: one logical change per commit
- No force-pushing to shared branches
- Branch names: `type/short-description` (e.g., `feat/user-auth`)
What each section does
The Verification section is the one the AI reads most carefully. It knows to run these commands before committing. Replace the placeholders with your stack’s commands: npm run lint, golangci-lint run, ruff check ., whatever fits.
The Code Quality Thresholds give the AI concrete numbers to check against. The “no exceptions for simple code” note matters because AI will argue that a trivial function doesn’t need testing. Simple code has bugs too, and simple untested code becomes complex untested code when someone extends it later.
The Dependency Policy forces the AI to verify its own suggestions. Without an explicit instruction to check that packages exist and are maintained, the AI won’t bother.
The AI-Specific Rules address failure modes that human developers don’t have. Humans rarely write tautological tests because they understand what they’re testing. AI writes them constantly. Humans rarely invent package names. AI does it routinely.
Conventional commits in the Git Policy create a parseable history. feat(auth): add rate limiting to registration endpoint tells you what changed at 2am. update code tells you nothing.
Property-Based and Contract Testing
Property-based testing attacks the tautological test problem from a different angle. Instead of testing specific input/output pairs, you define properties that must hold for all valid inputs. The testing framework generates hundreds of random inputs and checks each one.
The AI can’t game this by reimplementing the function, because it doesn’t control the inputs.
Consider our discount function. A property-based approach defines invariants:
# Property: discount should always be between 0 and original price
for any price >= 0 and 0 <= percentage <= 100:
0 <= calculate_discount(price, percentage) <= price
# Property: 0% discount returns original price
for any price >= 0:
calculate_discount(price, 0) == price
# Property: 100% discount returns 0
for any price >= 0:
calculate_discount(price, 100) == 0
# Property: discount is monotonic, higher percentage means lower result
for any price > 0 and 0 <= p1 < p2 <= 100:
calculate_discount(price, p1) > calculate_discount(price, p2)
These properties are mathematical truths about discounts, not reimplementations of the formula. A function that uses price * percentage / 100 instead of price * (1 - percentage / 100) will violate the first property immediately. The testing framework will find the counterexample automatically.
Every major language has property-based testing libraries. Hypothesis for Python. rapid for Go. proptest for Rust. fast-check for TypeScript. QuickCheck for Haskell (where the concept originated). jqwik for Java.
Contract Testing for Services
For microservices and API-driven architectures, contract testing verifies that services agree on API shapes. This matters for AI-generated code because the AI frequently generates API clients that drift from the actual API.
The pattern is consumer-driven contracts:
- The consumer (the service calling the API) defines what it expects: which endpoints, which request/response shapes, which status codes.
- The contract is stored as a shared artifact.
- The provider (the service implementing the API) runs the contract as a test, proving it delivers what the consumer expects.
When AI generates an API client, it’s guessing the API shape based on training data. Contract tests catch the drift immediately. The client says “I expect GET /users/:id to return { name: string, email: string }.” The provider test verifies that’s what it actually returns. When the AI generates a client expecting { username: string, mail: string }, the contract test fails.
Tools like Pact, Spring Cloud Contract, and Dredd implement this pattern across languages. The important thing isn’t which tool you use. It’s that the contract exists as a testable artifact, not as a shared assumption in the AI’s training data.
Combining Property-Based and Mutation Testing
These two techniques work well together. Property-based tests define invariants that must hold for all inputs. Mutation testing verifies that the property-based tests actually catch violations.
If a mutation tool changes price * (1 - percentage / 100) to price * (1 + percentage / 100), the property “result should be between 0 and price” fails immediately for any positive percentage. The mutation is killed.
If your property-based tests survive mutations, your properties are too weak. Tighten them until mutations consistently cause failures.
Building the Verification Pipeline
None of this works if it requires human discipline to follow. Someone will skip it the first time they’re in a hurry. With AI coding tools, everyone is always in a hurry because the code comes so fast.
Here’s my rule: I don’t look at AI-generated code until every criterion passes. Human time is too valuable for eyeballing logic errors that a machine can catch. I don’t care how the AI writes the code. I care that it meets external criteria. If the linter, the tests, the coverage check, the security scan, and the mutation score all pass, then I’ll look at it. Not before.
The enforcement mechanism is your CI pipeline. Every commit, every pull request, every merge triggers the full verification hierarchy:
- Static analysis runs first (fast, catches structural issues)
- Unit tests run second (fast, catches logic errors)
- Integration tests run third (slower, catches interaction bugs)
- Mutation testing runs last (slowest, validates test quality)
If any level fails, the pipeline stops. The AI and the developer both get immediate feedback about what broke and why.
For AI coding tools specifically, the verification should also run locally before commit. CLAUDE.md instructs the AI to run verification commands before committing. This catches issues in seconds instead of waiting for CI. The CI pipeline is the backstop, not the primary check.
Pre-commit Hooks
Git pre-commit hooks run verification automatically before every commit. The developer (or AI) can’t skip them without deliberate effort:
#!/bin/sh
# .git/hooks/pre-commit
# Run static analysis
[LINT_COMMAND] || exit 1
# Run tests
[TEST_COMMAND] || exit 1
# Check coverage
[COVERAGE_COMMAND] || exit 1
Pre-commit hooks run on the developer’s machine. They’re fast because they only check changed files (in most configurations). They catch the obvious issues before code even reaches CI.
Combine pre-commit hooks with CI for defense in depth. The hook catches issues locally in seconds. CI catches issues that the hook might miss (integration tests, mutation testing) in minutes. Both must pass.
Making Mutation Testing Practical
Mutation testing has a reputation for being slow. Running thousands of mutations against your full test suite takes time. There are practical strategies to make it viable:
Incremental mutation testing. Only mutate files that changed in the current commit or pull request. This reduces the mutation space from “entire codebase” to “changed code,” which typically runs in seconds or minutes instead of hours.
Targeted mutation operators. Not all mutations are equally valuable. Start with the mutations most likely to catch real bugs: arithmetic operator changes, comparison boundary changes, and return value modifications. Skip esoteric mutations until your team is comfortable with the basics.
Nightly full runs. Run full-codebase mutation testing on a nightly schedule. This catches regressions in areas that weren’t directly modified but might be affected by upstream changes. Use incremental runs for PR feedback and full runs for comprehensive analysis.
Closing
None of this is language-specific. The tools change between ecosystems, but the approach is the same: automate verification, kill tautological tests, pin dependencies, define acceptance criteria before writing code, and use CLAUDE.md to make the AI follow the same rules your team follows.
The hard part is doing all of it consistently. Knowing about mutation testing doesn’t help if you don’t run it. That’s why everything here is designed to be automated and enforced by machines, not by willpower.
This blog post, titled: "Ralph's Uncle: The AI Verification Loop" by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.
