AI on a Leash: Complete Go Project Configuration

The principles article covered why verification matters. The precursor Go article covered why Go’s constraints help AI produce better code. This article provides every configuration file you need. Drop these files into your Go project, change the module path, and you have a verification pipeline that catches the mistakes AI makes before they reach human review. One less round trip between you and the AI.

TL;DR: Install the Go Leash
Every config file in this article is available as a Claude Code skill. It audits your Go project, tells you what’s missing, and generates the files adapted to your module:
/plugin marketplace add txn2/claude-skills
/plugin install go-leash@txn2-claude-skills
Run /go-leash in your Go project. It scaffolds .golangci.yml, revive.toml, GNUmakefile, codecov.yml, and CLAUDE.md, then optionally runs make verify.

Frankly, I enjoy this part of working with AI. I get to think of new ways to torture it into producing sound code. The AI needs strict rules, but it can also help generate those rules. Give it higher-level goals and it’ll propose linter configs, test strategies, complexity thresholds. Then you run its code through the gauntlet and see what survives.

AI on a Leash Series | Previous: Ralph’s Uncle covers verification principles, mutation testing philosophy, and why human review remains the firewall

Why This Much Verification

None of the tools in this article are new. golangci-lint, gremlins, Codecov: they all existed before AI coding tools. Most Go teams wouldn’t bother with this much verification infrastructure. A small team of trusted developers doesn’t need 43 linters and mutation testing. Code review from people you trust catches most problems.

AI changes the math. Whatever productivity multiplier you get, your review capacity stays the same. Ten developers producing code at ten times their normal volume still have the same number of human reviewers. The review bottleneck doesn’t scale with the output.

This isn’t about slowing AI down to accommodate humans. It’s about demanding standards beyond what you’d normally ask of your trusted team. Your senior engineers don’t need a linter to tell them not to shadow builtins. They don’t need mutation testing to write boundary tests. But AI isn’t your senior engineer, even when it writes code like one. AI is productive and it lies to you every now and then. Humans lie too, but you don’t expect your trusted experts to confidently hand you broken code with a straight face. AI does exactly that.

The consequences scale with the output. More code means more surface area for subtle bugs that pass a quick review. You have two options: reduce AI to human speed and read every line yourself, or invest your time in building a verification pipeline that catches its mistakes automatically. This article is the second option.

Getting real productivity out of AI meant constantly correcting it and forcing it to test its own assertions. Each correction became a new rule, a new lint check, a new test threshold, so the same mistake wouldn’t survive the next session. I needed a strict language with a mature toolchain that could enforce those constraints. Go gave me that.

Project Structure

Go projects follow a standard layout. AI coding tools trained on thousands of Go repositories recognize this structure and adjust their behavior accordingly.

myproject/
├── cmd/
│   └── myproject/
│       └── main.go
├── pkg/
│   ├── api/
│   │   ├── handler.go
│   │   └── handler_test.go
│   └── service/
│       ├── service.go
│       └── service_test.go
├── internal/
│   └── config/
│       └── config.go
├── .golangci.yml
├── revive.toml
├── GNUmakefile
├── codecov.yml
├── CLAUDE.md
├── go.mod
└── go.sum

Three directories matter:

cmd/ holds application entrypoints. Each subdirectory is a main package that produces one binary. Keep these thin. Parse flags, wire dependencies, call into pkg/ or internal/.

pkg/ holds public library code. Anything here can be imported by external projects. When an AI sees code in pkg/, it writes with the public API in mind: exported types, documented functions, stable interfaces.

internal/ holds private application code. The Go compiler enforces this boundary. Code in internal/ cannot be imported by packages outside the module. When an AI sees internal/, it knows it can use unexported types freely and doesn’t need to worry about backward compatibility.

Tests live next to the code they test. handler.go gets handler_test.go in the same directory. The AI doesn’t need to guess where tests belong.

The layout works because it’s predictable. An AI that has seen this structure thousands of times during training doesn’t waste context trying to figure out your project’s custom organization. It just starts writing code in the right places.

A caveat: the Go team doesn’t endorse any official project layout, and plenty of successful Go projects skip pkg/ entirely. The layout shown here works for projects that export library code. If everything is internal, you can drop pkg/ and keep things flatter. The point is consistency within a project, not adherence to a template.

Note the configuration files in the project root. .golangci.yml, revive.toml, GNUmakefile, codecov.yml, and CLAUDE.md form the verification infrastructure. The AI reads these files to understand your project’s rules.

Every file in this article is shown in full. No snippets, no “add your own settings here.” Copy them, change the module path, and you have a working verification pipeline.

You don’t need all of this on day one. If you’re starting a new project, begin with the Makefile, golangci-lint config, and CLAUDE.md. Add mutation testing and Codecov as the project matures. The principles article covers when to adopt each layer.

Revive Linter Configuration

Revive is a fast, configurable Go linter that replaces golint. The configuration below targets the specific mistakes AI-generated code makes most often.

Create revive.toml in your project root:

# revive.toml - AI on a Leash for Go
ignoreGeneratedHeader = false
severity = "warning"
confidence = 0.8
errorCode = 1
warningCode = 1

[rule.blank-imports]
[rule.context-as-argument]
[rule.context-keys-type]
[rule.dot-imports]
[rule.error-return]
[rule.error-strings]
[rule.error-naming]
[rule.exported]
[rule.increment-decrement]
[rule.var-naming]
[rule.var-declaration]
[rule.package-comments]
[rule.range]
[rule.receiver-naming]
[rule.time-naming]
[rule.unexported-return]
[rule.indent-error-flow]
[rule.errorf]
[rule.empty-block]
[rule.superfluous-else]
[rule.unused-parameter]
[rule.unreachable-code]
[rule.redefines-builtin-id]
[rule.modifies-value-receiver]
[rule.waitgroup-by-value]
[rule.atomic]
[rule.range-val-in-closure]
[rule.range-val-address]
[rule.constant-logical-expr]
[rule.identical-branches]
[rule.unconditional-recursion]
[rule.duplicated-imports]

[rule.unhandled-error]
  arguments = [["fmt.Printf", "fmt.Println", "fmt.Print"]]

[rule.max-public-structs]
  arguments = [5]

[rule.cyclomatic]
  arguments = [10]

[rule.cognitive-complexity]
  arguments = [15]

[rule.argument-limit]
  arguments = [5]

[rule.function-result-limit]
  arguments = [3]

Each rule targets a real problem:

The rules fall into three groups.

Complexity limits keep functions small and reviewable. cyclomatic at 10 prevents deep branching. cognitive-complexity at 15 catches functions that are hard to understand even without deep nesting (nested loops with breaks and continues). argument-limit at 5 forces options structs instead of eight-parameter functions. function-result-limit at 3 stops AI from returning four or five values. max-public-structs at 5 per file prevents god-files.

I once watched a developer spend an entire day crafting a 1000-character regex. For about 24 hours, he was the world’s foremost expert on what that regex did. After that, nobody, including him, could confidently modify it. AI has the same problem. Complex functions aren’t just hard for humans to review; they’re hard for AI to extend without introducing regressions in the next session.

Concurrency safety catches bugs that only surface under load. modifies-value-receiver catches methods that silently lose modifications because Go passes value receivers by copy. waitgroup-by-value and atomic catch copied sync primitives and non-atomic shared state. range-val-in-closure and range-val-address catch the classic loop variable capture bug where goroutines process the last element N times instead of each element once. AI writes plausible-looking concurrent code that falls apart under real traffic.

Logic errors catch refactoring mistakes. constant-logical-expr and identical-branches flag if x || true or if/else blocks with identical code. unconditional-recursion catches infinite recursion. unhandled-error on fmt.Printf/Println/Print catches dropped error returns. These are surprisingly common after the AI refactors a conditional and forgets to update both branches.

The remaining rules enforce standard Go idioms: indent-error-flow for early returns, unused-parameter for dead parameters, redefines-builtin-id to stop AI from naming variables error or len, context-as-argument for context-first parameters, and blank-imports to flag blank imports outside main or test packages. AI mostly gets these right, but “mostly” isn’t good enough for production code.

Setting errorCode and warningCode both to 1 means any violation fails the build. No warnings that get ignored. Every rule is enforced or removed.

Full golangci-lint Configuration

golangci-lint aggregates dozens of linters into a single tool. The configuration below enables 43 linters tuned to keep AI on a leash.

Create .golangci.yml in your project root:

# .golangci.yml - AI on a Leash for Go
run:
  timeout: 5m
  modules-download-mode: readonly

output:
  formats:
    - format: colored-line-number
  sort-results: true

linters:
  enable:
    - bodyclose
    - copyloopvar
    - depguard
    - dogsled
    - dupl
    - errcheck
    - errorlint
    - exhaustive
    - funlen
    - gochecknoinits
    - goconst
    - gocritic
    - gocyclo
    - godot
    - gofmt
    - goimports
    - gosec
    - gosimple
    - govet
    - ineffassign
    - misspell
    - nakedret
    - nestif
    - nilerr
    - noctx
    - nolintlint
    - prealloc
    - predeclared
    - revive
    - rowserrcheck
    - sqlclosecheck
    - staticcheck
    - stylecheck
    - thelper
    - tparallel
    - typecheck
    - unconvert
    - unparam
    - unused
    - wastedassign
    - whitespace
    - wrapcheck
    - wsl

linters-settings:
  funlen:
    lines: 80
    statements: 50
  gocyclo:
    min-complexity: 10
  dupl:
    threshold: 100
  goconst:
    min-len: 3
    min-occurrences: 3
  gocritic:
    enabled-tags:
      - diagnostic
      - experimental
      - opinionated
      - performance
      - style
  nestif:
    min-complexity: 5
  gosec:
    excludes:
      - G104  # Unhandled errors (errcheck handles this better)
  revive:
    config-file: revive.toml
  depguard:
    rules:
      main:
        deny:
          - pkg: "io/ioutil"
            desc: "Deprecated: use io and os packages instead"
          - pkg: "github.com/pkg/errors"
            desc: "Use fmt.Errorf with %w instead"

issues:
  max-issues-per-linter: 0
  max-same-issues: 0
  exclude-rules:
    - path: _test\.go
      linters:
        - funlen
        - dupl

The pattern across all 43 linters is the same: each one catches a specific mistake that AI makes often enough to justify the config line. A few are worth calling out.

Error handling is where AI fails most visibly. errcheck catches dropped error returns on Close() and Flush() calls. errorlint enforces modern error wrapping with %w instead of errors.New. wrapcheck requires wrapping errors from external packages so error chains don’t lose context. nilerr catches functions that check err != nil and then return nil instead of the error. AI trained on older Go code gets these wrong constantly.

Security linters catch vulnerabilities the AI introduces without knowing it. gosec finds SQL injection, hardcoded credentials, weak crypto, and unvalidated file paths. noctx catches HTTP requests without context (no timeouts, no cancellation). We exclude G104 from gosec because errcheck handles unhandled errors better.

Code quality enforcement keeps AI output manageable. funlen at 80 lines forces decomposition (test files are exempt). dupl at 100 tokens catches copy-paste logic. nestif at complexity 5 catches nested conditionals six or seven levels deep. gocritic with all tags enabled runs the most thorough checks, including the experimental tag that catches subtle bugs like append to a nil slice.

Database safety matters if your project touches SQL. rowserrcheck verifies sql.Rows.Err() is checked after iteration. sqlclosecheck ensures rows and statements are closed. AI forgets defer rows.Close() in about half the database code it generates, causing connection pool exhaustion under load.

Testing quality is enforced by thelper (ensures test helpers call t.Helper()) and tparallel (catches parallel tests that share state unsafely). depguard blocks deprecated imports like io/ioutil that still appear in AI training data.

The remaining linters handle style (wsl, goimports, godot), dead code (wastedassign, predeclared), and correctness (goconst, unconvert). The issues section sets max-issues-per-linter: 0 and max-same-issues: 0, meaning all violations are reported. No suppression.

gofumpt and lll (line length) are deliberately excluded. gofumpt adds formatting opinions beyond gofmt that fight with AI training data. Line length limits produce awkward wrapping in Go, especially with error wrapping and struct literals.

GNUmakefile with All Verification Targets

The Makefile ties the tools together. One command runs every check. The AI executes make verify and gets a pass/fail answer.

Create GNUmakefile in your project root:

# GNUmakefile - AI on a Leash for Go
MODULE := $(shell head -1 go.mod | awk '{print $$2}')
COVERAGE_THRESHOLD := 80
MUTATION_THRESHOLD := 60

.PHONY: all test lint coverage patch-coverage security mutation deadcode bench profile build-check verify clean

all: verify

## Test targets
test:
	go test -race -shuffle=on -count=1 ./...

test-verbose:
	go test -race -shuffle=on -count=1 -v ./...

## Lint targets
lint:
	golangci-lint run ./...
	go vet ./...

## Coverage
coverage:
	go test -coverprofile=coverage.out -covermode=atomic ./...
	@COVERAGE=$$(go tool cover -func=coverage.out | grep total | awk '{print $$3}' | sed 's/%//'); \
	echo "Coverage: $${COVERAGE}%"; \
	if [ $$(echo "$${COVERAGE} < $(COVERAGE_THRESHOLD)" | bc -l) -eq 1 ]; then \
		echo "FAIL: Coverage $${COVERAGE}% is below threshold $(COVERAGE_THRESHOLD)%"; \
		exit 1; \
	fi

## Patch coverage (only changed lines vs main branch)
PATCH_THRESHOLD := 80
patch-coverage:
	@MERGE_BASE=$$(git merge-base main HEAD 2>/dev/null || echo "HEAD"); \
	if [ "$$MERGE_BASE" = "$$(git rev-parse HEAD)" ]; then \
		echo "On main branch, skipping patch coverage"; exit 0; \
	fi; \
	CHANGED_FILES=$$(git diff --name-only "$$MERGE_BASE"...HEAD -- '*.go' | grep -v '_test.go' || true); \
	if [ -z "$$CHANGED_FILES" ]; then \
		echo "No non-test Go files changed, skipping patch coverage"; exit 0; \
	fi; \
	echo "Changed files: $$CHANGED_FILES"; \
	TOTAL=0; COVERED=0; \
	for FILE in $$CHANGED_FILES; do \
		if [ ! -f "$$FILE" ]; then continue; fi; \
		PKG=$$(dirname "$$FILE"); \
		go test -coverprofile=patch_cov.tmp -covermode=atomic "./$$PKG" > /dev/null 2>&1 || true; \
		if [ -f patch_cov.tmp ]; then \
			FILE_COV=$$(go tool cover -func=patch_cov.tmp 2>/dev/null | grep "$$FILE" || true); \
			rm -f patch_cov.tmp; \
		fi; \
	done; \
	go test -coverprofile=coverage.out -covermode=atomic ./... > /dev/null 2>&1; \
	for FILE in $$CHANGED_FILES; do \
		LINES=$$(git diff --unified=0 "$$MERGE_BASE"...HEAD -- "$$FILE" | \
			grep '^@@' | sed 's/.*+\([0-9]*\).*/\1/' || true); \
		for LINE in $$LINES; do \
			TOTAL=$$((TOTAL + 1)); \
			if grep -q "$$FILE:$$LINE" coverage.out 2>/dev/null; then \
				COVERED=$$((COVERED + 1)); \
			fi; \
		done; \
	done; \
	if [ "$$TOTAL" -eq 0 ]; then \
		echo "No executable changed lines detected"; exit 0; \
	fi; \
	PCT=$$((COVERED * 100 / TOTAL)); \
	echo "Patch coverage: $$COVERED/$$TOTAL lines = $$PCT%"; \
	if [ "$$PCT" -lt "$(PATCH_THRESHOLD)" ]; then \
		echo "FAIL: Patch coverage $$PCT% is below threshold $(PATCH_THRESHOLD)%"; \
		exit 1; \
	fi

## Security scanning
security:
	gosec ./...
	govulncheck ./...

## Mutation testing (requires gremlins: go install github.com/go-gremlins/gremlins/cmd/gremlins@latest)
mutation:
	gremlins unleash --workers 1 --timeout-coefficient 3 --threshold-efficacy $(MUTATION_THRESHOLD)

## Dead code detection
deadcode:
	deadcode ./...

## Benchmarking
bench:
	go test -bench=. -benchmem -count=3 -run=^$$ ./... | tee bench.txt

## Profiling (CPU and memory)
profile:
	go test -bench=. -benchmem -cpuprofile=cpu.prof -memprofile=mem.prof -run=^$$ ./...
	@echo "CPU profile: go tool pprof cpu.prof"
	@echo "Memory profile: go tool pprof mem.prof"

## Build validation
build-check:
	go build ./...
	go mod verify

## Meta-target: runs everything
verify: lint test coverage patch-coverage security deadcode build-check
	@echo "All verification checks passed."

clean:
	rm -f coverage.out bench.txt cpu.prof mem.prof
	rm -rf dist/

Every target is designed to fail loudly and exit with a non-zero status code.

test runs with -race to catch data races and -count=1 to disable test caching. Cached test results mask flaky tests. During development, you always want fresh runs. -shuffle=on randomizes test execution order, catching tests that accidentally depend on running in a specific sequence. AI-generated tests often share setup state in ways that only work when tests run in declaration order.

coverage uses -covermode=atomic for accurate coverage under concurrent code. The shell script parses the coverage percentage and compares it against the threshold. Below 80%? Build fails.

patch-coverage checks coverage of only the lines you changed relative to the main branch. Overall project coverage can stay at 80% while a new PR introduces 200 untested lines hidden in a large codebase. Patch coverage catches that. The implementation diffs your branch against the merge base, identifies changed lines in non-test Go files, and checks each against the coverage profile. If you’re on the main branch or have no Go changes, it skips gracefully. For a more sophisticated version with per-file reporting and specific uncovered line numbers, see scripts/patch-coverage.sh.

security runs two scanners. gosec finds security anti-patterns in your code. govulncheck checks your dependencies against the Go vulnerability database.

mutation runs gremlins with a 60% efficacy threshold. --workers 1 produces deterministic results across runs. --timeout-coefficient 3 gives mutations extra time before declaring a timeout kill. More on mutation testing below.

deadcode finds unreachable functions. Note that ineffassign is already enabled in the golangci-lint configuration, so running it separately here would be redundant.

bench and profile are not part of verify because they’re diagnostic tools, not pass/fail gates. They exist so the AI can investigate performance problems when you point it at slow code. More on these below.

build-check runs go build to verify compilation and go mod verify to confirm your dependencies haven’t been tampered with since download.

verify runs everything in order. This is the single command the AI executes before declaring “done.” If it passes, you have reasonable confidence the code is correct.

Why GNUmakefile instead of Makefile? GNU Make looks for GNUmakefile first, then makefile, then Makefile. Using GNUmakefile explicitly signals that this file requires GNU Make features (like .PHONY and shell variable expansion). On macOS and Linux, the default make is GNU Make. On BSD systems (FreeBSD, OpenBSD), the default is BSD make, which has different syntax. The explicit name avoids silent breakage on those systems.

The MODULE variable at the top extracts your module path from go.mod automatically. You never need to hardcode it. This means the Makefile works as-is when copied to any Go project.

Mutation Testing in Go

Test coverage measures which lines your tests execute. Mutation testing measures whether your tests actually verify behavior. The distinction matters.

This is where gremlins changed how my team works. Before we adopted mutation testing, we’d hit our coverage numbers and move on. Ship it, the tests pass. Gremlins forced us to ask a different question: do the tests actually prove anything? It shifted our focus from “code that works” to code we can prove works. That’s a different thing.

A test that runs a function without checking the output gives you coverage. It gives you zero verification. Mutation testing catches this.

How It Works

Gremlins modifies your source code in small, targeted ways (mutations) and runs your tests against each mutation. If your tests still pass after a mutation, the mutant “survived,” meaning your tests failed to detect the change.

Install gremlins:

go install github.com/go-gremlins/gremlins/cmd/gremlins@latest

Consider this function:

// pkg/pricing/discount.go
func ApplyDiscount(price float64, percentage int) float64 {
    if percentage < 0 || percentage > 100 {
        return price
    }
    return price * float64(100-percentage) / 100
}

Gremlins creates mutations like these:

Change < to <= (boundary mutation)
Change > to >= (boundary mutation)
Change 100 to 99 (constant mutation)
Change * to / (arithmetic mutation)
Change || to && (logical mutation)
Remove the guard clause entirely (statement deletion)

Each mutation produces a modified version of your code. Gremlins compiles it, runs your tests, and records whether the tests caught the change.

Tests That Survive Mutation

Here’s a test that achieves 100% line coverage but fails mutation testing:

func TestApplyDiscount_Basic(t *testing.T) {
    result := ApplyDiscount(100, 20)
    if result != 80 {
        t.Errorf("got %v, want 80", result)
    }
}

Full line coverage, zero boundary testing. Change < to <= in the guard clause and the test still passes because it never tests percentage 0 or -1. The mutant survives.

Tests That Kill Mutants

func TestApplyDiscount(t *testing.T) {
    tests := []struct {
        name       string
        price      float64
        percentage int
        want       float64
    }{
        {"20% off 100", 100, 20, 80.0},
        {"no discount", 100, 0, 100.0},
        {"full discount", 100, 100, 0.0},
        {"negative percentage unchanged", 100, -1, 100.0},
        {"over 100% unchanged", 100, 101, 100.0},
        {"boundary: 1%", 100, 1, 99.0},
        {"boundary: 99%", 100, 99, 1.0},
        {"zero price", 0, 50, 0.0},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := ApplyDiscount(tt.price, tt.percentage)
            if got != tt.want {
                t.Errorf("ApplyDiscount(%v, %v) = %v, want %v",
                    tt.price, tt.percentage, got, tt.want)
            }
        })
    }
}

The boundary tests are the key. Testing percentage 0, 1, 99, 100, -1, and 101 covers every edge of the guard clause. If gremlins changes < to <=, the test for percentage 0 fails. If it changes > to >=, the test for percentage 100 fails. If it changes 100 to 99, the test for full discount fails. Every mutant is killed.

AI-generated tests rarely include these boundary cases unless explicitly asked. Mutation testing quantifies the gap: “73% of mutants killed” means 27% of code changes go undetected by your tests.

Running Gremlins

Run mutation testing against your project:

# Run with default settings
gremlins unleash

# Run with a minimum score threshold
gremlins unleash --threshold 60

# Run against a specific package
gremlins unleash ./pkg/pricing/...

Gremlins outputs a report showing which mutants survived, which were killed, and which timed out. Timeouts count as killed because a mutation that causes an infinite loop is still detected.

The 60% mutation score threshold in the Makefile is deliberately lower than the 80% coverage threshold. Mutation testing is harder to satisfy. Some mutations are equivalent (they change the code without changing behavior), and these survivors are unavoidable. 60% is a reasonable starting point. Raise it as your test suite matures.

Gremlins is younger than mutation testing tools in other ecosystems (Stryker for JS/TS, pitest for Java). It works, and it’s actively maintained, but expect rougher edges. Some Go-specific constructs like goroutine patterns and channel operations don’t have dedicated mutation operators yet. Start with the mutations gremlins does support well: arithmetic, comparisons, boundary conditions, and statement deletion. These cover the most common AI-generated bugs.

Dead Code Detection

AI generates code speculatively. It creates helper functions “in case they’re needed.” It writes utility methods that never get called. It adds struct fields that nothing reads. Dead code accumulates fast with AI assistance.

Two tools catch this:

deadcode

The deadcode tool (part of the Go tools suite) finds unreachable functions:

go install golang.org/x/tools/cmd/deadcode@latest
deadcode ./...

Example output:

pkg/utils/helpers.go:15:6: unreachable func: FormatTimestamp
pkg/utils/helpers.go:28:6: unreachable func: SanitizeInput
pkg/api/middleware.go:42:6: unreachable func: rateLimitMiddleware

AI generated FormatTimestamp, SanitizeInput, and rateLimitMiddleware because they seemed useful. Nothing calls them. They’re dead weight: code that must be maintained, reviewed, and understood by the next person reading the file, but that provides zero value.

ineffassign

ineffassign finds assignments to variables that are never read:

go install github.com/gordonklaus/ineffassign@latest
ineffassign ./...

This catches a pattern AI produces often:

func processOrder(order Order) error {
    total := calculateTotal(order)       // assigned
    discount := lookupDiscount(order.ID)  // assigned
    total = total - discount              // reassigned but...

    // AI refactored here and started using a different calculation
    finalPrice := order.Subtotal * 0.9
    return chargeCustomer(order.CustomerID, finalPrice)
}

The total and discount variables are computed but never used in the final result. The AI refactored partway through and left dead assignments behind. ineffassign catches these immediately.

Remove dead code aggressively. Every unused function is a function that could confuse the AI in the next session. AI reads existing code to understand patterns. Dead code teaches it the wrong patterns.

The Feedback Loop

Dead code detection creates a feedback loop with AI coding tools. When the AI generates a utility function that deadcode flags as unreachable, you have two choices: delete the function, or wire it up if it’s actually useful. Either way, the codebase stays clean.

Over time, running deadcode after every AI session trains you to give more precise instructions. Instead of “add a helper function that might be useful,” you say “add a helper function and use it in the handler.” The AI generates code that’s immediately integrated, not speculative.

Dead code compounds. One unused function isn’t a problem. Twenty unused functions across ten files clutter the codebase and confuse the AI in future sessions.

No Vaporware

Tools like deadcode catch unreachable functions, but there’s a subtler form of dead code that slips past them: code that compiles, passes its own tests, but isn’t wired into the running application. AI is good at this. It creates entire packages with working tests that nothing imports. It writes database migrations for tables that no query ever touches.

You can catch this with tests. A test that scans all packages under pkg/ and verifies each one is imported by at least one non-test file will fail when AI generates a standalone package that isn’t connected to anything. A test that extracts table names from your migration files and verifies each table appears in at least one INSERT, SELECT, UPDATE, or DELETE statement will fail when AI creates a schema without wiring up the data access layer.

There’s an even sneakier variant: the noop loophole. AI creates an interface, writes a no-op implementation that satisfies it, and moves on. The package is imported, the interface compliance assertion compiles, tests pass, coverage looks fine. But the core behavior, writing to an external system, sending a notification, whatever the interface was supposed to do, never actually executes. A noop bypasses every other verification level.

You can catch this too. Go projects use var _ SomeInterface = (*MyStruct)(nil) as a compile-time interface compliance assertion. A test can scan your source for these patterns and verify that every interface with a noop implementation also has a real one:

func TestNoopOnlyInterfaces(t *testing.T) {
    pkgDir := filepath.Join(".", "pkg")
    implRe := regexp.MustCompile(`var\s+_\s+(\S+)\s*=\s*\(\*(\w+)\)\(nil\)`)

    byInterface := make(map[string][]string)
    filepath.Walk(pkgDir, func(path string, info os.FileInfo, err error) error {
        if err != nil || info.IsDir() || !strings.HasSuffix(info.Name(), ".go") ||
            strings.HasSuffix(info.Name(), "_test.go") {
            return nil
        }
        content, _ := os.ReadFile(path)
        for _, match := range implRe.FindAllStringSubmatch(string(content), -1) {
            byInterface[match[1]] = append(byInterface[match[1]], match[2])
        }
        return nil
    })

    for iface, types := range byInterface {
        hasNoop, hasReal := false, false
        for _, name := range types {
            if strings.HasPrefix(strings.ToLower(name), "noop") ||
                strings.HasPrefix(strings.ToLower(name), "no_op") {
                hasNoop = true
            } else {
                hasReal = true
            }
        }
        if hasNoop && !hasReal {
            t.Errorf("interface %q has only noop implementation(s) %v - "+
                "a real implementation is required or the feature should be removed", iface, types)
        }
    }
}

The regex matches compile-time assertions like var _ Store = (*NoopStore)(nil). It groups them by interface name, then checks whether any interface has only noop implementations and no real ones. If an interface only has a noop, either the real implementation was never delivered or the feature should be removed.

These tests sound aggressive. They are. The alternative is discovering six months later that an entire subsystem the AI built was never called from anywhere, or that it was called but did nothing.

Property-Based Testing

Table-driven tests verify specific inputs produce specific outputs. Property-based tests verify that invariants hold across randomly generated inputs. For keeping AI on a leash, property tests may be the more important of the two. A table-driven test checks the cases you thought of. A property test checks the cases you didn’t.

AI-generated code passes table-driven tests reliably because the AI can see the expected outputs and work backward. Property tests are harder to game. The AI can’t reverse-engineer a function from “the result is always non-negative.” It has to write code that actually satisfies the invariant across all inputs.

Property Testing with rapid

rapid is the property-based testing library to use for Go. It provides typed generators, automatic shrinking, and integrates with Go’s standard testing package.

import "pgregory.net/rapid"

func TestDiscountProperties(t *testing.T) {
    rapid.Check(t, func(t *rapid.T) {
        price := rapid.Float64Range(0, 10000).Draw(t, "price")
        pct := rapid.IntRange(0, 100).Draw(t, "percentage")

        result := ApplyDiscount(price, pct)

        if result < 0 {
            t.Fatalf("negative result: %v for price=%v pct=%v", result, price, pct)
        }
        if result > price {
            t.Fatalf("result %v exceeds price %v", result, price)
        }
    })
}

No expected outputs here. The test asserts a property: the discounted price is never negative and never exceeds the original price. rapid.Check generates random inputs and verifies the property holds for all of them.

Consider an AI generating ApplyDiscount with a floating-point rounding error that produces -0.0000001 for certain inputs. A table-driven test with five cases would miss it. A property test with hundreds of random inputs is likely to find it.

Why rapid Catches What AI Misses

rapid provides typed generators. Float64Range(0, 10000) generates only valid prices. IntRange(0, 100) generates only valid percentages. No need to skip invalid inputs because the generator never produces them. This matters for AI-generated code.

AI tends to test the obvious path. It generates tests for 20% off $100, maybe 50% off $200. It rarely tests 0.01% off $0.01, or 100% off the maximum float64, or the boundary between 99% and 100%. Rapid generates these cases automatically because it doesn’t have the same blind spots the AI does.

rapid also supports shrinking. When it finds a failing input, it minimizes it to the smallest reproducing case. Instead of “failed for price=8347.293847 pct=73,” you get “failed for price=1.0 pct=1,” which is far easier to debug. Shrinking matters because the AI needs to understand why a test failed to fix the code. A minimal reproducing case is actionable. A random large number is not.

Multiple Properties per Function

A single function can have multiple properties. Test them separately:

func TestApplyDiscount_NeverNegative(t *testing.T) {
    rapid.Check(t, func(t *rapid.T) {
        price := rapid.Float64Range(0, 10000).Draw(t, "price")
        pct := rapid.IntRange(0, 100).Draw(t, "pct")
        if result := ApplyDiscount(price, pct); result < 0 {
            t.Fatalf("negative result %v", result)
        }
    })
}

func TestApplyDiscount_MonotonicInPercentage(t *testing.T) {
    rapid.Check(t, func(t *rapid.T) {
        price := rapid.Float64Range(0.01, 10000).Draw(t, "price")
        pct1 := rapid.IntRange(0, 99).Draw(t, "pct1")
        pct2 := rapid.IntRange(pct1+1, 100).Draw(t, "pct2")
        r1 := ApplyDiscount(price, pct1)
        r2 := ApplyDiscount(price, pct2)
        if r2 > r1 {
            t.Fatalf("higher discount %d%% produced higher price %v > %v", pct2, r2, r1)
        }
    })
}

func TestApplyDiscount_ZeroPercentIsIdentity(t *testing.T) {
    rapid.Check(t, func(t *rapid.T) {
        price := rapid.Float64Range(0, 10000).Draw(t, "price")
        if result := ApplyDiscount(price, 0); result != price {
            t.Fatalf("0%% discount changed price: %v != %v", result, price)
        }
    })
}

Each test verifies one invariant: results are never negative, higher discounts produce lower prices, and zero discount returns the original price. If the AI implements a discount function that violates any of these, the property tests fail regardless of whether the table-driven tests pass.

This is where property testing pulls its weight for AI verification. The AI can write a function that passes “20% off $100 equals $80” but violates monotonicity for certain float64 values. Table-driven tests would never catch that.

Where to Write Property Tests

Use property-based tests for any pure function: takes inputs, returns outputs, no side effects. Good candidates are parsers, formatters, validators, mathematical calculations, serialization/deserialization round-trips, and any function with documented invariants.

The AI should generate both table-driven tests and property tests. Table-driven tests for specific known behaviors. Property tests for invariants the function must satisfy regardless of input.

Go’s standard library includes testing/quick for basic property testing, but it’s frozen and there’s an open proposal to deprecate it. Go’s built-in fuzz testing (go test -fuzz) covers some of the same ground. Rapid is the better choice for property testing specifically because of typed generators and shrinking.

When to Use Each Approach

Table-driven tests verify specific known behaviors. “20% off $100 equals $80.” These tests document business rules and serve as executable specifications.

Property-based tests find edge cases you didn’t think of. They’re especially valuable for AI-generated code because AI introduces subtle mathematical errors that pass specific test cases but violate general invariants.

Mutation tests verify that your tests actually check results. They’re meta-tests. They don’t find bugs in your code; they find gaps in your tests.

Use all three. They catch different things and none of them replaces the others.

Benchmarking and Profiling

Everything above catches correctness problems. None of it catches performance problems. AI can produce code that passes every linter, every test, every mutation check, and still allocates memory in a tight loop or runs an O(n²) algorithm where O(n) would do. The AI needs tools to see performance, not just correctness.

Go’s built-in benchmarking and profiling give the AI exactly that.

Benchmarks

Go’s testing package includes benchmarking out of the box. Write benchmark functions alongside your tests:

func BenchmarkApplyDiscount(b *testing.B) {
    for i := 0; i < b.N; i++ {
        ApplyDiscount(99.99, 15)
    }
}

func BenchmarkApplyDiscount_Table(b *testing.B) {
    cases := []struct {
        price float64
        pct   int
    }{
        {100, 20},
        {0, 50},
        {99999.99, 99},
    }
    for _, tc := range cases {
        b.Run(fmt.Sprintf("price=%.2f_pct=%d", tc.price, tc.pct), func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                ApplyDiscount(tc.price, tc.pct)
            }
        })
    }
}

Run benchmarks with make bench:

BenchmarkApplyDiscount-10          1000000000    0.2987 ns/op    0 B/op    0 allocs/op
BenchmarkApplyDiscount_Table/price=100.00_pct=20-10    1000000000    0.2991 ns/op    0 B/op    0 allocs/op

The -benchmem flag reports allocations per operation. Zero allocations means the function doesn’t touch the heap. When AI refactors a function and allocations jump from 0 to 3 per call, the benchmark tells you immediately.

-count=3 runs each benchmark three times. Single runs are noisy. Three runs give you enough data to spot variance. For serious comparison between implementations, pipe the output to benchstat, which computes statistical differences.

Profiling with pprof

When benchmarks reveal a performance problem, profiling tells you where the time goes. Run make profile to generate CPU and memory profiles, then inspect them:

# Interactive CPU profile
go tool pprof cpu.prof

# Top functions by CPU time
(pprof) top 10

# Memory profile with allocation counts
go tool pprof -alloc_objects mem.prof

# Web-based flame graph (opens browser)
go tool pprof -http=:8080 cpu.prof

The AI can read pprof output. Tell it “run make profile and fix the top allocation” and it has something concrete to work with. Without profiling, the AI guesses at performance problems. With it, the AI sees exactly which function allocates, which loop dominates CPU time, and where the hot path sits.

When to Use Each

Benchmarks are cheap. Write them for any function that handles request-path data: parsers, serializers, query builders, middleware. AI should generate benchmarks alongside table-driven tests for these functions.

Profiling is for investigation. When a benchmark shows unexpected allocations or a request takes longer than it should, profiling finds the cause. You don’t run profiling in CI. You run it when something is slow and you need to know why.

Neither benchmarks nor profiling are part of make verify because they’re diagnostic, not gates. Performance regressions are harder to automate thresholds for than coverage or mutation scores. But having make bench and make profile in the Makefile means the AI can reach for them when you say “this endpoint is slow, figure out why.”

Codecov Configuration

Codecov tracks coverage over time and enforces thresholds on pull requests, preventing AI-generated code from landing without adequate test coverage.

Create codecov.yml in your project root:

# codecov.yml
coverage:
  status:
    project:
      default:
        target: 80%
        threshold: 2%
    patch:
      default:
        target: 80%
        threshold: 5%

  precision: 2
  round: down
  range: "60...100"

comment:
  layout: "header, diff, flags, components"
  behavior: default
  require_changes: true

ignore:
  - "cmd/**/*"
  - "**/*_test.go"
  - "**/mock_*.go"
  - "**/mocks/**"

The two coverage checks serve different purposes.

Project coverage tracks the overall repository. Target 80%, threshold 2%. The repository’s total coverage must stay at or above 80%. The 2% threshold allows minor fluctuations without blocking merges.

Patch coverage tracks only the lines changed in a pull request. Target 80%, threshold 5%. Patch coverage is the critical check for keeping AI on a leash. When the AI generates a PR with 200 new lines of code, at least 80% of those new lines need test coverage. Untested code can’t hide behind high overall project coverage.

The 5% threshold on patches is more generous because small PRs with a few untestable lines (error returns from os.Exit, for example) would otherwise block constantly.

Ignored paths keep the metrics honest. cmd/ directories contain main() functions that are hard to unit test and better verified by integration tests. Test files don’t need coverage of themselves. Mock files are generated code.

require_changes: true in the comment section means Codecov only comments on PRs that affect coverage. No noise on documentation-only changes.

Go CLAUDE.md for AI on a Leash

Here is a complete, copy-paste-ready CLAUDE.md that gives AI coding tools everything they need to work effectively in your Go project.

It extends the universal template from Ralph’s Uncle with Go-specific commands, thresholds, and conventions.

Create CLAUDE.md in your project root:

# CLAUDE.md - AI on a Leash for Go

## Project Architecture

- `cmd/` - Application entrypoints (main packages)
- `pkg/` - Public library code (importable by external projects)
- `internal/` - Private application code (not importable externally)
- Tests live next to the code they test: `foo.go` → `foo_test.go`

## Verification (Required Before Every Commit)

Run the full verification suite:
```
make verify
```

Individual checks (all must pass):
```
make lint            # golangci-lint + go vet
make test            # go test -race -shuffle=on ./...
make coverage        # Coverage report (threshold: 80%)
make patch-coverage  # Coverage of changed lines only (threshold: 80%)
make security        # gosec + govulncheck
make mutation        # gremlins (threshold: 60%)
make deadcode        # deadcode (unreachable functions)
make build-check     # go build + go mod verify
```

Performance diagnostics (not part of verify, use when investigating):
```
make bench           # Run benchmarks with memory allocation reporting
make profile         # Generate CPU and memory profiles for pprof
```

## Code Quality Thresholds

- Test coverage: ≥80%
- Mutation score: ≥60%
- Cyclomatic complexity: ≤10 per function
- Cognitive complexity: ≤15 per function
- Function length: ≤80 lines, ≤50 statements
- Function arguments: ≤5
- Function return values: ≤3
- Go Report Card: 100%

## Go Code Standards

1. **Error handling**: Always wrap errors with context: `fmt.Errorf("operation failed: %w", err)`
2. **Naming**: Follow Go conventions. MixedCaps, not underscores. Acronyms are all-caps (HTTP, URL, ID).
3. **Interfaces**: Accept interfaces, return structs. Define interfaces at the consumer, not the provider.
4. **Context**: First parameter when needed. Never store in structs.
5. **Concurrency**: Use channels for communication, mutexes for state. Always run tests with `-race`.
6. **Dependencies**: Use `internal/` for code that shouldn't be imported. Minimize third-party dependencies.
7. **Testing**: Table-driven tests. Property-based tests for pure functions. Testcontainers for integration tests.

## Dependency Policy

- All dependencies pinned to exact versions in `go.sum`
- Run `go mod tidy` before commits
- No deprecated packages (enforced by depguard in `.golangci.yml`)
- `govulncheck` must pass clean

## AI-Specific Rules

1. **No tautological tests**: tests must encode expected outputs, not reimplement logic
2. **No hallucinated imports**: verify every dependency exists in the Go module ecosystem
3. **Human review required**: all code requires human review before merge
4. **Acceptance criteria first**: do not write code without Given/When/Then criteria
5. **Explain non-obvious decisions**: comment WHY, not WHAT
6. **Integration tests for multi-component features**: unit tests alone are not sufficient when components interact. Wire up real objects, not mocks, and prove data flows through the actual call chain
7. **No vaporware**: every package must be imported by non-test code. Every database table must have DML in non-test code. Code that compiles but isn't wired into the application is dead code

## Git Policy

- Conventional commits: `type(scope): description`
  - Types: feat, fix, refactor, test, docs, chore, ci
- Commits are atomic: one logical change per commit
- No force-pushing to shared branches
- Branch names: `type/short-description`

How the Sections Work Together

The architecture and verification sections do the most work. The AI gets a mental model of the project without exploring the filesystem, and make verify gives it a single command that answers “is this code ready?” The thresholds section provides objective numbers the AI can check against, removing any ambiguity about “good enough.”

The Go code standards section encodes decisions that linters can’t fully enforce. “Accept interfaces, return structs” is a design principle, not a lint rule. “Define interfaces at the consumer” tells the AI where to put the type Storer interface declaration.

The AI-specific rules call out failure modes unique to generated code: tautological tests that pass by reimplementing the function under test, and hallucinated imports that the AI invents from training data. Without the git policy section, AI creates commits like “update files” with twelve unrelated changes. With it, commits follow conventional format and stay atomic.

Keeping CLAUDE.md Effective

Every line in CLAUDE.md competes for context window space. Review it monthly. Remove anything the AI consistently gets right without being told. Add anything it consistently gets wrong.

If you find yourself correcting the same mistake across multiple sessions, add a rule. If a rule never triggers, remove it to reclaim context space.

CLAUDE.md should evolve with your project and your experience working with AI tools.

Putting It All Together

Here’s the complete set of files to create in a new Go project:

File	Purpose
`revive.toml`	Revive linter rules: complexity limits, argument counts, concurrency safety
`.golangci.yml`	43 linters with AI-tuned settings, deprecated package blocking, test exemptions
`GNUmakefile`	All verification targets, coverage thresholds, single `make verify` command
`codecov.yml`	Project and patch coverage enforcement, path exclusions
`CLAUDE.md`	AI context: architecture, commands, thresholds, standards, rules

For release automation with signed builds and SBOMs, see GoReleaser with Cosign Signing and Syft SBOM.

To set up a new project:

Copy these files into your project root.
Install the tools: golangci-lint, gosec, govulncheck, deadcode, gremlins.
Run make verify. Fix anything that fails.
Start working with your AI coding tool. It reads CLAUDE.md, runs make verify, and self-corrects.

The Verification Loop in Practice

Here’s what a session looks like. You tell the AI to add a new endpoint. It reads CLAUDE.md, writes the handler, writes tests, and runs make verify. The first run fails:

pkg/api/handler.go:42:1: cyclomatic complexity 14 of func HandleOrder is high (> 10) (gocyclo)
pkg/api/handler.go:58:9: error returned from external package is not wrapped (wrapcheck)
FAIL: Coverage 74% is below threshold 80%

The AI reads those errors and fixes them without you saying anything. It decomposes the handler, wraps the error, adds test cases to cover the missing lines, and runs make verify again. Second run fails on mutation testing. It adds boundary tests. Third run passes. You review the final result.

Three AI iterations, zero human intervention. Without the pipeline, each of those iterations would have been a review comment from you, a round trip, and a context switch. The verification pipeline turns “review, comment, wait for fix, re-review” into “review once.”

The configuration files encode your quality standards into tools. The AI doesn’t need to remember the standards. It runs the tools and responds to their output.

These configs raise the bar, but they don’t eliminate the need for human review. The AI can write tests that pass the mutation threshold but still miss specification errors. The configs buy you leverage so the human review focuses on “is this the right behavior?” not “does this compile?” For the full argument on why human review remains the firewall, see Ralph’s Uncle.

CI Integration

Everything in this article runs locally via make verify. To enforce it on every pull request, you need a CI pipeline. The next article in this series covers GitHub Actions workflows that run the full verification suite, add CodeQL scanning, OpenSSF Scorecard, and a CI-aware CLAUDE.md that references your pipeline status.

Tool Installation

For reference, here are the installation commands for every tool used in this article:

# Linting
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest

# Security
go install github.com/securego/gosec/v2/cmd/gosec@latest
go install golang.org/x/vuln/cmd/govulncheck@latest

# Dead code
go install golang.org/x/tools/cmd/deadcode@latest

# Mutation testing
go install github.com/go-gremlins/gremlins/cmd/gremlins@latest

Pin these to specific versions in your CI configuration. Using @latest is fine for local development, but CI should be reproducible.

AI on a Leash Series

Previous: Ralph’s Uncle

Note: This blog is a collection of personal notes. Making them public encourages me to think beyond the limited scope of the current problem I'm trying to solve or concept I'm implementing, and hopefully provides something useful to my team and others.

This blog post, titled: "AI on a Leash: Complete Go Project Configuration: AI on a Leash for Go" by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.