An MCP server is written by an agent, used by an agent, and now often evaluated by an agent, and every one of them can confabulate. The agent that builds the server can report a tool that works when it does not. The server it builds can lie to the agents that call it, with a description that overpromises and a result that misstates what it did. The agents that call it can misreport what it is capable of. Point a fourth agent at the thing to evaluate it and feed the result back to the one building it, and a confabulated evaluation does not break the chain, it reinforces it. MCP development is close to a perfect storm for compounding confabulation: a loop of confident, occasionally wrong narrators with nothing in it anchored to the truth. The only way to break a loop like that is to put something in it that cannot lie, and nothing an agent emits qualifies. A compiler can. A linter can. A round-trip against a real database can. Building MCP servers well is mostly the work of wiring things that cannot confabulate into a loop of things that can.
This is the first post in MCP by Design, a short series on building MCP servers that survive contact with real agents. Where MCP on the Wire took the protocol apart message by message, this series is about the engineering around it: composition, agent steering, the knowledge loop, and testing. Everything here comes out of the open-source
txn2/mcp-data-platform, an Apache-2.0 platform in Go that fronts Trino, DataHub, and S3 behind one MCP endpoint, also available hosted as Plexara.
§A Loop of Agents, None of Them Reliable
A normal service faces a human through a UI or another program through an API. An MCP server is stranger. Walk the loop. The agent on the build side writes the server: the handler, the schema, the tests. The server it produces is the artifact in the middle, and on the run side a different agent calls that artifact, reading its description as a contract and treating its result as ground truth. An evaluating agent closes the loop, judging the artifact and reporting back to the builder. Not one of them is reliable on its own. The builder produces statistically likely Go, not necessarily correct Go. The consumer believes whatever the description says and acts on whatever the result returns, wrong parts included. The evaluator is one more model, as able to sign off on a broken tool as a working one.
What breaks the loop is the same in every position: ground truth that is not a model. So the question is not “which language writes MCP servers fastest.” It is “which language lets a machine, rather than another agent, catch the most mistakes before any of them are trusted.” That is where Go and Python part ways, and it is why every MCP server I ship is in Go.
The work splits along the loop. This post anchors the builder: the SDLC that makes the agent’s code answerable to something that cannot confabulate. The posts that follow anchor the artifact, the design that keeps the server from lying to the agents that call it. And the last post anchors the evaluator, where I drove the platform with an agent and trusted none of its findings until a real database confirmed them. Design and SDLC are different work, and a server needs both. A well-designed server can still be unverified, and a heavily tested one can still be badly designed.
§The Compiler Is the First Reviewer the Agent Can’t Argue With
I made the general case for this in the AI on a Leash series: a strongly typed language with fast compilation turns the build loop into automated verification. The agent proposes, the compiler rejects with a specific message, the agent reads the message and fixes it. That loop runs dozens of times a minute in Go and barely runs at all in Python, where the same class of mistake waits until runtime to surface, usually in production, usually as a page.
For MCP work the stakes are sharper than for ordinary services, because the output of an MCP tool is not rendered for a human who might notice it looks off. It is fed to a model that will plan its next three calls on the assumption the result was true. A function that returns None where a list was expected is a 3 a.m. page in Python. In an MCP handler it is worse: it is an agent confidently building on a value that should never have existed. The type system refusing to compile that is the cheapest review I will ever get.
§The Leash Is in the Config, Not in Good Intentions
Typing is the floor. The MCP servers in mcp-data-platform sit on top of a linter configuration that does the rest, and the important word in it is the first one:
linters:
default: none
enable:
- errcheck # error returns are handled
- staticcheck # comprehensive static analysis
- bodyclose # HTTP response bodies are closed
- rowserrcheck # sql.Rows.Err is checked
- sqlclosecheck # sql.Rows and sql.Stmt are closed
- gocyclo # cyclomatic complexity
- gocognit # cognitive complexity
- revive # fast, extensible linter (AI guardrails)
- gosec # security issues (OWASP)
- wrapcheck # wrap errors from external packages
# ... three dozen in total
default: none means nothing is on unless I turned it on. This is the inverse of the usual setup where you take the defaults and silence what annoys you. Here every one of roughly three dozen linters is an explicit decision, which means an agent cannot quietly benefit from a check getting relaxed upstream. The allowlist is the contract.
The thresholds are where the agent’s worst habits get clamped. Models love deep nesting and long functions, so the config makes both illegal:
gocyclo:
min-complexity: 10
gocognit:
min-complexity: 15
nestif:
min-complexity: 5
revive carries the rest, and the config comments call it what it is, AI guardrails. Functions cap at 80 lines, arguments at 5, results at 3. A datarace rule and an atomic rule run at error severity. There is even a banned-characters rule that rejects stray Greek letters, the kind of debris a model occasionally drops into an identifier, and an add-constant rule that fails on magic numbers past a literal count of three. None of this is style for its own sake. Each rule removes a way the generated code can be subtly wrong in a way a human reviewer skims past.
wrapcheck deserves a callout for MCP specifically. It forces every error crossing from an external package to be wrapped with context. When a Trino driver or an S3 client fails three layers down, the wrap chain is what lets the tool return a message the calling agent can actually reason about instead of a bare EOF.
§Verification the Agent Runs Itself
Linting catches shape. It does not catch a tool that lies. For that the project’s Makefile defines a gate the agent runs before I look at anything, and two of its targets are the ones I rarely see in a quickly shipped server.
The first is mutation testing:
## mutate: Run mutation testing with 60% efficacy threshold
mutate:
@echo "Running mutation testing..."
Coverage tells you a line executed. It does not tell you a test would notice if that line were wrong. Mutation testing flips operators and conditions in the real code and fails if the suite still passes, which is the only real measure of whether the tests have teeth. Holding it to a 60% efficacy threshold is a constraint an agent will never impose on itself, because an agent optimizing for green will write tests that cover lines without asserting anything. The threshold makes that strategy fail.
The second is a real database:
## test-realdb: Real-Postgres round-trip gate for tool write paths (Docker required)
test-realdb:
This is the gate that catches the failure I will spend the last post in this series on: a tool that reports failure while the write commits, because the unit test used a SQL mock that accepted a query real Postgres rejects. A mock proves the code compiles against an interface. It does not prove the tool told the agent the truth about what happened. The only thing that proves that is writing a row and reading it back through a real engine.
The coverage gate refuses to merge anything under 80% on changed lines. gosec and govulncheck run on every commit; semgrep and CodeQL run as a SAST pass. The whole thing is one make verify, and make verify-release adds mutation testing on top before a tag. The agent can run all of it without me, which is the point. Anthropic’s own guidance is that a model performs dramatically better when it can verify its own work. In Go I can hand it a verification surface that is almost entirely machine-checkable.
§The Detail That Sold Me
There is one target in that Makefile that captures why this matters more than any feature:
## golangci-lint or gosec versions enable different rules with
## different defaults. Concrete incident on 2026-05-08: local gosec
## 2.26.1 silently dropped rules ...
tools-check:
A linter upgrade silently stopped running checks the project depended on. Nothing failed. The build stayed green. The coverage held. The only symptom was that a category of bug was no longer being caught, and the only reason it got noticed is that someone pinned the versions and wrote down the date it bit them. That is the actual texture of an SDLC: not the tools you run, but the discipline of noticing when a tool quietly stops running. You cannot vibe-code your way to that. You arrive at it by getting burned and encoding the lesson where the machine reads it.
§Why Not Python
Python is where a lot of MCP examples live, and for a throwaway it is fine. The objection is not that you cannot write a careful Python MCP server. It is that almost none of the discipline above is enforceable in Python the way it is in Go. There is no compiler turning a type mistake into a pre-run rejection. The linting ecosystem is fragmented and mostly advisory. Complexity ceilings, mutation thresholds, and wrap-checking exist in pieces but are not the cultural default, so the agent has not seen them enforced a million times in its training data and does not reach for them on its own.
Go’s other quiet advantage is uniformity. gofmt settles formatting, the standard library sets the patterns the ecosystem follows, and there is broadly one idiomatic way to do most things. An agent generating Go draws from a tighter distribution, so its first guess lands closer to correct, and the narrow gap that remains is exactly what the linters and gates are sized to close. Python gives the model more rope, and an MCP server is the last place I want more rope, because the thing on the other end of it is another model that will trust whatever I let through.
I am not against agents writing my MCP servers. I let them write almost all of it. I just do not trust either agent, the one writing or the one calling, past the point where the compiler, the linters, the mutation suite, and a round-trip against a real database can check the work. Go is the language that lets me make that distrust automatic. The protocol you can read off the wire in an afternoon. The leash is the SDLC half of the work, the half that proves the agent’s code does what it claims, and a machine can enforce most of it. The other half is design, what the server actually does, and that is where the rest of this series goes. Skip either one and what you shipped is a demo that happened to record well.
The next post takes the first design decision that follows from all this: composing several MCP servers into one Go process instead of spawning a fleet of them.
The platform behind this series is txn2/mcp-data-platform, available hosted as Plexara.