Sampling: When the Server Calls Your Model Back

FOLIO XCV 2026-05-23 · 8 MIN · LONG-FORM

Sampling: When the Server Calls Your Model Back

The protocol inverts. A server asks the client to run the host's model, with no API key of its own and a human able to deny every request

Diagram · folio xcv

sequenceDiagram
  autonumber
  participant S as Server
  participant C as Client
  participant U as User
  participant M as Model
  S->>C: sampling/createMessage
  C->>U: review this request?
  U-->>C: approve (deny returns error -1)
  C->>M: run the model
  M-->>C: completion
  C->>U: review the response?
  U-->>C: approve
  C->>S: result { content, model, stopReason }

Every method so far has run one direction: the client asks, the server answers. Sampling runs the other way. sampling/createMessage is a request the server sends the client, asking it to run the host’s model. The server gets to use a language model without holding an API key, and the host stays in control of which model runs and whether it runs at all. This post reads the full request, the model-preference scheme that lets a server ask for a model without naming one, and the 2025-11-25 tool loop, all from real frames.

This is part of MCP on the Wire, a series that takes the Model Context Protocol apart message by message, in Go. It comes out of building and running MCP servers in production, including the open-source txn2/mcp-data-platform, an Apache-2.0 platform in Go that connects AI assistants to Trino, DataHub, and S3 through one MCP endpoint, enriching every result with semantic context (ownership, lineage, PII, data quality) behind OAuth 2.1 auth, personas, and an audit trail. Everything here is read straight off the wire against spec revision 2025-11-25, with the official Go SDK at v1.6.1.

§The Inversion

The protocol-versus-API post caught a server calling back to the client and promised the detail later. This is later. A server with a summarize tool has no model of its own, so when the tool runs it sends a sampling/createMessage to the client and uses the answer. On the wire, in the middle of handling the client’s tools/call, a request flows back the other way:

read:  sampling/createMessage  { the server asking the client to run a model }
write: result { the model's answer, the model name, a stopReason }

The arrows are what make this matter. The server is the one asking, and the client, the host with the model and the API key, is the one that decides. The spec is firm that this decision includes a person: there should always be a human in the loop able to deny a sampling request, review the prompt before it is sent, and review the response before it is returned. The server proposes. The user disposes. That consent gate is the sequence diagram at the top of this post, and it is why sampling is safe to expose: a server can drive the model, but never unsupervised.

§The Request

A sampling request is richer than the bare call from the earlier post. Here is the full frame this server sent, captured through a logging transport:

{
  "method": "sampling/createMessage",
  "params": {
    "maxTokens": 100,
    "temperature": 0.2,
    "stopSequences": ["\n\n"],
    "systemPrompt": "You are a terse summarizer. One sentence, no preamble.",
    "messages": [
      { "role": "user", "content": { "type": "text", "text": "Summarize: MCP lets a server ask the client to run the model." } }
    ],
    "modelPreferences": {
      "hints": [{ "name": "claude-sonnet" }],
      "intelligencePriority": 0.8, "speedPriority": 0.5, "costPriority": 0.3
    }
  }
}

messages is the conversation to sample from, each with a role of user or assistant and content that can be text, image, or audio. systemPrompt, maxTokens, temperature, and stopSequences are the familiar generation controls, with one caveat the spec keeps repeating: every one of them is a request, not a command. The client may shorten maxTokens, edit the systemPrompt, or ignore a preference, because the client owns the model. The server is asking nicely.

§Asking for a Model Without Naming One

The interesting field is modelPreferences, and it solves a problem that only exists because of the inversion. The server does not know what models the client has. The client might run Claude, or Gemini, or something local. So a server cannot request a model by name and expect it to exist. Instead it describes what it needs, two ways at once.

The server expresses needs

intelligence0.8

speed0.5

cost0.3

hints: claude-sonnet → claude

The client makes the call

claude-sonnet-4-6

Hints are advisory substrings, evaluated in order. If the client has no Claude model, it may map claude-sonnet to an equivalent like gemini-1.5-pro, then use the priorities to break ties. The server never learns which until the result names the model that ran.

The three priorities are normalized from 0 to 1: how much the server values intelligence, speed, and low cost. The hints are substrings matched against model names, advisory and tried in order, so claude-sonnet prefers a Sonnet-class model and claude falls back to any Claude. A client with a different provider may map the hint to its nearest equivalent. The server states a shape of need, and the client fills it from its own shelf.

One related field, includeContext, asks the client to attach context from MCP servers to the prompt. Its thisServer and allServers values are soft-deprecated in 2025-11-25 and gated behind a sampling.context capability; the default is none, and leaving it there is the path the spec now points to.

§The Result

The reply names what actually happened:

{ "role": "assistant",
  "content": { "type": "text", "text": "The protocol inverts: the server drives the host's model." },
  "model": "claude-sonnet-4-6",
  "stopReason": "endTurn" }

model is the model the client chose, which is how the server finds out which one ran. stopReason says why generation ended: endTurn for a natural finish, stopSequence for a hit on one of the stop strings, maxTokens for the cap, and toolUse for the case the next section is about.

§The Tool Loop

The 2025-11-25 revision let a sampling request carry tools, which turns a single call into an agentic loop. A client must declare it can handle this with a sampling.tools capability, visible in the handshake of a server built to use it:

"capabilities": { "sampling": { "tools": {} }, "roots": { "listChanged": true } }

With that declared, the server can put a tools array and a toolChoice in the request. The model may then answer not with text but with a request to call a tool. Here is the real loop, both rounds, captured frame by frame. Round one, the server offers a get_time tool and lets the model decide:

read:  sampling/createMessage { messages: [user "What time is it?"],
         tools: [get_time], toolChoice: {mode: "auto"} }
write: result { content: {type: "tool_use", id: "call_1", name: "get_time", input: {}},
         model: "claude-sonnet-4-6", stopReason: "toolUse" }

The model returned a tool_use block and a stopReason of toolUse: it wants the tool run. The server runs get_time and sends round two, the conversation so far plus the result:

read:  sampling/createMessage { messages: [
         user "What time is it?",
         assistant {tool_use call_1},
         user {tool_result toolUseId: "call_1", content: [text "14:32 UTC"]} ] }
write: result { content: {type: "text", text: "It is 14:32 UTC."}, stopReason: "endTurn" }

Now the model has the tool’s output and finishes with text, stopReason back to endTurn. That is a complete tool loop inside one server’s tool call. The toolChoice modes steer it: auto lets the model choose, required forces at least one tool call, and none forbids tools, which a server uses on the last iteration to force a final answer. The protocol enforces two rules on the messages: every tool_use from the assistant must be answered by a matching tool_result in the next user message, keyed by toolUseId, and a message carrying tool results must contain nothing but tool results. Both exist so the exchange maps cleanly onto Claude, OpenAI, and Gemini tool APIs, which is also why the model may return several tool_use blocks at once for parallel calls.

§When the User Says No

The consent gate is not a suggestion in the prose, it is a code on the wire. If the user denies the request, the client returns an error, and the server has to handle it:

{ "id": 3, "error": { "code": -1, "message": "User rejected sampling request" } }

The -1 is the protocol encoding the human-in-the-loop. A server that calls sampling/createMessage must be ready to be told no, the same way a polite request can be declined. Malformed tool messages, a missing tool_result or results mixed with other content, come back as -32602 instead.

§The Go Side

The two halves are small. A server samples by calling req.Session.CreateMessage from inside a tool handler, or CreateMessageWithTools when it wants the tool loop. A client opts in by setting a CreateMessageHandler in its options, which is what advertises the sampling capability, or a CreateMessageWithToolsHandler, which advertises sampling.tools. The handler is where a real host shows the user the request, runs its model, and returns the result, or returns the -1 if the user declines.

Not every server needs the inversion. txn2/mcp-data-platform grounds the model a different way, by injecting catalog context into its tool results rather than calling back for sampling, which keeps the model’s reasoning on the client side. Sampling is the tool for servers that genuinely need the host’s model mid-task; many data platforms do not.

§What’s Next

Sampling is one of three requests that run server to client. The next is smaller and quieter. Roots: Telling the Server Where It May Look covers roots/list, the client telling the server which directories it is allowed to touch, a boundary the server asks for and the client defines, and why that short exchange is a security primitive and not a convenience.

The production data platform behind this series is txn2/mcp-data-platform, available hosted as Plexara.

Craig Johnston · 2026-05-23 ← back to all notes

Sampling: When the Server Calls Your Model Back

§The Inversion

§The Request

§Asking for a Model Without Naming One

§The Result

§The Tool Loop

§When the User Says No

§The Go Side

§What’s Next

Webmentions