Wrap a real API in an MCP server and you hit a wall almost immediately. A production API has hundreds of operations. Expose each as its own tool and you bury the model in a tool list it cannot reason about, and most models get worse at choosing as that list grows. Hand the model the OpenAPI spec instead and you are pushing a document that can run to megabytes into a context window that has better uses. Neither option works, and the same wall shows up everywhere an MCP fronts something large: too many tools, too many prompts, a memory that has grown for a year. The answer is not a bigger context window. It is letting the agent search for the few things it needs, by meaning, at the moment it needs them.
§TL;DR
An MCP server that fronts real systems cannot lay everything in front of the model at once. Hundreds of API operations, a wall of tools, a year of accumulated memory: expose it all and you either explode the tool list or blow the context budget. The platform’s answer is semantic search over the corpora it owns, the agent’s memory, reusable prompts, generated assets, the platform’s own tool list, and a per-operation index of every API it wraps, so the agent retrieves the handful of relevant items by intent instead of receiving everything. The data catalog itself, tables and columns and glossary terms, keeps its own semantic search through DataHub; this series is about everything else. It runs in PostgreSQL with pgvector: an HNSW index for cosine similarity, a GIN index for full-text, fused into one hybrid query. No Pinecone, no Weaviate, no second datastore to keep in sync. When the embedder is down, the same query degrades to lexical-only ranking instead of returning nothing.
Vector Search Without the Vector Database | Part 1 of 3. This part covers the data layer and the why. Part 2 covers the async embedding pipeline on Postgres LISTEN/NOTIFY and Ollama. Part 3 covers running it in production on Kubernetes with a CPU-only embedder.
§The Search Problem Inside a Data MCP
I build MCP servers that sit in front of data systems. txn2/mcp-data-platform, the open-source MCP server behind Plexara, composes DataHub for metadata, Trino for SQL, and S3 for object storage behind a single endpoint. DataHub is the semantic layer for the data catalog itself: when the agent needs to find a table, a column, or a glossary term, that is a DataHub search, and the platform leans on it. So this series is not about searching the catalog. It is about everything else the platform owns and has to make findable, which turns out to be most of what an agent actually touches: its memory, reusable prompts, generated assets, the platform’s own tools, and the operations of every API the platform wraps.
The sharpest version of the problem is wrapping an API. Take something like the Salesforce REST API: one large spec, hundreds of operations. There are two obvious ways to expose it through MCP, and both are bad. Turn each operation into its own tool and the agent faces a list of hundreds of tools before it has done anything; the tool list alone can dwarf the task, and selection accuracy drops as it grows. Or skip the per-tool approach and hand the model the spec so it can choose an operation itself, except an OpenAPI document for an API that size can be megabytes of JSON, and now the context window is spent on documentation instead of work.
The platform takes a third path. Every operation in a catalog is embedded once into a per-operation vector index, and api_list_endpoints ranks operations by intent. The agent asks for “create a new customer record” and gets back the handful of operations that match, not all four hundred. A catalog that bundles several specs adds a gate: ask without naming a spec and you get the list of sections to choose from, not every operation across every section flattened into one oversized response. The full schema for an operation, its parameters and request body and responses, is fetched only for the one the agent settles on. Hundreds of operations and megabytes of spec collapse to a short ranked list and a single detail lookup.
The same shape repeats across the platform. A platform_find_tools call ranks the server’s own tools by intent, so an agent discovers the right tool without scanning every name. Prompts, generated assets, and saved collections are searchable the same way. And the memory store, which I will use as the worked example through the rest of this article because it is the simplest table, carries the most human version of the problem.
Here is that one concretely. An analyst tells the agent, months ago, “customer identifiers are GDPR-sensitive personal data, never put them in an exported report.” That gets saved as a memory record. Today a different session asks the agent to “build a churn dashboard with user-level detail.” A keyword search for “churn” or “dashboard” against that memory finds nothing. The words do not overlap. The constraint that should have fired never surfaces, and the agent happily exports a column it was told to protect.
The record and the query mean almost the same thing. They share almost no tokens. That gap, between lexical overlap and semantic meaning, is the whole problem, and it is the same gap behind the API case: “create a customer” has to find an operation named POST /sobjects/Account that shares not one word with the request. LIKE '%churn%' and even Postgres full-text search are matching strings. What you want is a search that matches meaning: “account attrition” should find “customer churn,” “PII” should find “personal data,” “auth” should find “login flow.”
That is what vector search does, and the reason it matters more for an MCP than for a normal app is that the consumer is a language model. A human will rephrase, scroll, try three searches, and use judgment. An agent issues one query, takes the top results as ground truth, and acts. If recall misses, the agent does not know it missed. Bad recall is not a worse user experience, it is a wrong answer delivered with confidence.
§What an Embedding Actually Is
An embedding is a list of numbers that represents the meaning of a piece of text. You run text through a model, and it returns a fixed-length vector. The platform uses Ollama with the nomic-embed-text model, which produces a 768-dimensional vector: 768 floating-point numbers per record.
The useful property is that texts with similar meaning produce vectors that point in similar directions. “Customer churn” and “account attrition” land close together in that 768-dimensional space even though they share no words. “Customer churn” and “quarterly office supply budget” land far apart. You measure closeness with cosine distance, which compares the angle between two vectors rather than their magnitude.
To search, you embed the query the same way you embedded the records, then ask the database for the stored vectors closest to the query vector. The model does the hard part of turning meaning into geometry, and the database does what databases are good at: finding the nearest rows, fast. (Some embedding models, nomic-embed-text among them, expect a short task prefix that differs between queries and stored documents. The platform keeps it simple and embeds both sides the same way, trading a little recall quality for one less moving part. If you want to squeeze recall later, that prefix is the first knob to reach for.)
The reason this is suddenly practical inside Postgres is pgvector. It adds a real vector column type and the distance operators that go with it. You do not need a separate service to hold the vectors. They sit in the same row as the text they describe.
§Why Not Just Use a Vector Database
The default architecture diagram people reach for puts a dedicated vector database next to Postgres: Pinecone, Weaviate, Qdrant, Milvus. The records live in Postgres, the vectors live in the vector store, and something keeps them in sync. I chose not to do that, and the reasoning is worth spelling out because “just add a vector DB” is the reflex answer.
A separate vector store buys you scale you do not have yet and costs you consistency you need today. The moment vectors live in a different system from the source rows, you have a synchronization problem. A record updates in Postgres but its vector is stale in the other store. A record is deleted in Postgres but its vector lingers and keeps showing up in results. You are now writing reconciliation logic to paper over the gap between two systems that disagree about what exists.
Keeping the vector in the same row as its text removes that entire class of bug. The write is one transaction. The delete cascades. There is no cross-system divergence to reconcile, because there is only one system holding both the text and its vector.
The second reason is hybrid ranking, which I will get to in detail. The best recall comes from combining vector similarity with old-fashioned lexical matching. If your vectors are in Pinecone and your full-text index is in Postgres, every hybrid query becomes two network round trips and a merge step in your application. When both indexes are on the same table, the database does both and you fuse the results in one place.
The caveat: this stops being the right call at a scale most data MCPs never reach. pgvector with an HNSW index is comfortable into the low single-digit millions of vectors, and wants real tuning into the tens of millions. Past a hundred million embeddings under heavy concurrent query load, a purpose-built vector engine earns its operational cost. I will come back to where that line sits in Part 3. For a platform whose vectors are API operations, tools, prompts, generated assets, and accumulated memory, Postgres is the right tool, not a compromise you tolerate until you can afford something dedicated.
§The Schema: One Table, One Vector Column
Here is the memory store’s table, trimmed to the columns that matter for search. The full migration enables the extension and creates the table:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS memory_records (
id TEXT PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_by TEXT NOT NULL DEFAULT '', -- user email, the scope boundary
persona TEXT NOT NULL DEFAULT '',
content TEXT NOT NULL, -- the text we embed and search
category TEXT NOT NULL DEFAULT 'business_context',
-- Embedding (768-dim nomic-embed-text via Ollama)
embedding vector(768),
status TEXT NOT NULL DEFAULT 'active'
);
The embedding vector(768) column is the only thing that distinguishes this from an ordinary table. It is nullable on purpose: a record can exist before its vector does. That matters more than it looks, and it is the seam the whole asynchronous pipeline in Part 2 hangs from. When an analyst saves a memory, the row is written immediately. The embedding gets filled in later, off the request path, because generating it takes a second or more and you do not make a user wait on a model to save a note.
A later migration adds two breadcrumbs next to the vector:
ALTER TABLE memory_records
ADD COLUMN IF NOT EXISTS embedding_model TEXT NOT NULL DEFAULT '',
ADD COLUMN IF NOT EXISTS embedding_text_hash BYTEA;
embedding_model records which model produced the vector, and embedding_text_hash is a hash of the exact text that was embedded. Those two columns let the system answer two questions cheaply: “is this vector stale because the text changed?” and “is this vector stale because we swapped embedding models?” Both are central to the indexing engine in Part 2. For now, just note that the vector carries its own provenance.
§Two Indexes, Two Kinds of Recall
A vector column with no index works, and it is a sequential scan. Every query computes the distance from the query vector to every row, then sorts. That is fine for a thousand rows and a disaster for a million. The migration adds two indexes that back two different retrieval strategies, and turn that full scan into something that scales close to logarithmically in practice, depending on how hard you push recall:
-- Approximate nearest-neighbor index for cosine similarity.
CREATE INDEX idx_memory_records_embedding_hnsw
ON memory_records USING hnsw (embedding vector_cosine_ops);
-- Full-text index for the lexical retrieval arm.
CREATE INDEX idx_memory_records_content_fts
ON memory_records USING gin (to_tsvector('english', content));
The first is HNSW, Hierarchical Navigable Small World. It is an approximate nearest-neighbor index, which means it does not promise the exact closest k vectors, it promises a very good approximation very quickly. For relevance ranking that trade is almost always worth it: nobody can tell the difference between the true tenth-best result and the approximate tenth-best result, and you stop paying for a full table scan on every search.
I picked HNSW over pgvector’s other option, IVFFlat, for one practical reason: HNSW needs no training step. IVFFlat has to be built against a representative sample of your data to choose its cluster centroids, which means you cannot build a good IVFFlat index on an empty or small table and have it stay good as the table grows. HNSW builds incrementally and performs well on a table that starts empty and fills up over time, which is exactly how a memory store behaves. The vector_cosine_ops operator class matters too: it pairs with the <=> cosine operator the query uses. Pick the wrong class for your distance metric and the planner ignores the index. Even with the right class, the index only accelerates ORDER BY embedding <=> $1 LIMIT k, not a WHERE distance < threshold filter, which is why the query path is shaped the way it is.
The second index is a plain GIN full-text index. It is not vector search at all. It backs the lexical arm of the hybrid query, the one that still matters when the query and the record do share words, and the one that can find records whose embedding has not been computed yet. The expression to_tsvector('english', content) in the index has to be the same expression with the same text-search configuration as the one in the query, or Postgres will not use the index. That is a sharp edge: a mismatched language config or a different tsquery function silently drops you back to a scan.
§Hybrid Search: Run Both Arms, Fuse in Go
Pure vector search has a weakness. It is great at meaning and bad at exact tokens. Ask it for an error code, a table name, a specific identifier, and semantic similarity can rank a vaguely related record above the one that contains the literal string you typed. Pure lexical search has the opposite weakness. The fix is to run both and combine them.
The naive way to combine them is a single SQL query that orders by some blend of the two scores. That is a trap, and the reason is indexes. The HNSW index only accelerates a pure ORDER BY embedding <=> $1 LIMIT k. The GIN index only accelerates the tsquery match. A single ORDER BY over a blended expression of both cannot use either index, so you forfeit both and fall back to scanning. You lose exactly the speed you built the indexes for.
So the platform runs two arms and fuses them in application code. The vector arm asks the HNSW index for the cosine top-k. The lexical arm asks the GIN index for the full-text top-k. Their union is deduplicated and re-scored in Go. Here is the core of HybridSearch:
// HybridSearch ranks records by fusing cosine similarity with a lexical
// full-text signal. It runs two index-backed arms and fuses in Go rather
// than ordering by a blended SQL expression, because the hnsw ANN index
// only accelerates a pure `ORDER BY embedding <=> $1 LIMIT k` and the GIN
// index only accelerates the tsquery match; a single blended ORDER BY
// would forfeit both.
func (s *postgresStore) HybridSearch(ctx context.Context, query HybridQuery) ([]ScoredRecord, error) {
limit := clampStoreLimit(query.Limit)
args := []any{pgvector.NewVector(query.Embedding), query.QueryText}
vecArm := fmt.Sprintf(
"SELECT %s, 1 - (embedding <=> $1) AS vec_score, (%s @@ %s) AS lex_match "+
"FROM %s WHERE embedding IS NOT NULL "+
"ORDER BY embedding <=> $1 LIMIT %d",
rawRecordCols, ftsExpr, ftsQuery, tableName, limit)
lexArm := fmt.Sprintf(
"SELECT %s, CASE WHEN embedding IS NOT NULL THEN 1 - (embedding <=> $1) ELSE 0 END AS vec_score, TRUE AS lex_match "+
"FROM %s WHERE %s @@ %s "+
"ORDER BY ts_rank_cd(%s, %s) DESC LIMIT %d",
rawRecordCols, tableName, ftsExpr, ftsQuery, ftsExpr, ftsQuery, limit)
sqlStr := "(" + vecArm + ") UNION ALL (" + lexArm + ")"
rows, err := s.db.QueryContext(ctx, sqlStr, args...)
if err != nil {
return nil, fmt.Errorf("executing hybrid search: %w", err)
}
defer rows.Close()
fused, err := collectHybridRows(rows)
if err != nil {
return nil, err
}
return rankFused(fused, limit), nil
}
A few details that took me a while to get right. The <=> operator is cosine distance, so 1 - (embedding <=> $1) converts it to a similarity score where higher is better. The vector arm filters embedding IS NOT NULL because a row with no vector cannot be ranked by distance. The lexical arm deliberately does not filter that way: it surfaces rows whose embedding is still null, which is what lets recall keep working for freshly written records that the pipeline has not embedded yet. The two arms are combined with UNION ALL, not UNION, because deduplication happens in Go, not in SQL.
Fusion is plain code. A record can show up in both arms, so the fuse step dedupes by id and keeps the higher score:
func rankFused(candidates []hybridCandidate, limit int) []ScoredRecord {
byID := make(map[string]int, len(candidates))
var out []ScoredRecord
for _, c := range candidates {
score := fuseHybridScore(c.vecScore, c.lexMatch)
if idx, ok := byID[c.record.ID]; ok {
if score > out[idx].Score {
out[idx].Score = score
}
continue
}
byID[c.record.ID] = len(out)
out = append(out, ScoredRecord{Record: c.record, Score: score})
}
sort.SliceStable(out, func(i, j int) bool { return out[i].Score > out[j].Score })
if len(out) > limit {
out = out[:limit]
}
return out
}
One honest note on the fusion: the two signals are not on the same scale. The vector arm produces a continuous cosine similarity, and the lexical arm contributes a boolean “did it match.” Blending them is a judgment call, not a law of nature, and the more principled approach when you want to tune it hard is reciprocal rank fusion, which ranks by position in each arm rather than by raw score. The platform keeps the simpler blend because it is good enough for recall and easy to reason about.
Doing the fuse in code instead of SQL has a side benefit beyond that: the ranking policy is testable Go, not a string of SQL arithmetic. When I want to weigh an exact lexical match more heavily, or add a recency boost, or swap in rank fusion, that is a unit test, not a migration.
§Graceful Degradation Beats a Hard Dependency
Vector search adds a dependency: the embedding model. The query has to be embedded before it can be matched, and that means a network call to Ollama on every search. If that call fails, you have a choice. You can return an error, or you can fall back to something that still works.
The platform always falls back. There is one shared decision point that every search path runs through:
// EmbedForSearch returns a query embedding for relevance ranking, or nil
// to signal that the caller should fall back to lexical-only ranking. It
// returns nil when no real provider is configured, when the embed call
// errors, or when the result is a zero vector.
func EmbedForSearch(ctx context.Context, p Provider, query string) []float32 {
if !IsConfigured(p) {
return nil
}
emb, err := p.Embed(ctx, query)
if err != nil {
slog.Warn("search embedding failed; falling back to lexical ranking", "error", err)
return nil
}
if IsZeroVector(emb) {
return nil
}
return emb
}
When this returns nil, the caller runs LexicalSearch instead of HybridSearch: pure Postgres full-text, no vector involved. The results are worse, because keyword matching is what we were trying to get past in the first place. But “worse” is a different category from “broken.” The agent gets a degraded answer it can work with rather than an error it cannot.
This is the part of the design I am most glad I got right. Semantic search should be an enhancement layered on top of a system that already works without it, not a hard dependency that takes recall to zero when the model is unreachable. The lexical GIN index is always there. The vectors make it better. If the embedder disappears, you are back to a normal full-text search, which is exactly where most applications live anyway.
§What This Buys You
Step back and look at what the memory store actually is, and remember the same shape backs every other corpus: API operations, tools, prompts, assets. One table with a vector column, an HNSW index, and a GIN index. Search is two index-backed arms in a single UNION ALL query, fused by twenty lines of Go. The only external dependency is an embedding model, and the system keeps working without it.
There is no second datastore to reconcile against. A deleted record takes its vector with it, so nothing stale lingers in the results, because the vector lived in the row that was deleted. The agent asks for “account attrition” and finds the note about “customer churn”; it asks to “create a customer” and finds the one right operation out of hundreds without the spec ever entering the context window. Both work because the text and its meaning sit in the same 768-dimensional space that Postgres can search with an index it already knows how to build.
The piece I have skipped so far is how the vectors get there. Embedding every record synchronously on write would put a slow model call in the middle of every save, and re-embedding a whole corpus after a model upgrade would be a nightmare of ad-hoc scripts. That needs a real job system: durable, idempotent, self-healing, and able to run a CPU-bound model without melting under load. The surprise is that you can build it on the same Postgres you are already running, with LISTEN/NOTIFY and SELECT ... FOR UPDATE SKIP LOCKED, and no message broker at all. That is Part 2.
§References
- txn2/mcp-data-platform : the open-source MCP server this series is built from
- Plexara : the commercial data platform built on it
- pgvector : PostgreSQL extension for vector similarity search
- HNSW indexing in pgvector : the approximate nearest-neighbor index used here
- Model Context Protocol : the spec these servers implement
Vector Search Without the Vector Database | Part 1 of 3. Next: The Embedding Pipeline builds the asynchronous job system that fills these vectors in, using Postgres LISTEN/NOTIFY and Ollama on CPU.