Also at Deasil Works · txn2 · Plexara
Profiles GitHub · X · LinkedIn
Theme Light · Auto · Dark
Professional notes by Craig Johnston
long-form, short-form, working drafts · since 2008
VOL. XIX · MMXXVI
85 NOTES IN PRINT
FOLIO LXXXV 07 JUN 2026 · 17 MIN · LONG-FORM

Running Ollama and pgvector on Kubernetes

Vector Search Without the Vector Database, Part 3

Diagram · folio lxxxv
flowchart TB
  subgraph ns[namespace: data-platform]
    PG[(Postgres + pgvector<br/>PVC)]
    OL[Ollama CPU<br/>PVC: models]
    PULL[/Job: ollama pull<br/>nomic-embed-text/]
    APP[mcp-data-platform<br/>worker + reaper + reconciler]
  end
  PULL --> OL
  APP --> PG
  APP --> OL
  AGENT([MCP agent]) --> APP

The first two parts of this series ran entirely on a laptop. A docker-compose brings up Postgres with pgvector and a CPU-only Ollama, the platform connects to both, and semantic search works. Moving that to a cluster sounds like it should be a formality, and mostly it is, with one exception. Every time I propose this stack, the first question is where the GPU node pool goes. It does not need one. For text embeddings at the scale a data platform produces, a couple of CPU cores is enough, and this part shows exactly how to deploy it, tune it, and operate it.

§TL;DR

Deploy three things to a namespace: Postgres with the pgvector extension on a persistent volume, Ollama on CPU with a persistent volume for the model and a Job that pulls nomic-embed-text once, and the platform itself running the worker, reaper, and reconciler. Tune the embed-jobs config for slow CPU inference: one worker, a five-minute per-batch timeout under a ten-minute lease, batch size thirty-two. Operate it through the failure-triage surface, which auto-resolves transient failures and keeps permanent ones visible. The whole stack is two stateful workloads and a deployment, with no GPU and no vector database.

Vector Search Without the Vector Database | Part 3 of 3. Part 1 covered pgvector and hybrid search. Part 2 built the asynchronous embedding pipeline on Postgres LISTEN/NOTIFY and Ollama. This part puts it in production.

§What Production Actually Requires

Strip the stack down and there are exactly three workloads:

  1. Postgres with pgvector. Holds the records, the vectors, and the index_jobs queue. Stateful, needs a volume.
  2. Ollama on CPU. Serves embeddings over HTTP. Stateful only in that it caches the model on disk, needs a volume so it does not re-download on every restart.
  3. The platform. txn2/mcp-data-platform, the open-source MCP server behind Plexara, running the MCP server plus the background worker, reaper, and reconciler from Part 2. Stateless, scales horizontally.

That is the whole footprint of a semantic search system: two stateful workloads and a stateless one, with nothing else to deploy alongside them. The dev environment proves the shape of it:

# dev/docker-compose.yml, trimmed to the two stateful services
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: platform
      POSTGRES_PASSWORD: platform_secret
      POSTGRES_DB: mcp_platform
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U platform -d mcp_platform"]

  ollama:
    image: ollama/ollama:0.30.6
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - acme_ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD-SHELL", "ollama list >/dev/null 2>&1 || exit 1"]

The cluster version is the same two services with Kubernetes semantics around them: persistent volumes, resource requests, and a one-shot model pull.

§Postgres With pgvector

The dev image is pgvector/pgvector:pg16, which is stock Postgres 16 with the extension precompiled in. In production you have two paths. If you run your own Postgres, use that image or install the extension into your existing one, and the platform’s migrations handle the rest: they run CREATE EXTENSION IF NOT EXISTS vector themselves on startup, so you do not pre-provision anything beyond making the extension available.

If you use managed Postgres, check that the provider ships pgvector before you commit. The major clouds do now: it is available on Amazon RDS and Aurora, Google Cloud SQL, and Azure Database for PostgreSQL. You enable it once and the migrations take it from there. Either way the platform needs nothing special from the database beyond three things it relies on from Part 2: SKIP LOCKED for the claim path, LISTEN/NOTIFY for wake-ups, and partial unique indexes for idempotent enqueue. Any Postgres 13 or newer with pgvector has all three.

One sharp edge on managed Postgres specifically: the LISTEN/NOTIFY worker needs a connection pinned to a session for the life of the listener. If you put a connection pooler in transaction-pooling mode in front of the database, which is the common default for PgBouncer and for managed poolers, LISTEN/NOTIFY stops working entirely, because the listening connection gets handed to other transactions between notifications. The pipeline still functions, because the thirty-second poll and the reconciler are the backstop, but you lose the low-latency wake-up that was the whole point of the notify. Point the listener at a session-pooled or direct connection, and pool the rest of the platform’s traffic separately if you need to.

A minimal self-hosted StatefulSet looks like any other Postgres on Kubernetes, with the pgvector image swapped in:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: data-platform
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels: { app: postgres }
  template:
    metadata:
      labels: { app: postgres }
    spec:
      containers:
        - name: postgres
          image: pgvector/pgvector:pg16
          env:
            - name: POSTGRES_DB
              value: mcp_platform
            - name: POSTGRES_USER
              value: platform
            - name: POSTGRES_PASSWORD
              valueFrom: { secretKeyRef: { name: postgres, key: password } }
            # initdb refuses a non-empty directory, and many CSI volumes
            # ship a lost+found at the mount root. Point PGDATA at a
            # subdirectory so init succeeds on a fresh volume.
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "platform", "-d", "mcp_platform"]
  volumeClaimTemplates:
    - metadata: { name: data }
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: data-platform
spec:
  clusterIP: None          # headless, matches the StatefulSet serviceName
  selector: { app: postgres }
  ports:
    - port: 5432
      targetPort: 5432
---
apiVersion: v1
kind: Secret
metadata:
  name: postgres
  namespace: data-platform
type: Opaque
stringData:
  password: "change-me"
  # The platform reads its connection string from this DSN. The worker's
  # LISTEN/NOTIFY listener uses the same DSN, so keep it a direct or
  # session-pooled connection (see the pooler note above).
  dsn: "postgres://platform:change-me@postgres:5432/mcp_platform?sslmode=disable"

I hardcoded POSTGRES_USER to platform rather than sourcing it from the Secret, so the value matches the pg_isready -U platform probe and there is no way for the two to drift. Only the password and the DSN come from the Secret. One thing to be clear about: pg_isready checks that the server is accepting connections, it does not authenticate, so a wrong password here fails later at the platform’s connection, not at the probe.

Sizing storage is mostly about the vectors, and the index is the part people underestimate. A 768-dimensional nomic-embed-text vector is 768 four-byte floats, roughly 3 KB per row. The HNSW index stores the full vector again plus the neighbor graph, so in practice the index is often as large as or larger than the raw vector column, not a rounding error on top of it. The knob that decides whether an HNSW build is fast or miserable is maintenance_work_mem: if the graph does not fit, the build spills and slows dramatically, so raise it before a large index build. A million records still lands in the low single-digit gigabytes of vectors plus index. For most data MCPs the database is small. The 20Gi above is generous headroom.

§Ollama On CPU

This is the part that actually matters on CPU. The Ollama deployment is a normal Deployment, and the two things that make it work on CPU are a persistent volume for the model cache and realistic resource requests.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: data-platform
spec:
  replicas: 1
  selector:
    matchLabels: { app: ollama }
  template:
    metadata:
      labels: { app: ollama }
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:0.30.6
          ports:
            - containerPort: 11434
              name: http
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          resources:
            requests:
              cpu: "2"
              memory: "1Gi"
            limits:
              cpu: "4"
              memory: "2Gi"
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 5
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: data-platform
spec:
  selector: { app: ollama }
  ports:
    - port: 11434
      targetPort: 11434

The mounted volume at /root/.ollama is not optional. Without it, every pod restart re-downloads the model, which is about 270 MB for nomic-embed-text. With it, the model is pulled once and survives restarts and rescheduling.

The resource numbers are tuned for nomic-embed-text, which is a small model. Two cores requested and four as a limit is enough for embedding throughput, and the model needs well under a gigabyte of memory to run, which is why the requests sit at 1Gi with a 2Gi ceiling rather than the multi-gigabyte footprint a chat model would want. On CPU this is a CPU job and nothing more exotic. An embedding is a single forward pass through a relatively small network, not token generation, so it does not need an accelerator to be fast enough. The latency is one to three seconds per text on these resources, and batching amortizes the round trips, as Part 2 covered.

The readiness probe hits /api/tags, which responds once Ollama is up, the same check the dev startup uses. Note it returns ready before the model is pulled, which is why the pull is a separate step.

§Pulling The Model Once

Ollama starts empty. The model has to be pulled into its cache before the first embedding call. In the dev environment a startup script does it:

# dev/start.sh, the model-pull step
docker exec acme-dev-ollama ollama pull nomic-embed-text

On Kubernetes the clean equivalent is a Job that runs the pull against the Ollama service and exits. Use the ollama image itself rather than hand-rolling a curl against /api/pull. That endpoint streams a progress response, so a curl returns 200 immediately and then streams, which means a mid-pull error does not produce a non-zero exit and your Job reports success with no model present. The ollama pull command exits non-zero on a real failure, so the Job and its backoffLimit behave the way you expect. Because the model lives on the persistent volume, this only does real work the first time; on later runs it is a fast no-op:

apiVersion: batch/v1
kind: Job
metadata:
  name: ollama-pull-model
  namespace: data-platform
spec:
  backoffLimit: 6
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: pull
          image: ollama/ollama:0.30.6
          env:
            - name: OLLAMA_HOST
              value: "http://ollama:11434"
          command:
            - sh
            - -c
            - |
              until ollama list >/dev/null 2>&1; do
                echo "waiting for ollama..."; sleep 3;
              done
              ollama pull nomic-embed-text

The job waits for the Ollama service to answer, then pulls. OLLAMA_HOST points the CLI at the in-cluster service instead of a local daemon. If you prefer, you can fold the same logic into an init container on the platform deployment, but a Job keeps the concern separate and re-runnable.

There is a deliberate property here: if this job never runs, or fails, the platform still starts and still serves search. It degrades to lexical-only ranking, exactly as Part 1 described, because the embedder being unreachable returns nil from EmbedForSearch rather than an error. The model pull makes the system better, not something it cannot boot without.

The one wrinkle is the gap between Ollama being Ready and the model being present. The readiness probe answers before the pull finishes, so a platform replica that starts embedding in that window gets errors and burns retry attempts against a model-less Ollama. The pipeline self-heals, because the reconciler re-enqueues anything that failed once the model lands, but if you want to avoid the wasted attempts entirely you can gate the worker on the model existing, or simply let the pull Job complete before scaling the platform up.

§Wiring The Platform

The platform points at both services through config. The database DSN and the embedding block both live in the config file, and the embedding block is nested under memory:, not at the top level:

database:
  dsn: "${DATABASE_URL}"   # expanded from the env var, sourced from the Secret

memory:
  embedding:
    provider: ollama
    ollama:
      url: "http://ollama:11434"
      model: "nomic-embed-text"

Two things matter here. The embedding config sits under memory:; putting embedding: at the top level is a silent no-op that leaves the platform on the noop provider and quietly degrades every search to lexical-only. And the URL is the in-namespace service DNS, http://ollama:11434, instead of the dev config’s localhost. The DSN uses ${DATABASE_URL} so the connection string comes from the Secret rather than being baked into the ConfigMap.

The platform deployment is an ordinary stateless workload. It mounts its config from a ConfigMap, injects DATABASE_URL from the postgres Secret, and probes the platform’s real health endpoints, /healthz for liveness and /readyz for readiness:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-data-platform
  namespace: data-platform
spec:
  replicas: 1
  selector:
    matchLabels: { app: mcp-data-platform }
  template:
    metadata:
      labels: { app: mcp-data-platform }
    spec:
      containers:
        - name: mcp-data-platform
          image: ghcr.io/txn2/mcp-data-platform:latest   # pin a released tag in production
          args: ["--config", "/etc/mcp-data-platform/config.yaml"]
          env:
            - name: DATABASE_URL
              valueFrom: { secretKeyRef: { name: postgres, key: dsn } }
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: config
              mountPath: /etc/mcp-data-platform
              readOnly: true
          livenessProbe:
            httpGet: { path: /healthz, port: 8080 }
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /readyz, port: 8080 }
            initialDelaySeconds: 5
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: mcp-data-platform-config

The worker, reaper, and reconciler from Part 2 run inside this process. There is no separate worker deployment to manage. When the platform pod starts, it starts the background loops; when it stops, they stop cleanly. Run two replicas and you get two worker pools racing for the same index_jobs queue through SKIP LOCKED, with the reaper and reconciler on each pod converging on the same correct state, exactly the multi-pod behavior the queue was designed for.

§Tuning For A Slow Embedder

The defaults assume a fast embedder. On CPU you adjust four knobs, and the platform exposes them under apigateway.embed_jobs:

apigateway:
  embed_jobs:
    workers: 1            # goroutines per pod; CPU-only saturates at 1
    embed_timeout: 5m     # per-batch HTTP timeout; must be < lease_duration
    lease_duration: 10m   # claim window the heartbeat re-stamps at lease/3
    batch_size: 32        # texts per upstream EmbedBatch call
    retention_days: 14    # purge finished index_jobs history older than this

None of these are arbitrary. Each one falls out of the mechanics from Part 2.

Start with workers: 1. The bottleneck on CPU is the model, not the queue, so adding worker goroutines just means more of them waiting on the same saturated Ollama. On a GPU embedder, or with multiple Ollama replicas behind a service, raising this to two to four pulls more jobs in parallel. On CPU it does not, so keep it at one and scale by adding platform replicas only if you also add Ollama capacity. It is a reasonable default, not a law; measure before you raise it.

embed_timeout: 5m is the HTTP timeout the worker puts on its batched /api/embed call. This is deliberately separate from the thirty-second timeout that request-path callers use. A user-facing memory_recall should fail fast at thirty seconds if Ollama is wedged, but the background worker embedding a thirty-two-text batch on CPU needs minutes of headroom. Two budgets, two timeouts.

The lease has to outlast the batch, which is why lease_duration: 10m sits above the five-minute embed_timeout. The lease is the crash-recovery window from Part 2, re-stamped by the heartbeat at one-third cadence. If it were shorter than a batch could take, the reaper would reclaim a job out from under a worker that is still healthily grinding through it.

batch_size: 32 is the lost-work-versus-overhead trade. A bigger batch means fewer round trips but more work to redo if a batch times out. On a slow CPU embedder, smaller batches recover faster from a transient failure. Thirty-two is the balance; lower it if your embedder is particularly slow and you see batches bumping the timeout.

Finished history grows steadily, because the reconciler writes a row per unit per sweep, so retention_days: 14 exists to keep the index_jobs table bounded. The retainer purges succeeded rows and resolved failures older than two weeks, while never touching open failures or in-flight jobs. Two weeks keeps a useful window for the admin dashboard’s throughput and latency views without letting the table grow without end.

§Operating It

Most of the time this system runs untended. The producer enqueues on write, the worker drains, the reconciler catches anything the producer missed, and search just works. The two things an operator actually interacts with are status and failures.

Status is visible per source kind. The platform rolls up the queue into pending, running, succeeded, and failed counts, plus a count of distinct units with unresolved failures, which is the real “is this degraded” signal. During a long embed pass the worker publishes an items_done counter at every chunk boundary, so a unit shows “running, N of M” before its final write commits, instead of looking frozen.

Failures get their own treatment, because a naive failed-jobs list becomes useless noise fast. A job that exhausted its retries an hour ago, then succeeded on the next reconcile sweep, should not still be screaming for attention. The platform models this with a resolved_at column and resolves failures two ways. The first is automatic: when a later job for the same unit succeeds, it stamps resolved_at on every open failure for that unit. A transient failure that self-heals disappears from the triage surface the moment the unit is healthy again, with no operator action:

UPDATE index_jobs f
   SET resolved_at = NOW()
  FROM index_jobs c
 WHERE c.id = $1
   AND f.source_kind = c.source_kind
   AND f.source_id = c.source_id
   AND f.status = 'failed'
   AND f.resolved_at IS NULL

The second is manual: an operator can dismiss a failure that will never be superseded, the leftover from a removed consumer or a source that was deleted on purpose. The triage surface reads only unresolved failures, so what an operator sees is the actionable set: units that are failing now and have not recovered. A unit that failed, was retried, and failed again shows an occurrence count and how long it has been failing, so you can tell a flapping problem from a one-time blip and a unit that never worked from one that used to.

In practice, the most common “failure” is the embedder being briefly unreachable during an Ollama restart. I watched this play out during a routine Ollama image bump: the pod cycled, a handful of in-flight embed jobs failed mid-batch, and the triage count ticked up. By the time I had the dashboard open, the reconciler had already re-enqueued them and the failures had auto-resolved against the next success. That is the normal rhythm. You rarely have to do anything. When you do, it is because a permanent failure is sitting in the triage list telling you a real thing is broken, which is exactly when you want to be told.

§When To Outgrow This

I have spent three articles arguing against a dedicated vector database, so I owe you the boundary. This architecture is the right call for a data MCP, and it is not the right call for everything.

The pieces that hold it together have ceilings, and they are the same ones Part 1 named. pgvector with HNSW is comfortable into the low single-digit millions of vectors and starts to want real care, on index build time and maintenance_work_mem, into the tens of millions. The single-table approach means your vector search shares a database with everything else, which is a feature for consistency and a constraint for isolation: a punishing analytical query and your semantic search compete for the same Postgres. The Postgres-as-queue choice tops out long before a real broker would, in the range of thousands of jobs a minute, which a semantic-indexing workload never approaches but a high-volume event pipeline would.

HNSW is also approximate, and as Part 1 argued that is the right trade for relevance ranking. If you have a use case that needs exact nearest-neighbor guarantees, or sub-millisecond vector queries under heavy concurrency, or past a hundred million vectors, that is when a purpose-built engine earns its operational weight.

For the workload this series is about, recall over API operations, tools, prompts, generated assets, and a memory store that accumulates over a year, none of those ceilings are close. The vectors number in the thousands to low millions. The query load is one search per agent turn, not thousands per second. Adding a vector database to this would be paying the full operational cost of a second stateful system, plus a synchronization problem, to solve a scale problem you do not have. The right time to adopt that complexity is when you measure a real limit, not when an architecture diagram suggests one.

§The Thread Through All Three Parts

The series comes down to one argument: an MCP that fronts data needs semantic recall, semantic recall needs vectors, and vectors need to be generated, stored, searched, and kept fresh. Every one of those needs turned out to have a Postgres-shaped answer that is simpler, more consistent, and easier to operate than the off-the-shelf assembly of a vector database and a message broker. Part 1 stored and searched the vectors in one table. Part 2 filled them in with a queue that is itself just rows. Part 3 put both on a cluster next to a model that runs on a couple of CPU cores. No GPU, no broker, no second datastore, and at the scale a data platform actually operates, no reason to want them.

§References


Vector Search Without the Vector Database | Part 3 of 3. Previous: The Embedding Pipeline on Postgres and Ollama. Start the series: Why Data MCPs Need Vector Search.

← back to all notes