Also at Deasil Works · txn2 · Plexara
Profiles GitHub · X · LinkedIn
Theme Light · Auto · Dark
Professional notes by Craig Johnston
long-form, short-form, working drafts · since 2008
VOL. XIX · MMXXVI
86 NOTES IN PRINT
FOLIO LXXXVI 2026-06-14 · 5 MIN · SHORT-FORM

Build a Better Platform Than You're Renting

Cloud-grade data platforms from FOSS on Kubernetes, now that agents run the hard part

Diagram · folio lxxxvi
mindmap
  root((Self-hosted platform))
    Instead of Snowflake
      Trino plus Iceberg
      object storage
    Instead of managed search
      OpenSearch
    Instead of Confluent
      Kafka with KRaft
    Instead of Workato
      Apache NiFi
    Instead of managed databases
      Postgres
      Cassandra
    Operated by
      Kubernetes
      AI agents

In 2020, Apress published my book, Advanced Platform Development with Kubernetes. It runs about five hundred pages and builds a complete data-centric platform end to end: streaming ingestion, search and analytics, data lakes and warehouses, IoT collection, and machine learning, all assembled from best-in-class open source on Kubernetes. It is still on Amazon, and I stand by every idea in it. What I can no longer stand by is the specific technology. Six years is a long time in this work, and more than the tools changed.

§The Cloud Bill Is a List of Things You Could Run Yourself

Look at your monthly cloud bill. The data warehouse, the search cluster, the streaming service, the integration platform, the managed databases, the observability vendor. Every line item is a piece of software you could run yourself, from free and open source projects, on hardware you control, for a fraction of what you are paying, with nobody able to raise your rates or lock you in. That has been true since well before I wrote the book. The reason most teams do not do it was never the building. It was the running, and that is the part that just changed.

§What You Can Actually Build

The gap between what you can self-host and what the big vendors sell has basically closed. You can stand up a query engine over open table formats that does what Snowflake does. You can run search and analytics that match a managed Elasticsearch, except it is OpenSearch and nobody owns you. You can replace Confluent’s streaming with the same Apache Kafka underneath it. Apache NiFi does what Workato and the other integration-platform vendors do, and frankly does it better. The managed database zoo, Postgres, Cassandra, and the rest, runs under operators that handle the failover and backups that used to be the reason you paid someone else.

None of it is a toy version. These are the same best-in-class projects the cloud vendors wrap, polish, and rent back to you. Assemble them yourself and you get more than the SaaS offers, at a fraction of the cost, with no vendor able to change the terms on you. Kubernetes is the substrate that ties them together and handles the networking, storage, scaling, and security underneath.

§Kubernetes Is Boring Now, and That Is Exactly Why This Works

When I wrote the book, this pitch came with an asterisk. Kubernetes was new, it was moving fast, and standing on it felt like a bet. That is over. Kubernetes settled into boring, dependable infrastructure, and boring is a compliment. The “is Kubernetes overkill” argument has worn itself out and stopped being interesting. What is left is a mature, declarative control plane that makes maintenance easier, not harder, because the whole platform is described in version-controlled manifests and reconciled by operators instead of held together by hand.

Boring infrastructure is the foundation you want under a data platform. You do not want excitement in the layer that keeps your data alive.

§The Thing That Changed Is the Operator

So if all of this was possible, why did so many teams still rent? Because the operation was the real cost. Running a dozen interconnected systems well means knowing all dozen deeply enough to tune, debug, upgrade, and secure them, and that expertise is expensive and scarce. The cloud sold convenience, and the convenience was worth it precisely because the operational burden was crushing.

That is what broke open. A capable agent, given the right context, now configures, manages, and optimizes Kubernetes and the FOSS stack on top of it better than most experts could a decade ago. The deep, scarce, expensive knowledge that was the real reason to rent is now something you keep on call for the cost of tokens. It is an expert beside you that already understands every component of your platform and the stack above it.

Take the operational burden and turn it into background noise, and the trade flips. The case for building your own, which used to rest on principle and a long-term cost argument, now wins outright.

§What This Series Is

This is a ground-up rebuild of that 2020 book, posted component by component, on real Kubernetes, with current and correct configuration you can run. Object storage, streaming, a lakehouse, search, identity, model serving, vector search, and the agentic layer that ties it together, each one assembled from liberally licensed open source.

You do not need to have read the original. The technology has turned over since then, OpenSearch for Elasticsearch, SeaweedFS for MinIO, Trino for Presto, Kafka without ZooKeeper, and a whole layer of AI tooling that did not exist in 2020. What carried over is the only part that was ever the point: how to wire best-in-class free software into a platform that rivals the cloud.

§Written for Your Agent, Too

Here is the part I find genuinely useful. You can hand this series to an agent and let it build. A cold prompt like “stand up a data platform on Kubernetes” gives you something plausible and wrong, the same way it does in any domain. An agent performs in proportion to the frame of reference you give it. Ground it in a coherent, current, opinionated reference for which pieces, wired which way, and why, and it builds something real.

That is what each post is written to be. The configs were never the scarce thing. The judgment about how the pieces fit is, and that is what a good reference carries. So read it yourself, or feed it to your agent and supervise, whichever you prefer. If it is too long to read, that is fine. That is rather the point.

Next, we start at the bottom and work up, from the substrate to the agents that run it.

← back to all notes