Storage the Cluster Owns: Rook and Ceph

FOLIO CX 2026-06-23 · 15 MIN · LONG-FORM

Storage the Cluster Owns: Rook and Ceph

Portable block and shared-filesystem persistence for stateful services, no managed disks required

Diagram · folio cx

flowchart TB
  PVC["PVC from a StatefulSet"] --> SC["StorageClass<br>rook-ceph-block"]
  SC --> CSI["Ceph CSI<br>rook-ceph.rbd.csi.ceph.com"]
  CSI --> POOL["CephBlockPool<br>replicated"]
  POOL --> OSD1["OSD on worker 1 disk"]
  POOL --> OSD2["OSD on worker 2 disk"]
  FS["CephFilesystem<br>ReadWriteMany"] -.-> OSD1
  FS -.-> OSD2

The cluster from the last post can schedule containers, but it cannot yet keep anything. Pods are transient and their filesystems go with them, so the moment you want to run a database, an index, or a queue, you need storage that outlives the pod that wrote it. You could rent managed disks from your cloud, but that is exactly the per-provider lock-in this series avoids. Instead we give the cluster its own distributed storage with Ceph, running on plain disks attached to your own nodes, as portable as everything else here.

This series rebuilds my 2020 Apress book, Advanced Platform Development with Kubernetes, for 2026. The approach behind it comes from building and running data platforms in production for more than twenty years.

§Why Run Your Own Storage

Managed block storage, the EBS volumes and persistent disks of the world, is convenient and it is a leash. The volume type, the snapshot format, the performance tiers, and the provisioning API are all specific to one provider, and a platform built on them does not move. The snapshots you took for safety are in a format only that cloud can restore. The IOPS you are paying for are metered by their rules. None of it follows you when you leave, which means leaving stops being an option you actually have.

There is a tempting middle ground that is worse than either choice: the cluster’s built-in hostPath or a local-path provisioner, where a volume is just a directory on whatever node the pod happened to land on. It works in a demo and then betrays you. The data is pinned to one node, so the moment that node reboots or the pod reschedules elsewhere, the storage is gone or stranded. A database on a node-local volume is a database with a single point of failure you built on purpose.

Ceph is the open answer to both problems. It is a distributed storage system that pools the raw disks across your nodes and serves them back as block volumes, a shared filesystem, and object storage, all replicated across the nodes and self-healing when a disk or a node dies. It runs the same on DigitalOcean, Hetzner, or a rack of your own machines, and it has been the serious-people’s answer to open storage for well over a decade. A volume lives in the cluster, not on a node, so it follows the pod wherever the scheduler puts it.

You do not operate Ceph by hand, and that distinction is what makes this practical in 2026. Rook is a Kubernetes operator that installs, configures, heals, and upgrades a Ceph cluster for you, and exposes it through ordinary Kubernetes storage classes. A pod asks for a PersistentVolumeClaim, Rook and Ceph provision a real replicated volume, and the pod neither knows nor cares that there is a storage cluster underneath. The expertise that used to make Ceph a specialist’s tool is now encoded in an operator, the same shift that runs through this whole series: the hard parts are automated, so owning the thing is no longer a second job.

§What Ceph Actually Is, and What Rook Does

It is worth understanding the pieces before you deploy them, because when something is wrong these are the names you will be reading in logs. A Ceph cluster is a handful of daemon types working together.

The OSDs, object storage daemons, are the workers. There is one per disk, and they hold the actual data and handle replication and recovery between themselves. The MONs, monitors, hold the cluster map and the consensus about who is in the cluster and healthy; they are the brain, and like any consensus system they want an odd number so they can form a quorum. The MGR, manager, runs the dashboard, the metrics, and assorted housekeeping. And the MDS, metadata server, exists only if you use the shared filesystem, where it tracks the directory tree.

Underneath, Ceph places data with an algorithm called CRUSH, which decides which OSDs hold which copies based on a failure domain you set. Tell Ceph the failure domain is host and it guarantees the replicas of any piece of data land on different nodes, so losing a whole node never takes out every copy. That single setting is the difference between replication that protects you and replication that merely uses more disk.

Rook’s job is to run all of this as Kubernetes resources. You declare a CephCluster, a CephBlockPool, a CephFilesystem, and Rook turns them into the right daemons, pods, secrets, and CSI drivers. You describe the storage you want in YAML, the operator makes Ceph match it, and your interface to the whole system stays declarative and manifest-first, the same way you run everything else on this cluster.

§What You Need

The three-node cluster from the previous post, plus one thing it did not have: an empty block device on each worker for Ceph to claim. On a cloud, attach an extra unformatted volume to each worker node, a hundred gigabytes is plenty for development, and leave it raw. On bare metal, a spare disk or partition does the same job. The control-plane node stays out of it; storage lives on the workers.

That raw-device requirement is itself a change worth noting. The 2020 edition could point Ceph at a directory on the node. Modern Rook provisions storage daemons on actual block devices, which is more robust and the reason for the attached volume.

The word “empty” is load-bearing, and it is the single most common reason a Rook deployment looks healthy but no OSDs ever appear. Ceph refuses to claim a device that has any existing partition table or filesystem signature, as a safety measure so it never eats data you forgot was there. A cloud volume is usually clean, but a reused disk often is not. Confirm before you start:

lsblk -f
# the target device, e.g. /dev/sdb, should show no FSTYPE and no children

If the device shows an old filesystem or partitions, wipe its signatures so Ceph will take it. This destroys whatever is on that device, so be certain you have the right one.

sgdisk --zap-all /dev/sdb
dd if=/dev/zero of=/dev/sdb bs=1M count=100 oflag=direct
blkdiscard /dev/sdb   # on SSDs; skip on spinning disks

§Install the Rook Operator

Rook ships the operator as a set of plain manifests, the same way the original book applied them. Pin a current release tag; this uses v1.16.2, so check for the latest before you run it.

ROOK_VERSION=v1.16.2
BASE=https://raw.githubusercontent.com/rook/rook/${ROOK_VERSION}/deploy/examples

kubectl apply -f ${BASE}/crds.yaml
kubectl apply -f ${BASE}/common.yaml
kubectl apply -f ${BASE}/operator.yaml

The crds.yaml installs Rook’s custom resource definitions, common.yaml creates the rook-ceph namespace and RBAC, and operator.yaml runs the operator that does the actual work. Wait for the operator pod to be running before continuing.

kubectl -n rook-ceph rollout status deployment/rook-ceph-operator

§Create the Ceph Cluster

Now declare the Ceph cluster itself. The example cluster.yaml in the same directory is tuned for production; for the small development cluster here, apply this trimmed CephCluster instead.

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.0   # a Ceph version your Rook release supports
  dataDirHostPath: /var/lib/rook
  mon:
    count: 1            # one monitor is fine for development; use 3 in production
  mgr:
    count: 1
  dashboard:
    enabled: true
  storage:
    useAllNodes: true
    useAllDevices: false
    deviceFilter: "^sdb$"   # claim only this device, never the OS disk

Two choices in there deserve a word, because the example file’s defaults are riskier than they look. mon.count: 1 is fine for development and a single point of failure for the storage control plane; production uses three so the monitors can form a quorum and survive losing one, exactly the same odd-number logic as control-plane nodes in the last post.

The bigger one is device selection. The example cluster.yaml ships with useAllDevices: true, which tells Ceph to grab every empty disk it finds on every node. That is convenient until the day it claims a disk you meant to use for something else. The safer habit, and the one worth building now, is useAllDevices: false with an explicit deviceFilter naming exactly the device Ceph may have. You are telling the storage system precisely what it owns, rather than letting it forage.

kubectl apply -f cephcluster.yaml

Rook discovers the disk, brings up the monitor and manager, and starts an OSD on each worker device. It takes a few minutes. Apply the toolbox to get the ceph command-line tools, then check health.

kubectl apply -f ${BASE}/toolbox.yaml

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Representative output once the cluster is healthy:

  cluster:
    id:     a1b2c3d4-1234-5678-9abc-def012345678
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a (age 4m)
    mgr: a(active, since 3m)
    osd: 2 osds: 2 up (since 3m), 2 in (since 3m)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   54 MiB used, 200 GiB / 200 GiB avail
    pgs:

HEALTH_OK with two OSDs up means Ceph is running on your disks. Confirm they landed where you expect with the OSD tree, which shows the placement Ceph will use for replication.

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

ID  CLASS  WEIGHT   TYPE NAME             STATUS
-1         0.19519  root default
-3         0.09760      host platform-w1
 0    ssd  0.09760          osd.0             up
-5         0.09760      host platform-w2
 1    ssd  0.09760          osd.1             up

One OSD under each worker host. That is what makes a host failure domain meaningful: the replicas of your data will sit on different machines.

§Reach the Ceph Dashboard

The CephCluster enabled the dashboard, which is genuinely useful for both you and any agent operating this cluster, since it surfaces health, capacity, and OSD status without shelling into the toolbox. Pull the generated admin password and forward the dashboard service.

kubectl -n rook-ceph get secret rook-ceph-dashboard-password \
  -o jsonpath="{['data']['password']}" | base64 -d ; echo

kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:8443

Open https://localhost:8443 and log in as admin. The certificate is self-signed, so your browser will complain; that is expected for an internal service reached over a port-forward.

§Block Storage and the One Big Change Since 2020

Pods get storage by claiming a StorageClass. Create a CephBlockPool for Ceph to replicate the data across the worker disks, and a StorageClass that provisions volumes from it.

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 2             # one copy on each worker; raise to 3 as you add nodes
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool
  imageFormat: "2"
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
allowVolumeExpansion: true

Look at the provisioner. In the 2020 book the storage class used ceph.rook.io/block, Rook’s own flex-volume provisioner. That whole mechanism is gone. Kubernetes storage standardized on the Container Storage Interface, and Ceph volumes now come through the CSI driver rook-ceph.rbd.csi.ceph.com, with the secret references above wiring the provisioner to Rook’s credentials. You do not invent these by hand; Rook ships the exact storage class as an example under deploy/examples/csi/rbd/. The point to carry forward is that the provisioner is CSI now, and allowVolumeExpansion: true means you can grow a volume later by editing the claim, which the old flex provisioner could not do.

kubectl apply -f rook-ceph-block.yaml
kubectl get storageclass

NAME                        PROVISIONER                  RECLAIMPOLICY   ALLOWVOLUMEEXPANSION
rook-ceph-block (default)   rook-ceph.rbd.csi.ceph.com   Delete          true

§Prove It Actually Persists

A Bound claim is necessary but it does not prove the thing that matters: that the data lives in the cluster and follows a pod to a different node, which is exactly what a node-local volume cannot do. So prove that, end to end. Claim a volume, write a file through one pod, destroy that pod, and read the file back from a new pod that the scheduler is free to place anywhere.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-claim
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: rook-ceph-block
  resources:
    requests:
      storage: 5Gi

kubectl apply -f test-claim.yaml

# write a file through a pod that mounts the claim
kubectl run writer --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"writer","image":"busybox","command":["sh","-c","echo owned-by-the-cluster > /data/proof.txt && sleep 5"],"volumeMounts":[{"name":"v","mountPath":"/data"}]}],"volumes":[{"name":"v","persistentVolumeClaim":{"claimName":"test-claim"}}]}}'

kubectl wait --for=delete pod/writer --timeout=60s

The writer pod is gone, but the volume and its file are not. Mount the same claim from a fresh pod and read it back.

kubectl run reader --image=busybox --restart=Never -it --rm \
  --overrides='{"spec":{"containers":[{"name":"reader","image":"busybox","command":["cat","/data/proof.txt"],"volumeMounts":[{"name":"v","mountPath":"/data"}]}],"volumes":[{"name":"v","persistentVolumeClaim":{"claimName":"test-claim"}}]}}'

owned-by-the-cluster

The file survived the pod that wrote it and reattached to a pod the scheduler placed independently. That is real cluster storage, not a directory pinned to a node. Delete the claim when you are done; with reclaimPolicy: Delete the underlying Ceph volume goes with it.

kubectl delete pvc test-claim

§A Shared Filesystem When You Need One

Block volumes are ReadWriteOnce, one writer at a time, which is what databases, search indexes, and queues want. Some workloads instead need many pods reading and writing the same files at once. For that, create a CephFilesystem, which brings up an MDS, and a filesystem storage class.

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: rook-ceph-fs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 2
  dataPools:
    - failureDomain: host
      replicated:
        size: 2
  metadataServer:
    activeCount: 1
    activeStandby: true

Pair it with a StorageClass using the rook-ceph.cephfs.csi.ceph.com provisioner (Rook’s deploy/examples/csi/cephfs/storageclass.yaml), and claims against it come back ReadWriteMany. Reach for this only when you genuinely need shared access; for everything with a single owner, block volumes are simpler and faster.

§Owning the Storage Means Operating It

A managed disk service quietly does several things for you: it watches the hardware, it grows when you ask, it re-replicates around failures, and it takes snapshots. Own the storage and those are yours, and Ceph was built to do all of them. None is a heroic operation; each is a command or a manifest, and each is the kind of routine an agent with this post in front of it can run.

Watch its health. Ceph tells you plainly how it is doing. HEALTH_OK is the goal; HEALTH_WARN means it is coping with something and usually fixing it; HEALTH_ERR means it needs you. The dashboard shows the same, and the detail view explains any warning in words.

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail

Grow it. Running low on space is not a migration, it is another disk. Attach a new empty volume to a worker, and because the operator reconciles continuously, Rook discovers it and brings up another OSD. Ceph then rebalances data onto it automatically using CRUSH, no downtime and no manual data movement. Watch it happen.

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
# a new osd appears under the host you added the disk to
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# 'data' shows objects misplaced and recovering, then settles back to HEALTH_OK

Let it heal. This is the part managed storage hides behind a curtain, and Ceph does it in the open. Lose a disk or a whole node, and as long as the surviving copies satisfy your failureDomain, Ceph notices, marks the OSD down, and re-replicates the affected data onto the remaining OSDs to restore the replica count you asked for. During recovery ceph status reports a HEALTH_WARN with degraded and recovering placement groups, then returns to HEALTH_OK on its own. Your size: 2 is not a wish, it is a contract Ceph actively maintains.

Snapshot it. Ceph’s RBD volumes snapshot natively, and Kubernetes exposes that through the CSI snapshot API, so a backup is a manifest. With the snapshot controller installed (Rook ships the class under deploy/examples/csi/rbd/), declare a VolumeSnapshotClass once, then snapshot any claim.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-rbdplugin-snapclass
driver: rook-ceph.rbd.csi.ceph.com
deletionPolicy: Delete
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pgdata-snap-20260623
spec:
  volumeSnapshotClassName: csi-rbdplugin-snapclass
  source:
    persistentVolumeClaimName: pgdata-postgres-0

That snapshot becomes the source for a new claim, which is how you restore a database to a point in time or clone it for testing. It is the storage-layer companion to the etcd backups from the last post: the cluster owns its state, so the cluster takes its own backups.

§When Something Is Wrong

A handful of failure modes account for nearly all the trouble with Rook, and like most of this stack they are obvious once you know where to look.

No OSDs ever appear and there is no disk usage. The device was not actually empty. This is the most common one by far. Check the OSD-prepare job logs (kubectl -n rook-ceph logs -l app=rook-ceph-osd-prepare); a line about an existing filesystem or partition signature is the tell. Wipe the device as in the prerequisites and let the operator reconcile.

Ceph claimed the wrong disk. You used useAllDevices: true and it grabbed something you wanted for another purpose. Switch to useAllDevices: false with an explicit deviceFilter, the safer pattern from the cluster section.

A reinstall will not come up and the monitors will not reach quorum. Rook keeps cluster identity under dataDirHostPath (/var/lib/rook) on each node. If you tore down a previous Ceph cluster without clearing that directory on every node, the stale state poisons the new one. Remove /var/lib/rook on all nodes before reinstalling. This trips up almost everyone exactly once.

A claim is stuck Pending. The CSI provisioner is not doing its job. Confirm the operator and the CSI provisioner pods are running (kubectl -n rook-ceph get pods | grep csi), and that the storage class names a pool that actually exists.

§Block, Filesystem, and What Comes Next

The default rook-ceph-block class is what the StatefulSets later in this series claim from: Postgres, OpenSearch, Kafka, all want their own replicated block volume per replica, snapshot-able and grow-able on infrastructure you own. The filesystem class is there for the rarer case of shared files across pods. Object storage, the S3-style bucket layer for the data lake, is a separate concern handled later by SeaweedFS rather than Ceph’s own object store, so we leave that off here and keep Ceph focused on block and filesystem.

The cluster can now keep data, replicated across its own disks, healing itself when hardware fails, and backing itself up, all on infrastructure that owes nothing to any provider. Next we make services reachable from the outside with the Gateway API, the successor to the Ingress this stack used to lean on.

Craig Johnston · 2026-06-23 ← back to all notes

Storage the Cluster Owns: Rook and Ceph

§Why Run Your Own Storage

§What Ceph Actually Is, and What Rook Does

§What You Need

§Install the Rook Operator

§Create the Ceph Cluster

§Reach the Ceph Dashboard

§Block Storage and the One Big Change Since 2020

§Prove It Actually Persists

§A Shared Filesystem When You Need One

§Owning the Storage Means Operating It

§When Something Is Wrong

§Block, Filesystem, and What Comes Next

Webmentions