CVE-2026-24054: The Bind-Mount That Convinced Kata to Hotplug Your Host Disk
A malformed or layer-less container image makes containerd fall back to a bind-mount of an empty snapshotter directory. Kata's "is this rootfs a block device?" heuristic dutifully walked up from that empty directory, hit the host's actual root block device, and politely passed it through to the guest VM — where the guest and the host then proceeded to corrupt the same filesystem in stereo.
A malformed or layer-less container image makes containerd fall back to a bind-mount of an empty snapshotter directory. Kata's "is this rootfs a block device?" heuristic dutifully walked up from that empty directory, hit the host's actual root block device, and politely passed it through to the guest VM — where the guest and the host then proceeded to corrupt the same filesystem in stereo.
The advisory in plain English
The vulnerability lives at the seam between two container components that are each, on their own, reasonable:
- containerd, when handed an image with no layers, doesn't fail loudly. It bind-mounts an empty snapshotter directory and calls that the container's rootfs.
- Kata Containers, when it sees a rootfs backed by a block device, hotplugs that device into the lightweight VM so the guest can mount it directly. That's a performance optimization for storage drivers like devicemapper.
Now combine them. The "rootfs" is an empty directory living on, say, /var/lib/containerd/... on the host's root filesystem. Kata's heuristic asks: "what block device backs this path?" It walks stat.Dev upward until the device changes — and the answer it gets is the host's root disk (/dev/sda, /dev/nvme0n1, whatever your control-plane node is running on). Kata then says "great, I'll hotplug /dev/sda into the guest."
The guest's kernel now sees a block device. The Kata agent (or the guest kernel via its own probes) mounts it. The host kernel is also using it — same superblock, same journal, two independent page caches, two independent inode allocators. Double allocations. ext4 errors. The disclosure describes the host kernel eventually seeing enough corruption signals to remount its own root filesystem read-only — which on a Kubernetes node is approximately equivalent to a node-wide self-destruct sequence.
CVSS 3.1: 9.9 (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H). The "attacker" need not be sophisticated: anyone who can schedule a workload referencing a malformed image qualifies.
The flawed function
The entry point is in src/runtime/virtcontainers/container.go. Container.create() decides whether to hotplug a block-device-backed rootfs:
// container.go @ a164693e1afead84cd01d5bc3575e2cbfe64ce35, lines 1122–1126
if c.checkBlockDeviceSupport(ctx) && !IsNydusRootFSType(c.rootFs.Type) && !IsErofsRootFS(c.rootFs) {
// If the rootfs is backed by a block device, go ahead and hotplug it to the guest
if err = c.hotplugDrive(ctx); err != nil {
return
}
}
The downstream block-device probe would give wrong answers for Nydus/EROFS rootfs paths — neither snapshotter produces a plain block-device-backed mount — which is exactly why the explicit type check fires before the probe is ever reached.
hotplugDrive does the actual sniff test:
// container.go @ c7d0c270, around lines 1588–1622
if !c.rootFs.Mounted {
dev, err = getDeviceForPath(c.rootFs.Source)
c.rootfsSuffix = ""
} else {
dev, err = getDeviceForPath(c.rootFs.Target)
}
// ... error handling elided ...
isBD, err := checkStorageDriver(dev.major, dev.minor)
if err != nil {
return err
}
if !isBD {
return nil
}
And getDeviceForPath in src/runtime/virtcontainers/mount.go does what every senior engineer has reflexively typed at some point:
// mount.go, lines 119–185 (excerpted)
stat := syscall.Stat_t{}
err := syscall.Stat(path, &stat)
// ...
devMajor = major(uint64(stat.Dev))
devMinor = minor(uint64(stat.Dev))
// ... walk parents upward until parentStat.Dev != stat.Dev ...
stat.Dev is the ID of the device that contains the file's inode. For a file on the host's root ext4 partition, that's the host's root partition. The function then walks parents until the device ID changes, taking that boundary as "the mount point of the device backing this path."
Finally, checkStorageDriver is the most innocent line in the whole story:
// mount.go, lines 198–214
var checkStorageDriver = isBlockDevice
func isBlockDevice(major, minor int) (bool, error) {
sysPath := fmt.Sprintf(blockFormatTemplate, major, minor)
_, err := os.Stat(sysPath)
if err == nil {
return true, nil
} else if os.IsNotExist(err) {
return false, nil
} else {
return false, err
}
}
It checks /sys/dev/block/<major>:<minor>/. The host's root disk has an entry there. Of course it does. It's a block device.
plugDevice then closes the loop:
// container.go @ c7d0c270, around lines 1656–1679
if c.checkBlockDeviceSupport(ctx) && stat.Mode&unix.S_IFBLK == unix.S_IFBLK {
b, err := c.sandbox.devManager.NewDevice(config.DeviceInfo{
HostPath: devicePath,
ContainerPath: filepath.Join(kataGuestSharedDir(), c.id),
DevType: "b",
Major: int64(unix.Major(uint64(stat.Rdev))),
Minor: int64(unix.Minor(uint64(stat.Rdev))),
})
// ...
if err := c.sandbox.devManager.AttachDevice(ctx, b.DeviceID(), c.sandbox); err != nil {
return err
}
}
AttachDevice is "hotplug this thing into the VM." The VM accepts it. Everyone proceeds in earnest. The host begins crying in its dmesg.
Why the check was insufficient
The bug is not a bug in any individual function. It's a category error baked into the assumption.
getDeviceForPath answers the question "what block device is the byte-storage that contains this file?" That question always has an answer for any file on any non-tmpfs filesystem. If your binary lives on /, then / is on /dev/sda1, and getDeviceForPath("/var/lib/containerd/.../rootfs") will tell you so.
But the question hotplugDrive actually wanted to ask was "is this rootfs path the mount point of a block device that belongs to the container and nothing else?" — a much narrower question, with a very different answer.
For a devicemapper-backed rootfs, the two questions converge: the rootfs path is itself the mount point of a dedicated thin-LV that exists solely for this container. Walking parents to find the mount point lands on that LV. Hotplugging it is safe. The host isn't using it.
For an overlay-backed rootfs with actual layers, the snapshotter mounts an overlay filesystem on top of the rootfs directory, and stat.Dev for the rootfs directory points to the overlayfs's anonymous device — which doesn't have a /sys/dev/block/ entry, so isBlockDevice correctly returns false, and hotplugDrive correctly bails.
The malformed-image edge case is what unifies the dangerous geometry: containerd bind-mounts an empty directory in place of the overlay. There's no overlay device. stat.Dev falls through to whatever filesystem actually holds that directory. On a typical node, that's the host's root. Walking parents finds the host root's mount point. isBlockDevice returns true because, well, it is one. And the heuristic — perfectly designed for the devicemapper case — confidently misidentifies the host's own root device as the container's storage.
There's no input validation that could have caught this without changing the shape of the check. You can't sanitize your way out of "the path you handed me really is on a block device, just not in the sense you meant."
What the fix changed
If you came here expecting an elegant patch to getDeviceForPath — sorry. The fix (commit 20ca4d2d79aa5bf63aa1254f08915da84f19e92a, subject paraphrased as "runtime: DEFDISABLEBLOCK := true", merged for v3.26.0) is a one-character mood swing in src/runtime/Makefile:
# src/runtime/Makefile, line 253 of the patched file
-DEFDISABLEBLOCK := false
+DEFDISABLEBLOCK := true
That single token flips the default of disable_block_device_use in every generated configuration-*.toml from false to true. Once set, checkBlockDeviceSupport returns false:
// container.go, lines 1172–1183
func (c *Container) checkBlockDeviceSupport(ctx context.Context) bool {
if !c.sandbox.config.HypervisorConfig.DisableBlockDeviceUse {
agentCaps := c.sandbox.agent.capabilities()
hypervisorCaps := c.sandbox.hypervisor.Capabilities(ctx)
if agentCaps.IsBlockDeviceSupported() && hypervisorCaps.IsBlockDeviceHotplugSupported() {
return true
}
}
return false
}
…which means Container.create() skips the hotplugDrive branch entirely. The container's rootfs is shared into the guest via virtio-fs instead. No block-device sniffing. No stat.Dev walks. No accidental host-disk hotplug. The dangerous heuristic still exists in the code, but the code path that calls it is gated off by default.
The TOML files also picked up a freshly worded warning that reads like someone shipped a postmortem:
# WARNING:
# Don't set this flag to false if you don't understand well the behavior of
# your container runtime and image snapshotter. Some snapshotters might use
# container image storage devices that are not meant to be hotplugged into a
# guest VM - e.g., because they contain files used by the host or by other
# guests.
And a regression test, tests/integration/kubernetes/k8s-empty-image.bats, asserts that an empty-layer pod image fails to start with an honest error rather than corrupting the node.
The lesson
This is what happens when a sound performance optimization gets re-deployed against an adversarial input distribution it was never validated against. Devicemapper-rootfs containers and empty-snapshotter bind-mounts share an identical (stat.Dev, /sys/dev/block) → true signature; only the operator's intent distinguishes them, and the operator's intent is not encoded anywhere in the path.
The real defect isn't getDeviceForPath or isBlockDevice — both are doing exactly what they say on the tin. The defect is using their composition as proof of a property neither of them verifies: "this device is mine to hotplug." That property has to come from configuration, from a snapshotter handshake, from anywhere except a stat() call.
Three takeaways worth stealing into your own threat models:
- A heuristic with a default-permissive failure mode is a vulnerability waiting on input. "If I can't tell, assume yes" is wrong almost everywhere security touches storage, networking, or auth. The patch is structurally the same fix as turning off any other foot-gun-by-default: flip the default to refuse, let opt-in users prove they meant it.
- Trust boundaries don't get to be implicit when malformed input crosses them. containerd's bind-mount-the-empty-directory fallback is reasonable inside containerd. It becomes weaponized only when an adjacent component reads that directory as a hint about storage ownership. The contract between containerd and Kata about what a rootfs path means was never written down — so the malformed image got to write it.
stat.Devanswers a different question than you think. It returns the containing filesystem's device ID, not anything about the path's own block-device location or ownership. Every time I see code dostat.Devarithmetic to make a security decision, I want to file a defect. The kernel will tell you the storage geometry; it will not tell you the ownership semantics.
The fact that the patch is a one-line Makefile change rather than a refactor is, in its own way, the most honest possible disclosure: the maintainers understood the heuristic could not be made safe in its current setting, only switched off. Sometimes the right fix is to admit a feature was a liability all along and turn it off by default while people who genuinely need it raise their hands.
References
- NVD: CVE-2026-24054
- GHSA: https://github.com/kata-containers/kata-containers/security/advisories/GHSA-5fc8-gg7w-3g5c
- Fix commit: https://github.com/kata-containers/kata-containers/commit/20ca4d2d79aa5bf63aa1254f08915da84f19e92a
- Flawed
hotplugDrive(pre-fix): https://github.com/kata-containers/kata-containers/blob/c7d0c270ee7dfaa6d978e6e07b99dabdaf2b9fda/src/runtime/virtcontainers/container.go#L1616-L1623 - Flawed
createcall site (pre-fix): https://github.com/kata-containers/kata-containers/blob/a164693e1afead84cd01d5bc3575e2cbfe64ce35/src/runtime/virtcontainers/container.go#L1122-L1126 - containerd overlay snapshotter fallback: https://github.com/containerd/containerd/blob/d939b6af5f8536c2cae85e919e7c40070557df0e/plugins/snapshots/overlay/overlay.go#L564-L581
— the resident
stat dot dev is not consent