From: Achkinazi, Igor Date: Thu, 28 May 2026 15:24:27 +0000 (+0000) Subject: nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=88bac2c1a72b8f4f71e9845699aa872df04e5850;p=thirdparty%2Fkernel%2Flinux.git nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped bio is then resubmitted through submit_bio_noacct() which calls bio_check_eod() because BIO_REMAPPED is not set. This races with nvme_ns_remove() which zeroes the per-path capacity before synchronize_srcu(): CPU 0 (IO submission) --------------------- srcu_read_lock() nvme_find_path() -> ns [NVME_NS_READY is set] CPU 1 (namespace removal) ------------------------- clear_bit(NVME_NS_READY) set_capacity(ns->disk, 0) synchronize_srcu() <- blocks CPU 0 (IO submission) --------------------- bio_set_dev(bio, ns->disk->part0) [clears BIO_REMAPPED] submit_bio_noacct(bio) -> bio_check_eod() sees capacity=0 -> bio fails with IO error The SRCU read lock prevents synchronize_srcu() from completing, but does not prevent set_capacity(0) from executing. The bio fails the EOD check before it reaches the NVMe driver, so nvme_failover_req() never gets a chance to redirect it to another path of multipath. IO errors are reported to the application despite another path being available. On older kernels (before commit 0b64682e78f7 "block: skip unnecessary checks for split bio"), the same race was also reachable through split remainders resubmitted via submit_bio_noacct(). Fix this by setting BIO_REMAPPED after bio_set_dev() in nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path device; the EOD check already passed on the multipath head. NVMe per-path namespace devices are always whole disks (bd_partno=0), so the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op. The flag does not persist across failover and cannot go stale if the namespace geometry changes between attempts: nvme_failover_req() calls bio_set_dev() to redirect the bio back to the multipath head, which clears BIO_REMAPPED. When nvme_requeue_work() resubmits through submit_bio_noacct(), bio_check_eod() runs normally against the current capacity. Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for partition-remapped bios"). Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Igor Achkinazi Signed-off-by: Keith Busch --- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index d6c51f59ff258..bd9e8d5a27132 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -521,6 +521,12 @@ static void nvme_ns_head_submit_bio(struct bio *bio) ns = nvme_find_path(head); if (likely(ns)) { bio_set_dev(bio, ns->disk->part0); + /* + * Use BIO_REMAPPED to skip bio_check_eod() when this bio + * enters submit_bio_noacct() for the per-path device. The EOD + * check already passed on the multipath head. + */ + bio_set_flag(bio, BIO_REMAPPED); bio->bi_opf |= REQ_NVME_MPATH; trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector);