]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
2 weeks agozloop: fail zone append operations that are targeting full zones
Damien Le Moal [Sat, 15 Nov 2025 12:15:52 +0000 (21:15 +0900)] 
zloop: fail zone append operations that are targeting full zones

zloop_rw() will fail any regular write operation that targets a full
sequential zone. The check for this is indirect and achieved by checking
the write pointer alignment of the write operation. But this check is
ineffective for zone append operations since these are alwasy
automatically directed at a zone write pointer.

Prevent zone append operations from being executed in a full zone with
an explicit check of the zone condition.

Fixes: eb0570c7df23 ("block: new zoned loop block device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agozloop: make the write pointer of full zones invalid
Damien Le Moal [Sat, 15 Nov 2025 12:15:51 +0000 (21:15 +0900)] 
zloop: make the write pointer of full zones invalid

The write pointer of zones that are in the full condition is always
invalid. Reflect that fact by setting the write pointer of full zones
to ULLONG_MAX.

Fixes: eb0570c7df23 ("block: new zoned loop block device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblock/blk-throttle: Remove throtl_slice from struct throtl_data
Guenter Roeck [Fri, 14 Nov 2025 23:54:34 +0000 (15:54 -0800)] 
block/blk-throttle: Remove throtl_slice from struct throtl_data

throtl_slice is now a constant. Remove the variable and use the constant
directly where needed.

Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblock/blk-throttle: drop unneeded blk_stat_enable_accounting
Guenter Roeck [Fri, 14 Nov 2025 23:54:33 +0000 (15:54 -0800)] 
block/blk-throttle: drop unneeded blk_stat_enable_accounting

After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer
necessary to enable block accounting, so remove the call to
blk_stat_enable_accounting(). With that, the track_bio_latency variable
is no longer used and can be deleted from struct throtl_data. Also,
including blk-stat.h is no longer necessary.

Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblock/blk-throttle: Fix throttle slice time for SSDs
Guenter Roeck [Fri, 14 Nov 2025 23:54:32 +0000 (15:54 -0800)] 
block/blk-throttle: Fix throttle slice time for SSDs

Commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD")
introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW
was enabled. Commit bf20ab538c81 ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW,
but left the device type specific throttle slices in place. This
effectively changed throttling behavior on systems with SSD which now use
a different and non-configurable slice time compared to non-SSD devices.
Practical impact is that throughput tests with low configured throttle
values (65536 bps) experience less than expected throughput on SSDs,
presumably due to rounding errors associated with the small throttle slice
time used for those devices. The same tests pass when setting the throttle
values to 65536 * 4 = 262144 bps.

The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if
CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the
problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to
the original code and re-introduce DFL_THROTL_SLICE to replace both
DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts
commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD").

While at it, also remove MAX_THROTL_SLICE since it is not used anymore.

Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Cc: Yu Kuai <yukuai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblock: consider discard merge last
Keith Busch [Fri, 14 Nov 2025 18:31:45 +0000 (10:31 -0800)] 
block: consider discard merge last

If the next discard range is contiguous with the current range being
considered, it's cheaper to expand the current range than to append an
additional bio.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agofloppy: fix for PAGE_SIZE != 4KB
Rene Rebe [Fri, 14 Nov 2025 13:41:27 +0000 (14:41 +0100)] 
floppy: fix for PAGE_SIZE != 4KB

For years I wondered why the floppy driver does not just work on
sparc64, e.g:

root@SUNW_375_0066:# disktype /dev/fd0
disktype: Can't open /dev/fd0: No such device or address

[  525.341906] disktype: attempt to access beyond end of device
fd0: rw=0, sector=0, nr_sectors = 16 limit=8
[  525.341991] floppy: error 10 while reading block 0

Turns out floppy.c __floppy_read_block_0 tries to read one page for
the first test read to determine the disk size and thus fails if that
is greater than 4k. Adjust minimum MAX_DISK_SIZE to PAGE_SIZE to fix
floppy on sparc64 and likely all other PAGE_SIZE != 4KB configs.

Cc: stable@vger.kernel.org
Signed-off-by: René Rebe <rene@exactco.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agops3disk: use memcpy_{from,to}_bvec index
Rene Rebe [Fri, 14 Nov 2025 14:30:33 +0000 (15:30 +0100)] 
ps3disk: use memcpy_{from,to}_bvec index

With 6e0a48552b8c (ps3disk: use memcpy_{from,to}_bvec) converting
ps3disk to new bvec helpers, incrementing the offset was accidently
lost, corrupting consecutive buffers. Restore index for non-corrupted
data transfers.

Fixes: 6e0a48552b8c (ps3disk: use memcpy_{from,to}_bvec)
Signed-off-by: René Rebe <rene@exactco.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoMerge branch 'p2pdma-mmio-6.19.v5' into for-6.19/block
Jens Axboe [Fri, 14 Nov 2025 12:10:09 +0000 (05:10 -0700)] 
Merge branch 'p2pdma-mmio-6.19.v5' into for-6.19/block

Merge MMIO P2P DMA series from Leon:

"This patch series improves block layer and NVMe driver support for MMIO
 memory regions, particularly for peer-to-peer (P2P) DMA transfers that
 go through the host bridge.

 The series addresses a critical gap where P2P transfers through the
 host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE) were not properly marked
 as MMIO memory, leading to potential issues with:

 - Inappropriate CPU cache synchronization operations on MMIO regions
 - Incorrect DMA mapping/unmapping that doesn't respect MMIO semantics
 - Missing IOMMU configuration for MMIO memory handling

 This work is extracted from the larger DMA physical API improvement
 series [1] and focuses specifically on block layer and NVMe
 requirements for MMIO memory support.

 [1] https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/"

Link: https://lore.kernel.org/linux-block/20251114-block-with-mmio-v5-0-69d00f73d766@nvidia.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* p2pdma-mmio-6.19.v5:
  block-dma: properly take MMIO path
  nvme-pci: migrate to dma_map_phys instead of map_page

3 weeks agoblock-dma: properly take MMIO path
Leon Romanovsky [Fri, 14 Nov 2025 09:07:04 +0000 (11:07 +0200)] 
block-dma: properly take MMIO path

In commit eadaa8b255f3 ("dma-mapping: introduce new DMA attribute to
indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe
MMIO addresses, which require to avoid any memory cache flushing, as
an outcome of the discussion pointed in Link tag below.

In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic
treated this as regular page and relied on "struct page" DMA flow.
That flow performs CPU cache flushing, which shouldn't be done here,
and doesn't set IOMMU_MMIO flag in DMA-IOMMU case.

As a solution, let's encode peer-to-peer transaction type in NVMe IOD
flags variable and provide it to blk-mq-dma API.

Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agonvme-pci: migrate to dma_map_phys instead of map_page
Leon Romanovsky [Fri, 14 Nov 2025 09:07:03 +0000 (11:07 +0200)] 
nvme-pci: migrate to dma_map_phys instead of map_page

After introduction of dma_map_phys(), there is no need to convert
from physical address to struct page in order to map page. So let's
use it directly.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoMerge branch 'elevator-switch-6.19' into for-6.19/block
Jens Axboe [Thu, 13 Nov 2025 16:28:00 +0000 (09:28 -0700)] 
Merge branch 'elevator-switch-6.19' into for-6.19/block

Merge elevator switching improvements from Nilay:

"This patchset reorganizes the elevator switch path used during both
 nr_hw_queues update and elv_iosched_store() operations to address a
 recently reported lockdep splat [1].

 The warning highlights a locking dependency between ->freeze_lock and
 ->elevator_lock on pcpu_alloc_mutex, triggered when the Kyber scheduler
 dynamically allocates its private scheduling data. The fix is to ensure
 that such allocations occur outside the locked sections, thus
 eliminating the dependency chain.

 While working on this, it also became evident that the nr_hw_queue
 update code maintains two disjoint xarrays—one for elevator tags and
 another for elevator type—both serving the same purpose. Unifying these
 into a single elv_change_ctx structure improves clarity and
 maintainability.

 This series therefore implements five patches:
 The first perparatory patch unifies elevator tags and type xarrays. It
 combines both xarrays into a single struct elv_change_ctx, simplifying
 per-queue elevator state management.

 The second patch is aimed to group together all elevator-related
 resources that share the same lifetime and as a first step we move the
 elevator tags pointer from struct elv_change_ctx into the newly
 introduced struct elevator_resources. The subsequent patch extends the
 struct elevator_resources to include other elevator-related data.

 The third patch introduce ->alloc_sched_data and ->free_sched_data
 elevator ops which could be then used to safely allocate and free
 scheduler data.

 The fourth patch now builds upon the previous patch and starts using
 the newly introduced alloc/free sched data methods in the earlier patch
 during elevator switch and nr_hw_queue update. And while doing so, it's
 ensured that sched data allocation and free happens before we acquire
 ->freeze_lock and ->elevator_lock thus preventing its dependency on
 pcpu_alloc_mutex.

 The last patch of this series converts Kyber scheduler to use the new
 methods inroduced in the previous patch. It hooks Kyber’s scheduler
 data allocation and teardown logic from ->init_sched and ->exit_sched
 into the new methods, ensuring memory operations are performed outside
 locked sections.

 Together, these changes simplify the elevator switch logic and prevent
 the reported lockdep splat."

Link: https://lore.kernel.org/linux-block/20251113090619.2030737-1-nilay@linux.ibm.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* elevator-switch-6.19:
  block: define alloc_sched_data and free_sched_data methods for kyber
  block: use {alloc|free}_sched data methods
  block: introduce alloc_sched_data and free_sched_data elevator methods
  block: move elevator tags into struct elevator_resources
  block: unify elevator tags and type xarrays into struct elv_change_ctx

3 weeks agoblock: define alloc_sched_data and free_sched_data methods for kyber
Nilay Shroff [Thu, 13 Nov 2025 08:58:22 +0000 (14:28 +0530)] 
block: define alloc_sched_data and free_sched_data methods for kyber

Currently, the Kyber elevator allocates its private data dynamically in
->init_sched and frees it in ->exit_sched. However, since ->init_sched
is invoked during elevator switch after acquiring both ->freeze_lock and
->elevator_lock, it may trigger the lockdep splat [1] due to dependency
on pcpu_alloc_mutex.

To resolve this, move the elevator data allocation and deallocation
logic from ->init_sched and ->exit_sched into the newly introduced
->alloc_sched_data and ->free_sched_data methods. These callbacks are
invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring
that memory allocation happens safely without introducing additional
locking dependencies.

This change breaks the dependency chain involving pcpu_alloc_mutex and
prevents the reported lockdep warning.

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Reported-by: Changhui Zhong <czhong@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: use {alloc|free}_sched data methods
Nilay Shroff [Thu, 13 Nov 2025 08:58:21 +0000 (14:28 +0530)] 
block: use {alloc|free}_sched data methods

The previous patch introduced ->alloc_sched_data and
->free_sched_data methods. This patch builds upon that
by now using these methods during elevator switch and
nr_hw_queue update.

It's also ensured that scheduler-specific data is
allocated and freed through the new callbacks outside
of the ->freeze_lock and ->elevator_lock locking contexts,
thereby preventing any dependency on pcpu_alloc_mutex.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: introduce alloc_sched_data and free_sched_data elevator methods
Nilay Shroff [Thu, 13 Nov 2025 08:58:20 +0000 (14:28 +0530)] 
block: introduce alloc_sched_data and free_sched_data elevator methods

The recent lockdep splat [1] highlights a potential deadlock risk
involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
mutex. The trace shows that the issue occurs when the Kyber scheduler
allocates dynamic memory for its elevator data during initialization.

To address this, introduce two new elevator operation callbacks:
->alloc_sched_data and ->free_sched_data. The subsequent patch would
build upon these newly introduced methods to suppress lockdep splat[1].

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: move elevator tags into struct elevator_resources
Nilay Shroff [Thu, 13 Nov 2025 08:58:19 +0000 (14:28 +0530)] 
block: move elevator tags into struct elevator_resources

This patch introduces a new structure, struct elevator_resources, to
group together all elevator-related resources that share the same
lifetime. As a first step, this change moves the elevator tag pointer
from struct elv_change_ctx into the new struct elevator_resources.

Additionally, rename blk_mq_alloc_sched_tags_batch() and
blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and
blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper
helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around
blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags().

These changes pave the way for consolidating the allocation and freeing
of elevator-specific resources into common helper functions. This
refactoring improves encapsulation and prepares the code for future
extensions, allowing additional elevator-specific data to be added to
struct elevator_resources without cluttering struct elv_change_ctx.

Subsequent patches will extend struct elevator_resources to include
other elevator-related data.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: unify elevator tags and type xarrays into struct elv_change_ctx
Nilay Shroff [Thu, 13 Nov 2025 08:58:18 +0000 (14:28 +0530)] 
block: unify elevator tags and type xarrays into struct elv_change_ctx

Currently, the nr_hw_queues update path manages two disjoint xarrays —
one for elevator tags and another for elevator type — both used during
elevator switching. Maintaining these two parallel structures for the
same purpose adds unnecessary complexity and potential for mismatched
state.

This patch unifies both xarrays into a single structure, struct
elv_change_ctx, which holds all per-queue elevator change context. A
single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
to its corresponding elv_change_ctx entry, encapsulating the elevator
tags, type and name references.

This unification simplifies the code, improves maintainability, and
clarifies ownership of per-queue elevator state.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoMerge branch 'bcache-updates-6.19' into for-6.19/block
Jens Axboe [Thu, 13 Nov 2025 16:18:19 +0000 (09:18 -0700)] 
Merge branch 'bcache-updates-6.19' into for-6.19/block

Merge bcache updates from Coly for 6.19:

"The major change is from me, which is to remove useless discard
 interface and code for cache device (not the backing device). And the
 last patch about gc latency is a cooperative result from Robert Pang
 (Google), Mingzhe Zou (Easystack) and me, by inspired from their
 previous works, I compose the final version and Robert prvides positive
 benchmark result.

 Marco contributes 2 patches to improve the usage of  per-cpu system
 work queue. Gustavo contributes a patch to fix the not-at-end
 flexible-array member warning by gcc14. And Qianfeng contributes a code
 cleanup patch to remove redundant __GFP_NOWARN."

Link: https://lore.kernel.org/linux-block/20251113053630.54218-1-colyli@fnnas.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache-updates-6.19:
  bcache: Avoid -Wflex-array-member-not-at-end warning
  bcache: WQ_PERCPU added to alloc_workqueue users
  bcache: replace use of system_wq with system_percpu_wq
  bcache: remove redundant __GFP_NOWARN
  bcache: reduce gc latency by processing less nodes and sleep less time
  bcache: remove discard sysfs interface document
  bcache: drop discard sysfs interface
  bcache: remove discard code from alloc.c
  bcache: get rid of discard code from journal

3 weeks agobcache: Avoid -Wflex-array-member-not-at-end warning
Gustavo A. R. Silva [Thu, 13 Nov 2025 05:36:30 +0000 (13:36 +0800)] 
bcache: Avoid -Wflex-array-member-not-at-end warning

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the new TRAILING_OVERLAP() helper to fix the following warning:

drivers/md/bcache/bset.h:330:27: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

This helper creates a union between a flexible-array member (FAM) and a
set of MEMBERS that would otherwise follow it.

This overlays the trailing MEMBER struct btree_iter_set stack_data[MAX_BSETS];
onto the FAM struct btree_iter::data[], while keeping the FAM and the start
of MEMBER aligned.

The static_assert() ensures this alignment remains, and it's
intentionally placed immediately after the corresponding structures --no
blank line in between.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: WQ_PERCPU added to alloc_workqueue users
Marco Crivellari [Thu, 13 Nov 2025 05:36:29 +0000 (13:36 +0800)] 
bcache: WQ_PERCPU added to alloc_workqueue users

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch continues the effort to refactor worqueue APIs, which has begun
with the change introducing new workqueues and a new alloc_workqueue flag:

commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: replace use of system_wq with system_percpu_wq
Marco Crivellari [Thu, 13 Nov 2025 05:36:28 +0000 (13:36 +0800)] 
bcache: replace use of system_wq with system_percpu_wq

Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

This patch continues the effort to refactor worqueue APIs, which has begun
with the change introducing new workqueues and a new alloc_workqueue flag:

commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

system_wq should be the per-cpu workqueue, yet in this name nothing makes
that clear, so replace system_wq with system_percpu_wq.

The old wq (system_wq) will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: remove redundant __GFP_NOWARN
Qianfeng Rong [Thu, 13 Nov 2025 05:36:27 +0000 (13:36 +0800)] 
bcache: remove redundant __GFP_NOWARN

GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant
__GFP_NOWARN.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Acked-by: Coly Li <colyli@fnnas.com>
Acked-by: Coly Li <colyli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: reduce gc latency by processing less nodes and sleep less time
Coly Li [Thu, 13 Nov 2025 05:36:26 +0000 (13:36 +0800)] 
bcache: reduce gc latency by processing less nodes and sleep less time

When bcache device is busy for high I/O loads, there are two methods to
reduce the garbage collection latency,
- Process less nodes in eac loop of incremental garbage collection in
  btree_gc_recurse().
- Sleep less time between two full garbage collection in
  bch_btree_gc().

This patch introduces to hleper routines to provide different garbage
collection nodes number and sleep intervel time.
- btree_gc_min_nodes()
  If there is no front end I/O, return 128 nodes to process in each
  incremental loop, otherwise only 10 nodes are returned. Then front I/O
  is able to access the btree earlier.
- btree_gc_sleep_ms()
  If there is no synchronized wait for bucket allocation, sleep 100 ms
  between two incremental GC loop. Othersize only sleep 10 ms before
  incremental GC loop. Then a faster GC may provide available buckets
  earlier, to avoid most of bcache working threads from being starved by
  buckets allocation.

The idea is inspired by works from Mingzhe Zou and Robert Pang, but much
simpler and the expected behavior is more predictable.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Robert Pang <robertpang@google.com>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: remove discard sysfs interface document
Coly Li [Thu, 13 Nov 2025 05:36:25 +0000 (13:36 +0800)] 
bcache: remove discard sysfs interface document

This patch removes documents of bcache discard sysfs interface, it
drops discard related sections from,
- Documentation/ABI/testing/sysfs-block-bcache
- Documentation/admin-guide/bcache.rst

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: drop discard sysfs interface
Coly Li [Thu, 13 Nov 2025 05:36:24 +0000 (13:36 +0800)] 
bcache: drop discard sysfs interface

Since discard code is removed, now the sysfs interface to enable discard
is useless. This patch removes the corresponding sysfs entry, and remove
bool variable 'discard' from struct cache as well.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: remove discard code from alloc.c
Coly Li [Thu, 13 Nov 2025 05:36:23 +0000 (13:36 +0800)] 
bcache: remove discard code from alloc.c

Bcache allocator initially has no free space to allocate. Firstly it
does a garbage collection which is triggered by a cache device write
and fills free space into ca->free[] lists. The discard happens after
the free bucket is handled by garbage collection added into one of the
ca->free[] lists. But normally this bucket will be allocated out very
soon to requester and filled data onto it. The discard hint on this
bucket LBA range doesn't help SSD control to improve internal erasure
performance, and waste extra CPU cycles to issue discard bios.

This patch removes the almost-useless discard code from alloc.c.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agobcache: get rid of discard code from journal
Coly Li [Thu, 13 Nov 2025 05:36:22 +0000 (13:36 +0800)] 
bcache: get rid of discard code from journal

In bcache journal there is discard functionality but almost useless in
reality. Because discard happens after a journal bucket is reclaimed,
and the reclaimed bucket is allocated for new journaling immediately.
There is no time for underlying SSD to use the discard hint for internal
data management.

The discard code in bcache journal doesn't bring any performance
optimization and wastes CPU cycles for issuing discard bios. Therefore
this patch gits rid of it from journal.c and journal.h.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agodm: fix zone reset all operation processing
Damien Le Moal [Thu, 13 Nov 2025 13:40:28 +0000 (22:40 +0900)] 
dm: fix zone reset all operation processing

dm_zone_get_reset_bitmap() is used to generate a bitmap of the zones of
a zoned device target when a REQ_OP_ZONE_RESET_ALL request is being
processed. This bitmap is built by executing a zone report with a report
callback set to the function dm_zone_need_reset_cb() in struct
dm_report_zones_args. However, the cb callback pointer is not anymore
the same as the callback specified by callers of the
blkdev_report_zones() function. Rather, this is a DM internal callback
and report zones callback functions from blkdev_report_zones() are
passed using struct blk_report_zones_args, introduced with commit
db9aed869f34 ("block: introduce disk_report_zone()").

This commit changed the DM main report zones callback handler function
dm_report_zones_cb() to call the new disk_report_zone() so that callback
functions from blkdev_report_zones() are executed, and this change
resulted in the DM internal dm_zone_need_reset_cb() callback function to
not be executed anymore, turning any REQ_OP_ZONE_RESET_ALL request into
a no-op.

Fix this by calling in dm_report_zones_cb() the DM internal cb function
specified in struct dm_report_zones_args.

Fixes: db9aed869f34 ("block: introduce disk_report_zone()").
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: fix NULL pointer dereference in disk_report_zones()
Damien Le Moal [Thu, 13 Nov 2025 13:40:27 +0000 (22:40 +0900)] 
block: fix NULL pointer dereference in disk_report_zones()

Commit 2284eec5053d ("block: introduce blkdev_get_zone_info()")
introduced the report_active field in struct blk_report_zones_args so
that open and closed zones can be reported with the condition
BLK_ZONE_COND_ACTIVE in the case of a cached report zone.
However, the args pointer to a struct blk_report_zones_args that is
passed to disk_report_zones() can be NULL, e.g. in the case of internal
report zones operations for device mapper zoned targets.

Fix disk_report_zones() to make sure to check that the args is not null
before updating a zone condition for cached zone reports.

Fixes: 2284eec5053d ("block: introduce blkdev_get_zone_info()")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: fix NULL pointer dereference in blk_zone_reset_all_bio_endio()
Damien Le Moal [Thu, 13 Nov 2025 13:40:26 +0000 (22:40 +0900)] 
block: fix NULL pointer dereference in blk_zone_reset_all_bio_endio()

For zoned block devices that do not need zone write plugs (e.g. most
device mapper devices that support zones), the disk hash table of zone
write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio()
should not attempt to scan this has table as that causes a NULL pointer
dereference.

Fix this by checking that the disk does have zone write plugs using the
atomic counter. This is equivalent to checking for a non-NULL hash table
but has the advantage to also speed up the execution of
blk_zone_reset_all_bio_endio() for devices that do use zone write plugs
but do not have any plug in the hash table (e.g. a disk with only full
zones).

Fixes: efae226c2ef1 ("block: handle zone management operations completions")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
Bart Van Assche [Tue, 11 Nov 2025 23:29:02 +0000 (15:29 -0800)] 
blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller

Move the following code into the only caller of disk_zone_wplug_add_bio():
 - The code for clearing the REQ_NOWAIT flag.
 - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
 - The disk_zone_wplug_schedule_bio_work() call.

This patch moves all code that is related to REQ_NOWAIT or to bio
scheduling into a single function. Additionally, the 'schedule_bio_work'
variable is removed. No functionality has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlmoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblk-zoned: Document disk_zone_wplug_schedule_bio_work() locking
Bart Van Assche [Tue, 11 Nov 2025 23:29:01 +0000 (15:29 -0800)] 
blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking

Document that all callers hold this lock because the code in
disk_zone_wplug_schedule_bio_work() depends on this.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblk-zoned: Fix a typo in a source code comment
Bart Van Assche [Tue, 11 Nov 2025 23:29:00 +0000 (15:29 -0800)] 
blk-zoned: Fix a typo in a source code comment

Remove a superfluous parenthesis that was introduced by commit fa8555630b32
("blk-zoned: Improve the queue reference count strategy documentation").

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agonull_blk: fix zone read length beyond write pointer
Keith Busch [Wed, 12 Nov 2025 16:42:18 +0000 (08:42 -0800)] 
null_blk: fix zone read length beyond write pointer

Fix up the divisor calculating the number of zone sectors being read and
handle a read that straddles the zone write pointer. The length is
rounded up a sector boundary, so be sure to truncate any excess bytes
off to avoid copying past the data segment.

Fixes: 3451cf34f51bb70 ("null_blk: allow byte aligned memory offsets")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblk-mq-dma: fix kernel-doc function name for integrity DMA iterator
Kriish Sharma [Tue, 11 Nov 2025 11:58:10 +0000 (11:58 +0000)] 
blk-mq-dma: fix kernel-doc function name for integrity DMA iterator

Documentation build reported:

  Warning: block/blk-mq-dma.c:373 expecting prototype for blk_rq_integrity_dma_map_iter_start(). Prototype was for blk_rq_integrity_dma_map_iter_next() instead

The kernel-doc comment above `blk_rq_integrity_dma_map_iter_next()` used
the wrong function name (`blk_rq_integrity_dma_map_iter_start`) in its
header. This patch corrects the function name in the kernel-doc block to
match the actual implementation, ensuring clean documentation builds.

Fixes: fec9b16dc555 ("blk-mq-dma: add scatter-less integrity data DMA mapping")
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: fix merging data-less bios
Keith Busch [Tue, 11 Nov 2025 14:06:20 +0000 (06:06 -0800)] 
block: fix merging data-less bios

The data segment gaps the block layer tracks doesn't apply to bio's that
don't have data. Skip calculating this to fix a NULL pointer access.

Fixes: 2f6b2565d43cdb5 ("block: accumulate memory segment gaps per bio")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoublk: return unsigned from ublk_{,un}map_io()
Caleb Sander Mateos [Sat, 8 Nov 2025 04:23:16 +0000 (21:23 -0700)] 
ublk: return unsigned from ublk_{,un}map_io()

ublk_map_io() and ublk_unmap_io() never return negative values, and
their return values are stored in variables of type unsigned. Clarify
that they can't fail by making their return types unsigned.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoublk: remove unnecessary checks in ublk_check_and_get_req()
Caleb Sander Mateos [Sat, 8 Nov 2025 22:17:45 +0000 (15:17 -0700)] 
ublk: remove unnecessary checks in ublk_check_and_get_req()

ub = iocb->ki_filp->private_data cannot be NULL, as it's set in
ublk_ch_open() before it returns succesfully. req->mq_hctx cannot be
NULL as any inflight ublk request must belong to some queue. And
req->mq_hctx->driver_data cannot be NULL as it's set to the ublk_queue
pointer in ublk_init_hctx(). So drop the unnecessary checks.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: add lockdep to queue_limits_commit_update()
Chaitanya Kulkarni [Sun, 9 Nov 2025 07:44:26 +0000 (23:44 -0800)] 
block: add lockdep to queue_limits_commit_update()

queue_limits_commit_update() expects q->limits_lock to be held by
the caller (via queue_limits_start_update()).

The API pattern is:

  lim = queue_limits_start_update(q);  /* acquires lock */
              /* modify lim */
  queue_limits_commit_update(q, &lim); /* releases lock */

  OR

  queue_limits_commit_update_frozen(q, &lim);
   lim = queue_limits_start_update(q); /* acquires lock */
  queue_limits_commit_update(q, &lim); /* releases lock */

Add lockdep_assert_held() to report incorrect API usage.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agonbd: defer config unlock in nbd_genl_connect
Zheng Qixing [Mon, 10 Nov 2025 12:49:20 +0000 (20:49 +0800)] 
nbd: defer config unlock in nbd_genl_connect

There is one use-after-free warning when running NBD_CMD_CONNECT and
NBD_CLEAR_SOCK:

nbd_genl_connect
  nbd_alloc_and_init_config // config_refs=1
  nbd_start_device // config_refs=2
  set NBD_RT_HAS_CONFIG_REF open nbd // config_refs=3
  recv_work done // config_refs=2
NBD_CLEAR_SOCK // config_refs=1
close nbd // config_refs=0
  refcount_inc -> uaf

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 24 PID: 1014 at lib/refcount.c:25 refcount_warn_saturate+0x12e/0x290
 nbd_genl_connect+0x16d0/0x1ab0
 genl_family_rcv_msg_doit+0x1f3/0x310
 genl_rcv_msg+0x44a/0x790

The issue can be easily reproduced by adding a small delay before
refcount_inc(&nbd->config_refs) in nbd_genl_connect():

        mutex_unlock(&nbd->config_lock);
        if (!ret) {
                set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags);
+               printk("before sleep\n");
+               mdelay(5 * 1000);
+               printk("after sleep\n");
                refcount_inc(&nbd->config_refs);
                nbd_connect_reply(info, nbd->index);
        }

Fixes: e46c7287b1c2 ("nbd: add a basic netlink interface")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoMerge tag 'md-6.19-20251111' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid...
Jens Axboe [Tue, 11 Nov 2025 13:58:11 +0000 (06:58 -0700)] 
Merge tag 'md-6.19-20251111' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.19/block

Pull MD changes from Yu:

"- Change maintainer's email address (Yu Kuai)
 - Data can be lost if array is created with different lbs devices, fix
   this problem and record lbs of the array in metadata (Li Nan)
 - Fix rcu protection for md_thread (Yun Zhou)
 - Fix mddev kobject lifetime regression (Xiao Ni)
 - Enable atomic writes for md-linear (John Garry)
 - Some cleanups (Chen Ni, Huiwen He, Wu Guanghao)"

* tag 'md-6.19-20251111' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux:
  md: allow configuring logical block size
  md: add check_new_feature module parameter
  md/raid0: Move queue limit setup before r0conf initialization
  md: init bioset in mddev_init
  md: delete md_redundancy_group when array is becoming inactive
  md: prevent adding disks with larger logical_block_size to active arrays
  md/raid5: remove redundant __GFP_NOWARN
  md: avoid repeated calls to del_gendisk
  md/md-llbitmap: Remove unneeded semicolon
  md/md-linear: Enable atomic writes
  Factor out code into md_should_do_recovery()
  md: fix rcu protection in md_wakeup_thread
  md: delete mddev kobj before deleting gendisk kobj
  MAINTAINERS: Update Yu Kuai's E-mail address

3 weeks agomd: allow configuring logical block size
Li Nan [Mon, 3 Nov 2025 12:57:57 +0000 (20:57 +0800)] 
md: allow configuring logical block size

Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:

```
  # LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
  mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean

  # LBS is 512
  cat /sys/block/md0/queue/logical_block_size

  # create partition md0p1
  parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
  lsblk | grep md0p1

  # LBS becomes 4096 after adding sdf
  mdadm --add -q /dev/md0 /dev/sdf
  cat /sys/block/md0/queue/logical_block_size

  # partition lost
  partprobe /dev/md0
  lsblk | grep md0p1
```

Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.

Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
  - store LBS in disk metadata
  - add a read-write sysfs 'mdX/logical_block_size'

Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.

Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.

Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
3 weeks agomd: add check_new_feature module parameter
Li Nan [Mon, 3 Nov 2025 12:57:56 +0000 (20:57 +0800)] 
md: add check_new_feature module parameter

Raid checks if pad3 is zero when loading superblock from disk. Arrays
created with new features may fail to assemble on old kernels as pad3
is used.

Add module parameter check_new_feature to bypass this check.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
3 weeks agomd/raid0: Move queue limit setup before r0conf initialization
Li Nan [Mon, 3 Nov 2025 12:57:55 +0000 (20:57 +0800)] 
md/raid0: Move queue limit setup before r0conf initialization

Prepare for making logical blocksize configurable. This change has no
impact until logical block size becomes configurable.

Move raid0_set_limits() before create_strip_zones(). It is safe as fields
modified in create_strip_zones() do not involve mddev configuration, and
rdev modifications there are not used in raid0_set_limits().

'blksize' in create_strip_zones() fetches mddev's logical block size,
which is already the maximum aross all rdevs, so the later max() can be
removed.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-4-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
3 weeks agomd: init bioset in mddev_init
Li Nan [Mon, 3 Nov 2025 12:57:54 +0000 (20:57 +0800)] 
md: init bioset in mddev_init

IO operations may be needed before md_run(), such as updating metadata
after writing sysfs. Without bioset, this triggers a NULL pointer
dereference as below:

 BUG: kernel NULL pointer dereference, address: 0000000000000020
 Call Trace:
  md_update_sb+0x658/0xe00
  new_level_store+0xc5/0x120
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Reproducer
```
  mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
  echo inactive > /sys/block/md0/md/array_state
  echo 10 > /sys/block/md0/md/new_level
```

mddev_init() can only be called once per mddev, no need to test if bioset
has been initialized anymore.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com
Fixes: d981ed841930 ("md: Add new_level sysfs interface")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
3 weeks agomd: delete md_redundancy_group when array is becoming inactive
Li Nan [Mon, 3 Nov 2025 12:57:53 +0000 (20:57 +0800)] 
md: delete md_redundancy_group when array is becoming inactive

'md_redundancy_group' are created in md_run() and deleted in del_gendisk(),
but these are not paired. Writing inactive/active to sysfs array_state can
trigger md_run() multiple times without del_gendisk(), leading to
duplicate creation as below:

 sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action'
 Call Trace:
  dump_stack_lvl+0x9f/0x120
  dump_stack+0x14/0x20
  sysfs_warn_dup+0x96/0xc0
  sysfs_add_file_mode_ns+0x19c/0x1b0
  internal_create_group+0x213/0x830
  sysfs_create_group+0x17/0x20
  md_run+0x856/0xe60
  ? __x64_sys_openat+0x23/0x30
  do_md_run+0x26/0x1d0
  array_state_store+0x559/0x760
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 md: cannot register extra attributes for md0

Creation of it depends on 'pers', its lifecycle cannot be aligned with
gendisk. So fix this issue by triggering 'md_redundancy_group' deletion
when the array is becoming inactive.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com
Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
3 weeks agomd: prevent adding disks with larger logical_block_size to active arrays
Li Nan [Thu, 18 Sep 2025 11:57:58 +0000 (19:57 +0800)] 
md: prevent adding disks with larger logical_block_size to active arrays

When adding a disk to a md array, avoid updating the array's
logical_block_size to match the new disk. This prevents accidental
partition table loss that renders the array unusable.

The later patch will introduce a way to configure the array's
logical_block_size.

The issue was introduced before Linux 2.6.12-rc2.

Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/
Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agoblock: clean up indentation in blk_rq_map_iter_init()
Caleb Sander Mateos [Sat, 8 Nov 2025 04:06:13 +0000 (21:06 -0700)] 
block: clean up indentation in blk_rq_map_iter_init()

blk_rq_map_iter_init() has one line with 7 spaces of indentation and
another that mixes 1 tab and 8 spaces. Convert both to tabs.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonbd: defer config put in recv_work
Zheng Qixing [Sat, 8 Nov 2025 07:02:02 +0000 (15:02 +0800)] 
nbd: defer config put in recv_work

There is one uaf issue in recv_work when running NBD_CLEAR_SOCK and
NBD_CMD_RECONFIGURE:
  nbd_genl_connect     // conf_ref=2 (connect and recv_work A)
  nbd_open        // conf_ref=3
  recv_work A done     // conf_ref=2
  NBD_CLEAR_SOCK       // conf_ref=1
  nbd_genl_reconfigure // conf_ref=2 (trigger recv_work B)
  close nbd        // conf_ref=1
  recv_work B
    config_put         // conf_ref=0
    atomic_dec(&config->recv_threads); -> UAF

Or only running NBD_CLEAR_SOCK:
  nbd_genl_connect   // conf_ref=2
  nbd_open       // conf_ref=3
  NBD_CLEAR_SOCK     // conf_ref=2
  close nbd
    nbd_release
      config_put     // conf_ref=1
  recv_work
    config_put       // conf_ref=0
    atomic_dec(&config->recv_threads); -> UAF

Commit 87aac3a80af5 ("nbd: call nbd_config_put() before notifying the
waiter") moved nbd_config_put() to run before waking up the waiter in
recv_work, in order to ensure that nbd_start_device_ioctl() would not
be woken up while nbd->task_recv was still uncleared.

However, in nbd_start_device_ioctl(), after being woken up it explicitly
calls flush_workqueue() to make sure all current works are finished.
Therefore, there is no need to move the config put ahead of the wakeup.

Move nbd_config_put() to the end of recv_work, so that the reference is
held for the whole lifetime of the worker thread. This makes sure the
config cannot be freed while recv_work is still running, even if clear
+ reconfigure interleave.

In addition, we don't need to worry about recv_work dropping the last
nbd_put (which causes deadlock):

path A (netlink with NBD_CFLAG_DESTROY_ON_DISCONNECT):
  connect  // nbd_refs=1 (trigger recv_work)
  open nbd // nbd_refs=2
  NBD_CLEAR_SOCK
  close nbd
    nbd_release
      nbd_disconnect_and_put
        flush_workqueue // recv_work done
      nbd_config_put
        nbd_put // nbd_refs=1
      nbd_put // nbd_refs=0
        queue_work

path B (netlink without NBD_CFLAG_DESTROY_ON_DISCONNECT):
  connect  // nbd_refs=2 (trigger recv_work)
  open nbd // nbd_refs=3
  NBD_CLEAR_SOCK // conf_refs=2
  close nbd
    nbd_release
      nbd_config_put // conf_refs=1
      nbd_put // nbd_refs=2
  recv_work done // conf_refs=0, nbd_refs=1
  rmmod // nbd_refs=0

Reported-by: syzbot+56fbf4c7ddf65e95c7cc@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6907edce.a70a0220.37351b.0014.GAE@google.com/T/
Fixes: 87aac3a80af5 ("nbd: make the config put is called before the notifying the waiter")
Depends-on: e2daec488c57 ("nbd: Fix hungtask when nbd_config_put")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agomd/raid5: remove redundant __GFP_NOWARN
Huiwen He [Sun, 2 Nov 2025 15:25:40 +0000 (23:25 +0800)] 
md/raid5: remove redundant __GFP_NOWARN

The __GFP_NOWARN flag was included in GFP_NOWAIT since commit
16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT"). So
remove the redundant __GFP_NOWARN flag.

Link: https://lore.kernel.org/linux-raid/20251102152540.871568-1-hehuiwen@kylinos.cn
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agomd: avoid repeated calls to del_gendisk
Xiao Ni [Wed, 29 Oct 2025 06:34:19 +0000 (14:34 +0800)] 
md: avoid repeated calls to del_gendisk

There is a uaf problem which is found by case 23rdev-lifetime:

Oops: general protection fault, probably for non-canonical address 0xdead000000000122
RIP: 0010:bdi_unregister+0x4b/0x170
Call Trace:
 <TASK>
 __del_gendisk+0x356/0x3e0
 mddev_unlock+0x351/0x360
 rdev_attr_store+0x217/0x280
 kernfs_fop_write_iter+0x14a/0x210
 vfs_write+0x29e/0x550
 ksys_write+0x74/0xf0
 do_syscall_64+0xbb/0x380
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff5250a177e

The sequence is:
1. rdev remove path gets reconfig_mutex
2. rdev remove path release reconfig_mutex in mddev_unlock
3. md stop calls do_md_stop and sets MD_DELETED
4. rdev remove path calls del_gendisk because MD_DELETED is set
5. md stop path release reconfig_mutex and calls del_gendisk again

So there is a race condition we should resolve. This patch adds a
flag MD_DO_DELETE to avoid the race condition.

Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com
Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Signed-off-by: Xiao Ni <xni@redhat.com>
Suggested-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agomd/md-llbitmap: Remove unneeded semicolon
Chen Ni [Wed, 10 Sep 2025 09:19:12 +0000 (17:19 +0800)] 
md/md-llbitmap: Remove unneeded semicolon

Remove unnecessary semicolons reported by Coccinelle/coccicheck and the
semantic patch at scripts/coccinelle/misc/semicolon.cocci.

Link: https://lore.kernel.org/linux-raid/20250910091912.25624-1-nichen@iscas.ac.cn
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agomd/md-linear: Enable atomic writes
John Garry [Wed, 3 Sep 2025 16:10:52 +0000 (16:10 +0000)] 
md/md-linear: Enable atomic writes

All the infrastructure has already been plumbed to support this for
stacked devices, so just enable the request_queue limits features flag.

A note about chunk sectors for linear arrays:
While it is possible to set a chunk sectors param for building a linear
array, this is for specifying the granularity at which data sectors from
the device are used. It is not the same as a stripe size, like for RAID0.

As such, it is not appropriate to set chunk_sectors request queue limit to
the same value, as chunk_sectors request limit is a boundary for which
requests cannot straddle.

However, request_queue limit max_hw_sectors is set to chunk sectors, which
almost has the same effect as setting chunk_sectors limit.

Link: https://lore.kernel.org/linux-raid/20250903161052.3326176-1-john.g.garry@oracle.com
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agoFactor out code into md_should_do_recovery()
Wu Guanghao [Mon, 8 Sep 2025 08:20:37 +0000 (16:20 +0800)] 
Factor out code into md_should_do_recovery()

In md_check_recovery(), use new helper to make code cleaner.

Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agomd: fix rcu protection in md_wakeup_thread
Yun Zhou [Wed, 15 Oct 2025 08:32:27 +0000 (16:32 +0800)] 
md: fix rcu protection in md_wakeup_thread

We attempted to use RCU to protect the pointer 'thread', but directly
passed the value when calling md_wakeup_thread(). This means that the
RCU pointer has been acquired before rcu_read_lock(), which renders
rcu_read_lock() ineffective and could lead to a use-after-free.

Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com
Fixes: 446931543982 ("md: protect md_thread with rcu")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agomd: delete mddev kobj before deleting gendisk kobj
Xiao Ni [Sun, 28 Sep 2025 01:24:24 +0000 (09:24 +0800)] 
md: delete mddev kobj before deleting gendisk kobj

In sync del gendisk path, it deletes gendisk first and the directory
/sys/block/md is removed. Then it releases mddev kobj in a delayed work.
If we enable debug log in sysfs_remove_group, we can see the debug log
'sysfs group bitmap not found for kobject md'. It's the reason that the
parent kobj has been deleted, so it can't find parent directory.

In creating path, it allocs gendisk first, then adds mddev kobj. So it
should delete mddev kobj before deleting gendisk.

Before commit 9e59d609763f ("md: call del_gendisk in control path"), it
releases mddev kobj first. If the kobj hasn't been deleted, it does clean
job and deletes the kobj. Then it calls del_gendisk and releases gendisk
kobj. So it doesn't need to call kobject_del to delete mddev kobj. After
this patch, in sync del gendisk path, the sequence changes. So it needs
to call kobject_del to delete mddev kobj.

After this patch, the sequence is:
1. kobject del mddev kobj
2. del_gendisk deletes gendisk kobj
3. mddev_delayed_delete releases mddev kobj
4. md_kobj_release releases gendisk kobj

Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com
Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
4 weeks agoMAINTAINERS: Update Yu Kuai's E-mail address
Yu Kuai [Tue, 21 Oct 2025 12:28:00 +0000 (20:28 +0800)] 
MAINTAINERS: Update Yu Kuai's E-mail address

Change to my new email address on fnnas.com.

Link: https://lore.kernel.org/linux-raid/20251021122800.3158836-1-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
4 weeks agoblock: introduce bdev_zone_start()
Damien Le Moal [Fri, 7 Nov 2025 06:38:44 +0000 (15:38 +0900)] 
block: introduce bdev_zone_start()

Introduce the function bdev_zone_start() as a more explicit (and clear)
replacement for ALIGN_DOWN() to get the start sector of a zone
containing a particular sector of a zoned block device.

Use this new helper in blkdev_get_zone_info() and
blkdev_report_zones_cached().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: refactor disk_zone_wplug_sync_wp_offset()
Damien Le Moal [Fri, 7 Nov 2025 06:38:43 +0000 (15:38 +0900)] 
block: refactor disk_zone_wplug_sync_wp_offset()

The helper function blk_zone_wp_offset() is called from
disk_zone_wplug_sync_wp_offset(), and again called from
blk_revalidate_seq_zone() right after the call to
disk_zone_wplug_sync_wp_offset().

Change disk_zone_wplug_sync_wp_offset() to return the value of obtained
with blk_zone_wp_offset() to avoid this double call, which simplifies a
little blk_revalidate_seq_zone().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: improve blk_zone_wp_offset()
Damien Le Moal [Fri, 7 Nov 2025 06:38:42 +0000 (15:38 +0900)] 
block: improve blk_zone_wp_offset()

blk_zone_wp_offset() is always called with a struct blk_zone obtained
from the device, that is, it will never see the BLK_ZONE_COND_ACTIVE
condition. However, handling this condition makes this function more
solid and will also avoid issues when propagating cached report requests
to underlying stacked devices is implemented. Add BLK_ZONE_COND_ACTIVE
as a new case in blk_zone_wp_offset() switch.

Also while at it, change the handling of the full condition to return
UINT_MAX for the zone write pointer to reflect the fact that the write
pointer of a full zone is invalid.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: don't return 1 for the fallback case in blkdev_get_zone_info
Christoph Hellwig [Thu, 6 Nov 2025 14:53:32 +0000 (15:53 +0100)] 
block: don't return 1 for the fallback case in blkdev_get_zone_info

blkdev_do_report_zones returns the number of reported zones, but
blkdev_get_zone_info returns 0 or an errno.  Translate to the expected
return value in blkdev_report_zone_fallback.

Fixes: b037d41762fd ("block: introduce blkdev_get_zone_info()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonvme: remove virtual boundary for sgl capable devices
Keith Busch [Tue, 14 Oct 2025 15:04:56 +0000 (08:04 -0700)] 
nvme: remove virtual boundary for sgl capable devices

The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such devices; rdma fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements, but tcp
and fc never needed it in the first place so they get to report no
virtual boundary.

Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: accumulate memory segment gaps per bio
Keith Busch [Tue, 14 Oct 2025 15:04:55 +0000 (08:04 -0700)] 
block: accumulate memory segment gaps per bio

The blk-mq dma iterator has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary, but the code had been depending on that queue
limit to know ahead of time if the request is guaranteed to align to
that optimization.

Rather than rely on that queue limit, which many devices may not report,
save the lowest set bit of any boundary gap between each segment in the
bio while checking the segments. The request stores the value for
merging and quickly checking per io if the request can use iova
optimizations.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agovirtio_blk: NULL out vqs to avoid double free on failed resume
Cong Zhang [Tue, 21 Oct 2025 11:07:56 +0000 (19:07 +0800)] 
virtio_blk: NULL out vqs to avoid double free on failed resume

The vblk->vqs releases during freeze. If resume fails before vblk->vqs
is allocated, later freeze/remove may attempt to free vqs again.
Set vblk->vqs to NULL after freeing to avoid double free.

Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonull_blk: allow byte aligned memory offsets
Keith Busch [Thu, 6 Nov 2025 01:54:47 +0000 (17:54 -0800)] 
null_blk: allow byte aligned memory offsets

Allowing byte aligned memory provides a nice testing ground for
direct-io.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonull_blk: single kmap per bio segment
Keith Busch [Thu, 6 Nov 2025 01:54:46 +0000 (17:54 -0800)] 
null_blk: single kmap per bio segment

Rather than kmap the the request bio segment for each sector, do
the mapping just once.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonull_blk: consistently use blk_status_t
Keith Busch [Thu, 6 Nov 2025 01:54:45 +0000 (17:54 -0800)] 
null_blk: consistently use blk_status_t

No need to mix errno and blk_status_t error types. Just use the standard
block layer type.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agonull_blk: simplify copy_from_nullb
Keith Busch [Thu, 6 Nov 2025 01:54:44 +0000 (17:54 -0800)] 
null_blk: simplify copy_from_nullb

It always returns success, so the code that saves the errors status, but
proceeds without checking it looks a bit odd. Clean this up.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoublk: use rq_for_each_segment() for user copy
Caleb Sander Mateos [Thu, 6 Nov 2025 17:16:47 +0000 (10:16 -0700)] 
ublk: use rq_for_each_segment() for user copy

ublk_advance_io_iter() and ublk_copy_io_pages() currently open-code the
iteration over the request's bvecs. Switch to the rq_for_each_segment()
macro provided by blk-mq to avoid reaching into the bio internals and
simplify the code.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoublk: use copy_{to,from}_iter() for user copy
Caleb Sander Mateos [Thu, 6 Nov 2025 17:16:46 +0000 (10:16 -0700)] 
ublk: use copy_{to,from}_iter() for user copy

ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
iov_iter_get_pages2() to extract the pages from the iov_iter and
memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
user page reference count increments and decrements and needing to split
the memcpy() at user page boundaries. It also simplifies the code
considerably.
Ming reports a 40% throughput improvement when issuing I/O to the
selftests null ublk server with zero-copy disabled.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: fix cached zone reporting after zone append was used
Christoph Hellwig [Wed, 5 Nov 2025 19:52:15 +0000 (14:52 -0500)] 
block: fix cached zone reporting after zone append was used

No zone plugs are allocated when a zone is opened by calling Zone Append
on it.  This makes the cached zone reporting report incorrectly empty
zones if the file system is unmounted and report zones is called after
that, e.g. by xfstests test cases using the scratch device.

Fix this by recording if zone append was used on a device, and disable
cached reporting for the device until a ZONE_RESET_ALL happens that
guarantees all zones are empty.

We could probably do even better using a per-zone flag, but the practical
use cache for zone reporting after the initial mount are rather limited,
so let's keep things simple for now.

Fixes: 31f0656a4ab7 ("block: introduce blkdev_report_zones_cached()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: don't leak disk->zones_cond for !disk_need_zone_resources
Christoph Hellwig [Wed, 5 Nov 2025 19:52:14 +0000 (14:52 -0500)] 
block: don't leak disk->zones_cond for !disk_need_zone_resources

disk->zones_cond is allocated for all zoned devices, but
disk_free_zone_resources skips it when the zone write plug hash is not
allocated, leaking the allocation for non-mq devices that don't emulate
zone append.  This is reported by kmemleak-enabled xfstests for various
tests that use simple device mapper targets.

Fix this by moving all code that requires writes plugs from
disk_free_zone_resources into disk_destroy_zone_wplugs_hash_table
and executing the rest of the code, including the disk->zones_cond
freeing unconditionally.

Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agorust: block: update ARef and AlwaysRefCounted imports from sync::aref
Shankari Anand [Sun, 12 Oct 2025 14:20:12 +0000 (19:50 +0530)] 
rust: block: update ARef and AlwaysRefCounted imports from sync::aref

Update call sites in the block subsystem to import `ARef` and
`AlwaysRefCounted` from `sync::aref` instead of `types`.

This aligns with the ongoing effort to move `ARef` and
`AlwaysRefCounted` to sync.

Suggested-by: Benno Lossin <lossin@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/1173
Signed-off-by: Shankari Anand <shankari.ak0208@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoMerge branch 'cached-zones' into for-6.19/block
Jens Axboe [Wed, 5 Nov 2025 15:07:41 +0000 (08:07 -0700)] 
Merge branch 'cached-zones' into for-6.19/block

This patch series implements a cached report zones using information
from the block layer zone write plugs and a new zone condition tracking.
This avoids having to execute slow report zones commands on the device
when for instance mounting file systems, which can significantly speed
things up, especially in setups with multiple SMR HDDs (e.g. a BTRFS
RAID volume).

The first patch improves handling of zone management commands. Patch 2
fixes zone resource updates and the following 3 patches cleanup the zone
code in preparation for introducing cached zone report support.
From patch 6 to 13, cached report zones is implemented and made
available to users with a new ioctl() command.

Finally, patches 14 and 15 introduce the use of cached report zones in
the mount operation of XFS and BTRFS.

Link: https://lore.kernel.org/linux-block/20251104212249.1075412-1-dlemoal@kernel.org/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* cached-zones:
  xfs: use blkdev_report_zones_cached()
  btrfs: use blkdev_report_zones_cached()
  block: add zone write plug condition to debugfs zone_wplugs
  block: improve zone_wplugs debugfs attribute output
  block: introduce BLKREPORTZONESV2 ioctl
  block: introduce blkdev_report_zones_cached()
  block: introduce blkdev_get_zone_info()
  block: refactor blkdev_report_zones() code
  block: track zone conditions
  block: use zone condition to determine conventional zones
  block: reorganize struct blk_zone_wplug
  block: introduce disk_report_zone()
  block: cleanup blkdev_report_zones()
  block: freeze queue when updating zone resources
  block: handle zone management operations completions

4 weeks agoxfs: use blkdev_report_zones_cached()
Damien Le Moal [Tue, 4 Nov 2025 21:22:49 +0000 (06:22 +0900)] 
xfs: use blkdev_report_zones_cached()

Modify xfs_mount_zones() to replace the call to blkdev_report_zones()
with blkdev_report_zones_cached() to speed-up mount operations.
Since this causes xfs_zone_validate_seq() to see zones with the
BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept
this condition as valid.

With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 2s compared to over 4.7s before.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agobtrfs: use blkdev_report_zones_cached()
Damien Le Moal [Tue, 4 Nov 2025 21:22:48 +0000 (06:22 +0900)] 
btrfs: use blkdev_report_zones_cached()

Modify btrfs_get_dev_zones() and btrfs_sb_log_location_bdev() to replace
the call to blkdev_report_zones() with blkdev_report_zones_cached() to
speed-up mount operations. btrfs_get_dev_zone_info() is also modified to
take into account the BLK_ZONE_COND_ACTIVE condition, which is
equivalent to either BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN or
BLK_ZONE_COND_CLOSED.

With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 100ms compared to over 1.8s before.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: add zone write plug condition to debugfs zone_wplugs
Damien Le Moal [Tue, 4 Nov 2025 21:22:47 +0000 (06:22 +0900)] 
block: add zone write plug condition to debugfs zone_wplugs

Modify queue_zone_wplug_show() to include the condition of a zone write
plug to the zone_wplugs debugfs attribute of a zoned block device.
To improve readability and ease of use, rather than the zone condition
raw value, the zone condition name is given using blk_zone_cond_str().

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: improve zone_wplugs debugfs attribute output
Damien Le Moal [Tue, 4 Nov 2025 21:22:46 +0000 (06:22 +0900)] 
block: improve zone_wplugs debugfs attribute output

Make the output of the zone_wplugs debugfs attribute file more easily
readable by adding the name of the zone write plugs fields in the
output.

No functional changes.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: introduce BLKREPORTZONESV2 ioctl
Damien Le Moal [Tue, 4 Nov 2025 21:22:45 +0000 (06:22 +0900)] 
block: introduce BLKREPORTZONESV2 ioctl

Introduce the new BLKREPORTZONESV2 ioctl command to allow user
applications access to the fast zone report implemented by
blkdev_report_zones_cached(). This new ioctl is defined as number 142
and is documented in include/uapi/linux/fs.h.

Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags
field of struct blk_zone_report also as an input. If the user sets the
BLK_ZONE_REP_CACHED flag as an input, then blkdev_report_zones_cached()
is used to generate the zone report using cached zone information. If
this flag is not set, then BLKREPORTZONESV2 behaves in the same manner
as BLKREPORTZONES and the zone report is generated by accessing the
zoned device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: introduce blkdev_report_zones_cached()
Damien Le Moal [Tue, 4 Nov 2025 21:22:44 +0000 (06:22 +0900)] 
block: introduce blkdev_report_zones_cached()

Introduce the function blkdev_report_zones_cached() to provide a fast
report zone built using the blkdev_get_zone_info() function, which gets
zone information from a disk zones_cond array or zone write plugs.
For a large capacity SMR drive, such fast report zone can be completed
in a few milliseconds compared to several seconds completion times
when the report zone is obtained from the device.

The zone report is built in the same manner as with the regular
blkdev_report_zones() function, that is, the first zone reported is the
one containing the specified start sector and the report is limited to
the specified number of zones (nr_zones argument). The information for
each zone in the report is obtained using blkdev_get_zone_info().

For zoned devices that do not use zone write plug resources,
using blkdev_get_zone_info() is inefficient as the zone report would
be very slow, generated one zone at a time. To avoid this,
blkdev_report_zones_cached() falls back to calling
blkdev_do_report_zones() to execute a regular zone report. In this case,
the .report_active field of struct blk_report_zones_args is set to true
to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in
place of the implicit open, explicit open and closed conditions.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: introduce blkdev_get_zone_info()
Damien Le Moal [Tue, 4 Nov 2025 21:22:43 +0000 (06:22 +0900)] 
block: introduce blkdev_get_zone_info()

Introduce the function blkdev_get_zone_info() to obtain a single zone
information from cached zone data, that is, either from the zone write
plug for the target zone if it exists and from the disk zones_cond
array otherwise.

Since sequential zones that do not have a zone write plug are either
full, empty or in a bad state (read-only or offline), the zone write
pointer can be inferred from the zone condition cached in the disk
zones_cond array. For sequential zones that have a zone write plug, the
zone condition and zone write pointer are obtained from the condition
and write pointer offset managed with the zone write plug. This allows
obtaining the information for a zone much more quickly than having to
execute a report zones command on the device.

blkdev_get_zone_info() falls back to using a regular zone report if the
target zone is flagged as needing an update with the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not
use zone write plugs (i.e. a device mapper device). In this case, the
new function blkdev_report_zone_fallback() is used and the zone
condition is reported consistantly with the cahced report, that is, the
BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open,
explicit open and closed conditions. This is achieved by adding the
.report_active field to struct blk_report_zones_args and by having
disk_report_zone() sets the correct zone condition if .report_active is
true.

In preparation for using blkdev_get_zone_info() in upcoming file systems
changes, also export this function as a GPL symbol.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: refactor blkdev_report_zones() code
Damien Le Moal [Tue, 4 Nov 2025 21:22:42 +0000 (06:22 +0900)] 
block: refactor blkdev_report_zones() code

In preparation for implementing cached report zone, split the main part
of the code of blkdev_report_zones() into the helper function
blkdev_do_report_zones(), with this new helper taking as argument a
struct blk_report_zones_args pointer instead of a report callback
function and its private argument.

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: track zone conditions
Damien Le Moal [Tue, 4 Nov 2025 21:22:41 +0000 (06:22 +0900)] 
block: track zone conditions

The function blk_revalidate_zone_cond() already caches the condition of
all zones of a zoned block device in the zones_cond array of a gendisk.
However, the zone conditions are updated only when the device is scanned
or revalidated.

Implement tracking of the runtime changes to zone conditions using
the new cond field in struct blk_zone_wplug. The size of this structure
remains 112 Bytes as the new field replaces the 4 Bytes padding at the
end of the structure.

Beause zones that do not have a zone write plug can be in the empty,
implicit open, explicit open or full condition, the zones_cond array of
a disk is used to track the conditions, of zones that do not have a zone
write plug. The condition of such zone is updated in the disk zones_cond
array when a zone reset, reset all or finish operation is executed, and
also when a zone write plug is removed from the disk hash table when the
zone becomes full.

Since a device may automatically close an implicitly open zone when
writing to an empty or closed zone, if the total number of open zones
has reached the device limit, the BLK_ZONE_COND_IMP_OPEN and
BLK_ZONE_COND_CLOSED zone conditions cannot be precisely tracked. To
overcome this, the zone condition BLK_ZONE_COND_ACTIVE is introduced to
represent a zone that has the condition BLK_ZONE_COND_IMP_OPEN,
BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED.  This follows the
definition of an active zone as defined in the NVMe Zoned Namespace
specifications. As such, for a zoned device that has a limit on the
maximum number of open zones, we will never have more zones in the
BLK_ZONE_COND_ACTIVE condition than the device limit. This is compatible
with the SCSI ZBC and ATA ZAC specifications for SMR HDDs as these
devices do not have a limit on the number of active zones.

The function disk_zone_wplug_set_wp_offset() is modified to use the new
helper disk_zone_wplug_update_cond() to update a zone write plug
condition whenever a zone write plug write offset is updated on
submission or merging of write BIOs to a zone.

The functions blk_zone_reset_bio_endio(), blk_zone_reset_all_bio_endio()
and blk_zone_finish_bio_endio() are modified to update the condition of
the zones targeted by reset, reset_all and finish operations, either
using though disk_zone_wplug_set_wp_offset() for zones that have a
zone write plug, or using the disk_zone_set_cond() helper to update the
zones_cond array of the disk for zones that do not have a zone write
plug.

When a zone write plug is removed from the disk hash table (when the
zone becomes empty or full), the condition of struct blk_zone_wplug is
used to update the disk zones_cond array. Conversely, when a zone write
plug is added to the disk hash table, the zones_cond array is used to
initialize the zone write plug condition.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: use zone condition to determine conventional zones
Damien Le Moal [Tue, 4 Nov 2025 21:22:40 +0000 (06:22 +0900)] 
block: use zone condition to determine conventional zones

The conv_zones_bitmap field of struct gendisk is used to define a bitmap
to identify the conventional zones of a zoned block device. The bit for
a zone is set in this bitmap if the zone is a conventional one, that is,
if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this
always corresponds to the zone condition BLK_ZONE_COND_NOT_WP.
In other words, conv_zones_bitmap tracks a single condition of the
zones of a zoned block device.

In preparation for tracking more zone conditions, change
conv_zones_bitmap into an array of zone conditions, using 1 byte per
zone. This increases the memory usage from 1 bit per zone to 1 byte per
zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256
MiB zones. This is a trade-off to allow fast cached report zones later
on top of this change.

Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add
a blk_revalidate_zone_cond() function to initialize the zones_cond array
of a disk during device scan and to update it on device revalidation.
Move the allocation of the zones_cond array to
disk_revalidate_zone_resources(), making sure that this array is always
allocated, even for devices that do not need zone write plugs (zone
resources), to ensure that bdev_zone_is_seq() can be re-implemented to
use the zone condition array in place of the conv zones bitmap.

Finally, the function bdev_zone_is_seq() is rewritten to use a test on
the condition of the target zone.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: reorganize struct blk_zone_wplug
Damien Le Moal [Tue, 4 Nov 2025 21:22:39 +0000 (06:22 +0900)] 
block: reorganize struct blk_zone_wplug

Reorganize the fields of struct blk_zone_wplug to remove a hole after
the wp_offset field and avoid having the bio_work structure split
between 2 cache lines.

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: introduce disk_report_zone()
Damien Le Moal [Tue, 4 Nov 2025 21:22:38 +0000 (06:22 +0900)] 
block: introduce disk_report_zone()

Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.

Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.

All block device drivers supporting the report_zones block operation are
updated to use this new scheme.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: cleanup blkdev_report_zones()
Damien Le Moal [Tue, 4 Nov 2025 21:22:37 +0000 (06:22 +0900)] 
block: cleanup blkdev_report_zones()

The variable capacity is used only in one place and so can be removed
and get_capacity(disk) used directly instead.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: freeze queue when updating zone resources
Damien Le Moal [Tue, 4 Nov 2025 21:22:36 +0000 (06:22 +0900)] 
block: freeze queue when updating zone resources

Modify disk_update_zone_resources() to freeze the device queue before
updating the number of zones, zone capacity and other zone related
resources. The locking order resulting from the call to
queue_limits_commit_update_frozen() is preserved, that is, the queue
limits lock is first taken by calling queue_limits_start_update() before
freezing the queue, and the queue is unfrozen after executing
queue_limits_commit_update(), which replaces the call to
queue_limits_commit_update_frozen().

This change ensures that there are no in-flights I/Os when the zone
resources are updated due to a zone revalidation. In case of error when
the limits are applied, directly call disk_free_zone_resources() from
disk_update_zone_resources() while the disk queue is still frozen to
avoid needing to freeze & unfreeze the queue again in
blk_revalidate_disk_zones(), thus simplifying that function code a
little.

Fixes: 0b83c86b444a ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: handle zone management operations completions
Damien Le Moal [Tue, 4 Nov 2025 21:22:35 +0000 (06:22 +0900)] 
block: handle zone management operations completions

The functions blk_zone_wplug_handle_reset_or_finish() and
blk_zone_wplug_handle_reset_all() both modify the zone write pointer
offset of zone write plugs that are the target of a reset, reset all or
finish zone management operation. However, these functions do this
modification before the BIO is executed. So if the zone operation fails,
the modified zone write pointer offsets become invalid.

Avoid this by modifying the zone write pointer offset of a zone write
plug that is the target of a zone management operation when the
operation completes. To do so, modify blk_zone_bio_endio() to call the
new function blk_zone_mgmt_bio_endio() which in turn calls the functions
blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
blk_zone_finish_bio_endio() depending on the operation of the completed
BIO, to modify a zone write plug write pointer offset accordingly.
These functions are called only if the BIO execution was successful.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoMerge branch 'autopi-deadlock' into for-6.19/block
Jens Axboe [Tue, 4 Nov 2025 19:43:02 +0000 (12:43 -0700)] 
Merge branch 'autopi-deadlock' into for-6.19/block

Currently the automatic block layer PI generation allocates the integrity
buffer using kmalloc, and thus could deadlock, or fail I/O request due
to memory pressure.

Fix this by adding a mempool, and capping the maximum I/O size on PI
capable devices to not exceed the allocation size of the mempool.

Link: https://lore.kernel.org/linux-block/20251103101653.2083310-1-hch@lst.de/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* autopi-deadlock:
  block: make bio auto-integrity deadlock safe
  block: blocking mempool_alloc doesn't fail

4 weeks agoblock: make bio auto-integrity deadlock safe
Christoph Hellwig [Mon, 3 Nov 2025 10:16:45 +0000 (05:16 -0500)] 
block: make bio auto-integrity deadlock safe

The current block layer automatic integrity protection allocates the
actual integrity buffer, which has three problems:

 - because it happens at the bottom of the I/O stack and doesn't use a
   mempool it can deadlock under load
 - because the data size in a bio is almost unbounded when using lage
   folios it can relatively easily exceed the maximum kmalloc size
 - even when it does not exceed the maximum kmalloc size, it could
   exceed the maximum segment size of the device

Fix this by limiting the I/O size so that we can allocate at least a
2MiB integrity buffer, i.e. 128MiB for 8 byte PI and 512 byte integrity
intervals, and create a mempool as a last resort for this maximum size,
mirroring the scheme used for bvecs.  As a nice upside none of this
can fail now, so we remove the error handling and open code the
trivial addition of the bip vec.

The new allocation helpers sit outside of bio-integrity-auto.c because
I plan to reuse them for file system based PI in the near future.

Fixes: 7ba1ba12eeef ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblock: blocking mempool_alloc doesn't fail
Christoph Hellwig [Mon, 3 Nov 2025 10:16:44 +0000 (05:16 -0500)] 
block: blocking mempool_alloc doesn't fail

So remove the error check for it in bio_integrity_prep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoselftests: ublk: make ublk_thread thread-local variable
Ming Lei [Sat, 1 Nov 2025 13:31:20 +0000 (21:31 +0800)] 
selftests: ublk: make ublk_thread thread-local variable

Refactor ublk_thread to be a thread-local variable instead of storing
it in ublk_dev:

- Remove pthread_t thread field from struct ublk_thread and move it to
  struct ublk_thread_info

- Remove struct ublk_thread array from struct ublk_dev, reducing memory
  footprint

- Define struct ublk_thread as local variable in __ublk_io_handler_fn()
  instead of accessing it from dev->threads[]

- Extract main IO handling logic into __ublk_io_handler_fn() which is
  marked as noinline

- Move CPU affinity setup to ublk_io_handler_fn() before calling
  __ublk_io_handler_fn()

- Update ublk_thread_set_sched_affinity() to take struct ublk_thread_info *
  instead of struct ublk_thread *, and use pthread_setaffinity_np()
  instead of sched_setaffinity()

- Reorder struct ublk_thread fields to group related state together

This change makes each thread's ublk_thread structure truly local to
the thread, improving cache locality and reducing memory usage.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoselftests: ublk: set CPU affinity before thread initialization
Ming Lei [Sat, 1 Nov 2025 13:31:19 +0000 (21:31 +0800)] 
selftests: ublk: set CPU affinity before thread initialization

Move ublk_thread_set_sched_affinity() call before ublk_thread_init()
to ensure memory allocations during thread initialization occur on
the correct NUMA node. This leverages Linux's first-touch memory
policy for better NUMA locality.

Also convert ublk_thread_set_sched_affinity() to use
pthread_setaffinity_np() instead of sched_setaffinity(), as the
pthread API is the proper interface for setting thread affinity in
multithreaded programs.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoublk: use struct_size() for allocation
Ming Lei [Sat, 1 Nov 2025 13:31:18 +0000 (21:31 +0800)] 
ublk: use struct_size() for allocation

Convert ublk_queue to use struct_size() for allocation.

Changes in this commit:

1. Update ublk_init_queue() to use struct_size(ubq, ios, depth)
   instead of manual size calculation (sizeof(struct ublk_queue) +
   depth * sizeof(struct ublk_io)).

This provides better type safety and makes the code more maintainable
by using standard kernel macro for flexible array handling.

Meantime annotate ublk_queue.ios by __counted_by().

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoublk: implement NUMA-aware memory allocation
Ming Lei [Sat, 1 Nov 2025 13:31:17 +0000 (21:31 +0800)] 
ublk: implement NUMA-aware memory allocation

Implement NUMA-friendly memory allocation for ublk driver to improve
performance on multi-socket systems.

This commit includes the following changes:

1. Rename __queues to queues, dropping the __ prefix since the field is
   now accessed directly throughout the codebase rather than only through
   the ublk_get_queue() helper.

2. Remove the queue_size field from struct ublk_device as it is no longer
   needed.

3. Move queue allocation and deallocation into ublk_init_queue() and
   ublk_deinit_queue() respectively, improving encapsulation. This
   simplifies ublk_init_queues() and ublk_deinit_queues() to just
   iterate and call the per-queue functions.

4. Add ublk_get_queue_numa_node() helper function to determine the
   appropriate NUMA node for a queue by finding the first CPU mapped
   to that queue via tag_set.map[HCTX_TYPE_DEFAULT].mq_map[] and
   converting it to a NUMA node using cpu_to_node(). This function is
   called internally by ublk_init_queue() to determine the allocation
   node.

5. Allocate each queue structure on its local NUMA node using
   kvzalloc_node() in ublk_init_queue().

6. Allocate the I/O command buffer on the same NUMA node using
   alloc_pages_node().

This reduces memory access latency on multi-socket NUMA systems by
ensuring each queue's data structures are local to the CPUs that
access them.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoublk: reorder tag_set initialization before queue allocation
Ming Lei [Sat, 1 Nov 2025 13:31:16 +0000 (21:31 +0800)] 
ublk: reorder tag_set initialization before queue allocation

Move ublk_add_tag_set() before ublk_init_queues() in the device
initialization path. This allows us to use the blk-mq CPU-to-queue
mapping established by the tag_set to determine the appropriate
NUMA node for each queue allocation.

The error handling paths are also reordered accordingly.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agoblktrace: add support for REQ_OP_WRITE_ZEROES tracing
Chaitanya Kulkarni [Wed, 29 Oct 2025 03:34:23 +0000 (20:34 -0700)] 
blktrace: add support for REQ_OP_WRITE_ZEROES tracing

Currently, REQ_OP_WRITE_ZEROES operations are not handled in the
blktrace infrastructure, resulting in incorrect or missing operation
labels in ftrace blktrace output. This manifests as write-zeroes
operations appearing with incorrect labels like "N" instead of a
proper "WZ" designation.

This patch adds complete support for REQ_OP_WRITE_ZEROES across the
blktrace infrastructure:

Add BLK_TC_WRITE_ZEROES trace category in blktrace_api.h and update
BLK_TC_END_V2 marker accordingly
Map REQ_OP_WRITE_ZEROES to BLK_TC_WRITE_ZEROES in __blk_add_trace()
to ensure proper trace event categorization
Update fill_rwbs() to generate "WZ" label for write-zeroes operations
in ftrace output, making them easily identifiable
Add "write-zeroes" string mapping in act_to_str array for debugfs
filter interface
Update blk_fill_rwbs() to handle REQ_OP_WRITE_ZEROES for block layer
event tracing

With this fix, write-zeroes operations are now correctly traced and
displayed.

===========================================================
BEFORE THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1
   blkdiscard-3809 [030] .....  1212.253701: block_bio_queue: 259,0 NS 0 + 80 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253703: block_getrq: 259,0 NS 0 + 80 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253704: block_io_start: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253704: block_plug: [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253706: block_unplug: [blkdiscard] 1
   blkdiscard-3809 [030] .....  1212.253706: block_rq_insert: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/30:1H-566  [030] .....  1212.253726: block_rq_issue: 259,0 NS 40960 () 0 + 80 be,0,4 [kworker/30:1H]
       <idle>-0    [030] d.h1.  1212.253957: block_rq_complete: 259,0 NS () 0 + 80 be,0,4 [0]
       <idle>-0    [030] dNh1.  1212.253960: block_io_done: 259,0 NS 0 () 0 + 0 none,0,0 [swapper/30]

Trace Event Breakdown:
 Event             | Device | Op  | Sector | Sectors | Byte Size | Calculation

 block_bio_queue   | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_getrq       | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_start    | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_insert   | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_issue    | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_complete | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_done     | 259,0  | NS  | 0      | 0       | 0         | Completion (no data)

  Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes

===========================================================
AFTER THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1

   blkdiscard-2477 [020] .....   960.989131: block_bio_queue: 259,0 WZS 0 + 80 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989134: block_getrq: 259,0 WZS 0 + 80 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989135: block_io_start: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989138: block_plug: [blkdiscard]
   blkdiscard-2477 [020] .....   960.989140: block_unplug: [blkdiscard] 1
   blkdiscard-2477 [020] .....   960.989141: block_rq_insert: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/20:1H-736  [020] .....   960.989166: block_rq_issue: 259,0 WZS 40960 () 0 + 80 be,0,4 [kworker/20:1H]
       <idle>-0    [020] d.h1.   960.989476: block_rq_complete: 259,0 WZS () 0 + 80 be,0,4 [0]
       <idle>-0    [020] dNh1.   960.989482: block_io_done: 259,0 WZS 0 () 0 + 0 none,0,0 [swapper/20]

Trace Event Breakdown:
 Event             | Device | Op  | Sector | Sectors | Byte Size | Calculation

 block_bio_queue   | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_getrq       | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_start    | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_insert   | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_issue    | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_complete | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_done     | 259,0  | WZS | 0      | 0       | 0         | Completion (no data)

  Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes

Tested with ftrace blktrace on NVMe devices using blkdiscard with
the -z (write-zeroes) flag.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 weeks agodrbd: replace kmap() with kmap_local_page() in receiver path
Shi Hao [Sat, 1 Nov 2025 05:44:22 +0000 (11:14 +0530)] 
drbd: replace kmap() with kmap_local_page() in receiver path

Use kmap_local_page() instead of kmap() to avoid
CPU contention.

kmap() uses a global set of mapping slots that can cause contention
between multiple CPUs, while kmap_local_page() uses per-CPU slots
eliminating this contention. It also ensures non-sleeping operation
and provides better cache locality.

Convert kmap() to kmap_local_page() as it aligns with ongoing
kernel efforts to modernize kmap() usage for better multi-core
scalability.

Signed-off-by: Shi Hao <i.shihao.999@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 weeks agoblktrace: for ftrace use correct trace format ver
Chaitanya Kulkarni [Tue, 28 Oct 2025 05:50:42 +0000 (22:50 -0700)] 
blktrace: for ftrace use correct trace format ver

The ftrace blktrace path allocates buffers and writes trace events but
was using the wrong recording function. After
commit 4d8bc7bd4f73 ("blktrace: move ftrace blk_io_tracer to blk_io_trace2"),
the ftrace interface was moved to use blk_io_trace2 format, but
__blk_add_trace() still called record_blktrace_event() which writes in
blk_io_trace (v1) format.

This causes critical data corruption:

- blk_io_trace (v1) has 32-bit 'action' field at offset 28
- blk_io_trace2 (v2) has 32-bit 'pid' at offset 28 and 64-bit 'action'
  at offset 32
- When record_blktrace_event() writes to a v2 buffer:
  * Writing pid (offset 32 in v1) corrupts the v2 action field
  * Writing action (offset 28 in v1) corrupts the v2 pid field
  * The 64-bit action is truncated to 32-bit via lower_32_bits()

Fix by:
1. Adding version switch to select correct format (v1 vs v2)
2. Calling appropriate recording function based on version
3. Defaulting to v2 for ftrace (as intended by commit 4d8bc7bd4f73)
4. Adding WARN_ONCE for unexpected version values

Without this patch :-
linux-block (for-next) # sh reproduce_blktrace_bug.sh
              dd-14242   [033] d..1.  3903.022308: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022333: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022365: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022366: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022369: Unknown action 36a2

The action field is corrupted because:
  - ftrace allocated blk_io_trace2 buffer (64 bytes)
  - But called record_blktrace_event() (writes v1, 48 bytes)
  - Field offsets don't match, causing corruption

The hex value shown 0x30e3 is actually a PID, not an action code!

linux-block (for-next) #
linux-block (for-next) #
linux-block (for-next) # sh reproduce_blktrace_bug.sh
Trace output looks correct:

              dd-2420    [019] d..1.    59.641742: 251,0    Q  RS 0 + 8 [dd]
              dd-2420    [019] d..1.    59.641775: 251,0    G  RS 0 + 8 [dd]
              dd-2420    [019] d..1.    59.641784: 251,0    P   N [dd]
              dd-2420    [019] d..1.    59.641785: 251,0    U   N [dd] 1
              dd-2420    [019] d..1.    59.641788: 251,0    D  RS 0 + 8 [dd]

Fixes: 4d8bc7bd4f73 ("blktrace: move ftrace blk_io_tracer to blk_io_trace2")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>