]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
7 weeks agoublk: simplify PFN range loop in __ublk_ctrl_reg_buf
Ming Lei [Thu, 9 Apr 2026 13:30:15 +0000 (21:30 +0800)] 
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf

Use the for-loop increment instead of a manual `i++` past the last
page, and fix the mtree_insert_range end key accordingly.

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-4-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: verify all pages in multi-page bvec fall within registered range
Ming Lei [Thu, 9 Apr 2026 13:30:14 +0000 (21:30 +0800)] 
ublk: verify all pages in multi-page bvec fall within registered range

rq_for_each_bvec() yields multi-page bvecs where bv_page is only the
first page. ublk_try_buf_match() only validated the start PFN against
the maple tree, but a bvec can span multiple pages past the end of a
registered range.

Use mas_walk() instead of mtree_load() to obtain the range boundaries
stored in the maple tree, and check that the bvec's end PFN does not
exceed the range. Also remove base_pfn from struct ublk_buf_range
since mas.index already provides the range start PFN.

Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-3-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
Ming Lei [Thu, 9 Apr 2026 13:30:13 +0000 (21:30 +0800)] 
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support

The __u32 len field cannot represent a 4GB buffer (0x100000000
overflows to 0). Change it to __u64 so buffers up to 4GB can be
registered. Add a reserved field for alignment and validate it
is zero.

The kernel enforces a default max of 4GB (UBLK_SHMEM_BUF_SIZE_MAX)
which may be increased in future.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-2-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoMerge tag 'md-7.1-20260407' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid...
Jens Axboe [Wed, 8 Apr 2026 12:53:16 +0000 (06:53 -0600)] 
Merge tag 'md-7.1-20260407' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.1/block

Pull MD changes from Yu Kuai:

"Bug Fixes:
 - avoid a sysfs deadlock when clearing array state (Yu Kuai)
 - validate raid5 journal payloads before reading metadata (Junrui Luo)
 - fall back to the correct bitmap operations after version mismatches
   (Yu Kuai)
 - serialize overlapping writes on writemostly raid1 disks (Xiao Ni)
 - wake raid456 reshape waiters before suspend (Yu Kuai)
 - prevent retry_aligned_read() from triggering soft lockups
   (Chia-Ming Chang)

 Improvements:
 - switch raid0 strip zone and devlist allocations to kvmalloc helpers
   (Gregory Price)
 - track clean unwritten stripes for proactive RAID5 parity building
   (Yu Kuai)
 - speed up initial llbitmap sync with write_zeroes_unmap support
   (Yu Kuai)

 Cleanups:
 - remove the unused static md workqueue definition
   (Abd-Alrhman Masalkhi)"

* tag 'md-7.1-20260407' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid5: fix soft lockup in retry_aligned_read()
  md: wake raid456 reshape waiters before suspend
  md/raid1: serialize overlap io for writemostly disk
  md/md-llbitmap: optimize initial sync with write_zeroes_unmap support
  md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity building
  md: add fallback to correct bitmap_ops on version mismatch
  md/raid5: validate payload size before accessing journal metadata
  md: remove unused static md_wq workqueue
  md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocations
  md: fix array_state=clear sysfs deadlock

7 weeks agoxfs: use bio_await in xfs_zone_gc_reset_sync
Christoph Hellwig [Tue, 7 Apr 2026 14:05:28 +0000 (16:05 +0200)] 
xfs: use bio_await in xfs_zone_gc_reset_sync

Replace the open-coded bio wait logic with the new bio_await helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: add a bio_submit_or_kill helper
Christoph Hellwig [Tue, 7 Apr 2026 14:05:27 +0000 (16:05 +0200)] 
block: add a bio_submit_or_kill helper

Factor the common logic for the ioctl helpers to either submit a bio or
end if the process is being killed.  As this is now the only user of
bio_await_chain, open code that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: factor out a bio_await helper
Christoph Hellwig [Tue, 7 Apr 2026 14:05:26 +0000 (16:05 +0200)] 
block: factor out a bio_await helper

Add a new helper to wait for a bio and anything chained off it to
complete synchronously after submitting it.  This factors common code out
of submit_bio_wait and bio_await_chain and will also be useful for
file system code and thus is exported.

Note that this will now set REQ_SYNC also for the bio_await case for
consistency.  Nothing should look at the flag in the end_io handler,
but if something does having the flag set makes more sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: unify the synchronous bi_end_io callbacks
Christoph Hellwig [Tue, 7 Apr 2026 14:05:25 +0000 (16:05 +0200)] 
block: unify the synchronous bi_end_io callbacks

Put the bio in bio_await_chain after waiting for the completion, and
share the now identical callbacks between submit_bio_wait and
bio_await_chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoxfs: fix number of GC bvecs
Christoph Hellwig [Tue, 7 Apr 2026 14:05:24 +0000 (16:05 +0200)] 
xfs: fix number of GC bvecs

GC scratch allocations can wrap around and use the same buffer twice, and
the current code fails to account for that.  So far this worked due to
rounding in the block layer, but changes to the bio allocator drop the
over-provisioning and generic/256 or generic/361 will now usually fail
when running against the current block tree.

Simplify the allocation to always pass the maximum value that is easier to
verify, as a saving of up to one bvec per allocation isn't worth the
effort to verify a complicated calculated value.

Fixes: 102f444b57b3 ("xfs: rework zone GC buffer management")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://patch.msgid.link/20260407140538.633364-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add read-only buffer registration test
Ming Lei [Tue, 31 Mar 2026 15:32:01 +0000 (23:32 +0800)] 
selftests/ublk: add read-only buffer registration test

Add --rdonly_shmem_buf option to kublk that registers shared memory
buffers with UBLK_SHMEM_BUF_READ_ONLY (read-only pinning without
FOLL_WRITE) and mmaps with PROT_READ only.

Add test_shmemzc_04.sh which exercises the new flag with a null target,
hugetlbfs buffer, and write workload. Write I/O works because the
server only reads from the shared buffer — the data flows from client
to kernel to the shared pages, and the server reads them out.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add filesystem fio verify test for shmem_zc
Ming Lei [Tue, 31 Mar 2026 15:32:00 +0000 (23:32 +0800)] 
selftests/ublk: add filesystem fio verify test for shmem_zc

Add test_shmemzc_03.sh which exercises shmem_zc through the full
filesystem stack: mkfs ext4 on the ublk device, mount it, then run
fio verify on a file inside the filesystem with --mem=mmaphuge.

Extend _mkfs_mount_test() to accept an optional command that runs
between mount and umount. The function cd's into the mount directory
so the command can use relative file paths. Existing callers that
pass only the device are unaffected.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add hugetlbfs shmem_zc test for loop target
Ming Lei [Tue, 31 Mar 2026 15:31:59 +0000 (23:31 +0800)] 
selftests/ublk: add hugetlbfs shmem_zc test for loop target

Add test_shmem_zc_02.sh which tests the UBLK_IO_F_SHMEM_ZC zero-copy
path on the loop target using a hugetlbfs shared buffer. Both kublk and
fio mmap the same hugetlbfs file with MAP_SHARED, sharing physical
pages. The kernel's PFN matching enables zero-copy — the loop target
reads/writes directly from the shared buffer to the backing file.

Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add shared memory zero-copy test
Ming Lei [Tue, 31 Mar 2026 15:31:58 +0000 (23:31 +0800)] 
selftests/ublk: add shared memory zero-copy test

Add test_shmem_zc_01.sh which tests UBLK_IO_F_SHMEM_ZC on the null
target using a hugetlbfs shared buffer. Both kublk (--htlb) and fio
(--mem=mmaphuge:<path>) mmap the same hugetlbfs file with MAP_SHARED,
sharing physical pages. The kernel PFN match enables zero-copy I/O.

Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
Ming Lei [Tue, 31 Mar 2026 15:31:57 +0000 (23:31 +0800)] 
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target

Add loop_queue_shmem_zc_io() which handles I/O requests marked with
UBLK_IO_F_SHMEM_ZC. When the kernel sets this flag, the request data
lives in a registered shared memory buffer — decode index + offset
from iod->addr and use the server's mmap as the I/O buffer.

The dispatch check in loop_queue_tgt_rw_io() routes SHMEM_ZC requests
to this new function, bypassing the normal buffer registration path.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests/ublk: add shared memory zero-copy support in kublk
Ming Lei [Tue, 31 Mar 2026 15:31:56 +0000 (23:31 +0800)] 
selftests/ublk: add shared memory zero-copy support in kublk

Add infrastructure for UBLK_F_SHMEM_ZC shared memory zero-copy:

- kublk.h: struct ublk_shmem_entry and table for tracking registered
  shared memory buffers
- kublk.c: per-device unix socket listener that accepts memfd
  registrations from clients via SCM_RIGHTS fd passing. The listener
  mmaps the memfd and registers the VA range with the kernel for PFN
  matching. Also adds --shmem_zc command line option.
- kublk.c: --htlb <path> option to open a pre-allocated hugetlbfs
  file, mmap it with MAP_SHARED|MAP_POPULATE, and register it with
  the kernel via ublk_ctrl_reg_buf(). Any process that mmaps the same
  hugetlbfs file shares the same physical pages, enabling zero-copy
  without socket-based fd passing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: eliminate permanent pages[] array from struct ublk_buf
Ming Lei [Tue, 31 Mar 2026 15:31:55 +0000 (23:31 +0800)] 
ublk: eliminate permanent pages[] array from struct ublk_buf

The pages[] array (kvmalloc'd, 8 bytes per page = 2MB for a 1GB buffer)
was stored permanently in struct ublk_buf but only needed during
pin_user_pages_fast() and maple tree construction. Since the maple tree
already stores PFN ranges via ublk_buf_range, struct page pointers can
be recovered via pfn_to_page() during unregistration.

Make pages[] a temporary allocation in ublk_ctrl_reg_buf(), freed
immediately after the maple tree is built. Rewrite __ublk_ctrl_unreg_buf()
to iterate the maple tree for matching buf_index entries, recovering
struct page pointers via pfn_to_page() and unpinning in batches of 32.
Simplify ublk_buf_erase_ranges() to iterate the maple tree by buf_index
instead of walking the now-removed pages[] array.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: enable UBLK_F_SHMEM_ZC feature flag
Ming Lei [Tue, 31 Mar 2026 15:31:54 +0000 (23:31 +0800)] 
ublk: enable UBLK_F_SHMEM_ZC feature flag

Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-4-ming.lei@redhat.com
[axboe: ublk_buf_reg -> ublk_shmem_buf_reg errors]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: add PFN-based buffer matching in I/O path
Ming Lei [Tue, 31 Mar 2026 15:31:53 +0000 (23:31 +0800)] 
ublk: add PFN-based buffer matching in I/O path

Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
each page's PFN in the per-device maple tree, and verifies all pages
belong to the same registered buffer at contiguous offsets.

Add ublk_iod_is_shmem_zc() inline helper for checking whether a
request uses the shmem zero-copy path.

Integrate into the I/O path:
- ublk_setup_iod(): if pages match a registered buffer, set
  UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
- ublk_start_io(): skip ublk_map_io() for zero-copy requests
- __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests

The feature remains disabled (ublk_support_shmem_zc() returns false)
until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
Ming Lei [Tue, 31 Mar 2026 15:31:52 +0000 (23:31 +0800)] 
ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands

Add control commands for registering and unregistering shared memory
buffers for zero-copy I/O:

- UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
  ranges into a per-device maple tree for O(log n) lookup during I/O.
  Buffer pointers are tracked in a per-device xarray. Returns the
  assigned buffer index.

- UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.

Queue freeze/unfreeze is handled internally so userspace need not
quiesce the device during registration.

Also adds:
- UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
  (16-bit buffer index supporting up to 65536 buffers)
- Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
- __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
- __ublk_ctrl_unreg_buf() helper for cleanup reuse
- ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
  (returning false — feature not enabled yet)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-2-ming.lei@redhat.com
[axboe: fixup ublk_buf_reg -> ublk_shmem_buf_reg errors, comments]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agodrbd: use get_random_u64() where appropriate
David Carlier [Sun, 5 Apr 2026 15:47:04 +0000 (16:47 +0100)] 
drbd: use get_random_u64() where appropriate

Use the typed random integer helpers instead of
get_random_bytes() when filling a single integer variable.
The helpers return the value directly, require no pointer
or size argument, and better express intent.

Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260405154704.4610-1-devnexen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agomd/raid5: fix soft lockup in retry_aligned_read()
Chia-Ming Chang [Thu, 2 Apr 2026 06:14:06 +0000 (14:14 +0800)] 
md/raid5: fix soft lockup in retry_aligned_read()

When retry_aligned_read() encounters an overlapped stripe, it releases
the stripe via raid5_release_stripe() which puts it on the lockless
released_stripes llist. In the next raid5d loop iteration,
release_stripe_list() drains the stripe onto handle_list (since
STRIPE_HANDLE is set by the original IO), but retry_aligned_read()
runs before handle_active_stripes() and removes the stripe from
handle_list via find_get_stripe() -> list_del_init(). This prevents
handle_stripe() from ever processing the stripe to resolve the
overlap, causing an infinite loop and soft lockup.

Fix this by using __release_stripe() with temp_inactive_list instead
of raid5_release_stripe() in the failure path, so the stripe does not
go through the released_stripes llist. This allows raid5d to break out
of its loop, and the overlap will be resolved when the stripe is
eventually processed by handle_stripe().

Fixes: 773ca82fa1ee ("raid5: make release_stripe lockless")
Cc: stable@vger.kernel.org
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Signed-off-by: Chia-Ming Chang <chiamingc@synology.com>
Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd: wake raid456 reshape waiters before suspend
Yu Kuai [Fri, 27 Mar 2026 14:07:29 +0000 (22:07 +0800)] 
md: wake raid456 reshape waiters before suspend

During raid456 reshape, direct IO across the reshape position can sleep
in raid5_make_request() waiting for reshape progress while still
holding an active_io reference. If userspace then freezes reshape and
writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io
and waits for all in-flight IO to drain.

This can deadlock: the IO needs reshape progress to continue, but the
reshape thread is already frozen, so the active_io reference is never
dropped and suspend never completes.

raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do
the same for normal md suspend when reshape is already interrupted, so
waiting raid456 IO can abort, drop its reference, and let suspend
finish.

The mdadm test tests/25raid456-reshape-deadlock reproduces the hang.

Fixes: 714d20150ed8 ("md: add new helpers to suspend/resume array")
Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd/raid1: serialize overlap io for writemostly disk
Xiao Ni [Tue, 24 Mar 2026 07:24:54 +0000 (15:24 +0800)] 
md/raid1: serialize overlap io for writemostly disk

Previously, using wait_event() would wake up all waiters simultaneously,
and they would compete for the tree lock. The bio which gets the lock
first will be handled, so the write sequence cannot be guaranteed.

For example:
bio1(100,200)
bio2(150,200)
bio3(150,300)

The write sequence of fast device is bio1,bio2,bio3. But the write sequence
of slow device could be bio1,bio3,bio2 due to lock competition. This causes
data corruption.

Replace waitqueue with a fifo list to guarantee the write sequence. And it
also needs to iterate the list when removing one entry. If not, it may miss
the opportunity to wake up the waiting io.

For example:
bio1(1,3), bio2(2,4)
bio3(5,7), bio4(6,8)
These four bios are in the same bucket. bio1 and bio3 are inserted into
the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the
first one. bio3 returns from slow disk and tries to wake up the waiting
bios. bio2 is removed from the list and will be handled. But bio1 hasn't
finished. So bio2 will be added into waiting list again. Then bio1 returns
from slow disk and wakes up waiting bios. bio4 is removed from the list
and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left
on the waiting list. So it needs to iterate the waiting list to wake up
the right bio.

Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd/md-llbitmap: optimize initial sync with write_zeroes_unmap support
Yu Kuai [Mon, 23 Mar 2026 05:46:44 +0000 (13:46 +0800)] 
md/md-llbitmap: optimize initial sync with write_zeroes_unmap support

For RAID-456 arrays with llbitmap, if all underlying disks support
write_zeroes with unmap, issue write_zeroes to zero all disk data
regions and initialize the bitmap to BitCleanUnwritten instead of
BitUnwritten.

This optimization skips the initial XOR parity building because:
1. write_zeroes with unmap guarantees zeroed reads after the operation
2. For RAID-456, when all data is zero, parity is automatically
   consistent (0 XOR 0 XOR ... = 0)
3. BitCleanUnwritten indicates parity is valid but no user data
   has been written

The implementation adds two helper functions:
- llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active
  disks support write_zeroes with unmap
- llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each
  rdev's data region to zero all disks

The zeroing and bitmap state setting happens in llbitmap_init_state()
during bitmap initialization. If any disk fails to zero, we fall back
to BitUnwritten and normal lazy recovery.

This significantly reduces array initialization time for RAID-456
arrays built on modern NVMe SSDs or other devices that support
write_zeroes with unmap.

Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity building
Yu Kuai [Mon, 23 Mar 2026 05:46:43 +0000 (13:46 +0800)] 
md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity building

Add new states to the llbitmap state machine to support proactive XOR
parity building for RAID-5 arrays. This allows users to pre-build parity
data for unwritten regions before any user data is written.

New states added:
- BitNeedSyncUnwritten: Transitional state when proactive sync is triggered
  via sysfs on Unwritten regions.
- BitSyncingUnwritten: Proactive sync in progress for unwritten region.
- BitCleanUnwritten: XOR parity has been pre-built, but no user data
  written yet. When user writes to this region, it transitions to BitDirty.

New actions added:
- BitmapActionProactiveSync: Trigger for proactive XOR parity building.
- BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/
  SyncingUnwritten states back to Unwritten before recovery starts.

State flows:
- Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean
- New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten
- On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean
- On disk replacement: CleanUnwritten regions are converted to Unwritten
  before recovery starts, so recovery only rebuilds regions with user data

A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync
(write-only) to trigger proactive sync. This only works for RAID-456 arrays.

Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd: add fallback to correct bitmap_ops on version mismatch
Yu Kuai [Mon, 23 Mar 2026 05:46:42 +0000 (13:46 +0800)] 
md: add fallback to correct bitmap_ops on version mismatch

If default bitmap version and on-disk version doesn't match, and mdadm
is not the latest version to set bitmap_type, set bitmap_ops based on
the disk version.

Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd/raid5: validate payload size before accessing journal metadata
Junrui Luo [Sat, 4 Apr 2026 07:44:35 +0000 (15:44 +0800)] 
md/raid5: validate payload size before accessing journal metadata

r5c_recovery_analyze_meta_block() and
r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a
journal metadata block using on-disk payload size fields without
validating them against the remaining space in the metadata block.

A corrupted journal contains payload sizes extending beyond the PAGE_SIZE
boundary can cause out-of-bounds reads when accessing payload fields or
computing offsets.

Add bounds validation for each payload type to ensure the full payload
fits within meta_size before processing.

Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1")
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd: remove unused static md_wq workqueue
Abd-Alrhman Masalkhi [Sat, 28 Mar 2026 19:35:22 +0000 (22:35 +0300)] 
md: remove unused static md_wq workqueue

The md_wq workqueue is defined as static and initialized in md_init(),
but it is not used anywhere within md.c.

All asynchronous and deferred work in this file is handled via
md_misc_wq or dedicated md threads.

Fixes: b75197e86e6d3 ("md: Remove flush handling")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agomd/raid0: use kvzalloc/kvfree for strip_zone and devlist allocations
Gregory Price [Sun, 8 Mar 2026 23:42:02 +0000 (19:42 -0400)] 
md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocations

syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof()
triggered by create_strip_zones() in the RAID0 driver.

When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB
on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER).

Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to
kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with
__GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the
corresponding kfree calls to kvfree.

Both arrays are pure metadata lookup tables (arrays of pointers and zone
descriptors) accessed only via indexing, so they do not require physically
contiguous memory.

Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agodrbd: remove DRBD_GENLA_F_MANDATORY flag handling
Christoph Böhmwalder [Fri, 3 Apr 2026 13:29:53 +0000 (15:29 +0200)] 
drbd: remove DRBD_GENLA_F_MANDATORY flag handling

DRBD used a custom mechanism to mark netlink attributes as "mandatory":
bit 14 of nla_type was repurposed as DRBD_GENLA_F_MANDATORY. Attributes
sent from userspace that had this bit present and that were unknown
to the kernel would lead to an error.

Since commit ef6243acb478 ("genetlink: optionally validate strictly/dumps"),
the generic netlink layer rejects unknown top-level attributes when
strict validation is enabled. DRBD never opted out of strict
validation, so unknown top-level attributes are already rejected by
the netlink core.

The mandatory flag mechanism was required for nested attributes, because
these are parsed liberally, silently dropping attributes unknown to the
kernel.

This prepares for the move to a new YNL-based family, which will use the
now-default strict parsing.
The current family is not expected to gain any new attributes, which
makes this change safe.

Old userspace that still sets bit 14 is unaffected: nla_type()
strips it before __nla_validate_parse() performs attribute validation,
so the bit never reaches DRBD.

Remove all references to the mandatory flag in DRBD.

Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260403132953.2248751-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
Yuto Ohnuki [Mon, 16 Mar 2026 07:03:59 +0000 (07:03 +0000)] 
blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()

wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:

- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered

syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.

wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.

Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.

Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda509 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests: ublk: test that teardown after incomplete recovery completes
Uday Shankar [Mon, 6 Apr 2026 04:25:31 +0000 (22:25 -0600)] 
selftests: ublk: test that teardown after incomplete recovery completes

Before the fix, teardown of a ublk server that was attempting to recover
a device, but died when it had submitted a nonempty proper subset of the
fetch commands to any queue would loop forever. Add a test to verify
that, after the fix, teardown completes. This is done by:

- Adding a new argument to the fault_inject target that causes it die
  after fetching a nonempty proper subset of the IOs to a queue
- Using that argument in a new test while trying to recover an
  already-created device
- Attempting to delete the ublk device at the end of the test; this
  hangs forever if teardown from the fault-injected ublk server never
  completed.

It was manually verified that the test passes with the fix and hangs
without it.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260405-cancel-v2-2-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: reset per-IO canceled flag on each fetch
Uday Shankar [Mon, 6 Apr 2026 04:25:30 +0000 (22:25 -0600)] 
ublk: reset per-IO canceled flag on each fetch

If a ublk server starts recovering devices but dies before issuing fetch
commands for all IOs, cancellation of the fetch commands that were
successfully issued may never complete. This is because the per-IO
canceled flag can remain set even after the fetch for that IO has been
submitted - the per-IO canceled flags for all IOs in a queue are reset
together only once all IOs for that queue have been fetched. So if a
nonempty proper subset of the IOs for a queue are fetched when the ublk
server dies, the IOs in that subset will never successfully be canceled,
as their canceled flags remain set, and this prevents ublk_cancel_cmd
from actually calling io_uring_cmd_done on the commands, despite the
fact that they are outstanding.

Fix this by resetting the per-IO cancel flags immediately when each IO
is fetched instead of waiting for all IOs for the queue (which may never
happen).

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 728cbac5fe21 ("ublk: move device reset into ublk_ch_release()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: zhang, the-essence-of-life <zhangweize9@gmail.com>
Link: https://patch.msgid.link/20260405-cancel-v2-1-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agomd: fix array_state=clear sysfs deadlock
Yu Kuai [Mon, 30 Mar 2026 05:52:13 +0000 (13:52 +0800)] 
md: fix array_state=clear sysfs deadlock

When "clear" is written to array_state, md_attr_store() breaks sysfs
active protection so the array can delete itself from its own sysfs
store method.

However, md_attr_store() currently drops the mddev reference before
calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0)
has made the mddev eligible for delayed deletion, the temporary
kobject reference taken by sysfs_break_active_protection() can become
the last kobject reference protecting the md kobject.

That allows sysfs_unbreak_active_protection() to drop the last
kobject reference from the current sysfs writer context. kobject
teardown then recurses into kernfs removal while the current sysfs
node is still being unwound, and lockdep reports recursive locking on
kn->active with kernfs_drain() in the call chain.

Reproducer on an existing level:
1. Create an md0 linear array and activate it:
   mknod /dev/md0 b 9 0
   echo none > /sys/block/md0/md/metadata_version
   echo linear > /sys/block/md0/md/level
   echo 1 > /sys/block/md0/md/raid_disks
   echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev
   echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \
/sys/block/md0/md/dev-sdb/size
   echo 0 > /sys/block/md0/md/dev-sdb/slot
   echo active > /sys/block/md0/md/array_state
2. Wait briefly for the array to settle, then clear it:
   sleep 2
   echo clear > /sys/block/md0/md/array_state

The warning looks like:

  WARNING: possible recursive locking detected
  bash/588 is trying to acquire lock:
  (kn->active#65) at __kernfs_remove+0x157/0x1d0
  but task is already holding lock:
  (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40
  ...
  Call Trace:
   kernfs_drain
   __kernfs_remove
   kernfs_remove_by_name_ns
   sysfs_remove_group
   sysfs_remove_groups
   __kobject_del
   kobject_put
   md_attr_store
   kernfs_fop_write_iter
   vfs_write
   ksys_write

Restore active protection before mddev_put() so the extra sysfs
kobject reference is dropped while the mddev is still held alive. The
actual md kobject deletion is then deferred until after the sysfs
write path has fully returned.

Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
7 weeks agoblock: remove unused BVEC_ITER_ALL_INIT
Caleb Sander Mateos [Fri, 3 Apr 2026 18:48:51 +0000 (12:48 -0600)] 
block: remove unused BVEC_ITER_ALL_INIT

This macro no longer has any users, so remove it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://patch.msgid.link/20260403184852.2140919-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agobcache: fix uninitialized closure object
Mingzhe Zou [Fri, 3 Apr 2026 04:21:35 +0000 (12:21 +0800)] 
bcache: fix uninitialized closure object

In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and
crash"), we adopted a simple modification suggestion from AI to fix the
use-after-free.

But in actual testing, we found an extreme case where the device is
stopped before calling bch_write_bdev_super().

At this point, struct closure sb_write has not been initialized yet.
For this patch, we ensure that sb_bio has been completed via
sb_write_mutex.

Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com
Fixes: fec114a98b87 ("bcache: fix cached_dev.sb_bio use-after-free and crash")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agobcache: fix cached_dev.sb_bio use-after-free and crash
Mingzhe Zou [Sun, 22 Mar 2026 13:41:02 +0000 (21:41 +0800)] 
bcache: fix cached_dev.sb_bio use-after-free and crash

In our production environment, we have received multiple crash reports
regarding libceph, which have caught our attention:

```
[6888366.280350] Call Trace:
[6888366.280452]  blk_update_request+0x14e/0x370
[6888366.280561]  blk_mq_end_request+0x1a/0x130
[6888366.280671]  rbd_img_handle_request+0x1a0/0x1b0 [rbd]
[6888366.280792]  rbd_obj_handle_request+0x32/0x40 [rbd]
[6888366.280903]  __complete_request+0x22/0x70 [libceph]
[6888366.281032]  osd_dispatch+0x15e/0xb40 [libceph]
[6888366.281164]  ? inet_recvmsg+0x5b/0xd0
[6888366.281272]  ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[6888366.281405]  ceph_con_process_message+0x79/0x140 [libceph]
[6888366.281534]  ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
[6888366.281661]  ceph_con_workfn+0x329/0x680 [libceph]
```

After analyzing the coredump file, we found that the address of
dc->sb_bio has been freed. We know that cached_dev is only freed when it
is stopped.

Since sb_bio is a part of struct cached_dev, rather than an alloc every
time.  If the device is stopped while writing to the superblock, the
released address will be accessed at endio.

This patch hopes to wait for sb_write to complete in cached_dev_free.

It should be noted that we analyzed the cause of the problem, then tell
all details to the QWEN and adopted the modifications it made.

Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Fixes: cafe563591446 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoblock: use sysfs_emit in sysfs show functions
Thorsten Blum [Thu, 2 Apr 2026 16:50:00 +0000 (18:50 +0200)] 
block: use sysfs_emit in sysfs show functions

Replace sprintf() with sysfs_emit() in sysfs show functions.
sysfs_emit() is preferred for formatting sysfs output because it
provides safer bounds checking.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260402164958.894879-4-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoblk-crypto: fix name of the bio completion callback
Christoph Hellwig [Wed, 1 Apr 2026 13:58:51 +0000 (15:58 +0200)] 
blk-crypto: fix name of the bio completion callback

Fix a simple naming issue in the documentation: the completion
routine is called bi_end_io and not bi_complete.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260401135854.125109-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agobio: fix kmemleak false positives from percpu bio alloc cache
Ming Lei [Thu, 26 Mar 2026 14:40:58 +0000 (22:40 +0800)] 
bio: fix kmemleak false positives from percpu bio alloc cache

When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and
later completed, bio_put() places it into the per-cpu bio_alloc_cache
via bio_put_percpu_cache() instead of freeing it back to the
mempool/slab. The slab allocation remains tracked by kmemleak, but the
only reference to the bio is through the percpu cache's free_list,
which kmemleak fails to trace through percpu memory. This causes
kmemleak to report the cached bios as unreferenced objects.

Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track
bios across percpu cache transitions:

 - bio_put_percpu_cache: call kmemleak_free() when a bio enters the
   cache, unregistering it from kmemleak tracking.

 - bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken
   from the cache for reuse, re-registering it so that genuine leaks
   of reused bios remain detectable.

 - __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so
   that kmem_cache_free()'s internal kmemleak_free() has a matching
   allocation to pair with.

Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260326144058.2392319-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoblk-iocost: fix busy_level reset when no IOs complete
Jialin Wang [Tue, 31 Mar 2026 10:05:09 +0000 (10:05 +0000)] 
blk-iocost: fix busy_level reset when no IOs complete

When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.

This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.

Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.

The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.

  BLK_DEVID="8:0"
  MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566"
  QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000"

  echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model
  echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos

  CG_A="/sys/fs/cgroup/cgA"
  CG_B="/sys/fs/cgroup/cgB"

  FILE_A="/path/to/sda/A.fio.testfile"
  FILE_B="/path/to/sda/B.fio.testfile"
  RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)"

  mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR"

  get_result() {
    local file=$1
    local label=$2

    local results=$(jq -r '
    .jobs[0].mixed |
    ( .iops | tonumber | round ) as $iops |
    ( .bw_bytes / 1024 / 1024 ) as $bps |
    ( .slat_ns.mean / 1000000 ) as $slat |
    ( .clat_ns.mean / 1000000 ) as $avg |
    ( .clat_ns.max / 1000000 ) as $max |
    ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 |
    ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 |
    ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 |
    ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 |
    "\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)"
    ' "$file")

    IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results"
    printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \
           "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999"
  }

  run_fio() {
    local cg_path=$1
    local filename=$2
    local name=$3
    local bs=$4
    local qd=$5
    local out=$6
    shift 6
    local extra=$@

    (
      pid=$(sh -c 'echo $PPID')
      echo $pid >"${cg_path}/cgroup.procs"
      fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \
          --ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \
          --time_based --group_reporting --unified_rw_reporting=mixed \
          --output-format=json --output="$out" $extra >/dev/null 2>&1
    ) &
  }

  echo "Starting Test ..."

  for bs_b in "4k" "32k" "256k"; do
    echo "Running iteration: BS=$bs_b"
    out_a="${RESULT_DIR}/cgA_1m.json"
    out_b="${RESULT_DIR}/cgB_${bs_b}.json"

    # cgA: Heavy background (BS 1MB, QD 128)
    run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a"
    # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100)
    run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100"

    wait
    SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n'
    SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n'
  done

  echo -e "\nFinal Results Summary:\n"

  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \
          "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat"
  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \
          "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99"
  echo "$SUMMARY_DATA"

  echo "Results saved in $RESULT_DIR"

Before:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   166    166.37  3.44     748.95   1298.29  977.27   1233.13  1300.23  1300.23
  cgB-4k   5      0.02    0.02     181.74   761.32   742.39   759.17   759.17   759.17

  cgA-1m   167    166.51  1.98     748.68   1549.41  809.50   1451.23  1551.89  1551.89
  cgB-32k  6      0.18    0.02     169.98   761.76   742.39   759.17   759.17   759.17

  cgA-1m   166    165.55  2.89     750.89   1540.37  851.44   1451.23  1535.12  1535.12
  cgB-256k 5      1.30    0.02     191.35   759.51   750.78   759.17   759.17   759.17

After:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   162    162.48  6.14     749.69   850.02   826.28   834.67   843.06   851.44
  cgB-4k   199    0.78    0.01     1.95     42.12    2.57     7.50     34.87    42.21

  cgA-1m   146    146.20  6.83     833.04   908.68   893.39   901.78   910.16   910.16
  cgB-32k  200    6.25    0.01     2.32     31.40    3.06     7.50     16.58    31.33

  cgA-1m   110    110.46  9.04     1082.67  1197.91  1182.79  1199.57  1199.57  1199.57
  cgB-256k 200    49.98   0.02     3.69     22.20    4.88     9.11     20.05    22.15

Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoblk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
Jackie Liu [Tue, 31 Mar 2026 08:50:54 +0000 (16:50 +0800)] 
blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()

Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.

Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.

Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agozloop: add max_open_zones option
Damien Le Moal [Thu, 26 Mar 2026 20:32:45 +0000 (05:32 +0900)] 
zloop: add max_open_zones option

Introduce the new max_open_zones option to allow specifying a limit on
the maximum number of open zones of a zloop device. This change allows
creating a zloop device that can more closely mimick the characteristics
of a physical SMR drive.

When set to a non zero value, only up to max_open_zones zones can be in
the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open
(BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the
implicit open condition of a zone on a write operation can result in an
implicit close of an already implicitly open zone. This is handled in
the function zloop_do_open_zone(). This function also handles
transitions to the explicit open condition. Implicit close transitions
are handled using an LRU ordered list of open zones which is managed
using the helper functions zloop_lru_rotate_open_zone() and
zloop_lru_remove_open_zone().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoblock: fix zones_cond memory leak on zone revalidation error paths
Jackie Liu [Tue, 31 Mar 2026 11:12:16 +0000 (19:12 +0800)] 
block: fix zones_cond memory leak on zone revalidation error paths

When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources()
has allocated args.zones_cond, the memory is leaked because no error path
frees it.

Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones")
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoloop: fix partition scan race between udev and loop_reread_partitions()
Daan De Meyer [Tue, 31 Mar 2026 10:51:28 +0000 (10:51 +0000)] 
loop: fix partition scan race between udev and loop_reread_partitions()

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:

  1. disk_force_media_change() sets GD_NEED_PART_SCAN
  2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
  3. loop_global_unlock() releases the lock
  4. loop_reread_partitions() calls bdev_disk_changed() to scan

There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.

The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.

Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:

  - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
    false and GD_NEED_PART_SCAN has no effect.
  - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
    set, loop_reread_partitions() performs an explicit scan. When not
    set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
  - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
    LO_FLAGS_PARTSCAN is set.
  - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
    after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.

With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.

A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.

Fixes: 9f65c489b68d ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agosed-opal: Add STACK_RESET command
Milan Broz [Tue, 10 Mar 2026 09:53:49 +0000 (10:53 +0100)] 
sed-opal: Add STACK_RESET command

The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.

The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).

This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.

The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Ondrej Kozina <okozina@redhat.com>
Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoMerge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block
Jens Axboe [Fri, 27 Mar 2026 15:51:17 +0000 (09:51 -0600)] 
Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block

Pull NVMe updates from Keith:

"- Fabrics authentication updates (Eric, Alistar)
 - Enanced block queue limits support (Caleb)
 - Workqueue usage updates (Marco)
 - A new write zeroes device quirk (Robert)
 - Tagset cleanup fix for loop device (Nilay)"

* tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme: (41 commits)
  nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
  nvme: add WQ_PERCPU to alloc_workqueue users
  nvmet-fc: add WQ_PERCPU to alloc_workqueue users
  nvmet: replace use of system_wq with system_percpu_wq
  nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
  nvme: Add the DHCHAP maximum HD IDs
  nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
  nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
  nvmet: report NPDGL and NPDAL
  nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
  nvme: set discard_granularity from NPDG/NPDA
  nvme: add from0based() helper
  nvme: always issue I/O Command Set specific Identify Namespace
  nvme: update nvme_id_ns OPTPERF constants
  nvme: fold nvme_config_discard() into nvme_update_disk_info()
  nvme: add preferred I/O size fields to struct nvme_id_ns_nvm
  nvme: Allow reauth from sysfs
  nvme: Expose the tls_configured sysfs for secure concat connections
  nvmet-tcp: Don't free SQ on authentication success
  nvmet-tcp: Don't error if TLS is enabed on a reset
  ...

2 months agonvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
Nilay Shroff [Fri, 13 Mar 2026 11:38:48 +0000 (17:08 +0530)] 
nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown

Cancelling the I/O and admin tagsets during nvme-loop controller reset
or shutdown is unnecessary. The subsequent destruction of the I/O and
admin queues already waits for all in-flight target operations to
complete.

Cancelling the tagsets first also opens a race window. After a request
tag has been cancelled, a late completion from the target may still
arrive before the queues are destroyed. In that case the completion path
may access a request whose tag has already been cancelled or freed,
which can lead to a kernel crash. Please see below the kernel crash
encountered while running blktests nvme/040:

run blktests nvme/040 at 2026-03-08 06:34:27
loop0: detected capacity change from 0 to 2097152
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme6: creating 96 I/O queues.
nvme nvme6: new ctrl: "blktests-subsystem-1"
nvme_log_error: 1 callbacks suppressed
block nvme6n1: no usable path - requeuing I/O
nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
blk_print_req_error: 1 callbacks suppressed
I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2
block nvme6n1: no usable path - requeuing I/O
Kernel attempted to read user page (236) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000236
Faulting instruction address: 0xc000000000961274
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop]
CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT
Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries
NIP:  c000000000961274 LR: c008000009af1808 CTR: c00000000096124c
REGS: c0000007ffc0f910 TRAP: 0300   Not tainted  (7.0.0-rc3+)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22222222  XER: 00000000
CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0
GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00
GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318
GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000
GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000
GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a
GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000
GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00
NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4
LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop]
Call Trace:
 0xc00000000502c640 (unreliable)
 nvme_loop_queue_response+0x104/0x290 [nvme_loop]
 __nvmet_req_complete+0x80/0x498 [nvmet]
 nvmet_req_complete+0x24/0xf8 [nvmet]
 nvmet_bio_done+0x58/0xcc [nvmet]
 bio_endio+0x250/0x390
 blk_update_request+0x2e8/0x68c
 blk_mq_end_request+0x30/0x5c
 lo_complete_rq+0x94/0x110 [loop]
 blk_complete_reqs+0x78/0x98
 handle_softirqs+0x148/0x454
 do_softirq_own_stack+0x3c/0x50
 __irq_exit_rcu+0x18c/0x1b4
 irq_exit+0x1c/0x34
 do_IRQ+0x114/0x278
 hardware_interrupt_common_virt+0x28c/0x290

Since the queue teardown path already guarantees that all target-side
operations have completed, cancelling the tagsets is redundant and
unsafe. So avoid cancelling the I/O and admin tagsets during controller
reset and shutdown.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: add WQ_PERCPU to alloc_workqueue users
Marco Crivellari [Mon, 23 Feb 2026 10:23:28 +0000 (11:23 +0100)] 
nvme: add WQ_PERCPU to alloc_workqueue users

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet-fc: add WQ_PERCPU to alloc_workqueue users
Marco Crivellari [Mon, 23 Feb 2026 10:23:29 +0000 (11:23 +0100)] 
nvmet-fc: add WQ_PERCPU to alloc_workqueue users

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Cc: Justin Tee <justin.tee@broadcom.com>
Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com>
CC: Paul Ely <paul.ely@broadcom.com>
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet: replace use of system_wq with system_percpu_wq
Marco Crivellari [Mon, 23 Feb 2026 10:23:27 +0000 (11:23 +0100)] 
nvmet: replace use of system_wq with system_percpu_wq

This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
Alistair Francis [Fri, 20 Mar 2026 00:20:45 +0000 (10:20 +1000)] 
nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C

Section 8.3.4.5.2 of the NVMe 2.1 base spec states that

"""
The 00h identifier shall not be proposed in an AUTH_Negotiate message
that requests secure channel concatenation (i.e., with the SC_C field
set to a non-zero value).
"""

We need to ensure that we don't set the NVME_AUTH_DHGROUP_NULL idlist if
SC_C is set.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: Add the DHCHAP maximum HD IDs
Alistair Francis [Fri, 20 Mar 2026 00:20:44 +0000 (10:20 +1000)] 
nvme: Add the DHCHAP maximum HD IDs

In preperation for using DHCHAP length in upcoming host and target
patches let's add the hash and diffie-hellman ID length macros.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
Robert Beckett [Fri, 20 Mar 2026 19:22:09 +0000 (19:22 +0000)] 
nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4

The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race
condition when processing concurrent write zeroes and DSM (discard)
commands, causing spurious "LBA Out of Range" errors and IOMMU page
faults at address 0x0.

The issue is reliably triggered by running two concurrent mkfs commands
on different partitions of the same drive, which generates interleaved
write zeroes and discard operations.

Disable write zeroes for this device, matching the pattern used for
other Kingston OM* drives that have similar firmware issues.

Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
Robert Beckett [Fri, 20 Mar 2026 19:22:08 +0000 (19:22 +0000)] 
nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set

The NVM Command Set Identify Controller data may report a non-zero
Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits()
unconditionally overrides max_zeroes_sectors from wzsl, even if
NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero.

This effectively re-enables write zeroes for devices that need it
disabled, defeating the quirk. Several Kingston OM* drives rely on
this quirk to avoid firmware issues with write zeroes commands.

Check for the quirk before applying the wzsl override.

Fixes: 5befc7c26e5a ("nvme: implement non-mdts command limits")
Cc: stable@vger.kernel.org
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet: report NPDGL and NPDAL
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:53 +0000 (13:23 -0700)] 
nvmet: report NPDGL and NPDAL

A block device with a very large discard_granularity queue limit may not
be able to report it in the 16-bit NPDG and NPDA fields in the Identify
Namespace data structure. For this reason, version 2.1 of the NVMe specs
added 32-bit fields NPDGL and NPDAL to the NVM Command Set Specific
Identify Namespace structure. So report the discard_granularity there
too and set OPTPERF to 11b to indicate those fields are supported.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:52 +0000 (13:23 -0700)] 
nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT

Use the NVME_NS_FEAT_OPTPERF_SHIFT constant in nvmet_bdev_set_limits()
to set the OPTPERF bits of the nvme_id_ns NSFEAT field instead of the
magic number 4.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: set discard_granularity from NPDG/NPDA
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:51 +0000 (13:23 -0700)] 
nvme: set discard_granularity from NPDG/NPDA

Currently, nvme_config_discard() always sets the discard_granularity
queue limit to the logical block size. However, NVMe namespaces can
advertise a larger preferred discard granularity in the NPDG or NPDA
field of the Identify Namespace structure or the NPDGL or NPDAL fields
of the I/O Command Set Specific Identify Namespace structure.

Use these fields to compute the discard_granularity limit. The logic is
somewhat involved. First, the fields are optional. NPDG is only reported
if the low bit of OPTPERF is set in NSFEAT. NPDA is reported if any bit
of OPTPERF is set. And NPDGL and NPDAL are reported if the high bit of
OPTPERF is set. NPDGL and NPDAL can also each be set to 0 to opt out of
reporting a limit. I/O Command Set Specific Identify Namespace may also
not be supported by older NVMe controllers. Another complication is that
multiple values may be reported among NPDG, NPDGL, NPDA, and NPDAL. The
spec says to prefer the values reported in the L variants. The spec says
NPDG should be a multiple of NPDA and NPDGL should be a multiple of
NPDAL, but it doesn't specify a relationship between NPDG and NPDAL or
NPDGL and NPDA. So use the maximum of the reported NPDG(L) and NPDA(L)
values as the discard_granularity.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: add from0based() helper
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:50 +0000 (13:23 -0700)] 
nvme: add from0based() helper

The NVMe specifications are big fans of "0's based"/"0-based" fields for
encoding values that must be positive. The encoded value is 1 less than
the value it represents. nvmet already provides a helper to0based() for
encoding 0's based values, so add a corresponding helper to decode these
fields on the host side.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: always issue I/O Command Set specific Identify Namespace
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:49 +0000 (13:23 -0700)] 
nvme: always issue I/O Command Set specific Identify Namespace

Currently, the I/O Command Set specific Identify Namespace structure is
only fetched for controllers that support extended LBA formats. This is
because struct nvme_id_ns_nvm is only used by nvme_configure_pi_elbas(),
which is only called when the ELBAS bit is set in the CTRATT field of
the Identify Controller structure.

However, the I/O Command Set specific Identify Namespace structure will
soon be used in nvme_update_disk_info(), so always try to obtain it in
nvme_update_ns_info_block(). This Identify structure is first defined in
NVMe spec version 2.0, but controllers reporting older versions could
still implement it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: update nvme_id_ns OPTPERF constants
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:48 +0000 (13:23 -0700)] 
nvme: update nvme_id_ns OPTPERF constants

In NVMe verson 2.0 and below, OPTPERF comprises only bit 4 of NSFEAT in
the Identify Namespace structure. Since version 2.1, OPTPERF includes
both bits 4 and 5 of NSFEAT. Replace the NVME_NS_FEAT_IO_OPT constant
with NVME_NS_FEAT_OPTPERF_SHIFT, NVME_NS_FEAT_OPTPERF_MASK, and
NVME_NS_FEAT_OPTPERF_MASK_2_1, representing the first bit, pre-2.1 bit
width, and post-2.1 bit width of OPTPERF.

Update nvme_update_disk_info() to check both OPTPERF bits for
controllers that report version 2.1 or newer, as NPWG and NOWS are
supported even if only bit 5 is set.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: fold nvme_config_discard() into nvme_update_disk_info()
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:47 +0000 (13:23 -0700)] 
nvme: fold nvme_config_discard() into nvme_update_disk_info()

The choice of what queue limits are set in nvme_update_disk_info() vs.
nvme_config_discard() seems a bit arbitrary. A subsequent commit will
compute the discard_granularity limit using struct nvme_id_ns, which is
only passed to nvme_update_disk_info() currently. So move the logic in
nvme_config_discard() to nvme_update_disk_info(). Replace several
instances of ns->ctrl in nvme_update_disk_info() with the ctrl variable
brought from nvme_config_discard().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: add preferred I/O size fields to struct nvme_id_ns_nvm
Caleb Sander Mateos [Fri, 27 Feb 2026 20:23:46 +0000 (13:23 -0700)] 
nvme: add preferred I/O size fields to struct nvme_id_ns_nvm

A subsequent change will use the NPDGL and NPDAL fields of the NVM
Command Set Specific Identify Namespace structure, so add them (and the
handful of intervening fields) to struct nvme_id_ns_nvm. Add an
assertion that the size is still 4 KB.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: Allow reauth from sysfs
Alistair Francis [Tue, 2 Dec 2025 05:17:55 +0000 (15:17 +1000)] 
nvme: Allow reauth from sysfs

Allow userspace to trigger a reauth (REPLACETLSPSK) from sysfs.
This can be done by writing  a zero to the sysfs file.

echo 0 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/tls_configured_key

In order to use the new keys for the admin queue we call controller
reset. This isn't ideal, but I can't find a simpler way to reset the
admin queue TLS connection.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme: Expose the tls_configured sysfs for secure concat connections
Alistair Francis [Tue, 2 Dec 2025 05:17:54 +0000 (15:17 +1000)] 
nvme: Expose the tls_configured sysfs for secure concat connections

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet-tcp: Don't free SQ on authentication success
Alistair Francis [Tue, 2 Dec 2025 05:17:53 +0000 (15:17 +1000)] 
nvmet-tcp: Don't free SQ on authentication success

Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.

This patch ensures we don't free the TLS key on success as we might need
it again in the future.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvmet-tcp: Don't error if TLS is enabed on a reset
Alistair Francis [Tue, 2 Dec 2025 05:17:52 +0000 (15:17 +1000)] 
nvmet-tcp: Don't error if TLS is enabed on a reset

If the host sends a AUTH_Negotiate Message on the admin queue with
REPLACETLSPSK set then we expect and require a TLS connection and
shouldn't report an error if TLS is enabled.

This change only enforces the nvmet_queue_tls_keyid() check if we aren't
resetting the negotiation.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agocrypto: remove HKDF library
Eric Biggers [Mon, 2 Mar 2026 07:59:59 +0000 (23:59 -0800)] 
crypto: remove HKDF library

Remove crypto/hkdf.c, since it's no longer used.  Originally it had two
users, but now both of them just inline the needed HMAC computations
using the HMAC library APIs.  That ends up being better, since it
eliminates all the complexity and performance issues associated with the
crypto_shash abstraction and multi-step HMAC input formatting.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: remove selections of no-longer used crypto modules
Eric Biggers [Mon, 2 Mar 2026 07:59:58 +0000 (23:59 -0800)] 
nvme-auth: common: remove selections of no-longer used crypto modules

Now that nvme-auth uses the crypto library instead of crypto_shash,
remove obsolete selections from the NVME_AUTH kconfig option.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: remove nvme_auth_digest_name()
Eric Biggers [Mon, 2 Mar 2026 07:59:57 +0000 (23:59 -0800)] 
nvme-auth: common: remove nvme_auth_digest_name()

Since nvme_auth_digest_name() is no longer used, remove it and the
associated data from the hash_map array.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: target: use crypto library in nvmet_auth_ctrl_hash()
Eric Biggers [Mon, 2 Mar 2026 07:59:56 +0000 (23:59 -0800)] 
nvme-auth: target: use crypto library in nvmet_auth_ctrl_hash()

For the HMAC computation in nvmet_auth_ctrl_hash(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: target: use crypto library in nvmet_auth_host_hash()
Eric Biggers [Mon, 2 Mar 2026 07:59:55 +0000 (23:59 -0800)] 
nvme-auth: target: use crypto library in nvmet_auth_host_hash()

For the HMAC computation in nvmet_auth_host_hash(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: target: remove obsolete crypto_has_shash() checks
Eric Biggers [Mon, 2 Mar 2026 07:59:54 +0000 (23:59 -0800)] 
nvme-auth: target: remove obsolete crypto_has_shash() checks

Since nvme-auth is now doing its HMAC computations using the crypto
library, it's guaranteed that all the algorithms actually work.
Therefore, remove the crypto_has_shash() checks which are now obsolete.

However, the caller in nvmet_auth_negotiate() seems to have also been
relying on crypto_has_shash(nvme_auth_hmac_name(host_hmac_id)) to
validate the host_hmac_id.  Therefore, make it validate the ID more
directly by checking whether nvme_auth_hmac_hash_len() returns 0 or not.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: host: remove allocation of crypto_shash
Eric Biggers [Mon, 2 Mar 2026 07:59:53 +0000 (23:59 -0800)] 
nvme-auth: host: remove allocation of crypto_shash

Now that the crypto_shash that is being allocated in
nvme_auth_process_dhchap_challenge() and stored in the
struct nvme_dhchap_queue_context is no longer used, remove it.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: host: use crypto library in nvme_auth_dhchap_setup_ctrl_response()
Eric Biggers [Mon, 2 Mar 2026 07:59:52 +0000 (23:59 -0800)] 
nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_ctrl_response()

For the HMAC computation in nvme_auth_dhchap_setup_ctrl_response(), use
the crypto library instead of crypto_shash.  This is simpler, faster,
and more reliable.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: host: use crypto library in nvme_auth_dhchap_setup_host_response()
Eric Biggers [Mon, 2 Mar 2026 07:59:51 +0000 (23:59 -0800)] 
nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_host_response()

For the HMAC computation in nvme_auth_dhchap_setup_host_response(), use
the crypto library instead of crypto_shash.  This is simpler, faster,
and more reliable.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: use crypto library in nvme_auth_derive_tls_psk()
Eric Biggers [Mon, 2 Mar 2026 07:59:50 +0000 (23:59 -0800)] 
nvme-auth: common: use crypto library in nvme_auth_derive_tls_psk()

For the HKDF-Expand-Label computation in nvme_auth_derive_tls_psk(), use
the crypto library instead of crypto_shash and crypto/hkdf.c.

While this means the HKDF "helper" functions are no longer utilized,
they clearly weren't buying us much: it's simpler to just inline the
HMAC computations directly, and this code needs to be tested anyway.  (A
similar result was seen in fs/crypto/.  As a result, this eliminates the
last user of crypto/hkdf.c, which we'll be able to remove as well.)

As usual this is also a lot more efficient, eliminating the allocation
of a transformation object and multiple other dynamic allocations.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: use crypto library in nvme_auth_generate_digest()
Eric Biggers [Mon, 2 Mar 2026 07:59:49 +0000 (23:59 -0800)] 
nvme-auth: common: use crypto library in nvme_auth_generate_digest()

For the HMAC computation in nvme_auth_generate_digest(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: use crypto library in nvme_auth_generate_psk()
Eric Biggers [Mon, 2 Mar 2026 07:59:48 +0000 (23:59 -0800)] 
nvme-auth: common: use crypto library in nvme_auth_generate_psk()

For the HMAC computation in nvme_auth_generate_psk(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: use crypto library in nvme_auth_augmented_challenge()
Eric Biggers [Mon, 2 Mar 2026 07:59:47 +0000 (23:59 -0800)] 
nvme-auth: common: use crypto library in nvme_auth_augmented_challenge()

For the hash and HMAC computations in nvme_auth_augmented_challenge(),
use the crypto library instead of crypto_shash.  This is simpler,
faster, and more reliable.  Notably, this eliminates two crypto
transformation object allocations for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: use crypto library in nvme_auth_transform_key()
Eric Biggers [Mon, 2 Mar 2026 07:59:46 +0000 (23:59 -0800)] 
nvme-auth: common: use crypto library in nvme_auth_transform_key()

For the HMAC computation in nvme_auth_transform_key(), use the crypto
library instead of crypto_shash.  This is simpler, faster, and more
reliable.  Notably, this eliminates the transformation object allocation
for every call, which was very slow.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: add HMAC helper functions
Eric Biggers [Mon, 2 Mar 2026 07:59:45 +0000 (23:59 -0800)] 
nvme-auth: common: add HMAC helper functions

Add some helper functions for computing HMAC-SHA256, HMAC-SHA384, or
HMAC-SHA512 values using the crypto library instead of crypto_shash.
These will enable some significant simplifications and performance
improvements in nvme-auth.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: explicitly verify psk_len == hash_len
Eric Biggers [Mon, 2 Mar 2026 07:59:44 +0000 (23:59 -0800)] 
nvme-auth: common: explicitly verify psk_len == hash_len

nvme_auth_derive_tls_psk() is always called with psk_len == hash_len.
And based on the comments above nvme_auth_generate_psk() and
nvme_auth_derive_tls_psk(), this isn't an implementation choice but
rather just the length the spec uses.  Add a check which makes this
explicit, so that when cleaning up nvme_auth_derive_tls_psk() we don't
have to retain support for arbitrary values of psk_len.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: rename nvme_auth_generate_key() to nvme_auth_parse_key()
Eric Biggers [Mon, 2 Mar 2026 07:59:43 +0000 (23:59 -0800)] 
nvme-auth: rename nvme_auth_generate_key() to nvme_auth_parse_key()

This function does not generate a key.  It parses the key from the
string that the caller passes in.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: add KUnit tests for TLS key derivation
Eric Biggers [Mon, 2 Mar 2026 07:59:42 +0000 (23:59 -0800)] 
nvme-auth: common: add KUnit tests for TLS key derivation

Unit-test the sequence of function calls that derive tls_psk, so that we
can be more confident that changes in the implementation don't break it.

Since the NVMe specification doesn't seem to include any test vectors
for this (nor does its description of the algorithm seem to match what
was actually implemented, for that matter), I just set the expected
values to the values that the code currently produces.  In the case
of SHA-512, nvme_auth_generate_digest() currently returns -EINVAL, so
for now the test tests for that too.  If it is later determined that
some other behavior is needed, the test can be updated accordingly.

Tested with:

    tools/testing/kunit/kunit.py run --kunitconfig drivers/nvme/common/

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: use proper argument types
Eric Biggers [Mon, 2 Mar 2026 07:59:41 +0000 (23:59 -0800)] 
nvme-auth: use proper argument types

For input parameters, use pointer to const.  This makes it easier to
understand which parameters are inputs and which are outputs.

In addition, consistently use char for strings and u8 for binary.  This
makes it easier to understand what is a string and what is binary data.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: common: constify static data
Eric Biggers [Mon, 2 Mar 2026 07:59:40 +0000 (23:59 -0800)] 
nvme-auth: common: constify static data

Fully constify the dhgroup_map and hash_map arrays.  Remove 'const' from
individual fields, as it is now redundant.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agonvme-auth: add NVME_AUTH_MAX_DIGEST_SIZE constant
Eric Biggers [Mon, 2 Mar 2026 07:59:39 +0000 (23:59 -0800)] 
nvme-auth: add NVME_AUTH_MAX_DIGEST_SIZE constant

Define a NVME_AUTH_MAX_DIGEST_SIZE constant and use it in the
appropriate places.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2 months agodrbd: Balance RCU calls in drbd_adm_dump_devices()
Bart Van Assche [Thu, 26 Mar 2026 21:40:54 +0000 (14:40 -0700)] 
drbd: Balance RCU calls in drbd_adm_dump_devices()

Make drbd_adm_dump_devices() call rcu_read_lock() before
rcu_read_unlock() is called. This has been detected by the Clang
thread-safety analyzer.

Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: a55bbd375d18 ("drbd: Backport the "status" command")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260326214054.284593-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agodrbd: use genl pre_doit/post_doit
Christoph Böhmwalder [Tue, 24 Mar 2026 15:29:07 +0000 (16:29 +0100)] 
drbd: use genl pre_doit/post_doit

Every doit handler followed the same pattern: stack-allocate an
adm_ctx, call drbd_adm_prepare() at the top, call drbd_adm_finish()
at the bottom. This duplicated boilerplate across 25 handlers and
made error paths inconsistent, since some handlers could miss sending
the reply skb on early-exit paths.

The generic netlink framework already provides pre_doit/post_doit
hooks for exactly this purpose. An old comment even noted "this
would be a good candidate for a pre_doit hook".

Use them:

- pre_doit heap-allocates adm_ctx, looks up per-command flags from a
  new drbd_genl_cmd_flags[] table, runs drbd_adm_prepare(), and
  stores the context in info->user_ptr[0].
- post_doit sends the reply, drops kref references for
  device/connection/resource, and frees the adm_ctx.
- Handlers just receive adm_ctx from info->user_ptr[0], set
  reply_dh->ret_code, and return. All teardown is in post_doit.
- drbd_adm_finish() is removed, superseded by post_doit.

Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://patch.msgid.link/20260324152907.2840984-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agozloop: forget write cache on force removal
Christoph Hellwig [Mon, 23 Mar 2026 07:11:50 +0000 (08:11 +0100)] 
zloop: forget write cache on force removal

Add a new options that causes zloop to truncate the zone files to the
write pointer value recorded at the last cache flush to simulate
unclean shutdowns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260323071156.2940772-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agozloop: refactor zloop_rw
Christoph Hellwig [Mon, 23 Mar 2026 07:11:49 +0000 (08:11 +0100)] 
zloop: refactor zloop_rw

Split out two helpers functions to make the function more readable and
to avoid conditional locking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260323071156.2940772-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoblock: fix bio_alloc_bioset slowpath GFP handling
Vasily Gorbik [Sun, 22 Mar 2026 02:35:10 +0000 (03:35 +0100)] 
block: fix bio_alloc_bioset slowpath GFP handling

bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic
fast allocation attempt with try_alloc_gfp(). If that fast path fails,
the slowpath checks saved_gfp to decide whether blocking allocation is
allowed, but then still calls mempool_alloc() with the stripped gfp mask.
That can lead to a NULL bio pointer being passed into bio_init().

Fix the slowpath by using saved_gfp for the bio and bvec mempool
allocations.

Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: move cold paths out of __ublk_batch_dispatch() for icache efficiency
Ming Lei [Wed, 18 Mar 2026 01:41:12 +0000 (09:41 +0800)] 
ublk: move cold paths out of __ublk_batch_dispatch() for icache efficiency

Mark ublk_filter_unused_tags() as noinline since it is only called from
the unlikely(needs_filter) branch. Extract the error-handling block from
__ublk_batch_dispatch() into a new noinline ublk_batch_dispatch_fail()
function to keep the hot path compact and icache-friendly. This also
makes __ublk_batch_dispatch() more readable by separating the error
recovery logic from the normal dispatch flow.

Before: __ublk_batch_dispatch is ~1419 bytes
After:  __ublk_batch_dispatch is ~1090 bytes (-329 bytes, -23%)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260318014112.3125432-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoMerge tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid...
Jens Axboe [Sun, 22 Mar 2026 19:37:45 +0000 (13:37 -0600)] 
Merge tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.1/block

Pull MD changes from Yu Kuia:

"Bug Fixes:
 - md: suppress spurious superblock update error message for dm-raid
   (Chen Cheng)
 - md/raid1: fix the comparing region of interval tree (Xiao Ni)
 - md/raid10: fix deadlock with check operation and nowait requests
   (Josh Hunt)
 - md/raid5: skip 2-failure compute when other disk is R5_LOCKED
   (FengWei Shih)
 - md/md-llbitmap: raise barrier before state machine transition
   (Yu Kuai)
 - md/md-llbitmap: skip reading rdevs that are not in_sync (Yu Kuai)

 Improvements:
 - md/raid5: set chunk_sectors to enable full stripe I/O splitting
   (Yu Kuai)

 Cleanups:
 - md: remove unused mddev argument from export_rdev (Chen Cheng)
 - md/raid5: remove stale md_raid5_kick_device() declaration
   (Chen Cheng)
 - md/raid5: move handle_stripe() comment to correct location
   (Chen Cheng)"

* tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md: remove unused mddev argument from export_rdev
  md/raid5: move handle_stripe() comment to correct location
  md/raid5: remove stale md_raid5_kick_device() declaration
  md/raid1: fix the comparing region of interval tree
  md/raid5: skip 2-failure compute when other disk is R5_LOCKED
  md/md-llbitmap: raise barrier before state machine transition
  md/md-llbitmap: skip reading rdevs that are not in_sync
  md/raid5: set chunk_sectors to enable full stripe I/O splitting
  md/raid10: fix deadlock with check operation and nowait requests
  md: suppress spurious superblock update error message for dm-raid

2 months agomd: remove unused mddev argument from export_rdev
Chen Cheng [Wed, 4 Mar 2026 11:14:17 +0000 (19:14 +0800)] 
md: remove unused mddev argument from export_rdev

The mddev argument in export_rdev() is never used. Remove it to
simplify callers.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2 months agomd/raid5: move handle_stripe() comment to correct location
Chen Cheng [Wed, 4 Mar 2026 11:10:01 +0000 (19:10 +0800)] 
md/raid5: move handle_stripe() comment to correct location

Move the handle_stripe() documentation comment from above
analyse_stripe() to directly above handle_stripe() where it belongs.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304111001.15767-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2 months agomd/raid5: remove stale md_raid5_kick_device() declaration
Chen Cheng [Wed, 4 Mar 2026 11:09:18 +0000 (19:09 +0800)] 
md/raid5: remove stale md_raid5_kick_device() declaration

Remove the unused md_raid5_kick_device() declaration from raid5.h -
no definition exists for this function.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304110919.15071-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2 months agomd/raid1: fix the comparing region of interval tree
Xiao Ni [Thu, 5 Mar 2026 01:18:33 +0000 (09:18 +0800)] 
md/raid1: fix the comparing region of interval tree

Interval tree uses [start, end] as a region which stores in the tree.
In raid1, it uses the wrong end value. For example:
bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B).
The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and
bio2 overlap which is not right.

Fix this problem by using right end value of the region.

Fixes: d0d2d8ba0494 ("md/raid1: introduce wait_for_serialization")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2 months agomd/raid5: skip 2-failure compute when other disk is R5_LOCKED
FengWei Shih [Thu, 19 Mar 2026 05:33:51 +0000 (13:33 +0800)] 
md/raid5: skip 2-failure compute when other disk is R5_LOCKED

When skip_copy is enabled on a doubly-degraded RAID6, a device that is
being written to will be in R5_LOCKED state with R5_UPTODATE cleared.
If a new read triggers fetch_block() while the write is still in
flight, the 2-failure compute path may select this locked device as a
compute target because it is not R5_UPTODATE.

Because skip_copy makes the device page point directly to the bio page,
reconstructing data into it might be risky. Also, since the compute
marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io()
which checks that R5_SkipCopy and R5_UPTODATE are not both set.

This can be reproduced by running small-range concurrent read/write on
a doubly-degraded RAID6 with skip_copy enabled, for example:

  mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing
  echo 1 > /sys/block/md0/md/skip_copy
  fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \
      --iodepth=32 --size=4M --runtime=30 --time_based --direct=1

Fix by checking R5_LOCKED before proceeding with the compute. The
compute will be retried once the lock is cleared on IO completion.

Signed-off-by: FengWei Shih <dannyshih@synology.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>