Chris Leech [Wed, 22 Apr 2026 19:06:36 +0000 (12:06 -0700)]
nvme-auth: Hash DH shared secret to create session key
The NVMe Base Specification 8.3.5.5.9 states that the session key Ks
shall be computed from the ephemeral DH key by applying the hash
function selected by the HashID parameter.
The current implementation stores the raw DH shared secret as the
session key without hashing it. This causes redundant hash operations:
1. Augmented challenge computation (section 8.3.5.5.4) requires
Ca = HMAC(H(g^xy mod p), C). The code compensates by hashing the
unhashed session key in nvme_auth_augmented_challenge() to produce
the correct result.
2. PSK generation (section 8.3.5.5.9) requires PSK = HMAC(Ks, C1 || C2)
where Ks should already be H(g^xy mod p). As the DH shared secret
is always larger than the HMAC block size, HMAC internally hashes
it before use, accidentally producing the correct result.
When using secure channel concatenation with bidirectional
authentication, this results in hashing the DH value three times: twice
for augmented challenge calculations and once during PSK generation.
Fix this by:
- Modifying nvme_auth_gen_shared_secret() to hash the DH shared secret
once after computation: Ks = H(g^xy mod p)
- Removing the hash operation from nvme_auth_augmented_challenge()
as the session key is now already hashed
- Updating session key buffer size from DH key size to hash output size
- Adding specification references in comments
This avoid storing the raw DH shared secret and reduces the number of
hash operations from three to one when using secure channel
concatenation.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
We can batch admin commands submitted through io_uring_cmd passthrough,
which means bd->last may be false and skips the doorbell write to
aggregate multiple commands per write. If a subsequent command can't be
dispatched for whatever reason, we have to provide the blk-mq ops'
commit_rqs callback in order to ensure we properly update the doorbell.
Fixes: 58e5bdeb9c2b ("nvme: enable uring-passthrough for admin commands") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Alistair Francis [Fri, 17 Apr 2026 00:50:48 +0000 (10:50 +1000)]
nvme-auth: Include SC_C in RVAL controller hash
Section 8.3.4.5.5 of the NVMe Base Specification 2.1 describes what is
included in the Response Value (RVAL) hash and SC_C should be included.
Currently we are hardcoding 0 instead of using the correct SC_C value.
Update the host and target code to use the SC_C when calculating the
RVAL instead of using 0.
Fixes: e88a7595b57f2 ("nvme-tcp: request secure channel concatenation") Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
When a controller reset is triggered via sysfs (by writing to
/sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
and re-establishes all queues. The socket release using fput() defers
the actual cleanup to task_work delayed_fput workqueue. This deferred
cleanup can race with the subsequent queue re-allocation during reset,
potentially leading to use-after-free or resource conflicts.
Replace fput() with __fput_sync() to ensure synchronous socket release,
guaranteeing that all socket resources are fully cleaned up before the
function returns. This prevents races during controller reset where
new queue setup may begin before the old socket is fully released.
memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
performing memory reclaim work that need reserve access. While PF_MEMALLOC
prevents the task from entering direct reclaim (causing __need_reclaim() to
return false), it does not strip __GFP_IO from gfp flags. The allocator can
therefore still trigger writeback I/O when __GFP_IO remains set, which is
unsafe when the caller holds block layer locks.
Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
the scope, making it safe to allocate memory while holding elevator_lock and
set->srcu.
* The issue can be reproduced using blktests:
nvme_trtype=tcp ./check nvme/005
blktests (master) # nvme_trtype=tcp ./check nvme/005
nvme/005 (tr=tcp) (reset local loopback target) [failed]
runtime 0.725s ... 0.798s
something found in dmesg:
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[...]
...
(See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message)
blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[ 108.526983] loop0: detected capacity change from 0 to 2097152
[ 108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.616832] nvme nvme0: creating 48 I/O queues.
[ 108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.748466] nvme nvme0: creating 48 I/O queues.
[ 108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 108.854288] block nvme0n1: no available path - failing I/O
[ 108.854344] block nvme0n1: no available path - failing I/O
[ 108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read
[ 108.891693] ======================================================
[ 108.895912] WARNING: possible circular locking dependency detected
[ 108.900184] 6.17.0nvme+ #3 Tainted: G N
[ 108.903913] ------------------------------------------------------
[ 108.908171] nvme/2734 is trying to acquire lock:
[ 108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170
[ 108.917587]
but task is already holding lock:
[ 108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[ 108.927361]
which lock already depends on the new lock.
Alistair Francis [Fri, 17 Apr 2026 00:48:09 +0000 (10:48 +1000)]
nvmet-tcp: Don't clear tls_key when freeing sq
Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.
A previous attempt to fix this involed not calling nvmet_auth_sq_free()
on successful connections, but that results in memory leaks. Instead we
should not clear `tls_key` in nvmet_auth_sq_free(), as that was
incorrectly wiping the tls keys which are used for the session.
This patch ensures we correctly free the ephemeral session key on
connection, yet we don't free the TLS key unless closing the connection.
Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Alistair Francis [Fri, 17 Apr 2026 00:48:08 +0000 (10:48 +1000)]
Revert "nvmet-tcp: Don't free SQ on authentication success"
In an attempt to fix REPLACETLSPSK we stopped freeing the secrets on
successful connections. This resulted in memory leaks in the kernel, so
let's revert the commit. A improved fix is being developed to just avoid
clearing the tls_key variable.
Closes: https://lore.kernel.org/linux-nvme/CAHj4cs-u3MWQR4idywptMfjEYi4YwObWFx4KVib35dZ5HMBDdw@mail.gmail.com Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Keith Busch [Mon, 20 Apr 2026 16:02:28 +0000 (09:02 -0700)]
nvme: skip trace completion for host path errors
The command was never dispatched for the driver's "host path error", so
the command was never actually initialized and there's no corresponding
submit trace for the completion.
Reported-by: Minsik Jeon <hmi.jeon@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Tao Jiang [Wed, 15 Apr 2026 17:27:15 +0000 (01:27 +0800)]
nvme-pci: add quirk for Memblaze Pblaze5 (0x1c5f:0x0555)
The Memblaze Pblaze5 NVMe device (PCI ID 0x1c5f:0x0555)
is detected as a controller on recent kernels (tested on 5.15.85
and 6.8.4), but no namespace is exposed.
Tools like lsblk and fdisk do not report any block device.
dmesg shows:
nvme nvme0: missing or invalid SUBNQN field.
The device works correctly on older kernels (e.g. 4.19), suggesting
a compatibility issue with newer namespace handling.
This indicates the device does not properly support the
Namespace Descriptor List feature.
Applying NVME_QUIRK_NO_NS_DESC_LIST allows the namespace to be
discovered correctly.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Tao Jiang <tanroame.kyle@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
John Garry [Wed, 15 Apr 2026 15:53:58 +0000 (15:53 +0000)]
nvme-multipath: put module reference when delayed removal work is canceled
The delayed disk removal work is canceled when a NS (re)appears. However,
we do not put the module reference grabbed in nvme_mpath_remove_disk(), so
fix that.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Wed, 8 Apr 2026 16:19:56 +0000 (18:19 +0200)]
nvme: expose TLS mode
It is not possible to determine the active TLS mode from the
presence or absence of sysfs attributes like tls_key,
tls_configured_key, or dhchap_secret.
With the introduction of the concat mode and optional DH-CHAP
authentication, different configurations can result in identical
sysfs state. This makes user space detection unreliable.
Expose the TLS mode explicitly to allow user space to
unambiguously identify the active configuration and avoid
fragile heuristics in nvme-cli.
Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme-apple: drop invalid put of admin queue reference count
Commit 03b3bcd319b3 ("nvme: fix admin request_queue lifetime") moved the
admin queue reference ->put call into nvme_free_ctrl() - a controller
device release callback performed for every nvme driver doing
nvme_init_ctrl().
nvme-apple sets refcount of the admin queue to 1 at allocation during the
probe function and then puts it twice now:
Note that there is a commit 941f7298c70c ("nvme-apple: remove an extra
queue reference") which intended to drop taking an extra admin queue
reference. Looks like at that moment it accidentally fixed a refcount
leak, which existed since the driver's introduction. There were two ->get
calls at driver's probe function and a single ->put inside
apple_nvme_free_ctrl().
However now after commit 03b3bcd319b3 ("nvme: fix admin request_queue
lifetime") the refcount is imbalanced again. Fix it by removing extra
->put call from apple_nvme_free_ctrl(). anv->dev and ctrl->dev point to
the same device, so use ctrl->dev directly for simplification. Compile
tested only.
Found by Linux Verification Center (linuxtesting.org).
In the declaration of the structure "core_quirks[]", in the comment
referred to the devices "Kioxia CD6-V Series / HPE PE8030", the
parameter "default_ps_max_latency_us" is reported in a wrong way:
nvme_core.default_ps_max_latency=0
The correct form is, instead:
nvme_core.default_ps_max_latency_us=0
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Flavio Suligoi <f.suligoi@asem.it> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet: avoid recursive nvmet-wq flush in nvmet_ctrl_free
nvmet_tcp_release_queue_work() runs on nvmet-wq and can drop the
final controller reference through nvmet_cq_put(). If that triggers
nvmet_ctrl_free(), the teardown path flushes ctrl->async_event_work on
the same nvmet-wq.
Previously Scheduled by :-
nvmet_add_async_event
queue_work(nvmet_wq, &ctrl->async_event_work);
This trips lockdep with a possible recursive locking warning.
[ 5223.015876] run blktests nvme/003 at 2026-04-07 20:53:55
[ 5223.061801] loop0: detected capacity change from 0 to 2097152
[ 5223.072206] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 5223.088368] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 5223.126086] nvmet: Created discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 5223.128453] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 5233.199447] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 5233.227718] ============================================
[ 5233.231283] WARNING: possible recursive locking detected
[ 5233.234696] 7.0.0-rc3nvme+ #20 Tainted: G O N
[ 5233.238434] --------------------------------------------
[ 5233.241852] kworker/u192:6/2413 is trying to acquire lock:
[ 5233.245429] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90
[ 5233.251438]
but task is already holding lock:
[ 5233.255254] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0
[ 5233.261125]
other info that might help us debug this:
[ 5233.265333] Possible unsafe locking scenario:
There is also no need to flush async_event_work from controller
teardown. The admin queue teardown already fails outstanding AER
requests before the final controller put :-
The controller has already been removed from the subsystem list before
nvmet_ctrl_free() quiesces outstanding work.
Replace flush_work() with cancel_work_sync() so a pending
async_event_work item is canceled and a running instance is waited on
without recursing into the same workqueue.
Fixes: 06406d81a2d7 ("nvmet: cancel fatal error and flush async work before free controller") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
John Garry [Wed, 8 Apr 2026 08:03:57 +0000 (08:03 +0000)]
nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()
A NS will always have a head pointer, so drop the check. As proof in
practice, all the nvme_mpath_clear_current_path() callers also
dereference ns->head.
This check has endured since the original changes to support multipath.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet-tcp: fix race between ICReq handling and queue teardown
nvmet_tcp_handle_icreq() updates queue->state after sending an
Initialization Connection Response (ICResp), but it does so without
serializing against target-side queue teardown.
If an NVMe/TCP host sends an Initialization Connection Request
(ICReq) and immediately closes the connection, target-side teardown
may start in softirq context before io_work drains the already
buffered ICReq. In that case, nvmet_tcp_schedule_release_queue()
sets queue->state to NVMET_TCP_Q_DISCONNECTING and drops the queue
reference under state_lock.
If io_work later processes that ICReq, nvmet_tcp_handle_icreq() can
still overwrite the state back to NVMET_TCP_Q_LIVE. That defeats the
DISCONNECTING-state guard in nvmet_tcp_schedule_release_queue() and
allows a later socket state change to re-enter teardown and issue a
second kref_put() on an already released queue.
The ICResp send failure path has the same problem. If teardown has
already moved the queue to DISCONNECTING, a send error can still
overwrite the state with NVMET_TCP_Q_FAILED, again reopening the
window for a second teardown path to drop the queue reference.
Fix this by serializing both post-send state transitions with
state_lock and bailing out if teardown has already started.
Use -ESHUTDOWN as an internal sentinel for that bail-out path rather
than propagating it as a transport error like -ECONNRESET. Keep
nvmet_tcp_socket_error() setting rcv_state to NVMET_TCP_RECV_ERR before
honoring that sentinel so receive-side parsing stays quiesced until the
existing release path completes.
nvmet-tcp: remove redundant calls to nvmet_tcp_fatal_error()
Executing nvmet_tcp_fatal_error() is generally the responsibility
of the caller (nvmet_tcp_try_recv); all other functions should
just return the error code.
Remove the nvmet_tcp_fatal_error() function, it's not needed
anymore.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet-tcp: propagate nvmet_tcp_build_pdu_iovec() errors to its callers
Currently, when nvmet_tcp_build_pdu_iovec() detects an out-of-bounds
PDU length or offset, it triggers nvmet_tcp_fatal_error(cmd->queue)
and returns early. However, because the function returns void, the
callers are entirely unaware that a fatal error has occurred and
that the cmd->recv_msg.msg_iter was left uninitialized.
Callers such as nvmet_tcp_handle_h2c_data_pdu() proceed to blindly
overwrite the queue state with queue->rcv_state = NVMET_TCP_RECV_DATA
Consequently, the socket receiving loop may attempt to read incoming
network data into the uninitialized iterator.
Fix this by shifting the error handling responsibility to the callers.
Fixes: 52a0a9854934 ("nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovec") Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Using this port configuration, one will be able to set the Maximum Data
Transfer Size (MDTS) for any controller that will be associated to the
configured port. The default value remains 0 (no limit).
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Aurelien Aptel <aaptel@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Geliang Tang [Tue, 31 Mar 2026 08:17:31 +0000 (16:17 +0800)]
nvme: add missing MODULE_ALIAS for fabrics transports
The generic fabrics layer uses request_module("nvme-%s", opts->transport)
to auto-load transport modules. Currently, the nvme-tcp, nvme-rdma, and
nvme-fc modules lack MODULE_ALIAS entries for these names, which prevents
the kernel from automatically finding and loading them when requested.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
Shivam Kumar [Wed, 18 Mar 2026 22:56:58 +0000 (18:56 -0400)]
nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path
In nvmet_tcp_try_recv_ddgst(), when a data digest mismatch is detected,
nvmet_req_uninit() is called unconditionally. However, if the command
arrived via the nvmet_tcp_handle_req_failure() path, nvmet_req_init()
had returned false and percpu_ref_tryget_live() was never executed. The
unconditional percpu_ref_put() inside nvmet_req_uninit() then causes a
refcount underflow, leading to a WARNING in
percpu_ref_switch_to_atomic_rcu, a use-after-free diagnostic, and
eventually a permanent workqueue deadlock.
Check cmd->flags & NVMET_TCP_F_INIT_FAILED before calling
nvmet_req_uninit(), matching the existing pattern in
nvmet_tcp_execute_request().
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shivam Kumar <kumar.shivam43666@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Yuto Ohnuki [Mon, 16 Mar 2026 07:03:59 +0000 (07:03 +0000)]
blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
selftests: ublk: test that teardown after incomplete recovery completes
Before the fix, teardown of a ublk server that was attempting to recover
a device, but died when it had submitted a nonempty proper subset of the
fetch commands to any queue would loop forever. Add a test to verify
that, after the fix, teardown completes. This is done by:
- Adding a new argument to the fault_inject target that causes it die
after fetching a nonempty proper subset of the IOs to a queue
- Using that argument in a new test while trying to recover an
already-created device
- Attempting to delete the ublk device at the end of the test; this
hangs forever if teardown from the fault-injected ublk server never
completed.
It was manually verified that the test passes with the fix and hangs
without it.
If a ublk server starts recovering devices but dies before issuing fetch
commands for all IOs, cancellation of the fetch commands that were
successfully issued may never complete. This is because the per-IO
canceled flag can remain set even after the fetch for that IO has been
submitted - the per-IO canceled flags for all IOs in a queue are reset
together only once all IOs for that queue have been fetched. So if a
nonempty proper subset of the IOs for a queue are fetched when the ublk
server dies, the IOs in that subset will never successfully be canceled,
as their canceled flags remain set, and this prevents ublk_cancel_cmd
from actually calling io_uring_cmd_done on the commands, despite the
fact that they are outstanding.
Fix this by resetting the per-IO cancel flags immediately when each IO
is fetched instead of waiting for all IOs for the queue (which may never
happen).
Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 728cbac5fe21 ("ublk: move device reset into ublk_ch_release()") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: zhang, the-essence-of-life <zhangweize9@gmail.com> Link: https://patch.msgid.link/20260405-cancel-v2-1-02d711e643c2@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mingzhe Zou [Fri, 3 Apr 2026 04:21:35 +0000 (12:21 +0800)]
bcache: fix uninitialized closure object
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and
crash"), we adopted a simple modification suggestion from AI to fix the
use-after-free.
But in actual testing, we found an extreme case where the device is
stopped before calling bch_write_bdev_super().
At this point, struct closure sb_write has not been initialized yet.
For this patch, we ensure that sb_bio has been completed via
sb_write_mutex.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com Fixes: fec114a98b87 ("bcache: fix cached_dev.sb_bio use-after-free and crash") Signed-off-by: Jens Axboe <axboe@kernel.dk>
After analyzing the coredump file, we found that the address of
dc->sb_bio has been freed. We know that cached_dev is only freed when it
is stopped.
Since sb_bio is a part of struct cached_dev, rather than an alloc every
time. If the device is stopped while writing to the superblock, the
released address will be accessed at endio.
This patch hopes to wait for sb_write to complete in cached_dev_free.
It should be noted that we analyzed the cause of the problem, then tell
all details to the QWEN and adopted the modifications it made.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Fixes: cafe563591446 ("bcache: A block layer cache") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace sprintf() with sysfs_emit() in sysfs show functions.
sysfs_emit() is preferred for formatting sysfs output because it
provides safer bounds checking.
Ming Lei [Thu, 26 Mar 2026 14:40:58 +0000 (22:40 +0800)]
bio: fix kmemleak false positives from percpu bio alloc cache
When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and
later completed, bio_put() places it into the per-cpu bio_alloc_cache
via bio_put_percpu_cache() instead of freeing it back to the
mempool/slab. The slab allocation remains tracked by kmemleak, but the
only reference to the bio is through the percpu cache's free_list,
which kmemleak fails to trace through percpu memory. This causes
kmemleak to report the cached bios as unreferenced objects.
Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track
bios across percpu cache transitions:
- bio_put_percpu_cache: call kmemleak_free() when a bio enters the
cache, unregistering it from kmemleak tracking.
- bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken
from the cache for reuse, re-registering it so that genuine leaks
of reused bios remain detectable.
- __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so
that kmem_cache_free()'s internal kmemleak_free() has a matching
allocation to pair with.
Jialin Wang [Tue, 31 Mar 2026 10:05:09 +0000 (10:05 +0000)]
blk-iocost: fix busy_level reset when no IOs complete
When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.
This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.
Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.
The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.
Jackie Liu [Tue, 31 Mar 2026 08:50:54 +0000 (16:50 +0800)]
blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.
Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.
Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 26 Mar 2026 20:32:45 +0000 (05:32 +0900)]
zloop: add max_open_zones option
Introduce the new max_open_zones option to allow specifying a limit on
the maximum number of open zones of a zloop device. This change allows
creating a zloop device that can more closely mimick the characteristics
of a physical SMR drive.
When set to a non zero value, only up to max_open_zones zones can be in
the implicit open (BLK_ZONE_COND_IMP_OPEN) and explicit open
(BLK_ZONE_COND_EXP_OPEN) conditions at any time. The transition to the
implicit open condition of a zone on a write operation can result in an
implicit close of an already implicitly open zone. This is handled in
the function zloop_do_open_zone(). This function also handles
transitions to the explicit open condition. Implicit close transitions
are handled using an LRU ordered list of open zones which is managed
using the helper functions zloop_lru_rotate_open_zone() and
zloop_lru_remove_open_zone().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260326203245.946830-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jackie Liu [Tue, 31 Mar 2026 11:12:16 +0000 (19:12 +0800)]
block: fix zones_cond memory leak on zone revalidation error paths
When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources()
has allocated args.zones_cond, the memory is leaked because no error path
frees it.
Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones") Suggested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
Daan De Meyer [Tue, 31 Mar 2026 10:51:28 +0000 (10:51 +0000)]
loop: fix partition scan race between udev and loop_reread_partitions()
When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:
1. disk_force_media_change() sets GD_NEED_PART_SCAN
2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
3. loop_global_unlock() releases the lock
4. loop_reread_partitions() calls bdev_disk_changed() to scan
There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.
The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.
Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:
- floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
false and GD_NEED_PART_SCAN has no effect.
- loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
set, loop_reread_partitions() performs an explicit scan. When not
set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
- loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
LO_FLAGS_PARTSCAN is set.
- nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.
With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.
A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.
Milan Broz [Tue, 10 Mar 2026 09:53:49 +0000 (10:53 +0100)]
sed-opal: Add STACK_RESET command
The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.
The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).
This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.
The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.
Jens Axboe [Fri, 27 Mar 2026 15:51:17 +0000 (09:51 -0600)]
Merge tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme into for-7.1/block
Pull NVMe updates from Keith:
"- Fabrics authentication updates (Eric, Alistar)
- Enanced block queue limits support (Caleb)
- Workqueue usage updates (Marco)
- A new write zeroes device quirk (Robert)
- Tagset cleanup fix for loop device (Nilay)"
* tag 'nvme-7.1-2026-03-27' of git://git.infradead.org/nvme: (41 commits)
nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
nvme: add WQ_PERCPU to alloc_workqueue users
nvmet-fc: add WQ_PERCPU to alloc_workqueue users
nvmet: replace use of system_wq with system_percpu_wq
nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
nvme: Add the DHCHAP maximum HD IDs
nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
nvmet: report NPDGL and NPDAL
nvmet: use NVME_NS_FEAT_OPTPERF_SHIFT
nvme: set discard_granularity from NPDG/NPDA
nvme: add from0based() helper
nvme: always issue I/O Command Set specific Identify Namespace
nvme: update nvme_id_ns OPTPERF constants
nvme: fold nvme_config_discard() into nvme_update_disk_info()
nvme: add preferred I/O size fields to struct nvme_id_ns_nvm
nvme: Allow reauth from sysfs
nvme: Expose the tls_configured sysfs for secure concat connections
nvmet-tcp: Don't free SQ on authentication success
nvmet-tcp: Don't error if TLS is enabed on a reset
...
Nilay Shroff [Fri, 13 Mar 2026 11:38:48 +0000 (17:08 +0530)]
nvme-loop: do not cancel I/O and admin tagset during ctrl reset/shutdown
Cancelling the I/O and admin tagsets during nvme-loop controller reset
or shutdown is unnecessary. The subsequent destruction of the I/O and
admin queues already waits for all in-flight target operations to
complete.
Cancelling the tagsets first also opens a race window. After a request
tag has been cancelled, a late completion from the target may still
arrive before the queues are destroyed. In that case the completion path
may access a request whose tag has already been cancelled or freed,
which can lead to a kernel crash. Please see below the kernel crash
encountered while running blktests nvme/040:
Since the queue teardown path already guarantees that all target-side
operations have completed, cancelling the tagsets is redundant and
unsafe. So avoid cancelling the I/O and admin tagsets during controller
reset and shutdown.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Marco Crivellari [Mon, 23 Feb 2026 10:23:28 +0000 (11:23 +0100)]
nvme: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Marco Crivellari [Mon, 23 Feb 2026 10:23:29 +0000 (11:23 +0100)]
nvmet-fc: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Cc: Justin Tee <justin.tee@broadcom.com> Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com> CC: Paul Ely <paul.ely@broadcom.com> Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Marco Crivellari [Mon, 23 Feb 2026 10:23:27 +0000 (11:23 +0100)]
nvmet: replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
Alistair Francis [Fri, 20 Mar 2026 00:20:45 +0000 (10:20 +1000)]
nvme-auth: Don't propose NVME_AUTH_DHGROUP_NULL with SC_C
Section 8.3.4.5.2 of the NVMe 2.1 base spec states that
"""
The 00h identifier shall not be proposed in an AUTH_Negotiate message
that requests secure channel concatenation (i.e., with the SC_C field
set to a non-zero value).
"""
We need to ensure that we don't set the NVME_AUTH_DHGROUP_NULL idlist if
SC_C is set.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Alistair Francis [Fri, 20 Mar 2026 00:20:44 +0000 (10:20 +1000)]
nvme: Add the DHCHAP maximum HD IDs
In preperation for using DHCHAP length in upcoming host and target
patches let's add the hash and diffie-hellman ID length macros.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yunje Shin <ioerts@kookmin.ac.kr> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Robert Beckett [Fri, 20 Mar 2026 19:22:09 +0000 (19:22 +0000)]
nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4
The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race
condition when processing concurrent write zeroes and DSM (discard)
commands, causing spurious "LBA Out of Range" errors and IOMMU page
faults at address 0x0.
The issue is reliably triggered by running two concurrent mkfs commands
on different partitions of the same drive, which generates interleaved
write zeroes and discard operations.
Disable write zeroes for this device, matching the pattern used for
other Kingston OM* drives that have similar firmware issues.
Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch <kbusch@kernel.org>
Robert Beckett [Fri, 20 Mar 2026 19:22:08 +0000 (19:22 +0000)]
nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set
The NVM Command Set Identify Controller data may report a non-zero
Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits()
unconditionally overrides max_zeroes_sectors from wzsl, even if
NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero.
This effectively re-enables write zeroes for devices that need it
disabled, defeating the quirk. Several Kingston OM* drives rely on
this quirk to avoid firmware issues with write zeroes commands.
Check for the quirk before applying the wzsl override.
Fixes: 5befc7c26e5a ("nvme: implement non-mdts command limits") Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch <kbusch@kernel.org>
A block device with a very large discard_granularity queue limit may not
be able to report it in the 16-bit NPDG and NPDA fields in the Identify
Namespace data structure. For this reason, version 2.1 of the NVMe specs
added 32-bit fields NPDGL and NPDAL to the NVM Command Set Specific
Identify Namespace structure. So report the discard_granularity there
too and set OPTPERF to 11b to indicate those fields are supported.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Use the NVME_NS_FEAT_OPTPERF_SHIFT constant in nvmet_bdev_set_limits()
to set the OPTPERF bits of the nvme_id_ns NSFEAT field instead of the
magic number 4.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Currently, nvme_config_discard() always sets the discard_granularity
queue limit to the logical block size. However, NVMe namespaces can
advertise a larger preferred discard granularity in the NPDG or NPDA
field of the Identify Namespace structure or the NPDGL or NPDAL fields
of the I/O Command Set Specific Identify Namespace structure.
Use these fields to compute the discard_granularity limit. The logic is
somewhat involved. First, the fields are optional. NPDG is only reported
if the low bit of OPTPERF is set in NSFEAT. NPDA is reported if any bit
of OPTPERF is set. And NPDGL and NPDAL are reported if the high bit of
OPTPERF is set. NPDGL and NPDAL can also each be set to 0 to opt out of
reporting a limit. I/O Command Set Specific Identify Namespace may also
not be supported by older NVMe controllers. Another complication is that
multiple values may be reported among NPDG, NPDGL, NPDA, and NPDAL. The
spec says to prefer the values reported in the L variants. The spec says
NPDG should be a multiple of NPDA and NPDGL should be a multiple of
NPDAL, but it doesn't specify a relationship between NPDG and NPDAL or
NPDGL and NPDA. So use the maximum of the reported NPDG(L) and NPDA(L)
values as the discard_granularity.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
The NVMe specifications are big fans of "0's based"/"0-based" fields for
encoding values that must be positive. The encoded value is 1 less than
the value it represents. nvmet already provides a helper to0based() for
encoding 0's based values, so add a corresponding helper to decode these
fields on the host side.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme: always issue I/O Command Set specific Identify Namespace
Currently, the I/O Command Set specific Identify Namespace structure is
only fetched for controllers that support extended LBA formats. This is
because struct nvme_id_ns_nvm is only used by nvme_configure_pi_elbas(),
which is only called when the ELBAS bit is set in the CTRATT field of
the Identify Controller structure.
However, the I/O Command Set specific Identify Namespace structure will
soon be used in nvme_update_disk_info(), so always try to obtain it in
nvme_update_ns_info_block(). This Identify structure is first defined in
NVMe spec version 2.0, but controllers reporting older versions could
still implement it.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
In NVMe verson 2.0 and below, OPTPERF comprises only bit 4 of NSFEAT in
the Identify Namespace structure. Since version 2.1, OPTPERF includes
both bits 4 and 5 of NSFEAT. Replace the NVME_NS_FEAT_IO_OPT constant
with NVME_NS_FEAT_OPTPERF_SHIFT, NVME_NS_FEAT_OPTPERF_MASK, and
NVME_NS_FEAT_OPTPERF_MASK_2_1, representing the first bit, pre-2.1 bit
width, and post-2.1 bit width of OPTPERF.
Update nvme_update_disk_info() to check both OPTPERF bits for
controllers that report version 2.1 or newer, as NPWG and NOWS are
supported even if only bit 5 is set.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme: fold nvme_config_discard() into nvme_update_disk_info()
The choice of what queue limits are set in nvme_update_disk_info() vs.
nvme_config_discard() seems a bit arbitrary. A subsequent commit will
compute the discard_granularity limit using struct nvme_id_ns, which is
only passed to nvme_update_disk_info() currently. So move the logic in
nvme_config_discard() to nvme_update_disk_info(). Replace several
instances of ns->ctrl in nvme_update_disk_info() with the ctrl variable
brought from nvme_config_discard().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme: add preferred I/O size fields to struct nvme_id_ns_nvm
A subsequent change will use the NPDGL and NPDAL fields of the NVM
Command Set Specific Identify Namespace structure, so add them (and the
handful of intervening fields) to struct nvme_id_ns_nvm. Add an
assertion that the size is still 4 KB.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
In order to use the new keys for the admin queue we call controller
reset. This isn't ideal, but I can't find a simpler way to reset the
admin queue TLS connection.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet-tcp: Don't free SQ on authentication success
Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.
This patch ensures we don't free the TLS key on success as we might need
it again in the future.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet-tcp: Don't error if TLS is enabed on a reset
If the host sends a AUTH_Negotiate Message on the admin queue with
REPLACETLSPSK set then we expect and require a TLS connection and
shouldn't report an error if TLS is enabled.
This change only enforces the nvmet_queue_tls_keyid() check if we aren't
resetting the negotiation.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:59 +0000 (23:59 -0800)]
crypto: remove HKDF library
Remove crypto/hkdf.c, since it's no longer used. Originally it had two
users, but now both of them just inline the needed HMAC computations
using the HMAC library APIs. That ends up being better, since it
eliminates all the complexity and performance issues associated with the
crypto_shash abstraction and multi-step HMAC input formatting.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:58 +0000 (23:59 -0800)]
nvme-auth: common: remove selections of no-longer used crypto modules
Now that nvme-auth uses the crypto library instead of crypto_shash,
remove obsolete selections from the NVME_AUTH kconfig option.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:57 +0000 (23:59 -0800)]
nvme-auth: common: remove nvme_auth_digest_name()
Since nvme_auth_digest_name() is no longer used, remove it and the
associated data from the hash_map array.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:56 +0000 (23:59 -0800)]
nvme-auth: target: use crypto library in nvmet_auth_ctrl_hash()
For the HMAC computation in nvmet_auth_ctrl_hash(), use the crypto
library instead of crypto_shash. This is simpler, faster, and more
reliable. Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:55 +0000 (23:59 -0800)]
nvme-auth: target: use crypto library in nvmet_auth_host_hash()
For the HMAC computation in nvmet_auth_host_hash(), use the crypto
library instead of crypto_shash. This is simpler, faster, and more
reliable. Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Since nvme-auth is now doing its HMAC computations using the crypto
library, it's guaranteed that all the algorithms actually work.
Therefore, remove the crypto_has_shash() checks which are now obsolete.
However, the caller in nvmet_auth_negotiate() seems to have also been
relying on crypto_has_shash(nvme_auth_hmac_name(host_hmac_id)) to
validate the host_hmac_id. Therefore, make it validate the ID more
directly by checking whether nvme_auth_hmac_hash_len() returns 0 or not.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:53 +0000 (23:59 -0800)]
nvme-auth: host: remove allocation of crypto_shash
Now that the crypto_shash that is being allocated in
nvme_auth_process_dhchap_challenge() and stored in the
struct nvme_dhchap_queue_context is no longer used, remove it.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:52 +0000 (23:59 -0800)]
nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_ctrl_response()
For the HMAC computation in nvme_auth_dhchap_setup_ctrl_response(), use
the crypto library instead of crypto_shash. This is simpler, faster,
and more reliable.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:51 +0000 (23:59 -0800)]
nvme-auth: host: use crypto library in nvme_auth_dhchap_setup_host_response()
For the HMAC computation in nvme_auth_dhchap_setup_host_response(), use
the crypto library instead of crypto_shash. This is simpler, faster,
and more reliable.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:50 +0000 (23:59 -0800)]
nvme-auth: common: use crypto library in nvme_auth_derive_tls_psk()
For the HKDF-Expand-Label computation in nvme_auth_derive_tls_psk(), use
the crypto library instead of crypto_shash and crypto/hkdf.c.
While this means the HKDF "helper" functions are no longer utilized,
they clearly weren't buying us much: it's simpler to just inline the
HMAC computations directly, and this code needs to be tested anyway. (A
similar result was seen in fs/crypto/. As a result, this eliminates the
last user of crypto/hkdf.c, which we'll be able to remove as well.)
As usual this is also a lot more efficient, eliminating the allocation
of a transformation object and multiple other dynamic allocations.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:49 +0000 (23:59 -0800)]
nvme-auth: common: use crypto library in nvme_auth_generate_digest()
For the HMAC computation in nvme_auth_generate_digest(), use the crypto
library instead of crypto_shash. This is simpler, faster, and more
reliable. Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:48 +0000 (23:59 -0800)]
nvme-auth: common: use crypto library in nvme_auth_generate_psk()
For the HMAC computation in nvme_auth_generate_psk(), use the crypto
library instead of crypto_shash. This is simpler, faster, and more
reliable. Notably, this eliminates the crypto transformation object
allocation for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:47 +0000 (23:59 -0800)]
nvme-auth: common: use crypto library in nvme_auth_augmented_challenge()
For the hash and HMAC computations in nvme_auth_augmented_challenge(),
use the crypto library instead of crypto_shash. This is simpler,
faster, and more reliable. Notably, this eliminates two crypto
transformation object allocations for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:46 +0000 (23:59 -0800)]
nvme-auth: common: use crypto library in nvme_auth_transform_key()
For the HMAC computation in nvme_auth_transform_key(), use the crypto
library instead of crypto_shash. This is simpler, faster, and more
reliable. Notably, this eliminates the transformation object allocation
for every call, which was very slow.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:45 +0000 (23:59 -0800)]
nvme-auth: common: add HMAC helper functions
Add some helper functions for computing HMAC-SHA256, HMAC-SHA384, or
HMAC-SHA512 values using the crypto library instead of crypto_shash.
These will enable some significant simplifications and performance
improvements in nvme-auth.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_auth_derive_tls_psk() is always called with psk_len == hash_len.
And based on the comments above nvme_auth_generate_psk() and
nvme_auth_derive_tls_psk(), this isn't an implementation choice but
rather just the length the spec uses. Add a check which makes this
explicit, so that when cleaning up nvme_auth_derive_tls_psk() we don't
have to retain support for arbitrary values of psk_len.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:43 +0000 (23:59 -0800)]
nvme-auth: rename nvme_auth_generate_key() to nvme_auth_parse_key()
This function does not generate a key. It parses the key from the
string that the caller passes in.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:42 +0000 (23:59 -0800)]
nvme-auth: common: add KUnit tests for TLS key derivation
Unit-test the sequence of function calls that derive tls_psk, so that we
can be more confident that changes in the implementation don't break it.
Since the NVMe specification doesn't seem to include any test vectors
for this (nor does its description of the algorithm seem to match what
was actually implemented, for that matter), I just set the expected
values to the values that the code currently produces. In the case
of SHA-512, nvme_auth_generate_digest() currently returns -EINVAL, so
for now the test tests for that too. If it is later determined that
some other behavior is needed, the test can be updated accordingly.
Tested with:
tools/testing/kunit/kunit.py run --kunitconfig drivers/nvme/common/
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:41 +0000 (23:59 -0800)]
nvme-auth: use proper argument types
For input parameters, use pointer to const. This makes it easier to
understand which parameters are inputs and which are outputs.
In addition, consistently use char for strings and u8 for binary. This
makes it easier to understand what is a string and what is binary data.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:40 +0000 (23:59 -0800)]
nvme-auth: common: constify static data
Fully constify the dhgroup_map and hash_map arrays. Remove 'const' from
individual fields, as it is now redundant.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Eric Biggers [Mon, 2 Mar 2026 07:59:39 +0000 (23:59 -0800)]
nvme-auth: add NVME_AUTH_MAX_DIGEST_SIZE constant
Define a NVME_AUTH_MAX_DIGEST_SIZE constant and use it in the
appropriate places.
Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Every doit handler followed the same pattern: stack-allocate an
adm_ctx, call drbd_adm_prepare() at the top, call drbd_adm_finish()
at the bottom. This duplicated boilerplate across 25 handlers and
made error paths inconsistent, since some handlers could miss sending
the reply skb on early-exit paths.
The generic netlink framework already provides pre_doit/post_doit
hooks for exactly this purpose. An old comment even noted "this
would be a good candidate for a pre_doit hook".
Use them:
- pre_doit heap-allocates adm_ctx, looks up per-command flags from a
new drbd_genl_cmd_flags[] table, runs drbd_adm_prepare(), and
stores the context in info->user_ptr[0].
- post_doit sends the reply, drops kref references for
device/connection/resource, and frees the adm_ctx.
- Handlers just receive adm_ctx from info->user_ptr[0], set
reply_dh->ret_code, and return. All teardown is in post_doit.
- drbd_adm_finish() is removed, superseded by post_doit.
Add a new options that causes zloop to truncate the zone files to the
write pointer value recorded at the last cache flush to simulate
unclean shutdowns.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260323071156.2940772-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split out two helpers functions to make the function more readable and
to avoid conditional locking.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260323071156.2940772-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Vasily Gorbik [Sun, 22 Mar 2026 02:35:10 +0000 (03:35 +0100)]
block: fix bio_alloc_bioset slowpath GFP handling
bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic
fast allocation attempt with try_alloc_gfp(). If that fast path fails,
the slowpath checks saved_gfp to decide whether blocking allocation is
allowed, but then still calls mempool_alloc() with the stripped gfp mask.
That can lead to a NULL bio pointer being passed into bio_init().
Fix the slowpath by using saved_gfp for the bio and bvec mempool
allocations.
Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath") Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 18 Mar 2026 01:41:12 +0000 (09:41 +0800)]
ublk: move cold paths out of __ublk_batch_dispatch() for icache efficiency
Mark ublk_filter_unused_tags() as noinline since it is only called from
the unlikely(needs_filter) branch. Extract the error-handling block from
__ublk_batch_dispatch() into a new noinline ublk_batch_dispatch_fail()
function to keep the hot path compact and icache-friendly. This also
makes __ublk_batch_dispatch() more readable by separating the error
recovery logic from the normal dispatch flow.
Before: __ublk_batch_dispatch is ~1419 bytes
After: __ublk_batch_dispatch is ~1090 bytes (-329 bytes, -23%)
Jens Axboe [Sun, 22 Mar 2026 19:37:45 +0000 (13:37 -0600)]
Merge tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.1/block
Pull MD changes from Yu Kuia:
"Bug Fixes:
- md: suppress spurious superblock update error message for dm-raid
(Chen Cheng)
- md/raid1: fix the comparing region of interval tree (Xiao Ni)
- md/raid10: fix deadlock with check operation and nowait requests
(Josh Hunt)
- md/raid5: skip 2-failure compute when other disk is R5_LOCKED
(FengWei Shih)
- md/md-llbitmap: raise barrier before state machine transition
(Yu Kuai)
- md/md-llbitmap: skip reading rdevs that are not in_sync (Yu Kuai)
Improvements:
- md/raid5: set chunk_sectors to enable full stripe I/O splitting
(Yu Kuai)
Cleanups:
- md: remove unused mddev argument from export_rdev (Chen Cheng)
- md/raid5: remove stale md_raid5_kick_device() declaration
(Chen Cheng)
- md/raid5: move handle_stripe() comment to correct location
(Chen Cheng)"
* tag 'md-7.1-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md: remove unused mddev argument from export_rdev
md/raid5: move handle_stripe() comment to correct location
md/raid5: remove stale md_raid5_kick_device() declaration
md/raid1: fix the comparing region of interval tree
md/raid5: skip 2-failure compute when other disk is R5_LOCKED
md/md-llbitmap: raise barrier before state machine transition
md/md-llbitmap: skip reading rdevs that are not in_sync
md/raid5: set chunk_sectors to enable full stripe I/O splitting
md/raid10: fix deadlock with check operation and nowait requests
md: suppress spurious superblock update error message for dm-raid
Xiao Ni [Thu, 5 Mar 2026 01:18:33 +0000 (09:18 +0800)]
md/raid1: fix the comparing region of interval tree
Interval tree uses [start, end] as a region which stores in the tree.
In raid1, it uses the wrong end value. For example:
bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B).
The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and
bio2 overlap which is not right.
Fix this problem by using right end value of the region.
FengWei Shih [Thu, 19 Mar 2026 05:33:51 +0000 (13:33 +0800)]
md/raid5: skip 2-failure compute when other disk is R5_LOCKED
When skip_copy is enabled on a doubly-degraded RAID6, a device that is
being written to will be in R5_LOCKED state with R5_UPTODATE cleared.
If a new read triggers fetch_block() while the write is still in
flight, the 2-failure compute path may select this locked device as a
compute target because it is not R5_UPTODATE.
Because skip_copy makes the device page point directly to the bio page,
reconstructing data into it might be risky. Also, since the compute
marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io()
which checks that R5_SkipCopy and R5_UPTODATE are not both set.
This can be reproduced by running small-range concurrent read/write on
a doubly-degraded RAID6 with skip_copy enabled, for example:
Kees Cook [Sat, 21 Mar 2026 00:48:44 +0000 (17:48 -0700)]
block: partitions: Replace pp_buf with struct seq_buf
In preparation for removing the strlcat API[1], replace the char *pp_buf
with a struct seq_buf, which tracks the current write position and
remaining space internally. This allows for:
- Direct use of seq_buf_printf() in place of snprintf()+strlcat()
pairs, eliminating local tmp buffers throughout.
- Adjacent strlcat() calls that build strings piece-by-piece
(e.g., strlcat("["); strlcat(name); strlcat("]")) to be collapsed
into single seq_buf_printf() calls.
- Simpler call sites: seq_buf_puts() takes only the buffer and string,
with no need to pass PAGE_SIZE at every call.
The backing buffer allocation is unchanged (__get_free_page), and the
output path uses seq_buf_str() to NUL-terminate before passing to
printk().
Yang Xiuwei [Tue, 17 Mar 2026 07:22:26 +0000 (15:22 +0800)]
scsi: bsg: add io_uring passthrough handler
Implement the SCSI-specific io_uring command handler for BSG using
struct bsg_uring_cmd.
The handler builds a SCSI request from the io_uring command, maps user
buffers (including fixed buffers), and completes asynchronously via a
request end_io callback and task_work. Completion returns a 32-bit
status and packed residual/sense information via CQE res and res2, and
supports IO_URING_F_NONBLOCK.
Yang Xiuwei [Tue, 17 Mar 2026 07:22:25 +0000 (15:22 +0800)]
bsg: add io_uring command support to generic layer
Add an io_uring command handler to the generic BSG layer. The new
.uring_cmd file operation validates io_uring features and delegates
handling to a per-queue bsg_uring_cmd_fn callback.
Extend bsg_register_queue() so transport drivers can register both
sg_io and io_uring command handlers.