ublk: remove redundant zone op check in ublk_setup_iod()
ublk_setup_iod() checks first whether the request is a zoned operation
issued to a device without zoned support and returns BLK_STS_IOERR if
so. However, such a request would already hit the default case in the
subsequent switch statement and fail the ublk_queue_is_zoned() check,
which also results in a return of BLK_STS_IOERR. So remove the redundant
early check for unsupported zone ops.
* tag 'nvme-6.18-2025-09-23' of git://git.infradead.org/nvme:
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
nvme-pci: Add TUXEDO IBS Gen8 to Samsung sleep quirk
nvme-auth: use hkdf_expand_label()
nvme-auth: add hkdf_expand_label()
nvme-tcp: send only permitted commands for secure concat
nvme-fc: use lock accessing port_state and rport state
nvmet-fcloop: call done callback even when remote port is gone
nvmet-fc: avoid scheduling association deletion twice
nvmet-fc: move lsop put work to nvmet_fc_ls_req_op
nvme-auth: update bi_directional flag
nvme: Use non zero KATO for persistent discovery connections
The NVMe Base Specification 2.1 states that:
"""
A host requests an explicit persistent connection ... by specifying a
non-zero Keep Alive Timer value in the Connect command.
"""
As such if we are starting a persistent connection to a discovery
controller and the KATO is currently 0 we need to update KATO to a non
zero value to avoid continuous timeouts on the target.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Max Gurtovoy [Sun, 21 Sep 2025 12:44:05 +0000 (15:44 +0300)]
nvmet: add safety check for subsys lock
Replace comment about required lock with a lockdep_assert_held()
check in the following functions:
- nvmet_p2pmem_ns_add_p2p()
- nvmet_setup_p2p_ns_map()
- nvmet_release_p2p_ns_map()
This ensures the subsystem lock is held at runtime.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Martin George [Mon, 22 Sep 2025 12:05:32 +0000 (17:35 +0530)]
nvme-core: use nvme_is_io_ctrl() for I/O controller check
Replace the current I/O controller check in
nvme_init_non_mdts_limits() with the helper nvme_is_io_ctrl()
function to maintain consistency with similar checks in other
parts of the code and improve code readability.
Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
An administrative controller does not support I/O queues, hence it
should ignore existing checks for IOCCSZ/IORCSZ. Currently, these checks
only exclude a discovery controller but need to also exclude an
administrative controller.
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme-core: add method to check for an I/O controller
Add nvme_is_io_ctrl() to check if the controller is of type I/O
controller. Uses negative logic by excluding an administrative
controller and a discovery controller.
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Root cause is that queue_usage_counter is grabbed with rq_qos_mutex
held in blkg_conf_prep(), while queue should be freezed before
rq_qos_mutex from other context.
The blk_queue_enter() from blkg_conf_prep() is used to protect against
policy deactivation, which is already protected with blkcg_mutex, hence
convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile,
consider that blkcg_mutex is held after queue is freezed from policy
deactivation, also convert blkg_alloc() to use GFP_NOIO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk_mq_free_tags() can be called after blk_mq_init_tags(), while
tags->page_list is still not initialized, causing null-ptr-deref.
Fix this problem by initializing tags->page_list at blk_mq_init_tags(),
meanwhile, also free tags directly from error path because there is no
srcu barrier.
blk-mq: Fix more tag iteration function documentation
Commit 8ab30a331946 ("blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument")
removed the hctx argument from the callback functions called by
bt_for_each() and blk_mq_queue_tag_busy_iter(). Commit 2dd6532e9591
("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn") removed the
'reserved' argument of the busy_tag_iter_fn function pointer type. Bring
the documentation of the tag iteration functions in sync with these
changes.
Cc: John Garry <john.g.garry@oracle.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: fix behavior when fio is not installed
Some ublk selftests have strange behavior when fio is not installed.
While most tests behave correctly (run if they don't need fio, or skip
if they need fio), the following tests have different behavior:
- test_null_01, test_null_02, test_generic_01, test_generic_02, and
test_generic_12 try to run fio without checking if it exists first,
and fail on any failure of the fio command (including "fio command
not found"). So these tests fail when they should skip.
- test_stress_05 runs fio without checking if it exists first, but
doesn't fail on fio command failure. This test passes, but that pass
is misleading as the test doesn't do anything useful without fio
installed. So this test passes when it should skip.
Fix these issues by adding _have_program fio checks to the top of all of
these tests.
Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_unmap_io() is a frequent cache miss. Pass to __ublk_complete_rq()
whether the ublk server's data buffer needs to be copied to the request.
In the callers __ublk_fail_req() and ublk_ch_uring_cmd_local(), get the
flags from the ublk_device instead, as its flags have just been read.
In ublk_put_req_ref(), pass false since all the features that require
reference counting disable copying of the data buffer upon completion.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't access ublk_queue in ublk_need_complete_req()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_need_complete_req() is a frequent cache miss. Get the flags from
the ublk_device instead, which is accessed earlier in
ublk_ch_uring_cmd_local().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_check_commit_and_fetch() is a frequent cache miss. Get the flags
from the ublk_device instead, which is accessed earlier in
ublk_ch_uring_cmd_local().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't access ublk_queue in ublk_config_io_buf()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_config_io_buf() is a frequent cache miss. Get the flags
from the ublk_device instead, which is accessed earlier in
ublk_ch_uring_cmd_local().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: pass q_id and tag to __ublk_check_and_get_req()
__ublk_check_and_get_req() only uses its ublk_queue argument to get the
q_id and tag. Pass those arguments explicitly to save an access to the
ublk_queue.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't access ublk_queue in ublk_daemon_register_io_buf()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_daemon_register_io_buf() is a frequent cache miss. Get the flags
from the ublk_device instead, which is accessed earlier in
ublk_ch_uring_cmd_local().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't access ublk_queue in ublk_register_io_buf()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_register_io_buf() is a frequent cache miss. Get the flags from the
ublk_device instead, which is accessed earlier in
ublk_ch_uring_cmd_local().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid repeating the 2 dereferences to get the ublk_device from the
io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to
ublk_register_io_buf().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't dereference ublk_queue in ublk_check_and_get_req()
For ublk servers with many ublk queues, accessing the ublk_queue in
ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and
queue depth from the ublk_device instead, which is accessed just before.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local()
For ublk servers with many ublk queues, accessing the ublk_queue to
handle a ublk command is a frequent cache miss. Get the queue depth from
the ublk_device instead, which is accessed just before.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: don't pass q_id to ublk_queue_cmd_buf_size()
ublk_queue_cmd_buf_size() only needs the queue depth, which is the same
for all queues. Get the queue depth from the ublk_device instead so the
q_id parameter can be dropped.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: add test to verify that feat_map is complete
Add a test that verifies that the currently running kernel does not
report support for any features that are unrecognized by kublk. This
should catch cases where features are added without updating kublk's
feat_map accordingly, which has happened multiple times in the past (see
[1], [2]).
Note that this new test may fail if the test suite is older than the
kernel, and the newer kernel contains a newly introduced feature. I
believe this is not a use case we currently care about - we only care
about newer test suites passing on older kernels.
selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_map
When UBLK_F_BUF_REG_OFF_DAEMON was added, we missed updating kublk's
feat_map, which results in the feature being reported as "unknown." Add
UBLK_F_BUF_REG_OFF_DAEMON to feat_map to fix this.
Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify the definition of feat_map by introducing a helper macro
FEAT_NAME to avoid having to type the feature name twice. As a side
effect, this changes the names in the feature list to be the full macro
name instead of the abbreviated names that were used before, but this is
a good change for clarity.
Using the full feature macro names ruins the alignment of the output, so
change the output format to put each feature's hex value before its
name, as this is easier to align nicely. The output now looks as
follows:
blk-throttle: fix throtl_data leak during disk release
Tightening the throttle activation check in blk_throtl_activated() to
require both q->td presence and policy bit set introduced a memory leak
during disk release:
blkg_destroy_all() clears the policy bit first during queue deactivation,
causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
blk_throtl_activated() fails policy check.
Idealy we should avoid modifying blk_throtl_exit() activation check because
it's intuitive that blk-throtl start from blk_throtl_init() and end in
blk_throtl_exit(). However, call blk_throtl_exit() before
blkg_destroy_all() will make a long term deadlock problem easier to
trigger[1], hence fix this problem by checking if q->td is NULL from
blk_throtl_exit(), and remove policy deactivation as well since it's
useless.
blk-mq: Fix the blk_mq_tagset_busy_iter() documentation
Commit 2dd6532e9591 ("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn")
removed the 'reserved' argument from tag iteration callback functions.
Bring the blk_mq_tagset_busy_iter() documentation in sync with that
change.
Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Mon, 15 Sep 2025 10:35:00 +0000 (10:35 +0000)]
block: relax atomic write boundary vs chunk size check
blk_validate_atomic_write_limits() ensures that any boundary fits into
and is aligned to any chunk size.
However, it should also be possible to fit the chunk size into any
boundary. That check is already made in
blk_stack_atomic_writes_boundary_head().
Relax the check in blk_validate_atomic_write_limits() by reusing (and
renaming) blk_stack_atomic_writes_boundary_head().
Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Mon, 15 Sep 2025 10:34:59 +0000 (10:34 +0000)]
block: fix stacking of atomic writes when atomics are not supported
Atomic writes support may not always be possible when stacking devices
which support atomic writes. Such as case is a different atomic write
boundary between stacked devices (which is not supported).
In the case that atomic writes cannot supported, the top device queue HW
limits are set to 0.
However, in blk_stack_atomic_writes_limits(), we detect that we are
stacking the first bottom device by checking the top device
atomic_write_hw_max value == 0. This get confused with the case of atomic
writes not supported, above.
Make the distinction between stacking the first bottom device and no
atomics supported by initializing stacked device atomic_write_hw_max =
UINT_MAX and checking that for stacking the first bottom device.
Fixes: d7f36dc446e8 ("block: Support atomic writes limits for stacked devices") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Mon, 15 Sep 2025 10:34:58 +0000 (10:34 +0000)]
block: update validation of atomic writes boundary for stacked devices
In commit 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked
atomic write limits"), it was missed to use a chunk sectors limit check
in blk_stack_atomic_writes_boundary_head(), so update that function to
do the proper check.
Fixes: 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked atomic write limits") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme-pci: Add TUXEDO IBS Gen8 to Samsung sleep quirk
On the TUXEDO InfinityBook S Gen8, a Samsung 990 Evo NVMe leads to
a high power consumption in s2idle sleep (3.5 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with
a lower power consumption, typically around 1 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <kbusch@kernel.org>
Chris Leech [Thu, 21 Aug 2025 20:48:16 +0000 (13:48 -0700)]
nvme-auth: use hkdf_expand_label()
When generating keying material during an authentication transaction
(secure channel concatenation), the HKDF-Expand-Label function is part
of the specified key derivation process.
The current open-coded implementation misses the length prefix
requirements on the HkdfLabel label and context variable-length vectors
(RFC 8446 Section 3.4).
Instead, use the hkdf_expand_label() function.
Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
block/mq-deadline: Remove the redundant rb_entry_rq in the deadline_from_pos().
In commit(fde02699c242), the "if (blk_rq_is_seq_zoned_write(rq))"
was removed, but the "rb_entry_rq(node)" and some other code were
inadvertently left behind. This patch fixed it.
Signed-off-by: chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin George [Tue, 9 Sep 2025 10:35:09 +0000 (16:05 +0530)]
nvme-tcp: send only permitted commands for secure concat
In addition to sending permitted commands such as connect/auth
over the initial unencrypted admin connection as part of secure
channel concatenation, the host also sends commands such as
Property Get and Identify on the same. This is a spec violation
leading to secure concat failures. Fix this by ensuring these
additional commands are avoided on this connection.
Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Tue, 2 Sep 2025 10:22:03 +0000 (12:22 +0200)]
nvme-fc: use lock accessing port_state and rport state
nvme_fc_unregister_remote removes the remote port on a lport object at
any point in time when there is no active association. This races with
with the reconnect logic, because nvme_fc_create_association is not
taking a lock to check the port_state and atomically increase the
active count on the rport.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/u4ttvhnn7lark5w3sgrbuy2rxupcvosp4qmvj46nwzgeo5ausc@uyrkdls2muwx Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Tue, 2 Sep 2025 10:22:02 +0000 (12:22 +0200)]
nvmet-fcloop: call done callback even when remote port is gone
When the target port is gone, it's not possible to access any of the
request resources. The function should just silently drop the response.
The comment is misleading in this regard.
Though it's still necessary to call the driver via the ->done callback
so the driver is able to release all resources.
Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs-OBA0WMt5f7R0dz+rR4HcEz19YLhnyGsj-MRV3jWDsPg@mail.gmail.com/ Fixes: 84eedced1c5b ("nvmet-fcloop: drop response if targetport is gone") Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Tue, 2 Sep 2025 10:22:01 +0000 (12:22 +0200)]
nvmet-fc: avoid scheduling association deletion twice
When forcefully shutting down a port via the configfs interface,
nvmet_port_subsys_drop_link() first calls nvmet_port_del_ctrls() and
then nvmet_disable_port(). Both functions will eventually schedule all
remaining associations for deletion.
The current implementation checks whether an association is about to be
removed, but only after the work item has already been scheduled. As a
result, it is possible for the first scheduled work item to free all
resources, and then for the same work item to be scheduled again for
deletion.
Because the association list is an RCU list, it is not possible to take
a lock and remove the list entry directly, so it cannot be looked up
again. Instead, a flag (terminating) must be used to determine whether
the association is already in the process of being deleted.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/rsdinhafrtlguauhesmrrzkybpnvwantwmyfq2ih5aregghax5@mhr7v3eryci3/ Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Tue, 2 Sep 2025 10:22:00 +0000 (12:22 +0200)]
nvmet-fc: move lsop put work to nvmet_fc_ls_req_op
It’s possible for more than one async command to be in flight from
__nvmet_fc_send_ls_req. For each command, a tgtport reference is taken.
In the current code, only one put work item is queued at a time, which
results in a leaked reference.
To fix this, move the work item to the nvmet_fc_ls_req_op struct, which
already tracks all resources related to the command.
Fixes: 710c69dbaccd ("nvmet-fc: avoid deadlock on delete association path") Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Martin George [Mon, 15 Sep 2025 11:49:21 +0000 (17:19 +0530)]
nvme-auth: update bi_directional flag
While setting chap->s2 to zero as part of secure channel
concatenation, the host missed out to disable the bi_directional
flag to indicate that controller authentication is not requested.
Fix the same.
Fixes: e88a7595b57f ("nvme-tcp: request secure channel concatenation") Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
When building for 32-bit platforms, there are several link (if builtin)
or modpost (if a module) errors due to dividends of type 'sector_t' in
DIV_ROUND_UP:
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_resize':
drivers/md/md-llbitmap.c:1017:(.text+0xae8): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.c:1020:(.text+0xb10): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_end_discard':
drivers/md/md-llbitmap.c:1114:(.text+0xf14): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_start_discard':
drivers/md/md-llbitmap.c:1097:(.text+0x1808): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_read_sb':
drivers/md/md-llbitmap.c:867:(.text+0x2080): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o:drivers/md/md-llbitmap.c:895: more undefined references to `__aeabi_uldivmod' follow
Use DIV_ROUND_UP_SECTOR_T instead of DIV_ROUND_UP, which exists to
handle this exact situation.
blk-mq: fix potential deadlock while nr_requests grown
Allocate and free sched_tags while queue is freezed can deadlock[1],
this is a long term problem, hence allocate memory before freezing
queue and free memory after queue is unfreezed.
blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
For shared tags case, all hctx->sched_tags/tags are the same, it doesn't
make sense to call into blk_mq_tag_update_depth() multiple times for the
same tags.
blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
request_queue->nr_requests can be changed by:
a) switch elevator by updating nr_hw_queues
b) switch elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests
Current lock order is:
1) update_nr_hwq_lock, case a,b
2) freeze_queue
3) elevator_lock, case a,b,c
And update nr_requests is seriablized by elevator_lock() already,
however, in the case c, we'll have to allocate new sched_tags if
nr_requests grow, and do this with elevator_lock held and queue
freezed has the risk of deadlock.
Hence use update_nr_hwq_lock instead, make it possible to allocate
memory if tags grow, meanwhile also prevent nr_requests to be changed
concurrently.
blk-mq: check invalid nr_requests in queue_requests_store()
queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.
Make code cleaner by moving all the checkings to the top function:
1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth;
Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.
blk-mq: remove useless checkings in blk_mq_update_nr_requests()
1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.
ublk_mark_io_ready() tracks whether all the ublk_device's I/Os have been
fetched by incrementing ublk_queue's nr_io_ready count and incrementing
ublk_device's nr_queues_ready count if the whole queue is ready.
Simplify the logic by just tracking the total number of fetched I/Os on
each ublk_device. When this count reaches nr_hw_queues * queue_depth,
the ublk_device is ready to receive I/O.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid0: convert raid0_make_request() to use bio_submit_split_bioset()
Currently, raid0_make_request() will remap the original bio to underlying
disks to prevent reordered IO. Now that bio_submit_split_bioset() will put
original bio to the head of current->bio_list, it's safe converting to use
this helper and bio will still be ordered.
CC: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up. This problem, on the one hand, will cause
performance degradation, especially for mdraid with large IO size; on
the other hand, will cause write errors for zoned block devices[1].
For example, in raid456 IO will first be split by max_sector from
md_submit_bio(), and then later be split again by chunksize for internal
handling:
For example, assume max_sectors is 1M, and chunksize is 512k
1) issue a 2M IO:
bio issuing: 0+2M
current->bio_list: NULL
2) md_submit_bio() split by max_sector:
bio issuing: 0+1M
current->bio_list: 1M+1M
3) chunk_aligned_read() split by chunksize:
bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k
4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:
bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k
5) chunk_aligned_read() split by chunksize:
bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k
6) no split afterwards, finally the issue order is:
0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.
Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
Test results:
Before this patch
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
2) blktrace G stage:
8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))
After this patch:
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
2) blktrace G stage:
8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 21483 75.0% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 7161 25.0% |############## |
BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.
And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.
Lots of checks are already done while submitting this bio the first
time, and there is no need to check them again when this bio is
resubmitted after split.
Hence open code should_fail_bio() and blk_throtl_bio() that are still
necessary from submit_bio_split_bioset().
md/raid10: convert read/write to use bio_submit_split_bioset()
Unify bio split code, prepare to fix ordering of split IO, the error path
is modified a bit, however no functional changes are intended:
- bio_submit_split_bioset() can fail the original bio directly
by split error, set R10BIO_Uptodate in this case to notify
raid_end_bio_io() that the original bio is returned already.
- set R10BIO_Uptodate and set error value to -EIO is useless now,
for r10_bio without R10BIO_Uptodate, -EIO will be returned for
original bio.
And discard is not handled, because discard is only split for
unaligned head and tail, and this can be considered slow path, the
reorder here does not matter much.
md/raid1: convert to use bio_submit_split_bioset()
Unify bio split code, and prepare to fix ordering of split IO.
Noted that bio_submit_split_bioset() can fail the original bio directly
by split error, set R1BIO_Returned in this case to notify raid_end_bio_io()
that the original bio is returned already.
md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
Unify bio split code, and prepare to fix ordering of split IO
Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large
sequential writes") already fix ordering of split IO by remapping bio to
underlying disks before resubmitting it, with the respect
md_submit_bio() already split it by sectors, and raid0_make_request()
will split at most once for unaligned IO. This is a bit hacky and we'll
convert this to solution in general later.
If bio is split by internal handling like chunksize or badblocks, the
corresponding trace_block_split() is missing, resulting in blktrace
inability to catch BIO split events and making it harder to analyze the
BIO sequence.
Cc: stable@vger.kernel.org Fixes: 4b1faf931650 ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio->issue_time_ns is initialized for every bio, however, it's only used
by blk-iolatency. Add a new queue_flag and only set this flag when
blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved
for disks that blk-iolatency is not enabled.
block: initialize bio issue time in blk_mq_submit_bio()
bio->issue_time_ns is only used by blk-iolatency, which can only be
enabled for rq-based disk, hence it's not necessary to initialize
the time for bio-based disk.
Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(),
the issue time is not initialized for new split bio, this can be fixed
as well.
Noted the next patch will optimize better that bio issue time will
only be used when blk-iolatency is really enabled by the disk.
Now that bio->bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.
Merge tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block
Pull MD changes from Yu Kuai:
"Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and
which ones need to be resynchronized or recovered. Each bit in the
bitmap represents a segment of data in the array. When a bit is set,
it indicates that the multiple redundant copies of that data segment
may not be consistent. Data synchronization can be performed based on
the bitmap after power failure or readding a disk. If there is no
bitmap, a full disk synchronization is required.
Due to known performance issues with md-bitmap and the unreasonable
implementations:
- self-managed IO submitting like filemap_write_page();
- global spin_lock
I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.
Key features for the new bitmap:
- IO fastpath is lockless, if user issues lots of write IO to the same
bitmap bit in a short time, only the first write has additional
overhead to update bitmap bit, no additional overhead for the
following writes;
- support only resync or recover written data, means in the case
creating new array or replacing with a new disk, there is no need to
do a full disk resync/recovery;"
* tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits)
md/md-llbitmap: introduce new lockless bitmap
md/md-bitmap: make method bitmap_ops->daemon_work optional
md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
md/md-bitmap: add a new method blocks_synced() in bitmap_operations
md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
md/md-bitmap: delay registration of bitmap_ops until creating bitmap
md/md-bitmap: add a new sysfs api bitmap_type
md: add a new mddev field 'bitmap_id'
md/md-bitmap: support discard for bitmap ops
md: factor out a helper raid_is_456()
md: add a new parameter 'offset' to md_super_write()
md/md-bitmap: introduce CONFIG_MD_BITMAP
md: check before referencing mddev->bitmap_ops
md/dm-raid: check before referencing mddev->bitmap_ops
md/raid5: check before referencing mddev->bitmap_ops
md/raid10: check before referencing mddev->bitmap_ops
md/raid1: check before referencing mddev->bitmap_ops
md/raid1: check bitmap before behind write
md/md-bitmap: handle the case bitmap is not enabled before end_sync()
md/md-bitmap: handle the case bitmap is not enabled before start_sync()
...
Keith Busch [Wed, 3 Sep 2025 20:27:46 +0000 (13:27 -0700)]
blk-map: provide the bdev to bio if one exists
We can now safely provide a block device when extracting user pages for
driver and user passthrough commands. Set the bdev so the caller doesn't
have to do that later. This has an additional benefit of being able to
extract P2P pages in the passthrough path.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 3 Sep 2025 19:33:17 +0000 (12:33 -0700)]
blk-mq-dma: bring back p2p request flags
We only need to consider data and metadata dma mapping types separately.
The request and bio integrity payload have enough flag bits to
internally track the mapping type for each. Use these so the caller
doesn't need to track them, and provide separete request and integrity
helpers to the common code. This will make it easier to scale new
mappings, like the proposed MMIO attribute, without burdening the caller
to track such things.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 3 Sep 2025 19:33:16 +0000 (12:33 -0700)]
blk-integrity: enable p2p source and destination
Set the extraction flags to allow p2p pages for the metadata buffer if
the block device allows it. Similar to data payloads, ensure the bio
does not use merging if we see a p2p page.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:58 +0000 (07:12 -0700)]
iov_iter: remove iov_iter_is_aligned
No more callers.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:57 +0000 (07:12 -0700)]
blk-integrity: use simpler alignment check
We're checking length and addresses against the same alignment value, so
use the more simple iterator check.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:56 +0000 (07:12 -0700)]
block: remove bdev_iter_is_aligned
No more callers.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:55 +0000 (07:12 -0700)]
iomap: simplify direct io validity check
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:54 +0000 (07:12 -0700)]
block: simplify direct io validity check
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:53 +0000 (07:12 -0700)]
block: align the bio after building it
Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:52 +0000 (07:12 -0700)]
block: add size alignment to bio_iov_iter_get_pages
The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 27 Aug 2025 14:12:51 +0000 (07:12 -0700)]
block: check for valid bio while splitting
We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
drivers/block: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
drivers/block: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
drivers/block: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Ming Lei [Tue, 9 Sep 2025 12:33:10 +0000 (20:33 +0800)]
blk-mq: Document tags_srcu member in blk_mq_tag_set structure
Add missing documentation for the tags_srcu member that was introduced
to defer freeing of tags page_list to prevent use-after-free when
iterating tags.
Fixes htmldocs warning:
WARNING: include/linux/blk-mq.h:536 struct member 'tags_srcu' not described in 'blk_mq_tag_set'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: remove the bi_inline_vecs variable sized array from struct bio
Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays. There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Han Guangjiang [Fri, 5 Sep 2025 10:24:11 +0000 (18:24 +0800)]
blk-throttle: fix access race during throttle policy activation
On repeated cold boots we occasionally hit a NULL pointer crash in
blk_should_throtl() when throttling is consulted before the throttle
policy is fully enabled for the queue. Checking only q->td != NULL is
insufficient during early initialization, so blkg_to_pd() for the
throttle policy can still return NULL and blkg_to_tg() becomes NULL,
which later gets dereferenced.
Genjian Zhang [Fri, 15 Aug 2025 09:07:32 +0000 (17:07 +0800)]
null_blk: Fix the description of the cache_size module argument
When executing modinfo null_blk, there is an error in the description
of module parameter mbps, and the output information of cache_size is
incomplete.The output of modinfo before and after applying this patch
is as follows:
Before:
[...]
parm: cache_size:ulong
[...]
parm: mbps:Cache size in MiB for memory-backed device.
Default: 0 (none) (uint)
[...]
After:
[...]
parm: cache_size:Cache size in MiB for memory-backed device.
Default: 0 (none) (ulong)
[...]
parm: mbps:Limit maximum bandwidth (in MiB/s).
Default: 0 (no limit) (uint)
[...]
Fixes: 058efe000b31 ("null_blk: add module parameters for 4 options") Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 30 Aug 2025 02:18:23 +0000 (10:18 +0800)]
blk-mq: Replace tags->lock with SRCU for tag iterators
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.
This is done by:
- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
This change fixes lockup issue in scsi_host_busy() in case of shost->host_blocked.
Also avoids big tags->lock when reading disk sysfs attribute `inflight`.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 30 Aug 2025 02:18:22 +0000 (10:18 +0800)]
blk-mq: Defer freeing flush queue to SRCU callback
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.
This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.
The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
`blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
instance to ensure synchronization with tag iterators.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 30 Aug 2025 02:18:21 +0000 (10:18 +0800)]
blk-mq: Defer freeing of tags page_list to SRCU callback
Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.
Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.
This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
`call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
callback for deferred execution.
The read-side protection for the tag iterators will be added in a
subsequent patch.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 30 Aug 2025 02:18:20 +0000 (10:18 +0800)]
blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.
This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.
This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.
No functional change is intended in this patch.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>