Ming Lei [Fri, 16 Jan 2026 14:18:37 +0000 (22:18 +0800)]
ublk: handle UBLK_U_IO_PREP_IO_CMDS
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
which allows userspace to prepare a batch of I/O requests.
The core of this change is the `ublk_walk_cmd_buf` function, which iterates
over the elements in the uring_cmd fixed buffer. For each element, it parses
the I/O details, finds the corresponding `ublk_io` structure, and prepares it
for future dispatch.
Add per-io lock for protecting concurrent delivery and committing.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 16 Jan 2026 14:18:34 +0000 (22:18 +0800)]
ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO
Introduces the basic structure for a batched I/O feature in the ublk driver.
It adds placeholder functions and a new file operations structure,
ublk_ch_batch_io_fops, which will be used for fetching and committing I/O
commands in batches. Currently, the feature is disabled.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 16 Jan 2026 07:46:38 +0000 (15:46 +0800)]
nvme/io_uring: optimize IOPOLL completions for local ring context
When multiple io_uring rings poll on the same NVMe queue, one ring can
find completions belonging to another ring. The current code always
uses task_work to handle this, but this adds overhead for the common
single-ring case.
This patch passes the polling io_ring_ctx through io_comp_batch's new
poll_ctx field. In io_do_iopoll(), the polling ring's context is stored
in iob.poll_ctx before calling the iopoll callbacks.
In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the
request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they
match (local context), we complete inline with io_uring_cmd_done32().
If they differ (remote context) or iob is NULL (non-iopoll path), we
use task_work as before.
This optimization eliminates task_work scheduling overhead for the
common case where a ring polls and finds its own completions.
~10% IOPS improvement is observed in the following benchmark:
Ming Lei [Fri, 16 Jan 2026 07:46:37 +0000 (15:46 +0800)]
block: pass io_comp_batch to rq_end_io_fn callback
Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn
callback signature. This allows end_io handlers to access the completion
batch context when requests are completed via blk_mq_end_request_batch().
The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL
is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't
have batch context.
This infrastructure change enables drivers to detect whether they're
being called from a batched completion path (like iopoll) and access
additional context stored in the io_comp_batch.
Damien Le Moal [Tue, 6 Jan 2026 07:00:56 +0000 (16:00 +0900)]
block: fix blk_zone_cond_str() comment
Fix the comment for blk_zone_cond_str() by replacing the meaningless
BLK_ZONE_ZONE_XXX comment with the correct BLK_ZONE_COND_name, thus also
replacing the XXX with what that actually means.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 12 Jan 2026 22:05:02 +0000 (00:05 +0200)]
selftests: ublk: add stop command with --safe option
Add 'stop' subcommand to kublk utility that uses the new
UBLK_CMD_TRY_STOP_DEV command when --safe option is specified.
This allows stopping a device only if it has no active openers,
returning -EBUSY otherwise.
Also add test_generic_16.sh to test the new functionality.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yoav Cohen [Mon, 12 Jan 2026 22:05:01 +0000 (00:05 +0200)]
ublk: add UBLK_CMD_TRY_STOP_DEV command
Add a best-effort stop command, UBLK_CMD_TRY_STOP_DEV, which only stops a
ublk device when it has no active openers.
Unlike UBLK_CMD_STOP_DEV, this command does not disrupt existing users.
New opens are blocked only after disk_openers has reached zero; if the
device is busy, the command returns -EBUSY and leaves it running.
The ub->block_open flag is used only to close a race with an in-progress
open and does not otherwise change open behavior.
Advertise support via the UBLK_F_SAFE_STOP_DEV feature flag.
Signed-off-by: Yoav Cohen <yoav@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add test case loop_08 to verify the ublk integrity data flow. It uses
the kublk loop target to create a ublk device with integrity on top of
backing data and integrity files. It then writes to the whole device
with fio configured to generate integrity data. Then it reads back the
whole device with fio configured to verify the integrity data.
It also verifies that injected guard, reftag, and apptag corruptions are
correctly detected.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add test case null_04 to exercise all the different integrity params. It
creates 4 different ublk devices with different combinations of
integrity arguments and verifies their integrity limits via sysfs and
the metadata_size utility.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: add integrity data support to loop target
To perform and end-to-end test of integrity information through a ublk
device, we need to actually store it somewhere and retrieve it. Add this
support to kublk's loop target. It uses a second backing file for the
integrity data corresponding to the data stored in the first file.
The integrity file is initialized with byte 0xFF, which ensures the app
and reference tags are set to the "escape" pattern to disable the
bio-integrity-auto guard and reftag checks until the blocks are written.
The integrity file is opened without O_DIRECT since it will be accessed
at sub-block granularity. Each incoming read/write results in a pair of
reads/writes, one to the data file, and one to the integrity file. If
either backing I/O fails, the error is propagated to the ublk request.
If both backing I/Os read/write some bytes, the ublk request is
completed with the smaller of the number of blocks accessed by each I/O.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: support non-O_DIRECT backing files
A subsequent commit will add support for using a backing file to store
integrity data. Since integrity data is accessed in intervals of
metadata_size, which may be much smaller than a logical block on the
backing device, direct I/O cannot be used. Add an argument to
backing_file_tgt_init() to specify the number of files to open for
direct I/O. The remaining files will use buffered I/O. For now, continue
to request direct I/O for all the files.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: implement integrity user copy in kublk
If integrity data is enabled for kublk, allocate an integrity buffer for
each I/O. Extend ublk_user_copy() to copy the integrity data between the
ublk request and the integrity buffer if the ublksrv_io_desc indicates
that the request has integrity data.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: add kublk support for integrity params
Add integrity param command line arguments to kublk. Plumb these to
struct ublk_params for the null and fault_inject targets, as they don't
need to actually read or write the integrity data. Forbid the integrity
params for loop or stripe until the integrity data copy is implemented.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: add utility to get block device metadata size
Some block device integrity parameters are available in sysfs, but
others are only accessible using the FS_IOC_GETLBMD_CAP ioctl. Add a
metadata_size utility program to print out the logical block metadata
size, PI offset, and PI size within the metadata. Example output:
$ metadata_size /dev/ublkb0
metadata_size: 64
pi_offset: 56
pi_tuple_size: 8
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk user copy syscalls may be issued from any task, so they take a
reference count on the struct ublk_io to check whether it is owned by
the ublk server and prevent a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ
from completing the request. However, if the user copy syscall is issued
on the io's daemon task, a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ isn't
possible, so the atomic reference count dance is unnecessary. Check for
UBLK_IO_FLAG_OWNED_BY_SRV to ensure the request is dispatched to the
sever and obtain the request from ublk_io's req field instead of looking
it up on the tagset. Skip the reference count increment and decrement.
Commit 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon
task") made an analogous optimization for ublk zero copy buffer
registration.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stanley Zhang [Thu, 8 Jan 2026 09:19:38 +0000 (02:19 -0700)]
ublk: support UBLK_F_INTEGRITY
Now that all the components of the ublk integrity feature have been
implemented, add UBLK_F_INTEGRITY to UBLK_F_ALL, conditional on block
layer integrity support (CONFIG_BLK_DEV_INTEGRITY). This allows ublk
servers to create ublk devices with UBLK_F_INTEGRITY set and
UBLK_U_CMD_GET_FEATURES to report the feature as supported.
Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
[csander: make feature conditional on CONFIG_BLK_DEV_INTEGRITY] Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stanley Zhang [Thu, 8 Jan 2026 09:19:37 +0000 (02:19 -0700)]
ublk: implement integrity user copy
Add a function ublk_copy_user_integrity() to copy integrity information
between a request and a user iov_iter. This mirrors the existing
ublk_copy_user_pages() but operates on request integrity data instead of
regular data. Check UBLKSRV_IO_INTEGRITY_FLAG in iocb->ki_pos in
ublk_user_copy() to choose between copying data or integrity data.
[csander: change offset units from data bytes to integrity data bytes,
fix CONFIG_BLK_DEV_INTEGRITY=n build, rebase on user copy refactor]
Signed-off-by: Stanley Zhang <stazhang@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: move offset check out of __ublk_check_and_get_req()
__ublk_check_and_get_req() checks that the passed in offset is within
the data length of the specified ublk request. However, only user copy
(ublk_check_and_get_req()) supports accessing ublk request data at a
nonzero offset. Zero-copy buffer registration (ublk_register_io_buf())
always passes 0 for the offset, so the check is unnecessary. Move the
check from __ublk_check_and_get_req() to ublk_check_and_get_req().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: inline ublk_check_and_get_req() into ublk_user_copy()
ublk_check_and_get_req() has a single callsite in ublk_user_copy(). It
takes a ton of arguments in order to pass local variables from
ublk_user_copy() to ublk_check_and_get_req() and vice versa. And more
are about to be added. Combine the functions to reduce the argument
passing noise.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_ch_read_iter() and ublk_ch_write_iter() are nearly identical except
for the iter direction. Split out a helper function ublk_user_copy() to
reduce the code duplication as these functions are about to get larger.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stanley Zhang [Thu, 8 Jan 2026 09:19:31 +0000 (02:19 -0700)]
ublk: support UBLK_PARAM_TYPE_INTEGRITY in device creation
Add a feature flag UBLK_F_INTEGRITY for a ublk server to request
integrity/metadata support when creating a ublk device. The ublk server
can also check for the feature flag on the created device or the result
of UBLK_U_CMD_GET_FEATURES to tell if the ublk driver supports it.
UBLK_F_INTEGRITY requires UBLK_F_USER_COPY, as user copy is the only
data copy mode initially supported for integrity data.
Add UBLK_PARAM_TYPE_INTEGRITY and struct ublk_param_integrity to struct
ublk_params to specify the integrity params of a ublk device.
UBLK_PARAM_TYPE_INTEGRITY requires UBLK_F_INTEGRITY and a nonzero
metadata_size. The LBMD_PI_CAP_* and LBMD_PI_CSUM_* values from the
linux/fs.h UAPI header are used for the flags and csum_type fields.
If the UBLK_PARAM_TYPE_INTEGRITY flag is set, validate the integrity
parameters and apply them to the blk_integrity limits.
The struct ublk_param_integrity validations are based on the checks in
blk_validate_integrity_limits(). Any invalid parameters should be
rejected before being applied to struct blk_integrity.
[csander: drop redundant pi_tuple_size field, use block metadata UAPI
constants, add param validation]
Signed-off-by: Stanley Zhang <stazhang@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_dev_support_user_copy() will be used in ublk_validate_params().
Move these functions next to ublk_{dev,queue}_is_zoned() to avoid
needing to forward-declare them.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-integrity: take const pointer in blk_integrity_rq()
blk_integrity_rq() doesn't modify the struct request passed in, so allow
a const pointer to be passed. Use a matching signature for the
!CONFIG_BLK_DEV_INTEGRITY version.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 11 Jan 2026 20:16:36 +0000 (13:16 -0700)]
Merge branch 'block-6.19' into for-7.0/block
Merge in fixes that went to 6.19 after for-7.0/block was branched.
Pending ublk changes depend on particularly the async scan work.
* block-6.19:
block: zero non-PI portion of auto integrity buffer
ublk: fix use-after-free in ublk_partition_scan_work
blk-mq: avoid stall during boot due to synchronize_rcu_expedited
loop: add missing bd_abort_claiming in loop_set_status
block: don't merge bios with different app_tags
blk-rq-qos: Remove unlikely() hints from QoS checks
loop: don't change loop device under exclusive opener in loop_set_status
block, bfq: update outdated comment
blk-mq: skip CPU offline notify on unmapped hctx
selftests/ublk: fix Makefile to rebuild on header changes
selftests/ublk: add test for async partition scan
ublk: scan partition in async way
block,bfq: fix aux stat accumulation destination
md: Fix forward incompatibility from configurable logical block size
md: Fix logical_block_size configuration being overwritten
md: suspend array while updating raid_disks via sysfs
md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
md: Fix static checker warning in analyze_sbs
blk-crypto: handle the fallback above the block layer
Add a blk_crypto_submit_bio helper that either submits the bio when
it is not encrypted or inline encryption is provided, but otherwise
handles the encryption before going down into the low-level driver.
This reduces the risk from bio reordering and keeps memory allocation
as high up in the stack as possible.
Note that if the submitter knows that inline enctryption is known to
be supported by the underyling driver, it can still use plain
submit_bio.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid the relatively high overhead of constructing and walking per-page
segment bio_vecs for data unit alignment checking by merging the checks
into existing loops.
For hardware support crypto, perform the check in bio_split_io_at, which
already contains a similar alignment check applied for all I/O. This
means bio-based drivers that do not call bio_split_to_limits, should they
ever grow blk-crypto support, need to implement the check themselves,
just like all other queue limits checks.
For blk-crypto-fallback do it in the encryption/decryption loops. This
means alignment errors for decryption will only be detected after I/O
has completed, but that seems like a worthwhile trade off.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-crypto: use mempool_alloc_bulk for encrypted bio page allocation
Calling mempool_alloc in a loop is not safe unless the maximum allocation
size times the maximum number of threads using it is less than the
minimum pool size. Use the new mempool_alloc_bulk helper to allocate
all missing elements in one pass to remove this deadlock risk. This
also means that non-pool allocations now use alloc_pages_bulk which can
be significantly faster than a loop over individual page allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-crypto: use on-stack skcipher requests for fallback en/decryption
Allocating a skcipher request dynamically can deadlock or cause
unexpected I/O failures when called from writeback context. Avoid the
allocation entirely by using on-stack skciphers, similar to what the
non-blk-crypto fscrypt path already does.
This drops the incomplete support for asynchronous algorithms, which
previously could be used, but only synchronously.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-crypto: optimize bio splitting in blk_crypto_fallback_encrypt_bio
The current code in blk_crypto_fallback_encrypt_bio is inefficient and
prone to deadlocks under memory pressure: It first walks the passed in
plaintext bio to see how much of it can fit into a single encrypted
bio using up to BIO_MAX_VEC PAGE_SIZE segments, and then allocates a
plaintext clone that fits the size, only to allocate another bio for
the ciphertext later. While the plaintext clone uses a bioset to avoid
deadlocks when allocations could fail, the ciphertex one uses bio_kmalloc
which is a no-go in the file system I/O path.
Switch blk_crypto_fallback_encrypt_bio to walk the source plaintext bio
while consuming bi_iter without cloning it, and instead allocate a
ciphertext bio at the beginning and whenever we fille up the previous
one. The existing bio_set for the plaintext clones is reused for the
ciphertext bios to remove the deadlock risk.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-crypto: submit the encrypted bio in blk_crypto_fallback_bio_prep
Restructure blk_crypto_fallback_bio_prep so that it always submits the
encrypted bio instead of passing it back to the caller, which allows
to simplify the calling conventions for blk_crypto_fallback_bio_prep and
blk_crypto_bio_prep so that they never have to return a bio, and can
use a true return value to indicate that the caller should submit the
bio, and false that the blk-crypto code consumed it.
The submission is handled by the on-stack bio list in the current
task_struct by the block layer and does not cause additional stack
usage or major overhead. It also prepares for the following optimization
and fixes for the blk-crypto fallback write path.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
fscrypt: keep multiple bios in flight in fscrypt_zeroout_range_inline_crypt
This should slightly improve performance for large zeroing operations,
but more importantly prepares for blk-crypto refactoring that requires
all fscrypt users to call submit_bio directly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
fscrypt: pass a real sector_t to fscrypt_zeroout_range_inline_crypt
While the pblk argument to fscrypt_zeroout_range_inline_crypt is
declared as a sector_t it actually is interpreted as a logical block
size unit, which is highly unusual. Switch to passing the 512 byte
units that sector_t is defined for.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 10 Jan 2026 13:42:36 +0000 (21:42 +0800)]
block: account for bi_bvec_done in bio_may_need_split()
When checking if a bio fits in a single segment, bio_may_need_split()
compares bi_size against the current bvec's bv_len. However, for
partially consumed bvecs (bi_bvec_done > 0), such as in cloned or
split bios, the remaining bytes in the current bvec is actually
(bv_len - bi_bvec_done), not bv_len.
This could cause bio_may_need_split() to incorrectly return false,
leading to nr_phys_segments being set to 1 when the bio actually
spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg()
when the actual mapped segments exceed the expected count.
Fix by subtracting bi_bvec_done from bv_len in the comparison.
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/ Repored-and-bisected-by: Christoph Hellwig <hch@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Christoph Hellwig <hch@infradead.org> Fixes: ee623c892aa5 ("block: use bvec iterator helper for bio_may_need_split()") Cc: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bi_offload_capable() returns whether a block device's metadata size
matches its PI tuple size. Use pi_tuple_size instead of switching on
csum_type. This makes the code considerably simpler and less branchy.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: zero non-PI portion of auto integrity buffer
The auto-generated integrity buffer for writes needs to be fully
initialized before being passed to the underlying block device,
otherwise the uninitialized memory can be read back by userspace or
anyone with physical access to the storage device. If protection
information is generated, that portion of the integrity buffer is
already initialized. The integrity data is also zeroed if PI generation
is disabled via sysfs or the PI tuple size is 0. However, this misses
the case where PI is generated and the PI tuple size is nonzero, but the
metadata size is larger than the PI tuple. In this case, the remainder
("opaque") of the metadata is left uninitialized.
Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the
metadata is larger than just the PI tuple.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c546d6f43833 ("block: only zero non-PI metadata tuples in bio_integrity_prep") Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 9 Jan 2026 12:14:54 +0000 (20:14 +0800)]
ublk: fix use-after-free in ublk_partition_scan_work
A race condition exists between the async partition scan work and device
teardown that can lead to a use-after-free of ub->ub_disk:
1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk()
2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does:
- del_gendisk(ub->ub_disk)
- ublk_detach_disk() sets ub->ub_disk = NULL
- put_disk() which may free the disk
3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk
leading to UAF
Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold
a reference to the disk during the partition scan. The spinlock in
ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker
either gets a valid reference or sees NULL and exits early.
Also change flush_work() to cancel_work_sync() to avoid running the
partition scan work unnecessarily when the disk is already detached.
Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 31 Dec 2025 03:00:57 +0000 (11:00 +0800)]
io_uring: remove nr_segs recalculation in io_import_kbuf()
io_import_kbuf() recalculates iter->nr_segs to reflect only the bvecs
needed for the requested byte range. This was added to provide an
accurate segment count to bio_iov_bvec_set(), which copied nr_segs to
bio->bi_vcnt for use as a bio split hint.
The previous two patches eliminated this dependency:
- bio_may_need_split() now uses bi_iter instead of bi_vcnt for split
decisions
- bio_iov_bvec_set() no longer copies nr_segs to bi_vcnt
Since nr_segs is no longer used for bio split decisions, the
recalculation loop is unnecessary. The iov_iter already has the correct
bi_size to cap iteration, so an oversized nr_segs is harmless.
Link: https://lkml.org/lkml/2025/4/16/351 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 31 Dec 2025 03:00:56 +0000 (11:00 +0800)]
block: don't initialize bi_vcnt for cloned bio in bio_iov_bvec_set()
bio_iov_bvec_set() creates a cloned bio that borrows a bvec array from
an iov_iter. For cloned bios, bi_vcnt is meaningless because iteration
is controlled entirely by bi_iter (bi_idx, bi_size, bi_bvec_done), not
by bi_vcnt. Remove the incorrect bi_vcnt assignment.
Explicitly initialize bi_iter.bi_idx to 0 to ensure iteration starts
at the first bvec. While bi_idx is typically already zero from bio
initialization, making this explicit improves clarity and correctness.
This change also avoids accessing iter->nr_segs, which is an iov_iter
implementation detail that block code should not depend on.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 31 Dec 2025 03:00:55 +0000 (11:00 +0800)]
block: use bvec iterator helper for bio_may_need_split()
bio_may_need_split() uses bi_vcnt to determine if a bio has a single
segment, but bi_vcnt is unreliable for cloned bios. Cloned bios share
the parent's bi_io_vec array but iterate over a subset via bi_iter,
so bi_vcnt may not reflect the actual segment count being iterated.
Replace the bi_vcnt check with bvec iterator access via
__bvec_iter_bvec(), comparing bi_iter.bi_size against the current
bvec's length. This correctly handles both cloned and non-cloned bios.
Move bi_io_vec into the first cache line adjacent to bi_iter. This is
a sensible layout since bi_io_vec and bi_iter are commonly accessed
together throughout the block layer - every bvec iteration requires
both fields. This displaces bi_end_io to the second cache line, which
is acceptable since bi_end_io and bi_private are always fetched
together in bio_endio() anyway.
The struct layout change requires bio_reset() to preserve and restore
bi_io_vec across the memset, since it now falls within BIO_RESET_BYTES.
Nitesh verified that this patch doesn't regress NVMe 512-byte IO perf [1].
Tetsuo Handa [Wed, 7 Jan 2026 10:41:43 +0000 (19:41 +0900)]
loop: add missing bd_abort_claiming in loop_set_status
Commit 08e136ebd193 ("loop: don't change loop device under exclusive
opener in loop_set_status") forgot to call bd_abort_claiming() when
mutex_lock_killable() failed.
Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload
of the struct request's first bio. This assumes all the request's bios
have the same app_tag. However, it is possible for bios with different
app_tag values to be merged into a single request.
Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of
bios/requests with different app_tag values if BIP_CHECK_APPTAG is set.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 3d8b5a22d404 ("block: add support to pass user meta buffer") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Breno Leitao [Tue, 6 Jan 2026 14:26:57 +0000 (06:26 -0800)]
blk-rq-qos: Remove unlikely() hints from QoS checks
The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are
counterproductive. Writeback throttling (WBT) might be enabled by
default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'.
Branch profiling on Meta servers, which have WBT enabled, confirms 100%
misprediction rates on these checks.
Remove the unlikely() annotations to let the CPU's branch predictor
learn the actual behavior, potentially improving I/O path performance.
Leon Romanovsky [Wed, 17 Dec 2025 09:41:24 +0000 (11:41 +0200)]
types: move phys_vec definition to common header
Move the struct phys_vec definition from block/blk-mq-dma.c to
include/linux/types.h to make it available for use across the kernel.
The phys_vec structure represents a physical address range with a
length, which is used by the new physical address-based DMA mapping
API. This structure is already used by the block layer and will be
needed for DMA phys API users.
Moving this definition to types.h provides a centralized location
for this common data structure and eliminates code duplication
across subsystems that need to work with physical address ranges.
Leon Romanovsky [Wed, 17 Dec 2025 09:41:23 +0000 (11:41 +0200)]
nvme-pci: Use size_t for length fields to handle larger sizes
This patch changes the length variables from unsigned int to size_t.
Using size_t ensures that we can handle larger sizes, as size_t is
always equal to or larger than the previously used u32 type.
Originally, u32 was used because blk-mq-dma code evolved from
scatter-gather implementation, which uses unsigned int to describe length.
This change will also allow us to reuse the existing struct phys_vec in places
that don't need scatter-gather.
loop: don't change loop device under exclusive opener in loop_set_status
loop_set_status() is allowed to change the loop device while there
are other openers of the device, even exclusive ones.
In this case, it causes a KASAN: slab-out-of-bounds Read in
ext4_search_dir(), since when looking for an entry in an inlined
directory, e_value_offs is changed underneath the filesystem by
loop_set_status().
Fix the problem by forbidding loop_set_status() from modifying the loop
device while there are exclusive openers of the device. This is similar
to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't
change loop device under exclusive opener") alongside commit ecbe6bc0003b
("block: use bd_prepare_to_claim directly in the loop driver").
Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397 Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Md Haris Iqbal [Fri, 5 Dec 2025 12:47:33 +0000 (13:47 +0100)]
rnbd-srv: Zero the rsp buffer before using it
Before using the data buffer to send back the response message, zero it
completely. This prevents any stray bytes to be picked up by the client
side when there the message is exchanged between different protocol
versions.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
rnbd-srv: Fix server side setting of bi_size for special IOs
On rnbd-srv, the bi_size of the bio is set during the bio_add_page
function, to which datalen is passed. But for special IOs like DISCARD
and WRITE_ZEROES, datalen is 0, since there is no data to write. For
these special IOs, use the bi_size of the rnbd_msg_io.
Fixes: f6f84be089c9 ("block/rnbd-srv: Add sanity check and remove redundant assignment") Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jack Wang [Fri, 5 Dec 2025 12:47:31 +0000 (13:47 +0100)]
rnbd-srv: fix the trace format for flags
The __print_flags helper meant for bitmask, while the rnbd_rw_flags is
mixed with bitmask and enum, to avoid confusion, just print the data
as it is.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Md Haris Iqbal [Fri, 5 Dec 2025 12:47:30 +0000 (13:47 +0100)]
block/rnbd-proto: Check and retain the NOUNMAP flag for requests
The NOUNMAP flag is in combination with WRITE_ZEROES flag to indicate
that the upper layers wants the sectors zeroed, but does not want it to
get freed. This instruction is especially important for storage stacks
which involves a layer capable of thin provisioning.
This commit makes RNBD block device transfer and retain this NOUNMAP flag
for requests, so it can be passed onto the backend device on the server
side.
Since it is a change in the wire protocol, bump the minor version of
protocol.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Md Haris Iqbal [Fri, 5 Dec 2025 12:47:28 +0000 (13:47 +0100)]
block/rnbd-proto: Handle PREFLUSH flag properly for IOs
In RNBD client, for a WRITE request of size 0, with only the REQ_PREFLUSH
bit set, while converting from bio_opf to rnbd_opf, we do REQ_OP_WRITE to
RNBD_OP_WRITE, and then check if the rq is flush through function
op_is_flush. That function checks both REQ_PREFLUSH and REQ_FUA flag, and
if any of them is set, the RNBD_F_FUA is set.
On the RNBD server side, while converting the RNBD flags to req flags, if
the RNBD_F_FUA flag is set, we just set the REQ_FUA flag. This means we
have lost the PREFLUSH flag, and added the REQ_FUA flag in its place.
This commits adds a new RNBD_F_PREFLUSH flag, and also adds separate
handling for REQ_PREFLUSH flag. On the server side, if the RNBD_F_PREFLUSH
is present, the REQ_PREFLUSH is added to the bio.
Since it is a change in the wire protocol, bump the minor version of
protocol.
The change is backwards compatible, and does not change the functionality
if either the client or the server is running older/newer versions.
If the client side is running the older version, both REQ_PREFLUSH and
REQ_FUA is converted to RNBD_F_FUA. The server running newer one would
still add only the REQ_FUA flag which is what happens when both client and
server is running the older version.
If the client side is running the newer version, just like before a
RNBD_F_FUA is added, but now a RNBD_F_PREFLUSH is also added to the
rnbd_opf. In case the server is running the older version the
RNBD_F_PREFLUSH is ignored, and only the RNBD_F_FUA is processed.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Julia Lawall [Wed, 31 Dec 2025 17:22:07 +0000 (18:22 +0100)]
block, bfq: update outdated comment
The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle()
in commit 277a4a9b56cd ("block, bfq: give a better name to
bfq_bfqq_may_idle"). Update the comment accordingly.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 31 Dec 2025 13:55:07 +0000 (06:55 -0700)]
Merge tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into block-6.19
Pull MD fixes from Yu Kuai:
"- Fix null-pointer dereference in raid5 sysfs group_thread_cnt store
(Tuo Li)
- Fix possible mempool corruption during raid1 raid_disks update via
sysfs (FengWei Shih)
- Fix logical_block_size configuration being overwritten during
super_1_validate() (Li Nan)
- Fix forward incompatibility with configurable logical block size:
arrays assembled on new kernels could not be assembled on kernels
<=6.18 due to non-zero reserved pad rejection (Li Nan)
- Fix static checker warning about iterator not incremented (Li Nan)"
* tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux:
md: Fix forward incompatibility from configurable logical block size
md: Fix logical_block_size configuration being overwritten
md: suspend array while updating raid_disks via sysfs
md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
md: Fix static checker warning in analyze_sbs
Cong Zhang [Tue, 30 Dec 2025 09:17:05 +0000 (17:17 +0800)]
blk-mq: skip CPU offline notify on unmapped hctx
If an hctx has no software ctx mapped, blk_mq_map_swqueue() never
allocates tags and leaves hctx->tags NULL. The CPU hotplug offline
notifier can still run for that hctx, return early since hctx cannot
hold any requests.
Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
null_blk: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
100263 37808 2752 140823 22617 drivers/block/null_blk/main.o
After:
=====
text data bss dec hex filename
100423 37648 2752 140823 22617 drivers/block/null_blk/main.o
Thorsten Blum [Sat, 20 Dec 2025 23:59:23 +0000 (00:59 +0100)]
brd: replace simple_strtol with kstrtoul in ramdisk_size
Replace simple_strtol() with the recommended kstrtoul() for parsing the
'ramdisk_size=' boot parameter. Unlike simple_strtol(), which returns a
long, kstrtoul() converts the string directly to an unsigned long and
avoids implicit casting.
Check the return value of kstrtoul() and reject invalid values. This
adds error handling while preserving behavior for existing values, and
removes use of the deprecated simple_strtol() helper. The current code
silently sets 'rd_size = 0' if parsing fails, instead of leaving the
default value (CONFIG_BLK_DEV_RAM_SIZE) unchanged.
Linus Torvalds [Sun, 28 Dec 2025 18:21:47 +0000 (10:21 -0800)]
Merge tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB fixes, and bunch of reverts for 6.19-rc3.
Included in here are:
- reverts of some typec ucsi driver changes that had a lot of
regression reports after -rc1. Let's just revert it all for now and
it will come back in a way that is better tested.
- other typec bugfixes
- usb-storage quirk fixups
- dwc3 driver fix
- other minor USB fixes for reported problems.
All of these have passed 0-day testing and individual testing"
* tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (22 commits)
Revert "usb: typec: ucsi: Update UCSI structure to have message in and message out fields"
Revert "usb: typec: ucsi: Add support for message out data structure"
Revert "usb: typec: ucsi: Enable debugfs for message_out data structure"
Revert "usb: typec: ucsi: Add support for SET_PDOS command"
Revert "usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common"
Revert "usb: typec: ucsi: Get connector status after enable notifications"
usb: ohci-nxp: clean up probe error labels
usb: gadget: lpc32xx_udc: clean up probe error labels
usb: ohci-nxp: fix device leak on probe failure
usb: phy: isp1301: fix non-OF device reference imbalance
usb: gadget: lpc32xx_udc: fix clock imbalance in error path
usb: typec: ucsi: Get connector status after enable notifications
usb: usb-storage: Maintain minimal modifications to the bcdDevice range.
usb: dwc3: of-simple: fix clock resource leak in dwc3_of_simple_probe
usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common
USB: lpc32xx_udc: Fix error handling in probe
usb: typec: altmodes/displayport: Drop the device reference in dp_altmode_probe()
usb: phy: fsl-usb: Fix use-after-free in delayed work during device removal
usb: renesas_usbhs: Fix a resource leak in usbhs_pipe_malloc()
usb: typec: ucsi: huawei-gaokin: add DRM dependency
...
Linus Torvalds [Sun, 28 Dec 2025 18:14:49 +0000 (10:14 -0800)]
Merge tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
"Here are some small serial driver fixes for some reported issues.
Included in here are:
- serial sysfs fwnode fix that was much reported
- sh-sci driver fix
- serial device init bugfix
- 8250 bugfix
- xilinx_uartps bugfix
All of these have passed 0-day testing and individual testing"
* tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: xilinx_uartps: fix rs485 delay_rts_after_send
serial: sh-sci: Check that the DMA cookie is valid
serial: core: Fix serial device initialization
serial: 8250: longson: Fix NULL vs IS_ERR() bug in probe
serial: core: Restore sysfs fwnode information
Linus Torvalds [Sun, 28 Dec 2025 18:11:18 +0000 (10:11 -0800)]
Merge tag 'firewire-fixes-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire fix from Takashi Sakamoto:
"A fix for PCI driver for Texas Instruments PCILyx series.
The driver had a bug where it allocated a DMA-coherent buffer of 16 KB
but released it using PAGE_SIZE. This disproportion was reported in
2020, but the fix was never merged. It is finally resolved"
* tag 'firewire-fixes-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: nosy: Fix dma_free_coherent() size
Linus Torvalds [Sun, 28 Dec 2025 17:44:26 +0000 (09:44 -0800)]
Merge tag 'riscv-for-linus-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Paul Walmsley:
"Nothing exotic here; these are the cleanup and new ISA extension
probing patches (not including CFI):
- Add probing and userspace reporting support for the standard RISC-V
ISA extensions Zilsd and Zclsd, which implement load/store dual
instructions on RV32
- Abstract the register saving code in setup_sigcontext() so it can
be used for stateful RISC-V ISA extensions beyond the vector
extension
- Add the SBI extension ID and some initial data structure
definitions for the RISC-V standard SBI debug trigger extension
- Clean up some code slightly: change some page table functions to
avoid atomic operations oinn !SMP and to avoid unnecessary casts to
atomic_long_t; and use the existing RISCV_FULL_BARRIER macro in
place of some open-coded 'fence rw,rw' instructions"
* tag 'riscv-for-linus-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Add SBI debug trigger extension and function ids
riscv/atomic.h: use RISCV_FULL_BARRIER in _arch_atomic* function.
riscv: hwprobe: export Zilsd and Zclsd ISA extensions
riscv: add ISA extension parsing for Zilsd and Zclsd
dt-bindings: riscv: add Zilsd and Zclsd extension descriptions
riscv: mm: use xchg() on non-atomic_long_t variables, not atomic_long_xchg()
riscv: mm: ptep_get_and_clear(): avoid atomic ops when !CONFIG_SMP
riscv: mm: pmdp_huge_get_and_clear(): avoid atomic ops when !CONFIG_SMP
riscv: signal: abstract header saving for setup_sigcontext
Linus Torvalds [Sun, 28 Dec 2025 17:40:09 +0000 (09:40 -0800)]
Merge tag 'powerpc-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Fix for kexec warning due to SMT disable or partial SMT enabled
- Handle font bitmap pointer with reloc_offset to fix boot crash
- Fix to enable cpuidle state for Power11
- Couple of misc fixes
Thanks to Aboorva Devarajan, Aditya Bodkhe, Cedar Maxwell, Christian
Zigotzky, Christophe Leroy, Christophe Leroy (CS GROUP), Finn Thain,
Gopi Krishna Menon, Guenter Roeck, Jan Stancek, Joe Lawrence, Josh
Poimboeuf, Justin M. Forbes, Madadi Vineeth Reddy, Naveen N Rao (AMD),
Nysal Jan K.A., Sachin P Bappalige, Samir M, Sourabh Jain, Srikar
Dronamraju, and Stan Johnson
* tag 'powerpc-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32: Restore disabling of interrupts at interrupt/syscall exit
powerpc/powernv: Enable cpuidle state detection for POWER11
powerpc: Add reloc_offset() to font bitmap pointer used for bootx_printf()
powerpc/tools: drop `-o pipefail` in gcc check scripts
selftests/powerpc/pmu/: Add check_extended_reg_test to .gitignore
powerpc/kexec: Enable SMT before waking offline CPUs
Ming Lei [Tue, 23 Dec 2025 03:27:41 +0000 (11:27 +0800)]
selftests/ublk: add test for async partition scan
Add test_generic_15.sh to verify that async partition scan prevents
IO hang when reading partition tables.
The test creates ublk devices with fault_inject target and very large
delay (60s) to simulate blocked partition table reads, then kills the
daemon to verify proper state transitions without hanging:
1. Without recovery support:
- Create device with fault_inject and 60s delay
- Kill daemon while partition scan may be blocked
- Verify device transitions to DEAD state
2. With recovery support (-r 1):
- Create device with fault_inject, 60s delay, and recovery
- Kill daemon while partition scan may be blocked
- Verify device transitions to QUIESCED state
Before the async partition scan fix, killing the daemon during
partition scan would cause deadlock as partition scan held ub->mutex
while waiting for IO. With the async fix, partition scan happens in
a work function and flush_work() ensures proper synchronization.
Add _add_ublk_dev_no_settle() helper function to skip udevadm settle,
which would otherwise hang waiting for partition scan events to
complete when partition table read is delayed.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 23 Dec 2025 03:27:40 +0000 (11:27 +0800)]
ublk: scan partition in async way
Implement async partition scan to avoid IO hang when reading partition
tables. Similar to nvme_partition_scan_work(), partition scanning is
deferred to a work queue to prevent deadlocks.
When partition scan happens synchronously during add_disk(), IO errors
can cause the partition scan to wait while holding ub->mutex, which
can deadlock with other operations that need the mutex.
Changes:
- Add partition_scan_work to ublk_device structure
- Implement ublk_partition_scan_work() to perform async scan
- Always suppress sync partition scan during add_disk()
- Schedule async work after add_disk() for trusted daemons
- Add flush_work() in ublk_stop_dev() before grabbing ub->mutex
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 27 Dec 2025 16:37:26 +0000 (08:37 -0800)]
Merge tag 'spi-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"We've got more fixes here for the Cadence QSPI controller, this time
fixing some issues that come up when working with slower flashes on
some platforms plus a general race condition.
We also add support for the Allwinner A523, this is just some new
compatibles"
* tag 'spi-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: cadence-quadspi: Improve CQSPI_SLOW_SRAM quirk if flash is slow
spi: cadence-quadspi: Prevent lost complete() call during indirect read
spi: sun6i: Support A523's SPI controllers
spi: dt-bindings: sun6i: Add compatibles for A523's SPI controllers
Linus Torvalds [Sat, 27 Dec 2025 16:04:39 +0000 (08:04 -0800)]
Merge tag 'regulator-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of fixes from Thomas, making the UAPI headers more robustly
correct and ensuring they are covered by checkpatch, and one from
Andreas fixing an update for a change to the DT bindings that I missed
was requested during bindings review in the newly added fp9931 driver"
* tag 'regulator-fix-v6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: fp9931: fix regulator node pointer
regulator: Add UAPI headers to MAINTAINERS
regulator: uapi: Use UAPI integer type
Li Nan [Fri, 26 Dec 2025 02:42:21 +0000 (10:42 +0800)]
md: Fix forward incompatibility from configurable logical block size
Commit 62ed1b582246 ("md: allow configuring logical block size") used
reserved pad to add 'logical_block_size' to metadata. RAID rejects
non-zero reserved pad, so arrays fail when rolling back to old kernels
after booting new ones.
Set 'logical_block_size' only for newly created arrays to support rollback
to old kernels. Importantly new arrays still won't work on old kernels to
prevent data loss issue from LBS changes.
For arrays created on old kernels which confirmed not to rollback,
configure LBS by echo current LBS (queue/logical_block_size) to
md/logical_block_size.
Li Nan [Fri, 26 Dec 2025 02:42:20 +0000 (10:42 +0800)]
md: Fix logical_block_size configuration being overwritten
In super_1_validate(), mddev->logical_block_size is directly overwritten
with the value from metadata. This causes the previously configured lbs
to be lost, making the configuration ineffective. Fix it.
FengWei Shih [Fri, 26 Dec 2025 10:18:16 +0000 (18:18 +0800)]
md: suspend array while updating raid_disks via sysfs
In raid1_reshape(), freeze_array() is called before modifying the r1bio
memory pool (conf->r1bio_pool) and conf->raid_disks, and
unfreeze_array() is called after the update is completed.
However, freeze_array() only waits until nr_sync_pending and
(nr_pending - nr_queued) of all buckets reaches zero. When an I/O error
occurs, nr_queued is increased and the corresponding r1bio is queued to
either retry_list or bio_end_io_list. As a result, freeze_array() may
unblock before these r1bios are released.
This can lead to a situation where conf->raid_disks and the mempool have
already been updated while queued r1bios, allocated with the old
raid_disks value, are later released. Consequently, free_r1bio() may
access memory out of bounds in put_all_bios() and release r1bios of the
wrong size to the new mempool, potentially causing issues with the
mempool as well.
Since only normal I/O might increase nr_queued while an I/O error occurs,
suspending the array avoids this issue.
Note: Updating raid_disks via ioctl SET_ARRAY_INFO already suspends
the array. Therefore, we suspend the array when updating raid_disks
via sysfs to avoid this issue too.
To fix this issue, the function should unlock mddev and return before
invoking raid5_quiesce() when conf is NULL, following the existing pattern
in raid5_change_consistency_policy().
Fixes: fa1944bbe622 ("md/raid5: Wait sync io to finish before changing group cnt") Signed-off-by: Tuo Li <islituo@gmail.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Linus Torvalds [Fri, 26 Dec 2025 21:41:02 +0000 (13:41 -0800)]
Merge tag 'driver-core-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Introduce DMA Rust helpers to avoid build errors when !CONFIG_HAS_DMA
- Remove unnecessary (and hence incorrect) endian conversion in the
Rust PCI driver sample code
- Fix memory leak in the unwind path of debugfs_change_name()
- Support non-const struct software_node pointers in
SOFTWARE_NODE_REFERENCE(), after introducing _Generic()
- Avoid NULL pointer dereference in the unwind path of
simple_xattrs_free()
* tag 'driver-core-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
fs/kernfs: null-ptr deref in simple_xattrs_free()
software node: Also support referencing non-constant software nodes
debugfs: Fix memleak in debugfs_change_name().
samples: rust: fix endianness issue in rust_driver_pci
rust: dma: add helpers for architectures without CONFIG_HAS_DMA
Linus Torvalds [Fri, 26 Dec 2025 21:37:11 +0000 (13:37 -0800)]
Merge tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"A couple of fixes for EFI regressions introduced this cycle:
- Make EDID handling in the EFI stub mixed mode safe
- Ensure that efi_mm.user_ns has a sane value - this is needed now
that EFI runtime calls are preemptible on arm64"
* tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
arm64: efi: Fix NULL pointer dereference by initializing user_ns
efi/libstub: gop: Fix EDID support in mixed-mode
Linus Torvalds [Fri, 26 Dec 2025 19:44:35 +0000 (11:44 -0800)]
Merge tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Fix for a signedness issue introduced in this kernel release for rnbd
- Fix up user copy references for ublk when the server exits
* tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: rnbd-clt: Fix signedness bug in init_dev()
ublk: clean up user copy references on ublk server exit
Linus Torvalds [Fri, 26 Dec 2025 19:34:38 +0000 (11:34 -0800)]
Merge tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fix from Jens Axboe:
"Just a single fix for a bug that can cause a leak of the filename with
IORING_OP_OPENAT, if direct descriptors are asked for and O_CLOEXEC
has been set in the request flags"
* tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: fix filename leak in __io_openat_prep()
__io_openat_prep() allocates a struct filename using getname(). However,
for the condition of the file being installed in the fixed file table as
well as having O_CLOEXEC flag set, the function returns early. At that
point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this,
the memory for the newly allocated struct filename is not cleaned up,
causing a memory leak.
Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the
successful getname() call, so that when the request is torn down, the
filename will be cleaned up, along with other resources needing cleanup.
Breno Leitao [Tue, 23 Dec 2025 10:55:44 +0000 (02:55 -0800)]
kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
Add a WARN_ON_ONCE() check to detect mm_struct instances that are
missing user_ns initialization when passed to kthread_use_mm().
When a kthread adopts an mm via kthread_use_mm(), LSM hooks and
capability checks may access current->mm->user_ns for credential
validation. If user_ns is NULL, this leads to a NULL pointer
dereference crash.
This was observed with efi_mm on arm64, where commit a5baf582f4c0
("arm64/efi: Call EFI runtime services without disabling preemption")
introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns
initialization, causing crashes during /proc access.
Adding this warning helps catch similar bugs early during development
rather than waiting for hard-to-debug NULL pointer crashes in
production.