Shida Zhang [Thu, 22 Jan 2026 06:13:21 +0000 (14:13 +0800)]
bcache: use bio cloning for detached device requests
Previously, bcache hijacked the bi_end_io and bi_private fields of
the incoming bio when the backing device was in a detached state.
This is fragile and breaks if the bio is needed to be processed by
other layers.
This patch transitions to using a cloned bio embedded within a private
structure. This ensures the original bio's metadata remains untouched.
Fixes: 53280e398471 ("bcache: fix improper use of bi_end_io") Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Acked-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 22 Jan 2026 04:28:58 +0000 (12:28 +0800)]
blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion
blk_execute_rq() with polling is used in kernel code paths such as
NVMe controller connect. The aggressive spinning in blk_hctx_poll()
can prevent the completion task from getting a chance to run, causing
a lockup.
The spinning with cpu_relax() doesn't yield CPU, so need_resched()
only becomes true on timer tick. This causes unnecessary spinning
while the completion task is already waiting to run.
Before commit f22ecf9c14c1, the loop would exit early because
task_is_running() was always true. After that commit removed the
check, the loop now spins until need_resched().
Fix this by using BLK_POLL_ONESHOT in blk_rq_poll_completion(). This
causes blk_hctx_poll() to poll once and return immediately, letting
the outer loop's cond_resched() yield CPU so the completion task can
run.
Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()") Cc: Diangang Li <lidiangang@bytedance.com> Cc: Fengnan Chang <changfengnan@bytedance.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 13 Jan 2026 08:58:02 +0000 (16:58 +0800)]
selftests/ublk: fix garbage output in foreground mode
Initialize _evtfd to -1 in struct dev_ctx to prevent garbage output
when running kublk in foreground mode. Without this, _evtfd is
zero-initialized to 0 (stdin), and ublk_send_dev_event() writes
binary data to stdin which appears as garbage on the terminal.
Also fix debug message format string.
Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 13 Jan 2026 08:58:01 +0000 (16:58 +0800)]
selftests/ublk: fix error handling for starting device
Fix error handling in ublk_start_daemon() when start_dev fails:
1. Call ublk_ctrl_stop_dev() to cancel inflight uring_cmd before
cleanup. Without this, the device deletion may hang waiting for
I/O completion that will never happen.
2. Add fail_start label so that pthread_join() is called on the
error path. This ensures proper thread cleanup when startup fails.
Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 13 Jan 2026 08:58:00 +0000 (16:58 +0800)]
selftests/ublk: fix IO thread idle check
Include cmd_inflight in ublk_thread_is_done() check. Without this,
the thread may exit before all FETCH commands are completed, which
may cause device deletion to hang.
Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: make the new blkzoned UAPI constants discoverable
The Linux 6.19 merge window added the new BLKREPORTZONESV2 ioctl, and
with it the new BLK_ZONE_REP_CACHED and BLK_ZONE_COND_ACTIVE constants.
The two constants are defined as part of enums, which makes it very
painful for userspace to discover if they are present in the installed
system headers.
Use the #define to the same name trick to make them trivially
discoverable using CPP directives.
Fixes: 0bf0e2e46668 ("block: track zone conditions") Fixes: b30ffcdc0c15 ("block: introduce BLKREPORTZONESV2 ioctl") Reported-by: Andrey Albershteyn <aalbersh@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Seamus Connor [Thu, 15 Jan 2026 02:59:52 +0000 (18:59 -0800)]
ublk: fix ublksrv pid handling for pid namespaces
When ublksrv runs inside a pid namespace, START/END_RECOVERY compared
the stored init-ns tgid against the userspace pid (getpid vnr), so the
check failed and control ops could not proceed. Compare against the
caller’s init-ns tgid and store that value, then translate it back to
the caller’s pid namespace when reporting GET_DEV_INFO so ublk list
shows a sensible pid.
Testing: start/recover in a pid namespace; `ublk list` shows
reasonable pid values in init, child, and sibling namespaces.
Fixes: c2c8089f325e ("ublk: validate ublk server pid") Signed-off-by: Seamus Connor <sconnor@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 14 Jan 2026 19:28:02 +0000 (11:28 -0800)]
block: Fix an error path in disk_update_zone_resources()
Any queue_limits_start_update() call must be followed either by a
queue_limits_commit_update() call or by a queue_limits_cancel_update()
call. Make sure that the error path near the start of
disk_update_zone_resources() follows this requirement. Remove the
"goto unfreeze" statement from that error path to make the code easier
to verify.
This was detected by annotating the queue_limits_*() calls with Clang
thread-safety attributes and by building the kernel with thread-safety
checking enabled. Without this patch and with thread-safety checking
enabled, the following error is reported:
block/blk-zoned.c:2020:1: error: mutex 'disk->queue->limits_lock' is not held on every path through here [-Werror,-Wthread-safety-analysis]
2020 | }
| ^
block/blk-zoned.c:1959:8: note: mutex acquired here
1959 | lim = queue_limits_start_update(q);
| ^
Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: bba4322e3f30 ("block: freeze queue when updating zone resources") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260114192803.4171847-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bug is a double kobject_put() on dev->kobj during device cleanup.
Kobject Lifecycle:
kobject_init_and_add() sets kobj.kref = 1 (initialization)
kobject_put() sets kobj.kref = 0 (should be called once)
* Before this patch:
rnbd_clt_unmap_device()
rnbd_destroy_sysfs()
kobject_del(&dev->kobj) [remove from sysfs]
kobject_put(&dev->kobj) PUT #1 (WRONG!)
kref: 1 to 0
rnbd_dev_release()
kfree(dev) [DEVICE FREED!]
rnbd_destroy_gen_disk() [use-after-free!]
rnbd_clt_put_dev()
refcount_dec_and_test(&dev->refcount)
kobject_put(&dev->kobj) PUT #2 (UNDERFLOW!)
kref: 0 to -1 [WARNING!]
The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the
device via rnbd_dev_release(), then the second kobject_put() in
rnbd_clt_put_dev() causes refcount underflow.
* After this patch :-
Remove kobject_put() from rnbd_destroy_sysfs(). This function should
only remove sysfs visibility (kobject_del), not manage object lifetime.
Call Graph (FIXED):
rnbd_clt_unmap_device()
rnbd_destroy_sysfs()
kobject_del(&dev->kobj) [remove from sysfs only]
[kref unchanged: 1]
rnbd_destroy_gen_disk() [device still valid]
rnbd_clt_put_dev()
refcount_dec_and_test(&dev->refcount)
kobject_put(&dev->kobj) ONLY PUT (CORRECT!)
kref: 1 to 0 [BALANCED]
rnbd_dev_release()
kfree(dev) [CLEAN DESTRUCTION]
This follows the kernel pattern where sysfs removal (kobject_del) is
separate from object destruction (kobject_put).
Fixes: 581cf833cac4 ("block: rnbd: add .release to rnbd_dev_ktype") Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 14 Jan 2026 19:35:00 +0000 (12:35 -0700)]
Merge tag 'nvme-6.19-2026-01-14' of git://git.infradead.org/nvme into block-6.19
Pull NVMe fixes from Keith:
"- Device quirk to disable faulty temperature (Ilikara)
- TCP target null pointer fix from bad host protocol usage (Shivam)
- Add compatible apple controller (Janne)
- FC tagset leak fix (Chaitanya)
- TCP socket deadlock fix (Hannes)
- Target name buffer overrun fix (Shin'ichiro)"
* tag 'nvme-6.19-2026-01-14' of git://git.infradead.org/nvme:
nvme: fix PCIe subsystem reset controller state transition
nvmet: do not copy beyond sybsysnqn string length
nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()
nvme-fc: release admin tagset if init fails
nvme-apple: add "apple,t8103-nvme-ans2" as compatible
nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec
nvme-pci: disable secondary temp for Wodposit WPBSNM8
Nilay Shroff [Wed, 14 Jan 2026 07:24:13 +0000 (12:54 +0530)]
nvme: fix PCIe subsystem reset controller state transition
The commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING
state”) disallows controller state transitions directly from RESETTING
to LIVE. However, the NVMe PCIe subsystem reset path relies on this
transition to recover the controller on PowerPC (PPC) systems.
On PPC systems, issuing a subsystem reset causes a temporary loss of
communication with the NVMe adapter. A subsequent PCIe MMIO read then
triggers EEH recovery, which restores the PCIe link and brings the
controller back online. For EEH recovery to proceed correctly, the
controller must transition back to the LIVE state.
Due to the changes introduced by commit d2fe192348f9 (“nvme: only allow
entering LIVE from CONNECTING state”), the controller can no longer
transition directly from RESETTING to LIVE. As a result, EEH recovery
exits prematurely, leaving the controller stuck in the RESETTING state.
Fix this by explicitly transitioning the controller state from RESETTING
to CONNECTING and then to LIVE. This satisfies the updated state
transition rules and allows the controller to be successfully recovered
on PPC systems following a PCIe subsystem reset.
Cc: stable@vger.kernel.org Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state") Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Commit edd17206e363 ("nvmet: remove redundant subsysnqn field from
ctrl") replaced ctrl->subsysnqn with ctrl->subsys->subsysnqn. This
change works as expected because both point to strings with the same
data. However, their memory allocation lengths differ. ctrl->subsysnqn
had the fixed size defined as NVMF_NQN_FILED_LEN, while
ctrl->subsys->subsysnqn has variable length determined by kstrndup().
Due to this difference, KASAN slab-out-of-bounds occurs at memcpy() in
nvmet_passthru_override_id_ctrl() after the commit. The failure can be
recreated by running the blktests test case nvme/033. To prevent such
failures, replace memcpy() with strscpy(), which copies only the string
length and avoids overruns.
Fixes: edd17206e363 ("nvmet: remove redundant subsysnqn field from ctrl") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Hannes Reinecke [Mon, 18 Aug 2025 09:32:45 +0000 (11:32 +0200)]
nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()
When the socket is closed while in TCP_LISTEN a callback is run to
flush all outstanding packets, which in turns calls
nvmet_tcp_listen_data_ready() with the sk_callback_lock held.
So we need to check if we are in TCP_LISTEN before attempting
to get the sk_callback_lock() to avoid a deadlock.
Nilay Shroff [Tue, 13 Jan 2026 06:57:22 +0000 (12:27 +0530)]
null_blk: fix kmemleak by releasing references to fault configfs items
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, the null-blk
driver sets up fault injection support by creating the timeout_inject,
requeue_inject, and init_hctx_fault_inject configfs items as children
of the top-level nullbX configfs group.
However, when the nullbX device is removed, the references taken to
these fault-config configfs items are not released. As a result,
kmemleak reports a memory leak, for example:
block: zero non-PI portion of auto integrity buffer
The auto-generated integrity buffer for writes needs to be fully
initialized before being passed to the underlying block device,
otherwise the uninitialized memory can be read back by userspace or
anyone with physical access to the storage device. If protection
information is generated, that portion of the integrity buffer is
already initialized. The integrity data is also zeroed if PI generation
is disabled via sysfs or the PI tuple size is 0. However, this misses
the case where PI is generated and the PI tuple size is nonzero, but the
metadata size is larger than the PI tuple. In this case, the remainder
("opaque") of the metadata is left uninitialized.
Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the
metadata is larger than just the PI tuple.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c546d6f43833 ("block: only zero non-PI metadata tuples in bio_integrity_prep") Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_fc_init_ctrl() allocates the admin blk-mq resources right after
nvme_add_ctrl() succeeds. If any of the subsequent steps fail (changing
the controller state, scheduling connect work, etc.), we jump to the
fail_ctrl path, which tears down the controller references but never
frees the admin queue/tag set. The leaked blk-mq allocations match the
kmemleak report seen during blktests nvme/fc.
Check ctrl->ctrl.admin_tagset in the fail_ctrl path and call
nvme_remove_admin_tag_set() when it is set so that all admin queue
allocations are reclaimed whenever controller setup aborts.
Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Janne Grunau [Wed, 31 Dec 2025 10:10:57 +0000 (11:10 +0100)]
nvme-apple: add "apple,t8103-nvme-ans2" as compatible
After discussion with the devicetree maintainers we agreed to not extend
lists with the generic compatible "apple,nvme-ans2" anymore [1]. Add
"apple,t8103-nvme-ans2" as fallback compatible as it is the SoC the
driver and bindings were written for.
Shivam Kumar [Sat, 13 Dec 2025 18:57:48 +0000 (13:57 -0500)]
nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec
Commit efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length")
added ttag bounds checking and data_offset
validation in nvmet_tcp_handle_h2c_data_pdu(), but it did not validate
whether the command's data structures (cmd->req.sg and cmd->iov) have
been properly initialized before processing H2C_DATA PDUs.
The nvmet_tcp_build_pdu_iovec() function dereferences these pointers
without NULL checks. This can be triggered by sending H2C_DATA PDU
immediately after the ICREQ/ICRESP handshake, before
sending a CONNECT command or NVMe write command.
Attack vectors that trigger NULL pointer dereferences:
1. H2C_DATA PDU sent before CONNECT → both pointers NULL
2. H2C_DATA PDU for READ command → cmd->req.sg allocated, cmd->iov NULL
3. H2C_DATA PDU for uninitialized command slot → both pointers NULL
The fix validates both cmd->req.sg and cmd->iov before calling
nvmet_tcp_build_pdu_iovec(). Both checks are required because:
- Uninitialized commands: both NULL
- READ commands: cmd->req.sg allocated, cmd->iov NULL
- WRITE commands: both allocated
Fixes: efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Shivam Kumar <kumar.shivam43666@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Ming Lei [Fri, 9 Jan 2026 12:14:54 +0000 (20:14 +0800)]
ublk: fix use-after-free in ublk_partition_scan_work
A race condition exists between the async partition scan work and device
teardown that can lead to a use-after-free of ub->ub_disk:
1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk()
2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does:
- del_gendisk(ub->ub_disk)
- ublk_detach_disk() sets ub->ub_disk = NULL
- put_disk() which may free the disk
3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk
leading to UAF
Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold
a reference to the disk during the partition scan. The spinlock in
ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker
either gets a valid reference or sees NULL and exits early.
Also change flush_work() to cancel_work_sync() to avoid running the
partition scan work unnecessarily when the disk is already detached.
Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo Handa [Wed, 7 Jan 2026 10:41:43 +0000 (19:41 +0900)]
loop: add missing bd_abort_claiming in loop_set_status
Commit 08e136ebd193 ("loop: don't change loop device under exclusive
opener in loop_set_status") forgot to call bd_abort_claiming() when
mutex_lock_killable() failed.
Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload
of the struct request's first bio. This assumes all the request's bios
have the same app_tag. However, it is possible for bios with different
app_tag values to be merged into a single request.
Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of
bios/requests with different app_tag values if BIP_CHECK_APPTAG is set.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 3d8b5a22d404 ("block: add support to pass user meta buffer") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Breno Leitao [Tue, 6 Jan 2026 14:26:57 +0000 (06:26 -0800)]
blk-rq-qos: Remove unlikely() hints from QoS checks
The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are
counterproductive. Writeback throttling (WBT) might be enabled by
default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'.
Branch profiling on Meta servers, which have WBT enabled, confirms 100%
misprediction rates on these checks.
Remove the unlikely() annotations to let the CPU's branch predictor
learn the actual behavior, potentially improving I/O path performance.
loop: don't change loop device under exclusive opener in loop_set_status
loop_set_status() is allowed to change the loop device while there
are other openers of the device, even exclusive ones.
In this case, it causes a KASAN: slab-out-of-bounds Read in
ext4_search_dir(), since when looking for an entry in an inlined
directory, e_value_offs is changed underneath the filesystem by
loop_set_status().
Fix the problem by forbidding loop_set_status() from modifying the loop
device while there are exclusive openers of the device. This is similar
to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't
change loop device under exclusive opener") alongside commit ecbe6bc0003b
("block: use bd_prepare_to_claim directly in the loop driver").
Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397 Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Julia Lawall [Wed, 31 Dec 2025 17:22:07 +0000 (18:22 +0100)]
block, bfq: update outdated comment
The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle()
in commit 277a4a9b56cd ("block, bfq: give a better name to
bfq_bfqq_may_idle"). Update the comment accordingly.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 31 Dec 2025 13:55:07 +0000 (06:55 -0700)]
Merge tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into block-6.19
Pull MD fixes from Yu Kuai:
"- Fix null-pointer dereference in raid5 sysfs group_thread_cnt store
(Tuo Li)
- Fix possible mempool corruption during raid1 raid_disks update via
sysfs (FengWei Shih)
- Fix logical_block_size configuration being overwritten during
super_1_validate() (Li Nan)
- Fix forward incompatibility with configurable logical block size:
arrays assembled on new kernels could not be assembled on kernels
<=6.18 due to non-zero reserved pad rejection (Li Nan)
- Fix static checker warning about iterator not incremented (Li Nan)"
* tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux:
md: Fix forward incompatibility from configurable logical block size
md: Fix logical_block_size configuration being overwritten
md: suspend array while updating raid_disks via sysfs
md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
md: Fix static checker warning in analyze_sbs
Cong Zhang [Tue, 30 Dec 2025 09:17:05 +0000 (17:17 +0800)]
blk-mq: skip CPU offline notify on unmapped hctx
If an hctx has no software ctx mapped, blk_mq_map_swqueue() never
allocates tags and leaves hctx->tags NULL. The CPU hotplug offline
notifier can still run for that hctx, return early since hctx cannot
hold any requests.
Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 23 Dec 2025 03:27:41 +0000 (11:27 +0800)]
selftests/ublk: add test for async partition scan
Add test_generic_15.sh to verify that async partition scan prevents
IO hang when reading partition tables.
The test creates ublk devices with fault_inject target and very large
delay (60s) to simulate blocked partition table reads, then kills the
daemon to verify proper state transitions without hanging:
1. Without recovery support:
- Create device with fault_inject and 60s delay
- Kill daemon while partition scan may be blocked
- Verify device transitions to DEAD state
2. With recovery support (-r 1):
- Create device with fault_inject, 60s delay, and recovery
- Kill daemon while partition scan may be blocked
- Verify device transitions to QUIESCED state
Before the async partition scan fix, killing the daemon during
partition scan would cause deadlock as partition scan held ub->mutex
while waiting for IO. With the async fix, partition scan happens in
a work function and flush_work() ensures proper synchronization.
Add _add_ublk_dev_no_settle() helper function to skip udevadm settle,
which would otherwise hang waiting for partition scan events to
complete when partition table read is delayed.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 23 Dec 2025 03:27:40 +0000 (11:27 +0800)]
ublk: scan partition in async way
Implement async partition scan to avoid IO hang when reading partition
tables. Similar to nvme_partition_scan_work(), partition scanning is
deferred to a work queue to prevent deadlocks.
When partition scan happens synchronously during add_disk(), IO errors
can cause the partition scan to wait while holding ub->mutex, which
can deadlock with other operations that need the mutex.
Changes:
- Add partition_scan_work to ublk_device structure
- Implement ublk_partition_scan_work() to perform async scan
- Always suppress sync partition scan during add_disk()
- Schedule async work after add_disk() for trusted daemons
- Add flush_work() in ublk_stop_dev() before grabbing ub->mutex
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Fri, 26 Dec 2025 02:42:21 +0000 (10:42 +0800)]
md: Fix forward incompatibility from configurable logical block size
Commit 62ed1b582246 ("md: allow configuring logical block size") used
reserved pad to add 'logical_block_size' to metadata. RAID rejects
non-zero reserved pad, so arrays fail when rolling back to old kernels
after booting new ones.
Set 'logical_block_size' only for newly created arrays to support rollback
to old kernels. Importantly new arrays still won't work on old kernels to
prevent data loss issue from LBS changes.
For arrays created on old kernels which confirmed not to rollback,
configure LBS by echo current LBS (queue/logical_block_size) to
md/logical_block_size.
Li Nan [Fri, 26 Dec 2025 02:42:20 +0000 (10:42 +0800)]
md: Fix logical_block_size configuration being overwritten
In super_1_validate(), mddev->logical_block_size is directly overwritten
with the value from metadata. This causes the previously configured lbs
to be lost, making the configuration ineffective. Fix it.
FengWei Shih [Fri, 26 Dec 2025 10:18:16 +0000 (18:18 +0800)]
md: suspend array while updating raid_disks via sysfs
In raid1_reshape(), freeze_array() is called before modifying the r1bio
memory pool (conf->r1bio_pool) and conf->raid_disks, and
unfreeze_array() is called after the update is completed.
However, freeze_array() only waits until nr_sync_pending and
(nr_pending - nr_queued) of all buckets reaches zero. When an I/O error
occurs, nr_queued is increased and the corresponding r1bio is queued to
either retry_list or bio_end_io_list. As a result, freeze_array() may
unblock before these r1bios are released.
This can lead to a situation where conf->raid_disks and the mempool have
already been updated while queued r1bios, allocated with the old
raid_disks value, are later released. Consequently, free_r1bio() may
access memory out of bounds in put_all_bios() and release r1bios of the
wrong size to the new mempool, potentially causing issues with the
mempool as well.
Since only normal I/O might increase nr_queued while an I/O error occurs,
suspending the array avoids this issue.
Note: Updating raid_disks via ioctl SET_ARRAY_INFO already suspends
the array. Therefore, we suspend the array when updating raid_disks
via sysfs to avoid this issue too.
To fix this issue, the function should unlock mddev and return before
invoking raid5_quiesce() when conf is NULL, following the existing pattern
in raid5_change_consistency_policy().
Fixes: fa1944bbe622 ("md/raid5: Wait sync io to finish before changing group cnt") Signed-off-by: Tuo Li <islituo@gmail.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Dan Carpenter [Sat, 20 Dec 2025 08:46:10 +0000 (11:46 +0300)]
block: rnbd-clt: Fix signedness bug in init_dev()
The "dev->clt_device_id" variable is set using ida_alloc_max() which
returns an int and in particular it returns negative error codes.
Change the type from u32 to int to fix the error checking.
Fixes: c9b5645fd8ca ("block: rnbd-clt: Fix leaked ID in init_dev()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: clean up user copy references on ublk server exit
If a ublk server process releases a ublk char device file, any requests
dispatched to the ublk server but not yet completed will retain a ref
value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef83 ("ublk: simplify
aborting ublk request"), __ublk_fail_req() would decrement the reference
count before completing the failed request. However, that commit
optimized __ublk_fail_req() to call __ublk_complete_rq() directly
without decrementing the request reference count.
The leaked reference count incorrectly allows user copy and zero copy
operations on the completed ublk request. It also triggers the
WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit()
and ublk_deinit_queue().
Commit c5c5eb24ed61 ("ublk: avoid ublk_io_release() called after ublk
char dev is closed") already fixed the issue for ublk devices using
UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference
count leak also affects UBLK_F_USER_COPY, the other reference-counted
data copy mode. Fix the condition in ublk_check_and_reset_active_ref()
to include all reference-counted data copy modes. This ensures that any
ublk requests still owned by the ublk server when it exits have their
reference counts reset to 0.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e63d2228ef83 ("ublk: simplify aborting ublk request") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Various code assumes that the integrity interval is at least 1 sector
and evenly divides the logical block size. Add these checks to
blk_validate_integrity_limits(). This guards against block drivers that
report invalid interval_exp values.
The PI tuple must be contained within the metadata value, so validate
that pi_offset + pi_tuple_size <= metadata_size. This guards against
block drivers that report invalid pi_offset values.
Ming Lei [Fri, 12 Dec 2025 14:34:15 +0000 (22:34 +0800)]
ublk: fix deadlock when reading partition table
When one process(such as udev) opens ublk block device (e.g., to read
the partition table via bdev_open()), a deadlock[1] can occur:
1. bdev_open() grabs disk->open_mutex
2. The process issues read I/O to ublk backend to read partition table
3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
runs bio->bi_end_io() callbacks
4. If this triggers fput() on file descriptor of ublk block device, the
work may be deferred to current task's task work (see fput() implementation)
5. This eventually calls blkdev_release() from the same context
6. blkdev_release() tries to grab disk->open_mutex again
7. Deadlock: same task waiting for a mutex it already holds
The fix is to run blk_update_request() and blk_mq_end_request() with bottom
halves disabled. This forces blkdev_release() to run in kernel work-queue
context instead of current task work context, and allows ublk server to make
forward progress, and avoids the deadlock.
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Link: https://github.com/ublk-org/ublksrv/issues/170 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
[axboe: rewrite comment in ublk] Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: add allocation size check in blkdev_pr_read_keys()
blkdev_pr_read_keys() takes num_keys from userspace and uses it to
calculate the allocation size for keys_info via struct_size(). While
there is a check for SIZE_MAX (integer overflow), there is no upper
bound validation on the allocation size itself.
A malicious or buggy userspace can pass a large num_keys value that
doesn't trigger overflow but still results in an excessive allocation
attempt, causing a warning in the page allocator when the order exceeds
MAX_PAGE_ORDER.
Fix this by introducing PR_KEYS_MAX to limit the number of keys to
a sane value. This makes the SIZE_MAX check redundant, so remove it.
Also switch to kvzalloc/kvfree to handle larger allocations gracefully.
Yongpeng Yang [Mon, 15 Dec 2025 15:21:06 +0000 (23:21 +0800)]
zloop: use READ_ONCE() to read lo->lo_state in queue_rq path
In the queue_rq path, zlo->state is accessed without locking, and direct
access may read stale data. This patch uses READ_ONCE() to read
zlo->state and data_race() to silence code checkers, and changes all
assignments to use WRITE_ONCE().
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yongpeng Yang [Mon, 15 Dec 2025 15:21:04 +0000 (23:21 +0800)]
loop: use READ_ONCE() to read lo->lo_state without locking
When lo->lo_mutex is not held, direct access may read stale data. This
patch uses READ_ONCE() to read lo->lo_state and data_race() to silence
code checkers, and changes all assignments to use WRITE_ONCE().
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ilikara Zheng [Mon, 8 Dec 2025 13:23:40 +0000 (21:23 +0800)]
nvme-pci: disable secondary temp for Wodposit WPBSNM8
Secondary temperature thresholds (temp2_{min,max}) were not reported
properly on this NVMe SSD. This resulted in an error while attempting to
read these values with sensors(1):
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Add the device to the nvme_id_table with the
NVME_QUIRK_NO_SECONDARY_TEMP_THRESH flag to suppress access to all non-
composite temperature thresholds.
Ming Lei [Fri, 12 Dec 2025 14:35:00 +0000 (22:35 +0800)]
block: fix race between wbt_enable_default and IO submission
When wbt_enable_default() is moved out of queue freezing in elevator_change(),
it can cause the wbt inflight counter to become negative (-1), leading to hung
tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter
is in an inconsistent state.
The issue occurs because wbt_enable_default() could race with IO submission,
allowing the counter to be decremented before proper initialization. This manifests
as:
rq_wait[0]:
inflight: -1
has_waiters: True
rwb_enabled() checks the state, which can be updated exactly between wbt_wait()
(rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter
will become negative.
And results in hung task warnings like:
task:kworker/u24:39 state:D stack:0 pid:14767
Call Trace:
rq_qos_wait+0xb4/0x150
wbt_wait+0xa9/0x100
__rq_qos_throttle+0x24/0x40
blk_mq_submit_bio+0x672/0x7b0
...
Fix this by:
1. Splitting wbt_enable_default() into:
- __wbt_enable_default(): Returns true if wbt_init() should be called
- wbt_enable_default(): Wrapper for existing callers (no init)
- wbt_init_enable_default(): New function that checks and inits WBT
2. Using wbt_init_enable_default() in blk_register_queue() to ensure
proper initialization during queue registration
3. Move wbt_init() out of wbt_enable_default() which is only for enabling
disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the
original lock warning can be avoided.
4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling
code since it's no longer needed
This ensures WBT is properly initialized before any IO can be submitted,
preventing the counter from going negative.
Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai@fnnas.com> Cc: Guangwu Zhang <guazhang@redhat.com> Fixes: 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ublk selftests cover every data copy mode except user copy. Add
tests for user copy based on the existing test suite:
- generic_14 ("basic recover function verification (user copy)") based
on generic_04 and generic_05
- null_03 ("basic IO test with user copy") based on null_01 and null_02
- loop_06 ("write and verify over user copy") based on loop_01 and
loop_03
- loop_07 ("mkfs & mount & umount with user copy") based on loop_02 and
loop_04
- stripe_05 ("write and verify test on user copy") based on stripe_03
- stripe_06 ("mkfs & mount & umount on user copy") based on stripe_02
and stripe_04
- stress_06 ("run IO and remove device (user copy)") based on stress_01
and stress_03
- stress_07 ("run IO and kill ublk server (user copy)") based on
stress_02 and stress_04
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: add support for user copy to kublk
The ublk selftests mock ublk server kublk supports every data copy mode
except user copy. Add support for user copy to kublk, enabled via the
--user_copy (-u) command line argument. On writes, issue pread() calls
to copy the write data into the ublk_io's buffer before dispatching the
write to the target implementation. On reads, issue pwrite() calls to
copy read data from the ublk_io's buffer before committing the request.
Copy in 2 KB chunks to provide some coverage of the offseting logic.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kublk mock ublk server allows multiple data copy mode arguments to
be passed on the command line (--zero_copy, --get_data, and --auto_zc).
The ublk device will be created with all the requested feature flags,
however kublk will only use one of the modes to interact with request
data (arbitrarily preferring auto_zc over zero_copy over get_data). To
clarify the intent of the test, don't allow multiple data copy modes to
be specified. --zero_copy and --auto_zc are allowed together for
--auto_zc_fallback, which uses both copy modes.
Don't set UBLK_F_USER_COPY for zero_copy, as it's a separate feature.
Fix the test cases in test_stress_05 passing --get_data along with
--zero_copy or --auto_zc.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: don't share backing files between ublk servers
stress_04 is missing a wait between blocks of tests, meaning multiple
ublk servers will be running in parallel using the same backing files.
Add a wait after each section to ensure each backing file is in use by a
single ublk server at a time.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04
stress_04 is described as "run IO and kill ublk server(zero copy)" but
the --per_io_tasks tests cases don't use zero copy. Plus, one of the
test cases is duplicated. Add --auto_zc to these test cases and
--auto_zc_fallback to one of the duplicated ones. This matches the test
cases in stress_03.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: fix fio arguments in run_io_and_recover()
run_io_and_recover() invokes fio with --size="${size}", but the variable
size doesn't exist. Thus, the argument expands to --size=, which causes
fio to exit immediately with an error without issuing any I/O. Pass the
value for size as the first argument to the function.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
selftests: ublk: correct last_rw map type in seq_io.bt
The last_rw map is initialized with a value of 0 but later assigned the
value args.sector + args.nr_sector, which has type sector_t = u64.
bpftrace complains about the type mismatch between int64 and uint64:
trace/seq_io.bt:18:3-59: ERROR: Type mismatch for @last_rw: trying to assign value of type 'uint64' when map already contains a value of type 'int64'
@last_rw[$dev, str($2)] = (args.sector + args.nr_sector);
Cast the initial value to uint64 so bpftrace will load the program.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 12 Dec 2025 17:16:59 +0000 (10:16 -0700)]
selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback()
The functions ublk_queue_use_zc(), ublk_queue_use_auto_zc(), and
ublk_queue_auto_zc_fallback() were returning int, but performing
bitwise AND on q->flags which is __u64.
When a flag bit is set in the upper 32 bits (beyond INT_MAX), the
result of the bitwise AND operation could overflow when cast to int,
leading to incorrect boolean evaluation.
For example, if UBLKS_Q_AUTO_BUF_REG_FALLBACK is 0x8000000000000000:
- (u64)flags & 0x8000000000000000 = 0x8000000000000000
- Cast to int: undefined behavior / incorrect value
- Used in if(): may evaluate incorrectly
Fix by:
1. Changing return type from int to bool for semantic correctness
2. Using !! to explicitly convert to boolean (0 or 1)
This ensures the functions return proper boolean values regardless
of which bit position the flags occupy in the 64-bit field.
Fixes: c3a6d48f86da ("selftests: ublk: remove ublk queue self-defined flags") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 23 Nov 2025 22:51:23 +0000 (22:51 +0000)]
block: move around bio flagging helpers
We'll need bio_flagged() earlier in bio.h for later patches, move it
together with all related helpers, and mark the bio_flagged()'s bio
argument as const.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 10 Dec 2025 10:43:46 +0000 (02:43 -0800)]
blk-mq-dma: always initialize dma state
Ensure the dma state is initialized when we're not using the contiguous
iova, otherwise the caller may be using a stale state from a previous
request that could use the coalesed iova allocation.
Fixes: 2f6b2565d43cdb5 ("block: accumulate memory segment gaps per bio") Reported-by: Sebastian Ott <sebott@redhat.com> Tested-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fengnan Chang [Wed, 10 Dec 2025 08:55:00 +0000 (16:55 +0800)]
blk-mq: delete task running check in blk_hctx_poll()
blk_hctx_poll() always checks if the task is running or not, and returns
1 if the task is running. This is a leftover from when polled IO was
purely for synchronous IO, and doesn't make sense anymore when polled IO
is purely asynchronous. Similarly, marking the task as TASK_RUNNING is
also superflous, as the very much has to be running to enter the
function in the first place.
It looks like there has been this judgment for historical reasons, and
in very early versions of this function the user would set the process
state to TASK_UNINTERRUPTIBLE.
Signed-off-by: Diangang Li <lidiangang@bytedance.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
[axboe: kill all remnants of task running, pointless now. massage message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: fix cached zone reports on devices with native zone append
When mounting a btrfs file system on virtio-blk which supports native
Zone Append there has been a WARN triggering in btrfs' space management
code.
Further looking into btrfs' zoned statistics uncovered the filesystem
expecting the zones to be used, but the write pointers being 0:
# cat /sys/fs/btrfs/8eabd2e7-3294-4f9e-9b58-7e64135c8bf4/zoned_stats
active block-groups: 4
reclaimable: 0
unused: 0
need reclaim: false
data relocation block-group: 1342177280
active zones:
start: 1073741824, wp: 0 used: 0, reserved: 0, unusable: 0
start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0
start: 1610612736, wp: 0 used: 16384, reserved: 0, unusable: 18446744073709535232
start: 1879048192, wp: 0 used: 131072, reserved: 0, unusable: 18446744073709420544
Looking at the blkzone report output for the zone in question
(1610612736) the write pointer on the device moved, but the filesystem
did not see a change on the write pointer:
# blkzone report -c 1 -o 0x300000 /dev/vda
start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000040 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
The zone write pointer is 0, because btrfs is using the cached version
of blkdev_report_zones() and as virtio-blk is supporting native zone
append, but blkdev_revalidate_zones() does not initialize the zone write
plugs in this case.
Not skipping the revalidate of sequential zones in
blkdev_revalidate_zones() callchain fixes this issue.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Fixes: a6aa36e957a1 ("block: Remove zone write plugs when handling native zone append writes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.
Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.
This results in deadlock between two threads with these stacktraces:
The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tries to walk the queues in set->tag_list.
The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.
Given that [un]quiescing queue is an operation that does not need to
sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU
safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list)
in blk_mq_del_queue_tag_set() because we can not re-initialize it while
the list is being traversed under RCU. The deleted queue will not be
added/deleted to/from a tagset and it will be freed in blk_free_queue()
after the end of RCU grace period.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
__bio_for_each_segment() uses the returned struct bio_vec's bv_len field
to advance the struct bvec_iter at the end of each loop iteration. So
it's incorrect to modify it during the loop. Don't assign to bv_len (or
bv_offset, for that matter) in ublk_copy_user_pages().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e87d66ab27ac ("ublk: use rq_for_each_segment() for user copy") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shida Zhang [Tue, 9 Dec 2025 09:01:57 +0000 (17:01 +0800)]
block: prohibit calls to bio_chain_endio
Now that all potential callers of bio_chain_endio have been
eliminated, completely prohibit any future calls to this function.
Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Andreas Gruenbacher <agruenba@redhat.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
Handling most of the ublksrv_ctrl_cmd opcodes require locking a mutex,
so ublk_ctrl_uring_cmd() bails out with EAGAIN when called with the
IO_URING_F_NONBLOCK issue flag. However, several opcodes can be handled
without blocking:
- UBLK_CMD_GET_QUEUE_AFFINITY
- UBLK_CMD_GET_DEV_INFO
- UBLK_CMD_GET_DEV_INFO2
- UBLK_U_CMD_GET_FEATURES
Handle these opcodes synchronously instead of returning EAGAIN so
io_uring doesn't need to issue the command via the worker thread pool.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'nvme-6.19-2025-12-04' of git://git.infradead.org/nvme:
nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
nvme-auth: use kvfree() for memory allocated with kvcalloc()
nvmet-tcp: use kvcalloc for commands array
nvmet-rdma: use kvcalloc for commands and responses arrays
nvme: fix typo error in nvme target
nvmet-fc: use pr_* print macros instead of dev_*
nvmet-fcloop: remove unused lsdir member.
nvmet-fcloop: check all request and response have been processed
nvme-fc: check all request and response have been processed
nvme-fc: don't hold rport lock when putting ctrl
nvme-pci: add debug message on fail to read CSTS
nvme-pci: print error message on failure in nvme_probe
nvmet: pci-epf: fix DMA channel debug print
nvmet: pci-epf: move DMA initialization to EPC init callback
nvmet: remove redundant subsysnqn field from ctrl
nvmet: add sanity checks when freeing subsystem
Justin Tee [Mon, 17 Nov 2025 18:43:43 +0000 (10:43 -0800)]
nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
With authentication, in addition to EKEYREJECTED there is also no point in
retrying reconnects when status is ENOKEY. Thus, add -ENOKEY as another
criteria to determine when to stop retries.
Cc: Daniel Wagner <wagi@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Closes: https://lore.kernel.org/linux-nvme/20250829-nvme-fc-sync-v3-0-d69c87e63aee@kernel.org/ Signed-off-by: Justin Tee <justintee8345@gmail.com> Tested-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Israel Rukshin [Sun, 23 Nov 2025 14:46:48 +0000 (16:46 +0200)]
nvme-auth: use kvfree() for memory allocated with kvcalloc()
Memory allocated by kvcalloc() may come from vmalloc or kmalloc,
so use kvfree() instead of kfree() for proper deallocation.
Fixes: aa36d711e945 ("nvme-auth: convert dhchap_auth_list to an array") Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Israel Rukshin [Mon, 24 Nov 2025 06:49:21 +0000 (08:49 +0200)]
nvmet-tcp: use kvcalloc for commands array
Replace kcalloc with kvcalloc for allocation of the commands
array. Each command structure is 712 bytes. The array typically
exceeds a single page, and grows much larger with high queue depths
(e.g., commands >182KB).
kvcalloc automatically falls back to vmalloc for large or fragmented
allocations, improving reliability. In our case, this memory is not
aimed for DMA operations and could be safely allocated by kvcalloc.
Using virtually contiguous memory helps to avoid allocation failures
and out-of-memory conditions common with kcalloc on large pools.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Israel Rukshin [Mon, 24 Nov 2025 06:49:20 +0000 (08:49 +0200)]
nvmet-rdma: use kvcalloc for commands and responses arrays
Replace kcalloc with kvcalloc for allocation of the commands and
responses arrays. Each command structure is 272 bytes and each
response structure is 672 bytes. These arrays typically exceed a
single page, and grow much larger with high queue depths
(e.g., commands >2MB, responses >170KB)
kvcalloc automatically falls back to vmalloc for large or fragmented
allocations, improving reliability. In our case, this memory is not
aimed for DMA operations and could be safely allocated by kvcalloc.
Using virtually contiguous memory helps to avoid allocation failures
and out-of-memory conditions common with kcalloc on large pools.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Shaurya Rane [Thu, 4 Dec 2025 18:12:59 +0000 (23:42 +0530)]
block: fix memory leak in __blkdev_issue_zero_pages
Move the fatal signal check before bio_alloc() to prevent a memory
leak when BLKDEV_ZERO_KILLABLE is set and a fatal signal is pending.
Previously, the bio was allocated before checking for a fatal signal.
If a signal was pending, the code would break out of the loop without
freeing or chaining the just-allocated bio, causing a memory leak.
This matches the pattern already used in __blkdev_issue_write_zeroes()
where the signal check precedes the allocation.
Fixes: bf86bcdb4012 ("blk-lib: check for kill signal in ioctl BLKZEROOUT") Reported-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=527a7e48a3d3d315d862 Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 4 Dec 2025 10:59:52 +0000 (19:59 +0900)]
block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs
Commit fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write
plug error recovery") added a WARN check in disk_put_zone_wplug() to
verify that when the last reference to a zone write plug is dropped,
this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set,
that is, that it is not plugged.
However, the function disk_zone_wplug_abort(), which is called for zone
reset and zone finish operations, does not clear this flag after
emptying a zone write plug BIO list. This can result in the
disk_put_zone_wplug() warning to trigger if the user (erroneously as
that is bad pratcice) issues zone reset or zone finish operations while
the target zone still has plugged BIOs.
Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag.
And while at it, also add a lockdep annotation to ensure that this
function is called with the zone write plug spinlock held.
Fixes: fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug error recovery") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cong Zhang [Wed, 3 Dec 2025 03:34:21 +0000 (11:34 +0800)]
blk-mq: Abort suspend when wakeup events are pending
During system suspend, wakeup capable IRQs for block device can be
delayed, which can cause blk_mq_hctx_notify_offline() to hang
indefinitely while waiting for pending request to complete.
Skip the request waiting loop and abort suspend when wakeup events are
pending to prevent the deadlock.
Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new helper function blk_rq_nr_bvec() that returns the number of
bvecs in a request. This count represents the number of iterations
rq_for_each_bvec() would perform on a request.
Drivers need to pre-allocate bvec arrays before iterating through
a request's bvecs. Currently, they manually count bvecs using
rq_for_each_bvec() in a loop, which is repetitive. The new helper
centralizes this logic.
This pattern exists in loop and zloop drivers, where multi-bio requests
require copying bvecs into a contiguous array before creating
an iov_iter for file operations.
Update loop and zloop drivers to use the new helper, eliminating
duplicate code.
This patch also provides a clear API to avoid any potential misuse of
blk_nr_phys_segments() for calculating the bvecs since, one bvec can
have more than one segments and use of blk_nr_phys_segments() can
lead to extra memory allocation :-
[ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072
[ 6155.673846] null_blk: #### null_handle_data_transfer:1375
[ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2
[ 6155.674263] null_blk: #### null_handle_data_transfer:1375
[ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1
Stefan Hajnoczi [Mon, 1 Dec 2025 21:43:29 +0000 (16:43 -0500)]
block: add IOC_PR_READ_RESERVATION ioctl
Add a Persistent Reservations ioctl to read the current reservation.
This calls the pr_ops->read_reservation() function that was previously
added in commit c787f1baa503 ("block: Add PR callouts for read keys and
reservation") but was only used by the in-kernel SCSI target so far.
The IOC_PR_READ_RESERVATION ioctl is necessary so that userspace
applications that rely on Persistent Reservations ioctls have a way of
inspecting the current state. Cluster managers and validation tests need
this functionality.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stefan Hajnoczi [Mon, 1 Dec 2025 21:43:28 +0000 (16:43 -0500)]
block: add IOC_PR_READ_KEYS ioctl
Add a Persistent Reservations ioctl to read the list of currently
registered reservation keys. This calls the pr_ops->read_keys() function
that was previously added in commit c787f1baa503 ("block: Add PR
callouts for read keys and reservation") but was only used by the
in-kernel SCSI target so far.
The IOC_PR_READ_KEYS ioctl is necessary so that userspace applications
that rely on Persistent Reservations ioctls have a way of inspecting the
current state. Cluster managers and validation tests need this
functionality.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The pr_read_keys() interface has a u32 num_keys parameter. The NVMe
Reservation Report command has a u32 maximum length. Reject num_keys
values that are too large to fit.
This will become important when pr_read_keys() is exposed to untrusted
userspace via an <linux/pr.h> ioctl.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The pr_read_keys() interface has a u32 num_keys parameter. The SCSI
PERSISTENT RESERVE IN command has a maximum READ KEYS service action
size of 65536 bytes. Reject num_keys values that are too large to fit
into the SCSI command.
This will become important when pr_read_keys() is exposed to untrusted
userspace via an <linux/pr.h> ioctl.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fengnan Chang [Fri, 14 Nov 2025 09:21:49 +0000 (17:21 +0800)]
block: enable per-cpu bio cache by default
Since after commit 12e4e8c7ab59 ("io_uring/rw: enable bio caches for
IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is
safe for task context and no one calls in irq context, so we can enable
per cpu bio cache by default.
Benchmarked with t/io_uring and ext4+nvme:
taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1
-X1 -n1 -P1 /mnt/testfile
base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset
decrease from 1.42% to 1.22%.
The worst case is allocate bio in CPU A but free in CPU B, still use
t/io_uring and ext4+nvme:
base IOPS is 648K, patch IOPS is 647K.
Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and
nvme, no obvious performance regression.
Linus Torvalds [Thu, 4 Dec 2025 04:45:43 +0000 (20:45 -0800)]
Merge tag 'ntfs3_for_6.19' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"New code:
- support timestamps prior to epoch
- do not overwrite uptodate pages
- disable readahead for compressed files
- setting of dummy blocksize to read boot_block when mounting
- the run_lock initialization when loading $Extend
- initialization of allocated memory before use
- support for the NTFS3_IOC_SHUTDOWN ioctl
- check for minimum alignment when performing direct I/O reads
- check for shutdown in fsync
Fixes:
- mount failure for sparse runs in run_unpack()
- use-after-free of sbi->options in cmp_fnames
- KMSAN uninit bug after failed mi_read in mi_format_new
- uninit error after buffer allocation by __getname()
- KMSAN uninit-value in ni_create_attr_list
- double free of sbi->options->nls and ownership of fc->fs_private
- incorrect vcn adjustments in attr_collapse_range()
- mode update when ACL can be reduced to mode
- memory leaks in add sub record
Changes:
- refactor code, updated terminology, spelling
- do not kmap pages in (de)compression code
- after ntfs_look_free_mft(), code that fails must put mft_inode
- default mount options for "acl" and "prealloc"
Replaced:
- use unsafe_memcpy() to avoid memcpy size warning
- ntfs_bio_pages with page cache for compressed files"
* tag 'ntfs3_for_6.19' of https://github.com/Paragon-Software-Group/linux-ntfs3: (26 commits)
fs/ntfs3: check for shutdown in fsync
fs/ntfs3: change the default mount options for "acl" and "prealloc"
fs/ntfs3: Prevent memory leaks in add sub record
fs/ntfs3: out1 also needs to put mi
fs/ntfs3: Fix spelling mistake "recommened" -> "recommended"
fs/ntfs3: update mode in xattr when ACL can be reduced to mode
fs/ntfs3: check minimum alignment for direct I/O
fs/ntfs3: implement NTFS3_IOC_SHUTDOWN ioctl
fs/ntfs3: correct attr_collapse_range when file is too fragmented
ntfs3: fix double free of sbi->options->nls and clarify ownership of fc->fs_private
fs/ntfs3: Initialize allocated memory before use
fs/ntfs3: remove ntfs_bio_pages and use page cache for compressed I/O
ntfs3: avoid memcpy size warning
fs/ntfs3: fix KMSAN uninit-value in ni_create_attr_list
ntfs3: init run lock for extend inode
ntfs: set dummy blocksize to read boot_block when mounting
fs/ntfs3: disable readahead for compressed files
ntfs3: Fix uninit buffer allocated by __getname()
ntfs3: fix uninit memory after failed mi_read in mi_format_new
ntfs3: fix use-after-free of sbi->options in cmp_fnames
...
Linus Torvalds [Thu, 4 Dec 2025 04:37:15 +0000 (20:37 -0800)]
Merge tag 'ext4_for_linus-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features and improvements for the ext4 file system:
- Optimize online defragmentation by using folios instead of
individual buffer heads
- Improve error codes stored in the superblock when the journal
aborts
- Minor cleanups and clarifications in ext4_map_blocks()
- Add documentation of the casefold and encrypt flags
- Add support for file systems with a blocksize greater than the
pagesize
- Improve performance by enabling the caching the fact that an inode
does not have a Posix ACL
Various Bug Fixes:
- Fix false positive complaints from smatch
- Fix error code which is returned by ext4fs_dirhash() when Siphash
is used without the encryption key
- Fix races when writing to inline data files which could trigger a
BUG
- Fix potential NULL dereference when there is an corrupt file system
with an extended attribute value stored in a inode
- Fix false positive lockdep report when syzbot uses ext4 and ocfs2
together
- Fix false positive reported by DEPT by adjusting lock annotation
- Avoid a potential BUG_ON in jbd2 when a file system is massively
corrupted
- Fix a WARN_ON when superblock is corrupted with a non-NULL
terminated mount options field
- Add check if the userspace passes in a non-NULL terminated mount
options field to EXT4_IOC_SET_TUNE_SB_PARAM
- Fix a potential journal checksum failure whena file system is
copied while it is mounted read-only
- Fix a potential potential orphan file tracking error which only
showed on 32-bit systems
- Fix assertion checks in mballoc (which have to be explicitly enbled
by manually enabling AGGRESSIVE_CHECKS and recompiling)
- Avoid complaining about overly large orphan files created by mke2fs
with with file systems with a 64k block size"
* tag 'ext4_for_linus-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: mark inodes without acls in __ext4_iget()
ext4: enable block size larger than page size
ext4: add checks for large folio incompatibilities when BS > PS
ext4: support verifying data from large folios with fs-verity
ext4: make data=journal support large block size
ext4: support large block size in __ext4_block_zero_page_range()
ext4: support large block size in mpage_prepare_extent_to_map()
ext4: support large block size in mpage_map_and_submit_buffers()
ext4: support large block size in ext4_block_write_begin()
ext4: support large block size in ext4_mpage_readpages()
ext4: rename 'page' references to 'folio' in multi-block allocator
ext4: prepare buddy cache inode for BS > PS with large folios
ext4: support large block size in ext4_mb_init_cache()
ext4: support large block size in ext4_mb_get_buddy_page_lock()
ext4: support large block size in ext4_mb_load_buddy_gfp()
ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page conversion
ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
ext4: support large block size in ext4_readdir()
ext4: support large block size in ext4_calculate_overhead()
ext4: introduce s_min_folio_order for future BS > PS support
...
Linus Torvalds [Thu, 4 Dec 2025 04:28:50 +0000 (20:28 -0800)]
Merge tag 'gfs2-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Major withdraw / error handling overhaul based on dlm's new
DLM_RELEASE_RECOVER feature: this allows gfs to treat withdraws like
node failures. Make withdraws asynchronous
- Fix a bug in commit e4a8b5481c59a that caused 'df' to remain out of
sync. ('df' is still allowed to go slightly out of sync for short
periods of time)
- Prevent recusive memory reclaim in gfs2_unstuff_dinode()
- Clean up SDF_JOURNAL_LIVE flag handling
- Fix remote evict for read-only filesystems
- Fix a misuse of bio_chain()
- Various other minor cleanups
* tag 'gfs2-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (35 commits)
gfs2: Fix use of bio_chain
gfs2: Clean up SDF_JOURNAL_LIVE flag handling
gfs2: No longer thaw filesystems during a withdraw
gfs2: Withdraw immediately in gfs2_trans_add_meta
gfs2: New gfs2_withdraw_helper
gfs2: Clean up properly during a withdraw
gfs2: Rename gfs2_{gl_dq_holders => withdraw_glocks}
Revert "gfs2: fix infinite loop when checking ail item count before go_inval"
Revert "gfs2: Allow some glocks to be used during withdraw"
Revert "gfs2: Check for log write errors before telling dlm to unlock"
Revert "gfs2: fix a deadlock on withdraw-during-mount"
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (6/6)
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (5/6)
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (4/6)
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (3/6)
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (2/6)
Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (1/6)
Revert "gfs2: don't stop reads while withdraw in progress"
gfs2: Rename LM_FLAG_{NOEXP -> RECOVER}
gfs2: Kill gfs2_io_error_bh_wd
...
Linus Torvalds [Thu, 4 Dec 2025 04:23:41 +0000 (20:23 -0800)]
Merge tag 'v6.19-rc-smb-fixes' of git://git.samba.org/ksmbd
Pull smb client and server updates from Steve French:
- server fixes:
- IPC use after free locking fix
- fix locking bug in delete paths
- fix use after free in disconnect
- fix underflow in locking check
- error mapping improvement
- socket listening improvement
- return code mapping fixes
- crypto improvements (use default libraries)
- cleanup patches:
- netfs
- client checkpatch cleanup
- server cleanup
- move server/client duplicate code to common code
- fix some defines to better match protocol specification
- smbdirect (RDMA) fixes
- client debugging improvements for leases
* tag 'v6.19-rc-smb-fixes' of git://git.samba.org/ksmbd: (44 commits)
cifs: Use netfs_alloc/free_folioq_buffer()
smb: client: show smb lease key in open_dirs output
smb: client: show smb lease key in open_files output
ksmbd: ipc: fix use-after-free in ipc_msg_send_request
smb: client: relax WARN_ON_ONCE(SMBDIRECT_SOCKET_*) checks in recv_done() and smbd_conn_upcall()
smb: server: relax WARN_ON_ONCE(SMBDIRECT_SOCKET_*) checks in recv_done() and smb_direct_cm_handler()
smb: smbdirect: introduce SMBDIRECT_CHECK_STATUS_{WARN,DISCONNECT}()
smb: smbdirect: introduce SMBDIRECT_DEBUG_ERR_PTR() helper
ksmbd: vfs: fix race on m_flags in vfs_cache
ksmbd: Replace strcpy + strcat to improve convert_to_nt_pathname
smb: move FILE_SYSTEM_ATTRIBUTE_INFO to common/fscc.h
ksmbd: implement error handling for STATUS_INFO_LENGTH_MISMATCH in smb server
ksmbd: fix use-after-free in ksmbd_tree_connect_put under concurrency
ksmbd: server: avoid busy polling in accept loop
smb: move create_durable_reconn to common/smb2pdu.h
smb: fix some warnings reported by scripts/checkpatch.pl
smb: do some cleanups
smb: move FILE_SYSTEM_SIZE_INFO to common/fscc.h
smb: move some duplicate struct definitions to common/fscc.h
smb: move list of FileSystemAttributes to common/fscc.h
...
Linus Torvalds [Thu, 4 Dec 2025 04:19:38 +0000 (20:19 -0800)]
Merge tag 'xfs-merge-6.19' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Carlos Maiolino:
"There are no major changes in xfs. This contains mostly some code
cleanups, a few bug fixes and documentation update. Highlights are:
- Quota locking cleanup
- Getting rid of old xlog_in_core_2_t type"
* tag 'xfs-merge-6.19' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (33 commits)
docs: remove obsolete links in the xfs online repair documentation
xfs: move some code out of xfs_iget_recycle
xfs: use zi more in xfs_zone_gc_mount
xfs: remove the unused bv field in struct xfs_gc_bio
xfs: remove xarray mark for reclaimable zones
xfs: remove the xlog_in_core_t typedef
xfs: remove l_iclog_heads
xfs: remove the xlog_rec_header_t typedef
xfs: remove xlog_in_core_2_t
xfs: remove a very outdated comment from xlog_alloc_log
xfs: cleanup xlog_alloc_log a bit
xfs: don't use xlog_in_core_2_t in struct xlog_in_core
xfs: add a on-disk log header cycle array accessor
xfs: add a XLOG_CYCLE_DATA_SIZE constant
xfs: reduce ilock roundtrips in xfs_qm_vop_dqalloc
xfs: move xfs_dquot_tree calls into xfs_qm_dqget_cache_{lookup,insert}
xfs: move quota locking into xrep_quota_item
xfs: move quota locking into xqcheck_commit_dquot
xfs: move q_qlock locking into xqcheck_compare_dquot
xfs: move q_qlock locking into xchk_quota_item
...
Linus Torvalds [Thu, 4 Dec 2025 04:14:44 +0000 (20:14 -0800)]
Merge tag 'erofs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
- Fix a WARNING caused by a recent FSDAX misdetection regression
- Fix the filesystem stacking limit for file-backed mounts
- Print more informative diagnostics on decompression errors
- Switch the on-disk definition `erofs_fs.h` to the MIT license
- Minor cleanups
* tag 'erofs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: switch on-disk header `erofs_fs.h` to MIT license
erofs: get rid of raw bi_end_io() usage
erofs: enable error reporting for z_erofs_fixup_insize()
erofs: enable error reporting for z_erofs_stream_switch_bufs()
erofs: improve Zstd, LZMA and DEFLATE error strings
erofs: improve decompression error reporting
erofs: tidy up z_erofs_lz4_handle_overlap()
erofs: limit the level of fs stacking for file-backed mounts
erofs: correct FSDAX detection
Linus Torvalds [Thu, 4 Dec 2025 04:08:32 +0000 (20:08 -0800)]
Merge tag 'hfs-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs
Pull hfs/hfsplus updates from Viacheslav Dubeyko:
"Several fixes for syzbot reported issues, HFS/HFS+ fixes of xfstests
failures, Kunit-based unit-tests introduction, and code cleanup:
- Dan Carpenter fixed a potential use-after-free issue in
hfs_correct_next_unused_CNID() method. Tetsuo Handa has made nice
fix of syzbot reported issue related to incorrect inode->i_mode
management if volume has been corrupted somehow. Yang Chenzhi has
made really good fix of potential race condition in
__hfs_bnode_create() method for HFS+ file system.
- Several fixes to xfstests failures. Particularly, generic/070,
generic/073, and generic/101 test-cases finish successfully for the
case of HFS+ file system right now.
- HFS and HFS+ drivers share multiple structures of on-disk layout
declarations. Some structures are used without any change. However,
we had two independent declarations of the same structures in HFS
and HFS+ drivers.
The on-disk layout declarations have been moved into
include/linux/hfs_common.h with the goal to exclude the
declarations duplication and to keep the HFS/HFS+ on-disk layout
declarations in one place.
Also, this patch prepares the basis for creating a hfslib that can
aggregate common functionality without necessity to duplicate the
same code in HFS and HFS+ drivers.
- HFS/HFS+ really need unit-tests because of multiple xfstests
failures. The first two patches introduce Kunit-based unit-tests
for the case string operations in HFS/HFS+ file system drivers"
* tag 'hfs-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs:
hfs/hfsplus: move on-disk layout declarations into hfs_common.h
hfsplus: fix volume corruption issue for generic/101
hfsplus: introduce KUnit tests for HFS+ string operations
hfs: introduce KUnit tests for HFS string operations
hfsplus: fix volume corruption issue for generic/073
hfsplus: Verify inode mode when loading from disk
hfsplus: fix volume corruption issue for generic/070
hfs/hfsplus: prevent getting negative values of offset/length
hfsplus: fix missing hfs_bnode_get() in __hfs_bnode_create
hfs: fix potential use after free in hfs_correct_next_unused_CNID()
Linus Torvalds [Thu, 4 Dec 2025 04:03:46 +0000 (20:03 -0800)]
Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
Linus Torvalds [Thu, 4 Dec 2025 03:26:18 +0000 (19:26 -0800)]
Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Fix head insertion for mq-deadline, a regression from when priority
support was added
- Series simplifying and improving the ublk user copy code
- Various ublk related cleanups
- Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
request is punted to a thread for handling
- Merge and then later revert loop dio nowait support, as it ended up
causing excessive stack usage for when the inline issue code needs to
dip back into the full file system code
- Improve auto integrity code, making it less deadlock prone
- Speedup polled IO handling, but manually managing the hctx lookups
- Fixes for blk-throttle for SSD devices
- Small series with fixes for the S390 dasd driver
- Add support for caching zones, avoiding unnecessary report zone
queries
- MD pull requests via Yu:
- fix null-ptr-dereference regression for dm-raid0
- fix IO hang for raid5 when array is broken with IO inflight
- remove legacy 1s delay to speed up system shutdown
- change maintainer's email address
- data can be lost if array is created with different lbs devices,
fix this problem and record lbs of the array in metadata
- fix rcu protection for md_thread
- fix mddev kobject lifetime regression
- enable atomic writes for md-linear
- some cleanups
- bcache updates via Coly
- remove useless discard and cache device code
- improve usage of per-cpu workqueues
- Reorganize the IO scheduler switching code, fixing some lockdep
reports as well
- Improve the block layer P2P DMA support
- Add support to the block tracing code for zoned devices
- Segment calculation improves, and memory alignment flexibility
improvements
- Set of prep and cleanups patches for ublk batching support. The
actual batching hasn't been added yet, but helps shrink down the
workload of getting that patchset ready for 6.20
- Fix for how the ps3 block driver handles segments offsets
- Improve how block plugging handles batch tag allocations
- nbd fixes for use-after-free of the configuration on device clear/put
- Set of improvements and fixes for zloop
- Add Damien as maintainer of the block zoned device code handling
- Various other fixes and cleanups
* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
block/rnbd: correct all kernel-doc complaints
blk-mq: use queue_hctx in blk_mq_map_queue_type
md: remove legacy 1s delay in md_notify_reboot
md/raid5: fix IO hang when array is broken with IO inflight
md: warn about updating super block failure
md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
sbitmap: fix all kernel-doc warnings
ublk: add helper of __ublk_fetch()
ublk: pass const pointer to ublk_queue_is_zoned()
ublk: refactor auto buffer register in ublk_dispatch_req()
ublk: add `union ublk_io_buf` with improved naming
ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
kfifo: add kfifo_alloc_node() helper for NUMA awareness
blk-mq: fix potential uaf for 'queue_hw_ctx'
blk-mq: use array manage hctx map instead of xarray
ublk: prevent invalid access with DEBUG
s390/dasd: Use scnprintf() instead of sprintf()
s390/dasd: Move device name formatting into separate function
s390/dasd: Remove unnecessary debugfs_create() return checks
s390/dasd: Fix gendisk parent after copy pair swap
...
Linus Torvalds [Thu, 4 Dec 2025 02:58:57 +0000 (18:58 -0800)]
Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
Linus Torvalds [Thu, 4 Dec 2025 01:24:33 +0000 (17:24 -0800)]
Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Replace busylock at the Tx queuing layer with a lockless list.
Resulting in a 300% (4x) improvement on heavy TX workloads, sending
twice the number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue.
Normally we perform queue re-selection when flow comes out of idle,
but under extreme circumstances the flows may be constantly busy.
Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already
did for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is
sadly quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for
packets.
- Make MPTCP use Rx backlog processing. This lowers the lock
pressure, improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor
fit for modern container workloads, where limits are imposed using
cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of
RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
unnecessarily aggressive rcvbuf growth and overshot when the
connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC
5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API:
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
defined by the OPEN Alliance's "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in
zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads
IPsec and performs RSS.
- Add info to devlink params whether the current setting is the
default or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame
duplication for a lot of switches without dedicated HSR offload.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback for
reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which
supports Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as
other drivers) to avoid wasting resources if feature is
unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor
format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211
debugfs interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support"
* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
net: page_pool: sanitise allocation order
net: page pool: xa init with destroy on pp init
net/mlx5e: Support XDP target xmit with dummy program
net/mlx5e: Update XDP features in switch channels
selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
wireguard: netlink: generate netlink code
wireguard: uapi: generate header with ynl-gen
wireguard: uapi: move flag enums
wireguard: uapi: move enum wg_cmd
wireguard: netlink: add YNL specification
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
selftests: drv-net: introduce Iperf3Runner for measurement use cases
selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
Documentation: net: dsa: mention simple HSR offload helpers
Documentation: net: dsa: mention availability of RedBox
...
Linus Torvalds [Thu, 4 Dec 2025 00:54:54 +0000 (16:54 -0800)]
Merge tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to
test_progs runner (Alexis Lothoré)
- Convert selftests/bpf/test_xsk to test_progs runner (Bastien
Curutchet)
- Replace bpf memory allocator with kmalloc_nolock() in
bpf_local_storage (Amery Hung), and in bpf streams and range tree
(Puranjay Mohan)
- Introduce support for indirect jumps in BPF verifier and x86 JIT
(Anton Protopopov) and arm64 JIT (Puranjay Mohan)
- Remove runqslower bpf tool (Hoyeon Lee)
- Fix corner cases in the verifier to close several syzbot reports
(Eduard Zingerman, KaFai Wan)
- Several improvements in deadlock detection in rqspinlock (Kumar
Kartikeya Dwivedi)
- Implement "jmp" mode for BPF trampoline and corresponding
DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance
from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong)
- Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul
Chaignon)
- Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel
Borkmann)
- Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko)
- Optimize bpf_map_update_elem() for map-in-map types (Ritesh
Oedayrajsingh Varma)
- Introduce overwrite mode for BPF ring buffer (Xu Kuohai)
* tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits)
bpf: optimize bpf_map_update_elem() for map-in-map types
bpf: make kprobe_multi_link_prog_run always_inline
selftests/bpf: do not hardcode target rate in test_tc_edt BPF program
selftests/bpf: remove test_tc_edt.sh
selftests/bpf: integrate test_tc_edt into test_progs
selftests/bpf: rename test_tc_edt.bpf.c section to expose program type
selftests/bpf: Add success stats to rqspinlock stress test
rqspinlock: Precede non-head waiter queueing with AA check
rqspinlock: Disable spinning for trylock fallback
rqspinlock: Use trylock fallback when per-CPU rqnode is busy
rqspinlock: Perform AA checks immediately
rqspinlock: Enclose lock/unlock within lock entry acquisitions
bpf: Remove runqslower tool
selftests/bpf: Remove usage of lsm/file_alloc_security in selftest
bpf: Disable file_alloc_security hook
bpf: check for insn arrays in check_ptr_alignment
bpf: force BPF_F_RDONLY_PROG on insn array creation
bpf: Fix exclusive map memory leak
selftests/bpf: Make CS length configurable for rqspinlock stress test
selftests/bpf: Add lock wait time stats to rqspinlock stress test
...
Linus Torvalds [Wed, 3 Dec 2025 23:50:11 +0000 (15:50 -0800)]
Merge tag 'linux_kselftest-kunit-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit updates from Shuah Khan:
- Make filter parameters configurable via Kconfig
- Add description of kunit.enable parameter to documentation
* tag 'linux_kselftest-kunit-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Make filter parameters configurable via Kconfig
Documentation: kunit: add description of kunit.enable parameter