git.ipfire.org Git - thirdparty/linux.git/log

net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II)

cake_dump_class_stats() runs without qdisc spinlock being held.

In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for:

- flow->deficit
- flow->cvars.dropping
- flow->cvars.count
- flow->cvars.p_drop
- flow->cvars.blue_timer
- flow->cvars.drop_next

Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061610.3503483-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I)

cake_dump_class_stats() runs without qdisc spinlock being held.

In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for:

- flow->head
- flow->dropped
- b->backlogs[]

Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061610.3503483-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

batman-adv: stop tp_meter sessions during mesh teardown

TP meter sessions remain linked on bat_priv->tp_list after the netlink
request has already finished. When the mesh interface is removed,
batadv_mesh_free() currently tears down the mesh without first draining
these sessions.

A running sender thread or a late incoming tp_meter packet can then keep
processing against a mesh instance which is already shutting down.
Synchronize tp_meter with the mesh lifetime by stopping all active
sessions from batadv_mesh_free() and waiting for sender threads to exit
before teardown continues.

Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>

batman-adv: reject new tp_meter sessions during teardown

Prevent tp_meter from starting new sender or receiver sessions after
mesh_state has left BATADV_MESH_ACTIVE.

Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>

batman-adv: fix integer overflow on buff_pos

Fixing an integer overflow present in batadv_iv_ogm_send_to_if. The size
check is done using the int type in batadv_iv_ogm_aggr_packet whereas the
buff_pos variable uses the s16 type. This could lead to an out-of-bound
read.

Cc: stable@vger.kernel.org
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Signed-off-by: Lyes Bourennani <lbourennani@fuzzinglabs.com>
Signed-off-by: Alexis Pinson <apinson@fuzzinglabs.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>

Merge tag 'v7.1-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:

- Reject algorithms with authsizes that are too short in authencesn

* tag 'v7.1-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: authencesn - reject short ahash digests during instance creation

Merge tag 'ntfs-for-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs

Pull ntfs fixes from Namjae Jeon:

- Fix a NULL pointer dereference in ntfs_index_walk_down() by
   validating index block allocation

- Fix a memory leak of the symlink target string in
   ntfs_reparse_set_wsl_symlink() during error paths

- Prevent VCN overflow and validate lowest_vcn in
   ntfs_mapping_pairs_decompress() to avoid runlist corruption

- Fix a page reference leak in ntfs_write_iomap_end_resident()
   when attribute search context allocation fails

- Fix an invalid PTR_ERR() usage on a valid folio pointer in
   __ntfs_bitmap_set_bits_in_run()

- Correct directory link counting by dropping nlink only when
   the MFT record link count reaches zero for WIN32/DOS aliases

- Fix an uninitialized variable in ntfs_mapping_pairs_decompress()
   by returning an error pointer directly

* tag 'ntfs-for-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs:
  ntfs: Use return instead of goto in ntfs_mapping_pairs_decompress()
  ntfs: drop nlink once for WIN32/DOS aliases
  ntfs: fix invalid PTR_ERR() usage in __ntfs_bitmap_set_bits_in_run()
  ntfs: fix error handling in ntfs_write_iomap_end_resident()
  ntfs: fix VCN overflow in ntfs_mapping_pairs_decompress()
  ntfs: fix WSL symlink target leak on reparse failure
  ntfs: fix NULL dereference in ntfs_index_walk_down()

RDMA/hns: Fix unlocked call to hns_roce_qp_remove()

Sashiko points out that hns_roce_qp_remove() requires the caller to hold
locks. The error flow in hns_roce_create_qp_common() doesn't hold those
locks for the error unwind so it risks corrupting memory.

Grab the same locks the other two callers use.

Cc: stable@vger.kernel.org
Fixes: e088a685eae9 ("RDMA/hns: Support rq record doorbell for the user space")
Link: https://sashiko.dev/#/patchset/0-v2-1c49eeb88c48%2B91-rdma_udata_rep_jgg%40nvidia.com?part=9
Link: https://patch.msgid.link/r/15-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/hns: Fix xarray race in hns_roce_create_qp_common()

Similar to the SRQ case the hr_qp is stored in the xarray before it is
fully initialized. Unlike the SRQ case the error unwinds do not wait for
the completion so keep the refcount 0 until the function succeeds.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Link: https://patch.msgid.link/r/14-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Suggested-by: Junxian Huang <huangjunxian6@hisilicon.com>
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/hns: Fix xarray race in hns_roce_create_srq()

Sashiko points out that once the srq memory is stored into the xarray by
alloc_srqc() it can immediately be looked up by:

xa_lock(&srq_table->xa);
srq = xa_load(&srq_table->xa, srqn & (hr_dev->caps.num_srqs - 1));
if (srq)
refcount_inc(&srq->refcount);
xa_unlock(&srq_table->xa);

Which will fail refcount debug because the refcount is 0 and then crash:

srq->event(srq, event_type);

Because event is NULL.

Use refcount_inc_not_zero() instead to ensure a partially prepared srq is
never retrieved from the event handler and fix the ordering of the
initialization so refcount becomes 1 only after it is fully ready.

All the initialization must be done before calling free_srqc() since it
depends on the completion and refcount.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=3
Link: https://patch.msgid.link/r/13-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx4: Fix mis-use of RCU in mlx4_srq_event()

Sashiko points out the radix_tree itself is RCU safe, but nothing ever
frees the mlx4_srq struct with RCU, and it isn't even accessed within the
RCU critical section. It also will crash if an event is delivered before
the srq object is finished initializing.

Use the spinlock since it isn't easy to make RCU work, use
refcount_inc_not_zero() to protect against partially initialized objects,
and order the refcount_set() to be after the srq is fully initialized.

Cc: stable@vger.kernel.org
Fixes: 30353bfc43a1 ("net/mlx4_core: Use RCU to perform radix tree lookup for SRQ")
Link: https://sashiko.dev/#/patchset/0-v2-1c49eeb88c48%2B91-rdma_udata_rep_jgg%40nvidia.com?part=5
Link: https://patch.msgid.link/r/12-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx4: Fix resource leak on error in mlx4_ib_create_srq()

Sashiko points out that mlx4_srq_alloc() was not undone during error
unwind, add the missing call to mlx4_srq_free().

Cc: stable@vger.kernel.org
Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=8
Link: https://patch.msgid.link/r/11-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/vmw_pvrdma: Fix double free on pvrdma_alloc_ucontext() error path

Sashiko points out that pvrdma_uar_free() is already called within
pvrdma_dealloc_ucontext(), so calling it before triggers a double free.

Cc: stable@vger.kernel.org
Fixes: 29c8d9eba550 ("IB: Add vmw_pvrdma driver")
Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=4
Link: https://patch.msgid.link/r/10-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/ocrdma: Don't NULL deref uctx on errors in ocrdma_copy_pd_uresp()

Sashiko points out that pd->uctx isn't initialized until late in the
function so all these error flow references are NULL and will crash. Use
the uctx that isn't NULL.

Cc: stable@vger.kernel.org
Fixes: fe2caefcdf58 ("RDMA/ocrdma: Add driver for Emulex OneConnect IBoE RDMA adapter")
Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=4
Link: https://patch.msgid.link/r/9-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/ocrdma: Clarify the mm_head searching

The intention of this code is to find matching entries exactly, the driver
never creates phys_addr's with different lens so the current expression is
not a bug, but it doesn't make sense and confuses review tooling.

Search for exact match instead.

Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=4
Link: https://patch.msgid.link/r/8-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mana: Fix error unwind in mana_ib_create_qp_rss()

Sashiko points out that mana_ib_cfg_vport_steering() is leaked, the normal
destroy path cleans it up.

Cc: stable@vger.kernel.org
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Link: https://sashiko.dev/#/patchset/0-v1-e911b76a94d1%2B65d95-rdma_udata_rep_jgg%40nvidia.com?part=4
Link: https://patch.msgid.link/r/7-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mana: Fix mana_destroy_wq_obj() cleanup in mana_ib_create_qp_rss()

Sashiko points out there are two bugs here in the error unwind flow, both
related to how the WQ table is unwound.

First there is a double i-- on the first failure path due to the while loop
having a i--, remove it.

Second if mana_ib_install_cq_cb() fails then mana_create_wq_obj() is not
undone due to the above i--.

Cc: stable@vger.kernel.org
Fixes: c15d7802a424 ("RDMA/mana_ib: Add CQ interrupt support for RAW QP")
Link: https://sashiko.dev/#/patchset/0-v2-1c49eeb88c48%2B91-rdma_udata_rep_jgg%40nvidia.com?part=1
Link: https://patch.msgid.link/r/6-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mana: Remove user triggerable WARN_ON() in mana_ib_create_qp_rss()

Sashiko points out that the user can specify WQs sharing the same CQ as a
part of the uAPI and this will trigger the WARN_ON() then go on to corrupt
the kernel.

Just reject it outright and fail the QP creation.

Cc: stable@vger.kernel.org
Fixes: c15d7802a424 ("RDMA/mana_ib: Add CQ interrupt support for RAW QP")
Link: https://sashiko.dev/#/patchset/0-v2-1c49eeb88c48%2B91-rdma_udata_rep_jgg%40nvidia.com?part=1
Link: https://patch.msgid.link/r/5-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mana: Validate rx_hash_key_len

Sashiko points out that rx_hash_key_len comes from a uAPI structure and is
blindly passed to memcpy, allowing the userspace to trash kernel
memory. Bounds check it so the memcpy cannot overflow.

Cc: stable@vger.kernel.org
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Link: https://sashiko.dev/#/patchset/0-v2-1c49eeb88c48%2B91-rdma_udata_rep_jgg%40nvidia.com?part=1
Link: https://patch.msgid.link/r/4-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx5: Add missing store/release for lock elision pattern

mlx5 has a common pattern implementing a device-global singleton resource
where it checks the resource pointer for !NULL and then skips obtaining
the lock.

This is not ordered properly as observing !NULL doesn't mean that all the
data under that pointer is also visible on this CPU when the lock is not
taken.

Use a release/acquire pairing to explicitly manage this.

Pointed out by sashiko, Codex found more cases.

Fixes: 5895e70f2e6e ("IB/mlx5: Allocate resources just before first QP/SRQ is created")
Fixes: 638420115cc4 ("IB/mlx5: Create UMR QP just before first reg_mr occurs")
Link: https://sashiko.dev/#/patchset/SYBPR01MB7881E1E0970268BD69C0BA75AF2B2%40SYBPR01MB7881.ausprd01.prod.outlook.com
Link: https://patch.msgid.link/r/3-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Assisted-by: Codex:GPT-5.5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx5: Restore zero-init to mlx5_ib_modify_qp() ucmd

Sashiko points out the check for inlen==0 got missed, the ={} was not
redundant, put it back.

Fixes: a9cd442a5347 ("RDMA: Remove redundant = {} for udata req structs")
Link: https://patch.msgid.link/r/2-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/ionic: Fix typo in format string

Applying the corrupted patch by hand mangled the format string, put the s
in the right place.

Cc: stable@vger.kernel.org
Fixes: 654a27f25530 ("RDMA/ionic: bound node_desc sysfs read with %.64s")
Link: https://patch.msgid.link/r/1-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Reported-by: Brad Spengler <brad.spengler@opensrcsec.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge branch 'net-dsa-yt921x-add-port-police-support'

David Yang says:

====================
net: dsa: yt921x: Add port police support
====================

Link: https://patch.msgid.link/20260430114529.3536911-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: yt921x: Add port police support

Enable rate meter ability and support limiting the rate of incoming
traffic.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260430114529.3536911-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: yt921x: Refactor long register helpers

Dealing long registers with u64 is good, until you realize there are
longer 96-bit registers.

Refactor reg64 helpers to use u32 arrays instead of u64 values, in
preparation for 96-bit registers. We do not keep the separate u64
version for reg64 to avoid duplicated wrappers, although it looks better
when dealing with reg64 *only*.

Helpers for reg96 should be added when they are actually used to avoid
function unused warnings.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260430114529.3536911-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: pass extack to dsa_switch_ops :: port_policer_add()

Drivers might have error messages to propagate to user space. Propagate
the netlink extack so that they can inform user space in a verbal way of
their limitations.

Make the according transformations to the two users (sja1105 and felix).

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260430114529.3536911-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip6_gre: Use cached t->net in ip6erspan_changelink().

After commit 5e72ce3e3980 ("net: ipv6: Use link netns in newlink() of
rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns
ip6gre hash via link_net. ip6erspan_changelink() was not converted in
that series and still uses dev_net(dev), which diverges from the
device's creation netns after IFLA_NET_NS_FD migration.

This re-inserts the tunnel into the wrong per-netns hash. The
original netns keeps a stale entry. When that netns is later
destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a
slab-use-after-free reported by KASAN, followed by a kernel BUG at
net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify().

Reachable from an unprivileged user namespace (unshare --user
--map-root-user --net).

ip6gre_changelink() earlier in the same file already uses the cached
t->net; only ip6erspan_changelink() has the wrong shape.

Fixes: 2d665034f239 ("net: ip6_gre: Fix ip6erspan hlen calculation")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'replace-direct-dequeue-call-with-qdisc_dequeue_peeked'

Jamal Hadi Salim says:

====================
Replace direct dequeue call with qdisc_dequeue_peeked

When sfb and red qdiscs have children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red/sfb in this case), it will do the following:
1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(red/sfb) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
1b. invoke a dequeue() on the child (red/sfb). And herein lies the problem.
     - red/sfb will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Patch 1 fixes the issue for red qdisc. Patch 2 fixes it for sfb.
Patch 3 adds testcases for the two setups.
====================

Link: https://patch.msgid.link/20260430152957.194015-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Add tests that force red and sfb to dequeue from child's gso_skb

Create 4 test cases:
- Force red to dequeue from its child's gso_skb with qfq leaf
- Force sfb to dequeue from its child's gso_skb with qfq leaf
- Force red to dequeue from its child's gso_skb with dualpi2 leaf
- Force sfb to dequeue from its child's gso_skb with dualpi2 leaf

All of them have tbf followed by red (or sfb) followed by qfq (or
dualpi2). Since tbf calls its child's peek followed by
qdisc_dequeue_peeked, it will force red/sfb to call their child's peek.
In this case, since the child (qfq/dualpi2) has qdisc_peek_dequeued as
its peek callback, the packet will be stored in its gso_skb queue. During
the subsequent call to qdisc_dequeue_peeked, red/sfb will have to dequeue
from the child's gso_skb to retrieve the packet.
Not doing so will cause a NULL ptr deref which was happening before a
recent fix.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260430152957.194015-4-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_sfb: Replace direct dequeue call with peek and qdisc_dequeue_peeked

When sfb has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (sfb in this case), it will do the following:
1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(sfb) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
1b. invoke a dequeue() on the child (sfb). And herein lies the problem.
     - sfb will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

[  127.594489][  T453] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[  127.594741][  T453] CPU: 2 UID: 0 PID: 453 Comm: ping Not tainted 7.1.0-rc1-00035-gac961974495b-dirty #793 PREEMPT(full)
[  127.595059][  T453] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  127.595254][  T453] RIP: 0010:qfq_dequeue+0x35c/0x1650 [sch_qfq]
[  127.595461][  T453] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
[  127.596081][  T453] RSP: 0018:ffff88810e5af440 EFLAGS: 00010216
[  127.596337][  T453] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[  127.596623][  T453] RDX: 0000000000000009 RSI: 0000001880000000 RDI: ffff888104fd82b0
[  127.596917][  T453] RBP: ffff888104fd8000 R08: ffff888104fd8280 R09: 1ffff110211893a3
[  127.597165][  T453] R10: 1ffff110211893a6 R11: 1ffff110211893a7 R12: 0000001880000000
[  127.597404][  T453] R13: ffff888104fd82b8 R14: 0000000000000048 R15: 0000000040000000
[  127.597644][  T453] FS:  00007fc380cbfc40(0000) GS:ffff88816f2a8000(0000) knlGS:0000000000000000
[  127.597956][  T453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  127.598160][  T453] CR2: 00005610aa9890a8 CR3: 000000010369e000 CR4: 0000000000750ef0
[  127.598390][  T453] PKRU: 55555554
[  127.598509][  T453] Call Trace:
[  127.598629][  T453]  <TASK>
[  127.598718][  T453]  ? mark_held_locks+0x40/0x70
[  127.598890][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599053][  T453]  sfb_dequeue+0x88/0x4d0
[  127.599174][  T453]  ? ktime_get+0x137/0x230
[  127.599328][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599480][  T453]  ? qdisc_peek_dequeued+0x7b/0x350 [sch_qfq]
[  127.599670][  T453]  ? srso_alias_return_thunk+0x5/0xfbef5
[  127.599831][  T453]  tbf_dequeue+0x6b1/0x1098 [sch_tbf]
[  127.599988][  T453]  __qdisc_run+0x169/0x1900

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Fixes: e13e02a3c68d ("net_sched: SFB flow scheduler")
Signed-off-by: Victor Nogueria <victor@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430152957.194015-3-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked

When red qdisc has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red in this case), it will do the following:
1a. do a peek() - and when sensing there's an skb the child can offer, then
     - the child in this case(red) calls its child's (qfq) peek.
        qfq does the right thing and will return the gso_skb queue packet.
        Note: if there wasnt a gso_skb entry then qfq will store it there.
1b. invoke a dequeue() on the child (red). And herein lies the problem.
     - red will call the child's dequeue() which will essentially just
       try to grab something of qfq's queue.

[   78.667668][  T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[   78.667927][  T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full)
[   78.668263][  T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   78.668486][  T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq]
[   78.668718][  T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d
[   78.669312][  T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216
[   78.669533][  T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   78.669790][  T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048
[   78.670044][  T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078
[   78.670297][  T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000
[   78.670560][  T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200
[   78.670814][  T363] FS:  00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000
[   78.671110][  T363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.671324][  T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0
[   78.671585][  T363] PKRU: 55555554
[   78.671713][  T363] Call Trace:
[   78.671843][  T363]  <TASK>
[   78.671936][  T363]  ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq]
[   78.672148][  T363]  ? __pfx__printk+0x10/0x10
[   78.672322][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.672496][  T363]  ? lockdep_hardirqs_on_prepare+0xa8/0x1a0
[   78.672706][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.672875][  T363]  ? trace_hardirqs_on+0x19/0x1a0
[   78.673047][  T363]  red_dequeue+0x65/0x270 [sch_red]
[   78.673217][  T363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   78.673385][  T363]  tbf_dequeue.cold+0xb0/0x70c [sch_tbf]
[   78.673566][  T363]  __qdisc_run+0x169/0x1900

The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.

Fixes: 77be155cba4e ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.")
Reported-by: Manas <ghandatmanas@gmail.com>
Reported-by: Rakshit Awasthi <rakshitawasthi17@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430152957.194015-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add vhca_id_type support to IPsec alias creation

When creating an alias FT for MPV IPsec, if alias creation with
sw_vhca_id is supported use it instead of using the hw_vhca_id.

This in turn allows IPsec to work properly after live migration,
in case a VF was live migrated and his hw_vhca_id changed due to
migration which can happen if you migrate to a VF with a different index
than yours, IPsec would fail to start post migration, this patch
resolves the issue by using sw_vhca_id instead which doesn't change post
migration.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061958.225245-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Documentation/tcp_ao: Document the supported MAC algorithms and lengths

Update the TCP-AO documentation to fix some incorrect terminology and
claims regarding the MAC algorithms, and document which MAC algorithms
and lengths the Linux implementation supports.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260429210856.725667-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

amd-xgbe: fix PTP addend overflow causing frozen clock

XGBE_PTP_ACT_CLK_FREQ and XGBE_V2_PTP_ACT_CLK_FREQ were 10x too
large (500MHz/1GHz instead of 50MHz/100MHz), causing the computed
addend to overflow the 32-bit tstamp_addend. In the general case
this would result in the clock advancing at the wrong rate. For v2
(PCI), ptpclk_rate is hardcoded to 125MHz, so the addend formula
(ACT_CLK_FREQ << 32) / ptpclk_rate yields exactly 8 * 2^32, and
when stored to the 32-bit tstamp_addend the value is zero. With
addend = 0 the hardware accumulator never overflows and the PTP
clock is fully stopped. For v1 (platform), ptpclk_rate is read from
ACPI/DT so the exact overflow behavior depends on the
firmware-reported frequency.

Define the constants as NSEC_PER_SEC / SSINC so the relationship is
explicit and cannot drift out of sync.

Fixes: fbd47be098b5 ("amd-xgbe: add hardware PTP timestamping support")
Tested-by: Gregory Fuchedgi <gfuchedgi@gmail.com>
Signed-off-by: Gregory Fuchedgi <gfuchedgi@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429-fix-xgbe-ptp-addend-v1-1-fca5b0ca5e62@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: fsl_ucc_hdlc: fix ucc_hdlc_remove

If the driver is used in a non tdm mode priv->utdm is a NULL pointer.
Therefore we need to check this pointer first before checking si_regs.

Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: fsl_ucc_hdlc: fix uhdlc_memclean

Unmapping of uf_regs is done from ucc_fast_free and doesn't need to be
done explicitly. If already unmapped ucc_fast_free will crash.

Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: Add self for the DEC LANCE network driver

Like with the rest of DECstation and TURBOchannel hardware I have been
handling the DEC LANCE network driver for some 25 years now anyway.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/alpine.DEB.2.21.2604271113520.28583@angie.orcam.me.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/headers: Regenerate stddef.h to fix BPF selftests

With commit dacbfc167808 ("crypto: af_alg - Annotate struct af_alg_iv
with __counted_by"), two selftests, test_tag and crypto_sanity, now
indirectly rely on the __counted_by macro. On systems with commit
dacbfc167808 in the installed UAPI headers, the selftests build fails
with:

  In file included from tools/testing/selftests/bpf/prog_tests/crypto_sanity.c:7:
  /usr/include/linux/if_alg.h:45:22: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘__counted_by’
     45 |         __u8    iv[] __counted_by(ivlen);
        |                      ^~~~~~~~~~~~

This patch fixes it by regenerating stddef.h in tools/include using the
instructions from commit a778f5d46b62 ("tools/headers: Pull in stddef.h
to uapi to fix BPF selftests build in CI").

Fixes: dacbfc167808 ("crypto: af_alg - Annotate struct af_alg_iv with __counted_by")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/8da8ef16055aa452d940668ed5359ce54adc6b0b.1777715500.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

driver core: reject devices with unregistered buses

Trying to register a device on a bus which has not yet been registered
used to trigger a NULL-pointer dereference, but since the const bus
structure rework registration instead succeeds without the device being
added to the bus.

This specifically means that the device will never bind to a driver and
that the bus sysfs attributes are not created (i.e. as if the device had
no bus).

Reject devices with unregistered buses to catch any callers that get
the ordering wrong and to handle bus registration failures more
gracefully.

Fixes: 5221b82d46f2 ("driver core: bus: bus_add/probe/remove_device() cleanups")
Cc: stable@vger.kernel.org # 6.3
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260430091718.230228-1-johan@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

driver core: faux: clean up init error handling

Clean up the faux bus init error handling by naming the labels after
what they do (rather than from where they are jumped to) and separating
the success path more clearly by returning explicit zero.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260424153127.2647405-3-johan@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

driver core: faux: fix root device registration

A recent change made the faux bus root device be allocated dynamically
but failed to provide a release function to free the memory when the
last reference is dropped (on theoretical failure to register the device
or bus).

Fix this by using root_device_register() instead of open coding.

Also add the missing sanity check when registering faux devices to avoid
use-after-free if the bus failed to register (which would previously
have triggered a bunch of use-after-free warnings).

Fixes: 61b76d07d2b4 ("driver core: faux: stop using static struct device")
Cc: stable@vger.kernel.org # 7.0
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260424153127.2647405-2-johan@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

riscv: mm: Fixup no5lvl failure when vaddr is invalid

Unlike no4lvl, no5lvl still continues to detect satp, which
requires va=pa mapping. When pa=0x800000000000, no5lvl
would fail in Sv48 mode due to an illegal VA value of
0x800000000000.

So, prevent detecting the satp flow for no5lvl, when
vaddr is invalid. Add the is_vaddr_valid() function for
checking.

Fixes: 26e7aacb83df ("riscv: Allow to downgrade paging mode from the command line")
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org>
Tested-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Link: https://patch.msgid.link/20260125055212.433163-1-guoren@kernel.org
[pjw@kernel.org: cleaned up commit message]
Signed-off-by: Paul Walmsley <pjw@kernel.org>

riscv: Fix register corruption from uninitialized cregs on error

compat_riscv_gpr_set() calls cregs_to_regs() unconditionally, even when
user_regset_copyin() fails. Since cregs is an uninitialized stack
variable, a copyin failure causes uninitialized stack data to be written
into the target task's pt_regs, corrupting its register state and
potentially leaking kernel stack contents.

compat_restore_sigcontext() has the same issue: it calls cregs_to_regs()
even when __copy_from_user() fails, leading to the same corruption of
the signal-returning task's register state on error.

Only call cregs_to_regs() when the user copy succeeds.

Fixes: 4608c159594f ("riscv: compat: ptrace: Add compat_arch_ptrace implement")
Fixes: 7383ee05314b ("riscv: compat: signal: Add rt_frame implementation")
Signed-off-by: Michael Neuling <mikey@neuling.org>
Assisted-by: Cursor:claude-4.6-opus-high-thinking
Link: https://patch.msgid.link/20260501062320.2339562-1-mikey@neuling.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>

ksmbd: validate inherited ACE SID length

smb_inherit_dacl() walks the parent directory DACL loaded from the
security descriptor xattr. It verifies that each ACE contains the fixed
SID header before using it, but does not verify that the variable-length
SID described by sid.num_subauth is fully contained in the ACE.

A malformed inheritable ACE can advertise more subauthorities than are
present in the ACE. compare_sids() may then read past the ACE.
smb_set_ace() also clamps the copied destination SID, but used the
unchecked source SID count to compute the inherited ACE size. That could
advance the temporary inherited ACE buffer pointer and nt_size accounting
past the allocated buffer.

Fix this by validating the parent ACE SID count and SID length before
using the SID during inheritance. Compute the inherited ACE size from the
copied SID so the size matches the bounded destination SID. Reject the
inherited DACL if size accumulation would overflow smb_acl.size or the
security descriptor allocation size.

Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Shota Zaizen <s@zaizen.me>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: fix kernel-doc warnings from ksmbd_conn_get/put()

The kernel test robot reported W=1 build warnings for ksmbd_conn_get()
and ksmbd_conn_put() due to missing parameter descriptions.
Add the @conn description to fix these warnings.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: fail share config requests when path allocation fails

Non-pipe shares must have a duplicated backing path before they can be
published. share_config_request() currently calls kstrndup() for that
path, but if the allocation fails it leaves ret unchanged. If veto list
parsing succeeds and share->name exists, the partially built share is
still inserted into the global share table with share->path left NULL.

A later share-root SMB2 create uses tree_conn->share_conf->path as the
lookup root. If the share was published with path == NULL, that request
passes a NULL pathname into do_getname_kernel()/strlen() and can crash
the ksmbd worker.

Set ret = -ENOMEM when path duplication fails so the incomplete share is
destroyed before publication.

Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: close durable scavenger races against m_fp_list lookups

ksmbd_durable_scavenger() has two related races against any walker
that iterates f_ci->m_fp_list, including ksmbd_lookup_fd_inode()
(used by ksmbd_vfs_rename) and the share-mode checks in
fs/smb/server/smb_common.c.

(1) fp->node list-head reuse.  Durable-preserved handles can remain
linked on f_ci->m_fp_list after session teardown so share-mode checks
still see them while the handle is reconnectable.  The scavenger
collected expired handles by adding fp->node to a local
scavenger_list after removing them from the global durable idr.
Because fp->node is the same list_head used by m_fp_list,
list_add(&fp->node, &scavenger_list) overwrites the m_fp_list links
and corrupts both lists.  CONFIG_DEBUG_LIST can report this on the
share-mode walk path.

(2) Refcount race against m_fp_list walkers.  The scavenger qualifies
an expired durable handle with atomic_read(&fp->refcount) > 1 and
fp->conn under global_ft.lock, removes fp from global_ft, then drops
global_ft.lock before unlinking fp from m_fp_list and freeing it.
During that gap fp is still linked on m_fp_list with f_state ==
FP_INITED.  ksmbd_lookup_fd_inode() under m_lock read calls
ksmbd_fp_get() (atomic_inc_not_zero on refcount that is still 1) and
takes a live reference; the scavenger then unlinks and frees fp
while the holder owns a reference, leading to UAF on the holder's
subsequent ksmbd_fd_put() and on any field reads performed by a
concurrent share-mode walker that iterates m_fp_list without taking
ksmbd_fp_get() (smb_check_perm_dleases-like paths).

Fix both:

  * Stop reusing fp->node as a scavenger-private list node.  Remove
    one expired handle from global_ft under global_ft.lock, take an
    explicit transient reference, drop the lock, unlink fp->node
    from m_fp_list under f_ci->m_lock, then drop both the durable
    lifetime and transient references with atomic_sub_and_test(2,
    &fp->refcount).  If the scavenger is the last putter the close
    runs there; otherwise an in-flight holder that already raced
    through the m_fp_list lookup owns the final close via its
    ksmbd_fd_put() path.  The one-at-a-time disposal can rescan the
    durable idr when multiple handles expire in the same pass, but
    durable scavenging is a background expiration path and the final
    full scan recomputes min_timeout before the next wait.

  * Clear fp->persistent_id inside __ksmbd_remove_durable_fd() right
    after idr_remove(), so a delayed final close from a holder that
    snatched fp does not re-issue idr_remove() on a persistent id
    that idr_alloc_cyclic() in ksmbd_open_durable_fd() may have
    already handed out to a brand-new durable handle.

  * Bypass the per-conn open_files_count decrement in
    __put_fd_final() when fp is detached from any session table
    (fp->conn cleared by session_fd_check() at durable preserve --
    paired with the volatile_id clear at unpublish, so checking
    fp->conn alone is sufficient).  The walker that owns the final
    close runs from an unrelated work->conn whose
    stats.open_files_count never tracked this durable fp; without
    this guard the holder would underflow that unrelated counter.

The two races are folded into one patch because patch (1) alone
cleans up the corrupted list but leaves a deterministic UAF window
for m_fp_list walkers that the transient-reference and
persistent_id discipline in (2) close; bisecting onto an
intermediate state would land on a UAF that pre-patch chaos merely
made less reproducible.

Validation:
  * CONFIG_DEBUG_LIST coverage for the list_head reuse path.
  * KASAN-enabled direct SMB2 durable-handle coverage that exercised
    ksmbd_durable_scavenger() and non-NULL ksmbd_lookup_fd_inode()
    returns while durable handles expired under concurrent rename
    lookups, with no KASAN, UAF, list-corruption, ODEBUG, or WARNING
    reports.
  * checkpatch --strict
  * make -j$(nproc) M=fs/smb/server

Fixes: d484d621d40f ("ksmbd: add durable scavenger timer")
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: harden file lifetime during session teardown

__close_file_table_ids() is the per-session teardown that closes every
fp belonging to a session (or to one tree connect on that session) by
walking the session's volatile-id idr.  The current loop has three
related problems on busy or racing workloads:

  * Sleeping under ft->lock.  The session-teardown skip callback,
    session_fd_check(), already sleeps in ksmbd_vfs_copy_durable_owner()
    -> kstrdup(GFP_KERNEL) and down_write(&fp->f_ci->m_lock) (a
    rw_semaphore).  Running the callback inside write_lock(&ft->lock)
    trips CONFIG_DEBUG_ATOMIC_SLEEP / CONFIG_PROVE_LOCKING on a
    durable-fd workload.

  * Refcount accounting blind to f_state.  The unconditional
    atomic_dec_and_test(&fp->refcount) does not distinguish
    FP_INITED (idr-owned reference still intact) from FP_CLOSED (an
    earlier ksmbd_close_fd() already consumed the idr-owned reference
    while leaving fp in the idr because a holder kept refcount
    non-zero).  When the latter races with teardown the same path
    over-decrements into a holder reference and ksmbd_fd_put() later
    UAFs that holder.

  * FP_NEW window.  Between __open_id() publishing fp into the
    session idr and ksmbd_update_fstate(..., FP_INITED) committing the
    transition at the end of smb2_open(), an fp is in FP_NEW and an
    intervening teardown that takes a transient reference and
    unpublishes the volatile id leaves the original idr-owned
    reference orphaned -- the opener is unaware that fp has been
    unpublished, returns success to the client, and the fp leaks at
    refcount = 1.

Refactor __close_file_table_ids() to take a transient reference on fp
and unpublish fp from the session idr *under ft->lock* before calling
skip() outside the lock.  A transient ref protects lifetime but not
concurrent field mutation, so the idr_remove() is what keeps
__ksmbd_lookup_fd() through this session's idr from granting a new
ksmbd_fp_get() reference to an fp whose fp->conn / fp->tcon /
fp->volatile_id / op->conn / lock_list links are about to be rewritten
by session_fd_check().  Durable reconnect is unaffected because it
reaches fp through the global durable table (ksmbd_lookup_durable_fd
-> global_ft).

Decide n_to_drop together with any FP_INITED -> FP_CLOSED transition
under ft->lock so teardown and ksmbd_close_fd() never both consume the
idr-owned reference.  See ksmbd_mark_fp_closed() for the per-state
accounting.  For the FP_NEW path to be safe, the opener has to learn
that fp was unpublished: ksmbd_update_fstate() now returns -ENOENT
when an FP_NEW -> FP_INITED transition finds f_state already advanced
or the volatile id cleared (both committed by teardown under
ft->lock); smb2_open() propagates that as STATUS_OBJECT_NAME_INVALID
and drops the original reference via ksmbd_fd_put().

The list removal cannot be left for a deferred final putter because
fp->volatile_id has already been cleared and __ksmbd_remove_fd() will
intentionally skip both idr_remove() and list_del_init().  Move the
m_fp_list unlink in __ksmbd_remove_fd() above the volatile-id check so
that an FP_NEW fp that happened to be added to m_fp_list (smb2_open()
adds fp->node before ksmbd_update_fstate() runs) is still cleaned up
on the deferred putter path; list_del_init() on an empty node is a
no-op and remains safe for fps that were never added.

Add a defensive guard in session_fd_check() that refuses non-FP_INITED
fps so that even if a teardown reaches an FP_NEW fp it falls into the
close branch (where the n_to_drop = 1 accounting keeps the opener's
reference alive) instead of the durable-preserve branch (which mutates
fp->conn / fp->tcon).

Validation on a debug kernel additionally built with CONFIG_DEBUG_LIST
and CONFIG_DEBUG_OBJECTS_WORK used a same-session two-tcon workload
(open/write storm on one tcon, 50 tree disconnects on the other) and
reported no list-corruption, work_struct ODEBUG, sleep-in-atomic,
lockdep or kmemleak reports.  Reverting only the
__close_file_table_ids() hunk while keeping a forced-is_reconnectable()
harness produced the expected sleep-in-atomic at vfs_cache.c:1095,
confirming the ft->lock-out-of-sleepable-skip discipline.

KASAN-enabled direct SMB2 coverage with durable handles enabled
exercised ksmbd_close_tree_conn_fds(), ksmbd_close_session_fds(),
the FP_NEW failure path, tree_conn_fd_check(), and a non-zero
session_fd_check() durable-preserve return.  This produced no KASAN,
DEBUG_LIST, ODEBUG, or WARNING reports.

Fixes: f44158485826 ("cifsd: add file operations")
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ksmbd: centralize ksmbd_conn final release to plug transport leak

ksmbd_conn_free() is one of four sites that can observe the last
refcount drop of a struct ksmbd_conn.  The other three

    fs/smb/server/connection.c    ksmbd_conn_r_count_dec()
    fs/smb/server/oplock.c        __free_opinfo()
    fs/smb/server/vfs_cache.c     session_fd_check()

end the conn with a bare kfree(), skipping
ida_destroy(&conn->async_ida) and
conn->transport->ops->free_transport(conn->transport).  Whenever one
of them is the last putter, the embedded async_ida and the entire
transport struct leak -- for TCP, that is also the struct socket and
the kvec iov.

__free_opinfo() being a final putter is not theoretical.  opinfo_put()
queues the callback via call_rcu(&opinfo->rcu, free_opinfo_rcu), so
ksmbd_server_terminate_conn() can deposit N opinfo releases in RCU and
have ksmbd_conn_free() run in the handler thread before any of them
fire.  ksmbd_conn_free() then observes refcnt > 0 and short-circuits;
the last RCU-delivered __free_opinfo() falls onto its bare kfree(conn)
branch and the transport is lost.

A/B validation in a QEMU/virtme guest, mounting //127.0.0.1/testshare:
each iteration holds 8 files open via sleep processes, force-closes
TCP with "ss -K sport = :445", kills the holders, lazy-umounts;
repeated 10 times, then ksmbd shutdown and kmemleak scan.

    state         conn_alloc  conn_free  tcp_free  opi_rcu  kmemleak
    ----------    ----------  ---------  --------  -------  --------
    pre-patch         20          20        10       160        7
    with patch        20          20        20       160        0

Pre-patch conn_free=20 with tcp_free=10 directly demonstrates the
bare-kfree paths skipping transport cleanup; kmemleak backtraces point
into struct tcp_transport / iov.  With this patch tcp_free matches
conn_free at 20/20 and kmemleak is clean.

Move the per-struct final release into __ksmbd_conn_release_work() and
route the three bare-kfree final-put sites through a new
ksmbd_conn_put().  Those sites now pair ida_destroy() and
free_transport() with kfree(conn) regardless of which holder happens
to release the last reference.  stop_sessions() only triggers the
transport shutdown and does not itself drop the last conn reference,
so it is unaffected.

The centralized release reaches sock_release() -> tcp_close() ->
lock_sock_nested() (might_sleep) from every final putter, including
__free_opinfo() invoked from an RCU softirq callback, which trips
CONFIG_DEBUG_ATOMIC_SLEEP.  Defer the release to a dedicated
ksmbd_conn_wq workqueue so ksmbd_conn_put() is safe from any
non-sleeping context.

Make ksmbd_file own a strong connection reference while fp->conn is
non-NULL so durable-preserve and final-close paths cannot dereference
a stale connection.  ksmbd_open_fd() and ksmbd_reopen_durable_fd()
take the reference via ksmbd_conn_get() (the latter also reorders the
fp->conn / fp->tcon assignments before __open_id() so the published fp
is never observed with fp->conn == NULL); session_fd_check() and
__ksmbd_close_fd() drop it via ksmbd_conn_put().  With that invariant,
session_fd_check() can take a local conn pointer once and use it
across the m_op_list and lock_list iterations even though op->conn
puts may otherwise drop the last reference.

At module exit the workqueue is flushed and destroyed after
rcu_barrier(), so any release queued by a trailing RCU callback is
drained before the inode hash and module text go away.

Fixes: ee426bfb9d09 ("ksmbd: add refcnt to ksmbd_conn struct")
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

Merge branch 'net-mlx5-enable-sub-page-allocations-for-mlx5_frag_buf'

Tariq Toukan says:

====================
net/mlx5: enable sub-page allocations for mlx5_frag_buf

This series aims to improve memory utilization for DMA-coherent
fragmented-buffer allocations on systems with large PAGE_SIZE.

Before this change, such allocations were page-granular, as they were
backed by full pages. On large-page systems this caused significant
internal waste for small objects. For example, a single 4K request
consumed an entire 64K page.

The common kernel solution for sub-page coherent DMA allocations is the
DMA pool API. However, those pools do not return pages to the system
until teardown. That behavior is not a good fit for mlx5_frag_buf
allocations, since they back interface resources (WQs and CQs).
Interfaces may be removed dynamically, so their memory footprint should
reflect live usage to avoid situations where large amounts of memory
remain tied up in pools.

This series introduces a lightweight mlx5-local pool implementation for
sub-page coherent DMA allocations, which immediately returns free
backing pages. It wires mlx5_frag_buf allocations to use these internal
pools, while keeping the mechanism reusable for other mlx5-internal
coherent DMA allocation users in follow-up work.
====================

Link: https://patch.msgid.link/20260429201429.223809-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: use internal dma pools for frag buf alloc

Add mlx5_dma_pool alloc/free paths, and wire mlx5_frag_buf allocation
and free paths to use them.

mlx5_frag_buf_alloc_node() now selects an mlx5_dma_pool to allocate
fragments from, instead of directly allocating full coherent pages.

mlx5_frag_buf_free() frees from the respective pool.

mlx5_dma_pool_alloc() keeps allocation fast by maintaining pages with
available indexes at the head of the list, so the common allocation path
can take a free index immediately. New backing pages are allocated only
when no free index is available.

mlx5_dma_pool_free() returns released indexes to the pool and frees a
backing page once all of its indexes become free. This avoids keeping
fully free pages for the lifetime of the pool and reduces coherent DMA
memory footprint.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: add frag buf pools create/destroy paths

Introduce mlx5 DMA pool and pool-page data structures, and add the
creation and teardown paths.

Each NUMA node owns a set of mlx5_dma_pool instances, each one with a
different block size. The sizes are defined as all powers of two
starting from MLX5_ADAPTER_PAGE_SHIFT and up to PAGE_SHIFT. Since
mlx5_frag_bufs are used to back objects whose sizes are encoded relative
to MLX5_ADAPTER_PAGE_SHIFT, a smaller block_shift value cannot be used.
Requests larger than PAGE_SIZE continue to be handled as page-sized
fragments, as in the existing frag-buf allocation model.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: wire frag buf pools lifecycle hooks

Wire mlx5_frag_buf pools init/cleanup hooks into
mlx5_mdev_init()/uninit() and the init unwind path.

Keep temporary no-op stubs in alloc.c so lifecycle ordering is in place
before the coherent DMA sub-page allocator implementation is added in
follow-up patches.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pppoe: optimize hash with word access

Currently, hash_item() processes the 6-byte Ethernet address and the
2-byte session ID byte-wise to compute a hash.

Optimize this by using 16-bit word operations: XOR three 16-bit words
from the Ethernet address and the 16-bit session ID, then fold the
result. This reduces the total number of loads and XORs. The Ethernet
addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
u16 pointer cast is safe.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260429023848.153425-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv6-fix-ecmp-route-failover-on-carrier-loss'

Sagarika Sharma says:

====================
ipv6: fix ECMP route failover on carrier loss

This patchset resolves an issue where established IPv6 connections are
unable to transition to alternative ECMP nexthops upon carrier loss.

Unlike IPv4, the IPv6 routing subsystem does not actively invalidate
cached destinations during a NETDEV_CHANGE event. Sockets persist
with dead routes, leading to stalled traffic or connection drops.

This series introduces a fix to trigger route invalidation by
updating the route serial number on link carrier loss and provides
a corresponding selftest to validate the failover behavior for IPv4
and IPv6.
====================

Link: https://patch.msgid.link/20260430200909.527827-1-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: net: Add test for TCP flow failover with ECMP routes.

Without the previous commit, TCP failed to switch to alternative
IPv6 routes immediately upon carrier loss.

It would persist with the dead route until reaching the threshold
net.ipv4.tcp_retries1, leading to unnecessary delays in failover.

Let's add a selftest for this scenario to ensure TCP fails over
immediately upon a carrier loss event.

Before:
  TEST: TCP IPv4 failover                                             [ OK ]
  TEST: TCP IPv6 failover                                             [FAIL]

After:
  TEST: TCP IPv4 failover                                             [ OK ]
  TEST: TCP IPv6 failover                                             [ OK ]

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Sagarika Sharma <sharmasagarika@google.com>
Link: https://patch.msgid.link/20260430200909.527827-3-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: update route serial number on NETDEV_CHANGE

When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences
a carrier change event (e.g., a bond device generating a NETDEV_CHANGE
event after its slaves go linkdown), established connections utilizing
that nexthop fail to fail over to other available nexthops. Instead,
these connections stall or drop.

This happens because the IPv6 FIB code does not invalidate the socket's
cached destination when a NETDEV_CHANGE event occurs. While
fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it
leaves the route's serial number unchanged. As a result, sockets with a
previously cached dst do not realize the route is no longer viable and
continue to try using the non-functional nexthop.

This behavior contrasts with IPv4, which actively flushes cached
destinations on a NETDEV_CHANGE event (see fib_netdev_event() in
net/ipv4/fib_frontend.c).

Fix this by updating the route serial number in fib6_ifdown() when
setting RTNH_F_LINKDOWN. This invalidates stale cached destinations,
forcing sockets to perform a new route lookup and fail over to a
functioning nexthop.

Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Sagarika Sharma <sharmasagarika@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_pie: annotate more data-races in pie_dump_stats()

My prior patch missed few READ_ONCE()/WRITE_ONCE() annotations.

Fixes: 5154561d9b11 ("net/sched: sch_pie: annotate data-races in pie_dump_stats()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430080056.35104-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: configure QoS channel for HW accelerated flowtable traffic

As done for the SW path, configure the QoS channel for HW accelerated
traffic according to the user port index when forwarding to a DSA port,
or rely on the GDM port identifier otherwise. This allows HTB shaping
to be applied to HW accelerated traffic.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260430-airoha-ppe-qos-channel-v1-1-5ef9221e85c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-move-some-fastpath-fields-to-appropriate-groups'

Eric Dumazet says:

====================
tcp: move some fastpath fields to appropriate groups

Move following fields to better groups to increase data locality.

- delivered
- delivered_ce
- segs_in
- segs_out
- first_tx_mstamp
- delivered_mstamp
- max_packets_out
- cwnd_usage_seq
- rate_delivered
- rate_interval_us

No change in overall tcp_sock size.
====================

Link: https://patch.msgid.link/20260430100021.211139-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move max_packets_out, cwnd_usage_seq, rate_delivered and rate_interval_us to tcp_sock_write_tx group

These fields are used in TX path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->bytes_acked to tcp_sock_write_tx group

tp->bytes_acked is touched in TX path only.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->first_tx_mstamp and tp->delivered_mstamp to tcp_sock_write_tx

These fields are touched in when payload is sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->segs_in and tp->segs_out to tcp_sock_write_txrx group

segs_in is changed for each incoming packet, including ACK packets.
segs_out is changed for each outgoing packet, including ACK packets.

They belong to tcp_sock_write_txrx group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->delivered and tp->delivered_ce to tcp_sock_write_tx group

These counters are changed whenever sent data is acknowleged.

They do not belong to tcp_sock_write_txrx group, because TCP receivers
do not touch them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: cdc_ncm: add Apple Mac USB-C direct networking quirk

Apple Silicon Macs expose two CDC NCM "private" data interfaces over
USB-C with VID:PID 0x05ac:0x1905 and product string "Mac". This is the
same protocol Apple already ships on iPhone (0x05ac:0x12a8) and iPad
(0x05ac:0x12ab) for RemoteXPC since iOS 17 -- both data interfaces lack
an interrupt status endpoint, so they rely on the FLAG_LINK_INTR-
conditional bind path introduced in commit 3ec8d7572a69 ("CDC-NCM: add
support for Apple's private interface").

The id_table currently has entries for iPhone and iPad but not for the
Mac. Without a match, cdc_ncm falls through to the generic CDC NCM
class-match entry, which uses the FLAG_LINK_INTR-having cdc_ncm_info
struct, so bind_common() fails on the missing status endpoint and no
netdev appears.

Add id_table entries for both interface numbers (0 and 2) of the Mac,
bound to the existing apple_private_interface_info driver_info.

Verified empirically on a Mac Studio M3 Ultra running macOS 26.5: when
a Mac is connected via USB-C, ioreg shows VID 0x05ac, PID 0x1905,
product string "Mac", with two NCM data interfaces at numbers 0 and 2.
The same PID is presented by all current Apple Silicon Mac models
(MacBook Pro/Air, Mac mini, Mac Studio across the M-series), mirroring
Apple's single-PID-per-family pattern from iPhone/iPad.

After this patch, plugging a Mac into a Linux host running the patched
kernel produces two enx... interfaces (one per data interface),
"ip -br link" lists them as UP, and standard userspace networking
(DHCP, NetworkManager shared mode, etc.) works without any modprobe
overrides or out-of-tree modules.

Signed-off-by: Alex Cheema <alex@exolabs.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429175739.34426-1-alex@exolabs.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: igmp: annotate data-races in igmp_heard_query()

Multiple cpus can run igmp_heard_query() concurrently.

Add missing READ_ONCE()/WRITE_ONCE() over following in_dev fields.

- mr_qrv
- mr_qi
- mr_qri
- mr_v1_seen
- mr_v2_seen

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+ae9a171f239b14485310@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69f38675.050a0220.3cbe47.0002.GAE@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430164836.872079-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: Enable ntuple-filters if supported

Certain devices which support ntuple-filters do not enable the feature
by default. The existing tests will skip (if they check for the feature),
or fail if they blindly attempt to install rules. Therefore, attempt to turn
on ntuple-filters if the device supports them.

Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260430165217.3700469-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: rtnetlink: zero ifla_vf_broadcast to avoid stack infoleak in rtnl_fill_vfinfo

rtnl_fill_vfinfo() declares struct ifla_vf_broadcast on the stack
without initialisation:

struct ifla_vf_broadcast vf_broadcast;

The struct contains a single fixed 32-byte field:

/* include/uapi/linux/if_link.h */
struct ifla_vf_broadcast {
__u8 broadcast[32];
};

The function then copies dev->broadcast into it using dev->addr_len
as the length:

memcpy(vf_broadcast.broadcast, dev->broadcast, dev->addr_len);

On Ethernet devices (the overwhelming majority of SR-IOV NICs)
dev->addr_len is 6, so only the first 6 bytes of broadcast[] are
written. The remaining 26 bytes retain whatever was previously on
the kernel stack. The full struct is then handed to userspace via:

nla_put(skb, IFLA_VF_BROADCAST,
sizeof(vf_broadcast), &vf_broadcast)

leaking up to 26 bytes of uninitialised kernel stack per VF per
RTM_GETLINK request, repeatable.

The other vf_* structs in the same function are explicitly zeroed
for exactly this reason - see the memset() calls for ivi,
vf_vlan_info, node_guid and port_guid a few lines above.
vf_broadcast was simply missed when it was added.

Reachability: any unprivileged local process can open AF_NETLINK /
NETLINK_ROUTE without capabilities and send RTM_GETLINK with an
IFLA_EXT_MASK attribute carrying RTEXT_FILTER_VF. The kernel walks
each VF and emits IFLA_VF_BROADCAST, leaking 26 bytes of stack per
VF per request. Stack residue at this call site can include return
addresses and transient sensitive data; KASAN with stack
instrumentation, or KMSAN, will flag the nla_put() when reproduced.

Zero the on-stack struct before the partial memcpy, matching the
existing pattern used for the other vf_* structs in the same
function.

Fixes: 75345f888f70 ("ipoib: show VF broadcast address")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Zen <kai.aizen.dev@gmail.com>
Link: https://patch.msgid.link/3c506e8f936e52b57620269b55c348af05d413a2.1777557228.git.kai.aizen.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip6mr: plug drop_reason to ip6mr_cache_report()

- Check mrt->mroute_sk earlier in the function.

- Use sock_queue_rcv_skb_reason() instead of sock_queue_rcv_skb().
- Use sk_skb_reason_drop() instead of kfree_skb().
Note that we return -ENOMEM if sock_queue_rcv_skb_reason() failed,
as the precise error is not really needed for callers.

- Remove one net_warn_ratelimited().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260430074004.4133602-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr: prevent info-leak in pmr_cache_report()

Yiming Qian reported:

<quote>
ipmr_cache_report()` allocates a report skb with `alloc_skb(128,
GFP_ATOMIC)` and appends a `struct igmphdr` using `skb_put()`. In the
non-`IGMPMSG_WHOLEPKT` path it initializes only:

- `igmp->type`
- `igmp->code`

but does not initialize:

- `igmp->csum`
- `igmp->group`

Later, `igmpmsg_netlink_event()` copies the bytes after `sizeof(struct
igmpmsg)` into the `IPMRA_CREPORT_PKT` netlink attribute and emits
`RTM_NEWCACHEREPORT` on `RTNLGRP_IPV4_MROUTE_R`.

As a result, 6 bytes of stale heap data from the skb head are
disclosed to userspace.
</quote>

Let's use skb_put_zero() instead of skb_put() to fix this bug.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260430070611.4004529-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'drm-fixes-2026-05-02' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Fixes for rc2, the usual amdgpu/xe double header, I think xe had a
  couple of weeks combined due to some maintainer access issues,
  otherwise there's just a few misc fixes and documentation fixups.

  core and helpers:
   - calculate framebuffer geometry with format helpers
   - fix docs

  amdgpu:
   - GFX12 fix for CONFIG_DRM_DEBUG_MM configs
   - Fix DC analog support
   - Userq fixes
   - GART placement fix
   - Aldebaran SMU fixes
   - AMDGPU_INFO_READ_MMR_REG fix
   - UVD 3.1 fix
   - GC 6 TCC fix
   - Fix root reservation in amdgpu_vm_handle_fault()
   - RAS fix
   - Module reload fix for APUs
   - Fix build for CONFIG_DRM_FBDEV_EMULATION=n
   - IGT DWB regression fix
   - GC 11.5.4 fix
   - VCN user fence fixes
   - JPEG user fence fixes
   - SMU 13.0.6 fix
   - VCN 3/4 IB parser fixes
   - NV3x+ dGPU vblank fix
   - DCE6/8 fixes for LVDS/eDP panels without an EDID

  amdkfd:
   - Fix for when CONFIG_HSA_AMD is not set
   - SVM fixes

  xe:
   - uapi: Add missing pad and extensions check
   - uapi: Reject unsafe PAT indices for CPU cached memory
   - Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge
   - Xe3p tuning and workaround fixes
   - USE drm mm instead of drm SA for CCS read/write
   - Fix leaks and null derefs
   - Fix Wa_18022495364

  appletbdrm:
   - allocate protocol buffers with kvzalloc()

  dma-buf:
   - fix docs

  imagination:
   - avoid segfault in debugfs

  ofdrm:
   - put PCI device reference on errors

  udl:
   - increase USB timeout"

* tag 'drm-fixes-2026-05-02' of https://gitlab.freedesktop.org/drm/kernel: (77 commits)
  drm/xe/uapi: Reject coh_none PAT index for CPU_ADDR_MIRROR
  drm/xe/uapi: Reject coh_none PAT index for CPU cached memory in madvise
  drm/xe/xelp: Fix Wa_18022495364
  drm/xe/gsc: Fix BO leak on error in query_compatibility_version()
  drm/xe/eustall: Fix drm_dev_put called before stream disable in close
  drm/xe: Fix error cleanup in xe_exec_queue_create_ioctl()
  drm/xe: Fix dma-buf attachment leak in xe_gem_prime_import()
  drm/xe: Fix bo leak in xe_dma_buf_init_obj() on allocation failure
  drm/xe/bo: Fix bo leak on GGTT flag validation in xe_bo_init_locked()
  drm/xe/bo: Fix bo leak on unaligned size validation in xe_bo_init_locked()
  drm/xe: Fix potential NULL deref in xe_exec_queue_tlb_inval_last_fence_put_unlocked
  drm/xe/vf: Use drm mm instead of drm sa for CCS read/write
  drm/xe: Add memory pool with shadow support
  drm/xe/debugfs: Correct printing of register whitelist ranges
  drm/xe: Mark ROW_CHICKEN5 as a masked register
  drm/xe/tuning: Use proper register offset for GAMSTLB_CTRL
  drm/xe/xe3p_lpg: Add missing indirect ring state feature flag
  drm/xe: Drop redundant rtp entries for Wa_14019988906 & Wa_14019877138
  drm/xe/vm: Add missing pad and extensions check
  drm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()
  ...

net: usb: r8152: add TRENDnet TUC-ET2G v2.0

The TRENDnet TUC-ET2G V2.0 is an RTL8156B based 2.5G Ethernet controller.

Add the vendor and product ID values to the driver. This makes Ethernet
work with the adapter.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260430213435.21821-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ne2k: fold drivers/net/Space.c into ne.c

drivers/net/Space.c is the last remnant of the linux-2.4.x driver model
that required each subsystem and device driver init function to be called
from init/main.c explicitly, before the introduction of initcall levels.

In linux-7.0, this was only used for a handful of ISA network drivers,
with the ne2000 driver being the last one.

Fold the code into ne.c directly, with minimal changes to preserve
the existing command line parsing.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429145624.2948432-2-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cs89x0: remove ISA bus probing

The cs89x0 driver is really two in one, and they are mutually exclusive:

- the ISA driver was used on 486-era PCs. It likely has no remaining
   users, like the other ethernet drivers that got removed in
   linux-7.1. The DMA support in here is the last device driver use of
   the deprecated isa_bus_to_virt() interface, all other users are either
   x86 specific or or got converted to the normal dma-mapping interface.
   The driver was maintained by Andrew Morton at the time, based on
   the linux-2.2 vendor driver from Cirrus Logic.

- the platform_driver instance was used on some embedded Arm boards
   around the same time, such as the EP7211 Development Kit. This
   is the same chip, but uses modern devicetree based probing and no DMA.
   This was added by Alexander Shiyan.

Remove the ISA driver as a cleanup, including all of the outdated
documentation referring to its configuration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429145624.2948432-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tls: reshuffle the device ops check

We try to validate during registration that the netdev
has ops if it has features. This is currently somewhat sillily
written because we have a dereference before a NULL check
on the ops struct. Straighten this out.

No functional change intended other than saving ourselves
the very theoretical crash with a bad driver.

Note that we check earlier in the function that either ops
or TLS features are set for the device in question.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429213001.1908235-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains Netfilter fixes for net:

1) Replace skb_try_make_writable() by skb_ensure_writable() in
   nft_fwd_netdev and the flowtable to deal with uncloned packets
   having their network header in paged fragments.

2) Drop packet if output device does not exist and ensure sufficient
   headroom in nft_fwd_netdev before transmitting the skb.

3) Use the existing dup recursion counter in nft_fwd_netdev for the
   neigh_xmit variant, from Weiming Shi.

4) Add .check_hooks interface to x_tables to detach the control plane
   hook check based on the match/target configuration. Then, update
   nft_compat to use .check_hooks from .validate path, this fixes a
   lack of hook validation for several match/targets.

5) Fix incorrect .usersize in xt_CT, from Florian Westphal.

6) Fix a memleak with netdev tables in dormant state,
   from Florian Westphal.

7) Several patches to check if the packet is a fragment, then skip
   layer 4 inspection, for x_tables and nf_tables; as well as common
   nf_socket infrastructure. The xt_hashlimit match drops fragments
   to stay consistent with the existing approach when failing to parse
   the layer 4 protocol header.

8) Ensure sufficient headroom in the flowtable before transmitting
   the skb.

9) Fix the flowtable inline vlan approach for double-tagged vlan:
   Reverse the iteration over .encap[] since it represents the
   encapsulation as seen from the ingress path. Postpone pushing
   layer 2 header so output device is available to calculate needed
   headroom. Finally, add and use nf_flow_vlan_push() to fix it.

10) Fix flowtable inline pppoe with GSO packets. Moreover, use
    FLOW_OFFLOAD_XMIT_DIRECT to fill up destination hardware
    address since neighbour cache does not exist in pppoe.

11) Use skb_pull_rcsum() to decapsulate vlan and pppoe headers, for
    double-tagged vlan in particular this should provide some benefits
    in certain scenarios.

More notes regarding 9-11):

- sashiko is also signalling to use it for IPIP headers, but that needs
  more adjustments such setting skb->protocol after removing the IPIP
  header, will follow up in a separated patch.
- I plan to submit selftests to cover double-tagged-vlan. As for pppoe,
  it should be possible but that would mandate a few userspace dependencies.
  This has been semi-automatically  tested by me and reporters describing
  broken double-vlan-tagged and pppoe currently in the flowtable.

* tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header
  netfilter: flowtable: fix inline pppoe encapsulation in xmit path
  netfilter: flowtable: fix inline vlan encapsulation in xmit path
  netfilter: flowtable: ensure sufficient headroom in xmit path
  netfilter: xtables: fix L4 header parsing for non-first fragments
  netfilter: nf_tables: skip L4 header parsing for non-first fragments
  netfilter: nf_socket: skip socket lookup for non-first fragments
  netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables
  netfilter: xt_CT: fix usersize for v1 and v2 revision
  netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate
  netfilter: x_tables: add .check_hooks to matches and targets
  netfilter: nft_fwd_netdev: use recursion counter in neigh egress path
  netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
  netfilter: replace skb_try_make_writable() by skb_ensure_writable()
====================

Link: https://patch.msgid.link/20260501122237.296262-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

- Avoid writing an uninitialised stack variable to POR_EL0 on sigreturn
   if the poe_context record is absent

- Reserve one more page for the early 4K-page kernel mapping to cover
   the extra [_text, _stext) split introduced by the non-executable
   read-only mapping

- Force the arch_local_irq_*() wrappers to be __always_inline so that
   noinstr entry and idle paths cannot call out-of-line, instrumentable
   copies

- Fix potential sign extension in the arm64 SCS unwinder's DWARF
   advance_loc4 decoding

- Tolerate arm64 ACPI platforms with only WFI and no deeper PSCI idle
   states, restoring cpuidle registration on such systems

- Include the UAPI <asm/ptrace.h> header in the arm64 GCS libc test
   rather than carrying a duplicate struct user_gcs definition (the
   original #ifdef NT_ARM_GCS was wrong to cover the structure
   definition as it would be masked out if the toolchain defined it)

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: signal: Preserve POR_EL0 if poe_context is missing
  arm64: Reserve an extra page for early kernel mapping
  kselftest/arm64: Include <asm/ptrace.h> for user_gcs definition
  ACPI: arm64: cpuidle: Tolerate platforms with no deep PSCI idle states
  arm64/irqflags: __always_inline the arch_local_irq_*() helpers
  arm64/scs: Fix potential sign extension issue of advance_loc4

smb: smbdirect: fix MR registration for coalesced SG lists

ib_dma_map_sg() modifies the provided scatterlist and returns the
number of mapped entries, which can be fewer than the requested
mr->sgt.nents if the DMA controller coalesces contiguous memory
segments. Passing the original, uncoalesced count to ib_map_mr_sg()
causes memory registration failures if coalescing actually occurs.

Capture the actual mapped count returned by ib_dma_map_sg() and pass it
to ib_map_mr_sg() to ensure correct MR registration.

Also update the ib_dma_map_sg() error logging to drop the error
pointer formatting, since the return value is an integer count
rather than an error code.

Ensure a proper error code (-EIO) is assigned when DMA mapping or
MR registration fails.

Fixes: de5ef8ec3c46 ("smb: smbdirect: introduce smbdirect_mr.c with client mr code")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221408
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yi Kuo <yi@yikuo.dev>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce and use include/linux/smbdirect.h

This makes it easier to rebuild cifs.ko and ksmbd.ko against
a running kernel.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/linux-cifs/aehrPuY60VMcYGU8@infradead.org/
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: make use of DEFAULT_SYMBOL_NAMESPACE and EXPORT_SYMBOL_GPL

This is a better solution than
EXPORT_SYMBOL_FOR_MODULES(__sym, "cifs,ksmbd") as it makes
it possible to rebuild smbdirect.ko against a
running kernel and then load the existing cifs.ko and ksmbd.ko
from the running kernel.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/linux-cifs/aehrPuY60VMcYGU8@infradead.org/
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

accel/qaic: fix incorrect counter check in RAS message decode

The UE and UE_NF cases check ce_count against UINT_MAX before incrementing
their respective counters. This is logically incorrect and prevents
ue_count and ue_nf_count from incrementing when ce_count reaches UINT_MAX.

Fixes: c11a50b170e7 ("accel/qaic: Add Reliability, Accessibility, Serviceability (RAS)")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Link: https://patch.msgid.link/20260410112015.592546-1-alok.a.tiwari@oracle.com

Merge tag 'selinux-pr-20260501' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux fixes from Paul Moore:

- Ensure SELinux is always properly accessing its own sock LSM state

- Only reserve an xattr slot for SELinux if it will be used

- Fix a SELinux auditing regression in the directory avdcache

* tag 'selinux-pr-20260501' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: fix avdcache auditing
  selinux: don't reserve xattr slot when we won't fill it
  selinux: use sk blob accessor in socket permission helpers

futex: Drop CLONE_THREAD requirement for private default hash alloc

Currently need_futex_hash_allocate_default() depends on strict pthread
semantics, abusing CLONE_THREAD.  This breaks the non-concurrency
assumptions when doing the mm->futex_ref pcpu allocations, leading to
bugs[0] when sharing the mm in other ways; ie:

    BUG: KASAN: slab-use-after-free in futex_hash_put

... where the +1 bias can end up on a percpu counter that mm->futex_ref
no longer points at.

Loosen the check to cover any CLONE_VM clone, except vfork().  Excluding
vfork keeps the existing paths untouched (no overhead), and we can't
race in the first place: either the parent is suspended and the child
runs alone, or mm->futex_ref is already allocated from an earlier
CLONE_VM.

Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/
Fixes: d9b05321e21e ("futex: Move futex_hash_free() back to __mmput()")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 's390-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

- Reject zero-length writes from userspace that corrupt Debug Facility
   buffers

- Replace one s390 PCI maintainer

- Remove SCLP_OFB Kconfig option and enable the guarded code
   unconditionally

- Replace incorrect use of phys_to_folio() to virt_to_folio() in
   do_secure_storage_access()

* tag 's390-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/mm: Fix phys_to_folio() usage in do_secure_storage_access()
  s390/sclp: Remove SCLP_OFB Kconfig option
  MAINTAINERS: Replace one of the maintainers for s390/pci
  s390/debug: Reject zero-length input in debug_input_flush_fn()
  s390/debug: Reject zero-length input before trimming a newline

alarmtimer: Remove unused interfaces

All alarmtimer users are converted to alarm_start_timer(). Remove the now
unused interfaces.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20260408114952.670899355@kernel.org

netfilter: xt_IDLETIMER: Switch to alarm_start_timer()

The existing alarm_start() interface is replaced with the new
alarm_start_timer() mechanism, which does not longer queue an already
expired timer and returns the state.

Adjust the code to utilize this so it schedules the work in the case that
the timer was already expired. Unlikely to happen as the timeout is at
least a second, but not impossible especially with virtualization.

No functional change intended

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260408114952.604232981@kernel.org

power: supply: charger-manager: Switch to alarm_start_timer()

The existing alarm_start() interface is replaced with the new
alarm_start_timer() mechanism, which does not longer queue an already
expired timer and returns the state. Adjust the code to utilize this.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Link: https://patch.msgid.link/20260408114952.536945376@kernel.org

fs/timerfd: Use the new alarm/hrtimer functions

Like any other user controlled interface, timerfd based timers can be
programmed with expiry times in the past or vary small intervals.

Both hrtimer and alarmtimer provide new interfaces which return the queued
state of the timer. If the timer was already expired, then let the callsite
handle the timerfd context update so that the full round trip through the
hrtimer interrupt is avoided.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20260408114952.469141112@kernel.org

alarmtimer: Convert posix timer functions to alarm_start_timer()

Use the new alarm_start_timer() for arming and rearming posix interval
timers and for clock_nanosleep() so that already expired timers do not go
through the full timer interrupt cycle.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260408114952.400451460@kernel.org

alarmtimer: Provide alarm_start_timer()

Alarm timers utilize hrtimers for normal operation and only switch to the
RTC on suspend. In order to catch already expired timers early and without
going through a timer interrupt cycle, provide a new start function which
internally uses hrtimer_start_range_ns_user().

If hrtimer_start_range_ns_user() detects an already expired timer, it does
not queue it. In that case remove the timer from the alarm base as well.

Return the status queued or not back to the caller to handle the early
expiry.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260408114952.332822525@kernel.org

posix-timers: Switch to hrtimer_start_expires_user()

Switch the arm and rearm callbacks for hrtimer based posix timers over to
hrtimer_start_expires_user() so that already expired timers are not
queued. Hand the result back to the caller, which then queues the signal.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260408114952.266001916@kernel.org

posix-timers: Handle the timer_[re]arm() return value

The [re]arm callbacks will return true when the timer was queued and false
if it was already expired at enqueue time.

In both cases the call sites can trivially queue the signal right there,
when the timer was already expired. That avoids a full round trip through
the hrtimer interrupt.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260408114952.198028466@kernel.org

posix-timers: Expand timer_[re]arm() callbacks with a boolean return value

In order to catch expiry times which are already in the past the
timer_arm() and timer_rearm() callbacks need to be able to report back to
the caller whether the timer has been queued or not.

Change the function signature and let all implementations return true for
now. While at it simplify posix_cpu_timer_rearm().

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260408114952.130222296@kernel.org

hrtimer: Use hrtimer_start_expires_user() for hrtimer sleepers

Most hrtimer sleepers are user controlled and user space can hand arbitrary
expiry values in as long as they are valid timespecs. If the expiry value
is in the past then this requires a full loop through reprogramming the
clock event device, taking the hrtimer interrupt, waking the task and
reprogram again.

Use hrtimer_start_expires_user() which avoids the full round trip by
checking the timer for expiry on enqueue.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260408114952.062400833@kernel.org

hrtimer: Provide hrtimer_start_range_ns_user()

Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space. He provided a reproducer, which set's up a timerfd based
timer and then rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

The clockevents code already has a last resort mechanism to prevent that,
but it's sensible to catch such issues before trying to reprogram the clock
event device.

Provide a variant of hrtimer_start_range_ns(), which sanity checks the
timer after queueing it. It does not so before because the timer might be
armed and therefore needs to be dequeued. also we optimize for the latest
possible point to check, so that the clock event prevention is avoided as
much as possible.

If the timer is already expired _before_ the clock event is reprogrammed,
remove the timer from the queue and signal to the caller that the operation
failed by returning false.

That allows the caller to take immediate action without going through the
loops and hoops of the hrtimer interrupt.

The queueing code can't invoke the timer callback as the caller might hold
a lock which is taken in the callback.

Add a tracepoint which allows to analyze the expired at start situation.

Reported-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260408114951.995031895@kernel.org

rseq: Don't advertise time slice extensions if disabled

If time slice extensions have been disabled on the kernel command line,
then advertising them in RSEQ flags is wrong.

Adjust the conditionals to reflect reality, fixup the misleading comments
about the gap of these flags and the rseq::flags field.

Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.437059375%40kernel.org
Cc: stable@vger.kernel.org

rseq: Protect rseq_reset() against interrupts

rseq_reset() uses memset() to clear the tasks rseq data. That's racy
against membarrier() and preemption.

Guard it with irqsave to cure this.

Fixes: faba9d250eae ("rseq: Introduce struct rseq_data")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.353887714%40kernel.org
Cc: stable@vger.kernel.org

rseq: Set rseq::cpu_id_start to 0 on unregistration

The RSEQ rework changed that to RSEQ_CPU_UNINITILIZED, which is obviously
incompatible. Revert back to the original behavior.

Fixes: 0f085b41880e ("rseq: Provide and use rseq_set_ids()")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.271566313%40kernel.org
Cc: stable@vger.kernel.org

selftests/rseq: Don't run tests with runner scripts outside of the scripts

The rseq selftests include two runner scripts run_param_test.sh and
run_syscall_errors_test.sh which set up the environment for test binaries
and run them with various parameters. Currently we list these test binaries
in TEST_GEN_PROGS but this results in the kselftest framework running them
directly as well as via the runners, resulting in duplication and spurious
failures when the environment is not correctly set up (eg, if glibc tries
to use rseq).

Move the binaries the runners invoke to TEST_GEN_PROGS_EXTENDED, binaries
listed there are built but not run by the framework. The param_test
benchmarks are not moved since they are not run by run_param_test.sh.

Fixes: 830969e7821a ("selftests/rseq: Implement time slice extension test")
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260423-selftests-rseq-use-runner-v1-1-e13a133754c1@kernel.org
Cc: stable@vger.kernel.org