git.ipfire.org Git - thirdparty/linux.git/log

netlink: add one debug check in nla_nest_end()

Add a DEBUG_NET_WARN_ON_ONCE(diff > U16_MAX)
to warn if the kernel sends corrupted nested attribute
to user space.

Offenders can be converted to nla_nest_end_safe().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260512155244.4137851-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.1-rc4).

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.

  Previous releases - regressions:

   - ethtool: fix NULL pointer dereference in phy_reply_size

   - netfilter:
      - allocate hook ops while under mutex
      - close dangling table module init race
      - restore nf_conntrack helper propagation via expectation

   - tcp:
      - fix potential UAF in reqsk_timer_handler().
      - fix out-of-bounds access for twsk in tcp_ao_established_key().

   - vsock: fix empty payload in tap skb for non-linear buffers

   - hsr: fix NULL pointer dereference in hsr_get_node_data()

   - eth:
      - cortina: fix RX drop accounting
      - ice: fix locking in ice_dcb_rebuild()

  Previous releases - always broken:

   - napi: avoid gro timer misfiring at end of busypoll

   - sched:
      - dualpi2: initialize timer earlier in dualpi2_init()
      - sch_cbs: Call qdisc_reset for child qdisc

   - shaper:
      - fix ordering issue in net_shaper_commit()
      - reject handle IDs exceeding internal bit-width

   - ipv6: flowlabel: enforce per-netns limit for unprivileged callers

   - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring

   - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint

   - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL

   - batman-adv:
      - reject new tp_meter sessions during teardown
      - purge non-released claims

   - eth:
      - i40e: cleanup PTP registration on probe failure
      - idpf: fix double free and use-after-free in aux device error paths
      - ena: fix potential use-after-free in get_timestamp"

* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  net: phy: DP83TC811: add reading of abilities
  net: tls: prevent chain-after-chain in plain text SG
  net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
  net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
  macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
  macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
  macsec: introduce dedicated workqueue for SA crypto cleanup
  net: net_failover: Fix the deadlock in slave register
  MAINTAINERS: update atlantic driver maintainer
  selftests/tc-testing: Add QFQ/CBS qlen underflow test
  net/sched: sch_cbs: Call qdisc_reset for child qdisc
  FDDI: defza: Sanitise the reset safety timer
  net: ethernet: ravb: Do not check URAM suspension when WoL is active
  ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
  net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
  net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
  net: atm: fix skb leak in sigd_send() default branch
  net: ethtool: phy: avoid NULL deref when PHY driver is unbound
  net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
  net: shaper: reject QUEUE scope handle with missing id
  ...

Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fixes from Paul Moore:

- Correctly log the inheritable capabilities

- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands

* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records

ptrace: slightly saner 'get_dumpable()' logic

The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.

And almost all users do in fact use it only for the case where the task
has a mm pointer.

But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).

It's not what this flag was designed for, but it is what it is.

The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.

Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.

Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

net: phy: DP83TC811: add reading of abilities

At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.

Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: tls: prevent chain-after-chain in plain text SG

Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:

  This would create a chain where the wrap link points directly
  to another chain link. The scatterlist API sg_next iterator
  does not recursively resolve consecutive chain links.

meaning this is illegal input to crypto.

The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:

   end         start            i
    v            v              v
  [   ]...[   ][ d ][ d ][ d ][ d ][rsv for wrap]

Skip the wrapping in this case.

TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.

Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring

When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:

  struct sk_msg_sg {
        [...]
        /* The extra two elements:
         * 1) used for chaining the front and sections when the list becomes
         *    partitioned (e.g. end < start). The crypto APIs require the
         *    chaining;
         * 2) to chain tailer SG entries after the message.
         */
        struct scatterlist              data[MAX_MSG_FRAGS + 2];

The current code uses MAX_SKB_FRAGS + 1 as the ring size:

    sg_chain(&msg_pl->sg.data[msg_pl->sg.start],
             MAX_SKB_FRAGS - msg_pl->sg.start + 1,
             msg_pl->sg.data);

This places the chain pointer at

  sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. =
  &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 =
  data[start + (MAX_SKB_FRAGS - start + 1) - 1] =
  data[MAX_SKB_FRAGS]

instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")

Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).

Reported-by: 钱一铭 <yimingqian591@gmail.com>
Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'bridge-add-selective-forwarding-of-gratuitous-neighbor-announcements'

Danielle Ratson says:

====================
bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.

The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.

Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.

This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.

The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.

Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.

Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================

Link: https://patch.msgid.link/20260511065936.4173106-1-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: net: Add tests for neigh_forward_grat option

Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.

The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed

For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.

For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-7-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add per-VLAN netlink handling for neigh_forward_grat

Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.

The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.

This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add port-level netlink handling for neigh_forward_grat

Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.

The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.

Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add selective forwarding of gratuitous neighbor announcements

The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.

Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.

The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
   unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
   br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
   packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
   - For gratuitous ARPs/NAs: suppress unless the output port has
     neigh_forward_grat enabled.
   - For regular ARPs/NDs: maintain existing behavior.

This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.

Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: Add internal flags for neigh_forward_grat

Add internal flags for the neigh_forward_grat feature:

- BR_NEIGH_FORWARD_GRAT: Port-level flag
- BR_VLFLAG_NEIGH_FORWARD_GRAT_ENABLED: Per-VLAN flag

These will be used to control whether gratuitous ARP and unsolicited NA
packets are forwarded when neighbor suppression is enabled.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260511065936.4173106-3-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bridge: uapi: Add neigh_forward_grat netlink attributes

Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.

Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.

The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260511065936.4173106-2-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot

On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.

smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:

  BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
  Write of size 4 at addr 0000000000000068 by task exploit/144
  Call Trace:
   _raw_spin_lock_bh
   smc_conn_create (net/smc/smc_core.c:1997)
   __smc_connect (net/smc/af_smc.c:1447)
   smc_connect (net/smc/af_smc.c:1720)
   __sys_connect
   __x64_sys_connect
   do_syscall_64

Require ism_dev[i] to be non-NULL before accepting a CHID match.

Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-netem-enhancements'

Stephen Hemminger says:

====================
net/sched: netem: enhancements

This is a collection of improvements to netem found while
investigating the fixes now in net tree.

v5 - minor cleanups
- add allocation_errors counter
- align counters with where impairment occurs
====================

Link: https://patch.msgid.link/20260509171123.307549-1-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: add per-impairment extended statistics

Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.

Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-6-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: handle multi-segment skb in corruption

The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.

Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-5-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: replace pr_info with netlink extack error messages

Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-4-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: remove useless VERSION

The version printed was never updated and kernel version is
better indication of what is fixed or not.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-3-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: netem: reorder struct netem_sched_data

The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.

Reorganize the struct as follows:

  - Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
    counter, and the unconditional enqueue scalars
    latency/jitter/rate/gap/loss.

  - Cacheline 1 holds the remaining zero-check scalars
    (duplicate/reorder/corrupt/ecn), all five crndstate correlation
    structures, and loss_model.

  - Cacheline 2 holds prng, delay_dist, the slot dequeue state,
    slot_dist, and the inner classful qdisc pointer.

  - Rate-shaping fields, q->limit (config-only; the fast path reads
    sch->limit), and the CLG Markov state move to the warm tail.

  - tc_netem_slot slot_config and qdisc_watchdog (only consulted on
    slot reschedule and watchdog wake) move to the cold tail.

Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities.  This removes the 3-byte interior hole
without changing the struct's size.

Should have no functional change.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/20260509171123.307549-2-stephen@networkplumber.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sched-refine-fq_codel-memory-limits'

Eric Dumazet says:

====================
net/sched: refine fq_codel memory limits

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

First patch makes is_skb_wmem() available to modules.

Second patch uses is_skb_wmem() in fq_codel.
====================

Link: https://patch.msgid.link/20260512094859.3673997-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: fq_codel: local packets no longer count against memory limit

Commit 95b58430abe7 ("fq_codel: add memory limitation per queue")
claimed that the 32Mb default was "reasonable even for heavy duty usages."

In practice, this is not the case.

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make is_skb_wmem() available to modules

Following patch will use is_skb_wmem() from fq_codel.

Provide __sock_wfree() only if CONFIG_INET=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5e-improve-rss-indirection-table-sizing-and-resizing'

Tariq Toukan says:

====================
net/mlx5e: improve RSS indirection table sizing and resizing

This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.

The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.

Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================

Link: https://patch.msgid.link/20260511172719.330490-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: increase RSS indirection table spread factor

Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.

This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").

The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize configured default RSS context table on channel change

mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").

Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize non-default RSS indirection tables on channel change

When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.

Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.

Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.

Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: advertise max RSS indirection table size to ethtool

Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.

Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: remove channel count limit for XOR8 RSS hash

mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").

XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'macsec-use-rcu_work-to-fix-crypto-cleanup-in-softirq-context'

Jinliang Zheng says:

====================
macsec: use rcu_work to fix crypto cleanup in softirq context

From: Jinliang Zheng <alexjlzheng@tencent.com>

crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.

This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.

Two design decisions worth noting:

1. rcu_work instead of schedule_work() + synchronize_rcu()

   An alternative would be to call schedule_work() directly from
   macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
   the start of the work handler to replace the grace period previously
   provided by call_rcu(). However, synchronize_rcu() blocks the worker
   thread for the duration of a full RCU grace period. Under high SA
   churn (e.g. tearing down an interface with many SAs), each SA would
   occupy a worker thread while waiting, and multiple concurrent calls
   cannot share the same grace period — leading to unnecessary latency
   and resource waste.

   rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
   the worker thread is only dispatched after the grace period has elapsed,
   and multiple concurrent queue_rcu_work() calls naturally batch under the
   same grace period via the RCU subsystem's existing coalescing mechanism.

2. Dedicated workqueue instead of system_wq

   Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
   exactly the work items belonging to this module — by calling
   destroy_workqueue() after rcu_barrier(). If system_wq were used,
   flush_scheduled_work() would drain all pending work items across the
   entire system, creating unnecessary coupling with unrelated subsystems
   and potentially causing unexpected delays. The dedicated workqueue
   provides a clean, contained teardown path.
====================

Link: https://patch.msgid.link/20260511153102.2640368-1-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer TX SA crypto cleanup out of softirq

free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.

Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: use rcu_work to defer RX SA crypto cleanup out of softirq

crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:

  vunmap+0x4c/0x70
  __iommu_dma_free+0xd0/0x138
  dma_free_attrs+0xf4/0x100
  sec_aead_exit+0x64/0xb8 [hisi_sec2]
  crypto_destroy_tfm+0x98/0x110
  free_rxsa+0x28/0x50 [macsec]
  rcu_do_batch+0x184/0x460
  rcu_core+0xf4/0x1f8
  handle_softirqs+0x118/0x330

Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-3-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: introduce dedicated workqueue for SA crypto cleanup

Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.

Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.

rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.

While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-2-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: net_failover: Fix the deadlock in slave register

There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.

Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_preempt_disabled+0x15/0x30
__mutex_lock.constprop.0+0x538/0x9e0
__mutex_lock_slowpath+0x13/0x20
mutex_lock+0x3b/0x50
dev_set_mtu+0x40/0xe0
net_failover_slave_register+0x24/0x280
failover_slave_register+0x103/0x1b0
failover_event+0x15e/0x210
? dropmon_net_event+0xac/0xe0
notifier_call_chain+0x5e/0xe0
raw_notifier_call_chain+0x16/0x30
call_netdevice_notifiers_info+0x52/0xa0
register_netdevice+0x5f4/0x7c0
register_netdev+0x1e/0x40
_mlx5e_probe+0xe2/0x370 [mlx5_core]
mlx5e_probe+0x59/0x70 [mlx5_core]
? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core]

Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Signed-off-by: Faicker Mo <faicker.mo@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: update atlantic driver maintainer

Igor Russkikh and Egor Pomozov have left Marvell.
Take over maintenance of the atlantic driver and its PTP subsystem.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netpoll-move-out-netconsole-specific-functions'

Breno Leitao says:

====================
netpoll: move out netconsole-specific functions

netpoll and netconsole were created together and their code has
been intermixed in net/core/netpoll.c for decades. The result is
that netpoll exposes two send-side interfaces:

* a generic "give me an sk_buff" path used by every stacked-device
driver (bonding, team, vlan, bridge, macvlan, dsa),

* a second path that takes raw bytes and builds a UDP/IP/Ethernet
packet -- exclusively for netconsole.

The packet builder, an skb pool allocator, and several
netconsole-specific helpers all live next to the generic plumbing even
though no other consumer ever touches them.

Worse, every netpoll user pays for that overlap: struct netpoll carries
an skb_pool and a refill work_struct that only netconsole's find_skb()
ever reads from, and net-core has to review unrelated changes (TTL, hop
limit, IP ID generation, source MAC selection, pool sizing) just because
they happen to be coded inside netpoll.

This is a waste of memory for something useless.

This series splits the netconsole-specific code out:

* netpoll_send_udp() and its private helpers (push_ipv6, push_ipv4,
push_eth, push_udp, netpoll_udp_checksum, find_skb) move into
drivers/net/netconsole.c, leaving netpoll with a single skb-only
send interface that is the same for every user.

The moves are one function per patch for reviewability; helpers are
temporarily EXPORT_SYMBOL_GPL'd while netpoll_send_udp() is still in
netpoll calling them, then those exports are dropped together once
netpoll_send_udp() itself moves.

The only new permanent export is zap_completion_queue(), needed because
find_skb() still drains the per-CPU TX completion queue before
allocating.

struct netpoll is unchanged in this series; making the pool itself
netconsole-private (and reclaiming the skb_pool / refill_wq fields for
the rest of netpoll's users) is the natural follow-up, once this patchset
lands.
====================

Link: https://patch.msgid.link/20260512-netconsole_split-v2-0-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move find_skb() from netpoll

find_skb() is the netconsole-specific entry into the netpoll skb
pool: every other netpoll consumer (bonding, team, vlan, bridge,
macvlan, dsa) builds its own sk_buff and never touches the pool.
With netpoll_send_udp() (its only caller) now living in netconsole,
find_skb() can join it.

Move find_skb() into drivers/net/netconsole.c as a file-static
helper, drop EXPORT_SYMBOL_GPL(find_skb) and remove its prototype
from include/linux/netpoll.h. find_skb() drains TX completions via
netpoll_zap_completion_queue(), which is already exported in the
NETDEV_INTERNAL namespace, so netconsole picks up
MODULE_IMPORT_NS("NETDEV_INTERNAL") to consume it.

The skb pool's lifecycle (np->skb_pool, np->refill_wq, refill_skbs(),
refill_skbs_work_handler(), skb_pool_flush()) stays in netpoll: it
is initialised in __netpoll_setup() and torn down in
__netpoll_cleanup(), both of which remain netpoll's responsibility.
The refill work queued via schedule_work(&np->refill_wq) from the
moved find_skb() runs refill_skbs_work_handler() in netpoll without
any further plumbing.

This is pure code motion: the function body is unchanged and its
sole caller (netpoll_send_udp(), already moved by an earlier patch)
keeps invoking it the same way. Pre-existing concerns about
find_skb() running from NMI/printk context (zap_completion_queue()
re-entry, skb_pool spinlocks, GFP_ATOMIC allocation, fallback skb
sizing vs. MAX_SKB_SIZE, PREEMPT_RT semantics of __kfree_skb()) are
inherited as-is and are not addressed here; they predate this
series and are out of scope. Fixing them is left for follow-up
work.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-9-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: rename and export netpoll_zap_completion_queue()

zap_completion_queue() drains the per-CPU softnet completion queue.
Rename it with the netpoll_ prefix shared by the rest of the
subsystem's public API, and promote it from file-static to
EXPORT_SYMBOL_NS_GPL in the NETDEV_INTERNAL namespace so the upcoming
netconsole-side find_skb() can call it once the function moves out.
A forward declaration is added to include/linux/netpoll.h, and the
old file-static forward declaration is dropped.

No functional change.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-8-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_udp_checksum() from netpoll

netpoll_udp_checksum() computes the UDP checksum for netconsole's
packets. Move it into drivers/net/netconsole.c as a file-static
helper; drop its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

This was the last csum_ipv6_magic() consumer in net/core/netpoll.c,
so drop the now-stale <net/ip6_checksum.h> include there. Pull it
into netconsole.c so the moved code keeps building.

It was also the last udp_hdr() consumer in net/core/netpoll.c. The
file no longer needs anything from <net/udp.h> (the UDP socket-layer
helpers); MAX_SKB_SIZE only needs struct udphdr, which is provided
by the lighter <linux/udp.h>. Swap the include accordingly.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-7-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_udp() from netpoll

push_udp() builds the UDP header (and triggers the checksum) for
netconsole's UDP packets. Move it into drivers/net/netconsole.c as
a file-static helper; drop its EXPORT_SYMBOL_GPL and remove the
prototype from include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-6-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_eth() from netpoll

push_eth() builds the Ethernet header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-5-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv4() from netpoll

push_ipv4() builds the IPv4 header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

put_unaligned() is no longer used in net/core/netpoll.c, so drop
the now-stale <linux/unaligned.h> include from there. Pull it into
netconsole.c so the moved code keeps building.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-4-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv6() from netpoll

push_ipv6() builds the IPv6 header for netconsole's UDP packets.
Its only caller, netpoll_send_udp(), now lives in netconsole, so
the helper can move there as a file-static function. Drop its
EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-3-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_send_udp() from netpoll

Move netpoll_send_udp() from net/core/netpoll.c into
drivers/net/netconsole.c as a static helper, drop EXPORT_SYMBOL(),
and remove the prototype from include/linux/netpoll.h.

netconsole was the only in-tree caller of this entry point. Every
other netpoll consumer (bonding, team, vlan, bridge, macvlan, dsa)
already builds its own sk_buff and hands it to netpoll_send_skb(),
so the netpoll send-side interface is now skb-only.

The helpers it depends on (find_skb(), push_ipv6(), push_ipv4(),
push_udp(), push_eth(), netpoll_udp_checksum()) were exposed in
the previous patches and stay in net/core/netpoll.c for now.
Subsequent patches move each of them into netconsole one at a time
and drop the corresponding EXPORT_SYMBOL_GPL.

Pull <linux/ip.h>, <linux/ipv6.h> and <linux/udp.h> into netconsole.c
so the moved code can name the header structures.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-2-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: expose UDP packet builder helpers for netconsole

Promote each from file-static to EXPORT_SYMBOL_GPL and forward-
declare them in include/linux/netpoll.h so netconsole can call
them once netpoll_send_udp() moves out.

These exports are kept until the end of the series, when
al of them move into netconsole.

No functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-1-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: pegasus: replace simple_strtoul with kstrtouint

simple_strtoul() is deprecated as it has no error checking. Replace it
with kstrtouint() which returns an error code on invalid input, and add
appropriate error handling.

Also add a NULL check before parsing flags, since strsep() can set id
to NULL if the input has fewer tokens than expected.

Preserve the original behavior for a trailing colon by checking *id
before parsing flags, so an empty string results in flags = 0 rather
than an error.

Signed-off-by: Sajal Gupta <sajal2005gupta@gmail.com>
Link: https://patch.msgid.link/20260509095518.2640-1-sajal2005gupta@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipvlan: use netif_receive_skb() in ipvlan_process_multicast()

ipvlan_process_multicast() runs from process context, there is no
risk of stack overflow if we call netif_receive_skb() instead
of netif_rx().

This avoids some overhead adding/removing skbs to/from a per-cpu
backlog and raising/processing NET_RX softirqs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260512042019.3300975-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Add QFQ/CBS qlen underflow test

Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.

Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_cbs: Call qdisc_reset for child qdisc

During a reset, CBS is not calling reset on its child qdisc, which
might cause qlen/backlog accounting issues. For example, if we have CBS
with a QFQ parent and a netem child with delay, we can create a scenario
where the parent's qlen underflows. QFQ, specifically, uses qlen to
check whether it should deference a pointer, so this scenario may cause
a null-ptr deref in QFQ:

[   43.875639][  T319] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI
[   43.876124][  T319] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[   43.876417][  T319] CPU: 10 UID: 0 PID: 319 Comm: ping Not tainted 7.0.0-13039-ge728258debd5 #773 PREEMPT(full)
[   43.876751][  T319] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   43.876949][  T319] RIP: 0010:qfq_dequeue+0x35c/0x1650
[   43.877123][  T319] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
[   43.877648][  T319] RSP: 0018:ffff8881017ef4f0 EFLAGS: 00010216
[   43.877845][  T319] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[   43.878073][  T319] RDX: 0000000000000009 RSI: 0000000c40000000 RDI: ffff88810eef02b0
[   43.878306][  T319] RBP: ffff88810eef0000 R08: ffff88810eef0280 R09: 1ffff1102120fd63
[   43.878523][  T319] R10: 1ffff1102120fd66 R11: 1ffff1102120fd67 R12: 0000000c40000000
[   43.878742][  T319] R13: ffff88810eef02b8 R14: 0000000000000048 R15: 0000000020000000
[   43.878959][  T319] FS:  00007f9c51c47c40(0000) GS:ffff88817a0be000(0000) knlGS:0000000000000000
[   43.879214][  T319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.879403][  T319] CR2: 000055e69a2230a8 CR3: 000000010c07a000 CR4: 0000000000750ef0
[   43.879621][  T319] PKRU: 55555554
[   43.879735][  T319] Call Trace:
[   43.879844][  T319]  <TASK>
[   43.879924][  T319]  __qdisc_run+0x169/0x1900
[   43.880075][  T319]  ? dev_qdisc_enqueue+0x8b/0x210
[   43.880222][  T319]  __dev_queue_xmit+0x2346/0x37a0
[   43.880376][  T319]  ? register_lock_class+0x3f/0x800
[   43.880531][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.880684][  T319]  ? __pfx___dev_queue_xmit+0x10/0x10
[   43.880834][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.880977][  T319]  ? __lock_acquire+0x819/0x1df0
[   43.881124][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881275][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881418][  T319]  ? __asan_memcpy+0x3c/0x60
[   43.881563][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.881708][  T319]  ? eth_header+0x165/0x1a0
[   43.881853][  T319]  ? lockdep_hardirqs_on_prepare+0xdb/0x1a0
[   43.882031][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.882174][  T319]  ? neigh_resolve_output+0x3cc/0x7e0
[   43.882325][  T319]  ? srso_alias_return_thunk+0x5/0xfbef5
[   43.882471][  T319]  ip_finish_output2+0x6b6/0x1e10

Fix this by calling qdisc_reset for CBS' child qdisc.
Sashiko caught an issue which could result in a null ptr deref if
qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed
by this patch. We add an early return if the qdisc is null to address this.
This is a similar approach used by two other fixes[1][2].

The proper fix for this specific issue elucidated by sashiko is to remove
the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc
isn't attached anywhere yet at that point, calling the reset callback doesn't
make much sense (and as stated has been a source of two other bugs).
We plan on  submitting this fix in a later patch.
[1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/
[2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/

Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
Reported-by: Junyoung Jang <graypanda.inzag@gmail.com>
Tested-by: Junyoung Jang <graypanda.inzag@gmail.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'

Simon Schippers says:

====================
tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops

This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. The XDP transmit path is not affected. This patch series
touches tun/tap and vhost-net, as they share common logic and must be
updated together. Modifying only one of them would break the other.

By applying proper backpressure, this change allows the connected qdisc to
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For
example, we observed a 36% TCP throughput improvement for an OpenVPN
connection between Germany and the USA.

Synthetic pktgen benchmarks indicate a slight regression, and packet
loss is reduced to near zero. Pktgen benchmarks are provided per commit,
with the final commit showing the overall performance.

Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
====================

Link: https://patch.msgid.link/20260510151529.43895-1-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once the ring reaches capacity after a produce attempt,
the netdev queue is stopped instead of dropping subsequent packets.
If no qdisc is present, the previous tail-drop behavior is preserved.

If producing an entry fails anyway due to a race, tun_net_xmit() drops
the packet. Such races are expected because LLTX is enabled and the
transmit path operates without the usual locking.

The __tun_wake_queue() function of the consumer races with the producer
for waking/stopping the netdev queue, which could result in a stalled
queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
with the smp_mb() of the consumer. It follows the principle of store
buffering described in tools/memory-model/Documentation/recipes.txt:

- The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
  followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
  ring with __ptr_ring_check_produce().

- The consumer in __tun_wake_queue() first writes zero to the ring in
  __ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
  status with netif_tx_queue_stopped().

=> Following the aforementioned principle, it is impossible for the
   producer to see a full ring (and therefore not wake the queue on the
   re-check) while the consumer simultaneously fails to see a stopped
   queue (and therefore also does not wake it).

Benchmarks:
The benchmarks show a slight regression in raw transmission performance
when using two sending threads. Packet loss also occurs only in the
two-thread sending case; no packet loss was observed with a single
sending thread.

Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.

Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster; slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)

+--------------------------+--------------+----------------+----------+
| 1 thread                 | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 1.132 Mpps   | 1.123 Mpps     | -0.8%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 3.765 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 3.857 Mpps   | 3.901 Mpps     | +1.1%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 0.802 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

+--------------------------+--------------+----------------+----------+
| 2 threads                | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 1.115 Mpps   | 1.081 Mpps     | -3.0%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 8.490 Mpps   | 391 pps        |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 3.664 Mpps   | 3.555 Mpps     | -3.0%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 5.330 Mpps   | 938 pps        |          |
+------------+-------------+--------------+----------------+----------+

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-5-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptr_ring: move free-space check into separate helper

This patch moves the check for available free space for a new entry into
a separate function. Existing callers that only check for a non-zero
return value are unaffected; __ptr_ring_produce() now returns -EINVAL
for a zero-size ring and -ENOSPC when full, whereas before both cases
returned -ENOSPC. The new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_check_produce() indicates that space is available again.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-4-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vhost-net: wake queue of tun/tap after ptr_ring consume

Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device and that
the tfile exists, dereferences the tun_struct under RCU, and delegates
to __tun_wake_queue().

vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.
The point is to allow the queue to be stopped when it gets full, which
is required for traffic shaping - implemented by the following
"avoid ptr_ring tail-drop when a qdisc is present".

Without the corresponding queue stopping, this patch alone causes no
throughput regression for a tap+vhost-net setup sending to a qemu VM:
3.857 Mpps to 3.891 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
50 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-3-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun/tap: add ptr_ring consume helper with netdev queue wakeup

Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. As a safety net, the queue is also woken on
the last consumed entry if it leaves the ring empty. The point is to
allow the queue to be stopped when it gets full, which is required for
traffic shaping - implemented by the following "avoid ptr_ring tail-drop
when a qdisc is present".

Some implementation details:
- tun_ring_recv() replaces ptr_ring_consume() with tun_ring_consume()
  to properly wake the queue.
- __tun_detach() locks the tx_ring.consumer_lock to avoid races with
  the consumer on the queue_index.
- The ptr_ring_consume() call in tun_queue_purge() is not replaced with
  tun_ring_consume(). Instead, within the same tx_ring.consumer_lock
  in __tun_detach(), the netdev queue is woken for the ntfile taking
  it over, to avoid a possible stall. This does not matter for
  tun_detach_all(), as it is called during device teardown and no tfile
  takes over any queue.
- Reset cons_cnt in tun_attach() so the half-ring wake threshold is
  valid for the new ring size after ptr_ring_resize().
- tun_queue_resize() wakes all queues after resizing with the proper
  tx_ring.consumer_lock and resets the cons_cnt to avoid a possible
  stale queue.
- The aforementioned upcoming patch explains the pairing of the smp_mb()
  of __tun_wake_queue().

Without the corresponding queue stopping, this patch alone causes no
regression for a tap setup sending to a qemu VM: 1.132 Mpps
to 1.134 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-2-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

FDDI: defza: Sanitise the reset safety timer

The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot.  Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.

The safety mechanism depends on the origin of the reset.  For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event.  In that case no sleep is possible as a device halt is reported
at the hardirq level.

A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6ffbee
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:

  defza: v.1.1.4  Oct  6 2018  Maciej W. Rozycki
  tc2: DEC FDDIcontroller 700 or 700-C at 0x18000000, irq 4
  tc2: resetting the board...
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-dirty #2 VOLUNTARY
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 7070617773203a6d
          0000000000000000 9800000002027ba4 0000000000001000 6465746e69617420
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff800ab7c0>] do_idle+0x40/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: OK
  tc2: model 700 (DEFZA-AA), MMF PMD, address 08-00-2b-xx-xx-xx
  tc2: ROM rev. 1.0, firmware rev. 1.2, RMC rev. A, SMT ver. 1
  tc2: link unavailable
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W           7.0.0-dirty #2 VOLUNTARY
  Tainted: [W]=WARN
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 0000000000000000
          0000000000000000 9800000002027ba4 0000000000001000 0000000000000000
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff806de8a4>] arch_local_irq_disable+0x4/0x28
  [<ffffffff800ab7d0>] do_idle+0x50/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: registered as fddi0

The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to.  Therefore there is a
very remote issue only this is a sign of.

Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.

Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean).  It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection.  For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).

Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.

Fixes: 61414f5ec9834 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.

   - UAFs and lifecycle bugs on the sub-sched attach/detach paths:
     parent sub_kset freed under a racing child, list_del_rcu on an
     uninitialized list head, ops->priv stomped by concurrent
     attach/detach, and a UAF in the init-failure error path

   - Task state-machine reorg closing concurrent enable-vs-dead races: a
     task exiting during the unlocked init window could trip NULL ops
     derefs or skip exit_task() cleanup

   - A scx_link_sched() self-deadlock on scx_sched_lock

   - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
     cpumask without RCU, and stop rejecting BPF schedulers when only
     cpuset isolated partitions are active

   - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
     the failing task rather than the irq_work kthread

   - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
     fixes"

* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
  sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
  sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
  sched_ext: Drop NONE early return in scx_disable_and_exit_task()
  sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
  sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
  sched_ext: Fix ops->priv clobber on concurrent attach/detach
  selftests/sched_ext: Fix build error in dequeue selftest
  sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  sched_ext: Close sub-sched init race with post-init DEAD recheck
  sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
  sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
  sched_ext: Drop unused scx_find_sub_sched() stub
  sched_ext: Move scx_error() out of scx_link_sched()'s lock region

Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

- cpuset fixes:
     - Partition invalidation could return CPUs still in use by sibling
       partitions, producing overlapping effective_cpus
     - cpuset_can_attach() over-reserved DL bandwidth on moves that
       stayed within the same root domain
     - Pending DL migration state leaked into later attaches when a
       later can_attach() check failed
     - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
       allocate from any node and exit quickly

- dmem: propagate -ENOMEM instead of spinning forever when the fallback
   pool allocation also fails

- selftests/cgroup: percpu test error-path leak, bogus numeric
   comparison of cpuset strings, and a zero-length read() that silently
   passed OOM-kill tests

* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
  selftests/cgroup: Fix error path leaks in test_percpu_basic
  cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
  cgroup/cpuset: Reset DL migration state on can_attach() failure
  selftests/cgroup: Fix string comparison in write_test
  selftests/cgroup: Fix cg_read_strcmp() empty string comparison
  cgroup/dmem: Return -ENOMEM on failed pool preallocation
  cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()

Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path

- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
   caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
   set with no owner

* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
  workqueue: Release PENDING in __queue_work() drain/destroy reject path

sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation

scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.

Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!

  1 lock held by scx_flash/281:
   #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at:
       bpf_struct_ops_link_create+0x134/0x1c0

  Call Trace:
   dump_stack_lvl+0x6f/0xb0
   lockdep_rcu_suspicious.cold+0x37/0x70
   housekeeping_cpumask+0xcd/0xe0
   scx_enable.isra.0+0x17/0x120
   bpf_scx_reg+0x5e/0x80
   bpf_struct_ops_link_create+0x151/0x1c0
   __sys_bpf+0x1e4b/0x33c0
   __x64_sys_bpf+0x21/0x30
   do_syscall_64+0x117/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.

Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.

Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>

cgroup/cpuset: Return only actually allocated CPUs during partition invalidation

In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs
to return to the parent are computed as:

    adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);

where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().

Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.

Reproducer (on a 4-CPU machine):

    cd /sys/fs/cgroup
    mkdir a1 b1

    # a1 becomes partition root with CPUs 0-1
    echo "0-1" > a1/cpuset.cpus
    echo "root" > a1/cpuset.cpus.partition

    # b1 becomes partition root with CPUs 1-2, but sibling exclusion
    # reduces its effective_xcpus to CPU 2 only
    echo "1-2" > b1/cpuset.cpus
    echo "root" > b1/cpuset.cpus.partition

    # b1 changes cpus_allowed to 0-1 -> partition invalidation
    echo "0-1" > b1/cpuset.cpus

    # Expected: CPUs 2-3  (only CPU 2 returned from b1)
    # Actual:   CPUs 1-3  (CPU 0-1 returned, overlapping with a1)
    cat cpuset.cpus.effective

dmesg will also show a WARNING from generate_sched_domains() reporting
overlapping partition root effective_cpus.

Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: sunshaojie <sunshaojie@kylinos.cn>
Tested-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"arm64:

   - Add the pKVM side of the workaround for ARM's erratum 4193714,
     provided that the EL3 firmware does its part of the job. KVM will
     refuse to initialise otherwise

   - Correctly handle 52bit VAs for guest EL2 stage-1 translations when
     running under NV with E2H==0

   - Correctly deal with permission faults in guest_memfd memslots

   - Fix the steal-time selftest after the infrastructure was reworked

   - Make sure the host cannot pass a non-sensical clock update to the
     EL2 tracing infrastructure

   - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
     ability to run arm64 guests, which will inevitably lead to arm64
     code being directly used on s390

   - Make sure that EL2 is configured with both exception entry and exit
     being Context Synchronization Events

   - Handle the current vcpu being NULL on EL2 panic

   - Fix the selftest_vcpu memcache being empty at the point of donation
     or sharing

   - Check that the memcache has enough capacity before engaging on the
     share/donate path

   - Fix __deactivate_fgt() to use its parameter rather than a variable
     in the macro context

  s390:

   - Fix array overrun with large amounts of PCI devices

  x86:

   - Never use L0's PAUSE loop exiting while L2 is running, since it's
     unlikely that a nested guest will help solving the hypervisor's
     spinlock contention

   - Fix emulation of MOVNTDQA

   - Fix typo in Xen hypercall tracepoint

   - Add back an optimization that was left behind when recently fixing
     a bug

   - Add module parameter to disable CET, whose implementation seems to
     have issues. For now it remains enabled by default

  Generic:

   - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()

  Documentation:

   - Update stale links

  Selftests:

   - Fix guest_memfd_test with host page size > guest page size"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  KVM: VMX: introduce module parameter to disable CET
  KVM: x86: Swap the dst and src operand for MOVNTDQA
  KVM: x86: use again the flush argument of __link_shadow_page()
  KVM: selftests: Ensure gmem file sizes are multiple of host page size
  Documentation: kvm: update links in the references section of AMD Memory Encryption
  KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
  KVM: x86: Fix Xen hypercall tracepoint argument assignment
  KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
  KVM: arm64: Pre-check vcpu memcache for host->guest donate
  KVM: arm64: Pre-check vcpu memcache for host->guest share
  KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
  KVM: arm64: Fix __deactivate_fgt macro parameter typo
  KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
  KVM: arm64: Make EL2 exception entry and exit context-synchronization events
  MAINTAINERS: Add Steffen as reviewer for KVM/arm64
  KVM: arm64: Remove potential UB on nvhe tracing clock update
  KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
  KVM: arm64: Handle permission faults with guest_memfd
  KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
  KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
  ...

selftests/cgroup: Fix error path leaks in test_percpu_basic

When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.

When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.

Fixes: 90631e1dea55 ("kselftests: cgroup: add perpcu memory accounting test")
Signed-off-by: Yu Miao <yumiao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux

Pull liveupdate fixes from Mike Rapoport:
"A few fixes for kexec handover and liveupdate:

   - make sure KHO is skipped for crash kernel

   - fix error reporting in memfd preservation if it fails mid-loop

   - don't allow preserving memfds whose page count exceeds UINT_MAX

   - fix documentation of memfd seals preservation to match the code"

* tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux:
  mm/memfd_luo: document preservation of file seals
  mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX
  mm/memfd_luo: report error when restoring a folio fails mid-loop
  kho: skip KHO for crash kernel

KVM: VMX: introduce module parameter to disable CET

There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.

Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley@proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'dpll-rework-fractional-frequency-offset-reporting'

Ivan Vecera says:

====================
dpll: rework fractional frequency offset reporting

Rework how the fractional frequency offset (FFO) is reported in
the DPLL subsystem.

Both fractional-frequency-offset (PPM) and
fractional-frequency-offset-ppt (PPT) attributes are now present at
the top level of a pin and inside each pin-parent-device nest. They
carry the same measurement at different precisions.

Introduce enum dpll_ffo_type and struct dpll_ffo_param to distinguish
FFO contexts: DPLL_FFO_PORT_RXTX_RATE for the RX vs TX symbol rate
offset at the top level, and DPLL_FFO_PIN_DEVICE for the pin vs
parent DPLL offset in the nest. Drivers declare which types they
support via the supported_ffo bitmask in dpll_pin_ops; the core only
calls ffo_get for opted-in types.

Patch 1 adds the type-safe FFO API, updates the YAML spec, netlink
handling, and documentation, and converts mlx5 and zl3073x drivers.

Patch 2 implements the nested FFO for zl3073x using the
dpll_df_offset_x register with ref_ofst=1, providing 2^-48
resolution. The old per-reference frequency measurement is removed
as it was redundant with measured-frequency.
====================

Link: https://patch.msgid.link/20260511155816.99936-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: zl3073x: report FFO as DPLL vs input reference offset

Replace the per-reference frequency offset measurement (which was
redundant with measured-frequency) with a direct read of the DPLL's
delta frequency offset vs its tracked input reference.

The new implementation uses the dpll_df_offset_x register with
ref_ofst=1 via the dpll_df_read_x semaphore mechanism. This
provides 2^-48 resolution (~3.5 fE) and reports the actual
frequency difference between the DPLL and its active input.

Switch supported_ffo from DPLL_FFO_PORT_RXTX_RATE to
DPLL_FFO_PIN_DEVICE so FFO is reported only in the per-parent
context for the active input pin.

Use atomic64_t for freq_offset to prevent torn reads on 32-bit
architectures between the periodic worker and netlink callbacks.

Rewrite ffo_check to compare the cached df_offset converted to PPT
instead of using the old per-reference measurement. Remove the
ref_ffo_update periodic measurement and the ref ffo field since
they are no longer needed.

Changes v3 -> v4:
- Switch to DPLL_FFO_PIN_DEVICE, remove dpll=NULL guard
- Use atomic64_t for freq_offset (torn read on 32-bit)

Reviewed-by: Petr Oros <poros@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260511155816.99936-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: add fractional frequency offset to pin-parent-device

Add both fractional-frequency-offset (PPM) and
fractional-frequency-offset-ppt (PPT) attributes to the
pin-parent-device nested attribute set, alongside the existing
top-level pin attributes. Both carry the same measurement at
different precisions.

Introduce enum dpll_ffo_type and struct dpll_ffo_param to
distinguish FFO contexts: DPLL_FFO_PORT_RXTX_RATE for the RX vs
TX symbol rate offset reported at the top level, and
DPLL_FFO_PIN_DEVICE for the pin vs parent DPLL offset reported
in the pin-parent-device nest.

Add a supported_ffo bitmask to struct dpll_pin_ops so drivers
declare which FFO types they support. The core only calls ffo_get
for types the driver has opted into, eliminating the need for
per-driver NULL pointer guards. Validate at pin registration time
that supported_ffo is not set without an ffo_get callback.

Update mlx5 (DPLL_FFO_PORT_RXTX_RATE) and zl3073x
(DPLL_FFO_PORT_RXTX_RATE) drivers to use the new API.

Add documentation for both FFO types to dpll.rst.

Changes v3 -> v4:
- Replace dpll=NULL overloading with enum dpll_ffo_type and
  struct dpll_ffo_param (Jakub Kicinski)
- Add supported_ffo opt-in bitmask in dpll_pin_ops for fail-close
  driver validation (Jakub Kicinski)
- Add WARN_ON in dpll_pin_register for supported_ffo without
  ffo_get callback

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260511155816.99936-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ravb: Do not check URAM suspension when WoL is active

When updating the driver to match latest datasheet to suspend access to
URAM when suspending DMA transfers a corner-case was missed, URAM access
will not be suspended if WoL is enabled. This lead to the error message
(correctly) being triggered as URAM access is not suspended even tho
it's requested as part of stopping DMA.

Avoid checking if URAM access is suspended and printing the error
message if WoL is enabled when we suspend the system, as we know it will
not be.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Closes: https://lore.kernel.org/all/CAMuHMdWnjV%3DHGE1o08zLhUfTgOSene5fYx1J5GG10mB%2BToq8qg@mail.gmail.com/
Fixes: 353d8e7989b6 ("net: ethernet: ravb: Suspend and resume the transmission flow")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sai Krishna <saikrishnag@marvell.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics

ethnl_bitmap32_not_zero() should return true if some bit in [start, end)
is set:

- Fix inverted memchr_inv() sense: return true when the scan finds a
  non-zero byte, not when the middle words are all zero.
- Return false for an empty interval (end <= start).
- When end is 32-bit aligned, indices in [start, end) do not include any
  bits from map[end_word]; return false after earlier checks found no
  non-zero data.

Fixes: 10b518d4e6dd ("ethtool: netlink bitset handling")
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint

The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and
smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk:

__string(name, smc->conn.lnk->ibname)

conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on
these paths already handles this (e.g. !conn->lnk in
SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first
sendmsg()/recvmsg() on an SMC-D socket crashes:

  Oops: general protection fault, probably for non-canonical address
  KASAN: null-ptr-deref in range [...]
  RIP: 0010:strlen+0x1e/0xa0
  Call Trace:
   trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44)
   smc_rx_recvmsg (net/smc/smc_rx.c:515)
   smc_recvmsg (net/smc/af_smc.c:2859)
   __sys_recvfrom (net/socket.c:2315)
   __x64_sys_recvfrom (net/socket.c:2326)
   do_syscall_64

The faulting address 0x3e0 is offsetof(struct smc_link, ibname),
confirming the NULL ->lnk deref. Enabling the tracepoint requires
root, but the trigger itself is unprivileged: socket(AF_SMC, ...) has
no capability check, and SMC-D negotiation needs no admin step on
s390 or on x86 with the loopback ISM device loaded.

Log an empty device name for SMC-D instead of dereferencing NULL.

Fixes: aff3083f10bf ("net/smc: Introduce tracepoints for tx and rx msg")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS

A logic flaw in __smc_setsockopt() allows a local unprivileged user to
cause a Denial of Service (DoS) by holding the socket lock indefinitely.

The function __smc_setsockopt() calls copy_from_sockptr() while holding
lock_sock(sk). By passing a userfaultfd-monitored memory page (or
FUSE-backed memory on systems where unprivileged userfaultfd is disabled)
as the optval, an attacker can halt execution during the copy operation,
keeping the lock held.

Combined with asynchronous tear-down operations like shutdown(), this
exhausts the kernel wq (kworkers) and triggers the hung task watchdog.

[  240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds.
[  240.123489] Call Trace:
[  240.123501]  smc_shutdown+...
[  240.123512]  lock_sock_nested+...

This patch moves the user-space copy outside the lock_sock() critical
section to prevent the issue.

Fixes: a6a6fe27bab4 ("net/smc: Dynamic control handshake limitation by socket options")
Signed-off-by: Nicolò Coccia <n.coccia96@gmail.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: realtek: rtl8365mb: add support for RTL8367SB

Add chip info entry for the Realtek RTL8367SB switch. This device has
chip ID 0x6367 and version 0x0010. It exposes two external interfaces:
port 6 supports MII, TMII, RMII, RGMII, SGMII and HSGMII, while port 7
supports MII, TMII, RMII and RGMII. Use the existing 8365MB-VC jam table
for initialization.

Reviewed-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com>
Link: https://patch.msgid.link/3c6d822b-0e85-4173-86ba-2badb140bbf1@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-use-ip_outnoroutes-drop-reason'

Eric Dumazet says:

====================
net: use IP_OUTNOROUTES drop reason

First patch changes sk_skb_reason_drop() sock to be const.

Second and last patch add SKB_DROP_REASON_IP_OUTNOROUTES
to both tcp_v6_send_response() and inet6_csk_xmit().
====================

Link: https://patch.msgid.link/20260511072310.1094859-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: use SKB_DROP_REASON_IP_OUTNOROUTES in inet6_csk_xmit()

Replace a bare kfree_skb() with a modern sk_skb_reason_drop() call,
and provide IP_OUTNOROUTES drop reason.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: use SKB_DROP_REASON_IP_OUTNOROUTES in tcp_v6_send_response()

Replace a bare kfree_skb() with a modern sk_skb_reason_drop() call,
and provide IP_OUTNOROUTES drop reason.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: constify sk_skb_reason_drop() sock parameter

sk_skb_reason_drop() does not change sock parameter, make it
const so that we can call it from TCP stack without a cast
on a (const) listener socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: add RTEXT_FILTER_NAME_ONLY support

iproute2 can spend considerable amount of time in ll_init_map()
or ll_link_get() to dump verbose netdev attributes, contributing
to RTNL pressure.

Add RTEXT_FILTER_NAME_ONLY new flag so that rtnl_fill_ifinfo()
limits its output to:

- struct nlmsghdr
- IFLA_IFNAME
- IFLA_PROP_LIST (alternate names)

We can later avoid using RTNL when RTEXT_FILTER_NAME_ONLY
is requested, as none of these attributes need RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511070244.971028-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rework-pci_device_id-initialisation'

Uwe Kleine-König says:

====================
Rework pci_device_id initialisation
====================

Link: https://patch.msgid.link/20260511090023.1634387-4-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Consistently define pci_device_ids using named initializers

... and PCI device helpers.

The various struct pci_device_id arrays were initialized mostly by one
the PCI_DEVICE macros and then list expressions. The latter isn't easily
readable if you're not into PCI. Using named initializers is more
explicit and thus easier to parse.

Also use PCI_DEVICE* helper macros to assign .vendor, .device,
.subvendor and .subdevice where appropriate and skip explicit
assignments of 0 (which the compiler takes care of).

The secret plan is to make struct pci_device_id::driver_data an
anonymous union (similar to
https://lore.kernel.org/all/cover.1776579304.git.u.kleine-koenig@baylibre.com/)
and that requires named initializers. But it's also a nice cleanup on
its own.

This change doesn't introduce changes to the compiled pci_device_id
arrays. Tested on x86 and arm64.

Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Forwarded: id:76da4f44d48bdde84580963862bf9616bee5c9e9.1778149923.git.u.kleine-koenig@baylibre.com (v2)
Reviewed-by: Michael Grzeschik <mgr@kernel.org>
Link: https://patch.msgid.link/20260511090023.1634387-6-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: nfp: Drop PCI class entries with .class_mask = 0

With .class_mask being zero the value of .class doesn't matter because
to check if a pci_device_id entry matches a given device the expression

(id->class ^ dev->class) & id->class_mask

is checked for being zero (see pci_match_one_device()). So drop the
useless and irritating assignment for .class to match what (I think) all
other drivers are doing that don't need to match on .class, i.e. set
both members to zero.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20260511090023.1634387-5-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atm: fix skb leak in sigd_send() default branch

The default branch in sigd_send() calls sock_put() and returns -EINVAL
without freeing the skb, while all other exit paths do so. Add the
missing dev_kfree_skb() before sock_put() to fix the leak.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Wei Yang <albinwyang@tencent.com>
Link: https://patch.msgid.link/20260509122358.1102997-1-albin_yang@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: intel-xway: add PHY-level statistics via ethtool

Report PCS receive error counts for all supported PEF 7061, 7071, 7072 and
xRX200 PHYs.

Accumulate the vendor-specific PHY_ERRCNT read-clear counter
(SEL=RXERR) in .update_stats() and expose it as both IEEE 802.3
SymbolErrorDuringCarrier and generic rx_errors via
.get_phy_stats().

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20260509205933.3965832-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: intel-xway: fix typo in Kconfig description

Replace "22E" with "22F" in the description.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20260509210900.3968447-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: phy: avoid NULL deref when PHY driver is unbound

phydev->drv can become NULL while the phy_device is still attached to
its net_device, namely after the PHY driver is unbound via sysfs:

echo <mdio_id> > /sys/bus/mdio_bus/drivers/<phy_drv>/unbind

phy_remove() clears phydev->drv but doesn't call phy_detach(), so the
phy_device stays in the link topology xarray and ethnl_req_get_phydev()
still hands it back. ETHTOOL_MSG_PHY_GET then oopses on:

rep_data->drvname = kstrdup(phydev->drv->name, GFP_KERNEL);

drvname is already treated as optional by phy_reply_size(),
phy_fill_reply() and phy_cleanup_data(), so just skip the allocation
when there is no driver bound.

Fixes: 9dd2ad5e92b9 ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump")
Cc: stable@vger.kernel.org # 6.13.x
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260509215046.107157-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: cope with slow env in so_txtime.py test

This test was converted from shell script to drv-net test.

The new version is flaky in dbg builds on the netdev.bots dashboard.
The previous shell script had more protections to avoid these. Added
in commit a7ee79b9c455 ("selftests: net: cope with slow env in
so_txtime.sh test").

Add the same overall protection:

- Suppress so_txtime process failure if KSFT_MACHINE_SLOW

Also relax two timeouts to reduce the number of process failures
themselves

- Increase SO_RCVTIMEO to 2 seconds
- Increase process start-up stabilization to 2 seconds

Delays were experimentally arrived at while running with vng
built with kernel/configs/debug.config

Fixes: 5c6baef3885c ("selftests: drv-net: convert so_txtime to drv-net")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20260510174219.74aeee6d@kernel.org/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260511222138.2045551-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled

The shutdown handler aq_pci_shutdown() unconditionally calls
pci_wake_from_d3(pdev, false), clearing the PCI PME_En bit even when
wake-on-LAN has been configured. While aq_nic_shutdown() correctly
programs the NIC firmware via aq_nic_set_power() to listen for magic
packets, the PCI subsystem will not propagate the resulting PME wake
event from D3, so the system never wakes after poweroff.

WOL from suspend (S3) is unaffected because aq_suspend_common() does
not touch pci_wake_from_d3() and relies on the PM core's wake
configuration via device_may_wakeup().

This affects all atlantic-supported NICs (AQC107/108/111/112/113);
users have reported that WOL works if the atlantic driver is never
loaded, but breaks once it has run its shutdown path.

Pass the configured WOL state to pci_wake_from_d3() instead of a
literal false, so the PCI PME_En bit is preserved when the user has
armed WOL via ethtool.

Fixes: 90869ddfefeb ("net: aquantia: Implement pci shutdown callback")
Cc: stable@vger.kernel.org
Signed-off-by: Zoran Ilievski <goodboy@rexbytes.com>
Reviewed-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260511064002.1857-1-goodboy@rexbytes.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work

scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock,
but that does not pin parent->sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.

Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.

Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable")
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()

On scx_link_sched() error paths (parent disabled, hash insert failure),
&sch->all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a
list_head that was never initialized triggering a corruption warning.

Initialize &sch->all.

Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()")
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'kvm-s390-master-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: pci: fix array indexing

For large amounts of PCI devices its possible to overrun the arrays as
the index was miscalculated in 2 places.

sched_ext: Drop NONE early return in scx_disable_and_exit_task()

d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p->scx.sched dangles - print_scx_info() from sched_show_task() reads
sch->ops.name from freed memory.

Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only
work left - and the one thing the path actually needs.

v2: Extend the SUB_INIT block comment to note that the flag is only
set on the sub-enable path, so it's always clear on the NONE
re-entry (Andrea).

Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

KVM: x86: Swap the dst and src operand for MOVNTDQA

Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.

Per the SDM:

MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.

MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.

Reported-by: Josh Eads <josheads@google.com>
Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: use again the flush argument of __link_shadow_page()

Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty. In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.

Based-on: <20260503201029.106481-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: selftests: Ensure gmem file sizes are multiple of host page size

When creating a guest_memfd file and associated memslot to validate shared
guest memory, size the file+memslot to the maximum of the host or guest
page size. Attempting to allocate a single guest page will fail if the
host page size is greater than the guest page size, as KVM requires that
the size of memslots and guest_memfd files are a multiple of the host page
size.

For simplicity, verify the entire file can be shared between guest and host,
e.g. instead of trying to validate "partial" mappings.

Fixes: 42188667be38 ("KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory")
Reported-by: Zenghui Yu <zenghui.yu@linux.dev>
Closes: https://lore.kernel.org/all/0064952b-048c-455d-ad89-e27e5cb82591@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260512155634.772602-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvmarm-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.1, take #2

- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
  that the EL3 firmware does its part of the job. KVM will refuse to
  initialise otherwise.

- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
  running under NV with E2H==0.

- Correctly deal with permission faults in guest_memfd memslots.

- Fix the steal-time selftest after the infrastructure was reworked.

- Make sure the host cannot pass a non-sensical clock update to the
  EL2 tracing infrastructure.

- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
  ability to run arm64 guests, which will inevitably lead to arm64
  code being directly used on s390.

- Make sure that EL2 is configured with both exception entry and exit
  being Context Synchronization Events.

- Handle the current vcpu being NULL on EL2 panic.

- Fix the selftest_vcpu memcache being empty at the point of donation or
  sharing.

- Check that the memcache has enough capacity before engaging on the
  share/donate path.

- Fix __deactivate_fgt() to use its parameter rather than a variable
  in the macro context.

Documentation: kvm: update links in the references section of AMD Memory Encryption

Replace non-working links in the reference section with the working ones.

Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
Link: https://patch.msgid.link/20260511174302.811918-1-ninadnaik07@gmail.com/
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running

Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.

The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock.  Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2.  I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.

While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain.  And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.

Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept.  As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.

Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.

And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.

The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest.  Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.

Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.

Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Fix Xen hypercall tracepoint argument assignment

TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of
__entry->a5.

That overwrites the recorded a4 argument and leaves a5 unset in the
trace entry. Fix the typo so both arguments are captured correctly.

Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: Reject wrapped offset in kvm_reset_dirty_gfn()

kvm_reset_dirty_gfn() guards the gfn range with

if (!memslot || (offset + __fls(mask)) >= memslot->npages)
return;

but offset is u64 and the addition is unchecked.  The check can be
silently bypassed by a u64 wrap.

The dirty ring backing those entries is MAP_SHARED at
KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the
slot and offset fields of any entry between when the kernel pushes
them and when KVM_RESET_DIRTY_RINGS consumes them.  On reset,
kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds
them straight back into this check; only the flags handshake is
treated as the handover, the slot/offset payload is taken on trust.

Crafting two entries

entry[i].offset   = 0xffffffffffffffc1
entry[i+1].offset = 0

makes the coalescing loop in kvm_dirty_ring_reset() compute

delta = (s64)(0 - 0xffffffffffffffc1) = 63

which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the
existing mask by setting bit 63.  The trailing kvm_reset_dirty_gfn()
call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63;
the sum is 0 in u64 and the bounds check passes.

That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked()
unchanged.  On the legacy MMU path -- kvm_memslots_have_rmaps() ==
true, i.e. shadow paging, any VM that has allocated shadow roots, or
a write-tracked slot -- it reaches gfn_to_rmap(), which indexes
slot->arch.rmap[0][] with a near-U64_MAX gfn.  That is an
out-of-bounds load of a kvm_rmap_head, followed by a conditional
clear of PT_WRITABLE_MASK in whatever the loaded pointer points at.
The path is reachable from any process holding /dev/kvm.

Range-check offset on its own first, so the addition cannot wrap.
memslot->npages is bounded well below U64_MAX, so once offset <
npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG)
stays in range.

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Cc: stable@vger.kernel.org
Signed-off-by: Aaron Sacks <contact@xchglabs.com>
Link: https://patch.msgid.link/20260512060742.1628959-1-contact@xchglabs.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>