Guenter Roeck [Thu, 14 May 2026 11:06:38 +0000 (13:06 +0200)]
kunit: Add backtrace suppression self-tests
Add unit tests to verify that warning backtrace suppression works.
Tests cover both API forms:
- Scoped: kunit_warning_suppress() with in-block count verification
and post-block inactivity check.
- Direct functions: kunit_start/end_suppress_warning() with
sequential independent suppression blocks and per-block counts.
Furthermore, tests verify incremental warning counting, that
kunit_has_active_suppress_warning() transitions correctly around
suppression boundaries, and that suppression active in the test
kthread does not leak to a separate kthread.
If backtrace suppression does _not_ work, the unit tests will likely
trigger unsuppressed backtraces, which should actually help to get
the affected architectures / platforms fixed.
Link: https://lore.kernel.org/r/20260514-kunit_add_support-v11-2-b36a530a6d8f@redhat.com Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Acked-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Alessandro Carminati <acarmina@redhat.com> Reviewed-by: David Gow <david@davidgow.net> Signed-off-by: Albert Esteve <aesteve@redhat.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
bug/kunit: Core support for suppressing warning backtraces
Some unit tests intentionally trigger warning backtraces by passing bad
parameters to kernel API functions. Such unit tests typically check the
return value from such calls, not the existence of the warning backtrace.
Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons:
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
investigated and has to be marked to be ignored, for example by
adjusting filter scripts. Such filters are ad hoc because there is
no real standard format for warnings. On top of that, such filter
scripts would require constant maintenance.
Solve the problem by providing a means to suppress warning backtraces
originating from the current kthread while executing test code. Since
each KUnit test runs in its own kthread, this effectively scopes
suppression to the test that enabled it. Limit changes to generic code
to the absolute minimum.
Implementation details:
Suppression is integrated into the existing KUnit hooks infrastructure
in test-bug.h, reusing the kunit_running static branch for zero
overhead when no tests are running.
Suppression is checked at three points in the warning path:
- In warn_slowpath_fmt(), the check runs before any output, fully
suppressing both message and backtrace. This covers architectures
without __WARN_FLAGS.
- In __warn_printk(), the check suppresses the warning message text.
This covers architectures that define __WARN_FLAGS but not their own
__WARN_printf (arm64, loongarch, parisc, powerpc, riscv, sh), where
the message is printed before the trap enters __report_bug().
- In __report_bug(), the check runs before __warn() is called,
suppressing the backtrace and stack dump.
To avoid double-counting on architectures where both __warn_printk()
and __report_bug() run for the same warning, kunit_is_suppressed_warning()
takes a bool parameter: true to increment the suppression counter
(used in warn_slowpath_fmt and __report_bug), false to check only
(used in __warn_printk).
The suppression state is dynamically allocated via kunit_kzalloc() and
tied to the KUnit test lifecycle via kunit_add_action(), ensuring
automatic cleanup at test exit. Writer-side access to the global
suppression list is serialized with a spinlock; readers use RCU.
Two API forms are provided:
- kunit_warning_suppress(test) { ... }: scoped, uses __cleanup for
automatic teardown on scope exit, kunit_add_action() as safety net
for abnormal exits (e.g. kthread_exit from failed assertions).
Suppression handle is only accessible inside the block.
- kunit_start/end_suppress_warning(test): direct functions returning
an explicit handle, for retaining the handle within the test,
or for cross-function usage.
Ruijie Li [Thu, 14 May 2026 08:13:25 +0000 (16:13 +0800)]
batman-adv: clear current gateway during teardown
batadv_gw_node_free() removes the gateway list entries during mesh teardown,
but it does not clear the currently selected gateway. This leaves stale
gateway state behind across cleanup and can break a later mesh recreation.
Clear bat_priv->gw.curr_gw before walking the gateway list so the selected
gateway reference is dropped as part of teardown.
Fixes: 2265c1410864 ("batman-adv: gateway election code refactoring") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Zhanpeng Li <lzhanpeng2025@lzu.edu.cn> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Alim Akhtar [Fri, 17 Apr 2026 12:14:49 +0000 (17:44 +0530)]
arm64: dts: exynosautov920: Add syscon hsi2 node
Syscon HSI2 block has system configuration settings for
HSI IPs, like ufs, usb etc. Add a syscon_hsi2 node entry
so that related HSI controller can make use of the same.
Use dev_err_probe() in tegra114_emc_interconnect_init() to make code a
bit simpler. It's preferred form of printing error messages during
probe, even if actual call cannot return EPROBE_DEFER.
clk: samsung: exynos850: mark APM I3C clocks as critical
The Exynos850 APM co-processor relies on the I3C bus to communicate with
the PMIC. Currently, there is no dedicated PMIC consumer driver managing
these clocks, so the clock subsystem automatically gates them during the
initialisation. Once gated, any subsequent ACPM communication with APM
results in timeouts.
As a temporary workaround (and let's hope it doesn't become permanent),
mark both `gout_i3c_pclk` and `gout_i3c_sclk` as CLK_IS_CRITICAL ones to
prevent the clock subsystem from disabling them. This makes the ACPM
communication functional. This workaround should be reverted once a
proper ACPM PMIC driver is implemented to manage these clocks.
Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org> Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org> Link: https://patch.msgid.link/20260430-exynos850-i3c-criticalclocks-v1-1-6e1fd8dfa21b@linaro.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Ruide Cao [Wed, 13 May 2026 03:58:15 +0000 (11:58 +0800)]
batman-adv: fix fragment reassembly length accounting
batman-adv keeps a running payload length for queued fragments and uses it
to validate a fragment chain before reassembly.
That accounting currently allows the accumulated fragment length to be
truncated during updates. As a result, malformed fragment chains can
bypass the intended validation and drive reassembly with inconsistent
length state, leading to a local denial of service.
Fix the accounting by storing the accumulated length in a length-typed
field and rejecting update overflows before the existing validation logic
runs.
The fix was verified against the original reproducer and against valid
fragment reassembly paths.
Fixes: 610bfc6bc99b ("batman-adv: Receive fragmented packets and merge") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruide Cao <caoruide123@gmail.com> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Sven Eckelmann <sven@narfation.org>
With the support of nested lazy mmu sections it can happen that
arch_enter_lazy_mmu_mode() is being called twice without a call of
arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers
are not disabling preemption when checking for nested lazy mmu
sections.
This is a problem when running as a Xen PV guest, as
xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this
case.
Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order
not to hurt all other lazy mmu mode users.
Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching") Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260508143933.493013-1-jgross@suse.com>
Juergen Gross [Tue, 5 May 2026 10:24:17 +0000 (12:24 +0200)]
x86/xen: Fix xen_e820_swap_entry_with_ram()
When swapping a not page-aligned E820 map entry with RAM, the start
address of the modified entry is calculated wrong (the offset into the
page is subtracted instead of being added to the page address).
Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Reported-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260505102417.208138-1-jgross@suse.com>
- ipv6: flowlabel: enforce per-netns limit for unprivileged callers
- tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
- smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
- sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL
- batman-adv:
- reject new tp_meter sessions during teardown
- purge non-released claims
- eth:
- i40e: cleanup PTP registration on probe failure
- idpf: fix double free and use-after-free in aux device error paths
- ena: fix potential use-after-free in get_timestamp"
* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
net: phy: DP83TC811: add reading of abilities
net: tls: prevent chain-after-chain in plain text SG
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
macsec: introduce dedicated workqueue for SA crypto cleanup
net: net_failover: Fix the deadlock in slave register
MAINTAINERS: update atlantic driver maintainer
selftests/tc-testing: Add QFQ/CBS qlen underflow test
net/sched: sch_cbs: Call qdisc_reset for child qdisc
FDDI: defza: Sanitise the reset safety timer
net: ethernet: ravb: Do not check URAM suspension when WoL is active
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
net: atm: fix skb leak in sigd_send() default branch
net: ethtool: phy: avoid NULL deref when PHY driver is unbound
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
net: shaper: reject QUEUE scope handle with missing id
...
Jeremy Erazo [Thu, 14 May 2026 12:03:34 +0000 (12:03 +0000)]
smb: client: avoid integer overflow in SMB2 READ length check
SMB2 READ response validation in cifs_readv_receive() and
handle_read_data() checks data_offset + data_len against the received
buffer length. Both values are attacker-controlled fields from the
server response and are stored as unsigned int, so the addition can
wrap before the bounds check:
fs/smb/client/transport.c:1259
if (!use_rdma_mr && (data_offset + data_len > buflen))
fs/smb/client/smb2ops.c:4839
else if (buf_len >= data_offset + data_len)
A malicious SMB server can use this to bypass validation. In the
non-encrypted receive path the client attempts an oversized socket
read and stalls for the SMB response timeout (180 seconds) before
reconnecting. In the SMB3 encrypted path, runtime testing shows the
malformed length can reach copy_to_iter() in handle_read_data() with
attacker-controlled size, where usercopy hardening stops the oversized
copy before bytes reach userspace.
Guard both call sites with check_add_overflow(), which is already
used elsewhere in this subsystem (smb2pdu.c). On overflow, treat the
response as malformed and reject with -EIO.
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Thu, 14 May 2026 15:53:24 +0000 (08:53 -0700)]
Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
- Correctly log the inheritable capabilities
- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands
* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records
David Carlier [Fri, 8 May 2026 20:19:58 +0000 (21:19 +0100)]
phy: apple: atc: Fix typec switch/mux leak on unbind
atcphy_probe_switch() and atcphy_probe_mux() discard the pointers
returned by typec_switch_register() and typec_mux_register(). The
platform driver has no .remove callback, so when the driver unbinds
(e.g. via sysfs unbind) neither typec_switch_unregister() nor
typec_mux_unregister() is called. The framework reference taken in
typec_switch_register() (device_initialize() + device_add() in
drivers/usb/typec/mux.c) is therefore never dropped and the
typec_switch_dev / typec_mux_dev objects stay live forever, with
their sysfs entries under the typec_mux class also left behind. A
subsequent rebind cannot recreate them with the same fwnode-derived
name.
Save the registered handles and unregister them through
devm_add_action_or_reset() so framework registration is torn down
in step with the driver's other devm-managed state. While here,
drop struct apple_atcphy::sw and ::mux: they were declared with the
consumer-side types (typec_switch *, typec_mux *) instead of the
provider-side types and were never assigned.
Scope of the fix
================
This patch fixes the registration leak only. It does not close the
use-after-free window that arises when a consumer that obtained a
reference via fwnode_typec_switch_get() / fwnode_typec_mux_get()
outlives the provider unbind: such consumers keep the underlying
typec_switch_dev / typec_mux_dev alive past device_unregister(),
and a later typec_switch_set() / typec_mux_set() still invokes the
registered atcphy_sw_set() / atcphy_mux_set(), which dereferences
the freed apple_atcphy through typec_{switch,mux}_get_drvdata().
On Apple Silicon the relevant consumers are the typec port and the
cd321x controller registered by drivers/usb/typec/tipd/core.c.
Cable plug / orientation events and alt-mode transitions trigger
the .set callbacks via:
Closing that window requires framework support for invalidating
consumer-held references on provider unbind. The same
consumer-survives-provider pattern has been discussed for the PHY
framework [1] and is out of scope here.
phy: econet: Add PCIe PHY driver for EcoNet EN751221 and EN7528 SoCs.
Introduce support for EcoNet PCIe PHY controllers found in EN751221
and EN7528 SoCs, these SoCs are not identical but are similar, each
having one Gen1 port, and one Gen1/Gen2 port.
Co-developed-by: Ahmed Naseef <naseefkm@gmail.com> Signed-off-by: Ahmed Naseef <naseefkm@gmail.com>
[cjd@cjdns.fr: add EN751221 support and refactor for clarity] Signed-off-by: Caleb James DeLisle <cjd@cjdns.fr> Link: https://patch.msgid.link/20260425173642.406089-3-cjd@cjdns.fr Signed-off-by: Vinod Koul <vkoul@kernel.org>
dt-bindings: phy: Document PCIe PHY in EcoNet EN751221 and EN7528
EN751221 and EN7528 SoCs have two PCIe slots, and each one has a PHY
which behaves slightly differently because one slot is Gen1/Gen2 while
the other is Gen1 only.
Thomas Weißschuh [Thu, 14 May 2026 12:05:13 +0000 (14:05 +0200)]
tools/nolibc: always pass mode to open syscall
When O_TMPFILE is set, the open mode needs to be passed to the kernel as
per the documentation. Currently this is not done.
Instead of checking for O_TMPFILE explicitly and making the conditionals
more complex, just always pass the mode to the kernel. If no value was
passed the mode will be garbage, but the kernel will ignore it anyways.
Jianwei Zheng [Tue, 5 May 2026 17:04:10 +0000 (19:04 +0200)]
phy: rockchip: inno-usb2: Add support for RK3528
The RK3528 has a single USB2PHY with a otg and host port.
Add support for the RK3528 variant of USB2PHY.
PHY tuning for RK3528:
- Turn off differential receiver in suspend mode to save power
consumption.
- Set HS eye-height to 400mV instead of default 450mV.
- Choose the Tx fs/ls data as linestate from TX driver for otg port
which uses dwc3 controller to improve fs/ls devices compatibility with
long cables.
Undocumented magic-values are based on the linux-stan-6.1-rkr5 tag of
the vendor-kernel.
The logic to decide if usbgrf or grf should be used is more complex than
it needs to be. For RK3568, RV1108 and soon RK3528 we can assign the
rockchip,usbgrf regmap directly to grf instead of doing a usbgrf and grf
dance.
Simplify the code to only use the grf regmap and handle the logic of
what regmap should be used in driver probe instead.
The only expected change from this is that RK3528 can be supported
because of an addition of a of_property_present() check.
Signed-off-by: Jonas Karlman <jonas@kwiboo.se> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://patch.msgid.link/20260505170410.3265305-3-heiko@sntech.de Signed-off-by: Vinod Koul <vkoul@kernel.org>
Jonas Karlman [Tue, 5 May 2026 17:04:06 +0000 (19:04 +0200)]
dt-bindings: phy: rockchip,inno-usb2phy: Require GRF for RK3568/RV1108
Typically these Rockchip USB2 PHYs are fully contained within a single
GRF. However, for RK3568 and RV1108 regs to control the USB2 PHY is
located in a different GRF compared to the base address.
Update this binding to require rockchip,usbgrf for RK3568 and RV1108 to
properly reflect that the USB GRF is required to control the USB2 PHYs
on these variants. Also disable use of rockchip,usbgrf for variants
where it is not required.
This should not introduce any breakage as the affected usb2phy nodes for
RK3568 and RV1108 were added together with a rockchip,usbgrf phandle in
their initial commit.
Andy Shevchenko [Wed, 13 May 2026 22:01:30 +0000 (00:01 +0200)]
phy: phy-can-transceiver: Decouple assignment and definition in probe
The code like
int foo = X;
...
if (bar)
foo = Y;
is prone to subtle mistakes and hence harder to maintain as the foo value
may be changed inadvertently while code in '...' grown in lines. On top
it's harder to navigate to understand the possible values of foo when branch
is not taken (requires to look somewhere else in the code, far from the piece
at hand).
Besides that in case of taken branch the foo will be rewritten, which is
not a problem per se, just an unneeded operation.
Decouple assignment and definition to use if-else to address the inconveniences
described above.
Andy Shevchenko [Wed, 13 May 2026 22:01:29 +0000 (00:01 +0200)]
phy: phy-can-transceiver: Don't check for specific errors when parsing properties
Instead of checking for the specific error codes (that can be considered
a layering violation to some extent) check for the property existence first
and then either parse it, or apply a default value.
With that, return an error when parsing of the existing property fails.
Andy Shevchenko [Wed, 13 May 2026 22:01:28 +0000 (00:01 +0200)]
phy: phy-can-transceiver: Move OF ID table closer to their user
There is no code that uses ID table directly, except the
struct device_driver at the end of the file. Hence, move
table closer to its user. It's always possible to access
them via a pointer.
Andy Shevchenko [Wed, 13 May 2026 22:01:26 +0000 (00:01 +0200)]
phy: phy-can-transceiver: Check driver match and driver data against NULL
Every platform driver can be forced to match a device that doesn't
match its list of device IDs because of device_match_driver_override()
so platform drivers that rely on the existence of a device's driver
data need to verify its presence.
Accordingly, add requisite match and driver data checks against NULL
to the driver where they are missing.
Linus Torvalds [Wed, 13 May 2026 18:37:18 +0000 (11:37 -0700)]
ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.
Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.
Ioana Ciornei [Mon, 11 May 2026 15:00:23 +0000 (18:00 +0300)]
phy: lynx-28g: add support for 25GBASER
Add support for 25GBASE-R in the Lynx 28G SerDes PHY driver. This will
be used by the dpaa2-mac consumer on LX2160A with:
- phy_validate(phy, PHY_MODE_ETHERNET, PHY_INTERFACE_MODE_25GBASER) to
detect support.
- phy_set_mode_ext(phy, PHY_MODE_ETHERNET, PHY_INTERFACE_MODE_25GBASER)
to reconfigure the lane for this protocol.
The intended use case for dynamic protocol switching to 25GBase-R is
with SFP28 modules, and protocol switching is triggered by the SFP
module insertion. There also exists a 25GBase-KR use case, where the
protocol switching is covered by IEEE 802.3 clause 73 auto-negotiation.
However, that is not handled here; it merely needs the support added
here as basic ground work.
The lane frequency for 25GbE is sourced from a clock net frequency of
12.890625 GHz, as produced by PLLF or PLLS, further multiplied by the
lane by 2. The clock net frequencies produced by the PLLs are treated as
read-only by the driver, so the absence of a PLL provisioned for the
right clock net frequency implies absence of 25GbE support, even though
a lane might have the appropriate protocol converter for it.
In terms of implementation, the change consists of:
- determining at probe time if any PLL was preconfigured for the
required clock net frequency for 25GbE
- adding the default lane parameters for reconfiguring a lane to 25GbE
irrespective of the original protocol
- allowing this operating mode only on supported lanes, i.e. all lanes
of LX2162A SerDes #1, and LX2160A SerDes lanes 0-1, 4-7.
Vladimir Oltean [Mon, 11 May 2026 15:00:22 +0000 (18:00 +0300)]
phy: lynx-28g: probe on per-SoC and per-instance compatible strings
Add driver support for probing on the new, per-instance and per-SoC
bindings, which provide the main benefit that they allow rejecting
unsupported protocols per lane (10GbE on SerDes 2 lanes 0-5), but they
also allow avoiding the creation of PHYs for lanes that don't exist
(LX2162A lanes 0-3).
For old device trees with just "fsl,lynx-28g", the only things that
change are:
- a probe time warning/encouragement to update the device tree. This is
warranted by the fact that using "fsl,lynx-28g" may already provide
incorrect behaviour (undetected absent 10GbE support on LX2160A
SerDes 2 lanes 0-5). But we retain bug compatibility nonetheless.
- the feature set is frozen in time (e.g. no 25GbE). Since we cannot
guarantee that this protocol will work on a lane, just err on the safe
side and don't offer it (and require a device tree update to get it).
In terms of code, the lynx_28g_supports_lane_mode() function prototype
changes. It was a SerDes-global function and now becomes per lane, to
reflect the specific capabilities each instance may have. The
implementation goes through priv->info->lane_supports_mode().
Vladimir Oltean [Mon, 11 May 2026 15:00:21 +0000 (18:00 +0300)]
phy: lynx-28g: require an OF node to probe
The driver will gain support for variants in an upcoming change, and
will use of_device_get_match_data() to deduce the running variant from
the compatible string.
Currently, the driver expects the schema at phy/fsl,lynx-28g.yaml, and
OF-based consumers, but doesn't enforce this. And it is possible for
user space to force-bind the driver to a device without OF node using
the driver_override sysfs.
To avoid future surprise crashes for an unsupported configuration,
explicitly test for the presence of an OF node and fail probing if
found.
Vladimir Oltean [Mon, 11 May 2026 15:00:19 +0000 (18:00 +0300)]
dt-bindings: phy: lynx-28g: add compatible strings per SerDes and instantiation
The 28G Lynx SerDes is instantiated 3 times in the NXP LX2160A SoC and
twice in the NXP LX2162A. All these instances share the same register
map, but the number of lanes and the protocols supported by each lane
differs in a way that isn't detectable by the programming model.
For example, not all lanes of all SerDes block instantiations support
25GbE.
So, using a generic "fsl,lynx-28g" compatible string and expecting all
SerDes instantiations to use it was a mistake that needs to be fixed.
The option chosen is to encode the SoC and the SerDes instance in the
compatible string, with everything else being the responsibility of the
driver to derive.
An alternative considered but dismissed was to add sufficient device
tree properties to describe the per-lane differences (implying:
supported protocols), as well as the different lane count.
Any decision made for the 28G Lynx should be consistent with the
decisions taken for the yet-to-be-introduced 10G Lynx SerDes (older
generation for older SoCs), because of how similar they are.
I've seen the alternative at play in this unmerged patch set for the 10G
Lynx here, and I didn't like it:
https://lore.kernel.org/linux-phy/20230413160607.4128315-3-sean.anderson@seco.com/
This is because there, we have a higher degree of variability in the
PCCR register values that need to be written per protocol. This makes
that approach more drawn-out and more prone to errors, compared to the
compatible strings which are more succinct and obviously correct.
NXP SoC reference manuals clearly document the SerDes instantiations as
not identical, and refers to them as such (SerDes 1, 2, etc).
The per-SoC compatible string is prepended to the "fsl,lynx-28g" generic
compatible, which is left there for compatibility with old kernels. An
exception would be LX2160A SerDes #3, which at the time of writing is
not described in fsl-lx2160a.dtsi. As "fsl,lx2160a-serdes3" implies it
is a 28G Lynx SerDes, it makes "fsl,lynx-28g" redundant so we don't
accept it.
Shuicheng Lin [Mon, 11 May 2026 15:33:07 +0000 (15:33 +0000)]
drm/xe/gt_idle: Use NSEC_PER_MSEC instead of float literal
The residency multiplier conversion in get_residency_ms() used the
floating-point literal 1e6 as the divisor of mul_u64_u32_div(). While
the compiler constant-folds this to an integer, using float literals
in kernel code is bad practice since the kernel generally avoids
floating-point operations.
Replace 1e6 with the standard NSEC_PER_MSEC macro from <linux/time64.h>,
which is both self-documenting (ns to ms conversion) and unambiguously
integer. Add the corresponding include rather than relying on
transitive inclusion.
Shuicheng Lin [Mon, 11 May 2026 15:41:34 +0000 (15:41 +0000)]
drm/xe/gsc: Fix double-free of managed BO in error path
The error path in xe_gsc_init_post_hwconfig() explicitly frees a BO
allocated with xe_managed_bo_create_pin_map() via
xe_bo_unpin_map_no_vm(). Since the managed BO already has a devm
cleanup action registered, this causes a double-free when devm
unwinds during probe failure.
Remove the explicit free and let devm handle it, consistent with
all other xe_managed_bo_create_pin_map() callers.
Fixes: 2e5d47fe7839 ("drm/xe/uc: Use managed bo for HuC and GSC objects") Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Assisted-by: Claude:claude-opus-4.6 Link: https://patch.msgid.link/20260511154134.223696-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Guannan Wang [Thu, 14 May 2026 07:44:54 +0000 (15:44 +0800)]
bpf: Use array_map_meta_equal for percpu array inner map replacement
percpu_array_map_ops.map_meta_equal points to the generic
bpf_map_meta_equal(), which does not compare max_entries. When a
percpu array serves as an inner map, replacing it with one that has
fewer max_entries bypasses the check. Since percpu_array_map_gen_lookup()
inlines the original template's index_mask as a JIT immediate, a lookup
on the replacement map can access pptrs[] out of bounds.
Point percpu_array_map_ops.map_meta_equal to array_map_meta_equal(),
which already enforces the max_entries equality check.
Add a selftest to verify that replacing a percpu array inner map with
a differently-sized one is rejected.
Peter Ujfalusi [Fri, 8 May 2026 10:17:55 +0000 (18:17 +0800)]
soundwire: intel: Move suspend tracking from trigger to pm suspend
Mark all open DAI runtimes as suspended in the component .suspend
callback instead of relying on SNDRV_PCM_TRIGGER_SUSPEND, which is
not delivered during PAUSE or xrun states.
If during system suspend a dai is open it means that it is in either in
SUSPENDED, PAUSED or STOPPED (due to xrun) state and they will need to be
re-initialized during resume (which is done in .prepare callback).
DaeMyung Kang [Wed, 13 May 2026 13:26:22 +0000 (22:26 +0900)]
cifs: client: stage smb3_reconfigure() updates and restore ctx on failure
smb3_reconfigure() moves strings out of cifs_sb->ctx before the
multichannel update, so a later failure can leave the live context
with NULL strings or options that do not match the session.
Stage the new ctx separately, commit it only on success, and restore
the snapshot on failure. Also make smb3_sync_session_ctx_passwords()
all-or-nothing.
Commit session passwords before channel updates so newly added channels
authenticate with the staged credentials.
Fixes: ef529f655a2c ("cifs: client: allow changing multichannel mount options on remount") Reported-by: RAJASI MANDAL <rajasimandalos@gmail.com> Closes: https://lore.kernel.org/lkml/CAEY6_V1+dzW3OD5zqXhsWyXwrDTrg5tAMGZ1AJ7_GAuRE+aevA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/xkr2dlvgibq5j6gkcxd3yhhnj4atgxw2uy4eug2pxm7wy7nbms@iq6cf5taa65v/ Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Nick Chan [Thu, 14 May 2026 13:16:01 +0000 (21:16 +0800)]
nvme-apple: Reset q->sq_tail during queue init
Fixes a "duplicate tag error for tag 0" firmware crash during controller
reset while setting up a queue on Apple A11 / T8015 caused by stale
entries in the submission queue due to an invalid sq_tail offset after
reset.
Fixes: 04d8ecf37b5e ("nvme: apple: Add Apple A11 support") Cc: stable@vger.kernel.org Suggested-by: Yuriy Havrylyuk <yhavry@gmail.com> Reviewed-by: Sven Peter <sven@kernel.org> Signed-off-by: Nick Chan <towinchenmi@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Ye Bin [Thu, 14 May 2026 13:14:18 +0000 (21:14 +0800)]
smb/client: fix possible infinite loop and oob read in symlink_data()
On 32-bit architectures, the infinite loop is as follows:
len = p->ErrorDataLength == 0xfffffff8
u8 *next = p->ErrorContextData + len
next == p
On 32-bit architectures, the out-of-bounds read is as follows:
len = p->ErrorDataLength == 0xfffffff0
u8 *next = p->ErrorContextData + len
next == (u8 *)p - 8
Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Fixes: 76894f3e2f71 ("cifs: improve symlink handling for smb2+") Cc: stable@vger.kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>
ovpn: fix race between deleting interface and adding new peer
While deleting an existing ovpn interface, there is a very
narrow window where adding a new peer via netlink may cause
the netdevice to hang and prevent its unregistration.
It may happen during ovpn_dellink(), when all existing peers are
freed and the device is queued for deregistration, but a
CMD_PEER_NEW message comes in adding a new peer that takes again
a reference to the netdev.
At this point there is no way to release the device because we are
under the assumption that all peers were already released.
Fix the race condition by releasing all peers in ndo_uninit(),
when the netdevice has already been removed from the netdev
list.
Also ovpn_peer_add() has now an extra check that forces the
function to bail out if the device reg_state is not REGISTERED.
This way any incoming CMD_PEER_NEW racing with the interface
deletion routine will simply stop before adding the peer.
Note that the above check happens while holding the netdev_lock
to prevent racing netdev state changes.
ovpn_dellink() is now empty and can be removed.
Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Closes: https://lore.kernel.org/netdev/aaVgJ16edTfQkYbx@v4bel/ Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: 80747caef33d ("ovpn: introduce the ovpn_peer object") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
David Carlier [Wed, 13 May 2026 10:55:21 +0000 (11:55 +0100)]
ovpn: respect peer refcount in CMD_NEW_PEER error path
ovpn_nl_peer_new_doit()'s error path calls ovpn_peer_release() directly
rather than ovpn_peer_put(), bypassing the kref. The accompanying
comment ("peer was not yet hashed, thus it is not used in any context")
holds for UDP but not for TCP.
For UDP, the ovpn_socket union uses the .ovpn arm and never points back
at a peer; UDP encap_recv looks up peers via the not-yet-populated
hashtables, so the new peer is unreachable until ovpn_peer_add()
publishes it.
For TCP, ovpn_socket_new() sets ovpn_sock->peer and
ovpn_tcp_socket_attach() publishes ovpn_sock via rcu_assign_sk_user_data().
From that moment until ovpn_socket_release() detaches in the error path,
the TCP fd is fully wired: userspace recvmsg / sendmsg / close / poll
on the fd, as well as the strparser-driven ovpn_tcp_rcv() path, can
reach the peer through sk_user_data -> ovpn_sock->peer and bump its
refcount via ovpn_peer_hold().
ovpn_tcp_socket_wait_finish() (called inside ovpn_socket_release())
drains strparser and the tx work, but does not synchronize with
userspace syscall callers that already hold a peer reference. If
ovpn_nl_peer_modify() or ovpn_peer_add() returns an error while such
a caller is in flight - notably an ovpn_tcp_recvmsg() blocked in
__skb_recv_datagram() on peer->tcp.user_queue - the direct
ovpn_peer_release() destroys the peer while the caller still holds
the reference, and the eventual ovpn_peer_put() from that caller
operates on freed memory.
Replace the direct destructor call with ovpn_peer_put() so the kref
correctly defers destruction until the last reference is dropped.
In the common case where no concurrent user is present, behaviour is
unchanged: the kref hits zero immediately and ovpn_peer_release_kref()
runs the same destructor.
With this conversion ovpn_peer_release() has no callers outside peer.c
- ovpn_peer_release_kref() in the same translation unit is the only
remaining user - so make it static and drop its declaration from
peer.h.
Fixes: 11851cbd60ea ("ovpn: implement TCP transport") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
David Carlier [Wed, 13 May 2026 10:55:20 +0000 (11:55 +0100)]
ovpn: tcp - use cached peer pointer in ovpn_tcp_close()
ovpn_tcp_close() loads the ovpn_socket via rcu_dereference_sk_user_data()
under rcu_read_lock(), takes a reference on sock->peer, caches the peer
pointer in a local, and drops the read lock. It then passes sock->peer
(rather than the cached local) to ovpn_peer_del(), re-dereferencing the
ovpn_socket after the RCU read section has ended.
Unlike ovpn_tcp_sendmsg(), which uses the same "load under RCU, use
after unlock" pattern but is protected by lock_sock() held across the
function, ovpn_tcp_close() runs without the socket lock: inet_release()
invokes sk_prot->close() without taking lock_sock first.
ovpn_socket_release() can therefore complete its kref_put -> detach ->
synchronize_rcu -> kfree(sock) sequence concurrently, in the window
after ovpn_tcp_close() drops rcu_read_lock() but before it dereferences
sock->peer. The synchronize_rcu() in ovpn_socket_release() protects
readers that use the dereferenced pointer inside the RCU read section,
not those that escape the pointer to a local and use it afterwards.
A reproducer follows the pattern of commit 94560267d6c4 ("ovpn: tcp -
don't deref NULL sk_socket member after tcp_close()"): trigger a peer
removal (keepalive expiration or netlink OVPN_CMD_DEL_PEER) at the same
moment userspace closes the TCP fd. That commit fixed the detach-side
of the same race window; this one fixes the close-side at a different
victim.
Tighten the entry block to read sock->peer exactly once into the cached
peer local, and route all subsequent uses (the hold check, the
ovpn_peer_del() call, and the prot->close() invocation) through that
local. sock->peer is only ever written once in ovpn_socket_new() under
lock_sock(), before rcu_assign_sk_user_data() publishes the ovpn_socket,
and is never reassigned afterwards - but the previous multi-read pattern
made that invariant implicit rather than explicit. The same multi-read
shape exists in ovpn_tcp_recvmsg(), ovpn_tcp_sendmsg(),
ovpn_tcp_data_ready() and ovpn_tcp_write_space(); those will be cleaned
up via a dedicated helper in a follow-up net-next series.
Fixes: 11851cbd60ea ("ovpn: implement TCP transport") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Commit 201ba706318d ("selftests: ovpn: reduce ping count in test.sh")
lowered the baseline traffic flood ping count to avoid flakes on slower
CI instances, however some instances were left out.
Apply the same limit to the remaining ovpn selftest flood pings that
still request 500 packets.
Fixes: 201ba706318d ("selftests: ovpn: reduce ping count in test.sh") Signed-off-by: Ralf Lici <ralf@mandelbit.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Jens Axboe [Mon, 4 May 2026 11:42:51 +0000 (05:42 -0600)]
io_uring/rsrc: raise registered buffer 1GB limit
There's no real reason to have a limit, as the memory is accounted by
the lockmem limits anyway, if any exist. io_pin_pages() will still
restrict the maximum allowed limit per buffer, which is INT_MAX
number of pages. Cap it a bit lower than that, at 1TB for a 64-bit
system. Surely that should be enough for everyone. For now.
Jens Axboe [Sat, 24 Jan 2026 17:02:41 +0000 (10:02 -0700)]
io_uring/rsrc: add huge page accounting for registered buffers
Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.
When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.
Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment its accounting reference count, and only account pages that
are newly added.
When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.
Note: any account is done against the ctx->user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.
With these changes, any use of imu->acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a
64-bit arch. Additionally, hpage_already_acct() is gone, which was an
O(M*M) scan over current + previous registrations.
Shuai Zhang [Mon, 11 May 2026 13:58:37 +0000 (21:58 +0800)]
Bluetooth: hci_qca: Convert timeout from jiffies to ms
Since the timer uses jiffies as its unit rather than ms, the timeout value
must be converted from ms to jiffies when configuring the timer. Otherwise,
the intended 8s timeout is incorrectly set to approximately 33s.
To improve readability, embed msecs_to_jiffies() directly in the macro
definitions and drop the _MS suffix from macros that now yield jiffies
values: MEMDUMP_TIMEOUT, FW_DOWNLOAD_TIMEOUT, IBS_DISABLE_SSR_TIMEOUT,
CMD_TRANS_TIMEOUT, and IBS_BTSOC_TX_IDLE_TIMEOUT.
IBS_WAKE_RETRANS_TIMEOUT_MS and IBS_HOST_TX_IDLE_TIMEOUT_MS are
intentionally left unchanged. Their values are stored in the struct fields
wake_retrans and tx_idle_delay, which hold ms values at runtime and can be
modified via debugfs. The msecs_to_jiffies() conversion happens at each
call site against the field value, so it cannot be embedded in the macro.
Bluetooth: L2CAP: ecred_reconfigure: send packed pdu, not stack pointer
Commit 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end
warnings") converted the on-stack request PDU in l2cap_ecred_reconfigure()
from an explicit packed struct to DEFINE_RAW_FLEX(), but did not adjust the
size and source-pointer arguments to l2cap_send_cmd():
After the conversion, DEFINE_RAW_FLEX() expands to declare an anonymous
union pdu_u plus a local pointer "pdu" pointing at it. Therefore:
- sizeof(pdu) is now sizeof(struct l2cap_ecred_reconf_req *) = 8 on
64-bit (4 on 32-bit), not the 6 bytes of (mtu, mps, scid[1]).
- &pdu is the address of the local pointer's stack storage, not the
address of the request payload.
l2cap_send_cmd() forwards (data, count) to l2cap_build_cmd(), which calls
skb_put_data(skb, data, count). The L2CAP_ECRED_RECONFIGURE_REQ packet
body therefore contains 8 bytes copied from the kernel stack starting at
&pdu -- the 8 bytes overlap the pdu pointer's value, leaking a kernel
stack address to the paired Bluetooth peer. The intended (mtu, mps, scid)
fields are not transmitted at all, so the peer rejects the request as
malformed and the L2CAP_ECRED_RECONFIGURE feature itself has been broken
for the local-side initiator since the introducing commit landed.
The sibling site l2cap_ecred_conn_req() in the same commit was converted
correctly (sizeof(*pdu) + len, pdu); only this site was missed.
Restore the original semantics: pass the full flex-struct size via
struct_size(pdu, scid, 1) and the pdu pointer (the struct address) as
the source.
Validated on a stock 7.0-based host kernel via the real call path:
setsockopt(SOL_BLUETOOTH, BT_RCVMTU, ...) on a BT_CONNECTED
L2CAP_MODE_EXT_FLOWCTL socket emits an L2CAP_ECRED_RECONFIGURE_REQ
whose body is 8 bytes (the on-stack pdu local's value) rather than
the expected 6. Three captures from fresh socket / fresh hciemu peer
on the same host -- low bytes vary per call, high 0xffff confirms a
kernel virtual address (KASLR-randomised stack slot, not a fixed
string):
RECONF_REQ body (ident=0x02 len=8): 42 fb 54 af 0e ca ff ff
RECONF_REQ body (ident=0x02 len=8): 52 3d 2e af 0e ca ff ff
RECONF_REQ body (ident=0x02 len=8): b2 fc 5b af 0e ca ff ff
After this patch the body is 6 bytes carrying the expected
little-endian (mtu, mps, scid).
Cc: stable@vger.kernel.org Fixes: 1c08108f3014 ("Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end warnings") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pauli Virtanen [Fri, 24 Apr 2026 19:24:29 +0000 (22:24 +0300)]
Bluetooth: btmtk: accept too short WMT FUNC_CTRL events
MT7925 (USB ID 0e8d:e025) on fw version 20260106153314 sends WMT
FUNC_CTRL events that are missing the status field.
Prior to commit 006b9943b982 ("Bluetooth: btmtk: validate WMT event SKB
length before struct access") the status was read from out-of-bounds of
SKB data, which usually would result to success with
BTMTK_WMT_ON_UNDONE, although I don't know the intent here. The bounds
check added in that commit returns with error instead, producing
"Bluetooth: hci0: Failed to send wmt func ctrl (-22)" and makes the
device unusable.
Fix the regression by interpreting too short packet as status
BTMTK_WMT_ON_UNDONE, which makes the device work normally again.
Fixes: 634a4408c061 ("Bluetooth: btmtk: validate WMT event SKB length before struct access") Signed-off-by: Pauli Virtanen <pav@iki.fi> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> # MT7922 (0489:e0e2) Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Jiexun Wang [Wed, 6 May 2026 11:43:30 +0000 (19:43 +0800)]
Bluetooth: serialize accept_q access
bt_sock_poll() walks the accept queue without synchronization, while
child teardown can unlink the same socket and drop its last reference.
The unsynchronized accept queue walk has existed since the initial
Bluetooth import.
Protect accept_q with a dedicated lock for queue updates and polling.
Also rework bt_accept_dequeue() to take temporary child references under
the queue lock before dropping it and locking the child socket.
Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Jann Horn <jannh@google.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
cpufreq/amd-pstate: Drop Kconfig option for dynamic EPP
There are some performance issues being identified by dynamic EPP
and we don't want to have distributions turning it on by default
exposing them to users at this time.
Drop the kconfig option, and require an explicit opt in from kernel
command line or runtime sysfs option to turn it on.
Reported-by: Viktor Jägersküpper <viktor_jaegerskuepper@freenet.de> Closes: https://lore.kernel.org/linux-pm/14a87c99-785c-4b16-bfce-35ecbf053448@freenet.de/ Reported-by: Stuart Meckle <stuartmeckle@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221473 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20260512221947.1652988-1-mario.limonciello@amd.com
(fix sysfs file path) Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Thomas Hellström [Mon, 11 May 2026 16:24:43 +0000 (18:24 +0200)]
drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
Apply the same fix as b2ed01e7ad ("drm/ttm: Fix ttm_bo_swapout()
infinite LRU walk on swapout failure") to the ttm_bo_shrink() path.
Move del_bulk_move from before the backup to after success only,
using ttm_resource_del_bulk_move_unevictable() since the resource
is now unevictable once fully backed up.
Fixes: 70d645deac98 ("drm/ttm: Add helpers for shrinking") Cc: Christian König <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Dave Airlie <airlied@redhat.com> Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org # v6.15+ Assisted-by: GitHub_Copilot:claude-opus-4.6 Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260511162443.24352-1-thomas.hellstrom@linux.intel.com Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Shouvik Kar [Tue, 12 May 2026 11:02:42 +0000 (16:32 +0530)]
io_uring/net: allow filtering on IORING_OP_CONNECT
This adds custom filtering for IORING_OP_CONNECT, where the target
family is always exposed, and (for AF_INET / AF_INET6) port and
address are exposed. port and v4_addr are in network byte order so
filter authors can compare against on-wire constants.
Skip population unless addr_len covers the populated fields, to
avoid leaking stale io_async_msghdr data on short connects.
Wrap the io_ring_head_to_buf() macro value in an extra pair of parentheses
so it is safe when composed into larger expressions, and to satisfy
scripts/checkpatch.pl.
Sven Schuchmann [Tue, 12 May 2026 07:19:47 +0000 (09:19 +0200)]
net: phy: DP83TC811: add reading of abilities
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Qing Wang [Tue, 12 May 2026 03:50:35 +0000 (11:50 +0800)]
mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
flush_rcu_sheaves_on_cache() calls queue_work_on() in a
for_each_online_cpu() loop, which requires the cpu to stay online.
But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the
set of "online cpus" is subject to change.
There are two paths that call flush_rcu_sheaves_on_cache():
// has cpus_read_lock()
flush_all_rcu_sheaves()
-> flush_rcu_sheaves_on_cache()
// no cpus_read_lock()
kvfree_rcu_barrier_on_cache()
-> flush_rcu_sheaves_on_cache()
Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache().
Why not move cpus_read_lock() from flush_all_rcu_sheaves() into
flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock
order (slab_mutex -> cpu_hotplug_lock). The reverse order
(cpu_hotplug_lock -> slab_mutex) is established by
The unstripped vDSO files are useful for debugging.
They are provided in the upstream 'linux-headers' package.
Also package them as part of 'make pacman-pkg'.
Make them part of the '-debug' package, as they fit there best.
This differs from the upstream package as that has no '-debug' variant.
Jim Mattson [Tue, 7 Apr 2026 19:03:31 +0000 (12:03 -0700)]
KVM: x86: nSVM: Save/restore gPAT with KVM_{GET,SET}_NESTED_STATE
Add a 'gpat' field to kvm_svm_nested_state_hdr to carry L2's guest PAT
value across save and restore.
When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save vmcb02's g_pat into the header on
KVM_GET_NESTED_STATE, and restore it on KVM_SET_NESTED_STATE.
Host-initiated accesses to IA32_PAT (via KVM_GET/SET_MSRS) always target
L1's hPAT, so they cannot be used to save or restore gPAT. The separate
header field ensures that KVM_GET/SET_MSRS and KVM_GET/SET_NESTED_STATE are
independent and can be ordered arbitrarily during save and restore.
Note that struct kvm_svm_nested_state_hdr is included in a union padded to
120 bytes, so there is room to add the gpat field without changing any
offsets.
Jim Mattson [Tue, 7 Apr 2026 19:03:30 +0000 (12:03 -0700)]
KVM: Documentation: document KVM_{GET,SET}_NESTED_STATE for SVM
Document the nested state constants and structures for SVM that were added
by commit cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and
KVM_SET_NESTED_STATE").
Jim Mattson [Tue, 7 Apr 2026 19:03:29 +0000 (12:03 -0700)]
KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on VMEXIT
According to the APM volume 3 pseudo-code for "VMRUN," when nested paging
is enabled in the vmcb, the guest PAT register (gPAT) is saved to the vmcb
on emulated VMEXIT.
When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and the vCPU is in
guest mode with nested NPT enabled, save the vmcb02 g_pat field to the
vmcb12 g_pat field on emulated VMEXIT.
Jim Mattson [Tue, 7 Apr 2026 19:03:28 +0000 (12:03 -0700)]
KVM: x86: nSVM: Redirect IA32_PAT accesses to either hPAT or gPAT
When handling PAT accesses from L2, route PAT accesses to either hPAT or
gPAT based on whether or not L2 has a separate PAT, i.e. if KVM is actually
emulating gPAT, instead of using L1's PAT for everything. Specifically, if
KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled, the vCPU is in guest mode
with nested NPT enabled, *and* the access if from the guest (i.e. is not
from the host stuffing PAT as part of save/restore), then redirect guest
PAT accesses to the gPAT "register" in vmcb02, i.e. emulate gPAT for L2.
Always route non-guest accesses to hPAT, i.e. L1's PAT in vcpu->arch.pat,
to ensures that KVM_{G,S}ET_MSRS and KVM_{G,S}ET_NESTED_STATE are
independent of each other and can be ordered arbitrarily during save and
restore. E.g. if KVM didn't exempt host accesses, then whether a write to
PAT hit hPAT or gPAT would vary based on whether userspace restores PAT
before or after nested state. Note, gPAT is saved and restored separately
via KVM_{G,S}ET_NESTED_STATE.
WARN if there's a host-initiated access to PAT from within KVM_RUN, i.e. if
KVM itself initiated the access, as there are no such accesses today, and
it's not clear what the "right" behavior would be.
Fixes: 15038e147247 ("KVM: SVM: obey guest PAT") Signed-off-by: Jim Mattson <jmattson@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Jim Mattson [Tue, 7 Apr 2026 19:03:27 +0000 (12:03 -0700)]
KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT
When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested NPT is
enabled in vmcb12, copy the (cached and validated) vmcb12 g_pat field to
vmcb02's g_pat, giving L2 its own independent guest PAT register.
When the quirk is enabled (default), or when NPT is enabled but nested NPT
is disabled, copy L1's IA32_PAT MSR to the vmcb02 g_pat field, since L2
shares the IA32_PAT MSR with L1.
When NPT is disabled, the g_pat field is ignored by hardware.
Jim Mattson [Tue, 7 Apr 2026 19:03:26 +0000 (12:03 -0700)]
KVM: x86: nSVM: Cache and validate vmcb12 g_pat
When KVM_X86_QUIRK_NESTED_SVM_SHARED_PAT is disabled and nested paging is
enabled in vmcb12, validate g_pat at emulated VMRUN and cause an immediate
VMEXIT with exit code VMEXIT_INVALID if it is invalid, as specified in the
APM, volume 2: "Nested Paging and VMRUN/VMEXIT."
Jim Mattson [Tue, 7 Apr 2026 19:03:25 +0000 (12:03 -0700)]
KVM: x86: nSVM: Clear VMCB_NPT clean bit when updating hPAT from guest mode
When running an L2 guest and writing to MSR_IA32_CR_PAT, the host PAT value
is stored in both vmcb01's g_pat field and vmcb02's g_pat field, but the
clean bit was only being cleared for vmcb02.
Introduce the helper vmcb_set_gpat() which sets vmcb->save.g_pat and marks
the VMCB dirty for VMCB_NPT. Use this helper in both svm_set_msr() for
updating vmcb01 and in nested_vmcb02_compute_g_pat() for updating vmcb02,
ensuring both VMCBs' NPT fields are properly marked dirty.
Define a quirk to control whether nested SVM shares L1's PAT with L2
(legacy behavior) or gives L2 its own independent gPAT (correct behavior
per the APM).
When the quirk is enabled (default), L2 shares L1's PAT, preserving the
legacy KVM behavior. When userspace disables the quirk, KVM correctly
virtualizes the PAT for nested SVM guests, giving L2 a separate gPAT as
specified in the AMD architecture.
Jonathan Corbet [Wed, 13 May 2026 20:58:53 +0000 (14:58 -0600)]
docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN
The threat-model document says that only users with CAP_SYS_ADMIN can carry
out a number of admin-level tasks, but there are numerous capabilities that
can confer that sort of power. Generalize the text slightly to make it
clear that CAP_SYS_ADMIN is not the only all-powerful capability.
Acked-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Andy Shevchenko [Wed, 13 May 2026 20:48:55 +0000 (22:48 +0200)]
RAS/AMD/ATL: Drop malformed default N from Kconfig
The capital letters are for symbols and N in 'default N' will be evaluated as
another, nonexistent, Kconfig symbol, and not as the 'no' it should be. More
importantly, 'n' *is* the default already. Hence just drop the malformed line.
Jakub Kicinski [Mon, 11 May 2026 17:49:18 +0000 (10:49 -0700)]
net: tls: prevent chain-after-chain in plain text SG
Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:
This would create a chain where the wrap link points directly
to another chain link. The scatterlist API sg_next iterator
does not recursively resolve consecutive chain links.
meaning this is illegal input to crypto.
The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:
end start i
v v v
[ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap]
Skip the wrapping in this case.
TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.
Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.
Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:17 +0000 (10:49 -0700)]
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:
struct sk_msg_sg {
[...]
/* The extra two elements:
* 1) used for chaining the front and sections when the list becomes
* partitioned (e.g. end < start). The crypto APIs require the
* chaining;
* 2) to chain tailer SG entries after the message.
*/
struct scatterlist data[MAX_MSG_FRAGS + 2];
The current code uses MAX_SKB_FRAGS + 1 as the ring size:
instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")
Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).
Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stepan Ionichev [Wed, 13 May 2026 09:09:00 +0000 (14:09 +0500)]
clk: scpi: Unregister child clock providers on remove
SCPI clock providers are registered for each child node in
scpi_clk_add(), but scpi_clocks_remove() unregisters the parent node on
each iteration.
of_clk_del_provider() matches providers by the node used at registration
time, so passing the parent node leaves the child providers registered.
This leaks the provider allocations and the node references held by the
clock provider core.
Pass the child node to of_clk_del_provider() so the remove path matches
the probe path.
Fixes: cd52c2a4b5c4 ("clk: add support for clocks provided by SCP(System Control Processor)") Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com> Link: https://patch.msgid.link/20260513090900.5323-1-sozdayvek@gmail.com
(sudeep.holla: Updated commit title and message a bit) Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
drm/ttm: Convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC
dmem_cgroup_try_charge() returns -EAGAIN when the cgroup limit is
hit and the charge fails. TTM has no concept of -EAGAIN from resource
allocation; -ENOSPC is the canonical error meaning "no space, try
eviction". Convert at the source in ttm_resource_alloc() so no caller
needs to handle an unexpected error code, and clean up the now-redundant
-EAGAIN check in ttm_bo_alloc_resource().
Without this, -EAGAIN escaping ttm_resource_alloc() during an eviction
walk causes the walk to terminate early instead of continuing to the
next candidate.
Cc: Friedrich Vock <friedrich.vock@gmx.de> Cc: Maarten Lankhorst <dev@lankhorst.se> Cc: Tejun Heo <tj@kernel.org> Cc: Maxime Ripard <mripard@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.14+ Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM") Assisted-by: GitHub_Copilot:claude-sonnet-4.6 Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Maarten Lankhorst <dev@lankhrost.se> Link: https://patch.msgid.link/20260508160920.230339-1-thomas.hellstrom@linux.intel.com
====================
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
This series adds a new neigh_forward_grat option that provides
independent control of gratuitous ARP and unsolicited NA forwarding.
When neigh_suppress is enabled but neigh_forward_grat is enabled,
regular neighbor discovery is suppressed while gratuitous announcements
are forwarded.
The implementation marks gratuitous ARPs and unsolicited NAs in
BR_INPUT_SKB_CB during input processing, then checks the per-output-port
neigh_forward_grat setting during flooding. This allows gratuitous
announcements from any input port to be selectively forwarded based on
each output port's individual configuration.
Both port-level control (via IFLA_BRPORT_NEIGH_FORWARD_GRAT) and
per-VLAN control (via BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT) are
provided. The default value of OFF preserves existing behavior.
This behavior is in accordance with RFC 9161 (Section 3.6), which
recommends that VTEPs forward gratuitous ARP and unsolicited NA messages
to avoid traffic disruption during host mobility events.
The new attributes use NLA_U8, although the kernel netlink guideline
recommends NLA_U32 as the minimum integer type on the grounds that
alignment makes smaller types equivalent on the wire. For a simple
on/off attribute there is no technical advantage to u32 over u8, and
keeping u8 preserves consistency with all surrounding bridge port
attributes and avoids introducing new helpers alongside the existing
infrastructure.
Patchset overview:
Patch #1: adds uapi headers.
Patches #2-#3: support selective forwarding of gratuitous ARP.
Patches #4-#5: add netlink handling.
Patch #6: adds tests.
Please see iproute related patches in the last 3 commits of:
https://github.com/daniellerts/iproute2
====================
Danielle Ratson [Mon, 11 May 2026 06:59:36 +0000 (09:59 +0300)]
selftests: net: Add tests for neigh_forward_grat option
Add tests to validate the neigh_forward_grat bridge option for selective
forwarding of gratuitous neighbor announcements.
The tests verify per-port and per-VLAN control of gratuitous neighbor
announcement forwarding for both IPv4 (gratuitous ARP) and IPv6
(unsolicited NA):
- When neigh_suppress is enabled with neigh_forward_grat off (default),
gratuitous announcements are suppressed
- When neigh_forward_grat is enabled, gratuitous announcements are
forwarded while regular neighbor discovery remains suppressed
For IPv4, use arping to send gratuitous ARP packets. For IPv6, use
mausezahn to craft unsolicited Neighbor Advertisement packets.
For the per-port tests, the IPv4 test exercises the ip link interface,
while the IPv6 test exercises the bridge link interface.
The per-VLAN tests use the bridge interface throughout, as per-VLAN
attributes are only accessible via 'bridge vlan'.
Danielle Ratson [Mon, 11 May 2026 06:59:35 +0000 (09:59 +0300)]
bridge: Add per-VLAN netlink handling for neigh_forward_grat
Add netlink handlers for the per-VLAN neigh_forward_grat option via
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT attribute.
The per-VLAN option provides fine-grained control, allowing different
VLANs on the same port to have different gratuitous ARP/unsolicited NA
forwarding behavior.
This enables control via 'bridge' commands:
# bridge vlan set dev eth0 vid 10 neigh_suppress on
# bridge vlan set dev eth0 vid 10 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-6-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:34 +0000 (09:59 +0300)]
bridge: Add port-level netlink handling for neigh_forward_grat
Add netlink handlers for the port-level neigh_forward_grat option via
IFLA_BRPORT_NEIGH_FORWARD_GRAT attribute.
The default value of OFF preserves existing behavior, i.e. gratuitous ARP
and unsolicited NA are suppressed when neigh_suppress is enabled. Users can
explicitly set it to ON to allow these packets through.
Example for enabling control via 'bridge link' command:
# bridge link set dev eth0 neigh_suppress on
# bridge link set dev eth0 neigh_forward_grat on
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-5-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Danielle Ratson [Mon, 11 May 2026 06:59:33 +0000 (09:59 +0300)]
bridge: Add selective forwarding of gratuitous neighbor announcements
The existing neighbor suppression unconditionally suppresses gratuitous
ARPs and unsolicited Neighbor Advertisements, which prevents fast
mobility of hosts between VTEPs.
Add the neigh_forward_grat option to allow selective control of gratuitous
neighbor announcements. When neigh_suppress is enabled but
neigh_forward_grat is disabled (default), gratuitous announcements are
suppressed. When neigh_forward_grat is enabled, gratuitous announcements
are forwarded while regular neighbor discovery remains suppressed.
The implementation provides per-output-port control by:
1. Adding a 'grat_arp' flag to BR_INPUT_SKB_CB to mark gratuitous ARPs and
unsolicited NAs.
2. Setting both grat_arp and proxyarp_replied flags in
br_do_proxy_suppress_arp() and br_do_suppress_nd() when gratuitous
packets are detected.
3. Checking neigh_forward_grat per output port during flooding:
- For gratuitous ARPs/NAs: suppress unless the output port has
neigh_forward_grat enabled.
- For regular ARPs/NDs: maintain existing behavior.
This allows gratuitous announcements from any input port to be selectively
forwarded based on each output port's individual neigh_forward_grat
setting, enabling gratuitous neighbor announcements to be flooded to the
VXLAN fabric.
Regular neighbor discovery (ARP requests, NS queries, solicited replies)
remains controlled by neigh_suppress and is unaffected.
Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260511065936.4173106-4-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add netlink attributes for controlling gratuitous ARP and unsolicited NA
forwarding when neighbor suppression is enabled.
Add IFLA_BRPORT_NEIGH_FORWARD_GRAT for port-level control and
BRIDGE_VLANDB_ENTRY_NEIGH_FORWARD_GRAT for per-VLAN control.
The new attributes provide independent control of gratuitous ARP and
unsolicited NA packets. Operators can enable forwarding for those packets
for fast mobility across VTEPs while keeping general neighbor suppression
active.
Thomas Weißschuh [Wed, 22 Apr 2026 09:42:32 +0000 (11:42 +0200)]
vdso/gettimeofday: Reload sequence counter after switch to time page in do_aux()
After switching to the real data pages, the sequence counter needs to be
reloaded from there. The code using vdso_read_begin_timens() assumed
this worked by 'continue' jumping to the *beginning* of the do-while
retry loop. However the 'continue' jumps to the *end* of said loop,
evaluating the exit condition. If the data page has a sequence counter
of '1' it will match the one from the time namespace page and prematurely
exit the retry loop. This would result in garbage returned to the caller.
Reload the sequence counter after switching the pages by using an inner
while loop again, which will loop at most once.
The loop generates slightly better code than an explicit reload through
'seq = vdso_read_begin()'.
Fixes: ed78b7b2c5ae ("vdso/gettimeofday: Add a helper to read the sequence lock of a time namespace aware clock") Reported-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Link: https://patch.msgid.link/20260422-vdso-aux-timens-loop-v1-1-e2dd8c7164cc@linutronix.de Closes: https://lore.kernel.org/lkml/CANiDSCsOy0P1if-gJZqOM5pTJ0RDcwVfru1B7KFbTOEMqjPKJw@mail.gmail.com/
Xiang Mei [Mon, 11 May 2026 06:21:38 +0000 (23:21 -0700)]
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.
smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:
BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
Write of size 4 at addr 0000000000000068 by task exploit/144
Call Trace:
_raw_spin_lock_bh
smc_conn_create (net/smc/smc_core.c:1997)
__smc_connect (net/smc/af_smc.c:1447)
smc_connect (net/smc/af_smc.c:1720)
__sys_connect
__x64_sys_connect
do_syscall_64
Require ism_dev[i] to be non-NULL before accepting a CHID match.
Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add 64-bit counters for each impairment netem applies (delay, loss,
ECN marking, corruption, duplication, reordering) and for skb
allocation failures during enqueue. Exposed through TCA_STATS_APP
as struct tc_netem_xstats.
Counters increment when an impairment is occurs, independent of later
events that may mask its on-wire effect. Added allocation_errors
(similar to sch_fq) to account for when impairment could not be
applied due to memory pressure, etc.
net/sched: netem: handle multi-segment skb in corruption
The packet corruption code only flipped bits in the linear
header portion of the skb, skipping corruption when
skb_headlen() was zero.
Linearize the whole skb if necessary before corruption.
Extends d64cb81dcbd5 ("net/sched: sch_netem: fix out-of-bounds access
in packet corruption") with a more general solution.
net/sched: netem: replace pr_info with netlink extack error messages
Use netlink extack to report errors instead of sending them
to the kernel log with pr_info(). The error message can them be seen
with tc commands; and avoids log spam.
The current layout of struct netem_sched_data can be improved
by optimizing cache locality, compacting data types (use u8
for enum) and eliminating unused elements.
Reorganize the struct as follows:
- Cacheline 0 holds the tfifo state (t_root/t_head/t_tail/t_len),
counter, and the unconditional enqueue scalars
latency/jitter/rate/gap/loss.
- Cacheline 1 holds the remaining zero-check scalars
(duplicate/reorder/corrupt/ecn), all five crndstate correlation
structures, and loss_model.
- Cacheline 2 holds prng, delay_dist, the slot dequeue state,
slot_dist, and the inner classful qdisc pointer.
- Rate-shaping fields, q->limit (config-only; the fast path reads
sch->limit), and the CLG Markov state move to the warm tail.
- tc_netem_slot slot_config and qdisc_watchdog (only consulted on
slot reschedule and watchdog wake) move to the cold tail.
Also reorder struct clgstate to place the u8 state member after the
u32 transition probabilities. This removes the 3-byte interior hole
without changing the struct's size.