smb: smbdirect: add some logging to SMBDIRECT_CHECK_STATUS_{WARN,DISCONNECT}()
This should make it easier to analyze any possible problems.
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
This will be used by client and server in order to keep controlling
the logging when we move to shared functions.
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: smbdirect: let smbdirect.h include #include <linux/types.h>
This will make it easier to use.
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: fix use-after-free from async crypto on Qualcomm crypto engine
ksmbd_crypt_message() sets a NULL completion callback on AEAD requests
and does not handle the -EINPROGRESS return code from async hardware
crypto engines like the Qualcomm Crypto Engine (QCE). When QCE returns
-EINPROGRESS, ksmbd treats it as an error and immediately frees the
request while the hardware DMA operation is still in flight. The DMA
completion callback then dereferences freed memory, causing a NULL
pointer crash:
pc : qce_skcipher_done+0x24/0x174
lr : vchan_complete+0x230/0x27c
...
el1h_64_irq+0x68/0x6c
ksmbd_free_work_struct+0x20/0x118 [ksmbd]
ksmbd_exit_file_cache+0x694/0xa4c [ksmbd]
Use the standard crypto_wait_req() pattern with crypto_req_done() as
the completion callback, matching the approach used by the SMB client
in fs/smb/client/smb2ops.c. This properly handles both synchronous
engines (immediate return) and async engines (-EINPROGRESS followed
by callback notification).
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Link: https://github.com/openwrt/openwrt/issues/21822 Signed-off-by: Joshua Klinesmith <joshuaklinesmith@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: fix mechToken leak when SPNEGO decode fails after token alloc
The kernel ASN.1 BER decoder calls action callbacks incrementally as it
walks the input. When ksmbd_decode_negTokenInit() reaches the mechToken
[2] OCTET STRING element, ksmbd_neg_token_alloc() allocates
conn->mechToken immediately via kmemdup_nul(). If a later element in
the same blob is malformed, then the decoder will return nonzero after
the allocation is already live. This could happen if mechListMIC [3]
overrunse the enclosing SEQUENCE.
decode_negotiation_token() then sets conn->use_spnego = false because
both the negTokenInit and negTokenTarg grammars failed. The cleanup at
the bottom of smb2_sess_setup() is gated on use_spnego:
if (conn->use_spnego && conn->mechToken) {
kfree(conn->mechToken);
conn->mechToken = NULL;
}
so the kfree is skipped, causing the mechToken to never be freed.
This codepath is reachable pre-authentication, so untrusted clients can
cause slow memory leaks on a server without even being properly
authenticated.
Fix this up by not checking check for use_spnego, as it's not required,
so the memory will always be properly freed. At the same time, always
free the memory in ksmbd_conn_free() incase some other failure path
forgot to free it.
Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: require 3 sub-authorities before reading sub_auth[2]
parse_dacl() compares each ACE SID against sid_unix_NFS_mode and on
match reads sid.sub_auth[2] as the file mode. If sid_unix_NFS_mode is
the prefix S-1-5-88-3 with num_subauth = 2 then compare_sids() compares
only min(num_subauth, 2) sub-authorities so a client SID with
num_subauth = 2 and sub_auth = {88, 3} will match.
If num_subauth = 2 and the ACE is placed at the very end of the security
descriptor, sub_auth[2] will be 4 bytes past end_of_acl. The
out-of-band bytes will then be masked to the low 9 bits and applied as
the file's POSIX mode, probably not something that is good to have
happen.
Fix this up by forcing the SID to actually carry a third sub-authority
before reading it at all.
Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb2_get_ea() reads ea_req->EaNameLength from the client request and
passes it directly to strncmp() as the comparison length without
verifying that the length of the name really is the size of the input
buffer received.
Fix this up by properly checking the size of the name based on the value
received and the overall size of the request, to prevent a later
strncmp() call to use the length as a "trusted" size of the buffer.
Without this check, uninitialized heap values might be slowly leaked to
the client.
Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: validate owner of durable handle on reconnect
Currently, ksmbd does not verify if the user attempting to reconnect
to a durable handle is the same user who originally opened the file.
This allows any authenticated user to hijack an orphaned durable handle
by predicting or brute-forcing the persistent ID.
According to MS-SMB2, the server MUST verify that the SecurityContext
of the reconnect request matches the SecurityContext associated with
the existing open.
Add a durable_owner structure to ksmbd_file to store the original opener's
UID, GID, and account name. and catpure the owner information when a file
handle becomes orphaned. and implementing ksmbd_vfs_compare_durable_owner()
to validate the identity of the requester during SMB2_CREATE (DHnC).
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Reported-by: Davide Ornaghi <d.ornaghi97@gmail.com> Reported-by: Navaneeth K <knavaneeth786@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd: fix use-after-free in __ksmbd_close_fd() via durable scavenger
When a durable file handle survives session disconnect (TCP close without
SMB2_LOGOFF), session_fd_check() sets fp->conn = NULL to preserve the
handle for later reconnection. However, it did not clean up the byte-range
locks on fp->lock_list.
Later, when the durable scavenger thread times out and calls
__ksmbd_close_fd(NULL, fp), the lock cleanup loop did:
spin_lock(&fp->conn->llist_lock);
This caused a slab use-after-free because fp->conn was NULL and the
original connection object had already been freed by
ksmbd_tcp_disconnect().
The root cause is asymmetric cleanup: lock entries (smb_lock->clist) were
left dangling on the freed conn->lock_list while fp->conn was nulled out.
To fix this issue properly, we need to handle the lifetime of
smb_lock->clist across three paths:
- Safely skip clist deletion when list is empty and fp->conn is NULL.
- Remove the lock from the old connection's lock_list in
session_fd_check()
- Re-add the lock to the new connection's lock_list in
ksmbd_reopen_durable_fd().
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Co-developed-by: munan Huang <munanevil@gmail.com> Signed-off-by: munan Huang <munanevil@gmail.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ZhangGuoDong [Tue, 3 Mar 2026 15:13:16 +0000 (15:13 +0000)]
smb: move filesystem_vol_info into common/fscc.h
The structure definition on the server side is specified in MS-CIFS
2.2.8.2.3, but we should instead refer to MS-FSCC 2.5.9, just as the
client side does.
Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Reviewed-by: Steve French <stfrench@microsoft.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
ZhangGuoDong [Tue, 3 Mar 2026 15:13:14 +0000 (15:13 +0000)]
smb: move some definitions from common/smb2pdu.h into common/fscc.h
These definitions are specified in MS-FSCC, so move them into fscc.h.
Only add some documentation references, no other changes.
Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Reviewed-by: Steve French <stfrench@microsoft.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
====================
bpf: fix short IPv4/IPv6 handling in test_run_skb
bpf_prog_test_run_skb() may access IPv4/IPv6 network headers based on
skb->protocol even when the provided test input only contains an
Ethernet header.
Fix it by rejecting such short IPv4/IPv6 inputs before accessing the
L3 headers, and add a selftest that exercises the reported
bpf_skb_adjust_room() path on ETH_HLEN-sized IPv4/IPv6 EtherType
inputs.
Changes in v4:
- Split the selftests into a separate patch.
- Rework the selftest to actually execute a BPF program calling
bpf_skb_adjust_room().
- Reuse a single struct ethhdr eth_hlen and initialize h_proto from
the test case table.
- Add the Fixes tag to the test_run.c patch.
Sun Jian [Wed, 8 Apr 2026 03:46:23 +0000 (11:46 +0800)]
selftests/bpf: cover short IPv4/IPv6 inputs with adjust_room
Add a selftest covering ETH_HLEN-sized IPv4/IPv6 EtherType inputs for
bpf_prog_test_run_skb().
Reuse a single zero-initialized struct ethhdr eth_hlen and set
eth_hlen.h_proto from the per-test h_proto field.
Also add a dedicated tc_adjust_room program and route the short
IPv4/IPv6 cases to it, so the selftest actually exercises the
bpf_skb_adjust_room() path from the report.
====================
net: fix skb_ext BUILD_BUG_ON failures with GCOV
This mini-series fixes build failures in net/core/skbuff.c when the
kernel is built with CONFIG_GCOV_PROFILE_ALL=y.
This is part of a larger effort to add -fprofile-update=atomic to
global CFLAGS_GCOV (posted earlier as a combined series):
https://lore.kernel.org/lkml/20260401142020.1434243-1-khorenko@virtuozzo.com/T/#t
That combined series was split per subsystem as requested by Jakub.
The companion patches are:
- iommu: use __always_inline for amdv1pt_install_leaf_entry()
(sent to iommu maintainers)
- gcov: add -fprofile-update=atomic globally (sent to gcov/kbuild
maintainers, depends on this series and the iommu patch)
Patch 1/2 fixes a pre-existing build failure with CONFIG_GCOV_PROFILE_ALL:
GCOV counters prevent GCC from constant-folding the skb_ext_total_length()
loop. It also removes the CONFIG_KCOV_INSTRUMENT_ALL preprocessor guard
from d6e5794b06c0: that guard was a precaution in case KCOV instrumentation
also prevented constant folding, but KCOV's -fsanitize-coverage=trace-pc
does not interfere with GCC's constant folding (verified experimentally
with GCC 14.2 and GCC 16.0.1), so the guard is unnecessary.
Patch 2/2 is an additional fix needed when -fprofile-update=atomic is
added to CFLAGS_GCOV: __no_profile on the __always_inline function alone
is insufficient because after inlining, the code resides in the caller's
profiled body. The caller (skb_extensions_init) needs __no_profile and
noinline to prevent re-exposure to GCOV instrumentation.
====================
net: add noinline __init __no_profile to skb_extensions_init() for GCOV compatibility
With -fprofile-update=atomic in global CFLAGS_GCOV, GCC still cannot
constant-fold the skb_ext_total_length() loop when it is inlined into a
profiled caller. The existing __no_profile on skb_ext_total_length()
itself is insufficient because after __always_inline expansion the code
resides in the caller's body, which still carries GCOV instrumentation.
Mark skb_extensions_init() with __no_profile so the BUILD_BUG_ON checks
can be evaluated at compile time. Also mark it noinline to prevent the
compiler from inlining it into skb_init() (which lacks __no_profile),
which would re-expose the function body to GCOV instrumentation.
Add __init since skb_extensions_init() is only called from __init
skb_init(). Previously it was implicitly inlined into the .init.text
section; with noinline it would otherwise remain in permanent .text,
wasting memory after boot.
Build-tested with both CONFIG_GCOV_PROFILE_ALL=y and
CONFIG_KCOV_INSTRUMENT_ALL=y.
net: fix skb_ext_total_length() BUILD_BUG_ON with CONFIG_GCOV_PROFILE_ALL
When CONFIG_GCOV_PROFILE_ALL=y is enabled, the kernel fails to build:
In file included from <command-line>:
In function 'skb_extensions_init',
inlined from 'skb_init' at net/core/skbuff.c:5214:2:
././include/linux/compiler_types.h:706:45: error: call to
'__compiletime_assert_1490' declared with attribute error:
BUILD_BUG_ON failed: skb_ext_total_length() > 255
CONFIG_GCOV_PROFILE_ALL adds -fprofile-arcs -ftest-coverage
-fno-tree-loop-im to CFLAGS globally. GCC inserts branch profiling
counters into the skb_ext_total_length() loop and, combined with
-fno-tree-loop-im (which disables loop invariant motion), cannot
constant-fold the result.
BUILD_BUG_ON requires a compile-time constant and fails.
The issue manifests in kernels with 5+ SKB extension types enabled
(e.g., after addition of SKB_EXT_CAN, SKB_EXT_PSP). With 4 extensions
GCC can still unroll and fold the loop despite GCOV instrumentation;
with 5+ it gives up.
Mark skb_ext_total_length() with __no_profile to prevent GCOV from
inserting counters into this function. Without counters the loop is
"clean" and GCC can constant-fold it even with -fno-tree-loop-im active.
This allows BUILD_BUG_ON to work correctly while keeping GCOV profiling
for the rest of the kernel.
This also removes the CONFIG_KCOV_INSTRUMENT_ALL preprocessor guard
introduced by d6e5794b06c0. That guard was added as a precaution because
KCOV instrumentation was also suspected of inhibiting constant folding.
However, KCOV uses -fsanitize-coverage=trace-pc, which inserts
lightweight trace callbacks that do not interfere with GCC's constant
folding or loop optimization passes. Only GCOV's -fprofile-arcs combined
with -fno-tree-loop-im actually prevents the compiler from evaluating
the loop at compile time. The guard is therefore unnecessary and can be
safely removed.
Fixes: 96ea3a1e2d31 ("can: add CAN skb extension infrastructure") Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Reviewed-by: Thomas Weissschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20260410162150.3105738-2-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Fri, 10 Apr 2026 02:57:52 +0000 (03:57 +0100)]
net: ethernet: mtk_eth_soc: initialize PPE per-tag-layer MTU registers
The PPE enforces output frame size limits via per-tag-layer VLAN_MTU
registers that the driver never initializes. The hardware defaults do
not account for PPPoE overhead, causing the PPE to punt encapsulated
frames back to the CPU instead of forwarding them.
Initialize the registers at PPE start and on MTU changes using the
maximum GMAC MTU. This is a conservative approximation -- the actual
per-PPE requirement depends on egress path, but using the global
maximum ensures the limits are never too small.
Qingfang Deng [Fri, 10 Apr 2026 05:49:49 +0000 (13:49 +0800)]
pppox: remove sk_pppox() helper
The sk member can be directly accessed from struct pppox_sock without
relying on type casting. Remove the sk_pppox() helper and update all
call sites to use po->sk directly.
rtc: abx80x: Disable alarm feature if no interrupt attached
Commit 795cda8338ea ("rtc: interface: Fix long-standing race when setting
alarm") exposed an issue where the rtc-abx80x driver does not clear the
alarm feature bit, but instead relies on the set_alarm operation to return
invalid.
For example, when a RTC_UIE_ON ioctl is handled, it should abort at the
feature validation. Instead, it proceeds to the rtc_timer_enqueue(),
which used to return an error from the set_alarm call. However,
following the race condition handling, which likely should not be
discarding predecing errors, a success condition is returned to the
ioctl() caller. This results in (for example):
hwclock: select() to /dev/rtc0 to wait for clock tick timed out
Notwithstanding the validity of the race condition handling, if an interrupt
wasn't specified, or could not be attached, the driver should clear the
alarm feature bit.
selftests/bpf: Use memfd_create instead of shm_open in cgroup_iter_memcg
Replace shm_open/shm_unlink with memfd_create in the shmem subtest.
shm_open requires /dev/shm to be mounted, which is not always available
in test environments, causing the test to fail with ENOENT.
memfd_create creates an anonymous shmem-backed fd without any filesystem
dependency while exercising the same shmem accounting path.
Gal Pressman [Thu, 9 Apr 2026 20:28:52 +0000 (23:28 +0300)]
net/mlx5e: IPsec, fix ASO poll timeout with read_poll_timeout_atomic()
The do-while poll loop uses jiffies for its timeout:
expires = jiffies + msecs_to_jiffies(10);
jiffies is sampled at an arbitrary point within the current tick, so the
first partial tick contributes anywhere from a full tick down to nearly
zero real time. For small msecs_to_jiffies() results this is
significant, the effective poll window can be much shorter than the
requested 10ms, and in the worst case the loop exits after a single
iteration (e.g., when HZ=100), well before the device has delivered the
CQE.
Replace the loop with read_poll_timeout_atomic(), which counts elapsed
time via udelay() accounting rather than jiffies, guaranteeing the full
poll window regardless of HZ.
Additionally, read_poll_timeout_atomic() executes the poll operation one
more time after the timeout has expired, giving the CQE a final chance
to be detected. The old do-while loop could exit without a final poll if
the timeout expired during the udelay() between iterations.
Fixes: 76e463f6508b ("net/mlx5e: Overcome slow response for first IPsec ASO WQE") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260409202852.158059-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Thu, 9 Apr 2026 20:28:51 +0000 (23:28 +0300)]
net/mlx5e: Fix features not applied during netdev registration
mlx5e_fix_features() returns early when the netdevice is not present.
This is correct during profile transitions where priv is cleared, but it
also incorrectly blocks feature fixups during register_netdev(), when
the device is also not yet present.
It is not trivial to distinguish between both cases as we cannot use
priv to carry state, and in both cases reg_state == NETREG_REGISTERED.
Force a netdev features update after register_netdev() completes, where
the device is present and fix_features() can actually work.
This is not a pretty solution, as it results in an additional features
update call (register_netdevice() already calls
__netdev_update_features() internally), but it is the simplest,
cleanest, and most robust way I found to fix this issue after multiple
attempts.
This fixes an issue on systems where CQE compression is enabled by
default, RXHASH remains enabled after registration despite the two
features being mutually exclusive.
Fixes: ab4b01bfdaa6 ("net/mlx5e: Verify dev is present for fix features ndo") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260409202852.158059-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sun, 12 Apr 2026 21:34:27 +0000 (14:34 -0700)]
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-04-09
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add icm_mng_function_id_mode cap bit
net/mlx5: Rename MLX5_PF page counter type to MLX5_SELF
net/mlx5: Add vhca_id_type bit to alias context
mlx5: Remove redundant iseg base
====================
In vsock_update_buffer_size(), the buffer size was being clamped to the
maximum first, and then to the minimum. If a user sets a minimum buffer
size larger than the maximum, the minimum check overrides the maximum
check, inverting the constraint.
This breaks the intended socket memory boundaries by allowing the
vsk->buffer_size to grow beyond the configured vsk->buffer_max_size.
Fix this by checking the minimum first, and then the maximum. This
ensures the buffer size never exceeds the buffer_max_size.
Fixes: b9f2b0ffde0c ("vsock: handle buffer_size sockopts in the core") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Norbert Szetei <norbert@doyensec.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Add support for PIC64-HPSC/HX MDIO controller
This series adds a driver for the two MDIO controllers of PIC64-HPSC/HX.
The hardware supports C22 and C45 but only C22 is implemented for now.
This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely different
with pic64hpsc, hence the need for a separate driver.
The documentation recommends an input clock of 156.25MHz and a prescaler of
39, which yields an MDIO clock of 1.95MHz.
This was tested on Microchip HB1301 evalkit which has a VSC8574 and a
VSC8541. I've tested with bus frequencies of 0.6, 1.95 and 2.5 MHz.
This series also adds a PHY write barrier when disabling PHY interrupts as
discussed in: https://lore.kernel.org/acvUqDgepCIScs8M@shell.armlinux.org.uk
====================
Charles Perry [Wed, 8 Apr 2026 13:18:16 +0000 (06:18 -0700)]
net: phy: add a PHY write barrier when disabling interrupts
MDIO bus controllers are not required to wait for write transactions to
complete before returning as synchronization is often achieved by polling
status bits.
This can cause issues when disabling interrupts since an interrupt could
fire before the interrupt handler is unregistered and there's no status
bit to poll.
Add a phy_write_barrier() function and use it in phy_disable_interrupts()
to fix this issue. The write barrier just reads an MII register and
discards the value, which is enough to guarantee that previous writes have
completed.
Charles Perry [Wed, 8 Apr 2026 13:18:15 +0000 (06:18 -0700)]
net: mdio: add a driver for PIC64-HPSC/HX MDIO controller
This adds an MDIO driver for PIC64-HPSC/HX. The hardware supports C22
and C45 but only C22 is implemented in this commit.
This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely
different with pic64hpsc, hence the need for a separate driver.
The documentation recommends an input clock of 156.25MHz and a prescaler
of 39, which yields an MDIO clock of 1.95MHz.
The hardware supports an interrupt pin or a "TRIGGER" bit that can be
polled to signal transaction completion. This commit uses polling.
This was tested on Microchip HB1301 evalkit with a VSC8574 and a
VSC8541.
This MDIO hardware is based on a Microsemi design supported in Linux by
mdio-mscc-miim.c. However, The register interface is completely different
with pic64hpsc, hence the need for separate documentation.
The hardware supports C22 and C45.
The documentation recommends an input clock of 156.25MHz and a prescaler
of 39, which yields an MDIO clock of 1.95MHz.
The hardware supports an interrupt pin to signal transaction completion
which is not strictly needed as the software can also poll a "TRIGGER"
bit for this.
Charles Perry [Thu, 9 Apr 2026 13:36:54 +0000 (06:36 -0700)]
net: phy: fix a return path in get_phy_c45_ids()
The return value of phy_c45_probe_present() is stored in "ret", not
"phy_reg", fix this. "phy_reg" always has a positive value if we reach
this return path (since it would have returned earlier otherwise), which
means that the original goal of the patch of not considering -ENODEV
fatal wasn't achieved.
Fixes: 17b447539408 ("net: phy: c45 scanning: Don't consider -ENODEV fatal") Signed-off-by: Charles Perry <charles.perry@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20260409133654.3203336-1-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johan Hovold [Tue, 7 Apr 2026 12:27:17 +0000 (14:27 +0200)]
rtc: ntxec: fix OF node reference imbalance
The driver reuses the OF node of the parent multi-function device but
fails to take another reference to balance the one dropped by the
platform bus code when unbinding the MFD and deregistering the child
devices.
Fix this by using the intended helper for reusing OF nodes.
Fixes: 435af89786c6 ("rtc: New driver for RTC in Netronix embedded controller") Cc: stable@vger.kernel.org # 5.13 Cc: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260407122717.2676774-1-johan@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Brian Masney [Sun, 22 Feb 2026 23:30:51 +0000 (18:30 -0500)]
rtc: pic32: allow driver to be compiled with COMPILE_TEST
This driver currently only supports builds against a PIC32 target. Now
that commit ed65ae9f6c6b ("rtc: pic32: update include to use pic32.h
from platform_data") is merged, it's possible to compile this driver on
other architectures.
To avoid future breakage of this driver in the future, let's update the
Kconfig so that it can be built with COMPILE_TEST enabled on all
architectures.
Akashdeep Kaur [Fri, 13 Mar 2026 11:17:40 +0000 (16:47 +0530)]
rtc: ti-k3: Add support to resume from IO DDR low power mode
Restore the RTC HW context which may be lost when system enters
certain low power mode (IO+DDR mode).
Check if the RTC registers are locked which would indicate loss of
context (reset) and restore the context as needed.
Initially 'reg' and 'val' are assigned from HW_PARAM_2.
But since IPA v5.0+ takes EV_PER_EE from HW_PARAM_4 (instead of
NUM_EV_PER_EE from HW_PARAM_2), we not only need to re-assign 'reg' but
also read the register value of that register into 'val' so that
reg_decode() works on the correct value.
Taegu Ha [Thu, 9 Apr 2026 07:11:15 +0000 (16:11 +0900)]
ppp: require CAP_NET_ADMIN in target netns for unattached ioctls
/dev/ppp open is currently authorized against file->f_cred->user_ns,
while unattached administrative ioctls operate on current->nsproxy->net_ns.
As a result, a local unprivileged user can create a new user namespace
with CLONE_NEWUSER, gain CAP_NET_ADMIN only in that new user namespace,
and still issue PPPIOCNEWUNIT, PPPIOCATTACH, or PPPIOCATTCHAN against
an inherited network namespace.
Require CAP_NET_ADMIN in the user namespace that owns the target network
namespace before handling unattached PPP administrative ioctls.
This preserves normal pppd operation in the network namespace it is
actually privileged in, while rejecting the userns-only inherited-netns
case.
Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE
map to another pcpu map with the same value_size that is not rounded
up to 8 bytes, and add a test case to reproduce the issue.
The root cause is that pcpu_init_value() uses copy_map_value_long() which
rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not
8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when
the copy is performed.
====================
Lang Xu [Thu, 2 Apr 2026 07:42:36 +0000 (15:42 +0800)]
selftests/bpf: Add test for cgroup storage OOB read
Add a test case to reproduce the out-of-bounds read issue when copying
from a cgroup storage map to a pcpu map with a value_size not rounded
up to 8 bytes.
The test creates:
1. A CGROUP_STORAGE map with 4-byte value (not 8-byte aligned)
2. A LRU_PERCPU_HASH map with 4-byte value (same size)
When a socket is created in the cgroup, the BPF program triggers
bpf_map_update_elem() which calls copy_map_value_long(). This function
rounds up the copy size to 8 bytes, but the cgroup storage buffer is
only 4 bytes, causing an OOB read (before the fix).
Lang Xu [Thu, 2 Apr 2026 07:42:35 +0000 (15:42 +0800)]
bpf: Fix OOB in pcpu_init_value
An out-of-bounds read occurs when copying element from a
BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the
same value_size that is not rounded up to 8 bytes.
The issue happens when:
1. A CGROUP_STORAGE map is created with value_size not aligned to
8 bytes (e.g., 4 bytes)
2. A pcpu map is created with the same value_size (e.g., 4 bytes)
3. Update element in 2 with data in 1
pcpu_init_value assumes that all sources are rounded up to 8 bytes,
and invokes copy_map_value_long to make a data copy, However, the
assumption doesn't stand since there are some cases where the source
may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data.
the verifier verifies exactly the size that the source claims, not
the size rounded up to 8 bytes by kernel, an OOB happens when the
source has only 4 bytes while the copy size(4) is rounded up to 8.
This is initially introduced in: d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up
in a netns other than init_net").
Here, we made RDS aware of the namespace by storing a net pointer in
each connection. But it is not explicitly restricted to init_net in
the case of ib. The RDS/TCP transport has its own pernet exit handler
(rds_tcp_exit_net) that destroys connections when a namespace is torn
down. But RDS/IB does not support more than the initial namespace and
has no such handler. The initial namespace is statically allocated,
and never torn down, so it always has at least one reference.
Allowing non init namespaces that do not have a persistent reference
means that when their refcounts drop to zero, they are released through
cleanup_net(). Which would call any registered pernet clean up handlers
if it had any, but since they don't in this case, the extra
rds_connections remain with stale c_net pointers. Which are then
accessed later causing the use-after-free bug.
So, the simple fix is to disallow more than the initial namespace
to be created in the case of ib connections.
Fixes are ported from UEK patches found here:
https://github.com/oracle/linux-uek/commit/8ed9a82376b7
Patch 1 is a prerequisite optimization to rds_ib_laddr_check() that
avoids excessive rdma_bind_addr() calls during transport probing by
first checking rds_ib_get_device(). This is needed because patch 2
adds a namespace check at the top of the same function.
https://github.com/oracle/linux-uek/commit/bd9489a08004
Patch 2 restricts RDS/IB to the initial network namespace. It adds
checks in both rds_ib_laddr_check() and rds_set_transport() to reject
IB use from non-init namespaces with -EPROTOTYPE. This prevents the
use-after-free by ensuring IB connections cannot exist in namespaces
that may be torn down.
UEK: bd9489a08004 ("net/rds: Restrict use of RDS/IB to the initial
network namespace")
Questions, comments and feedback appreciated!
====================
net/rds: Restrict use of RDS/IB to the initial network namespace
Prevent using RDS/IB in network namespaces other than the initial one.
The existing RDS/IB code will not work properly in non-initial network
namespaces.
Fixes: d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net") Reported-by: syzbot+da8e060735ae02c8f3d1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1 Signed-off-by: Greg Jumper <greg.jumper@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260408080420.540032-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_ib_laddr_check() creates a CM_ID and attempts to bind the address
in question to it. This in order to qualify the allegedly local
address as a usable IB/RoCE address.
In the field, ExaWatcher runs rds-ping to all ports in the fabric from
all local ports. This using all active ToS'es. In a full rack system,
we have 14 cell servers and eight db servers. Typically, 6 ToS'es are
used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo"
interval.
Adding to this, each rds-ping invocation creates eight sockets and
binds the local address to them:
So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are
allocated, considering this full-rack system. After the a CM_ID has
been allocated, rdma_bind_addr() is called, with the port number being
zero. This implies that the CMA will attempt to search for an un-used
ephemeral port. Simplified, the algorithm is to start at a random
position in the available port space, and then if needed, iterate
until an un-used port is found.
The book-keeping of used ports uses the idr system, which again uses
slab to allocate new struct idr_layer's. The size is 2092 bytes and
slab tries to reduce the wasted space. Hence, it chooses an order:3
allocation, for which 15 idr_layer structs will fit and only 1388
bytes are wasted per the 32KiB order:3 chunk.
Although this order:3 allocation seems like a good space/speed
trade-off, it does not resonate well with how it used by the CMA. The
combination of the randomized starting point in the port space (which
has close to zero spatial locality) and the close proximity in time of
the 4224 invocations of the rds-ping's, creates a memory hog for
order:3 allocations.
These costly allocations may need reclaims and/or compaction. At
worst, they may fail and produce a stack trace such as (from uek4):
To avoid these excessive calls to rdma_bind_addr(), we optimize
rds_ib_laddr_check() by simply checking if the address in question has
been used before. The rds_rdma module keeps track of addresses
associated with IB devices, and the function rds_ib_get_device() is
used to determine if the address already has been qualified as a valid
local address. If not found, we call the legacy rds_ib_laddr_check(),
now renamed to rds_ib_laddr_check_cm().
====================
net: hamradio: fix missing input validation in bpqether and scc
This series fixes two missing input validation bugs in the hamradio
drivers. Both patches were reviewed by Joerg Reuter (hamradio
maintainer).
====================
net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl
The SIOCSCCSMEM ioctl copies a scc_mem_config from user space and
assigns its bufsize field directly to scc->stat.bufsize without any
range validation:
scc->stat.bufsize = memcfg.bufsize;
If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive
interrupt handler later calls dev_alloc_skb(0) and immediately writes
a KISS type byte via skb_put_u8() into a zero-capacity socket buffer,
corrupting the adjacent skb_shared_info region.
Reject bufsize values smaller than 16; this is large enough to hold
at least one KISS header byte plus useful data.
net: hamradio: bpqether: validate frame length in bpq_rcv()
The BPQ length field is decoded as:
len = skb->data[0] + skb->data[1] * 256 - 5;
If the sender sets bytes [0..1] to values whose combined value is
less than 5, len becomes negative. Passing a negative int to
skb_trim() silently converts to a huge unsigned value, causing the
function to be a no-op. The frame is then passed up to AX.25 with
its original (untrimmed) payload, delivering garbage beyond the
declared frame boundary.
Additionally, a negative len corrupts the 64-bit rx_bytes counter
through implicit sign-extension.
Add a bounds check before pulling the length bytes: reject frames
where len is negative or exceeds the remaining skb data.
Paul Chaignon [Wed, 8 Apr 2026 20:40:50 +0000 (22:40 +0200)]
selftests/bpf: Fix reg_bounds to match new tnum-based refinement
Commit efc11a667878 ("bpf: Improve bounds when tnum has a single
possible value") improved the bounds refinement to detect when the tnum
and u64 range overlap in a single value (and the bounds can thus be set
to that value).
Eduard then noticed that it broke the slow-mode reg_bounds selftests
because they don't have an equivalent logic and are therefore unable to
refine the bounds as much as the verifier. The following test case
illustrates this.
When w6 == w7 is true, the verifier can deduce that the R6's tnum is
equal to (0xfffffffe00000000; 0x100000000) and then use that information
to refine the bounds: the tnum only overlap with the u64 range in
0xffffffff00000000. The reg_bounds selftest doesn't know about tnums
and therefore fails to perform the same refinement.
This issue happens when the tnum carries information that cannot be
represented in the ranges, as otherwise the selftest could reach the
same refined value using just the ranges. The tnum thus needs to
represent non-contiguous values (ex., R6's tnum above, after the
condition). The only way this can happen in the reg_bounds selftest is
at the boundary between the 32 and 64bit ranges. We therefore only need
to handle that case.
This patch fixes the selftest refinement logic by checking if the u32
and u64 ranges overlap in a single value. If so, the ranges can be set
to that value. We need to handle two cases: either they overlap in
umin64...
To detect the first case, we decrease umax64 to the maximum value that
matches the u32 range. If that happens to be umin64, then umin64 is the
only overlap. We proceed similarly for the second case, increasing
umin64 to the minimum value that matches the u32 range.
Note this is similar to how the verifier handles the general case using
tnum, but we don't need to care about a single-value overlap in the
middle of the range. That case is not possible when comparing two
ranges.
This patch also adds two test cases reproducing this bug as part of the
normal test runs (without SLOW_TESTS=1).
Fixes: efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") Reported-by: Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/bpf/4e6dd64a162b3cab3635706ae6abfdd0be4db5db.camel@gmail.com/ Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ada9UuSQi2SE2IfB@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
net: rose: reject truncated CLEAR_REQUEST frames in state machines
All five ROSE state machines (states 1-5) handle ROSE_CLEAR_REQUEST
by reading the cause and diagnostic bytes directly from skb->data[3]
and skb->data[4] without verifying that the frame is long enough:
The entry-point check in rose_route_frame() only enforces
ROSE_MIN_LEN (3 bytes), so a remote peer on a ROSE network can
send a syntactically valid but truncated CLEAR_REQUEST (3 or 4
bytes) while a connection is open in any state. Processing such a
frame causes a one- or two-byte out-of-bounds read past the skb
data, leaking uninitialized heap content as the cause/diagnostic
values returned to user space via getsockopt(ROSE_GETCAUSE).
Add a single length check at the rose_process_rx_frame() dispatch
point, before any state machine is entered, to drop frames that
carry the CLEAR_REQUEST type code but are too short to contain the
required cause and diagnostic fields.
Billy Tsai [Tue, 7 Apr 2026 08:53:23 +0000 (16:53 +0800)]
i3c: mipi-i3c-hci: fix IBI payload length calculation for final status
In DMA mode, the IBI status descriptor encodes the payload using
CHUNKS (number of chunks) and DATA_LENGTH (valid bytes in the last
chunk). All preceding chunks are implicitly full-sized.
The current code accumulates full chunk sizes for non-final status
descriptors, but for the final status descriptor it only adds
DATA_LENGTH. This ignores the contribution of the preceding full
chunks described by the same final status entry.
As a result, the computed IBI payload length is truncated whenever
the final status spans multiple chunks. For example, with a chunk
size of 4 bytes, CHUNKS=2 and DATA_LENGTH=1 should result in a total
payload size of 5 bytes, but the current code reports only 1 byte.
Fix the calculation by adding the size of (CHUNKS - 1) full chunks
plus DATA_LENGTH for the last chunk.
Fixes: 9ad9a52cce28 ("i3c/master: introduce the mipi-i3c-hci driver") Signed-off-by: Billy Tsai <billy_tsai@aspeedtech.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260407-i3c-hci-dma-v2-1-a583187b9d22@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
====================
net: enetc: improve statistics for v1 and add statistics for v4
For ENETC v1, some standardized statistics were redundantly included in
the unstructured statistics, so remove these duplicated entries.
Previously, the unstructured statistics only contained eMAC data and
did not include pMAC data; add pMAC statistics to ensure completeness.
For ENETC v4, the driver previously reported MAC statistics only for the
internal ENETC (Pseudo MAC). Extend the implementation to provide
additional statistics for both the internal ENETC and the standalone
ENETC.
====================
net: enetc: add unstructured pMAC counters for ENETC v1
The ENETC v1 has two MACs (eMAC and pMAC) to support preemption. The
existing unstructured counters include the eMAC counters, but not the
pMAC counters. So add pMAC counters to improve statistical coverage.
net: enetc: remove standardized counters from enetc_pm_counters
The standardized counters are already exposed via the get_pause_stats(),
get_rmon_stats(), get_eth_ctrl_stats() and get_eth_mac_stats()
interfaces. Keeping the same counters in enetc_pm_counters results in
redundant output.
Remove these standardized counters from enetc_pm_counters and rely on
the existing statistics interfaces to report them.
net: enetc: show RX drop counters only for assigned RX rings
For ENETC v1, each SI provides 16 RBDCR registers for RX ring drop
counters, but this does not imply that an SI actually owns 16 RX rings.
The ENETC hardware supports a total of 16 RX rings, which are assigned
to 3 SIs (1 PSI and 2 VSIs), so each SI is assigned fewer than 16 RX
rings.
The current implementation always reports 16 RX drop counters per SI,
leading to redundant output for SIs with fewer RX rings. Update the
logic to display drop counters only for the RX rings that are actually
assigned to the SI.
net: enetc: add support for the standardized counters
ENETC v4 provides 64-bit counters for IEEE 802.3 basic and mandatory
managed objects, the IETF Management Information Database (MIB) package
(RFC2665), and Remote Network Monitoring (RMON) statistics. In addition,
some ENETCs support preemption, so these ENETCs have two MACs: MAC 0 is
the express MAC (eMAC), MAC 1 is the preemptible MAC (pMAC). Both MACs
support these statistics.
Emil Tsalapatis [Sun, 12 Apr 2026 17:45:38 +0000 (13:45 -0400)]
bpf: Allow instructions with arena source and non-arena dest registers
The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR
operation into the scalar register rather than the pointer register.
Relax the verifier to allow operations between a source arena register
and a destination non-arena register, marking the destination's value
as a PTR_TO_ARENA.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Song Liu <song@kernel.org> Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
bpf: add the missing fsession
Add the missing fsession attach type to the BPF docs, verifier log and
bpftool.
Changes since v2:
- replace "FENTRY/FEXIT/FSESSION" with "Tracing" in the 1st patch
- v2: https://lore.kernel.org/all/20260408062109.386083-1-dongml2@chinatelecom.cn/
Changes since v1:
- add a missing FSESSION in bpf_check_attach_target() in the 1st patch
- v1: https://lore.kernel.org/all/20260408031416.266229-1-dongml2@chinatelecom.cn/
====================
The fsession attach type is missed in the verifier log in
check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id().
Update them to make the verifier log proper. Meanwhile, update the
corresponding selftests.
====================
v3->v4: Restore few minor comments and undo few function moves
v2->v3: Actually restore comments lost in patch 3
(instead of adding them to patch 4)
v1->v2: Restore comments lost in patch 3
verifier.c is huge. Split it into logically independent pieces.
No functional changes.
The diff is impossible to review over email.
'git show' shows minimal actual changes. Only plenty of moved lines.
Such split may cause backport headaches.
We should have split it long ago.
Even after split verifier.c is still 20k lines,
but further split is harder.
====================
Gal Pressman [Thu, 9 Apr 2026 09:09:45 +0000 (12:09 +0300)]
gre: Count GRE packet drops
GRE is silently dropping packets without updating statistics.
In case of drop, increment rx_dropped counter to provide visibility into
packet loss. For the case where no GRE protocol handler is registered,
use rx_nohandler.
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20260409090945.1542440-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
bpf: Fix SOCK_OPS_GET_SK same-register OOB read in sock_ops and add selftest
When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg,
the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the
destination register in the !fullsock / !locked_tcp_sock path, leading to
OOB read (GET_SK) and kernel pointer leak (GET_FIELD).
Patch 1: Fix both macros by adding BPF_MOV64_IMM(si->dst_reg, 0) in the
!fullsock landing pad.
Patch 2: Add selftests covering same-register and different-register cases
for both GET_SK and GET_FIELD.
selftests/bpf: Add tests for sock_ops ctx access with same src/dst register
Add selftests to verify SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() correctly
return NULL/zero when dst_reg == src_reg and is_fullsock == 0.
Three subtests are included:
- get_sk: ctx->sk with same src/dst register (SOCK_OPS_GET_SK)
- get_field: ctx->snd_cwnd with same src/dst register (SOCK_OPS_GET_FIELD)
- get_sk_diff_reg: ctx->sk with different src/dst register (baseline)
Each BPF program uses inline asm (__naked) to force specific register
allocation, reads is_fullsock first, then loads the field using the same
(or different) register. The test triggers TCP_NEW_SYN_RECV via a TCP
handshake and checks that the result is NULL/zero when is_fullsock == 0.
bpf: Fix same-register dst/src OOB read and pointer leak in sock_ops
When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg,
the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the
destination register in the !fullsock / !locked_tcp_sock path.
Both macros borrow a temporary register to check is_fullsock /
is_locked_tcp_sock when dst_reg == src_reg, because dst_reg holds the
ctx pointer. When the check is false (e.g., TCP_NEW_SYN_RECV state with
a request_sock), dst_reg should be zeroed but is not, leaving the stale
ctx pointer:
- SOCK_OPS_GET_SK: dst_reg retains the ctx pointer, passes NULL checks
as PTR_TO_SOCKET_OR_NULL, and can be used as a bogus socket pointer,
leading to stack-out-of-bounds access in helpers like
bpf_skc_to_tcp6_sock().
- SOCK_OPS_GET_FIELD: dst_reg retains the ctx pointer which the
verifier believes is a SCALAR_VALUE, leaking a kernel pointer.
Fix both macros by:
- Changing JMP_A(1) to JMP_A(2) in the fullsock path to skip the
added instruction.
- Adding BPF_MOV64_IMM(si->dst_reg, 0) after the temp register
restore in the !fullsock path, placed after the restore because
dst_reg == src_reg means we need src_reg intact to read ctx->temp.
Fixes: fd09af010788 ("bpf: sock_ops ctx access may stomp registers in corner case") Fixes: 84f44df664e9 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/6fe1243e-149b-4d3b-99c7-fcc9e2f75787@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260407022720.162151-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nohz_full was introduced in v3.10 in 2013, which means this
documentation is overdue for 13 years.
Fortunately Paul wrote a part of the needed documentation a while ago,
especially concerning nohz_full in Documentation/timers/no_hz.rst and
also about per-CPU kthreads in
Documentation/admin-guide/kernel-per-CPU-kthreads.rst
Introduce a new page that gives an overview of CPU isolation in general.
Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260402094749.18879-1-frederic@kernel.org>
The NFC-A anti-collision cascade in digital_in_recv_sdd_res() appends 3
or 4 bytes to target->nfcid1 on each round, but the number of cascade
rounds is controlled entirely by the peer device. The peer sets the
cascade tag in the SDD_RES (deciding 3 vs 4 bytes) and the
cascade-incomplete bit in the SEL_RES (deciding whether another round
follows).
ISO 14443-3 limits NFC-A to three cascade levels and target->nfcid1 is
sized accordingly (NFC_NFCID1_MAXSIZE = 10), but nothing in the driver
actually enforces this. This means a malicious peer can keep the
cascade running, writing past the heap-allocated nfc_target with each
round.
Fix this by rejecting the response when the accumulated UID would exceed
the buffer.
Commit e329e71013c9 ("NFC: nci: Bounds check struct nfc_target arrays")
fixed similar missing checks against the same field on the NCI path.
net_sched: fix skb memory leak in deferred qdisc drops
When the network stack cleans up the deferred list via qdisc_run_end(),
it operates on the root qdisc. If the root qdisc do not implement the
TCQ_F_DEQUEUE_DROPS flag the packets queue to free are never freed and
gets stranded on the child's local to_free list.
Fix this by making qdisc_dequeue_drop() aware of the root qdisc. It
fetches the root qdisc and check for the TCQ_F_DEQUEUE_DROPS flag. If
the flag is present, the packet is appended directly to the root's
to_free list. Otherwise, drop it directly as it was done before the
optimization was implemented.
Fixes: a6efc273ab82 ("net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel") Reported-by: Damilola Bello <damilola@aterlo.com> Closes: https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260408100044.4530-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: add support for disabling autonomous EEE
Some PHYs implement autonomous EEE where the PHY manages EEE
independently, preventing the MAC from controlling LPI signaling.
This conflicts with MACs that implement their own LPI control.
This series adds a .disable_autonomous_eee callback to struct phy_driver
and calls it from phy_support_eee(). When a MAC indicates it supports
EEE, the PHY's autonomous EEE is automatically disabled. The setting is
persisted across suspend/resume by re-applying it in phy_init_hw() after
soft reset, following the same pattern suggested by Russell King for PHY
tunables [1].
Patch 1 adds the phylib infrastructure.
Patch 2 implements it for Broadcom BCM54xx (AutogrEEEn).
Patch 3 converts the Realtek RTL8211F, which previously unconditionally
disabled PHY-mode EEE in config_init.
This came up while adding EEE support to the Cadence macb driver (used
on Raspberry Pi 5 with a BCM54210PE PHY). The PHY's AutogrEEEn mode
prevented the MAC from tracking LPI state. The Realtek RTL8211F has
the same pattern, unconditionally disabling PHY-mode EEE with the
comment "Disable PHY-mode EEE so LPI is passed to the MAC".
Other BCM54xx PHYs likely have the same AutogrEEEn register layout,
but I only have access to the BCM54210PE/BCM54213PE datasheets. It
would be appreciated if Florian or others could confirm which other
BCM54xx variants share this register so we can wire them up too.
Tested on Raspberry Pi CM4 (bcmgenet + BCM54210PE),
Raspberry Pi CM5 (Cadence GEM + BCM54210PE) and
Raspberry Pi 5 (Cadence GEM + BCM54213PE).
net: phy: realtek: convert RTL8211F to .disable_autonomous_eee
The RTL8211F previously unconditionally disabled PHY-mode EEE in
config_init. Convert this to use the new .disable_autonomous_eee
callback so it is only disabled when the MAC indicates EEE support
via phy_support_eee().
This preserves PHY-autonomous EEE for MACs that do not support EEE,
while still disabling it when the MAC manages LPI.